New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
master is broken using latest pandas release #1308
Comments
To fix this correctly requires some large-ish changes to the codebase (mostly code deletion). I'm almost done with a PR that will drop Python 2 support, so want that to be merged first to avoid conflicts. I'll submit a PR now to cap the pandas version so master will be passing, then will fix the issue for real once Python 2 support is removed. |
jairideout
added a commit
to jairideout/scikit-bio
that referenced
this issue
Mar 23, 2016
The rationale for this temporary hack is described in scikit-bio#1308.
jairideout
added a commit
to jairideout/scikit-bio
that referenced
this issue
Mar 30, 2016
Removed memory optimizations for storing empty `metadata` and `positional_metadata`, currently on `Sequence` and `TabularMSA`. Previously, empty metadata was stored internally as `None`, and an empty metadata representation (e.g., `{}` or `pd.DataFrame(index=range(axis_len))` was created and stored on the object when `.metadata` or `.positional_metadata` properties were accessed for the first time. While this optimization resulted in some memory savings, its implementation is not sustainable because scikit-bio devs must remember to use `has_metadata` or `has_positional_metadata` to handle empty metadata efficiently, otherwise any memory savings are lost. Every routine interacting with metadata needed a unit test ensuring empty metadata is handled efficiently, and worse, these tests had to interact with private state. As metadata will be applied to more, if not all, scikit-bio objects in the future, the maintenance and cognitive burden placed on developers is not worth the memory savings at this time. Further, this memory optimization caused the memory footprints of scikit-bio objects to be unpredictable. For example, a user may not understand why they are running out of memory when accessing `.metadata`, `.positional_metadata`, or a routine that doesn't handle empty metadata efficiently. Now, empty `metadata` will be stored as `{}` and empty `positional_metadata` as a `DataFrame` with default index matching the object's axis length. The empty dict shouldn't be too heavy but the empty `DataFrame` is problematic. To fix scikit-bio#1308 we will use the new default `RangeIndex` in pandas 0.18.0, which is much more memory-efficient (essentially storing 3 integers: start, stop, and end). Thus, empty `positional_metadata` will have a much smaller memory footprint soon. If more memory savings are needed in the future, we could implement proxy objects. Deprecated `has_metadata` and `has_positional_metadata`, as they mainly existed for scikit-bio devs to check whether metadata exists without creating empty metadata objects. Related to scikit-bio#1308.
jairideout
added a commit
to jairideout/scikit-bio
that referenced
this issue
Mar 30, 2016
Removed memory optimizations for storing empty `metadata` and `positional_metadata`, currently on `Sequence` and `TabularMSA`. Previously, empty metadata was stored internally as `None`, and an empty metadata representation (e.g., `{}` or `pd.DataFrame(index=range(axis_len))` was created and stored on the object when `.metadata` or `.positional_metadata` properties were accessed for the first time. While this optimization resulted in some memory savings, its implementation is not sustainable because scikit-bio devs must remember to use `has_metadata` or `has_positional_metadata` to handle empty metadata efficiently, otherwise any memory savings are lost. Every routine interacting with metadata needed a unit test ensuring empty metadata is handled efficiently, and worse, these tests had to interact with private state. As metadata will be applied to more, if not all, scikit-bio objects in the future, the maintenance and cognitive burden placed on developers is not worth the memory savings at this time. Further, this memory optimization caused the memory footprints of scikit-bio objects to be unpredictable. For example, a user may not understand why they are running out of memory when accessing `.metadata`, `.positional_metadata`, or a routine that doesn't handle empty metadata efficiently. Now, empty `metadata` will be stored as `{}` and empty `positional_metadata` as a `DataFrame` with default index matching the object's axis length. The empty dict shouldn't be too heavy but the empty `DataFrame` is problematic. To fix scikit-bio#1308 we will use the new default `RangeIndex` in pandas 0.18.0, which is much more memory-efficient (essentially storing 3 integers: start, stop, and end). Thus, empty `positional_metadata` will have a much smaller memory footprint soon. If more memory savings are needed in the future, we could implement proxy objects. Deprecated `has_metadata` and `has_positional_metadata`, as they mainly existed for scikit-bio devs to check whether metadata exists without creating empty metadata objects. Related to scikit-bio#1308.
jairideout
added a commit
to jairideout/scikit-bio
that referenced
this issue
Mar 30, 2016
Removed memory optimizations for storing empty `metadata` and `positional_metadata`, currently on `Sequence` and `TabularMSA`. Previously, empty metadata was stored internally as `None`, and an empty metadata representation (e.g., `{}` or `pd.DataFrame(index=range(axis_len))` was created and stored on the object when `.metadata` or `.positional_metadata` properties were accessed for the first time. While this optimization resulted in some memory savings, its implementation is not sustainable because scikit-bio devs must remember to use `has_metadata` or `has_positional_metadata` to handle empty metadata efficiently, otherwise any memory savings are lost. Every routine interacting with metadata needed a unit test ensuring empty metadata is handled efficiently, and worse, these tests had to interact with private state. As metadata will be applied to more, if not all, scikit-bio objects in the future, the maintenance and cognitive burden placed on developers is not worth the memory savings at this time. Further, this memory optimization caused the memory footprints of scikit-bio objects to be unpredictable. For example, a user may not understand why they are running out of memory when accessing `.metadata`, `.positional_metadata`, or a routine that doesn't handle empty metadata efficiently. Now, empty `metadata` will be stored as `{}` and empty `positional_metadata` as a `DataFrame` with default index matching the object's axis length. The empty dict shouldn't be too heavy but the empty `DataFrame` is problematic. To fix scikit-bio#1308 we will use the new default `RangeIndex` in pandas 0.18.0, which is much more memory-efficient (essentially storing 3 integers: start, stop, and end). Thus, empty `positional_metadata` will have a much smaller memory footprint soon. If more memory savings are needed in the future, we could implement proxy objects. Deprecated `has_metadata` and `has_positional_metadata`, as they mainly existed for scikit-bio devs to check whether metadata exists without creating empty metadata objects. Related to scikit-bio#1308.
jairideout
added a commit
to jairideout/scikit-bio
that referenced
this issue
Mar 30, 2016
Removed memory optimizations for storing empty `metadata` and `positional_metadata`, currently on `Sequence` and `TabularMSA`. Previously, empty metadata was stored internally as `None`, and an empty metadata representation (e.g., `{}` or `pd.DataFrame(index=range(axis_len))` was created and stored on the object when `.metadata` or `.positional_metadata` properties were accessed for the first time. While this optimization resulted in some memory savings, its implementation is not sustainable because scikit-bio devs must remember to use `has_metadata` or `has_positional_metadata` to handle empty metadata efficiently, otherwise any memory savings are lost. Every routine interacting with metadata needed a unit test ensuring empty metadata is handled efficiently, and worse, these tests had to interact with private state. As metadata will be applied to more, if not all, scikit-bio objects in the future, the maintenance and cognitive burden placed on developers is not worth the memory savings at this time. Further, this memory optimization caused the memory footprints of scikit-bio objects to be unpredictable. For example, a user may not understand why they are running out of memory when accessing `.metadata`, `.positional_metadata`, or a routine that doesn't handle empty metadata efficiently. Now, empty `metadata` will be stored as `{}` and empty `positional_metadata` as a `DataFrame` with default index matching the object's axis length. The empty dict shouldn't be too heavy but the empty `DataFrame` is problematic. To fix scikit-bio#1308 we will use the new default `RangeIndex` in pandas 0.18.0, which is much more memory-efficient (essentially storing 3 integers: start, stop, and end). Thus, empty `positional_metadata` will have a much smaller memory footprint soon. If more memory savings are needed in the future, we could implement proxy objects. Deprecated `has_metadata` and `has_positional_metadata`, as they mainly existed for scikit-bio devs to check whether metadata exists without creating empty metadata objects. Related to scikit-bio#1308.
jairideout
added a commit
to jairideout/scikit-bio
that referenced
this issue
Apr 7, 2016
Fixes scikit-bio#1308. scikit-bio now depends on pandas >= 0.18.0. Added support for `pandas.RangeIndex`, lowering the memory footprint of default integer index objects. `Sequence.positional_metadata` and `TabularMSA.positional_metadata` now use `pd.RangeIndex` as the positional metadata index. `TabularMSA` now uses `pd.RangeIndex` as the default index. Usage of `pd.RangeIndex` over the previous `pd.Int64Index` [should be transparent](http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#range-index), so these changes should be non-breaking to users. Added `reset_index=False` parameter to `TabularMSA.append` and `TabularMSA.extend` for resetting the MSA's index to the default index after appending/extending. `TabularMSA.append` and `TabularMSA.extend` now require one of `minter`, `index`, or `reset_index` to be provided when incorporating new sequences into an MSA. Previous behavior was to auto-increment the index labels if `minter` and `index` weren't provided and the MSA had a default integer index, otherwise error. Fixed doctests to work with pandas 0.18.0 ("integers" stored in a float column now repr with a trailing `.0`).
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Using the latest pandas release breaks many many unit tests and doctests. The default pandas index is now
RangeIndex
instead ofInt64Index
, and data frame/seriesrepr
adds trailing.0
to whole numbers.The text was updated successfully, but these errors were encountered: