Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master is broken using latest pandas release #1308

Closed
jairideout opened this issue Mar 17, 2016 · 1 comment · Fixed by #1321
Closed

master is broken using latest pandas release #1308

jairideout opened this issue Mar 17, 2016 · 1 comment · Fixed by #1321

Comments

@jairideout
Copy link
Contributor

Using the latest pandas release breaks many many unit tests and doctests. The default pandas index is now RangeIndex instead of Int64Index, and data frame/series repr adds trailing .0 to whole numbers.

@jairideout
Copy link
Contributor Author

To fix this correctly requires some large-ish changes to the codebase (mostly code deletion). I'm almost done with a PR that will drop Python 2 support, so want that to be merged first to avoid conflicts. I'll submit a PR now to cap the pandas version so master will be passing, then will fix the issue for real once Python 2 support is removed.

jairideout added a commit to jairideout/scikit-bio that referenced this issue Mar 23, 2016
The rationale for this temporary hack is described in scikit-bio#1308.
jairideout added a commit to jairideout/scikit-bio that referenced this issue Mar 30, 2016
Removed memory optimizations for storing empty `metadata` and `positional_metadata`, currently on `Sequence` and `TabularMSA`.

Previously, empty metadata was stored internally as `None`, and an empty metadata representation (e.g., `{}` or `pd.DataFrame(index=range(axis_len))` was created and stored on the object when `.metadata` or `.positional_metadata` properties were accessed for the first time.

While this optimization resulted in some memory savings, its implementation is not sustainable because scikit-bio devs must remember to use `has_metadata` or `has_positional_metadata` to handle empty metadata efficiently, otherwise any memory savings are lost. Every routine interacting with metadata needed a unit test ensuring empty metadata is handled efficiently, and worse, these tests had to interact with private state. As metadata will be applied to more, if not all, scikit-bio objects in the future, the maintenance and cognitive burden placed on developers is not worth the memory savings at this time. Further, this memory optimization caused the memory footprints of scikit-bio objects to be unpredictable. For example, a user may not understand why they are running out of memory when accessing `.metadata`, `.positional_metadata`, or a routine that doesn't handle empty metadata efficiently.

Now, empty `metadata` will be stored as `{}` and empty `positional_metadata` as a `DataFrame` with default index matching the object's axis length. The empty dict shouldn't be too heavy but the empty `DataFrame` is problematic. To fix scikit-bio#1308 we will use the new default `RangeIndex` in pandas 0.18.0, which is much more memory-efficient (essentially storing 3 integers: start, stop, and end). Thus, empty `positional_metadata` will have a much smaller memory footprint soon. If more memory savings are needed in the future, we could implement proxy objects.

Deprecated `has_metadata` and `has_positional_metadata`, as they mainly existed for scikit-bio devs to check whether metadata exists without creating empty metadata objects.

Related to scikit-bio#1308.
jairideout added a commit to jairideout/scikit-bio that referenced this issue Mar 30, 2016
Removed memory optimizations for storing empty `metadata` and `positional_metadata`, currently on `Sequence` and `TabularMSA`.

Previously, empty metadata was stored internally as `None`, and an empty metadata representation (e.g., `{}` or `pd.DataFrame(index=range(axis_len))` was created and stored on the object when `.metadata` or `.positional_metadata` properties were accessed for the first time.

While this optimization resulted in some memory savings, its implementation is not sustainable because scikit-bio devs must remember to use `has_metadata` or `has_positional_metadata` to handle empty metadata efficiently, otherwise any memory savings are lost. Every routine interacting with metadata needed a unit test ensuring empty metadata is handled efficiently, and worse, these tests had to interact with private state. As metadata will be applied to more, if not all, scikit-bio objects in the future, the maintenance and cognitive burden placed on developers is not worth the memory savings at this time. Further, this memory optimization caused the memory footprints of scikit-bio objects to be unpredictable. For example, a user may not understand why they are running out of memory when accessing `.metadata`, `.positional_metadata`, or a routine that doesn't handle empty metadata efficiently.

Now, empty `metadata` will be stored as `{}` and empty `positional_metadata` as a `DataFrame` with default index matching the object's axis length. The empty dict shouldn't be too heavy but the empty `DataFrame` is problematic. To fix scikit-bio#1308 we will use the new default `RangeIndex` in pandas 0.18.0, which is much more memory-efficient (essentially storing 3 integers: start, stop, and end). Thus, empty `positional_metadata` will have a much smaller memory footprint soon. If more memory savings are needed in the future, we could implement proxy objects.

Deprecated `has_metadata` and `has_positional_metadata`, as they mainly existed for scikit-bio devs to check whether metadata exists without creating empty metadata objects.

Related to scikit-bio#1308.
jairideout added a commit to jairideout/scikit-bio that referenced this issue Mar 30, 2016
Removed memory optimizations for storing empty `metadata` and `positional_metadata`, currently on `Sequence` and `TabularMSA`.

Previously, empty metadata was stored internally as `None`, and an empty metadata representation (e.g., `{}` or `pd.DataFrame(index=range(axis_len))` was created and stored on the object when `.metadata` or `.positional_metadata` properties were accessed for the first time.

While this optimization resulted in some memory savings, its implementation is not sustainable because scikit-bio devs must remember to use `has_metadata` or `has_positional_metadata` to handle empty metadata efficiently, otherwise any memory savings are lost. Every routine interacting with metadata needed a unit test ensuring empty metadata is handled efficiently, and worse, these tests had to interact with private state. As metadata will be applied to more, if not all, scikit-bio objects in the future, the maintenance and cognitive burden placed on developers is not worth the memory savings at this time. Further, this memory optimization caused the memory footprints of scikit-bio objects to be unpredictable. For example, a user may not understand why they are running out of memory when accessing `.metadata`, `.positional_metadata`, or a routine that doesn't handle empty metadata efficiently.

Now, empty `metadata` will be stored as `{}` and empty `positional_metadata` as a `DataFrame` with default index matching the object's axis length. The empty dict shouldn't be too heavy but the empty `DataFrame` is problematic. To fix scikit-bio#1308 we will use the new default `RangeIndex` in pandas 0.18.0, which is much more memory-efficient (essentially storing 3 integers: start, stop, and end). Thus, empty `positional_metadata` will have a much smaller memory footprint soon. If more memory savings are needed in the future, we could implement proxy objects.

Deprecated `has_metadata` and `has_positional_metadata`, as they mainly existed for scikit-bio devs to check whether metadata exists without creating empty metadata objects.

Related to scikit-bio#1308.
jairideout added a commit to jairideout/scikit-bio that referenced this issue Mar 30, 2016
Removed memory optimizations for storing empty `metadata` and `positional_metadata`, currently on `Sequence` and `TabularMSA`.

Previously, empty metadata was stored internally as `None`, and an empty metadata representation (e.g., `{}` or `pd.DataFrame(index=range(axis_len))` was created and stored on the object when `.metadata` or `.positional_metadata` properties were accessed for the first time.

While this optimization resulted in some memory savings, its implementation is not sustainable because scikit-bio devs must remember to use `has_metadata` or `has_positional_metadata` to handle empty metadata efficiently, otherwise any memory savings are lost. Every routine interacting with metadata needed a unit test ensuring empty metadata is handled efficiently, and worse, these tests had to interact with private state. As metadata will be applied to more, if not all, scikit-bio objects in the future, the maintenance and cognitive burden placed on developers is not worth the memory savings at this time. Further, this memory optimization caused the memory footprints of scikit-bio objects to be unpredictable. For example, a user may not understand why they are running out of memory when accessing `.metadata`, `.positional_metadata`, or a routine that doesn't handle empty metadata efficiently.

Now, empty `metadata` will be stored as `{}` and empty `positional_metadata` as a `DataFrame` with default index matching the object's axis length. The empty dict shouldn't be too heavy but the empty `DataFrame` is problematic. To fix scikit-bio#1308 we will use the new default `RangeIndex` in pandas 0.18.0, which is much more memory-efficient (essentially storing 3 integers: start, stop, and end). Thus, empty `positional_metadata` will have a much smaller memory footprint soon. If more memory savings are needed in the future, we could implement proxy objects.

Deprecated `has_metadata` and `has_positional_metadata`, as they mainly existed for scikit-bio devs to check whether metadata exists without creating empty metadata objects.

Related to scikit-bio#1308.
@jairideout jairideout reopened this Mar 30, 2016
jairideout added a commit to jairideout/scikit-bio that referenced this issue Apr 7, 2016
Fixes scikit-bio#1308. scikit-bio now depends on pandas >= 0.18.0.

Added support for `pandas.RangeIndex`, lowering the memory footprint of default integer index objects. `Sequence.positional_metadata` and `TabularMSA.positional_metadata` now use `pd.RangeIndex` as the positional metadata index. `TabularMSA` now uses `pd.RangeIndex` as the default index.  Usage of `pd.RangeIndex` over the previous `pd.Int64Index` [should be transparent](http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#range-index), so these changes should be non-breaking to users.

Added `reset_index=False` parameter to `TabularMSA.append` and `TabularMSA.extend` for resetting the MSA's index to the default index after appending/extending. `TabularMSA.append` and `TabularMSA.extend` now require one of `minter`, `index`, or `reset_index` to be provided when incorporating new sequences into an MSA. Previous behavior was to auto-increment the index labels if `minter` and `index` weren't provided and the MSA had a default integer index, otherwise error.

Fixed doctests to work with pandas 0.18.0 ("integers" stored in a float column now repr with a trailing `.0`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant