master is broken using latest pandas release #1308

jairideout · 2016-03-17T22:01:39Z

Using the latest pandas release breaks many many unit tests and doctests. The default pandas index is now RangeIndex instead of Int64Index, and data frame/series repr adds trailing .0 to whole numbers.

The text was updated successfully, but these errors were encountered:

jairideout · 2016-03-23T19:13:03Z

To fix this correctly requires some large-ish changes to the codebase (mostly code deletion). I'm almost done with a PR that will drop Python 2 support, so want that to be merged first to avoid conflicts. I'll submit a PR now to cap the pandas version so master will be passing, then will fix the issue for real once Python 2 support is removed.

The rationale for this temporary hack is described in scikit-bio#1308.

Removed memory optimizations for storing empty `metadata` and `positional_metadata`, currently on `Sequence` and `TabularMSA`. Previously, empty metadata was stored internally as `None`, and an empty metadata representation (e.g., `{}` or `pd.DataFrame(index=range(axis_len))` was created and stored on the object when `.metadata` or `.positional_metadata` properties were accessed for the first time. While this optimization resulted in some memory savings, its implementation is not sustainable because scikit-bio devs must remember to use `has_metadata` or `has_positional_metadata` to handle empty metadata efficiently, otherwise any memory savings are lost. Every routine interacting with metadata needed a unit test ensuring empty metadata is handled efficiently, and worse, these tests had to interact with private state. As metadata will be applied to more, if not all, scikit-bio objects in the future, the maintenance and cognitive burden placed on developers is not worth the memory savings at this time. Further, this memory optimization caused the memory footprints of scikit-bio objects to be unpredictable. For example, a user may not understand why they are running out of memory when accessing `.metadata`, `.positional_metadata`, or a routine that doesn't handle empty metadata efficiently. Now, empty `metadata` will be stored as `{}` and empty `positional_metadata` as a `DataFrame` with default index matching the object's axis length. The empty dict shouldn't be too heavy but the empty `DataFrame` is problematic. To fix scikit-bio#1308 we will use the new default `RangeIndex` in pandas 0.18.0, which is much more memory-efficient (essentially storing 3 integers: start, stop, and end). Thus, empty `positional_metadata` will have a much smaller memory footprint soon. If more memory savings are needed in the future, we could implement proxy objects. Deprecated `has_metadata` and `has_positional_metadata`, as they mainly existed for scikit-bio devs to check whether metadata exists without creating empty metadata objects. Related to scikit-bio#1308.

Fixes scikit-bio#1308. scikit-bio now depends on pandas >= 0.18.0. Added support for `pandas.RangeIndex`, lowering the memory footprint of default integer index objects. `Sequence.positional_metadata` and `TabularMSA.positional_metadata` now use `pd.RangeIndex` as the positional metadata index. `TabularMSA` now uses `pd.RangeIndex` as the default index. Usage of `pd.RangeIndex` over the previous `pd.Int64Index` [should be transparent](http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#range-index), so these changes should be non-breaking to users. Added `reset_index=False` parameter to `TabularMSA.append` and `TabularMSA.extend` for resetting the MSA's index to the default index after appending/extending. `TabularMSA.append` and `TabularMSA.extend` now require one of `minter`, `index`, or `reset_index` to be provided when incorporating new sequences into an MSA. Previous behavior was to auto-increment the index labels if `minter` and `index` weren't provided and the MSA had a default integer index, otherwise error. Fixed doctests to work with pandas 0.18.0 ("integers" stored in a float column now repr with a trailing `.0`).

jairideout added CI / testing severity: blocking labels Mar 17, 2016

jairideout self-assigned this Mar 17, 2016

jairideout added this to the 0.5.0: Python 3 support only milestone Mar 17, 2016

jairideout added a commit to jairideout/scikit-bio that referenced this issue Mar 23, 2016

HACK: temporary pandas version cap in CI

da6ca9c

The rationale for this temporary hack is described in scikit-bio#1308.

jairideout mentioned this issue Mar 23, 2016

HACK: temporary pandas version cap in CI #1314

Merged

jairideout mentioned this issue Mar 30, 2016

MAINT: remove empty metadata optimizations #1321

Merged

ebolyen closed this as completed in #1321 Mar 30, 2016

jairideout reopened this Mar 30, 2016

jairideout mentioned this issue Apr 7, 2016

ENH: pandas 0.18.0 support #1325

Merged

ebolyen closed this as completed in #1325 May 4, 2016

wasade mentioned this issue May 18, 2016

Partial pairwise calculation for beta diversity. #1337

Merged

3 tasks

wasade mentioned this issue Jun 1, 2016

ENH: TreeNode subclass support #1364

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master is broken using latest pandas release #1308

master is broken using latest pandas release #1308

jairideout commented Mar 17, 2016

jairideout commented Mar 23, 2016

master is broken using latest pandas release #1308

master is broken using latest pandas release #1308

Comments

jairideout commented Mar 17, 2016

jairideout commented Mar 23, 2016