Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas forward compat #1828

Closed
wants to merge 21 commits into from
Closed

Pandas forward compat #1828

wants to merge 21 commits into from

Conversation

kernc
Copy link
Contributor

@kernc kernc commented Dec 16, 2016

Description of changes

Minimal (non-exhaustive) pandas forward-compatible API with our current Table technology. Current API still (mostly) works but is deprecated and updated in tests.

Includes
  • Code changes
  • Tests
  • Documentation

@kernc kernc force-pushed the pandas branch 9 times, most recently from 4ea0ab7 to a572b5d Compare December 20, 2016 23:09
@kernc kernc changed the title [WIP] Pandas forward compat Pandas forward compat Dec 20, 2016
@codecov-io
Copy link

codecov-io commented Dec 20, 2016

Current coverage is 89.08% (diff: 85.00%)

Merging #1828 into master will decrease coverage by 0.15%

@@             master      #1828   diff @@
==========================================
  Files            86         85     -1   
  Lines          9100       9169    +69   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           8121       8168    +47   
- Misses          979       1001    +22   
  Partials          0          0          

Sunburst

Powered by Codecov. Last update 7acb5fc...a572b5d

@@ -261,10 +262,10 @@ def save(self, filename):
with open(filename, "wt") as fle:
fle.write(data + "\n")
if col_labels is not None:
fle.write("\t".join(str(e.metas[0]) for e in col_labels) + "\n")
fle.write("\t".join(str(m[0]) for m in col_labels.metas) + "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

for i, row in enumerate(self):
if row_labels is not None:
fle.write(str(row_labels[i].metas[0]) + "\t")
fle.write(str(row_labels.metas.T[0][i]) + "\t")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

@@ -81,8 +81,7 @@ class DropInstances(BaseImputeMethod):
description = ""

def __call__(self, data, variable):
index = data.domain.index(variable)
return numpy.isnan(data[:, index]).reshape(-1)
return numpy.isnan([list(row) for _, row in data[variable].iterrows()]).reshape(-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't measured it, but I suppose not much slower than it was before. numpy.isnan(data[:, index]) also did a full iteration over data (via Sequence.__iter__() via table.__getitem__(i)).

In pandas, this will be: return data[variable].isnull().

I'll fix it up to use Table.get_column_view() interim.



def get_contingencies(dat, skipDiscrete=False, skipContinuous=False):
def get_contingencies(dat, skip_discrete=False, skip_continuous=False):
Copy link
Member

@astaric astaric Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this rename really necessary?

@@ -39,8 +39,8 @@ def test_discrete(self):

def test_discrete_missing(self):
d = data.Table("zoo")
d.Y[25] = float("nan")
d[0][0] = float("nan")
d.loc[d.index[25], d.domain.class_var] = float("nan")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just love how pandas makes the code cleaner :)

Copy link
Contributor Author

@kernc kernc Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not code. These are tests, made compatible with as little change as possible. The useful function code will always be much cleaner.

Ideally, this would be:

d.type.iloc[25] = np.nan  # 'type' is classvar name
# or
d.target.iloc[25] = np.nan  # target as an introduced alias
# or
d[d.domain.class_var].iloc[25] = np.nan  # familiar, verbose

But our current code makes it hard to adapt it so. I suppose it will have to get a bit more ugly before it gets neat.

@@ -544,8 +544,7 @@ def duplicateFeature(self):
def check_attrs_values(self, attr, data):
for i in range(len(data)):
for var in attr:
if not math.isnan(data[i, var]) \
and int(data[i, var]) >= len(var.values):
if data[var].iloc[data.index[i]] not in range(var.values):
Copy link
Member

@astaric astaric Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know that you can use "in" with generators :)

Copy link
Contributor Author

@kernc kernc Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python 3, range(x) is a special object. You can't use in with generators.

@@ -278,11 +278,6 @@ def _update_predictions_model(self):
results = []
for p in slots:
values, prob = p.results
if p.predictor.domain.class_var.is_discrete:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Replaced were:

    if [data|table]:                 (for __bool__)
    (data|table)\[                   (changed to pandas indexing)
    \[.{10}(\.\.\.|Ellipsis).{10}\]  (unsupported Ellipsis replaced with
                                      slice(None))
    for row in data                  (replaced with .iterrows() call)
    ...
@astaric
Copy link
Member

astaric commented Dec 20, 2017

As this PR touches too many files in the repository and has not been updated for almost a year, it is highly unlikely that it will be merged. We are still interested in porting the Table to pandas, but it will have to be done in a more gradual way.

If anyone is interested in working on this, I would suggest the following steps:

  1. Add new methods to Table object that match pandas interface
    (empty(), iterrows(), ... - see commit 9a1a35a for more ideas)
  2. Modify the code to call the new methods while deprecating the old ones
    (code calls empty(), __bool__ becomes deprecated, ...)
  3. Figure out if Table can be ported to pandas without modifying more than 10 files, if not, return to 1. :)

Each "pandas compatible" method should be added in a separate pull request, as that eases reviewing and raises the chances of PR actually being merged. I would also split 1. and 2. into two PRs as 1 should only touch a single file and rebasing it should be much easier than the modifications of all places the code is used in 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants