-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas forward compat #1828
Pandas forward compat #1828
Conversation
4ea0ab7
to
a572b5d
Compare
Current coverage is 89.08% (diff: 85.00%)@@ master #1828 diff @@
==========================================
Files 86 85 -1
Lines 9100 9169 +69
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 8121 8168 +47
- Misses 979 1001 +22
Partials 0 0
|
Orange/misc/distmatrix.py
Outdated
@@ -261,10 +262,10 @@ def save(self, filename): | |||
with open(filename, "wt") as fle: | |||
fle.write(data + "\n") | |||
if col_labels is not None: | |||
fle.write("\t".join(str(e.metas[0]) for e in col_labels) + "\n") | |||
fle.write("\t".join(str(m[0]) for m in col_labels.metas) + "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
Orange/misc/distmatrix.py
Outdated
for i, row in enumerate(self): | ||
if row_labels is not None: | ||
fle.write(str(row_labels[i].metas[0]) + "\t") | ||
fle.write(str(row_labels.metas.T[0][i]) + "\t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
Orange/preprocess/impute.py
Outdated
@@ -81,8 +81,7 @@ class DropInstances(BaseImputeMethod): | |||
description = "" | |||
|
|||
def __call__(self, data, variable): | |||
index = data.domain.index(variable) | |||
return numpy.isnan(data[:, index]).reshape(-1) | |||
return numpy.isnan([list(row) for _, row in data[variable].iterrows()]).reshape(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't measured it, but I suppose not much slower than it was before. numpy.isnan(data[:, index])
also did a full iteration over data (via Sequence.__iter__()
via table.__getitem__(i)
).
In pandas, this will be: return data[variable].isnull()
.
I'll fix it up to use Table.get_column_view()
interim.
Orange/statistics/contingency.py
Outdated
|
||
|
||
def get_contingencies(dat, skipDiscrete=False, skipContinuous=False): | ||
def get_contingencies(dat, skip_discrete=False, skip_continuous=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this rename really necessary?
@@ -39,8 +39,8 @@ def test_discrete(self): | |||
|
|||
def test_discrete_missing(self): | |||
d = data.Table("zoo") | |||
d.Y[25] = float("nan") | |||
d[0][0] = float("nan") | |||
d.loc[d.index[25], d.domain.class_var] = float("nan") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just love how pandas makes the code cleaner :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not code. These are tests, made compatible with as little change as possible. The useful function code will always be much cleaner.
Ideally, this would be:
d.type.iloc[25] = np.nan # 'type' is classvar name
# or
d.target.iloc[25] = np.nan # target as an introduced alias
# or
d[d.domain.class_var].iloc[25] = np.nan # familiar, verbose
But our current code makes it hard to adapt it so. I suppose it will have to get a bit more ugly before it gets neat.
@@ -544,8 +544,7 @@ def duplicateFeature(self): | |||
def check_attrs_values(self, attr, data): | |||
for i in range(len(data)): | |||
for var in attr: | |||
if not math.isnan(data[i, var]) \ | |||
and int(data[i, var]) >= len(var.values): | |||
if data[var].iloc[data.index[i]] not in range(var.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know that you can use "in" with generators :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python 3, range(x)
is a special object. You can't use in
with generators.
@@ -278,11 +278,6 @@ def _update_predictions_model(self): | |||
results = [] | |||
for p in slots: | |||
values, prob = p.results | |||
if p.predictor.domain.class_var.is_discrete: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
Sparse support pending.
Of note is the forced value type to float: without it, integer arrays are interpreted as double pointers and their values are 0, which breaks weights.
Over 90 % of base widgets now work. I haven't checked every single button, but they produce the intended result. Merge data, feature constructor and Venn diagram need a bit more work, so those aren't functional yet.
Replaced were: if [data|table]: (for __bool__) (data|table)\[ (changed to pandas indexing) \[.{10}(\.\.\.|Ellipsis).{10}\] (unsupported Ellipsis replaced with slice(None)) for row in data (replaced with .iterrows() call) ...
As this PR touches too many files in the repository and has not been updated for almost a year, it is highly unlikely that it will be merged. We are still interested in porting the Table to pandas, but it will have to be done in a more gradual way. If anyone is interested in working on this, I would suggest the following steps:
Each "pandas compatible" method should be added in a separate pull request, as that eases reviewing and raises the chances of PR actually being merged. I would also split 1. and 2. into two PRs as 1 should only touch a single file and rebasing it should be much easier than the modifications of all places the code is used in 2. |
Description of changes
Minimal (non-exhaustive) pandas forward-compatible API with our current Table technology. Current API still (mostly) works but is deprecated and updated in tests.
Includes