-
Notifications
You must be signed in to change notification settings - Fork 91
Upgrade Woodwork to 0.6.0 #2690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2690 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 301 301
Lines 27607 27678 +71
=======================================
+ Hits 27563 27634 +71
Misses 44 44
Continue to review full report at Codecov.
|
# Conflicts: # docs/source/release_notes.rst
# Conflicts: # docs/source/release_notes.rst
| features.set_index(X_ww.index, inplace=True) | ||
|
|
||
| X_ww = X_ww.ww.drop(self._columns) | ||
| features.ww.init(logical_types={col_: "categorical" for col_ in features}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're taking all features generated from DFS and Calculate Feature Matrix, and making them categorical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but currently this only applies for URL and Email features. It was the old behavior but now the problem is that ww 0.6.1 infers features created by ft as Unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after discussion with @dsherry and @freddyaboulton this is what we decided on to address the test issues for this transformer
freddyaboulton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ParthivNaresh ! I know it must have been tough to update all of the tests. I agree with the general approach of modifying the set-up code/test fixtures to recover what the results of the inference would have been in 0.5.1.
Nothing "dangerous" stood out to me hehe but I left some questions about some of the modifications we've made here. I look forward to what comes out of #2715 . Hopefully we can cut out ww type inference from our unit tests for greater stability in the future!
| { | ||
| "categorical col": pd.Series( | ||
| ["zero", "one", "two", "zero", "three"], dtype="category" | ||
| ["zero", "one", "two", "zero", "two"] * 4, dtype="category" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to make sure I understand, this will now get inferred as categorical because it has three unique values out of 20 values, so n_unique/total_values = 0.15 which is below 0.20?
Sounds good to me. Although I'm worried what will happen if ww ever changes the threshold again. I'm good to proceed with this approach! I would be interested in trying to cut out the inference from our tests entirely - Maybe #2715 is intended to track that @chukarsten ?
evalml/tests/model_understanding_tests/prediction_explanations_tests/test_force_plots.py
Outdated
Show resolved
Hide resolved
| features.set_index(X_ww.index, inplace=True) | ||
|
|
||
| X_ww = X_ww.ww.drop(self._columns) | ||
| features.ww.init(logical_types={col_: "categorical" for col_ in features}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but currently this only applies for URL and Email features. It was the old behavior but now the problem is that ww 0.6.1 infers features created by ft as Unknown.
|
@freddyaboulton I can't reply to your comment about the categorical threshold directly for some reason, but the categorical inference threshold is a configurable setting in Woodwork, so if you wanted to use a different value in your tests (or in EvalML overall) you can do so. You can also set this inside a context manager if you wanted to do it temporarily in a test. Here is an example which shows how you would do that (using a different config option though). |
|
Thank you @thehomebrewnerd ! @ParthivNaresh did you try this? I think we talked about it at the beginning but not sure what came of it. It seems like setting |
|
@freddyaboulton @thehomebrewnerd Is this something we want to do? While this did require a decent number of changes, getting rid of the sampling inference + setting a threshold seems like a good long term move. Wouldn't setting |
|
@ParthivNaresh Ah my bad, maybe I meant 1? I'm ok with the approach we have here! But my concern is that if the default threshold value ever changes in the future, we'll have to modify the test cases again to increase/decrease the length so that they match the new threshold. I think setting the thresholds in the tests would prevent us from having to do that? I'm down to kick that out into #2715 though. @chukarsten Maybe we should we rename that epic "Minimize impact of default ww inference in unit tests"? |
@ParthivNaresh Yes. If you want a series to be recognized as |
| X_permuted.iloc[:, col_idx] = col | ||
| X_permuted.ww.init(schema=X_features.ww.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the schema being invalidated by setting the col? Should we file a WW issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying the dataframe outside of Woodwork, as is being done with X_permuted.iloc[:, col_idx] = col, always carries the risk of invalidating the schema.
That said, I don't think it is currently possible to do this type of assignment through WW currently with this:
X_permuted.ww.iloc[:, col_idx] = col
I'm not sure how easy it would be to implement, but we could look into it if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, based on my limited testing, as long as the new values don't change the column dtype, the schema should not be invalidated by this type of assignment outside of WW.
| "bool col": [True, False, False, True, True] * 4, | ||
| "categorical with nan": pd.Series( | ||
| ["0", "1", "0", "0", "3"], dtype="category" | ||
| ["0", "1", "0", "0", "3"] * 4, dtype="category" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that our tests are pretty synthetic, but is it worth pushing back on Woodwork here? It seems like these columns should very much be inferred as categorical, despite being small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, if you are setting the dtype of the column explicitly to category, Woodwork should recognize this as Categorical, regardless of how many values are present.
An example:
>>> import pandas as pd
>>> import woodwork as ww
>>> df = pd.DataFrame({'id': [0], 'vals': pd.Series(['1'], dtype='category')})
>>> df.ww.init()
>>> df.ww
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
vals category Categorical ['category']So, that's another option instead of changing series length or updating the inference threshold.
| X2_fit = pd.DataFrame({"feature": ["c", "b", "a", "d"]}) | ||
| X2_predict = pd.DataFrame({"feature": ["d", "c", "b", "a"]}) | ||
| X2_predict_expected = pd.DataFrame({0: [3.0, 2.0, 1.0, 0.0]}, dtype="category") | ||
| X2_fit.ww.init(logical_types={"feature": "categorical"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like we're having to work around inference rather than using it...
| { | ||
| "categorical col": pd.Series( | ||
| ["zero", "one", "two", "zero", "three"], dtype="category" | ||
| ["zero", "one", "two", "zero", "two"] * 4, dtype="category" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the idea @freddyaboulton . The intent is to try and move as many of these dataframes into a common location, at least within the testing module so that we can avoid the more laborious upgrade lifts. Like this lol
Fixes #2616
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.