Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Woodwork to 0.6.0 #2690

Merged
merged 45 commits into from Aug 31, 2021
Merged

Upgrade Woodwork to 0.6.0 #2690

merged 45 commits into from Aug 31, 2021

Conversation

ParthivNaresh
Copy link
Contributor

Fixes #2616


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Aug 24, 2021

Codecov Report

Merging #2690 (61c91ee) into main (635f3e7) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2690     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        301     301             
  Lines      27607   27678     +71     
=======================================
+ Hits       27563   27634     +71     
  Misses        44      44             
Impacted Files Coverage Δ
...alml/model_understanding/permutation_importance.py 100.0% <100.0%> (ø)
...elines/components/transformers/column_selectors.py 100.0% <100.0%> (ø)
...rs/preprocessing/transform_primitive_components.py 100.0% <100.0%> (ø)
...mponent_tests/test_column_selector_transformers.py 100.0% <100.0%> (ø)
...mponent_tests/test_delayed_features_transformer.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_imputer.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_lgbm_classifier.py 100.0% <100.0%> (ø)
...valml/tests/component_tests/test_lgbm_regressor.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_one_hot_encoder.py 100.0% <100.0%> (ø)
...l/tests/component_tests/test_per_column_imputer.py 100.0% <100.0%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 635f3e7...61c91ee. Read the comment docs.

features.set_index(X_ww.index, inplace=True)

X_ww = X_ww.ww.drop(self._columns)
features.ww.init(logical_types={col_: "categorical" for col_ in features})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're taking all features generated from DFS and Calculate Feature Matrix, and making them categorical?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but currently this only applies for URL and Email features. It was the old behavior but now the problem is that ww 0.6.1 infers features created by ft as Unknown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after discussion with @dsherry and @freddyaboulton this is what we decided on to address the test issues for this transformer

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ParthivNaresh ! I know it must have been tough to update all of the tests. I agree with the general approach of modifying the set-up code/test fixtures to recover what the results of the inference would have been in 0.5.1.

Nothing "dangerous" stood out to me hehe but I left some questions about some of the modifications we've made here. I look forward to what comes out of #2715 . Hopefully we can cut out ww type inference from our unit tests for greater stability in the future!

@@ -19,24 +19,24 @@ def imputer_test_data():
return pd.DataFrame(
{
"categorical col": pd.Series(
["zero", "one", "two", "zero", "three"], dtype="category"
["zero", "one", "two", "zero", "two"] * 4, dtype="category"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to make sure I understand, this will now get inferred as categorical because it has three unique values out of 20 values, so n_unique/total_values = 0.15 which is below 0.20?

Sounds good to me. Although I'm worried what will happen if ww ever changes the threshold again. I'm good to proceed with this approach! I would be interested in trying to cut out the inference from our tests entirely - Maybe #2715 is intended to track that @chukarsten ?

evalml/tests/conftest.py Show resolved Hide resolved
features.set_index(X_ww.index, inplace=True)

X_ww = X_ww.ww.drop(self._columns)
features.ww.init(logical_types={col_: "categorical" for col_ in features})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but currently this only applies for URL and Email features. It was the old behavior but now the problem is that ww 0.6.1 infers features created by ft as Unknown.

@thehomebrewnerd
Copy link
Contributor

@freddyaboulton I can't reply to your comment about the categorical threshold directly for some reason, but the categorical inference threshold is a configurable setting in Woodwork, so if you wanted to use a different value in your tests (or in EvalML overall) you can do so.

You can also set this inside a context manager if you wanted to do it temporarily in a test. Here is an example which shows how you would do that (using a different config option though).

@freddyaboulton
Copy link
Contributor

Thank you @thehomebrewnerd ! @ParthivNaresh did you try this? I think we talked about it at the beginning but not sure what came of it. It seems like setting categorical_threshold to 0 and numeric_categorical_threshold to None would recover the old behavior?

@ParthivNaresh
Copy link
Contributor Author

@freddyaboulton @thehomebrewnerd Is this something we want to do? While this did require a decent number of changes, getting rid of the sampling inference + setting a threshold seems like a good long term move. Wouldn't setting categorical_threshold to 0 prevent anything from being recognized as Categorical?

@freddyaboulton
Copy link
Contributor

@ParthivNaresh Ah my bad, maybe I meant 1? I'm ok with the approach we have here! But my concern is that if the default threshold value ever changes in the future, we'll have to modify the test cases again to increase/decrease the length so that they match the new threshold. I think setting the thresholds in the tests would prevent us from having to do that? I'm down to kick that out into #2715 though. @chukarsten Maybe we should we rename that epic "Minimize impact of default ww inference in unit tests"?

@thehomebrewnerd
Copy link
Contributor

Wouldn't setting categorical_threshold to 0 prevent anything from being recognized as Categorical?

@ParthivNaresh Yes. If you want a series to be recognized as Categorical "more easily" you would want to increase the threshold.

Comment on lines 295 to +296
X_permuted.iloc[:, col_idx] = col
X_permuted.ww.init(schema=X_features.ww.schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the schema being invalidated by setting the col? Should we file a WW issue for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the dataframe outside of Woodwork, as is being done with X_permuted.iloc[:, col_idx] = col, always carries the risk of invalidating the schema.

That said, I don't think it is currently possible to do this type of assignment through WW currently with this:

X_permuted.ww.iloc[:, col_idx] = col

I'm not sure how easy it would be to implement, but we could look into it if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, based on my limited testing, as long as the new values don't change the column dtype, the schema should not be invalidated by this type of assignment outside of WW.

evalml/tests/component_tests/test_imputer.py Outdated Show resolved Hide resolved
"categorical with nan": pd.Series(
["0", "1", "0", "0", "3"], dtype="category"
["0", "1", "0", "0", "3"] * 4, dtype="category"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that our tests are pretty synthetic, but is it worth pushing back on Woodwork here? It seems like these columns should very much be inferred as categorical, despite being small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, if you are setting the dtype of the column explicitly to category, Woodwork should recognize this as Categorical, regardless of how many values are present.

An example:

>>> import pandas as pd
>>> import woodwork as ww
>>> df = pd.DataFrame({'id': [0], 'vals': pd.Series(['1'], dtype='category')}) 
>>> df.ww.init()
>>> df.ww
       Physical Type Logical Type Semantic Tag(s)
Column                                           
id             int64      Integer     ['numeric']
vals        category  Categorical    ['category']

So, that's another option instead of changing series length or updating the inference threshold.

evalml/tests/component_tests/test_imputer.py Show resolved Hide resolved
evalml/tests/component_tests/test_imputer.py Show resolved Hide resolved
@@ -158,6 +162,8 @@ def test_multiple_fit(mock_predict):
X2_fit = pd.DataFrame({"feature": ["c", "b", "a", "d"]})
X2_predict = pd.DataFrame({"feature": ["d", "c", "b", "a"]})
X2_predict_expected = pd.DataFrame({0: [3.0, 2.0, 1.0, 0.0]}, dtype="category")
X2_fit.ww.init(logical_types={"feature": "categorical"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like we're having to work around inference rather than using it...

evalml/tests/component_tests/test_one_hot_encoder.py Outdated Show resolved Hide resolved
@@ -19,24 +19,24 @@ def imputer_test_data():
return pd.DataFrame(
{
"categorical col": pd.Series(
["zero", "one", "two", "zero", "three"], dtype="category"
["zero", "one", "two", "zero", "two"] * 4, dtype="category"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the idea @freddyaboulton . The intent is to try and move as many of these dataframes into a common location, at least within the testing module so that we can avoid the more laborious upgrade lifts. Like this lol

@ParthivNaresh ParthivNaresh merged commit 6607edf into main Aug 31, 2021
@chukarsten chukarsten mentioned this pull request Sep 1, 2021
@freddyaboulton freddyaboulton deleted the woodwork-upgrade-0.6.0 branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade to ww 0.6.0
5 participants