Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow AutoMLSearch to handle Unknown type #2477

Merged
merged 34 commits into from
Jul 20, 2021
Merged

Allow AutoMLSearch to handle Unknown type #2477

merged 34 commits into from
Jul 20, 2021

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Jul 8, 2021

fix #2426

Perf tests here

@bchen1116 bchen1116 self-assigned this Jul 8, 2021
@codecov
Copy link

codecov bot commented Jul 8, 2021

Codecov Report

Merging #2477 (bbc1021) into main (b892fc9) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2477     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        283     283             
  Lines      25808   25897     +89     
=======================================
+ Hits       25707   25796     +89     
  Misses       101     101             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.4% <100.0%> (+0.1%) ⬆️
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.2% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.8% <100.0%> (+0.1%) ⬆️
evalml/tests/component_tests/test_lsa.py 100.0% <100.0%> (ø)
...l/tests/component_tests/test_per_column_imputer.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_text_featurizer.py 100.0% <100.0%> (ø)
evalml/tests/conftest.py 98.3% <100.0%> (+0.1%) ⬆️
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)
...ecks_tests/test_natural_language_nan_data_check.py 100.0% <100.0%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b892fc9...bbc1021. Read the comment docs.

@@ -262,7 +262,7 @@ def test_per_column_imputer_woodwork_custom_overrides_returned_by_components(
override_types = [Integer, Double, Categorical, NaturalLanguage, Boolean]
for logical_type in override_types:
# Column with Nans to boolean used to fail. Now it doesn't
if has_nan and logical_type == Boolean:
if has_nan and logical_type in [Boolean, NaturalLanguage]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave out NaturalLanguage since casting this will result in np.nan becoming pd.NA, which fails the imputer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how come this wouldn't happen before? Woodwork doesn't convert np.nan to pd.NA in 0.4.2? Is this because of the pandas upgrade?

Copy link
Contributor Author

@bchen1116 bchen1116 Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton I believe it's due to the pandas upgrade!
image
Looking at the 1.3.0 release docs, seems like there's a lot of changes with NaN handling, and they're using <NA> for scalar types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @bchen1116 Let's list this as a breaking change for now. I imagine we might want to file an issue to discuss if there are any changes we need to make to the simpleimputer? If users run it on natural language after this pr they'll get a stacktrace they didn't get before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton updated the release notes with the breaking change and filed the issue here!

assert nl_nan_check.validate([nl_col, nl_col_without_nan]) == expected

# test np.array
assert nl_nan_check.validate(np.array([nl_col, nl_col_without_nan])) == expected
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can't cast this to NL type without ww, we remove this from the tests. They'll be cast as Unknown

@bchen1116 bchen1116 marked this pull request as ready for review July 13, 2021 17:08
@bchen1116 bchen1116 requested review from dsherry, angela97lin and chukarsten and removed request for dsherry July 13, 2021 17:08
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 Thank you for doing this! The changes look good to me and I'm glad the performance looks better with this change for a lot of the datasets we commonly test with.

One thing I'd like to find out/cover with (small/fast) unit tests before merge though - Do our model understanding methods support Unknown features?

I think we should focus on

  • partial dependence
  • permutation importance
  • prediction explanations

My hope is yes but I'm not 100% sure.

@@ -262,7 +262,7 @@ def test_per_column_imputer_woodwork_custom_overrides_returned_by_components(
override_types = [Integer, Double, Categorical, NaturalLanguage, Boolean]
for logical_type in override_types:
# Column with Nans to boolean used to fail. Now it doesn't
if has_nan and logical_type == Boolean:
if has_nan and logical_type in [Boolean, NaturalLanguage]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how come this wouldn't happen before? Woodwork doesn't convert np.nan to pd.NA in 0.4.2? Is this because of the pandas upgrade?

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
evalml/tests/component_tests/test_text_featurizer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit here and there, but overall, I like where this is headed. Some of the things I'd like for you to consider, particularly with respect to the explanation predictions, partial dependence and permutation importance is whether we're adding back in more run time? I need to reacquaint myself with Freddy's testing environment/context manager to see whether that could help us there, but I'd just like to be cognizant of whether we're walking back any of his work by adding in these tests.

Overall, though, this is solid. Thanks a lot for addressing it so quickly from design to impl and getting it out there.

evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/model_understanding/graphs.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
Comment on lines 1312 to 1314
partial_dependence(pl, X, indices, grid_resolution=10)
return
s = partial_dependence(pl, X, indices, grid_resolution=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How low can we keep this grid_resolution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowest is 2

@bchen1116
Copy link
Contributor Author

@chukarsten two of the tests take about 1 second, while the permutation importance test add about 3 seconds. I think it's important to have the coverage on these tests, and it doesn't take too long to raise concern for now imo. Let me know what you think!

@chukarsten chukarsten merged commit 87df494 into main Jul 20, 2021
@chukarsten chukarsten deleted the bc_2426_unknown branch July 20, 2021 17:53
@chukarsten chukarsten mentioned this pull request Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle woodwork's Unknown Logical Type
3 participants