Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data checks to accept Woodwork data structures #1481

Merged
merged 19 commits into from
Dec 3, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Nov 30, 2020

Closes #1465

  • Should we update all tests to use Woodwork? ex: we currently usually test pandas data structures, and then have one test to check for other inputs (lists, arrays, etc.). I added Woodwork to that test in most cases, but is that against the principle of having Woodwork be the main expected input?

From discussion with @dsherry and @freddyaboulton we agreed that as long as we have some test that makes sure Woodwork inputs work as well as pandas/numpy, that suffices in most cases where we're not using any Woodwork metadata.

  • In some tests, I removed checks for 1D lists or arrays. These would get converted to DataColumns, then to Series, and would not have the appropriate attributes/methods that our method impl were looking for. This was previously not as much of an issue because we just did a check and converted our input to DataFrames (regardless of 1D or not). This update is more limiting, but more consistent--the user must pass in a 2D structure for X. Would love to hear suggestions/thoughts though!
  • Is it possible to use woodwork_wrapper in multiple places (outside of data checks)? May be difficult given different parameters across different methods.

@angela97lin angela97lin added this to the December 2020 milestone Nov 30, 2020
@angela97lin angela97lin self-assigned this Nov 30, 2020
@codecov
Copy link

codecov bot commented Nov 30, 2020

Codecov Report

Merging #1481 (f1653b1) into main (d1d2d52) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1481     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         223      223             
  Lines       15158    15257     +99     
=========================================
+ Hits        15151    15250     +99     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/data_checks/class_imbalance_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/data_checks.py 100.0% <100.0%> (ø)
evalml/data_checks/highly_null_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/id_columns_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/no_variance_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/outliers_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/target_leakage_data_check.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_class_imbalance_data_check.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1d2d52...f1653b1. Read the comment docs.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Thanks for doing this!

Should we update AutoMLSearch to not convert to pandas before running data checks?

You raise a good point about whether we should use ww as the primary data type in our tests! I think this applies to components/pipelines and not just data checks.

I'd be in favor of continuing our pattern of using pandas as the "primary" data type and having one (or a couple) unit tests with a parametrize that checks that the feature works as expected given all possible data types. The reason being that (I think) most of our features don't leverage the logical/semantic types in the table so I don't see a lot of value in updating all of our tests.

That being said, I think there are some places where we don't follow the pattern I just mentioned so we should update those tests. And features that leverage logical/semantic types should be tested against lots of different inferred and user-defined types. (I think that's the case but just being explicit).

Curious to hear what others think. Certainly happy to brainstorm on a call or something together.

evalml/utils/gen_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I agree with @freddyaboulton in keeping the pandas tests for the reasons he mentioned. Also agree that the @woodwork_wrapper would be more understandable from the user's perspective as a call inside the method instead of as a function wrapper.

evalml/utils/gen_utils.py Outdated Show resolved Hide resolved
@dsherry
Copy link
Contributor

dsherry commented Dec 2, 2020

@angela97lin RE your question in the description:

Should we update all tests to use Woodwork? ex: we currently usually test pandas data structures, and then have one test to check for other inputs (lists, arrays, etc.). I added Woodwork to that test in most cases, but is that against the principle of having Woodwork be the main expected input?

Yes I think we should make that update! And we should also have separate tests which ensure that if a pandas dataframe is provided to the data checks, it gets converted to woodwork correctly.

@angela97lin angela97lin merged commit 7f63094 into main Dec 3, 2020
@angela97lin angela97lin deleted the 1465_data_checks_ww branch December 3, 2020 20:33
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin looks good! 👍

@dsherry dsherry mentioned this pull request Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update all data checks to support Woodwork
4 participants