Update data checks to accept Woodwork data structures #1481

angela97lin · 2020-11-30T22:31:20Z

Should we update all tests to use Woodwork? ex: we currently usually test pandas data structures, and then have one test to check for other inputs (lists, arrays, etc.). I added Woodwork to that test in most cases, but is that against the principle of having Woodwork be the main expected input?

From discussion with @dsherry and @freddyaboulton we agreed that as long as we have some test that makes sure Woodwork inputs work as well as pandas/numpy, that suffices in most cases where we're not using any Woodwork metadata.

In some tests, I removed checks for 1D lists or arrays. These would get converted to DataColumns, then to Series, and would not have the appropriate attributes/methods that our method impl were looking for. This was previously not as much of an issue because we just did a check and converted our input to DataFrames (regardless of 1D or not). This update is more limiting, but more consistent--the user must pass in a 2D structure for X. Would love to hear suggestions/thoughts though!
Is it possible to use woodwork_wrapper in multiple places (outside of data checks)? May be difficult given different parameters across different methods.

codecov · 2020-11-30T22:44:20Z

Codecov Report

Merging #1481 (f1653b1) into main (d1d2d52) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1481     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         223      223             
  Lines       15158    15257     +99     
=========================================
+ Hits        15151    15250     +99     
  Misses          7        7

Impacted Files	Coverage Δ
evalml/data_checks/class_imbalance_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/data_checks.py	`100.0% <100.0%> (ø)`
evalml/data_checks/highly_null_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/id_columns_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/invalid_targets_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/no_variance_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/outliers_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/target_leakage_data_check.py	`100.0% <100.0%> (ø)`
...ta_checks_tests/test_class_imbalance_data_check.py	`100.0% <100.0%> (ø)`
evalml/tests/data_checks_tests/test_data_checks.py	`100.0% <100.0%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1d2d52...f1653b1. Read the comment docs.

freddyaboulton

@angela97lin Thanks for doing this!

Should we update AutoMLSearch to not convert to pandas before running data checks?

You raise a good point about whether we should use ww as the primary data type in our tests! I think this applies to components/pipelines and not just data checks.

I'd be in favor of continuing our pattern of using pandas as the "primary" data type and having one (or a couple) unit tests with a parametrize that checks that the feature works as expected given all possible data types. The reason being that (I think) most of our features don't leverage the logical/semantic types in the table so I don't see a lot of value in updating all of our tests.

That being said, I think there are some places where we don't follow the pattern I just mentioned so we should update those tests. And features that leverage logical/semantic types should be tested against lots of different inferred and user-defined types. (I think that's the case but just being explicit).

Curious to hear what others think. Certainly happy to brainstorm on a call or something together.

evalml/tests/data_checks_tests/test_id_columns_data_check.py

evalml/utils/gen_utils.py

bchen1116

Looks great! I agree with @freddyaboulton in keeping the pandas tests for the reasons he mentioned. Also agree that the @woodwork_wrapper would be more understandable from the user's perspective as a call inside the method instead of as a function wrapper.

evalml/utils/gen_utils.py

dsherry · 2020-12-02T20:30:02Z

@angela97lin RE your question in the description:

Should we update all tests to use Woodwork? ex: we currently usually test pandas data structures, and then have one test to check for other inputs (lists, arrays, etc.). I added Woodwork to that test in most cases, but is that against the principle of having Woodwork be the main expected input?

Yes I think we should make that update! And we should also have separate tests which ensure that if a pandas dataframe is provided to the data checks, it gets converted to woodwork correctly.

…1465_data_checks_ww

dsherry

@angela97lin looks good! 👍

angela97lin added 7 commits November 30, 2020 16:08

init

c11db6d

id cols dc

4d7a1ee

invalid targets

2ce6acf

no variance

49685ff

target leakage

5602782

outliers, datachecks

ce48a80

linting

65fa0f1

angela97lin added this to the December 2020 milestone Nov 30, 2020

angela97lin self-assigned this Nov 30, 2020

Merge branch 'main' into 1465_data_checks_ww

860d399

angela97lin added 4 commits December 1, 2020 11:12

Merge branch 'main' into 1465_data_checks_ww

16fad1c

add tests and cleanup

62348ef

Merge branch 'main' into 1465_data_checks_ww

d1ba7e9

cleanup docstr

079ff42

angela97lin marked this pull request as ready for review December 1, 2020 20:46

angela97lin requested review from freddyaboulton, dsherry, bchen1116, christopherbunn, eccabay and jeremyliweishih December 1, 2020 20:46

freddyaboulton approved these changes Dec 1, 2020

View reviewed changes

evalml/tests/data_checks_tests/test_id_columns_data_check.py Show resolved Hide resolved

evalml/utils/gen_utils.py Outdated Show resolved Hide resolved

bchen1116 approved these changes Dec 2, 2020

View reviewed changes

evalml/utils/gen_utils.py Outdated Show resolved Hide resolved

angela97lin added 5 commits December 3, 2020 12:04

release notes

74434ec

release notes

77336b1

Merge branch 'main' into 1465_data_checks_ww

63bc56b

remove wrapper

06e4efe

Merge branch '1465_data_checks_ww' of github.com:alteryx/evalml into …

4cd0726

…1465_data_checks_ww

angela97lin mentioned this pull request Dec 3, 2020

Integrate Woodwork DataTables into EvalML #1229

Closed

angela97lin added 2 commits December 3, 2020 14:49

fix doctests

d040539

Merge branch 'main' into 1465_data_checks_ww

f1653b1

angela97lin merged commit 7f63094 into main Dec 3, 2020

angela97lin deleted the 1465_data_checks_ww branch December 3, 2020 20:33

dsherry reviewed Dec 3, 2020

View reviewed changes

dsherry mentioned this pull request Dec 29, 2020

Release v0.17.0 #1623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data checks to accept Woodwork data structures #1481

Update data checks to accept Woodwork data structures #1481

angela97lin commented Nov 30, 2020 •

edited

Loading

codecov bot commented Nov 30, 2020 •

edited

Loading

freddyaboulton left a comment

bchen1116 left a comment

dsherry commented Dec 2, 2020

dsherry left a comment

Update data checks to accept Woodwork data structures #1481

Update data checks to accept Woodwork data structures #1481

Conversation

angela97lin commented Nov 30, 2020 • edited Loading

codecov bot commented Nov 30, 2020 • edited Loading

Codecov Report

freddyaboulton left a comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

dsherry commented Dec 2, 2020

dsherry left a comment

Choose a reason for hiding this comment

angela97lin commented Nov 30, 2020 •

edited

Loading

codecov bot commented Nov 30, 2020 •

edited

Loading