New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update data checks to accept Woodwork data structures #1481
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1481 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 223 223
Lines 15158 15257 +99
=========================================
+ Hits 15151 15250 +99
Misses 7 7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Thanks for doing this!
Should we update AutoMLSearch to not convert to pandas before running data checks?
You raise a good point about whether we should use ww as the primary data type in our tests! I think this applies to components/pipelines and not just data checks.
I'd be in favor of continuing our pattern of using pandas as the "primary" data type and having one (or a couple) unit tests with a parametrize
that checks that the feature works as expected given all possible data types. The reason being that (I think) most of our features don't leverage the logical/semantic types in the table so I don't see a lot of value in updating all of our tests.
That being said, I think there are some places where we don't follow the pattern I just mentioned so we should update those tests. And features that leverage logical/semantic types should be tested against lots of different inferred and user-defined types. (I think that's the case but just being explicit).
Curious to hear what others think. Certainly happy to brainstorm on a call or something together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I agree with @freddyaboulton in keeping the pandas tests for the reasons he mentioned. Also agree that the @woodwork_wrapper
would be more understandable from the user's perspective as a call inside the method instead of as a function wrapper.
@angela97lin RE your question in the description:
Yes I think we should make that update! And we should also have separate tests which ensure that if a pandas dataframe is provided to the data checks, it gets converted to woodwork correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin looks good!
Closes #1465
From discussion with @dsherry and @freddyaboulton we agreed that as long as we have some test that makes sure Woodwork inputs work as well as pandas/numpy, that suffices in most cases where we're not using any Woodwork metadata.
X
. Would love to hear suggestions/thoughts though!woodwork_wrapper
in multiple places (outside of data checks)? May be difficult given different parameters across different methods.