New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update docs to use data check action methods rather than manually cleaning data #3050
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3050 +/- ##
=====================================
Coverage 99.8% 99.8%
=====================================
Files 312 312
Lines 30421 30421
=====================================
Hits 30330 30330
Misses 91 91 Continue to review full report at Codecov.
|
haven't taken a look at the PR itself but the string for |
@jeremyliweishih Yah, that's what #3000 addresses! I'll probably pick that up this sprint too :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your option of removing the manual data cleaning. That doesn't really highlight what EvalML brings to the table. DataChecks/Actions and utilize it to do the cleaning in a quick and convenient way, does. I filed this issue to follow up on this work, should we decide to do so. I think this a good move, though.
"from evalml.data_checks import DataCheckAction\n", | ||
"\n", | ||
"# Convert dictionary form of actions returned from data check output dictionary as DataCheckAction objects\n", | ||
"actions = [\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is definitely a step in the right direction. Obviously not in the scope of this PR, but what do you think about a follow up PR to add to the search_iterative()
function a parameter like action_return_type="object"
where you can pass a string in that will either give you back the list of converted DataCheckActions (essentially doing what you do here in this cell) in the results[1]['actions']
value of the results dict? If we set the default to "dict" then it can retain the current behavior. Just a shortcut, but as a novice EvalML user, I don't look at this list comprehension and think "this makes my life easier!" lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love, love this idea and very much agree--thank you for filing! 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think either option1 or 2 would work. I do see value in showing that addressing all warnings/errors would be the best option to have better search results, but it also makes sense to showcase what EvalML can do versus manual cleaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin I think it's fine to keep option 2 you presented. I think it's pretty clear what's happening and it presents users with two different ways to clean their data prior to search. I don't think it's lame that one section is manual while the other isn't. Some users may prefer to do manual cleaning anyways.
I guess we can consider making highly null columns an error instead of a warning to side-step this point?
Other than that, just two minor nits. Looks good to me!
"\n", | ||
"EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is [data checks](https://evalml.alteryx.com/en/stable/user_guide/data_checks.html), which are geared towards determining the health of the data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:\n", | ||
"EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is [data checks](https://evalml.alteryx.com/en/stable/user_guide/data_checks.html), which help determine the health of the our data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "of the our data"
"# we must also drop this for y since we are removing its associated feature input\n", | ||
"y_train.drop(index=1477, inplace=True)\n", | ||
"\n", | ||
"from evalml.pipelines.utils import make_pipeline_from_actions\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can now get rid of the In the future, we aim to provide a helper function to allow users to quickly clean the data by taking in the list of actions and creating an appropriate pipeline of transformers to alter the data
line below?
@freddyaboulton @chukarsten @bchen1116 It sounds like there's no clear consensus on what's the better option here. Here are my thoughts after reading your comments: I agree with @chukarsten's comment that we want to highlight what EvalML can bring to the table. I think adding the section about manual cleaning detracts from this since it doesn't get straight to the point of what we can provide. I'm going to move the section about addressing via However, we can still keep the manual cleaning section, since it could provide users with an idea of how they could address comments by looking at the output of data check actions. LMK if you have any objections :) |
Closes #3004. Docs here: https://feature-labs-inc-evalml--3050.com.readthedocs.build/en/3050/user_guide/data_actions.html
Main docs here: https://evalml.alteryx.com/en/stable/user_guide/data_actions.html
Q: In the documentation, there's a section where we manually address errors and later show that not addressing warnings leads to worse performance. However, data check actions don't differentiate between warnings/errors and severity of the action.
We could either: