Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs to use data check action methods rather than manually cleaning data #3050

Merged
merged 13 commits into from
Nov 18, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Nov 15, 2021

Closes #3004. Docs here: https://feature-labs-inc-evalml--3050.com.readthedocs.build/en/3050/user_guide/data_actions.html

Main docs here: https://evalml.alteryx.com/en/stable/user_guide/data_actions.html

Q: In the documentation, there's a section where we manually address errors and later show that not addressing warnings leads to worse performance. However, data check actions don't differentiate between warnings/errors and severity of the action.

We could either:

  • Remove this section. Reasoning being that we're showcasing actions, and this is manual cleaning
  • Keep as is. It's lame that the error-cleaning section is manual, but there's still a point we get across that data check warnings are important and useful to address to increase model performance
  • Add functionality to actions to only address / return components if we'll error out in search. I'm not fully convinced about the usefulness of this method outside of this case.

@codecov
Copy link

codecov bot commented Nov 15, 2021

Codecov Report

Merging #3050 (e77fc2a) into main (401457c) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #3050   +/-   ##
=====================================
  Coverage   99.8%   99.8%           
=====================================
  Files        312     312           
  Lines      30421   30421           
=====================================
  Hits       30330   30330           
  Misses        91      91           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 401457c...e77fc2a. Read the comment docs.

@angela97lin angela97lin self-assigned this Nov 16, 2021
@jeremyliweishih
Copy link
Collaborator

haven't taken a look at the PR itself but the string for null_row_indices is ginormous, could we alter the dataset to prodouce less null rows or truncate the list somehow?

@angela97lin
Copy link
Contributor Author

@jeremyliweishih Yah, that's what #3000 addresses! I'll probably pick that up this sprint too :)

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your option of removing the manual data cleaning. That doesn't really highlight what EvalML brings to the table. DataChecks/Actions and utilize it to do the cleaning in a quick and convenient way, does. I filed this issue to follow up on this work, should we decide to do so. I think this a good move, though.

"from evalml.data_checks import DataCheckAction\n",
"\n",
"# Convert dictionary form of actions returned from data check output dictionary as DataCheckAction objects\n",
"actions = [\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is definitely a step in the right direction. Obviously not in the scope of this PR, but what do you think about a follow up PR to add to the search_iterative() function a parameter like action_return_type="object" where you can pass a string in that will either give you back the list of converted DataCheckActions (essentially doing what you do here in this cell) in the results[1]['actions'] value of the results dict? If we set the default to "dict" then it can retain the current behavior. Just a shortcut, but as a novice EvalML user, I don't look at this list comprehension and think "this makes my life easier!" lol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love, love this idea and very much agree--thank you for filing! 🙏

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think either option1 or 2 would work. I do see value in showing that addressing all warnings/errors would be the best option to have better search results, but it also makes sense to showcase what EvalML can do versus manual cleaning.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think it's fine to keep option 2 you presented. I think it's pretty clear what's happening and it presents users with two different ways to clean their data prior to search. I don't think it's lame that one section is manual while the other isn't. Some users may prefer to do manual cleaning anyways.

I guess we can consider making highly null columns an error instead of a warning to side-step this point?

Other than that, just two minor nits. Looks good to me!

"\n",
"EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is [data checks](https://evalml.alteryx.com/en/stable/user_guide/data_checks.html), which are geared towards determining the health of the data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:\n",
"EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is [data checks](https://evalml.alteryx.com/en/stable/user_guide/data_checks.html), which help determine the health of the our data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "of the our data"

"# we must also drop this for y since we are removing its associated feature input\n",
"y_train.drop(index=1477, inplace=True)\n",
"\n",
"from evalml.pipelines.utils import make_pipeline_from_actions\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can now get rid of the In the future, we aim to provide a helper function to allow users to quickly clean the data by taking in the list of actions and creating an appropriate pipeline of transformers to alter the data line below?

@angela97lin
Copy link
Contributor Author

@freddyaboulton @chukarsten @bchen1116 It sounds like there's no clear consensus on what's the better option here. Here are my thoughts after reading your comments:

I agree with @chukarsten's comment that we want to highlight what EvalML can bring to the table. I think adding the section about manual cleaning detracts from this since it doesn't get straight to the point of what we can provide. I'm going to move the section about addressing via make_pipeline_from_actions above the manual cleaning section.

However, we can still keep the manual cleaning section, since it could provide users with an idea of how they could address comments by looking at the output of data check actions.

LMK if you have any objections :)

@angela97lin angela97lin merged commit 292d5aa into main Nov 18, 2021
@angela97lin angela97lin deleted the 3004_update_docs branch November 18, 2021 05:39
@chukarsten chukarsten mentioned this pull request Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update docs to use make_pipeline_from_actions to address data checks
5 participants