Update docs to use data check action methods rather than manually cleaning data #3050

angela97lin · 2021-11-15T21:10:20Z

Closes #3004. Docs here: https://feature-labs-inc-evalml--3050.com.readthedocs.build/en/3050/user_guide/data_actions.html

Main docs here: https://evalml.alteryx.com/en/stable/user_guide/data_actions.html

Q: In the documentation, there's a section where we manually address errors and later show that not addressing warnings leads to worse performance. However, data check actions don't differentiate between warnings/errors and severity of the action.

We could either:

Remove this section. Reasoning being that we're showcasing actions, and this is manual cleaning
Keep as is. It's lame that the error-cleaning section is manual, but there's still a point we get across that data check warnings are important and useful to address to increase model performance
Add functionality to actions to only address / return components if we'll error out in search. I'm not fully convinced about the usefulness of this method outside of this case.

codecov · 2021-11-15T21:13:53Z

Codecov Report

Merging #3050 (e77fc2a) into main (401457c) will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff          @@
##            main   #3050   +/-   ##
=====================================
  Coverage   99.8%   99.8%           
=====================================
  Files        312     312           
  Lines      30421   30421           
=====================================
  Hits       30330   30330           
  Misses        91      91

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 401457c...e77fc2a. Read the comment docs.

jeremyliweishih · 2021-11-16T18:26:03Z

haven't taken a look at the PR itself but the string for null_row_indices is ginormous, could we alter the dataset to prodouce less null rows or truncate the list somehow?

angela97lin · 2021-11-16T18:27:40Z

@jeremyliweishih Yah, that's what #3000 addresses! I'll probably pick that up this sprint too :)

chukarsten

I agree with your option of removing the manual data cleaning. That doesn't really highlight what EvalML brings to the table. DataChecks/Actions and utilize it to do the cleaning in a quick and convenient way, does. I filed this issue to follow up on this work, should we decide to do so. I think this a good move, though.

chukarsten · 2021-11-17T17:58:48Z

docs/source/user_guide/data_actions.ipynb

+    "from evalml.data_checks import DataCheckAction\n",
+    "\n",
+    "# Convert dictionary form of actions returned from data check output dictionary as DataCheckAction objects\n",
+    "actions = [\n",


I think this is definitely a step in the right direction. Obviously not in the scope of this PR, but what do you think about a follow up PR to add to the search_iterative() function a parameter like action_return_type="object" where you can pass a string in that will either give you back the list of converted DataCheckActions (essentially doing what you do here in this cell) in the results[1]['actions'] value of the results dict? If we set the default to "dict" then it can retain the current behavior. Just a shortcut, but as a novice EvalML user, I don't look at this list comprehension and think "this makes my life easier!" lol.

I love, love this idea and very much agree--thank you for filing! 🙏

bchen1116

LGTM! I think either option1 or 2 would work. I do see value in showing that addressing all warnings/errors would be the best option to have better search results, but it also makes sense to showcase what EvalML can do versus manual cleaning.

freddyaboulton

@angela97lin I think it's fine to keep option 2 you presented. I think it's pretty clear what's happening and it presents users with two different ways to clean their data prior to search. I don't think it's lame that one section is manual while the other isn't. Some users may prefer to do manual cleaning anyways.

I guess we can consider making highly null columns an error instead of a warning to side-step this point?

Other than that, just two minor nits. Looks good to me!

freddyaboulton · 2021-11-17T19:09:18Z

docs/source/user_guide/data_actions.ipynb

    "\n",
-    "EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is [data checks](https://evalml.alteryx.com/en/stable/user_guide/data_checks.html), which are geared towards determining the health of the data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:\n",
+    "EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is [data checks](https://evalml.alteryx.com/en/stable/user_guide/data_checks.html), which help determine the health of the our data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:\n",


typo: "of the our data"

freddyaboulton · 2021-11-17T19:19:11Z

docs/source/user_guide/data_actions.ipynb

-    "# we must also drop this for y since we are removing its associated feature input\n",
-    "y_train.drop(index=1477, inplace=True)\n",
-    "\n",
+    "from evalml.pipelines.utils import make_pipeline_from_actions\n",


I think we can now get rid of the In the future, we aim to provide a helper function to allow users to quickly clean the data by taking in the list of actions and creating an appropriate pipeline of transformers to alter the data line below?

angela97lin · 2021-11-17T19:48:50Z

@freddyaboulton @chukarsten @bchen1116 It sounds like there's no clear consensus on what's the better option here. Here are my thoughts after reading your comments:

I agree with @chukarsten's comment that we want to highlight what EvalML can bring to the table. I think adding the section about manual cleaning detracts from this since it doesn't get straight to the point of what we can provide. I'm going to move the section about addressing via make_pipeline_from_actions above the manual cleaning section.

However, we can still keep the manual cleaning section, since it could provide users with an idea of how they could address comments by looking at the output of data check actions.

LMK if you have any objections :)

…4_update_docs

angela97lin added 3 commits November 15, 2021 16:09

init

732c2c2

notebook linting

ee2bc94

cleanup

b6a039d

angela97lin added 2 commits November 15, 2021 16:14

fix missing colon in release notes

ca6d4fc

remove scrolled

d288206

angela97lin self-assigned this Nov 16, 2021

Merge branch 'main' into 3004_update_docs

467f116

angela97lin marked this pull request as ready for review November 16, 2021 18:23

angela97lin requested review from bchen1116, freddyaboulton, dsherry, christopherbunn, chukarsten, eccabay, jeremyliweishih and ParthivNaresh November 16, 2021 18:23

chukarsten mentioned this pull request Nov 17, 2021

Return list of DataCheckAction objects #3072

Closed

chukarsten approved these changes Nov 17, 2021

View reviewed changes

bchen1116 approved these changes Nov 17, 2021

View reviewed changes

freddyaboulton approved these changes Nov 17, 2021

View reviewed changes

Merge branch 'main' into 3004_update_docs

9b1f5ca

angela97lin added 5 commits November 17, 2021 15:37

move sections around and cleanup

990f069

linting

edcc6a3

Merge branch 'main' into 3004_update_docs

c5355bf

fix print statement

a664975

Merge branch '3004_update_docs' of github.com:alteryx/evalml into 300…

422eefa

…4_update_docs

retrigger

e77fc2a

angela97lin merged commit 292d5aa into main Nov 18, 2021

angela97lin deleted the 3004_update_docs branch November 18, 2021 05:39

chukarsten mentioned this pull request Nov 29, 2021

Release v.0.38.0 #3102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update docs to use data check action methods rather than manually cleaning data #3050

Update docs to use data check action methods rather than manually cleaning data #3050

angela97lin commented Nov 15, 2021 •

edited

codecov bot commented Nov 15, 2021 •

edited

jeremyliweishih commented Nov 16, 2021

angela97lin commented Nov 16, 2021

chukarsten left a comment

chukarsten Nov 17, 2021

angela97lin Nov 17, 2021

bchen1116 left a comment

freddyaboulton left a comment

freddyaboulton Nov 17, 2021

freddyaboulton Nov 17, 2021

angela97lin commented Nov 17, 2021

Update docs to use data check action methods rather than manually cleaning data #3050

Update docs to use data check action methods rather than manually cleaning data #3050

Conversation

angela97lin commented Nov 15, 2021 • edited

codecov bot commented Nov 15, 2021 • edited

Codecov Report

jeremyliweishih commented Nov 16, 2021

angela97lin commented Nov 16, 2021

chukarsten left a comment

Choose a reason for hiding this comment

chukarsten Nov 17, 2021

Choose a reason for hiding this comment

angela97lin Nov 17, 2021

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

freddyaboulton Nov 17, 2021

Choose a reason for hiding this comment

freddyaboulton Nov 17, 2021

Choose a reason for hiding this comment

angela97lin commented Nov 17, 2021

angela97lin commented Nov 15, 2021 •

edited

codecov bot commented Nov 15, 2021 •

edited