Skip to content

Delete data check code from AutoML search #1935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 14, 2021
Merged

Delete data check code from AutoML search #1935

merged 15 commits into from
Mar 14, 2021

Conversation

angela97lin
Copy link
Contributor

Closes #1882

@angela97lin angela97lin self-assigned this Mar 5, 2021
@angela97lin angela97lin changed the title Move data check code from AutoML search to init Delete data check code from AutoML search Mar 8, 2021
@codecov
Copy link

codecov bot commented Mar 8, 2021

Codecov Report

Merging #1935 (b1b04c1) into main (10297a2) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1935     +/-   ##
=========================================
- Coverage   100.0%   100.0%   -0.0%     
=========================================
  Files         273      273             
  Lines       22406    22288    -118     
=========================================
- Hits        22400    22282    -118     
  Misses          6        6             
Impacted Files Coverage Δ
evalml/data_checks/data_checks.py 100.0% <ø> (ø)
evalml/tuners/random_search_tuner.py 95.3% <ø> (ø)
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10297a2...b1b04c1. Read the comment docs.

"cell_type": "markdown",
"metadata": {},
"source": [
"### High Variance Cross-Validation Scores\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, forgot to delete this before!

@angela97lin angela97lin marked this pull request as ready for review March 9, 2021 16:07
@angela97lin angela97lin added this to the Sprint 2021 Mar A milestone Mar 9, 2021
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin The code changes look great! I think we should synch about whether or not we should delete the data checks section from the automl user guide before merging!

@@ -142,17 +142,6 @@
"print(regression_objective.name)"
]
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should mention data checks in the automl user guide.

If I'm following the intention of the data check actions epic, we're moving from "AutoML will check your data for you" to "Evalml has functionality to check your data and recommend fixes before you start AutoML".

I think it's important to document the relationship between data checks and automl and that it makes sense to document this relationship here.

So I think we should:

  1. Add back the data checks section (changing it a bit to make it clear that AutoMLSearch is not running the checks)
  2. Import DefaultDataChecks and run on the data.
  3. Say that the results of the data checks don't reveal anything that would cause problems so we can run search.
  4. Link to the data checks user guide page for more information

Let me know what you think!

Copy link
Contributor Author

@angela97lin angela97lin Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I like how this sounds... my only qualm about this is it would make sense for me that the AutoML user guide currently focuses on the functionality within AutoMLSearch (we currently list out error callbacks, passing in custom pipelines, etc.). This seems more like a flow of how the user should use EvalML.

I guess TLDR is: for what our docs currently look like, I like this change. Maybe this also means we should separate out our docs and not shove everything into the AutoML guide; we could have a intro user guide that steps through the steps a user would take, and then a dedicated AutoML guide that deep-dives into AutoMLSearch specifics.

Will update this, thank you for the great feedback @freddyaboulton!

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Agree this should be a breaking change and should discuss what happens with automl.ipynb

@dsherry dsherry removed this from the Sprint 2021 Mar A milestone Mar 9, 2021
@angela97lin angela97lin dismissed freddyaboulton’s stale review March 10, 2021 02:30

Addressed requested changes

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Looks great! This change add two uncovered lines to our repo.

Can you please add coverage in test_data_checks.py It should be quick but we can also handle that in a separate issue.

In my opinion, only the first uncovered line is important: raise ValueError(f"Parameter data_checks must be a list. Received {type(data_check_classes).__name__}."). We're probably going to get rid of the AutoMLDataChecks class (I can file an issue) since AutoMLSearch no longer knows about data checks so the second uncovered line is not high priority.

"source": [
"### Data Checks\n",
"\n",
"Before calling `AutoMLSearch.search`, we should run some sanity checks on our data to ensure that the input data being passed will not run into some common issues before running a potentially time-consuming search. EvalML has various data checks that makes this easy. Each data check will return a collection of warnings and errors if it detects potential issues with the input data. This allows users to inspect their data to avoid confusing errors that may arise during the search process. You can learn about each of the data checks available through our [data checks guide](data_checks.ipynb) \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin
Copy link
Contributor Author

@freddyaboulton Thanks for catching this! Sucks that codecov let me slip this by but I added a test and just decided to remove AutoMLDataChecks in this PR since it's not really useful anymore / was only used in automl.

"cell_type": "markdown",
"metadata": {},
"source": [
"Since there were no warnings or errors returned, we can safely continue with the search process."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Gonna be really cool when we can replace this with the data check actions :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeeeed 🥳

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

@angela97lin angela97lin merged commit 1b3e084 into main Mar 14, 2021
@angela97lin angela97lin deleted the 1882_init branch March 14, 2021 04:08
@dsherry dsherry mentioned this pull request Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete data check code from AutoML
5 participants