Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1570 AutoML: pass number of CV folds to default data checks #1619

Merged
merged 17 commits into from Jan 5, 2021

Conversation

ParthivNaresh
Copy link
Contributor

@ParthivNaresh ParthivNaresh commented Dec 29, 2020

Pull Request Description

Fixes #1570

@@ -328,7 +328,7 @@ def _validate_data_checks(self, data_checks):
return AutoMLDataChecks(data_checks)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this PR is for the DefaultDataChecks, I spent a while trying to figure out how we might pass it to DataChecks or AutoMLDataChecks. It becomes a bigger issue in which we have to check and see if ClassImbalanceDataCheck is one of the Data Checks, then determine what parameters were passed by the user, and if num_cv_folds is one of them. Then overwriting that to reflect self.data_splitter.get_n_splits(). There's probably an easier way I'm sure :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to only make the change for DefaultDataChecks! If the user is providing their own data checks, it's assumed they init them with the appropriate parameters!

# Conflicts:
#	evalml/automl/automl_search.py
#	evalml/tests/automl_tests/test_automl.py
# Conflicts:
#	evalml/automl/automl_search.py
#	evalml/tests/automl_tests/test_automl.py
@codecov
Copy link

codecov bot commented Dec 29, 2020

Codecov Report

Merging #1619 (2c92d5c) into main (be22ef8) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1619     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         240      240             
  Lines       18247    18255      +8     
=========================================
+ Hits        18239    18247      +8     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.7% <100.0%> (ø)
evalml/data_checks/default_data_checks.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be22ef8...2c92d5c. Read the comment docs.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about testing: We want to confirm that the data check got the appropriate parameter!

evalml/tests/automl_tests/test_automl.py Outdated Show resolved Hide resolved
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh Looks good! I just have a small suggestion on how to improve the unit test. Implementation looks great :)

evalml/data_checks/default_data_checks.py Show resolved Hide resolved
@@ -328,7 +328,7 @@ def _validate_data_checks(self, data_checks):
return AutoMLDataChecks(data_checks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to only make the change for DefaultDataChecks! If the user is providing their own data checks, it's assumed they init them with the appropriate parameters!

@@ -412,6 +413,15 @@ def test_automl_bad_data_check_parameter_type():
automl.search(data_checks=[MockDataCheckErrorAndWarning])


def test_validate_data_check_n_splits(X_y_multi):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better way of testing this is to run search(data_checks='auto') on a dataset where one of the classes only has 7 observations.

That way we don't have to call the private _validate_data_checks or rely on the fact that the ClassImbalanceDataCheck is the last one in the list or that cv_folds actually corresponds to 2 * n_splits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I got it, so you're recommending I validate that the parameter is being passed by triggering the ClassImbalanceDataCheck error, nice!

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, thanks for updating this! LGTM 🥳

@ParthivNaresh ParthivNaresh merged commit 1bc8982 into main Jan 5, 2021
1 check passed
@freddyaboulton freddyaboulton deleted the 1570-CV-Folds-Default-Data-Checks branch May 13, 2022 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoML: pass number of CV folds to default data checks
3 participants