New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Internal target check to ensure no class missing from train/val #1226

Merged

bchen1116 merged 9 commits into main from bc_760_split

Sep 29, 2020

Contributor

bchen1116 commented Sep 24, 2020

fix #760

Throws an error if the data split results in missing target values in either the train or validation sets

bchen1116 added 3 commits

September 24, 2020 15:52


          implementation for internal check

9f36455


          update release notes

9afa9a0


          fix bugs

46b3c3a

codecov bot commented Sep 24, 2020 •

edited

Codecov Report

Merging #1226 into main will increase coverage by 8.40%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1226      +/-   ##
==========================================
+ Coverage   91.52%   99.92%   +8.40%     
==========================================
  Files         200      200              
  Lines       12293    12339      +46     
==========================================
+ Hits        11251    12330    +1079     
+ Misses       1042        9    -1033

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.58% <100.00%> (+0.42%)`	⬆️
evalml/tests/automl_tests/test_automl.py	`100.00% <100.00%> (ø)`
evalml/tests/component_tests/test_components.py	`100.00% <0.00%> (+0.51%)`	⬆️
...s/prediction_explanations_tests/test_algorithms.py	`100.00% <0.00%> (+1.11%)`	⬆️
evalml/tests/component_tests/test_utils.py	`100.00% <0.00%> (+3.57%)`	⬆️
evalml/tests/pipeline_tests/test_pipelines.py	`100.00% <0.00%> (+3.89%)`	⬆️
...derstanding/prediction_explanations/_algorithms.py	`97.14% <0.00%> (+4.28%)`	⬆️
evalml/utils/gen_utils.py	`99.02% <0.00%> (+5.82%)`	⬆️
evalml/tests/utils_tests/test_dependencies.py	`100.00% <0.00%> (+6.25%)`	⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e8f614...70129b7. Read the comment docs.

Contributor Author

bchen1116 commented Sep 24, 2020 •

edited

Of the original issues, this PR addresses issues 1 and 3. Issue 2 has been addressed with PR 1135.

Add solid test coverage of class imbalance in binary and multiclass cases.
Add a class-imbalance data check which warns if class imbalance is so severe that data splitting would be affected.
Add an internal check which throws an error if we generate a training split which is missing a class
For CV splitting, make sure we understand and can describe how sklearn's StratifiedKFold works when it comes to edge-cases
Figure out what we'd use to get a stratified TVH split (StratifiedShuffleSplit)

Doc for points 4 and 5 is here

bchen1116 commented

View reviewed changes

evalml/automl/automl_search.py Outdated

@@ @@ -607,10 +607,18 @@ def _compute_cv_scores(self, pipeline, X, y): @@
                       start = time.time()
                       cv_data = []
                       logger.info("\tStarting cross validation")
+                      warnings.filterwarnings("ignore", lineno=665)

Contributor Author

bchen1116 Sep 24, 2020

Add filter to catch and suppress SKLearn's warning for having too few cases of a target given n_splits=3

Contributor

freddyaboulton Sep 28, 2020

I'm skeptical of lineno because it will be impossible to maintain as the size of this file changes. Maybe we don't worry about suppressing warnings?

FWIW, I'm still seeing the warning when I run this code anyways.

bchen1116 commented

View reviewed changes

evalml/automl/automl_search.py Show resolved Hide resolved

bchen1116 added 2 commits

September 25, 2020 09:19


          fix release notes

8c636b3


          update test

e1fc547

bchen1116 self-assigned this

bchen1116 marked this pull request as ready for review

September 25, 2020 15:42

bchen1116 requested review from dsherry, angela97lin, freddyaboulton, christopherbunn, eccabay and jeremyliweishih and removed request for dsherry and angela97lin

September 25, 2020 15:42


          Merge branch 'main' into bc_760_split

1fc2738

freddyaboulton approved these changes

View reviewed changes

Contributor

freddyaboulton left a comment

@bchen1116 I have some minor suggestions to improve the implementation and a question regarding your comment about changing this for TrainingValidationSplit - otherwise looks great!

evalml/automl/automl_search.py Show resolved Hide resolved

evalml/automl/automl_search.py Outdated Show resolved Hide resolved

evalml/automl/automl_search.py Outdated

@@ @@ -607,10 +607,18 @@ def _compute_cv_scores(self, pipeline, X, y): @@
                       start = time.time()
                       cv_data = []
                       logger.info("\tStarting cross validation")
+                      warnings.filterwarnings("ignore", lineno=665)

Contributor

freddyaboulton Sep 28, 2020

I'm skeptical of lineno because it will be impossible to maintain as the size of this file changes. Maybe we don't worry about suppressing warnings?

FWIW, I'm still seeing the warning when I run this code anyways.

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved

freddyaboulton reviewed

View reviewed changes

evalml/automl/automl_search.py Outdated Show resolved Hide resolved

bchen1116 added 2 commits

September 28, 2020 16:29


          Merge branch 'main' into bc_760_split

2badd0d


          update tests

3ee7d1d

Contributor Author

bchen1116 commented Sep 28, 2020

@freddyaboulton about my comment for TrainingValidationSplit. This internal data check ensures that we need each class to occur in each train/validation split. This makes sense and is easy to enforce with StratifiedKFold since as long as we have > 3 occurances of each target class, we shouldn't run into this error. However, for TrainingValidationSplit, we don't give it a stratify argument, which means that we might not pass this check even with more instances of a target class. I'm just curious if we should include some default stratify arg, otherwise the user might run into this error when they use a really large dataset?


          Merge branch 'main' into bc_760_split

70129b7

bchen1116 merged commit fbdb8bc into main

angela97lin mentioned this pull request

Release v0.14.1 #1241

Merged

freddyaboulton deleted the bc_760_split branch

May 13, 2022 15:16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

freddyaboulton freddyaboulton approved these changes

dsherry Awaiting requested review from dsherry

christopherbunn Awaiting requested review from christopherbunn

eccabay Awaiting requested review from eccabay

jeremyliweishih Awaiting requested review from jeremyliweishih

angela97lin Awaiting requested review from angela97lin