Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal target check to ensure no class missing from train/val #1226

Merged
merged 9 commits into from Sep 29, 2020

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Sep 24, 2020

fix #760

Throws an error if the data split results in missing target values in either the train or validation sets

@codecov
Copy link

codecov bot commented Sep 24, 2020

Codecov Report

Merging #1226 into main will increase coverage by 8.40%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1226      +/-   ##
==========================================
+ Coverage   91.52%   99.92%   +8.40%     
==========================================
  Files         200      200              
  Lines       12293    12339      +46     
==========================================
+ Hits        11251    12330    +1079     
+ Misses       1042        9    -1033     
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.58% <100.00%> (+0.42%) ⬆️
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <0.00%> (+0.51%) ⬆️
...s/prediction_explanations_tests/test_algorithms.py 100.00% <0.00%> (+1.11%) ⬆️
evalml/tests/component_tests/test_utils.py 100.00% <0.00%> (+3.57%) ⬆️
evalml/tests/pipeline_tests/test_pipelines.py 100.00% <0.00%> (+3.89%) ⬆️
...derstanding/prediction_explanations/_algorithms.py 97.14% <0.00%> (+4.28%) ⬆️
evalml/utils/gen_utils.py 99.02% <0.00%> (+5.82%) ⬆️
evalml/tests/utils_tests/test_dependencies.py 100.00% <0.00%> (+6.25%) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e8f614...70129b7. Read the comment docs.

@bchen1116
Copy link
Contributor Author

bchen1116 commented Sep 24, 2020

Of the original issues, this PR addresses issues 1 and 3. Issue 2 has been addressed with PR 1135.

  1. Add solid test coverage of class imbalance in binary and multiclass cases.
  2. Add a class-imbalance data check which warns if class imbalance is so severe that data splitting would be affected.
  3. Add an internal check which throws an error if we generate a training split which is missing a class
  4. For CV splitting, make sure we understand and can describe how sklearn's StratifiedKFold works when it comes to edge-cases
  5. Figure out what we'd use to get a stratified TVH split (StratifiedShuffleSplit)

Doc for points 4 and 5 is here

@@ -607,10 +607,18 @@ def _compute_cv_scores(self, pipeline, X, y):
start = time.time()
cv_data = []
logger.info("\tStarting cross validation")
warnings.filterwarnings("ignore", lineno=665)
Copy link
Contributor Author

@bchen1116 bchen1116 Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add filter to catch and suppress SKLearn's warning for having too few cases of a target given n_splits=3

Copy link
Contributor

@freddyaboulton freddyaboulton Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical of lineno because it will be impossible to maintain as the size of this file changes. Maybe we don't worry about suppressing warnings?

FWIW, I'm still seeing the warning when I run this code anyways.

image

@bchen1116 bchen1116 self-assigned this Sep 25, 2020
@bchen1116 bchen1116 marked this pull request as ready for review Sep 25, 2020
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

@bchen1116 I have some minor suggestions to improve the implementation and a question regarding your comment about changing this for TrainingValidationSplit - otherwise looks great!

evalml/automl/automl_search.py Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
@@ -607,10 +607,18 @@ def _compute_cv_scores(self, pipeline, X, y):
start = time.time()
cv_data = []
logger.info("\tStarting cross validation")
warnings.filterwarnings("ignore", lineno=665)
Copy link
Contributor

@freddyaboulton freddyaboulton Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical of lineno because it will be impossible to maintain as the size of this file changes. Maybe we don't worry about suppressing warnings?

FWIW, I'm still seeing the warning when I run this code anyways.

image

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
@bchen1116
Copy link
Contributor Author

bchen1116 commented Sep 28, 2020

@freddyaboulton about my comment for TrainingValidationSplit. This internal data check ensures that we need each class to occur in each train/validation split. This makes sense and is easy to enforce with StratifiedKFold since as long as we have > 3 occurances of each target class, we shouldn't run into this error. However, for TrainingValidationSplit, we don't give it a stratify argument, which means that we might not pass this check even with more instances of a target class. I'm just curious if we should include some default stratify arg, otherwise the user might run into this error when they use a really large dataset?

@bchen1116 bchen1116 merged commit fbdb8bc into main Sep 29, 2020
@angela97lin angela97lin mentioned this pull request Sep 29, 2020
@freddyaboulton freddyaboulton deleted the bc_760_split branch May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure no classes are missing from training/validation splits during data splitting
2 participants