Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Balanced Classification Data Splitter #1875

Merged
merged 74 commits into from
Mar 9, 2021
Merged

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Feb 23, 2021

fix #973

Perf test results here
Updated and more in-depth perf test here
Summary:

  • The data sampler performs slightly worse (in holdout log loss, holdout f1, and time to fit) compared to the original StratifiedKfold
  • This performance might be due to our default values for min_samples/min_percentage, so might be useful to do tuning.

@bchen1116 bchen1116 self-assigned this Feb 23, 2021
@codecov
Copy link

codecov bot commented Feb 23, 2021

Codecov Report

Merging #1875 (828a411) into main (62c092a) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1875     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         266      269      +3     
  Lines       21959    22188    +229     
=========================================
+ Hits        21953    22182    +229     
  Misses          6        6             
Impacted Files Coverage Δ
...sing_tests/test_balanced_classification_sampler.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
evalml/automl/utils.py 100.0% <100.0%> (ø)
evalml/data_checks/class_imbalance_data_check.py 100.0% <100.0%> (ø)
evalml/preprocessing/__init__.py 100.0% <100.0%> (ø)
evalml/preprocessing/data_splitters/__init__.py 100.0% <100.0%> (ø)
.../data_splitters/balanced_classification_sampler.py 100.0% <100.0%> (ø)
...data_splitters/balanced_classification_splitter.py 100.0% <100.0%> (ø)
...lml/preprocessing/data_splitters/base_splitters.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62c092a...828a411. Read the comment docs.

@bchen1116 bchen1116 marked this pull request as draft February 23, 2021 19:52
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 I think this looks good! The one thing I want your input on is the order of sampling and creating a threshold tuning split in _find_best_pipeline.

@@ -63,7 +64,7 @@ def make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=
if problem_type == ProblemTypes.REGRESSION:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we're not using BalancedClassificationDataTVSplit for the large classification datasets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that's a good question. It wasn't something I was going to add in this iteration, but it might be good to add? It's untested in terms of perf tests, however. @dsherry what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call @freddyaboulton . Yeah @bchen1116 we should add that in, otherwise we're breaking our current large data support.

I took a stab at the control flow here, and threw in a little refactor for timeseries just to clean things up:

if is_time_series(problem_type):
    if not problem_configuration:
        raise ValueError("problem_configuration is required for time series problem types")
    return TimeSeriesSplit(n_splits=n_splits, gap=problem_configuration.get('gap'),
        max_delay=problem_configuration.get('max_delay'))
if X.shape[0] > _LARGE_DATA_THRESHOLD:
    if problem_type == ProblemTypes.REGRESSION:
        return TrainingValidationSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, shuffle=shuffle)
    elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
]:
        return BalancedClassificationDataTVSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, random_seed=random_seed, shuffle=shuffle)
if problem_type == ProblemTypes.REGRESSION:
    return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
    return BalancedClassificationDataCVSplit(n_splits=n_splits, random_seed=random_seed, shuffle=shuffle)
raise ValueError('Invalid problem_type')

I may be screwing up the usage of shuffle above so pls check my math haha

Thoughts @bchen1116 @freddyaboulton ?

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
evalml/automl/automl_search.py Show resolved Hide resolved
evalml/automl/automl_search.py Show resolved Hide resolved
@@ -63,7 +64,7 @@ def make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=
if problem_type == ProblemTypes.REGRESSION:
return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
return StratifiedKFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
return BalancedClassificationDataCVSplit(n_splits=n_splits, random_seed=random_seed, shuffle=shuffle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dsherry
Copy link
Contributor

dsherry commented Mar 3, 2021

@bchen1116 I left a few comments on the perf test doc, feel free to send me an invite if you wanna chat about any of that! Will review code shortly

dsherry
dsherry previously requested changes Mar 4, 2021
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 good stuff! This is looking nice so far.

Main points to change:

  • I agree with @freddyaboulton that we should not break our current large data support, and that this PR should use the TV splitter instead of CV when appropriate. I left a suggestion for the control flow there.
  • I left a suggestion for how to break up the unit tests a bit to make it easier for us to read and modify them in the future, and to make sure we have coverage of the math (which it appears you do!)
  • A few impl suggestions
  • Docstrings on the two splitter classes

@@ -63,7 +64,7 @@ def make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=
if problem_type == ProblemTypes.REGRESSION:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call @freddyaboulton . Yeah @bchen1116 we should add that in, otherwise we're breaking our current large data support.

I took a stab at the control flow here, and threw in a little refactor for timeseries just to clean things up:

if is_time_series(problem_type):
    if not problem_configuration:
        raise ValueError("problem_configuration is required for time series problem types")
    return TimeSeriesSplit(n_splits=n_splits, gap=problem_configuration.get('gap'),
        max_delay=problem_configuration.get('max_delay'))
if X.shape[0] > _LARGE_DATA_THRESHOLD:
    if problem_type == ProblemTypes.REGRESSION:
        return TrainingValidationSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, shuffle=shuffle)
    elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
]:
        return BalancedClassificationDataTVSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, random_seed=random_seed, shuffle=shuffle)
if problem_type == ProblemTypes.REGRESSION:
    return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
    return BalancedClassificationDataCVSplit(n_splits=n_splits, random_seed=random_seed, shuffle=shuffle)
raise ValueError('Invalid problem_type')

I may be screwing up the usage of shuffle above so pls check my math haha

Thoughts @bchen1116 @freddyaboulton ?

@bchen1116 bchen1116 requested a review from dsherry March 4, 2021 19:49
Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, the new parametrized tests look excellent!

@bchen1116 bchen1116 dismissed dsherry’s stale review March 9, 2021 18:04

we have discussed

@bchen1116 bchen1116 merged commit 0b57961 into main Mar 9, 2021
@dsherry dsherry mentioned this pull request Mar 11, 2021
@freddyaboulton freddyaboulton deleted the bc_973_balanced_splitter branch May 13, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Balanced sampling for classification problems
5 participants