Skip to content

Conversation

@bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Feb 23, 2021

fix #973

Perf test results here
Updated and more in-depth perf test here
Summary:

  • The data sampler performs slightly worse (in holdout log loss, holdout f1, and time to fit) compared to the original StratifiedKfold
  • This performance might be due to our default values for min_samples/min_percentage, so might be useful to do tuning.

@bchen1116 bchen1116 self-assigned this Feb 23, 2021
@codecov
Copy link

codecov bot commented Feb 23, 2021

Codecov Report

Merging #1875 (828a411) into main (62c092a) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1875     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         266      269      +3     
  Lines       21959    22188    +229     
=========================================
+ Hits        21953    22182    +229     
  Misses          6        6             
Impacted Files Coverage Δ
...sing_tests/test_balanced_classification_sampler.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
evalml/automl/utils.py 100.0% <100.0%> (ø)
evalml/data_checks/class_imbalance_data_check.py 100.0% <100.0%> (ø)
evalml/preprocessing/__init__.py 100.0% <100.0%> (ø)
evalml/preprocessing/data_splitters/__init__.py 100.0% <100.0%> (ø)
.../data_splitters/balanced_classification_sampler.py 100.0% <100.0%> (ø)
...data_splitters/balanced_classification_splitter.py 100.0% <100.0%> (ø)
...lml/preprocessing/data_splitters/base_splitters.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62c092a...828a411. Read the comment docs.

@bchen1116 bchen1116 marked this pull request as draft February 23, 2021 19:52
@bchen1116 bchen1116 force-pushed the bc_973_balanced_splitter branch from 2b54849 to e2a81f5 Compare February 23, 2021 20:40
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 I think this looks good! The one thing I want your input on is the order of sampling and creating a threshold tuning split in _find_best_pipeline.

if X.shape[0] > _LARGE_DATA_ROW_THRESHOLD:
return TrainingValidationSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, shuffle=True)

if problem_type == ProblemTypes.REGRESSION:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we're not using BalancedClassificationDataTVSplit for the large classification datasets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that's a good question. It wasn't something I was going to add in this iteration, but it might be good to add? It's untested in terms of perf tests, however. @dsherry what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call @freddyaboulton . Yeah @bchen1116 we should add that in, otherwise we're breaking our current large data support.

I took a stab at the control flow here, and threw in a little refactor for timeseries just to clean things up:

if is_time_series(problem_type):
    if not problem_configuration:
        raise ValueError("problem_configuration is required for time series problem types")
    return TimeSeriesSplit(n_splits=n_splits, gap=problem_configuration.get('gap'),
        max_delay=problem_configuration.get('max_delay'))
if X.shape[0] > _LARGE_DATA_THRESHOLD:
    if problem_type == ProblemTypes.REGRESSION:
        return TrainingValidationSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, shuffle=shuffle)
    elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
]:
        return BalancedClassificationDataTVSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, random_seed=random_seed, shuffle=shuffle)
if problem_type == ProblemTypes.REGRESSION:
    return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
    return BalancedClassificationDataCVSplit(n_splits=n_splits, random_seed=random_seed, shuffle=shuffle)
raise ValueError('Invalid problem_type')

I may be screwing up the usage of shuffle above so pls check my math haha

Thoughts @bchen1116 @freddyaboulton ?

return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
return StratifiedKFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
return BalancedClassificationDataCVSplit(n_splits=n_splits, random_seed=random_seed, shuffle=shuffle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dsherry
Copy link
Contributor

dsherry commented Mar 3, 2021

@bchen1116 I left a few comments on the perf test doc, feel free to send me an invite if you wanna chat about any of that! Will review code shortly

dsherry
dsherry previously requested changes Mar 4, 2021
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 good stuff! This is looking nice so far.

Main points to change:

  • I agree with @freddyaboulton that we should not break our current large data support, and that this PR should use the TV splitter instead of CV when appropriate. I left a suggestion for the control flow there.
  • I left a suggestion for how to break up the unit tests a bit to make it easier for us to read and modify them in the future, and to make sure we have coverage of the math (which it appears you do!)
  • A few impl suggestions
  • Docstrings on the two splitter classes

if X.shape[0] > _LARGE_DATA_ROW_THRESHOLD:
return TrainingValidationSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, shuffle=True)

if problem_type == ProblemTypes.REGRESSION:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call @freddyaboulton . Yeah @bchen1116 we should add that in, otherwise we're breaking our current large data support.

I took a stab at the control flow here, and threw in a little refactor for timeseries just to clean things up:

if is_time_series(problem_type):
    if not problem_configuration:
        raise ValueError("problem_configuration is required for time series problem types")
    return TimeSeriesSplit(n_splits=n_splits, gap=problem_configuration.get('gap'),
        max_delay=problem_configuration.get('max_delay'))
if X.shape[0] > _LARGE_DATA_THRESHOLD:
    if problem_type == ProblemTypes.REGRESSION:
        return TrainingValidationSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, shuffle=shuffle)
    elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
]:
        return BalancedClassificationDataTVSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, random_seed=random_seed, shuffle=shuffle)
if problem_type == ProblemTypes.REGRESSION:
    return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
    return BalancedClassificationDataCVSplit(n_splits=n_splits, random_seed=random_seed, shuffle=shuffle)
raise ValueError('Invalid problem_type')

I may be screwing up the usage of shuffle above so pls check my math haha

Thoughts @bchen1116 @freddyaboulton ?

@bchen1116 bchen1116 requested a review from dsherry March 4, 2021 19:49
Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, the new parametrized tests look excellent!

@bchen1116 bchen1116 dismissed dsherry’s stale review March 9, 2021 18:04

we have discussed

@bchen1116 bchen1116 merged commit 0b57961 into main Mar 9, 2021
@dsherry dsherry mentioned this pull request Mar 11, 2021
@freddyaboulton freddyaboulton deleted the bc_973_balanced_splitter branch May 13, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Balanced sampling for classification problems

6 participants