-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Balanced Classification Data Splitter #1875
Conversation
…c_973_downsampler
…c_973_downsampler
…c_973_downsampler
Codecov Report
@@ Coverage Diff @@
## main #1875 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 266 269 +3
Lines 21959 22188 +229
=========================================
+ Hits 21953 22182 +229
Misses 6 6
Continue to review full report at Codecov.
|
2b54849
to
e2a81f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 I think this looks good! The one thing I want your input on is the order of sampling and creating a threshold tuning split in _find_best_pipeline
.
@@ -63,7 +64,7 @@ def make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits= | |||
if problem_type == ProblemTypes.REGRESSION: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we're not using BalancedClassificationDataTVSplit
for the large classification datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, that's a good question. It wasn't something I was going to add in this iteration, but it might be good to add? It's untested in terms of perf tests, however. @dsherry what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good call @freddyaboulton . Yeah @bchen1116 we should add that in, otherwise we're breaking our current large data support.
I took a stab at the control flow here, and threw in a little refactor for timeseries just to clean things up:
if is_time_series(problem_type):
if not problem_configuration:
raise ValueError("problem_configuration is required for time series problem types")
return TimeSeriesSplit(n_splits=n_splits, gap=problem_configuration.get('gap'),
max_delay=problem_configuration.get('max_delay'))
if X.shape[0] > _LARGE_DATA_THRESHOLD:
if problem_type == ProblemTypes.REGRESSION:
return TrainingValidationSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
]:
return BalancedClassificationDataTVSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, random_seed=random_seed, shuffle=shuffle)
if problem_type == ProblemTypes.REGRESSION:
return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
return BalancedClassificationDataCVSplit(n_splits=n_splits, random_seed=random_seed, shuffle=shuffle)
raise ValueError('Invalid problem_type')
I may be screwing up the usage of shuffle
above so pls check my math haha
Thoughts @bchen1116 @freddyaboulton ?
@@ -63,7 +64,7 @@ def make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits= | |||
if problem_type == ProblemTypes.REGRESSION: | |||
return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle) | |||
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]: | |||
return StratifiedKFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle) | |||
return BalancedClassificationDataCVSplit(n_splits=n_splits, random_seed=random_seed, shuffle=shuffle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@bchen1116 I left a few comments on the perf test doc, feel free to send me an invite if you wanna chat about any of that! Will review code shortly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 good stuff! This is looking nice so far.
Main points to change:
- I agree with @freddyaboulton that we should not break our current large data support, and that this PR should use the TV splitter instead of CV when appropriate. I left a suggestion for the control flow there.
- I left a suggestion for how to break up the unit tests a bit to make it easier for us to read and modify them in the future, and to make sure we have coverage of the math (which it appears you do!)
- A few impl suggestions
- Docstrings on the two splitter classes
evalml/preprocessing/data_splitters/balanced_classification_splitter.py
Outdated
Show resolved
Hide resolved
@@ -63,7 +64,7 @@ def make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits= | |||
if problem_type == ProblemTypes.REGRESSION: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good call @freddyaboulton . Yeah @bchen1116 we should add that in, otherwise we're breaking our current large data support.
I took a stab at the control flow here, and threw in a little refactor for timeseries just to clean things up:
if is_time_series(problem_type):
if not problem_configuration:
raise ValueError("problem_configuration is required for time series problem types")
return TimeSeriesSplit(n_splits=n_splits, gap=problem_configuration.get('gap'),
max_delay=problem_configuration.get('max_delay'))
if X.shape[0] > _LARGE_DATA_THRESHOLD:
if problem_type == ProblemTypes.REGRESSION:
return TrainingValidationSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
]:
return BalancedClassificationDataTVSplit(test_size=_LARGE_DATA_PERCENT_VALIDATION, random_seed=random_seed, shuffle=shuffle)
if problem_type == ProblemTypes.REGRESSION:
return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
elif problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
return BalancedClassificationDataCVSplit(n_splits=n_splits, random_seed=random_seed, shuffle=shuffle)
raise ValueError('Invalid problem_type')
I may be screwing up the usage of shuffle
above so pls check my math haha
Thoughts @bchen1116 @freddyaboulton ?
evalml/preprocessing/data_splitters/balanced_classification_sampler.py
Outdated
Show resolved
Hide resolved
evalml/preprocessing/data_splitters/balanced_classification_sampler.py
Outdated
Show resolved
Hide resolved
evalml/preprocessing/data_splitters/balanced_classification_sampler.py
Outdated
Show resolved
Hide resolved
evalml/tests/preprocessing_tests/test_balanced_classification_data_splitter.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, the new parametrized tests look excellent!
fix #973
Perf test results here
Updated and more in-depth perf test here
Summary:
StratifiedKfold
min_samples/min_percentage
, so might be useful to do tuning.