Skip to content

Add easier way to determine whether data splitter is CV#3297

Merged
bchen1116 merged 10 commits intomainfrom
bc_3098_cv
Feb 7, 2022
Merged

Add easier way to determine whether data splitter is CV#3297
bchen1116 merged 10 commits intomainfrom
bc_3098_cv

Conversation

@bchen1116
Copy link
Contributor

fix #3098

@bchen1116 bchen1116 self-assigned this Feb 1, 2022
@codecov
Copy link

codecov bot commented Feb 1, 2022

Codecov Report

Merging #3297 (e25b87c) into main (4fdcf63) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3297     +/-   ##
=======================================
+ Coverage   99.8%   99.8%   +0.1%     
=======================================
  Files        322     324      +2     
  Lines      31714   31764     +50     
=======================================
+ Hits       31624   31674     +50     
  Misses        90      90             
Impacted Files Coverage Δ
evalml/automl/utils.py 100.0% <ø> (ø)
evalml/preprocessing/data_splitters/__init__.py 100.0% <100.0%> (ø)
evalml/preprocessing/data_splitters/no_split.py 100.0% <100.0%> (ø)
...valml/preprocessing/data_splitters/sk_splitters.py 100.0% <100.0%> (ø)
.../preprocessing/data_splitters/time_series_split.py 96.7% <100.0%> (+0.4%) ⬆️
...essing/data_splitters/training_validation_split.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl_utils.py 100.0% <100.0%> (ø)
evalml/tests/preprocessing_tests/test_no_split.py 100.0% <100.0%> (ø)
...lml/tests/preprocessing_tests/test_sk_splitters.py 100.0% <100.0%> (ø)
...processing_tests/test_training_validation_split.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fdcf63...e25b87c. Read the comment docs.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling, Bryan. I think we might need to rethink dynamically adding the attribute to the sklearn object, though. Perhaps a subclass of KFold/StratifiedKFold?

return KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
kfold = KFold(n_splits=n_splits, random_state=random_seed, shuffle=shuffle)
# can set this to true directly since k-fold requires >1 splits
kfold.is_cv = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of worrisome. The KFold class is an sklearn object. There's not really much reason for contributors or other devs to expect this attribute added to the standard sklearn object if they don't know about this code segment that modifies it. Maybe we should consider a simple class wrapper with the same name the inherits from KFold and StratifiedKFold but defines the property as the other splitters do. Curious what others think...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten I added a quick fix to this where we define our own classes and add is_cv as a property to that! The performance shouldn't change otherwise though. Let me know what you think

@bchen1116 bchen1116 requested a review from chukarsten February 1, 2022 20:05
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @bchen1116 !! Good to go.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a comment about making is_cv abstract for our base class but not blocking

@bchen1116 bchen1116 merged commit 465ae93 into main Feb 7, 2022
@chukarsten chukarsten mentioned this pull request Feb 18, 2022
@freddyaboulton freddyaboulton deleted the bc_3098_cv branch May 13, 2022 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Easy way to determine whether or not a data splitter counts as a CV splitter

3 participants