Skip to content

Make it easy to customize automl search data splitter#1568

Merged
dsherry merged 16 commits intomainfrom
ds_1567_make_data_split
Dec 18, 2020
Merged

Make it easy to customize automl search data splitter#1568
dsherry merged 16 commits intomainfrom
ds_1567_make_data_split

Conversation

@dsherry
Copy link
Contributor

@dsherry dsherry commented Dec 17, 2020

Fix #1567

Usage: configure automl search to use a different number of folds (splits) in CV

random_state = 42
data_split = make_data_splitter(X, y, problem_type, n_splits=5, random_state=random_state)
automl = AutoMLSearch(problem_type=problem_type, data_split=data_split, random_state=random_state, ...)
...

Usage: disable shuffling

random_state = 42
data_split = make_data_splitter(X, y, problem_type, shuffle=False, random_state=random_state)
automl = AutoMLSearch(problem_type=problem_type, data_split=data_split, random_state=random_state, ...)
...

I suppose we could simply add n_splits and shuffle to AutoMLSearch.__init__, but I wanted to keep us flexible on this a while longer. We'll be adding more heuristics to the data splitting logic here at some point and its nice exposing that.

Hmm, perhaps we need to rename "data_split" to "data_splitter" as well.

@codecov
Copy link

codecov bot commented Dec 17, 2020

Codecov Report

Merging #1568 (fcf7083) into main (22cd574) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1568     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         239      240      +1     
  Lines       17593    17677     +84     
=========================================
+ Hits        17585    17669     +84     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/automl/__init__.py 100.0% <100.0%> (ø)
evalml/automl/automl_search.py 99.7% <100.0%> (-<0.1%) ⬇️
evalml/automl/utils.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl_utils.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 22cd574...fcf7083. Read the comment docs.

@dsherry dsherry marked this pull request as ready for review December 17, 2020 13:48
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

This change means we should file a separate issue to update our implementation for class_imbalance_data_check, specifically the cv_folds arg here. It defaults to 3 since that's the default data split n_folds, but with this change, I believe we should update AutoMLSearch to pass in the required param. Not blocking this PR though!

@dsherry
Copy link
Contributor Author

dsherry commented Dec 17, 2020

@bchen1116 excellent point RE the class imbalance data check! I just filed as #1570

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Looks great!

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 😁

@dsherry dsherry force-pushed the ds_1567_make_data_split branch from 8576972 to 150bd95 Compare December 18, 2020 00:06
@dsherry dsherry force-pushed the ds_1567_make_data_split branch from c8ed8ff to fcf7083 Compare December 18, 2020 14:58
@dsherry dsherry merged commit 162992d into main Dec 18, 2020
@dsherry dsherry deleted the ds_1567_make_data_split branch December 18, 2020 16:10
@dsherry dsherry mentioned this pull request Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make it easy to customize automl search data splitter

4 participants