Add Time Series Splitting Data Check#3141
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3141 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 318 320 +2
Lines 30948 31030 +82
=======================================
+ Hits 30843 30925 +82
Misses 105 105
Continue to review full report at Codecov.
|
# Conflicts: # docs/source/release_notes.rst # docs/source/user_guide/data_checks.ipynb # evalml/automl/automl_search.py # evalml/data_checks/default_data_checks.py # evalml/tests/data_checks_tests/test_data_checks.py
freddyaboulton
left a comment
There was a problem hiding this comment.
@ParthivNaresh Looks good! I think the main thing is that we should be creating a data_splitter in search and search_iterative rather than adding a new parameter to AutoMLSearch init.
Secondly, I think we only need to care about the first training split? As long as the pipeline is trained with all targets, the pipeline will be able to predict on the validation set, regardless of the number of unique target values in the validation set. Since the splits are sequentially constructed, if the first split has all the target values, all the subsequent ones will too.
| max_time=None, | ||
| patience=None, | ||
| tolerance=None, | ||
| n_splits=3, |
There was a problem hiding this comment.
@ParthivNaresh I don't think we need to add this parameter to AutoMLSearch. Isn't it implicit in the data_splitter?
There was a problem hiding this comment.
I was thinking about the use case where a user starts with search and decides that while they don't want to pass in their own data splitter, they might want to determine the number of splits. It would be passed into the make_data_splitter below
There was a problem hiding this comment.
Makes sense! I'm not opposed to that but maybe that should be its own issue. This API change is not related to the scope of the original issue.
There was a problem hiding this comment.
Gotcha, understandable. I'm fine with making a data split in search and search_iterative and leaving n_splits alone for another issue
| "tolerance": tolerance, | ||
| "verbose": verbose, | ||
| "problem_configuration": problem_configuration, | ||
| "n_splits": n_splits, |
There was a problem hiding this comment.
Rather than threading this parameter to down to AutoMLSearch we should just call make_data_splitter and pass in as the data_splitter parameter?
| "\n", | ||
| "Due to the nature of time series data, splitting cannot involve shuffling and has to be done in a sequential manner. This means splitting the data into `n_splits` + 1 different sections and increasing the size of the training data by the split size every iteration while keeping the test size equal to the split size.\n", | ||
| "\n", | ||
| "For every split in the data, the training and validation segments must contain target data that has an example of every class found in the entire target set for time series binary and time series multiclass problems. The reason for this is that many classification machine learning models run into issues if they're trained on data that doesn't contain an instance of a class but then the model is expected to be able to predict for it. For example, with 3 splits and a split size of 25, this means that every training/validation split: (0:25)/(25:50), (0:50)/(50:75), (0:75)/(75:100) must contain at least one instance of all unique target classes in the training and validation set.\n", |
There was a problem hiding this comment.
Thanks @ParthivNaresh ! Looks good to me.
In my opinion, it's best to leave the n_splits addition to the AutoMLSearch api for its own issue. In search and search_iterative we can create the data splitter with make_data_splitter and pass to AutoMLSearch as the data_splitter param.
Curious what other people on the team think about that, but changing the AutoMLSearch api feels like it's own issue since it's not 100% needed to close this issue out.
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! Left some minor comments, but looks great!
Fixes #1681