Skip to content

Add Time Series Splitting Data Check#3141

Merged
ParthivNaresh merged 21 commits intomainfrom
MissingClassesSplittingTSClassification
Dec 17, 2021
Merged

Add Time Series Splitting Data Check#3141
ParthivNaresh merged 21 commits intomainfrom
MissingClassesSplittingTSClassification

Conversation

@ParthivNaresh
Copy link
Copy Markdown
Contributor

Fixes #1681

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 9, 2021

Codecov Report

Merging #3141 (be6a0b1) into main (7bbc60a) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3141     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        318     320      +2     
  Lines      30948   31030     +82     
=======================================
+ Hits       30843   30925     +82     
  Misses       105     105             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.9% <100.0%> (+0.1%) ⬆️
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/default_data_checks.py 100.0% <100.0%> (ø)
evalml/data_checks/ts_splitting_data_check.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_search.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_search_iterative.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)
.../data_checks_tests/test_ts_splitting_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bbc60a...be6a0b1. Read the comment docs.

@ParthivNaresh ParthivNaresh changed the title Update Splitting Logic for Time Series Classification Add Time Series Splitting Data Check Dec 10, 2021
# Conflicts:
#	docs/source/release_notes.rst
#	docs/source/user_guide/data_checks.ipynb
#	evalml/automl/automl_search.py
#	evalml/data_checks/default_data_checks.py
#	evalml/tests/data_checks_tests/test_data_checks.py
@ParthivNaresh ParthivNaresh marked this pull request as ready for review December 15, 2021 15:56
Copy link
Copy Markdown
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh Looks good! I think the main thing is that we should be creating a data_splitter in search and search_iterative rather than adding a new parameter to AutoMLSearch init.

Secondly, I think we only need to care about the first training split? As long as the pipeline is trained with all targets, the pipeline will be able to predict on the validation set, regardless of the number of unique target values in the validation set. Since the splits are sequentially constructed, if the first split has all the target values, all the subsequent ones will too.

Comment thread evalml/automl/automl_search.py Outdated
max_time=None,
patience=None,
tolerance=None,
n_splits=3,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh I don't think we need to add this parameter to AutoMLSearch. Isn't it implicit in the data_splitter?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the use case where a user starts with search and decides that while they don't want to pass in their own data splitter, they might want to determine the number of splits. It would be passed into the make_data_splitter below

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I'm not opposed to that but maybe that should be its own issue. This API change is not related to the scope of the original issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, understandable. I'm fine with making a data split in search and search_iterative and leaving n_splits alone for another issue

Comment thread evalml/automl/automl_search.py Outdated
"tolerance": tolerance,
"verbose": verbose,
"problem_configuration": problem_configuration,
"n_splits": n_splits,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than threading this parameter to down to AutoMLSearch we should just call make_data_splitter and pass in as the data_splitter parameter?

Comment thread evalml/data_checks/ts_splitting_data_check.py Outdated
Comment thread evalml/data_checks/ts_splitting_data_check.py
"\n",
"Due to the nature of time series data, splitting cannot involve shuffling and has to be done in a sequential manner. This means splitting the data into `n_splits` + 1 different sections and increasing the size of the training data by the split size every iteration while keeping the test size equal to the split size.\n",
"\n",
"For every split in the data, the training and validation segments must contain target data that has an example of every class found in the entire target set for time series binary and time series multiclass problems. The reason for this is that many classification machine learning models run into issues if they're trained on data that doesn't contain an instance of a class but then the model is expected to be able to predict for it. For example, with 3 splits and a split size of 25, this means that every training/validation split: (0:25)/(25:50), (0:50)/(50:75), (0:75)/(75:100) must contain at least one instance of all unique target classes in the training and validation set.\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ParthivNaresh ! Looks good to me.

In my opinion, it's best to leave the n_splits addition to the AutoMLSearch api for its own issue. In search and search_iterative we can create the data splitter with make_data_splitter and pass to AutoMLSearch as the data_splitter param.

Curious what other people on the team think about that, but changing the AutoMLSearch api feels like it's own issue since it's not 100% needed to close this issue out.

Copy link
Copy Markdown
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some minor comments, but looks great!

Comment thread evalml/tests/data_checks_tests/test_ts_splitting_data_check.py
Comment thread evalml/tests/data_checks_tests/test_ts_splitting_data_check.py Outdated
@ParthivNaresh ParthivNaresh merged commit 2e374a5 into main Dec 17, 2021
@angela97lin angela97lin mentioned this pull request Dec 22, 2021
@freddyaboulton freddyaboulton deleted the MissingClassesSplittingTSClassification branch May 13, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Changes and fixes for time series classification problems

3 participants