Add DataCheck to validate time series problem configuration parameters #3111

freddyaboulton · 2021-12-02T18:45:40Z

Pull Request Description

Fixes #3103 and also updates the time series data splitter to use the same logic as the data check to validate the parameters. This was something the splitter was doing before, but it was buggy given the addition of forecast_horizon .

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

freddyaboulton · 2021-12-02T18:47:31Z

evalml/tests/automl_tests/test_automl_utils.py

@@ -112,6 +117,7 @@ def test_make_data_splitter_default(problem_type, large_data):
        assert data_splitter.n_splits == 3
        assert data_splitter.gap == 1
        assert data_splitter.max_delay == 7
+        assert data_splitter.forecast_horizon == 4


To make sure the data splitter get the right forecast horizon value.

freddyaboulton · 2021-12-02T18:49:03Z

evalml/data_checks/ts_parameters_data_check.py

+)
+
+
+class TimeSeriesParametersDataCheck(DataCheck):


Should this be a part of default data checks?

Reason I'm not adding it is because it would require changing the init api of the default data checks and some of the use cases we've seen manually create their own DataChecks class from a list of individual DataCheck instances. So if they wanted to use this check, they just need to add it to their list.

I think it could be a good idea to add it, but not necessary in this PR. Let's file an issue for it?

Issue here: #3125

codecov · 2021-12-02T18:50:50Z

Codecov Report

Merging #3111 (e76074d) into main (b786c54) will increase coverage by 0.8%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3111     +/-   ##
=======================================
+ Coverage   99.0%   99.8%   +0.8%     
=======================================
  Files        313     315      +2     
  Lines      30603   30664     +61     
=======================================
+ Hits       30281   30573    +292     
+ Misses       322      91    -231

Impacted Files	Coverage Δ
evalml/automl/utils.py	`100.0% <ø> (+1.7%)`	⬆️
...ts/automl_tests/parallel_tests/test_automl_dask.py	`100.0% <ø> (ø)`
evalml/tests/automl_tests/test_automl.py	`99.5% <ø> (+0.1%)`	⬆️
evalml/automl/automl_search.py	`99.9% <100.0%> (+0.2%)`	⬆️
evalml/data_checks/__init__.py	`100.0% <100.0%> (ø)`
evalml/data_checks/data_check_message_code.py	`100.0% <100.0%> (ø)`
evalml/data_checks/ts_parameters_data_check.py	`100.0% <100.0%> (ø)`
.../preprocessing/data_splitters/time_series_split.py	`96.3% <100.0%> (-3.7%)`	⬇️
...ests/automl_tests/test_automl_search_regression.py	`100.0% <100.0%> (+15.6%)`	⬆️
evalml/tests/automl_tests/test_automl_utils.py	`100.0% <100.0%> (+9.5%)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b786c54...e76074d. Read the comment docs.

bchen1116

Looks great to me! I left a comment about adding some tests, but mostly left small nitpicks otherwise.

bchen1116 · 2021-12-02T20:14:52Z

evalml/preprocessing/data_splitters/time_series_split.py

-        gap (int): Gap used in time series problem. Time series pipelines shift the target variable by gap rows. Defaults to 0.
+        gap (int): Number of time units separating the data used to generate features and the data to forecast on.
+            Defaults to 0.
+        forecast_horizon (int): Number of time units to forecast. Defaults to 0.


Defaults to 1*

bchen1116 · 2021-12-02T20:15:44Z

evalml/preprocessing/data_splitters/time_series_split.py

-                "then at least one of the splits would be empty by the time it reaches the pipeline. "
-                "Please use a smaller number of splits or collect more data."
-            )
+        result = are_ts_parameters_valid_for_split(


Nice! I like this value assignment

bchen1116 · 2021-12-02T20:16:47Z

evalml/data_checks/ts_parameters_data_check.py

+
+        Args:
+            X (pd.DataFrame, np.ndarray): Features.
+            y (pd.Series, np.ndarray): Ignored.  Defaults to None.


Nit, but there seems to be an extra space between Ignored. and Defaults to?

bchen1116 · 2021-12-02T20:20:40Z

evalml/tests/automl_tests/test_time_series_split.py

+        forecast_horizon=forecast_horizon,
+        n_splits=n_splits,
+        date_index="date",
+    )
    X = pd.DataFrame({"features": range(15)})
    # Each split would have 15 // 5 = 3 data points. However, this is smaller than the number of data_points required


I think this comment needs to be updated since not every split is now using 3 data points.

I'm just going to delete it. Don't think it's adding much value since it's obvious the values are incompatible given the expected behavior of the test.

bchen1116 · 2021-12-02T20:32:16Z

evalml/utils/gen_utils.py

+        TsParameterValidationResult - named tuple with four fields
+            is_valid (bool): True if parameters are valid.
+            msg (str): Contains error message to display. Empty if is_valid.
+            smallest_split_size (int): smallest split size given n_obs and n_splits.


nit: Can we capitalize the first letters of these descriptions to get everything into the same format, ie Smallest split size ...

bchen1116 · 2021-12-02T20:34:05Z

evalml/utils/gen_utils.py

+            f"the smallest split would have {split_size} observations. "
+            f"Since {gap + max_delay + forecast_horizon} (gap + max_delay + forecast_horizon)  > {split_size}, "
+            "then at least one of the splits would be empty by the time it reaches the pipeline. "
+            "Please use a smaller number of splits or collect more data."


I like this error message because it explains why the values passed in will not work! Great work here!

Should we change the end of the error message to include this?
"Please use a smaller number of splits, consider reducing one or more of these parameters, or collect more data."

bchen1116 · 2021-12-02T20:35:01Z

evalml/utils/gen_utils.py

+)
+
+
+def are_ts_parameters_valid_for_split(


I know these are covered in other tests, but can we add some explicit tests for these in the test_utils file? Mainly because if we decide to change some stuff in the future, it would be nice to ensure that we have full test coverage for these methods

Word adding some small tests now too!

eccabay

Good stuff!

eccabay · 2021-12-03T16:02:55Z

evalml/data_checks/ts_parameters_data_check.py

+class TimeSeriesParametersDataCheck(DataCheck):
+    """Checks whether the time series parameters are compatible with data splitting.
+
+    If gap + max_delay + forecast_horizon > X.shape[0] // (n_splits + 1)


Nit: add markdown so this will display cleanly in our documentation

eccabay · 2021-12-03T16:04:00Z

evalml/data_checks/ts_parameters_data_check.py

+
+    Args:
+        problem_configuration (dict): Dict containing problem_configuration parameters.
+        n_splits (int): Number of time series split.


Nit: splits?

eccabay · 2021-12-03T16:07:11Z

evalml/preprocessing/data_splitters/time_series_split.py

-    some feature and target engineering, e.g delaying input features and shifting the target variable by the
-    desired amount. If the data that will be split already has all the features and appropriate target values, and
-    then set max_delay and gap to 0.
+    The max_delay, gap, and forecast_horizon parameters are just used to validate that the requested split size


mega-nit: "just"->"only"?

evalml/tests/automl_tests/test_automl.py

eccabay · 2021-12-03T16:16:35Z

evalml/utils/gen_utils.py

+            "problem_configuration must be a dict containing values for at least the date_index, gap, max_delay, "
+            f"and forecast_horizon parameters. Received {problem_configuration}."
+        )
+    return not (msg), msg


super smooth, love this logic!

ParthivNaresh

Looks great, I love how you've split up the logic for the parameters

ParthivNaresh · 2021-12-03T20:50:43Z

evalml/utils/gen_utils.py

+            f"the smallest split would have {split_size} observations. "
+            f"Since {gap + max_delay + forecast_horizon} (gap + max_delay + forecast_horizon)  > {split_size}, "
+            "then at least one of the splits would be empty by the time it reaches the pipeline. "
+            "Please use a smaller number of splits or collect more data."


Should we change the end of the error message to include this?
"Please use a smaller number of splits, consider reducing one or more of these parameters, or collect more data."

freddyaboulton commented Dec 2, 2021

View reviewed changes

freddyaboulton self-assigned this Dec 2, 2021

freddyaboulton marked this pull request as ready for review December 2, 2021 20:10

freddyaboulton requested review from angela97lin, chukarsten, bchen1116, christopherbunn, dsherry, eccabay, jeremyliweishih and ParthivNaresh December 2, 2021 20:11

bchen1116 approved these changes Dec 2, 2021

View reviewed changes

freddyaboulton force-pushed the 3103-catch-incompatible-ts-parameters branch 2 times, most recently from 2b55ebf to fe6b0cc Compare December 3, 2021 16:06

eccabay approved these changes Dec 3, 2021

View reviewed changes

ParthivNaresh approved these changes Dec 3, 2021

View reviewed changes

freddyaboulton mentioned this pull request Dec 6, 2021

Add TimeSeriesParametersDataCheck to DefaultDataChecks #3125

Closed

freddyaboulton added 8 commits December 6, 2021 12:03

Implementation + tests

df3bfb8

Release notes

89a5f89

Add test for value error

40807b7

Fix docstrings + add tests

3d8a8eb

Update docs

cc10178

Update error message

744a4e3

Update error message check

9cb5b88

Linting

e76074d

freddyaboulton force-pushed the 3103-catch-incompatible-ts-parameters branch from eda9526 to e76074d Compare December 6, 2021 17:03

freddyaboulton merged commit b9d4cb6 into main Dec 6, 2021

freddyaboulton deleted the 3103-catch-incompatible-ts-parameters branch December 6, 2021 17:54

chukarsten mentioned this pull request Dec 9, 2021

Release v0.39.0 #3136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataCheck to validate time series problem configuration parameters #3111

Add DataCheck to validate time series problem configuration parameters #3111

freddyaboulton commented Dec 2, 2021

freddyaboulton Dec 2, 2021

freddyaboulton Dec 2, 2021 •

edited

Loading

angela97lin Dec 3, 2021

freddyaboulton Dec 6, 2021

codecov bot commented Dec 2, 2021 •

edited

Loading

bchen1116 left a comment

bchen1116 Dec 2, 2021

bchen1116 Dec 2, 2021

bchen1116 Dec 2, 2021

bchen1116 Dec 2, 2021

freddyaboulton Dec 2, 2021

bchen1116 Dec 2, 2021

bchen1116 Dec 2, 2021

ParthivNaresh Dec 3, 2021

freddyaboulton Dec 3, 2021

bchen1116 Dec 2, 2021

freddyaboulton Dec 2, 2021

eccabay left a comment

eccabay Dec 3, 2021

eccabay Dec 3, 2021

eccabay Dec 3, 2021

eccabay Dec 3, 2021

ParthivNaresh left a comment

ParthivNaresh Dec 3, 2021

Add DataCheck to validate time series problem configuration parameters #3111

Add DataCheck to validate time series problem configuration parameters #3111

Conversation

freddyaboulton commented Dec 2, 2021

Pull Request Description

Choose a reason for hiding this comment

freddyaboulton Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 2, 2021 • edited Loading

Codecov Report

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParthivNaresh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Dec 2, 2021 •

edited

Loading

codecov bot commented Dec 2, 2021 •

edited

Loading