Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataCheck to validate time series problem configuration parameters #3111

Merged
merged 8 commits into from
Dec 6, 2021

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #3103 and also updates the time series data splitter to use the same logic as the data check to validate the parameters. This was something the splitter was doing before, but it was buggy given the addition of forecast_horizon .


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@@ -112,6 +117,7 @@ def test_make_data_splitter_default(problem_type, large_data):
assert data_splitter.n_splits == 3
assert data_splitter.gap == 1
assert data_splitter.max_delay == 7
assert data_splitter.forecast_horizon == 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure the data splitter get the right forecast horizon value.

)


class TimeSeriesParametersDataCheck(DataCheck):
Copy link
Contributor Author

@freddyaboulton freddyaboulton Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a part of default data checks?

Reason I'm not adding it is because it would require changing the init api of the default data checks and some of the use cases we've seen manually create their own DataChecks class from a list of individual DataCheck instances. So if they wanted to use this check, they just need to add it to their list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be a good idea to add it, but not necessary in this PR. Let's file an issue for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue here: #3125

@codecov
Copy link

codecov bot commented Dec 2, 2021

Codecov Report

Merging #3111 (e76074d) into main (b786c54) will increase coverage by 0.8%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3111     +/-   ##
=======================================
+ Coverage   99.0%   99.8%   +0.8%     
=======================================
  Files        313     315      +2     
  Lines      30603   30664     +61     
=======================================
+ Hits       30281   30573    +292     
+ Misses       322      91    -231     
Impacted Files Coverage Δ
evalml/automl/utils.py 100.0% <ø> (+1.7%) ⬆️
...ts/automl_tests/parallel_tests/test_automl_dask.py 100.0% <ø> (ø)
evalml/tests/automl_tests/test_automl.py 99.5% <ø> (+0.1%) ⬆️
evalml/automl/automl_search.py 99.9% <100.0%> (+0.2%) ⬆️
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/ts_parameters_data_check.py 100.0% <100.0%> (ø)
.../preprocessing/data_splitters/time_series_split.py 96.3% <100.0%> (-3.7%) ⬇️
...ests/automl_tests/test_automl_search_regression.py 100.0% <100.0%> (+15.6%) ⬆️
evalml/tests/automl_tests/test_automl_utils.py 100.0% <100.0%> (+9.5%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b786c54...e76074d. Read the comment docs.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! I left a comment about adding some tests, but mostly left small nitpicks otherwise.

gap (int): Gap used in time series problem. Time series pipelines shift the target variable by gap rows. Defaults to 0.
gap (int): Number of time units separating the data used to generate features and the data to forecast on.
Defaults to 0.
forecast_horizon (int): Number of time units to forecast. Defaults to 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults to 1*

"then at least one of the splits would be empty by the time it reaches the pipeline. "
"Please use a smaller number of splits or collect more data."
)
result = are_ts_parameters_valid_for_split(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like this value assignment


Args:
X (pd.DataFrame, np.ndarray): Features.
y (pd.Series, np.ndarray): Ignored. Defaults to None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but there seems to be an extra space between Ignored. and Defaults to?

forecast_horizon=forecast_horizon,
n_splits=n_splits,
date_index="date",
)
X = pd.DataFrame({"features": range(15)})
# Each split would have 15 // 5 = 3 data points. However, this is smaller than the number of data_points required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment needs to be updated since not every split is now using 3 data points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to delete it. Don't think it's adding much value since it's obvious the values are incompatible given the expected behavior of the test.

TsParameterValidationResult - named tuple with four fields
is_valid (bool): True if parameters are valid.
msg (str): Contains error message to display. Empty if is_valid.
smallest_split_size (int): smallest split size given n_obs and n_splits.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we capitalize the first letters of these descriptions to get everything into the same format, ie Smallest split size ...

f"the smallest split would have {split_size} observations. "
f"Since {gap + max_delay + forecast_horizon} (gap + max_delay + forecast_horizon) > {split_size}, "
"then at least one of the splits would be empty by the time it reaches the pipeline. "
"Please use a smaller number of splits or collect more data."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this error message because it explains why the values passed in will not work! Great work here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the end of the error message to include this?
"Please use a smaller number of splits, consider reducing one or more of these parameters, or collect more data."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on it!

)


def are_ts_parameters_valid_for_split(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know these are covered in other tests, but can we add some explicit tests for these in the test_utils file? Mainly because if we decide to change some stuff in the future, it would be nice to ensure that we have full test coverage for these methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word adding some small tests now too!

@freddyaboulton freddyaboulton force-pushed the 3103-catch-incompatible-ts-parameters branch 2 times, most recently from 2b55ebf to fe6b0cc Compare December 3, 2021 16:06
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff!

class TimeSeriesParametersDataCheck(DataCheck):
"""Checks whether the time series parameters are compatible with data splitting.

If gap + max_delay + forecast_horizon > X.shape[0] // (n_splits + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add markdown so this will display cleanly in our documentation


Args:
problem_configuration (dict): Dict containing problem_configuration parameters.
n_splits (int): Number of time series split.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: splits?

some feature and target engineering, e.g delaying input features and shifting the target variable by the
desired amount. If the data that will be split already has all the features and appropriate target values, and
then set max_delay and gap to 0.
The max_delay, gap, and forecast_horizon parameters are just used to validate that the requested split size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mega-nit: "just"->"only"?

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
"problem_configuration must be a dict containing values for at least the date_index, gap, max_delay, "
f"and forecast_horizon parameters. Received {problem_configuration}."
)
return not (msg), msg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super smooth, love this logic!

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I love how you've split up the logic for the parameters

f"the smallest split would have {split_size} observations. "
f"Since {gap + max_delay + forecast_horizon} (gap + max_delay + forecast_horizon) > {split_size}, "
"then at least one of the splits would be empty by the time it reaches the pipeline. "
"Please use a smaller number of splits or collect more data."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the end of the error message to include this?
"Please use a smaller number of splits, consider reducing one or more of these parameters, or collect more data."

@freddyaboulton freddyaboulton force-pushed the 3103-catch-incompatible-ts-parameters branch from eda9526 to e76074d Compare December 6, 2021 17:03
@freddyaboulton freddyaboulton merged commit b9d4cb6 into main Dec 6, 2021
@freddyaboulton freddyaboulton deleted the 3103-catch-incompatible-ts-parameters branch December 6, 2021 17:54
@chukarsten chukarsten mentioned this pull request Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants