-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataCheck to validate time series problem configuration parameters #3111
Conversation
@@ -112,6 +117,7 @@ def test_make_data_splitter_default(problem_type, large_data): | |||
assert data_splitter.n_splits == 3 | |||
assert data_splitter.gap == 1 | |||
assert data_splitter.max_delay == 7 | |||
assert data_splitter.forecast_horizon == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure the data splitter get the right forecast horizon value.
) | ||
|
||
|
||
class TimeSeriesParametersDataCheck(DataCheck): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a part of default data checks?
Reason I'm not adding it is because it would require changing the init api of the default data checks and some of the use cases we've seen manually create their own DataChecks
class from a list of individual DataCheck
instances. So if they wanted to use this check, they just need to add it to their list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be a good idea to add it, but not necessary in this PR. Let's file an issue for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue here: #3125
Codecov Report
@@ Coverage Diff @@
## main #3111 +/- ##
=======================================
+ Coverage 99.0% 99.8% +0.8%
=======================================
Files 313 315 +2
Lines 30603 30664 +61
=======================================
+ Hits 30281 30573 +292
+ Misses 322 91 -231
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! I left a comment about adding some tests, but mostly left small nitpicks otherwise.
gap (int): Gap used in time series problem. Time series pipelines shift the target variable by gap rows. Defaults to 0. | ||
gap (int): Number of time units separating the data used to generate features and the data to forecast on. | ||
Defaults to 0. | ||
forecast_horizon (int): Number of time units to forecast. Defaults to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaults to 1*
"then at least one of the splits would be empty by the time it reaches the pipeline. " | ||
"Please use a smaller number of splits or collect more data." | ||
) | ||
result = are_ts_parameters_valid_for_split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like this value assignment
|
||
Args: | ||
X (pd.DataFrame, np.ndarray): Features. | ||
y (pd.Series, np.ndarray): Ignored. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but there seems to be an extra space between Ignored.
and Defaults to
?
forecast_horizon=forecast_horizon, | ||
n_splits=n_splits, | ||
date_index="date", | ||
) | ||
X = pd.DataFrame({"features": range(15)}) | ||
# Each split would have 15 // 5 = 3 data points. However, this is smaller than the number of data_points required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment needs to be updated since not every split is now using 3 data points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just going to delete it. Don't think it's adding much value since it's obvious the values are incompatible given the expected behavior of the test.
evalml/utils/gen_utils.py
Outdated
TsParameterValidationResult - named tuple with four fields | ||
is_valid (bool): True if parameters are valid. | ||
msg (str): Contains error message to display. Empty if is_valid. | ||
smallest_split_size (int): smallest split size given n_obs and n_splits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we capitalize the first letters of these descriptions to get everything into the same format, ie Smallest split size ...
evalml/utils/gen_utils.py
Outdated
f"the smallest split would have {split_size} observations. " | ||
f"Since {gap + max_delay + forecast_horizon} (gap + max_delay + forecast_horizon) > {split_size}, " | ||
"then at least one of the splits would be empty by the time it reaches the pipeline. " | ||
"Please use a smaller number of splits or collect more data." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this error message because it explains why the values passed in will not work! Great work here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the end of the error message to include this?
"Please use a smaller number of splits, consider reducing one or more of these parameters, or collect more data."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on it!
) | ||
|
||
|
||
def are_ts_parameters_valid_for_split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know these are covered in other tests, but can we add some explicit tests for these in the test_utils
file? Mainly because if we decide to change some stuff in the future, it would be nice to ensure that we have full test coverage for these methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Word adding some small tests now too!
2b55ebf
to
fe6b0cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff!
class TimeSeriesParametersDataCheck(DataCheck): | ||
"""Checks whether the time series parameters are compatible with data splitting. | ||
|
||
If gap + max_delay + forecast_horizon > X.shape[0] // (n_splits + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add markdown so this will display cleanly in our documentation
|
||
Args: | ||
problem_configuration (dict): Dict containing problem_configuration parameters. | ||
n_splits (int): Number of time series split. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: splits?
some feature and target engineering, e.g delaying input features and shifting the target variable by the | ||
desired amount. If the data that will be split already has all the features and appropriate target values, and | ||
then set max_delay and gap to 0. | ||
The max_delay, gap, and forecast_horizon parameters are just used to validate that the requested split size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mega-nit: "just"->"only"?
"problem_configuration must be a dict containing values for at least the date_index, gap, max_delay, " | ||
f"and forecast_horizon parameters. Received {problem_configuration}." | ||
) | ||
return not (msg), msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super smooth, love this logic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I love how you've split up the logic for the parameters
evalml/utils/gen_utils.py
Outdated
f"the smallest split would have {split_size} observations. " | ||
f"Since {gap + max_delay + forecast_horizon} (gap + max_delay + forecast_horizon) > {split_size}, " | ||
"then at least one of the splits would be empty by the time it reaches the pipeline. " | ||
"Please use a smaller number of splits or collect more data." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the end of the error message to include this?
"Please use a smaller number of splits, consider reducing one or more of these parameters, or collect more data."
eda9526
to
e76074d
Compare
Pull Request Description
Fixes #3103 and also updates the time series data splitter to use the same logic as the data check to validate the parameters. This was something the splitter was doing before, but it was buggy given the addition of
forecast_horizon
.After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.