-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve time series data splitting #3616
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3616 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33504 33508 +4
=======================================
+ Hits 33382 33386 +4
Misses 122 122
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, just some minor nits. I think I get how this is working...it seems like a super minor change made, passing the forecast horizon into the TSTimeSplit had major results. Do we need to update the docs at all?
self.time_index = time_index | ||
self.n_splits = n_splits | ||
self._splitter = SkTimeSeriesSplit(n_splits=n_splits) | ||
self._splitter = SkTimeSeriesSplit( | ||
n_splits=n_splits, test_size=forecast_horizon or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol what's going on with forecast_horizon? docstring says int, here it's None. CAn't we just pass forecast_horizon here? Seems the or None
is a little redundant as it won't do anything to catch non-integer inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was an oversight on my part. There was an issue with the forecast horizon defaulting to 1 caused issues so this was a workaround, but it's not actually necessary after updating the default value. I'll fix this.
@@ -23,7 +23,7 @@ def test_time_series_split_init(): | |||
|
|||
@pytest.mark.parametrize( | |||
"gap,max_delay,forecast_horizon,n_splits", | |||
[[7, 3, 1, 4], [0, 3, 2, 3], [1, 1, 1, 4]], | |||
[[7, 3, 1, 5], [0, 8, 2, 3], [5, 4, 2, 4]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I can read these changes and know if they're correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I refactor this to make it clearerer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - some suggestions for improvements and some clarification questions. @eccabay and I spoke offline about the changes to the forecast horizon in the tests with the root cause being having a forecast_horizon
of 1 errors out on scoring for AUC
since it requires more than 1 class. The proposed solution is to remove AUC
as a default objective for TS classification and @eccabay will file an issue to track.
time_index=None, | ||
n_splits=3, | ||
): | ||
self.max_delay = max_delay | ||
self.gap = gap | ||
self.forecast_horizon = forecast_horizon | ||
self.forecast_horizon = forecast_horizon if forecast_horizon else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we want 1 if forecast_horizon
isn't passed? or is there a scenario where forecast_horizon
isn't passed? If there isn't I would argue to make horizon
a required parameter or we select a better default option other than 1
by looking at the time frequency for the passed in data. Could be a low priority followup issue since we shouldn't be using it without forecast_horizon
anyways!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This situation is a little tricky, but I think we should keep forecast_horizon
as an optional parameter.
Previously, the test size was None for scikit's TimeSeriesSplit
, so it defaulted to n_samples // (n_splits + 1)
. The only reason forecast_horizon
was used in this function was to validate the split size was large enough.
Notice here that self.forecast_horizon
gets set to 1 if forecast_horizon
is None. However, forecast_horizon
and not self.forecast_horizon
is what's passed in to the SkTimeSeriesSplit
. This way, if no forecast horizon is passed in, we maintain our historical behavior and only update the split size when a forecast horizon is explicitly set. Does that make sense?
Closes #3287