Improve time series data splitting#3616
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3616 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33504 33508 +4
=======================================
+ Hits 33382 33386 +4
Misses 122 122
Continue to review full report at Codecov.
|
chukarsten
left a comment
There was a problem hiding this comment.
OK, just some minor nits. I think I get how this is working...it seems like a super minor change made, passing the forecast horizon into the TSTimeSplit had major results. Do we need to update the docs at all?
| self.n_splits = n_splits | ||
| self._splitter = SkTimeSeriesSplit(n_splits=n_splits) | ||
| self._splitter = SkTimeSeriesSplit( | ||
| n_splits=n_splits, test_size=forecast_horizon or None |
There was a problem hiding this comment.
lol what's going on with forecast_horizon? docstring says int, here it's None. CAn't we just pass forecast_horizon here? Seems the or None is a little redundant as it won't do anything to catch non-integer inputs
There was a problem hiding this comment.
Yeah, this was an oversight on my part. There was an issue with the forecast horizon defaulting to 1 caused issues so this was a workaround, but it's not actually necessary after updating the default value. I'll fix this.
| @pytest.mark.parametrize( | ||
| "gap,max_delay,forecast_horizon,n_splits", | ||
| [[7, 3, 1, 4], [0, 3, 2, 3], [1, 1, 1, 4]], | ||
| [[7, 3, 1, 5], [0, 8, 2, 3], [5, 4, 2, 4]], |
There was a problem hiding this comment.
I'm not sure I can read these changes and know if they're correct.
There was a problem hiding this comment.
Should I refactor this to make it clearerer?
jeremyliweishih
left a comment
There was a problem hiding this comment.
Looks good to me - some suggestions for improvements and some clarification questions. @eccabay and I spoke offline about the changes to the forecast horizon in the tests with the root cause being having a forecast_horizon of 1 errors out on scoring for AUC since it requires more than 1 class. The proposed solution is to remove AUC as a default objective for TS classification and @eccabay will file an issue to track.
| self.max_delay = max_delay | ||
| self.gap = gap | ||
| self.forecast_horizon = forecast_horizon | ||
| self.forecast_horizon = forecast_horizon if forecast_horizon else 1 |
There was a problem hiding this comment.
would we want 1 if forecast_horizon isn't passed? or is there a scenario where forecast_horizon isn't passed? If there isn't I would argue to make horizon a required parameter or we select a better default option other than 1 by looking at the time frequency for the passed in data. Could be a low priority followup issue since we shouldn't be using it without forecast_horizon anyways!
There was a problem hiding this comment.
This situation is a little tricky, but I think we should keep forecast_horizon as an optional parameter.
Previously, the test size was None for scikit's TimeSeriesSplit, so it defaulted to n_samples // (n_splits + 1). The only reason forecast_horizon was used in this function was to validate the split size was large enough.
Notice here that self.forecast_horizon gets set to 1 if forecast_horizon is None. However, forecast_horizon and not self.forecast_horizon is what's passed in to the SkTimeSeriesSplit. This way, if no forecast horizon is passed in, we maintain our historical behavior and only update the split size when a forecast horizon is explicitly set. Does that make sense?
Closes #3287