Add dates_needed_for_prediction to TimeSeriesPipelineBase#3906
Conversation
e45c38a to
b58d431
Compare
Codecov Report
@@ Coverage Diff @@
## main #3906 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 346 346
Lines 36669 36707 +38
=======================================
+ Hits 36536 36574 +38
Misses 133 133
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
| self.forecast_horizon | ||
| + self.max_delay # include start delay for featurization | ||
| + self.gap # add first gap for the actual gap from the end date | ||
| + self.gap, # add another gap to ensure training data is greater than gap |
There was a problem hiding this comment.
We have a limitation s.t we require the training data (using the dates from this method) to be larger than the gap. Thus, I'm directly adding the gap here so that we ensure our training data is always long enough. This can be optimized further by the dates needed to push the length of the training data over the gap but decided to err on the side of caution for the first pass. I believe its safer to ask the user for more data unless we know that it is hard for the user to provide us such data.
There was a problem hiding this comment.
A frequency of 'MS' (month start) here would be interpreted as milliseconds, right? https://numpy.org/doc/stable/reference/arrays.datetime.html#datetime-units
There was a problem hiding this comment.
Yeah, I think Frank is right here in that MS would be milliseconds. From my understanding, I think numpy defaults to milliseconds while pandas maps to month start for this.
There was a problem hiding this comment.
ah gotcha - I'll change to using a datetime timedelta!
| ts_data, | ||
| time_series_regression_pipeline_class, | ||
| ): | ||
| X, X_t, y = ts_data() |
There was a problem hiding this comment.
Based on Frank's comment above, it'd probably be a good idea to test a few different frequencies here!
| self.forecast_horizon | ||
| + self.max_delay # include start delay for featurization | ||
| + self.gap # add first gap for the actual gap from the end date | ||
| + self.gap, # add another gap to ensure training data is greater than gap |
There was a problem hiding this comment.
Yeah, I think Frank is right here in that MS would be milliseconds. From my understanding, I think numpy defaults to milliseconds while pandas maps to month start for this.
| + self.max_delay # include start delay for featurization | ||
| + self.gap # add first gap for the actual gap from the end date | ||
| + self.gap | ||
| + 1 # for the + 1 in the time series featurizer |
There was a problem hiding this comment.
:-/ Is it safe to assume that all pipelines will have the time series featurizer?
| dates = pd.date_range( | ||
| beginning_date, | ||
| end_date, | ||
| freq=pipeline.frequency.split("-")[0], | ||
| ) | ||
| X_train = pd.DataFrame(index=[i + 1 for i in range(len(dates))]) | ||
| feature = pd.Series([i + 1 for i in range(len(dates))], index=X_train.index) | ||
|
|
||
| X_train["feature"] = pd.Series(feature.values, index=X_train.index) | ||
| X_train["date"] = pd.Series(dates.values, index=X_train.index) | ||
| y_train = pd.Series(X_train["feature"].values, index=X_train.index) | ||
|
|
||
| X_test = pd.DataFrame({"feature": [54], "date": [prediction_date]}) | ||
|
|
||
| assert X_train.shape[0] == len(dates) | ||
| assert not X_train.isnull().any().any() | ||
| assert len(y_train) == X_train.shape[0] | ||
| assert X_test.shape[0] == 1 | ||
|
|
||
| X_train.ww.init() | ||
| X_test.ww.init() | ||
|
|
||
| X_train.ww.set_time_index("date") | ||
| X_test.ww.set_time_index("date") | ||
|
|
||
| _ = pipeline.predict(X_test, X_train=X_train, y_train=y_train).all() | ||
| assert not mock_predict.call_args[0][0].empty |
There was a problem hiding this comment.
Can you explain why we need all this? I would think we might want to keep the test a little more minimalist rather than do a predict. Like don't we just want to test the dates_needed_for_prediction function itself?
There was a problem hiding this comment.
During implementation I created date ranges that were mathematically correct but didn't work entirely with our featurization so added this last part to ensure!
| assert beginning_date <= end_date | ||
| assert end_date < prediction_date | ||
| date_diff = pipeline.forecast_horizon + pipeline.max_delay + pipeline.gap | ||
| assert end_date == beginning_date + pd.tseries.frequencies.to_offset( | ||
| f"{date_diff}{pipeline.frequency}", | ||
| ) |
There was a problem hiding this comment.
Do we want to reparametrize this so that the pytest parameters contain the answers to the incoming inputs and just test them against each other?
There was a problem hiding this comment.
I think it'll get pretty messy including the answers with the different combinations of parameters and frequencies above. I could do so but I would need to narrow down the scope of combinations.
…valml into js_dates_needed_for_prediction
Adds
dates_needed_for_predictionfor time series pipelines. Erring on the side of caution and asking for a longer date period and will optimize further if necessary.