-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dates_needed_for_prediction
to TimeSeriesPipelineBase
#3906
Conversation
e45c38a
to
b58d431
Compare
Codecov Report
@@ Coverage Diff @@
## main #3906 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 346 346
Lines 36669 36707 +38
=======================================
+ Hits 36536 36574 +38
Misses 133 133
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
self.forecast_horizon | ||
+ self.max_delay # include start delay for featurization | ||
+ self.gap # add first gap for the actual gap from the end date | ||
+ self.gap, # add another gap to ensure training data is greater than gap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a limitation s.t we require the training data (using the dates from this method) to be larger than the gap. Thus, I'm directly adding the gap here so that we ensure our training data is always long enough. This can be optimized further by the dates needed to push the length of the training data over the gap but decided to err on the side of caution for the first pass. I believe its safer to ask the user for more data unless we know that it is hard for the user to provide us such data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A frequency of 'MS' (month start) here would be interpreted as milliseconds, right? https://numpy.org/doc/stable/reference/arrays.datetime.html#datetime-units
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think Frank is right here in that MS
would be milliseconds. From my understanding, I think numpy defaults to milliseconds while pandas maps to month start for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah gotcha - I'll change to using a datetime timedelta!
ts_data, | ||
time_series_regression_pipeline_class, | ||
): | ||
X, X_t, y = ts_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on Frank's comment above, it'd probably be a good idea to test a few different frequencies here!
self.forecast_horizon | ||
+ self.max_delay # include start delay for featurization | ||
+ self.gap # add first gap for the actual gap from the end date | ||
+ self.gap, # add another gap to ensure training data is greater than gap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think Frank is right here in that MS
would be milliseconds. From my understanding, I think numpy defaults to milliseconds while pandas maps to month start for this.
+ self.max_delay # include start delay for featurization | ||
+ self.gap # add first gap for the actual gap from the end date | ||
+ self.gap | ||
+ 1 # for the + 1 in the time series featurizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-/ Is it safe to assume that all pipelines will have the time series featurizer?
dates = pd.date_range( | ||
beginning_date, | ||
end_date, | ||
freq=pipeline.frequency.split("-")[0], | ||
) | ||
X_train = pd.DataFrame(index=[i + 1 for i in range(len(dates))]) | ||
feature = pd.Series([i + 1 for i in range(len(dates))], index=X_train.index) | ||
|
||
X_train["feature"] = pd.Series(feature.values, index=X_train.index) | ||
X_train["date"] = pd.Series(dates.values, index=X_train.index) | ||
y_train = pd.Series(X_train["feature"].values, index=X_train.index) | ||
|
||
X_test = pd.DataFrame({"feature": [54], "date": [prediction_date]}) | ||
|
||
assert X_train.shape[0] == len(dates) | ||
assert not X_train.isnull().any().any() | ||
assert len(y_train) == X_train.shape[0] | ||
assert X_test.shape[0] == 1 | ||
|
||
X_train.ww.init() | ||
X_test.ww.init() | ||
|
||
X_train.ww.set_time_index("date") | ||
X_test.ww.set_time_index("date") | ||
|
||
_ = pipeline.predict(X_test, X_train=X_train, y_train=y_train).all() | ||
assert not mock_predict.call_args[0][0].empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we need all this? I would think we might want to keep the test a little more minimalist rather than do a predict. Like don't we just want to test the dates_needed_for_prediction
function itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During implementation I created date ranges that were mathematically correct but didn't work entirely with our featurization so added this last part to ensure!
assert beginning_date <= end_date | ||
assert end_date < prediction_date | ||
date_diff = pipeline.forecast_horizon + pipeline.max_delay + pipeline.gap | ||
assert end_date == beginning_date + pd.tseries.frequencies.to_offset( | ||
f"{date_diff}{pipeline.frequency}", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to reparametrize this so that the pytest
parameters contain the answers to the incoming inputs and just test them against each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'll get pretty messy including the answers with the different combinations of parameters and frequencies above. I could do so but I would need to narrow down the scope of combinations.
…valml into js_dates_needed_for_prediction
Adds
dates_needed_for_prediction
for time series pipelines. Erring on the side of caution and asking for a longer date period and will optimize further if necessary.