Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate holdout dataset passed to predict/predict_proba for time series #2804

Merged
merged 4 commits into from
Sep 20, 2021

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #2732


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Sep 17, 2021

Codecov Report

Merging #2804 (a17c87a) into main (ad797e1) will increase coverage by 0.1%.
The diff coverage is 100.0%.

❗ Current head a17c87a differs from pull request most recent head 7ade293. Consider uploading reports for the commit 7ade293 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2804     +/-   ##
=======================================
+ Coverage   99.8%   99.8%   +0.1%     
=======================================
  Files        297     297             
  Lines      27719   27754     +35     
=======================================
+ Hits       27651   27686     +35     
  Misses        68      68             
Impacted Files Coverage Δ
.../pipelines/time_series_classification_pipelines.py 99.0% <100.0%> (+0.1%) ⬆️
evalml/pipelines/time_series_pipeline_base.py 99.1% <100.0%> (+0.1%) ⬆️
evalml/tests/pipeline_tests/test_pipelines.py 99.9% <100.0%> (+0.1%) ⬆️
.../tests/pipeline_tests/test_time_series_pipeline.py 99.8% <100.0%> (+0.1%) ⬆️
...tests/component_tests/test_polynomial_detrender.py 100.0% <0.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad797e1...7ade293. Read the comment docs.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I think you just need to fix up a copy-pasta doc string and validate your double validation of the holdout sets and I think you're gucci!

X (pd.DataFrame): Data of shape [n_samples, n_features].
objective (Object or string): The objective to use to make predictions.
X_train (pd.DataFrame): Training data.
y_train (pd.Series): Training labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring doesn't match the function sig!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

y_holdout = self._create_empty_series(y_train)
X, y_holdout = self._convert_to_woodwork(X, y_holdout)
y_holdout = infer_feature_types(y_holdout)
self._validate_holdout_datasets(X, X_train)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to validate the holdout datasets twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake lol

},
)
X, y = ts_data
X_train, y_train = X.iloc[:15], y.iloc[:15]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to just put the 15 into a variable in case we come back to change this for some reason or another down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

elif thing_thats_wrong == "not-separated-by-gap":
X = X.iloc[15 + gap + 2 : 15 + gap + 2 + forecast_horizon]
else:
X = X.iloc[15 + gap + 2 : 15 + gap + 2 + forecast_horizon + 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me again what the right length looks like here? What subsetting of X passes the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. X.iloc[15 + gap: 15 + gap + forecast_horizon]

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes! Looks solid.

@freddyaboulton freddyaboulton merged commit af994b3 into main Sep 20, 2021
@freddyaboulton freddyaboulton deleted the 2732-validate-holdout-dataset-ts branch September 20, 2021 20:12
@chukarsten chukarsten mentioned this pull request Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate holdout dataset passed to time series predict and predict_proba
2 participants