Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error msg for time series predict + tests #3579

Merged
merged 16 commits into from
Jun 23, 2022

Conversation

MichaelFu512
Copy link
Contributor

Pull Request Description

Closes #3443


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Jun 21, 2022

Codecov Report

Merging #3579 (14a3ab9) into main (204c849) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3579     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        335     335             
  Lines      33350   33375     +25     
=======================================
+ Hits       33221   33246     +25     
  Misses       129     129             
Impacted Files Coverage Δ
evalml/pipelines/time_series_pipeline_base.py 100.0% <100.0%> (ø)
.../tests/pipeline_tests/test_time_series_pipeline.py 99.9% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 204c849...14a3ab9. Read the comment docs.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! I just left a few suggestions on making this code as robust as it can be.

@@ -6,6 +6,7 @@ Release Notes
* Updated the Imputer and SimpleImputer to work with scikit-learn 1.1.1. :pr:`3525`
* Bumped the minimum versions of scikit-learn to 1.1.1 and imbalanced-learn to 0.9.1. :pr:`3525`
* Added a clearer error message when ``describe`` is called on an un-instantiated ComponentGraph :pr:`3569`
* Added a clear error message when ``predict`` is called without a y_train parameter :pr:`3579`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention the change is only for time series problems, predict should run just fine without y_train for other problem types!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should mention the error message for X_train as well, since we now have a clearer error message in that case too!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay your devotion to clear and thorough release notes is appreciated


Returns:
Predictions.
"""
X_train, y_train = self._convert_to_woodwork(X_train, y_train)
try:
print("HELLO")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this line!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh shoot sorry about that!

Comment on lines 193 to 195
X_train, y_train = self._convert_to_woodwork(X_train, y_train)
except AttributeError:
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this does work to catch the correct bug, there's definitely easier/more pythonic ways to do this. I would also recommend doing separate checks for missing X_train and y_train, so the user has more information about their specific error!
The easiest way to do these checks would be to check if X_train and/or y_train are None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great call out @eccabay . As you start to get into our larger products, you'll notice that having specific, clear error messages are incredibly beneficial when EvalML consumers are handling these errors.

ValueError,
match="Make sure to have a value for both X_train and y_train when calling predict",
):
clf.predict(X_train)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're throwing the error if either X_train or y_train is missing, it's a good idea to cover both cases in your test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if X_train is None, the convert_to_woodwork() function makes X_train a pd.DataFrame() and infers its type from there. It doesn't do the same for y_train which is why this initial error occurs.

So probably, instead of checking both X_train and y_train, I just need to check y_train since if X_train is null the code still runs (though it does make X_train a blank dataframe).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very fair! For the overall predict function (outside of just convert_to_woodwork), we do still need X_train to not be None. I think it would be better to simply check to make sure neither X_train nor y_train is None instead of depending on convert_to_woodwork() failing. It's more future-proof, too, since someday we may update that sub function to do the same "convert to empty series" for y train.


Returns:
Predictions.
"""
X_train, y_train = self._convert_to_woodwork(X_train, y_train)
try:
print("HELLO")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove this print

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am glad to see that someone else does this as well

@@ -183,12 +183,18 @@ def predict(self, X, objective=None, X_train=None, y_train=None):
y_train (pd.Series or None): Training labels.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh if we need X_train and y_train should we still accept None as input? I don't have the context on why we allowed this before!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not sure why X_train and y_train are optional here, they're very much required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think it had to do with the fact that the parent PipelineBase has them as optional because regular regression and classification problems don't need them

try:
print("HELLO")
X_train, y_train = self._convert_to_woodwork(X_train, y_train)
except AttributeError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is good but we may need to reconsider allow X_train or y_train as optional arguments. We can also consider just validating that X_train and y_train exists. Let's keep this for now and see what @ParthivNaresh has to say!

X, y = ts_data_binary
clf = time_series_binary_classification_pipeline_class(
parameters={
"Logistic Regression Classifier": {"n_jobs": 1},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if all the parameters are the same for each of these pipelines, we can save the parameters up top and then call it to save some space.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking better! Since we've determined that both X_train and y_train should be required parameters (but we should probably keep the function signature the same as other predict functions), we should shift this back to checking for both the parameters, but separately this time!

ValueError,
match="Make sure to have a value for both X_train and y_train when calling predict",
):
clf.predict(X_train)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very fair! For the overall predict function (outside of just convert_to_woodwork), we do still need X_train to not be None. I think it would be better to simply check to make sure neither X_train nor y_train is None instead of depending on convert_to_woodwork() failing. It's more future-proof, too, since someday we may update that sub function to do the same "convert to empty series" for y train.

Comment on lines 1084 to 1145
"Logistic Regression Classifier": {"n_jobs": 1},
"Random Forest Regressor": {"n_jobs": 1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this (and the below) change?

Comment on lines 1063 to 1064
@pytest.fixture
def my_parameters():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, if we're going to go through the effort of creating a pytest fixture, we try to use it in more than one test. I'd either put this at the top of the function instead of a fixture, or update the other tests that have the same parameters in this file to use the same fixture.

Also, my_parameters isn't a very useful name 😂 could you name it something more problem-specific (and not first person!)

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, Michael, thanks for making these updates! I just left a final few comments, but other than that this is good to go! 🚢

evalml/pipelines/time_series_pipeline_base.py Outdated Show resolved Hide resolved
evalml/tests/pipeline_tests/test_time_series_pipeline.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet LGTM - just a style comment 😄


Returns:
Predictions.
"""
if X_train is None:
raise ValueError(
"Make sure to have a non None value for X_train when calling time series' predict"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of "non Non value" maybe just "to include input for X-train". Double negatives can get a little confusing. just a style thing so up to you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure!

@MichaelFu512 MichaelFu512 merged commit 4f5f7a4 into main Jun 23, 2022
@MichaelFu512 MichaelFu512 deleted the 3443-Error-TimeSeries-Predict-y branch June 23, 2022 19:34
This was referenced Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a useful error when there is no y value passed to time series predict
5 participants