Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pipeline score method to return nan for any objective which errors #787

Merged
merged 4 commits into from May 21, 2020

Conversation

dsherry
Copy link
Collaborator

@dsherry dsherry commented May 21, 2020

Changes:

@codecov
Copy link

codecov bot commented May 21, 2020

Codecov Report

Merging #787 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #787      +/-   ##
==========================================
+ Coverage   99.40%   99.42%   +0.01%     
==========================================
  Files         150      150              
  Lines        5588     5709     +121     
==========================================
+ Hits         5555     5676     +121     
  Misses         33       33              
Impacted Files Coverage Δ
evalml/pipelines/binary_classification_pipeline.py 100.00% <ø> (ø)
...ml/pipelines/multiclass_classification_pipeline.py 100.00% <ø> (ø)
evalml/pipelines/classification_pipeline.py 100.00% <100.00%> (ø)
evalml/pipelines/pipeline_base.py 100.00% <100.00%> (ø)
evalml/pipelines/regression_pipeline.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_autobase.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.73% <100.00%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 729dcee...79a40c0. Read the comment docs.

@kmax12
Copy link
Contributor

kmax12 commented May 21, 2020

@dsherry here's some thoughts for how we could think about raise_errors in the scoring situation. Curious what you think.

The pipeline object should always raise errors if there are errors. If we start silencing errors at the pipeline level it becomes harder to integrate the pipelines in other applications. The application (in this case automl search) using the pipeline should be doing any error handling.

Then, it's a question of whether or not the pipeline should literally raise an error or just return nan. I can see arguments both ways. On one hand, a pipeline score could be nan for many reasons and a proper error is an opportunity to say why. One the other hand, if we're scoring many objectives and only some error, how do we both return a score and raise the error? I'm leaning towards just returning nans when you do Pipeline.score, but raising errors when you do objective.score. I think this is how we have it set up in this PR, but want to confirm the rationale.

At the automl layer, when does a user want errors to arise? I think if the primary objective scores, but an additional objective fails, I would only consider that an error in the strictest of situations. So, I say we can focus on cases where the primary objective fails. Here are those cases that come to mind

  1. In first n iterations, the primary objective always returns nan. In this case, I'd say it 95% a real error and we'll never be able to score. Raise an error regardless.

  2. The primary objective scores, but on iteration i, it errors. By default, I'd just consider that just an invalid set up hyperparamters and we should just continue for most users and they can view this problem in the results. However, I could see a user wanting to know immediately, so this could go either way. In my mind, this is case where raise_errors is relevant.

  3. The primary objective scores for a Pipeline A with any hyper parameters, but never works for another Pipeline B with any hyperparameters. This is a trickier case, but I think we don't raise the error and consider Pipeline B invalid for this problem. This is a more complex version of situation 2.

Finally, this doesn't take into account errors while fitting the pipelines, but I think that's outside of scope for this PR.

Anyways, those are my initial thoughts.

@dsherry
Copy link
Collaborator Author

dsherry commented May 21, 2020

@kmax12 good points all around. I like your separation of whether the primary objective failed vs the additional ones; our code doesn't reflect that well right now IMO.

This is complex enough to warrant discussion offline. I just sent you an invite.

For the moment, I feel this PR adds value. It solves the immediate problem in #786. And it simplifies the pipeline score code which makes it easy to modify this to do whatever we want, and will hopefully prevent future errors. I plan to merge it on approval, unless our discussion significantly changes our views.

@dsherry dsherry marked this pull request as ready for review May 21, 2020
@dsherry dsherry requested a review from kmax12 May 21, 2020
Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

I think this is looking good - a big fan of the mocking and consolidation of the pipeline scoring. In terms of what Max brought up, I think this PR is good as it stands and we can put in another PR detailing error handling after discussing further.

@@ -82,6 +82,32 @@ def test_pipeline_fit_raises(mock_fit, X_y):
assert np.isnan(score)


@patch('evalml.objectives.AUC.score')
def test_pipeline_score_raises(mock_score, X_y):
msg = 'all your model are belong to us'
Copy link
Contributor

@jeremyliweishih jeremyliweishih May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Collaborator Author

@dsherry dsherry May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we really need to figure out a good strategy for mocking this sort of thing. This test adds 9sec. A problem for another day!

kmax12
kmax12 approved these changes May 21, 2020
Copy link
Contributor

@kmax12 kmax12 left a comment

this looks good to me to merge. left some code organization suggestions if you want them :)

return scores

def _score(self, X, y, predictions, objective):
Copy link
Contributor

@kmax12 kmax12 May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to add a doc string here to make it clear that predictions can be a the actual predictions or probabilities

i also wonder if this should be a standalone util method since it doesn't depend on the Class instance in any way. this would probably make more if you move the default .score() implementation to regression base too

finally, i don't think it's the end of the world to duplicate this particular chunk of code in two places since it's not doing much and should still be tested in both cases. but your call

Copy link
Collaborator Author

@dsherry dsherry May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good point, will add docstring.

Ah, yeah, should be static at least.

I agree it's not a huge problem to have this code duplicated, but having it exist in only one place makes it harder to break.

evalml/pipelines/pipeline_base.py Show resolved Hide resolved
…nan for errored objectives. Add more scoring test coverage
@dsherry dsherry merged commit ed68580 into master May 21, 2020
2 checks passed
@dsherry dsherry deleted the ds_786_score_errors branch May 21, 2020
@angela97lin
Copy link
Contributor

angela97lin commented May 22, 2020

@dsherry Sorry to comment after this was merged but I think this PR might introduce a change that breaks how scoring works--but silently so via a logged error! Previously, we had evalml/pipelines/multiclass_classification_pipeline.py and evalml/pipelines/binary_classification_pipeline.py both implement score(), the difference being that evalml/pipelines/binary_classification_pipeline.py sets y_predicted_proba to y_predicted_proba[:, 1], while evalml/pipelines/multiclass_classification_pipeline.py returns the full array. If I'm understanding correctly, this PR the two are consolidated to always return the full array. This causes our AUC score function to raise an error complaining about the shape of the data.

Here's a small reproducible example from one of the notebooks (and how I ran into this):

import pandas as pd

X = pd.DataFrame({
    "leaked_feature": [6, 6, 10, 5, 5, 11, 5, 10, 11, 4],
    "leaked_feature_2": [3, 2.5, 5, 2.5, 3, 5.5, 2, 5, 5.5, 2],
    "valid_feature": [3, 1, 3, 2, 4, 6, 1, 3, 3, 11]
})

y = pd.Series([1, 1, 0, 1, 1, 0, 1, 0, 0, 1])

automl = evalml.AutoClassificationSearch(
    max_pipelines=1,
    allowed_model_families=["linear_model"],
)

automl.search(X, y)

(From https://evalml.featurelabs.com/en/latest/guardrails/overfitting.html)

@kmax12
Copy link
Contributor

kmax12 commented May 22, 2020

i think this is related to #348. cannot find documentation on where we ended up in terms of what shape binary classifiers should return

@angela97lin
Copy link
Contributor

angela97lin commented May 22, 2020

Yup! I think during the objectives API project we decided that our binary classifiers would return a consistent shape (resolving #348) and then we would modify the shape if we needed to in score(), hence why there were two different implementations

@dsherry
Copy link
Collaborator Author

dsherry commented May 22, 2020

@angela97lin oh no! Can you please file a bug with this and tag me? I'll fix it in a bit. It's a bit disconcerting that our tests didn't catch this -- all green on this PR and on master. Will have to fix that too.

If I understand right, this error would only occur for binary classification metrics which rely on probabilities, like AUC.

@dsherry
Copy link
Collaborator Author

dsherry commented May 22, 2020

@angela97lin I filed #797 and am taking a look now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automl: if any provided objective errors, all computed scores are nan
4 participants