Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include target column name and class labels in predict/predict_proba output #951

Merged
merged 19 commits into from
Jul 23, 2020

Conversation

angela97lin
Copy link
Contributor

Closes #645

@angela97lin angela97lin self-assigned this Jul 20, 2020
@codecov
Copy link

codecov bot commented Jul 20, 2020

Codecov Report

Merging #951 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #951   +/-   ##
=======================================
  Coverage   99.87%   99.87%           
=======================================
  Files         171      171           
  Lines        8780     8800   +20     
=======================================
+ Hits         8769     8789   +20     
  Misses         11       11           
Impacted Files Coverage Δ
evalml/pipelines/classification_pipeline.py 100.00% <100.00%> (ø)
...ents/estimators/classifiers/baseline_classifier.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
.../tests/component_tests/test_baseline_classifier.py 100.00% <100.00%> (ø)
...ion_pipeline_tests/test_baseline_classification.py 100.00% <100.00%> (ø)
...ation_pipeline_tests/test_binary_classification.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8147aa...4d273e2. Read the comment docs.

@angela97lin angela97lin added this to the July 2020 milestone Jul 21, 2020
@angela97lin angela97lin marked this pull request as ready for review July 21, 2020 20:24
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its great that we've arrived at this point where we can have predict_proba return unique class labels!

I left a discussion on the estimator changes. My suggestion is that we have this PR update the predict_proba col names only for classification pipelines, and file a separate issue where we can continue the conversation on classification estimators separately. What do you think?

That aside, I see you added an automl test to cover predict_proba columns, and that's great. We should also add a pipeline-level test which trains a pipeline on data with various target types (int, categorical, str) and expects correct column names and ordering of the predict_proba output.

evalml/pipelines/classification_pipeline.py Show resolved Hide resolved
@@ -101,6 +101,7 @@ def predict_proba(self, X):

X = self._transform(X)
proba = self.estimator.predict_proba(X)
proba.columns = self._decode_targets(proba.columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet. So all our estimators' predict_proba return pd.DataFrame? And _decode_targets still works when columns returns RangeIndex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's enforce the following for the use of estimators in our pipelines:

  • When part of pipelines, classification estimators are always given a target of type int ranging from 0 to n-1. Therefore, their denotion of which class is which will by definition match the ordering provided by the pipeline.

Let's update the code here to use the pipeline label encoder to set the predict_proba column labels. As long as the estimator conforms to the above statement (need unit testing), this code will work.

if (self._encoder.classes_ != proba.columns):
    logger.warn('Estimator may break class order')
proba.columns = self._decode_targets(self._encoder.classes_)

Copy link
Contributor Author

@angela97lin angela97lin Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Hahaha upon looking at this more closely, we had it backwards 😅 We can't decode self._encoder.classes_ when self._encoder.classes_ are the original and hence, decoded targets!

Alternatives we can do:

  1. Assume that the column order of the estimator's predict_proba is the order of the classes listed in self._encoder.classes_ and simply do
proba.columns = self._encoder.classes
  1. If the predict_proba column names are equal to the self._encoder.classes_ then we assume that there was already decoding done, and don't do anything. Otherwise, we try to decode.
if (set(self._encoder.classes_) == set(proba.columns)):
      return proba
else:
      proba.columns = self._decode_targets(proba.columns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. I think we should go with option 1. Not sure if there was a typo in your example or if you meant something else, but just for clarity:

proba.columns = self._encoder.classes_

Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think there's value in adding a warning message if the estimator predict_proba doesn't match what we expect, and perhaps some debug logs throughout the encoding/decoding. In particular, if the estimator's predict_proba returns something with a different number of columns from what we expect, we should raise an error. Maybe there's a way to have a similar check for the column names and ordering but that may get too complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry for the typo but underscore is correct!

RE testing ; adding checks: I agree adding these checks would be helpful, but with regards to the problem we're having here with column names and ordering... it's much harder to do when sklearn drops them via returning numpy arrays and we have to use the estimator's .classes_ attribute to match up each column :( I can add some debug logging within fit / wherever else we're calling encode/decode!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I agree given the complexity already present here, adding a bunch of error handling is fine to keep out of scope for now.

try:
return self._component_obj.classes_
except AttributeError:
raise MethodPropertyNotFoundError("Estimator requires a classes_ property or a component_obj that implements classes_")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. So this PR adds class labels to predict_proba not only for classification pipelines but also for classification estimators.

By adding this, we're now expecting all custom classification estimators to include a classes_ property, right? At least those which don't define a component_obj. I know this is how sklearn classification estimators work. My concern is that by doing this, we're making it difficult to add custom classification estimators. The baseline model happened to be tracking this info already, but what would be required if that weren't the case?

I think another viable path would be to explore using the same label_encoder pattern we used in the pipeline here in the estimator. This would also be the first non-generic logic we've added to the estimators, so perhaps we can consider alternatives.

Another consideration is that @jeremyliweishih is working on a metaclass pattern for #851 which may be helpful here.

My suggestion: have this PR update the predict_proba col names for classification pipelines. And file a separate issue for estimators where we can continue the conversation. Does that sound good to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Yes, you're correct, and if an user wants to implement their own custom classifier, then they would have to implement _classes.

The reason why I took this approach is because using / storing classes on a pipeline level is possible but also tricky and depends on the output of the estimator's predict_proba. That is, on the pipeline level, we're not guaranteed that the order of the columns we're setting is equal to the order of the columns outputted by the estimator's predict_proba method.

With that in mind, what do you think? We're not guaranteed that the class labels we store are necessarily the order that the estimator will return values in, hence why I left it up to the estimator to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin ok got it.

For classification pipelines we're currently storing unique class values using the _encoder. I like what we settled on there. My concerns are solely about the classification estimators.

For estimators, there are a few cases to consider:

  1. User provides a component_obj to wrap, and the component_obj exposes classes_ as an attribute (i.e. sklearn estimators). Your proposal would work perfectly here; the user doesn't need to override component classes_.

  2. User provides a component_obj to wrap, and the component_obj exposes classes_ but under a different name. In this case the user needs to override component classes_ to point to the right accessor on the component_obj. That's fine.

  3. User provides a component_obj to wrap, but the component_obj doesn't surface the unique class values. In this case, the user would have to override fit and learn it using a label encoder or similar custom logic.

  4. No component_obj; user implements custom __init__, fit, predict/predict_proba. In this case, does the user need to define classes_? Functionally, they don't, since we don't care how they attach labels to predict_proba columns as long as they conform to our APIs expectations. But now that we expose classes_ as a public property, we'd need to clearly articulate whether or not this is necessary to override.

Your proposal does well for cases 1 and 2, but I think we can improve on it for cases 3 and 4.

What if we added a label encoder to all classification estimators by default, just like we do now with classification pipelines. That way we don't need to rely on the component_obj to learn and expose the ordered class labels. We could expose this under private methods _learn_unique_class_labels which we'd call in Estimator.fit and _get_unique_class_labels which we'd call in Estimator.predict_proba. This would address cases 1 and 2, and then for cases 3 and 4 we can advise users to follow our pattern and call this themselves instead of having to write that code from scratch.

Also, I agree that it would be confusing if classification pipelines' predict_proba vs estimators' predict_proba returned columns in a different order. We should keep an eye out for that detail.

All this said, I think we should split this into its own issue so we can merge the code you've got for pipelines now and tackle these questions separately.

This is reminding me we really need a thorough page in the User Guide about how to define custom components!

@dsherry
Copy link
Contributor

dsherry commented Jul 22, 2020

Discussed again with @angela97lin

We've made the following decisions:

  • When part of pipelines, classification estimators are always given a target of type int ranging from 0 to n-1. Therefore, their denotion of which class is which will by definition match the ordering provided by the pipeline.
  • Our pipeline code will make no assumptions about the predict_proba column labels provided by classification estimators
  • A separate PR could add support for accepting str targets in classification estimators, and also including str labels in the predict_proba column labels. However, we want to keep estimators simple, and we feel this is low-priority for the estimators defined in our codebase.
  • We should write a custom component user guide, and document what is expected of custom components here. TODO @dsherry file issue -- done: Add a "Defining Custom Components" page to the user guide #964

We should make sure our test coverage covers:

  • All our classification estimators work fine with int targets from 0 to n-1. And predict_proba column labels are set to 0 to n-1 int
  • Pipeline sets correct int or str predict_proba column labels depending on the target type

@angela97lin angela97lin requested a review from dsherry July 23, 2020 18:18
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin looks good to me! Thanks for closing this out. It's pretty cool that we can provide this kind of encoding and mask the complexity inside the pipelines.

This isn't blocking, but I think the intent with the test_targets_data_types_classification automl test is to check that the automl scores and other info are computed correctly for different target datatypes. Checking pipelines' predict and predict_proba output is more fundamental and ideally should be done outside the context of automl. Hopefully that will make it easier to maintain. So my suggestion is to move that coverage to a pipeline test instead of doing it in an automl test, i.e. add a test_targets_data_types_classification in test_pipelines.py

@@ -101,6 +101,7 @@ def predict_proba(self, X):

X = self._transform(X)
proba = self.estimator.predict_proba(X)
proba.columns = self._encoder.classes_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, makes sense!

Returns:
list(str) or list(float) : class names
"""
return self._classes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can't hurt to leave this in, but no longer necessary for this PR, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer necessary! Goes well with our direction towards using property for read-only though? :D

@@ -33,6 +33,7 @@ def test_binary_classification_pipeline_predict(mock_fit,
# test custom threshold set but no objective passed
mock_predict_proba.return_value = pd.DataFrame([[0.1, 0.2], [0.1, 0.2]])
binary_pipeline.threshold = 0.6
binary_pipeline._encoder.classes_ = [0, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why didn't we need this before this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this now because of the proba.columns = self._encoder.classes_ line we added in this PR. Without it, we'll get an AttributeError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks. When I looked at this first I thought this was happening further down and was confused. Got it.

@angela97lin angela97lin merged commit 1cfd36a into main Jul 23, 2020
@angela97lin
Copy link
Contributor Author

@dsherry Argh sorry I missed the comment associated with your approval again 🤦‍♀️ I know it wasn't a blocking comment but I'll get something up to move things to some pipeline tests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include target column name and class labels in predict/predict_proba output
2 participants