-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include target column name and class labels in predict/predict_proba output #951
Conversation
Codecov Report
@@ Coverage Diff @@
## main #951 +/- ##
=======================================
Coverage 99.87% 99.87%
=======================================
Files 171 171
Lines 8780 8800 +20
=======================================
+ Hits 8769 8789 +20
Misses 11 11
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its great that we've arrived at this point where we can have predict_proba
return unique class labels!
I left a discussion on the estimator changes. My suggestion is that we have this PR update the predict_proba
col names only for classification pipelines, and file a separate issue where we can continue the conversation on classification estimators separately. What do you think?
That aside, I see you added an automl test to cover predict_proba
columns, and that's great. We should also add a pipeline-level test which trains a pipeline on data with various target types (int, categorical, str) and expects correct column names and ordering of the predict_proba
output.
@@ -101,6 +101,7 @@ def predict_proba(self, X): | |||
|
|||
X = self._transform(X) | |||
proba = self.estimator.predict_proba(X) | |||
proba.columns = self._decode_targets(proba.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet. So all our estimators' predict_proba
return pd.DataFrame
? And _decode_targets
still works when columns
returns RangeIndex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's enforce the following for the use of estimators in our pipelines:
- When part of pipelines, classification estimators are always given a target of type int ranging from 0 to
n-1
. Therefore, their denotion of which class is which will by definition match the ordering provided by the pipeline.
Let's update the code here to use the pipeline label encoder to set the predict_proba
column labels. As long as the estimator conforms to the above statement (need unit testing), this code will work.
if (self._encoder.classes_ != proba.columns):
logger.warn('Estimator may break class order')
proba.columns = self._decode_targets(self._encoder.classes_)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Hahaha upon looking at this more closely, we had it backwards 😅 We can't decode self._encoder.classes_
when self._encoder.classes_
are the original and hence, decoded targets!
Alternatives we can do:
- Assume that the column order of the estimator's
predict_proba
is the order of the classes listed inself._encoder.classes_
and simply do
proba.columns = self._encoder.classes
- If the
predict_proba
column names are equal to theself._encoder.classes_
then we assume that there was already decoding done, and don't do anything. Otherwise, we try to decode.
if (set(self._encoder.classes_) == set(proba.columns)):
return proba
else:
proba.columns = self._decode_targets(proba.columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. I think we should go with option 1. Not sure if there was a typo in your example or if you meant something else, but just for clarity:
proba.columns = self._encoder.classes_
Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think there's value in adding a warning message if the estimator predict_proba
doesn't match what we expect, and perhaps some debug logs throughout the encoding/decoding. In particular, if the estimator's predict_proba
returns something with a different number of columns from what we expect, we should raise an error. Maybe there's a way to have a similar check for the column names and ordering but that may get too complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, sorry for the typo but underscore is correct!
RE testing ; adding checks: I agree adding these checks would be helpful, but with regards to the problem we're having here with column names and ordering... it's much harder to do when sklearn drops them via returning numpy arrays and we have to use the estimator's .classes_
attribute to match up each column :( I can add some debug logging within fit / wherever else we're calling encode/decode!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I agree given the complexity already present here, adding a bunch of error handling is fine to keep out of scope for now.
try: | ||
return self._component_obj.classes_ | ||
except AttributeError: | ||
raise MethodPropertyNotFoundError("Estimator requires a classes_ property or a component_obj that implements classes_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. So this PR adds class labels to predict_proba
not only for classification pipelines but also for classification estimators.
By adding this, we're now expecting all custom classification estimators to include a classes_
property, right? At least those which don't define a component_obj
. I know this is how sklearn classification estimators work. My concern is that by doing this, we're making it difficult to add custom classification estimators. The baseline model happened to be tracking this info already, but what would be required if that weren't the case?
I think another viable path would be to explore using the same label_encoder
pattern we used in the pipeline here in the estimator. This would also be the first non-generic logic we've added to the estimators, so perhaps we can consider alternatives.
Another consideration is that @jeremyliweishih is working on a metaclass pattern for #851 which may be helpful here.
My suggestion: have this PR update the predict_proba
col names for classification pipelines. And file a separate issue for estimators where we can continue the conversation. Does that sound good to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Yes, you're correct, and if an user wants to implement their own custom classifier, then they would have to implement _classes
.
The reason why I took this approach is because using / storing classes on a pipeline level is possible but also tricky and depends on the output of the estimator's predict_proba
. That is, on the pipeline level, we're not guaranteed that the order of the columns we're setting is equal to the order of the columns outputted by the estimator's predict_proba
method.
With that in mind, what do you think? We're not guaranteed that the class labels we store are necessarily the order that the estimator will return values in, hence why I left it up to the estimator to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin ok got it.
For classification pipelines we're currently storing unique class values using the _encoder
. I like what we settled on there. My concerns are solely about the classification estimators.
For estimators, there are a few cases to consider:
-
User provides a
component_obj
to wrap, and thecomponent_obj
exposesclasses_
as an attribute (i.e. sklearn estimators). Your proposal would work perfectly here; the user doesn't need to override componentclasses_
. -
User provides a
component_obj
to wrap, and thecomponent_obj
exposesclasses_
but under a different name. In this case the user needs to override componentclasses_
to point to the right accessor on thecomponent_obj
. That's fine. -
User provides a
component_obj
to wrap, but thecomponent_obj
doesn't surface the unique class values. In this case, the user would have to overridefit
and learn it using a label encoder or similar custom logic. -
No
component_obj
; user implements custom__init__
,fit
,predict
/predict_proba
. In this case, does the user need to defineclasses_
? Functionally, they don't, since we don't care how they attach labels topredict_proba
columns as long as they conform to our APIs expectations. But now that we exposeclasses_
as a public property, we'd need to clearly articulate whether or not this is necessary to override.
Your proposal does well for cases 1 and 2, but I think we can improve on it for cases 3 and 4.
What if we added a label encoder to all classification estimators by default, just like we do now with classification pipelines. That way we don't need to rely on the component_obj
to learn and expose the ordered class labels. We could expose this under private methods _learn_unique_class_labels
which we'd call in Estimator.fit
and _get_unique_class_labels
which we'd call in Estimator.predict_proba
. This would address cases 1 and 2, and then for cases 3 and 4 we can advise users to follow our pattern and call this themselves instead of having to write that code from scratch.
Also, I agree that it would be confusing if classification pipelines' predict_proba
vs estimators' predict_proba
returned columns in a different order. We should keep an eye out for that detail.
All this said, I think we should split this into its own issue so we can merge the code you've got for pipelines now and tackle these questions separately.
This is reminding me we really need a thorough page in the User Guide about how to define custom components!
Discussed again with @angela97lin We've made the following decisions:
We should make sure our test coverage covers:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin looks good to me! Thanks for closing this out. It's pretty cool that we can provide this kind of encoding and mask the complexity inside the pipelines.
This isn't blocking, but I think the intent with the test_targets_data_types_classification
automl test is to check that the automl scores and other info are computed correctly for different target datatypes. Checking pipelines' predict
and predict_proba
output is more fundamental and ideally should be done outside the context of automl. Hopefully that will make it easier to maintain. So my suggestion is to move that coverage to a pipeline test instead of doing it in an automl test, i.e. add a test_targets_data_types_classification
in test_pipelines.py
@@ -101,6 +101,7 @@ def predict_proba(self, X): | |||
|
|||
X = self._transform(X) | |||
proba = self.estimator.predict_proba(X) | |||
proba.columns = self._encoder.classes_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, makes sense!
Returns: | ||
list(str) or list(float) : class names | ||
""" | ||
return self._classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it can't hurt to leave this in, but no longer necessary for this PR, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer necessary! Goes well with our direction towards using property for read-only though? :D
@@ -33,6 +33,7 @@ def test_binary_classification_pipeline_predict(mock_fit, | |||
# test custom threshold set but no objective passed | |||
mock_predict_proba.return_value = pd.DataFrame([[0.1, 0.2], [0.1, 0.2]]) | |||
binary_pipeline.threshold = 0.6 | |||
binary_pipeline._encoder.classes_ = [0, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why didn't we need this before this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this now because of the proba.columns = self._encoder.classes_
line we added in this PR. Without it, we'll get an AttributeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks. When I looked at this first I thought this was happening further down and was confused. Got it.
@dsherry Argh sorry I missed the comment associated with your approval again 🤦♀️ I know it wasn't a blocking comment but I'll get something up to move things to some pipeline tests! |
Closes #645