-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explain Predictions #1016
Explain Predictions #1016
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1016 +/- ##
==========================================
+ Coverage 99.72% 99.90% +0.18%
==========================================
Files 181 181
Lines 9748 9998 +250
==========================================
+ Hits 9721 9989 +268
+ Misses 27 9 -18
Continue to review full report at Codecov.
|
ac9355a
to
2410871
Compare
@@ -132,9 +132,9 @@ | |||
"outputs": [], | |||
"source": [ | |||
"# get the predicted probabilities associated with the \"true\" label\n", | |||
"y = y.map({'malignant': 0, 'benign': 1})\n", | |||
"y_encoded = y.map({'malignant': 0, 'benign': 1})\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this so that the calls to explain_predictions_best_worst
don't break (pipeline was fit on string labels but then we would be passing in int labels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. How about:
- Move this
y_encoded = y.map({'malignant': 0, 'benign': 1})
down into the chunk about prediction explanations - Don't change whats passed to
graph_precision_recall_curve
andgraph_roc_curve
. Not necessary, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
graph_precision_recall_curve
and graph_roc_curve
need ints and the prediction explanations need strings (since that is what the pipeline is originally fit on). I think we have to change what gets passed to precision_recall_curve
and roc_curve
unless we only use int labels in the entire notebook (the original labels are strings)?
|
||
# Classification | ||
if isinstance(shap_values, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic got split into the _SHAPMultiClassTableMaker
, _SHAPBinaryTableMaker
, _SHAPRegressionTableMaker
. I think this is more maintainable and easier to understand.
return table_maker(shap_values, normalized_shap_values, top_k, include_shap_values) | ||
|
||
|
||
def _abs_error(y_true, y_pred): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_abs_error
and _cross_entropy
don't have to be defined in this file but didn't want to create a new module since they would only be used in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, makes me wonder if they're useful in standard_metrics
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's what I was thinking too but the difference between these functions and the ones in standard_metrics
is that these functions return a float for each data point but standard_metrics
return a float for the entire cv fold.
|
||
|
||
@pytest.mark.parametrize("problem_type", ["binary", "multi"]) | ||
def test_pipeline_has_classes_property(logistic_regression_binary_pipeline_class, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to add this test because I added a _classes
property to the ClassificationPipeline
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
This makes me think a) we should just bite the bullet and make that property public b) can you also do a similar test with an int-type target instead of str-type, because those could be treated differently by the encoder if we alter that code in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the test for an int-type target!
(ProblemTypes.MULTICLASS, multiclass_no_best_worst_answer)]) | ||
@patch("evalml.pipelines.prediction_explanations.explainers._DEFAULT_METRICS") | ||
@patch("evalml.pipelines.prediction_explanations.explainers.explain_prediction") | ||
def test_explain_predictions_custom_index(explain_prediction_mock, mock_default_metrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to add a test that made sure the way we indexed the features dataframe didn't use the dataframe's index. I was afraid custom indexes would break future refactorings of the code but maybe I'm being paranoid lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing. Great thinking. I respect your paranoia 🤣
|
||
|
||
class _ReportSectionMaker: | ||
"""Make a prediction explanation report. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this docstring helps explain how the following classes fit together:
_HeadingMaker
_SHAPTableMaker
_EmptyPredictedValuesMaker
_ClassificationPredictedValuesMaker
_RegressionPredictedValuesMaker
I originally wrote a working implementation without this structure but there were too many slight differences between the expected output of explain_predictions
and explain_predictions_best_worst
depending on whether it is a regression or classification problem that I decided that breaking up the differences into their own classes was the best way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some tiny comments for now, but not quite done looking through yet :'D Looking good though!
"\n", | ||
"For regression problems, the default metric is the absolute difference between the prediction and target value. We can specify our own ranking function by passing in a function to the `metric` parameter. This function will be called on `y_true` and `y_pred`. By convention, lower scores are better.\n", | ||
"\n", | ||
"At the top of each table, you can see the predicted probabilities, target value, and error on that prediction.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick: At the top of each table, you can see --> At the top of each table, we can see
Just to be consistent with previous paragraphs' use of "us" and "our"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
"\n", | ||
"For regression problems, the default metric is the absolute difference between the prediction and target value. We can specify our own ranking function by passing in a function to the `metric` parameter. This function will be called on `y_true` and `y_pred`. By convention, lower scores are better.\n", | ||
"\n", | ||
"At the top of each table, you can see the predicted probabilities, target value, and error on that prediction.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a tiny bit confused since we're jumping back and forth between classification and regression. We talk about regression problems but then mention for the table that we can see predicted probabilities--but we can only see that for classification problems, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is slightly different between regression and classification so I'm trying to give as much detail as possible. But you're right that since this example is classification, it can be confusing to talk about regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that makes sense! Maybe would just help to restructure (ex: move regression paragraph after example)?
dtypes = ["t", "t"] | ||
alignment = ["c", "c"] | ||
|
||
if include_shap_values: | ||
dtypes.append("f") | ||
alignment.append("c") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Nitpick/style)
Could do something like (syntax not guaranteed :p)
dtypes = ["t", "t", "f"] if include_shap_values else ["t", "t"]
alignment = ["t", "t", "f"] if include_shap_values else ["c", "c"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return table_maker(shap_values, normalized_shap_values, top_k, include_shap_values) | ||
|
||
|
||
def _abs_error(y_true, y_pred): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, makes me wonder if they're useful in standard_metrics
🤔
evalml/tests/pipeline_tests/explanations_tests/test_user_interface.py
Outdated
Show resolved
Hide resolved
634e1bc
to
6ea9d6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton this is really great. The guide reads real easy and the APIs and the tests look solid.
I left a few suggestions on the guide/docs.
I left some comments on organization in explainers.py
. I think its fine to keep most of the impl classes and methods private-named, but perhaps then we should have explainers.py
contain only public-named stuff.
I also wonder if there's a way to have explain_predictions_best_worst
call explain_predictions
, left a couple comments about that.
I left a discussion about the classification pipeline _classes
accessor, adding more unit testing. I think its fine to keep it private-named for now.
Nothing blocking merge IMO. We should resolve the code organization / naming discussions though.
@@ -132,9 +132,9 @@ | |||
"outputs": [], | |||
"source": [ | |||
"# get the predicted probabilities associated with the \"true\" label\n", | |||
"y = y.map({'malignant': 0, 'benign': 1})\n", | |||
"y_encoded = y.map({'malignant': 0, 'benign': 1})\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. How about:
- Move this
y_encoded = y.map({'malignant': 0, 'benign': 1})
down into the chunk about prediction explanations - Don't change whats passed to
graph_precision_recall_curve
andgraph_roc_curve
. Not necessary, right?
|
||
|
||
@pytest.mark.parametrize("problem_type", ["binary", "multi"]) | ||
def test_pipeline_has_classes_property(logistic_regression_binary_pipeline_class, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
This makes me think a) we should just bite the bullet and make that property public b) can you also do a similar test with an int-type target instead of str-type, because those could be treated differently by the encoder if we alter that code in the future.
1f05fa2
to
1e104b6
Compare
@dsherry I think I addressed all of your feedback! Regarding code organization:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…tions and explain_predictions_best_worst.
…cs with more user-friendly output.
…reference and fixing docs.
…e model understanding user guide.
…ne, a helpful ValueError will be raised.
…e classes to make the structure more clear.
277783b
to
d9b0efb
Compare
Pull Request Description
Closes #986 #955
Implements the design for
explain_predictions
andexplain_predictions_best_worst
in this design document.Demo of what the user sees
Regression - Boston Housing Dataset
Binary - Breast Cancer Dataset
Multiclass - Iris Dataset
For more examples and the complete output per pipeline, see this PR
Changes to docs
See the updates here.
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.