Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain Predictions #1016

Merged
merged 21 commits into from
Aug 10, 2020
Merged

Explain Predictions #1016

merged 21 commits into from
Aug 10, 2020

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Aug 3, 2020

Pull Request Description

Closes #986 #955

Implements the design for explain_predictions and explain_predictions_best_worst in this design document.

Demo of what the user sees

Regression - Boston Housing Dataset

image

Binary - Breast Cancer Dataset

image

Multiclass - Iris Dataset

image

For more examples and the complete output per pipeline, see this PR

Changes to docs

See the updates here.


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Aug 3, 2020

Codecov Report

Merging #1016 into main will increase coverage by 0.18%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1016      +/-   ##
==========================================
+ Coverage   99.72%   99.90%   +0.18%     
==========================================
  Files         181      181              
  Lines        9748     9998     +250     
==========================================
+ Hits         9721     9989     +268     
+ Misses         27        9      -18     
Impacted Files Coverage Δ
evalml/pipelines/classification_pipeline.py 100.00% <100.00%> (ø)
...alml/pipelines/prediction_explanations/__init__.py 100.00% <100.00%> (ø)
...pelines/prediction_explanations/_user_interface.py 100.00% <100.00%> (ø)
...ml/pipelines/prediction_explanations/explainers.py 100.00% <100.00%> (ø)
...assification_pipeline_tests/test_classification.py 100.00% <100.00%> (ø)
...peline_tests/explanations_tests/test_explainers.py 100.00% <100.00%> (ø)
...ne_tests/explanations_tests/test_user_interface.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 99.59% <0.00%> (+0.40%) ⬆️
evalml/automl/automl_search.py 99.55% <0.00%> (+0.44%) ⬆️
.../automl_tests/test_automl_search_classification.py 100.00% <0.00%> (+0.45%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbc315f...d9b0efb. Read the comment docs.

@freddyaboulton freddyaboulton force-pushed the 986-explain-best-worst-random-k branch 2 times, most recently from ac9355a to 2410871 Compare August 3, 2020 18:58
@freddyaboulton freddyaboulton marked this pull request as ready for review August 3, 2020 21:32
@@ -132,9 +132,9 @@
"outputs": [],
"source": [
"# get the predicted probabilities associated with the \"true\" label\n",
"y = y.map({'malignant': 0, 'benign': 1})\n",
"y_encoded = y.map({'malignant': 0, 'benign': 1})\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this so that the calls to explain_predictions_best_worst don't break (pipeline was fit on string labels but then we would be passing in int labels)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. How about:

  • Move this y_encoded = y.map({'malignant': 0, 'benign': 1}) down into the chunk about prediction explanations
  • Don't change whats passed to graph_precision_recall_curve and graph_roc_curve. Not necessary, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph_precision_recall_curve and graph_roc_curve need ints and the prediction explanations need strings (since that is what the pipeline is originally fit on). I think we have to change what gets passed to precision_recall_curve and roc_curve unless we only use int labels in the entire notebook (the original labels are strings)?

@freddyaboulton freddyaboulton added this to the August 2020 milestone Aug 3, 2020

# Classification
if isinstance(shap_values, list):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic got split into the _SHAPMultiClassTableMaker, _SHAPBinaryTableMaker, _SHAPRegressionTableMaker. I think this is more maintainable and easier to understand.

return table_maker(shap_values, normalized_shap_values, top_k, include_shap_values)


def _abs_error(y_true, y_pred):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_abs_error and _cross_entropy don't have to be defined in this file but didn't want to create a new module since they would only be used in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, makes me wonder if they're useful in standard_metrics 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's what I was thinking too but the difference between these functions and the ones in standard_metrics is that these functions return a float for each data point but standard_metrics return a float for the entire cv fold.



@pytest.mark.parametrize("problem_type", ["binary", "multi"])
def test_pipeline_has_classes_property(logistic_regression_binary_pipeline_class,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add this test because I added a _classes property to the ClassificationPipeline class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

This makes me think a) we should just bite the bullet and make that property public b) can you also do a similar test with an int-type target instead of str-type, because those could be treated differently by the encoder if we alter that code in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test for an int-type target!

(ProblemTypes.MULTICLASS, multiclass_no_best_worst_answer)])
@patch("evalml.pipelines.prediction_explanations.explainers._DEFAULT_METRICS")
@patch("evalml.pipelines.prediction_explanations.explainers.explain_prediction")
def test_explain_predictions_custom_index(explain_prediction_mock, mock_default_metrics,
Copy link
Contributor Author

@freddyaboulton freddyaboulton Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to add a test that made sure the way we indexed the features dataframe didn't use the dataframe's index. I was afraid custom indexes would break future refactorings of the code but maybe I'm being paranoid lol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing. Great thinking. I respect your paranoia 🤣



class _ReportSectionMaker:
"""Make a prediction explanation report.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this docstring helps explain how the following classes fit together:

  • _HeadingMaker
  • _SHAPTableMaker
  • _EmptyPredictedValuesMaker
  • _ClassificationPredictedValuesMaker
  • _RegressionPredictedValuesMaker

I originally wrote a working implementation without this structure but there were too many slight differences between the expected output of explain_predictions and explain_predictions_best_worst depending on whether it is a regression or classification problem that I decided that breaking up the differences into their own classes was the best way to go.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some tiny comments for now, but not quite done looking through yet :'D Looking good though!

"\n",
"For regression problems, the default metric is the absolute difference between the prediction and target value. We can specify our own ranking function by passing in a function to the `metric` parameter. This function will be called on `y_true` and `y_pred`. By convention, lower scores are better.\n",
"\n",
"At the top of each table, you can see the predicted probabilities, target value, and error on that prediction.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick: At the top of each table, you can see --> At the top of each table, we can see

Just to be consistent with previous paragraphs' use of "us" and "our"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

"\n",
"For regression problems, the default metric is the absolute difference between the prediction and target value. We can specify our own ranking function by passing in a function to the `metric` parameter. This function will be called on `y_true` and `y_pred`. By convention, lower scores are better.\n",
"\n",
"At the top of each table, you can see the predicted probabilities, target value, and error on that prediction.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a tiny bit confused since we're jumping back and forth between classification and regression. We talk about regression problems but then mention for the table that we can see predicted probabilities--but we can only see that for classification problems, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is slightly different between regression and classification so I'm trying to give as much detail as possible. But you're right that since this example is classification, it can be confusing to talk about regression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that makes sense! Maybe would just help to restructure (ex: move regression paragraph after example)?

Comment on lines 50 to 55
dtypes = ["t", "t"]
alignment = ["c", "c"]

if include_shap_values:
dtypes.append("f")
alignment.append("c")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Nitpick/style)
Could do something like (syntax not guaranteed :p)

dtypes = ["t", "t", "f"] if include_shap_values else ["t", "t"]
alignment = ["t", "t", "f"] if include_shap_values else ["c", "c"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return table_maker(shap_values, normalized_shap_values, top_k, include_shap_values)


def _abs_error(y_true, y_pred):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, makes me wonder if they're useful in standard_metrics 🤔

evalml/pipelines/prediction_explanations/explainers.py Outdated Show resolved Hide resolved
@freddyaboulton freddyaboulton force-pushed the 986-explain-best-worst-random-k branch from 634e1bc to 6ea9d6a Compare August 5, 2020 20:43
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton this is really great. The guide reads real easy and the APIs and the tests look solid.

I left a few suggestions on the guide/docs.

I left some comments on organization in explainers.py. I think its fine to keep most of the impl classes and methods private-named, but perhaps then we should have explainers.py contain only public-named stuff.

I also wonder if there's a way to have explain_predictions_best_worst call explain_predictions, left a couple comments about that.

I left a discussion about the classification pipeline _classes accessor, adding more unit testing. I think its fine to keep it private-named for now.

Nothing blocking merge IMO. We should resolve the code organization / naming discussions though.

@@ -132,9 +132,9 @@
"outputs": [],
"source": [
"# get the predicted probabilities associated with the \"true\" label\n",
"y = y.map({'malignant': 0, 'benign': 1})\n",
"y_encoded = y.map({'malignant': 0, 'benign': 1})\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. How about:

  • Move this y_encoded = y.map({'malignant': 0, 'benign': 1}) down into the chunk about prediction explanations
  • Don't change whats passed to graph_precision_recall_curve and graph_roc_curve. Not necessary, right?

docs/source/user_guide/model_understanding.ipynb Outdated Show resolved Hide resolved
evalml/pipelines/prediction_explanations/explainers.py Outdated Show resolved Hide resolved
evalml/pipelines/prediction_explanations/explainers.py Outdated Show resolved Hide resolved


@pytest.mark.parametrize("problem_type", ["binary", "multi"])
def test_pipeline_has_classes_property(logistic_regression_binary_pipeline_class,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

This makes me think a) we should just bite the bullet and make that property public b) can you also do a similar test with an int-type target instead of str-type, because those could be treated differently by the encoder if we alter that code in the future.

@freddyaboulton freddyaboulton force-pushed the 986-explain-best-worst-random-k branch from 1f05fa2 to 1e104b6 Compare August 7, 2020 21:41
@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Aug 7, 2020

@dsherry I think I addressed all of your feedback!

Regarding code organization:

  • I created a _TableMaker base class. _SHAPRegressionTableMaker, _SHAPBinaryTableMaker, and _SHAPMultiClassTableMakernow inherit from _TableMaker.
  • I moved the implementation code that was in explainers to _user_interface.py. The only slight hiccup is that before _SHAPTableMaker called explain_prediction. To avoid circular imports, I created a new function called _make_single_prediction_shap_table. Both explain_prediction and _SHAPTableMaker now call this function.
  • I created a _SectionMaker base class. _HeadingMaker, _EmptyPredictedValuesMaker, _ClassificationPredictedValuesMaker, _RegressionPredictedValuesMaker, _SHAPTableMaker now inherit from _SectionMaker.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@freddyaboulton freddyaboulton merged commit 49556bb into main Aug 10, 2020
@freddyaboulton freddyaboulton deleted the 986-explain-best-worst-random-k branch August 10, 2020 19:17
@dsherry dsherry mentioned this pull request Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Explain Best/Worst K Predictions
4 participants