-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Feature Value column to SHAP table. #1064
Adding Feature Value column to SHAP table. #1064
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1064 +/- ##
========================================
Coverage 99.91% 99.91%
========================================
Files 188 191 +3
Lines 10296 10852 +556
========================================
+ Hits 10287 10843 +556
Misses 9 9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a really good suggestion from the demos and this LGTM! 😁
…del_understanding_tests.
…of github.com:FeatureLabs/evalml into 1035-add-feature-values-column-to-explanation-reports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I left a question about how the call to PipelineBase._transform
is happening now, a suggestion about using a pandas dataframe instead of a dict for table_maker
, a suggestion about rounding, and a little test code suggested simplification. None blocking though.
@@ -47,14 +47,13 @@ def _compute_shap_values(pipeline, features, training_data=None): | |||
if estimator.model_family == ModelFamily.BASELINE: | |||
raise ValueError("You passed in a baseline pipeline. These are simple enough that SHAP values are not needed.") | |||
|
|||
pipeline_features = pipeline._transform(features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you able to delete this? Are you now calling PipelineBase._transform
one step up the stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, moved it to _make_single_prediction_shap_table
to only call it once.
row = [feature_name, display_text] | ||
feature_value = pipeline_features[feature_name] | ||
if pd.api.types.is_number(feature_value): | ||
feature_value = np.round(feature_value, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the rounding necessary? I think '{0:2f}'.format(feature_value)
would be better. That way, you're guaranteed the output is limited to maximum 2 characters after the decimal point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're 100% right! Much better than calling round.
dtypes = ["t", "t", "f"] if include_shap_values else ["t", "t"] | ||
alignment = ["c", "c", "c"] if include_shap_values else ["c", "c"] | ||
dtypes = ["t", "f", "t", "f"] if include_shap_values else ["t", "t", "t"] | ||
alignment = ["c", "c", "c", "c"] if include_shap_values else ["c", "c", "c"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the formatting for the columns but now that we are using string formatting to round rather than using numpy, we can use text for each column, which will simplify these lines.
@@ -136,14 +142,18 @@ def _make_single_prediction_shap_table(pipeline, input_features, top_k=3, traini | |||
shap_values = _compute_shap_values(pipeline, input_features, training_data) | |||
normalized_shap_values = _normalize_shap_values(shap_values) | |||
|
|||
# We need a dict of type {column_name: feature value} | |||
pipeline_features = pipeline._transform(input_features) | |||
features_dict = dict(zip(pipeline_features.columns, *pipeline_features.values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need a dict? Why not pass in the pandas df and index into it? The fewer datastructures we have floating around, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, great!
@@ -18,7 +18,11 @@ | |||
def compare_two_tables(table_1, table_2): | |||
assert len(table_1) == len(table_2) | |||
for row, row_answer in zip(table_1, table_2): | |||
assert row.strip().split() == row_answer.strip().split() | |||
# To make it easier to compare header underline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh so this isn't fixed-length? Or something which you could easily calculate in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got lazy but I fixed it now by setting the correct length in the answer and removing this line 😬
@@ -96,6 +100,7 @@ def test_explain_prediction(mock_normalize_shap_values, | |||
pipeline = MagicMock() | |||
pipeline.problem_type = problem_type | |||
pipeline._classes = ["class_0", "class_1", "class_2"] | |||
pipeline._transform.return_value = pd.DataFrame({"a": [10], "b": [20], "c": [30], "d": [40]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Could be nice to have the DF have more than one row, to make sure indexing works. Not sure if that's covered elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time we get to this point in the code, the df is guaranteed to only have one row. I'll add a comment to the test to explain!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok thanks
|
||
if include_string_features: | ||
pipeline_features["a"] = "foo-feature" | ||
pipeline_features["b"] = np.datetime64("2020-08-14") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, great that you included datetimes / something other than numeric and string
if include_shap_values: | ||
new_answer = copy.deepcopy(answer) | ||
for row in new_answer: | ||
row.append(values[row[0]][0]) | ||
else: | ||
new_answer = answer | ||
|
||
assert _make_rows(values, values, top_k, include_shap_values) == new_answer | ||
if include_string_features: | ||
new_answer = copy.deepcopy(new_answer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a no-op, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, deleted this since I went with your suggestion below!
if row[0] == "a": | ||
row[1] = "foo-feature" | ||
elif row[0] == "b": | ||
row[1] = "2020-08-14" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to wrack my brain to remember if this would work in python!
Suggested simplification:
filtered_answer = []
for row in new_answer:
val = row[1]
if row[0] == "a":
val = "foo-feature"
elif row[0] == "b":
val = "2020-08-14"
filtered_answer.append((row[0], val))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with your suggestion! Although it looks a little bit different because row
has four values and so we want to append four values to filtered_answer, not just two values.
c 1.000 + 0.000 | ||
d -1.560 -- -2.560 | ||
e -1.800 -- -2.800 | ||
f -1.900 -- -2.900""".splitlines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These reports look good!
@dsherry Thanks for the feedback - I addressed your comments! |
Pull Request Description
Closes #1035
Docs Changes
Here
Demo for Regression Problem
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.