In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import pandas as pd
import eli5
import graphviz
import shap
shap.initjs()
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from eli5.sklearn import PermutationImportance
from matplotlib import pyplot as plt
from pdpbox import pdp#, get_dataset, info_plots

from visualization_utils import load_notebook_config, show_feature_importance
load_notebook_config(static=False)

# ML explainability

## Agenda

- Motivation

- Permutation importance

- Partial Dependence Plots

- SHAP

## Motivation

## Motivation


Explore techniques to extract the following insights from machine learning models:

- What features in the data did the model think are most important?

- For any single prediction from a model, how did each feature in the data affect that particular prediction?

- How does each feature affect the model's predictions in a big-picture sense (what is its typical effect when considered over a large number of possible predictions)?

## Motivation


Why Are These Insights Valuable?

- Debugging

- Informing feature engineering

- Directing future data collection

- Informing human decision-making

- Building Trust

## Feature importance

- What features have the biggest impact on predictions?

- Only gives you notion which features contributes to the decision, not "which way".

- Permutation importance: a feature importance technique.

## Permutation importance

- Fast to calculate

- Widely used and understood

- It is calculated after a model has been fitted

### How it works

- If I randomly shuffle a single column of the validation data, leaving the target and all other columns in place, how would that affect the accuracy of predictions in that now-shuffled data?


- Example: "We want to predict a person's height when they become 20 years old, using data that is available at age 10."

<img src="../images/permutation_importance_example_1.png" width="800" height="400">

### The process


1. Get a trained model.

2. Shuffle the values in a single column, make predictions using the resulting dataset. Use these predictions and the true target values to calculate how much the loss function suffered from shuffling. That performance deterioration measures the importance of the variable you just shuffled.

3. Return the data to the original order (undoing the shuffle from step 2). Now repeat step 2 with the next column in the dataset, until you have calculated the importance of each column.

### Code example with eli5 library


- The idea is to use a model that predicts whether a football team will have the "Man of the Game" winner based on the team's statistics.

- https://www.kaggle.com/mathan/fifa-2018-match-statistics

In [None]:
fifa2018 = pd.read_csv('../data/FIFA_2018_Statistics.csv')
print(fifa2018.shape)
fifa2018.head(2)

### Code example with eli5 library

In [None]:
y = (fifa2018['Man of the Match'] == "Yes")  # Convert from string "Yes"/"No" to binary
feature_names = [i for i in fifa2018.columns if fifa2018[i].dtype in [np.int64]]
X = fifa2018[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(n_estimators=100, random_state=0).fit(train_X, train_y)

In [None]:
perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

##### The first number in each row shows how much model performance decreased with a random shuffling.

##### The number after the ± measures how performance varied from one-reshuffling to the next.

##### You'll occasionally see negative values for permutation importances. In those cases, the predictions on the shuffled (or noisy) data happened to be more accurate than the real data.

### Warnings


- Codependent features tend to share importance.

- This approach is faster but can introduce nonsensical observations by permuting invalid values into records (e.g., shifting a true pregnant value into a male’s record).


(Warnings source: Martin's feature importance docs :D)

## Partial Dependence Plots

- While feature importance shows WHAT VARIABLES most affect predictions, partial dependence plots show HOW A FEATURE affects predictions.

This is useful to answer questions like:

- Controlling for all other house features, what impact do longitude and latitude have on home prices? To restate this, how would similarly sized houses be priced in different areas?

### How it works

- Like permutation importance, partial dependence plots are calculated after a model has been fit.

- We take a row of data and we will use the fitted model to predict our outcome (probability their player won "man of the match").

- But we repeatedly alter the value for one variable to make a series of predictions (for instance, Ball Possession % is equal to 50 for that row, we make also predictions with other possible feature values: 20, 30, 60, 70)

### How it works


- We trace out predicted outcomes (on the vertical axis) as we move from small values of ball possession to large values (on the horizontal axis).

- Interactions between features may cause the plot for a single row to be atypical. So, we repeat that mental experiment with multiple rows from the original dataset, and we plot the average predicted outcome on the vertical axis.

### Code Example

In [None]:
tree_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_X, train_y)
tree_graph = tree.export_graphviz(tree_model, out_file=None, feature_names=feature_names)
graphviz.Source(tree_graph)

In [None]:
warnings.filterwarnings('ignore', module="matplotlib")
pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=val_X, model_features=feature_names, feature='Goal Scored')
pdp.pdp_plot(pdp_goals, 'Goal Scored')
plt.show()

- The y axis is interpreted as change in the prediction from what it would be predicted at the baseline or leftmost value.

- A blue shaded area indicates level of confidence

- From this particular graph, we see that scoring a goal substantially increases your chances of winning "Man of The Match." But extra goals beyond that appear to have little impact on predictions.

In [None]:
rf_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)
pdp_dist = pdp.pdp_isolate(model=rf_model, dataset=val_X, model_features=feature_names, feature='Distance Covered (Kms)')
pdp.pdp_plot(pdp_dist, 'Distance Covered (Kms)')
plt.show()

# 2D Partial Dependence Plots


- To see interactions between features.

Library bug fix:

- https://github.com/SauceCat/PDPbox/commit/73c69665f1663b53984e187c7bc8996e25fea18e

- Replace in pdp_plot_utils.py

        251 inter_ax.clabel(c2, contour_label_fontsize=fontsize, inline=1)
        with
        251 inter_ax.clabel(c2, fontsize=fontsize, inline=1)

In [None]:
features_to_plot = ['Goal Scored', 'Distance Covered (Kms)']
inter1 = pdp.pdp_interact(model=tree_model, dataset=val_X, model_features=feature_names, features=features_to_plot)
pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')
plt.show()

- We see the highest predictions when a team scores at least 1 goal and they run a total distance close to 100km.

- If they score 0 goals, distance covered doesn't matter.

- But distance can impact predictions if they score goals.

### SHAP

- But what if you want to break down how the model works for an individual prediction?

- SHAP Values break down a prediction to show the impact of each feature.

Where could you use this?

- A model says a bank shouldn't loan someone money, and the bank is legally required to explain the basis for each loan rejection

- A healthcare provider wants to identify what factors are driving each patient's risk of some disease so they can directly address those risk factors with targeted health interventions

### How it works

- SHAP values interpret the impact of having a certain value for a given feature in comparison to the prediction we'd make if that feature took some baseline value.

- Property: sum(SHAP values for all features) = prediction - pred_for_baseline_values

    - That is, the SHAP values of all features sum up to explain why my prediction was different from the baseline.

- Base value is the average model output (based on provided training data)

### Example

In [None]:
row_to_show = 5
data_for_prediction = val_X.iloc[row_to_show]
rf_model.predict_proba(data_for_prediction.values.reshape(1, -1))

In [None]:
# Create object that can calculate shap values
explainer = shap.TreeExplainer(rf_model)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)

# The shap_values object above is a list with two arrays. (in regression problems will be only one array)
print(len(shap_values))

# The first array is the SHAP values for a negative outcome (don't win the award),
# and the second array is the list of SHAP values for the positive outcome (wins the award).
print(len(shap_values[0]), len(shap_values[1]))

#### Force / Decision Plot

- To understand individual predictions

In [None]:
shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)

- If you subtract the length of the blue bars from the length of the red bars, it equals the distance from the base value to the output.

About Explainers:

- SHAP package has explainers for every type of model.

- TreeExplainer works with Tree based models

- DeepExplainer works with Deep Learning models

- KernelExplainer works with all models, though it is slower than other Explainers and it offers an approximation rather than exact Shap values.

#### Summary plot 


- Each dot has three characteristics:

    - Vertical location shows what feature it is depicting
    - Color shows whether that feature was high or low for that row of the dataset
    - Horizontal location shows whether the effect of that value caused a higher or lower prediction.

In [None]:
shap.summary_plot(shap_values, X_test, feature_names=None, max_display=10)