SHAP-demo-python
================



## Introduction to SHAP values and ML interpretability



Notebook based on the DataCamp tutorial by Abid Ali Awan ([June 2023](https://www.datacamp.com/tutorial/introduction-to-shap-values-machine-learning-interpretability)).



## Overview



SHAP values measure how much each feature or predictor of a predictive
algorithm contributes to the model's prediction. SHAP values can help
you see which features affect the model's outcome the most.

This demo briefly explains SHAP values in general and in machine
learning (ML), and shows how to implement them in Python using a
supervised Random Forest classifier for customer churn (or attrition)
i.e. a prediction of how many customers a company is going to lose
over time based on observed customer features that might lead to
churn.



## What are SHAP values?



SHAP (SHapley Additive exPlanations) values explain the output of any
ML model using a game theoretic approach to establish the contribution
of each predictor to the final outcome.

SHAP values are model-agnostic, which means that they can be used to
interpret any ML model, whether white-box (transparent) or black-box
(intransparent).



### The properties of SHAP values



1.  Additivity: SHAP values are additive, which means that the
    contribution of each feature to the final prediction can be
    computed independently and then summed up. This property allows for
    efficient computation of SHAP values, even for high-dimensional
    datasets.
2.  Local accuracy: SHAP values add up to the difference between the
    expected model output and the actual output for a given input. This
    means that SHAP values provide an accurate and local interpretation
    of the model's prediction for a given input.
3.  Missingness: SHAP values are zero for missing or irrelevant
    features for a prediction. This makes SHAP values robust to missing
    data and ensures that irrelevant features do not distort the
    interpretation.
4.  Consistency: SHAP values do not change when the model changes
    unless the contribution of a feature changes. This means that SHAP
    values provide a consistent interpretation of the model's behavior,
    even when the model parameters change.



### Using SHAP values



Apart from machine learning interpretability and explainability, SHAP
value can be used for:

-   Model debugging. By examining the SHAP values, we can identify any
    biases or outliers in the data that may be causing the model to make
    mistakes.
-   Feature importance. Identifying and removing low-impact features can
    create a more optimized model.
-   Anchoring explanations. We can use SHAP values to explain individual
    predictions by highlighting the essential features that caused that
    prediction. It can help users understand and trust a model's
    decisions.
-   Model summaries. It can provide a global summary of a model in the
    form of a SHAP value summary plot. It gives an overview of the most
    important features across the entire dataset.
-   Detecting biases. The SHAP value analysis helps identify if certain
    features disproportionately affect particular groups. It enables the
    detection and reduction of discrimination in the model.
-   Fairness auditing. It can be used to assess a model's fairness and
    ethical implications.
-   Regulatory approval. SHAP values can help gain regulatory approval
    by explaining the model's decisions.



## Setting up the Python environment



First, import libraries for SHAP, data frame manipulation (`pandas`,
numerical computation (`numpy`) and machine learning (`sklearn`):



In [1]:
import shap
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

## Read CSV data into data frame



Create a data frame `customer` from the CSV data:



In [1]:
customer = pd.read_csv("data/customer_churn.csv")
print(customer.head())
print(customer)

#+begin_example
   Call Failure  Complaints  Subscription Length  ...  Age  Customer Value  Churn
0             8           0                   38  ...   30         197.640      0
1             0           0                   39  ...   25          46.035      0
2            10           0                   37  ...   30        1536.520      0
3            10           0                   38  ...   15         240.020      0
4             3           0                   38  ...   15         145.805      0

[5 rows x 14 columns]
      Call Failure  Complaints  Subscription Length  ...  Age  Customer Value  Churn
0                8           0                   38  ...   30         197.640      0
1                0           0                   39  ...   25          46.035      0
2               10           0                   37  ...   30        1536.520      0
3               10           0                   38  ...   15         240.020      0
4                3           0              

## Model training and evaluation



Create a data frame `X` of independent variables or predictors, and a
vector `y` for the `customer.Churn` that we want to predict:



In [1]:
X = customer.drop("Churn", axis=1)
y = customer.Churn
print(X.columns)

# fancier printout - like str() in R
[print(f'{col}: {X[col].tolist()}') for col in X.columns]

predictors
#+begin_example
Index(['Call Failure', 'Complaints', 'Subscription Length', 'Charge Amount',
       'Seconds of Use', 'Frequency of use', 'Frequency of SMS',
       'Distinct Called Numbers', 'Age Group', 'Tariff Plan', 'Status', 'Age',
       'Customer Value'],
      dtype='object')
Call Failure: [8, 0, 10, 10, 3, 11, 4, 13, 7, 7, 6, 9, 25, 4, 9, 3, 0, 2, 0, 3, 7, 8, 23, 21, 13, 1, 9, 9, 0, 1, 3, 0, 0, 25, 6, 2, 3, 6, 4, 4, 11, 0, 10, 2, 2, 16, 12, 8, 2, 3, 11, 3, 13, 13, 6, 14, 7, 16, 10, 10, 9, 12, 28, 7, 12, 6, 3, 5, 3, 6, 10, 11, 26, 24, 16, 4, 12, 12, 3, 4, 6, 3, 3, 28, 9, 5, 6, 9, 7, 7, 14, 3, 13, 5, 5, 19, 15, 11, 5, 6, 5, 0, 7, 7, 0, 8, 1, 10, 4, 4, 3, 6, 22, 1, 6, 0, 0, 0, 0, 0, 4, 5, 20, 18, 10, 0, 6, 6, 0, 0, 0, 0, 0, 22, 3, 0, 0, 3, 1, 1, 8, 0, 7, 0, 0, 13, 9, 5, 0, 0, 9, 1, 11, 11, 4, 12, 5, 14, 8, 8, 7, 10, 26, 5, 10, 4, 1, 3, 1, 4, 8, 9, 24, 22, 14, 2, 10, 10, 1, 2, 4, 1, 1, 26, 7, 3, 4, 7, 5, 5, 12, 1, 11, 3, 3, 17, 13, 9, 3, 4, 7, 0, 9, 9, 2, 10, 3, 12, 6, 

Split the data frame and the vector into a training (70%) and a test
(30%) data set of randomly mixed up customer records:



In [1]:
X_train, X_test, y_train, y_test =\
    train_test_split(X, y, test_size=0.4, random_state=1)

Train a machine learning model (a random forest ensemble, i.e. a group
of [decision trees](https://scikit-learn.org/stable/modules/tree.html) - rule-based classifiers - see [illustration](https://github.com/birkenkrahe/ai23/blob/main/img/randomForest.png)):



In [1]:
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

Make a prediction on the testing data:



In [1]:
y_pred = clf.predict(X_test)

Build and print a classification report based on comparing the
predicted churn from the testing data with the (known) values:



In [1]:
print(classification_report(y_pred,y_test))

precision    recall  f1-score   support
:
           0       0.98      0.96      0.97      1091
           1       0.77      0.85      0.81       169
:
    accuracy                           0.95      1260
   macro avg       0.87      0.91      0.89      1260
weighted avg       0.95      0.95      0.95      1260

The model shows high (> 90%) accuracy. Here,

-   `support` shows the nubmer of samples in each class in the data set,
    customers who do (1) or do not (0) churn.
-   `accuracy` shows the overall ability to predict the correct class:
    here, the (weighted) average is a measure for the average difference
    between predicted and (known) values in the test dataset after
    training the model on the training data - weighted by the number of
    samples in each class.



### Setting



## Setting up the SHAP explainer



The SHAP package ([documentation](https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html)) contains functions to compute the
SHAP values. We will calculate the SHAP values for the Random Forest model, and
visualize feature contributions.

Initialize SHAP display:



In [1]:
shap.initjs()

Set up an explainer object, then calculate SHAP value using the test
set:



In [1]:
explainer = shap.Explainer(clf)
shap_values = explainer.shap_values(X_test)

## Explainer visualization



SHAP values can be plotted in various ways to show contributions of
the predictors to the prediction, their dependencies, the "force" of
their impact on the prediction, and a "decision" plot - all of whom
contain pretty much the same information.



### Summary plots



Display the summary plot using SHAP values and whole testing data set:



In [1]:
shap.summary_plot(shap_values,X_test)

According to the plot, which features dominate the churn decision?

Display only the summary plot of the label "0" (no churn):



In [1]:
shap.summary_plot(shap_values[0],X_test)

According to the plot, which feature's negative SHAP value has the
greatest impact on the churn?

Explanation:

-   Y-axis indicates the feature names in order of importance from top
    to bottom.
-   X-axis represents the SHAP value, which indicates the degree of
    change in log odds.
-   The color of each point on the graph represents the value of the
    corresponding feature, with red indicating high values and blue
    indicating low values.
-   Each point represents a row of data from the original dataset.

If you look at the feature “Complaints ', you will see that it is
mostly high with a negative SHAP value. It means higher complaint
counts tend to negatively affect the output.

Note: for label “1” the visualization will be flipped. Try that:



In [1]:
shap.summary_plot(shap_values[1],X_test)

### Dependence plot



A dependence plot is a type of scatter plot that displays how a
model's predictions are affected by a specific feature (Subscription
Length). The plot shows that, on average, subscription lengths have a
mostly positive (blue) effect on the model, i.e. low churn values for
customers with longer subscription length. The third dimension,
indicated by the color, is age.



In [1]:
shap.dependence_plot("Subscription Length",
                     shap_values[0],
                     X_test,
                     interaction_index="Age")

Vary the second variable - choose e.g. `Tariff Plan` instead of `Age`.



### Force plot



We will examine the first sample in the testing set to determine which
features contributed to the "0" (no churn) result. To do this, we will
utilize a force plot and provide the expected value, SHAP value, and
testing sample.



In [1]:
shap.plots.force(explainer.expected_value[0],
                 shap_values[0][0,:],
                 X_test.iloc[0, :],
                 matplotlib = True)

We can clearly see that zero complaints and zero call failures have
contributed to keeping customers (negative loss on the left hand
side of the force plot).

Next, look at the customer churn samples with the label "1":



In [1]:
shap.plots.force(explainer.expected_value[1],
                 shap_values[1][6, :],
                 X_test.iloc[6, :],
                 matplotlib = True)

You can see all of the features with the value and magnitude that have
contributed to a loss of customers. It seems that even one unresolved
complaint can cost a telecommunications company a customer.



### Decision plot



Lastly, the decision plot: it shows the model decisions by mapping the
cumulative SHAP values for each prediction. First for the label "1":



In [1]:
shap.decision_plot(explainer.expected_value[1],
                   shap_values[1],
                   X_test.columns)

Display the decision plot for the label "0":



In [1]:
shap.decision_plot(explainer.expected_value[0],
                   shap_values[0],
                   X_test.columns)

## Summary



-   We explored SHAP values for a random forest ensemble model trained
    on customer churn data for a telecommunications company.
-   SHAP extends the focus on model accuracy (predictive power) to
    interpretability (why was a prediction made) and transparency
-   Being able to interpret a model's output can help debug potential
    bias, data issues and model decisions.



## References



-   Awan A A. An Introduction to SHAP Values and Machine Learning
    Interpretability. 2023. URL: [datacamp.com](https://www.datacamp.com/tutorial/introduction-to-shap-values-machine-learning-interpretability).

