# Understanding SHAP values 

In [1]:
!pip install plotly

Collecting plotly
  Downloading plotly-5.3.1-py2.py3-none-any.whl (23.9 MB)
[K     |████████████████████████████████| 23.9 MB 2.6 MB/s eta 0:00:01
[?25hCollecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.3.1 tenacity-8.0.1
You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
!pip install shap

Collecting numpy
  Downloading numpy-1.20.3-cp38-cp38-macosx_10_9_x86_64.whl (16.0 MB)
[K     |████████████████████████████████| 16.0 MB 2.3 MB/s eta 0:00:01
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.2
    Uninstalling numpy-1.21.2:
      Successfully uninstalled numpy-1.21.2
Successfully installed numpy-1.20.3
You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [6]:
!pip install numba

You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


### Import libraries

In [7]:
#Import libraries
import pandas as pd
import numpy as np
import plotly
np.random.seed(0)
import matplotlib.pyplot as plt
import shap

ImportError: Numba needs NumPy 1.20 or less

### Read dataset

In [None]:
df = pd.read_csv('winequality-red.csv') # ,sep=';')

In [None]:
df.shape

In [None]:
df.columns
df['quality'] = df['quality'].astype(int)

In [None]:
df.head()

### Target variable is quality of the wine

In [None]:
df['quality'].hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.title('Quality of wine count')
plt.xlabel('Wine Quality')
plt.ylabel('Count of quality')
plt.grid(axis='y', alpha=0.75)

### Implement Random Forest regressor

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor

#target variable
Y = df['quality']

# Independent variables
X =  df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol']]

In [None]:
# Train test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

In [None]:
model = RandomForestRegressor(max_depth=6, random_state=0, n_estimators=10)

# Fit the model
model.fit(X_train, Y_train)  
print(model.feature_importances_)

### Variable Importance Plot — Global Interpretability

### Average impact on Quality of wine

In [None]:
shap_values = shap.TreeExplainer(model).shap_values(X_train)
shap.summary_plot(shap_values, X_train, plot_type="bar")

In [None]:
len(shap_values)

In [None]:
shap_values.shape

In [None]:

print(f'Shape of test dataset: {X_train.shape}')
print(f'Type of shap_values: {type(shap_values)}. Length of the list: {len(shap_values)}')
print(f'Shape of shap_values: {np.array(shap_values).shape}')

Can the above variable importance plot show the directions between the features and the target variable? Yes, that's the power of the Shap value plot as shown below. This plot is made of many dots. Each dot has three characteristics. The graph below plots the SHAP values of every feature for every sample. It shorts features by the total of absolute SHAP values over all samples. The color represents the feature value (red high, blue low).

The vertical location shows the feature importance.
The horizontal location shows whether the effect of that value caused a higher or lower prediction.
Color shows whether that feature was high or low for that observation

### Plot with SHAP values having impact on model output (target- wine quality)

In [None]:
shap.summary_plot(shap_values, X_train)

We can describe the model. A high quality rating of wine is associated with the following characteristics:

1. High alcohol content

2. High sulphates

3. Low volatile acidity

4. Low total sulfuer dioxide

5. Low pH

6. Low chlorides

7. Low citric acid

8. Low density

9. High fixed acidity content

10. High free sulfur dioxide

11. High residual sugar

To understand how a single feature effects the output of the model we can plot the SHAP value of that feature vs. the value of the feature for all the examples in a dataset. Since SHAP values represent a feature's responsibility for a change in the model output, the plot below represents the change in predicted house price as RM (the average number of rooms per house in an area) changes. Vertical dispersion at a single value of RM represents interaction effects with other features. To help reveal these interactions dependence_plot automatically selects another feature for coloring. In this case coloring by RAD (index of accessibility to radial highways) highlights that the average number of rooms per house has less impact on home price for areas with a high RAD value.

### SHAP Dependence Plot — Global Interpretability

In [None]:
# the interaction is really in the model see SHAP interaction values below
shap.dependence_plot("alcohol", shap_values, X_train)

The function automatically includes another variable that your chosen variable interacts most with. The following plot shows there is an approximately linear and positive trend between “alcohol” and the target variable, and “alcohol” interacts with “sulphates” frequently.

In [None]:
shap.dependence_plot("volatile acidity", shap_values, X_train)

The plot above shows there exists an approximately linear but negative relationship between “volatile acidity” and the target variable. 

In [None]:
shap.dependence_plot("total sulfur dioxide", shap_values, X_train, show=False)
plt.show()

### SHAP values interaction for the target variable 'Quality of wine'

### Individual SHAP Value Plot — Local Interpretability

In [None]:
X_output = X_test.copy()
X_output.loc[:,'predict wine quality'] = np.round(model.predict(X_output),2)

random_picks = np.arange(1,330,50)
S = X_output.iloc[random_picks]
S

In [None]:
shap.initjs()

The below shap.force_plot() takes three values: the base value (explainerModel.expected_value[0]), the SHAP values (shap_values_Model[j][0]) and the matrix of feature values (S.iloc[[j]]). The base value or the expected value is the average of the model output over the training data X_train. It is the base value used in the following plot.

In [None]:
def shap_plot(j):
    # compute SHAP values
    explainerModel = shap.TreeExplainer(model)
    shap_values_Model = explainerModel.shap_values(S)
    p = shap.force_plot(explainerModel.expected_value, shap_values_Model[j], S.iloc[[j]])
    return(p)

In [None]:
# mean of X train
X_train.mean()

In [None]:
# mean of Y test
Y_test.mean()

In [None]:
shap_plot(1)

Output value: is the prediction for that observation

Base value: The original paper explains that the base value E(y_hat) is "the value that would be predicted if we did not know any features for the current output." In other words, it is the mean prediction, or mean(yhat). So the mean prediction of Y_test is 5.62.

Features: The above explanation shows features that contributes to push the final prediction away from the base value.

Red/blue: Those features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue.

Alcohol: has negative impact on the quality rating. The alcohol of this wine is 9.2 which is less than the average value 10.41. So it pushes the prediction to the right.

pH: has a postive impact on the quality rating. A higher value than the average pH drives the prediction to the left.

Volatile acidity: is positively related to the quality rating. A higher value than the average volatile acidity pushes the prediction to the left.

In [None]:
shap_plot(2)

Features: The above explanation shows features that contributes to push the final prediction away from the base value.

Red/blue: Those features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue.

Alcohol: has negative impact on the quality rating. The alcohol of this wine is 9.2 which is less than the average value 10.41. So it pushes the prediction to the right.

pH: has a postive impact on the quality rating. A higher value than the average pH drives the prediction to the left.

Volatile acidity: is positively related to the quality rating. A higher value than the average volatile acidity pushes the prediction to the left.

In [None]:
shap_plot(3)

In [None]:
shap_plot(4)

In [None]:
fig = shap_plot(4)
fig

In [None]:
shap.force_plot(shap.TreeExplainer(model).expected_value, shap.TreeExplainer(model), S)

## The SHAP Works for Binary Target as Well

In [None]:
# Suppose the target is a binary variable
df['quality_bin'] = np.where(df['quality'].astype(int)>6,1,0)

In [None]:
Y = df['quality_bin']
X =  df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol']]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

model = RandomForestRegressor(max_depth=6, random_state=0, n_estimators=10)
model.fit(X_train, Y_train)  
print(model.feature_importances_)

importances = model.feature_importances_
indices = np.argsort(importances)

features = X_train.columns
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
shap_values = shap.TreeExplainer(model).shap_values(X_train)

In [None]:
shap.summary_plot(shap_values, X_train)

anhy datasets, listof independ & dependent (mention) , and different dataset, run series of models