# Explainer examples
**Introduction**

In this Notebook we will pick up Pima Indian Diabetes dataset from the National Institute of Diabetes and Digestive and Kidney Diseases.

The problem at hand is to be able to detect and predict weather a person has Diabetes or not, based on other available parameters like Body Mass Index, Insulin levels, etc.

This notebook shows how you can use the `Explainer` object for interactive visualization in your jupyter notebook.

Another interesting insight from this problem could be to see on which parameter does Diabetes depend the most.

All this plotting functionality gets called by the `ExplainerDashboard` to construct the interactive dashboard.

# Google colab link:

[https://colab.research.google.com/github/oegedijk/explainerdashboard/blob/master/explainer_examples.ipynb](https://colab.research.google.com/github/oegedijk/explainerdashboard/blob/master/explainer_examples.ipynb)

In [None]:
!pip install explainerdashboard

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting explainerdashboard
  Downloading explainerdashboard-0.4.2.2-py3-none-any.whl (286 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.9/286.9 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting dash-auth (from explainerdashboard)
  Downloading dash_auth-2.0.0-py3-none-any.whl (3.4 kB)
Collecting dash-bootstrap-components>=1 (from explainerdashboard)
  Downloading dash_bootstrap_components-1.4.1-py3-none-any.whl (220 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.6/220.6 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dash>=2.3.1 (from explainerdashboard)
  Downloading dash-2.10.2-py3-none-any.whl (10.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m101.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dtreeviz>=2.1 (from explainerdashboard)
  Downloading dtreeviz-2.2.1-py3-no

# notebook properties

Display multiple outputs per cell:

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# ClassifierExplainer:

## train model

In [None]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving diabetes.csv to diabetes.csv
User uploaded file "diabetes.csv" with length 23875 bytes


In [None]:
from sklearn.ensemble import RandomForestClassifier

diabetes=pd.read_csv('diabetes.csv')
X = diabetes.drop(["Outcome"], axis=1)
Y = diabetes["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=0)
model = RandomForestClassifier(n_estimators=50, max_depth=5)
model.fit(X_train, y_train)

In [None]:
Y.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

## build explainer

In [None]:
from explainerdashboard import ClassifierExplainer

explainer = ClassifierExplainer(model, X_test, y_test,
                                target='Outcomes',
                                labels=['Diabetes', 'Not Diabetes'])

Detected RandomForestClassifier model: Changing class type to RandomForestClassifierExplainer...
Note: model_output=='probability', so assuming that raw shap output of RandomForestClassifier is in probability space...
Generating self.shap_explainer = shap.TreeExplainer(model)


In [None]:
import explainerdashboard
dir(explainerdashboard)

['ClassifierExplainer',
 'ExplainerDashboard',
 'ExplainerHub',
 'InlineExplainer',
 'RegressionExplainer',
 '___version__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'dashboard_components',
 'dashboard_methods',
 'dashboards',
 'explainer_methods',
 'explainer_plots',
 'explainers',
 'to_html']

========================================= EXPLAINING INHERENTLY INTERPRETABLE MODELS ===========================

In [None]:
!pip install interpret

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
X_test[:10]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
661,1,199,76,43,0,42.9,1.394,22
122,2,107,74,30,100,33.6,0.404,23
113,4,76,62,0,0,34.0,0.391,25
14,5,166,72,19,175,25.8,0.587,51
529,0,111,65,0,0,24.6,0.66,31
103,1,81,72,18,40,26.6,0.283,24
338,9,152,78,34,171,34.2,0.893,33
588,3,176,86,27,156,33.3,1.154,52
395,2,127,58,24,275,27.7,1.6,25
204,6,103,72,32,190,37.7,0.324,55


In [None]:
from interpret.glassbox import (LogisticRegression,
                                ClassificationTree,
                                ExplainableBoostingClassifier)
from interpret import show
from sklearn.metrics import f1_score, accuracy_score

# %% Fit decision tree model
tree = ClassificationTree()
tree.fit(X_train, y_train)
print("Training finished.")
y_pred = tree.predict(X_test)
print(f"F1 Score {f1_score(y_test, y_pred, average='macro')}")
print(f"Accuracy {accuracy_score(y_test, y_pred)}")


# %% Explain local prediction
tree_local = tree.explain_local(X_test[:10], y_test[:10])#[:100], y_test[:100], name='Tree')
show(tree_local)


<interpret.glassbox._decisiontree.ClassificationTree at 0x7f2de5abc700>

Training finished.
F1 Score 0.7578616352201257
Accuracy 0.7922077922077922


In [None]:
X_test[0]

KeyError: ignored

==================================================

## Importances

Get a dataframe of mean absolute shap value per feature and with a cutoff value of 0.01:

In [None]:
explainer.get_mean_abs_shap_df(cutoff=0.01)

Calculating shap values...


Unnamed: 0,Feature,MEAN_ABS_SHAP
0,Glucose,0.110568
1,Age,0.069232
2,BMI,0.064237
3,Pregnancies,0.029459
4,DiabetesPedigreeFunction,0.026478
5,Insulin,0.013555
6,SkinThickness,0.012532


 Get permutation importances (decrease in metric when randomly permuting a particular feature):

In [None]:
explainer.get_permutation_importances_df(topx=5)

Calculating permutation importances (if slow, try setting n_jobs parameter)...


Unnamed: 0,Feature,Importance,Score
1,Glucose,0.195466,0.665341
5,BMI,0.035594,0.825214
7,Age,0.024259,0.836548
4,Insulin,0.011334,0.849473
0,Pregnancies,0.009743,0.851064


### Plot mean absolute shap importances:

In [None]:
explainer.plot_importances(kind='shap', topx=5)

### Permutation importances showing top 6

In [None]:
explainer.plot_importances(kind='permutation', topx=6)

## detailed shap summary

Only show top 10 features, group onehot-encoded categorical features:

In [None]:
explainer.plot_importances_detailed(topx=10)

## interaction importances

### mean absolute shap interaction values for interactions with 'Sex'
- the direct effect is usually the largest
- in this case PassengerClass shows the biggest interaction with gender

In [None]:
explainer.plot_interactions_importance('Glucose', topx=5)

In [None]:
explainer.plot_interactions_importance('BMI', topx=5)

### Detailed shap interactions summary:

In [None]:
explainer.plot_interactions_detailed("Glucose")

## Contributions

In [None]:
index = 0 # explain prediction for first row of X_test
explainer.get_contrib_df(index, topx=8)

Unnamed: 0,col,contribution,value,cumulative,base
0,_BASE,0.360033,,0.360033,0.0
1,Glucose,0.240406,199.0,0.600438,0.360033
2,DiabetesPedigreeFunction,0.075803,1.394,0.676242,0.600438
3,Age,-0.070567,22.0,0.605675,0.676242
4,BMI,0.052231,42.9,0.657905,0.605675
5,Pregnancies,-0.020031,1.0,0.637874,0.657905
6,SkinThickness,0.007736,43.0,0.645611,0.637874
7,Insulin,0.001281,0.0,0.646892,0.645611
8,BloodPressure,-0.000429,76.0,0.646463,0.646892
9,_REST,0.0,,0.646463,0.646463


In [None]:
explainer.plot_contributions(index, topx=8)

In [None]:
# explainer prediction for name
explainer.plot_contributions(1)

## Shap dependence plots

In [None]:
explainer.plot_dependence("Glucose")

In [None]:
explainer.plot_dependence("BMI")

### color by BMI


In [None]:
explainer.plot_dependence("Glucose", color_col="BMI")

In [None]:
explainer.plot_dependence("Age", color_col="BMI")

### Highlight particular index

In [None]:

explainer.plot_dependence("Age", color_col="BMI", highlight_index=5)

## Shap interactions plots

In [None]:
explainer.plot_interaction("Glucose", "BMI")

In [None]:
explainer.plot_interaction("Age", "BMI")

In [None]:
explainer.plot_interaction("Glucose", "Insulin")

## partial dependence plots (pdp)

### Plot average general partial dependence plot with ice lines for specific observations

In [None]:
explainer.plot_pdp("Glucose")

In [None]:
explainer.plot_pdp("Insulin", )

### highlight pdp for specific observation

In [None]:
explainer.plot_pdp("Age", 1)

### with default parameters:

In [None]:
explainer.plot_pdp("Age", index=5, drop_na=True, sample=100,
                    gridlines=100, gridpoints=10)

### adjusting parameters:

- `drop_na=False` no longer drop values equal to self.na_fill (-999 by default)
- `sample=200` sample 200 samples for calculating the average
- `gridlines=10`  display 10 additional grid lines
- `gridpoints=50` take 50 points along the x axis to calculate the lines

In [None]:
explainer.plot_pdp("Age", index=5, drop_na=False, sample=200,
                    gridlines=10, gridpoints=50)

## Classification validation plots:

In [None]:
explainer.metrics(cutoff=0.8)

Calculating prediction probabilities...
Calculating metrics...


{'accuracy': 0.7012987012987013,
 'precision': 0.6666666666666666,
 'recall': 0.0425531914893617,
 'f1': 0.08,
 'roc_auc_score': 0.8514615231656393,
 'pr_auc_score': 0.6910539116232207,
 'log_loss': 0.44943139825098016}

In [None]:
explainer.prediction_result_df(1)

Unnamed: 0,label,probability
0,Diabetes*,0.883
1,Not Diabetes,0.117


### confusion matrix

In [None]:
explainer.plot_confusion_matrix(cutoff=0.5, binary=True)

Calculating confusion matrices...


#### For multiclass classifiers, `binary=False` would display e.g. a 3x3 confusion matrix
- in this case it's a binary classifier, so binary=False makes no difference

### precision plot
- if the classifier works well the predicted probability should be the same as the observed probability per bin, so we would expect a nice straight line from 0 to 1

#### based on bin size:

In [None]:
explainer.plot_precision(bin_size=0.1)

#### based on quantiles, showing all classes, adding in a cutoff value

In [None]:
explainer.plot_precision(quantiles=10, cutoff=0.75, multiclass=True)

### Cumulative precision

In [None]:
explainer.plot_cumulative_precision()

Calculating liftcurve_dfs...


### lift curve

In [None]:
explainer.plot_lift_curve(cutoff=None, percentage=False, round=2)

In [None]:
explainer.plot_lift_curve(cutoff=0.75, percentage=True, round=2)

### Plot classification:

In [None]:
explainer.plot_classification()

Calculating classification_dfs...


In [None]:
explainer.plot_classification(cutoff=0.75, percentage=False)

### ROC AUC Curve

In [None]:
explainer.plot_roc_auc(cutoff=0.75)

Calculating roc auc curves...


### Plot PR AUC

In [None]:
explainer.plot_pr_auc(cutoff=0.25)

Calculating pr auc curves...


# RegressionExplainer

In [None]:
from explainerdashboard.datasets import titanic_fare
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=50, max_depth=5)
model.fit(X_train, y_train)


In [None]:
from explainerdashboard.datasets import titanic_fare, titanic_names, feature_descriptions
from explainerdashboard import RegressionExplainer


explainer = RegressionExplainer(model, X_test, y_test,
                                target='Insulin',)

Changing class type to RandomForestRegressionExplainer...
Generating self.shap_explainer = shap.TreeExplainer(model)


## Importances

### Mean absolute shap importances:

In [None]:
explainer.plot_importances(kind='shap', topx=5, round=3)

Calculating shap values...


### Permutation importances,  showing top 4

In [None]:
explainer.plot_importances(kind='permutation', topx=4, round=3)

Calculating importances...


## detailed shap summary

In [None]:
explainer.plot_importances_detailed(topx=10)

## interaction importances

### mean absolute shap interaction values for interactions with 'Sex'
- the direct effect is usually the largest
- in this case PassengerClass shows the biggest interaction with gender

In [None]:
explainer.plot_interactions_importance('Glucose', topx=5)

Calculating shap interaction values...
Reminder: TreeShap computational complexity is O(TLD^2), where T is the number of trees, L is the maximum number of leaves in any tree and D the maximal depth of any tree. So reducing these will speed up the calculation.


In [None]:
explainer.plot_interactions_importance('Age', topx=5)

### Detailed shap interactions summary:

In [None]:
explainer.plot_interactions_detailed("Glucose")

## Contributions

In [None]:
index = 0 # explain prediction for first row of X_test
explainer.plot_contributions(index, topx=5, round=2)

In [None]:
# explainer prediction for specific observation
explainer.plot_contributions(1, sort='low-to-high', orientation='horizontal')

## Shap dependence plots

In [None]:
explainer.plot_dependence("Age")

### color by sex

In [None]:
explainer.plot_dependence("Age", color_col="Glucose")

### Highlight particular index

In [None]:

explainer.plot_dependence("Glucose", color_col="Insulin", highlight_index=5)

## Shap interactions plots

In [None]:
explainer.plot_interaction("Glucose", "BMI")

In [None]:
explainer.plot_interaction("BMI", "Insulin")

In [None]:
explainer.plot_interaction("Glucose", "Age", highlight_index=5)

## partial dependence plots (pdp)

### Plot average general partial dependence plot with ice lines for specific observations

In [None]:
explainer.plot_pdp("Glucose")

In [None]:
explainer.plot_pdp("Age")

### highlight pdp for specific observation

In [None]:
explainer.plot_pdp("BMI", 1)

### with default parameters:

In [None]:
explainer.plot_pdp("BMI", index=17, drop_na=True, sample=100,
                    gridlines=100, gridpoints=10)

### adjusting parameters:

- `drop_na=False` no longer drop values equal to self.na_fill (-999 by default)
- `sample=200` sample 200 samples for calculating the average
- `gridlines=10`  display 10 additional grid lines
- `gridpoints=50` take 50 points along the x axis to calculate the lines

In [None]:
explainer.plot_pdp("Glucose", index=17, drop_na=False, sample=200,
                    gridlines=10, gridpoints=50)

## Regression validation plots:

In [None]:
explainer.metrics()

Calculating predictions...


{'mean-squared-error': 0.14030080772221198,
 'root-mean-squared-error': 0.3745674942146101,
 'mean-absolute-error': 0.2892639956822969,
 'mean-absolute-percentage-error': 729142086963093.1,
 'R-squared': 0.3383627051222948}

In [None]:
explainer.prediction_result_df(1)

Unnamed: 0,Unnamed: 1,Insulin
0,Predicted,0.109
1,Observed,0.0
2,Residual,-0.109


### predicted vs actual

In [None]:
explainer.plot_predicted_vs_actual()

In [None]:
explainer.plot_predicted_vs_actual(log_x=True, log_y=True)

### plot residuals

In [None]:
explainer.plot_residuals()

In [None]:
explainer.plot_residuals(vs_actual=True, residuals='ratio')

In [None]:
explainer.plot_residuals(vs_actual=True, residuals='log-ratio')


divide by zero encountered in log



### residuals vs specific feature

In [None]:
explainer.plot_residuals_vs_feature("Age")

# RandomForestExplainer

For RandomForest models, the class type gets recast to either a `RandomForestClassifierExplainer` or a `RandomForestRegressionExplainer`, which provide some additional functionality to visualize the individual trees in the RandomForest.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from explainerdashboard.datasets import titanic_survive, titanic_names, feature_descriptions

model = RandomForestClassifier(n_estimators=50, max_depth=5)
model.fit(X_train, y_train)

explainer = ClassifierExplainer(model, X_test, y_test,)

Detected RandomForestClassifier model: Changing class type to RandomForestClassifierExplainer...
Note: model_output=='probability', so assuming that raw shap output of RandomForestClassifier is in probability space...
Generating self.shap_explainer = shap.TreeExplainer(model)


In [None]:
explainer.plot_trees(1, highlight_tree=20)


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X has feature names, but DecisionTreeClassifier was fitted without feature names


X h

In [None]:
explainer.get_decisionpath_df(tree_idx=20, index=1)

Calculating ShadowDecTree for each individual decision tree...



invalid value encountered in long_scalars



Unnamed: 0,node_id,average,feature,value,split,direction,left,right,diff
0,0,0.305195,Glucose,107.0,127.5,left,0.158416,0.584906,-0.146779
1,1,0.158416,BMI,33.6,26.45,right,0.034483,0.208333,0.049917
2,9,0.208333,Glucose,107.0,94.5,right,0.0,0.294118,0.085784
3,15,0.294118,BloodPressure,74.0,67.0,right,0.142857,0.4,0.105882
4,19,0.4,Insulin,100.0,501.0,left,0.4,,0.0


In [None]:
explainer.get_decisionpath_summary_df(tree_idx=5, index=1)

Unnamed: 0,Feature,Condition,Adjustment,New Prediction
0,,,Starting average,30.52%
1,Insulin,100.0 < 121.0,-5.1%,25.42%
2,Age,23.0 < 29.5,-4.53%,20.9%
3,DiabetesPedigreeFunction,0.404 < 1.2800000309944153,-0.9%,20.0%
4,Glucose,107.0 < 130.5,-10.57%,9.43%
5,Glucose,107.0 < 111.5,-6.66%,2.78%
6,,,Final Prediction,2.78%
