# Shapley's Additive Explanations

This Jupyter notebook explains how SHAP library can be used to provide explanations for a black box model. We will use the TreeSHAP and KernelSHAP modules to provide explanations for a Gradient Boosted Decision Tree  (GBDT) for tabular data and neural network. We will compare explanations provided by LIME and SHAP. 


## **1. Import required libraries**

In [15]:

import pandas as pd
import numpy as numpy
import shap
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_auc_score, f1_score




##  **2. Load Dataset**

We will use the Heart Failure Prediction Dataset from Kaggle available from [this download link](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset). The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. 

This dataset contains 3 files:

1. **diabetes _ 012 _ health _ indicators _ BRFSS2015.csv** is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_012 has 3 classes. 0 is for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes. There is class imbalance in this dataset. This dataset has 21 feature variables

2. **diabetes _ binary _ 5050split _ health _ indicators _ BRFSS2015.csv** is a clean dataset of 70,692 survey responses to the CDC's BRFSS2015. It has an equal 50-50 split of respondents with no diabetes and with either prediabetes or diabetes. The target variable Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21 feature variables and is balanced.

3. **diabetes _ binary _ health _ indicators _ BRFSS2015.csv** is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21 feature variables and is not balanced.


For a start, let's use subset 2 which is the balanced-binary dataset. 

In [20]:
df = pd.read_csv('../datasets/diabetes-health-indicators/diabetes_binary_5050split_health_indicators_BRFSS2015.csv')
df.head()

X = df.drop(columns=['Diabetes_binary'])
y = df['Diabetes_binary']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbdt_model = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42)

gbdt_model.fit(X_train, y_train)



# Predict probabilities
y_pred_proba = gbdt_model.predict_proba(X_test)[:, 1]
y_pred = gbdt_model.predict(X_test)

# Calculate AUC
auc = roc_auc_score(y_test, y_pred_proba)
f1 = f1_score(y_test, y_pred)
print(f"AUC: {auc:.4f}, F1: {f1:.4f}")



AUC: 0.8295, F1: 0.7595


## **4. Explain Model predictions using TreeSHAP**

In [None]:
shape_values = shap.TreeExplainer(gbdt_model).shap_values(X_train)
shap.summary_plot(shape_values, X_train, plot_type='bar')