In [8]:
!pip install shap

Collecting shap
  Using cached shap-0.50.0-cp313-cp313-macosx_10_13_x86_64.whl.metadata (25 kB)
Collecting tqdm>=4.27.0 (from shap)
  Using cached tqdm-4.67.3-py3-none-any.whl.metadata (57 kB)
Collecting slicer==0.0.8 (from shap)
  Using cached slicer-0.0.8-py3-none-any.whl.metadata (4.0 kB)
Collecting numba>=0.54 (from shap)
  Using cached numba-0.64.0-cp313-cp313-macosx_10_15_x86_64.whl
Collecting cloudpickle (from shap)
  Using cached cloudpickle-3.1.2-py3-none-any.whl.metadata (7.1 kB)
Collecting llvmlite<0.47,>=0.46.0dev0 (from numba>=0.54->shap)
  Using cached llvmlite-0.46.0.tar.gz (193 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Using cached shap-0.50.0-cp313-cp313-macosx_10_13_x86_64.whl (558 kB)
Using cached slicer-0.0.8-py3-none-any.whl (15 kB)
Using cached tqdm-4.67.3-py3-none-any.whl (78 kB)
Using cached cloudpickle-3.1.2-py3-none-any.whl (22 kB)
Bui

In [9]:
# import packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import shap
import matplotlib.pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'shap'

In [5]:
seed = 2724

### Import data

In [None]:
DF_PATH = "mod04_data/sample.csv"
df = pd.read_csv(DF_PATH)

### Separate data by independent (X) and dependent (y) variables

In [None]:
X = df[["income", "education_years", "zipcode_score"]]
y = df["target"]

### Split the data into a _training_ set (to build a model) and _test_ set (to validate a model)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=seed
)

### Build a model on the training set

In [None]:
model = RandomForestRegressor(
    n_estimators=200,
    random_state=seed
)
model.fit(X_train, y_train)

### Use SHAP to explain the model on test data

In [None]:
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)

This will allow us to see which variables are most important to predicting the outcome.

In [None]:
shap.plots.bar(shap_values)

### Import the `group` variable, which was **not** used in training this model.

In [None]:
X_test_with_group = X_test.copy()
X_test_with_group["group"] = df.loc[X_test.index, "group"]

### Look at the difference in SHAP values between the two groups across the variables used in the model.

In [None]:
shap_df = pd.DataFrame(shap_values.values, columns=X_test.columns)
shap_df["group"] = X_test_with_group["group"].values

shap_df.groupby("group").mean()

### Let's put `group` and `zipcode_score` in the same plot

In [None]:
def plot_shap(var):
    # Extract SHAP values for the feature
    shap_var = shap_values[:, var].values

    # Plot the values of each group using different colors
    plt.figure()
    plt.scatter(
        X_test[var],
        shap_var,
        c=X_test_with_group["group"]
    )
    plt.xlabel(var)
    plt.ylabel(f"SHAP value for var")
    plt.title("Proxy feature impact by group")
    plt.show()

plot_shap("zipcode_score")

# Discussion Questions

### What is a _SHAP_ (or Shapley) value? 

A SHAP value is related to a game theory method that is used calculate the contribution of each individual feature to a machine learning models prediction. 

### Suppose you built this model and then it is peer reviewed by another entity. If the reviewer asks whether you used the variable `group` in your model, what would your answer be?

My answer would be "No". In this model, the only features that was used to train the model were "income", "education_years", and "zipcode_score". The variable "group" was only introduced after training the model for group analysis of SHAP values. 

### If the reviewer asks whether the outcome of your model is correlated with `group`, what would your answer be?

The features used to train the ML model may be correlated with group indirectly. Meaning the model may still encode group-related patterns through variables so model predictions could be statistically associated with it. For example, the RF model can learn that higher income and higher zipcode score can lead to higher predicted repayment of a loan. This is referred to as a proxy effect.

### Construct a "proxy feature impact by group" plot for `income`. How is this plot different from the one for `zipcode_score`?

In [10]:
#the structure of the plot is alread defined in plot_shap(var) function
plot_shap("income")

NameError: name 'plot_shap' is not defined

The plot for income shows a significant mixture of points between the independent and dependent variables whereas zipcode_score showed a clean seperation between groups. This suggests there is more variablity for the income feature in relationship to the group. Whereas, there exist a boundary for group in relationship with zipcode_score. 

### If, instead, you were the **reviewer**, what other questions might you ask the person who built this model? Give at least two.

(1) Are both groups fairly represented in the training data?
(2) Are these group differences reflecting real life relationships?
(3) Are error rates statsitically different across groups?