In [1]:
#!pip install shap==0.44.1
!pip install shap

zsh:1: command not found: pip


In [2]:
import shap
import pandas as pd
import numpy as np
shap.initjs()



ModuleNotFoundError: No module named 'shap'

In [None]:
shap.__version__
customer = pd.DataFrame()
customer

In [None]:
customer = pd.read_csv("sample_data/CustomerChurn.csv")

customer.head()

In [None]:
print(customer.dtypes)


In [None]:
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

X = customer.drop("Churn", axis='columns')


y = customer.Churn # Dependent variable

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Train a machine learning model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Make prediction on the testing data
y_pred = clf.predict(X_test)

# Classification Report
print(classification_report(y_pred, y_test))

In [None]:
y_pred


This code performs a classification task using a Random Forest model.

First, it separates the features (X) from the target variable (y).
The 'Churn' column is the target — it indicates whether a customer has left or not.
All other columns are used as input features to predict churn.

The dataset is split into a training set (70%) and a test set (30%) using train_test_split.
The training set is used to train a RandomForestClassifier, a machine learning model that builds multiple decision trees.
Then, the model predicts the churn values for the test set (X_test).

The classification_report function is used to print performance metrics comparing the predicted values (y_pred) with the actual values (y_test). The report includes precision, recall, F1-score, and support for each class.

This process helps evaluate how well the model performs in predicting customer churn.


In [None]:
# Crea TreeExplainer forzato
explainer = shap.Explainer(clf, X_train, feature_names=X_train.columns, model_output="probability")
shap_values = explainer(X_test)

# Verifica forme
print("shap_values shape:", shap_values.values.shape)
print("X_test shape:", X_test.shape)

This code uses SHAP (SHapley Additive exPlanations) to interpret the predictions made by the trained Random Forest model.

A SHAP Explainer is created specifically for the trained model (clf) using the training data (X_train).
The parameter 'feature_names' ensures the explanations are labeled correctly.
'model_output="probability"' tells SHAP to explain the model's predicted probabilities instead of raw scores or logits.

Then, shap_values are calculated for the test set (X_test).
These values represent how much each feature contributed to pushing the prediction higher or lower for each individual test sample.

Finally, the shapes of the resulting SHAP values and X_test are printed to confirm they match.
shap_values.values.shape should have dimensions (number of samples, number of features, number of classes),
and X_test.shape should match the sample and feature count.


In [None]:
# Summary plot per la classe positiva (Churn = 1)
shap.summary_plot(shap_values.values[:, :, 1], X_test)

This SHAP summary plot explains how each feature (column in your dataset) influenced the model’s prediction for Churn = 1 (meaning: the customer leaves).
Each dot is one customer. The x-axis shows how much a feature pushed the prediction towards or away from churn.
Features are listed from top to bottom by importance (top = most important).
Left side (negative SHAP value) = feature pushed the prediction toward no churn.
Right side (positive SHAP value) = feature pushed the prediction toward churn.

Color shows the value of the feature for that customer:
🔴 Red = high value
🔵 Blue = low value
Status is the most important:
Red dots are on the right → high "Status" values tend to increase churn risk.
Blue dots are on the left → low "Status" values reduce churn risk.
Complains:
High values (red) strongly push predictions to churn.
This makes sense: customers who complain are more likely to leave.
Seconds of Use:
It has both red and blue spread on both sides.
This means the effect depends on the context (sometimes high usage reduces churn, sometimes it increases it).

In [None]:
#SHAP feature importance bar chart,
shap.summary_plot(shap_values.values[:, :, 1], X_test, plot_type="bar")


SHAP feature importance bar chart,
This is a SHAP feature importance bar chart, showing the average impact of each feature on the model’s prediction for customer churn.
Each bar represents a feature from your dataset.
The longer the bar, the more influence that feature has on the model’s predictions.
The x-axis is the mean of the absolute SHAP values (mean(|SHAP value|)), which means:
How strongly, on average, a feature contributes to pushing a prediction toward either "churn" or "not churn" — regardless of direction.

In [None]:
#SHAP Decision Plot
shap.decision_plot(
    base_value=explainer.expected_value[1], # Use index 1 for the probability of class 1 (Churn)
    shap_values=shap_values.values[:5, :, 1], # Use index 1 for the probability of class 1 and select first 5 samples
    features=X_test.iloc[:5], # Select the first 5 samples from X_test
    feature_names=X_test.columns.tolist()
)


This SHAP Decision Plot shows how the model made its predictions for the first 5 customers in your test set.
Each line represents one customer and how the model's output value (probability of churn) was built up step by step from the base value by adding the effect of each feature.
The x-axis shows the model output (probability of churn).
The left side (negative values) pushes toward no churn.
The right side (positive values) pushes toward churn.
The plot starts at the base value (average model output for the training set).
As you move from top to bottom, each feature's SHAP value is added or subtracted, modifying the prediction.

The color of the lines shows the prediction:
🔴 Red = high churn risk
🔵 Blue = low churn risk
🟣 Purple = in-between (medium risk)