# Task 3: Model Explainability

This notebook uses SHAP (Shapley Additive exPlanations) to interpret the best-performing Random Forest model from Task 2 for fraud detection on e-commerce and credit card datasets at Adey Innovations Inc. It leverages modular functions from `src/` for preprocessing and SHAP analysis.

## Objectives
- Load preprocessed data from Task 2.
- Train and evaluate models to obtain the best Random Forest models for SHAP analysis.
- Apply SHAP to analyze feature importance using `shap_utils`.
- Generate and interpret SHAP Summary and Force plots.

## Datasets
- `processed_ecommerce_with_features.csv`: Cleaned e-commerce data.
- `processed_creditcard.csv`: Cleaned credit card data.

## Setup
- Run in the virtual environment with dependencies from `requirements.txt` (including `shap==0.45.1`, `pandas==2.2.2`, `numpy==1.26.4`, `scikit-learn==1.5.1`, `imbalanced-learn==0.12.3`).

In [1]:
# Import necessary libraries and modules
import pandas as pd
import numpy as np
import sys
import os
sys.path.append('..')
from src.data_utils import load_data
from src.feature_engineering import add_time_features, add_transaction_frequency, preprocess_data
from src.model_utils import train_evaluate_model
from src.shap_utils import compute_shap_values, plot_shap_summary, plot_shap_force

# Load datasets
ecommerce_df = load_data('../data/processed/processed_ecommerce_with_features.csv')
creditcard_df = load_data('../data/processed/processed_creditcard.csv')

print('E-commerce Dataset Shape:', ecommerce_df.shape)
print('Credit Card Dataset Shape:', creditcard_df.shape)

  from .autonotebook import tqdm as notebook_tqdm


E-commerce Dataset Shape: (151112, 16)
Credit Card Dataset Shape: (283726, 31)


## Preprocess Datasets

This cell preprocesses the e-commerce and credit card datasets. For e-commerce, it adds time-based features (time_since_signup, hour_of_day, day_of_week) and transaction frequency, then applies SMOTE, scaling, and encoding. The credit card dataset is processed with only numerical scaling and SMOTE, as it lacks categorical variables. It also retrieves encoded feature names for SHAP analysis.

In [2]:
# Preprocess datasets (reuse Task 2 preprocessing with additional features)
ecomm_cat_cols = ['source', 'browser', 'country']
ecomm_num_cols = ['purchase_value', 'time_since_signup', 'hour_of_day', 'day_of_week', 'trans_freq']
ecommerce_df = add_time_features(ecommerce_df)
ecommerce_df = add_transaction_frequency(ecommerce_df, 'user_id')
X_train_ecomm, X_test_ecomm, y_train_ecomm, y_test_ecomm, ecomm_encoder, ecomm_feature_names = preprocess_data(
    ecommerce_df, 'class', ecomm_cat_cols, ecomm_num_cols
)

cc_num_cols = ['Time', 'Amount', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 
               'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 
               'V23', 'V24', 'V25', 'V26', 'V27', 'V28']
X_train_cc, X_test_cc, y_train_cc, y_test_cc, cc_encoder, cc_feature_names = preprocess_data(
    creditcard_df, 'Class', [], cc_num_cols
)

print('E-commerce - Resampled train set shape:', X_train_ecomm.shape)
print('Credit Card - Resampled train set shape:', X_train_cc.shape)
print('E-commerce feature names:', ecomm_feature_names)
print('Credit Card feature names:', cc_feature_names)

E-commerce - Resampled train set shape: (219136, 152)
Credit Card - Resampled train set shape: (453204, 30)
E-commerce feature names: ['source_Ads', 'source_Direct', 'source_SEO', 'browser_Chrome', 'browser_FireFox', 'browser_IE', 'browser_Opera', 'browser_Safari', 'country_Afghanistan', 'country_Albania', 'country_Algeria', 'country_Angola', 'country_Antigua and Barbuda', 'country_Argentina', 'country_Armenia', 'country_Australia', 'country_Austria', 'country_Azerbaijan', 'country_Bahamas', 'country_Bahrain', 'country_Bangladesh', 'country_Barbados', 'country_Belarus', 'country_Belgium', 'country_Benin', 'country_Bhutan', 'country_Bosnia and Herzegowina', 'country_Botswana', 'country_Brunei Darussalam', 'country_Bulgaria', 'country_Burkina Faso', 'country_Cambodia', 'country_Cameroon', 'country_Canada', 'country_Cape Verde', 'country_Cayman Islands', 'country_China', 'country_Congo The Democratic Republic of The', "country_Cote D'ivoire", 'country_Croatia (LOCAL Name: Hrvatska)', 'cou

## Train and Evaluate Models

This cell trains and evaluates Logistic Regression and Random Forest models to obtain the best Random Forest models for SHAP analysis. It reuses the training logic from Task 2, focusing only on retrieving the best models.

In [3]:
# Train and evaluate models (to get best_rf models)
import time
start_time = time.time()
_, best_rf_ecomm, y_test_ecomm, _, _ = train_evaluate_model(
    X_train_ecomm, X_test_ecomm, y_train_ecomm, y_test_ecomm, 'E-commerce'
)
print(f'E-commerce training time: {time.time() - start_time:.2f} seconds')
start_time = time.time()
_, best_rf_cc, y_test_cc, _, _ = train_evaluate_model(
    X_train_cc, X_test_cc, y_train_cc, y_test_cc, 'Credit Card'
)
print(f'Credit Card training time: {time.time() - start_time:.2f} seconds')

Tuned E-commerce Logistic Regression Metrics: {'accuracy': 0.6505641398934586, 'precision': 0.16782675947409126, 'recall': 0.6901060070671378, 'f1': 0.2699937789451856, 'roc_auc': 0.668292517277956}
Best C: 10
Tuned E-commerce Random Forest Metrics: {'accuracy': 0.954041623928796, 'precision': 0.9657401422107305, 'recall': 0.5279151943462898, 'f1': 0.682659355723098, 'roc_auc': 0.762990196742378}
Best Parameters: {'max_depth': 20, 'n_estimators': 50}
E-commerce Logistic Regression CV F1 Scores: [0.68774897 0.68757491 0.68464666 0.68906964 0.68306461]
Mean CV F1 Score: 0.686420956432628
E-commerce Random Forest CV F1 Scores: [0.76383296 0.81584554 0.81688014 0.79776512 0.81580934]
Mean CV F1 Score: 0.8020266191487974
E-commerce Logistic Regression Confusion Matrix:\n [[17709  9684]
 [  877  1953]]
E-commerce Random Forest Confusion Matrix:\n [[27340    53]
 [ 1336  1494]]
E-commerce training time: 456.16 seconds
Tuned Credit Card Logistic Regression Metrics: {'accuracy': 0.9736897754907

## SHAP Analysis

This cell performs SHAP analysis on the trained Random Forest models for both datasets. It computes SHAP values using a subset of test data and generates Summary and Force plots to visualize global and local feature importance, saved to the `plots/` directory, using the encoded feature names.

In [5]:
# SHAP Analysis for Random Forest (best model)
shap_values_ecomm, X_test_ecomm_subset, explainer_ecomm = compute_shap_values(best_rf_ecomm, X_test_ecomm, 'E-commerce')
shap_values_cc, X_test_cc_subset, explainer_cc = compute_shap_values(best_rf_cc, X_test_cc, 'Credit Card')

# Generate SHAP plots using the subset data and encoded feature names
plot_shap_summary(shap_values_ecomm, X_test_ecomm_subset, 'E-commerce', feature_names=ecomm_feature_names)
plot_shap_summary(shap_values_cc, X_test_cc_subset, 'Credit Card', feature_names=cc_feature_names)
plot_shap_force(shap_values_ecomm, X_test_ecomm_subset, 'E-commerce', explainer_ecomm)
plot_shap_force(shap_values_cc, X_test_cc_subset, 'Credit Card', explainer_cc)

SHAP values computed for E-commerce with 1000 samples
SHAP values shape for E-commerce: (1000, 152)
X_test_subset shape for E-commerce: (1000, 152)
SHAP values computed for Credit Card with 1000 samples
SHAP values shape for Credit Card: (1000, 30)
X_test_subset shape for Credit Card: (1000, 30)
Force plot saved for E-commerce
Force plot saved for Credit Card


## SHAP Interpretation

This cell provides an interpretation of the SHAP plots generated above. The Summary Plot shows global feature importance, while the Force Plot illustrates local contributions for a single sample. These insights help understand the key drivers of fraud in the datasets.

- **E-commerce Dataset**: The SHAP Summary Plot indicates that `purchase_value` and `time_since_signup` are key fraud drivers, with higher purchase values and newer accounts increasing fraud probability. The Force Plot for a sample shows `time_since_signup` as a significant contributor, pushing the prediction toward fraud for a new account.
- **Credit Card Dataset**: The SHAP Summary Plot highlights `V14`, `V17`, and `Amount` as top contributors, with large transaction amounts and specific `V` features driving fraud likelihood. The Force Plot for a sample illustrates how `V14` and `Amount` increase the fraud probability.
- **Insights**: SHAP confirms the Random Forest captures complex interactions, with global importance aligning with domain knowledge (e.g., transaction amount, account age), enhancing model interpretability.