# 🔍 SHAP Interpretation for Tox21 - XGBoost

This notebook trains an XGBoost model for each Tox21 target using ECFP4 fingerprints, performs a train/test split, saves the model, and generates SHAP summary plots for interpretation.

Make sure:
- Your Tox21 data is in `data/tox21.csv`
- Your ECFP4 features are stored in the variable `X`

In [1]:
import shap
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

# Create necessary directories
os.makedirs("final_models", exist_ok=True)
os.makedirs("shap_summaries", exist_ok=True)
os.makedirs("shap_summaries/xgb", exist_ok=True)

# Load dataset
df = pd.read_csv("data/tox21.csv")

# ECFP4 matrix must be precomputed and stored in variable X

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


In [2]:
from rdkit import Chem
from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator
import numpy as np

# Generate ECFP4 fingerprints (radius=2, 1024 bits)
morgan_gen = GetMorganGenerator(radius=2, fpSize=1024)

def smiles_to_ecfp(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return np.zeros((morgan_gen.GetNumBits(),))
    return np.array(morgan_gen.GetFingerprint(mol))

# Apply to all rows
X = np.array([smiles_to_ecfp(s) for s in df['smiles']])




In [3]:
targets = [
    'NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER',
    'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5',
    'SR-HSE', 'SR-MMP', 'SR-p53'
]

In [4]:
for target in targets:
    print(f"🚀 Processing {target} (XGBoost)")

    y = df[target]
    mask = y.notna()
    X_clean = X[mask]
    y_clean = y[mask].values

    if len(np.unique(y_clean)) < 2:
        print(f"⚠️ Skipping {target} — only one class present.")
        continue

    X_train, X_test, y_train, y_test = train_test_split(
        X_clean, y_clean, test_size=0.2, stratify=y_clean, random_state=42
    )

    # Train XGBoost
    model = XGBClassifier(
        n_estimators=100,
        max_depth=6,
        use_label_encoder=False,
        eval_metric='logloss',
        random_state=42
    )
    model.fit(X_train, y_train)

    # Save the model
    joblib.dump(model, f"final_models/{target}_XGBoost.joblib")

    # SHAP summary plot
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X_test)

    plt.figure()
    shap.summary_plot(shap_values, X_test, plot_type="bar", max_display=20, show=False)
    plt.title(f"{target} - SHAP Summary (XGBoost)")
    plt.savefig(f"shap_summaries/xgb/{target}_XGBoost_shap_summary.png", bbox_inches='tight')
    plt.close()
    print(f"✅ Saved model and SHAP plot for {target}")


🚀 Processing NR-AR (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for NR-AR
🚀 Processing NR-AR-LBD (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for NR-AR-LBD
🚀 Processing NR-AhR (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for NR-AhR
🚀 Processing NR-Aromatase (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for NR-Aromatase
🚀 Processing NR-ER (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for NR-ER
🚀 Processing NR-ER-LBD (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for NR-ER-LBD
🚀 Processing NR-PPAR-gamma (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for NR-PPAR-gamma
🚀 Processing SR-ARE (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for SR-ARE
🚀 Processing SR-ATAD5 (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for SR-ATAD5
🚀 Processing SR-HSE (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for SR-HSE
🚀 Processing SR-MMP (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for SR-MMP
🚀 Processing SR-p53 (XGBoost)


Parameters: { "use_label_encoder" } are not used.



✅ Saved model and SHAP plot for SR-p53


The SHAP summary plots identify the top fingerprint features that influence the model’s prediction of toxicity. Each bar represents an ECFP4 bit (hashed molecular substructure) ranked by its average absolute SHAP value across all test samples. Features with longer bars have stronger contributions to the model’s decisions. While the fingerprint bits are not directly interpretable, their importance suggests consistent structure-activity relationships captured by the model.