# Audit Your Own Model: Custom Data Template

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GlassAlpha/glassalpha/blob/main/examples/notebooks/custom_data_template.ipynb)

**Use this template to audit YOUR model with YOUR data**

This notebook shows how to:
1. Load your CSV data
2. Train or load your model
3. Generate a complete audit
4. Export configuration for CI/CD

**Replace the placeholder data paths with your own files!**

## Step 1: Installation

In [None]:
%pip install -q glassalpha[explain]

In [None]:
"""Environment verification for reproducibility"""
import sys, platform, random, numpy as np, pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier  # Replace with your model
import glassalpha as ga

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

print({
    "python": sys.version.split()[0],
    "platform": platform.platform(),
    "glassalpha": getattr(ga, "__version__", "dev"),
    "seed": SEED
})

## Step 2: Load YOUR Data

**REPLACE THIS with your actual data path**

In [None]:
# Option A: Load from CSV
# df = pd.read_csv('your_data.csv')

# Option B: For this template, we'll use German Credit as example
df = ga.datasets.load_german_credit()

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## Step 3: Define Your Features and Target

**CUSTOMIZE THIS** based on your data

In [None]:
# Define column names
TARGET_COLUMN = 'credit_risk'  # ← CHANGE THIS to your target column
PROTECTED_ATTRIBUTES = ['gender', 'age_group']  # ← CHANGE THIS to your protected attributes

# Features = all columns except target and protected attributes
feature_columns = [col for col in df.columns if col not in [TARGET_COLUMN] + PROTECTED_ATTRIBUTES]

print(f"Target: {TARGET_COLUMN}")
print(f"Protected attributes: {PROTECTED_ATTRIBUTES}")
print(f"Features ({len(feature_columns)}): {', '.join(feature_columns[:5])}...")

## Step 4: Validate Your Data

Check for common issues before training

In [None]:
# Check for missing values
missing = df[feature_columns + [TARGET_COLUMN]].isnull().sum()
if missing.sum() > 0:
    print("⚠️ Missing values detected:")
    print(missing[missing > 0])
    print("\nConsider: df.fillna() or df.dropna()")
else:
    print("✓ No missing values")

# Check target distribution
print(f"\nTarget distribution:")
print(df[TARGET_COLUMN].value_counts())
print(f"Class balance: {df[TARGET_COLUMN].mean():.1%}")

# Check protected group sizes
print(f"\nProtected group sizes:")
for attr in PROTECTED_ATTRIBUTES:
    counts = df[attr].value_counts()
    min_size = counts.min()
    print(f"{attr}: {dict(counts)} (min={min_size}, {'✓ OK' if min_size >= 30 else '⚠️ TOO SMALL'})")

## Step 5: Train/Test Split

In [None]:
X = df[feature_columns]
y = df[TARGET_COLUMN]
protected_data = df[PROTECTED_ATTRIBUTES]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=SEED, stratify=y
)

print(f"Train: {len(X_train)} samples ({len(X_train)/len(X):.0%})")
print(f"Test: {len(X_test)} samples ({len(X_test)/len(X):.0%})")

## Step 6: Train Your Model

**REPLACE THIS** with your actual model

In [None]:
# Option A: Train a new model (example with RandomForest)
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=SEED)
model.fit(X_train, y_train)

# Option B: Load a pre-trained model
# import joblib
# model = joblib.load('your_model.joblib')

# Quick performance check
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f"Train accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")
print(f"✓ Model ready for audit")

## Step 7: Generate Audit

This is where the magic happens!

In [None]:
result = ga.audit.from_model(
    model=model,
    X_test=X_test,
    y_test=y_test,
    protected_attributes={
        attr: protected_data.loc[X_test.index, attr]
        for attr in PROTECTED_ATTRIBUTES
    },
    feature_names=list(X.columns),
    target_name=TARGET_COLUMN,
    threshold=0.5,  # Adjust if needed
    random_seed=SEED
)

print("✓ Audit complete")

In [None]:
# Display inline summary
result

## Step 8: Review Key Metrics

In [None]:
print("=== PERFORMANCE ===")
print(f"Accuracy: {result.performance.accuracy:.3f}")
print(f"AUC-ROC: {result.performance.auc_roc:.3f}")
print(f"Precision: {result.performance.precision:.3f}")
print(f"Recall: {result.performance.recall:.3f}")

print("\n=== FAIRNESS ===")
print(f"Demographic Parity: {result.fairness.demographic_parity_difference:.3f}")
print(f"Equal Opportunity: {result.fairness.equal_opportunity_difference:.3f}")
if result.fairness.has_bias(threshold=0.10):
    print("⚠️ WARNING: Bias detected (>10% threshold)")
else:
    print("✓ PASS: No significant bias (within 10% tolerance)")

print("\n=== CALIBRATION ===")
print(f"Expected Calibration Error: {result.calibration.expected_calibration_error:.4f}")
if result.calibration.expected_calibration_error < 0.05:
    print("✓ PASS: Well-calibrated (ECE < 0.05)")
else:
    print("⚠️ WARNING: Calibration could be improved")

## Step 9: Visualize Results

In [None]:
# Display key metrics
print("📊 PERFORMANCE")
print(f"  Accuracy: {result.performance.accuracy:.3f}")
print(f"  AUC-ROC: {result.performance.auc_roc:.3f}")
print(f"  Precision: {result.performance.precision:.3f}")
print(f"  Recall: {result.performance.recall:.3f}")

print("\n⚖️  FAIRNESS")
print(f"  Demographic Parity: {result.fairness.demographic_parity_difference:.3f}")
print(f"  Equal Opportunity: {result.fairness.equal_opportunity_difference:.3f}")

print("\n🎯 CALIBRATION")
print(f"  ECE: {result.calibration.expected_calibration_error:.4f}")
print(f"  Brier Score: {result.calibration.brier_score:.4f}")

# Note: Interactive plotting (.plot_*) coming in Phase 3
# All visualizations are available in the PDF report

In [None]:
print("Top 10 Important Features:\n")
print(result.explanations.feature_importance.head(10))

# Note: Interactive plotting (.plot_*) coming in Phase 3
# All visualizations are available in the PDF report

## Step 10: Export Audit Outputs

In [None]:
# Export PDF report
result.to_pdf('my_model_audit.pdf')
print('✓ PDF report: my_model_audit.pdf')

# Export metrics as JSON
result.to_json('my_model_metrics.json')
print('✓ Metrics JSON: my_model_metrics.json')

# Export config for CI/CD reproduction
result.to_config('my_audit_config.yaml')
print('✓ Config YAML: my_audit_config.yaml')

print('\n✓ All outputs saved!')

## Step 11: Reproduce with CLI (Optional)

The config file can be used to reproduce this audit via command line:

In [None]:
print("To reproduce this audit:\n")
print("  1. Install GlassAlpha: pip install glassalpha[explain]")
print("  2. Run: glassalpha audit --config my_audit_config.yaml --output report.pdf")
print("\nFor CI/CD integration, add this to your GitHub Actions:")
print("""\n```yaml
- name: Run ML Audit
  run: |
    pip install glassalpha[explain]
    glassalpha audit --config my_audit_config.yaml --output audit.pdf
```""")

## Checklist: Customize This Template

Before using with your own data, update:

- [ ] **Step 2**: Load your CSV file (`pd.read_csv('your_data.csv')`)
- [ ] **Step 3**: Set `TARGET_COLUMN` to your target variable name
- [ ] **Step 3**: Set `PROTECTED_ATTRIBUTES` to your fairness-sensitive columns
- [ ] **Step 6**: Train your model or load pre-trained model
- [ ] **Step 7**: Adjust `threshold` if needed (default: 0.5)
- [ ] **Step 10**: Customize output filenames

**Need help?**
- [Custom Data Guide](https://glassalpha.com/getting-started/custom-data/)
- [Configuration Guide](https://glassalpha.com/getting-started/configuration/)
- [API Reference](https://glassalpha.com/reference/api/)