# Understanding IsolationForest and SHAP

This notebook explains the two core technologies behind DQX's anomaly detection feature:
- **IsolationForest**: A tree-based algorithm for detecting anomalies
- **SHAP (SHapley Additive exPlanations)**: A method for explaining which features contributed to a prediction

We'll use synthetic e-commerce transaction data to demonstrate these concepts.

In [None]:
# Install required packages (if needed)
# !pip install faker scikit-learn shap pandas numpy matplotlib

In [None]:
import pandas as pd
import numpy as np
from faker import Faker
from sklearn.ensemble import IsolationForest
import shap
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(42)
fake = Faker()
Faker.seed(42)

## Part 1: Generate Synthetic Data

We'll create a dataset of e-commerce transactions with:
- Normal transactions: Typical purchase amounts, quantities, and discounts
- Anomalous transactions: Unusual patterns (e.g., very high amounts, unusual quantities)

In [None]:
def generate_normal_transactions(n=1000):
    """Generate normal e-commerce transactions."""
    data = []
    for _ in range(n):
        amount = np.random.normal(100, 30)  # Mean $100, std $30
        quantity = np.random.randint(1, 5)   # 1-4 items
        discount = np.random.uniform(0, 0.2) # 0-20% discount
        data.append({
            'transaction_id': fake.uuid4(),
            'amount': max(10, amount),  # Minimum $10
            'quantity': quantity,
            'discount': discount
        })
    return data

def generate_anomalous_transactions(n=50):
    """Generate anomalous transactions with unusual patterns."""
    data = []
    for _ in range(n):
        anomaly_type = np.random.choice(['high_amount', 'high_quantity', 'high_discount'])
        
        if anomaly_type == 'high_amount':
            amount = np.random.uniform(500, 2000)  # Unusually high
            quantity = np.random.randint(1, 5)
            discount = np.random.uniform(0, 0.2)
        elif anomaly_type == 'high_quantity':
            amount = np.random.normal(100, 30)
            quantity = np.random.randint(20, 100)  # Unusually high
            discount = np.random.uniform(0, 0.2)
        else:  # high_discount
            amount = np.random.normal(100, 30)
            quantity = np.random.randint(1, 5)
            discount = np.random.uniform(0.7, 0.99)  # Unusually high
        
        data.append({
            'transaction_id': fake.uuid4(),
            'amount': max(10, amount),
            'quantity': quantity,
            'discount': discount,
            'true_label': 'anomaly'  # Ground truth for evaluation
        })
    return data

# Generate data
normal_data = generate_normal_transactions(1000)
anomalous_data = generate_anomalous_transactions(50)

# Add ground truth labels
for record in normal_data:
    record['true_label'] = 'normal'

# Combine and shuffle
all_data = normal_data + anomalous_data
np.random.shuffle(all_data)

df = pd.DataFrame(all_data)
print(f"Generated {len(df)} transactions ({len(normal_data)} normal, {len(anomalous_data)} anomalous)")
print("\nFirst 5 rows:")
df.head()

In [None]:
# Basic statistics
print("Data Statistics:\n")
df[['amount', 'quantity', 'discount']].describe()

## Part 2: IsolationForest - How It Works

### Key Concepts:

1. **Isolation Trees**: The algorithm builds random decision trees that isolate data points
2. **Path Length**: Anomalies are easier to isolate (shorter path from root to leaf)
3. **Anomaly Score**: Normalized score where values closer to 1 indicate anomalies

### Why It's Effective:
- No need to define what "normal" looks like explicitly
- Works well with high-dimensional data
- Fast and scalable
- Unsupervised: doesn't need labeled training data

In [None]:
# Prepare features for training
feature_columns = ['amount', 'quantity', 'discount']
X = df[feature_columns].values

# Train IsolationForest
# contamination: expected proportion of anomalies (5% in our case: 50/1050)
model = IsolationForest(
    n_estimators=100,      # Number of trees
    contamination=0.05,    # Expected % of anomalies
    random_state=42,
    max_samples='auto'     # Use all samples for training
)

print("Training IsolationForest...")
model.fit(X)
print("Training complete.")

# Get predictions
# predict() returns: 1 for normal, -1 for anomaly
predictions = model.predict(X)
df['prediction'] = predictions
df['prediction_label'] = df['prediction'].map({1: 'normal', -1: 'anomaly'})

# Get anomaly scores
# score_samples() returns negative values: more negative = more anomalous
# We'll convert to positive scale for easier interpretation
anomaly_scores = model.score_samples(X)
df['anomaly_score'] = -anomaly_scores  # Negate so higher = more anomalous

print(f"\nPredicted {(predictions == -1).sum()} anomalies out of {len(df)} transactions")

In [None]:
# Evaluate model performance
from sklearn.metrics import classification_report, confusion_matrix

# Convert labels for evaluation
y_true = (df['true_label'] == 'anomaly').astype(int)
y_pred = (df['prediction'] == -1).astype(int)

print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Normal', 'Anomaly']))

In [None]:
# Visualize anomaly score distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
normal_scores = df[df['true_label'] == 'normal']['anomaly_score']
anomaly_scores = df[df['true_label'] == 'anomaly']['anomaly_score']

plt.hist(normal_scores, bins=50, alpha=0.7, label='Normal (Ground Truth)', color='blue')
plt.hist(anomaly_scores, bins=50, alpha=0.7, label='Anomaly (Ground Truth)', color='red')
plt.xlabel('Anomaly Score (higher = more anomalous)')
plt.ylabel('Count')
plt.title('Distribution of Anomaly Scores by True Label')
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(df['amount'], df['quantity'], c=df['anomaly_score'], 
            cmap='YlOrRd', alpha=0.6, s=30)
plt.colorbar(label='Anomaly Score')
plt.xlabel('Transaction Amount ($)')
plt.ylabel('Quantity')
plt.title('Transactions colored by Anomaly Score')
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Part 3: SHAP - Explaining Model Predictions

### The Problem:
IsolationForest tells us a transaction is anomalous, but **why**? Which features made it anomalous?

### SHAP Solution:
SHAP (SHapley Additive exPlanations) is based on game theory and provides:
- **Feature attribution**: How much each feature contributed to the prediction
- **Consistent**: Contributions sum to the prediction value
- **Local explanations**: Per-row feature importance

### TreeSHAP:
For tree-based models like IsolationForest, SHAP uses an optimized algorithm called TreeSHAP:
- Fast computation (polynomial time instead of exponential)
- Exact Shapley values (not approximations)
- Works by tracing decision paths through the trees

In [None]:
# Initialize SHAP TreeExplainer
print("Creating SHAP explainer...")
explainer = shap.TreeExplainer(model)

# Calculate SHAP values for all transactions
# This tells us which features contributed to each anomaly score
print("Computing SHAP values...")
shap_values = explainer.shap_values(X)
print(f"Shape of SHAP values: {shap_values.shape}")
print("Done.")

### Understanding SHAP Values:

For each transaction and each feature:
- **Positive SHAP value**: Feature pushed the score toward "more anomalous"
- **Negative SHAP value**: Feature pushed the score toward "more normal"
- **Magnitude**: How strong the contribution was

Let's examine some specific examples:

In [None]:
# Find top anomalies
top_anomalies = df.nlargest(5, 'anomaly_score')

print("Top 5 Most Anomalous Transactions:\n")
for idx, row in top_anomalies.iterrows():
    print(f"Transaction {row['transaction_id'][:8]}...")
    print(f"  Amount: ${row['amount']:.2f}, Quantity: {row['quantity']}, Discount: {row['discount']:.1%}")
    print(f"  Anomaly Score: {row['anomaly_score']:.3f}")
    print(f"  Ground Truth: {row['true_label']}")
    
    # Get SHAP values for this transaction
    shap_vals = shap_values[idx]
    
    # Calculate absolute contributions (normalized to sum to 1.0)
    abs_shap = np.abs(shap_vals)
    if abs_shap.sum() > 0:
        contributions = abs_shap / abs_shap.sum()
    else:
        contributions = np.ones(len(feature_columns)) / len(feature_columns)
    
    print("  Feature Contributions:")
    for i, col in enumerate(feature_columns):
        print(f"    {col}: {contributions[i]:.1%}")
    print()

In [None]:
# SHAP Summary Plot: Shows global feature importance
# Each point is a transaction, color represents feature value
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values, X, feature_names=feature_columns, show=False)
plt.title('SHAP Summary Plot: Global Feature Importance')
plt.tight_layout()
plt.show()

print("Interpretation:")
print("- X-axis: SHAP value (impact on model output)")
print("- Y-axis: Features ranked by importance")
print("- Color: Feature value (red=high, blue=low)")
print("- Each dot: One transaction")

In [None]:
# SHAP Waterfall Plot: Detailed explanation for a single anomalous transaction
anomaly_idx = top_anomalies.index[0]  # Most anomalous transaction

print(f"Detailed Explanation for Most Anomalous Transaction:")
print(f"Transaction {df.loc[anomaly_idx, 'transaction_id'][:8]}...\n")

# Create explanation object
explanation = shap.Explanation(
    values=shap_values[anomaly_idx],
    base_values=explainer.expected_value,
    data=X[anomaly_idx],
    feature_names=feature_columns
)

# Waterfall plot shows how features push prediction from base value to final value
plt.figure(figsize=(10, 6))
shap.waterfall_plot(explanation, show=False)
plt.title('SHAP Waterfall: How Features Contributed to Anomaly Score')
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Starts from base value (average prediction)")
print("- Each row shows how a feature pushed the score up or down")
print("- Final value is the anomaly score for this transaction")

## Part 4: How This Maps to DQX

DQX's `anomaly.has_no_anomalies()` check uses these same concepts in a distributed Spark environment.

### Key Implementation Details:

1. **Why Pandas UDFs?**
   - Distributes computation across Spark cluster executors
   - Each executor loads the model once, scores many rows
   - Balances performance vs. model distribution overhead

2. **Why scikit-learn instead of Spark ML?**
   - IsolationForest not available in open-source PySpark
   - scikit-learn works everywhere (local, DBR, Spark Connect)
   - Mature, well-tested implementation

3. **Why SHAP needs cluster installation?**
   - UDF code runs on executors, not just the driver
   - Each executor needs `shap` package installed
   - That's why users see: "Install 'shap' on your cluster"

## Part 5: Performance Considerations

Understanding the computational costs:

In [None]:
import time

# Benchmark scoring performance
n_rows = 10000
X_benchmark = np.random.randn(n_rows, len(feature_columns))

# 1. IsolationForest scoring (fast)
start = time.time()
scores = model.score_samples(X_benchmark)
predictions = model.predict(X_benchmark)
scoring_time = time.time() - start

print(f"IsolationForest Scoring:")
print(f"  {n_rows:,} rows in {scoring_time:.3f} seconds")
print(f"  {n_rows/scoring_time:.0f} rows/second")

# 2. SHAP computation (slower, but provides explanations)
start = time.time()
shap_vals = explainer.shap_values(X_benchmark)
shap_time = time.time() - start

print(f"\nSHAP Computation:")
print(f"  {n_rows:,} rows in {shap_time:.3f} seconds")
print(f"  {n_rows/shap_time:.0f} rows/second")
print(f"  {shap_time/scoring_time:.1f}x slower than scoring alone")

print(f"\nDQX Design Choice:")
print(f"  - anomaly_score: Always computed (fast)")
print(f"  - anomaly_contributions: Optional (include_contributions=True)")
print(f"  - This gives users control over the speed/explainability tradeoff")

## Summary

### IsolationForest:
- **What**: Tree-based unsupervised anomaly detection
- **How**: Isolates anomalies using random decision trees
- **Output**: Anomaly score (higher = more anomalous) and binary prediction
- **Pros**: Fast, scalable, no labeled data needed

### SHAP (TreeSHAP):
- **What**: Explainability framework based on game theory
- **How**: Computes Shapley values by tracing tree paths
- **Output**: Per-feature contributions to each prediction
- **Pros**: Accurate, consistent, theoretically sound
- **Cons**: Computationally expensive (3-10x slower than scoring)

### DQX Integration:
- Uses scikit-learn IsolationForest for model training
- Distributes scoring via Spark Pandas UDFs
- Optionally computes SHAP values for explainability
- Stores models in Unity Catalog via MLflow
- Makes SHAP optional to balance performance vs. explainability

### Key Takeaway:
DQX gives users the power of IsolationForest anomaly detection at Spark scale, with optional SHAP explanations when they need to understand **why** a row was flagged as anomalous.