# Project 19: Real-Time Fraud Detection System

**Detect fraudulent transactions using real-time data**

In this tutorial, we'll build a complete fraud detection system that can:
- Handle highly imbalanced data (fraud is rare!)
- Detect fraudulent transactions in real-time
- Optimize for business metrics (minimize false negatives)
- Process streaming transactions

**Dataset**: Credit Card Fraud Detection (Kaggle)
- 284,807 transactions
- Only 492 frauds (0.172% - highly imbalanced!)
- Features V1-V28: PCA transformed (anonymized)
- Time: Seconds since first transaction
- Amount: Transaction amount
- Class: 0 = Normal, 1 = Fraud

**Key Challenges**:
1. Extreme class imbalance (99.83% vs 0.17%)
2. Real-time prediction requirements
3. High cost of false negatives (missed fraud)
4. Need for interpretable decisions

## Table of Contents

1. [Setup and Data Loading](#1-setup-and-data-loading)
2. [Exploratory Data Analysis](#2-exploratory-data-analysis)
3. [Data Preprocessing](#3-data-preprocessing)
4. [Handling Class Imbalance](#4-handling-class-imbalance)
5. [Model Building](#5-model-building)
6. [Model Evaluation](#6-model-evaluation)
7. [Threshold Optimization](#7-threshold-optimization)
8. [Real-Time Detection Pipeline](#8-real-time-detection-pipeline)
9. [Streaming Simulation](#9-streaming-simulation)
10. [Summary](#10-summary)

## 1. Setup and Data Loading

In [None]:
# Install required packages
!pip install -q kagglehub imbalanced-learn xgboost lightgbm

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import time
import warnings
warnings.filterwarnings('ignore')

# Sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    precision_recall_curve, roc_curve, auc, f1_score,
    precision_score, recall_score, average_precision_score
)

# Imbalanced learning
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek
from imblearn.pipeline import Pipeline as ImbPipeline

# XGBoost
import xgboost as xgb

# Deep Learning
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, WeightedRandomSampler

# Set random seeds
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

print("All libraries imported successfully!")
print(f"PyTorch: {torch.__version__}")
print(f"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

In [None]:
# Load the Credit Card Fraud dataset
import os

# Check if running on Kaggle
USE_KAGGLE = os.path.exists('/kaggle/input')

if USE_KAGGLE:
    # Direct path on Kaggle
    try:
        df = pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')
        print("Loaded from Kaggle input directory")
    except:
        import kagglehub
        from kagglehub import KaggleDatasetAdapter
        df = kagglehub.load_dataset(
            KaggleDatasetAdapter.PANDAS,
            "mlg-ulb/creditcardfraud",
            "",
        )
        print("Loaded via kagglehub")
else:
    # Try kagglehub
    try:
        import kagglehub
        from kagglehub import KaggleDatasetAdapter
        df = kagglehub.load_dataset(
            KaggleDatasetAdapter.PANDAS,
            "mlg-ulb/creditcardfraud",
            "",
        )
        print("Loaded via kagglehub")
    except Exception as e:
        print(f"Could not load via kagglehub: {e}")
        print("Please download the dataset manually from:")
        print("https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud")
        # Create sample data for demonstration
        print("\nCreating synthetic fraud data for demonstration...")
        n_samples = 10000
        n_fraud = int(n_samples * 0.002)  # 0.2% fraud rate
        
        # Generate features
        np.random.seed(SEED)
        normal_data = np.random.randn(n_samples - n_fraud, 28)
        fraud_data = np.random.randn(n_fraud, 28) + np.random.choice([-2, 2], size=(n_fraud, 28))
        
        V_cols = [f'V{i}' for i in range(1, 29)]
        df_normal = pd.DataFrame(normal_data, columns=V_cols)
        df_normal['Class'] = 0
        df_fraud = pd.DataFrame(fraud_data, columns=V_cols)
        df_fraud['Class'] = 1
        
        df = pd.concat([df_normal, df_fraud], ignore_index=True)
        df['Time'] = np.arange(len(df))
        df['Amount'] = np.abs(np.random.exponential(100, len(df)))
        df = df.sample(frac=1, random_state=SEED).reset_index(drop=True)

print(f"\nDataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

## 2. Exploratory Data Analysis

In [None]:
# Basic info
print("Dataset Overview")
print("=" * 50)
print(f"Total transactions: {len(df):,}")
print(f"Features: {df.shape[1]}")
print(f"\nMissing values: {df.isnull().sum().sum()}")
print(f"\nData types:")
print(df.dtypes.value_counts())

In [None]:
# Class distribution - THE KEY CHALLENGE
print("\nClass Distribution (Target Variable)")
print("=" * 50)

class_counts = df['Class'].value_counts()
class_pcts = df['Class'].value_counts(normalize=True) * 100

print(f"Normal transactions (0): {class_counts[0]:,} ({class_pcts[0]:.3f}%)")
print(f"Fraud transactions (1):  {class_counts[1]:,} ({class_pcts[1]:.3f}%)")
print(f"\nImbalance ratio: 1:{class_counts[0]//class_counts[1]}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
ax1 = axes[0]
colors = ['#2ecc71', '#e74c3c']
bars = ax1.bar(['Normal', 'Fraud'], class_counts.values, color=colors)
ax1.set_ylabel('Count')
ax1.set_title('Transaction Class Distribution')
for bar, count in zip(bars, class_counts.values):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1000, 
             f'{count:,}', ha='center', va='bottom', fontsize=10)

# Pie chart (log scale visual)
ax2 = axes[1]
ax2.pie(class_pcts.values, labels=['Normal', 'Fraud'], autopct='%1.2f%%',
        colors=colors, explode=[0, 0.1], shadow=True)
ax2.set_title('Class Percentage')

plt.tight_layout()
plt.show()

print("\n*** This extreme imbalance is the main challenge! ***")
print("A naive model predicting all 'Normal' would be 99.83% accurate but useless!")

In [None]:
# Analyze Time and Amount features
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Time distribution
ax1 = axes[0, 0]
ax1.hist(df[df['Class']==0]['Time'], bins=50, alpha=0.7, label='Normal', color='#2ecc71')
ax1.hist(df[df['Class']==1]['Time'], bins=50, alpha=0.7, label='Fraud', color='#e74c3c')
ax1.set_xlabel('Time (seconds from first transaction)')
ax1.set_ylabel('Count')
ax1.set_title('Transaction Time Distribution')
ax1.legend()

# Amount distribution (log scale)
ax2 = axes[0, 1]
ax2.hist(df[df['Class']==0]['Amount'], bins=50, alpha=0.7, label='Normal', color='#2ecc71')
ax2.hist(df[df['Class']==1]['Amount'], bins=50, alpha=0.7, label='Fraud', color='#e74c3c')
ax2.set_xlabel('Amount')
ax2.set_ylabel('Count')
ax2.set_title('Transaction Amount Distribution')
ax2.set_yscale('log')
ax2.legend()

# Amount boxplot by class
ax3 = axes[1, 0]
df.boxplot(column='Amount', by='Class', ax=ax3)
ax3.set_title('Amount by Class')
ax3.set_xlabel('Class (0=Normal, 1=Fraud)')
plt.suptitle('')

# Amount statistics
ax4 = axes[1, 1]
amount_stats = df.groupby('Class')['Amount'].describe()
ax4.axis('off')
table = ax4.table(cellText=amount_stats.round(2).values,
                  colLabels=amount_stats.columns,
                  rowLabels=['Normal', 'Fraud'],
                  cellLoc='center',
                  loc='center')
table.auto_set_font_size(False)
table.set_fontsize(9)
table.scale(1.2, 1.5)
ax4.set_title('Amount Statistics by Class', pad=20)

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print(f"  - Fraud transactions tend to have lower amounts")
print(f"  - Mean fraud amount: ${df[df['Class']==1]['Amount'].mean():.2f}")
print(f"  - Mean normal amount: ${df[df['Class']==0]['Amount'].mean():.2f}")

In [None]:
# Analyze PCA features (V1-V28)
v_features = [f'V{i}' for i in range(1, 29)]

# Correlation with target
correlations = df[v_features + ['Class']].corr()['Class'].drop('Class').sort_values()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Correlation bar plot
ax1 = axes[0]
colors = ['#e74c3c' if x < 0 else '#2ecc71' for x in correlations.values]
ax1.barh(correlations.index, correlations.values, color=colors)
ax1.set_xlabel('Correlation with Fraud')
ax1.set_title('Feature Correlation with Fraud Class')
ax1.axvline(x=0, color='black', linestyle='-', linewidth=0.5)

# Top features distribution
ax2 = axes[1]
top_features = correlations.abs().nlargest(5).index.tolist()
for feat in top_features:
    fraud_vals = df[df['Class']==1][feat]
    ax2.hist(fraud_vals, bins=30, alpha=0.5, label=feat, density=True)
ax2.set_xlabel('Feature Value')
ax2.set_ylabel('Density')
ax2.set_title('Top Correlated Features (Fraud Only)')
ax2.legend()

plt.tight_layout()
plt.show()

print("\nTop features correlated with fraud:")
print(correlations.abs().nlargest(10))

## 3. Data Preprocessing

In [None]:
# Prepare features and target
print("Data Preprocessing")
print("=" * 50)

# Features and target
X = df.drop('Class', axis=1)
y = df['Class']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Scale Amount and Time (V1-V28 already scaled via PCA)
scaler_amount = RobustScaler()  # Robust to outliers
scaler_time = StandardScaler()

X['Amount_scaled'] = scaler_amount.fit_transform(X[['Amount']])
X['Time_scaled'] = scaler_time.fit_transform(X[['Time']])

# Drop original Amount and Time
X = X.drop(['Amount', 'Time'], axis=1)

print(f"\nFeatures after scaling: {X.shape[1]}")
print(f"Feature names: {X.columns.tolist()[:5]}... (and {len(X.columns)-5} more)")

In [None]:
# Train-test split (stratified to maintain class ratio)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=SEED, stratify=y
)

print("Train-Test Split (Stratified)")
print("=" * 50)
print(f"Training set: {X_train.shape[0]:,} samples")
print(f"  - Normal: {(y_train==0).sum():,} ({(y_train==0).mean()*100:.2f}%)")
print(f"  - Fraud:  {(y_train==1).sum():,} ({(y_train==1).mean()*100:.2f}%)")
print(f"\nTest set: {X_test.shape[0]:,} samples")
print(f"  - Normal: {(y_test==0).sum():,} ({(y_test==0).mean()*100:.2f}%)")
print(f"  - Fraud:  {(y_test==1).sum():,} ({(y_test==1).mean()*100:.2f}%)")

## 4. Handling Class Imbalance

Several techniques to handle the extreme imbalance:

1. **Oversampling**: Create synthetic fraud samples (SMOTE)
2. **Undersampling**: Reduce normal samples
3. **Class Weights**: Penalize misclassifying minority class
4. **Threshold Adjustment**: Lower decision threshold for fraud

In [None]:
# Demonstrate different sampling techniques
print("Sampling Techniques for Class Imbalance")
print("=" * 50)

# Original distribution
print(f"\nOriginal: {y_train.value_counts().to_dict()}")

# 1. SMOTE (Synthetic Minority Oversampling)
smote = SMOTE(random_state=SEED, sampling_strategy=0.5)  # 50% of majority
X_smote, y_smote = smote.fit_resample(X_train, y_train)
print(f"SMOTE:    {pd.Series(y_smote).value_counts().to_dict()}")

# 2. Random Undersampling
rus = RandomUnderSampler(random_state=SEED, sampling_strategy=0.5)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print(f"Undersample: {pd.Series(y_rus).value_counts().to_dict()}")

# 3. SMOTE + Tomek Links (combined)
smt = SMOTETomek(random_state=SEED)
X_smt, y_smt = smt.fit_resample(X_train, y_train)
print(f"SMOTETomek: {pd.Series(y_smt).value_counts().to_dict()}")

# Visualize
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

datasets = [
    ('Original', y_train),
    ('SMOTE', y_smote),
    ('Undersampling', y_rus),
    ('SMOTETomek', y_smt)
]

for ax, (name, y_data) in zip(axes, datasets):
    counts = pd.Series(y_data).value_counts()
    ax.bar(['Normal', 'Fraud'], counts.values, color=['#2ecc71', '#e74c3c'])
    ax.set_title(f'{name}\n({len(y_data):,} samples)')
    ax.set_ylabel('Count')
    for i, v in enumerate(counts.values):
        ax.text(i, v + 100, f'{v:,}', ha='center', fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# Calculate class weights for algorithms that support it
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}

print("Class Weights (for algorithms that support it):")
print(f"  Normal (0): {class_weight_dict[0]:.4f}")
print(f"  Fraud (1):  {class_weight_dict[1]:.4f}")
print(f"\nThis means fraud samples are weighted {class_weight_dict[1]/class_weight_dict[0]:.0f}x more!")

## 5. Model Building

We'll train multiple models and compare their performance:
1. Logistic Regression (baseline)
2. Random Forest
3. XGBoost
4. Neural Network

In [None]:
# Store models and results
models = {}
results = {}

def evaluate_model(name, model, X_test, y_test, threshold=0.5):
    """
    Evaluate model and return metrics.
    """
    # Predictions
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X_test)[:, 1]
    else:
        y_proba = model.predict(X_test)
    
    y_pred = (y_proba >= threshold).astype(int)
    
    # Metrics
    metrics = {
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_proba),
        'pr_auc': average_precision_score(y_test, y_proba),
        'y_proba': y_proba,
        'y_pred': y_pred
    }
    
    return metrics

print("Model training functions defined.")

In [None]:
# 1. Logistic Regression (with class weights)
print("Training Model 1: Logistic Regression")
print("=" * 50)

lr_model = LogisticRegression(
    class_weight='balanced',
    max_iter=1000,
    random_state=SEED,
    n_jobs=-1
)
lr_model.fit(X_train, y_train)
models['Logistic Regression'] = lr_model

results['Logistic Regression'] = evaluate_model('Logistic Regression', lr_model, X_test, y_test)
print(f"ROC-AUC: {results['Logistic Regression']['roc_auc']:.4f}")
print(f"PR-AUC:  {results['Logistic Regression']['pr_auc']:.4f}")

In [None]:
# 2. Random Forest (with class weights)
print("Training Model 2: Random Forest")
print("=" * 50)

rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    class_weight='balanced',
    random_state=SEED,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)
models['Random Forest'] = rf_model

results['Random Forest'] = evaluate_model('Random Forest', rf_model, X_test, y_test)
print(f"ROC-AUC: {results['Random Forest']['roc_auc']:.4f}")
print(f"PR-AUC:  {results['Random Forest']['pr_auc']:.4f}")

In [None]:
# 3. XGBoost (with scale_pos_weight)
print("Training Model 3: XGBoost")
print("=" * 50)

# Calculate scale_pos_weight
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

xgb_model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=scale_pos_weight,
    random_state=SEED,
    n_jobs=-1,
    eval_metric='auc'
)
xgb_model.fit(X_train, y_train)
models['XGBoost'] = xgb_model

results['XGBoost'] = evaluate_model('XGBoost', xgb_model, X_test, y_test)
print(f"ROC-AUC: {results['XGBoost']['roc_auc']:.4f}")
print(f"PR-AUC:  {results['XGBoost']['pr_auc']:.4f}")

In [None]:
# 4. Neural Network with SMOTE
print("Training Model 4: Neural Network (with SMOTE)")
print("=" * 50)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class FraudDetectionNN(nn.Module):
    """Neural network for fraud detection."""
    
    def __init__(self, input_dim):
        super(FraudDetectionNN, self).__init__()
        
        self.network = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.network(x)

# Apply SMOTE for neural network training
smote = SMOTE(random_state=SEED, sampling_strategy=0.3)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Convert to tensors
X_train_tensor = torch.FloatTensor(X_train_smote.values).to(device)
y_train_tensor = torch.FloatTensor(y_train_smote.values).to(device)
X_test_tensor = torch.FloatTensor(X_test.values).to(device)

# Create model
nn_model = FraudDetectionNN(X_train.shape[1]).to(device)

# Loss with class weights
pos_weight = torch.tensor([class_weight_dict[1] / class_weight_dict[0]]).to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
optimizer = optim.Adam(nn_model.parameters(), lr=0.001)

# Training
batch_size = 256
n_epochs = 20

dataset = TensorDataset(X_train_tensor, y_train_tensor)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

nn_model.train()
for epoch in range(n_epochs):
    epoch_loss = 0
    for batch_X, batch_y in dataloader:
        optimizer.zero_grad()
        outputs = nn_model(batch_X).squeeze()
        loss = nn.BCELoss()(outputs, batch_y)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {epoch_loss/len(dataloader):.4f}")

# Evaluate
nn_model.eval()
with torch.no_grad():
    y_proba_nn = nn_model(X_test_tensor).cpu().numpy().squeeze()

# Store results
y_pred_nn = (y_proba_nn >= 0.5).astype(int)
results['Neural Network'] = {
    'precision': precision_score(y_test, y_pred_nn),
    'recall': recall_score(y_test, y_pred_nn),
    'f1': f1_score(y_test, y_pred_nn),
    'roc_auc': roc_auc_score(y_test, y_proba_nn),
    'pr_auc': average_precision_score(y_test, y_proba_nn),
    'y_proba': y_proba_nn,
    'y_pred': y_pred_nn
}
models['Neural Network'] = nn_model

print(f"\nROC-AUC: {results['Neural Network']['roc_auc']:.4f}")
print(f"PR-AUC:  {results['Neural Network']['pr_auc']:.4f}")

## 6. Model Evaluation

For imbalanced fraud detection, we focus on:
- **Precision-Recall** (not ROC-AUC alone!)
- **Recall** (catch as many frauds as possible)
- **Confusion Matrix** (see false negatives)

In [None]:
# Compare all models
print("Model Comparison")
print("=" * 70)

comparison_df = pd.DataFrame({
    name: {
        'Precision': res['precision'],
        'Recall': res['recall'],
        'F1-Score': res['f1'],
        'ROC-AUC': res['roc_auc'],
        'PR-AUC': res['pr_auc']
    }
    for name, res in results.items()
}).T

print(comparison_df.round(4).to_string())

# Highlight best
print("\nBest Models:")
for metric in comparison_df.columns:
    best = comparison_df[metric].idxmax()
    print(f"  {metric}: {best} ({comparison_df.loc[best, metric]:.4f})")

In [None]:
# ROC and PR Curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors = ['#3498db', '#2ecc71', '#e74c3c', '#9b59b6']

# ROC Curves
ax1 = axes[0]
for (name, res), color in zip(results.items(), colors):
    fpr, tpr, _ = roc_curve(y_test, res['y_proba'])
    ax1.plot(fpr, tpr, label=f"{name} (AUC={res['roc_auc']:.3f})", color=color, linewidth=2)

ax1.plot([0, 1], [0, 1], 'k--', linewidth=1)
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC Curves')
ax1.legend(loc='lower right')
ax1.grid(True, alpha=0.3)

# Precision-Recall Curves
ax2 = axes[1]
for (name, res), color in zip(results.items(), colors):
    precision, recall, _ = precision_recall_curve(y_test, res['y_proba'])
    ax2.plot(recall, precision, label=f"{name} (AP={res['pr_auc']:.3f})", color=color, linewidth=2)

# Baseline (random classifier)
baseline = y_test.mean()
ax2.axhline(y=baseline, color='k', linestyle='--', label=f'Baseline ({baseline:.4f})')

ax2.set_xlabel('Recall')
ax2.set_ylabel('Precision')
ax2.set_title('Precision-Recall Curves (More Important for Imbalanced Data!)')
ax2.legend(loc='upper right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nNote: PR-AUC is more informative than ROC-AUC for imbalanced datasets!")

In [None]:
# Confusion matrices
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for ax, (name, res) in zip(axes.flatten(), results.items()):
    cm = confusion_matrix(y_test, res['y_pred'])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Normal', 'Fraud'],
                yticklabels=['Normal', 'Fraud'])
    ax.set_title(f'{name}')
    ax.set_ylabel('Actual')
    ax.set_xlabel('Predicted')
    
    # Add metrics
    tn, fp, fn, tp = cm.ravel()
    ax.text(0.5, -0.15, f'FN (Missed Fraud): {fn} | FP (False Alarm): {fp}',
            transform=ax.transAxes, ha='center', fontsize=9)

plt.tight_layout()
plt.show()

print("Key Metrics for Fraud Detection:")
print("  - False Negatives (FN): Missed frauds - VERY COSTLY!")
print("  - False Positives (FP): False alarms - Annoying but less costly")

## 7. Threshold Optimization

The default threshold of 0.5 may not be optimal. For fraud detection:
- **Lower threshold** = Catch more fraud (higher recall) but more false alarms
- **Higher threshold** = Fewer false alarms but miss more fraud

In [None]:
def find_optimal_threshold(y_true, y_proba, metric='f1'):
    """
    Find optimal classification threshold.
    
    Args:
        metric: 'f1', 'recall', or 'precision'
    """
    thresholds = np.arange(0.1, 0.9, 0.01)
    scores = []
    
    for thresh in thresholds:
        y_pred = (y_proba >= thresh).astype(int)
        if metric == 'f1':
            score = f1_score(y_true, y_pred)
        elif metric == 'recall':
            score = recall_score(y_true, y_pred)
        elif metric == 'precision':
            score = precision_score(y_true, y_pred)
        scores.append(score)
    
    best_idx = np.argmax(scores)
    return thresholds[best_idx], scores[best_idx], thresholds, scores

# Find optimal thresholds for best model (XGBoost)
best_model_name = comparison_df['PR-AUC'].idxmax()
y_proba_best = results[best_model_name]['y_proba']

print(f"Threshold Optimization for {best_model_name}")
print("=" * 50)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot threshold vs metrics
ax1 = axes[0]
for metric, color in [('f1', 'blue'), ('recall', 'green'), ('precision', 'red')]:
    opt_thresh, opt_score, thresholds, scores = find_optimal_threshold(y_test, y_proba_best, metric)
    ax1.plot(thresholds, scores, label=f'{metric.capitalize()} (opt={opt_thresh:.2f})', color=color)
    ax1.axvline(x=opt_thresh, color=color, linestyle='--', alpha=0.5)

ax1.axvline(x=0.5, color='black', linestyle='-', alpha=0.3, label='Default (0.5)')
ax1.set_xlabel('Threshold')
ax1.set_ylabel('Score')
ax1.set_title('Metrics vs Threshold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Business cost analysis
ax2 = axes[1]

# Assume: Cost of missed fraud = $500, Cost of false alarm = $10
cost_fn = 500  # False Negative cost
cost_fp = 10   # False Positive cost

costs = []
thresholds = np.arange(0.1, 0.9, 0.01)

for thresh in thresholds:
    y_pred = (y_proba_best >= thresh).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    total_cost = (fn * cost_fn) + (fp * cost_fp)
    costs.append(total_cost)

ax2.plot(thresholds, costs, color='purple', linewidth=2)
opt_idx = np.argmin(costs)
ax2.axvline(x=thresholds[opt_idx], color='red', linestyle='--', 
            label=f'Optimal: {thresholds[opt_idx]:.2f}')
ax2.scatter([thresholds[opt_idx]], [costs[opt_idx]], color='red', s=100, zorder=5)

ax2.set_xlabel('Threshold')
ax2.set_ylabel('Total Cost ($)')
ax2.set_title(f'Business Cost Analysis\n(FN=${cost_fn}, FP=${cost_fp})')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nOptimal threshold (minimum cost): {thresholds[opt_idx]:.2f}")
print(f"Minimum total cost: ${costs[opt_idx]:,.0f}")

In [None]:
# Compare default vs optimized threshold
optimal_threshold = thresholds[opt_idx]

print("Performance Comparison: Default vs Optimized Threshold")
print("=" * 60)

for thresh, name in [(0.5, 'Default (0.5)'), (optimal_threshold, f'Optimized ({optimal_threshold:.2f})')]:
    y_pred = (y_proba_best >= thresh).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    
    print(f"\n{name}:")
    print(f"  Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"  Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"  F1-Score:  {f1_score(y_test, y_pred):.4f}")
    print(f"  Missed Frauds (FN): {fn}")
    print(f"  False Alarms (FP): {fp}")
    print(f"  Total Cost: ${(fn * cost_fn) + (fp * cost_fp):,}")

## 8. Real-Time Detection Pipeline

Now let's build a production-ready pipeline for real-time fraud detection.

In [None]:
class RealTimeFraudDetector:
    """
    Real-time fraud detection system.
    
    Features:
    - Fast prediction for streaming transactions
    - Configurable threshold
    - Risk scoring
    - Transaction history tracking
    """
    
    def __init__(self, model, scaler_amount, scaler_time, threshold=0.5):
        """
        Initialize fraud detector.
        
        Args:
            model: Trained classifier
            scaler_amount: Fitted scaler for Amount
            scaler_time: Fitted scaler for Time
            threshold: Classification threshold
        """
        self.model = model
        self.scaler_amount = scaler_amount
        self.scaler_time = scaler_time
        self.threshold = threshold
        
        # Statistics tracking
        self.total_transactions = 0
        self.total_flagged = 0
        self.transaction_history = []
        
    def preprocess(self, transaction):
        """
        Preprocess a single transaction.
        
        Args:
            transaction: dict with keys 'V1'-'V28', 'Amount', 'Time'
            
        Returns:
            numpy array ready for prediction
        """
        # Extract features
        features = [transaction.get(f'V{i}', 0) for i in range(1, 29)]
        
        # Scale Amount and Time
        amount_scaled = self.scaler_amount.transform([[transaction['Amount']]])[0][0]
        time_scaled = self.scaler_time.transform([[transaction['Time']]])[0][0]
        
        features.extend([amount_scaled, time_scaled])
        
        return np.array(features).reshape(1, -1)
    
    def predict(self, transaction):
        """
        Predict if transaction is fraudulent.
        
        Returns:
            dict with prediction results
        """
        start_time = time.time()
        
        # Preprocess
        features = self.preprocess(transaction)
        
        # Get probability
        if hasattr(self.model, 'predict_proba'):
            fraud_prob = self.model.predict_proba(features)[0][1]
        else:
            fraud_prob = self.model.predict(features)[0]
        
        # Classification
        is_fraud = fraud_prob >= self.threshold
        
        # Risk level
        if fraud_prob < 0.3:
            risk_level = 'LOW'
        elif fraud_prob < 0.6:
            risk_level = 'MEDIUM'
        elif fraud_prob < 0.8:
            risk_level = 'HIGH'
        else:
            risk_level = 'CRITICAL'
        
        # Update statistics
        self.total_transactions += 1
        if is_fraud:
            self.total_flagged += 1
        
        prediction_time = (time.time() - start_time) * 1000  # ms
        
        result = {
            'transaction_id': self.total_transactions,
            'amount': transaction['Amount'],
            'fraud_probability': fraud_prob,
            'is_fraud': is_fraud,
            'risk_level': risk_level,
            'prediction_time_ms': prediction_time,
            'action': 'BLOCK' if is_fraud else 'ALLOW'
        }
        
        self.transaction_history.append(result)
        
        return result
    
    def get_statistics(self):
        """Get detection statistics."""
        return {
            'total_transactions': self.total_transactions,
            'total_flagged': self.total_flagged,
            'fraud_rate': self.total_flagged / max(1, self.total_transactions),
            'avg_prediction_time_ms': np.mean([t['prediction_time_ms'] for t in self.transaction_history]) if self.transaction_history else 0
        }
    
    def reset_statistics(self):
        """Reset tracking statistics."""
        self.total_transactions = 0
        self.total_flagged = 0
        self.transaction_history = []

# Create detector with best model
detector = RealTimeFraudDetector(
    model=models[best_model_name],
    scaler_amount=scaler_amount,
    scaler_time=scaler_time,
    threshold=optimal_threshold
)

print(f"Real-Time Fraud Detector initialized!")
print(f"  Model: {best_model_name}")
print(f"  Threshold: {optimal_threshold:.2f}")

In [None]:
# Test with sample transactions
print("Testing Real-Time Fraud Detection")
print("=" * 60)

# Get some test transactions
test_indices = y_test.reset_index(drop=True)
fraud_indices = test_indices[test_indices == 1].index[:3].tolist()
normal_indices = test_indices[test_indices == 0].index[:3].tolist()

test_samples = fraud_indices + normal_indices

for idx in test_samples:
    # Create transaction dict
    row = df.iloc[X_test.index[idx]]
    transaction = {
        'Time': row['Time'],
        'Amount': row['Amount'],
        **{f'V{i}': row[f'V{i}'] for i in range(1, 29)}
    }
    
    # Predict
    result = detector.predict(transaction)
    actual = 'FRAUD' if y_test.iloc[idx] == 1 else 'NORMAL'
    
    print(f"\nTransaction #{result['transaction_id']}:")
    print(f"  Amount: ${result['amount']:.2f}")
    print(f"  Fraud Probability: {result['fraud_probability']:.4f}")
    print(f"  Risk Level: {result['risk_level']}")
    print(f"  Prediction: {result['action']} | Actual: {actual}")
    print(f"  Time: {result['prediction_time_ms']:.2f}ms")

## 9. Streaming Simulation

Simulate real-time transaction processing.

In [None]:
def simulate_transaction_stream(detector, X_data, y_data, n_transactions=100, delay=0.01):
    """
    Simulate streaming transactions.
    
    Args:
        detector: FraudDetector instance
        X_data: Feature data
        y_data: Labels
        n_transactions: Number of transactions to process
        delay: Delay between transactions (seconds)
    """
    detector.reset_statistics()
    
    # Sample transactions
    indices = np.random.choice(len(X_data), min(n_transactions, len(X_data)), replace=False)
    
    true_positives = 0
    false_positives = 0
    true_negatives = 0
    false_negatives = 0
    
    print(f"Processing {n_transactions} transactions...")
    print("=" * 50)
    
    for i, idx in enumerate(indices):
        # Get transaction
        row = df.iloc[X_data.index[idx]]
        transaction = {
            'Time': row['Time'],
            'Amount': row['Amount'],
            **{f'V{i}': row[f'V{i}'] for i in range(1, 29)}
        }
        
        # Predict
        result = detector.predict(transaction)
        actual_fraud = y_data.iloc[idx] == 1
        predicted_fraud = result['is_fraud']
        
        # Track metrics
        if actual_fraud and predicted_fraud:
            true_positives += 1
        elif not actual_fraud and predicted_fraud:
            false_positives += 1
        elif actual_fraud and not predicted_fraud:
            false_negatives += 1
        else:
            true_negatives += 1
        
        # Print alerts for fraud
        if result['risk_level'] in ['HIGH', 'CRITICAL']:
            status = "CAUGHT!" if actual_fraud else "False Alarm"
            print(f"  [{result['risk_level']}] Transaction #{result['transaction_id']}: "
                  f"${result['amount']:.2f} - {result['action']} ({status})")
        
        # Progress
        if (i + 1) % 200 == 0:
            print(f"  Processed {i + 1}/{n_transactions}...")
        
        time.sleep(delay)
    
    # Final statistics
    stats = detector.get_statistics()
    
    print(f"\n" + "=" * 50)
    print("STREAMING SIMULATION RESULTS")
    print("=" * 50)
    print(f"Total Transactions: {stats['total_transactions']}")
    print(f"Flagged as Fraud: {stats['total_flagged']}")
    print(f"Avg Prediction Time: {stats['avg_prediction_time_ms']:.2f}ms")
    print(f"\nConfusion Matrix:")
    print(f"  True Positives (Caught Fraud): {true_positives}")
    print(f"  False Positives (False Alarms): {false_positives}")
    print(f"  True Negatives (Correct Allow): {true_negatives}")
    print(f"  False Negatives (Missed Fraud): {false_negatives}")
    
    if true_positives + false_negatives > 0:
        recall = true_positives / (true_positives + false_negatives)
        print(f"\nFraud Detection Rate (Recall): {recall:.2%}")
    
    return {
        'tp': true_positives, 'fp': false_positives,
        'tn': true_negatives, 'fn': false_negatives,
        'stats': stats
    }

# Run simulation
sim_results = simulate_transaction_stream(detector, X_test, y_test, n_transactions=500, delay=0)

In [None]:
# Visualize streaming results
history = detector.transaction_history

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Fraud probability over time
ax1 = axes[0, 0]
probs = [t['fraud_probability'] for t in history]
colors = ['red' if t['is_fraud'] else 'green' for t in history]
ax1.scatter(range(len(probs)), probs, c=colors, alpha=0.5, s=10)
ax1.axhline(y=optimal_threshold, color='blue', linestyle='--', label=f'Threshold ({optimal_threshold:.2f})')
ax1.set_xlabel('Transaction #')
ax1.set_ylabel('Fraud Probability')
ax1.set_title('Fraud Probability Over Time')
ax1.legend()

# 2. Risk level distribution
ax2 = axes[0, 1]
risk_counts = pd.Series([t['risk_level'] for t in history]).value_counts()
risk_colors = {'LOW': '#2ecc71', 'MEDIUM': '#f1c40f', 'HIGH': '#e67e22', 'CRITICAL': '#e74c3c'}
bars = ax2.bar(risk_counts.index, risk_counts.values, 
               color=[risk_colors.get(r, 'gray') for r in risk_counts.index])
ax2.set_xlabel('Risk Level')
ax2.set_ylabel('Count')
ax2.set_title('Risk Level Distribution')

# 3. Prediction time distribution
ax3 = axes[1, 0]
pred_times = [t['prediction_time_ms'] for t in history]
ax3.hist(pred_times, bins=30, color='#3498db', edgecolor='black', alpha=0.7)
ax3.axvline(x=np.mean(pred_times), color='red', linestyle='--', 
            label=f'Mean: {np.mean(pred_times):.2f}ms')
ax3.set_xlabel('Prediction Time (ms)')
ax3.set_ylabel('Frequency')
ax3.set_title('Prediction Latency Distribution')
ax3.legend()

# 4. Cumulative fraud detection
ax4 = axes[1, 1]
flagged_cumsum = np.cumsum([1 if t['is_fraud'] else 0 for t in history])
ax4.plot(flagged_cumsum, color='red', linewidth=2)
ax4.set_xlabel('Transaction #')
ax4.set_ylabel('Cumulative Flagged Transactions')
ax4.set_title('Cumulative Fraud Flags Over Time')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 10. Summary

### What We Built

A complete **Real-Time Fraud Detection System** with:

| Component | Implementation |
|-----------|---------------|
| **Data Handling** | Kaggle credit card dataset |
| **Imbalance Handling** | SMOTE, class weights, threshold tuning |
| **Models** | Logistic Regression, Random Forest, XGBoost, Neural Network |
| **Evaluation** | PR-AUC, ROC-AUC, Confusion Matrix |
| **Real-Time Pipeline** | FraudDetector class with streaming simulation |

### Key Learnings

1. **Class Imbalance**: Don't use accuracy! Use PR-AUC and recall
2. **Threshold Tuning**: Default 0.5 is rarely optimal for fraud
3. **Business Costs**: Different costs for FN vs FP
4. **Real-Time Requirements**: Sub-millisecond predictions possible

### Best Practices for Fraud Detection

1. Always use stratified sampling
2. Focus on recall (catch frauds) over precision
3. Consider business costs when setting thresholds
4. Monitor model performance over time (fraud patterns change!)
5. Use ensemble methods for robustness

In [None]:
# Final summary
print("="*60)
print("REAL-TIME FRAUD DETECTION SYSTEM - FINAL SUMMARY")
print("="*60)

print(f"""
Dataset:
────────
  - {len(df):,} credit card transactions
  - {(df['Class']==1).sum()} frauds ({(df['Class']==1).mean()*100:.3f}%)
  - 30 features (V1-V28 + Amount + Time)

Best Model: {best_model_name}
────────────────────────────────
  - ROC-AUC: {results[best_model_name]['roc_auc']:.4f}
  - PR-AUC:  {results[best_model_name]['pr_auc']:.4f}
  - Optimal Threshold: {optimal_threshold:.2f}

Real-Time Performance:
─────────────────────
  - Avg Prediction Time: {detector.get_statistics()['avg_prediction_time_ms']:.2f}ms
  - Throughput: ~{1000/max(0.01, detector.get_statistics()['avg_prediction_time_ms']):.0f} transactions/second

Techniques Used:
───────────────
  - SMOTE for oversampling
  - Class weights for imbalance
  - Threshold optimization
  - Business cost analysis
  - Streaming simulation

Key Metrics for Fraud Detection:
────────────────────────────────
  - Recall (Sensitivity): Catch as many frauds as possible
  - PR-AUC: Better than ROC-AUC for imbalanced data
  - False Negative Cost: Most important to minimize
""")