# Lab 4: Classification - Fraud Detection
## Interactive Notebook

### Learning Objectives

1. Handle imbalanced datasets
2. Implement multiple classification algorithms
3. Evaluate models with ROC-AUC
4. Optimize for precision vs recall
5. Deploy a fraud detection system

**Estimated Time:** 4-5 hours

---

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (classification_report, confusion_matrix, 
                             roc_auc_score, roc_curve, precision_recall_curve,
                             accuracy_score, precision_score, recall_score, f1_score)
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
print('‚úÖ Libraries loaded successfully!')

## Part 1: Load and Explore Data

We'll create a synthetic credit card fraud dataset.

In [None]:
# Generate synthetic fraud detection data
np.random.seed(42)

n_samples = 10000
fraud_ratio = 0.02  # 2% fraud

# Normal transactions
n_normal = int(n_samples * (1 - fraud_ratio))
normal_amount = np.random.normal(100, 50, n_normal)
normal_time = np.random.uniform(0, 24, n_normal)
normal_distance = np.random.normal(10, 5, n_normal)

# Fraudulent transactions (different patterns)
n_fraud = int(n_samples * fraud_ratio)
fraud_amount = np.random.normal(500, 200, n_fraud)
fraud_time = np.random.uniform(0, 6, n_fraud)  # Late night
fraud_distance = np.random.normal(100, 50, n_fraud)  # Far from home

# Create DataFrame
df = pd.DataFrame({
    'Amount': np.concatenate([normal_amount, fraud_amount]),
    'Time': np.concatenate([normal_time, fraud_time]),
    'Distance': np.concatenate([normal_distance, fraud_distance]),
    'NumTransactions': np.concatenate([
        np.random.randint(1, 10, n_normal),
        np.random.randint(1, 3, n_fraud)
    ]),
    'IsFraud': np.concatenate([np.zeros(n_normal), np.ones(n_fraud)])
})

# Shuffle
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f'Dataset created: {len(df)} transactions')
print(f'\nClass distribution:')
print(df['IsFraud'].value_counts())
print(f'\nFraud percentage: {df["IsFraud"].mean()*100:.2f}%')
df.head()

### üìù Task 1: Explore Class Imbalance

In [None]:
# TODO: Create a bar plot showing class distribution
plt.figure(figsize=(10, 6))
# YOUR CODE HERE

plt.xlabel('Class')
plt.ylabel('Count')
plt.title('Class Distribution (Imbalanced)')
plt.show()

# TODO: Calculate imbalance ratio
imbalance_ratio = # YOUR CODE HERE
print(f'Imbalance ratio: 1:{imbalance_ratio:.0f}')

In [None]:
# Solution
plt.figure(figsize=(10, 6))
class_counts = df['IsFraud'].value_counts()
plt.bar(['Normal', 'Fraud'], class_counts, color=['green', 'red'], alpha=0.7, edgecolor='black')
for i, v in enumerate(class_counts):
    plt.text(i, v + 50, str(v), ha='center', fontweight='bold')
plt.xlabel('Class', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Class Distribution (Highly Imbalanced)', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.show()

n_normal = (df['IsFraud'] == 0).sum()
n_fraud = (df['IsFraud'] == 1).sum()
imbalance_ratio = n_normal / n_fraud
print(f'‚ö†Ô∏è Imbalance ratio: 1:{imbalance_ratio:.0f}')
print(f'For every 1 fraud case, there are {imbalance_ratio:.0f} normal cases')

### üìù Task 2: Compare Feature Distributions

In [None]:
# TODO: Create box plots for each feature, separated by class
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

features = ['Amount', 'Time', 'Distance', 'NumTransactions']

for idx, feature in enumerate(features):
    ax = axes[idx // 2, idx % 2]
    # YOUR CODE HERE: Create box plot
    
plt.tight_layout()
plt.show()

## Part 2: Data Preparation

### Split and Scale Data

In [None]:
# Prepare features and target
X = df.drop('IsFraud', axis=1)
y = df['IsFraud']

# TODO: Split data (80/20)
X_train, X_test, y_train, y_test = # YOUR CODE HERE

# TODO: Scale features
scaler = StandardScaler()
X_train_scaled = # YOUR CODE HERE
X_test_scaled = # YOUR CODE HERE

print(f'Training set: {X_train.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples')
print(f'\nTraining class distribution:')
print(y_train.value_counts())

## Part 3: Baseline Model (Without Handling Imbalance)

In [None]:
# TODO: Train Logistic Regression
lr_baseline = LogisticRegression(random_state=42)
# YOUR CODE HERE

# TODO: Make predictions
y_pred_baseline = # YOUR CODE HERE

# TODO: Print classification report
print('Baseline Model (No Rebalancing):')
print('=' * 60)
# YOUR CODE HERE


### ‚ö†Ô∏è Notice the Problem!

The model likely has:
- High accuracy (but misleading!)
- Low recall for fraud class
- Poor performance on minority class

## Part 4: Handle Imbalance with SMOTE

In [None]:
# TODO: Apply SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = # YOUR CODE HERE

print('After SMOTE:')
print(f'Training samples: {len(X_train_balanced)}')
print(f'Class distribution:')
print(pd.Series(y_train_balanced).value_counts())

## Part 5: Train Multiple Classifiers

In [None]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=10),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

# TODO: Train each model and store predictions
results = {}

for name, model in models.items():
    # YOUR CODE HERE: Fit model
    
    # YOUR CODE HERE: Predict
    
    # YOUR CODE HERE: Calculate metrics
    results[name] = {
        'model': model,
        'predictions': None,  # YOUR CODE
        'accuracy': None,  # YOUR CODE
        'precision': None,  # YOUR CODE
        'recall': None,  # YOUR CODE
        'f1': None,  # YOUR CODE
        'roc_auc': None  # YOUR CODE
    }

# Display results
results_df = pd.DataFrame(results).T
print(results_df[['accuracy', 'precision', 'recall', 'f1', 'roc_auc']])

## Part 6: ROC Curve Analysis

In [None]:
# TODO: Plot ROC curves for all models
plt.figure(figsize=(12, 8))

for name, result in results.items():
    # YOUR CODE HERE: Get probability predictions
    # YOUR CODE HERE: Calculate ROC curve
    # YOUR CODE HERE: Plot curve
    pass

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Model Comparison')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

---

# üéØ Practice Questions

1. Which model performs best for fraud detection? Why?
2. What is more important for fraud detection: precision or recall?
3. How does SMOTE help with imbalanced data?
4. What other techniques could handle class imbalance?

---

# üìù Summary

‚úÖ Identified and handled class imbalance
‚úÖ Implemented SMOTE for oversampling
‚úÖ Trained multiple classifiers
‚úÖ Evaluated with appropriate metrics
‚úÖ Analyzed ROC curves

**Great work on fraud detection! üéâ**