# Credit Card Fraud Detection

This notebook demonstrates how to build and evaluate machine learning models for credit card fraud detection. 

We'll explore:
1. Understanding the data and class imbalance
2. Building a baseline model
3. Handling class imbalance with SMOTE
4. Using class weights
5. Evaluating models with appropriate metrics


In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import our custom modules
from model import FraudDetector
from utils import (
    calculate_class_distribution,
    preprocess_data,
    plot_roc_curves,
    plot_confusion_matrices,
    plot_feature_importance
)

# Set some plotting parameters
plt.style.use('seaborn')
sns.set_palette('viridis')

# For reproducibility
np.random.seed(42)

## 1. Load and Explore the Data

For this example, we'll use synthetic data to mimic credit card transactions. In a real scenario, you would load your actual dataset here.

In [None]:
# Create a synthetic dataset
def create_synthetic_data(n_samples=10000, n_features=10, fraud_ratio=0.01):
    """
    Create a synthetic dataset to mimic credit card transactions.
    
    Parameters:
    - n_samples: Number of transactions
    - n_features: Number of features (will be named V1, V2, etc.)
    - fraud_ratio: Ratio of fraudulent transactions
    """
    # Create feature names
    feature_names = [f'V{i+1}' for i in range(n_features)]
    
    # Generate non-fraudulent transactions (normally distributed)
    n_non_fraud = int(n_samples * (1 - fraud_ratio))
    non_fraud_data = np.random.normal(0, 1, size=(n_non_fraud, n_features))
    non_fraud_labels = np.zeros(n_non_fraud)
    
    # Generate fraudulent transactions (different distribution)
    n_fraud = int(n_samples * fraud_ratio)
    fraud_data = np.random.normal(-2, 2, size=(n_fraud, n_features))
    fraud_labels = np.ones(n_fraud)
    
    # Combine the data
    X = np.vstack([non_fraud_data, fraud_data])
    y = np.hstack([non_fraud_labels, fraud_labels])
    
    # Shuffle the data
    idx = np.random.permutation(len(y))
    X, y = X[idx], y[idx]
    
    # Create a DataFrame
    df = pd.DataFrame(X, columns=feature_names)
    df['Class'] = y.astype(int)
    
    return df

# Create the dataset
data = create_synthetic_data(n_samples=10000, n_features=10, fraud_ratio=0.01)

# Display the first few rows
data.head()

In [None]:
# Explore class distribution
class_stats = calculate_class_distribution(data['Class'])

# Display statistics
for key, value in class_stats.items():
    print(f"{key}: {value}")

# Plot the class distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='Class', data=data)
plt.title('Class Distribution')
plt.xlabel('Class (0: Non-Fraud, 1: Fraud)')
plt.ylabel('Count')
plt.yscale('log')  # Log scale for better visualization
plt.grid(True, alpha=0.3)
plt.show()

## 2. Preprocess the Data

Next, we'll prepare our data for modeling.

In [None]:
# Preprocess the data
X, y = preprocess_data(data, target_col='Class')

# Check the processed data
print("Features shape:", X.shape)
print("Target shape:", y.shape)
X.head()

## 3. Train and Evaluate Models

Now we'll train various models to detect fraud and evaluate their performance.

In [None]:
# Initialize our fraud detector
detector = FraudDetector(random_state=42)

# Split the data
X_train, X_test, y_train, y_test = detector.prepare_data(X, y)

# Train the baseline model
baseline_model = detector.train_baseline_model(X_train, y_train)

# Train model with SMOTE
smote_model = detector.train_smote_model(X_train, y_train)

# Train model with class weights
weighted_model = detector.train_weighted_model(X_train, y_train, weight_ratio=100)

# Train logistic regression model
logistic_model = detector.train_logistic_model(X_train, y_train)

In [None]:
# Evaluate all models
results = {}
for name, model in detector.models.items():
    results[name] = detector.evaluate_model(model, X_test, y_test)
    print(f"\n--- {name.upper()} MODEL EVALUATION ---")
    print(f"AUC: {results[name]['auc']:.4f}")
    print("\nConfusion Matrix:")
    print(results[name]['confusion_matrix'])
    print("\nClassification Report:")
    for cls in ['0', '1']:
        cr = results[name]['classification_report'][cls]
        print(f"Class {cls} - Precision: {cr['precision']:.4f}, "
              f"Recall: {cr['recall']:.4f}, "
              f"F1: {cr['f1-score']:.4f}")

## 4. Visualize Results

Let's create some visualizations to better understand our models' performance.

In [None]:
# Plot ROC curves
plot_roc_curves(results)
plt.show()

# Plot confusion matrices
plot_confusion_matrices(results)
plt.show()

# Plot normalized confusion matrices
plot_confusion_matrices(results, normalize=True)
plt.show()

In [None]:
# Plot feature importance for baseline model
importance = detector.get_feature_importance('baseline')
plt_fig, feat_df = plot_feature_importance(X.columns, importance)
plt.show()

# Display top features
feat_df.head(10)

## 5. Conclusion

This notebook demonstrated several key concepts in fraud detection:

1. **Understanding class imbalance**: We saw that fraud is typically a rare event, creating challenges for standard ML approaches.

2. **Handling imbalanced data**: We explored multiple techniques:
   - Baseline model (no special handling)
   - SMOTE for synthetic minority oversampling
   - Class weighting
   - Alternative algorithms (logistic regression)

3. **Appropriate evaluation**: We used metrics beyond accuracy:
   - ROC curves and AUC
   - Precision and recall 
   - Confusion matrices

In a real-world scenario, additional steps would include:
- Hyperparameter tuning
- More advanced feature engineering
- Deployment considerations
- Monitoring for model drift

For a production fraud detection system, you would likely combine multiple models and business rules to achieve optimal performance.