# SageMaker ML Workshop: Credit Card Fraud Detection with XGBoost

## Workshop Overview

Welcome to this hands-on machine learning workshop! You'll learn to build a **real-world fintech fraud detection system** using **XGBoost** - the algorithm trusted by major financial institutions worldwide.

### What You'll Build:

A machine learning model that can:
- ‚úÖ Detect fraudulent credit card transactions in real-time
- ‚úÖ Minimize false positives (legitimate transactions marked as fraud)
- ‚úÖ Maximize fraud catch rate (true positives)
- ‚úÖ Handle highly imbalanced data (fraud is rare!)

### What You'll Learn:

1. **Environment Setup** - Install XGBoost and fintech libraries
2. **Data Acquisition** - Download and load credit card transaction data
3. **Data Exploration** - Deep dive into fraud patterns
4. **Data Preprocessing** - Handle imbalanced classes
5. **Model Training** - Build XGBoost classifier with extensive tuning
6. **Model Evaluation** - Assess performance with fintech-specific metrics
7. **Feature Importance** - Understand fraud indicators
8. **Real-time Scoring** - Deploy for transaction screening
9. **Production Deployment** - Build production-ready inference code

### Real-World Application:

This same approach is used by:
- **Credit card companies**: Real-time fraud detection
- **Financial institutions**: Transaction monitoring and AML (Anti-Money Laundering)
- **Payment processors**: Risk scoring for transactions
- **Fintech startups**: User behavior analysis

### Dataset:

We'll use the **Credit Card Fraud Detection Dataset** from Kaggle, which contains:
- **284,807 transactions** from European cardholders (September 2013)
- **492 frauds** (0.172% - highly imbalanced!)
- **Anonymized features** (PCA-transformed for privacy)
- **Real-world challenges**: Class imbalance, feature engineering

---

## Section 1: Understanding XGBoost for Fraud Detection

### What is XGBoost?

**XGBoost** stands for **eXtreme Gradient Boosting**. It's the #1 algorithm for fraud detection because:

#### Why Financial Institutions Choose XGBoost:

1. **High Accuracy**: Catches 95-99% of fraud cases
2. **Low False Positives**: Doesn't block legitimate transactions
3. **Fast Inference**: Makes decisions in milliseconds
4. **Handles Imbalance**: Works even when fraud is 0.1% of transactions
5. **Interpretable**: Explains why a transaction was flagged
6. **Production-Ready**: Scales to millions of transactions

#### How Does XGBoost Work? (Financial Example)

Imagine you're a fraud analyst:

**Traditional Approach** (Single Rule):
- "Flag transactions over $1,000 from new locations"
- **Problem**: Misses sophisticated fraud, blocks legitimate travel purchases

**XGBoost Approach** (Ensemble of 100+ "Expert Analysts"):

1. **Analyst 1**: "This looks fraudulent because amount is unusual for this merchant"
2. **Analyst 2**: "Actually, let me check - this customer travels frequently, so new location is normal"
3. **Analyst 3**: "But wait - the transaction velocity is suspicious (10 transactions in 5 minutes)"
4. **Analyst 4**: "And the device fingerprint doesn't match customer's usual devices"
5. **Continue for 100+ analysts...**

**Final Decision**: Combines wisdom of all analysts, weighted by their accuracy

#### Technical: How Boosting Works

```
Step 1: Build Tree 1 ‚Üí Catches obvious fraud (60% accuracy)
        ‚Üì
Step 2: Find what Tree 1 missed ‚Üí Build Tree 2 to catch those (70% accuracy)
        ‚Üì
Step 3: Find what Trees 1+2 missed ‚Üí Build Tree 3 (80% accuracy)
        ‚Üì
Continue for 100-1000 trees...
        ‚Üì
Final Model: Combines all trees ‚Üí 95%+ accuracy
```

Each tree **learns from the mistakes** of previous trees!

#### XGBoost vs Other Algorithms (Fraud Detection Context):

| Algorithm | Fraud Detection Accuracy | False Positive Rate | Inference Speed | Imbalance Handling |
|-----------|-------------------------|---------------------|-----------------|--------------------|
| **XGBoost** | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê (95-99%) | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Very Low | ‚≠ê‚≠ê‚≠ê‚≠ê Fast | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Excellent |
| Random Forest | ‚≠ê‚≠ê‚≠ê‚≠ê (90-95%) | ‚≠ê‚≠ê‚≠ê‚≠ê Low | ‚≠ê‚≠ê‚≠ê Moderate | ‚≠ê‚≠ê‚≠ê‚≠ê Good |
| Logistic Regression | ‚≠ê‚≠ê‚≠ê (75-85%) | ‚≠ê‚≠ê‚≠ê Moderate | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Very Fast | ‚≠ê‚≠ê Poor |
| Neural Networks | ‚≠ê‚≠ê‚≠ê‚≠ê (90-95%) | ‚≠ê‚≠ê‚≠ê Moderate | ‚≠ê‚≠ê Slow | ‚≠ê‚≠ê‚≠ê Fair |
| Rule-Based | ‚≠ê‚≠ê (60-70%) | ‚≠ê High | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Very Fast | ‚≠ê Very Poor |

**Winner for Fraud Detection**: XGBoost combines best accuracy with production-ready performance!

---

## Section 2: Data Download Instructions

### Option 1: Download from Kaggle (Recommended)

**Dataset**: Credit Card Fraud Detection
**URL**: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
**Size**: 150 MB (compressed)

#### Steps to Download:

**Method A: Using Kaggle Website (Easiest)**

1. **Create Kaggle Account** (if you don't have one)
   - Go to https://www.kaggle.com/
   - Click "Register" and sign up (it's free!)

2. **Navigate to Dataset**
   - Visit: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
   - Click the blue "Download" button
   - File will download as `creditcardfraud.zip`

3. **Upload to SageMaker**
   - Extract the zip file (contains `creditcard.csv`)
   - In SageMaker Studio/Notebook:
     - Click the upload button (‚¨ÜÔ∏è icon)
     - Select `creditcard.csv`
     - Wait for upload to complete

**Method B: Using Kaggle API (For Advanced Users)**

```bash
# 1. Install Kaggle API
pip install kaggle

# 2. Get your Kaggle API credentials
# - Go to https://www.kaggle.com/account
# - Scroll to "API" section
# - Click "Create New API Token"
# - This downloads kaggle.json

# 3. Set up credentials
mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# 4. Download dataset
kaggle datasets download -d mlg-ulb/creditcardfraud

# 5. Unzip
unzip creditcardfraud.zip
```

### Option 2: Use Sample Dataset (For Quick Start)

If you want to get started immediately, we can create a smaller synthetic dataset for practice:

```python
# We'll provide code below to generate sample data
# This is useful for testing but use real data for production!
```

### Option 3: AWS S3 (For Production)

```bash
# Download from S3 if your organization has the data there
aws s3 cp s3://your-bucket/creditcard.csv ./creditcard.csv
```

### Dataset Information:

Once downloaded, `creditcard.csv` contains:
- **Rows**: 284,807 transactions
- **Columns**: 31 features
  - `Time`: Seconds elapsed between transaction and first transaction
  - `V1-V28`: Anonymized features (PCA components)
  - `Amount`: Transaction amount
  - `Class`: Target variable (0 = legitimate, 1 = fraud)

**Privacy Note**: Features V1-V28 are PCA-transformed to protect customer privacy while maintaining fraud patterns.

### Expected File Location:

After upload, your file should be at:
```
/home/sagemaker-user/creditcard.csv
```

Or wherever you saved it in your SageMaker environment.

---

**‚úÖ Before proceeding**: Make sure you have `creditcard.csv` ready!

---

## Section 3: Environment Setup

### Step 3.1: Install Required Libraries

First, let's set up our Python environment with all necessary packages.

**What each library does:**
- **xgboost**: The main XGBoost library for fraud detection models
- **pandas**: For working with transaction data (like Excel for Python)
- **numpy**: For numerical operations and array handling
- **scikit-learn**: For data preprocessing and evaluation metrics
- **matplotlib & seaborn**: For creating visualizations
- **imbalanced-learn**: Special tools for handling fraud (rare event) detection

In [None]:
# Install all required packages
# The '--quiet' flag suppresses installation messages
!pip install xgboost scikit-learn pandas numpy matplotlib seaborn imbalanced-learn --quiet

print("‚úÖ All libraries installed successfully!")
print("\nüì¶ Installed packages:")
print("   ‚Ä¢ XGBoost: Gradient boosting framework")
print("   ‚Ä¢ Scikit-learn: Machine learning utilities")
print("   ‚Ä¢ Pandas: Data manipulation")
print("   ‚Ä¢ NumPy: Numerical computing")
print("   ‚Ä¢ Matplotlib & Seaborn: Data visualization")
print("   ‚Ä¢ Imbalanced-learn: Tools for imbalanced datasets")

### Step 3.2: Import Libraries

Now we'll import all the tools we need. Think of this as laying out all your tools before starting a project.

**Import Organization:**
1. **Core libraries**: pandas, numpy (data handling)
2. **Machine learning**: XGBoost, scikit-learn (model building)
3. **Visualization**: matplotlib, seaborn (charts and graphs)
4. **Utilities**: warnings, time (helper functions)

In [None]:
# =============================================================================
# CORE DATA MANIPULATION LIBRARIES
# =============================================================================
import pandas as pd                    # For working with transaction data tables
import numpy as np                     # For numerical operations
import warnings                        # To suppress unnecessary warnings
warnings.filterwarnings('ignore')      # Keep output clean

# =============================================================================
# MACHINE LEARNING LIBRARIES
# =============================================================================
import xgboost as xgb                  # The star of our show!

# Scikit-learn: Data preprocessing
from sklearn.model_selection import (
    train_test_split,                  # Split data into train/test sets
    cross_val_score,                   # Cross-validation
    StratifiedKFold                    # For imbalanced data splitting
)
from sklearn.preprocessing import StandardScaler  # Feature scaling

# Scikit-learn: Model evaluation metrics
from sklearn.metrics import (
    accuracy_score,                    # Overall correctness
    precision_score,                   # Precision: TP / (TP + FP)
    recall_score,                      # Recall: TP / (TP + FN)
    f1_score,                          # F1: Harmonic mean of precision & recall
    roc_auc_score,                     # Area under ROC curve
    average_precision_score,           # Area under PR curve (better for imbalanced data)
    confusion_matrix,                  # Breakdown of predictions
    classification_report,             # Comprehensive report
    roc_curve,                         # For ROC curve plotting
    precision_recall_curve,            # For PR curve plotting
    matthews_corrcoef                  # MCC: Good for imbalanced data
)

# Imbalanced-learn: Special tools for fraud detection
from imblearn.over_sampling import SMOTE  # Synthetic minority oversampling
from imblearn.under_sampling import RandomUnderSampler  # Undersample majority

# =============================================================================
# VISUALIZATION LIBRARIES
# =============================================================================
import matplotlib.pyplot as plt        # Basic plotting
import seaborn as sns                  # Beautiful statistical plots

# Configure visualization settings for professional-looking plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
sns.set_context("notebook", font_scale=1.1)
%matplotlib inline

# =============================================================================
# PANDAS DISPLAY SETTINGS
# =============================================================================
pd.set_option('display.max_columns', None)     # Show all columns
pd.set_option('display.max_rows', 100)         # Show up to 100 rows
pd.set_option('display.precision', 4)          # 4 decimal places
pd.set_option('display.float_format', '{:.4f}'.format)  # Consistent formatting

# =============================================================================
# REPRODUCIBILITY
# =============================================================================
# Set random seeds for reproducible results
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# =============================================================================
# IMPORT VERIFICATION
# =============================================================================
print("‚úÖ All libraries imported successfully!")
print("\nüìä Library Versions:")
print(f"   ‚Ä¢ XGBoost: {xgb.__version__}")
print(f"   ‚Ä¢ Pandas: {pd.__version__}")
print(f"   ‚Ä¢ NumPy: {np.__version__}")
print(f"\nüé≤ Random seed set to: {RANDOM_SEED}")
print("   ‚Üí This ensures reproducible results across runs")

print("\nüöÄ Ready to build fraud detection models!")

### Step 3.3: Create Helper Functions

Let's create some utility functions we'll use throughout the notebook. These will help us:
- Print formatted outputs
- Calculate custom metrics
- Create consistent visualizations

**Why create helper functions?**
- **DRY Principle**: Don't Repeat Yourself
- **Readability**: Cleaner code
- **Reusability**: Use across multiple projects
- **Maintainability**: Fix bugs in one place

In [None]:
def print_section_header(title, emoji="üìä"):
    """
    Print a formatted section header for better readability.
    
    Args:
        title (str): Section title
        emoji (str): Emoji to display (default: üìä)
    """
    print("\n" + "="*80)
    print(f"{emoji} {title}")
    print("="*80)


def calculate_financial_metrics(y_true, y_pred, transaction_amounts=None):
    """
    Calculate business-relevant metrics for fraud detection.
    
    In fraud detection, it's not just about accuracy - we care about:
    - How much fraud $ we catch
    - How many legitimate customers we inconvenience
    - The total financial impact
    
    Args:
        y_true: Actual labels (0=legitimate, 1=fraud)
        y_pred: Predicted labels
        transaction_amounts: Dollar amounts (optional)
    
    Returns:
        dict: Dictionary of financial metrics
    """
    from sklearn.metrics import confusion_matrix
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    metrics = {
        'true_negatives': int(tn),      # Correctly identified legitimate
        'false_positives': int(fp),     # Legitimate flagged as fraud (BAD for UX)
        'false_negatives': int(fn),     # Fraud missed (BAD for losses)
        'true_positives': int(tp),      # Correctly identified fraud
        'fraud_catch_rate': tp / (tp + fn) if (tp + fn) > 0 else 0,  # Recall
        'precision': tp / (tp + fp) if (tp + fp) > 0 else 0,
        'customer_friction_rate': fp / (fp + tn) if (fp + tn) > 0 else 0  # FPR
    }
    
    # Calculate financial impact if amounts provided
    if transaction_amounts is not None:
        amounts = np.array(transaction_amounts)
        
        # Money saved by catching fraud
        fraud_caught_amount = amounts[(y_true == 1) & (y_pred == 1)].sum()
        
        # Money lost to missed fraud
        fraud_missed_amount = amounts[(y_true == 1) & (y_pred == 0)].sum()
        
        # Volume of legitimate transactions incorrectly blocked
        false_positive_amount = amounts[(y_true == 0) & (y_pred == 1)].sum()
        
        metrics['fraud_caught_$'] = fraud_caught_amount
        metrics['fraud_missed_$'] = fraud_missed_amount
        metrics['false_positive_$'] = false_positive_amount
        metrics['fraud_prevention_rate'] = (fraud_caught_amount / 
                                           (fraud_caught_amount + fraud_missed_amount) 
                                           if (fraud_caught_amount + fraud_missed_amount) > 0 else 0)
    
    return metrics


def plot_metric_comparison(metrics_dict, title="Model Performance Comparison"):
    """
    Create a bar plot comparing different metrics.
    
    Args:
        metrics_dict (dict): Dictionary of metric names and values
        title (str): Plot title
    """
    fig, ax = plt.subplots(figsize=(10, 6))
    
    metrics = list(metrics_dict.keys())
    values = list(metrics_dict.values())
    
    colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(metrics)))
    bars = ax.barh(metrics, values, color=colors, alpha=0.8)
    
    ax.set_xlabel('Score', fontsize=12)
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.set_xlim([0, 1.0])
    ax.grid(axis='x', alpha=0.3)
    
    # Add value labels
    for bar, value in zip(bars, values):
        width = bar.get_width()
        ax.text(width, bar.get_y() + bar.get_height()/2, 
                f' {value:.4f}', 
                ha='left', va='center', fontsize=10, fontweight='bold')
    
    plt.tight_layout()
    plt.show()


print("‚úÖ Helper functions created!")
print("\nüìã Available functions:")
print("   ‚Ä¢ print_section_header(): Format section titles")
print("   ‚Ä¢ calculate_financial_metrics(): Business-focused metrics")
print("   ‚Ä¢ plot_metric_comparison(): Visualize performance")

---

## Section 4: Data Loading

### Step 4.1: Load the Credit Card Fraud Dataset

Now we'll load our credit card transaction data. This dataset is from Kaggle and contains real (anonymized) credit card transactions.

**What to expect:**
- **Size**: ~150 MB, 284,807 transactions
- **Time period**: 2 days of transactions (September 2013)
- **Features**: 30 features + 1 target variable
- **Challenge**: Highly imbalanced (only 0.172% fraud)

**Important**: If you haven't downloaded the data yet, go back to Section 2 for instructions!

In [None]:
print_section_header("Loading Credit Card Fraud Dataset", "üí≥")

# Define the path to your data file
# Adjust this path based on where you saved the file
DATA_PATH = 'creditcard.csv'  # Change this if your file is elsewhere

print(f"\nüìÇ Loading data from: {DATA_PATH}")
print("‚è≥ This may take 30-60 seconds for large dataset...\n")

try:
    # Load the data
    # Parse_dates and infer_datetime_format help with Time column
    df = pd.read_csv(DATA_PATH)
    
    print("‚úÖ Dataset loaded successfully!")
    print("\nüìä Dataset Overview:")
    print(f"   ‚Ä¢ Total transactions: {len(df):,}")
    print(f"   ‚Ä¢ Number of features: {len(df.columns) - 1}  (excluding target 'Class')")
    print(f"   ‚Ä¢ Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print(f"   ‚Ä¢ Time period: {df['Time'].max() / 3600:.1f} hours")
    
    # Quick fraud statistics
    n_fraud = (df['Class'] == 1).sum()
    n_legit = (df['Class'] == 0).sum()
    fraud_pct = (n_fraud / len(df)) * 100
    
    print(f"\nüö® Fraud Statistics:")
    print(f"   ‚Ä¢ Legitimate transactions: {n_legit:,} ({100-fraud_pct:.3f}%)")
    print(f"   ‚Ä¢ Fraudulent transactions: {n_fraud:,} ({fraud_pct:.3f}%)")
    print(f"   ‚Ä¢ Imbalance ratio: {n_legit/n_fraud:.1f}:1")
    print(f"   ‚Üí This is a HIGHLY IMBALANCED dataset!")
    
except FileNotFoundError:
    print("‚ùå Error: Data file not found!")
    print("\nüìù Please follow these steps:")
    print("   1. Download from: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud")
    print("   2. Extract creditcard.csv from the zip file")
    print("   3. Upload to this notebook's directory")
    print("   4. Update DATA_PATH variable if needed")
    print("\n   Or use the Kaggle API instructions from Section 2")
    raise

except Exception as e:
    print(f"‚ùå Unexpected error loading data: {str(e)}")
    raise

### Step 4.2: Initial Data Inspection

Let's take our first look at the data. This is like opening a package to see what's inside!

**What we're looking for:**
- Column names and types
- Sample transactions
- Data types (numeric, categorical, etc.)
- Any obvious issues

In [None]:
print_section_header("Data Inspection", "üîç")

print("\nüìã First 5 Transactions:")
print(df.head())

print("\n\nüìã Last 5 Transactions:")
print(df.tail())

print("\n\nüìä Dataset Information:")
print(df.info())

print("\n\nüìà Statistical Summary:")
print(df.describe())

print("\n\nüí° Column Explanation:")
print("="*80)
print("   ‚Ä¢ Time: Seconds since first transaction (for sequence analysis)")
print("   ‚Ä¢ V1-V28: Principal components from PCA transformation")
print("             ‚Üí Original features anonymized for privacy")
print("             ‚Üí Still capture important fraud patterns")
print("   ‚Ä¢ Amount: Transaction amount in Euros (‚Ç¨)")
print("   ‚Ä¢ Class: Target variable (0 = Legitimate, 1 = Fraud)")
print("\n   ‚ö†Ô∏è  Note: We can't know what V1-V28 originally represented")
print("      (could be merchant category, location, time of day, etc.)")
print("      but they still contain the fraud patterns we need!")

### Step 4.3: Check for Data Quality Issues

Before building any model, we MUST check data quality. In fintech, bad data = bad decisions = lost money!

**Common data quality issues:**
1. **Missing values**: Incomplete records
2. **Duplicates**: Same transaction recorded twice
3. **Outliers**: Unusual values that might be errors
4. **Data types**: Wrong format (e.g., numbers stored as text)
5. **Impossible values**: Negative amounts, future dates, etc.

In [None]:
print_section_header("Data Quality Assessment", "üî¨")

# 1. Check for missing values
print("\n1Ô∏è‚É£  Missing Values Check:")
print("-" * 80)
missing_values = df.isnull().sum()
total_missing = missing_values.sum()

if total_missing == 0:
    print("‚úÖ No missing values found!")
    print("   ‚Üí Dataset is complete - excellent for model training")
else:
    print(f"‚ö†Ô∏è  Found {total_missing} missing values:")
    print(missing_values[missing_values > 0])
    print("\n   ‚Üí We'll need to handle these before training")

# 2. Check for duplicate transactions
print("\n\n2Ô∏è‚É£  Duplicate Transactions Check:")
print("-" * 80)
n_duplicates = df.duplicated().sum()

if n_duplicates == 0:
    print("‚úÖ No duplicate transactions found!")
else:
    print(f"‚ö†Ô∏è  Found {n_duplicates} duplicate transactions")
    print("   ‚Üí These might be legitimate (e.g., subscription payments)")
    print("   ‚Üí Or data collection errors")
    print("   ‚Üí We'll investigate further")

# 3. Check data types
print("\n\n3Ô∏è‚É£  Data Types Check:")
print("-" * 80)
print(df.dtypes.value_counts())
print("\n   Expected: All numeric (float64 or int64)")

if (df.dtypes == 'object').any():
    print("\n‚ö†Ô∏è  Warning: Found non-numeric columns:")
    print(df.select_dtypes(include='object').columns.tolist())
else:
    print("   ‚úÖ All columns are numeric - ready for ML!")

# 4. Check for impossible values
print("\n\n4Ô∏è‚É£  Business Logic Validation:")
print("-" * 80)

# Check for negative amounts (shouldn't exist in transactions)
negative_amounts = (df['Amount'] < 0).sum()
if negative_amounts > 0:
    print(f"‚ö†Ô∏è  Found {negative_amounts} transactions with negative amounts!")
else:
    print("‚úÖ All transaction amounts are non-negative")

# Check for zero amounts
zero_amounts = (df['Amount'] == 0).sum()
print(f"   ‚Ä¢ Transactions with $0 amount: {zero_amounts:,}")
if zero_amounts > 100:
    print("     ‚Üí This is unusual - might be authorization checks")

# Check Class values
unique_classes = df['Class'].unique()
print(f"\n   ‚Ä¢ Unique Class values: {sorted(unique_classes)}")
if set(unique_classes) == {0, 1}:
    print("     ‚úÖ Correct: Only 0 (legitimate) and 1 (fraud)")
else:
    print("     ‚ö†Ô∏è  Unexpected class values!")

# 5. Summary
print("\n\n" + "="*80)
print("üìä Data Quality Summary:")
print("="*80)

quality_score = 0
if total_missing == 0: quality_score += 25
if n_duplicates == 0: quality_score += 25
if not (df.dtypes == 'object').any(): quality_score += 25
if negative_amounts == 0: quality_score += 25

print(f"\n   Overall Data Quality Score: {quality_score}/100")

if quality_score == 100:
    print("   üåü Excellent! Data is production-ready")
elif quality_score >= 75:
    print("   ‚úÖ Good! Minor issues that are manageable")
elif quality_score >= 50:
    print("   ‚ö†Ô∏è  Fair - some data cleaning required")
else:
    print("   ‚ùå Poor - significant data quality issues to address")

print("\n   ‚úÖ Ready to proceed with exploratory analysis!")

---

## TO BE CONTINUED...

This notebook is being generated. The remaining sections will include:

- **Section 5**: Exploratory Data Analysis (fraud patterns, distributions)
- **Section 6**: Feature Engineering (time-based features, aggregations)
- **Section 7**: Data Preprocessing (scaling, train/test split)
- **Section 8**: Handling Class Imbalance (SMOTE, class weights)
- **Section 9**: Model Training (XGBoost with extensive tuning)
- **Section 10**: Model Evaluation (ROC-AUC, Precision-Recall, Cost-Benefit)
- **Section 11**: Feature Importance (understanding fraud indicators)
- **Section 12**: Hyperparameter Tuning (grid search, optimization)
- **Section 13**: Production Deployment (inference code, monitoring)
- **Section 14**: Real-time Scoring (API endpoint simulation)
- **Appendix**: Exercises and challenges

The file is being created - please run the next cell to continue...