# Credit Card Fraud Detection with XGBoost

## Workshop Overview (1 Hour)

Welcome! In this hands-on workshop, you'll build a **fraud detection system** using XGBoost - the algorithm trusted by major financial institutions and fintech companies worldwide.

### What You'll Build:
- A model that detects fraudulent credit card transactions
- Learn to handle imbalanced data (fraud is rare!)
- Understand evaluation metrics for fraud detection
- Save your model for production use

### Dataset:
Credit Card Fraud Detection dataset with:
- **284,807 transactions** over 2 days
- **492 frauds** (0.172% - highly imbalanced!)
- **Real-world challenge**: Detect rare events accurately

---

## Understanding XGBoost for Fraud Detection

### What is XGBoost?

**XGBoost** = e**X**treme **G**radient **Boosting**

#### Why Financial Institutions Choose XGBoost:

1. **High Accuracy**: Catches 95-99% of fraud cases
2. **Low False Positives**: Doesn't block legitimate customers
3. **Fast**: Makes decisions in milliseconds (real-time fraud detection)
4. **Handles Imbalance**: Works even when fraud is 0.1% of transactions
5. **Interpretable**: Explains why a transaction was flagged
6. **Production-Ready**: Used by PayPal, Airbnb, and major financial institutions

#### How Does XGBoost Work? (Simple Analogy)

Imagine you have 100 fraud analysts:

**Traditional Approach** (One Rule):
- "Flag all transactions over $1,000 from new locations"
- **Problem**: Misses sophisticated fraud, blocks legitimate travelers

**XGBoost Approach** (100 Analysts Working Together):

```
Analyst 1: "Unusual amount for this merchant" ‚Üí 60% sure it's fraud
    ‚Üì
Analyst 2: "Customer travels frequently, new location is normal" ‚Üí Actually 40% sure
    ‚Üì
Analyst 3: "But wait - 10 transactions in 5 minutes!" ‚Üí Back to 70% sure
    ‚Üì
Analyst 4: "Device fingerprint doesn't match" ‚Üí 85% sure it's fraud
    ‚Üì
... 96 more analysts ...
    ‚Üì
Final Decision: 96% confidence it's FRAUD ‚Üí Block transaction
```

**Key Insight**: Each analyst (tree) learns from the mistakes of previous analysts!

#### Technical: How Boosting Works

```
Step 1: Tree 1 catches obvious fraud ‚Üí 60% accuracy
Step 2: Tree 2 focuses on what Tree 1 missed ‚Üí Combined 75% accuracy  
Step 3: Tree 3 focuses on what Trees 1+2 missed ‚Üí Combined 85% accuracy
...
Step 100: Tree 100 fine-tunes everything ‚Üí Final 95%+ accuracy
```

#### XGBoost vs Other Algorithms:

| Algorithm | Fraud Detection | Speed | Imbalance Handling | Production Ready |
|-----------|----------------|-------|-------------------|------------------|
| **XGBoost** | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê (95-99%) | ‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê |
| Random Forest | ‚≠ê‚≠ê‚≠ê‚≠ê (90-95%) | ‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê |
| Logistic Regression | ‚≠ê‚≠ê‚≠ê (75-85%) | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê |
| Neural Networks | ‚≠ê‚≠ê‚≠ê‚≠ê (90-95%) | ‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê |

**Winner**: XGBoost combines best accuracy with production performance!

---

## Data Download Instructions

### Option 1: Download from Kaggle (Recommended)

**Quick Steps:**
1. Go to: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
2. Create account (free) if needed
3. Click "Download" button
4. Extract `creditcard.csv` from zip file
5. Upload to this SageMaker environment

### Option 2: Kaggle API (Fastest)

```bash
pip install kaggle
kaggle datasets download -d mlg-ulb/creditcardfraud
unzip creditcardfraud.zip
```

### Option 3: Generate Sample Data (For Testing)

If you can't access Kaggle right now, uncomment and run this code to generate sample data:

```python
# import pandas as pd
# import numpy as np
# from sklearn.datasets import make_classification
# 
# X, y = make_classification(n_samples=10000, n_features=30, n_classes=2, 
#                            weights=[0.98, 0.02], random_state=42)
# df = pd.DataFrame(X, columns=[f'V{i}' for i in range(1,29)] + ['Amount', 'Time'])
# df['Class'] = y
# df.to_csv('creditcard.csv', index=False)
# print("‚úÖ Sample data created!")
```

**‚ö†Ô∏è Important**: Make sure `creditcard.csv` is in the same directory as this notebook!

---

## Setup and Import Libraries

In [None]:
# Install required packages
!pip install xgboost scikit-learn pandas numpy matplotlib seaborn --quiet

print("‚úÖ All packages installed!")

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report
)

# Settings
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
RANDOM_SEED = 42

print(f"‚úÖ Libraries imported!")
print(f"XGBoost version: {xgb.__version__}")

---

## Load and Explore the Data

### About the Dataset:

**Columns:**
- `Time`: Seconds since first transaction
- `V1-V28`: Anonymized features (PCA-transformed for privacy)
- `Amount`: Transaction amount in Euros (‚Ç¨)
- `Class`: Target variable (0 = Legitimate, 1 = Fraud)

**Privacy Note**: V1-V28 are transformed to protect customer identity while keeping fraud patterns.

In [None]:
# Load the data
df = pd.read_csv('creditcard.csv')

print("üìä Dataset loaded successfully!")
print(f"\nShape: {df.shape[0]:,} transactions, {df.shape[1]} columns")
print(f"\nFirst 5 transactions:")
print(df.head())

In [None]:
# Check class distribution
fraud_count = (df['Class'] == 1).sum()
legit_count = (df['Class'] == 0).sum()
fraud_pct = (fraud_count / len(df)) * 100

print("üö® Class Distribution:")
print(f"   Legitimate: {legit_count:,} ({100-fraud_pct:.3f}%)")
print(f"   Fraud: {fraud_count:,} ({fraud_pct:.3f}%)")
print(f"   Imbalance Ratio: {legit_count/fraud_count:.0f}:1")
print(f"\n‚ö†Ô∏è  This is HIGHLY IMBALANCED - a key challenge in fraud detection!")

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
df['Class'].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red'], alpha=0.7)
axes[0].set_title('Transaction Class Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Class (0=Legitimate, 1=Fraud)')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['Legitimate', 'Fraud'], rotation=0)

# Amount distribution by class
df[df['Class']==0]['Amount'].hist(bins=50, ax=axes[1], alpha=0.5, label='Legitimate', color='green')
df[df['Class']==1]['Amount'].hist(bins=50, ax=axes[1], alpha=0.5, label='Fraud', color='red')
axes[1].set_title('Transaction Amount Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Amount (‚Ç¨)')
axes[1].set_ylabel('Frequency')
axes[1].legend()
axes[1].set_xlim([0, 500])  # Focus on typical amounts

plt.tight_layout()
plt.show()

print("üí° Observations:")
print("   ‚Ä¢ Fraud is extremely rare (0.17%)")
print("   ‚Ä¢ Fraud and legitimate transactions have different amount patterns")
print("   ‚Ä¢ This imbalance is typical in real-world fraud detection")

---

## Prepare Data for Training

### What we're doing:
1. **Separate features (X) from target (y)**: X = what we know, y = what we predict
2. **Split into train and test sets**: Train to learn, test to evaluate
3. **Use stratified split**: Ensures both sets have similar fraud rates

### Why 80/20 split?
- **80% for training**: Model needs data to learn patterns
- **20% for testing**: Evaluate on unseen data (simulates real-world use)
- **Stratified**: Maintains the 0.17% fraud rate in both sets

In [None]:
# Separate features and target
X = df.drop('Class', axis=1)  # All columns except 'Class'
y = df['Class']                # Only the 'Class' column

print("üìä Data Separation:")
print(f"   Features (X): {X.shape}")
print(f"   Target (y): {y.shape}")
print(f"\n   Features: {list(X.columns)}")

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,           # 20% for testing
    random_state=RANDOM_SEED, # Reproducible results
    stratify=y               # Keep class distribution
)

print("‚úÖ Train-Test Split Complete!")
print(f"\nTraining Set:")
print(f"   X_train: {X_train.shape}")
print(f"   y_train: {y_train.shape}")
print(f"   Fraud rate: {(y_train.sum()/len(y_train)*100):.3f}%")

print(f"\nTest Set:")
print(f"   X_test: {X_test.shape}")
print(f"   y_test: {y_test.shape}")
print(f"   Fraud rate: {(y_test.sum()/len(y_test)*100):.3f}%")

print(f"\n‚úÖ Fraud rates are similar - stratification worked!")

---

## Build and Train XGBoost Model

### Key Hyperparameters Explained:

1. **n_estimators=100**: Build 100 decision trees
   - More trees = better learning but slower
   - 100 is a good starting point

2. **max_depth=6**: Each tree can be 6 levels deep
   - Deeper = more complex patterns but risk overfitting
   - 6 is a balanced choice

3. **learning_rate=0.1**: How much each tree contributes
   - Lower = slower learning but potentially better
   - 0.1 is standard

4. **scale_pos_weight=578**: Handle class imbalance
   - Formula: (# legitimate) / (# fraud)
   - Tells model to pay more attention to fraud
   - **Critical for imbalanced data!**

5. **eval_metric='auc'**: Use AUC to measure performance
   - Better than accuracy for imbalanced data
   - Measures ability to distinguish fraud from legitimate

In [None]:
# Calculate scale_pos_weight for class imbalance
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

print(f"‚öñÔ∏è  Handling Class Imbalance:")
print(f"   scale_pos_weight = {scale_pos_weight:.0f}")
print(f"   This tells XGBoost: 'Fraud is {scale_pos_weight:.0f}x more important!'")
print(f"   Without this, model would just predict everything as legitimate!")

In [None]:
# Create XGBoost classifier
xgb_model = xgb.XGBClassifier(
    n_estimators=100,                    # Number of trees
    max_depth=6,                         # Tree depth
    learning_rate=0.1,                   # Learning rate
    scale_pos_weight=scale_pos_weight,   # Handle imbalance
    eval_metric='auc',                   # Evaluation metric
    random_state=RANDOM_SEED,            # Reproducibility
    use_label_encoder=False
)

print("üîß XGBoost Model Initialized!")
print(f"\nConfiguration:")
print(f"   ‚Ä¢ Trees: {xgb_model.n_estimators}")
print(f"   ‚Ä¢ Depth: {xgb_model.max_depth}")
print(f"   ‚Ä¢ Learning rate: {xgb_model.learning_rate}")
print(f"   ‚Ä¢ Scale pos weight: {scale_pos_weight:.0f}")

In [None]:
# Train the model
print("üöÄ Training XGBoost model...")
print("   This may take 30-60 seconds...\n")

xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    verbose=False  # Set to True to see training progress
)

print("\n‚úÖ Model Training Complete!")
print(f"   Trained on {len(X_train):,} transactions")
print(f"   Model is ready to detect fraud!")

---

## Evaluate Model Performance

### Understanding Fraud Detection Metrics:

**Why not just use Accuracy?**
- If we predict "no fraud" for everything, we get 99.83% accuracy!
- But we miss ALL fraud ‚Üí Useless model
- **We need better metrics for imbalanced data**

### Key Metrics for Fraud Detection:

1. **Recall (Fraud Catch Rate)**
   - What % of actual fraud did we catch?
   - Formula: True Positives / (True Positives + False Negatives)
   - **Higher is better** - we want to catch fraud!

2. **Precision**
   - Of transactions we flagged, what % were actually fraud?
   - Formula: True Positives / (True Positives + False Positives)
   - **Higher is better** - avoid blocking legitimate customers

3. **F1-Score**
   - Balance between Precision and Recall
   - Harmonic mean of both

4. **AUC-ROC**
   - Overall ability to distinguish fraud from legitimate
   - 0.5 = random guessing, 1.0 = perfect
   - **Best single metric for imbalanced data**

### Business Impact:
- **False Negative (FN)**: Missed fraud ‚Üí Lost money
- **False Positive (FP)**: Blocked legitimate customer ‚Üí Lost customer, bad UX

In [None]:
# Make predictions
y_pred = xgb_model.predict(X_test)
y_pred_proba = xgb_model.predict_proba(X_test)[:, 1]  # Probability of fraud

print("üéØ Predictions Generated!")
print(f"   Predicted {(y_pred==1).sum()} fraudulent transactions")
print(f"   Predicted {(y_pred==0).sum()} legitimate transactions")

In [None]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)

print("üìä Model Performance Metrics:")
print("="*60)
print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Precision: {precision:.4f} ({precision*100:.2f}%)")
print(f"Recall:    {recall:.4f} ({recall*100:.2f}%) ‚Üê Fraud catch rate")
print(f"F1-Score:  {f1:.4f}")
print(f"AUC-ROC:   {auc:.4f} ‚Üê Overall performance")

print("\nüí° What This Means:")
print(f"   ‚Ä¢ We catch {recall*100:.1f}% of all fraud")
print(f"   ‚Ä¢ When we flag fraud, we're right {precision*100:.1f}% of the time")
print(f"   ‚Ä¢ AUC of {auc:.2f} = {'Excellent' if auc >= 0.9 else 'Good' if auc >= 0.8 else 'Fair'} discrimination")

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print("\nüìä Confusion Matrix Breakdown:")
print("="*60)
print(f"True Negatives (TN):  {tn:5,} ‚úÖ Correctly identified as legitimate")
print(f"False Positives (FP): {fp:5,} ‚ùå Legitimate flagged as fraud")
print(f"False Negatives (FN): {fn:5,} ‚ùå Fraud that we missed (BAD!)")
print(f"True Positives (TP):  {tp:5,} ‚úÖ Correctly caught fraud")

print("\nüí∞ Business Impact:")
print(f"   ‚Ä¢ Fraud caught: {tp} out of {tp+fn} ({tp/(tp+fn)*100:.1f}%)")
print(f"   ‚Ä¢ Fraud missed: {fn} (potential losses)")
print(f"   ‚Ä¢ Customers inconvenienced: {fp} (false alarms)")

In [None]:
# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Legitimate', 'Fraud'],
            yticklabels=['Legitimate', 'Fraud'],
            cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Fraud Detection', fontsize=14, fontweight='bold')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

print("\n‚úÖ Green (TN, TP): Correct predictions")
print("‚ùå Red (FP, FN): Errors we need to minimize")

---

## Feature Importance

### Why Feature Importance Matters:

1. **Interpretability**: Understand what drives fraud predictions
2. **Trust**: Verify model makes sense (not just a black box)
3. **Optimization**: Focus on most important features
4. **Compliance**: Explain decisions to regulators

### What to Look For:
- Which features have highest importance?
- Do they make business sense?
- Are there surprising patterns?

In [None]:
# Get feature importance
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': xgb_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("üìä Top 10 Most Important Features:")
print("="*60)
print(feature_importance.head(10).to_string(index=False))

print(f"\nüí° Insight:")
top_feature = feature_importance.iloc[0]['Feature']
print(f"   '{top_feature}' is the most important fraud indicator")
print(f"   Top 3 features account for {feature_importance.head(3)['Importance'].sum()*100:.1f}% of importance")

In [None]:
# Visualize feature importance (top 10)
plt.figure(figsize=(10, 6))
top_features = feature_importance.head(10)
plt.barh(top_features['Feature'], top_features['Importance'], color='steelblue')
plt.xlabel('Importance Score')
plt.title('Top 10 Features for Fraud Detection', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\n‚úÖ These are the key fraud indicators the model learned!")

---

## Make Predictions on New Transactions

### Real-World Application:

In production, this model would:
1. Receive transaction data in real-time
2. Score each transaction (0-1 probability)
3. Flag high-risk transactions for review
4. Block or challenge suspicious transactions

### Risk Levels:
- **High Risk** (>70%): Block transaction, send alert
- **Medium Risk** (40-70%): Challenge with 2FA/OTP
- **Low Risk** (<40%): Approve automatically

In [None]:
# Take 5 sample transactions (mix of fraud and legitimate)
sample_indices = [10, 100, 1000, 5000, 10000]
sample_transactions = X_test.iloc[sample_indices]
sample_actual = y_test.iloc[sample_indices]

# Make predictions
sample_predictions = xgb_model.predict(sample_transactions)
sample_probabilities = xgb_model.predict_proba(sample_transactions)[:, 1]

# Create results DataFrame
results = pd.DataFrame({
    'Transaction_ID': sample_indices,
    'Actual': ['Fraud' if x==1 else 'Legit' for x in sample_actual],
    'Predicted': ['Fraud' if x==1 else 'Legit' for x in sample_predictions],
    'Fraud_Probability': [f"{x:.2%}" for x in sample_probabilities],
    'Risk_Level': ['High' if x>0.7 else 'Medium' if x>0.4 else 'Low' for x in sample_probabilities],
    'Correct': ['‚úÖ' if a==p else '‚ùå' for a, p in zip(sample_actual, sample_predictions)]
})

print("üéØ Sample Transaction Predictions:")
print("="*80)
print(results.to_string(index=False))

print("\nüí° How to Read This:")
print("   ‚Ä¢ Fraud_Probability: Model's confidence (0-100%)")
print("   ‚Ä¢ Risk_Level: Action to take")
print("   ‚Ä¢ Correct: Did model get it right?")

---

## Save the Model

### Why Save Models?

1. **Deployment**: Use in production systems
2. **Sharing**: Share with team members
3. **Versioning**: Track different model versions
4. **Efficiency**: Don't retrain every time

### File Formats:
- **JSON (.json)**: XGBoost native format, best for production
- **Pickle (.pkl)**: Python format, includes full object

In [None]:
import pickle
import json
from pathlib import Path

# Create models directory
model_dir = Path('/home/sagemaker-user/models')
model_dir.mkdir(exist_ok=True)

print("üíæ Saving Model...")

# Save as JSON (recommended for production)
json_path = model_dir / 'fraud_detection_model.json'
xgb_model.save_model(str(json_path))
print(f"‚úÖ Saved as JSON: {json_path}")

# Save as Pickle (includes Python object)
pickle_path = model_dir / 'fraud_detection_model.pkl'
with open(pickle_path, 'wb') as f:
    pickle.dump(xgb_model, f)
print(f"‚úÖ Saved as Pickle: {pickle_path}")

# Save model metadata
metadata = {
    'model_type': 'XGBoost Classifier',
    'training_date': pd.Timestamp.now().isoformat(),
    'n_training_samples': len(X_train),
    'n_features': len(X_train.columns),
    'performance': {
        'accuracy': float(accuracy),
        'precision': float(precision),
        'recall': float(recall),
        'f1_score': float(f1),
        'auc_roc': float(auc)
    },
    'hyperparameters': {
        'n_estimators': xgb_model.n_estimators,
        'max_depth': xgb_model.max_depth,
        'learning_rate': xgb_model.learning_rate,
        'scale_pos_weight': float(scale_pos_weight)
    }
}

metadata_path = model_dir / 'model_metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)
print(f"‚úÖ Saved metadata: {metadata_path}")

print("\n‚úÖ Model saved successfully!")
print("   Ready for production deployment!")

In [None]:
# Demonstrate loading the model
print("üìÇ Loading Saved Model...")

# Load from JSON
loaded_model = xgb.XGBClassifier()
loaded_model.load_model(str(json_path))

# Verify it works
test_prediction = loaded_model.predict(X_test[:5])
original_prediction = xgb_model.predict(X_test[:5])

if np.array_equal(test_prediction, original_prediction):
    print("‚úÖ Model loaded successfully!")
    print("   Predictions match original model")
    print("   Ready to use for real-time fraud detection!")
else:
    print("‚ùå Warning: Loaded model predictions don't match!")

---

## Workshop Summary

### üéâ Congratulations!

You've successfully built a fraud detection system using XGBoost!

### What You Learned:

‚úÖ **Understanding XGBoost**
- How gradient boosting works (sequential tree learning)
- Why it's the industry standard for fraud detection
- Key advantages over other algorithms

‚úÖ **Handling Imbalanced Data**
- Fraud is rare (0.17% of transactions)
- Used `scale_pos_weight` to balance classes
- Chose appropriate metrics (Recall, Precision, AUC)

‚úÖ **Building Production Models**
- Trained on 284K real transactions
- Achieved {auc*100:.1f}% AUC score
- Catching {recall*100:.1f}% of fraud cases
- Saved model for deployment

‚úÖ **Model Interpretation**
- Analyzed feature importance
- Understood confusion matrix
- Business impact of FP vs FN

### Real-World Applications:

This same approach is used by:
- **PayPal**: Real-time transaction monitoring
- **Stripe**: Payment fraud prevention
- **Financial institutions**: Credit card fraud detection
- **Airbnb**: Booking fraud prevention

### Next Steps:

1. **Hyperparameter Tuning**: Experiment with different settings
2. **Feature Engineering**: Create new features (time-based, aggregations)
3. **Threshold Optimization**: Adjust decision threshold for business needs
4. **A/B Testing**: Compare with existing fraud systems
5. **Deployment**: Build REST API or SageMaker endpoint

### Key Takeaways:

üí° **Accuracy is NOT enough** for imbalanced data
üí° **Class weighting** is critical for rare event detection
üí° **Business context** matters - FP and FN have different costs
üí° **Feature importance** builds trust and interpretability
üí° **XGBoost** is production-ready out of the box

### Resources:

- **XGBoost Docs**: https://xgboost.readthedocs.io/
- **Kaggle Competition**: https://www.kaggle.com/c/ieee-fraud-detection
- **Research Paper**: https://arxiv.org/abs/1603.02754

---

**Thank you for participating! üöÄ**

Questions? Discuss with your instructor or fellow participants!