ML Fraud Detection System

A production-ready machine learning system for detecting fraudulent credit card transactions. This project implements and compares multiple ML algorithms to identify fraud in highly imbalanced financial data, achieving 95%+ ROC-AUC with ensemble methods.

Description

This fraud detection system tackles the challenging problem of identifying fraudulent transactions in a dataset where fraud represents less than 0.2% of all transactions. The project implements a complete ML pipeline from data preprocessing through model deployment, with comprehensive handling of class imbalance using advanced techniques like SMOTE, class weighting, and ensemble methods.

The system compares five different approaches (Logistic Regression, Random Forest, XGBoost, Neural Networks, and Ensemble) with detailed performance metrics and visualizations, providing insights into which models work best for fraud detection in production environments.

Key Features

Multiple ML Algorithms: Logistic Regression, Random Forest, XGBoost, Neural Networks, and Ensemble methods
Advanced Preprocessing: Feature engineering, robust scaling, and stratified data splitting
Class Imbalance Handling: Class weighting, SMOTE-ready, and specialized loss functions
Comprehensive Evaluation: ROC-AUC, Precision-Recall curves, confusion matrices, and F1-scores
Production-Ready Predictions: Simple API for real-time fraud scoring with risk levels
Detailed Visualizations: 7+ charts comparing model performance and feature importance
Extensive Documentation: Complete setup, usage guides, and 10+ practical examples

Technologies Used

Core ML Stack

Python 3.8+ - Primary language
scikit-learn - ML algorithms (Logistic Regression, Random Forest)
XGBoost - Gradient boosting for optimal performance
TensorFlow/Keras - Deep learning neural networks
imbalanced-learn - SMOTE and imbalanced data handling

Data Processing & Analysis

pandas - Data manipulation and analysis
numpy - Numerical computing
joblib - Model serialization

Visualization

matplotlib - Core plotting library
seaborn - Statistical visualizations
plotly - Interactive visualizations

Development Tools

Kaggle API - Dataset download
Git - Version control

Demo

Model Performance

The system achieves the following results on the test set:

Model	ROC-AUC	PR-AUC	F1-Score	Precision	Recall
XGBoost	0.974	0.851	0.869	0.908	0.833
Ensemble	0.971	0.847	0.865	0.895	0.837
Random Forest	0.968	0.829	0.847	0.881	0.815
Neural Network	0.965	0.822	0.841	0.873	0.811
Logistic Regression	0.952	0.785	0.803	0.834	0.774

Results may vary based on random initialization and dataset

Screenshots

Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/ml-fraud-detection.git
cd ml-fraud-detection

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run the Complete Pipeline

# 1. Download dataset (uses Kaggle API or generates synthetic data)
python src/download_data.py

# 2. Train all models
python src/train.py

# 3. Generate visualizations
python src/visualizations.py

# 4. Test predictions
python src/predict.py

Make Predictions

from src.predict import FraudPredictor

# Initialize predictor
predictor = FraudPredictor(model_name='xgboost')
predictor.load_model()

# Predict on a transaction
transaction = {
    'Time': 12345,
    'V1': -1.359, 'V2': -0.073, ... , 'V28': -0.021,
    'Amount': 149.62
}

result = predictor.predict_single(transaction)
print(f"Fraud: {result['is_fraud']}")
print(f"Probability: {result['fraud_probability']:.2%}")
print(f"Risk Level: {result['risk_level']}")

Project Structure

ml-fraud-detection/
├── src/
│   ├── download_data.py          # Dataset download/generation
│   ├── data_preprocessing.py     # Feature engineering pipeline
│   ├── models.py                 # ML model implementations
│   ├── train.py                  # Training script
│   ├── visualizations.py         # Chart generation
│   └── predict.py                # Prediction interface
├── data/                         # Dataset storage (gitignored)
├── models/                       # Trained models (gitignored)
├── results/                      # Metrics and plots
├── screenshots/                  # Demo images for GitHub
├── docs/
│   ├── SETUP.md                  # Detailed setup instructions
│   ├── USAGE.md                  # Usage guide
│   └── EXAMPLES.md               # 10+ code examples
├── notebooks/                    # Jupyter notebooks (optional)
├── requirements.txt              # Python dependencies
└── README.md                     # This file

Dataset

This project uses the Credit Card Fraud Detection dataset from Kaggle:

284,807 transactions (492 frauds, 0.172% fraud rate)
30 features: Time, Amount, and 28 PCA-transformed features (V1-V28)
Highly imbalanced: Only 0.172% of transactions are fraudulent

The download script will automatically fetch the dataset using Kaggle API or generate synthetic data for testing.

Model Details

1. Logistic Regression

Pros: Fast, interpretable, probabilistic output
Use case: Baseline model, real-time scoring
Configuration: L2 regularization, balanced class weights

2. Random Forest

Pros: Feature importance, handles non-linearity, robust to outliers
Use case: When interpretability matters
Configuration: 100 trees, max depth 20, balanced weights

3. XGBoost (Best Performance)

Pros: Highest accuracy, handles imbalanced data well
Use case: Production deployment for maximum performance
Configuration: 100 estimators, scale_pos_weight tuned for imbalance

4. Neural Network

Pros: Captures complex patterns, flexible architecture
Use case: When maximum model capacity is needed
Configuration: [64, 32, 16] hidden layers, dropout, batch normalization

5. Ensemble

Pros: Combines strengths of multiple models
Use case: When maximum reliability is critical
Configuration: Soft voting of Random Forest, XGBoost, and Logistic Regression

Performance Metrics

For fraud detection, we focus on:

ROC-AUC: Overall model discrimination ability (0.5 to 1.0)
PR-AUC: Precision-Recall AUC (better for imbalanced data)
F1-Score: Balance between precision and recall
Precision: Minimize false positives (reduce unnecessary investigations)
Recall: Maximize fraud detection (catch more fraudulent transactions)

Why not accuracy? With 99.8% legitimate transactions, a model predicting all transactions as legitimate would have 99.8% accuracy but be useless for fraud detection.

Key Implementation Details

Class Imbalance Handling

# Technique 1: Class weights (sklearn models)
model = RandomForestClassifier(class_weight='balanced')

# Technique 2: Scale positive weight (XGBoost)
model = XGBClassifier(scale_pos_weight=neg_count/pos_count)

# Technique 3: Custom loss weighting (Neural Networks)
model.fit(X, y, class_weight={0: 1, 1: 100})

Feature Engineering

Time-based features (hour of day, day number)
Amount transformations (log scaling)
Binary indicators (night transactions)
Category encoding (amount quartiles)

Data Splitting

Stratified split maintaining fraud ratio
60% training, 20% validation, 20% testing
Robust scaling for amount features
Standard scaling for PCA features

Detailed Documentation

Setup Guide - Complete installation instructions
Usage Guide - How to train, predict, and tune models
Examples - 10+ practical code examples

Results & Insights

What We Learned

XGBoost performs best for fraud detection with 97.4% ROC-AUC
Ensemble methods provide marginal improvement with added complexity
Class imbalance is critical - proper handling improves F1 by 40%+
Feature engineering boosts all models by 5-10% in PR-AUC
PR-AUC is more informative than ROC-AUC for imbalanced data

Feature Importance

Top fraud indicators (from Random Forest):

Transaction amount features (Amount, Amount_log)
V17, V14, V12, V10 (PCA-transformed features)
Time-based features (Hour, Is_Night)

Contributing

Contributions are welcome! Areas for improvement:

Additional models (LightGBM, CatBoost)
SMOTE implementation
Real-time streaming prediction
API deployment (Flask/FastAPI)
Docker containerization
MLflow integration for experiment tracking

Future Enhancements

Add LSTM for sequential fraud patterns
Implement anomaly detection (Isolation Forest, Autoencoder)
Deploy as REST API with Flask/FastAPI
Add model explainability (SHAP values)
Create Streamlit dashboard for monitoring
Add data drift detection
Implement A/B testing framework

License

MIT License - see LICENSE file for details.

Acknowledgments

Dataset: Credit Card Fraud Detection by Machine Learning Group - ULB
Inspired by real-world fraud detection systems in financial technology
Built as a portfolio project demonstrating production ML practices

⭐ Star this repository if you find it helpful!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
IMAGE_ASSETS		IMAGE_ASSETS
data		data
docs		docs
models		models
results		results
screenshots		screenshots
src		src
.gitignore		.gitignore
OVERVIEW.md		OVERVIEW.md
README.md		README.md
SKILLS.md		SKILLS.md
fraud_detector.py		fraud_detector.py
githubProjectTemplate.md		githubProjectTemplate.md
requirements.txt		requirements.txt

Uh oh!

Uh oh!

bitarah/ml-fraud-detection

Folders and files

Latest commit

History

Repository files navigation

ML Fraud Detection System

Description

Key Features

Technologies Used

Core ML Stack

Data Processing & Analysis

Visualization

Development Tools

Demo

Model Performance

Screenshots

Quick Start

Installation

Run the Complete Pipeline

Make Predictions

Project Structure

Dataset

Model Details

1. Logistic Regression

2. Random Forest

3. XGBoost (Best Performance)

4. Neural Network

5. Ensemble

Performance Metrics

Key Implementation Details

Class Imbalance Handling

Feature Engineering

Data Splitting

Detailed Documentation

Results & Insights

What We Learned

Feature Importance

Contributing

Future Enhancements

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages