# Reproducible Fraud Detection with LakeFS

## 1. Project Overview

**Goal:** This project demonstrates a Data-Centric MLOps pipeline for detecting credit card fraud. Unlike traditional scripts, we use **LakeFS** to provide Git-like version control for our data and models.

## 2. The Dataset

We are using the **Credit Card Fraud Detection** dataset (anonymized real-world transactions).
* **Challenge:** The dataset is highly imbalanced (**0.17% fraud** vs. 99.83% legitimate).
* **Implication:** Standard accuracy metrics are misleading (a dummy model predicting "legit" every time gets 99.8% accuracy). We must prioritize **F1-Score** and **Recall**.

## 3. The Tool: LakeFS

**LakeFS** creates a versioning layer over our object storage. It allows us to:
* **Commit** data snapshots (just like Git commits code).
* **Branch** data for isolated experiments.
* **Revert** changes if data becomes corrupted.

In [1]:
# Execute this entire cell only once if you run into any errors
#!pip install lakefs-client
#!pip install lakefs-client imbalanced-learn
#!pip install xgboost lightgbm tensorflow

In [2]:
from LakeFS_Fraud_utils import LakeFSDataHandler, preprocess_data_pro, train_and_eval, save_confusion_matrix, save_roc_curve, save_pr_curve
import pandas as pd
import os
import io
from sklearn.metrics import classification_report
import numpy as np

## Setup

In [3]:
LAKEFS_HOST = 'http://host.docker.internal:8000' 
REPO_NAME = 'creditcard-fraud'
ACCESS_KEY = 'YOUR_ACCESS_KEY' 
SECRET_KEY = 'YOUR_SECRET_KEY'

handler = LakeFSDataHandler(LAKEFS_HOST, ACCESS_KEY, SECRET_KEY, REPO_NAME)

## Phase 1: Ingestion, Feature Engineering & Immutable Baseline

### Aim
To establish a "Golden Record" of our training data. We will load the raw CSV, generate new predictive features, apply critical preprocessing, and **commit** the result to the `main` branch.

### Methodology
1.  **Feature Engineering:**
    * **Time $\rightarrow$ Hour of Day:** We convert the raw timestamp (seconds) into an `Hour` feature (0-23) to capture circadian patterns in fraud.
    * **Amount $\rightarrow$ LogAmount:** Transaction amounts are highly skewed. We apply a Log transformation (`log1p`) to normalize the distribution for linear models.
2.  **SMOTE (Synthetic Minority Over-sampling Technique):** Since fraud cases are rare (0.17%), we synthesize new fraud examples to balance the training set.
3.  **Standard Scaling:** We normalize the features (V1-V28, Hour, LogAmount) so algorithms like Neural Networks converge faster.
4.  **LakeFS Commit:** We upload the processed `train.csv` and `test.csv` to LakeFS and commit them.

### Inference
By committing these files to `main`, we ensure that every subsequent experiment starts from the **exact same data snapshot** (including our engineered features), guaranteeing reproducibility.

In [4]:
# Load Raw Data
print("Loading raw data...")
df = pd.read_csv('creditcard.csv')
print(f"Original Shape: {df.shape}")

# FEATURE ENGINEERING (Task Requirement)
# Extract 'Hour' from 'Time' (Capture daily cycles). We take the floor and mod 24 to get hour 0-23.
df['Hour'] = df['Time'].apply(lambda x: np.floor(x / 3600)) % 24

# Log Transform 'Amount' (Handle extreme skewness)
df['LogAmount'] = np.log1p(df['Amount'])

# Drop the original raw columns (Model will use the new engineered ones)
df = df.drop(['Time', 'Amount'], axis=1)

print(f"New Features Added: ['Hour', 'LogAmount']")
print(f"Shape after Engineering: {df.shape}")

# 3. PREPROCESSING & UPLOAD
# Apply SMOTE and Scaling (wrapped in helper function)
print("\nApplying SMOTE and Standard Scaling...")
train_df, test_df = preprocess_data_pro(df)

# Upload and Commit to LakeFS 'main' branch
print("\nVersioning Data in LakeFS...")
handler.upload_df(train_df, 'main', 'data/processed/train.csv', 'Final Preprocessed Train Data')
handler.upload_df(test_df, 'main', 'data/processed/test.csv', 'Final Preprocessed Test Data')

Loading raw data...
Original Shape: (284807, 31)
New Features Added: ['Hour', 'LogAmount']
Shape after Engineering: (284807, 31)

Applying SMOTE and Standard Scaling...
Preprocessing...

Versioning Data in LakeFS...
Uploading to branch 'main' at path 'data/processed/train.csv'...
Committing: Final Preprocessed Train Data
Uploading to branch 'main' at path 'data/processed/test.csv'...
Committing: Final Preprocessed Test Data


## Phase 2: The 8-Model Tournament 

### **Aim**
To find the best performing model without polluting our production environment. We will run an automated tournament comparing 8 different algorithms.

### **The "Isolation" Strategy**
For each model (e.g., XGBoost, Random Forest), the code will:
1.  Create a **New Branch** (e.g., `exp-xgb`) from `main`.
2.  Train the model on that branch.
3.  Upload **Visualization Artifacts** (Confusion Matrix, ROC Curve) to that branch.

### **Inference**
Using branches ensures **experiment isolation**. If the "Neural Network" experiment fails or produces junk data, it remains trapped in the `exp-nn` branch and never touches our clean `main` branch.

In [5]:
algorithms = ['lr', 'rf', 'xgb', 'lgbm', 'nn', 'ensemble', 'power_ensemble', 'tuned_xgb']
algo_names = {'lr': 'Logistic Regression','rf': 'Random Forest','xgb': 'XGBoost','lgbm': 'LightGBM','nn': 'Neural Network','ensemble': 'Basic Ensemble (All)','power_ensemble': 'Power Ensemble (Tree Models Only)','tuned_xgb': 'XGBoost (Hyperparameter Tuned)'}
tournament_results = []

for algo in algorithms:
    full_name = algo_names[algo]
    print(f"\n\n=== Running Experiment: {full_name} ===")
    
    # Creating a branch for this model
    branch_name = f'exp-{algo}'
    handler.create_branch(branch_name, 'main')
    
    # Train and Evaluate
    y_true, y_pred, y_prob = train_and_eval(train_df, test_df, algo=algo)
    
    if y_pred is not None:
        # Generate report as a dictionary to extract numbers
        report_dict = classification_report(y_true, y_pred, output_dict=True)
        
        # Extract F1-score for Class '1' (Fraud)
        fraud_f1 = report_dict['1']['f1-score']
        tournament_results.append({'Model': full_name, 'Fraud F1-Score': fraud_f1})
        
        # Save Artifacts
        cm_file = save_confusion_matrix(y_true, y_pred, algo)
        roc_file = save_roc_curve(y_true, y_prob, algo)
        pr_file = save_pr_curve(y_true, y_prob, algo)
        
        # Uploading Artifacts to LakeFS
        print(f"Uploading artifacts to '{branch_name}'...")
        handler.upload_file(cm_file, branch_name, f'results/viz/{algo}_cm.png', 'Confusion Matrix')
        handler.upload_file(roc_file, branch_name, f'results/viz/{algo}_roc.png', 'ROC Curve')
        handler.upload_file(pr_file, branch_name, f'results/viz/{algo}_pr.png', 'PR Curve')
        
        # 5. Cleaning up local files
        os.remove(cm_file)
        os.remove(roc_file)
        os.remove(pr_file)
        print(f"Experiment {algo} complete!")
        
print("\n=== TOURNAMENT COMPLETE ===")



=== Running Experiment: Logistic Regression ===
Branch 'exp-lr' created.
Training LR...
              precision    recall  f1-score   support

           0       1.00      0.97      0.99     56864
           1       0.06      0.92      0.10        98

    accuracy                           0.97     56962
   macro avg       0.53      0.95      0.55     56962
weighted avg       1.00      0.97      0.98     56962

Uploading artifacts to 'exp-lr'...
Uploading to branch 'exp-lr' at path 'results/viz/lr_cm.png'...
Committing: Confusion Matrix
Uploading to branch 'exp-lr' at path 'results/viz/lr_roc.png'...
Committing: ROC Curve
Uploading to branch 'exp-lr' at path 'results/viz/lr_pr.png'...
Committing: PR Curve
Experiment lr complete!


=== Running Experiment: Random Forest ===
Branch 'exp-rf' created.
Training RF...
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.86      0.84      0.85        98

    accurac

## 6. Phase 3: Results & Leaderboard

### **Aim**
To aggregate metrics from all 8 branches and identify the superior model based on the **F1-Score for Fraud** (Class 1).

### **Conclusion & Inferences**
* **Winning Model:** The **Random Forest** model (~0.85 F1) emerged as the top performer. Its ability to handle non-linear decision boundaries allows it to effectively leverage categorical-like features such as the engineered `Hour`.
* **Ensemble Strategy:** The **Power Ensemble** (Tree-based only) significantly outperformed the **Basic Ensemble**, confirming that including weaker linear models (like Logistic Regression) dilutes overall performance.
* **Model Constraints:** The **Logistic Regression** baseline struggled significantly (F1 ~0.10). This indicates that the decision boundary between "Fraud" and "Legit" is highly non-linear, especially after introducing synthetic SMOTE examples, which linear models cannot easily separate.

In [6]:
print("\n\n=== üèÜ FINAL TOURNAMENT LEADERBOARD üèÜ ===")
leaderboard_df = pd.DataFrame(tournament_results)
leaderboard_df = leaderboard_df.sort_values(by='Fraud F1-Score', ascending=False).reset_index(drop=True)

print(leaderboard_df)



=== üèÜ FINAL TOURNAMENT LEADERBOARD üèÜ ===
                               Model  Fraud F1-Score
0                      Random Forest        0.849741
1                            XGBoost        0.822967
2  Power Ensemble (Tree Models Only)        0.821256
3               Basic Ensemble (All)        0.780269
4     XGBoost (Hyperparameter Tuned)        0.770642
5                           LightGBM        0.699187
6                     Neural Network        0.588235
7                Logistic Regression        0.104651


## Upload the Leaderboard to Main Branch

In [7]:
print("\nSaving Leaderboard to LakeFS (main branch)...")
csv_buffer = io.StringIO()
leaderboard_df.to_csv(csv_buffer, index=False)
handler._upload_and_commit(branch='main', path='results/final_leaderboard.csv', content=io.BytesIO(csv_buffer.getvalue().encode('utf-8')), message='Added Final Tournament Leaderboard')


Saving Leaderboard to LakeFS (main branch)...
Uploading to branch 'main' at path 'results/final_leaderboard.csv'...
Committing: Added Final Tournament Leaderboard
