# Credit Card Fraud Detection – Supervised Learning Project

## 1. Introduction

### 1.1 Problem Statement

Credit card fraud causes massive financial losses each year. The goal of this project is to build a **supervised machine learning model** that can detect fraudulent transactions based on anonymized transaction features.

We will use the **Credit Card Fraud Detection** dataset (European cardholders, September 2013), which is a **highly imbalanced binary classification problem**:

- **Class 0** – legitimate transaction  
- **Class 1** – fraudulent transaction  

Our objectives:

- Perform **Exploratory Data Analysis (EDA)** to understand the data distribution and class imbalance.
- Build several supervised ML models (e.g., Logistic Regression, Random Forest).
- Handle **severe class imbalance** using techniques like class weighting (and optionally resampling).
- Evaluate models using appropriate metrics (Precision, Recall, F1, ROC-AUC, PR curve).
- Compare models and discuss trade-offs.


## 2. Data Loading & Overview

In [None]:
# 2. Setup & Imports

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
    average_precision_score
)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from collections import OrderedDict

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

In [None]:
# 2.1 Load dataset

# Adjust this path if needed
DATA_PATH = "data/creditcard.csv"

df = pd.read_csv(DATA_PATH)
df.head()

In [None]:
# 2.2 Basic info and summary

df.info()

In [None]:
df.describe().T

### 2.3 Dataset Description

- Number of rows (transactions): `df.shape[0]`  
- Number of columns (features + target): `df.shape[1]`  
- Features:
  - `Time` – seconds elapsed between this transaction and the first transaction in the dataset  
  - `Amount` – transaction amount  
  - `V1` … `V28` – numeric features obtained by PCA transformation (anonymized)  
- Target:
  - `Class` – 0 for legitimate transactions, 1 for fraudulent

---

In [None]:
# 2.4 Class distribution

class_counts = df["Class"].value_counts()
class_ratio = df["Class"].value_counts(normalize=True)

print("Class counts:\n", class_counts)
print("\nClass ratio:\n", class_ratio)

sns.barplot(x=class_counts.index, y=class_counts.values)
plt.title("Class Distribution (0 = Non-fraud, 1 = Fraud)")
plt.xlabel("Class")
plt.ylabel("Count")
plt.show()

**Observation:**

- The dataset is **heavily imbalanced**.
- Fraud (Class = 1) represents a very small fraction of transactions (typically ~0.17%).
- This has important implications for modeling and evaluation:
  - Accuracy alone is misleading.
  - We must pay attention to **precision, recall, F1-score**, and **ROC/PR curves**.

---

## 3. Exploratory Data Analysis (EDA)


In [None]:
# 3.1 Distribution of 'Amount'

sns.histplot(df["Amount"], bins=50, kde=True)
plt.title("Distribution of Transaction Amount")
plt.xlabel("Amount")
plt.ylabel("Count")
plt.show()

In [None]:
# 3.2 Distribution of 'Time'

sns.histplot(df["Time"], bins=50, kde=True)
plt.title("Distribution of Time (seconds from first transaction)")
plt.xlabel("Time")
plt.ylabel("Count")
plt.show()

In [None]:
# 3.3 Amount vs Class

sns.boxplot(x="Class", y="Amount", data=df)
plt.title("Transaction Amount by Class")
plt.xlabel("Class")
plt.ylabel("Amount")
plt.show()

**Observations:**

- Many transactions have relatively small amounts, with a long tail of larger transactions.
- Fraudulent transactions may have different amount distributions (we can comment based on plots).
- Time might exhibit some patterns, but it's not always strongly predictive on its own.

---

In [None]:
# 3.4 Correlation matrix (features only, excluding 'Class')

corr = df.drop(columns=["Class"]).corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap of Features")
plt.show()

**Observations:**

- The `V` features are PCA components, so they are decorrelated in specific ways.
- `Amount` and `Time` are original features and may show weaker correlations.

---

## 4. Data Preparation & Preprocessing


In [None]:
# 4.1 Check for missing values

df.isna().sum()

Typically, the Kaggle credit card dataset has **no missing values**, but this check confirms.

---

### 4.2 Train–Test Split (with Stratification)

We will:
- Use `Class` as the target variable.
- Use all other columns as features.
- Perform a stratified split to preserve the fraud ratio in both train and test sets.


In [None]:
X = df.drop("Class", axis=1)
y = df["Class"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape

### 4.3 Scaling Features

- PCA features (`V1`–`V28`) are already scaled.
- `Amount` and `Time` are not.
- We will still apply `StandardScaler` inside a `Pipeline` so all numeric features are scaled consistently for models like Logistic Regression.

---

In [None]:
# 4.4 Quick baseline: majority class accuracy for reference

majority_class = y_train.mode()[0]
baseline_pred = np.full_like(y_test, fill_value=majority_class)

print("Baseline accuracy (predict all 0):", np.mean(baseline_pred == y_test))
print("Fraud rate in test set:", y_test.mean())

The baseline accuracy will be very high simply by predicting **non-fraud (0)** all the time, but this ignores fraudulent cases.  
This shows **why we must not rely solely on accuracy** in imbalanced settings.

---

## 5. Modeling & Evaluation

We will start with:

1. Logistic Regression (with class weights)
2. Random Forest (with class weights)
3. Optional: Hyperparameter tuning with GridSearchCV

We will use:
- Confusion matrix
- Precision, recall, F1-score
- ROC-AUC
- Precision–Recall curve


In [None]:
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    PrecisionRecallDisplay
)

### 5.1 Logistic Regression (with class_weight='balanced')

In [None]:
# 5.1 Logistic Regression Pipeline

log_reg = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(
        solver="lbfgs",
        max_iter=1000,
        class_weight="balanced"
    ))
])

log_reg.fit(X_train, y_train)

In [None]:
# Evaluate Logistic Regression

y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:, 1]

print("=== Logistic Regression (class_weight='balanced') ===\n")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, digits=4))

roc_auc_lr = roc_auc_score(y_test, y_proba_lr)
print("ROC-AUC:", roc_auc_lr)

In [None]:
# ROC Curve for Logistic Regression

fpr_lr, tpr_lr, _ = roc_curve(y_test, y_proba_lr)

plt.figure(figsize=(6, 5))
plt.plot(fpr_lr, tpr_lr, label=f"LogReg (AUC = {roc_auc_lr:.3f})")
plt.plot([0, 1], [0, 1], "k--", label="Random")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve – Logistic Regression")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Precision-Recall Curve for Logistic Regression

precision_lr, recall_lr, _ = precision_recall_curve(y_test, y_proba_lr)
ap_lr = average_precision_score(y_test, y_proba_lr)

plt.figure(figsize=(6, 5))
plt.plot(recall_lr, precision_lr, label=f"LogReg (AP = {ap_lr:.3f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve – Logistic Regression")
plt.legend()
plt.grid(True)
plt.show()

### 5.2 Random Forest Classifier

In [None]:
# 5.2 Random Forest Classifier

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=42,
    n_jobs=-1,
    class_weight="balanced_subsample"
)

rf.fit(X_train, y_train)

In [None]:
# Evaluate Random Forest

y_pred_rf = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)[:, 1]

print("=== Random Forest (class_weight='balanced_subsample') ===\n")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, digits=4))

roc_auc_rf = roc_auc_score(y_test, y_proba_rf)
print("ROC-AUC:", roc_auc_rf)

In [None]:
# ROC Curve – Random Forest vs Logistic Regression

fpr_rf, tpr_rf, _ = roc_curve(y_test, y_proba_rf)

plt.figure(figsize=(6, 5))
plt.plot(fpr_lr, tpr_lr, label=f"LogReg (AUC = {roc_auc_lr:.3f})")
plt.plot(fpr_rf, tpr_rf, label=f"Random Forest (AUC = {roc_auc_rf:.3f})")
plt.plot([0, 1], [0, 1], "k--", label="Random")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve – Model Comparison")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Precision-Recall Curve – Random Forest vs Logistic Regression

precision_rf, recall_rf, _ = precision_recall_curve(y_test, y_proba_rf)
ap_rf = average_precision_score(y_test, y_proba_rf)

plt.figure(figsize=(6, 5))
plt.plot(recall_lr, precision_lr, label=f"LogReg (AP = {ap_lr:.3f})")
plt.plot(recall_rf, precision_rf, label=f"Random Forest (AP = {ap_rf:.3f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve – Model Comparison")
plt.legend()
plt.grid(True)
plt.show()

### 5.3 Hyperparameter Tuning (Example: Logistic Regression)

In [None]:
# 5.3 Hyperparameter Tuning for Logistic Regression

param_grid_lr = {
    "clf__C": [0.01, 0.1, 1, 10],
    "clf__penalty": ["l2"]
}

grid_log_reg = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid_lr,
    scoring="f1",
    cv=5,
    n_jobs=-1,
    verbose=1
)

grid_log_reg.fit(X_train, y_train)

In [None]:
print("Best parameters (LogReg):", grid_log_reg.best_params_)
print("Best CV F1-score:", grid_log_reg.best_score_)

In [None]:
# Evaluate tuned Logistic Regression on test set

best_lr = grid_log_reg.best_estimator_

y_pred_best_lr = best_lr.predict(X_test)
y_proba_best_lr = best_lr.predict_proba(X_test)[:, 1]

print("=== Tuned Logistic Regression ===\n")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_best_lr))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best_lr, digits=4))

roc_auc_best_lr = roc_auc_score(y_test, y_proba_best_lr)
print("ROC-AUC:", roc_auc_best_lr)

## 6. Model Comparison

In [None]:
results = OrderedDict()

results["LogReg (balanced)"] = {
    "ROC-AUC": roc_auc_lr,
    "AP": ap_lr
}
results["Random Forest (balanced_subsample)"] = {
    "ROC-AUC": roc_auc_rf,
    "AP": ap_rf
}
results["Tuned LogReg"] = {
    "ROC-AUC": roc_auc_best_lr,
    "AP": average_precision_score(y_test, y_proba_best_lr)
}

results_df = pd.DataFrame(results).T
results_df

**Interpretation:**

- Compare models by **ROC-AUC** and **Average Precision (AP)**.
- Discuss:
  - Which model has better recall on fraud cases?
  - Which model has higher precision?
  - Is there a trade-off? Which model is preferable in a real-world fraud setting?

---

## 7. Conclusion & Future Work

### 7.1 Summary

- We tackled a **highly imbalanced binary classification** problem: credit card fraud detection.
- We performed EDA:
  - Verified strong class imbalance.
  - Inspected distributions of Time and Amount.
  - Looked at correlations between features.
- We built and evaluated multiple models:
  - Logistic Regression with class weighting.
  - Random Forest with class weighting.
  - Tuned Logistic Regression with GridSearchCV.

Key observations:

- Class imbalance makes **accuracy misleading**.
- Models must be judged using **precision, recall, F1-score, ROC-AUC, and PR curves**.
- Random Forest often provides strong performance out-of-the-box, while Logistic Regression is more interpretable.

### 7.2 Limitations

- Features `V1`–`V28` are anonymized PCA components → limited interpretability.
- Dataset represents a specific time window and region; generalization to other domains may require retraining.

### 7.3 Future Work

- Experiment with:
  - **SMOTE** or other advanced resampling techniques.
  - More advanced models (e.g., XGBoost, LightGBM).
  - Cost-sensitive learning (different misclassification costs for fraud vs non-fraud).
- Calibrate predicted probabilities (e.g., via Platt scaling or isotonic regression).
- Deploy as an API or web app for real-time fraud scoring.

---

## 8. References

- Kaggle – Credit Card Fraud Detection dataset  
- Scikit-learn documentation  
- Imbalanced classification resources
