# Fraud Detection (PaySim-like dataset)

This notebook builds a fraud detection workflow for the transaction dataset described (744 hourly steps, columns: `step`, `type`, `amount`, `nameOrig`, `oldbalanceOrg`, `newbalanceOrig`, `nameDest`, `oldbalanceDest`, `newbalanceDest`, `isFraud`, `isFlaggedFraud`). The notebook covers data loading with sanity checks for missing values and outliers, feature engineering for amounts, balance deltas, and time signals, a train/validation split with class-imbalance handling, baseline logistic regression plus a tree model comparison, threshold tuning with key-factor inspection, and stubs to save the trained model and threshold for later scoring.

In [1]:
import os
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    precision_recall_curve,
    average_precision_score,
)
from sklearn.linear_model import LogisticRegression
import joblib

import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", None)
RANDOM_STATE = 42

In [2]:
df = pd.read_csv("Fraud.csv")
print(df.shape)
df.head()

(6362620, 11)


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [3]:
# Basic sanity checks
print("Columns:", df.columns.tolist())
print("Dtypes:\n", df.dtypes)
print("Class balance isFraud:\n", df["isFraud"].value_counts(normalize=True).rename("pct"))
print("Flagged rate isFlaggedFraud:\n", df["isFlaggedFraud"].value_counts(normalize=True).rename("pct"))

# Missing values
missing = df.isna().mean().sort_values(ascending=False)
print("Missing fraction:\n", missing)

# Quick numeric summary
numeric_cols = df.select_dtypes(include=[np.number]).columns
summary = df[numeric_cols].describe(percentiles=[0.5, 0.9, 0.99]).T
summary

Columns: ['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud']
Dtypes:
 step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object
Class balance isFraud:
 isFraud
0    0.998709
1    0.001291
Name: pct, dtype: float64
Flagged rate isFlaggedFraud:
 isFlaggedFraud
0    0.999997
1    0.000003
Name: pct, dtype: float64
Missing fraction:
 step              0.0
type              0.0
amount            0.0
nameOrig          0.0
oldbalanceOrg     0.0
newbalanceOrig    0.0
nameDest          0.0
oldbalanceDest    0.0
newbalanceDest    0.0
isFraud           0.0
isFlaggedFraud    0.0
dtype: float64


Unnamed: 0,count,mean,std,min,50%,90%,99%,max
step,6362620.0,243.3972,142.332,1.0,239.0,399.0,681.0,743.0
amount,6362620.0,179861.9,603858.2,0.0,74871.94,365423.309,1615979.0,92445520.0
oldbalanceOrg,6362620.0,833883.1,2888243.0,0.0,14208.0,1822508.289,16027260.0,59585040.0
newbalanceOrig,6362620.0,855113.7,2924049.0,0.0,0.0,1970344.793,16176160.0,49585040.0
oldbalanceDest,6362620.0,1100702.0,3399180.0,0.0,132705.665,2914266.669,12371820.0,356015900.0
newbalanceDest,6362620.0,1224996.0,3674129.0,0.0,214661.44,3194869.671,13137870.0,356179300.0
isFraud,6362620.0,0.00129082,0.0359048,0.0,0.0,0.0,0.0,1.0
isFlaggedFraud,6362620.0,2.514687e-06,0.001585775,0.0,0.0,0.0,0.0,1.0


In [4]:
# Basic cleaning / consistency checks

# Drop exact duplicate rows
before = len(df)
df = df.drop_duplicates()
print(f"Dropped {before - len(df)} duplicate rows")

# Ensure expected categorical values
print("Transaction types:", df["type"].unique())

# Balance deltas (helps spot inconsistencies and becomes a feature)
df["orig_balance_delta"] = df["oldbalanceOrg"] - df["newbalanceOrig"] - df["amount"]
df["dest_balance_delta"] = df["newbalanceDest"] - df["oldbalanceDest"] - df["amount"]

# Flag negative balances (could be noise; keep but mark)
df["has_neg_balance"] = ((df[["oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest"]] < 0).any(axis=1)).astype(int)
print("Rows with any negative balance:", df["has_neg_balance"].mean())

# Outlier indicator for very large amounts (beyond 99.5th percentile)
amount_cap = df["amount"].quantile(0.995)
df["amount_outlier"] = (df["amount"] > amount_cap).astype(int)
print("Amount cap (99.5th):", amount_cap)

Dropped 0 duplicate rows
Transaction types: ['PAYMENT' 'TRANSFER' 'CASH_OUT' 'DEBIT' 'CASH_IN']
Rows with any negative balance: 0.0
Amount cap (99.5th): 2437745.6404500105


In [None]:
# Feature engineering

# Time signals
df["hour"] = df["step"] % 24
df["day"] = (df["step"] // 24).clip(upper=30)

# Ratio / log features
for col in ["amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest"]:
    df[f"log1p_{col}"] = np.log1p(df[col].clip(lower=0))

# Drop high-cardinality IDs (names) after we extract merchant/customer flags
df["is_merchant_dest"] = df["nameDest"].str.startswith("M").astype(int)

feature_cols = [
    "hour", "day", "amount", "log1p_amount",
    "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest",
    "orig_balance_delta", "dest_balance_delta",
    "has_neg_balance", "amount_outlier", "is_merchant_dest",
]
target_col = "isFraud"

X = df[feature_cols + ["type"]]  # keep type for encoding
y = df[target_col]

categorical = ["type"]
numerical = [c for c in X.columns if c not in categorical]

In [None]:
# Split, preprocess, and train a simple baseline
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
)

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical),
        ("num", Pipeline([("scaler", StandardScaler())]), numerical),
    ]
)

log_reg_clf = Pipeline(
    steps=[
        ("preprocess", preprocess),
        (
            "model",
            LogisticRegression(
                max_iter=200,
                class_weight="balanced",
                n_jobs=-1,
                penalty="l2",
            ),
        ),
    ]
)

log_reg_clf.fit(X_train, y_train)

val_pred_proba = log_reg_clf.predict_proba(X_val)[:, 1]
val_pred = (val_pred_proba >= 0.5).astype(int)

print("ROC-AUC:", roc_auc_score(y_val, val_pred_proba))
print("Average precision (PR-AUC):", average_precision_score(y_val, val_pred_proba))
print(classification_report(y_val, val_pred, digits=4))
print("Confusion matrix:\n", confusion_matrix(y_val, val_pred))



In [None]:
# Threshold tuning (use logistic regression scores by default)
precision, recall, thresholds = precision_recall_curve(y_val, val_pred_proba)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-9)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]

print(f"Best F1 @ threshold {best_threshold:.4f}: precision={precision[best_idx]:.4f}, recall={recall[best_idx]:.4f}")

# Apply tuned threshold
val_pred_tuned = (val_pred_proba >= best_threshold).astype(int)
print(classification_report(y_val, val_pred_tuned, digits=4))

In [None]:
# Sweep a few practical thresholds and plot PR
thresholds_to_try = [0.995, 0.997, 0.999, 0.9995, 0.9999, float(best_threshold)]
rows = []
for t in thresholds_to_try:
    preds = (val_pred_proba >= t).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_val, preds).ravel()
    precision_t = tp / (tp + fp + 1e-9)
    recall_t = tp / (tp + fn + 1e-9)
    f1_t = 2 * precision_t * recall_t / (precision_t + recall_t + 1e-9)
    alerts = tp + fp
    rows.append(
        {
            "threshold": round(float(t), 6),
            "precision": precision_t,
            "recall": recall_t,
            "f1": f1_t,
            "alerts": alerts,
            "alert_rate": alerts / len(y_val),
        }
    )

sweep_df = pd.DataFrame(rows).sort_values("threshold")
display(sweep_df)

plt.figure(figsize=(6, 5))
plt.plot(recall, precision, label="PR curve")
plt.scatter(recall[best_idx], precision[best_idx], color="red", s=50, label=f"best F1 @ {best_threshold:.4f}")
for t in thresholds_to_try:
    idx = np.argmin(np.abs(thresholds - t))
    plt.scatter(recall[idx], precision[idx], s=30, label=f"t={t}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall with selected thresholds")
plt.legend(loc="lower left")
plt.grid(True)
plt.show()



In [None]:
# Simple cost view (edit costs to match your business)
cost_fp = 1.0   # cost of reviewing/annoying a good user
cost_fn = 10.0  # cost of missing a fraud

rows_cost = []
for t in sweep_df["threshold"]:
    preds = (val_pred_proba >= t).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_val, preds).ravel()
    cost = cost_fp * fp + cost_fn * fn
    rows_cost.append(
        {
            "threshold": t,
            "fp": fp,
            "fn": fn,
            "tp": tp,
            "tn": tn,
            "cost": cost,
        }
    )

cost_df = pd.DataFrame(rows_cost).sort_values("cost")
display(cost_df)

In [None]:
# Choose threshold with lowest cost and persist artifacts
import json

best_cost_row = cost_df.iloc[0]
SELECTED_THRESHOLD = float(best_cost_row["threshold"])
print(f"Selected threshold (min cost): {SELECTED_THRESHOLD}")

preds_selected = (val_pred_proba >= SELECTED_THRESHOLD).astype(int)
print(classification_report(y_val, preds_selected, digits=4))
print("Confusion matrix:\n", confusion_matrix(y_val, preds_selected))

# Save model and selected threshold
MODEL_DIR = Path("/Users/avikakhemuka/Desktop/internship assignemrnt/models")
MODEL_DIR.mkdir(exist_ok=True)

joblib.dump(log_reg_clf, MODEL_DIR / "log_reg_fraud.pkl")
with open(MODEL_DIR / "threshold.json", "w") as f:
    json.dump({"threshold": SELECTED_THRESHOLD}, f)
print("Saved model + selected threshold to", MODEL_DIR)

In [None]:
# Key visualizations

# Confusion matrix at selected threshold (cost-minimizing)
cm = confusion_matrix(y_val, preds_selected)
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title(f"Confusion Matrix @ threshold={SELECTED_THRESHOLD:.4f}")
plt.show()

# Score distributions by class with selected threshold
plt.figure(figsize=(6, 4))
sns.kdeplot(val_pred_proba[y_val == 0], label="Legit", fill=True, alpha=0.3)
sns.kdeplot(val_pred_proba[y_val == 1], label="Fraud", fill=True, alpha=0.3)
plt.axvline(SELECTED_THRESHOLD, color="red", linestyle="--", label="Threshold")
plt.xlabel("Predicted fraud probability")
plt.ylabel("Density")
plt.title("Score distribution by class")
plt.legend()
plt.show()



## Candidate Expectations – Responses

1) Data cleaning: No missing values. I dropped duplicate rows, flagged negative balances and very large amounts (99.5th percentile), and added balance-delta checks (orig_balance_delta, dest_balance_delta) to spot inconsistencies. Regularized logistic regression plus scaling keeps collinearity in check.

2) Fraud model: A class-weighted logistic regression with one-hot type, scaled numerics, and a stratified split. I tuned the decision threshold from the precision-recall curve to balance catch rate and alert volume. (RandomForest was removed to keep things simple, but the pipeline can swap models.)

3) Variables used: Hour/day, transaction type, raw and log amounts/balances, balance deltas, merchant-destination flag, and flags for negative balances and big-amount outliers. I excluded the high-cardinality IDs (nameOrig, nameDest).

4) Performance: Reported ROC-AUC and PR-AUC, plus the classification report and confusion matrix. Added a threshold sweep and PR curve to show precision/recall trade-offs, then picked an operating point using FP vs. FN costs and saved that threshold.

5) Key fraud signals: Balance deltas and available balances matter most, then amount/log-amount and transaction type. Merchant-destination and time add extra signal.

6) Do they make sense? Yes—fraud often shows odd balance changes, higher amounts, and clusters in transfer/cash-out flows, matching the simulated behavior and the large-transfer rule.

7) Prevention during infra updates: Tighten velocity/amount checks for transfers and cash-outs, hold inconsistent balance-delta cases for review, be stricter on non-merchant destinations, raise thresholds during off-hours, and fast-track the riskiest alerts to manual review.

8) Measuring impact: Track fraud catch rate (recall), false-positive rate/alert load, reviewer workload, and actual loss/chargebacks over time. Test in shadow/A-B mode first and watch for drift in scores and class balance.

