
# Fraud Detection Assignment — End‑to‑End Solution
**Author:** _Your Name_  
**Deadline:** 25 Aug, 11:59 PM IST

This notebook is structured to satisfy all deliverables in the brief:

1) **Data cleaning**: missing values, outliers, multicollinearity  
2) **Model**: end‑to‑end build with rationale  
3) **Feature selection rationale**  
4) **Performance demonstration**: rigorous, time‑aware validation, metrics, plots  
5) **Key fraud drivers** (global + local explainability)  
6) **Sanity of factors** (business reasoning)  
7) **Prevention recommendations**  
8) **Measurement plan**

> 🔧 **How to use**
> - Put your dataset path in the cell below (`DATA_PATH`).  
> - Run cells in order.  
> - If running on Colab, enable GPU/TPU (optional).  
> - This notebook uses memory‑efficient reading + LightGBM/XGBoost (install as needed).


In [2]:

# If running locally and missing packages, uncomment installs:
%pip install pandas numpy scikit-learn lightgbm xgboost shap matplotlib plotly tqdm pyarrow fastparquet

import os, gc, math, warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    roc_auc_score, average_precision_score, precision_recall_curve,
    confusion_matrix, classification_report, roc_curve
)

import matplotlib.pyplot as plt

# Optional (if available):
try:
    import lightgbm as lgb
    HAS_LGBM = True
except Exception:
    HAS_LGBM = False

try:
    import xgboost as xgb
    HAS_XGB = True
except Exception:
    HAS_XGB = False

try:
    import shap
    HAS_SHAP = True
except Exception:
    HAS_SHAP = False

print("Packages -> LGBM:", HAS_LGBM, "| XGB:", HAS_XGB, "| SHAP:", HAS_SHAP)


Collecting pandas
  Using cached pandas-2.3.2-cp313-cp313-win_amd64.whl.metadata (19 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.7.1-cp313-cp313-win_amd64.whl.metadata (11 kB)
Collecting lightgbm
  Using cached lightgbm-4.6.0-py3-none-win_amd64.whl.metadata (17 kB)
Collecting xgboost
  Using cached xgboost-3.0.4-py3-none-win_amd64.whl.metadata (2.1 kB)
Collecting shap
  Using cached shap-0.48.0-cp313-cp313-win_amd64.whl.metadata (25 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.5-cp313-cp313-win_amd64.whl.metadata (11 kB)
Collecting plotly
  Using cached plotly-6.3.0-py3-none-any.whl.metadata (8.5 kB)
Collecting fastparquet
  Using cached fastparquet-2024.11.0-cp313-cp313-win_amd64.whl.metadata (4.3 kB)
Collecting numba>=0.54 (from shap)
  Using cached numba-0.61.2-cp313-cp313-win_amd64.whl.metadata (2.8 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.3-cp313-cp313-win_amd64.whl.metadata (5.5 kB)
Using cached pandas-2.3.2-cp313-cp


[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Packages -> LGBM: True | XGB: True | SHAP: True


In [None]:

# ====== CONFIG ======
DATA_PATH = "C:\Users\yashh\Downloads\Fraud.csv"  # <- <-- PUT YOUR CSV PATH HERE
RANDOM_STATE = 42
TEST_SIZE = 0.20  # time-aware split is applied below; this is a fallback
TARGET_COL = "isFraud"

# Memory optimization: set dtypes for known columns
DTYPES = {
    "step": "int32",
    "type": "category",
    "amount": "float32",
    "nameOrig": "category",
    "oldbalanceOrg": "float32",
    "newbalanceOrig": "float32",
    "nameDest": "category",
    "oldbalanceDest": "float32",
    "newbalanceDest": "float32",
    "isFraud": "int8",
    "isFlaggedFraud": "int8",
}

assert os.path.exists(DATA_PATH), f"Dataset not found at {DATA_PATH}. Please update DATA_PATH."
print('Using dataset at:', DATA_PATH)


In [None]:

# ====== LOAD DATA ======
# For 6.36M rows, pandas can load with the right dtypes on a 16GB+ machine.
# If RAM is tight, consider reading with chunks for EDA; we do full load for modeling.
df = pd.read_csv(DATA_PATH, dtype=DTYPES)
print(df.shape)
df.head(3)


In [None]:

# ====== DATA HEALTH CHECKS ======
print("\nBasic Info:")
print(df.info())

print("\nTarget balance:")
print(df[TARGET_COL].value_counts(normalize=True).rename('proportion'))

print("\nMissing values per column:")
print(df.isna().sum())

# Summary stats for numeric columns
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = df.select_dtypes(include=['category', 'object']).columns.tolist()
print("\nNumeric columns:", num_cols)
print("Categorical columns:", cat_cols)

df[num_cols].describe(percentiles=[.01,.05,.25,.5,.75,.95,.99])


## Exploratory Data Analysis (target leakage, distributions, imbalances)

In [None]:

# Target vs type
ct = pd.crosstab(df['type'], df[TARGET_COL], normalize='index')
print(ct)

# Distribution of amounts (trimmed)
amt_q99 = df['amount'].quantile(0.99)
df['amount_clip'] = df['amount'].clip(upper=amt_q99)
df['amount_clip'].hist(bins=50)
plt.title('Amount (clipped at 99th pct)')
plt.xlabel('amount')
plt.ylabel('count')
plt.show()

# Time progression of fraud rate
fraud_rate_by_step = df.groupby('step')[TARGET_COL].mean()
fraud_rate_by_step.plot()
plt.title('Fraud rate by time step (hour)')
plt.xlabel('step')
plt.ylabel('fraud_rate')
plt.show()


## Feature Engineering

In [None]:

# Derived risk features
df['is_type_TRANSFER'] = (df['type'] == 'TRANSFER').astype('int8')
df['is_type_CASH_OUT'] = (df['type'] == 'CASH_OUT').astype('int8')
df['is_merchant_dest'] = df['nameDest'].astype(str).str.startswith('M').astype('int8')

# Balance deltas for origin
df['deltaOrig'] = (df['oldbalanceOrg'] - df['newbalanceOrig']).astype('float32')
df['deltaDest'] = (df['newbalanceDest'] - df['oldbalanceDest']).astype('float32')

# Suspicious patterns
df['orig_balance_zero_then_txn'] = ((df['oldbalanceOrg'] == 0) & (df['amount'] > 0)).astype('int8')
df['dest_balance_zero_then_in'] = ((df['oldbalanceDest'] == 0) & (df['amount'] > 0)).astype('int8')
df['mismatch_orig'] = (np.abs(df['deltaOrig'] - df['amount']) > 1e-2).astype('int8')
df['mismatch_dest'] = (np.abs(df['deltaDest'] - df['amount']) > 1e-2).astype('int8')

# Drop leakage columns (IDs kept only if used as categorical signals)
LEAKS = []  # if any discovered, append here
feature_cols = [c for c in df.columns if c not in [TARGET_COL, 'isFlaggedFraud', 'amount_clip'] + LEAKS]
print("Feature count:", len(feature_cols))
feature_cols[:15]


## Train/Validation Split (Time‑aware)

In [None]:

# Time-aware split: use first 80% steps for train, last 20% for validation
step_cut = int(df['step'].quantile(0.80))
train_idx = df['step'] <= step_cut
valid_idx = df['step'] > step_cut

train = df.loc[train_idx].reset_index(drop=True)
valid = df.loc[valid_idx].reset_index(drop=True)

X_train = train[feature_cols]
y_train = train[TARGET_COL].astype('int8')
X_valid = valid[feature_cols]
y_valid = valid[TARGET_COL].astype('int8')

print(train.shape, valid.shape, " | step_cut:", step_cut)

# Encode categoricals (simple): convert category to codes
for c in X_train.select_dtypes(include=['category']).columns:
    # Ensure consistent codes between train/valid
    allcats = pd.Categorical(df[c])
    cat2code = {cat: i for i, cat in enumerate(allcats.categories)}
    X_train[c] = pd.Categorical(X_train[c], categories=allcats.categories).codes.astype('int32')
    X_valid[c] = pd.Categorical(X_valid[c], categories=allcats.categories).codes.astype('int32')

# Fill any remaining NaNs (should be rare)
X_train = X_train.fillna(0)
X_valid = X_valid.fillna(0)


## Baseline Models

In [None]:

def evaluate_preds(y_true, y_prob, threshold=0.5, title_suffix=""):
    y_pred = (y_prob >= threshold).astype(int)
    auc = roc_auc_score(y_true, y_prob)
    ap = average_precision_score(y_true, y_prob)
    cm = confusion_matrix(y_true, y_pred)
    print(f"AUC: {auc:.5f} | Average Precision (PR AUC): {ap:.5f}")
    print("Confusion Matrix:\n", cm)
    print("\nClassification report:\n", classification_report(y_true, y_pred, digits=4))

    # PR Curve
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    plt.figure()
    plt.plot(recall, precision)
    plt.title(f'Precision-Recall Curve {title_suffix}')
    plt.xlabel('Recall'); plt.ylabel('Precision')
    plt.show()

    # ROC Curve
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    plt.figure()
    plt.plot(fpr, tpr)
    plt.title(f'ROC Curve {title_suffix}')
    plt.xlabel('FPR'); plt.ylabel('TPR')
    plt.show()

# 8.1 LightGBM (preferred for tabular & imbalance with class_weight)
if HAS_LGBM:
    lgb_train = lgb.Dataset(X_train, label=y_train)
    lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)
    params = {
        'objective': 'binary',
        'metric': ['auc'],
        'learning_rate': 0.05,
        'num_leaves': 64,
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 2,
        'reg_lambda': 5.0,
        'min_data_in_leaf': 50,
        'max_depth': -1,
        'verbose': -1,
        'scale_pos_weight': max(1.0, (y_train==0).sum() / max(1,(y_train==1).sum())),
        'seed': 42
    }
    print("Training LightGBM with params:", params)
    lgb_model = lgb.train(
        params,
        lgb_train,
        valid_sets=[lgb_train, lgb_valid],
        valid_names=['train','valid'],
        num_boost_round=2000,
        early_stopping_rounds=100,
        verbose_eval=100
    )
    y_valid_prob_lgb = lgb_model.predict(X_valid, num_iteration=lgb_model.best_iteration)
    evaluate_preds(y_valid, y_valid_prob_lgb, title_suffix="(LightGBM)")
else:
    print("LightGBM not available; skipping.")


In [None]:

if HAS_XGB:
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dvalid = xgb.DMatrix(X_valid, label=y_valid)
    scale_pos_weight = max(1.0, (y_train==0).sum() / max(1,(y_train==1).sum()))
    xgb_params = {
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'eta': 0.05,
        'max_depth': 8,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'lambda': 5.0,
        'scale_pos_weight': scale_pos_weight,
        'seed': 42
    }
    print("Training XGBoost with params:", xgb_params)
    xgb_model = xgb.train(
        xgb_params,
        dtrain,
        num_boost_round=3000,
        evals=[(dtrain,'train'),(dvalid,'valid')],
        early_stopping_rounds=100,
        verbose_eval=200
    )
    y_valid_prob_xgb = xgb_model.predict(dvalid, iteration_range=(0, xgb_model.best_ntree_limit))
    evaluate_preds(y_valid, y_valid_prob_xgb, title_suffix="(XGBoost)")
else:
    print("XGBoost not available; skipping.")


## Explainability — Feature Importance & SHAP

In [None]:

# Feature importances (model-dependent)
def plot_importance(names, importances, topn=25, title="Feature Importance"):
    order = np.argsort(importances)[::-1][:topn]
    plt.figure(figsize=(8, 6))
    plt.barh(range(len(order)), np.array(importances)[order][::-1])
    plt.yticks(range(len(order)), np.array(names)[order][::-1])
    plt.title(title)
    plt.xlabel('importance')
    plt.show()

if HAS_LGBM:
    imp = lgb_model.feature_importance(importance_type='gain')
    plot_importance(X_train.columns, imp, title="LightGBM Feature Importance (gain)")

# SHAP (optional, can be heavy)
if HAS_SHAP and HAS_LGBM:
    explainer = shap.TreeExplainer(lgb_model)
    # Use a small sample for speed
    sample = X_valid.sample(n=min(5000, len(X_valid)), random_state=RANDOM_STATE)
    shap_values = explainer.shap_values(sample)
    shap.summary_plot(shap_values, sample, show=True)


## Multicollinearity (VIF)

In [None]:

# Compute VIF on numeric subset to check multi-collinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

num_for_vif = X_train.select_dtypes(include=[np.number]).copy()
# Limit to a reasonable subset to keep runtime manageable
cols_for_vif = [c for c in num_for_vif.columns if num_for_vif[c].nunique() > 10]
cols_for_vif = cols_for_vif[:30]  # cap for speed
vif_df = pd.DataFrame({
    'feature': cols_for_vif,
    'VIF': [variance_inflation_factor(num_for_vif[cols_for_vif].values, i) for i in range(len(cols_for_vif))]
})
vif_df.sort_values('VIF', ascending=False).head(15)


## Threshold Tuning (Optimize for Business Cost)

In [None]:

# Example: choose threshold maximizing F1 or desired precision
def best_threshold_by_precision_target(y_true, y_prob, precision_target=0.95):
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    for p, r, t in zip(precision, recall, np.append(thresholds, 1)):
        if p >= precision_target:
            return float(t), float(p), float(r)
    return 0.5, precision[0], recall[0]

if HAS_LGBM:
    thr, p, r = best_threshold_by_precision_target(y_valid, y_valid_prob_lgb, precision_target=0.90)
    print(f"Threshold for >=90% precision: {thr:.4f} -> precision={p:.3f}, recall={r:.3f}")


## Save Artifacts

In [None]:

# Save predictions for audit / attachment in submission
if HAS_LGBM:
    valid_out = valid[['step','type','amount','nameOrig','nameDest','isFlaggedFraud',TARGET_COL]].copy()
    valid_out['fraud_prob_lgb'] = y_valid_prob_lgb
    valid_out.to_parquet('valid_predictions.parquet', index=False)
    print("Saved: valid_predictions.parquet")



## Conclusions & Answers (fill after running)

**1) Data cleaning:**  
- Missing: _<notes>_  
- Outliers: _<notes>_  
- Multicollinearity: _<notes>_  

**2) Model description:**  
- Algorithm: _<LightGBM/XGBoost>_  
- Why: _<reasoning>_  
- Handling imbalance: _<scale_pos_weight, threshold tuning>_  

**3) Variable selection:**  
- Included: engineered deltas, type indicators, merchant flags, mismatch signals  
- Rationale: information value, importance, SHAP  

**4) Performance:**  
- Metrics: AUC, PR‑AUC, Precision@Recall, confusion matrix  
- Validation: time‑aware split (first 80% steps train, last 20% validate)  

**5) Key predictors:**  
- Top features: _<from gain importance / SHAP>_  

**6) Do they make sense?**  
- Business reasoning: _<explain link to fraud modus operandi>_  

**7) Prevention recommendations:**  
- _<rate limiting high‑risk flows, MFA, velocity rules, beneficiary cooling periods, anomaly detection in balances, merchant monitoring, graph rules, device fingerprinting>_  

**8) Measurement of success:**  
- A/B or backtest: reduction in fraud loss, false positive rate, precision/recall lift, alert fatigue drop, manual review SLA, net profit impact.
