# Fraud Detection Pipeline (Notebook)

**Goal:** demonstrate the analytical process from raw events → features → model → business impact.

**Highlights:** cost-sensitive training, precision/recall trade-offs, and deployable artifacts.

## 1. Business Framing
- Chargebacks and fraud costs
- Analyst capacity constraints
- KPI: **recall at fixed alert volume (top 1%)**

In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.inspection import permutation_importance
from joblib import dump

%matplotlib inline

## 2. Data Generation (Synthetic)

In [None]:
from pathlib import Path
import sys
sys.path.append(str(Path('..').resolve()))
from src.data_prep import generate_synthetic

df = generate_synthetic(n_samples=50000, n_features=20, weights=[0.995], random_state=7)
df.head()

## 3. EDA Quick Peek

In [None]:
df['label'].value_counts(normalize=True)

## 4. Train/Test Split + Model

In [None]:
X = df.drop(columns=['label'])
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)
probs = model.predict_proba(X_test)[:,1]

auroc = roc_auc_score(y_test, probs)
prec, rec, thr = precision_recall_curve(y_test, probs)
auprc = auc(rec, prec)
auroc, auprc

## 5. Precision–Recall Curve

In [None]:
plt.figure()
plt.plot(rec, prec)
plt.xlabel('Recall'); plt.ylabel('Precision'); plt.title('PR Curve'); plt.grid(True); plt.show()

## 6. Business KPI: Recall @ Top 1%

In [None]:
import numpy as np
n = len(probs)
k = max(1, int(0.01 * n))
idx = np.argsort(probs)[::-1][:k]
recall_at_top_1 = y_test.iloc[idx].sum() / (y_test.sum() + 1e-12)
float(recall_at_top_1)

## 7. Export Artifact

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

features = list(X.columns)
pre = ColumnTransformer([('num', StandardScaler(), features)])
pipe = Pipeline([('pre', pre), ('clf', model)])
pipe.fit(X_train, y_train)

Path('../artifacts').mkdir(exist_ok=True)
dump({'model': pipe, 'features': features}, '../artifacts/model_from_notebook.joblib')
print('Saved to ../artifacts/model_from_notebook.joblib')

## 8. Talking Points for Interviews
- Why **AUPRC** over AUROC on imbalanced data
- Thresholding for **fixed analyst capacity**
- Cost-sensitive improvements and expected ROI
- Path to production: API (FastAPI), dashboard (Streamlit), monitoring and retraining