**Programmer: python_scripts (Abhijith Warrier)**

**PYTHON SCRIPT TO *DETECT CREDIT CARD FRAUD USING MACHINE LEARNING ON IMBALANCED DATA*. üß†üí≥üö®**

This script demonstrates how to build a **fraud detection model**, where fraudulent transactions are extremely rare. We focus on **handling class imbalance**, training a robust classifier, and evaluating performance using appropriate metrics instead of accuracy.

---

## **üì¶ Install Required Packages**

**Install ML and imbalance-handling libraries.**

In [None]:
pip install pandas numpy scikit-learn imbalanced-learn matplotlib

---

## **üß© Load and Inspect the Dataset**

**We simulate a fraud-like dataset with heavy class imbalance.**

In [None]:
import pandas as pd
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    weights=[0.995, 0.005],   # extreme fraud imbalance
    random_state=42
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(20)])
df["fraud"] = y

This mirrors real-world fraud datasets where fraudulent transactions are **less than 1%**.

---

## **‚úÇÔ∏è Train/Test Split (Stratified)**

**Preserve class distribution during splitting.**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("fraud", axis=1),
    df["fraud"],
    test_size=0.3,
    stratify=df["fraud"],
    random_state=42
)

---

## **‚öñÔ∏è Handle Class Imbalance with SMOTE**

**Balance the dataset to help the model learn fraud patterns.**

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Without this step, most models would predict **‚Äúnot fraud‚Äù for everything**.

---

## **üå≤ Train a Fraud Detection Model**

**Random Forest works well for non-linear, tabular fraud data.**

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=300,
    max_depth=10,
    random_state=42
)

model.fit(X_resampled, y_resampled)

---

## **üìä Evaluate Fraud Detection Performance**

**Use metrics suited for imbalanced classification.**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

üëâ **Recall for fraud class matters more than accuracy.**

---

## **üìà ROC‚ÄìAUC for Fraud Detection**

**ROC‚ÄìAUC measures how well the model separates fraud vs non-fraud.**

In [None]:
from sklearn.metrics import roc_auc_score

y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)

print("ROC‚ÄìAUC:", auc)

---

## **üß™ Why Fraud Detection Is Different**

- Extreme class imbalance
- False negatives are very costly
- Accuracy is misleading
- Recall, precision, and ROC‚ÄìAUC matter most

---

## **Key Takeaways**

1. Fraud detection is an extreme imbalanced classification problem.
2. SMOTE helps models learn rare fraud patterns.
3. Random Forest handles complex fraud signals well.
4. Accuracy alone is meaningless for fraud use cases.
5. ROC‚ÄìAUC and recall are critical evaluation metrics.

---