# Data Science for Business
**Faculty of Economics and Business**  
**Accounting Department**  
**Master of Accounting Program**  
**Universitas Indonesia**


**Course:** Data Science for Business (ECEM801201)  
**Semester:** Odd Semester 2025/2026   
**Part**: Business Problem with Data Solutions   
**Content**: Supervised & Unsupervised Machine Learning

---

## Class Information

| Lecturer| Name | Contact | Linkedin | 
|------------|-------------|---------|---------|
| Lecturer | Yudhistira Dharma Putra, S.E., M.Sc. | y.dharma@ui.ac.id | https://id.linkedin.com/in/yudhistira-dharma-putra-91367256 |
| Assistant Lecturer | Fiqry Revadiansyah | fiqryrevadiansyah@gmail.com | https://www.linkedin.com/in/fiqryrevadiansyah/ |

# Part 3: Supervised Learning: Classification

In supervised learning, we teach a model using examples that already have the correct answer (the label). After learning patterns from these labeled examples, the model predicts the label for new, unseen data.

📌 Accounting analogy  
An auditor studies past transactions labeled fraud or legitimate. After enough examples, the auditor can flag risky transactions in a fresh ledger. That is classification.

---

## 3.1 Problem framing

- **Goal**: predict a **category** for each record  
  Examples: is a business still open (1) or closed (0), is a transaction fraud or not, will a customer default or not.
- **Input** (**X**, features): columns like revenue growth, cash ratio, industry, account age.
- **Output** (**y**, label): the thing to predict, such as `isOpen` or `isFraud`.
- **Train vs test split**: learn patterns on training data, judge generalization on test data.

Key ideas  
- **Generalization**: doing well on unseen data.  
- **Bias vs variance**: simple models may underfit (high bias). Complex models may overfit (high variance).  
- **Data leakage**: do not use any future or target-derived information during training.

In [5]:
# 3.1 Problem Framing — one simple cell

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

TARGET_COL = "isOpen"
df = pd.read_csv('../data/w2--dataset.csv')

# 1) Make sure target exists (create a simple demo target if missing)
if TARGET_COL not in df.columns:
    print(f"[Info] '{TARGET_COL}' not found. Creating a demo target for teaching.")
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if num_cols:
        key = num_cols[0]
        thresh = df[key].median()
        df[TARGET_COL] = (df[key] > thresh).astype(int)
    else:
        np.random.seed(42)
        df[TARGET_COL] = np.random.choice([0, 1], size=len(df))

# 2) Define X (features) and y (label) — use numeric features only for now
feature_cols = df.select_dtypes(include=[np.number]).columns.tolist()
if TARGET_COL in feature_cols:
    feature_cols.remove(TARGET_COL)

X = df[feature_cols].copy()
y = df[TARGET_COL].copy()

# 3) Quick summary
print("Dataset shape:", df.shape)
print("Number of numeric features (X):", len(feature_cols))
print("First 5 feature names:", feature_cols[:5])
print("\nTarget distribution (y):")
print(y.value_counts(normalize=True).rename(lambda k: f"class {k}").mul(100).round(1).astype(str) + "%")

Dataset shape: (50000, 19)
Number of numeric features (X): 10
First 5 feature names: ['buisness_year', 'doc_id', 'document_create_date', 'document_create_date.1', 'due_in_date']

Target distribution (y):
isOpen
class 0    80.0%
class 1    20.0%
Name: proportion, dtype: object


## 3.2 Data preparation checklist

1) **Target definition**  
   - Confirm the label is aligned with the business question and time horizon.  
   - Example: `isOpen` refers to next quarter status, not current status.

2) **Train and test split**  
   - Use stratification for imbalanced classes.  
   - Keep the test set cold (no peeking).

3) **Feature engineering**  
   - Numeric features: raw values, ratios, growth, log transforms.  
   - Categorical features: industry, region, risk tier.  
   - Dates: convert to age, month, seasonality indicators.  
   - Aggregations: customer-level averages, rolling statistics.

4) **Missing values**  
   - Impute with mean or median for numeric, most frequent for categorical.  
   - Add missingness indicators when it carries signal.

5) **Scaling**  
   - Needed for distance or margin-based models (k-NN, SVM, logistic regression, neural nets).  
   - Not required for trees and tree ensembles.

6) **Encoding categoricals**  
   - One-hot encoding for low or medium cardinality.  
   - Target or CatBoost encoding for high-cardinality categoricals.  
   - Some libraries (CatBoost) handle categoricals natively.

7) **Pipelines**  
   - Wrap preprocessing and models in a single pipeline to avoid leakage and keep training consistent.

---

In [6]:
# 3.2 Data Preparation — one simple cell (checklist → code)
#
# This cell prepares data for ML with a clean, reusable sklearn Pipeline:
# - Defines target and splits (if not already split)
# - Engineers simple date features (year, month, quarter, age_days)
# - Imputes missing values (numeric=median+indicator, categorical=most_frequent)
# - One-hot encodes categoricals
# - Scales numeric features (good for LR/SVM/NN; trees don’t need it but harmless)
# - Wraps everything in a Pipeline/ColumnTransformer to avoid leakage

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

TARGET_COL = "isOpen"

# --- 1) Split data (stratified so class balance is preserved) ---
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- 2) Identify numeric vs categorical features ---
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# --- 3) Preprocessing pipelines ---
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median", add_indicator=True)),
    ("scaler", StandardScaler())  # scaling useful for LR/SVM/NN
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))  # <- FIX
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="drop"
)

# --- 4) Fit on training, transform both train & test ---
preprocess.fit(X_train)

X_train_ready = preprocess.transform(X_train)
X_test_ready  = preprocess.transform(X_test)

print("Prep complete.")
print("Numeric features:", len(numeric_features), "| Categorical features:", len(categorical_features))
print("X_train_ready shape:", X_train_ready.shape, "| X_test_ready shape:", X_test_ready.shape)
print("y_train distribution:", y_train.value_counts(normalize=True).round(2).to_dict())



Prep complete.
Numeric features: 10 | Categorical features: 8
X_train_ready shape: (40000, 6058) | X_test_ready shape: (10000, 6058)
y_train distribution: {0: 0.8, 1: 0.2}




## 3.3 Evaluation and business alignment

### Common metrics
- **Accuracy**: percent correct. Simple baseline, can mislead under imbalance.  
- **Balanced accuracy**: average of recall across classes. Better when classes are imbalanced.  
- **Precision (for positive class)**: out of predicted positives, how many are truly positive.  
- **Recall (for positive class)**: out of all true positives, how many did we catch.  
- **F1 score**: harmonic mean of precision and recall.  
- **ROC-AUC**: probability the model ranks a random positive above a random negative.  
- **PR-AUC**: precision-recall area, more informative under heavy imbalance.  
- **Log loss**: measures probability quality. Useful when you will use calibrated probabilities.

### Threshold tuning
- Models output probabilities. You choose a decision threshold (often 0.5 by default).  
- Optimize the threshold using business costs.  
  - Example: fraud detection where missing a fraud is expensive. Choose a threshold that maximizes expected profit or minimizes expected cost.

### Confusion matrix and costs
- **TP**: correctly predicted positive  
- **FP**: predicted positive but actually negative  
- **FN**: predicted negative but actually positive  
- **TN**: correctly predicted negative

---

In [None]:
# 3.3 Evaluasi & Alignment Bisnis — DEMO SATU SEL (dummy data akuntansi/keuangan)
#
# Konteks: Deteksi "Transaksi Fraud" (positif = 1) vs "Transaksi Non Fraud" (negatif = 0)
# Contoh nyata: penipuan kartu, klaim biaya fiktif, atau invoice yang berpotensi gagal bayar.
# y_true  : label aktual dari tim audit/risk
# y_score : probabilitas dari model bahwa transaksi berisiko (semakin besar = makin berisiko)
# y_pred  : keputusan 0/1 berdasarkan ambang (threshold)
#
# Output:
# - Confusion Matrix (TP/FP/FN/TN) + penjelasan istilah
# - Accuracy, Precision, Recall, F1, Specificity, Balanced Accuracy
# - ROC-AUC, PR-AUC, Log Loss
# - Perbandingan threshold 0.5 vs 0.3
# - Simulasi biaya: cost_fp (review manual transaksi aman yang salah ditandai berisiko)
#                   cost_fn (kerugian jika transaksi berisiko lolos)
#
# Catatan: Satu sel saja. Nanti Anda bisa pecah per metrik sesuai kebutuhan.

import numpy as np
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, log_loss,
    confusion_matrix
)

# -----------------------------
# 0) Dummy data akuntansi/keuangan
# -----------------------------
# 15 transaksi: 1 = berisiko (fraud/gagal bayar), 0 = aman
# Contoh: transaksi #1, #3, #7, #11 berisiko (ground truth dari audit)
y_true = np.array([
    1, 0, 1, 0, 0,
    0, 1, 0, 0, 0,
    1, 0, 0, 0, 0
])

# Probabilitas model "seberapa berisiko" transaksi
# Angka besar berarti model curiga transaksi berisiko
y_score = np.array([
    0.85, 0.40, 0.72, 0.10, 0.22,
    0.55, 0.65, 0.18, 0.12, 0.33,
    0.49, 0.31, 0.08, 0.27, 0.60
])

# Threshold default 0.5 (sering jadi standar awal)
thr_default = 0.50
y_pred = (y_score >= thr_default).astype(int)

# -----------------------------
# 1) Confusion matrix + istilah
# -----------------------------
cm = confusion_matrix(y_true, y_pred, labels=[0,1])
TN, FP, FN, TP = cm.ravel()  # baris aktual [0,1] x kolom pred [0,1]

print("=== Konteks Bisnis (Deteksi Transaksi Fraud) ===")
print("- Positif (1)   : Transaksi Berisiko (fraud/gagal bayar)")
print("- Negatif (0)   : Transaksi Aman")
print("- TP (True +)   : Dianggap Fraud & kenyataannya Fraud → kerugian dihindari")
print("- FP (False +)  : Dianggap Fraud & kenyataannya tidak Fraud → biaya review manual/komplain nasabah")
print("- FN (False -)  : Dianggap tidak Fraud & kenyataannya Fraud → kerugian finansial")
print("- TN (True -)   : Dianggap tidak Fraud & kenyataannya tidak Fraud\n")

print(f"=== Confusion Matrix @ threshold {thr_default:.2f} ===")
print("[ [TN FP] ; [FN TP] ]")
print(cm)

# -----------------------------
# 2) Metrik klasifikasi dasar
# -----------------------------
accuracy  = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, zero_division=0)
recall    = recall_score(y_true, y_pred, zero_division=0)
f1        = f1_score(y_true, y_pred, zero_division=0)

# Specificity = TN / (TN + FP)
specificity = TN / (TN + FP) if (TN + FP) > 0 else 0.0
balanced_acc = (recall + specificity) / 2

# Metrik berbasis probabilitas (threshold-independent)
eps = 1e-7
p = np.clip(y_score, eps, 1 - eps)
roc_auc = roc_auc_score(y_true, y_score)
pr_auc  = average_precision_score(y_true, y_score)
ll      = log_loss(y_true, p)

In [24]:
# Anggap saja model kita adalah sebuah SISTEM PENYARING OTOMATIS.
# Transaksi Asli/Sah = Transaksi dari pembeli jujur.
# Transaksi Fraud = Transaksi dari penipu.

# TP (True Positive)  : Sistem TEPAT memblokir transaksi penipu. ✅
# TN (True Negative)  : Sistem TEPAT meloloskan transaksi pembeli jujur. ✅
# FP (False Positive) : Sistem SALAH memblokir transaksi pembeli jujur. (False Alarm 🚨)
# FN (False Negative) : Sistem KECOLONGAN, transaksi penipu malah diloloskan. (Fraud Lolos 💸)

print(f"Accuracy (Akurasi)    : {accuracy:.3f}")
print("  → Seberapa sering tebakan sistem benar secara keseluruhan (baik saat memblokir penipu maupun meloloskan pembeli jujur).")
print("  → Ibarat nilai rapor umum sistem.")
print("  → PERHATIAN: Akurasi bisa menipu jika jumlah transaksi asli jauh lebih banyak dari transaksi fraud. Sistem bisa dapat nilai 99% hanya dengan meloloskan semua transaksi, padahal semua penipunya ikut lolos.")

print(f"\nPrecision (Presisi)     : {precision:.3f}")
print("  → Dari semua transaksi yang DIBLOKIR oleh sistem, berapa persen yang BENAR-BENAR penipuan?")
print("  → Mengukur KUALITAS pemblokiran. Jangan sampai sering salah blokir!")
print("  → Jika Precision TINGGI: Sistem sangat bisa diandalkan. Saat ia memblokir sesuatu, hampir pasti itu penipuan. Ini penting agar tidak membuat pelanggan jujur kecewa karena transaksinya ditolak.")
print("  → Jika Precision RENDAH: Sistem terlalu 'parno'. Banyak transaksi dari pembeli jujur yang ikut diblokir. Ini menyebabkan pengalaman belanja yang buruk dan banyak komplain pelanggan.")

print(f"\nRecall (Daya Jangkau)   : {recall:.3f}")
print("  → Dari semua transaksi penipuan yang SEBENARNYA terjadi, berapa persen yang berhasil DISARING/DIBLOKIR oleh sistem?")
print("  → Mengukur KEMAMPUAN sistem dalam 'menyapu bersih' para penipu.")
print("  → Jika Recall TINGGI: Sistem sangat efektif. Hampir tidak ada penipu yang bisa lolos, sehingga kerugian perusahaan bisa ditekan.")
print("  → Jika Recall RENDAH: Sistem banyak 'kebobolan'. Banyak transaksi penipuan yang berhasil lolos dan menyebabkan kerugian finansial bagi perusahaan.")

print(f"\nF1-Score              : {f1:.3f}")
print("  → Nilai gabungan yang menyeimbangkan antara Precision dan Recall.")
print("  → Mencari 'titik tengah' terbaik antara 'terlalu galak' (banyak salah blokir) dan 'terlalu longgar' (banyak fraud lolos).")
print("  → Ibaratnya, ini adalah nilai IPK si sistem. Sebuah angka tunggal untuk menilai apakah performanya seimbang dalam dua tugas penting tersebut.")
print("  → Berguna karena dalam kasus fraud, kita tidak bisa hanya fokus pada salah satunya saja.")

print(f"\nSpecificity (TNR)     : {specificity:.3f}")
print("  → Seberapa jago sistem dalam mengenali dan meloloskan PEMBELI JUJUR?")
print("  → Dari semua transaksi yang SEBENARNYA ASLI, berapa persen yang berhasil diloloskan tanpa masalah oleh sistem?")
print("  → Ini adalah metrik 'Kepuasan Pelanggan'. Jika nilainya tinggi, berarti sistem sangat jarang mengganggu transaksi dari pembeli yang sah.")

print("\n\n📈 METRIK UNTUK DATA TIDAK SEIMBANG (IMBALANCED):")
print("   (Sangat relevan untuk kasus fraud dimana transaksi asli jauh > transaksi fraud)")
print("-" * 60)
print(f"Balanced Accuracy     : {balanced_acc:.3f}")
print("  → Nilai Akurasi yang sudah 'diperbaiki' agar lebih adil untuk data timpang.")
print("  → Dihitung dengan merata-ratakan kemampuan sistem mendeteksi fraud (Recall) dan kemampuan sistem mengenali transaksi asli (Specificity).")
print("  → Mencegah sistem dapat nilai bagus hanya karena mayoritas transaksi adalah transaksi asli. Kinerja di dua sisi (asli vs fraud) dinilai setara.")

print(f"\nROC-AUC               : {roc_auc:.3f}")
print("  → Mengukur kemampuan sistem dalam MEMBEDAKAN antara transaksi fraud dan transaksi asli.")
print("  → Bayangkan sistem memberi 'skor kecurigaan' pada tiap transaksi. Nilai ini menunjukkan: Jika kita ambil 1 transaksi fraud dan 1 transaksi asli secara acak, seberapa besar kemungkinan sistem memberi skor lebih tinggi pada yang fraud?")
print("  → Nilai 1.0 = Sempurna. Nilai 0.5 = Tebakannya tidak lebih baik dari lempar koin.")

print(f"\nPR-AUC                : {pr_auc:.3f}")
print("  → Mirip ROC-AUC, tapi ini adalah 'spesialis' untuk data yang sangat timpang seperti data fraud.")
print("  → Lebih fokus pada pertanyaan: 'Seberapa baik sistem dapat menemukan semua fraud (Recall) tanpa terlalu banyak salah tangkap (Precision)?'")
print("  → Karena tidak terpengaruh oleh jutaan transaksi asli yang mudah ditebak, metrik ini seringkali lebih menggambarkan performa asli model pada kasus fraud.")

print(f"\nLog Loss              : {ll:.3f} (Makin kecil makin bagus)")
print("  → Menghukum sistem bukan hanya karena tebakannya salah, tapi juga karena 'terlalu percaya diri' saat salah.")
print("  → Sistem yang baik tidak hanya menebak 'fraud' atau 'asli', tapi juga memberikan tingkat keyakinan (probabilitas).")
print("  → Contoh: Jika sistem dengan yakin 99% bilang sebuah transaksi itu fraud, padahal ternyata asli, ia akan dapat 'hukuman' (nilai Log Loss) yang sangat besar.")

Accuracy (Akurasi)    : 0.800
  → Seberapa sering tebakan sistem benar secara keseluruhan (baik saat memblokir penipu maupun meloloskan pembeli jujur).
  → Ibarat nilai rapor umum sistem.
  → PERHATIAN: Akurasi bisa menipu jika jumlah transaksi asli jauh lebih banyak dari transaksi fraud. Sistem bisa dapat nilai 99% hanya dengan meloloskan semua transaksi, padahal semua penipunya ikut lolos.

Precision (Presisi)     : 0.600
  → Dari semua transaksi yang DIBLOKIR oleh sistem, berapa persen yang BENAR-BENAR penipuan?
  → Mengukur KUALITAS pemblokiran. Jangan sampai sering salah blokir!
  → Jika Precision TINGGI: Sistem sangat bisa diandalkan. Saat ia memblokir sesuatu, hampir pasti itu penipuan. Ini penting agar tidak membuat pelanggan jujur kecewa karena transaksinya ditolak.
  → Jika Precision RENDAH: Sistem terlalu 'parno'. Banyak transaksi dari pembeli jujur yang ikut diblokir. Ini menyebabkan pengalaman belanja yang buruk dan banyak komplain pelanggan.

Recall (Daya Jangkau)   : 0.7

In [25]:
# --------------------------------------------------------------------
# 3) ANALISIS BIAYA & SIMULASI STRATEGI BISNIS
#    Menghitung potensi kerugian finansial berdasarkan keputusan model.
# --------------------------------------------------------------------

# Asumsikan biaya-biaya berikut (dalam Rupiah):
# Ini adalah estimasi yang harus divalidasi oleh tim bisnis/keuangan.

# Rata-rata kerugian jika 1 transaksi fraud berhasil lolos (False Negative).
# Misalnya, nilai transaksi rata-rata yang hilang.
avg_fraud_loss = 5000000  # Rp 5.000.000

# Biaya untuk setiap investigasi manual pada transaksi yang ditandai (False Positive).
# Meliputi waktu analis, tool, atau potensi kerugian karena menunda transaksi pelanggan sah.
cost_of_review = 150000   # Rp 150.000

print("=== Asumsi Biaya Bisnis ===")
print(f"Kerugian per Fraud Lolos (FN) : Rp {avg_fraud_loss:,}")
print(f"Biaya per Review Manual (FP)  : Rp {cost_of_review:,}\n")

# ---
# Skenario A: Fokus KEJAR RECALL (Tim "Anti-Kecolongan")
# Kita menurunkan threshold agar lebih sensitif dan menangkap lebih banyak fraud.
# Konsekuensi: Akan ada lebih banyak False Positive (salah tuduh).
# ---
print("--- [ Skenario A: Fokus RECALL TINGGI (Threshold Rendah) ] ---")
thr_recall = 0.45  # Turunkan threshold agar lebih banyak transaksi dianggap berisiko

# Hitung ulang prediksi, metrik, dan kerugian
y_pred_recall = (y_score >= thr_recall).astype(int)
TN_r, FP_r, FN_r, TP_r = confusion_matrix(y_true, y_pred_recall).ravel()
precision_r = precision_score(y_true, y_pred_recall, zero_division=0)
recall_r = recall_score(y_true, y_pred_recall, zero_division=0)

# Hitung total kerugian finansial untuk skenario ini
loss_from_missed_fraud = FN_r * avg_fraud_loss
loss_from_reviews = FP_r * cost_of_review
total_loss_recall = loss_from_missed_fraud + loss_from_reviews

print(f"Dengan threshold = {thr_recall}, model menjadi lebih 'agresif':")
print(f"  → Recall: {recall_r:.2f} (Naik, bagus!) | Precision: {precision_r:.2f} (Turun, risiko)")
print(f"  → Fraud Lolos (FN)       : {FN_r} transaksi → Kerugian: Rp {loss_from_missed_fraud:,}")
print(f"  → Salah Blokir (FP)      : {FP_r} transaksi → Biaya   : Rp {loss_from_reviews:,}")
print(f"  → TOTAL POTENSI KERUGIAN : Rp {total_loss_recall:,}\n")
print("  (+) PRO: Sebagian besar fraud tertangkap, meminimalkan kerugian besar.")
print("  (-) KONTRA: Biaya operasional untuk review naik & berisiko mengganggu banyak pelanggan jujur.\n")


# ---
# Skenario B: Fokus JAGA PRECISION (Tim "Anti-Salah-Tuduh")
# Kita menaikkan threshold agar hanya transaksi yang SANGAT MENCURIGAKAN yang diblokir.
# Konsekuensi: Akan ada lebih banyak False Negative (fraud yang lolos).
# ---
print("--- [ Skenario B: Fokus PRECISION TINGGI (Threshold Tinggi) ] ---")
thr_precision = 0.70 # Naikkan threshold, model hanya akan yakin pada skor yang sangat tinggi

# Hitung ulang prediksi, metrik, dan kerugian
y_pred_precision = (y_score >= thr_precision).astype(int)
TN_p, FP_p, FN_p, TP_p = confusion_matrix(y_true, y_pred_precision).ravel()
precision_p = precision_score(y_true, y_pred_precision, zero_division=0)
recall_p = recall_score(y_true, y_pred_precision, zero_division=0)

# Hitung total kerugian finansial untuk skenario ini
loss_from_missed_fraud_p = FN_p * avg_fraud_loss
loss_from_reviews_p = FP_p * cost_of_review
total_loss_precision = loss_from_missed_fraud_p + loss_from_reviews_p

print(f"Dengan threshold = {thr_precision}, model menjadi lebih 'konservatif':")
print(f"  → Recall: {recall_p:.2f} (Turun, bahaya!) | Precision: {precision_p:.2f} (Naik, bagus!)")
print(f"  → Fraud Lolos (FN)       : {FN_p} transaksi → Kerugian: Rp {loss_from_missed_fraud_p:,}")
print(f"  → Salah Blokir (FP)      : {FP_p} transaksi → Biaya   : Rp {loss_from_reviews_p:,}")
print(f"  → TOTAL POTENSI KERUGIAN : Rp {total_loss_precision:,}\n")
print("  (+) PRO: Sangat jarang mengganggu pelanggan jujur, biaya review rendah.")
print("  (-) KONTRA: Risiko kebobolan fraud bernilai besar sangat tinggi.\n")


# ---
# Skenario C: Pendekatan Seimbang (Menggunakan threshold default dari kode awal)
# ---
print("--- [ Skenario C: Pendekatan Seimbang (Threshold Default) ] ---")
loss_from_missed_fraud_def = FN * avg_fraud_loss
loss_from_reviews_def = FP * cost_of_review
total_loss_default = loss_from_missed_fraud_def + loss_from_reviews_def
print(f"Dengan threshold = {thr_default:.2f}, performa awal adalah:")
print(f"  → Recall: {recall:.2f} | Precision: {precision:.2f}")
print(f"  → TOTAL POTENSI KERUGIAN : Rp {total_loss_default:,}\n")


# -----------------------------
# 4) KESIMPULAN & REKOMENDASI
# -----------------------------
print("================================================================")
print("             PERBANDINGAN TOTAL POTENSI KERUGIAN")
print("================================================================")
print(f"| Strategi                 | Threshold | Total Kerugian   |")
print(f"|--------------------------|-----------|------------------|")
print(f"| A: Fokus Recall (Agresif)| {thr_recall:^9.2f} | Rp {total_loss_recall:^14,d} |")
print(f"| C: Seimbang (Default)    | {thr_default:^9.2f} | Rp {total_loss_default:^14,d} |")
print(f"| B: Fokus Precision (Konservatif)| {thr_precision:^9.2f} | Rp {total_loss_precision:^14,d} |")
print("================================================================\n")

print("💡 Pertimbangan Bisnis:")
print("Tidak ada threshold yang 'sempurna', yang ada adalah threshold yang 'optimal' sesuai tujuan bisnis.")
print("  - Jika bisnis sedang fokus menekan kerugian finansial akibat fraud (misalnya saat ada serangan), maka strategi A (Fokus Recall) lebih masuk akal meskipun biaya operasional naik.")
print("  - Jika prioritas utama adalah menjaga pengalaman dan kepuasan pelanggan agar tidak terganggu, maka strategi B (Fokus Precision) bisa dipilih, dengan kesadaran bahwa risiko kebobolan fraud lebih tinggi.")
print("\nAnalisis seperti ini membantu tim untuk menentukan 'sweet spot' atau titik threshold yang memberikan kerugian total paling minimal bagi perusahaan.")

=== Asumsi Biaya Bisnis ===
Kerugian per Fraud Lolos (FN) : Rp 5,000,000
Biaya per Review Manual (FP)  : Rp 150,000

--- [ Skenario A: Fokus RECALL TINGGI (Threshold Rendah) ] ---
Dengan threshold = 0.45, model menjadi lebih 'agresif':
  → Recall: 1.00 (Naik, bagus!) | Precision: 0.67 (Turun, risiko)
  → Fraud Lolos (FN)       : 0 transaksi → Kerugian: Rp 0
  → Salah Blokir (FP)      : 2 transaksi → Biaya   : Rp 300,000
  → TOTAL POTENSI KERUGIAN : Rp 300,000

  (+) PRO: Sebagian besar fraud tertangkap, meminimalkan kerugian besar.
  (-) KONTRA: Biaya operasional untuk review naik & berisiko mengganggu banyak pelanggan jujur.

--- [ Skenario B: Fokus PRECISION TINGGI (Threshold Tinggi) ] ---
Dengan threshold = 0.7, model menjadi lebih 'konservatif':
  → Recall: 0.50 (Turun, bahaya!) | Precision: 1.00 (Naik, bagus!)
  → Fraud Lolos (FN)       : 2 transaksi → Kerugian: Rp 10,000,000
  → Salah Blokir (FP)      : 0 transaksi → Biaya   : Rp 0
  → TOTAL POTENSI KERUGIAN : Rp 10,000,000

  (+

## 3.4 Model catalog: academic and practical view

Below is a balanced catalog that covers baselines, interpretable models, and high-performance models. The best practice is to start with strong **benchmarks** then escalate in complexity.

---

### 1) Linear and Probabilistic Models

These models find simple, often linear relationships in your data. They are fast, easy to understand, and provide a great starting point for any classification task.

- **Logistic Regression**
  - **Definition:** Thinks of this as a statistical tool for predicting a yes/no outcome (e.g., will a customer default?). It calculates the probability of an event by taking a weighted sum of input variables (like income, age, loan amount). The "coefficients" or weights tell you how much each factor influences the outcome, and in which direction.
  - **When to Use:** An excellent baseline for binary classification problems like fraud detection or credit approval. It's fast, robust, and the results are easy to explain to stakeholders. It works best when your financial metrics are scaled to a similar range (e.g., 0 to 1) and can be tweaked with regularization (L1 or L2) to handle a large number of features without overfitting. Note that it assumes a linear relationship between the inputs and the outcome.

- **Linear Discriminant Analysis (LDA)**
  - **Definition:** LDA is a method that finds a combination of your financial metrics that best separates two or more groups (e.g., profitable vs. non-profitable companies). It does this by projecting your data onto a lower-dimensional space, maximizing the distance between the groups.
  - **When to Use:** It's very fast and effective when your numerical data (like financial ratios) is well-behaved and follows a bell-curve (normal) distribution. It can be a strong performer in situations where its statistical assumptions hold true.

- **Naive Bayes**
  - **Definition:** This model uses probability (specifically Bayes' Theorem) to classify data. It makes a "naive" but powerful assumption that all input features are independent of each other (e.g., that a company's revenue has no bearing on its number of employees).
  - **When to Use:** It's extremely fast and a great first model to try, especially for analyzing text like news articles for sentiment analysis or classifying transaction descriptions. Despite its simple assumption, it often performs surprisingly well in practice.

---

### 2) Distance and Margin-Based Models

These models classify data points based on their proximity to other points or to a dividing boundary.

- **k-Nearest Neighbors (k-NN)**
  - **Definition:** A simple and intuitive model that classifies a new data point based on the majority class of its "k" closest neighbors. Think of it as "judging a company by the company it keeps." If a new loan applicant's financial profile is very similar to 5 previous applicants who all defaulted, k-NN would predict a default.
  - **When to Use:** Good for smaller, clean datasets where data points with similar features tend to have similar outcomes. It's easy to understand but can become very slow for making predictions on large datasets and can struggle when you have a high number of features (the "curse of dimensionality"). Requires feature scaling.

- **Support Vector Machines (SVM)**
  - **Definition:** An SVM seeks to find the best possible "line" or boundary that separates different classes of data. It does this by maximizing the margin, or the gap, between the closest data points of each class (the "support vectors"). Using a "kernel trick," it can even find complex, non-linear boundaries.
  - **When to Use:** A powerful choice for medium-sized, well-structured datasets, such as predicting stock price direction (up or down). It requires feature scaling and careful tuning of its parameters, and can be computationally intensive on very large datasets.

---

### 3) Tree-Based Models

These models use a series of "if-then" rules to make predictions, forming a structure that looks like a tree.

- **Decision Tree**
  - **Definition:** Creates a flowchart of questions to arrive at a classification. For example: "Is the P/E ratio > 20? If yes, is the debt-to-equity ratio < 0.5? If no, classify as 'Buy'." The resulting tree is highly transparent and easy to interpret.
  - **When to Use:** Excellent when you need to explain the model's logic clearly. They are fast but, on their own, can easily memorize the training data (overfit). They are most often used as the building blocks for more powerful "ensemble" models below.

- **Random Forest**
  - **Definition:** Instead of relying on one decision tree, a Random Forest builds a large number of them. Each tree is trained on a random sample of the data and considers only a random subset of features. The final prediction is determined by a majority vote from all the trees in the "forest."
  - **When to Use:** A fantastic default model for many problems. It's powerful, handles complex relationships and interactions automatically, is robust to outliers, and requires minimal tuning. Great for tasks like customer churn prediction or asset valuation. Its main drawback is being less directly interpretable than a single tree.

- **Extra Trees (Extremely Randomized Trees)**
  - **Definition:** A variation of Random Forest that adds another layer of randomness. When splitting a node, instead of searching for the best possible split, it tests a few random splits and picks the best among them. This reduces variance and speeds up training.
  - **When to Use:** A good alternative to try alongside a Random Forest. It is often faster to train and can sometimes yield better performance.

- **Gradient Boosting Trees (XGBoost, LightGBM, CatBoost)**
  - **Definition:** The current state-of-the-art for many tabular data problems. This technique builds trees sequentially, where each new tree is trained to correct the errors made by the previous ones. It's like assembling a team of specialists, where each one focuses on the hardest cases the previous one got wrong.
  - **When to Use:** When you need the highest possible performance on structured data like financial statements, market data, or transaction logs.
    - **XGBoost:** The well-established, reliable, and highly customizable standard.
    - **LightGBM:** Known for its incredible speed, especially on very large datasets.
    - **CatBoost:** Shines when you have a lot of categorical features (e.g., industry, country, product type), as it can handle them automatically and effectively.
  - These models require careful tuning and cross-validation to prevent overfitting.

---

### 4) Neural Models for Tabular Data

These models are inspired by the structure of the human brain and can learn very complex patterns.

- **Multilayer Perceptron (MLP)**
  - **Definition:** A classic type of neural network that uses interconnected layers of "neurons" to learn complex, non-linear patterns in the data. It's considered a universal approximator, meaning it can theoretically model any continuous function.
  - **When to Use:** While tree-based models often perform better on standard tabular financial data, an MLP can be a good choice if you have a very large dataset or if you suspect there are intricate underlying patterns that other models miss. They require careful feature scaling, tuning of the network structure, and regularization.

---

### 5) Probabilistic and Calibrated Outputs

Ensuring the model's probability scores are trustworthy.

- **Calibrated Classifiers**
  - **Definition:** A process that adjusts a model's output scores to ensure they represent true probabilities. For example, when a calibrated model predicts an 80% probability of default, it means that out of 100 loans with that score, about 80 will actually default. Models like SVMs or even boosted trees can be poorly calibrated out-of-the-box.
  - **When to Use:** This is **critical** when your business decision depends on the probability value itself, not just the final classification. Examples include calculating expected credit loss, setting insurance premiums, or pricing financial instruments based on risk. Calibration should always be done on a validation dataset that the model hasn't seen during training.

---

## 3.5 Imbalanced learning strategies

When positive class is rare (fraud, default), accuracy can mislead.

Tools  
- **Class weights**: make errors on minority class count more.  
- **Resampling**: oversample minority (RandomOverSampler, SMOTE) or undersample majority.  
- **Threshold moving**: raise or lower the decision threshold to meet cost or recall targets.  
- **Appropriate metrics**: PR-AUC, F1, recall at fixed precision, cost-based evaluation.

---

## 3.6 Interpretability and accountability

- **Global explanations**  
  - Logistic regression coefficients and odds ratios.  
  - Tree feature importances.  
  - Permutation importance for model-agnostic insight.  
  - SHAP values to explain contributions per feature.

- **Local explanations**  
  - Explain a specific decision for auditability.

- **Fairness and compliance**  
  - Avoid sensitive attributes unless explicitly permitted and justified.  
  - Monitor disparate impact.  
  - Keep documentation of data sources, preprocessing, versions, and decisions.

---

## 3.7 Robust practice: from notebook to production

- **Cross-validation**: use stratified k-fold for stable estimates.  
- **Hyperparameter search**: start with simple grids or randomized search, then narrow.  
- **Pipelines**: include scaling and encoding inside the pipeline to prevent leakage.  
- **Out-of-time validation**: for temporal data (quarters), validate on a future slice.  
- **Monitoring**: track data drift, performance decay, calibration drift.  
- **Rollback plan**: maintain a safe baseline model and alerts.

---

## 3.8 Suggested benchmarking protocol

1) **Sanity checks**  
   - DummyClassifier (majority).  
   - Single-feature threshold.

2) **Interpretable baseline**  
   - Logistic Regression with standardized numeric features and one-hot categoricals.  
   - Collect coefficients and odds ratios to explain drivers.

3) **Strong default**  
   - Random Forest with reasonable depth and estimators.  
   - Compare metrics and calibration.

4) **High-performance candidate**  
   - Gradient Boosting (LightGBM or XGBoost) with early stopping.  
   - Tune learning rate, max depth or leaves, regularization.

5) **Pick by cost and constraints**  
   - If explanations and speed to sign-off matter: Logistic Regression or small trees with clear rules.  
   - If predictive power is paramount and auditability can be handled: Gradient Boosting.

---

## 3.9 Quick cheat sheet

| Model | Use when | Strengths | Caveats | Quick setup |
|---|---|---|---|---|
| Dummy baseline | Always | Sets floor | None | Majority class baseline |
| Logistic Regression | Need interpretability and speed | Simple, fast, well-calibrated with tuning | Needs scaling, linear boundary | Standardize, L2 penalty, class_weight if needed |
| LDA | Numeric, near-Gaussian | Very fast, solid if assumptions hold | Sensitive to covariance assumptions | Standardize, check covariance |
| Naive Bayes | Text or counts | Blazing fast | Independence assumption | MultinomialNB for counts |
| k-NN | Small to medium data | Simple, non-parametric | Slow at prediction, needs scaling | Scale, tune k via CV |
| SVM (RBF) | Medium-sized clean data | Strong margins | Tuning C and gamma | Scale, grid C and gamma |
| Decision Tree | Rules needed | Transparent rules | Overfits without limits | Limit depth, prune |
| Random Forest | General tabular default | Robust, low tuning, handles mixes | Less interpretable | 200 trees, limit depth, class_weight |
| Extra Trees | Similar to RF, faster | Often strong | Similar caveats | 400 trees, shallow to medium depth |
| LightGBM/XGBoost | Best tabular accuracy | State-of-the-art | Tuning required | Early stopping, tune depth and learning rate |
| CatBoost | Many categoricals | Native categorical handling | Slower than LightGBM | Use default with categorical features |
| MLP | Very large data or special structure | Flexible | Tuning, needs scaling | Scale, small hidden layers, early stopping |

---

## 3.10 Glossary

- **Feature**: input variable.  
- **Label**: target variable to predict.  
- **Overfitting**: great on train, poor on test.  
- **Underfitting**: poor on both train and test.  
- **Calibration**: predicted probabilities match observed frequencies.  
- **Stratification**: keep class ratios similar across folds or splits.

---

## 3.11 Accounting-focused examples

- **Business viability (`isOpen`)**  
  Inputs: revenue trend, expenses ratio, cash balance, industry, age.  
  Useful models: Logistic Regression for explainability, Random Forest or LightGBM for lift.  
  Metric: balanced accuracy, recall for closures, or cost-based threshold.

- **Loan default**  
  Inputs: credit score, DTI, payment history, collateral value.  
  Useful models: Gradient Boosting with calibration.  
  Metric: ROC-AUC, PR-AUC, expected loss.

- **Fraud detection**  
  Inputs: transaction amount, merchant category, device ID, velocity features.  
  Useful models: Gradient Boosting, calibrated probabilities, cost-sensitive threshold.  
  Metric: precision at top-k, recall at fixed false positive rate, cost minimization.

---

## 3.12 Minimal workflow you can reuse

1) Define target and features.  
2) Split into train and test (stratified).  
3) Build pipeline with: imputers, encoders, scaler.  
4) Train baseline (Logistic Regression).  
5) Train strong default (Random Forest).  
6) If needed, train boosted trees with early stopping.  
7) Evaluate with proper metrics and pick thresholds by cost.  
8) Document model and release with monitoring.

---

**Learning outcomes**  
By the end of this part, students will be able to:  
- Frame a classification problem that aligns with an accounting decision.  
- Build and evaluate baseline and benchmark models.  
- Interpret results using the right metrics and thresholds.  
- Explain model trade-offs to non-technical stakeholders.  
- Prepare a model for responsible deployment with monitoring and documentation.

In [28]:
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
import time
import warnings
warnings.filterwarnings('ignore')

print("=" * 80)
print("🚀 AUTOGLUON AUTOML - SIMPLE & FIXED VERSION")
print("=" * 80)

# ===========================
# PREPARE DATA FOR AUTOGLUON
# ===========================

# Combine preprocessed features with target for AutoGluon
train_data = pd.DataFrame(X_train_ready)
train_data['target'] = y_train.values

test_data = pd.DataFrame(X_test_ready)
test_data['target'] = y_test.values

print(f"\n📊 Dataset Info:")
print(f"Training samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")
print(f"Features: {X_train_ready.shape[1]}")
print(f"Target distribution: {y_train.value_counts(normalize=True).round(3).to_dict()}")

# ===========================
# TRAIN WITH AUTOGLUON (SIMPLIFIED)
# ===========================

print("\n" + "=" * 80)
print("🔄 TRAINING MODELS WITH AUTOGLUON (Simplified Settings)")
print("=" * 80)

# Initialize predictor with simpler settings
predictor = TabularPredictor(
    label='target',  # target column name
    problem_type='binary',  # binary classification
    eval_metric='f1',  # optimize for F1 score
    verbosity=1  # less verbose to avoid clutter
)

# Train models with simpler configuration to avoid getting stuck
start_time = time.time()

print("\nTraining AutoGluon with simplified settings...")
print("This should take 2-3 minutes...")

# Use simpler preset and disable Ray parallelization
predictor.fit(
    train_data=train_data,
    time_limit=180,  # 3 minutes time limit
    presets='medium_quality',  # Simpler preset: 'medium_quality' instead of 'best_quality'
    hyperparameters={  # Specify simpler models only
        'GBM': {},  # LightGBM
        'CAT': {},  # CatBoost  
        'XGB': {},  # XGBoost
        'RF': {},   # Random Forest
        'XT': {},   # Extra Trees
        'LR': {},   # Logistic Regression
    },
    num_bag_folds=0,  # Disable bagging to speed up
    num_stack_levels=0,  # Disable stacking to avoid complexity
    ag_args_fit={
        'num_cpus': 1,  # Use only 1 CPU per model to avoid resource conflicts
    }
)

train_time = time.time() - start_time
print(f"\n✅ Training completed in {train_time:.1f} seconds")

# ===========================
# EVALUATE MODELS
# ===========================

print("\n" + "=" * 80)
print("📈 MODEL EVALUATION RESULTS")
print("=" * 80)

# Get predictions
y_pred = predictor.predict(test_data.drop('target', axis=1))
y_proba = predictor.predict_proba(test_data.drop('target', axis=1))[1]  # probability of positive class

# Calculate metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_proba)

# Confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0

print("\n🎯 BASIC METRICS:")
print("-" * 40)
print(f"Accuracy    : {accuracy:.3f}")
print(f"Precision   : {precision:.3f}  (dari yang diprediksi fraud, berapa % benar)")
print(f"Recall      : {recall:.3f}  (dari semua fraud, berapa % tertangkap)")
print(f"F1-Score    : {f1:.3f}  (keseimbangan precision & recall)")
print(f"Specificity : {specificity:.3f}  (dari semua normal, berapa % dikenali normal)")
print(f"ROC-AUC     : {roc_auc:.3f}  (kualitas ranking skor fraud)")

print("\n📊 CONFUSION MATRIX:")
print("-" * 40)
print(f"True Negatives  (TN): {tn:,}  -> Normal, diprediksi Normal ✓")
print(f"False Positives (FP): {fp:,}  -> Normal, diprediksi Fraud ✗")
print(f"False Negatives (FN): {fn:,}  -> Fraud, diprediksi Normal ✗")
print(f"True Positives  (TP): {tp:,}  -> Fraud, diprediksi Fraud ✓")

# ===========================
# MODEL LEADERBOARD
# ===========================

print("\n" + "=" * 80)
print("🏆 MODEL LEADERBOARD")
print("=" * 80)

# Show leaderboard of all models tried
leaderboard = predictor.leaderboard(test_data, silent=True)
print("\nModels Trained:")
print(leaderboard[['model', 'score_test', 'pred_time_test', 'fit_time']].to_string())

# ===========================
# BEST MODEL INFO
# ===========================

print("\n" + "=" * 80)
print("🥇 BEST MODEL DETAILS")
print("=" * 80)

best_model = predictor.get_model_best()
print(f"Best Model: {best_model}")

# Model info
model_info = predictor.info()
print(f"Total models trained: {model_info['num_models_trained']}")

print("\n" + "=" * 80)
print("✅ AUTOGLUON AUTOML COMPLETE!")
print("=" * 80)


No path specified. Models will be saved in: "AutogluonModels/ag-20250831_124759"


🚀 AUTOGLUON AUTOML - SIMPLE & FIXED VERSION

📊 Dataset Info:
Training samples: 40000
Test samples: 10000
Features: 6058
Target distribution: {0: 0.8, 1: 0.2}

🔄 TRAINING MODELS WITH AUTOGLUON (Simplified Settings)

Training AutoGluon with simplified settings...
This should take 2-3 minutes...





✅ Training completed in 55.3 seconds

📈 MODEL EVALUATION RESULTS

🎯 BASIC METRICS:
----------------------------------------
Accuracy    : 1.000
Precision   : 1.000  (dari yang diprediksi fraud, berapa % benar)
Recall      : 1.000  (dari semua fraud, berapa % tertangkap)
F1-Score    : 1.000  (keseimbangan precision & recall)
Specificity : 1.000  (dari semua normal, berapa % dikenali normal)
ROC-AUC     : 1.000  (kualitas ranking skor fraud)

📊 CONFUSION MATRIX:
----------------------------------------
True Negatives  (TN): 7,999  -> Normal, diprediksi Normal ✓
False Positives (FP): 1  -> Normal, diprediksi Fraud ✗
False Negatives (FN): 0  -> Fraud, diprediksi Normal ✗
True Positives  (TP): 2,000  -> Fraud, diprediksi Fraud ✓

🏆 MODEL LEADERBOARD

Models Trained:
                 model  score_test  pred_time_test  fit_time
0             CatBoost     1.00000        0.034854  8.194745
1             LightGBM     1.00000        0.041689  3.163693
2           ExtraTrees     1.00000        0.

AttributeError: 'TabularPredictor' object has no attribute 'get_model_best'

In [29]:
# ===========================
# DATA LEAKAGE & PERFECT SCORE CHECKER
# ===========================

import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

print("=" * 80)
print("🔍 INVESTIGATING PERFECT SCORE (F1 = 1.000)")
print("=" * 80)

# ===========================
# 1. CHECK DATA STATISTICS
# ===========================

print("\n1️⃣ DATA STATISTICS CHECK:")
print("-" * 40)

# Basic stats
print(f"Train shape: {X_train_ready.shape}")
print(f"Test shape: {X_test_ready.shape}")
print(f"Number of features: {X_train_ready.shape[1]}")

# Check for constant features
train_df = pd.DataFrame(X_train_ready)
test_df = pd.DataFrame(X_test_ready)

constant_features = (train_df.nunique() == 1).sum()
print(f"Constant features in train: {constant_features}")

# Check variance
low_variance = (train_df.var() < 0.0001).sum()
print(f"Very low variance features: {low_variance}")

# ===========================
# 2. CHECK FOR DUPLICATES
# ===========================

print("\n2️⃣ DUPLICATE CHECK:")
print("-" * 40)

# Check duplicates within train
train_duplicates = train_df.duplicated().sum()
print(f"Duplicate rows in train set: {train_duplicates}")

# Check duplicates within test  
test_duplicates = test_df.duplicated().sum()
print(f"Duplicate rows in test set: {test_duplicates}")

# Check if test samples exist in train
train_combined = pd.concat([train_df, pd.Series(y_train, name='target')], axis=1)
test_combined = pd.concat([test_df, pd.Series(y_test, name='target')], axis=1)

# Create hash of each row for comparison
train_hash = pd.util.hash_pandas_object(train_df, index=False)
test_hash = pd.util.hash_pandas_object(test_df, index=False)

overlap = len(set(train_hash) & set(test_hash))
print(f"Test samples that appear in train: {overlap}")

# ===========================
# 3. CHECK FEATURE IMPORTANCE
# ===========================

print("\n3️⃣ FEATURE IMPORTANCE CHECK:")
print("-" * 40)

# Quick Random Forest to check feature importance
rf = RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42)
rf.fit(X_train_ready, y_train)

# Get top features
importances = rf.feature_importances_
top_features_idx = np.argsort(importances)[-10:][::-1]
top_importances = importances[top_features_idx]

print("Top 10 feature importances:")
for idx, imp in zip(top_features_idx, top_importances):
    print(f"  Feature {idx}: {imp:.4f}")

# Check if any single feature is too dominant
max_importance = importances.max()
if max_importance > 0.5:
    print(f"\n⚠️ WARNING: Feature {np.argmax(importances)} has {max_importance:.2%} importance!")
    print("This could indicate data leakage!")

# ===========================
# 4. CHECK SEPARABILITY
# ===========================

print("\n4️⃣ CLASS SEPARABILITY CHECK:")
print("-" * 40)

# Check with simple model
lr = LogisticRegression(max_iter=100, random_state=42)
lr_scores = cross_val_score(lr, X_train_ready, y_train, cv=3, scoring='f1')
print(f"Logistic Regression CV F1: {lr_scores.mean():.3f} (+/- {lr_scores.std():.3f})")

if lr_scores.mean() > 0.99:
    print("⚠️ Even simple linear model gets perfect score - data is too easy or leaked!")

# ===========================
# 5. CHECK INDIVIDUAL FEATURES
# ===========================

print("\n5️⃣ SUSPICIOUS FEATURE CHECK:")
print("-" * 40)

# Check correlation with target
correlations = []
for i in range(min(X_train_ready.shape[1], 100)):  # Check first 100 features
    corr = np.corrcoef(X_train_ready[:, i], y_train)[0, 1]
    correlations.append(abs(corr))

max_corr = max(correlations)
max_corr_idx = np.argmax(correlations)

print(f"Maximum correlation with target: {max_corr:.4f} (Feature {max_corr_idx})")

if max_corr > 0.9:
    print("⚠️ CRITICAL: Feature has >90% correlation with target - likely data leakage!")

# Check if any feature perfectly separates classes
perfect_separator = False
for i in range(min(X_train_ready.shape[1], 20)):  # Check first 20 features
    feature_vals = X_train_ready[:, i]
    
    # Check if feature perfectly separates classes
    class0_vals = feature_vals[y_train == 0]
    class1_vals = feature_vals[y_train == 1]
    
    if len(class0_vals) > 0 and len(class1_vals) > 0:
        if class0_vals.max() < class1_vals.min() or class1_vals.max() < class0_vals.min():
            print(f"⚠️ Feature {i} perfectly separates classes!")
            perfect_separator = True
            break

# ===========================
# 6. RECOMMENDATIONS
# ===========================

print("\n" + "=" * 80)
print("💡 RECOMMENDATIONS")
print("=" * 80)

if max_importance > 0.5 or max_corr > 0.9 or perfect_separator or lr_scores.mean() > 0.99:
    print("\n🚨 HIGH RISK OF DATA LEAKAGE DETECTED!")
    print("\nImmediate actions:")
    print("1. Review your preprocessing pipeline:")
    print("   - Are you using any post-split statistics?")
    print("   - Is target encoding done correctly?")
    print("   - Are there any features derived from the target?")
    print("\n2. Check original features:")
    print("   - Is 'Amount' or any ID field acting as proxy for fraud?")
    print("   - Are there temporal features that shouldn't be there?")
    print("\n3. Try training without preprocessing:")
    print("   ```python")
    print("   # Train on raw data")
    print("   from sklearn.ensemble import RandomForestClassifier")
    print("   rf = RandomForestClassifier()")
    print("   rf.fit(X_train_original, y_train)  # Use original features")
    print("   ```")
else:
    print("\nNo obvious data leakage detected, but F1=1.0 is still suspicious.")
    print("Consider:")
    print("• Using a different train/test split")
    print("• Checking if the problem is genuinely too easy")
    print("• Validating on completely new data")

print("\n" + "=" * 80)

🔍 INVESTIGATING PERFECT SCORE (F1 = 1.000)

1️⃣ DATA STATISTICS CHECK:
----------------------------------------
Train shape: (40000, 6058)
Test shape: (10000, 6058)
Number of features: 6058
Constant features in train: 2
Very low variance features: 3560

2️⃣ DUPLICATE CHECK:
----------------------------------------
Duplicate rows in train set: 730
Duplicate rows in test set: 52
Test samples that appear in train: 379

3️⃣ FEATURE IMPORTANCE CHECK:
----------------------------------------
Top 10 feature importances:
  Feature 2: 0.0988
  Feature 4: 0.0783
  Feature 3: 0.0588
  Feature 8: 0.0584
  Feature 0: 0.0379
  Feature 5967: 0.0326
  Feature 5927: 0.0325
  Feature 5950: 0.0277
  Feature 5919: 0.0245
  Feature 5912: 0.0242

4️⃣ CLASS SEPARABILITY CHECK:
----------------------------------------
Logistic Regression CV F1: 1.000 (+/- 0.000)
⚠️ Even simple linear model gets perfect score - data is too easy or leaked!

5️⃣ SUSPICIOUS FEATURE CHECK:
----------------------------------------
