<a href="https://colab.research.google.com/github/annisafitribas/ft_credit_home/blob/main/ft_credit_home.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Persiapan**

In [33]:
import os, sys, time, json
import warnings
warnings.filterwarnings('ignore')

WORKDIR = '/content/home_credit_task'
os.makedirs(WORKDIR, exist_ok=True)
print("WORKDIR:", WORKDIR)

WORKDIR: /content/home_credit_task


setup awal environment di Python

## Pengecekan

In [34]:
# Colab-friendly installs (only if missing)
try:
    import gdown
except Exception:
    !pip install -q gdown
    import gdown

# ML libs: install if missing
try:
    import lightgbm as lgb
except Exception:
    !pip install -q lightgbm
    import lightgbm as lgb

try:
    import xgboost as xgb
except Exception:
    !pip install -q xgboost
    import xgboost as xgb

try:
    from catboost import CatBoostClassifier
except Exception:
    !pip install -q catboost
    from catboost import CatBoostClassifier

try:
    from pptx import Presentation
    from pptx.util import Inches, Pt
except Exception:
    !pip install -q python-pptx
    from pptx import Presentation
    from pptx.util import Inches, Pt

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h

blok instalasi otomatis untuk beberapa library penting, khususnya agar bisa langsung jalan di Google Colab atau environment lain yang mungkin belum punya library tersebut.

Berikut detail tiap bagian:

## Library untuk plotting dan data, Library dari scikit-learn (sklearn)

In [35]:
# plotting / data
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pathlib import Path
import joblib

# sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (roc_auc_score, classification_report, confusion_matrix, roc_curve, auc,
                             precision_recall_curve, average_precision_score)
from sklearn.impute import SimpleImputer

EDA & plotting → matplotlib, pandas, numpy
Preprocessing & scaling → StandardScaler, SimpleImputer
Modeling → Logistic Regression, Random Forest
Evaluation → metrics scikit-learn
Utilities → joblib, pathlib

## Pembantu (helper function)

In [36]:
def savefig(fig, filename):
    path = os.path.join(WORKDIR, filename)
    fig.savefig(path, bbox_inches='tight')
    print('Saved plot:', path)

Fungsi ini memudahkan menyimpan semua plot ke folder kerja secara konsisten tanpa harus menulis path lengkap setiap kali.

# **1. Download Dataset**

data disimpan pada google drive

In [37]:
files = {
    'application_train.csv': '1q059QolR6CNxB0PWESAjkEWIprNutajA',
    'application_test.csv' : '1QD7ehk_hzXze0vHQuYa5qyqfDcfI8Sex',
    'bureau.csv'           : '1hndizX1t5ab0DTnKMTedqVJ1ZxLVclhF',
    'bureau_balance.csv'   : '1OXEQb_L6S_mZALJi4--C6RyFI6yOsq4x',
    'credit_card_balance.csv': '1t6Hhsmj0vSCCKUlNXht_xDQ6Z6l4M0Vu',
    'installments_payments.csv': '126xrKCW5EQrxkQoDwmN-yb00ILBKnhR8',
    'POS_CASH_balance.csv' : '1dODAmBQLaylpM2JcCHfc4KNbbtKY7xhA',
    'previous_application.csv': '1D4O7xf-lF_3oBeu6XMwzhpXtSvhcgoBU',
    'HomeCredit_columns_description.csv': '1v2iGGOJjlUGSTsQz-bsjtjtyM5IQp7uW',
    'sample_submission.csv': '1JongVA9fWMYml5XKVnbhm8TUlR5Efs0n'
}

for fname, fid in files.items():
    dest = os.path.join(WORKDIR, fname)
    if not os.path.exists(dest):
        print("Downloading", fname)
        url = f"https://drive.google.com/uc?export=download&id={fid}"
        gdown.download(url, dest, quiet=False)
    else:
        print("Exists:", fname)

Exists: application_train.csv
Exists: application_test.csv
Exists: bureau.csv
Exists: bureau_balance.csv
Exists: credit_card_balance.csv
Exists: installments_payments.csv
Exists: POS_CASH_balance.csv
Exists: previous_application.csv
Exists: HomeCredit_columns_description.csv
Exists: sample_submission.csv


# **2. Pelajari konteks masalah**
Pelajari konteks masalah dari sumber eksternal (MARKDOWN / NOTE: this is guidance — non-code)

In [38]:
md_context = """
**Context / SMK (put into notebook as Markdown)**

- Business: consumer credit scoring (Home Credit). Goal is to predict default (TARGET=1) to support accept/decline/manual-review.
- Common finance metrics: PD (probability of default), LGD (loss given default), EAD (exposure at default), Approval rate, Bad rate.
- Regulations / fairness: ensure model doesn't unfairly discriminate; add explainability later (SHAP).
- Typical objectives: maximize AUC/PR while optimizing business net-savings when integrating decision thresholds.
"""
print(md_context)


**Context / SMK (put into notebook as Markdown)**

- Business: consumer credit scoring (Home Credit). Goal is to predict default (TARGET=1) to support accept/decline/manual-review.
- Common finance metrics: PD (probability of default), LGD (loss given default), EAD (exposure at default), Approval rate, Bad rate.
- Regulations / fairness: ensure model doesn't unfairly discriminate; add explainability later (SHAP).
- Typical objectives: maximize AUC/PR while optimizing business net-savings when integrating decision thresholds.



# **3. Pahami deskripsi kolom yang tersedia**

In [39]:
desc_path = os.path.join(WORKDIR, 'HomeCredit_columns_description.csv')
if os.path.exists(desc_path):
    col_desc = pd.read_csv(
    desc_path,
    encoding='latin1',   # aman untuk file non-UTF-8
    low_memory=False
)
    print("Column description sample:")
    display(col_desc.head(8))
else:
    print("Column description file not found in WORKDIR.")

Column description sample:


Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,
5,8,application_{train|test}.csv,FLAG_OWN_REALTY,Flag if client owns a house or flat,
6,9,application_{train|test}.csv,CNT_CHILDREN,Number of children the client has,
7,10,application_{train|test}.csv,AMT_INCOME_TOTAL,Income of the client,


# **4. Tentukan goal, objective, dan metrics**

In [40]:
goal = "Predict probability of loan default (TARGET) to reduce defaults while balancing approval rate."
objective = "Maximize AUC and maximize business net-savings under threshold policy."
metrics = ["ROC-AUC (primary)", "Average Precision (PR-AUC)", "Approve rate vs bad rate", "Net savings"]
print("Goal:", goal)
print("Objective:", objective)
print("Metrics:", metrics)

Goal: Predict probability of loan default (TARGET) to reduce defaults while balancing approval rate.
Objective: Maximize AUC and maximize business net-savings under threshold policy.
Metrics: ['ROC-AUC (primary)', 'Average Precision (PR-AUC)', 'Approve rate vs bad rate', 'Net savings']


# **5. EDA (Exploratory Data Analysis)**
EDA (Exploratory Data Analysis) awal dan loading dataset untuk proyek Home Credit. Mari kita urai langkah-langkahnya satu per satu

In [41]:
print("\nLoading CSVs ...")
train = pd.read_csv(os.path.join(WORKDIR, 'application_train.csv'), low_memory=False)
test  = pd.read_csv(os.path.join(WORKDIR, 'application_test.csv'), low_memory=False)
bureau = pd.read_csv(os.path.join(WORKDIR, 'bureau.csv'), low_memory=False)
bureau_balance = pd.read_csv(os.path.join(WORKDIR, 'bureau_balance.csv'), low_memory=False)
credit_card_balance = pd.read_csv(os.path.join(WORKDIR, 'credit_card_balance.csv'), low_memory=False)
installments = pd.read_csv(os.path.join(WORKDIR, 'installments_payments.csv'), low_memory=False)
pos_cash = pd.read_csv(os.path.join(WORKDIR, 'POS_CASH_balance.csv'), low_memory=False)
prev_app = pd.read_csv(os.path.join(WORKDIR, 'previous_application.csv'), low_memory=False)
sample_sub = pd.read_csv(os.path.join(WORKDIR, 'sample_submission.csv'), low_memory=False)

print("Shapes:")
print("train", train.shape, "test", test.shape)
print("bureau", bureau.shape, "bureau_balance", bureau_balance.shape)
print("credit_card_balance", credit_card_balance.shape, "installments", installments.shape)
print("pos_cash", pos_cash.shape, "previous_application", prev_app.shape)


Loading CSVs ...
Shapes:
train (307511, 122) test (48744, 121)
bureau (1716428, 17) bureau_balance (27299925, 3)
credit_card_balance (3840312, 23) installments (13605401, 8)
pos_cash (10001358, 8) previous_application (1670214, 37)


In [42]:
# quick target dist
print("\n--- TARGET distribution ---")
print(train['TARGET'].value_counts(normalize=True))
# save target plot
fig = plt.figure(figsize=(6,4)); ax = fig.add_subplot(111)
counts = train['TARGET'].value_counts().sort_index()
ax.bar(counts.index.astype(str), counts.values)
ax.set_title('Target distribution (counts)'); ax.set_xlabel('TARGET'); ax.set_ylabel('Count')
savefig(fig, '1_target_distribution.png'); plt.close(fig)


--- TARGET distribution ---
TARGET
0    0.919271
1    0.080729
Name: proportion, dtype: float64
Saved plot: /content/home_credit_task/1_target_distribution.png


In [43]:
# missingness top
missing = train.isna().mean().sort_values(ascending=False).head(30)
fig = plt.figure(figsize=(8,6)); ax = fig.add_subplot(111)
ax.barh(missing.index[::-1], missing.values[::-1]); ax.set_title('Top missing percentage (train)')
ax.set_xlabel('Fraction missing'); savefig(fig, '2_missing_pct_top.png'); plt.close(fig)

Saved plot: /content/home_credit_task/2_missing_pct_top.png


In [44]:
# numeric correlation subset
num = train.select_dtypes(include=[np.number]).drop(['SK_ID_CURR','TARGET'], axis=1, errors='ignore')
num_small = num.sample(n=min(30, num.shape[1]), axis=1)
corr = num_small.corr()
fig = plt.figure(figsize=(10,8)); ax = fig.add_subplot(111)
cax = ax.imshow(corr.values, interpolation='nearest')
ax.set_xticks(np.arange(len(corr.columns))); ax.set_xticklabels(corr.columns, rotation=90, fontsize=8)
ax.set_yticks(np.arange(len(corr.columns))); ax.set_yticklabels(corr.columns, fontsize=8)
ax.set_title('Correlation matrix (subset)'); fig.colorbar(cax, ax=ax)
savefig(fig, '3_corr_matrix_subset.png'); plt.close(fig)

Saved plot: /content/home_credit_task/3_corr_matrix_subset.png


# **6. Data Cleaning dan Data Processing**
(Feature Engineering)

In [45]:
# Aggregations from bureau
b_agg = bureau.groupby('SK_ID_CURR').agg(
    bureau_loans_count = ('SK_ID_BUREAU', 'count'),
    bureau_credit_sum_mean = ('AMT_CREDIT_SUM', 'mean'),
    bureau_credit_sum_max = ('AMT_CREDIT_SUM', 'max'),
    bureau_active_cnt = ('CREDIT_ACTIVE', lambda x: (x=='Active').sum())
).reset_index()

Hitung jumlah pinjaman, rata-rata dan maksimum kredit, dan jumlah pinjaman aktif per customer.

In [46]:
# bureau_balance -> bad rate per bureau id then agg
bb_bad = bureau_balance[bureau_balance['STATUS'].isin(['2','3','4','5'])].groupby('SK_ID_BUREAU').size().rename('bad_months')
bb_tot = bureau_balance.groupby('SK_ID_BUREAU').size().rename('total_months')
bb = pd.concat([bb_bad, bb_tot], axis=1).fillna(0)
bb['bad_rate'] = bb['bad_months'] / bb['total_months']
bureau2 = bureau.merge(bb.reset_index(), on='SK_ID_BUREAU', how='left')
b2_agg = bureau2.groupby('SK_ID_CURR').agg(
    bureau_prev_bad_rate_mean = ('bad_rate','mean'),
    bureau_prev_months_mean = ('total_months','mean')
).reset_index()

b_agg = b_agg.merge(b2_agg, on='SK_ID_CURR', how='left')

Rasio bulan bermasalah (STATUS 2–5) per pinjaman → agregasi ke level customer.

In [47]:
# previous_application aggregates
prev_agg = prev_app.groupby('SK_ID_CURR').agg(
    prev_count = ('SK_ID_PREV','count'),
    prev_amt_app_mean = ('AMT_APPLICATION','mean'),
    prev_amt_credit_mean = ('AMT_CREDIT','mean'),
    prev_approved = ('NAME_CONTRACT_STATUS', lambda x: (x=='Approved').sum())
).reset_index()

Jumlah aplikasi sebelumnya, rata-rata jumlah aplikasi dan kredit, jumlah aplikasi yang disetujui per customer.

In [48]:
# installments
inst_agg = installments.groupby('SK_ID_CURR').agg(
    inst_count = ('NUM_INSTALMENT_VERSION','count'),
    inst_amt_sum = ('AMT_PAYMENT','sum'),
    inst_delay_mean = ('DAYS_ENTRY_PAYMENT', lambda x: np.nanmean(x - installments.loc[x.index,'DAYS_INSTALMENT']))
).reset_index()

Jumlah cicilan, total pembayaran, dan rata-rata keterlambatan per customer.

In [49]:
# credit_card & pos
cc_agg = credit_card_balance.groupby('SK_ID_CURR').agg(
    cc_count = ('SK_ID_PREV','count'),
    cc_bal_mean = ('AMT_BALANCE','mean'),
    cc_limit_mean = ('AMT_CREDIT_LIMIT_ACTUAL','mean')
).reset_index()

pos_agg = pos_cash.groupby('SK_ID_CURR').agg(
    pos_count = ('SK_ID_PREV','count'),
    pos_dpd_mean = ('SK_DPD','mean')
).reset_index()

Credit card: jumlah kartu, rata-rata saldo, rata-rata limit kredit
POS cash: jumlah akun, rata-rata DPD (days past due)

In [50]:
# application-level features
def make_app_features(df):
    df = df.copy()
    df['DAYS_BIRTH_YEARS'] = (-df['DAYS_BIRTH']) / 365.25
    df['DAYS_EMPLOYED_YEARS'] = df['DAYS_EMPLOYED'].replace(365243, np.nan) / -365.25
    df['INCOME_CREDIT_RATIO'] = df['AMT_INCOME_TOTAL'] / (df['AMT_CREDIT'] + 1)
    df['CREDIT_GOODS_RATIO'] = df['AMT_CREDIT'] / (df['AMT_GOODS_PRICE'] + 1)
    return df[['SK_ID_CURR','DAYS_BIRTH_YEARS','DAYS_EMPLOYED_YEARS','AMT_INCOME_TOTAL','AMT_CREDIT','INCOME_CREDIT_RATIO','CREDIT_GOODS_RATIO']]

app_train_feats = make_app_features(train)
app_test_feats = make_app_features(test)

Umur (DAYS_BIRTH_YEARS), lama bekerja (DAYS_EMPLOYED_YEARS)
Rasio income/kredit dan kredit/harga barang

In [51]:
# Merge features
train_base = train[['SK_ID_CURR','TARGET']].merge(b_agg, on='SK_ID_CURR', how='left') \
                               .merge(prev_agg, on='SK_ID_CURR', how='left') \
                               .merge(inst_agg, on='SK_ID_CURR', how='left') \
                               .merge(cc_agg, on='SK_ID_CURR', how='left') \
                               .merge(pos_agg, on='SK_ID_CURR', how='left') \
                               .merge(app_train_feats, on='SK_ID_CURR', how='left')

test_base = test[['SK_ID_CURR']].merge(b_agg, on='SK_ID_CURR', how='left') \
                               .merge(prev_agg, on='SK_ID_CURR', how='left') \
                               .merge(inst_agg, on='SK_ID_CURR', how='left') \
                               .merge(cc_agg, on='SK_ID_CURR', how='left') \
                               .merge(pos_agg, on='SK_ID_CURR', how='left') \
                               .merge(app_test_feats, on='SK_ID_CURR', how='left')

print("\nMerged shapes:", train_base.shape, test_base.shape)


Merged shapes: (307511, 26) (48744, 25)


Gabungkan hasil agregasi dari semua sumber data (bureau, previous_app, installments, cc, pos, aplikasi level) per customer

In [52]:
# Prepare X, y
Y = train_base['TARGET']
X = train_base.drop(['SK_ID_CURR','TARGET'], axis=1)
X_test = test_base.drop(['SK_ID_CURR'], axis=1)

Y = target (TARGET)
X = semua fitur numerik & kategorikal untuk training
X_test = semua fitur untuk test set

In [53]:
# numeric / categorical separation
num_cols = [c for c in X.columns if X[c].dtype.kind in 'biufc']
cat_cols = [c for c in X.columns if c not in num_cols]


Number of features: 24


Numerik: integer / float

Kategorikal: sisanya

In [54]:
# Impute numeric with median
num_imputer = SimpleImputer(strategy='median')
X[num_cols] = num_imputer.fit_transform(X[num_cols])
X_test[num_cols] = num_imputer.transform(X_test[num_cols])

# For categorical, fill and simple label-encode
for c in cat_cols:
    X[c] = X[c].fillna('MISSING').astype(str)
    X_test[c] = X_test[c].fillna('MISSING').astype(str)
for c in cat_cols:
    X[c], _ = pd.factorize(X[c])
    X_test[c], _ = pd.factorize(X_test[c])

# Ensure X_test has all columns
X_test = X_test.reindex(columns=X.columns, fill_value=0)
print("\nNumber of features:", X.shape[1])


Number of features: 24


**Imputasi missing values**

Numerik: median
Kategorikal: 'MISSING' → label encode
Encode fitur kategorikal
Ubah kategori menjadi angka (factorize) supaya model ML bisa pakai
Pastikan X_test sesuai X_train
Menambahkan kolom yang hilang di test set → jumlah fitur sama dengan X_train

# **7. Split data & scaling**

In [55]:
X_tr, X_val, y_tr, y_val = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_tr)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# save scaler
joblib.dump(scaler, os.path.join(WORKDIR, 'scaler.pkl'))

['/content/home_credit_task/scaler.pkl']

Membagi data menjadi train/validation
Melakukan scaling numerik agar model lebih stabil
Menyimpan scaler untuk digunakan kembali di test set atau deployment

# **8. Pemodelan: Logistic Regression + hyperparameter tuning**

In [57]:
print("\nModeling (multiple algorithms + light tuning)")

# Definisi model dan hyperparameter
models = {
    'LogisticRegression': LogisticRegression(max_iter=2000, class_weight='balanced', n_jobs=-1),
    'RandomForest': RandomForestClassifier(random_state=42, n_jobs=-1),
    'XGBoost': xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42, n_jobs=-1),
    'LightGBM': lgb.LGBMClassifier(objective='binary', random_state=42, n_jobs=-1),
    'CatBoost': CatBoostClassifier(verbose=0, random_state=42)
}
param_grids = {
    'LogisticRegression': {'C': [0.01, 0.1, 1, 5]},
    'RandomForest': {'n_estimators': [100, 200], 'max_depth': [6, 12, None], 'min_samples_leaf': [1,3]},
    'XGBoost': {'n_estimators': [100,200], 'max_depth': [3,6], 'learning_rate':[0.05,0.1]},
    'LightGBM': {'n_estimators':[100,300], 'num_leaves':[31,63], 'learning_rate':[0.01,0.05]},
    'CatBoost': {'iterations':[200,400], 'depth':[4,6], 'learning_rate':[0.03,0.1]}
}


Modeling (multiple algorithms + light tuning)


In [58]:
# Stratified K-Fold dan CV setup
n_iter_search = 10
cv_folds = 3
cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)

results = []
fitted_models = {}

In [None]:
# Looping tiap model
for name, estimator in models.items():
    print(f"\n>> {name}")
    grid = param_grids.get(name, {})
    if len(grid) == 0:
        print("No grid -> fit default")
        if name == 'LogisticRegression':
            estimator.fit(X_tr_scaled, y_tr)
        else:
            estimator.fit(X_tr, y_tr)
        best = estimator
        cv_best = None
    else:
        # choose data (scaled for LR)
        X_for_fit = X_tr_scaled if name=='LogisticRegression' else X_tr
        rs = RandomizedSearchCV(estimator, param_distributions=grid,
                                n_iter=min(n_iter_search, max(1, np.prod([len(v) for v in grid.values()]))),
                                scoring='roc_auc', n_jobs=-1, cv=cv, random_state=42, verbose=0)
        rs.fit(X_for_fit, y_tr)
        best = rs.best_estimator_
        cv_best = rs.best_score_
        print("Best params:", rs.best_params_)
        print("CV AUC (approx):", cv_best)

    # Evaluasi holdout set
    X_val_use = X_val_scaled if name=='LogisticRegression' else X_val
    proba_val = best.predict_proba(X_val_use)[:,1]
    auc_holdout = roc_auc_score(y_val, proba_val)
    ap = average_precision_score(y_val, proba_val)
    print(f"Holdout AUC: {auc_holdout:.4f} | AP: {ap:.4f}")
    # Simpan model & simpan hasil
    fitted_models[name] = {'model': best, 'proba_val': proba_val}
    results.append({'model': name, 'cv_auc': cv_best, 'holdout_auc': auc_holdout})

    # save model
    joblib.dump(best, os.path.join(WORKDIR, f"model_{name}.pkl"))


>> LogisticRegression
Best params: {'C': 5}
CV AUC (approx): 0.6761919059017285
Holdout AUC: 0.6780 | AP: 0.1566

>> RandomForest


In [None]:
results_df = pd.DataFrame(results).sort_values('holdout_auc', ascending=False)
results_df.to_csv(os.path.join(WORKDIR, 'model_results_summary.csv'), index=False)
print("\nModel results:\n", results_df)

# **9. Evaluasi hasil pemodelan (visual + metrics)**

In [None]:
# ROC comparison
plt.figure(figsize=(8,6))
for name, info in fitted_models.items():
    proba = info['proba_val']
    fpr, tpr, _ = roc_curve(y_val, proba)
    auc_val = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{name} (AUC={auc_val:.4f})")
plt.plot([0,1],[0,1], linestyle='--', color='gray')
plt.title('ROC comparison (holdout)'); plt.xlabel('FPR'); plt.ylabel('TPR'); plt.legend(loc='lower right')
savefig(plt.gcf(), '4_roc_comparison.png'); plt.close()

In [None]:
# PR comparison
plt.figure(figsize=(8,6))
for name, info in fitted_models.items():
    proba = info['proba_val']
    precision, recall, _ = precision_recall_curve(y_val, proba)
    ap = average_precision_score(y_val, proba)
    plt.plot(recall, precision, label=f"{name} (AP={ap:.4f})")
plt.title('Precision-Recall (holdout)'); plt.xlabel('Recall'); plt.ylabel('Precision'); plt.legend(loc='lower left')
savefig(plt.gcf(), '5_pr_comparison.png'); plt.close()

In [None]:
# Confusion matrix & classification report at threshold 0.5 for top model
top_model_name = results_df.iloc[0]['model']
top_model = fitted_models[top_model_name]['model']
top_proba = fitted_models[top_model_name]['proba_val']
thr = 0.5
y_pred_thr = (top_proba > thr).astype(int)
print(f"\nTop model: {top_model_name} | Holdout AUC: {results_df.iloc[0]['holdout_auc']:.4f}")
print("Classification report (threshold=0.5):")
print(classification_report(y_val, y_pred_thr))
cm = confusion_matrix(y_val, y_pred_thr)
print("Confusion matrix:\n", cm)

In [None]:
# Save ROC for logistic as well (already have combined)
# Save feature importance for tree-based top model if available
if hasattr(top_model, 'feature_importances_'):
    fi = pd.Series(top_model.feature_importances_, index=X.columns).sort_values(ascending=False)
    fig = plt.figure(figsize=(8,6)); ax = fig.add_subplot(111)
    topn = fi.head(20)[::-1]
    ax.barh(topn.index, topn.values); ax.set_title('Top 20 feature importances (top model)')
    savefig(fig, '6_feature_importance_top20.png'); plt.close(fig)