# Modern Classification Workflow: Corporate Bankruptcy (2026 Best Practices)

This notebook demonstrates **reproducibility**, **correct validation** (pipelines + stratified CV), **multiple metrics**, and **interpretability (SHAP)** on the Taiwan Corporate Bankruptcy dataset. It serves as the reference implementation for the portfolio—see [docs/BEST_PRACTICES.md](docs/BEST_PRACTICES.md) for detailed explanations of each practice.

## 1. Reproducibility: set the random seed

**Why:** ML involves randomness (splits, bootstrap, stochastic algorithms). Fixing the seed ensures that anyone who runs this notebook gets the same results. Always call `set_seed` once at the top and pass `random_state` to every random component (splits, estimators, CV).

In [None]:
from portfolio_utils import set_seed

set_seed(42)

## 2. Load data and prepare target/features

Data is loaded via the portfolio data loader (Kaggle API or local `data/`). We separate the target `Bankrupt?` from features and use only numeric columns. No preprocessing is fitted here—that happens inside the pipeline so we avoid leaking information from the test set.

In [None]:
import pandas as pd
import numpy as np
from portfolio_utils import load_bankruptcy

df = load_bankruptcy()
y = df["Bankrupt?"]
X = df.drop(columns=["Bankrupt?"]).select_dtypes(include=[np.number])
# Drop constant columns so SelectKBest (F-test) is well-defined
constant_cols = [c for c in X.columns if X[c].nunique() <= 1]
if constant_cols:
    X = X.drop(columns=constant_cols)
feature_names = list(X.columns)
print("Shape:", X.shape, "Target balance:");
print(y.value_counts())

## 3. Stratified train/test split

**Why:** The dataset is imbalanced (few bankruptcies). Using `stratify=y` keeps the same class proportions in train and test so evaluation is fair. We hold out 20% for a final test set and never use it until the end.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print("Train:", X_train.shape, "Test:", X_test.shape)

## 4. Pipeline: preprocessing + model

**Why:** A single pipeline ensures (1) the scaler and feature selector are fitted only on training data, (2) the same transformations are applied to test data, and (3) cross-validation fits preprocessing per fold—no leakage. We use StandardScaler, SelectKBest (top 30 features by F-statistic), and XGBClassifier. Tree-based models can work without scaling, but scaling is harmless and keeps the pattern reusable for non-tree models.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
import xgboost as xgb

n_features = min(30, X_train.shape[1] - 1)
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("selector", SelectKBest(f_classif, k=n_features)),
    ("estimator", xgb.XGBClassifier(random_state=42)),
])
pipeline

## 5. Cross-validation with multiple metrics

**Why:** We evaluate the **entire pipeline** with `cross_validate` so that scaling and feature selection are refit on each fold's training portion. Reporting several metrics (accuracy, precision, recall, F1, ROC-AUC) gives a complete picture; for imbalanced data, F1 and ROC-AUC are more informative than accuracy alone.

In [None]:
from sklearn.model_selection import cross_validate

scoring = ["accuracy", "precision_weighted", "recall_weighted", "f1_weighted", "roc_auc_ovr"]
cv_results = cross_validate(pipeline, X_train, y_train, cv=5, scoring=scoring, n_jobs=-1)

print("Cross-validation (5-fold stratified):")
for metric in scoring:
    key = f"test_{metric}"
    if key in cv_results:
        mean_val = cv_results[key].mean()
        std_val = cv_results[key].std()
        print(f"  {metric}: {mean_val:.4f} (+/- {std_val:.4f})")

## 6. Fit on full training set and evaluate on holdout test set

We fit the pipeline once on all training data, then predict on the held-out test set. We report classification report, confusion matrix, and ROC-AUC to show we care about more than accuracy.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print("Classification report (test set):")
print(classification_report(y_test, y_pred, target_names=["Not bankrupt", "Bankrupt"]))
print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))
print("ROC-AUC (test):", roc_auc_score(y_test, y_proba).round(4))

## 7. Interpretability: SHAP summary

**Why:** SHAP (Shapley values) explains which features drove the model's predictions. For tree models we use `TreeExplainer` on the **final estimator** and the **transformed** training data (as the model sees it). We use a sample of 500 rows to keep runtime reasonable. The beeswarm plot shows feature importance and the direction of each feature's effect (red = higher value pushes prediction toward bankruptcy).

In [None]:
import matplotlib.pyplot as plt

try:
    import shap

    # Pipeline: data flows through scaler -> selector -> estimator.
    # SHAP needs the estimator and the data in the form the estimator sees.
    estimator = pipeline.named_steps["estimator"]
    X_train_transformed = pipeline["selector"].transform(pipeline["scaler"].transform(X_train))
    selected_names = np.array(feature_names)[pipeline["selector"].get_support()].tolist()

    sample_size = min(500, len(X_train_transformed))
    X_sample = X_train_transformed[:sample_size]

    explainer = shap.TreeExplainer(estimator, X_sample)
    shap_values = explainer.shap_values(X_sample)

    plt.figure(figsize=(10, 8))
    shap.summary_plot(shap_values, X_sample, feature_names=selected_names, max_display=15, show=False)
    plt.tight_layout()
    plt.title("SHAP summary (class 1 = Bankrupt)")
    plt.show()
except ImportError:
    print("Install shap: pip install shap (or uv add shap)")
except Exception as e:
    print("SHAP error:", e)

## Summary

- **Reproducibility:** `set_seed(42)` and `random_state=42` everywhere.
- **Validation:** One pipeline (scaler → selector → model), fitted only on train; `cross_validate` on the pipeline; stratified splits.
- **Metrics:** Accuracy, precision, recall, F1, ROC-AUC and classification report on the holdout test set.
- **Interpretability:** SHAP summary plot on the fitted model with transformed features.

Apply these patterns across the portfolio—see [docs/BEST_PRACTICES.md](docs/BEST_PRACTICES.md) and [IMPROVEMENTS.md](IMPROVEMENTS.md).