# Modern Classification Workflow: Corporate Bankruptcy (2026 Best Practices)

This notebook demonstrates **reproducibility**, **correct validation** (pipelines + stratified CV), **multiple metrics**, and **interpretability (SHAP)** on the Taiwan Corporate Bankruptcy dataset. It serves as the reference implementation for the portfolio—see [docs/BEST_PRACTICES.md](docs/BEST_PRACTICES.md) for detailed explanations of each practice.

## 1. Reproducibility: set the random seed

**Why:** ML involves randomness (splits, bootstrap, stochastic algorithms). Fixing the seed ensures that anyone who runs this notebook gets the same results. Always call `set_seed` once at the top and pass `random_state` to every random component (splits, estimators, CV).

In [1]:
from portfolio_utils import set_seed

set_seed(42)

## 2. Load data and prepare target/features

Data is loaded via the portfolio data loader (Kaggle API or local `data/`). We separate the target `Bankrupt?` from features and use only numeric columns. No preprocessing is fitted here—that happens inside the pipeline so we avoid leaking information from the test set.

In [2]:
import pandas as pd
import numpy as np
from portfolio_utils import load_bankruptcy

df = load_bankruptcy()
y = df["Bankrupt?"]
X = df.drop(columns=["Bankrupt?"]).select_dtypes(include=[np.number])
# Drop constant columns so SelectKBest (F-test) is well-defined
constant_cols = [c for c in X.columns if X[c].nunique() <= 1]
if constant_cols:
    X = X.drop(columns=constant_cols)
feature_names = list(X.columns)
print("Shape:", X.shape, "Target balance:");
print(y.value_counts())

Shape: (6819, 94) Target balance:
Bankrupt?
0    6599
1     220
Name: count, dtype: int64


## 3. Stratified train/test split

**Why:** The dataset is imbalanced (few bankruptcies). Using `stratify=y` keeps the same class proportions in train and test so evaluation is fair. We hold out 20% for a final test set and never use it until the end.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print("Train:", X_train.shape, "Test:", X_test.shape)

Train: (5455, 94) Test: (1364, 94)


## 4. Pipeline: preprocessing + model

**Why:** A single pipeline ensures (1) the scaler and feature selector are fitted only on training data, (2) the same transformations are applied to test data, and (3) cross-validation fits preprocessing per fold—no leakage. We use StandardScaler, SelectKBest (top 30 features by F-statistic), and XGBClassifier. Tree-based models can work without scaling, but scaling is harmless and keeps the pattern reusable for non-tree models.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
import xgboost as xgb

n_features = min(30, X_train.shape[1] - 1)
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("selector", SelectKBest(f_classif, k=n_features)),
    ("estimator", xgb.XGBClassifier(random_state=42)),
])
pipeline

0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('scaler', ...), ('selector', ...), ...]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True

0,1,2
,"score_func  score_func: callable, default=f_classif Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below ""See Also""). The default function only works with classification tasks. .. versionadded:: 0.18",<function f_c...0022DC4297C40>
,"k  k: int or ""all"", default=10 Number of top features to select. The ""all"" option bypasses selection, for use in a parameter search.",30

0,1,2
,"objective  objective: typing.Union[str, xgboost.sklearn._SklObjWProto, typing.Callable[[typing.Any, typing.Any], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType] Specify the learning task and the corresponding learning objective or a custom objective function to be used. For custom objective, see :doc:`/tutorials/custom_metric_obj` and :ref:`custom-obj-metric` for more information, along with the end note for function signatures.",'binary:logistic'
,"base_score  base_score: typing.Union[float, typing.List[float], NoneType] The initial prediction score of all instances, global bias.",
,booster,
,"callbacks  callbacks: typing.Optional[typing.List[xgboost.callback.TrainingCallback]] List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using :ref:`Callback API `. .. note::  States in callback are not preserved during training, which means callback  objects can not be reused for multiple training sessions without  reinitialization or deepcopy. .. code-block:: python  for params in parameters_grid:  # be sure to (re)initialize the callbacks before each run  callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]  reg = xgboost.XGBRegressor(**params, callbacks=callbacks)  reg.fit(X, y)",
,colsample_bylevel  colsample_bylevel: typing.Optional[float] Subsample ratio of columns for each level.,
,colsample_bynode  colsample_bynode: typing.Optional[float] Subsample ratio of columns for each split.,
,colsample_bytree  colsample_bytree: typing.Optional[float] Subsample ratio of columns when constructing each tree.,
,"device  device: typing.Optional[str] .. versionadded:: 2.0.0 Device ordinal, available options are `cpu`, `cuda`, and `gpu`.",
,"early_stopping_rounds  early_stopping_rounds: typing.Optional[int] .. versionadded:: 1.6.0 - Activates early stopping. Validation metric needs to improve at least once in  every **early_stopping_rounds** round(s) to continue training. Requires at  least one item in **eval_set** in :py:meth:`fit`. - If early stopping occurs, the model will have two additional attributes:  :py:attr:`best_score` and :py:attr:`best_iteration`. These are used by the  :py:meth:`predict` and :py:meth:`apply` methods to determine the optimal  number of trees during inference. If users want to access the full model  (including trees built after early stopping), they can specify the  `iteration_range` in these inference methods. In addition, other utilities  like model plotting can also use the entire model. - If you prefer to discard the trees after `best_iteration`, consider using the  callback function :py:class:`xgboost.callback.EarlyStopping`. - If there's more than one item in **eval_set**, the last entry will be used for  early stopping. If there's more than one metric in **eval_metric**, the last  metric will be used for early stopping.",
,enable_categorical  enable_categorical: bool See the same parameter of :py:class:`DMatrix` for details.,False


## 5. Cross-validation with multiple metrics

**Why:** We evaluate the **entire pipeline** with `cross_validate` so that scaling and feature selection are refit on each fold's training portion. Reporting several metrics (accuracy, precision, recall, F1, ROC-AUC) gives a complete picture; for imbalanced data, F1 and ROC-AUC are more informative than accuracy alone.

In [5]:
from sklearn.model_selection import cross_validate

scoring = ["accuracy", "precision_weighted", "recall_weighted", "f1_weighted", "roc_auc_ovr"]
cv_results = cross_validate(pipeline, X_train, y_train, cv=5, scoring=scoring, n_jobs=-1)

print("Cross-validation (5-fold stratified):")
for metric in scoring:
    key = f"test_{metric}"
    if key in cv_results:
        mean_val = cv_results[key].mean()
        std_val = cv_results[key].std()
        print(f"  {metric}: {mean_val:.4f} (+/- {std_val:.4f})")

Cross-validation (5-fold stratified):
  accuracy: 0.9683 (+/- 0.0031)
  precision_weighted: 0.9599 (+/- 0.0053)
  recall_weighted: 0.9683 (+/- 0.0031)
  f1_weighted: 0.9621 (+/- 0.0042)
  roc_auc_ovr: 0.9174 (+/- 0.0299)


## 6. Fit on full training set and evaluate on holdout test set

We fit the pipeline once on all training data, then predict on the held-out test set. We report classification report, confusion matrix, and ROC-AUC to show we care about more than accuracy.

In [6]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print("Classification report (test set):")
print(classification_report(y_test, y_pred, target_names=["Not bankrupt", "Bankrupt"]))
print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))
print("ROC-AUC (test):", round(roc_auc_score(y_test, y_proba), 4))

Classification report (test set):
              precision    recall  f1-score   support

Not bankrupt       0.98      0.99      0.98      1320
    Bankrupt       0.54      0.30      0.38        44

    accuracy                           0.97      1364
   macro avg       0.76      0.64      0.68      1364
weighted avg       0.96      0.97      0.96      1364

Confusion matrix:
[[1309   11]
 [  31   13]]
ROC-AUC (test): 0.92


## 7. Interpretability: SHAP summary

**Why:** SHAP (Shapley values) explains which features drove the model's predictions. For tree models we use `TreeExplainer` on the **final estimator** and the **transformed** training data (as the model sees it). We use a sample of 500 rows to keep runtime reasonable. The beeswarm plot shows feature importance and the direction of each feature's effect (red = higher value pushes prediction toward bankruptcy).

In [None]:
import matplotlib.pyplot as plt

try:
    import shap

    # Pipeline: data flows through scaler -> selector -> estimator.
    # SHAP needs the estimator and the data in the form the estimator sees.
    estimator = pipeline.named_steps["estimator"]
    X_train_transformed = pipeline["selector"].transform(pipeline["scaler"].transform(X_train))
    selected_names = np.array(feature_names)[pipeline["selector"].get_support()].tolist()

    sample_size = min(500, len(X_train_transformed))
    X_sample = X_train_transformed[:sample_size]

    explainer = shap.TreeExplainer(estimator, X_sample)
    shap_values = explainer.shap_values(X_sample)

    plt.figure(figsize=(10, 8))
    shap.summary_plot(shap_values, X_sample, feature_names=selected_names, max_display=15, show=False)
    plt.tight_layout()
    plt.title("SHAP summary (class 1 = Bankrupt)")
    plt.show()
except ImportError:
    print("Install shap: pip install shap (or uv add shap)")
except Exception as e:
    print("SHAP error:", e)

Install shap: pip install shap (or uv add shap)


## Summary

- **Reproducibility:** `set_seed(42)` and `random_state=42` everywhere.
- **Validation:** One pipeline (scaler → selector → model), fitted only on train; `cross_validate` on the pipeline; stratified splits.
- **Metrics:** Accuracy, precision, recall, F1, ROC-AUC and classification report on the holdout test set.
- **Interpretability:** SHAP summary plot on the fitted model with transformed features.

Apply these patterns across the portfolio—see [docs/BEST_PRACTICES.md](docs/BEST_PRACTICES.md) and [IMPROVEMENTS.md](IMPROVEMENTS.md).