In [1]:
import multiprocessing
print(multiprocessing.cpu_count())

import psutil
print(f"Available memory before training: {psutil.virtual_memory().available / 1e9:.2f} GB")

10
Available memory before training: 9.25 GB


# Diabetes Readmission – LightGBM Gradient Boosting

## Introduction

This notebook implements LightGBM (Light Gradient Boosting Machine) for predicting hospital readmission within 30 days for diabetic patients. We use the same preprocessed dataset as XGBoost, optimized for tree-based methods, which includes:

-Full dataset: All encounters retained (101,763 records), as gradient boosting methods handle correlated observations effectively
- Binary and count features: ICD-9 diagnostic codes expanded into both indicator variables and count-based features
- Ordinal encoding: Categorical variables encoded as integers for optimal tree-based learning
- Raw numeric feature: No scaling applied as LightGBM handles different scales naturally

## Methodology

**No Class Imbalance Handling**: LightGBM includes built-in class weighting mechanisms that naturally handle imbalanced datasets without requiring synthetic sampling techniques.

**Leaf-wise Tree Growth**: Unlike XGBoost's level-wise approach, LightGBM grows trees leaf-wise, expanding the most beneficial leaf first. This provides:
- Faster training: Significantly reduced computation time compared to XGBoost
- Memory efficiency: Lower memory usage through optimized data structures
- Higher accuracy: Often achieves better performance with fewer iterations

**Enhanced Feature Processing**: LightGBM includes several optimizations:
- Native categorical handling: Direct support for categorical features without preprocessing
- Feature bundling: Automatic grouping of sparse features for efficiency
- Network communication optimization: Faster distributed training capabilities

**Hyperparameter Optimization**: Using Optuna's intelligent search across 12 key parameters including LightGBM-specific options like `num_leaves`, `min_split_gain`, and `feature_fraction`.

**Preprocessing Pipeline**: StandardScaler for numeric features and ordinal encoding for categoricals

The goal is to leverage LightGBM's speed and efficiency advantages while potentially achieving superior performance to XGBoost through its advanced leaf-wise tree construction and optimized feature handling.

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pickle
import time

In [3]:
token = 'f11' # iteratable by the user as we try new things
randy = 42 # random value insertion for repeatability
lgbm_data = pd.read_pickle("../models/randomForests.pkl") # See prior notebook, p02.

In [4]:
lgbm_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 101763 entries, 0 to 101765
Columns: 147 entries, encounter_id to count_E990_E999
dtypes: bool(4), float64(6), int64(115), object(22)
memory usage: 112.2+ MB


## Memory Optimization

The `optimize_dtypes()` function reduces memory usage by downcasting numeric types to their smallest sufficient representation:
- `int64` → `int8/int16/int32` based on value ranges
- `float64` → `float32` when precision allows

This optimization is particularly valuable for large datasets and memory-intensive operations like SMOTE resampling.

In [5]:
def optimize_dtypes(df):
    
    """
    Here we convert some of our columns intelligently to save on memory & time
    """
    
    for col in df.columns:
        col_type = df[col].dtype

        if col_type == 'int64':
            c_min = df[col].min()
            c_max = df[col].max()

            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)

        elif col_type == 'float64':
            c_min = df[col].min()
            c_max = df[col].max()

            if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)

    return df

In [6]:
lgbm_data = optimize_dtypes(lgbm_data)
lgbm_data.info() # 90+mb RAM savings

<class 'pandas.core.frame.DataFrame'>
Index: 101763 entries, 0 to 101765
Columns: 147 entries, encounter_id to count_E990_E999
dtypes: bool(4), float32(6), int16(1), int32(2), int8(112), object(22)
memory usage: 32.4+ MB


In [7]:
import lightgbm as lgb
from lightgbm import LGBMClassifier

import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.integration import LightGBMPruningCallback

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler 
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import KFold

from sklearn.pipeline import Pipeline
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, roc_curve, auc
from sklearn.metrics import precision_score, recall_score, f1_score

  from .autonotebook import tqdm as notebook_tqdm


## Model Evaluation and Persistence Function

The `evaluate_and_save_model()` function provides standardized evaluation across all modeling approaches in this project. See other notebooks for why. For the Light GBM implementation, since no scikit pipelines were used, it was slightly modified, but still exports everything the other notebooks do.

In [8]:
def evaluate_and_save_model(model, preprocessor, namestring, token, 
                            X_train, X_test, y_train, y_test,
                            console_out=False):
    """Direct model evaluation without pipeline"""
    # Suppress the feature name warnings during evaluation
    with warnings.catch_warnings():
        warnings.filterwarnings('ignore', category=UserWarning, module='sklearn')

        # Apply preprocessing
        X_train_proc = preprocessor.transform(X_train)
        X_test_proc = preprocessor.transform(X_test)

        # Make predictions using PROCESSED data
        y_train_pred = model.predict(X_train_proc)
        y_test_pred = model.predict(X_test_proc)
        y_test_pred_proba = model.predict_proba(X_test_proc)[:, 1]

        # Calculate metrics using PROCESSED data
        accuracy = model.score(X_test_proc, y_test)  # Changed this line
        precision = precision_score(y_test, y_test_pred)
        recall = recall_score(y_test, y_test_pred)
        f1 = f1_score(y_test, y_test_pred)

        # Confusion matrix metrics
        tn, fp, fn, tp = confusion_matrix(y_test, y_test_pred).ravel()
        specificity = tn / (tn + fp) if (tn + fp) > 0 else 0

        # ROC curve
        fpr, tpr, thresholds = roc_curve(y_test, y_test_pred_proba)
        roc_auc = auc(fpr, tpr)

        # Save metrics
        pickle_metrics = {
            'model_version': f"{token}_{namestring}",
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1,
            'specificity': specificity,
            'roc_auc': roc_auc,
            'y_test': y_test,
            'y_train_pred': y_train_pred,
            'y_test_pred': y_test_pred,
            'y_test_pred_proba': y_test_pred_proba,
            'display_labels': [0, 1],
            'confusion_matrix': {'tn': tn, 'fp': fp, 'fn': fn, 'tp': tp},
            'roc_curve': {'fpr': fpr, 'tpr': tpr, 'thresholds': thresholds},
            'shap_data': {
                'model': model,
                'preprocessor': preprocessor,
                'X_train_processed': X_train_proc,
                'X_test_processed': X_test_proc,
                'feature_names': preprocessor.get_feature_names_out(),
                'original_feature_names': list(X_train.columns)
            }
        }

    filename = f"../models/fits_pickle_{token}_{namestring}.pkl"
    with open(filename, "wb") as file:
        pickle.dump(pickle_metrics, file)

    return pickle_metrics

In [9]:
X = lgbm_data.drop(["readmitted"], axis=1)
y = lgbm_data["readmitted"]

## Training Feature Type Classification

**Feature Type Identification**:
The preprocessing pipeline requires different handling for different data types:

**exclude_features**: Filter list ensuring ID columns (`patient_nbr`, `encounter_id`) and target variable (`readmitted`) are excluded from feature sets.

**numeric_features**: Continuous variables requiring standardization for LightGBM
- Applied to StandardScaler in ColumnTransformer
- LightGBM benefits from normalized features for optimal convergence
- Includes engineered features like `service_utilization` and medication counts

**boolean_features**: Binary indicator variables treated as categorical
- Combined with object_features for ordinal encoding
- More efficient than one-hot encoding for tree-based models
- Preserves boolean nature while making them LightGBM-compatible

**object_features**: String categorical variables requiring encoding
- Medical specialties, diagnostic groups, demographic categories
- Converted to integers via OrdinalEncoder for tree splitting
- `handle_unknown` parameter ensures robust handling of unseen categories

**categorical_features**: Index positions of all categorical columns for LightGBM
- Used to inform LightGBM which features should be treated as categorical
- Enables LightGBM's native categorical optimization algorithms
- Critical for proper handling of ordinal-encoded categorical variables

This systematic classification ensures each feature type receives appropriate preprocessing while maintaining LightGBM's performance advantages.

In [10]:
# Training features to include
exclude_features = ["patient_nbr", "encounter_id", "readmitted"]
numeric_features = [
    col
    for col in X.columns
    if col not in exclude_features and pd.api.types.is_numeric_dtype(X[col])
]
boolean_features = [
    col for col in X.columns if col not in exclude_features and X[col].dtype == "bool"
]
object_features = [
    col for col in X.columns if col not in exclude_features and X[col].dtype == "object"
]
categorical_features = [
    X.columns.get_loc(col) for col in object_features + boolean_features
]

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=randy, stratify=y
)

In [12]:
preprocessor_lgb = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
    object_features + boolean_features)
], remainder='drop')

categorical_indices = list(range(len(numeric_features), len(numeric_features) + len(object_features) + len(boolean_features)))

In [13]:
def objective_lgb(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 3000), 
        'max_depth': trial.suggest_int('max_depth', 3, 25), 
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.5, log=True),  # Log scale
        'num_leaves': trial.suggest_int('num_leaves', 10, 1024),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 100),  
        'subsample': trial.suggest_float('subsample', 0.25, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.25, 1.0),  
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),  # Log scale
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 100.0, log=True),  # Log scale
        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.2, 1.0),
        'class_weight': trial.suggest_categorical('class_weight', ['balanced', None]),

        # Additional LightGBM-specific parameters Optuna can explore
        'min_split_gain': trial.suggest_float('min_split_gain', 0.0, 15.0),
        'min_child_weight': trial.suggest_float('min_child_weight', 1e-8, 10.0, log=True),
    }

    # Use lgb.cv for proper pruning support
    X_train_proc = preprocessor_lgb.fit_transform(X_train)

    # Create LightGBM dataset
    dtrain = lgb.Dataset(X_train_proc, label=y_train)

    # Use lgb.cv with pruning callback
    cv_results = lgb.cv(
        params={
            'objective': 'binary',
            'metric': 'auc',
            'verbosity': -1,
            'random_state': randy,
            **params
        },
        train_set=dtrain,
        num_boost_round=params['n_estimators'],
        nfold=5,
        stratified=True,
        shuffle=True,
        seed=randy,
        callbacks=[LightGBMPruningCallback(trial, 'auc')],
        return_cvbooster=True
    )

    # Return best CV score
    return cv_results['valid auc-mean'][-1]

In [14]:
# Callbacks for monitoring
def progress_callback(study, trial):
    if trial.number % 10 == 0:
        print(f"Trial {trial.number}: Best so far = {study.best_value:.4f}")

In [15]:
# Run optimization
cv_folds = list(KFold(n_splits=5, shuffle=True, random_state=randy).split(X_train))

In [16]:
%%time

# Set up pruner for early stopping
pruner = MedianPruner(
    n_startup_trials=10,  # Don't prune first 10 trials
    n_warmup_steps=2,     # Prune after 2 CV folds if clearly bad
    interval_steps=1      # Check after each CV fold
)

# Run Optuna optimization
# Create study with efficient sampler and pruner
study = optuna.create_study(
    direction='maximize',
    pruner=pruner,
    sampler=TPESampler(seed=randy, n_startup_trials=20)
)

study.optimize(
    objective_lgb,
    n_trials=200,
    callbacks=[progress_callback],
    show_progress_bar=False
)


# After study.optimize() completes
best_params = study.best_params
print(f"Best parameters: {best_params}")
print(f"Best AUC: {study.best_value:.4f}")
with open(f"../models/{token}_LGBM_best_params.pkl", "wb") as f:
    pickle.dump(best_params, f)

[I 2025-07-07 10:35:21,345] A new study created in memory with name: no-name-75685d12-a13f-4a37-8b38-fdf6ec48163d


[I 2025-07-07 10:45:34,381] Trial 199 pruned. Trial was pruned at iteration 2.


Best parameters: {'n_estimators': 2021, 'max_depth': 10, 'learning_rate': 0.07648565112369948, 'num_leaves': 564, 'min_child_samples': 19, 'subsample': 0.9771884708234189, 'colsample_bytree': 0.8313496175208359, 'reg_alpha': 2.8542399074977594, 'reg_lambda': 8.877148894655603, 'feature_fraction': 0.758739987286651, 'bagging_fraction': 0.9374993880184935, 'class_weight': None, 'min_split_gain': 0.678409333658071, 'min_child_weight': 8.471746987003668e-06}
Best AUC: 0.7082
CPU times: user 18min 37s, sys: 11min 54s, total: 30min 31s
Wall time: 10min 13s


## Final Model Training with Optimized Hyperparameters

**Model Configuration:**
Building the final LightGBM model using the best hyperparameters discovered through Optuna optimization:
- Core parameters: Optimized tree structure (`num_leaves`, `max_depth`) and learning dynamics (`learning_rate`, `n_estimators`)
- Regularization: Tuned `reg_alpha` and `reg_lambda` for optimal bias-variance tradeoff
- Sampling strategies: Optimized `subsample` and `feature_fraction` for robust ensemble learning
- Class handling: `class_weight` setting for imbalanced dataset management

**Pipeline Assembly:**
The final pipeline combines two essential components:
1. StandardScaler preprocessing: Normalizes numeric features for optimal LightGBM performance
2. Optimized LightGBM: Model configured with best hyperparameters from Bayesian optimization

**Training Strategy:**
- Uses the full training set (no resampling needed due to LightGBM's class handling)
- Maintains consistent random state for reproducible results
- Verbose output disabled for clean execution

This represents the production-ready model combining preprocessing optimization with hyperparameter tuning for maximum predictive performance.

In [17]:
# Load best params from disk
best_params = pd.read_pickle(f"../models/f04_LGBM_best_params.pkl")

In [18]:
# Train final model without pipeline
X_train_proc = preprocessor_lgb.fit_transform(X_train)
X_test_proc = preprocessor_lgb.transform(X_test)

lgb_final = LGBMClassifier(
    objective="binary",
    random_state=randy,
    verbose=-1,
    **best_params
)

lgb_final.fit(X_train_proc, y_train)

In [19]:
# Save trained model to disk
with open(f"../models/{token}_LGBM_final.pkl", "wb") as file:
    pickle.dump(lgb_final, file)

In [20]:
# Open a trained model
# lgb_final = pd.read_pickle("../models/f04_XGB_final.pkl")

In [21]:
evaluate_and_save_model(
    model=lgb_final,
    preprocessor=preprocessor_lgb,
    namestring='LGB',
    token=token,
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test
)

{'model_version': 'f11_LGB',
 'accuracy': 0.6547437724168427,
 'precision': 0.6429438911829002,
 'recall': 0.564332160750453,
 'f1_score': 0.601078626170877,
 'specificity': np.float64(0.7320452059788553),
 'roc_auc': np.float64(0.7134766548048208),
 'y_test': 27827    1
 84192    0
 60829    0
 84663    1
 72262    1
         ..
 46646    1
 64740    0
 9515     1
 89761    1
 16019    0
 Name: readmitted, Length: 20353, dtype: int8,
 'y_train_pred': array([1, 0, 1, ..., 0, 0, 1], dtype=int8),
 'y_test_pred': array([0, 1, 1, ..., 1, 1, 0], dtype=int8),
 'y_test_pred_proba': array([0.37212897, 0.74116775, 0.5584925 , ..., 0.70028383, 0.68268187,
        0.39955486]),
 'display_labels': [0, 1],
 'confusion_matrix': {'tn': np.int64(8032),
  'fp': np.int64(2940),
  'fn': np.int64(4087),
  'tp': np.int64(5294)},
 'roc_curve': {'fpr': array([0.        , 0.        , 0.        , ..., 0.97101713, 0.97101713,
         1.        ]),
  'tpr': array([0.00000000e+00, 1.06598444e-04, 1.38577977e-03,