In [1]:
import multiprocessing
print(multiprocessing.cpu_count())

import psutil
print(f"Available memory before training: {psutil.virtual_memory().available / 1e9:.2f} GB")

10
Available memory before training: 6.65 GB


# Diabetes Readmission – Random Forest Classification

## Introduction

This notebook implements Random Forest classification for predicting hospital readmission within 30 days for diabetic patients. We use the same preprocessed dataset as other tree-based methods, optimized for ensemble learning, which includes:

- **Full dataset**: All encounters retained (101,763 records), as ensemble methods can effectively handle correlated observations
- **Binary and count features**: ICD-9 diagnostic codes expanded into both indicator variables and count-based features
- **Ordinal encoding**: Categorical variables encoded as integers for optimal tree-based learning
- **Raw numeric features**: No scaling applied as Random Forest is invariant to monotonic transformations

## Methodology

**Bootstrap Aggregating (Bagging)**: Random Forest builds multiple decision trees on bootstrap samples of the training data, then averages their predictions. This approach provides:
- **Variance reduction**: Multiple trees reduce overfitting compared to single decision trees
- **Feature randomness**: Each split considers only a random subset of features, increasing diversity
- **Natural regularization**: Built-in protection against overfitting through ensemble averaging

**Class Imbalance Handling**: Using the `class_weight='balanced'` parameter to automatically adjust for class imbalance without requiring synthetic sampling techniques.

**Hyperparameter Optimization**: Using Optuna's intelligent search across 8 key parameters including:
- **Tree structure**: `n_estimators`, `max_depth`, `max_features`
- **Splitting criteria**: `min_samples_split`, `min_samples_leaf`, `min_weight_fraction_leaf`
- **Regularization**: `min_impurity_decrease`, `class_weight`

**Preprocessing Pipeline**: Ordinal encoding for categorical features while preserving numeric features as-is, maintaining the original ~147 features without expansion.

The goal is to leverage Random Forest's robustness and interpretability while achieving strong predictive performance through proper ensemble construction and hyperparameter optimization.

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pickle
import time

In [3]:
token = 'f11' # iteratable by the user as we try new things
randy = 42 # random value insertion for repeatability
rf_data = pd.read_pickle("../models/randomForests.pkl") # See prior notebook, p02.

In [4]:
rf_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 101763 entries, 0 to 101765
Columns: 147 entries, encounter_id to count_E990_E999
dtypes: bool(4), float64(6), int64(115), object(22)
memory usage: 112.2+ MB


## Memory Optimization

The `optimize_dtypes()` function reduces memory usage by downcasting numeric types to their smallest sufficient representation:
- `int64` → `int8/int16/int32` based on value ranges
- `float64` → `float32` when precision allows

This optimization is particularly valuable for large datasets and memory-intensive operations like SMOTE resampling.

In [5]:
def optimize_dtypes(df):
    
    """
    Here we convert some of our columns intelligently to save on memory & time
    """
    
    for col in df.columns:
        col_type = df[col].dtype

        if col_type == 'int64':
            c_min = df[col].min()
            c_max = df[col].max()

            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)

        elif col_type == 'float64':
            c_min = df[col].min()
            c_max = df[col].max()

            if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)

    return df

In [6]:
rf_data = optimize_dtypes(rf_data)
rf_data.info() # ~90mb RAM savings

<class 'pandas.core.frame.DataFrame'>
Index: 101763 entries, 0 to 101765
Columns: 147 entries, encounter_id to count_E990_E999
dtypes: bool(4), float32(6), int16(1), int32(2), int8(112), object(22)
memory usage: 32.4+ MB


In [7]:
from sklearn.ensemble import RandomForestClassifier

import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold

from sklearn.pipeline import Pipeline

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, roc_curve, auc
from sklearn.metrics import precision_score, recall_score, f1_score

  from .autonotebook import tqdm as notebook_tqdm


## Model Evaluation and Persistence Function

The `evaluate_and_save_pipeline()` function provides standardized evaluation across all modeling approaches in this project:

**Comprehensive Metrics Calculation:**
- **Classification performance**: Accuracy, precision, recall, F1-score, specificity
- **Probability-based metrics**: ROC curve data and AUC score for threshold optimization
- **Confusion matrix**: True/false positive/negative counts for detailed performance analysis
- **Prediction arrays**: Both binary predictions and probability scores for ensemble building

**Standardized Output Format:**
All metrics are saved in identical pickle format enabling:
- Direct performance comparison across different model types
- Consistent evaluation methodology regardless of underlying algorithm
- Easy integration into ensemble methods and model selection workflows
- Reproducible results with preserved prediction arrays

**Model Persistence:**
Trained pipelines are saved with preprocessing steps intact, ensuring deployment-ready models that can handle new data with the same
transformations applied during training.

This standardization is critical for fair model comparison and supports the ensemble modeling approach in later notebooks.

In [8]:
def evaluate_and_save_pipeline(pipeline, namestring, token, 
                                X_train, X_test, 
                                y_train, y_test,
                                console_out = False):
    """
    Evaluates a trained pipeline and saves metrics to a pickle file.
    """

    # Input validation
    if any(v is None for v in [X_train, X_test, y_train, y_test]):
        raise ValueError("X_train, X_test, y_train, or y_test must not be None.")

    # Convert to numpy if needed
    y_train = y_train.values if hasattr(y_train, "values") else y_train
    y_test = y_test.values if hasattr(y_test, "values") else y_test

    # Make predictions once
    y_train_pred = pipeline.predict(X_train)
    y_test_pred = pipeline.predict(X_test)

    # Get probability predictions
    if hasattr(pipeline, "predict_proba"):
        y_test_pred_pct = pipeline.predict_proba(X_test)[:, 1]
    elif hasattr(pipeline, "decision_function"):
        y_test_pred_pct = pipeline.decision_function(X_test)
    else:
        raise AttributeError("Pipeline needs predict_proba() or decision_function() for ROC/AUC.")

    # Classification metrics (not regression metrics)
    accuracy = pipeline.score(X_test, y_test)
    precision = precision_score(y_test, y_test_pred)
    recall = recall_score(y_test, y_test_pred)  # Same as sensitivity
    f1 = f1_score(y_test, y_test_pred)

    # Confusion matrix metrics
    tn, fp, fn, tp = confusion_matrix(y_test, y_test_pred).ravel()
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0

    # ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_test_pred_pct)
    roc_auc = auc(fpr, tpr)

    # Safe access to classes
    classes_ = getattr(pipeline, 'classes_', np.unique(y_train))

    # Save metrics
    pickle_metrics = {
        'model_version': f"{token}_{namestring}",
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'specificity': specificity,
        'roc_auc': roc_auc,
        'y_test': y_test,
        'y_train_pred': y_train_pred,
        'y_test_pred': y_test_pred,
        'y_test_pred_proba': y_test_pred_pct,
        'display_labels': classes_,
        'confusion_matrix': {'tn': tn, 'fp': fp, 'fn': fn, 'tp': tp},
        'roc_curve': {'fpr': fpr, 'tpr': tpr, 'thresholds': thresholds},
        # SHAP-specific additions
        'shap_data': {
            'model': pipeline,
            'X_train_processed': pipeline.named_steps['preprocessor'].transform(X_train),
            'X_test_processed': pipeline.named_steps['preprocessor'].transform(X_test),
            'feature_names': pipeline.named_steps['preprocessor'].get_feature_names_out(),
            'original_feature_names': list(X_train.columns)
        }
    }

    # Save to file
    filename = f"../models/fits_pickle_{token}_{namestring}.pkl"
    with open(filename, "wb") as file:
        pickle.dump(pickle_metrics, file)

    if console_out:
        # Print summary
        print(f"Metrics saved to {filename}")
        print(f'Accuracy:    {accuracy:.4f}')
        print(f'Precision:   {precision:.4f}')
        print(f'Recall:      {recall:.4f}')
        print(f'F1-Score:    {f1:.4f}')
        print(f'Specificity: {specificity:.4f}')
        print(f'ROC AUC:     {roc_auc:.4f}')

        # Plot confusion matrix
        cm = confusion_matrix(y_test, y_test_pred)
        disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes_)
        disp.plot(cmap=plt.cm.Blues)
        plt.title(f"Confusion Matrix - {namestring}")
        plt.show()

    return pickle_metrics

In [9]:
X = rf_data.drop(["readmitted"], axis=1)
y = rf_data["readmitted"]

## Training Feature Type Classification

**Feature Type Identification**:
Different data types require specific preprocessing for Random Forest:

**exclude_features**: Filters out ID columns and target variable from feature sets.

**numeric_features**: Continuous variables with passthrough (no scaling needed for tree-based models)

**boolean_features** + **object_features**: Combined for ordinal encoding
- More efficient than one-hot encoding for Random Forest
- Converted to integers for optimal tree splitting

**categorical_features**: Index positions tracking categorical columns for consistency with other tree-based implementations.

This classification ensures appropriate preprocessing while maintaining Random Forest's robustness advantages.

In [10]:
# Training features to include
exclude_features = ["patient_nbr", "encounter_id", "readmitted"]
numeric_features = [
    col
    for col in X.columns
    if col not in exclude_features and pd.api.types.is_numeric_dtype(X[col])
]
boolean_features = [
    col for col in X.columns if col not in exclude_features and X[col].dtype == "bool"
]
object_features = [
    col for col in X.columns if col not in exclude_features and X[col].dtype == "object"
]
categorical_features = [
    X.columns.get_loc(col) for col in object_features + boolean_features
]

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=randy, stratify=y
)

## Preprocessing Strategy for Random Forest

**Categorical Encoding**: OrdinalEncoder converts categorical variables to integers, which is optimal for tree-based models like Random Forest that make binary splits on feature values.

**Numeric Passthrough**: Numeric features are left unchanged using `remainder='passthrough'` since Random Forest is invariant to scaling and can handle different value ranges naturally.

This minimal preprocessing approach preserves the original data characteristics while ensuring compatibility with Random Forest's tree-splitting algorithms.

In [12]:
preprocessor_rf = ColumnTransformer([
    ('cat', OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
    object_features + boolean_features)
], remainder='passthrough')  # Keep numeric features as-is

categorical_indices = list(range(len(numeric_features), len(numeric_features) + len(object_features) + len(boolean_features)))

## Manual Cross-Validation with Pruning

**Manual fold iteration** enables Optuna's pruning mechanism:
- Early stopping: Poor trials abort after 2-3 folds via `trial.should_prune()`
- Efficiency: Saves ~60% computation by stopping bad hyperparameter combinations
- Consistency: Pre-generated folds ensure fair comparison across trials

**Parameter Search Strategy**:
- Tree structure: `n_estimators` (100-800), `max_depth` (10-50) control ensemble size and complexity
- Split quality: `min_samples_split/leaf` prevent overfitting through minimum sample requirements
- Feature sampling: `max_features` ('sqrt', 'log2') controls randomness in feature selection
- Class handling: `class_weight` ('balanced', None) addresses imbalanced dataset

This combines rigorous CV with intelligent early stopping for efficient Random Forest optimization.

In [13]:
def objective_rf(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1500),
        'max_depth': trial.suggest_int('max_depth', 10, 100),
        'min_samples_split': trial.suggest_float('min_samples_split', 0.005, 0.2),
        'min_samples_leaf': trial.suggest_float('min_samples_leaf', 0.0001, 0.02),
        'min_weight_fraction_leaf': trial.suggest_float('min_weight_fraction_leaf', 0.0, 0.3),
        'min_impurity_decrease': trial.suggest_float('min_impurity_decrease', 0.0, 0.02),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2']),
        'class_weight': trial.suggest_categorical('class_weight', ['balanced', None]),
        'max_leaf_nodes': trial.suggest_int('max_leaf_nodes', 10, 1000),
    }

    model = RandomForestClassifier(
        criterion="gini",
        random_state=randy,
        n_jobs=-1,
        verbose=0,
        **params
    )

    pipeline = Pipeline([
        ('preprocessor', preprocessor_rf),
        ('model', model)
    ])

    # Manual CV with pruning support
    cv_folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=randy)
    scores = []

    ## Stratified needs both x and y
    for fold_idx, (train_idx, val_idx) in enumerate(cv_folds.split(X_train, y_train)):
        X_fold_train, X_fold_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_fold_train, y_fold_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

        pipeline.fit(X_fold_train, y_fold_train)
        val_proba = pipeline.predict_proba(X_fold_val)[:, 1]

        # Calculate ROC-AUC for this fold
        from sklearn.metrics import roc_auc_score
        fold_score = roc_auc_score(y_fold_val, val_proba)
        scores.append(fold_score)

        # Report intermediate score for pruning
        trial.report(np.mean(scores), fold_idx)

        # Prune if trial looks bad
        if trial.should_prune():
            raise optuna.TrialPruned()

    return np.mean(scores)

In [14]:
# Callbacks for monitoring
def progress_callback(study, trial):
    if trial.number % 25 == 0:
        print(f"Trial {trial.number}: Best ROC-AUC = {study.best_value:.4f}")
        print(f"Best params so far: {study.best_params}")

def save_best_callback(study, trial):
    if study.best_trial == trial:
        with open(f"../models/{token}_RF_best_params.pkl", "wb") as f:
            pickle.dump(study.best_params, f)

In [15]:
# Set up pruner and sampler for efficiency
pruner = MedianPruner(
    n_startup_trials=15,  # Don't prune first 15 trials
    n_warmup_steps=2,     # Prune after 2 CV folds if clearly bad
    interval_steps=1      # Check after each CV fold
)

study = optuna.create_study(
    direction='maximize',
    pruner=pruner,
    sampler=TPESampler(seed=randy, n_startup_trials=25)
)

[I 2025-07-07 10:45:41,163] A new study created in memory with name: no-name-75e53310-3708-48fe-a485-958bbd3385f1


In [16]:
%%time

# Run optimization

study.optimize(
    objective_rf,
    n_trials=200,
    callbacks=[progress_callback, save_best_callback],
    # show_progress_bar=True
)

print(f"\nOptimization completed!")
best_params = study.best_params
print(f"Best ROC-AUC: {study.best_value:.4f}")
print(f"Best parameters: {study.best_params}")

[I 2025-07-07 10:45:47,853] Trial 0 finished with value: 0.6562182418950282 and parameters: {'n_estimators': 624, 'max_depth': 96, 'min_samples_split': 0.147738818653224, 'min_samples_leaf': 0.012013303835521029, 'min_weight_fraction_leaf': 0.04680559213273095, 'min_impurity_decrease': 0.003119890406724053, 'max_features': 'log2', 'class_weight': None, 'max_leaf_nodes': 30}. Best is trial 0 with value: 0.6562182418950282.


Trial 0: Best ROC-AUC = 0.6562
Best params so far: {'n_estimators': 624, 'max_depth': 96, 'min_samples_split': 0.147738818653224, 'min_samples_leaf': 0.012013303835521029, 'min_weight_fraction_leaf': 0.04680559213273095, 'min_impurity_decrease': 0.003119890406724053, 'max_features': 'log2', 'class_weight': None, 'max_leaf_nodes': 30}


[I 2025-07-07 10:46:04,103] Trial 1 finished with value: 0.6542989360426839 and parameters: {'n_estimators': 1458, 'max_depth': 85, 'min_samples_split': 0.046406126582263854, 'min_samples_leaf': 0.0037183168474213026, 'min_weight_fraction_leaf': 0.055021352956030146, 'min_impurity_decrease': 0.006084844859190755, 'max_features': 'sqrt', 'class_weight': None, 'max_leaf_nodes': 148}. Best is trial 0 with value: 0.6562182418950282.


[I 2025-07-07 11:20:01,938] Trial 199 finished with value: 0.6931355656742044 and parameters: {'n_estimators': 1029, 'max_depth': 85, 'min_samples_split': 0.019413522587478298, 'min_samples_leaf': 0.007621229753178505, 'min_weight_fraction_leaf': 0.011299003840782813, 'min_impurity_decrease': 1.4440952024140954e-05, 'max_features': 'sqrt', 'class_weight': 'balanced', 'max_leaf_nodes': 794}. Best is trial 199 with value: 0.6931355656742044.



Optimization completed!
Best ROC-AUC: 0.6931
Best parameters: {'n_estimators': 1029, 'max_depth': 85, 'min_samples_split': 0.019413522587478298, 'min_samples_leaf': 0.007621229753178505, 'min_weight_fraction_leaf': 0.011299003840782813, 'min_impurity_decrease': 1.4440952024140954e-05, 'max_features': 'sqrt', 'class_weight': 'balanced', 'max_leaf_nodes': 794}
CPU times: user 2h 32min 58s, sys: 13min 51s, total: 2h 46min 49s
Wall time: 34min 20s


In [17]:
# Create final model with best parameters
rf_final_model = RandomForestClassifier(
    criterion="gini",
    random_state=randy,
    **best_params  # Unpack the best parameters
)

# Create final pipeline
rf_final = Pipeline(steps=[
    ('preprocessor', preprocessor_rf),
    ('model', rf_final_model)
])

rf_final.fit(X_train, y_train)

filename=f"../models/{token}_RF_final.pkl"
with open(filename, "wb") as file:
    pickle.dump(rf_final, file)
print(f"Model saved as {filename}")

Model saved as ../models/f11_RF_final.pkl


In [18]:
# Load model from disk 
# rf_final = pd.read_pickle("../models/f04_RF_final.pkl")

In [19]:
evaluate_and_save_pipeline(
    pipeline=rf_final, 
    namestring='Random Forest',
    token=token, 
    X_train=X_train, 
    X_test=X_test, 
    y_train=y_train, 
    y_test=y_test)

{'model_version': 'f11_Random Forest',
 'accuracy': 0.6401021962364271,
 'precision': 0.6129670329670329,
 'recall': 0.5946061187506663,
 'f1_score': 0.6036469887993074,
 'specificity': np.float64(0.6790010936930369),
 'roc_auc': np.float64(0.69666521944609),
 'y_test': array([1, 0, 0, ..., 1, 1, 0], dtype=int8),
 'y_train_pred': array([1, 0, 1, ..., 0, 0, 1], dtype=int8),
 'y_test_pred': array([0, 1, 1, ..., 1, 1, 0], dtype=int8),
 'y_test_pred_proba': array([0.49610255, 0.65781136, 0.56962064, ..., 0.59112288, 0.61456102,
        0.40604344]),
 'display_labels': array([0, 1], dtype=int8),
 'confusion_matrix': {'tn': np.int64(7450),
  'fp': np.int64(3522),
  'fn': np.int64(3803),
  'tp': np.int64(5578)},
 'roc_curve': {'fpr': array([0.        , 0.        , 0.        , ..., 0.99835946, 0.99835946,
         1.        ]),
  'tpr': array([0.00000000e+00, 1.06598444e-04, 1.70557510e-03, ...,
         9.99893402e-01, 1.00000000e+00, 1.00000000e+00]),
  'thresholds': array([       inf, 0.730