# Building Predictive Models with Decision Trees and Random Forests


Time estimate: **20** minutes




## Objectives

After completing this lab, you will be able to:

- Understand decision tree and random forest algorithms and when to use them.
- Prepare tabular clinical data for tree-based models, including encoding and handling missing values.
- Train, tune, and evaluate decision trees and random forests using scikit-learn.
- Interpret model outputs: feature importance, partial dependence, and tree visualization.
- Assess model performance with cross-validation, ROC/AUC, precision/recall, and calibration.
- Communicate model limitations and suggestions for clinical deployment.


## What you will do in this lab

- Use a simulated clinical dataset for predicting a binary outcome (e.g., deterioration/readmission).
- Train a decision tree and a random forest (RF) classifier.
- Tune hyperparameters using grid search with cross-validation.
- Inspect feature importances and visualize a sample tree.
- Evaluate models on holdout set and compare metrics (ROC, AUC, precision/recall).
- Use partial dependence plots for interpretation.
- Complete 7 consolidated exercises with hints & solutions at the end of the lab.


## Overview

Decision trees are intuitive, rule-based models that split features to predict outcomes. Random forests combine many trees to reduce variance and improve generalization. In clinical settings, tree-based models are popular for their interpretability and ability to handle mixed data types. This lab covers the end-to-end workflow: simulation, preprocessing, training, tuning, interpretation, and evaluation with practical advice for clinical use.


## About the dataset/environment

You will use a simulated dataset with patient demographics, vitals summaries, and simple lab flags to predict a binary clinical outcome 'event' (e.g., deterioration within 48 hours). The dataset includes numeric and categorical features, missing values, and class imbalance. Tools: Python (pandas, numpy, scikit-learn, matplotlib, seaborn, joblib).


## Setup

Run the following cell to install required libraries (if needed) and import all tools used for data preparation, model training, evaluation, and interpretation. If you are using Google Colab, run this cell to ensure all dependencies are available and your environment is set up consistently for building and analyzing tree-based machine learning models.


In [None]:
# Install required libraries quietly (useful in Colab environments)
!pip -q install numpy pandas scikit-learn matplotlib seaborn joblib

In [None]:
# Import numerical computing library
import numpy as np

# Import data manipulation library
import pandas as pd

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import model selection utilities
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

# Import tree-based models
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier

# Import evaluation metrics
from sklearn.metrics import (
    roc_auc_score, roc_curve,
    confusion_matrix, classification_report,
    precision_recall_curve, average_precision_score
)

# Import preprocessing utilities
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Import pipeline tools
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Import model interpretation tools
from sklearn.inspection import partial_dependence, PartialDependenceDisplay

# Import utility for saving models
import joblib

## Load the dataset

Run this code cell to load the clinical deterioration dataset directly from a remote GitHub repository and preview its structure by displaying the first few rows. This confirms that the data has been loaded correctly before analysis.

In [None]:
# Load the dataset from a remote GitHub URL
df = pd.read_csv('https://statistical-analysis-and-data-modeling-in-healthcare-02a43a.gitlab.io/labs/lab5/deterioration_data.csv')

# Display the first five rows of the dataset
df.head()


## Step 1: Inspect and preprocess the data

Before modeling, inspect the dataset by reviewing its size, data types, summary statistics, outcome balance, and missing values. This helps surface preprocessing requirements and potential modeling risks early.

In [None]:
# Print number of rows and columns
print('Rows, Columns:', df.shape)

In [None]:
# Display data types of all columns
print(df.dtypes)

In [None]:
# Show descriptive statistics for numeric variables
df.describe().T

In [None]:
# Display class counts for the outcome variable
print('\nEvent counts:\n', df['event'].value_counts())

In [None]:
# Check for missing values in each column
print(df.isna().sum())


## Step 2: Define preprocessing and pipeline

Tree-based models can work with mixed data types but do not accept missing values. To prepare the data, you'll impute missing numeric values using the median, one-hot encode categorical variables, and combine all preprocessing steps into a single pipeline.

In [None]:
# Define numeric feature columns
numeric_features = ['age','hr_mean','sbp_mean','comorbidity_count','prior_adm']

In [None]:
# Create numeric preprocessing pipeline with median imputation
numeric_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median'))
    ]
)

In [None]:
# Define categorical feature columns
categorical_features = ['sex','lab_flag']

In [None]:
# Create categorical preprocessing pipeline with one-hot encoding
categorical_transformer = Pipeline(
    steps=[
        ('onehot', OneHotEncoder())
    ]
)

In [None]:
# Combine numeric and categorical preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [None]:
# Separate predictors and outcome
X = df.drop(columns=['patient_id','event'])
y = df['event']

In [None]:
# Split data into training and test sets with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.25, random_state=42
)

In [None]:
# Print training data shape
print('Training shape:', X_train.shape)

## Step 3: Train decision tree baseline

Now, let's train a baseline decision tree to establish a simple reference model, visualize the learned decision rules, and measure initial model performance.

In [None]:
# Create a pipeline with preprocessing and decision tree classifier
dt_pipeline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('clf', DecisionTreeClassifier(random_state=42))
    ]
)


In [None]:
# Train the decision tree model
dt_pipeline.fit(X_train, y_train)

In [None]:
# Predict probabilities for the positive class
y_prob_dt = dt_pipeline.predict_proba(X_test)[:,1]

In [None]:
# Compute and print AUC score
print('Decision Tree AUC:', round(roc_auc_score(y_test, y_prob_dt),3))

## Visualize the decision tree

Run this code cell to visualize the structure of the trained decision tree model. The plot displays a shallow version of the tree (limited depth) with feature names, split rules, and class outcomes, helping you understand how the model makes clinical risk predictions.

In [None]:
# Create a large figure for readability
plt.figure(figsize=(20,10))

# Extract numeric feature names
num_feats = list(preprocessor.transformers_[0][2])

# Extract one-hot encoded feature names
ohe_feats = list(
    dt_pipeline.named_steps['preprocessor']
    .named_transformers_['cat']
    .named_steps['onehot']
    .get_feature_names_out(['sex','lab_flag'])
)

# Combine all feature names
feature_names = num_feats + ohe_feats

# Plot a shallow version of the decision tree
plot_tree(
    dt_pipeline.named_steps['clf'],
    feature_names=feature_names,
    max_depth=3,
    filled=True,
    fontsize=8
)

# Display the plot
plt.show()


## Step 4: Train random forest and evaluate

Random forests reduce overfitting by averaging predictions from many decision trees. Let's train a baseline forest and examine feature importance to understand which variables drive predictions.

In [None]:
# Create a pipeline with preprocessing and random forest classifier
rf_pipeline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('clf', RandomForestClassifier(
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        ))
    ]
)


In [None]:

# Train the random forest model
rf_pipeline.fit(X_train, y_train)

# Predict probabilities for the positive class
y_prob_rf = rf_pipeline.predict_proba(X_test)[:,1]

# Compute and print AUC score
print('Random Forest AUC:', round(roc_auc_score(y_test, y_prob_rf),3))

In [None]:
# Feature importances
# Get feature names
num_feats = list(preprocessor.transformers_[0][2])
ohe = rf_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot']
cat_feats = list(ohe.get_feature_names_out(['sex','lab_flag']))
feature_names = num_feats + cat_feats
importances = rf_pipeline.named_steps['clf'].feature_importances_
feat_imp = pd.DataFrame({'feature': feature_names, 'importance': importances}).sort_values('importance', ascending=False)
feat_imp.head(10)

## Step 5: Hyperparameter tuning with GridSearchCV

Run this code cell to tune key random forest hyperparameters—such as tree depth, number of estimators, and maximum features—using cross-validation to improve model performance.

In [None]:
# Define grid of hyperparameters to search
param_grid = {
    'clf__n_estimators': [100, 200],
    'clf__max_depth': [None, 6, 12],
    'clf__max_features': ['sqrt', 0.5]
}


In [None]:
# Create GridSearchCV object
grid = GridSearchCV(
    rf_pipeline,
    param_grid,
    cv=3,
    scoring='roc_auc',
    n_jobs=-1
)

In [None]:
# Fit grid search on training data
grid.fit(X_train, y_train)


In [None]:
# Print best hyperparameters
print('Best params:', grid.best_params_)

In [None]:
# Extract best model
best_rf = grid.best_estimator_


In [None]:
# Predict probabilities with tuned model
y_prob_best = best_rf.predict_proba(X_test)[:,1]


In [None]:
# Compute and print AUC
print('Best RF AUC:', round(roc_auc_score(y_test, y_prob_best),3))

## Step 6: Interpretation - Feature importance and partial dependence

Run this code cell to explore how the most important input features influence model predictions using a partial dependence plot (PDP). This visualization shows the average effect of a selected feature on the predicted outcome while holding other features constant, helping you interpret the model’s behavior.

In [None]:
# Select top 3 most important features
top_feats = feat_imp['feature'].head(3).tolist()

# Create a figure for PDP
fig, ax = plt.subplots(figsize=(8,6))

# Generate partial dependence plot (simplified demo)
PartialDependenceDisplay.from_estimator(
    best_rf.named_steps['clf'],
    best_rf.named_steps['preprocessor'].transform(X_train),
    features=[0],
    feature_names=feature_names,
    ax=ax
)

# Display the plot
plt.show()

# Note: This is a simplified PDP demonstration for teaching purposes.


## Step 7: Evaluation, calibration, and model selection

Run this code cell to compare the predictive performance of the baseline and tuned random forest models using AUC. You will then evaluate how well the tuned model’s predicted probabilities align with observed outcomes by generating a calibration curve.

In [None]:
# Compare baseline and tuned Random Forest performance
print('Baseline RF AUC:', round(roc_auc_score(y_test, y_prob_rf),3))
print('Tuned RF AUC:', round(roc_auc_score(y_test, y_prob_best),3))

In [None]:
# Import calibration function
from sklearn.calibration import calibration_curve

# Compute calibration curve values
prob_true, prob_pred = calibration_curve(
    y_test, y_prob_best, n_bins=10
)

# Plot calibration curve
plt.figure(figsize=(6,5))
plt.plot(prob_pred, prob_true, marker='o')
plt.plot([0,1],[0,1],'k--')
plt.title('Calibration plot (tuned RF)')
plt.show()


## Consolidated practice exercises



### Exercise 1: Inspect data & missingness, report event rate and null counts

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use df.isna().sum() and df['event'].value_counts(normalize=True).

</details>

<details> <summary>Click here for solution</summary>

```python
print(df.isna().sum())
print('Event rate:', df['event'].mean())
```

</details>

### Exercise 2: Build preprocessing pipeline and show transformed training shape

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use preprocessor.fit_transform(X_train) or pipeline to inspect shape.

</details>

<details> <summary>Click here for solution</summary>

```python
Xt = preprocessor.fit_transform(X_train)
print('Transformed shape:', Xt.shape)
```

</details>

### Exercise 3: Train a decision tree with max_depth=6 and report AUC

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Set DecisionTreeClassifier(max_depth=6) in pipeline and compute roc_auc_score.

</details>

<details> <summary>Click here for solution</summary>

```python
dt = Pipeline(steps=[('preprocessor', preprocessor), ('clf', DecisionTreeClassifier(max_depth=6, random_state=42))])
dt.fit(X_train, y_train)
y_prob = dt.predict_proba(X_test)[:,1]
print('AUC:', roc_auc_score(y_test, y_prob))
```

</details>

### Exercise 4: Train random forest with 200 trees and report feature importances (top 5)

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use RandomForestClassifier(n_estimators=200).

</details>

<details> <summary>Click here for solution</summary>

```python
rf = Pipeline(steps=[('preprocessor', preprocessor), ('clf', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))])
rf.fit(X_train, y_train)
feat_imp = pd.DataFrame({'feature': feature_names, 'importance': rf.named_steps['clf'].feature_importances_}).sort_values('importance', ascending=False)
feat_imp.head(5)
```

</details>

### Exercise 5: Perform GridSearchCV (small grid) and report best params

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use the param_grid shown above and GridSearchCV(cv=3).

</details>

<details> <summary>Click here for solution</summary>

```python
grid = GridSearchCV(rf_pipeline, param_grid, cv=3, scoring='roc_auc', n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)
```

</details>

### Exercise 6: Plot partial dependence for the top feature and interpret shape

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use PartialDependenceDisplay.from_estimator on best_rf.

</details>

<details> <summary>Click here for solution</summary>

```python
top = feat_imp['feature'].iloc[0]
PartialDependenceDisplay.from_estimator(best_rf.named_steps['clf'], best_rf.named_steps['preprocessor'].transform(X_train), features=[0], feature_names=feature_names)
plt.show()
```

</details>

### Exercise 7: Compare tuned RF and baseline RF AUC on test set and choose model

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Compute roc_auc_score for y_prob_rf and y_prob_best and pick higher AUC.

</details>

<details> <summary>Click here for solution</summary>

```python
print('Baseline RF AUC:', roc_auc_score(y_test, y_prob_rf))
print('Tuned RF AUC:', roc_auc_score(y_test, y_prob_best))
```

</details>

## Final thoughts and best practices

- Tree-based models handle mixed data types and missing values well but can overfit—use tuning and ensembling.  
- Random forests reduce variance but lose some interpretability; partial dependence plots and feature importances help.  
- Validate models with cross-validation and assess calibration before deployment.


# Congratulations!

You have successfully completed this lab on **Building Predictive Models with Decision Trees and Random Forests**.

In this lab, you built and compared tree-based machine learning models to predict a binary clinical outcome using a simulated dataset. You trained a baseline decision tree and a random forest, examined feature importance, and visualized decision rules to understand how predictions are made.

You then tuned random forest hyperparameters using cross-validation, evaluated models with ROC/AUC and calibration plots, and used partial dependence plots to interpret key predictors.

By the end of the lab, you practiced selecting, tuning, and interpreting tree-based models, and summarizing their strengths, limitations, and readiness for clinical deployment.

## Authors

Ramesh Sannareddy

Copyright © 2025 SkillUp. All rights reserved.