# Predicting Readmission Risk with Logistic Regression


Time estimate: **20** minutes




## Objectives

After completing this lab, you will be able to:

- Prepare clinical tabular data for binary classification (readmission yes/no).
- Train, evaluate, and interpret a logistic regression model for readmission risk.
- Handle class imbalance, perform feature preprocessing, and compute performance metrics (ROC, AUC, precision/recall).
- Use calibration plots and decision thresholds to support clinical decision-making.
- Report model performance with clarity for clinicians, including limitations and fairness checks.


## What you will do in this lab

- Simulate a hospital dataset with patient demographics, comorbidities, and inpatient features.
- Preprocess data: One-hot encoding, scaling, and train/test split.
- Train logistic regression with regularization and examine coefficients.
- Evaluate model using confusion matrix, ROC/AUC, precision-recall, and calibration.
- Address class imbalance using class weights and simple resampling.
- Complete seven consolidated exercises, with hints and solutions provided at the end of the lab.


## Overview

Predicting 30-day hospital readmission is a common applied machine-learning task in healthcare. Logistic regression provides a transparent, interpretable baseline model. This lab focuses on end-to-end workflow: data simulation, preprocessing, model training, evaluation, calibration, and reporting—emphasizing interpretability and clinical relevance.


## About the dataset/environment

We will simulate a dataset with the following columns:
- `patient_id`, `age`, `sex`, `comorbidity_count`, `length_of_stay`, `num_prior_admissions`, `lab_abnormal_flag`, `discharge_destination` (Home/SNF), and `readmitted_30d` (0/1).
The simulated dataset will include moderate class imbalance (readmission ~15%).
Tools: Python (pandas, numpy, scikit-learn, matplotlib, seaborn, imbalanced-learn optional).


## Setup

Run the following setup cell to install any needed libraries (especially in Google Colab), import key packages for preprocessing and model evaluation, and configure reproducibility for consistent results.


In [None]:
# Colab compatibility: uncomment installation if needed
try:
    import google.colab
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    !pip -q install numpy pandas scikit-learn matplotlib seaborn imbalanced-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report, precision_recall_curve, average_precision_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from imblearn.over_sampling import RandomOverSampler

# Reproducibility
RNG = np.random.default_rng(42)
np.random.seed(42)
pd.set_option('display.max_columns', 60)
sns.set(style='whitegrid')
print('Setup complete. Running in Colab:', IN_COLAB)


Execute this cell to create a simulated 30-day readmission dataset. The generator constructs clinically meaningful predictors such as age, comorbidities, prior admissions, and discharge status, then assigns readmission probabilities using a controlled logistic process. This ensures you have a realistic, well-behaved dataset for experimenting with preprocessing steps and model evaluation techniques.

In [None]:
# Simulate readmission dataset
def simulate_readmission(n_patients=2000, readmit_rate=0.15):
    ids = [f"P{10000+i}" for i in range(n_patients)]
    age = np.clip(np.random.normal(65, 14, size=n_patients).astype(int), 18, 95)
    sex = np.random.choice(['M','F'], size=n_patients, p=[0.52,0.48])
    comorbidity_count = np.random.poisson(2, size=n_patients)
    length_of_stay = np.clip(np.random.exponential(4, size=n_patients).astype(int)+1, 1, 60)
    num_prior_adm = np.random.poisson(1, size=n_patients)
    lab_abnormal_flag = np.random.binomial(1, 0.18, size=n_patients)
    discharge_destination = np.random.choice(['Home','SNF'], size=n_patients, p=[0.85,0.15])
    # base log-odds
    logits = -2.0 + 0.02*(age-65) + 0.4*lab_abnormal_flag + 0.25*(comorbidity_count) + 0.08*(length_of_stay) + 0.3*(discharge_destination=='SNF') + 0.15*(num_prior_adm)
    probs = 1/(1+np.exp(-logits))
    # calibrate to achieve approx readmit_rate
    probs = probs * (readmit_rate / probs.mean())
    probs = np.clip(probs, 0.01, 0.9)
    readmitted = np.random.binomial(1, probs)
    df = pd.DataFrame({
        'patient_id': ids,
        'age': age,
        'sex': sex,
        'comorbidity_count': comorbidity_count,
        'length_of_stay': length_of_stay,
        'num_prior_admissions': num_prior_adm,
        'lab_abnormal_flag': lab_abnormal_flag,
        'discharge_destination': discharge_destination,
        'readmitted_30d': readmitted
    })
    return df

df = simulate_readmission(n_patients=2000, readmit_rate=0.15)
df.head()


## Step 1: Inspect and understand class balance

Basic inspection and class balance.

Run this cell to inspect the structure and quality of the simulated readmission dataset. You will confirm the number of rows and columns, review variable data types, compute descriptive statistics, and check the overall 30-day readmission rate before beginning model development.

In [None]:
print('Rows, Columns:', df.shape)
print(df.dtypes)
df.describe().T
print('\nReadmission counts:\n', df['readmitted_30d'].value_counts(normalize=False))
print('\nReadmission rate:', df['readmitted_30d'].mean())

## Step 2: Preprocessing pipeline (encoding and scaling)

Define ColumnTransformer and pipeline.

Run this cell to define the preprocessing pipeline for your readmission model. Numeric features will be standardized, and categorical variables will be one-hot encoded. This ensures all inputs are properly scaled and formatted before training the logistic regression model.

In [None]:
# Preprocessing setup
numeric_features = ['age','comorbidity_count','length_of_stay','num_prior_admissions']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

categorical_features = ['sex','discharge_destination','lab_abnormal_flag']
# lab_abnormal_flag is binary but treat as categorical for one-hot simplicity
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

## Step 3: Train/Test split and baseline logistic regression

Train model with class weights.

Run this code to split your dataset into training and test sets, preserving the readmission class balance. You’ll then fit a baseline logistic regression model with class weighting to account for the imbalanced outcome.


In [None]:
# Train/test split
X = df.drop(columns=['patient_id','readmitted_30d'])
y = df['readmitted_30d']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=42)

# Baseline logistic regression with class weight to handle imbalance
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('clf', LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42))])
model = clf.fit(X_train, y_train)
print('Model trained.')

## Step 4: Evaluate model - ROC, AUC, and confusion matrix

Plot ROC and compute AUC.

Run this cell to generate predictions from your logistic regression model and evaluate its performance. You’ll compute predicted probabilities, plot the ROC curve with AUC, and review the confusion matrix and classification report at a 0.5 threshold.

In [None]:
# Predictions and probabilities
y_prob = model.predict_proba(X_test)[:,1]
y_pred = model.predict(X_test)

# ROC and AUC
auc = roc_auc_score(y_test, y_prob)
print('AUC:', round(auc,3))
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure(figsize=(6,5))
plt.plot(fpr, tpr, label=f'AUC={auc:.3f}')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

# Confusion matrix at 0.5 threshold
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix (0.5 threshold):\n', cm)
print('\nClassification report:\n', classification_report(y_test, y_pred))

## Step 5: Calibration and thresholding

Calibration curve and choose threshold for clinical use.

Run this code to evaluate how well your model’s predicted probabilities align with real outcomes. You’ll generate a calibration curve to assess probability accuracy and then compare key performance metrics—Precision (PPV) and Sensitivity—across multiple decision thresholds.

In [None]:
# Calibration curve
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
plt.figure(figsize=(6,5))
plt.plot(prob_pred, prob_true, marker='o')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction of positives')
plt.title('Calibration plot')
plt.show()

# Explore different thresholds
thresholds = [0.2, 0.3, 0.4, 0.5]
for t in thresholds:
    pred_t = (y_prob >= t).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, pred_t).ravel()
    ppv = tp/(tp+fp) if (tp+fp)>0 else 0
    sens = tp/(tp+fn) if (tp+fn)>0 else 0
    print(f'Threshold {t}: PPV={ppv:.3f}, Sensitivity={sens:.3f}')

## Step 6: Address imbalance (oversampling) and compare

Use RandomOverSampler and retrain.

Run this code to balance the training data using oversampling and retrain the logistic regression model. This helps the model learn from minority-class cases more effectively. After oversampling, you'll evaluate performance by computing the AUC on the untouched test set.

In [None]:
# Oversampling training set
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X_train, y_train)
print('Resampled class distribution:', pd.Series(y_res).value_counts())

clf_res = Pipeline(steps=[('preprocessor', preprocessor),
                      ('clf', LogisticRegression(solver='liblinear', random_state=42))])
model_res = clf_res.fit(X_res, y_res)
y_prob_res = model_res.predict_proba(X_test)[:,1]
print('AUC (resampled train):', round(roc_auc_score(y_test, y_prob_res),3))

## Step 7: Interpretation and reporting

Coefficient interpretation and fairness checks.

Run this code cell to extract and interpret the logistic regression model’s coefficients. You’ll generate a consolidated table showing each feature, its estimated coefficient, and the corresponding odds ratio—helping you understand how each variable influences the likelihood of 30-day readmission.

In [None]:
# Coefficients interpretation
feature_names_num = numeric_features
# get one-hot feature names
ohe = model.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot']
ohe_features = list(ohe.get_feature_names_out(['sex','discharge_destination','lab_abnormal_flag']))
feature_names = list(feature_names_num) + ohe_features
coefs = model.named_steps['clf'].coef_[0]
coef_df = pd.DataFrame({'feature': feature_names, 'coef': coefs})
coef_df['odds_ratio'] = np.exp(coef_df['coef'])
coef_df.sort_values('coef', ascending=False).reset_index(drop=True)

## Consolidated practice exercises



### Exercise 1: Report class balance and compute baseline readmission rate

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use `df['readmitted_30d'].value_counts()` and `.mean()`.

</details>

<details> <summary>Click here for solution</summary>

```python
print(df['readmitted_30d'].value_counts())
print('Readmit rate:', df['readmitted_30d'].mean())
```

</details>

### Exercise 2: Build preprocessing pipeline and show transformed feature matrix shape

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use the ColumnTransformer pipeline and call `preprocessor.fit_transform(X_train)`.

</details>

<details> <summary>Click here for solution</summary>

```python
preprocessor.fit(X_train)
Xt = preprocessor.transform(X_train)
print('Transformed shape:', Xt.shape)
```

</details>

### Exercise 3: Train logistic regression with class_weight='balanced' and report AUC on test set

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Fit pipeline and compute `roc_auc_score(y_test, y_prob)`.

</details>

<details> <summary>Click here for solution</summary>

```python
clf = Pipeline(steps=[('preprocessor', preprocessor), ('clf', LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42))])
model = clf.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:,1]
print('AUC:', roc_auc_score(y_test, y_prob))
```

</details>

### Exercise 4: Plot ROC curve and mark threshold at 0.3

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use `roc_curve` and plot, optionally mark point where threshold ~0.3.

</details>

<details> <summary>Click here for solution</summary>

```python
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr,tpr)
plt.scatter(fpr[np.argmin(np.abs(thresholds-0.3))], tpr[np.argmin(np.abs(thresholds-0.3))], color='red')
plt.show()
```

</details>

### Exercise 5: Compute calibration curve and comment if model is over- or under-confident

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use `calibration_curve(y_test, y_prob, n_bins=10)` and inspect plot.

</details>

<details> <summary>Click here for solution</summary>

```python
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
print(prob_true, prob_pred)
```

</details>

### Exercise 6: Retrain using RandomOverSampler on training set and compare AUC to baseline

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use `RandomOverSampler` to resample then fit and compute roc_auc_score on test set.

</details>

<details> <summary>Click here for solution</summary>

```python
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X_train, y_train)
clf_res = Pipeline(steps=[('preprocessor', preprocessor), ('clf', LogisticRegression(solver='liblinear', random_state=42))])
model_res = clf_res.fit(X_res, y_res)
y_prob_res = model_res.predict_proba(X_test)[:,1]
print('AUC (resampled):', roc_auc_score(y_test, y_prob_res))
```

</details>

### Exercise 7: List top 5 features by absolute coefficient (importance) and their odds ratios

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Extract feature names and coefficients from trained model and sort by abs(coef).

</details>

<details> <summary>Click here for solution</summary>

```python
feature_names = list(numeric_features) + list(model.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(['sex','discharge_destination','lab_abnormal_flag']))
coefs = model.named_steps['clf'].coef_[0]
coef_df = pd.DataFrame({'feature':feature_names,'coef':coefs})
coef_df['abs_coef']=coef_df['coef'].abs()
coef_df['odds_ratio']=np.exp(coef_df['coef'])
coef_df.sort_values('abs_coef',ascending=False).head(5)
```

</details>

## Final thoughts and best practices

- Logistic regression is a strong baseline that balances interpretability and performance.  
- Check calibration and consider decision thresholds aligned with clinical priorities.  
- Be explicit about bias and fairness checks across subgroups.  
- Document preprocessing and model selection for reproducibility.


# Congratulations!

You have successfully completed this lab on **Predicting Readmission Risk with Logistic Regression**.

In this lab, you built an end-to-end logistic regression workflow to predict 30-day hospital readmission using a simulated clinical dataset. You prepared the data with encoding, scaling, and a train/test split, then trained a logistic regression model with class-imbalance handling and regularization.

You evaluated performance using ROC/AUC, confusion matrices, precision–recall curves, and calibration plots to judge probability accuracy. You also interpreted coefficients through odds ratios and conducted basic fairness and limitations checks relevant to clinical use.

By the end, you practiced selecting thresholds for clinical decision-making, comparing baseline and oversampled models, and summarizing model results clearly and responsibly for clinicians.

## Authors

Ramesh Sannareddy

Copyright © 2025 SkillUp. All rights reserved.