# Hospital Readmission Prediction

Clinical and economic context  
Thirty day readmissions are a priority quality metric in value based care. Avoidable readmissions generate penalties and represent missed opportunities for safe transitions. A practical predictive model can support discharge planning and early outreach to high risk individuals.

Repository  
https://github.com/albertokabore/Hospital-Readmission-Prediction

Dataset  
`data/hospital_readmissions_30k.csv`  

Outcome  
`readmitted_30_days` with values Yes or No


In [1]:
# Environment and imports
import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    average_precision_score,
    RocCurveDisplay,
    PrecisionRecallDisplay
)

pd.set_option("display.max_columns", 120)


ModuleNotFoundError: No module named 'sklearn'

## Project overview

Goal  
Estimate the probability of a thirty day readmission at the time of discharge using routinely collected features.

Clinical use  
Flag high risk patients for post discharge calls, early clinic visits, medication review, and home care referrals.

Success criteria  
Balanced performance that supports early intervention. We report AUROC and AUPRC due to class imbalance.


In [None]:
# Paths and outcome definition
DATA_PATH = Path("data")
FILE_NAME = "hospital_readmissions_30k.csv"
TARGET = "readmitted_30_days"

csv_path = DATA_PATH / FILE_NAME
assert csv_path.exists(), f"Dataset not found at {csv_path}. Place the CSV under data/."

df = pd.read_csv(csv_path, low_memory=False)
df.head()


## Data description and quality review


In [None]:
# Schema and missing values
buf = []
df.info(buf=buf.append)
print("\n".join(buf))

print("\nMissing values by column (top twenty):")
print(df.isna().sum().sort_values(ascending=False).head(20))

print("\nOutcome distribution:")
print(df[TARGET].value_counts(dropna=False))


## Outcome balance review


In [None]:
y_raw = df[TARGET].astype(str).str.strip().str.title()
pos_rate = (y_raw == "Yes").mean()
print(f"Positive rate Yes: {pos_rate:.3f}")

fig = plt.figure()
y_raw.value_counts().plot(kind="bar")
plt.title("Outcome distribution")
plt.xlabel(TARGET)
plt.ylabel("count")
plt.show()


## Target cleaning and feature catalog

We map Yes to one and No to zero.  
We identify numeric and categorical predictors for preprocessing.


In [None]:
y = y_raw.map({"Yes": 1, "No": 0}).astype(int)
X = df.drop(columns=[TARGET])

cat_cols = [c for c in X.columns if X[c].dtype == "object"]
num_cols = [c for c in X.columns if c not in cat_cols]

print(f"Categorical columns: {len(cat_cols)}")
print(f"Numeric columns: {len(num_cols)}")
print("Sample categoricals:", cat_cols[:8])
print("Sample numerics:", num_cols[:8])


## Exploratory analysis that informs modeling


In [None]:
# Numeric summary
display(X[num_cols].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99]).T)


In [None]:
# Selected numeric distributions if present
for col in ["age", "length_of_stay", "bmi", "cholesterol", "medication_count"]:
    if col in X.columns:
        fig = plt.figure()
        X[col].hist(bins=30)
        plt.title(f"Distribution of {col}")
        plt.xlabel(col)
        plt.ylabel("count")
        plt.show()


In [None]:
# Selected categorical profiles if present
for col in ["gender", "diabetes", "hypertension", "discharge_destination"]:
    if col in X.columns:
        print(f"\nTop values for {col}")
        print(X[col].value_counts(dropna=False).head(10))


## Train and test partition

We use a stratified split to preserve the outcome rate.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape


## Modeling strategy

Rationale  
Logistic Regression offers transparency for clinical stakeholders and supports coefficient based interpretation.  
Random Forest and Gradient Boosting often improve predictive performance on tabular data.

Preprocessing  
Numeric imputation uses median.  
Categorical imputation uses most frequent value followed by one hot encoding.  


In [None]:
preprocess = ColumnTransformer(
    transformers=[
        ("num", SimpleImputer(strategy="median"), num_cols),
        ("cat", Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols),
    ],
    n_jobs=None
)

models = {
    "LogReg_balanced": LogisticRegression(max_iter=1000, class_weight="balanced"),
    "RandomForest": RandomForestClassifier(
        n_estimators=300, random_state=42, n_jobs=-1, class_weight="balanced"
    ),
    "GradientBoosting": GradientBoostingClassifier(random_state=42)
}

pipelines = {
    name: Pipeline(steps=[("preprocess", preprocess), ("clf", clf)])
    for name, clf in models.items()
}
list(pipelines.keys())


## Model fitting and core evaluation

We report AUROC and AUPRC.  
We print a full classification report and a confusion matrix.  


In [None]:
results = {}
reports = {}

for name, pipe in pipelines.items():
    pipe.fit(X_train, y_train)

    if hasattr(pipe.named_steps["clf"], "predict_proba"):
        proba = pipe.predict_proba(X_test)[:, 1]
    else:
        proba = pipe.decision_function(X_test)

    pred = (proba >= 0.5).astype(int)

    au
