# Building a Readmission Risk Classifier

Time estimate: **20** minutes


## Objectives
After completing this lab, you will be able to:
- Explain the concept of hospital readmission risk.
- Prepare a feature set for binary classification.
- Train a simple readmission risk classifier.
- Evaluate model performance using basic metrics.
- Interpret model outputs in a healthcare context.



## What you will do in this lab

In this lab, you will use de-identified clinical data to build, evaluate, and interpret a simple hospital readmission risk prediction model.

You will:

- Load a prepared, de-identified clinical dataset.
- Define a readmission outcome variable.
- Split data into training and testing sets.
- Train a baseline classification model.
- Evaluate and interpret model results.



## Overview
Hospital readmissions are costly and often preventable. Predicting which patients
are at higher risk of readmission allows healthcare organizations to intervene early.
In this lab, you will build a simple readmission risk classifier. The goal is not model sophistication, but understanding
the end-to-end workflow of clinical prediction.



## About the dataset/environment
You will work with a **synthetic, de-identified patient-level dataset** that includes:
- Encounter counts
- Average laboratory values
- Chronic condition indicators
- A binary readmission label

The dataset simulates real-world inputs to a readmission risk model.


## Setup

In [None]:

# This cell imports required libraries.

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix


In [None]:
# This cell reads a synthetic modeling dataset.

data = pd.read_csv('https://machine-learning-for-healthcare-applications-f276df.gitlab.io/labs/lab1/patient_data1.csv')


## Step 1: Review the modeling dataset
In this step, you will look at the features and the outcome variable used for prediction.

**Why this matters in healthcare:** Understanding features ensures clinical relevance and trust in predictions.


In [None]:

# This cell displays the dataset structure and summary.


data.info()


In [None]:
#list top 5 row.
# Reviewing data is critical before modeling.
data.head()


## Step 2: Separate features and the target variable
You will separate input features from the readmission outcome.

**Why this matters in healthcare:** Clear separation avoids accidental data leakage.


In [None]:

# This cell separates features (X) and target (y).
# Models require this explicit distinction.

X = data.drop(columns=["readmitted"])
y = data["readmitted"]


In [None]:
#display rows in X
X.head()

In [None]:
#display rows in y
y.head()


## Step 3: Split data into training and testing sets
You will reserve part of the data for unbiased evaluation.

**Why this matters in healthcare:** Testing on unseen data reflects real-world performance.


In [None]:
# This cell splits the dataset into training and test sets.
# It helps evaluate the model's ability to generalize to new, unseen data.

# train_test_split is a function from scikit-learn that divides arrays or matrices into random train and test subsets.
# X: The feature matrix (independent variables).
# y: The target variable (dependent variable).
# test_size=0.25: This means 25% of the data will be used for the test set, and 75% for the training set.
# random_state=42: This parameter ensures reproducibility. If you run the code multiple times with the same random_state,
#                  you will get the exact same split every time. This is crucial for consistent experimentation.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Print the shapes (number of rows, number of columns) of the training and testing feature sets.
# This helps to confirm that the split was performed correctly and to understand the size of each dataset.
print(X_train.shape, X_test.shape)


## Step 4: Train a readmission risk classifier
You will train a simple logistic regression model.

**Why this matters in healthcare:** Logistic regression is interpretable and widely used in clinical settings.


In [None]:

# This cell trains a logistic regression classifier.
# A simple model is appropriate for baseline risk prediction.

model = LogisticRegression()
model.fit(X_train, y_train)



## Step 5: Generate predictions
You will predict readmission risk on the test set.

**Why this matters in healthcare:** Predictions enable proactive clinical interventions.


In [None]:

# This cell generates predictions on test data.
# Predictions are later evaluated for accuracy.

y_pred = model.predict(X_test)
y_pred



## Step 6: Evaluate model performance
You will evaluate accuracy and examine a confusion matrix.

**Why this matters in healthcare:** Evaluation helps balance missed readmissions versus false alarms.


In [None]:

# This cell evaluates model performance.
# Basic metrics provide initial insight.

accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy = ",accuracy)
print("Confusion Matrix = \n",cm)


## Exercises

In [None]:
#Load the exercise specific dataset
data = pd.read_csv('https://machine-learning-for-healthcare-applications-f276df.gitlab.io/labs/lab1/patient_data2.csv')

### Exercise 1: Inspect the data

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Use head() to list rows.

</details>

<details>
<summary>Click here for solution</summary>

```python
data.head()
```

</details>


### Exercise 2: Identify features and target

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Identify features and target columns.

</details>

<details>
<summary>Click here for solution</summary>

```python
X = data.drop(columns=["readmitted"])
y = data["readmitted"]
```

</details>


### Exercise 3: Split data with test size as 40%

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Modify the test_size parameter.

</details>

<details>
<summary>Click here for solution</summary>

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
```

</details>


### Exercise 4: Retrain the model

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Fit the model again using the new split on the training data.

</details>

<details>
<summary>Click here for solution</summary>

```python
model.fit(X_train, y_train)
```

</details>


### Exercise 5: Compute accuracy

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Use accuracy_score on predictions.

</details>

<details>
<summary>Click here for solution</summary>

```python
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
```

</details>


## Congratulations!

You have successfully completed this lab on building and evaluating a basic readmission risk classifier. You practiced the complete workflow of building, evaluating, and interpreting a simple predictive model using healthcare data.

## Authors
Ramesh Sannareddy

<br>

Â© SkillUp. All rights reserved.   


Materials may not be reproduced in whole or in part without written permission from SkillUp.