# Feature Engineering for Clinical Events, Labs, and Encounters

Time estimate: **20** minutes

## Objectives
After completing this lab, you will be able to:
- Engineer features from raw clinical, lab, and encounter data.  
- Aggregate lab values across clinically meaningful time windows.  
- Derive encounter frequency and utilization indicators.  
- Encode comorbidities and medication counts.  
- Construct episode-level features while avoiding data leakage.


## What you will do in this lab
In this lab, you will prepare longitudinal clinical data for modeling by aggregating events, encoding clinical context, and preventing data leakage.

You will:
- Load a synthetic longitudinal clinical dataset.  
- Aggregate laboratory results over time windows.  
- Derive encounter-based utilization features.  
- Encode comorbidities and medication exposure.  
- Build episode-level features suitable for modeling.  
- Identify and avoid common data leakage pitfalls.


## Overview
Raw clinical data must be transformed into meaningful features before it can be used
for analytics or machine learning. Feature engineering in healthcare requires careful
attention to time, clinical context, and patient-level aggregation. This lab focuses
on creating clinically sensible features from events, labs, and encounters while
highlighting common pitfalls such as data leakage.


## About the dataset/environment
You will work with a **synthetic longitudinal clinical dataset** that includes:
- Patient encounters over time  
- Laboratory test results  
- Medication records  
- Diagnosis codes  

The dataset is designed to support feature engineering exercises and includes
timestamps needed for time-aware aggregation.


## Setup

In [None]:

# This cell imports required libraries and loads a synthetic longitudinal clinical dataset.


import pandas as pd


# Load a synthetic encounter-level dataset
encounters_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab5/encounters_dataset1.csv")

# Load a synthetic lab dataset
labs_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab5/laboratory_dataset1.csv")

# Load a synthetic medication dataset
meds_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab5/medications_dataset1.csv")


In [None]:
encounters_df.head()

In [None]:
labs_df.head()

In [None]:
meds_df.head()

## Step 1: Aggregate lab values over time windows
You will begin by summarizing lab values over a fixed time window before each encounter.
This allows capturing recent clinical status.

**Why this matters in healthcare:** Lab trends over time are often more informative than single measurements.


In [None]:
# This cell aggregates lab values within 30 days prior to each encounter.
# Time-windowed aggregation is a common clinical feature engineering technique.

# Convert date columns to datetime objects
encounters_df['encounter_date'] = pd.to_datetime(encounters_df['encounter_date'])
labs_df['lab_date'] = pd.to_datetime(labs_df['lab_date'])

# Merge encounters with labs
enc_labs = encounters_df.merge(labs_df, on="patient_id", how="left")

# Calculate days between lab and encounter
enc_labs["days_before_encounter"] = (
    enc_labs["encounter_date"] - enc_labs["lab_date"]
).dt.days

# Filter labs within 30 days before encounter
recent_labs = enc_labs[
    (enc_labs["days_before_encounter"] >= 0) &
    (enc_labs["days_before_encounter"] <= 30)
]

# Aggregate mean lab value per encounter
lab_features = recent_labs.groupby("encounter_id")["lab_value"].mean().reset_index()
lab_features

## Step 2: Derive encounter frequency features
Here, you will calculate how frequently patients are visiting healthcare facilities.

**Why this matters in healthcare:** High encounter frequency often signals disease severity or care complexity.


In [None]:

# This cell derives encounter frequency per patient.
# Utilization features are strong predictors in many healthcare models.

encounter_counts = encounters_df.groupby("patient_id")["encounter_id"]                                  .count()                                  .reset_index(name="encounter_count")

encounter_counts


## Step 3: Encode comorbidities
You will convert diagnosis codes into indicators representing the presence of chronic conditions.

**Why this matters in healthcare:** Comorbidities strongly influence outcomes and risk stratification.


In [None]:

# This cell encodes comorbidities as binary indicators.
# Diagnosis-based features are commonly used in clinical models.

# Create comorbidity flags
encounters_df["has_diabetes"] = encounters_df["diagnosis_code"] == "E11"
encounters_df["has_hypertension"] = encounters_df["diagnosis_code"] == "I10"

# Aggregate to patient level
comorbidity_features = encounters_df.groupby("patient_id")[
    ["has_diabetes", "has_hypertension"]
].any().reset_index()

comorbidity_features


## Step 4: Derive medication exposure features
Next, you will count the number of medications prescribed to each patient.

**Why this matters in healthcare:** Medication burden is a proxy for disease complexity and risk.


In [None]:

# This cell counts distinct medications per patient.
# Medication exposure features are important clinical predictors.

med_counts = meds_df.groupby("patient_id")["medication_name"].nunique().reset_index(name="medication_count")

med_counts


## Step 5: Construct episode-level features
You will now combine all engineered features into a single episode-level dataset.

**Why this matters in healthcare:** Models typically operate on episode- or patient-level feature tables.


In [None]:

# This cell merges all engineered features.
# Consolidating features prepares the dataset for modeling.

episode_features = encounters_df.merge(lab_features, on="encounter_id", how="left").merge(encounter_counts, on="patient_id", how="left").merge(comorbidity_features, on="patient_id", how="left").merge(med_counts, on="patient_id", how="left")

episode_features


## Step 6: Identify and avoid data leakage
Finally, you will review features to ensure they do not use information from the future.

**Why this matters in healthcare:** Data leakage produces misleadingly high model performance and unsafe models.


In [None]:

# This cell highlights potential data leakage risks.
# Features must only use information available at prediction time.

# Example leakage check: ensure lab dates precede encounter dates
leakage_check = enc_labs[enc_labs["days_before_encounter"] < 0]

leakage_check


## Exercises

In [None]:
# Load a synthetic encounter-level dataset for exercises
encounters_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab5/encounters_dataset2.csv")

# Load a synthetic lab dataset for exercises
labs_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab5/laboratory_dataset2.csv")

# Load a synthetic medication dataset for exercises
meds_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab5/medications_dataset2.csv")

### Exercise 1: Aggregate lab values over time windows

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Compute mean lab value per encounter.

</details>

<details>
<summary>Click here for solution</summary>

```python
# Convert date columns to datetime objects
encounters_df['encounter_date'] = pd.to_datetime(encounters_df['encounter_date'])
labs_df['lab_date'] = pd.to_datetime(labs_df['lab_date'])

# Merge encounters with labs
enc_labs = encounters_df.merge(labs_df, on="patient_id", how="left")

# Calculate days between lab and encounter
enc_labs["days_before_encounter"] = (
    enc_labs["encounter_date"] - enc_labs["lab_date"]
).dt.days

# Filter labs within 30 days before encounter
recent_labs = enc_labs[
    (enc_labs["days_before_encounter"] >= 0) &
    (enc_labs["days_before_encounter"] <= 30)
]

# Aggregate mean lab value per encounter
lab_features = recent_labs.groupby("encounter_id")["lab_value"].mean().reset_index()
lab_features
```

</details>

### Exercise 2: Calculate encounter frequency

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Count encounters per patient.

</details>

<details>
<summary>Click here for solution</summary>

```python
encounter_counts = encounters_df.groupby("patient_id")["encounter_id"].count().reset_index(name="encounter_count")
encounter_counts
```

</details>

### Exercise 3: Encode comorbidities

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Create diabetes and hypertension flags.

</details>

<details>
<summary>Click here for solution</summary>

```python
# Create comorbidity flags
encounters_df["has_diabetes"] = encounters_df["diagnosis_code"] == "E11"
encounters_df["has_hypertension"] = encounters_df["diagnosis_code"] == "I10"

# Aggregate to patient level
comorbidity_features = encounters_df.groupby("patient_id")[
    ["has_diabetes", "has_hypertension"]
].any().reset_index()

comorbidity_features
```

</details>

### Exercise 4: Derive medication counts

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Count unique medications.

</details>

<details>
<summary>Click here for solution</summary>

```python
med_counts = meds_df.groupby("patient_id")["medication_name"].nunique().reset_index(name="medication_count")

med_counts

```

</details>

### Exercise 5: Build episode-level features

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Merge all feature tables.

</details>

<details>
<summary>Click here for solution</summary>

```python
episode_features = encounters_df.merge(lab_features, on="encounter_id", how="left").merge(encounter_counts, on="patient_id", how="left").merge(comorbidity_features, on="patient_id", how="left").merge(med_counts, on="patient_id", how="left")

episode_features
```

</details>

### Exercise 6: Check for data leakage

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Identify labs after encounters.

</details>

<details>
<summary>Click here for solution</summary>

```python
leakage_check = enc_labs[enc_labs["days_before_encounter"] < 0]

leakage_check
```

</details>

## Congratulations!

You have engineered clinically meaningful features while avoiding data leakage. These features reflect real-world healthcare data constraints and are suitable for downstream analytics and modeling tasks.

## Authors
Ramesh Sannareddy  
<br>  
Â© SkillUp. All rights reserved.
<br>  
Materials may not be reproduced in whole or in part without written permission from SkillUp.