# Building a Healthcare Analytics Dataset from Raw Multi-Source Data

## Environment setup and synthetic data

**Important — Do not modify these datasets**

The synthetic datasets below are fixed and used by all learners.
Do not modify this cell or change any data values, as **grading is based on the exact outputs** produced from these datasets.

You should only write code in the sections of the notebook that are clearly marked for you to complete.

In [None]:
# ⚠️ DO NOT MODIFY THIS CELL
# This dataset is fixed and used for consistent grading across all learners.

import pandas as pd
import numpy as np

ehr_df = pd.DataFrame({
    "patient_id": ["P001","P002","P003","P003","P004"],
    "patient_name": ["John Doe","Jane Smith","Alice Brown","Alice Brown","Bob Lee"],
    "dob": pd.to_datetime(["1980-05-12","1975-09-30","1990-02-15","1990-02-15","1965-11-01"]),
    "encounter_id": ["E1","E2","E3","E3","E4"],
    "encounter_date": pd.to_datetime(["2023-01-10","2023-01-12","2023-01-15","2023-01-15","2023-01-18"]),
    "heart_rate": [72,-10,85,85,220]
})

claims_df = pd.DataFrame({
    "member_id": ["001","002","003","004"],
    "diagnosis_code": ["I10","E11","E11","I10"],
    "claim_amount": [12000,18000,15000,20000]
})

labs_df = pd.DataFrame({
    "lab_patient_id": ["P-001", "P-002", "P-003", "P-004"],
    "lab_date": pd.to_datetime(["2023-01-08", "2023-01-09", "2023-01-14", "2023-01-17"]),
    "glucose": [95, 5.5, 110, 6.2],
    "unit": ["mg/dL", "mmol/L", "mg/dL", "mmol/L"]
})

### Step 1: Inspect dataset structure

**What is expected**
- Display the number of rows and columns of the EHR dataset
- Display all column names and their data types
(For example, by showing the dataset shape and column data types.)

In [None]:
# your code goes here

### Step 2: Identify duplicate records

**What is expected**
- Print the total number of duplicate rows
- Show which rows are duplicates using a clear row-level indicator (For example, by displaying a row-level indicator showing True/False for duplicates.)


In [None]:
# your code goes here

### Step 3: Remove duplicate records

**What is expected**
- Remove duplicate rows from the dataset
- Print the updated dataset shape (rows and columns)

In [None]:
# your code goes here

### Step 4: Identify invalid clinical values

**What is expected**
- Print any rows where heart rate values are invalid (less than zero)

In [None]:
# your code goes here

### Step 5: Correct invalid clinical values

**What is expected**
- Replace invalid heart rate values with missing values (NaN)
- Print the updated heart rate column

In [None]:
# your code goes here

### Step 6: Standardize patient identifiers across sources

**What is expected**
- Create a standardized patient identifier column
- Print the standardized identifiers from all three datasets

In [None]:
# your code goes here

### Step 7: Map diagnosis codes to clinical conditions

**What is expected**
- Map diagnosis codes to readable clinical condition names
- Print diagnosis codes alongside mapped conditions

In [None]:
# your code goes here

### Step 8: Resolve laboratory unit mismatches

**What is expected**
- Convert all glucose values to a consistent unit (mg/dL)
- Print original values, units, and converted values

In [None]:
# your code goes here

### Step 9: Merge EHR, claims, and laboratory data

**What is expected**
- Merge all datasets using the standardized patient identifier
- Display the column names to confirm the merge
- Display key fields from each source system (for example, a patient identifier, an encounter field, a diagnosis field, and a laboratory value) to verify the merged result

In [None]:
# your code goes here

### Step 10: Apply HIPAA safe harbor de-identification

**What is expected**
- Remove the patient name column, which represents a direct identifier in this dataset
- Print the column list to confirm that the patient_name column has been removed
- Create a generalized year-of-birth field from the date of birth (for example, extracting the year)
- To keep the output readable for grading, **print only the patient identifier and birth year columns**

Note: The full dataset should still be updated. You are only limiting what is printed.

In [None]:
# your code goes here

### Step 11: Create encounter utilization features

**What is expected**
- Count the number of encounters per patient
- Print the encounter count for each patient using the patient identifier

In [None]:
# your code goes here

### Step 12: Engineer final model-ready features

**What is expected**
- Aggregate patient-level features
- Create a final feature table with one row per patient that includes a glucose-related feature and an encounter count
- Print the final feature table

In [None]:
# your code goes here