# Cleaning Clinical Datasets and Applying HIPAA De-Identification

Time estimate: **20** minutes

## Objectives
After completing this lab, you will be able to:
- Identify and remove duplicate clinical records.  
- Correct invalid or inconsistent clinical values.  
- Handle missing data using appropriate strategies.  
- Detect and review potential outliers.  
- Apply HIPAA Safe Harbor de-identification rules.


## What you will do in this lab
In this lab, you will work with a synthetic clinical dataset to clean, prepare, and de-identify healthcare data for safe analysis.

You will:

- Load and inspect a synthetic clinical dataset.  
- Clean duplicate and invalid records.  
- Address missing values in key fields.  
- Identify potential outliers in clinical measurements.  
- De-identify the dataset using HIPAA Safe Harbor rules.


## Overview
Clinical datasets often contain data quality issues such as duplicates, invalid values,
and missing information. In addition, healthcare data must be handled carefully to
protect patient privacy. This lab focuses on practical data cleaning techniques while
also applying HIPAA Safe Harbor de-identification rules to produce a privacy-compliant,
analysis-ready dataset.


## About the dataset/environment
You will work with a **synthetic clinical dataset** that includes patient demographics,
visit information, and vital signs. The dataset intentionally contains duplicates,
invalid entries, missing values, outliers, and identifiable information such as names
and dates. These issues simulate real-world healthcare data challenges.


## Setup

In [None]:

# This cell imports required libraries and loads a synthetic clinical dataset.


import pandas as pd
import numpy as np

# Load a synthetic clinical dataset
clinical_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab4/clinical_data_quality_lab_dataset.csv")

# Convert date columns to datetime
clinical_df["date_of_birth"] = pd.to_datetime(clinical_df["date_of_birth"])
clinical_df["visit_date"] = pd.to_datetime(clinical_df["visit_date"])

# Display the top 5 rows in the dataset
clinical_df.head()


## Step 1: Identify duplicate records
You will start by checking whether the same clinical record appears more than once.
Duplicate records are common in healthcare systems due to data entry or system issues.

**Why this matters in healthcare:** Duplicate records can inflate patient counts and distort clinical metrics.


In [None]:
# Print the number of rows and columns in the dataset.
print(f"Number of rows: {clinical_df.shape[0]}")
print(f"Number of columns: {clinical_df.shape[1]}")

In [None]:

# Detecting duplicates is the first step in cleaning clinical data.
# This cell identifies duplicate rows in the dataset.


# Count the number of  duplicate rows
print(clinical_df.duplicated().sum())


In [None]:

# Print the duplicate rows
clinical_df[clinical_df.duplicated()]

## Step 2: Remove duplicate records
After identifying duplicates, you will remove them to ensure each record is counted once.

**Why this matters in healthcare:** Removing duplicates prevents over-counting patients and visits.


In [None]:

# This cell removes duplicate rows from the dataset.
# Deduplication ensures accurate analysis.

clinical_df_deduped = clinical_df.drop_duplicates().copy()
clinical_df_deduped


## Step 3: Correct invalid clinical values
Next, you will look for values that do not make clinical sense, such as negative heart rates.

**Why this matters in healthcare:** Invalid clinical values can lead to unsafe or misleading conclusions.


In [None]:

# This cell corrects invalid clinical values.
# Cleaning invalid values improves data reliability.

# Replace negative heart rates with NaN (Not a Number)
clinical_df_deduped.loc[clinical_df_deduped["heart_rate"] < 0, "heart_rate"] = np.nan

clinical_df_deduped


## Step 4: Handle missing values
Clinical data often contains missing measurements.
Here, you will apply a simple and commonly used strategy, median imputation, to handle missing values.

**Why this matters in healthcare:** Improper handling of missing data can bias results.


In [None]:

# This cell handles missing values in clinical measurements.
# Filling or flagging missing values is a common cleaning step.

# Fill missing systolic blood pressure with median
median_bp = clinical_df_deduped["systolic_bp"].median()
clinical_df_deduped["systolic_bp"] = clinical_df_deduped["systolic_bp"].fillna(median_bp)

clinical_df_deduped


## Step 5: Detect potential outliers
You will now review clinical measurements to identify unusually high or low values.

**Why this matters in healthcare:** Outliers might indicate data errors or rare but important clinical events.


In [None]:

# This cell identifies potential outliers using simple thresholds.
# Outlier detection supports data quality review.

# Flag extremely high heart rates
clinical_df_deduped["heart_rate_outlier"] = clinical_df_deduped["heart_rate"] > 180

clinical_df_deduped


## Step 6: Apply HIPAA Safe Harbor de-identification
Finally, you will remove or transform direct identifiers to comply with HIPAA Safe Harbor rules.

**Why this matters in healthcare:** De-identification is required to protect patient privacy and enable data sharing.


In [None]:

# This cell applies HIPAA Safe Harbor de-identification.
# Removing direct identifiers helps protect patient privacy.

# Drop direct identifiers
deidentified_df = clinical_df_deduped.drop(columns=["patient_name"])

# Generalize dates to year only
deidentified_df["birth_year"] = deidentified_df["date_of_birth"].dt.year
deidentified_df["visit_year"] = deidentified_df["visit_date"].dt.year

# Drop original date columns
deidentified_df = deidentified_df.drop(columns=["date_of_birth", "visit_date"])

deidentified_df


## Exercises

In [None]:
# Load the dataset for exercises
# Load a synthetic clinical dataset
clinical_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab4/clinical_data_quality_lab_dataset_exercises.csv")
clinical_df["date_of_birth"] = pd.to_datetime(clinical_df["date_of_birth"])
clinical_df["visit_date"] = pd.to_datetime(clinical_df["visit_date"])
clinical_df.head()


### Exercise 1: Identify duplicates

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Use duplicated() to find repeated rows.

</details>

<details>
<summary>Click here for solution</summary>

```python
clinical_df[clinical_df.duplicated()]
```

</details>

### Exercise 2: Remove duplicates

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Drop duplicate rows.

</details>

<details>
<summary>Click here for solution</summary>

```python
clinical_df_deduped = clinical_df.drop_duplicates().copy()
clinical_df_deduped
```

</details>

### Exercise 3: Fix invalid values

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Replace negative heart rates.

</details>

<details>
<summary>Click here for solution</summary>

```python
clinical_df_deduped.loc[clinical_df_deduped["heart_rate"] < 0, "heart_rate"] = np.nan

clinical_df_deduped
```

</details>

### Exercise 4: Handle missing values

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Fill missing blood pressure values.

</details>

<details>
<summary>Click here for solution</summary>

```python
median_bp = clinical_df_deduped["systolic_bp"].median()
clinical_df_deduped["systolic_bp"] = clinical_df_deduped["systolic_bp"].fillna(median_bp)

clinical_df_deduped
```

</details>

### Exercise 5: Flag outliers

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Identify extreme heart rates.

</details>

<details>
<summary>Click here for solution</summary>

```python
clinical_df_deduped["heart_rate_outlier"] = clinical_df_deduped["heart_rate"] > 180

clinical_df_deduped
```

</details>

### Exercise 6: De-identify the dataset

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Remove names and generalize dates.

</details>

<details>
<summary>Click here for solution</summary>

```python
# Drop direct identifiers
deidentified_df = clinical_df_deduped.drop(columns=["patient_name"])

# Generalize dates to year only
deidentified_df["birth_year"] = deidentified_df["date_of_birth"].dt.year
deidentified_df["visit_year"] = deidentified_df["visit_date"].dt.year

# Drop original date columns
deidentified_df = deidentified_df.drop(columns=["date_of_birth", "visit_date"])

deidentified_df
```

</details>


## Congratulations!

You have cleaned a clinical dataset and applied HIPAA de-identification rules. You now have an analysis-ready, privacy-compliant dataset that reflects the data quality and governance practices expected in real-world healthcare analytics.

## Authors
Ramesh Sannareddy  

<br>

Â© SkillUp. All rights reserved.   


Materials may not be reproduced in whole or in part without written permission from SkillUp.