# Exploring Raw EHR Data for Structure and Quality Issues

Time estimate: **20** minutes

## Objectives
After completing this lab, you will be able to:
- Inspect the structure and schema of raw EHR data.  
- Identify missing values, duplicates, and inconsistencies.  
- Analyze temporal and clinical data quality issues.  
- Explain how data quality impacts downstream healthcare analytics.


## What you will do in this lab
In this lab, you will work with a synthetic EHR dataset to explore its structure and identify common data quality issues that can affect healthcare analysis.

You will:

- Load a synthetic EHR dataset.  
- Carefully explore how the data is organized.  
- Look for missing, duplicated, and inconsistent information.  
- Check whether dates and clinical values are reasonable.  
- Summarize key data quality issues found in the dataset.


## Overview
Electronic health record (EHR) data is created as part of routine clinical care and hospital operations.
Because it is collected for operational purposes rather than analysis, the data often contains gaps,
duplicates, and logical inconsistencies. Before any reporting, analytics, or machine learning can be
trusted, analysts must first explore the raw data and understand these issues. This lab focuses on
building that foundational inspection skill.


## About the dataset/environment
You will work with a **synthetic EHR dataset** designed to resemble real-world healthcare data.
The dataset includes patient identifiers, encounter identifiers, vital signs, and admission and
discharge dates. The data is intentionally imperfect to reflect what analysts typically encounter
when working with real hospital systems.


## Setup

In [None]:

# This cell imports required libraries and loads a synthetic EHR dataset.

import pandas as pd  # Used for working with tabular data
import numpy as np   # Used for generating random and numeric values

# Load a synthetic EHR dataset
ehr_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab1/ehr_dataset.csv")

# Display the first few rows to confirm dataset creation
ehr_df.head()


## Step 1: Inspect dataset structure
In this step, you will begin by taking a broad look at the dataset. You will understand how many rows
and columns are present, what each column represents, and how the data is typed. This helps set
expectations before any deeper analysis.

**Why this matters in healthcare:** Misunderstanding data structure often leads to incorrect joins,
misinterpreted fields, and inaccurate reports.


In [None]:

# This cell displays the overall structure of the dataset.
# It helps understand the size of the data and how each column is stored.

# Display the number of rows and columns in the dataset
ehr_df.shape

# Display column names, data types, and non-null counts
ehr_df.info()


## Step 2: Evaluate identifier quality
Here, you will focus on patient and encounter identifiers. These fields are used to link records across
tables and time. You will check whether these identifiers are missing and whether they behave as expected.

**Why this matters in healthcare:** Missing or inconsistent identifiers can fragment patient records
or incorrectly combine data from different individuals.


In [None]:

# This cell checks whether identifier fields are complete and consistent.
# Reliable identifiers are critical for accurate patient-level analysis.

# Count missing values in identifier columns
ehr_df[['patient_id', 'encounter_id']].isnull().sum()

# Count how many unique patients and encounters exist
ehr_df['patient_id'].nunique()
ehr_df['encounter_id'].nunique()


## Step 3: Analyze missing data
In this step, you will examine how much data is missing and where those gaps occur. Some missing values
might be expected, while others can signal data capture or system issues.

**Why this matters in healthcare:** Missing clinical or demographic data can introduce bias and
reduce confidence in analytical results.


In [None]:

# This cell calculates how many values are missing in each column.
# Understanding missingness helps assess data completeness.

# Count missing values per column
ehr_df.isnull().sum()


## Step 4: Detect duplicate records
Next, you will check whether the same record appears more than once in the dataset. Duplicate rows can
occur due to system errors or data integration issues.

**Why this matters in healthcare:** Duplicate records can inflate patient counts, visit volumes,
and outcome measures.


In [None]:

# This cell identifies duplicate rows in the dataset.
# Duplicate records can distort summary statistics and trends.

# Count the number of duplicate rows
ehr_df.duplicated().sum()


## Step 5: Validate temporal consistency
Here, you will verify that date fields follow a logical order. For example, a patient should not be
discharged before they are admitted.

**Why this matters in healthcare:** Incorrect timelines lead to wrong length-of-stay calculations
and misleading operational metrics.


In [None]:

# This cell checks for illogical date sequences.
# Time-related errors can lead to incorrect duration calculations.

# Find records where discharge occurs before admission
ehr_df[ehr_df['discharge_time'] < ehr_df['admission_time']].head()


## Step 6: Inspect clinical value ranges
Finally, you will review whether clinical measurements fall within reasonable ranges. Extremely high or
low values might indicate data entry mistakes or system issues.

**Why this matters in healthcare:** Implausible clinical values can mislead analyses and compromise
patient safety insights.


In [None]:

# This cell summarizes clinical measurements.
# Extreme or unrealistic values might signal data quality problems.

# Generate summary statistics for vital signs
ehr_df[['heart_rate', 'systolic_bp', 'diastolic_bp']].describe()


## Exercises

In [None]:
# Load the dataset for exercises
# Load a synthetic EHR dataset
ehr_exercise_df = pd.read_csv("https://fundamentals-of-healthcare-data-science-858397.gitlab.io/labs/lab1/ehr_exercise_dataset.csv")
ehr_exercise_df.head()

### Exercise 1: Inspect dataset structure

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Look at dataset size and data types.

</details>

<details>
<summary>Click here for solution</summary>

```python
ehr_exercise_df.shape
ehr_exercise_df.info()
```

</details>


### Exercise 2: Check identifier completeness

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Check for missing patient or encounter IDs.

</details>

<details>
<summary>Click here for solution</summary>

```python
ehr_exercise_df[['patient_id','encounter_id']].isnull().sum()
```

</details>


### Exercise 3: Identify missing data patterns

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Count missing values per column.

</details>

<details>
<summary>Click here for solution</summary>

```python
ehr_exercise_df.isnull().sum()
```

</details>


### Exercise 4: Detect duplicate records

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Check whether rows are repeated.

</details>

<details>
<summary>Click here for solution</summary>

```python
ehr_exercise_df.duplicated().sum()
```

</details>


### Exercise 5: Identify invalid timestamps

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Compare admission and discharge dates.

</details>

<details>
<summary>Click here for solution</summary>

```python
ehr_exercise_df[ehr_exercise_df['discharge_time'] < ehr_exercise_df['admission_time']]
```

</details>


### Exercise 6: Review clinical value ranges

In [None]:
# your code goes here

<details>
<summary>Click here for a hint</summary>

Summarize vital sign distributions.

</details>

<details>
<summary>Click here for solution</summary>

```python
ehr_exercise_df[['heart_rate','systolic_bp','diastolic_bp']].describe()
```

</details>


## Congratulations!

You have completed this lab and practiced inspecting raw EHR data for quality issues. This experience prepares you to recognize data quality issues early, before they affect analysis or downstream models.

## Authors
Ramesh Sannareddy   
<br>  
Â© SkillUp. All rights reserved.

Materials may not be reproduced in whole or in part without written permission from SkillUp.