# Analyzing Patient Vital Signs with Descriptive Statistics

Time estimate: **20** minutes

## Objectives

After completing this lab, you will be able to:

- Load and inspect a patient vital-signs dataset.
- Clean and prepare vital-signs data for analysis (handle missing values and outliers).
- Compute and interpret descriptive statistics for continuous and categorical clinical variables.
- Visualize vital-sign distributions and relationships using histograms, boxplots, and density plots.
- Summarize patient-level aggregated metrics and produce clinician-ready reports.

## What you will do in this lab

- Generate a realistic simulated patient vital-signs dataset.
- Explore data structure and quality.
- Compute central tendency and dispersion measures for vitals (HR, BP, SpO₂, Temp).
- Detect and handle outliers and missing data using simple clinical rules.
- Create visual summaries and grouped descriptive tables (by ward, age group, sex).
- Complete 5 consolidated exercises (with hints & solutions) at the end of the lab.

## Overview

Descriptive statistics for patient vital signs (heart rate, blood pressure, respiratory rate, oxygen saturation, temperature) are essential for clinical monitoring, triage, and research. This lab demonstrates how to prepare, summarize, visualize, and report vital-sign data in a reproducible manner. We simulate a dataset to allow controlled examples of missingness and outliers useful for teaching.

## About the dataset/environment

Let's simulate a dataset representing repeated vital-sign measurements for hospitalized patients collected across several wards over multiple days. Each row is a single measurement event and contains:

- `patient_id` — unique patient identifier  
- `age` — patient age in years  
- `sex` — 'M' or 'F'  
- `ward` — ward name (e.g., Med, Surg, ICU)  
- `timestamp` — datetime of measurement  
- `hr` — heart rate (beats per minute)  
- `sbp` — systolic blood pressure (mmHg)  
- `dbp` — diastolic blood pressure (mmHg)  
- `spo2` — oxygen saturation (%)  
- `temp_c` — temperature in Celsius  
- `resp_rate` — respiratory rate (breaths per minute)

You will use Python (pandas, numpy, matplotlib, seaborn) for analysis.

---

## Setup
The cell below installs optional packages and imports libraries. If you open this notebook in Google Colab, run this cell to ensure all dependencies are available.

This code sets up the working environment for data analysis by checking if the notebook is running in Google Colab, installing any missing packages, importing all required Python libraries, and configuring reproducibility and display settings so that subsequent data simulation and analysis run consistently and smoothly.

In [None]:
# If running in Colab, uncomment the following to install missing packages
try:
    import google.colab
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    # Colab environment: install packages if needed (most are preinstalled)
    !pip -q install pandas matplotlib seaborn numpy

# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import random

# Reproducibility
RNG = np.random.default_rng(42)
random.seed(42)

pd.set_option('display.max_columns', 50)
sns.set(style="whitegrid")
print('Setup complete. Running in Colab:', IN_COLAB)


The following cell simulates a realistic, clinically plausible patient vital-signs dataset by generating demographic attributes, assigning patients to wards, creating timestamped vital-sign measurements over multiple days, and adding natural variability, missing values, and occasional outliers to mirror real-world hospital data for analysis and practice.

In [None]:
# Function to simulate vital signs dataset
def simulate_vitals(n_patients=120, days=4, measurements_per_day=3, start_date=datetime(2025,11,20,6,0)):
    patient_ids = [f"P{1000+i}" for i in range(n_patients)]
    records = []
    wards = ['Med', 'Surg', 'ICU']
    for pid in patient_ids:
        age = int(RNG.normal(58, 16)) if RNG.random() > 0.05 else int(RNG.normal(35, 8))
        sex = 'M' if RNG.random() < 0.53 else 'F'
        ward = RNG.choice(wards, p=[0.55, 0.35, 0.10])
        baseline_hr = int(max(40, RNG.normal(78, 12)))
        baseline_sbp = int(max(80, RNG.normal(125, 14)))
        baseline_dbp = int(max(40, RNG.normal(78, 9)))
        baseline_spo2 = int(min(100, max(85, RNG.normal(97, 2))))
        baseline_temp = round(RNG.normal(36.7, 0.4), 1)
        for d in range(days):
            for m in range(measurements_per_day):
                ts = start_date + timedelta(days=d, hours=m*6) + timedelta(minutes=int(RNG.normal(0,30)))
                hr = int(np.clip(RNG.normal(baseline_hr, 8), 30, 200))
                sbp = int(np.clip(RNG.normal(baseline_sbp, 10), 60, 260))
                dbp = int(np.clip(RNG.normal(baseline_dbp, 6), 30, 160))
                spo2 = int(np.clip(RNG.normal(baseline_spo2, 1.8), 70, 100))
                temp_c = round(np.clip(RNG.normal(baseline_temp, 0.25), 34.0, 41.0), 1)
                resp_rate = int(np.clip(RNG.normal(16, 3), 6, 40))
                # inject missingness
                if RNG.random() < 0.05:
                    hr = np.nan
                if RNG.random() < 0.04:
                    sbp = np.nan
                    dbp = np.nan
                if RNG.random() < 0.03:
                    spo2 = np.nan
                # inject rare outliers
                if RNG.random() < 0.01:
                    hr = int(RNG.normal(160, 8))
                if RNG.random() < 0.005:
                    temp_c = round(RNG.normal(40.2, 0.6), 1)
                records.append({
                    'patient_id': pid,
                    'age': age,
                    'sex': sex,
                    'ward': ward,
                    'timestamp': ts,
                    'hr': hr,
                    'sbp': sbp,
                    'dbp': dbp,
                    'spo2': spo2,
                    'temp_c': temp_c,
                    'resp_rate': resp_rate
                })
    df = pd.DataFrame.from_records(records)
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    return df

# Create dataset
df = simulate_vitals(n_patients=120, days=4, measurements_per_day=3)
df.head()

## Step 1: Inspect the data

The code cell below shows basic inspection commands. This code provides a quick structural and statistical overview of the dataset by displaying its dimensions, data types for each column, and descriptive statistics for the first set of variables—helping you verify data integrity before performing deeper analysis.

In [None]:
print('Rows, Columns:', df.shape)
print(df.dtypes)
df.describe(include='all').T.head(12)

## Step 2: Basic descriptive statistics for continuous vitals

Run the following code cell to isolate all vital-sign variables and generate their descriptive statistics. This will allow you to quickly assess central tendencies and variability for each clinical measurement.

In [None]:
vital_cols = ['hr', 'sbp', 'dbp', 'spo2', 'temp_c', 'resp_rate']
df[vital_cols].describe().T

## Step 3: Visualize distributions

The following cells show common plots. Run this code cell to visualize the distribution of each vital-sign variable using histograms with density curves. This will help you quickly identify patterns, ranges, skewness, and potential outliers across the clinical measurements.

In [None]:
# Histograms
plt.figure(figsize=(12, 8))
for i, col in enumerate(vital_cols, 1):
    plt.subplot(3, 2, i)
    sns.histplot(df[col].dropna(), kde=True, bins=30)
    plt.title(col)
plt.tight_layout()

Run this code cell to create boxplots of selected vital signs across different hospital wards to compare their distributions. This will make it easier to spot differences, variability, and potential outliers between patient groups.

In [None]:
# Boxplots by ward
plt.figure(figsize=(12, 8))
for i, col in enumerate(['hr', 'sbp', 'temp_c']):
    plt.subplot(2, 2, i+1)
    sns.boxplot(x='ward', y=col, data=df)
    plt.title(f'{col} by ward')
plt.tight_layout()

## Step 4: Missing data handling

Identify missing patterns and forward/back-fill example. Run this code cell to calculate and print the percentage of missing values in each column. This will help you assess data completeness and identify variables that may require cleaning or imputation.

In [None]:
missing_pct = df.isna().mean().round(3) * 100
print('Missing % per column:\n', missing_pct)

Thi code cell fills missing vital-sign values within each patient’s timeline using forward-fill and backward-fill methods, then compares missing-value counts before and after imputation to verify that gaps have been successfully handled.

In [None]:
df_sorted = df.sort_values(['patient_id', 'timestamp']).copy()
df_ffill = df_sorted.groupby('patient_id').apply(lambda x: x.ffill().bfill(), include_groups=False).reset_index(drop=True)
print('Before missing counts:\n', df.isna().sum())
print('After ffill/bfill counts:\n', df_ffill.isna().sum())

## Step 5: Outlier detection

IQR-based detection example. 

Run the code cell to define a function to identify outliers using the Interquartile Range (IQR) method and apply it to heart-rate data. This allows you to detect unusually low or high HR values that may need investigation or cleaning.

In [None]:
def iqr_outliers(series, k=1.5):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - k * iqr
    upper = q3 + k * iqr
    return series[(series < lower) | (series > upper)]

hr_outliers = iqr_outliers(df['hr'].dropna())
print('HR outliers count:', len(hr_outliers))

## Step 6: Group summaries (by ward, age group, sex)

Aggregated summaries example.

Run the code cell to group patients by ward and age category, then compute median values and IQRs for key vital signs—along with patient counts—to summarize how clinical measurements vary across demographic and clinical subgroups.

In [None]:
bins = [0, 30, 50, 65, 120]
labels = ['<30', '30-49', '50-64', '65+']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

agg = df.groupby(['ward', 'age_group']).agg(
    hr_median=('hr', 'median'),
    hr_iqr=('hr', lambda x: x.quantile(0.75)-x.quantile(0.25)),
    sbp_median=('sbp', 'median'),
    sbp_iqr=('sbp', lambda x: x.quantile(0.75)-x.quantile(0.25)),
    count=('patient_id', 'count')
).reset_index()
agg.head(12)

## Step 7: Patient-level summaries and variability

Per-patient aggregations.

Run the code cell to create a patient-level summary by calculating the number of measurements and key statistical metrics (mean, standard deviation, and maximum values) for each vital sign. This will enable you to understand individual patient patterns and variability over time.

In [None]:
patient_summary = df.groupby('patient_id').agg(
    n_measurements=('hr', 'count'),
    hr_mean=('hr', 'mean'),
    hr_std=('hr', 'std'),
    sbp_mean=('sbp', 'mean'),
    sbp_std=('sbp', 'std'),
    spo2_mean=('spo2', 'mean'),
    temp_max=('temp_c', 'max')
).reset_index()
patient_summary.head()

## Step 8: Relationships and correlations

Scatter and correlation heatmap.

Run the code cell to visualize relationships between vital signs by plotting systolic versus diastolic blood pressure across wards and generating a correlation heatmap. This allows you to explore how different clinical measurements relate to one another.

In [None]:
plt.figure(figsize=(6,5))
sns.scatterplot(data=df.dropna(subset=['sbp','dbp']), x='sbp', y='dbp', hue='ward', alpha=0.6)
plt.title('SBP vs DBP by ward')
plt.show()

plt.figure(figsize=(8,6))
corr = df[vital_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation matrix of vital signs')
plt.show()

## Step 9: Reporting: assemble clinician-ready summary

Ward-level reporting example.

Run the code cell to generate a concise, ward-level summary report by computing record counts and key statistics for multiple vital signs. This provides a quick snapshot of typical clinical values and variability within each hospital ward.

In [None]:
ward_report = df.groupby('ward').agg(
    n_records=('patient_id', 'count'),
    hr_median=('hr', 'median'),
    hr_iqr=('hr', lambda x: x.quantile(0.75)-x.quantile(0.25)),
    sbp_median=('sbp', 'median'),
    spo2_median=('spo2', 'median'),
    temp_mean=('temp_c', 'mean')
).reset_index()
ward_report

## Consolidated practice exercises



### Exercise 1: Inspect missingness and unique counts

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Check `df.isna().sum()` for missingness and `df['patient_id'].nunique()` for unique patient count.

</details>

<details> <summary>Click here for solution</summary>

```python
print('Missing values per column:')
print(df.isna().sum())
print('\nUnique patients:', df['patient_id'].nunique())
```

</details>

### Exercise 2: Create `describe_vitals(df, cols)` returning mean, median, std, IQR for specified columns

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use `.median()`, `.mean()`, `.std()`, and quantiles `.quantile(0.25)` and `.quantile(0.75)` to compute IQR.

</details>

<details> <summary>Click here for solution</summary>

```python
def describe_vitals(df, cols):
    out = {}
    for col in cols:
        s = df[col].dropna()
        q1 = s.quantile(0.25)
        q3 = s.quantile(0.75)
        out[col] = {
            'count': int(s.count()),
            'mean': float(s.mean()),
            'median': float(s.median()),
            'std': float(s.std()),
            'iqr': float(q3 - q1),
            'min': float(s.min()),
            'max': float(s.max())
        }
    return pd.DataFrame(out).T

describe_vitals(df, vital_cols)
```

</details>

### Exercise 3: Impute missing SBP and DBP by ward median and show counts before/after

In [None]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use `df.groupby('ward')['sbp'].transform('median')` and fillna on the column.

</details>

<details> <summary>Click here for solution</summary>

```python
df_imputed = df.copy()
for col in ['sbp', 'dbp']:
    before = df_imputed[col].isna().sum()
    df_imputed[col] = df_imputed[col].fillna(df_imputed.groupby('ward')[col].transform('median'))
    after = df_imputed[col].isna().sum()
    print(f"{col}: before {before}, after {after}")
```

</details>

### Exercise 4: Flag rows where temperature > 39.0°C or SpO₂ < 90% and show count + head(5)

In [34]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Create a boolean mask `(df['temp_c'] > 39.0) | (df['spo2'] < 90)` and use it to filter.

</details>

<details> <summary>Click here for solution</summary>

```python
mask = (df['temp_c'] > 39.0) | (df['spo2'] < 90)
print('Count:', mask.sum())
df.loc[mask].head(5)
```

</details>

### Exercise 5: Calculate Pearson correlation between heart rate and temperature and interpret strength (weak/moderate/strong)

In [35]:
# your code goes here

<details> <summary>Click here for a hint</summary>

Use `df[['hr','temp_c']].dropna().corr().loc['hr','temp_c']`. Interpret using thresholds: |r|<0.3 weak, 0.3–0.6 moderate, >0.6 strong.

</details>

<details> <summary>Click here for solution</summary>

```python
corr_value = df[['hr','temp_c']].dropna().corr().loc['hr','temp_c']
print('Pearson correlation (HR, Temp):', round(corr_value, 3))
if abs(corr_value) < 0.3:
    print('Interpretation: weak correlation')
elif abs(corr_value) < 0.6:
    print('Interpretation: moderate correlation')
else:
    print('Interpretation: strong correlation')
```

</details>

## Final thoughts and best practices

- Always document data provenance and measurement devices.  
- Prefer robust summary statistics (median, IQR) when distributions are skewed.  
- Include both tabular summaries and visualizations when reporting to clinicians.  
- Save reproducible code that regenerates the exact figures and tables.


# Congratulations!

You have successfully completed this lab on **Analyzing Patient Vital Signs with Descriptive Statistics**.

In this lab, you explored a realistic patient vital-signs dataset to practice core healthcare analytics skills. You began by inspecting the data, computing descriptive statistics, and visualizing key patterns. You handled missing values, detect outliers, and create summaries by ward and age group. You also generated patient-level insights and examined relationships between vital signs using scatterplots and a correlation heatmap. By the end, you produced a clear, clinician-ready summary that interprets vital-sign trends and variability in a real-world context.

## Authors

Ramesh Sannareddy

Copyright © 2025 SkillUp. All rights reserved.