# Unit Normalization, Time Alignment, Resampling, and Windowing for Longitudinal Data

This notebook covers essential preprocessing techniques for longitudinal medical data, including unit normalization, time alignment, resampling, and windowing. These techniques are crucial for preparing time-series medical data for analysis and integration across different sources and measurement systems.

First, let's import the necessary libraries for working with time-series data and create some synthetic longitudinal medical data to demonstrate the concepts.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

Now we'll create synthetic longitudinal medical data representing different patients with measurements taken at irregular intervals. This simulates real-world scenarios where data collection frequency varies.

In [None]:
# Create synthetic longitudinal data for 3 patients
def create_patient_data(patient_id, start_date, n_days, measurement_frequency):
    dates = []
    current_date = start_date
    
    while current_date <= start_date + timedelta(days=n_days):
        if np.random.random() < measurement_frequency:
            dates.append(current_date)
        current_date += timedelta(hours=6)  # Check every 6 hours
    
    n_measurements = len(dates)
    
    # Simulate different vital signs with different units and scales
    data = {
        'patient_id': [patient_id] * n_measurements,
        'timestamp': dates,
        'heart_rate_bpm': np.random.normal(70 + patient_id * 10, 10, n_measurements),
        'blood_pressure_systolic_mmHg': np.random.normal(120 + patient_id * 5, 15, n_measurements),
        'temperature_celsius': np.random.normal(36.5 + patient_id * 0.2, 0.5, n_measurements),
        'weight_kg': np.random.normal(70 + patient_id * 5, 2, n_measurements)
    }
    
    return pd.DataFrame(data)

# Generate data for 3 patients
start_date = datetime(2023, 1, 1)
patients_data = []

for patient_id in range(1, 4):
    patient_data = create_patient_data(patient_id, start_date, 30, 0.3)
    patients_data.append(patient_data)

# Combine all patient data
raw_data = pd.concat(patients_data, ignore_index=True)
print(f"Generated {len(raw_data)} measurements for {raw_data['patient_id'].nunique()} patients")
raw_data.head(10)

Let's visualize the raw data to understand the irregular timing and different scales of measurements across patients.

In [None]:
# Visualize the raw longitudinal data
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

measurements = ['heart_rate_bpm', 'blood_pressure_systolic_mmHg', 'temperature_celsius', 'weight_kg']

for i, measurement in enumerate(measurements):
    for patient_id in raw_data['patient_id'].unique():
        patient_data = raw_data[raw_data['patient_id'] == patient_id]
        axes[i].scatter(patient_data['timestamp'], patient_data[measurement], 
                       label=f'Patient {patient_id}', alpha=0.7)
    
    axes[i].set_title(measurement.replace('_', ' ').title())
    axes[i].legend()
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 1. Unit Normalization

Unit normalization is essential when combining measurements with different scales and units. We'll demonstrate both min-max normalization and z-score standardization.

In [None]:
# Prepare data for normalization (excluding non-numeric columns)
numeric_columns = ['heart_rate_bpm', 'blood_pressure_systolic_mmHg', 'temperature_celsius', 'weight_kg']
normalization_data = raw_data.copy()

print("Original data statistics:")
print(normalization_data[numeric_columns].describe())

Let's apply min-max normalization to scale all values between 0 and 1. This is useful when we want to preserve the relative relationships while ensuring all features have the same scale.

In [None]:
# Apply Min-Max normalization
scaler_minmax = MinMaxScaler()
normalized_minmax = normalization_data.copy()
normalized_minmax[numeric_columns] = scaler_minmax.fit_transform(normalization_data[numeric_columns])

print("Min-Max normalized data statistics:")
print(normalized_minmax[numeric_columns].describe())

Now let's apply z-score standardization, which transforms data to have zero mean and unit variance. This is particularly useful when dealing with normally distributed medical measurements.

In [None]:
# Apply Z-score standardization
scaler_standard = StandardScaler()
normalized_zscore = normalization_data.copy()
normalized_zscore[numeric_columns] = scaler_standard.fit_transform(normalization_data[numeric_columns])

print("Z-score standardized data statistics:")
print(normalized_zscore[numeric_columns].describe())

Let's visualize the effect of different normalization techniques on our data distribution.

In [None]:
# Compare normalization techniques
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Original data
for col in numeric_columns:
    axes[0].hist(normalization_data[col], alpha=0.5, label=col, bins=20)
axes[0].set_title('Original Data')
axes[0].legend()
axes[0].set_ylabel('Frequency')

# Min-Max normalized
for col in numeric_columns:
    axes[1].hist(normalized_minmax[col], alpha=0.5, label=col, bins=20)
axes[1].set_title('Min-Max Normalized')
axes[1].legend()

# Z-score standardized
for col in numeric_columns:
    axes[2].hist(normalized_zscore[col], alpha=0.5, label=col, bins=20)
axes[2].set_title('Z-score Standardized')
axes[2].legend()

plt.tight_layout()
plt.show()

## 2. Time Alignment

Time alignment involves synchronizing measurements from different sources or patients to a common time reference. We'll demonstrate aligning data to the earliest timestamp.

In [None]:
# Time alignment - align all patients to a common time reference
def align_timestamps(data, reference_time=None):
    aligned_data = data.copy()
    
    if reference_time is None:
        reference_time = data['timestamp'].min()
    
    # Calculate time difference in hours from reference time
    aligned_data['time_from_start_hours'] = (aligned_data['timestamp'] - reference_time).dt.total_seconds() / 3600
    
    return aligned_data, reference_time

aligned_data, ref_time = align_timestamps(raw_data)

print(f"Reference time: {ref_time}")
print(f"Time range: {aligned_data['time_from_start_hours'].min():.1f} to {aligned_data['time_from_start_hours'].max():.1f} hours")
aligned_data[['patient_id', 'timestamp', 'time_from_start_hours', 'heart_rate_bpm']].head()

Let's visualize how the time alignment affects our data representation, showing measurements as a function of time from the common reference point.

In [None]:
# Visualize time-aligned data
plt.figure(figsize=(12, 6))

for patient_id in aligned_data['patient_id'].unique():
    patient_data = aligned_data[aligned_data['patient_id'] == patient_id]
    plt.scatter(patient_data['time_from_start_hours'], patient_data['heart_rate_bpm'], 
               label=f'Patient {patient_id}', alpha=0.7)

plt.xlabel('Time from Start (hours)')
plt.ylabel('Heart Rate (BPM)')
plt.title('Time-Aligned Heart Rate Measurements')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 3. Resampling

Resampling involves converting irregular time series data to regular intervals. We'll demonstrate both upsampling (interpolation) and downsampling (aggregation) techniques.

In [None]:
# Resampling - convert to regular intervals
def resample_patient_data(patient_data, freq='6H', method='linear'):
    """
    Resample patient data to regular intervals
    freq: pandas frequency string (e.g., '6H' for 6 hours)
    method: interpolation method ('linear', 'nearest', 'cubic')
    """
    # Set timestamp as index
    patient_data_indexed = patient_data.set_index('timestamp')
    
    # Resample to regular intervals and interpolate
    resampled = patient_data_indexed.resample(freq).mean()
    
    # Interpolate missing values
    for col in numeric_columns:
        if col in resampled.columns:
            resampled[col] = resampled[col].interpolate(method=method)
    
    # Reset index and add patient_id back
    resampled = resampled.reset_index()
    resampled['patient_id'] = patient_data['patient_id'].iloc[0]
    
    return resampled

# Resample data for each patient
resampled_data_list = []
for patient_id in aligned_data['patient_id'].unique():
    patient_data = aligned_data[aligned_data['patient_id'] == patient_id]
    resampled_patient = resample_patient_data(patient_data, freq='6H')
    resampled_data_list.append(resampled_patient)

resampled_data = pd.concat(resampled_data_list, ignore_index=True)

print(f"Original data points: {len(aligned_data)}")
print(f"Resampled data points: {len(resampled_data)}")
resampled_data.head()

Let's compare the original irregular data with the resampled regular interval data to see the effect of interpolation.

In [None]:
# Compare original vs resampled data
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Original irregular data
for patient_id in aligned_data['patient_id'].unique():
    patient_data = aligned_data[aligned_data['patient_id'] == patient_id]
    axes[0].scatter(patient_data['timestamp'], patient_data['heart_rate_bpm'], 
                   label=f'Patient {patient_id}', alpha=0.7)

axes[0].set_title('Original Irregular Data')
axes[0].legend()
axes[0].tick_params(axis='x', rotation=45)

# Resampled regular data
for patient_id in resampled_data['patient_id'].unique():
    patient_data = resampled_data[resampled_data['patient_id'] == patient_id]
    axes[1].plot(patient_data['timestamp'], patient_data['heart_rate_bpm'], 
                'o-', label=f'Patient {patient_id}', alpha=0.7)

axes[1].set_title('Resampled Regular Intervals (6H)')
axes[1].legend()
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 4. Windowing

Windowing involves creating overlapping or non-overlapping time windows for analysis. This is crucial for feature extraction and temporal pattern analysis in longitudinal data.

In [None]:
# Windowing - create time windows for analysis
def create_time_windows(data, window_size_hours=24, overlap_hours=12):
    """
    Create overlapping time windows from longitudinal data
    window_size_hours: size of each window in hours
    overlap_hours: overlap between consecutive windows in hours
    """
    windows = []
    
    for patient_id in data['patient_id'].unique():
        patient_data = data[data['patient_id'] == patient_id].sort_values('timestamp')
        
        if len(patient_data) == 0:
            continue
            
        start_time = patient_data['timestamp'].min()
        end_time = patient_data['timestamp'].max()
        
        current_window_start = start_time
        window_id = 0
        
        while current_window_start + timedelta(hours=window_size_hours) <= end_time:
            window_end = current_window_start + timedelta(hours=window_size_hours)
            
            # Extract data within the window
            window_data = patient_data[
                (patient_data['timestamp'] >= current_window_start) & 
                (patient_data['timestamp'] < window_end)
            ].copy()
            
            if len(window_data) > 0:
                # Calculate window statistics
                window_stats = {
                    'patient_id': patient_id,
                    'window_id': window_id,
                    'window_start': current_window_start,
                    'window_end': window_end,
                    'n_measurements': len(window_data)
                }
                
                # Add statistics for each measurement
                for col in numeric_columns:
                    if col in window_data.columns:
                        window_stats[f'{col}_mean'] = window_data[col].mean()
                        window_stats[f'{col}_std'] = window_data[col].std()
                        window_stats[f'{col}_min'] = window_data[col].min()
                        window_stats[f'{col}_max'] = window_data[col].max()
                
                windows.append(window_stats)
            
            # Move to next window
            current_window_start += timedelta(hours=window_size_hours - overlap_hours)
            window_id += 1
    
    return pd.DataFrame(windows)

# Create 24-hour windows with 12-hour overlap
windowed_data = create_time_windows(resampled_data, window_size_hours=24, overlap_hours=12)

print(f"Created {len(windowed_data)} windows from {len(resampled_data)} data points")
windowed_data.head()

Let's visualize the windowed data to show how statistical features are extracted from each time window.

In [None]:
# Visualize windowed statistics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

statistics = ['mean', 'std', 'min', 'max']
measurement = 'heart_rate_bpm'

for i, stat in enumerate(statistics):
    col_name = f'{measurement}_{stat}'
    
    for patient_id in windowed_data['patient_id'].unique():
        patient_windows = windowed_data[windowed_data['patient_id'] == patient_id]
        axes[i].plot(patient_windows['window_id'], patient_windows[col_name], 
                    'o-', label=f'Patient {patient_id}', alpha=0.7)
    
    axes[i].set_title(f'Heart Rate {stat.upper()} per Window')
    axes[i].set_xlabel('Window ID')
    axes[i].set_ylabel(f'{stat.upper()} (BPM)')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Now let's demonstrate how to combine all preprocessing steps into a complete pipeline for processing longitudinal medical data.

In [None]:
# Complete preprocessing pipeline
def preprocess_longitudinal_data(raw_data, normalization_method='zscore', 
                                resample_freq='6H', window_size_hours=24, 
                                overlap_hours=12):
    """
    Complete preprocessing pipeline for longitudinal medical data
    """
    print("Step 1: Time alignment...")
    aligned_data, ref_time = align_timestamps(raw_data)
    
    print("Step 2: Unit normalization...")
    if normalization_method == 'zscore':
        scaler = StandardScaler()
    else:
        scaler = MinMaxScaler()
    
    normalized_data = aligned_data.copy()
    normalized_data[numeric_columns] = scaler.fit_transform(aligned_data[numeric_columns])
    
    print("Step 3: Resampling...")
    resampled_list = []
    for patient_id in normalized_data['patient_id'].unique():
        patient_data = normalized_data[normalized_data['patient_id'] == patient_id]
        resampled_patient = resample_patient_data(patient_data, freq=resample_freq)
        resampled_list.append(resampled_patient)
    
    resampled_data = pd.concat(resampled_list, ignore_index=True)
    
    print("Step 4: Windowing...")
    windowed_data = create_time_windows(resampled_data, window_size_hours, overlap_hours)
    
    return {
        'aligned': aligned_data,
        'normalized': normalized_data,
        'resampled': resampled_data,
        'windowed': windowed_data,
        'scaler': scaler,
        'reference_time': ref_time
    }

# Apply complete pipeline
processed_data = preprocess_longitudinal_data(raw_data)

print(f"\nPipeline completed:")
print(f"- Raw data: {len(raw_data)} measurements")
print(f"- Resampled data: {len(processed_data['resampled'])} measurements")
print(f"- Windowed features: {len(processed_data['windowed'])} windows")

Finally, let's examine the final processed features that could be used for downstream analysis or machine learning tasks.

In [None]:
# Examine the final processed features
final_features = processed_data['windowed']

# Select feature columns (exclude metadata)
feature_columns = [col for col in final_features.columns if 
                  col.endswith(('_mean', '_std', '_min', '_max'))]

print(f"Available features: {len(feature_columns)}")
print("Feature columns:")
for col in feature_columns[:8]:  # Show first 8 features
    print(f"  - {col}")
print(f"  ... and {len(feature_columns) - 8} more")

# Show feature matrix
print("\nFeature matrix shape:", final_features[feature_columns].shape)
final_features[['patient_id', 'window_id'] + feature_columns[:4]].head()

## Summary

In this notebook, we covered four essential preprocessing techniques for longitudinal medical data:

1. **Unit Normalization**: We demonstrated min-max scaling and z-score standardization to handle measurements with different units and scales
2. **Time Alignment**: We aligned timestamps to a common reference point to enable comparison across patients
3. **Resampling**: We converted irregular time series to regular intervals using interpolation
4. **Windowing**: We created overlapping time windows and extracted statistical features for analysis

These techniques are fundamental for preparing longitudinal medical data for integration, analysis, and machine learning applications.

## Exercise

Using the techniques learned in this notebook, complete the following tasks:

1. **Generate new synthetic data**: Create longitudinal data for 5 patients over 60 days with measurements including heart rate, blood pressure, temperature, and glucose levels (mg/dL with values around 100±20).

2. **Apply different normalization techniques**: Compare the results of min-max normalization, z-score standardization, and robust scaling (using `RobustScaler` from sklearn) on your data.

3. **Experiment with resampling**: Try different resampling frequencies (2H, 12H, 24H) and interpolation methods ('linear', 'cubic', 'nearest'). Visualize the differences.

4. **Custom windowing**: Create a windowing function that extracts additional features such as slope (trend), area under the curve, and peak detection for each window.

5. **Pipeline evaluation**: Create a function that evaluates the quality of your preprocessing pipeline by calculating metrics such as data coverage (percentage of time with data), interpolation error, and feature correlation before and after processing.

Submit your solution showing the preprocessed data characteristics, visualizations comparing different approaches, and your evaluation metrics.