# Export Patterns: Subsets, Snapshots, Reproducible Filters

In medical data integration, it's crucial to create reproducible and well-documented data exports. This notebook covers three essential export patterns: creating data subsets, taking temporal snapshots, and implementing reproducible filters. These techniques ensure data consistency across research teams and enable reliable analysis workflows.

Let's start by importing the necessary libraries and creating a sample medical dataset that simulates patient records with temporal information.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import json
from pathlib import Path
import hashlib
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

Now we'll create a synthetic medical dataset with patient information, including demographics, lab results, and visit dates.

In [None]:
# Create synthetic medical data
n_patients = 1000
base_date = datetime(2020, 1, 1)

# Generate patient data
patients_data = {
    'patient_id': [f'P_{i:04d}' for i in range(1, n_patients + 1)],
    'age': np.random.normal(55, 15, n_patients).astype(int),
    'gender': np.random.choice(['M', 'F'], n_patients),
    'diagnosis': np.random.choice(['Diabetes', 'Hypertension', 'Asthma', 'Obesity', 'Healthy'], n_patients),
    'admission_date': [base_date + timedelta(days=np.random.randint(0, 1095)) for _ in range(n_patients)],
    'glucose_level': np.random.normal(100, 20, n_patients),
    'blood_pressure_sys': np.random.normal(120, 15, n_patients),
    'blood_pressure_dia': np.random.normal(80, 10, n_patients),
    'hospital': np.random.choice(['Hospital_A', 'Hospital_B', 'Hospital_C'], n_patients)
}

df_patients = pd.DataFrame(patients_data)
print(f"Created dataset with {len(df_patients)} patients")
df_patients.head()

Let's examine the basic characteristics of our dataset to understand its structure and temporal distribution.

In [None]:
# Basic dataset information
print("Dataset Overview:")
print(f"Date range: {df_patients['admission_date'].min()} to {df_patients['admission_date'].max()}")
print(f"\nDiagnosis distribution:")
print(df_patients['diagnosis'].value_counts())
print(f"\nHospital distribution:")
print(df_patients['hospital'].value_counts())

## 1. Creating Data Subsets

Data subsets allow us to extract specific portions of the dataset based on clinical criteria. We'll create a function that generates subsets based on multiple conditions.

In [None]:
def create_subset(df, conditions, subset_name):
    """
    Create a data subset based on multiple conditions
    
    Parameters:
    df: pandas DataFrame
    conditions: dict with column names as keys and filter conditions as values
    subset_name: string identifier for the subset
    """
    subset_df = df.copy()
    filter_log = []
    
    for column, condition in conditions.items():
        initial_count = len(subset_df)
        
        if isinstance(condition, list):
            # Filter for values in list
            subset_df = subset_df[subset_df[column].isin(condition)]
            filter_log.append(f"{column} in {condition}: {initial_count} -> {len(subset_df)}")
        elif isinstance(condition, dict):
            # Handle range conditions
            if 'min' in condition:
                subset_df = subset_df[subset_df[column] >= condition['min']]
            if 'max' in condition:
                subset_df = subset_df[subset_df[column] <= condition['max']]
            filter_log.append(f"{column} range {condition}: {initial_count} -> {len(subset_df)}")
    
    print(f"Subset '{subset_name}' created:")
    for log in filter_log:
        print(f"  {log}")
    print(f"Final subset size: {len(subset_df)} patients")
    
    return subset_df

Now let's create a specific subset for diabetic patients with certain age and glucose criteria.

In [None]:
# Create subset: Diabetic patients with specific criteria
diabetes_conditions = {
    'diagnosis': ['Diabetes'],
    'age': {'min': 40, 'max': 70},
    'glucose_level': {'min': 110}
}

diabetes_subset = create_subset(df_patients, diabetes_conditions, 'Diabetes_40-70_HighGlucose')
diabetes_subset.head()

Let's create another subset focusing on patients from specific hospitals with hypertension.

In [None]:
# Create subset: Hypertension patients from specific hospitals
hypertension_conditions = {
    'diagnosis': ['Hypertension'],
    'hospital': ['Hospital_A', 'Hospital_B'],
    'blood_pressure_sys': {'min': 130}
}

hypertension_subset = create_subset(df_patients, hypertension_conditions, 'Hypertension_HospAB_HighBP')
hypertension_subset.describe()

## 2. Creating Temporal Snapshots

Temporal snapshots capture the state of data at specific time points, which is crucial for longitudinal medical studies. We'll create functions to generate snapshots based on admission dates.

In [None]:
def create_temporal_snapshot(df, snapshot_date, date_column='admission_date'):
    """
    Create a snapshot of data up to a specific date
    
    Parameters:
    df: pandas DataFrame
    snapshot_date: datetime object or string
    date_column: column name containing dates
    """
    if isinstance(snapshot_date, str):
        snapshot_date = pd.to_datetime(snapshot_date)
    
    snapshot_df = df[df[date_column] <= snapshot_date].copy()
    
    snapshot_info = {
        'snapshot_date': snapshot_date.strftime('%Y-%m-%d'),
        'total_records': len(snapshot_df),
        'date_range': f"{snapshot_df[date_column].min().strftime('%Y-%m-%d')} to {snapshot_df[date_column].max().strftime('%Y-%m-%d')}",
        'created_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    }
    
    return snapshot_df, snapshot_info

Let's create snapshots for different time points to see how our patient population evolved over time.

In [None]:
# Create snapshots for different time points
snapshot_dates = ['2021-01-01', '2021-12-31', '2022-12-31']

snapshots = {}
for date in snapshot_dates:
    snapshot_df, snapshot_info = create_temporal_snapshot(df_patients, date)
    snapshots[date] = {'data': snapshot_df, 'info': snapshot_info}
    
    print(f"Snapshot {date}:")
    print(f"  Records: {snapshot_info['total_records']}")
    print(f"  Date range: {snapshot_info['date_range']}")
    print()

Let's compare the diagnosis distribution across different snapshots to observe temporal trends.

In [None]:
# Compare diagnosis distribution across snapshots
print("Diagnosis distribution across temporal snapshots:")
print("=" * 50)

for date, snapshot in snapshots.items():
    print(f"\nSnapshot {date} ({snapshot['info']['total_records']} patients):")
    diagnosis_counts = snapshot['data']['diagnosis'].value_counts()
    for diagnosis, count in diagnosis_counts.items():
        percentage = (count / snapshot['info']['total_records']) * 100
        print(f"  {diagnosis}: {count} ({percentage:.1f}%)")

## 3. Reproducible Filters

Reproducible filters ensure that the same filtering criteria can be applied consistently across different analyses. We'll create a system to save and load filter configurations.

In [None]:
class ReproducibleFilter:
    def __init__(self, name, description=""):
        self.name = name
        self.description = description
        self.conditions = {}
        self.metadata = {
            'created_at': datetime.now().isoformat(),
            'version': '1.0'
        }
    
    def add_condition(self, column, condition_type, values):
        """Add a filter condition"""
        self.conditions[column] = {
            'type': condition_type,
            'values': values
        }
        return self
    
    def apply(self, df):
        """Apply filter to dataframe"""
        filtered_df = df.copy()
        filter_log = []
        
        for column, condition in self.conditions.items():
            initial_count = len(filtered_df)
            
            if condition['type'] == 'in':
                filtered_df = filtered_df[filtered_df[column].isin(condition['values'])]
            elif condition['type'] == 'range':
                if 'min' in condition['values']:
                    filtered_df = filtered_df[filtered_df[column] >= condition['values']['min']]
                if 'max' in condition['values']:
                    filtered_df = filtered_df[filtered_df[column] <= condition['values']['max']]
            elif condition['type'] == 'greater_than':
                filtered_df = filtered_df[filtered_df[column] > condition['values']]
            elif condition['type'] == 'less_than':
                filtered_df = filtered_df[filtered_df[column] < condition['values']]
            
            filter_log.append(f"{column}: {initial_count} -> {len(filtered_df)}")
        
        return filtered_df, filter_log
    
    def save_config(self, filepath):
        """Save filter configuration to JSON file"""
        config = {
            'name': self.name,
            'description': self.description,
            'conditions': self.conditions,
            'metadata': self.metadata
        }
        
        with open(filepath, 'w') as f:
            json.dump(config, f, indent=2)
        
        print(f"Filter configuration saved to {filepath}")
    
    @classmethod
    def load_config(cls, filepath):
        """Load filter configuration from JSON file"""
        with open(filepath, 'r') as f:
            config = json.load(f)
        
        filter_obj = cls(config['name'], config['description'])
        filter_obj.conditions = config['conditions']
        filter_obj.metadata = config['metadata']
        
        return filter_obj

Now let's create and configure a reproducible filter for high-risk patients.

In [None]:
# Create a reproducible filter for high-risk patients
high_risk_filter = ReproducibleFilter(
    name="High_Risk_Patients_v1",
    description="Filter for patients with elevated cardiovascular risk factors"
)

# Add conditions to the filter
high_risk_filter.add_condition('age', 'greater_than', 50)
high_risk_filter.add_condition('blood_pressure_sys', 'greater_than', 130)
high_risk_filter.add_condition('diagnosis', 'in', ['Diabetes', 'Hypertension'])

print(f"Created filter: {high_risk_filter.name}")
print(f"Description: {high_risk_filter.description}")
print(f"Conditions: {len(high_risk_filter.conditions)}")

Let's apply the reproducible filter to our dataset and examine the results.

In [None]:
# Apply the filter
filtered_data, log = high_risk_filter.apply(df_patients)

print("Filter application log:")
for entry in log:
    print(f"  {entry}")

print(f"\nFinal filtered dataset: {len(filtered_data)} patients")
print(f"\nFiltered data characteristics:")
print(filtered_data[['age', 'diagnosis', 'blood_pressure_sys', 'glucose_level']].describe())

Now we'll save the filter configuration to ensure reproducibility across different analysis sessions.

In [None]:
# Save the filter configuration
filter_path = "high_risk_filter_config.json"
high_risk_filter.save_config(filter_path)

# Demonstrate loading the saved configuration
loaded_filter = ReproducibleFilter.load_config(filter_path)
print(f"\nLoaded filter: {loaded_filter.name}")
print(f"Created at: {loaded_filter.metadata['created_at']}")
print(f"Conditions loaded: {list(loaded_filter.conditions.keys())}")

## 4. Export Pipeline Integration

Let's combine all three export patterns into a comprehensive pipeline that creates documented, reproducible exports.

In [None]:
def create_export_pipeline(df, filter_obj, snapshot_date=None, export_name="medical_export"):
    """
    Comprehensive export pipeline combining filters, snapshots, and documentation
    """
    pipeline_info = {
        'export_name': export_name,
        'original_records': len(df),
        'pipeline_steps': [],
        'created_at': datetime.now().isoformat()
    }
    
    # Step 1: Apply temporal snapshot if specified
    current_df = df.copy()
    if snapshot_date:
        current_df, snapshot_info = create_temporal_snapshot(current_df, snapshot_date)
        pipeline_info['pipeline_steps'].append({
            'step': 'temporal_snapshot',
            'snapshot_date': snapshot_date,
            'records_after': len(current_df)
        })
    
    # Step 2: Apply reproducible filter
    filtered_df, filter_log = filter_obj.apply(current_df)
    pipeline_info['pipeline_steps'].append({
        'step': 'reproducible_filter',
        'filter_name': filter_obj.name,
        'filter_log': filter_log,
        'records_after': len(filtered_df)
    })
    
    # Step 3: Generate data hash for integrity checking
    data_string = filtered_df.to_string()
    data_hash = hashlib.md5(data_string.encode()).hexdigest()
    pipeline_info['data_hash'] = data_hash
    pipeline_info['final_records'] = len(filtered_df)
    
    return filtered_df, pipeline_info

Let's execute the complete export pipeline with a temporal snapshot and our high-risk filter.

In [None]:
# Execute complete export pipeline
export_df, export_info = create_export_pipeline(
    df_patients, 
    high_risk_filter, 
    snapshot_date='2022-01-01',
    export_name="HighRisk_2022_Export"
)

print(f"Export Pipeline Results:")
print(f"Export name: {export_info['export_name']}")
print(f"Original records: {export_info['original_records']}")
print(f"Final records: {export_info['final_records']}")
print(f"Data integrity hash: {export_info['data_hash'][:16]}...")

print(f"\nPipeline steps:")
for step in export_info['pipeline_steps']:
    print(f"  {step['step']}: {step['records_after']} records")

Finally, let's save both the exported data and its metadata for complete reproducibility.

In [None]:
# Save export data and metadata
export_filename = f"{export_info['export_name']}.csv"
metadata_filename = f"{export_info['export_name']}_metadata.json"

# Save data
export_df.to_csv(export_filename, index=False)

# Save metadata
with open(metadata_filename, 'w') as f:
    json.dump(export_info, f, indent=2)

print(f"Export completed successfully!")
print(f"Data saved to: {export_filename}")
print(f"Metadata saved to: {metadata_filename}")
print(f"\nExported dataset preview:")
export_df.head()

## Summary

In this notebook, we covered three essential export patterns for medical data integration:

1. **Data Subsets**: Creating focused datasets based on clinical criteria
2. **Temporal Snapshots**: Capturing data state at specific time points for longitudinal studies
3. **Reproducible Filters**: Implementing reusable, configurable filtering systems

These patterns ensure data consistency, reproducibility, and proper documentation in medical research workflows.

## Exercise

Create a comprehensive export for a pediatric study using the patterns learned in this notebook:

1. **Create a pediatric filter**: Design a ReproducibleFilter for patients aged 0-18 with asthma diagnosis
2. **Generate quarterly snapshots**: Create temporal snapshots for each quarter of 2021 (Q1: March 31, Q2: June 30, Q3: September 30, Q4: December 31)
3. **Compare trends**: Analyze how the pediatric asthma population changed across these quarters
4. **Export pipeline**: Use the export pipeline to create a final dataset combining your filter with the Q4 2021 snapshot
5. **Documentation**: Save all configurations and metadata to ensure your analysis can be reproduced

**Bonus**: Calculate the percentage change in pediatric asthma cases between Q1 and Q4 2021, and create a filter configuration that could identify patients with concerning vital signs (you define the criteria).