<a href="https://colab.research.google.com/github/altalanta/clinical-data-platform/blob/main/notebooks/public_demo_quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🏥 Clinical Data Platform - Public Demo Quickstart

This notebook demonstrates the clinical data platform using **synthetic OMOP CDM data**. No PHI or real patient data is used.

**What you'll see:**
- Synthetic patient demographics, visits, conditions, and measurements
- Data quality validation with Great Expectations and Pandera
- Analytics-ready data transformations with dbt
- Comprehensive data quality metrics

**Runtime:** ~3-5 minutes

## 🚀 Setup

First, let's install the required dependencies and clone the repository:

In [None]:
# Install required packages
!pip install -q duckdb pandas numpy great-expectations pandera matplotlib seaborn

# Clone the repository
!git clone --depth 1 https://github.com/altalanta/clinical-data-platform.git
%cd clinical-data-platform

# Install the platform package
!pip install -q -e .

## 📦 Import Libraries

In [None]:
import sys
import json
import pandas as pd
import numpy as np
import duckdb
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add src to path for imports
sys.path.insert(0, str(Path.cwd() / 'src'))

from clinical_platform.data_adapters.public_cdm import PublicCDMAdapter
from clinical_data_platform.validation.pandera_public import PublicCDMSchemas, validate_public_cdm_data

print("✅ All imports successful!")

## 🎲 Generate Synthetic Clinical Data

Let's create realistic synthetic OMOP CDM data:

In [None]:
# Initialize the adapter with fast mode for Colab
adapter = PublicCDMAdapter(seed=42, fast_mode=True)

print("🏭 Generating synthetic OMOP CDM data...")
print("   - Person demographics")
print("   - Visit occurrences")
print("   - Condition diagnoses")
print("   - Laboratory measurements")

# Generate the data files
adapter.fetch(target_dir="data/public_cdm")

print("\n✅ Synthetic data generated successfully!")

## 📊 Load and Explore the Data

In [None]:
# Load the generated CSV files
person_df = pd.read_csv("data/public_cdm/person.csv")
visit_df = pd.read_csv("data/public_cdm/visit_occurrence.csv")
condition_df = pd.read_csv("data/public_cdm/condition_occurrence.csv")
measurement_df = pd.read_csv("data/public_cdm/measurement.csv")

# Convert date columns
for df in [person_df, visit_df, condition_df, measurement_df]:
    for col in df.columns:
        if 'date' in col.lower() or col == 'birth_datetime':
            df[col] = pd.to_datetime(df[col], errors='coerce')

print("📋 Dataset Summary:")
print(f"   👥 Persons: {len(person_df):,}")
print(f"   🏥 Visits: {len(visit_df):,}")
print(f"   🩺 Conditions: {len(condition_df):,}")
print(f"   🧪 Measurements: {len(measurement_df):,}")

### Sample Data Preview

In [None]:
print("👥 Sample Person Records:")
display(person_df[['person_id', 'gender_concept_id', 'year_of_birth', 'race_concept_id']].head())

print("\n🏥 Sample Visit Records:")
display(visit_df[['visit_occurrence_id', 'person_id', 'visit_concept_id', 'visit_start_date', 'visit_end_date']].head())

## ✅ Data Quality Validation

Let's run comprehensive data quality checks using Pandera schemas:

In [None]:
print("🔍 Running data quality validation...")

# Run pandera validation
validation_results = validate_public_cdm_data(
    person_df=person_df,
    visit_df=visit_df,
    condition_df=condition_df,
    measurement_df=measurement_df
)

print("\n📊 Validation Results:")
for table_name, result in validation_results.items():
    status_emoji = "✅" if result['status'] == 'passed' else "❌"
    print(f"   {status_emoji} {table_name}: {result['status']} ({result['row_count']:,} rows)")
    
    if result['errors']:
        for error in result['errors'][:3]:  # Show first 3 errors
            print(f"      - {error}")
        if len(result['errors']) > 3:
            print(f"      ... and {len(result['errors']) - 3} more errors")

# Check overall validation status
all_passed = all(result['status'] == 'passed' for result in validation_results.values())
print(f"\n🎯 Overall Status: {'✅ ALL VALIDATIONS PASSED' if all_passed else '❌ Some validations failed'}")

## 📈 Data Analytics & Insights

Let's create some analytics on our synthetic clinical data:

In [None]:
# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

# Create demographic analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('📊 Synthetic Patient Demographics Analysis', fontsize=16, fontweight='bold')

# Age distribution
current_year = 2024
ages = current_year - person_df['year_of_birth']
axes[0, 0].hist(ages, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].set_xlabel('Age (years)')
axes[0, 0].set_ylabel('Count')

# Gender distribution
gender_mapping = {8507: 'Male', 8532: 'Female', 8551: 'Unknown', 0: 'Other'}
gender_counts = person_df['gender_concept_id'].map(gender_mapping).value_counts()
axes[0, 1].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%', startangle=90)
axes[0, 1].set_title('Gender Distribution')

# Visit types
visit_mapping = {9201: 'Inpatient', 9202: 'Outpatient', 9203: 'Emergency'}
visit_counts = visit_df['visit_concept_id'].map(visit_mapping).value_counts()
axes[1, 0].bar(visit_counts.index, visit_counts.values, color=['lightcoral', 'lightgreen', 'lightsalmon'])
axes[1, 0].set_title('Visit Types')
axes[1, 0].set_ylabel('Count')
axes[1, 0].tick_params(axis='x', rotation=45)

# Length of stay distribution
los_days = (visit_df['visit_end_date'] - visit_df['visit_start_date']).dt.days
axes[1, 1].hist(los_days[los_days <= 30], bins=20, alpha=0.7, color='lightsteelblue', edgecolor='black')
axes[1, 1].set_title('Length of Stay (≤30 days)')
axes[1, 1].set_xlabel('Days')
axes[1, 1].set_ylabel('Count')

plt.tight_layout()
plt.show()

print(f"📊 Key Statistics:")
print(f"   📅 Age Range: {ages.min():.0f} - {ages.max():.0f} years")
print(f"   🏥 Avg Visits per Patient: {len(visit_df) / len(person_df):.1f}")
print(f"   🩺 Avg Conditions per Patient: {len(condition_df) / len(person_df):.1f}")
print(f"   🧪 Avg Measurements per Patient: {len(measurement_df) / len(person_df):.1f}")

## 🗄️ Database Operations with DuckDB

Let's load the data into DuckDB and run some analytical queries:

In [None]:
# Create DuckDB connection
conn = duckdb.connect('colab_demo.duckdb')

print("🗄️ Loading data into DuckDB...")

# Load data into DuckDB
conn.execute("CREATE TABLE person AS SELECT * FROM person_df")
conn.execute("CREATE TABLE visit_occurrence AS SELECT * FROM visit_df")
conn.execute("CREATE TABLE condition_occurrence AS SELECT * FROM condition_df")
conn.execute("CREATE TABLE measurement AS SELECT * FROM measurement_df")

print("✅ Data loaded successfully!")

# Run analytical queries
print("\n🔍 Running analytical queries...")

# Patient summary query
query = """
SELECT 
    COUNT(*) as total_patients,
    AVG(2024 - year_of_birth) as avg_age,
    COUNT(DISTINCT CASE WHEN gender_concept_id = 8507 THEN person_id END) as male_count,
    COUNT(DISTINCT CASE WHEN gender_concept_id = 8532 THEN person_id END) as female_count
FROM person
"""

result = conn.execute(query).fetchone()
print(f"   👥 Total Patients: {result[0]:,}")
print(f"   📅 Average Age: {result[1]:.1f} years")
print(f"   👨 Male Patients: {result[2]:,}")
print(f"   👩 Female Patients: {result[3]:,}")

# Visit patterns
visit_query = """
SELECT 
    visit_concept_id,
    COUNT(*) as visit_count,
    AVG(visit_end_date - visit_start_date) as avg_length_of_stay
FROM visit_occurrence 
GROUP BY visit_concept_id
ORDER BY visit_count DESC
"""

visit_results = conn.execute(visit_query).fetchall()
print("\n🏥 Visit Patterns:")
for visit_type, count, avg_los in visit_results:
    visit_name = visit_mapping.get(visit_type, f"Type {visit_type}")
    print(f"   {visit_name}: {count:,} visits (avg LOS: {avg_los:.1f} days)")

# Close connection
conn.close()
print("\n✅ Database operations completed!")

## 🎯 Summary & Next Steps

### What we accomplished:
✅ Generated realistic synthetic OMOP CDM data  
✅ Validated data quality with comprehensive schemas  
✅ Created demographic and clinical analytics  
✅ Demonstrated database operations with DuckDB  

### Key Features Demonstrated:
- **No PHI**: 100% synthetic data, safe for public demos
- **OMOP CDM Compliance**: Follows healthcare data standards
- **Data Quality**: Comprehensive validation with Great Expectations & Pandera
- **Analytics Ready**: Structured for downstream ML and analytics
- **Fast & Lightweight**: Generates complete dataset in minutes

### Try it yourself:
1. **GitHub Codespaces**: [![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/altalanta/clinical-data-platform?quickstart=1)
2. **Local Setup**: `git clone` → `make demo-public`
3. **Full Pipeline**: Includes dbt transformations, Great Expectations, and data docs

---

**🏥 Clinical Data Platform** - Production-ready healthcare data infrastructure with privacy-first design.

In [None]:
# Final metrics summary
metrics_summary = {
    "demo_completed": True,
    "data_quality_passed": all_passed,
    "total_patients": len(person_df),
    "total_visits": len(visit_df),
    "total_conditions": len(condition_df),
    "total_measurements": len(measurement_df),
    "avg_age": float(ages.mean()),
    "gender_distribution": gender_counts.to_dict(),
    "visit_distribution": visit_counts.to_dict()
}

print("📋 Final Demo Metrics:")
print(json.dumps(metrics_summary, indent=2, default=str))

print("\n🎉 Demo completed successfully! Thank you for trying the Clinical Data Platform.")