# Welcome to the ZIMAM AI Datathon 2025

## November 6-8, 2025 | Dubai, UAE

Welcome! This notebook will help you get started with the ZIMAM AI Datathon 2025. You'll learn about the datasets, set up your environment, and explore sample analyses.

### What You'll Learn

- Introduction to MIMIC-IV and eICU datasets
- How to access and query the data
- Sample analyses and visualization techniques
- Best practices for working with healthcare data
- Tips for successful collaboration

### Event Overview

This intensive two-and-a-half-day datathon focuses on generating innovative, AI-driven solutions from real-world healthcare datasets. You'll work in multidisciplinary teams to address clinical questions and develop AI solutions for patient care.

**Datasets**: MIMIC-IV and eICU Collaborative Research Database  
**Focus**: AI + Data Science in Healthcare  
**Goal**: Build trustworthy, transparent, and impactful AI solutions

---

## 1. Environment Setup

First, let's set up the Python environment with necessary libraries for data analysis and visualization.

In [None]:
# Install required packages
# Uncomment the following line if running for the first time
# !pip install pandas numpy matplotlib seaborn scikit-learn google-cloud-bigquery plotly

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings

# Configure display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Environment setup complete!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

---

## 2. Introduction to MIMIC-IV

### What is MIMIC-IV?

**MIMIC-IV (Medical Information Mart for Intensive Care IV)** is a large, freely-available database comprising de-identified health-related data from patients admitted to critical care units at the Beth Israel Deaconess Medical Center in Boston, Massachusetts.

### Key Features

- **Patients**: Over 70,000 ICU admissions
- **Time Period**: 2008-2019
- **Data Types**: 
  - Demographics
  - Vital signs (heart rate, blood pressure, temperature, etc.)
  - Laboratory measurements
  - Medications
  - Clinical notes
  - Diagnostic codes (ICD-9, ICD-10)
  - Procedures

### Database Structure

MIMIC-IV is organized into several modules:

1. **Core (hosp)**: Hospital-level data including admissions, diagnoses, procedures, pharmacy, lab tests
2. **ICU**: ICU-specific data including vital signs, procedures, medications
3. **ED**: Emergency department data
4. **CXR**: Chest X-ray reports and images

### Important Tables

- `admissions`: Hospital admission and discharge information
- `patients`: Patient demographics
- `icustays`: ICU admission details
- `chartevents`: Vital signs and clinical measurements
- `labevents`: Laboratory test results
- `prescriptions`: Medication orders
- `diagnoses_icd`: Diagnostic codes

### Sample MIMIC-IV Query

Here's an example of how you might query MIMIC-IV data (replace with your actual data access method):

In [None]:
# Example: Load sample MIMIC-IV data
# NOTE: Replace this with your actual data access method during the datathon

# For demonstration purposes, we'll create sample data
# During the datathon, you'll use actual BigQuery or database connections

sample_query = """
SELECT 
    p.subject_id,
    p.gender,
    p.anchor_age,
    a.admittime,
    a.dischtime,
    a.admission_type,
    a.admission_location
FROM 
    `physionet-data.mimiciv_hosp.patients` p
INNER JOIN 
    `physionet-data.mimiciv_hosp.admissions` a
ON 
    p.subject_id = a.subject_id
LIMIT 1000
"""

print("Sample MIMIC-IV Query:")
print(sample_query)

---

## 3. Introduction to eICU

### What is eICU?

The **eICU Collaborative Research Database** is a multi-center ICU database with high granularity data for over 200,000 admissions to ICUs across the United States.

### Key Features

- **Patients**: Over 200,000 ICU admissions
- **Hospitals**: 335 units across 208 hospitals
- **Time Period**: 2014-2015
- **Geographic Diversity**: Multiple regions across the United States
- **Data Types**:
  - Vital signs
  - Laboratory values
  - APACHE scores
  - Care plan data
  - Diagnoses
  - Treatment information

### Database Structure

Key tables in eICU:

- `patient`: Patient demographics and outcomes
- `vitalPeriodic`: Vital signs recorded periodically
- `vitalAperiodic`: Vital signs recorded at irregular intervals
- `lab`: Laboratory results
- `medication`: Medication administration
- `diagnosis`: Admission and active diagnoses
- `treatment`: Treatment information
- `apacheApsVar`: APACHE scoring variables
- `apachePatientResult`: APACHE predictions and outcomes

### eICU vs MIMIC-IV

**eICU Advantages**:
- Multi-center data (better generalizability)
- Geographic diversity
- APACHE scoring already calculated

**MIMIC-IV Advantages**:
- Longer time span
- More detailed clinical notes
- Laboratory data with more granular timing

### Sample eICU Query

Here's an example of how you might query eICU data:

In [None]:
# Example: Load sample eICU data
# NOTE: Replace this with your actual data access method during the datathon

sample_eicu_query = """
SELECT 
    p.patientunitstayid,
    p.gender,
    p.age,
    p.ethnicity,
    p.admissionheight,
    p.admissionweight,
    p.unittype,
    p.unitadmitsource,
    p.unitdischargestatus,
    p.hospitaldischargestatus
FROM 
    `physionet-data.eicu_crd.patient` p
LIMIT 1000
"""

print("Sample eICU Query:")
print(sample_eicu_query)

---

## 4. Data Access

### PhysioNet Credentialing

To access MIMIC-IV and eICU, you need to:

1. **Create a PhysioNet account** at https://physionet.org/
2. **Complete required training**:
   - CITI "Data or Specimens Only Research" course
3. **Request access** to the specific datasets
4. **Sign the Data Use Agreement (DUA)**

### Access Methods During the Datathon

During the ZIMAM AI Datathon, you'll have access to the data through:

1. **Google BigQuery**: Cloud-based SQL queries (recommended for large-scale analyses)
2. **Local databases**: PostgreSQL or similar (if you prefer local development)
3. **Pre-processed datasets**: Your mentors may provide pre-processed data for specific tasks

### Setting Up BigQuery Access

In [None]:
# Example: Setting up BigQuery client
# Uncomment and configure during the datathon

# from google.cloud import bigquery
# import os

# # Set your project ID
# project_id = 'your-project-id'
# os.environ['GOOGLE_CLOUD_PROJECT'] = project_id

# # Initialize BigQuery client
# client = bigquery.Client(project=project_id)

# # Test query
# query = """
# SELECT COUNT(*) as patient_count
# FROM `physionet-data.mimiciv_hosp.patients`
# """

# df = client.query(query).to_dataframe()
# print(f"Total patients in MIMIC-IV: {df['patient_count'].iloc[0]}")

print("BigQuery setup instructions displayed above.")
print("Uncomment and configure during the datathon with your project credentials.")

---

## 5. Sample Analysis: Patient Demographics

Let's create a sample analysis workflow that you might use during the datathon.

In [None]:
# Create sample demographic data for demonstration
np.random.seed(42)

n_patients = 1000

sample_data = pd.DataFrame({
    'patient_id': range(1, n_patients + 1),
    'age': np.random.normal(65, 15, n_patients).clip(18, 90).astype(int),
    'gender': np.random.choice(['M', 'F'], n_patients),
    'admission_type': np.random.choice(['Emergency', 'Elective', 'Urgent'], n_patients, p=[0.6, 0.2, 0.2]),
    'los_days': np.random.exponential(5, n_patients).clip(0.5, 30),
    'mortality': np.random.choice([0, 1], n_patients, p=[0.85, 0.15]),
    'apache_score': np.random.normal(60, 20, n_patients).clip(0, 100).astype(int)
})

print("Sample Patient Data:")
print(sample_data.head(10))
print(f"\nTotal patients: {len(sample_data)}")

### Exploratory Data Analysis

In [None]:
# Summary statistics
print("Summary Statistics:")
print(sample_data.describe())

print("\nMissing Values:")
print(sample_data.isnull().sum())

In [None]:
# Visualize age distribution
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Age distribution
axes[0, 0].hist(sample_data['age'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Age (years)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].axvline(sample_data['age'].mean(), color='red', linestyle='--', label=f'Mean: {sample_data["age"].mean():.1f}')
axes[0, 0].legend()

# Gender distribution
gender_counts = sample_data['gender'].value_counts()
axes[0, 1].bar(gender_counts.index, gender_counts.values, color=['lightblue', 'lightpink'])
axes[0, 1].set_xlabel('Gender')
axes[0, 1].set_ylabel('Count')
axes[0, 1].set_title('Gender Distribution')

# Admission type
admission_counts = sample_data['admission_type'].value_counts()
axes[1, 0].barh(admission_counts.index, admission_counts.values, color='skyblue')
axes[1, 0].set_xlabel('Count')
axes[1, 0].set_ylabel('Admission Type')
axes[1, 0].set_title('Admission Type Distribution')

# Length of stay vs mortality
for mortality_status in [0, 1]:
    data_subset = sample_data[sample_data['mortality'] == mortality_status]['los_days']
    axes[1, 1].hist(data_subset, bins=20, alpha=0.6, label=f'Mortality: {mortality_status}')
axes[1, 1].set_xlabel('Length of Stay (days)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Length of Stay by Mortality Status')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

---

## 6. Sample Clinical Question: Mortality Risk Prediction

Let's explore a common clinical question: predicting ICU mortality based on patient characteristics.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, confusion_matrix

# Prepare features
# Encode categorical variables
le_gender = LabelEncoder()
le_admission = LabelEncoder()

sample_data['gender_encoded'] = le_gender.fit_transform(sample_data['gender'])
sample_data['admission_type_encoded'] = le_admission.fit_transform(sample_data['admission_type'])

# Select features and target
features = ['age', 'gender_encoded', 'admission_type_encoded', 'los_days', 'apache_score']
X = sample_data[features]
y = sample_data['mortality']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Mortality rate in training: {y_train.mean():.2%}")
print(f"Mortality rate in test: {y_test.mean():.2%}")

In [None]:
# Train a simple Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)[:, 1]

# Evaluate model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print(f"\nROC-AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
axes[0].plot(fpr, tpr, linewidth=2, label=f'ROC (AUC = {roc_auc_score(y_test, y_pred_proba):.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1], 
            xticklabels=['Survived', 'Died'], yticklabels=['Survived', 'Died'])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title('Confusion Matrix')

plt.tight_layout()
plt.show()

In [None]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'], color='steelblue')
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Feature Importance for Mortality Prediction')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nFeature Importance:")
print(feature_importance)

---

## 7. Best Practices for Healthcare AI

### Data Quality and Integrity

1. **Check for missing data**: Understand why data is missing (MCAR, MAR, MNAR)
2. **Validate data ranges**: Ensure vital signs and lab values are physiologically plausible
3. **Handle outliers carefully**: Outliers may represent true clinical events
4. **Understand temporal relationships**: Time of measurement matters in ICU data

### Ethical Considerations

1. **Fairness**: Check for bias across demographic groups (gender, age, ethnicity)
2. **Privacy**: Ensure patient privacy is maintained (data is already de-identified)
3. **Transparency**: Make your model interpretable to clinicians
4. **Clinical validity**: Ensure your findings make clinical sense

### Model Development

1. **Feature engineering**: Create clinically meaningful features
2. **Validation strategy**: Use proper train/validation/test splits or cross-validation
3. **Performance metrics**: Choose appropriate metrics (AUC-ROC, precision, recall, calibration)
4. **Clinical applicability**: Consider how the model would be used in practice

### Collaboration Tips

1. **Clinical + Data Science Partnership**: Clinicians shape the question, data scientists develop solutions
2. **Clear communication**: Use terminology that both groups understand
3. **Iterative approach**: Start simple, then add complexity
4. **Documentation**: Keep track of your analyses and decisions

---

## 8. Potential Research Questions

Here are some example research questions you might explore during the datathon:

### Prediction Tasks
- Predict ICU mortality
- Predict length of stay
- Predict need for mechanical ventilation
- Predict sepsis onset
- Predict acute kidney injury

### Comparative Effectiveness
- Compare treatment strategies for specific conditions
- Analyze differences between eICU centers
- Evaluate timing of interventions

### Resource Optimization
- Identify patients at risk for readmission
- Optimize ICU bed allocation
- Predict healthcare resource utilization

### Disease-Specific Analyses
- COVID-19 outcomes and risk factors
- Heart failure management
- Sepsis treatment protocols
- Traumatic brain injury outcomes

### AI Model Development
- Develop early warning systems
- Create clinical decision support tools
- Build personalized treatment recommendation systems

---

## 9. Useful Resources

### Documentation
- [MIMIC-IV Documentation](https://mimic.mit.edu/docs/iv/)
- [eICU Documentation](https://eicu-crd.mit.edu/about/eicu/)
- [PhysioNet](https://physionet.org/)

### Code Repositories
- [MIMIC Code Repository](https://github.com/MIT-LCP/mimic-code)
- [eICU Code Repository](https://github.com/MIT-LCP/eicu-code)

### Learning Materials
- [MIT Critical Data Course](https://criticaldata.mit.edu/)
- [Google Cloud Healthcare Datathon Tutorials](https://github.com/GoogleCloudPlatform/healthcare/tree/master/datathon)

### Python Libraries for Healthcare
- `scikit-learn`: Machine learning
- `pandas`: Data manipulation
- `numpy`: Numerical computing
- `matplotlib` / `seaborn`: Visualization
- `lifelines`: Survival analysis
- `statsmodels`: Statistical modeling
- `xgboost` / `lightgbm`: Gradient boosting
- `tensorflow` / `pytorch`: Deep learning

---

## 10. Getting Started with Your Team

### Day 1 Checklist

- [ ] Meet your team members and introduce yourselves
- [ ] Discuss everyone's background and expertise
- [ ] Brainstorm potential research questions
- [ ] Choose 2-3 questions to explore further
- [ ] Identify required data tables and features
- [ ] Set up data access for all team members
- [ ] Divide initial exploration tasks
- [ ] Connect with your mentor

### Throughout the Datathon

- Regular check-ins with your team
- Document your progress and findings
- Ask mentors for guidance
- Attend checkpoint sessions
- Prepare your final presentation

### Final Presentation Tips

1. **Clear clinical question**: What problem are you solving?
2. **Methodology**: How did you approach the problem?
3. **Results**: What did you find?
4. **Clinical significance**: Why does this matter for patient care?
5. **Limitations**: What are the constraints of your analysis?
6. **Next steps**: How could this work be extended?

---

## 11. Contact and Support

### During the Datathon

- **Mentors**: Available throughout the event for technical and clinical guidance
- **Organizers**: Contact via event communication channels
- **Technical Support**: Available for data access and computing issues

### Event Information

- **Website**: [https://www.gccehealth.org/](https://www.gccehealth.org/)
- **Venue**: 
  - Nov 6: Jumeirah Emirates Towers
  - Nov 7-8: Center for Innovation & Technology, Dubai Health

---

## Good luck and have fun!

Remember: The goal is to learn, collaborate, and develop innovative solutions for healthcare challenges. Don't hesitate to ask questions and engage with mentors and fellow participants.

**Let's build trustworthy, transparent, and impactful AI solutions for healthcare together!**