# MIMIC-III Psychiatric Clustering: Data Exploration

This notebook explores the MIMIC-III dataset and performs initial data loading and preprocessing.

## Objectives
1. Load MIMIC-III CSV files (NOTEEVENTS, DIAGNOSES_ICD, PRESCRIPTIONS)
2. Explore data structure and quality
3. Filter for psychiatric admissions
4. Create patient-level aggregated dataset
5. Visualize data distributions

In [None]:
import sys
sys.path.append('/app')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

from src.utils import load_config, set_random_seeds
from src.data_loader import MIMICDataLoader
from src.preprocessor import PsychiatricCohortSelector

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

%matplotlib inline

## 1. Load Configuration

In [None]:
# Load pipeline configuration
config = load_config('/app/config/config.yaml')

# Set random seeds for reproducibility
set_random_seeds(config['random_state'])

print('Configuration loaded successfully')
print(f"Dataset paths:")
print(f"  Notes: {config['data']['notes_path']}")
print(f"  Diagnoses: {config['data']['diagnoses_path']}")
print(f"  Prescriptions: {config['data']['prescriptions_path']}")

## 2. Load MIMIC-III Data

**Note**: This step may take 5-10 minutes depending on dataset size.

In [None]:
# Initialize data loader
data_loader = MIMICDataLoader(config)

# Load all data and create unified dataset
print('Loading MIMIC-III data...')
patient_df = data_loader.load_all()

print(f'\nLoaded {len(patient_df)} patient admissions')
print(f'Unique patients: {patient_df["SUBJECT_ID"].nunique()}')
print(f'Unique admissions: {patient_df["HADM_ID"].nunique()}')

## 3. Explore Dataset Structure

In [None]:
# Display dataset info
print('Dataset shape:', patient_df.shape)
print('\nColumn names:')
print(patient_df.columns.tolist())

# Display first few rows
patient_df.head()

In [None]:
# Check for missing values
print('Missing values:')
missing = patient_df.isnull().sum()
missing[missing > 0]

In [None]:
# Basic statistics
print('Statistics:')
print(f'  Average diagnoses per patient: {patient_df["num_diagnoses"].mean():.2f}')
print(f'  Average medications per patient: {patient_df["num_medications"].mean():.2f}')
print(f'  Average discharge summary length: {patient_df["discharge_summary"].str.len().mean():.0f} characters')

## 4. Select Psychiatric Cohort

In [None]:
# Initialize preprocessor
preprocessor = PsychiatricCohortSelector(config)

# Process and filter psychiatric patients
print('Selecting psychiatric cohort...')
psych_df = preprocessor.process(patient_df)

print(f'\nPsychiatric cohort size: {len(psych_df)}')
print(f'Percentage of total: {len(psych_df)/len(patient_df)*100:.1f}%')

## 5. Visualize Data Distributions

In [None]:
# Plot diagnosis and medication distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Diagnoses
axes[0].hist(psych_df['num_psych_diagnoses'], bins=20, edgecolor='black')
axes[0].set_xlabel('Number of Psychiatric Diagnoses')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Psychiatric Diagnoses')
axes[0].grid(True, alpha=0.3)

# Medications
axes[1].hist(psych_df['num_medications'], bins=20, edgecolor='black', color='orange')
axes[1].set_xlabel('Number of Medications')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Medications')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Plot cluster framework flags
flag_cols = [col for col in psych_df.columns if col.startswith('flag_')]

flag_counts = {}
for col in flag_cols:
    category_name = col.replace('flag_', '').title()
    flag_counts[category_name] = psych_df[col].sum()

# Plot
plt.figure(figsize=(12, 6))
bars = plt.bar(flag_counts.keys(), flag_counts.values())
plt.xlabel('Theoretical Framework Category')
plt.ylabel('Number of Patients')
plt.title('Distribution Across Theoretical Psychiatric Categories')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height)}\n({height/len(psych_df)*100:.1f}%)',
            ha='center', va='bottom')

plt.tight_layout()
plt.show()

In [None]:
# Comorbidity analysis
print('Comorbidity Statistics:')
print(f'  Patients with comorbidity: {psych_df["has_comorbidity"].sum()} ({psych_df["has_comorbidity"].mean()*100:.1f}%)')
print(f'  Average cluster categories per patient: {psych_df["num_cluster_categories"].mean():.2f}')

# Plot comorbidity distribution
plt.figure(figsize=(10, 5))
psych_df['num_cluster_categories'].value_counts().sort_index().plot(kind='bar')
plt.xlabel('Number of Cluster Categories')
plt.ylabel('Number of Patients')
plt.title('Psychiatric Comorbidity: Number of Cluster Categories per Patient')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Sample Discharge Summary

In [None]:
# Display a sample discharge summary
print('Sample Discharge Summary:')
print('='*80)
sample_text = psych_df.iloc[0]['discharge_summary'][:1000]
print(sample_text)
print('...')
print('='*80)
print(f'Total length: {len(psych_df.iloc[0]["discharge_summary"]):,} characters')

## 7. Save Processed Data

In [None]:
# Save for next notebook
output_path = '/app/outputs/notebook_psych_cohort_step1.csv'
psych_df.to_csv(output_path, index=False)
print(f'Saved psychiatric cohort to {output_path}')

## Summary

In this notebook, we:
- Loaded MIMIC-III data (NOTEEVENTS, DIAGNOSES_ICD, PRESCRIPTIONS)
- Merged datasets at patient-admission level
- Filtered for psychiatric patients (ICD-9: 290-319)
- Explored data distributions and quality
- Identified comorbidity patterns

**Next steps**: Feature engineering in notebook 02