# 01 - Data Exploration for 30-Day Readmission Risk

## Purpose
This notebook is the canonical Phase 1 data exploration notebook for the readmission ML learning workflow in `med-z1`.

## Objectives
- Validate source-table availability and baseline data quality
- Explore encounter-level distributions relevant to 30-day readmission modeling
- Establish leakage-safe patterns for downstream feature engineering
- Perform early profiling of the Family History domain

## Primary Data Sources
- `clinical.patient_encounters`
- `clinical.patient_demographics`
- `clinical.patient_problems`
- `clinical.patient_military_history`
- `clinical.patient_family_history`

## Inputs / Outputs
- Input: PostgreSQL data via `DATABASE_URL` from `/Users/chuck/swdev/med/med-z1/config.py`
- Output: In-notebook exploratory tables/observations (no production artifacts)

## Guardrails
- Keep outputs de-identified (no names/SSN/ICN unless explicitly needed)
- Do not print secrets (`DATABASE_URL`)
- For label/features later: only use data available at or before index discharge datetime

## Run Order
Run cells top-to-bottom. Re-run setup cells whenever environment or dependency context changes.

In [None]:
# Project setup + config import (robust project-root detection)

import sys
from pathlib import Path

# Find project root by locating config.py from current notebook directory upward
cwd = Path.cwd()
project_root = next((p for p in [cwd, *cwd.parents] if (p / 'config.py').exists()), None)
if project_root is None:
    raise FileNotFoundError('Could not locate config.py from current notebook path.')

if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

from config import DATABASE_URL

# Validate configuration without exposing secrets
if not DATABASE_URL:
    raise ValueError('DATABASE_URL is empty. Check /Users/chuck/swdev/med/med-z1/config.py')


In [None]:
# Core dependencies for exploration + later modeling phases
# For Apple Silicon machines, you may need: brew install libomp
import pandas as pd
import numpy as np
import sklearn
import xgboost as xgb
import shap
import imblearn

from sqlalchemy import create_engine


In [None]:
# Environment checks (safe to display; no secrets)
import platform

print(f"Python:       {platform.python_version()}")
print(f"Pandas:       {pd.__version__}")
print(f"NumPy:        {np.__version__}")
print(f"Scikit-Learn: {sklearn.__version__}")
print(f"XGBoost:      {xgb.__version__}")
print(f"SHAP:         {shap.__version__}")
print(f"ImbLearn:     {imblearn.__version__}")
print(f"Project root: {project_root}")
print("DATABASE_URL loaded: yes")


In [None]:
# Connecting to med-z1 PostgreSQL
# Database Schema to Query:
#  - clinical.patient_encounters - Hospital admissions (your target events)
#  - clinical.patient_demographics - Age, sex, DOB
#  - clinical.patient_medications_outpatient - Active meds (polypharmacy feature)
#  - clinical.patient_vitals - BP, weight, temp (clinical instability features)
#  - clinical.patient_labs - Creatinine, Hgb (abnormal lab features)

# Create database connection
engine = create_engine(DATABASE_URL)

### Notebook Safety Notes (PHI + Leakage Guardrails)
- Keep exploration de-identified where possible (avoid names/SSN/ICN in outputs).
- Do not print connection secrets (for example, full `DATABASE_URL`).
- For future feature engineering, only use records available on or before index discharge date to avoid data leakage.


In [None]:
# Let's try a database query (patient_demographics)
# De-identified/minimum fields for ML exploration

patient_demographics_query = """
SELECT
    patient_key,
    age,
    sex
FROM clinical.patient_demographics
ORDER BY patient_key
"""

# Load into pandas DataFrame
patient_demographics_df = pd.read_sql(patient_demographics_query, engine)

patient_demographics_df

In [None]:
# Let's try a database query (patient_encounters)
# Keep outputs de-identified for notebook learning work

# Query 1
pt_encounters_count = """
    SELECT COUNT(*)
    FROM clinical.patient_encounters
"""

# Query 2
pt_encounters_query = """
SELECT
    e.patient_key,
    d.age,
    d.sex,
    e.sta3n,
    e.admit_datetime::DATE AS admit_date,
    e.discharge_datetime::DATE AS discharge_date,
    e.discharge_disposition AS disposition
FROM clinical.patient_encounters AS e
INNER JOIN clinical.patient_demographics AS d
    ON e.patient_key = d.patient_key
WHERE e.discharge_datetime IS NOT NULL
ORDER BY e.patient_key DESC, e.discharge_datetime
"""

# Load Query 1 into pandas DataFrame and display
pt_encounters_count_df = pd.read_sql(pt_encounters_count, engine)
pt_encounters_count_df

# Load Query 2 into pandas DataFrame and display
pt_encounters_df = pd.read_sql(pt_encounters_query, engine)
print(f"Shape: {pt_encounters_df.shape}")
pt_encounters_df.head(35)

In [None]:
# Display again
print(f"Shape: {pt_encounters_df.shape}")
pt_encounters_df.tail(35)

In [None]:
# Sort DataFrame
pt_encounters_df = pt_encounters_df.sort_values(['patient_key', 'discharge_date'])
pt_encounters_df

### Family History Domain (Early Profiling)
Use aggregate views first to confirm data availability and quality before building patient-level features.

Note for later feature engineering: for each index discharge, only include family-history rows where `recorded_datetime <= discharge_datetime`.

In [None]:
# Family history: high-level profile (aggregate only, no direct identifiers)
family_history_profile_query = """
SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT patient_key) AS unique_patients,
    SUM(CASE WHEN first_degree_relative_flag THEN 1 ELSE 0 END) AS first_degree_rows,
    SUM(CASE WHEN hereditary_risk_flag THEN 1 ELSE 0 END) AS hereditary_risk_rows,
    SUM(CASE WHEN recorded_datetime IS NULL THEN 1 ELSE 0 END) AS missing_recorded_datetime_rows
FROM clinical.patient_family_history
"""

family_history_profile_df = pd.read_sql(family_history_profile_query, engine)
family_history_profile_df

In [None]:
# Top condition categories to understand signal density
family_history_category_query = """
SELECT
    COALESCE(condition_category, 'UNKNOWN') AS condition_category,
    COUNT(*) AS row_count,
    COUNT(DISTINCT patient_key) AS patient_count
FROM clinical.patient_family_history
GROUP BY COALESCE(condition_category, 'UNKNOWN')
ORDER BY row_count DESC
LIMIT 15
"""

family_history_category_df = pd.read_sql(family_history_category_query, engine)
family_history_category_df