# Exploratory Data Analysis - DDI, Medications, and Demographics
This notebook explores the DDI reference dataset, patient medications dataset, and patient demographics dataset to:
1. Understand data structure and quality
2. Identify potential DDI matches in current data
3. Analyze patient demographics distribution
4. Generate recommendations for seeding CDWWork with known DDI pairs

**Datasets**:
- DDI Reference: `med-data/v1_raw/ddi/db_drug_interactions.parquet`
- Medications: `med-data/v1_raw/medications/medications_combined.parquet`
- Demographics: `med-data/v1_raw/demographics/patient_demographics.parquet`

In [1]:
# Import dependencies

import os
import sys
import logging
import time
import re
from collections import Counter
import numpy as np
import pandas as pd
import s3fs
import pyarrow as pa
from importlib.metadata import version
from config import *

In [2]:
# Verify dependencies

def print_version():
    print("pandas:", pd.__version__)
    print("numpy:", np.__version__)
    print("s3fs:", s3fs.__version__)
    print("pyarrow:", pa.__version__)

print_version()

pandas: 2.3.3
numpy: 2.3.4
s3fs: 2025.10.0
pyarrow: 22.0.0


In [3]:
# Set up logging

for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s"
)

logging.info("Logging configured successfully")

2025-12-02 18:03:59,857 INFO Logging configured successfully


In [4]:
# Load configuration

logging.info(f"MinIO endpoint: {MINIO_ENDPOINT}")
logging.info(f"DDI data: {DEST_BUCKET}/{V1_RAW_DDI_PREFIX}")
logging.info(f"Medications data: {DEST_BUCKET}/{V1_RAW_MEDICATIONS_PREFIX}")

2025-12-02 18:04:02,503 INFO MinIO endpoint: localhost:9000
2025-12-02 18:04:02,503 INFO DDI data: med-data/v1_raw/ddi/
2025-12-02 18:04:02,504 INFO Medications data: med-data/v1_raw/medications/


In [5]:
# Create S3FileSystem for MinIO

logging.info(f"Initializing S3FileSystem for MinIO at {MINIO_ENDPOINT}")
fs = s3fs.S3FileSystem(
    anon=False,
    key=MINIO_ACCESS_KEY,
    secret=MINIO_SECRET_KEY,
    client_kwargs={'endpoint_url': f"http://{MINIO_ENDPOINT}"}
)
logging.info("S3FileSystem created successfully")

2025-12-02 18:04:04,303 INFO Initializing S3FileSystem for MinIO at localhost:9000
2025-12-02 18:04:04,305 INFO S3FileSystem created successfully


---
## Part 1: DDI Reference Dataset Exploration

In [6]:
# Load DDI reference dataset

ddi_uri = f"s3://{DEST_BUCKET}/{V1_RAW_DDI_PREFIX}db_drug_interactions.parquet"
logging.info(f"Reading DDI data: {ddi_uri}")

start_time = time.time()
df_ddi = pd.read_parquet(ddi_uri, filesystem=fs)
elapsed = time.time() - start_time

logging.info(f"Loaded {len(df_ddi):,} DDI records in {elapsed:.2f}s")

2025-12-02 18:04:07,533 INFO Reading DDI data: s3://med-data/v1_raw/ddi/db_drug_interactions.parquet
2025-12-02 18:04:07,688 INFO Loaded 191,541 DDI records in 0.15s


In [7]:
# DDI dataset overview

print("="*80)
print("DDI REFERENCE DATASET OVERVIEW")
print("="*80)
print(f"Shape: {df_ddi.shape}")
print(f"Columns: {list(df_ddi.columns)}")
print(f"\nMemory: {df_ddi.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("="*80)

df_ddi.head(10)

DDI REFERENCE DATASET OVERVIEW
Shape: (191541, 3)
Columns: ['Drug 1', 'Drug 2', 'Interaction Description']

Memory: 51.75 MB


Unnamed: 0,Drug 1,Drug 2,Interaction Description
0,Trioxsalen,Verteporfin,Trioxsalen may increase the photosensitizing activities of Verteporfin.
1,Aminolevulinic acid,Verteporfin,Aminolevulinic acid may increase the photosensitizing activities of Verteporfin.
2,Titanium dioxide,Verteporfin,Titanium dioxide may increase the photosensitizing activities of Verteporfin.
3,Tiaprofenic acid,Verteporfin,Tiaprofenic acid may increase the photosensitizing activities of Verteporfin.
4,Cyamemazine,Verteporfin,Cyamemazine may increase the photosensitizing activities of Verteporfin.
5,Temoporfin,Verteporfin,Temoporfin may increase the photosensitizing activities of Verteporfin.
6,Methoxsalen,Verteporfin,Methoxsalen may increase the photosensitizing activities of Verteporfin.
7,Hexaminolevulinate,Verteporfin,Hexaminolevulinate may increase the photosensitizing activities of Verteporfin.
8,Benzophenone,Verteporfin,Benzophenone may increase the photosensitizing activities of Verteporfin.
9,Riboflavin,Verteporfin,Riboflavin may increase the photosensitizing activities of Verteporfin.


In [8]:
# DDI dataset info

df_ddi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191541 entries, 0 to 191540
Data columns (total 3 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   Drug 1                   191541 non-null  object
 1   Drug 2                   191541 non-null  object
 2   Interaction Description  191541 non-null  object
dtypes: object(3)
memory usage: 4.4+ MB


In [9]:
# DDI dataset statistics

print("\n" + "="*80)
print("DDI DATASET STATISTICS")
print("="*80)

# Unique drugs
unique_drug1 = df_ddi['Drug 1'].nunique()
unique_drug2 = df_ddi['Drug 2'].nunique()
all_drugs = pd.concat([df_ddi['Drug 1'], df_ddi['Drug 2']]).unique()
total_unique_drugs = len(all_drugs)

print(f"Total DDI pairs: {len(df_ddi):,}")
print(f"Unique Drug 1: {unique_drug1:,}")
print(f"Unique Drug 2: {unique_drug2:,}")
print(f"Total unique drugs: {total_unique_drugs:,}")
print("\nMissing values:")
print(df_ddi.isnull().sum())
print("="*80)


DDI DATASET STATISTICS
Total DDI pairs: 191,541
Unique Drug 1: 1,634
Unique Drug 2: 1,606
Total unique drugs: 1,701

Missing values:
Drug 1                     0
Drug 2                     0
Interaction Description    0
dtype: int64


In [10]:
# Most common drugs in DDI dataset

print("\n" + "="*80)
print("TOP 20 DRUGS IN DDI DATASET (by interaction count)")
print("="*80)

# Count appearances in either Drug 1 or Drug 2
drug_counts = Counter()
drug_counts.update(df_ddi['Drug 1'])
drug_counts.update(df_ddi['Drug 2'])

top_drugs = pd.DataFrame(drug_counts.most_common(20), columns=['Drug', 'Interaction_Count'])
print(top_drugs.to_string(index=False))
print("="*80)


TOP 20 DRUGS IN DDI DATASET (by interaction count)
         Drug  Interaction_Count
Phenobarbital                920
    Primidone                896
    Phenytoin                895
 Fosphenytoin                877
Carbamazepine                868
  Venlafaxine                860
Pentobarbital                845
  Fluvoxamine                845
   Amiodarone                844
   Nefazodone                815
    Diltiazem                795
 Cyclosporine                793
  Ziprasidone                789
  Vemurafenib                777
    Clozapine                773
    Verapamil                766
   Rifampicin                763
  Stiripentol                760
 Mifepristone                758
   Isradipine                754


In [11]:
# Analyze interaction descriptions for severity keywords

print("\n" + "="*80)
print("INTERACTION SEVERITY ANALYSIS (keyword-based)")
print("="*80)

# Extract severity keywords from descriptions
def extract_severity_keywords(description):
    """Extract potential severity indicators from interaction description."""
    if pd.isna(description):
        return 'Unknown'
    desc_lower = description.lower()
    
    # Check for severity keywords
    if any(word in desc_lower for word in ['contraindicated', 'avoid', 'serious', 'severe']):
        return 'High'
    elif any(word in desc_lower for word in ['caution', 'monitor', 'may increase', 'may decrease']):
        return 'Moderate'
    else:
        return 'Unknown'

df_ddi['Severity_Keyword'] = df_ddi['Interaction Description'].apply(extract_severity_keywords)

severity_dist = df_ddi['Severity_Keyword'].value_counts()
print(severity_dist)
print("\nNote: Severity based on keyword analysis of descriptions")
print("="*80)


INTERACTION SEVERITY ANALYSIS (keyword-based)
Severity_Keyword
Unknown     144313
Moderate     47228
Name: count, dtype: int64

Note: Severity based on keyword analysis of descriptions


In [12]:
# Sample high-severity interactions

print("\n" + "="*80)
print("SAMPLE HIGH-SEVERITY INTERACTIONS")
print("="*80)

high_severity = df_ddi[df_ddi['Severity_Keyword'] == 'High'].head(10)
print(high_severity[['Drug 1', 'Drug 2', 'Interaction Description']].to_string(index=False))
print("="*80)


SAMPLE HIGH-SEVERITY INTERACTIONS
Empty DataFrame
Columns: [Drug 1, Drug 2, Interaction Description]
Index: []


---
## Part 2: Medications Dataset Exploration

In [13]:
# Load medications dataset

meds_uri = f"s3://{DEST_BUCKET}/{V1_RAW_MEDICATIONS_PREFIX}medications_combined.parquet"
logging.info(f"Reading medications data: {meds_uri}")

start_time = time.time()
df_meds = pd.read_parquet(meds_uri, filesystem=fs)
elapsed = time.time() - start_time

logging.info(f"Loaded {len(df_meds):,} medication records in {elapsed:.2f}s")

2025-12-02 18:04:35,604 INFO Reading medications data: s3://med-data/v1_raw/medications/medications_combined.parquet
2025-12-02 18:04:35,633 INFO Loaded 1,257 medication records in 0.03s


In [15]:
# Medications dataset overview

print("="*80)
print("MEDICATIONS DATASET OVERVIEW")
print("="*80)
print(f"Shape: {df_meds.shape}")
print(f"Columns: {list(df_meds.columns)}")
print(f"\nMemory: {df_meds.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("="*80)

df_meds.head(25)

MEDICATIONS DATASET OVERVIEW
Shape: (1257, 24)
Columns: ['PatientSID', 'PatientIEN', 'Sta3n', 'DrugNameWithoutDose', 'DrugNameWithDose', 'SourceSystem', 'MedicationDateTime', 'StartDate', 'EndDate', 'Status', 'DaysSupply', 'Quantity', 'DEASchedule', 'ControlledSubstanceFlag', 'OrderNumber', 'ProviderSID', 'LocalDrugSID', 'NationalDrugSID', 'Route', 'DosageOrdered', 'Frequency', 'PrescriptionNumber', 'PharmacyName', 'ProviderType']

Memory: 1.01 MB


Unnamed: 0,PatientSID,PatientIEN,Sta3n,DrugNameWithoutDose,DrugNameWithDose,SourceSystem,MedicationDateTime,StartDate,EndDate,Status,...,OrderNumber,ProviderSID,LocalDrugSID,NationalDrugSID,Route,DosageOrdered,Frequency,PrescriptionNumber,PharmacyName,ProviderType
0,1001,PtIEN1001,508,LISINOPRIL,LISINOPRIL 10MG TAB,BCMA,2025-01-01 08:05:00,2025-01-01 08:05:00,NaT,GIVEN,...,IP-2025-001001,1001.0,10002.0,20002.0,,,,,,
1,1001,PtIEN1001,508,METFORMIN HCL,METFORMIN HCL 500MG TAB,BCMA,2025-01-01 12:10:00,2025-01-01 12:10:00,NaT,GIVEN,...,IP-2025-001002,1001.0,10001.0,20001.0,,,,,,
2,1001,PtIEN1001,508,METFORMIN HCL,METFORMIN HCL 500MG TAB,BCMA,2025-01-01 18:08:00,2025-01-01 18:08:00,NaT,GIVEN,...,IP-2025-001002,1001.0,10001.0,20001.0,,,,,,
3,1001,PtIEN1001,508,LISINOPRIL,LISINOPRIL 10MG TAB,BCMA,2025-01-02 08:45:00,2025-01-02 08:45:00,NaT,GIVEN,...,IP-2025-001001,1001.0,10002.0,20002.0,,,,,,
4,1001,PtIEN1001,508,METFORMIN HCL,METFORMIN HCL 500MG TAB,RxOut,2025-01-15 10:30:00,2025-01-15 10:30:00,2025-01-15,ACTIVE,...,2024-001-0001,1001.0,10001.0,20001.0,,,,,,
5,1001,PtIEN1001,508,LISINOPRIL,LISINOPRIL 10MG TAB,RxOut,2025-01-15 10:35:00,2025-01-15 10:35:00,2025-01-15,ACTIVE,...,2024-001-0002,1001.0,10002.0,20002.0,,,,,,
6,1001,PtIEN1001,508,SPIRONOLACTONE,SPIRONOLACTONE 25MG TAB,RxOut,2025-03-01 10:00:00,2025-03-01 10:00:00,2025-03-01,ACTIVE,...,2024-001-0004,1001.0,10022.0,20022.0,,,,,,
7,1002,PtIEN1002,508,INSULIN GLARGINE,INSULIN GLARGINE 100UNIT/ML INJ,BCMA,2025-01-02 07:30:00,2025-01-02 07:30:00,NaT,GIVEN,...,IP-2025-002003,1002.0,10019.0,20019.0,,,,,,
8,1002,PtIEN1002,508,HYDROCODONE-ACETAMINOPHEN,HYDROCODONE-ACETAMINOPHEN 5-325MG TAB,BCMA,2025-01-02 14:35:00,2025-01-02 14:35:00,NaT,GIVEN,...,IP-2025-002001,1002.0,10012.0,20012.0,,,,,,
9,1002,PtIEN1002,508,ALBUTEROL SULFATE,ALBUTEROL SULFATE HFA 90MCG INHALER,RxOut,2025-01-22 09:15:00,2025-01-22 09:15:00,2025-01-22,ACTIVE,...,2024-002-0001,1002.0,10004.0,20004.0,,,,,,


In [None]:
df_meds.tail(25)

In [None]:
# Medications dataset info

df_meds.info()

In [None]:
# Medications dataset statistics

print("\n" + "="*80)
print("MEDICATIONS DATASET STATISTICS")
print("="*80)

print(f"Total medication records: {len(df_meds):,}")
print(f"Unique patients: {df_meds['PatientSID'].nunique()}")
print(f"Unique drug names (with dose): {df_meds['DrugNameWithDose'].nunique()}")
print(f"Unique drug names (without dose): {df_meds['DrugNameWithoutDose'].nunique()}")

print("\nSource system distribution:")
print(df_meds['SourceSystem'].value_counts())

print("\nMissing values:")
print(df_meds.isnull().sum())
print("="*80)

In [None]:
# Patient medication profiles

print("\n" + "="*80)
print("PATIENT MEDICATION PROFILES")
print("="*80)

# Medications per patient
meds_per_patient = df_meds.groupby('PatientSID').agg({
    'DrugNameWithDose': 'count',
    'SourceSystem': lambda x: ', '.join(x.unique())
}).rename(columns={'DrugNameWithDose': 'MedicationCount', 'SourceSystem': 'Sources'})

print(meds_per_patient)
print(f"\nAverage medications per patient: {meds_per_patient['MedicationCount'].mean():.1f}")
print("="*80)

In [None]:
# Detailed view: Sample patient medication list

sample_patient = df_meds['PatientSID'].iloc[0]

print("\n" + "="*80)
print(f"SAMPLE PATIENT MEDICATION LIST (PatientSID={sample_patient})")
print("="*80)

patient_meds = df_meds[df_meds['PatientSID'] == sample_patient][[
    'DrugNameWithDose', 'SourceSystem', 'MedicationDateTime', 'Status'
]].sort_values('MedicationDateTime')

print(patient_meds.to_string(index=False))
print("="*80)

---
## Part 3: Patient Demographics Data

In [None]:
# Load patient demographics dataset

demo_uri = f"s3://{DEST_BUCKET}/v1_raw/demographics/patient_demographics.parquet"
logging.info(f"Reading demographics data: {demo_uri}")

start_time = time.time()
df_demo = pd.read_parquet(demo_uri, filesystem=fs)
elapsed = time.time() - start_time

logging.info(f"Loaded {len(df_demo):,} patient demographic records in {elapsed:.2f}s")

In [None]:
# Demographics dataset overview

print("="*80)
print("DEMOGRAPHICS DATASET OVERVIEW")
print("="*80)
print(f"Shape: {df_demo.shape}")
print(f"Columns: {list(df_demo.columns)}")
print(f"\nMemory: {df_demo.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("="*80)

df_demo.head(10)

In [None]:
# Demographics dataset info

df_demo.info()

In [None]:
# Age distribution analysis

print("\n" + "="*80)
print("AGE DISTRIBUTION")
print("="*80)

print("\nAge summary statistics:")
print(df_demo['Age'].describe())

print("\nAge group distribution:")
age_group_counts = df_demo['AgeGroup'].value_counts().sort_index()
age_group_pct = (age_group_counts / len(df_demo) * 100).round(1)

age_dist = pd.DataFrame({
    'Count': age_group_counts,
    'Percentage': age_group_pct
})
print(age_dist)

print(f"\nElderly patients (65+): {df_demo['Age'].ge(65).sum()} ({df_demo['Age'].ge(65).sum() / len(df_demo) * 100:.1f}%)")
print(f"Mean age: {df_demo['Age'].mean():.1f} years")
print(f"Median age: {df_demo['Age'].median():.1f} years")
print("="*80)

In [None]:
# Gender distribution analysis

print("\n" + "="*80)
print("GENDER DISTRIBUTION")
print("="*80)

gender_counts = df_demo['Gender'].value_counts()
gender_pct = (gender_counts / len(df_demo) * 100).round(1)

gender_dist = pd.DataFrame({
    'Count': gender_counts,
    'Percentage': gender_pct
})

print(gender_dist)
print("="*80)

In [None]:
# Demographics data quality checks

print("\n" + "="*80)
print("DEMOGRAPHICS DATA QUALITY")
print("="*80)

# Missing values
print("\nMissing values:")
missing_counts = df_demo.isnull().sum()
print(missing_counts[missing_counts > 0] if missing_counts.sum() > 0 else "None")

# Duplicate patients
duplicates = df_demo['PatientSID'].duplicated().sum()
print(f"\nDuplicate PatientSIDs: {duplicates}")

# Age outliers
negative_ages = (df_demo['Age'] < 0).sum()
extreme_ages = (df_demo['Age'] > 120).sum()

if negative_ages > 0:
    print(f"\n⚠ WARNING: {negative_ages} patients with negative age")
if extreme_ages > 0:
    print(f"⚠ WARNING: {extreme_ages} patients with age > 120")
if negative_ages == 0 and extreme_ages == 0:
    print("\n✓ Age values within expected range (0-120)")

print(f"\nAge range: {df_demo['Age'].min()} to {df_demo['Age'].max()} years")

# Check alignment with medications dataset
if 'df_meds' in locals():
    meds_patients = set(df_meds['PatientSID'].unique())
    demo_patients = set(df_demo['PatientSID'].unique())
    
    patients_with_demo = meds_patients & demo_patients
    patients_without_demo = meds_patients - demo_patients
    
    print(f"\nPatient alignment with medications:")
    print(f"  Patients with medications: {len(meds_patients)}")
    print(f"  Patients with demographics: {len(patients_with_demo)}")
    print(f"  Patients without demographics: {len(patients_without_demo)}")
    
    if len(patients_without_demo) > 0:
        print(f"  ⚠ WARNING: {len(patients_without_demo)} patients missing demographics")
    else:
        print(f"  ✓ All medication patients have demographics")

print("="*80)

---
## Part 4: Drug Name Normalization

In [None]:
# Drug name normalization functions

def normalize_drug_name(drug_name):
    """
    Normalize drug name for matching:
    - Convert to uppercase
    - Remove common suffixes (HCL, SODIUM, CALCIUM, etc.)
    - Strip whitespace
    
    Examples:
    'METFORMIN HCL' → 'METFORMIN'
    'ATORVASTATIN CALCIUM' → 'ATORVASTATIN'
    'Warfarin' → 'WARFARIN'
    """
    if pd.isna(drug_name):
        return None
    
    # Convert to uppercase
    name = str(drug_name).upper().strip()
    
    # Remove common salt/compound suffixes
    suffixes_to_remove = [
        ' HCL', ' HYDROCHLORIDE',
        ' SODIUM', ' POTASSIUM', ' CALCIUM',
        ' SULFATE', ' TARTRATE', ' SUCCINATE',
        ' MALEATE', ' FUMARATE', ' ACETATE',
        ' CITRATE', ' PHOSPHATE'
    ]
    
    for suffix in suffixes_to_remove:
        if name.endswith(suffix):
            name = name[:-len(suffix)].strip()
            break
    
    return name


def extract_base_drug_name(drug_name_with_dose):
    """
    Extract base drug name from full name with dose.
    
    Examples:
    'METFORMIN HCL 500MG TAB' → 'METFORMIN'
    'LISINOPRIL 10MG TAB' → 'LISINOPRIL'
    'HYDROCODONE-ACETAMINOPHEN 5-325MG TAB' → 'HYDROCODONE-ACETAMINOPHEN'
    """
    if pd.isna(drug_name_with_dose):
        return None
    
    name = str(drug_name_with_dose).upper().strip()
    
    # Remove dose information (digits + units like MG, MCG, ML, etc.)
    # Pattern: remove everything starting with first digit
    name = re.split(r'\s*\d', name)[0].strip()
    
    # Apply normalization to remove salts
    name = normalize_drug_name(name)
    
    return name


logging.info("Drug name normalization functions defined")

In [None]:
# Test normalization functions

print("\n" + "="*80)
print("DRUG NAME NORMALIZATION EXAMPLES")
print("="*80)

test_names = [
    'METFORMIN HCL 500MG TAB',
    'LISINOPRIL 10MG TAB',
    'ATORVASTATIN CALCIUM 20MG TAB',
    'HYDROCODONE-ACETAMINOPHEN 5-325MG TAB',
    'WARFARIN SODIUM 5MG TAB',
    'ALBUTEROL SULFATE HFA 90MCG INHALER'
]

for name in test_names:
    normalized = extract_base_drug_name(name)
    print(f"{name:50s} → {normalized}")

print("="*80)

In [None]:
# Apply normalization to both datasets

logging.info("Applying normalization to DDI dataset...")
df_ddi['Drug1_Normalized'] = df_ddi['Drug 1'].apply(normalize_drug_name)
df_ddi['Drug2_Normalized'] = df_ddi['Drug 2'].apply(normalize_drug_name)

logging.info("Applying normalization to medications dataset...")
df_meds['DrugName_Normalized'] = df_meds['DrugNameWithoutDose'].apply(extract_base_drug_name)

logging.info("Normalization complete")

print("\nSample normalized medications:")
print(df_meds[['DrugNameWithDose', 'DrugNameWithoutDose', 'DrugName_Normalized']].head(10))

---
## Part 5: DDI Matching Analysis

In [None]:
# Check which medications appear in DDI dataset

print("\n" + "="*80)
print("MEDICATION DRUGS FOUND IN DDI REFERENCE")
print("="*80)

# Get all unique drugs from DDI dataset
ddi_drugs = set(df_ddi['Drug1_Normalized'].dropna()) | set(df_ddi['Drug2_Normalized'].dropna())

# Get all unique drugs from medications
med_drugs = set(df_meds['DrugName_Normalized'].dropna())

# Find matches
matching_drugs = med_drugs & ddi_drugs
non_matching_drugs = med_drugs - ddi_drugs

print(f"Total unique medications in patient data: {len(med_drugs)}")
print(f"Total unique drugs in DDI reference: {len(ddi_drugs):,}")
print(f"\nMatching drugs (found in DDI): {len(matching_drugs)}")
print(f"Non-matching drugs: {len(non_matching_drugs)}")

if matching_drugs:
    print("\n✓ Medications found in DDI dataset:")
    for drug in sorted(matching_drugs):
        print(f"  - {drug}")
else:
    print("\n✗ No matching drugs found")

if non_matching_drugs:
    print("\n✗ Medications NOT in DDI dataset:")
    for drug in sorted(non_matching_drugs):
        print(f"  - {drug}")

print("="*80)

In [None]:
# Check for potential DDI pairs within each patient

print("\n" + "="*80)
print("POTENTIAL DDI PAIRS WITHIN PATIENTS")
print("="*80)

def find_patient_ddi_pairs(patient_id, patient_meds_df, ddi_df):
    """
    For a given patient, find all potential DDI pairs from their medication list.
    """
    # Get patient's medications
    meds = patient_meds_df[patient_meds_df['PatientSID'] == patient_id]['DrugName_Normalized'].dropna().unique()
    
    interactions = []
    
    # Check all pairs of patient's medications
    for i, drug1 in enumerate(meds):
        for drug2 in meds[i+1:]:
            # Check if this pair exists in DDI dataset (either order)
            match = ddi_df[
                ((ddi_df['Drug1_Normalized'] == drug1) & (ddi_df['Drug2_Normalized'] == drug2)) |
                ((ddi_df['Drug1_Normalized'] == drug2) & (ddi_df['Drug2_Normalized'] == drug1))
            ]
            
            if not match.empty:
                for _, row in match.iterrows():
                    interactions.append({
                        'PatientSID': patient_id,
                        'Drug1': drug1,
                        'Drug2': drug2,
                        'Interaction': row['Interaction Description'][:100] + '...' if len(row['Interaction Description']) > 100 else row['Interaction Description'],
                        'Severity': row.get('Severity_Keyword', 'Unknown')
                    })
    
    return interactions

# Check all patients
all_patient_ddis = []
for patient_id in df_meds['PatientSID'].unique():
    patient_ddis = find_patient_ddi_pairs(patient_id, df_meds, df_ddi)
    all_patient_ddis.extend(patient_ddis)

if all_patient_ddis:
    df_patient_ddis = pd.DataFrame(all_patient_ddis)
    print(f"\n✓ Found {len(df_patient_ddis)} potential DDI pairs in patient medication lists:\n")
    print(df_patient_ddis.to_string(index=False))
else:
    print("\n✗ No DDI pairs found in current patient medication data")
    print("\nThis indicates a need to seed CDWWork with known interacting medication pairs.")

print("\n" + "="*80)

---
## Part 6: Seeding Recommendations

In [None]:
# Identify high-value DDI pairs to add to CDWWork

print("\n" + "="*80)
print("RECOMMENDED DDI PAIRS TO ADD TO CDWWORK")
print("="*80)

# Define clinically important, common drug pairs
# NOTE: Use drug names as they appear in DDI dataset (e.g., ACETYLSALICYLIC ACID not ASPIRIN)
recommended_pairs = [
    # Drug pairs that should be easy to find in DDI dataset
    ('WARFARIN', 'ACETYLSALICYLIC ACID'),  # Note: Use ACETYLSALICYLIC ACID, not ASPIRIN
    ('WARFARIN', 'IBUPROFEN'),
    ('LISINOPRIL', 'IBUPROFEN'),
    ('LISINOPRIL', 'SPIRONOLACTONE'),      # ACE inhibitor + K-sparing diuretic = hyperkalemia
    ('METFORMIN', 'IBUPROFEN'),
    ('SERTRALINE', 'TRAMADOL'),
    ('METOPROLOL', 'VERAPAMIL'),
    ('ALBUTEROL', 'PROPRANOLOL'),
    ('WARFARIN', 'AMOXICILLIN'),
]

print("\nSearching DDI dataset for recommended pairs...\n")

found_pairs = []
not_found_pairs = []

for drug1, drug2 in recommended_pairs:
    # Search in DDI dataset (either order)
    match = df_ddi[
        ((df_ddi['Drug1_Normalized'] == drug1) & (df_ddi['Drug2_Normalized'] == drug2)) |
        ((df_ddi['Drug1_Normalized'] == drug2) & (df_ddi['Drug2_Normalized'] == drug1))
    ]
    
    if not match.empty:
        interaction = match.iloc[0]
        found_pairs.append({
            'Drug 1': drug1,
            'Drug 2': drug2,
            'Found': 'YES',
            'Severity': interaction.get('Severity_Keyword', 'Unknown'),
            'Interaction': interaction['Interaction Description'][:80] + '...'
        })
    else:
        not_found_pairs.append({'Drug 1': drug1, 'Drug 2': drug2, 'Found': 'NO'})

if found_pairs:
    df_found = pd.DataFrame(found_pairs)
    print("✓ FOUND IN DDI DATASET (ready to add to CDWWork):")
    print(df_found.to_string(index=False))

if not_found_pairs:
    df_not_found = pd.DataFrame(not_found_pairs)
    print("\n✗ NOT FOUND IN DDI DATASET (may need different drug names):")
    print(df_not_found.to_string(index=False))

print("\n" + "="*80)

In [None]:
# Generate SQL INSERT statement templates for CDWWork seeding

print("\n" + "="*80)
print("SQL INSERT TEMPLATE FOR CDWWORK SEEDING")
print("="*80)

print("""
-- Add these medications to CDWWork insert scripts to create DDI test scenarios
-- Recommendation: Add to existing patients or create new test patients

IMPORTANT: Drug name mapping between CDWWork and DDI dataset
  - CDWWork uses common/brand names (e.g., ASPIRIN)
  - DDI dataset uses chemical names (e.g., Acetylsalicylic acid)
  - Use ACETYLSALICYLIC ACID in CDWWork to ensure DDI matching

Example 1: Patient 1005 (already has WARFARIN, add ACETYLSALICYLIC ACID for interaction)
✓ COMPLETED - Added to CDWWork

INSERT INTO RxOut.RxOutpat (...)
VALUES
(5021, 'RxIEN5021', 508, 1005, 'PtIEN1005',
 10021, 'DrugIEN10021', 20021,
 'ACETYLSALICYLIC ACID', 'ACETYLSALICYLIC ACID 81MG TAB',
 '2024-005-0003', '2024-02-01 10:00:00', ...);
-- Expected DDI: WARFARIN + ACETYLSALICYLIC ACID = Increased bleeding risk

Example 2: Patient 1001 (already has LISINOPRIL, add SPIRONOLACTONE for interaction)
✓ COMPLETED - Added to CDWWork

INSERT INTO RxOut.RxOutpat (...)
VALUES
(5022, 'RxIEN5022', 508, 1001, 'PtIEN1001',
 10022, 'DrugIEN10022', 20022,
 'SPIRONOLACTONE', 'SPIRONOLACTONE 25MG TAB',
 '2024-001-0004', '2024-03-01 10:00:00', ...);
-- Expected DDI: LISINOPRIL + SPIRONOLACTONE = Hyperkalemia risk

Example 3: Future addition - Patient 1006 (has IBUPROFEN, add WARFARIN for interaction)

INSERT INTO RxOut.RxOutpat (...)
VALUES
(5023, 'RxIEN5023', 508, 1006, 'PtIEN1006',
 10023, 'DrugIEN10023', 20023,
 'WARFARIN', 'WARFARIN SODIUM 5MG TAB',
 '2024-006-0003', '2024-04-15 10:00:00', ...);
-- Expected DDI: WARFARIN + IBUPROFEN = Increased bleeding risk
""")

print("="*80)

In [None]:
# Prioritized seeding recommendations

print("\n" + "="*80)
print("PRIORITIZED SEEDING RECOMMENDATIONS")
print("="*80)

recommendations = """
Based on the exploration analysis, here are prioritized recommendations:

COMPLETED ITEMS:
  ✓ Patient 1005 (has WARFARIN) → Added ACETYLSALICYLIC ACID (bleeding risk DDI)
  ✓ Patient 1001 (has LISINOPRIL) → Added SPIRONOLACTONE (hyperkalemia risk DDI)

PRIORITY 1 - Add to Existing Patients (Quick Wins):
  □ Patient 1005 (has WARFARIN + ACETYLSALICYLIC ACID) → Add IBUPROFEN (triple bleeding risk)
  □ Patient 1001 (has LISINOPRIL + SPIRONOLACTONE) → Add IBUPROFEN (NSAID + ACE inhibitor)
  □ Patient 1002 (has TRAMADOL) → Add SERTRALINE (serotonin syndrome risk)
  □ Patient 1006 (has IBUPROFEN) → Add WARFARIN or LISINOPRIL

PRIORITY 2 - Create New Test Patient with Multiple DDIs:
  □ Patient 1011 - "High-risk polypharmacy patient"
    Medications:
    - WARFARIN SODIUM 5MG TAB (anticoagulant)
    - ACETYLSALICYLIC ACID 81MG TAB (antiplatelet) → DDI with WARFARIN
    - IBUPROFEN 600MG TAB (NSAID) → DDI with WARFARIN and ACETYLSALICYLIC ACID
    - LISINOPRIL 10MG TAB (ACE inhibitor) → DDI with IBUPROFEN
    Expected: 4+ DDI pairs from 4 medications

PRIORITY 3 - Vary Severity Levels:
  □ Add moderate-severity interactions (current: have moderate)
  □ Add high-severity/contraindicated pairs
  □ Ensure mix of RxOut and BCMA sources (current: BCMA only in test data)

IMPORTANT NOTES:
  - Drug Name Mapping: DDI dataset uses chemical names (ACETYLSALICYLIC ACID not ASPIRIN)
  - Date Range: Ensure medication dates fall within 01b notebook date range filter
  - Normalization: Salt suffixes (HCL, SODIUM, etc.) are automatically stripped during matching

NEXT STEPS (for future additions):
1. Update CDWWork insert scripts in med-data/sql-server/cdwwork/insert/RxOut.RxOutpat.sql
2. Re-run database population: cd med-data/sql-server/cdwwork/insert && ./_master.sh
3. Re-run 01b_dataprep_medications.ipynb to regenerate parquet file
4. Re-run this notebook (02_explore.ipynb) to verify DDI matches appear
5. Proceed to 03_clean.ipynb when satisfied with test data
"""

print(recommendations)
print("="*80)

---
## Part 7: Data Quality Summary

In [None]:
# Final data quality and readiness summary

print("\n" + "="*80)
print("DATA QUALITY & ML READINESS SUMMARY")
print("="*80)

demo_status = ""
if 'df_demo' in locals():
    demo_status = f"""
DEMOGRAPHICS DATASET:
  Rows: {len(df_demo):,}
  Unique patients: {df_demo['PatientSID'].nunique()}
  Mean age: {df_demo['Age'].mean():.1f} years
  Elderly (65+): {df_demo['Age'].ge(65).sum()} ({df_demo['Age'].ge(65).sum() / len(df_demo) * 100:.1f}%)
  Gender distribution: {dict(df_demo['Gender'].value_counts())}
  Status: ✓ Ready for use
"""

print(f"""
DDI REFERENCE DATASET:
  Rows: {len(df_ddi):,}
  Unique drugs: {len(ddi_drugs):,}
  Columns: {len(df_ddi.columns)}
  Missing values: {df_ddi.isnull().sum().sum()}
  Status: ✓ Ready for use

MEDICATIONS DATASET:
  Rows: {len(df_meds):,}
  Unique patients: {df_meds['PatientSID'].nunique()}
  Unique medications: {len(med_drugs)}
  Medications found in DDI: {len(matching_drugs)}
  Potential DDI pairs: {len(all_patient_ddis)}
  Status: {'✓ Has DDI matches' if all_patient_ddis else '✗ No DDI matches - seeding needed'}
{demo_status}
DATA NORMALIZATION:
  Drug name cleaning: ✓ Implemented
  Dose removal: ✓ Implemented
  Salt/compound suffix removal: ✓ Implemented
  Case standardization: ✓ Implemented

ML READINESS:
  DDI reference data: ✓ Ready
  Patient medication data: {'✓ Ready' if all_patient_ddis else '⚠ Needs seeding for validation'}
  Patient demographics data: {'✓ Ready' if 'df_demo' in locals() else '✗ Not loaded'}
  Name normalization: ✓ Ready
  Next steps: {'Proceed to 03_clean.ipynb' if all_patient_ddis else 'Seed CDWWork with DDI pairs, then re-run 01b and 02'}
""")

print("="*80)