# Data Augmentation Pipeline v2.0

**For Research Paper**: Complete pipeline for augmenting the symptom-disease dataset.

## Pipeline Overview

| Stage | Description | Output |
|-------|-------------|--------|
| **0** | Expand symptom vocabulary with Mayo Clinic symptoms | `symptom_columns.json` (updated) |
| **1** | Generate synthetic samples for rare diseases (<20 samples) | `symptoms_augmented_no_demographics.csv` |
| **2** | Add demographic variables (age, sex) | `symptoms_augmented_with_demographics.csv` |

## Requirements
- `data/rare_diseases_symptoms_template.json` - Filled with manually added symptoms
- `data/final_disease_demographics.json` - Demographics from ChatGPT + synthetic rules

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import json
import re
import random
import sys
import gc
from collections import Counter

# Add project root to path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Import normalization utilities (consolidated from scripts/symptom_mapper.py)
from utils.symptom_normalizer import normalize_symptom, find_similar_symptoms
from utils.consts import SYNONYM_MAP, NON_SYMPTOM_COLS

# Paths
# Input files
data_path = project_root / "data" / "processed" / "symptoms" / "symptoms_to_disease_cleaned.csv"
symptom_cols_path = project_root / "data" / "symptom_vocabulary.json"
template_path = project_root / "data" / "rare_diseases_symptoms_template.json"
category_map_path = project_root / "data" / "disease_mapping.json"
demographics_path = project_root / "data" / "final_disease_demographics.json"

# Output files - vocabulary saved to SAME file (overwrites original)
expanded_vocab_path = symptom_cols_path  # Overwrites original
output_no_demo_path = project_root / "data" / "processed" / "symptoms" / "symptoms_augmented_no_demographics.csv"
output_with_demo_path = project_root / "data" / "processed" / "symptoms" / "symptoms_augmented_with_demographics.csv"

print(f"Project root: {project_root}")
print(f"\nInput files:")
print(f"  Data: {data_path.exists()} - {data_path}")
print(f"  Vocab: {symptom_cols_path.exists()} - {symptom_cols_path}")
print(f"  Template: {template_path.exists()} - {template_path}")
print(f"  Demographics: {demographics_path.exists()} - {demographics_path}")


Project root: c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis

Input files:
  Data: True - c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\processed\symptoms\symptoms_to_disease_cleaned.csv
  Vocab: True - c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\symptom_vocabulary.json
  Template: True - c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\rare_diseases_symptoms_template.json
  Demographics: True - c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\final_disease_demographics.json


---
# Stage 0: Expand Symptom Vocabulary

The current vocabulary has 377 symptoms. Many manual symptoms don't have exact matches.

**Strategy**: Add clinically important new symptoms to the vocabulary (appearing in 5+ diseases).

In [2]:
# Load current vocabulary
with open(symptom_cols_path) as f:
    ORIGINAL_VOCAB = json.load(f)
ORIGINAL_SET = set(s.lower() for s in ORIGINAL_VOCAB)

print(f"Original vocabulary: {len(ORIGINAL_VOCAB)} symptoms")

# Load template with Mayo Clinic symptoms
with open(template_path) as f:
    template = json.load(f)

print(f"Diseases in template: {len(template)}")

Original vocabulary: 456 symptoms
Diseases in template: 135


In [3]:
# Extract all unique symptoms from template
all_mayo_symptoms = set()
symptom_counts = Counter()

print("Extracting and normalizing symptoms from template...")
for disease, info in template.items():
    mayo = info.get("mayo_clinic_symptoms", [])
    for sym in mayo:
        # Normalize symptom using project specific normalizer
        sym_norm = normalize_symptom(sym)
        
        if sym_norm and not sym_norm.startswith('"notes"'):
            all_mayo_symptoms.add(sym_norm)
            symptom_counts[sym_norm] += 1

print(f"Total unique mayo symptoms: {len(all_mayo_symptoms)}")

# Find symptoms NOT in current vocabulary
new_symptoms = [s for s in all_mayo_symptoms if s not in ORIGINAL_SET]
print(f"New symptoms (not in vocabulary): {len(new_symptoms)}")


Extracting and normalizing symptoms from template...
Total unique mayo symptoms: 690
New symptoms (not in vocabulary): 526


In [4]:
# Show most common new symptoms
new_symptom_counts = [(s, symptom_counts[s]) for s in new_symptoms]
new_symptom_counts.sort(key=lambda x: -x[1])

print("Most common new symptoms (top 50):")
print("-" * 60)
for sym, count in new_symptom_counts[:50]:
    print(f"  [{count:2d}x] {sym}")

Most common new symptoms (top 50):
------------------------------------------------------------
  [ 2x] low-set ears
  [ 1x] cervical insufficiency
  [ 1x] general feeling of discomfort
  [ 1x] congenital heart disease
  [ 1x] foul-smelling sputum
  [ 1x] irritable
  [ 1x] sleep problems in infants
  [ 1x] weight gain in the face
  [ 1x] extra fuild behind a fetus' next
  [ 1x] shivering
  [ 1x] disseminated sporotrichosis
  [ 1x] sores on thighs
  [ 1x] trouble processing information
  [ 1x] skin changes in color on vulva
  [ 1x] pain when talking
  [ 1x] soreness
  [ 1x] weakened tooth enamel
  [ 1x] reduced mental sharpness
  [ 1x] stiff neck
  [ 1x] rough scaly skin
  [ 1x] sudden abdominal pain
  [ 1x] blisters with yellowish fluid
  [ 1x] spasms
  [ 1x] cold clammy skin
  [ 1x] swelling of the legs
  [ 1x] eye watering
  [ 1x] rectal prolapse
  [ 1x] itchiness
  [ 1x] pain on vulva
  [ 1x] gurgling noise at back of the throat
  [ 1x] changes in voice
  [ 1x] night blindness
  [ 1

In [5]:
# NOTE: This cell shows statistics about new symptoms found in Mayo Clinic data
# The actual column expansion happens AFTER symptom mapping (see later cell)
# We no longer filter by MIN_DISEASE_COUNT - ALL mapped symptoms are used

# Show statistics about new symptoms (informational only)
new_symptom_counts = [(s, symptom_counts[s]) for s in new_symptoms]
new_symptom_counts.sort(key=lambda x: -x[1])

print(f"New symptoms found in Mayo Clinic data: {len(new_symptoms)}")
print("All mapped symptoms will be added as columns later")
print("\nTop 20 new symptoms by frequency:")
print("-" * 60)
for sym, count in new_symptom_counts[:20]:
    print(f"  [{count:2d}x] {sym}")


New symptoms found in Mayo Clinic data: 526
All mapped symptoms will be added as columns later

Top 20 new symptoms by frequency:
------------------------------------------------------------
  [ 2x] low-set ears
  [ 1x] cervical insufficiency
  [ 1x] general feeling of discomfort
  [ 1x] congenital heart disease
  [ 1x] foul-smelling sputum
  [ 1x] irritable
  [ 1x] sleep problems in infants
  [ 1x] weight gain in the face
  [ 1x] extra fuild behind a fetus' next
  [ 1x] shivering
  [ 1x] disseminated sporotrichosis
  [ 1x] sores on thighs
  [ 1x] trouble processing information
  [ 1x] skin changes in color on vulva
  [ 1x] pain when talking
  [ 1x] soreness
  [ 1x] weakened tooth enamel
  [ 1x] reduced mental sharpness
  [ 1x] stiff neck
  [ 1x] rough scaly skin


In [6]:
# Create expanded vocabulary
# NOTE: We start with ORIGINAL_VOCAB, but additional columns will be added
# dynamically during symptom mapping (see later cell for column expansion)
EXPANDED_VOCAB = list(ORIGINAL_VOCAB)  # Start with original
EXPANDED_SET = set(s.lower() for s in EXPANDED_VOCAB)

print(f"Starting vocabulary: {len(EXPANDED_VOCAB)} symptoms")
print("Additional Mayo Clinic symptoms will be added after mapping")


Starting vocabulary: 456 symptoms
Additional Mayo Clinic symptoms will be added after mapping


---
# Stage 1: Symptom Mapping & Synthetic Data Generation

1. Map Mayo Clinic symptoms to expanded vocabulary
2. Generate synthetic samples for diseases with <20 samples

In [7]:
# Load original dataset
df = pd.read_csv(data_path)
print(f"Original dataset: {len(df):,} rows, {df['diseases'].nunique()} diseases")

# Get disease counts
counts = df['diseases'].value_counts()
rare_diseases = counts[counts < 20]
print(f"\nDiseases with <20 samples: {len(rare_diseases)}")
print(f"Total samples in rare diseases: {rare_diseases.sum():,}")

Original dataset: 206,267 rows, 627 diseases

Diseases with <20 samples: 128
Total samples in rare diseases: 1,085


In [8]:
# Load category mapping
with open(category_map_path) as f:
    category_map = json.load(f)

# Create disease -> category lookup
disease_to_category = {}
for cat, diseases in category_map.items():
    for d in diseases:
        disease_to_category[d] = cat

print(f"Loaded {len(disease_to_category)} disease -> category mappings")

Loaded 599 disease -> category mappings


In [None]:
# ============================================================================
# SMART SYMPTOM MAPPING (Integrated from scripts/symptom_mapper.py)
# Uses normalization + fuzzy matching for better mapping coverage
# ============================================================================

def smart_map_symptom(external_symptom: str, vocab_set: set) -> str | None:
    """
    Map an external symptom to vocabulary using normalization + fuzzy matching.
    Returns the matched symptom or None if no match found.
    """
    # Step 1: Normalize with synonyms (e.g., 'belly pain' -> 'abdominal pain')
    normalized = normalize_symptom(external_symptom, apply_synonyms=True)
    if normalized in vocab_set:
        return normalized
    
    # Step 2: Try without synonym mapping (in case original is better)
    normalized_raw = normalize_symptom(external_symptom, apply_synonyms=False)
    if normalized_raw in vocab_set:
        return normalized_raw
    
    # Step 3: Fuzzy match (only for high-confidence matches >= 85%)
    matches = find_similar_symptoms(external_symptom, list(vocab_set), threshold=0.85)
    if matches and matches[0][1] >= 0.85:
        return matches[0][0]
    
    return None  # No match found


def map_symptoms_to_vocab(symptoms_list, vocab_set):
    """
    Map a list of symptoms to the vocabulary using smart matching.
    Returns list of symptoms that matched.
    """
    mapped = []
    for sym in symptoms_list:
        result = smart_map_symptom(sym, vocab_set)
        if result:
            mapped.append(result)
    return list(set(mapped))  # Remove duplicates


def generate_synthetic_samples(disease: str, symptoms: list, n_samples: int,
                               all_symptom_cols: list, min_sym: int = 4, max_sym: int = 8) -> list:
    """
    Generate synthetic samples for a disease.
    Each sample has random 4-8 symptoms selected from the symptom list.
    NOTE: all_symptom_cols should be the DataFrame columns (includes new Mayo cols).
    """
    samples = []
    category = disease_to_category.get(disease, "Unknown Type")
    all_symptom_set = set(all_symptom_cols)
    
    for _ in range(n_samples):
        # Select random symptoms
        n_sym = random.randint(min_sym, min(max_sym, len(symptoms)))
        selected = random.sample(symptoms, n_sym)
        
        # Create row with all symptoms as 0
        row = {col: 0 for col in all_symptom_cols}
        
        # Set selected symptoms to 1 (only if column exists)
        for sym in selected:
            if sym in all_symptom_set:
                row[sym] = 1
        
        row['diseases'] = disease
        row['disease_category'] = category
        row['symptoms'] = ", ".join(selected)
        
        samples.append(row)
    
    return samples

print("Defined smart mapping functions (integrated from symptom_mapper.py)")


Defined smart mapping functions (integrated from symptom_mapper.py)


In [10]:
# Map symptoms for each disease in template
disease_mapped_symptoms = {}
mapping_stats = {'total': 0, 'mapped': 0, 'diseases_ready': 0}

for disease, info in template.items():
    mayo = info.get("mayo_clinic_symptoms", [])
    if not mayo:
        continue
    
    mapped = map_symptoms_to_vocab(mayo, EXPANDED_SET)
    mapping_stats['total'] += len(mayo)
    mapping_stats['mapped'] += len(mapped)
    
    if len(mapped) >= 4:  # Minimum for synthetic generation
        disease_mapped_symptoms[disease] = mapped
        mapping_stats['diseases_ready'] += 1

print(f"Symptom mapping results:")
print(f"  Total mayo symptoms: {mapping_stats['total']}")
print(f"  Mapped to vocabulary: {mapping_stats['mapped']} ({100*mapping_stats['mapped']/mapping_stats['total']:.1f}%)")
print(f"  Diseases ready for synthesis (>=4 symptoms): {mapping_stats['diseases_ready']}")

Symptom mapping results:
  Total mayo symptoms: 1255
  Mapped to vocabulary: 716 (57.1%)
  Diseases ready for synthesis (>=4 symptoms): 81


In [11]:
# ============================================================================
# FIX: Add ALL mapped Mayo symptoms as columns
# This ensures synthetic samples have proper columns for their symptoms
# ============================================================================

# Collect ALL symptoms that were successfully mapped from Mayo Clinic data
all_mapped_symptoms = set()
base_disease_set = set(df['diseases'].unique())

for disease, symptoms in disease_mapped_symptoms.items():
    # Only add diseases that exist in base data
    if disease not in base_disease_set:
        continue
    all_mapped_symptoms.update(symptoms)

print(f"Total unique mapped symptoms from Mayo Clinic: {len(all_mapped_symptoms)}")

# Find symptoms that need to be added as new columns
existing_cols = set(c.lower() for c in df.columns)
new_cols_to_add = [sym for sym in all_mapped_symptoms if sym.lower() not in existing_cols]

if new_cols_to_add:
    print(f"Adding {len(new_cols_to_add)} new symptom columns from Mayo Clinic data...")
    # Create a separate DataFrame for new columns (int8 for memory efficiency)
    new_data = pd.DataFrame(0, index=df.index, columns=new_cols_to_add, dtype='int8')
    
    # Concatenate once (avoids fragmentation)
    df = pd.concat([df, new_data], axis=1)
    if len(new_cols_to_add) > 10:
        print(f"  Sample new columns: {new_cols_to_add[:10]}...")
    else:
        print(f"  Added: {new_cols_to_add}")
else:
    print("No new columns to add (all mapped symptoms already exist).")

print(f"Expanded dataset now has {len(df.columns)} columns")


Total unique mapped symptoms from Mayo Clinic: 157
Adding 81 new symptom columns from Mayo Clinic data...
  Sample new columns: ['gas', 'brittle nails', 'swelling', 'sleep problems', 'heart palpitations', 'balance problems', 'rapid heartbeat', 'enlarged liver', 'vaginal bleeding', 'poor growth']...
Expanded dataset now has 458 columns


In [12]:
# Generate synthetic samples
# IMPORTANT: Only augment diseases that EXIST in the base cleaned dataset
# This prevents adding back diseases that were intentionally excluded
random.seed(42)
TARGET_SAMPLES = 25  # Minimum samples per disease

# Get set of diseases in base data (for filtering)
base_disease_set = set(df['diseases'].unique())
print(f"Base dataset has {len(base_disease_set)} unique diseases")

# Get symptom columns from expanded df (excluding non-symptom cols)
symptom_cols = [c for c in df.columns if c.lower() not in {s.lower() for s in NON_SYMPTOM_COLS}]
print(f"Using {len(symptom_cols)} symptom columns for synthetic generation")

all_synthetic = []
generation_log = []
skipped_diseases = []

for disease, symptoms in disease_mapped_symptoms.items():
    # FILTER: Only augment diseases that exist in base data
    if disease not in base_disease_set:
        skipped_diseases.append(disease)
        continue
    
    current_count = counts.get(disease, 0)
    
    if current_count >= TARGET_SAMPLES:
        continue
    
    n_new = TARGET_SAMPLES - current_count
    samples = generate_synthetic_samples(disease, symptoms, n_new, symptom_cols)
    all_synthetic.extend(samples)
    
    generation_log.append({
        'disease': disease,
        'original': current_count,
        'added': n_new,
        'symptoms_available': len(symptoms)
    })

print(f"\nGenerated {len(all_synthetic):,} synthetic samples for {len(generation_log)} diseases")
if skipped_diseases:
    print(f"Skipped {len(skipped_diseases)} diseases not in base data: {skipped_diseases}")
print("\nGeneration details:")
for log in generation_log[:20]:
    print(f"  {log['disease']}: {log['original']} -> {log['original'] + log['added']} (+{log['added']}, {log['symptoms_available']} symptoms available)")
if len(generation_log) > 20:
    print(f"  ... and {len(generation_log) - 20} more diseases")


Base dataset has 627 unique diseases
Using 456 symptom columns for synthetic generation

Generated 1,251 synthetic samples for 78 diseases
Skipped 3 diseases not in base data: ['poisoning due to antipsychotics', 'hypothermia', 'poisoning due to antihypertensives']

Generation details:
  rocky mountain spotted fever: 1 -> 25 (+24, 13 symptoms available)
  myocarditis: 1 -> 25 (+24, 9 symptoms available)
  kaposi sarcoma: 1 -> 25 (+24, 8 symptoms available)
  chronic ulcer: 1 -> 25 (+24, 7 symptoms available)
  diabetes: 1 -> 25 (+24, 6 symptoms available)
  gas gangrene: 1 -> 25 (+24, 6 symptoms available)
  thalassemia: 1 -> 25 (+24, 6 symptoms available)
  typhoid fever: 1 -> 25 (+24, 8 symptoms available)
  diabetic kidney disease: 2 -> 25 (+23, 9 symptoms available)
  rheumatic fever: 2 -> 25 (+23, 5 symptoms available)
  human immunodeficiency virus infection (hiv): 2 -> 25 (+23, 10 symptoms available)
  hashimoto thyroiditis: 2 -> 25 (+23, 10 symptoms available)
  carcinoid syndro

In [13]:
# Combine original + synthetic
if all_synthetic:
    df_synthetic = pd.DataFrame(all_synthetic)
    
    # Ensure synthetic dataframe has all columns
    # Optimization: Use reindex which is faster/cleaner
    df_synthetic = df_synthetic.reindex(columns=df.columns, fill_value=0).astype(df.dtypes)
    
    # Concatenate
    df_augmented = pd.concat([df, df_synthetic], ignore_index=True)
    
    print(f"Original samples: {len(df):,}")
    print(f"Synthetic samples: {len(df_synthetic):,}")
    print(f"Total augmented: {len(df_augmented):,}")
    
    # cleanup
    del df_synthetic
else:
    df_augmented = df
    print("No synthetic samples generated")

# Free up memory
del df
gc.collect()
print("Memory cleanup: deleted original df")


Original samples: 206,267
Synthetic samples: 1,251
Total augmented: 207,518
Memory cleanup: deleted original df


In [14]:
# Verify rare disease counts improved
new_counts = df_augmented['diseases'].value_counts()
new_rare = new_counts[new_counts < 20]

print(f"Before augmentation: {len(rare_diseases)} diseases with <20 samples")
print(f"After augmentation: {len(new_rare)} diseases with <20 samples")
print(f"\nDiseases still below 20 samples:")
for d, c in new_rare.items():
    print(f"  {c:2d}  {d}")

Before augmentation: 128 diseases with <20 samples
After augmentation: 50 diseases with <20 samples

Diseases still below 20 samples:
  19  otosclerosis
  18  cyst of the eyelid
  14  pneumoconiosis
  14  fibrocystic breast disease
  13  factitious disorder
  13  hpv
  13  congenital malformation syndrome
  12  raynaud disease
  11  galactorrhea of unknown cause
  11  zenker diverticulum
  11  myoclonus
  10  avascular necrosis
  10  granuloma inguinale
  10  optic neuritis
  10  testicular cancer
  10  decubitus ulcer
  10  vesicoureteral reflux
   9  vacterl syndrome
   9  aphakia
   9  lichen planus
   9  thyroid cancer
   9  hemarthrosis
   9  placenta previa
   8  hypercholesterolemia
   8  spinocerebellar ataxia
   7  pemphigus
   7  omphalitis
   6  blepharospasm
   6  breast cancer
   6  tuberous sclerosis
   6  g6pd enzyme deficiency
   6  edward syndrome
   6  vulvar cancer
   6  priapism
   5  uterine cancer
   5  vitamin a deficiency
   5  pelvic fistula
   5  spherocytosis

In [15]:
# Save dataset WITHOUT demographics
df_augmented.to_csv(output_no_demo_path, index=False)

print(f"Saved dataset WITHOUT demographics:")
print(f"  Path: {output_no_demo_path}")
print(f"  Size: {output_no_demo_path.stat().st_size / 1024 / 1024:.1f} MB")
print(f"  Rows: {len(df_augmented):,}")
print(f"  Columns: {len(df_augmented.columns)}")

Saved dataset WITHOUT demographics:
  Path: c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\processed\symptoms\symptoms_augmented_no_demographics.csv
  Size: 189.1 MB
  Rows: 207,518
  Columns: 458


---
# Stage 2: Add Demographics (Age, Sex)

Using merged demographics from ChatGPT + synthetic rules.

In [16]:
# Load demographics
with open(demographics_path) as f:
    demographics = json.load(f)

print(f"Loaded demographics for {len(demographics)} diseases")

# Check coverage
all_diseases = set(df_augmented['diseases'].unique())
demo_diseases = set(demographics.keys())
covered = all_diseases & demo_diseases
missing = all_diseases - demo_diseases

print(f"\nDemographic coverage:")
print(f"  Total diseases in dataset: {len(all_diseases)}")
print(f"  Covered by demographics: {len(covered)} ({100*len(covered)/len(all_diseases):.1f}%)")
print(f"  Missing (will use defaults): {len(missing)}")

Loaded demographics for 667 diseases

Demographic coverage:
  Total diseases in dataset: 627
  Covered by demographics: 620 (98.9%)
  Missing (will use defaults): 7


In [17]:
# Default demographics for missing diseases
DEFAULT_DEMO = {
    "age_min": 10,
    "age_max": 80,
    "age_peak": 45,
    "male_pct": 50
}

def sample_age(demo: dict) -> int:
    """Sample age from triangular distribution."""
    age_min = demo.get('age_min', 10)
    age_max = demo.get('age_max', 80)
    age_peak = demo.get('age_peak', 45)
    
    # Handle edge cases
    if age_min == age_max:
        return int(age_min)
    
    age_peak = max(age_min, min(age_peak, age_max))
    
    if age_min == age_peak or age_peak == age_max:
        age = np.random.uniform(age_min, age_max)
    else:
        age = np.random.triangular(age_min, age_peak, age_max)
    
    return int(np.clip(age, 0, 100))


def sample_sex(demo: dict) -> str:
    """Sample sex from Bernoulli distribution."""
    male_pct = demo.get('male_pct', 50)
    return 'M' if np.random.random() * 100 < male_pct else 'F'

print("Defined demographic sampling functions")

Defined demographic sampling functions


In [18]:
# Generate demographics for all rows
np.random.seed(42)
ages = []
sexes = []

for idx, row in df_augmented.iterrows():
    disease = row['diseases']
    demo = demographics.get(disease, DEFAULT_DEMO)
    
    ages.append(sample_age(demo))
    sexes.append(sample_sex(demo))
    
    if idx % 50000 == 0:
        print(f"Processed {idx:,} rows...")

df_augmented = df_augmented.assign(
    age=ages,
    sex=sexes
)

print(f"\nGenerated demographics for {len(df_augmented):,} rows")

Processed 0 rows...
Processed 50,000 rows...
Processed 100,000 rows...
Processed 150,000 rows...
Processed 200,000 rows...

Generated demographics for 207,518 rows


In [19]:
# Summary statistics
print("Demographics Summary:")
print("=" * 50)
print(f"Age: min={df_augmented['age'].min()}, max={df_augmented['age'].max()}, mean={df_augmented['age'].mean():.1f}")
print(f"Sex: {(df_augmented['sex'] == 'M').mean() * 100:.1f}% male, {(df_augmented['sex'] == 'F').mean() * 100:.1f}% female")
print(f"\n{df_augmented['sex'].value_counts()}")

Demographics Summary:
Age: min=0, max=99, mean=43.4
Sex: 47.2% male, 52.8% female

sex
F    109472
M     98046
Name: count, dtype: int64


In [20]:
# Verify key diseases
print("Verification - Sample diseases:")
print("=" * 80)

verify_diseases = ["prostate cancer", "preeclampsia", "migraine", "pyloric stenosis", "diabetes"]

for disease in verify_diseases:
    subset = df_augmented[df_augmented['diseases'] == disease]
    if len(subset) > 0:
        male_pct = (subset['sex'] == 'M').mean() * 100
        mean_age = subset['age'].mean()
        expected = demographics.get(disease, DEFAULT_DEMO)
        
        print(f"{disease:25} Age: {mean_age:5.1f} (exp: {expected.get('age_peak', '?'):>3}), "
              f"Male: {male_pct:5.1f}% (exp: {expected.get('male_pct', '?'):>3}%), "
              f"n={len(subset)}")

Verification - Sample diseases:
prostate cancer           Age:  70.2 (exp:  70), Male: 100.0% (exp: 100%), n=135
preeclampsia              Age:  30.3 (exp:  30), Male:   0.0% (exp:   0%), n=217
migraine                  Age:  32.0 (exp:  30), Male:  30.3% (exp:  30%), n=221
pyloric stenosis          Age:   0.0 (exp:   0), Male:  76.0% (exp:  80%), n=25
diabetes                  Age:  58.6 (exp:  55), Male:  80.0% (exp:  52%), n=25


In [21]:
# Reorder columns: diseases, category, age, sex, then symptoms
cols = df_augmented.columns.tolist()

# Move key columns to front
key_cols = ['diseases', 'disease_category', 'age', 'sex']
symptom_cols = [c for c in cols if c not in key_cols + ['symptoms']]
final_order = key_cols + symptom_cols + ['symptoms']

# Only include columns that exist
final_order = [c for c in final_order if c in df_augmented.columns]

df_final = df_augmented[final_order]
print(f"Reordered columns: {len(df_final.columns)} total")
print(f"First 10: {df_final.columns[:10].tolist()}")

Reordered columns: 460 total
First 10: ['diseases', 'disease_category', 'age', 'sex', 'anxiety and nervousness', 'depression', 'shortness of breath', 'depressive or psychotic symptoms', 'sharp chest pain', 'dizziness']


In [22]:
# Save dataset WITH demographics
df_final.to_csv(output_with_demo_path, index=False)

print(f"Saved dataset WITH demographics:")
print(f"  Path: {output_with_demo_path}")
print(f"  Size: {output_with_demo_path.stat().st_size / 1024 / 1024:.1f} MB")
print(f"  Rows: {len(df_final):,}")
print(f"  Columns: {len(df_final.columns)}")

Saved dataset WITH demographics:
  Path: c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\processed\symptoms\symptoms_augmented_with_demographics.csv
  Size: 190.1 MB
  Rows: 207,518
  Columns: 460


---
# Summary

## Output Files

In [23]:
print("=" * 80)
print("DATA AUGMENTATION COMPLETE")
print("=" * 80)

# Count actual symptom columns in final dataset
final_symptom_count = len([c for c in df_final.columns 
                           if c.lower() not in {s.lower() for s in NON_SYMPTOM_COLS}
                           and c != "symptoms"])

print("\n1. VOCABULARY:")
print(f"   {symptom_cols_path}")
print(f"   Original: {len(ORIGINAL_VOCAB)} symptoms")
print(f"   Final dataset: {final_symptom_count} symptom columns")

print("\n2. DATASET WITHOUT DEMOGRAPHICS:")
print(f"   {output_no_demo_path}")
if output_no_demo_path.exists():
    df_check = pd.read_csv(output_no_demo_path, nrows=1)
    print(f"   Rows: {len(df_augmented):,}, Columns: {len(df_check.columns)}")
    print(f"   Size: {output_no_demo_path.stat().st_size / 1024 / 1024:.1f} MB")

print("\n3. DATASET WITH DEMOGRAPHICS:")
print(f"   {output_with_demo_path}")
if output_with_demo_path.exists():
    df_check = pd.read_csv(output_with_demo_path, nrows=1)
    print(f"   Rows: {len(df_final):,}, Columns: {len(df_check.columns)}")
    print(f"   Size: {output_with_demo_path.stat().st_size / 1024 / 1024:.1f} MB")
    print(f"   Includes: age, sex columns")


DATA AUGMENTATION COMPLETE

1. VOCABULARY:
   c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\symptom_vocabulary.json
   Original: 456 symptoms
   Final dataset: 456 symptom columns

2. DATASET WITHOUT DEMOGRAPHICS:
   c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\processed\symptoms\symptoms_augmented_no_demographics.csv
   Rows: 207,518, Columns: 458
   Size: 189.1 MB

3. DATASET WITH DEMOGRAPHICS:
   c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\processed\symptoms\symptoms_augmented_with_demographics.csv
   Rows: 207,518, Columns: 460
   Size: 190.1 MB
   Includes: age, sex columns


In [24]:
# ============================================================================
# FINAL STEP: Update vocabulary file to match actual dataset columns
# This ensures vocabulary.json stays in sync with the augmented data
# ============================================================================

# Get final symptom columns from augmented dataset
final_symptom_cols = [c for c in df_final.columns 
                       if c.lower() not in {s.lower() for s in NON_SYMPTOM_COLS} 
                       and c != 'symptoms']

# Sort alphabetically for consistency
final_symptom_cols = sorted(final_symptom_cols)

# Save updated vocabulary
with open(symptom_cols_path, 'w') as f:
    json.dump(final_symptom_cols, f, indent=2)

print(f"Updated vocabulary file: {symptom_cols_path}")
print(f"  Original vocabulary: {len(ORIGINAL_VOCAB)} symptoms")
print(f"  Final vocabulary: {len(final_symptom_cols)} symptoms")
print(f"  Net change: {len(final_symptom_cols) - len(ORIGINAL_VOCAB):+d} symptoms")


Updated vocabulary file: c:\Users\henry\Desktop\Programming\Python\Multimodal_Diagnosis\data\symptom_vocabulary.json
  Original vocabulary: 456 symptoms
  Final vocabulary: 456 symptoms
  Net change: +0 symptoms


---
## Documentation for Research Paper

> **Data Augmentation Pipeline**
>
> **Stage 0 - Vocabulary Expansion:**
> 1. Collected symptom lists from Mayo Clinic and Cleveland Clinic for 135 rare diseases
> 2. Identified symptoms appearing in >=5 diseases not in original vocabulary
> 3. Expanded vocabulary from 377 to N symptoms (updated in place)
>
> **Stage 1 - Synthetic Symptom Data:**
> 1. Mapped Mayo Clinic symptoms to expanded vocabulary
> 2. For diseases with <20 training samples, generated synthetic samples
> 3. Each synthetic sample: random 4-8 symptom subset from disease's symptom profile
> 4. Increased rare disease representation to minimum 25 samples per disease
>
> **Stage 2 - Demographic Variables (Age/Sex):**
> 1. Collected epidemiological demographics via GPT-4 queries
> 2. Applied category-level defaults with keyword-based overrides for sex-specific diseases
> 3. Age sampled from triangular distribution (min, peak, max)
> 4. Sex sampled from Bernoulli distribution based on disease-specific male percentage
>
> Two output datasets were created:
> - `symptoms_augmented_no_demographics.csv`: For symptom-only models
> - `symptoms_augmented_with_demographics.csv`: For multimodal models incorporating age/sex