# Preprocessing Notebook: `4_df_conditions.ipynb`

### Description  
Processes the **`conditions`** table from the AACT Clinical Trials database to create grouped, standardized features representing the medical conditions studied across interventional trials.  
This notebook consolidates thousands of condition names into interpretable, model-ready condition categories.

### Key Steps  
- Load condition data (`nct_id`, `name`) from the AACT database using SQLAlchemy.  
- Filter to include only **interventional studies** by joining with the cleaned `studies` dataset.  
- Identify the **top 1000 most frequent conditions** for manual grouping and mapping.  
- Apply **keyword-based classification** to categorize unmapped conditions using domain-specific logic (e.g., cancer, cardiovascular, infectious diseases).  
- One-hot encode grouped condition categories (prefix `condt_`).  
- Aggregate by trial (`nct_id`) to produce a single record per study.
- Export final dataset to `../data/processed/conditions_clean.csv`

### Output:
- ✅ `conditions_clean.csv`

In [1]:
# Load packages and connect to AACT database
from sqlalchemy import create_engine
import pandas as pd
from dotenv import load_dotenv
import os

load_dotenv()
DATABASE_URL = os.getenv('DATABASE_URL')
engine = create_engine(DATABASE_URL)

In [2]:
# Load conditions and filter interventional studies
query = """
SELECT nct_id, name
FROM ctgov.conditions;
"""
df_conditions = pd.read_sql(query, engine)

df_conditions['name'] = df_conditions['name'].str.lower()

df_studies = pd.read_csv('../data/processed/studies_clean.csv')
df_conditions = df_conditions.merge(df_studies[['nct_id', 'study_type']], on='nct_id', how='left')
df_conditions = df_conditions[df_conditions['study_type'] == 'INTERVENTIONAL']
df_conditions.drop(columns=['study_type'], inplace=True)

In [3]:
# Export top 1000 most common condition names for manual review and grouping
top_conditions = df_conditions['name'].value_counts().head(1000)
top_conditions.to_csv('../data/processed/conditions_top1000.csv')

In [4]:
# Map top 1000 common conditions to grouped categories
conditions_1000 = pd.read_csv('../data/processed/conditions_top1000_grouped.csv')
mapping_dict = dict(zip(conditions_1000['name'], conditions_1000['grouped_conditions']))
df_conditions['grouped_conditions'] = df_conditions['name'].map(mapping_dict)

In [5]:
# Define keyword-based classifier for unmapped conditions
keyword_map = {
    "cancers": ["astrocytoma", "metastases", "leukemia", "lymphoma", "malignancy", "melanoma", "mds", "myelodysplastic",
                "myeloproliferative", "cancer", "carcinoma", "tumor", "neoplasm", "sarcoma", "glioblastoma", "glioma"],
    "cardiovascular_diseases": ["cardio", "arrhythmia", "atrial", "cardiac", "vein", "ischemia", "myocardial", "arterial", "vascular", "heart", "coronary", "aortic", "artery", "stroke", "hypertension", "angina"],
    "dental_disorders": ["dental", "tooth", "caries", "periodontitis", "oral hygiene", "gingivitis"],
    "dermatological_disorders": ["acne", "psoriasis", "eczema", "rash", "skin", "derm"],
    "endocrine/metabolic_disorders": ["diabetes", "thyroid", "insulin", "metabolic", "endocrine", "glucose", "obesity", "overweight", "parathyroidism"],
    "gastrointestinal_disorders": ["gastro", "liver", "colon", "bowel", "hepatitis", "ulcer", "esophageal", "pancreas", "celiac", "hernia", "colitis", "crohn"],
    "genetic_disorders": ["genetic", "mutation", "chromosome", "inherited", "congenital"],
    "hepatic_disorders": ["cirrhosis", "liver", "hepatic", "nafld"],
    "infectious_diseases": ["corona", "sars", "immunodeficiency", "influenza", "sepsis", "tuberculosis", "tb", "std", "hiv", "viral", "bacterial", "infection", "covid", "hepatitis"],
    "mental_health_disorders": ["delirium", "mental disorders", "ocd", "compulsive", "stress", "insomnia", "psychosis", "schizophrenia", "abuse", "dependence", "use disorders", "depression", "anxiety", "ptsd", 
                         "bipolar", "autism", "adhd"],
    "musculoskeletal_disorders": ["arthritis", "disc", "fracture", "dystrophy", "myalgia", "gout", "osteo", "osteoporosis", "joint", "bone", "muscle", "capsulitis", "back pain"],
    "neurological_disorders": ["neuro", "brain", "sclerosis", "cerebral", "stroke", "headache", "meningitis", "migraine", "nervous", "seizure", "parkinson", "epilepsy", "dementia", "als", "ms"],
    "opthamological_disorders": ["eye", "glaucoma", "macular", "myopia", "presbyopia", "retina"],
    "pain_disorders": ["pain"],
    "respiratory_disorders": ["sinusitis", "respiratory", "bronchitis", "pulmonary", "copd", "cystic fibrosis", "pneumonia", "asthma", "lung"],
    "renal/urological_disorders": ["renal", "kidney", "urinary", "bladder", "prostate"],
    "reproductive_health": ["pregnancy", "fertility", "menstrual", "contraception", "ovarian", "uterine", "pcod","ovary", "menopause"]
}


In [6]:
# Define priority order 
priority_order = list(keyword_map.keys())

# Classify using keyword_map
def classify_condition_priority(name):
    name = str(name).lower().strip()
    for group in priority_order:
        keywords = keyword_map[group]
        if any(keyword in name for keyword in keywords):
            return group
    return "others"

In [7]:
# Apply fallback classification to unmapped conditions
df_conditions['grouped_conditions'] = df_conditions.apply(
    lambda row: row['grouped_conditions'] if pd.notnull(row['grouped_conditions'])
    else classify_condition_priority(row['name']), axis=1)

In [8]:
# Standardize and consolidate condition groups
df_conditions['grouped_conditions'] = df_conditions['grouped_conditions'].replace({
    'infectious_disorders': 'infectious_diseases',
    'mental_health_disorders': 'mental_disorders',
    'neurodevelopmental_disorders': 'mental_disorders',
    'reproductive_disorders': 'reproductive_health',
    'hepatic_disorders': 'gastrointestinal_disorders'
})

In [9]:
# One-hot encode grouped condition categories
df_conditions = pd.get_dummies(df_conditions, columns=['grouped_conditions'], prefix='condt', dtype=int)

# Group by NCT ID
df_conditions = df_conditions.groupby('nct_id').max().reset_index()
df_conditions.drop(columns=['name', 'study_type', 'condt_healthy_participants'], inplace=True, errors='ignore')

# Save cleaned condition features
df_conditions.to_csv('../data/processed/conditions_clean.csv', index=False)

---

## Summary  
This notebook successfully standardized and categorized medical condition data from the AACT database.  

Key transformations included:
- Mapping and grouping the **top 1000 conditions** using manual review.  
- Applying **keyword-based classification** for unmapped conditions.  
- Consolidating overlapping categories for uniform labeling.  
- Encoding final condition groups into binary flag variables (`condt_*`).  

The resulting file — `conditions_clean.csv` — provides **interpretable, high-level clinical condition features** used in downstream EDA and modeling.

---

📂 **Next Notebook:** `5_df_designs.ipynb` → Processes study design details (allocation, masking, model, and purpose) for each trial.
