# Preprocessing Notebook: `6_df_eligibilities.ipynb`

### Description  
Processes the **`eligibilities`** table from the AACT Clinical Trials database to standardize demographic and participation criteria such as gender, age limits, and healthy volunteer status.  
These features describe population-level eligibility for each interventional study.

### Key Steps  
- Extracted `nct_id`, `gender`, `minimum_age`, `maximum_age`, and `healthy_volunteers` from AACT.  
- Joined with `studies_clean.csv` to include **interventional trials only**.
- Standardized gender values and converted missing entries to `"all"`.
-  **Healthy volunteers encoding**: Maps boolean values to integers (`True → 1`, `False → 0`) and fills missing values with `-1`.
- Converted age strings (e.g., `"18 Years"`, `"6 Months"`) into numeric years for uniform comparison.  
- Derived **age group categories** → `child`, `adult`, `senior`, `mixed`, or `unknown`.  
- Encoded categorical variables into binary flags (`elig_gender_*`, `elig_age_*`).  
- Aggregated by `nct_id` to maintain one record per trial.  
- Saved cleaned file to: `../data/processed/eligibilities_clean.csv`  

### Output:
- ✅ `eligibilities_clean.csv`

In [1]:
# Import required libraries and setup DB connection
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import os

load_dotenv()
DATABASE_URL = os.getenv('DATABASE_URL')
engine = create_engine(DATABASE_URL)

In [2]:
# Load eligibility data from AACT
query = '''
SELECT nct_id, gender, minimum_age, maximum_age, healthy_volunteers
FROM ctgov.eligibilities;
'''
df_eligibilities = pd.read_sql(query, engine)

df_studies = pd.read_csv('../data/processed/studies_clean.csv')
df_eligibilities = df_eligibilities.merge(df_studies[['nct_id', 'study_type']], on='nct_id', how='left')
df_eligibilities = df_eligibilities[df_eligibilities['study_type'] == 'INTERVENTIONAL']
df_eligibilities.drop(columns=['study_type'], inplace=True)

In [3]:
# Clean gender and healthy_volunteers columns
df_eligibilities['gender'] = df_eligibilities['gender'].str.lower().fillna('all')
df_eligibilities['healthy_volunteers'] = df_eligibilities['healthy_volunteers'].map({True: 1, False: 0}).fillna(-1)

In [4]:
# Convert age strings to numeric (in years)
def convert_age_to_years(age_str):
    if pd.isnull(age_str): 
        return np.nan
    val, unit = age_str.lower().split()
    val = float(val)
    if 'year' in unit: 
        return round(val, 2)
    elif 'month' in unit: 
        return round(val / 12, 2)
    elif 'week' in unit: 
        return round(val / 52, 2)
    elif 'day' in unit: 
        return round(val / 365, 2)
    else:
        return np.nan

df_eligibilities['minimum_age'] = df_eligibilities['minimum_age'].apply(convert_age_to_years)
df_eligibilities['maximum_age'] = df_eligibilities['maximum_age'].apply(convert_age_to_years)

In [5]:
# Classify age group based on min/max age
def age_group(min_age, max_age):
    if pd.isna(min_age) and pd.isna(max_age): 
        return 'unknown'
    if pd.isna(min_age):
        return 'child' if max_age < 18 else 'adult' if max_age <= 65 else 'senior'
    if pd.isna(max_age):
        return 'mixed' if min_age < 18 else 'adult' if min_age <= 65 else 'senior'
    if max_age < 18: return 'child'
    if min_age >= 18 and max_age <= 65: return 'adult'
    if min_age > 65: return 'senior'
    return 'mixed'

df_eligibilities['age_group'] = df_eligibilities.apply(lambda row: age_group(row['minimum_age'], row['maximum_age']), axis=1)
df_eligibilities.drop(columns=['minimum_age', 'maximum_age'], inplace=True)

In [6]:
# One-hot encode `gender` and `age_group`
df_eligibilities = pd.get_dummies(df_eligibilities, columns=['gender', 'age_group'], prefix=['elig_gender', 'elig_age'], dtype=int)

# Group by `nct_id` to ensure one row per trial
df_eligibilities = df_eligibilities.groupby('nct_id').max().reset_index()

In [7]:
# Export final cleaned dataset
df_eligibilities.to_csv('../data/processed/eligibilities_clean.csv', index=False)

## Summary  
This notebook successfully processed participant eligibility criteria from the AACT database.  

Key outcomes:
- Transformed age values into standardized numeric years.
- Encoded **healthy_volunteers** as 1 (yes), 0 (no), -1 (missing).
- Derived categorical **age group** and **gender eligibility** variables.  
- Encoded and aggregated all eligibility features at the trial level.  
- Exported the final cleaned dataset: `eligibilities_clean.csv`.  

---

📂 **Next Notebook:** `7_df_sponsors.ipynb` → Standardizes sponsor types and roles (lead vs. collaborator) for inclusion in modeling.
