# Preprocessing Notebook: `2_df_baseline_features.ipynb`

### Description  
Processes baseline demographic data from the **`baseline_counts`** and **`baseline_measurements`** tables in the AACT Clinical Trials database.  
Focuses on cleaning and transforming participant-level data related to **age**, **gender**, and **overall participant counts** to create a unified baseline feature table.

### Key Steps  
- Load and clean participant-level data from `baseline_counts` and `baseline_measurements` tables.  
- Filter only **participant-based rows** and remove redundant or null entries.  
- Standardize column names and categorical labels for **gender** and **age groups**.  
- Pivot categorical values to wide format for better interpretability.  
- Compute continuous variables such as **mean age** and **number analyzed** per trial.  
- Merge all derived features into a single baseline dataset.
- Final dataset is exported as `../data/processed/baseline_features_clean.csv`  

### Output:
- ✅ `baseline_features_clean.csv`

**Note:** These features were later excluded from modeling due to high missingness and inconsistency.

In [1]:
from sqlalchemy import create_engine
import pandas as pd
from dotenv import load_dotenv
import os

load_dotenv(dotenv_path = "../.env")
DATABASE_URL = os.getenv('DATABASE_URL')
engine = create_engine(DATABASE_URL)


In [2]:
# Load and process `baseline_counts` table
query_counts = 'SELECT * FROM ctgov.baseline_counts'
baseline_counts = pd.read_sql(query_counts, engine)

# Keep only participant-level rows, remove unnecessary columns
baseline_counts = baseline_counts.drop(columns=['id', 'scope'])
baseline_counts = baseline_counts.dropna(subset=['nct_id']).drop_duplicates()
baseline_counts = baseline_counts[baseline_counts['units'] == 'Participants']

# Convert and aggregate count
baseline_counts['count'] = pd.to_numeric(baseline_counts['count'], errors='coerce')
baseline_counts = baseline_counts.groupby('nct_id')['count'].sum().reset_index()
baseline_counts.rename(columns={'count': 'overall_count'}, inplace=True)

In [3]:
# Load and filter `baseline_measurements` table
query_measurements = '''
SELECT nct_id, category, title, units, param_value_num, number_analyzed, number_analyzed_units
FROM ctgov.baseline_measurements;
'''
baseline_measurements = pd.read_sql(query_measurements, engine)

# Filter for participant-level rows only
baseline_measurements = baseline_measurements[
    baseline_measurements['number_analyzed_units'] == 'Participants'
].dropna(subset=['param_value_num']).drop_duplicates()

In [4]:
# Standardize titles and filter only relevant features (age & gender)
baseline_measurements['title'] = baseline_measurements['title'].replace({
    'Age, Categorical': 'age_categorical',
    'Age Categorical': 'age_categorical',
    'Age, Continuous': 'age_continuous',
    'Age Continuous': 'age_continuous',
    'Sex: Female, Male': 'gender',
    'Gender': 'gender'
})
baseline_measurements = baseline_measurements[
    baseline_measurements['title'].isin(['age_categorical', 'age_continuous', 'gender'])
]

In [5]:
# Clean and normalize the `category` column
# Handle nulls and assign default values
baseline_measurements = baseline_measurements.dropna(subset=['units'])

# Drop rows with missing category for age_categorical
baseline_measurements = baseline_measurements[~(
    (baseline_measurements['title'] == 'age_categorical') & 
    (baseline_measurements['category'].isna())
)]

# Fill missing category with 'overall' for age_continuous
baseline_measurements.loc[
    (baseline_measurements['title'] == 'age_continuous') & 
    (baseline_measurements['category'].isna()), 'category'
] = 'overall'

# Normalize values
baseline_measurements['category'] = baseline_measurements['category'].replace({
    'FEMALE': 'female', 'Female': 'female', 'MALE': 'male', 'Male': 'male',
    '<=18 years': '<=18years', 'Between 18 and 65 years': '18-65years',
    '18-64 years': '18-65years', 'Between 18 and 65 Years': '18-65years',
    '>=65 years': '>=65years'
})

# Final filter on accepted categories
accepted = ['female', 'male', '<=18years', '18-65years', '>=65years', 'overall']
baseline_measurements = baseline_measurements[
    baseline_measurements['category'].isin(accepted)
]

In [6]:
# Pivot and aggregate baseline data
age_cat = baseline_measurements[baseline_measurements['title'] == 'age_categorical']
gender = baseline_measurements[baseline_measurements['title'] == 'gender']
age_cont = baseline_measurements[baseline_measurements['title'] == 'age_continuous']

age_cat_pivot = age_cat.pivot_table(index='nct_id', columns='category', values='param_value_num', aggfunc='sum').reset_index()
gender_pivot = gender.pivot_table(index='nct_id', columns='category', values='param_value_num', aggfunc='sum').reset_index()
mean_age = age_cont.groupby('nct_id')['param_value_num'].mean().reset_index().rename(columns={'param_value_num': 'mean_age'})
age_analyzed = age_cont.groupby('nct_id')['number_analyzed'].mean().reset_index().rename(columns={'number_analyzed': 'age_number_analyzed'})

In [7]:
# Merge all baseline features
from functools import reduce
baseline_features = reduce(
    lambda left, right: pd.merge(left, right, on='nct_id', how='outer'),
    [age_cat_pivot, gender_pivot, mean_age, age_analyzed]
)

In [8]:
# Merge with overall participant counts and export to CSV
df_baseline_features = pd.merge(baseline_features, baseline_counts, on='nct_id', how='left')
df_baseline_features = df_baseline_features[[
    'nct_id', 'overall_count', 'female', 'male', '<=18years', '18-65years', '>=65years',
    'mean_age', 'age_number_analyzed']]
df_baseline_features.to_csv('../data/processed/baseline_features_clean.csv', index=False)

---

## Summary  
This notebook successfully extracted, cleaned, and structured baseline demographic information from AACT’s baseline tables.  
Key derived variables included **overall participant count**, **gender ratios**, and **age distributions** per trial.

However, these features were **excluded from the final modeling dataset** because:
- Only ~26% of trials reported complete demographic data.  
- High **missingness** and **zero-inflated values** were observed across gender and age columns.  
- Reporting patterns were inconsistent between trial arms and groups.  

Thus, including them could introduce **bias and noise** in the predictive models.  
The cleaning and exclusion rationale are documented here for transparency and reproducibility.

---

📂 **Next Notebook:** `3_df_interventions.ipynb` → Cleans and encodes intervention-related data for modeling.
