# Preprocessing Notebook: `5_df_designs.ipynb`

### Description  
Processes the **`designs`** table from the AACT Clinical Trials database to extract and standardize trial design characteristics such as allocation, intervention model, masking, and primary purpose.  
These features represent core study design attributes critical for modeling clinical trial outcomes.

### Key Steps  
- Load design-related columns (`allocation`, `intervention_model`, `primary_purpose`, `masking`) from the AACT schema.  
- Merge with the cleaned `studies` dataset to include **interventional trials only**.  
- Standardize categorical values by converting to lowercase and resolving inconsistencies.  
- Drop logically invalid combinations (e.g., randomized + single_group).  
- Consolidate related purpose categories (e.g., `basic_science`, `device_feasibility` → `research`).  
- Handle missing or ambiguous values using standardized labels such as `unknown`.  
- Apply one-hot encoding to design variables and aggregate features at the **trial (`nct_id`) level**.  
- Export final dataset to `../data/processed/designs_clean.csv`

### Output:
- ✅ `designs_clean.csv`

In [1]:
# Load libraries and connect to database
from sqlalchemy import create_engine
import pandas as pd
from dotenv import load_dotenv
import os

load_dotenv()
DATABASE_URL = os.getenv('DATABASE_URL')
engine = create_engine(DATABASE_URL)

In [2]:
# Load `designs` table with selected columns
query = '''
SELECT nct_id, allocation, intervention_model, primary_purpose, masking
FROM ctgov.designs;
'''
df_designs = pd.read_sql(query, engine)

# Merge with `studies_clean.csv` to filter only interventional trials
df_studies = pd.read_csv('../data/processed/studies_clean.csv')
df_designs = df_designs.merge(df_studies[['nct_id', 'study_type']], on='nct_id', how='left')
df_designs = df_designs[df_designs['study_type'] == 'INTERVENTIONAL']
df_designs.drop(columns=['study_type'], inplace=True)

In [3]:
# Convert all object-type columns to lowercase (except `nct_id`)
for col in df_designs.columns:
    if col != 'nct_id' and df_designs[col].dtype == 'object':
        df_designs[col] = df_designs[col].str.lower()

In [4]:
# Drop logically inconsistent rows: randomized + single_group
invalid_rows = (df_designs['intervention_model'] == 'single_group') & (df_designs['allocation'] == 'randomized')
df_designs = df_designs.drop(df_designs[invalid_rows].index)

In [5]:
# Fill missing values and fix logic mismatches
df_designs.loc[(df_designs['intervention_model'] == 'single_group') & (df_designs['allocation'].isna()), 'allocation'] = 'non_randomized'
df_designs.loc[(df_designs['intervention_model'] == 'single_group') & (df_designs['allocation'] == 'na'), 'allocation'] = 'non_randomized'

df_designs['allocation'] = df_designs['allocation'].replace({'na': 'unknown'}).fillna('unknown')
df_designs['intervention_model'] = df_designs['intervention_model'].fillna('unknown')

In [6]:
# Standardize and consolidate related primary_purpose categories
df_designs['primary_purpose'] = df_designs['primary_purpose'].replace({
    'basic_science': 'research',
    'health_services_research': 'research',
    'device_feasibility': 'research',
    'screening': 'diagnostic',
    'ect': 'other'
}).fillna('other')

In [7]:
# Normalize and fill `masking` values
df_designs['masking'] = df_designs['masking'].replace({'none': 'unknown'}).fillna('unknown')

In [8]:
# One-hot encode design features and aggregate by trial
df_designs = pd.get_dummies(
    df_designs,
    columns=['allocation', 'intervention_model', 'primary_purpose', 'masking'],
    prefix=['allocation', 'model', 'purpose', 'masking'],
    dtype=int
)
df_designs = df_designs.groupby('nct_id').max().reset_index()

In [9]:
# Export cleaned design dataset
df_designs.to_csv('../data/processed/designs_clean.csv', index=False)

## Summary  
This notebook successfully extracted and standardized **trial design features** from the AACT database.  

Key outcomes:
- Cleaned and harmonized allocation, intervention model, masking, and purpose categories.  
- Resolved missing and logically inconsistent design combinations.
- Consolidated overlapping categories (e.g., grouped research-related purposes). 
- Applied one-hot encoding to categorical variables and ensured unique trial entries.  
- Exported a compact, machine-learning-ready dataset: `designs_clean.csv`.  

---

📂 **Next Notebook:** `6_df_eligibilities.ipynb` → Processes participant eligibility and demographic criteria for each trial.
