# Preprocessing Notebook: `3_df_interventions.ipynb`

### Description  
Processes the **`interventions`** table from the AACT Clinical Trials database to create structured, model-ready features that represent the types and diversity of interventions used in clinical studies.

### Key Steps  
- Load data from the `interventions` table using SQLAlchemy.  
- Remove null and duplicate records to ensure data integrity.  
- Standardize all `intervention_type` values (lowercase formatting, cleaned categories).  
- One-hot encode all unique intervention types (Drug, Device, Biological, Procedure, etc.).  
- Aggregate encoded features at the **trial (`nct_id`) level**.  
- Create new engineered features:  
  - `intervention_count`: total number of intervention types per trial.  
  - `has_multiple_intervention_types`: binary flag indicating if a trial includes more than one intervention type.  
- Verify unique trial IDs and export the cleaned dataset for downstream merging.
- Export final dataset to `../data/processed/interventions_clean.csv`

### Output:
- ✅ `interventions_clean.csv`

In [1]:
# Load packages and setup database connection
from sqlalchemy import create_engine
import pandas as pd
from dotenv import load_dotenv
import os

load_dotenv()
DATABASE_URL = os.getenv('DATABASE_URL')
engine = create_engine(DATABASE_URL)

In [2]:
# Load `interventions` table from the AACT database
query = '''
SELECT nct_id, intervention_type
FROM ctgov.interventions;
'''
df_interventions = pd.read_sql(query, engine)

In [3]:
# Drop null values and duplicates
df_interventions = df_interventions.dropna(subset=['nct_id', 'intervention_type'])
df_interventions = df_interventions.drop_duplicates()

In [4]:
# Standardize `intervention_type` values to lowercase
df_interventions['intervention_type'] = df_interventions['intervention_type'].str.lower()

# One-hot encode intervention types
df_interventions = pd.get_dummies(
    df_interventions,
    columns=['intervention_type'],
    prefix='intervention',
    dtype=int
)

# Aggregate one-hot encoded values per trial
df_interventions = df_interventions.groupby('nct_id').max().reset_index()

In [5]:
# Derived features
# Total number of intervention types used in each trial
intervention_cols = [col for col in df_interventions.columns if col.startswith('intervention_')]
df_interventions['intervention_count'] = df_interventions[intervention_cols].sum(axis=1)

# Binary flag for multiple interventions
df_interventions['has_multiple_intervention_types'] = (
    df_interventions['intervention_count'] > 1
).astype(int)

In [6]:
# Ensure NCT IDs are unique and export final dataset
assert df_interventions['nct_id'].is_unique, 'NCT ID duplication error!'
df_interventions.to_csv('../data/processed/interventions_clean.csv', index=False)

---

## Summary  
This notebook successfully extracted, standardized, and encoded the intervention-level data from AACT.  

Key highlights:
- Generated one-hot encoded intervention flags for each trial.  
- Derived interpretable metrics like **intervention count** and **multi-intervention flag**.  
- Ensured dataset quality through duplicate removal and ID validation.  

The resulting file — `interventions_clean.csv` — provides a **compact, structured representation** of trial interventions suitable for merging with other datasets in the modeling pipeline.

---

📂 **Next Notebook:** `4_df_conditions.ipynb` → Cleans and encodes medical condition categories associated with each trial.

