# Integration Notebook: `8_df_merged.ipynb`

### Description  
This notebook consolidates all cleaned feature tables from the AACT Clinical Trials database into a single, analysis-ready dataset.  
It represents the **final preprocessing step** before Exploratory Data Analysis (EDA) and model development.

### Key Steps  
- Imported all previously cleaned datasets:  
  - `studies_clean.csv`  
  - `interventions_clean.csv`  
  - `conditions_clean.csv`  
  - `designs_clean.csv`  
  - `eligibilities_clean.csv`  
  - `sponsors_clean.csv`  
- Sequentially merged all tables on the common identifier `nct_id` using left joins.  
- Verified dataset integrity (no duplicate trials).  
- Identified and handled missing values in one-hot encoded columns.  
- Filled missing binary/flag columns with `0` to maintain consistency.  
- Verified dataset dimensions and exported final merged file.
- Saved consolidated dataset to: `../data/processed/df_merged.csv`  

### Output:
- ✅ `df_merged.csv`

In [1]:
# Import libraries and load all feature datasets
import pandas as pd
from functools import reduce

df_studies = pd.read_csv('../data/processed/studies_clean.csv')
df_interventions = pd.read_csv('../data/processed/interventions_clean.csv')
df_conditions = pd.read_csv('../data/processed/conditions_clean.csv')
df_designs = pd.read_csv('../data/processed/designs_clean.csv')
df_eligibilities = pd.read_csv('../data/processed/eligibilities_clean.csv')
df_sponsors = pd.read_csv('../data/processed/sponsors_clean.csv')

In [2]:
# Merge all datasets sequentially on `nct_id`
dfs_to_merge = [
    df_studies,
    df_interventions,
    df_conditions,
    df_designs,
    df_eligibilities,
    df_sponsors
]

df_merged = reduce(lambda left, right: pd.merge(left, right, on='nct_id', how='left'), dfs_to_merge)

In [3]:
df_merged.columns

Index(['nct_id', 'study_type', 'enrollment', 'overall_status',
       'number_of_arms', 'has_dmc', 'has_expanded_access',
       'is_fda_regulated_drug', 'is_fda_regulated_device', 'duration_of_study',
       'phase_1', 'phase_2', 'phase_3', 'phase_4', 'phase_not applicable',
       'intervention_behavioral', 'intervention_biological',
       'intervention_combination_product', 'intervention_device',
       'intervention_diagnostic_test', 'intervention_dietary_supplement',
       'intervention_drug', 'intervention_genetic', 'intervention_other',
       'intervention_procedure', 'intervention_radiation',
       'intervention_count', 'has_multiple_intervention_types',
       'condt_cancers', 'condt_cardiovascular_diseases',
       'condt_dental_disorders', 'condt_dermatological_disorders',
       'condt_endocrine/metabolic_disorders',
       'condt_gastrointestinal_disorders', 'condt_genetic_disorders',
       'condt_infectious_diseases', 'condt_mental_disorders',
       'condt_musculosk

In [4]:
# Check for missing values
df_merged.isnull().sum().sort_values(ascending=False).head(30)

masking_double                      5791
model_single_group                  5791
purpose_treatment                   5791
purpose_prevention                  5791
masking_quadruple                   5791
masking_single                      5791
masking_triple                      5791
masking_unknown                     5791
purpose_other                       5791
purpose_diagnostic                  5791
model_unknown                       5791
model_sequential                    5791
purpose_research                    5791
model_parallel                      5791
model_factorial                     5791
model_crossover                     5791
allocation_unknown                  5791
allocation_randomized               5791
allocation_non_randomized           5791
purpose_supportive_care             5791
condt_respiratory_disorders            0
condt_reproductive_health              0
condt_renal/urological_disorders       0
nct_id                                 0
study_type      

In [5]:
# Check for duplicate trial entries
df_merged['nct_id'].duplicated().sum()

0

In [6]:
# Fill missing values in one-hot encoded features with 0
flag_cols = df_merged.columns[df_merged.isnull().sum() == 5791]
df_merged[flag_cols] = df_merged[flag_cols].fillna(0)

In [7]:
# Final checking 
print("Total missing values in the dataset:", df_merged.isnull().sum().sum())
print("Final shape after merging and filling nulls:", df_merged.shape)

Total missing values in the dataset: 0
Final shape after merging and filling nulls: (263165, 80)


In [8]:
# Export merged dataset
df_merged.to_csv('../data/processed/df_merged.csv', index=False)

---

## Summary  
This notebook successfully consolidated all preprocessed AACT feature tables into a single dataset.  

Key outcomes:
- Merged six cleaned tables (`studies`, `interventions`, `conditions`, `designs`, `eligibilities`, `sponsors`).  
- Verified no duplicate `nct_id` entries.  
- Filled missing values in one-hot encoded columns with `0`.  
- Produced a final, clean dataset ready for downstream EDA and model training.  

---

📂 **Next Phase:** Exploratory Data Analysis  
The next step involves statistical inspection, visualization, and feature-target relationship assessment to identify patterns predictive of clinical trial outcomes.
