# Final Dataset Preparation: `9_df_final.ipynb`

### Description  
This notebook generates the **final modeling-ready datasets** after completing all Exploratory Data Analysis (EDA) steps.  
It produces two optimized datasets—one for linear models (Logistic Regression) and one for tree-based models (XGBoost)—ensuring clean, feature-aligned inputs for training and evaluation.

### Key Steps  
- Loaded the cleaned dataset output from `EDA_3`.  
- Dropped columns used only during visualization or exploratory inspection (e.g., raw enrollment, duration bins, sponsor role).  
- Created **two modeling datasets**:  
  - **One-Hot Encoded Dataset** → keeps one-hot categorical flags (best for Logistic Regression).  
  - **Grouped Dataset** → keeps grouped categorical features (`*_grouped`) for tree-based models like XGBoost.  
- Validated null values and column consistency.  
- Exported both datasets to the `/data/final/` directory for direct use in model training.

### Outputs  
- `df_final_onehot.csv` → One-hot encoded dataset for Logistic Regression.  
- `df_final_grouped.csv` → Grouped categorical dataset for XGBoost.  

In [1]:
import numpy as np
import pandas as pd

# Load the cleaned dataset after EDA_3
df_final = pd.read_csv('../data/processed/df_EDA_3.csv', keep_default_na=False)
print("✅ Loaded dataset")
print("Shape:", df_final.shape)

# Inspect columns
df_final.columns

✅ Loaded dataset
Shape: (263136, 95)


Index(['nct_id', 'enrollment', 'overall_status', 'number_of_arms', 'has_dmc',
       'has_expanded_access', 'is_fda_regulated_drug',
       'is_fda_regulated_device', 'duration_of_study', 'phase_1', 'phase_2',
       'phase_3', 'phase_4', 'phase_not applicable', 'intervention_behavioral',
       'intervention_biological', 'intervention_combination_product',
       'intervention_device', 'intervention_diagnostic_test',
       'intervention_dietary_supplement', 'intervention_drug',
       'intervention_genetic', 'intervention_other', 'intervention_procedure',
       'intervention_radiation', 'intervention_count',
       'has_multiple_intervention_types', 'condt_cancers',
       'condt_cardiovascular_diseases', 'condt_dental_disorders',
       'condt_dermatological_disorders', 'condt_endocrine/metabolic_disorders',
       'condt_gastrointestinal_disorders', 'condt_genetic_disorders',
       'condt_infectious_diseases', 'condt_mental_disorders',
       'condt_musculoskeletal_disorders', 'c

In [2]:
# Drop columns used only during EDA (not needed for modeling)
eda_only_cols = [
    'role_collaborator', 'role_lead',      # sponsor role (dropped after EDA)
    'duration_bins', 'arms_capped',        # engineered bins for EDA
    'duration_of_study', 'enrollment'      # raw duration/enrollment (we keep log values instead)
]

df_final.drop(columns=eda_only_cols, inplace=True, errors="ignore")

In [3]:
# Create One-Hot Encoded Dataset for LogReg
# Drop grouped categorical columns, keep one-hot encoded flags
df_final_onehot = df_final.drop(columns=[c for c in df_final if c.endswith('grouped')])
print("One-Hot Dataset Shape:", df_final_onehot.shape)

One-Hot Dataset Shape: (263136, 79)


In [4]:
# Create Grouped Dataset for xgBoost
# Define the core columns we want to retain in grouped dataset
columns_to_keep = [
    'nct_id', 'overall_status',
    'number_of_arms', 'intervention_count',
    'has_multiple_intervention_types',
    'log_enrollment', 'log_duration',
    'high_enroll_flag_975', 'high_enroll_flag_99',
    'has_dmc', 'has_expanded_access', 'healthy_volunteers',
    'is_fda_regulated_drug', 'is_fda_regulated_device'
]

# Keep grouped categorical variables + selected numeric/flags
df_final_grouped = df_final[columns_to_keep + [c for c in df_final.columns if c.endswith('grouped')]]
print("Grouped Dataset Shape:", df_final_grouped.shape)

Grouped Dataset Shape: (263136, 24)


In [5]:
# Quick sanity check for null values
print("Nulls in One-Hot:", df_final_onehot.isnull().sum().sum())
print("Nulls in Grouped:", df_final_grouped.isnull().sum().sum())

Nulls in One-Hot: 0
Nulls in Grouped: 0


In [6]:
# Save final datasets
df_final_onehot.to_csv('../data/final/df_final_onehot.csv', index=False)
df_final_grouped.to_csv('../data/final/df_final_grouped.csv', index=False)

print("✅ Final datasets saved:")
print("One-Hot Encoded:", df_final_onehot.shape, "→ ../data/final/df_final_onehot.csv")
print("Grouped:", df_final_grouped.shape, "→ ../data/final/df_final_grouped.csv")

✅ Final datasets saved:
One-Hot Encoded: (263136, 79) → ../data/final/df_final_onehot.csv
Grouped: (263136, 24) → ../data/final/df_final_grouped.csv


---

## Summary  
This notebook finalized the preprocessing workflow by generating **two clean, modeling-ready datasets**.  

Key outcomes:
- Dropped redundant or EDA-only columns.  
- Created separate inputs tailored for Logistic Regression (one-hot) and XGBoost (grouped).  
- Verified there were **no missing values** or duplicate identifiers.  
- Saved both datasets to the `/data/final/` directory.

These files represent the **final feature space** used across all subsequent modeling notebooks.

---

📂 **Next Phase:** `Model_LogReg.ipynb` & `Model_XGBoost.ipynb` → Train, evaluate, and interpret machine learning models using the prepared datasets.
