### Feature Processing


This section defines the exact transformation logic for every column in the final dataset (`project_data.csv`). These rules to implemented in the Scikit-Learn `ColumnTransformer` during the Modeling Phase.

#### A. Target Variable (Passthrough)
*   **`target`**: The label (0/1). No processing required.

#### B. Numerical Features (Scaling Required)
*   **Possible Strategy:** `SimpleImputer(strategy='median')` followed by `StandardScaler`.
*   **Log-Transformation:** Apply `np.log1p` before scaling for highly skewed distributions (long tails).
    *   `num_facilities` (Log + Scale)
    *   `num_countries` (Log + Scale)
    *   `competition_niche` (Log + Scale)
    *   `competition_broad` (Log + Scale)
    *   `number_of_arms` (Scale)
    *   `start_year` (Scale)

#### C. Ordinal Features (Passthrough)
*   **Strategy:** Keep as numeric. The order matters (1.0 < 2.0 < 3.0).
    *   `phase_ordinal`

#### D. Low-Cardinality Categorical (One-Hot Encoding)
*   **Strategy:** Apply `OneHotEncoder(handle_unknown='ignore')`.
*   **Imputation:** Fill missing values with "UNKNOWN" or Mode before encoding.
    *   `agency_class` (Industry, NIH, Other)
    *   `masking` (None, Double, Quadruple...)
    *   `primary_purpose` (Treatment, Prevention...)
    *   `intervention_model` (Parallel, Crossover...)
    *   `allocation` (Randomized, Non-Randomized, Unknown)
    *   `therapeutic_area` (Oncology, Cardiology...)
    *   `gender` (All, Female, Male)

#### E. Binary Flags (Binary Encoding)
*   **Strategy:** These are already 0/1 or True/False. Treat as Binary or One-Hot (drop='if_binary').
    *   `includes_us`
    *   `covid_exposure`
    *   `healthy_volunteers`
    *   `adult`
    *   `child`
    *   `older_adult`

#### F. High-Cardinality Categorical (Target Encoding)
*   **Strategy:** Apply `TargetEncoder` (replaces category with the average risk probability).
*   **Why:** Too many unique values for One-Hot Encoding.
    *   `therapeutic_subgroup_name` (~140 categories)
    *   `best_pathology` (~1,200 categories)

#### G. Unstructured Text (NLP Embeddings)
*   **Strategy:** Process via **BERT/Transformers** (Stream B) or **TF-IDF** to generate risk scores/embeddings.
    *   `criteria` (Primary Source of Complexity)
    *   `official_title` (Secondary)
    *   `brief_summary` (Secondary)

#### H. Excluded Fields (Drop)
*   **Strategy:** Drop from the training set entirely.
    *   **IDs:** `nct_id`
    *   **Data Leakage:** `overall_status`, `min_p_value`, `why_stopped`
    *   **Noise/Sparse:** `detailed_description` (41% missing, redundant with criteria)