# Clinical Trial Risk Engine: Data Engineering Pipeline

**Project:** Premature Termination Risk Prediction
**Output Artifact:** `project_data.csv`
**Scope:** Interventional Drug Trials (Phases 1, 2, and 3).
**Temporal Filter:** Trials starting between 2000 and the present (plus 2 years).

## Data Dictionary and Feature Rationale

This dataset aggregates data from the AACT (Aggregate Analysis of ClinicalTrials.gov) database to predict trial termination.

### 1. Target Variable
*   **`target`** (Binary): The classification target.
    *   `0`: **Completed**. The trial concluded according to protocol.
    *   `1`: **Failed**. The trial was Terminated, Withdrawn, or Suspended.
*   **`overall_status`** (String): The raw status label. *Note: This column is retained for validation but must be dropped prior to training to prevent data leakage.*

### 2. Operational Features (Complexity Proxies)
*   **`num_facilities`** (Integer): The count of distinct recruiting sites. Higher counts correlate with increased operational complexity and cost.
*   **`num_countries`** (Integer): The count of unique countries involved. Indicates regulatory complexity (e.g., FDA, EMA, PMDA coordination).
*   **`phase_ordinal`** (Float): A numeric mapping of the trial phase (Phase 1=1.0, Phase 3=3.0). This is the primary proxy for trial magnitude and resource requirements.
*   **`number_of_arms`** (Integer): The number of intervention groups. Indicates protocol complexity.
*   **`start_year`** (Integer): The year the trial began. Used to capture temporal trends in clinical research standards.

### 3. Eligibility and Protocol Features
*   **`criteria`** (Text): The full inclusion and exclusion criteria. This unstructured text contains high-value signals regarding protocol restrictiveness.
*   **`gender`** (Categorical): Indicates if the trial is restricted by sex (Female, Male, or All).
*   **`healthy_volunteers`** (Binary): Indicates if healthy participants are accepted. Often distinguishes Phase 1 safety trials (Yes) from Phase 2/3 efficacy trials (No).
*   **`adult`, `child`, `older_adult`** (Binary): Pre-calculated flags indicating the target age groups. Used in place of raw numeric age limits to reduce noise and missing data.

### 4. Scientific and Design Features
*   **`therapeutic_area`** (Categorical): High-level medical classification (e.g., Oncology, Cardiology).
*   **`therapeutic_subgroup_name`** (Categorical): Granular disease classification (e.g., Neoplasms by Site).
*   **`intervention_model`** (Categorical): The strategy for assigning interventions (e.g., Parallel, Crossover, Single Group).
*   **`masking`** (Categorical): The level of blinding (e.g., Double, Quadruple, None).
*   **`allocation`** (Categorical): Indicates if participants are randomized.

### 5. Environmental Context (Competition)
*   **`competition_niche`** (Integer): The count of concurrent trials with the *same* Phase and *same* Therapeutic Subgroup. Represents direct competition for specific patient populations.
*   **`competition_broad`** (Integer): The count of concurrent trials within the *same* Therapeutic Area. Represents broader resource saturation.
*   **`covid_exposure`** (Binary): Indicates if the trial was active during the 2019-2021 global disruption window.

### 6. Sponsor Information
*   **`agency_class`** (Categorical): The type of lead sponsor (Industry, NIH, Other). Industry trials typically have different risk profiles compared to academic or government-funded studies.
*   **`includes_us`** (Binary): Indicates if the trial has at least one site in the United States, implying FDA regulatory oversight.

### 1. Configuration and Path Management
This block establishes the environment settings. It implements a robust path-finding logic to ensure the code executes correctly whether run from the project root or a subdirectory (e.g., Jupyter Notebooks). It also defines safe loading parameters to handle special characters in the raw AACT text files.

In [17]:
import pandas as pd
import numpy as np
import os
import csv

# -----------------------------------------------------------------------------
# 1. CONFIGURATION & ROBUST PATH SETUP
# -----------------------------------------------------------------------------

# Dynamic Path Finding: Looks for the 'data' folder relative to where you are
if os.path.exists("data"):
    DATA_PATH = "data"                        # Scenario: Running from Project Root
elif os.path.exists("../../data"):
    DATA_PATH = "../../data"                  # Scenario: Running from notebooks/nicolas
elif os.path.exists("../data"):
    DATA_PATH = "../data"                     # Scenario: Running from notebooks/
else:
    # Fallback: Absolute path (Only used if the above fail)
    DATA_PATH = "/home/delaunan/code/delaunan/clintrialpredict/data"

print(f">>> DATA_PATH set to: {os.path.abspath(DATA_PATH)}")

OUTPUT_FILE = 'project_data.csv'

# ROBUST LOADING PARAMETERS
AACT_LOAD_PARAMS = {
    "sep": "|",
    "dtype": str,
    "header": 0,
    "quotechar": '"',
    "quoting": csv.QUOTE_MINIMAL,
    "low_memory": False,
    "on_bad_lines": "warn"
}

print(">>> Setup Complete. Ready to process.")

>>> DATA_PATH set to: /home/delaunan/code/delaunan/clintrialpredict/data
>>> Setup Complete. Ready to process.


### 2. Data Loading and Cohort Filtering
This step defines the study cohort. We apply strict inclusion and exclusion criteria to ensure data quality:
1.  **Study Type:** Retain only `INTERVENTIONAL` trials.
2.  **Intervention:** Retain only `DRUG` or `BIOLOGICAL` trials.
3.  **Status:** Retain only definitive outcomes (`COMPLETED`, `TERMINATED`, `WITHDRAWN`, `SUSPENDED`).
4.  **Phase:** Exclude Phase 0 and Phase 4 to focus on the core development pipeline.
5.  **Temporal Validity:** Filter for trials starting between 2000 and the near future, removing invalid dates (e.g., 1900) and placeholder future records.

In [18]:
# -----------------------------------------------------------------------------
# 2. THE FUNNEL: LOADING & FILTERING (Updated with Year Filter)
# -----------------------------------------------------------------------------
print(">>> Loading Studies & Applying Filters...")

# A. Load Studies
cols_studies = [
    'nct_id', 'overall_status', 'study_type', 'phase',
    'start_date', 'start_date_type',
    'number_of_arms', 'official_title', 'why_stopped'
]
df = pd.read_csv(os.path.join(DATA_PATH, 'studies.txt'), usecols=cols_studies, **AACT_LOAD_PARAMS)

# B. Filter: Interventional Only
df = df[df['study_type'] == 'INTERVENTIONAL'].copy()

# C. Filter: Drugs Only
df_int = pd.read_csv(os.path.join(DATA_PATH, 'interventions.txt'), usecols=['nct_id', 'intervention_type'], **AACT_LOAD_PARAMS)
drug_ids = df_int[df_int['intervention_type'].str.upper().isin(['DRUG', 'BIOLOGICAL'])]['nct_id'].unique()
df = df[df['nct_id'].isin(drug_ids)]

# D. Filter: Closed Statuses Only
allowed_statuses = ['COMPLETED', 'TERMINATED', 'WITHDRAWN', 'SUSPENDED']
df = df[df['overall_status'].isin(allowed_statuses)]

# E. Filter: Exclude Phase 0 and Phase 4 (Refined Scope)
excluded_phases = ['EARLY_PHASE1', 'PHASE4', 'NA']
df = df[~df['phase'].isin(excluded_phases)]

# F. Create Target & Fix Dates
df['target'] = df['overall_status'].apply(lambda x: 0 if x == 'COMPLETED' else 1)
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
df['start_year'] = df['start_date'].dt.year

# --- NEW FILTER: VALID YEARS ONLY ---
# Drop 1900 (Errors) and Future Dates (Invalid for training)
current_year = pd.Timestamp.now().year
df = df[df['start_year'].between(2000, current_year + 2)]

print(f"   - Core Cohort Size (Phases 1-3, Years 2000-{current_year+2}): {len(df)} trials")

>>> Loading Studies & Applying Filters...
   - Core Cohort Size (Phases 1-3, Years 2000-2027): 119201 trials


### 3. Medical Hierarchy Integration
This block enriches the dataset with standardized medical classifications. By mapping `nct_id` to the MeSH (Medical Subject Headings) hierarchy, we derive:
*   **`therapeutic_area`:** The broad medical category.
*   **`therapeutic_subgroup`:** The specific disease family.
This hierarchical approach allows the model to learn risk patterns associated with specific medical fields (e.g., Oncology vs. Infectious Diseases).


In [19]:
# -----------------------------------------------------------------------------
# 3. MEDICAL HIERARCHY & SUBGROUPS
# -----------------------------------------------------------------------------
print(">>> Attaching Medical Hierarchy...")

# A. Load Smart Lookup (Best Term per Trial)
df_smart = pd.read_csv(os.path.join(DATA_PATH, 'smart_pathology_lookup.csv'))
df = df.merge(df_smart, on='nct_id', how='left')

# B. Fill Missing
df['therapeutic_area'] = df['therapeutic_area'].fillna('Other/Unclassified')
df['best_pathology'] = df['best_pathology'].fillna('Unknown')

# C. Create Subgroup Code (Level 2 Hierarchy)
df['therapeutic_subgroup'] = df['tree_number'].astype(str).apply(
    lambda x: x[:7] if pd.notna(x) and len(x) >= 7 else 'Unknown'
)

# D. Map Subgroup Code to Name
df_lookup = pd.read_csv(os.path.join(DATA_PATH, 'mesh_lookup.csv'), sep='|')
code_to_name = pd.Series(df_lookup.mesh_term.values, index=df_lookup.tree_number).to_dict()
df['therapeutic_subgroup_name'] = df['therapeutic_subgroup'].map(code_to_name).fillna('Unknown Subgroup')

>>> Attaching Medical Hierarchy...


### 4. Competition Intensity Calculation
This block calculates "Crowding" metrics to quantify the competitive environment. We define competition using a 3-year rolling window (Start Year, +1, +2).
*   **`competition_broad`:** Measures saturation within the general therapeutic area.
*   **`competition_niche`:** Measures direct competition for the specific patient population (same Subgroup and Phase).

In [20]:
# -----------------------------------------------------------------------------
# 4. DUAL-LEVEL CROWDING (Niche vs Broad)
# -----------------------------------------------------------------------------
print(">>> Calculating Competition Intensity (Dual Level)...")

# A. Standardize Phase for Grouping
phase_group_map = {
    'PHASE1': 'PHASE1', 'PHASE1/PHASE2': 'PHASE2',
    'PHASE2': 'PHASE2', 'PHASE2/PHASE3': 'PHASE3', 'PHASE3': 'PHASE3'
}
df['phase_group'] = df['phase'].map(phase_group_map).fillna('UNKNOWN')

# --- LEVEL 1: BROAD COMPETITION (Area + Phase) ---
grid_broad = df.groupby(['start_year', 'therapeutic_area', 'phase_group']).size().reset_index(name='count')
dict_broad = dict(zip(zip(grid_broad['start_year'], grid_broad['therapeutic_area'], grid_broad['phase_group']), grid_broad['count']))

def get_broad_crowding(row):
    y, area, ph = row['start_year'], row['therapeutic_area'], row['phase_group']
    if pd.isna(y): return 0
    return dict_broad.get((y, area, ph), 0) + dict_broad.get((y+1, area, ph), 0) + dict_broad.get((y+2, area, ph), 0)

df['competition_broad'] = df.apply(get_broad_crowding, axis=1)

# --- LEVEL 2: NICHE COMPETITION (Subgroup + Phase) ---
grid_niche = df.groupby(['start_year', 'therapeutic_subgroup', 'phase_group']).size().reset_index(name='count')
dict_niche = dict(zip(zip(grid_niche['start_year'], grid_niche['therapeutic_subgroup'], grid_niche['phase_group']), grid_niche['count']))

def get_niche_crowding(row):
    y, sub, ph = row['start_year'], row['therapeutic_subgroup'], row['phase_group']
    if pd.isna(y) or sub == 'Unknown': return 0
    return dict_niche.get((y, sub, ph), 0) + dict_niche.get((y+1, sub, ph), 0) + dict_niche.get((y+2, sub, ph), 0)

df['competition_niche'] = df.apply(get_niche_crowding, axis=1)

df.drop(columns=['phase_group'], inplace=True)
print("   - Created 'competition_broad' and 'competition_niche'")

>>> Calculating Competition Intensity (Dual Level)...
   - Created 'competition_broad' and 'competition_niche'


### 5. Protocol Details and Eligibility
This block extracts key protocol design features.
*   **Eligibility Flags:** We incorporate categorical flags (`gender`, `healthy_volunteers`, `adult`, `child`, `older_adult`) to characterize the study population.
*   **Text Data:** We extract the full `criteria` text block for downstream Natural Language Processing (NLP).
*   **Endpoints:** We calculate the number of primary endpoints as a proxy for scientific complexity.
*   **Analysis Data:** We extract `min_p_value` for post-hoc analysis (excluded from training to prevent leakage).


In [21]:
# -----------------------------------------------------------------------------
# 5. PROTOCOL DETAILS (Eligibility, Endpoints) - UPDATED
# -----------------------------------------------------------------------------
print(">>> Extracting Eligibility & Endpoints...")

# A. Load Eligibility Fields (No Age Parsing)
# We rely on the pre-calculated flags (adult/child/older_adult) instead of parsing numbers.
cols_elig = [
    'nct_id',
    'criteria',
    'gender', 'healthy_volunteers',
    'adult', 'child', 'older_adult'
]

df_elig = pd.read_csv(os.path.join(DATA_PATH, 'eligibilities.txt'),
                      usecols=cols_elig,
                      **AACT_LOAD_PARAMS)

# B. Merge into Main DataFrame
df = df.merge(df_elig, on='nct_id', how='left')

# C. Endpoint Counts
df_calc = pd.read_csv(os.path.join(DATA_PATH, 'calculated_values.txt'),
                      usecols=['nct_id', 'number_of_primary_outcomes_to_measure'],
                      **AACT_LOAD_PARAMS)
df = df.merge(df_calc, on='nct_id', how='left')
df['num_primary_endpoints'] = pd.to_numeric(df['number_of_primary_outcomes_to_measure'], errors='coerce').fillna(1)

# D. P-Values (Analysis Only)
df_outcomes = pd.read_csv(os.path.join(DATA_PATH, 'outcomes.txt'), usecols=['id', 'nct_id', 'outcome_type'], **AACT_LOAD_PARAMS)
prim_ids = df_outcomes[df_outcomes['outcome_type'] == 'PRIMARY']['id'].unique()

df_an = pd.read_csv(os.path.join(DATA_PATH, 'outcome_analyses.txt'), usecols=['outcome_id', 'p_value'], **AACT_LOAD_PARAMS)
df_an = df_an[df_an['outcome_id'].isin(prim_ids)]
df_an['p_value_num'] = pd.to_numeric(df_an['p_value'], errors='coerce')

min_p = df_an.groupby('outcome_id')['p_value_num'].min().reset_index()
min_p = min_p.merge(df_outcomes[['id', 'nct_id']], left_on='outcome_id', right_on='id')
trial_p = min_p.groupby('nct_id')['p_value_num'].min().reset_index(name='min_p_value')

df = df.merge(trial_p, on='nct_id', how='left')

>>> Extracting Eligibility & Endpoints...


### 6. Operational Proxies and Sponsor Data
This block merges operational and administrative features:
*   **Phase Mapping:** Converts text phases to a numeric ordinal scale (1.0–3.0). Invalid phases are filtered out.
*   **Geography:** Calculates the number of facilities and countries, and flags US-based trials.
*   **Sponsor:** Identifies the lead agency class (e.g., Industry vs. Other).
*   **External Factors:** Calculates `covid_exposure` based on the trial's start date relative to the pandemic window.

In [22]:
# -----------------------------------------------------------------------------
# 6. OPERATIONAL PROXIES, SPONSORS & EXTERNAL FACTORS (Updated)
# -----------------------------------------------------------------------------
print(">>> Merging Operational Features & Calculating COVID Exposure...")

# A. Phase Ordinal
phase_map = {'PHASE1': 1, 'PHASE1/PHASE2': 1.5, 'PHASE2': 2, 'PHASE2/PHASE3': 2.5, 'PHASE3': 3}
df['phase_ordinal'] = df['phase'].map(phase_map).fillna(0)
df = df[df['phase_ordinal'] > 0] # Drop unknown phases

# B. COVID Exposure
df['covid_exposure'] = df['start_year'].between(2019, 2021).astype(int)

# C. Geography (International Flag + US Flag)
# Logic:
# 1. International = Sites in >1 unique country.
# 2. Includes US = 'United States' is listed as a location.
df_countries = pd.read_csv(os.path.join(DATA_PATH, 'countries.txt'), usecols=['nct_id', 'name'], **AACT_LOAD_PARAMS)

country_stats = df_countries.groupby('nct_id')['name'].agg(
    cnt='nunique',
    includes_us=lambda x: 1 if 'United States' in x.values else 0
).reset_index()

df = df.merge(country_stats, on='nct_id', how='left')

# Create the flags
df['is_international'] = (df['cnt'] > 1).astype(int)
df['includes_us'] = df['includes_us'].fillna(0).astype(int)

# Drop the raw count (leakage prevention)
df.drop(columns=['cnt'], inplace=True)

# D. Sponsors & Design
df_sponsors = pd.read_csv(os.path.join(DATA_PATH, 'sponsors.txt'), **AACT_LOAD_PARAMS)
df_lead = df_sponsors[df_sponsors['lead_or_collaborator'] == 'lead'][['nct_id', 'agency_class']].drop_duplicates('nct_id')
df = df.merge(df_lead, on='nct_id', how='left')

cols_des = ['nct_id', 'allocation', 'intervention_model', 'masking', 'primary_purpose']
df_des = pd.read_csv(os.path.join(DATA_PATH, 'designs.txt'), usecols=cols_des, **AACT_LOAD_PARAMS)
df = df.merge(df_des, on='nct_id', how='left')

print("   - Created 'is_international' and 'includes_us' flags.")
print("   - Dropped leaky counts (facilities/countries).")

>>> Merging Operational Features & Calculating COVID Exposure...
   - Created 'is_international' and 'includes_us' flags.
   - Dropped leaky counts (facilities/countries).


### 7. Text Integration and Data Export
This final processing step merges the remaining unstructured text fields (`brief_summary`, `detailed_description`) and cleans up temporary technical columns. The final dataset is exported as `project_data.csv`.

In [23]:
# -----------------------------------------------------------------------------
# 7. TEXT FEATURE ENGINEERING ("Tags" vs "Complexity") - REFINED FOR SHAP
# -----------------------------------------------------------------------------
print(">>> Engineering Text Features...")

# A. Load Components
# 1. Keywords
df_keys = pd.read_csv(os.path.join(DATA_PATH, 'keywords.txt'), usecols=['nct_id', 'name'], **AACT_LOAD_PARAMS)
keys_grouped = df_keys.groupby('nct_id')['name'].apply(lambda x: " | ".join(x.dropna().astype(str))).reset_index(name='txt_keywords')

# 2. Intervention Details
df_int_det = pd.read_csv(os.path.join(DATA_PATH, 'interventions.txt'),
                         usecols=['nct_id', 'intervention_type', 'name', 'description'],
                         **AACT_LOAD_PARAMS)
# Filter for drugs only
df_int_det = df_int_det[df_int_det['intervention_type'].str.upper().isin(['DRUG', 'BIOLOGICAL'])]

int_names = df_int_det.groupby('nct_id')['name'].apply(lambda x: " | ".join(x.dropna().astype(str))).reset_index(name='txt_int_names')
int_desc = df_int_det.groupby('nct_id')['description'].apply(lambda x: " ".join(x.dropna().astype(str))).reset_index(name='txt_int_desc')

# B. Merge Components
df = df.merge(keys_grouped, on='nct_id', how='left')
df = df.merge(int_names, on='nct_id', how='left')
df = df.merge(int_desc, on='nct_id', how='left')

# Fill NaNs before combining
text_cols = ['official_title', 'txt_keywords', 'txt_int_names', 'criteria', 'txt_int_desc']
df[text_cols] = df[text_cols].fillna("")

# C. Create Final Features (Separated for Explainability)

# 1. TAGS (The "What") -> TF-IDF
# We include Intervention Description here because it contains factual keywords (e.g., "Intravenous")
print("   - Creating 'txt_tags' (Title + Keywords + Drug Names + Int. Desc)...")
df['txt_tags'] = (
    df['official_title'] + " | " +
    df['txt_keywords'] + " | " +
    df['txt_int_names'] + " | " +
    df['txt_int_desc']
)

# 2. COMPLEXITY (The "How Hard") -> BERT
# We keep Criteria ISOLATED. This allows SHAP to specifically blame "Strict Criteria" for failure.
print("   - Creating 'txt_criteria' (Inclusion/Exclusion Rules only)...")
df['txt_criteria'] = df['criteria']

# D. Cleanup
cols_to_drop = [
    'start_date', 'start_date_type', 'tree_number', 'number_of_primary_outcomes_to_measure',
    'official_title', 'txt_keywords', 'txt_int_names', 'criteria', 'txt_int_desc'
]
df.drop(columns=cols_to_drop, inplace=True, errors='ignore')

# E. Save
df.to_csv(os.path.join(DATA_PATH, OUTPUT_FILE), index=False, quoting=csv.QUOTE_MINIMAL)

print(f"\\n>>> SUCCESS: Final Dataset saved to {OUTPUT_FILE}")
print(f"    Rows: {len(df)}")
print(f"    Columns: {len(df.columns)}")

>>> Engineering Text Features...
   - Creating 'txt_tags' (Title + Keywords + Drug Names + Int. Desc)...
   - Creating 'txt_criteria' (Inclusion/Exclusion Rules only)...
\n>>> SUCCESS: Final Dataset saved to project_data.csv
    Rows: 105884
    Columns: 32


In [25]:
df.columns

Index(['nct_id', 'study_type', 'overall_status', 'phase', 'number_of_arms',
       'why_stopped', 'target', 'start_year', 'best_pathology',
       'therapeutic_area', 'therapeutic_subgroup', 'therapeutic_subgroup_name',
       'competition_broad', 'competition_niche', 'gender',
       'healthy_volunteers', 'adult', 'child', 'older_adult',
       'num_primary_endpoints', 'min_p_value', 'phase_ordinal',
       'covid_exposure', 'includes_us', 'is_international', 'agency_class',
       'allocation', 'intervention_model', 'primary_purpose', 'masking',
       'txt_tags', 'txt_criteria'],
      dtype='object')

### 8. Quality Assurance and Encoding Strategy
This block performs a comprehensive audit of the final dataset. It analyzes missing values, cardinality, and data types to generate an automated **Encoding Strategy Report**. This report provides specific recommendations for the machine learning pipeline (e.g., which fields require One-Hot Encoding, Target Encoding, or NLP transformation).

In [24]:
import pandas as pd
import numpy as np
import os

# -----------------------------------------------------------------------------
# 8. FINAL DATASET AUDIT & PIPELINE BLUEPRINT
# -----------------------------------------------------------------------------
INPUT_FILE = 'project_data.csv'
OUTPUT_REPORT = 'audit_dataset_report.txt'

# Define the Strategy Logic
STRATEGY_MAP = {
    "TARGET": {
        "columns": ["target"],
        "encoding": "Label (None)",
        "scaling": "None",
        "why": "Outcome variable (0=Completed, 1=Failed)."
    },
    "NUMERIC_SKEWED": {
        "columns": ["number_of_arms", "start_year", "competition_niche", "competition_broad"],
        "encoding": "Numeric (Passthrough)",
        "scaling": "Log1p + StandardScaler",
        "why": "High Skewness expected. Log1p compresses outliers; Scaler normalizes range."
    },
    "ORDINAL": {
        "columns": ["phase_ordinal"],
        "encoding": "Numeric (Passthrough)",
        "scaling": "MinMax (Optional) or None",
        "why": "Ordinal nature (1 < 2 < 3). Preserving magnitude is important."
    },
    "CATEGORICAL_BINARY": {
        "columns": ["includes_us", "is_international", "covid_exposure", "healthy_volunteers", "adult", "child", "older_adult"],
        "encoding": "OneHotEncoder (drop='if_binary')",
        "scaling": "None",
        "why": "Already Boolean. Dropping one column prevents multicollinearity."
    },
    "CATEGORICAL_NOMINAL": {
        "columns": ["gender", "agency_class", "masking", "allocation", "intervention_model", "primary_purpose", "therapeutic_area"],
        "encoding": "OneHotEncoder (handle_unknown='ignore')",
        "scaling": "None",
        "why": "Low cardinality (<50). One-Hot is interpretable for SHAP."
    },
    "CATEGORICAL_HIGH_CARD": {
        "columns": ["therapeutic_subgroup_name", "best_pathology"],
        "encoding": "TargetEncoder",
        "scaling": "None",
        "why": "High cardinality (>50). One-Hot would create sparse matrices. Target Encoding captures risk probability."
    },
    "TEXT_TAGS": {
        "columns": ["txt_tags"],
        "encoding": "TF-IDF Vectorizer (Top 50)",
        "scaling": "None",
        "why": "Bag-of-words approach. Identifies specific topics (e.g., 'Placebo', 'Oncology')."
    },
    "TEXT_COMPLEXITY": {
        "columns": ["txt_criteria"],
        "encoding": "BERT Embeddings (Stream B)",
        "scaling": "None",
        "why": "Semantic complexity. BERT captures the difficulty of the protocol rules."
    },
    "EXCLUDED": {
        "columns": ["overall_status", "min_p_value", "why_stopped", "nct_id"],
        "encoding": "DROP",
        "scaling": "None",
        "why": "Data Leakage (Future Info) or ID columns."
    }
}

def run_blueprint_audit():
    print(f">>> STARTING PIPELINE BLUEPRINT AUDIT ON: {INPUT_FILE}...")
    file_path = os.path.join(DATA_PATH, INPUT_FILE)
    if not os.path.exists(file_path): return

    df = pd.read_csv(file_path, low_memory=False)

    with open(os.path.join(DATA_PATH, OUTPUT_REPORT), 'w', encoding='utf-8') as f:
        f.write(f"PIPELINE BLUEPRINT & DATA AUDIT\n")
        f.write(f"Rows: {len(df):,}\nCols: {len(df.columns)}\n")
        f.write("="*80 + "\n")

        for group, meta in STRATEGY_MAP.items():
            f.write(f"\n[{group}]\n")
            f.write(f"  > Strategy:  {meta['encoding']} | {meta['scaling']}\n")
            f.write(f"  > Rationale: {meta['why']}\n")
            f.write("-" * 80 + "\n")
            f.write(f"  {'COLUMN NAME':<30} | {'MISSING':<8} | {'STATS / DISTRIBUTION'}\n")
            f.write("-" * 80 + "\n")

            for col in meta['columns']:
                if col not in df.columns:
                    f.write(f"  ❌ {col:<28} | NOT FOUND\n")
                    continue

                # Calculate Stats
                missing = df[col].isna().sum()
                missing_pct = (missing / len(df)) * 100

                # 1. TEXT STATS
                if "TEXT" in group:
                    avg_words = df[col].astype(str).apply(lambda x: len(x.split())).mean()
                    empty_rows = (df[col].fillna("").astype(str).str.strip() == "").sum()
                    stat_str = f"Avg Words: {avg_words:.0f} | Empty Rows: {empty_rows}"

                # 2. NUMERIC STATS
                elif group in ["NUMERIC_SKEWED", "ORDINAL"]:
                    stats = df[col].describe()
                    skew = df[col].skew()
                    zeros = (df[col] == 0).sum()
                    zero_pct = (zeros / len(df)) * 100
                    stat_str = f"Mean: {stats['mean']:.1f} | Max: {stats['max']:.0f} | Skew: {skew:.2f} | Zeros: {zero_pct:.1f}%"

                # 3. CATEGORICAL STATS
                else:
                    unique = df[col].nunique()
                    # Check for Dominance (if one category is >90% of data)
                    if unique > 0:
                        top_cat_pct = df[col].value_counts(normalize=True).iloc[0] * 100
                        stat_str = f"Unique: {unique:<4} | Dominant Cat: {top_cat_pct:.1f}%"
                    else:
                        stat_str = "Empty"

                f.write(f"  • {col:<28} | {missing_pct:>6.1f}% | {stat_str}\n")

    print(f"SUCCESS: Pipeline Blueprint saved to {OUTPUT_REPORT}")
    print("Check the 'Skew' and 'Dominant Cat' columns to confirm your strategy.")

if __name__ == "__main__":
    run_blueprint_audit()

>>> STARTING PIPELINE BLUEPRINT AUDIT ON: project_data.csv...
SUCCESS: Pipeline Blueprint saved to audit_dataset_report.txt
Check the 'Skew' and 'Dominant Cat' columns to confirm your strategy.
