# Clinical Trial Risk Engine: Data Engineering Pipeline

**Project:** Premature Termination Risk Prediction
**Output Artifact:** `project_data.csv`
**Scope:** Interventional Drug Trials (Phases 1, 2, and 3).
**Temporal Filter:** Trials starting between 2000 and the present (plus 2 years).

## Data Dictionary and Feature Rationale

This dataset aggregates data from the AACT (Aggregate Analysis of ClinicalTrials.gov) database to predict trial termination.

### 1. Target Variable
*   **`target`** (Binary): The classification target.
    *   `0`: **Completed**. The trial concluded according to protocol.
    *   `1`: **Failed**. The trial was Terminated, Withdrawn, or Suspended.
*   **`overall_status`** (String): The raw status label. *Note: This column is retained for validation but must be dropped prior to training to prevent data leakage.*

### 2. Operational Features (Complexity Proxies)
*   **`num_facilities`** (Integer): The count of distinct recruiting sites. Higher counts correlate with increased operational complexity and cost.
*   **`num_countries`** (Integer): The count of unique countries involved. Indicates regulatory complexity (e.g., FDA, EMA, PMDA coordination).
*   **`phase_ordinal`** (Float): A numeric mapping of the trial phase (Phase 1=1.0, Phase 3=3.0). This is the primary proxy for trial magnitude and resource requirements.
*   **`number_of_arms`** (Integer): The number of intervention groups. Indicates protocol complexity.
*   **`start_year`** (Integer): The year the trial began. Used to capture temporal trends in clinical research standards.

### 3. Eligibility and Protocol Features
*   **`criteria`** (Text): The full inclusion and exclusion criteria. This unstructured text contains high-value signals regarding protocol restrictiveness.
*   **`gender`** (Categorical): Indicates if the trial is restricted by sex (Female, Male, or All).
*   **`healthy_volunteers`** (Binary): Indicates if healthy participants are accepted. Often distinguishes Phase 1 safety trials (Yes) from Phase 2/3 efficacy trials (No).
*   **`adult`, `child`, `older_adult`** (Binary): Pre-calculated flags indicating the target age groups. Used in place of raw numeric age limits to reduce noise and missing data.

### 4. Scientific and Design Features
*   **`therapeutic_area`** (Categorical): High-level medical classification (e.g., Oncology, Cardiology).
*   **`therapeutic_subgroup_name`** (Categorical): Granular disease classification (e.g., Neoplasms by Site).
*   **`intervention_model`** (Categorical): The strategy for assigning interventions (e.g., Parallel, Crossover, Single Group).
*   **`masking`** (Categorical): The level of blinding (e.g., Double, Quadruple, None).
*   **`allocation`** (Categorical): Indicates if participants are randomized.

### 5. Environmental Context (Competition)
*   **`competition_niche`** (Integer): The count of concurrent trials with the *same* Phase and *same* Therapeutic Subgroup. Represents direct competition for specific patient populations.
*   **`competition_broad`** (Integer): The count of concurrent trials within the *same* Therapeutic Area. Represents broader resource saturation.
*   **`covid_exposure`** (Binary): Indicates if the trial was active during the 2019-2021 global disruption window.

### 6. Sponsor Information
*   **`agency_class`** (Categorical): The type of lead sponsor (Industry, NIH, Other). Industry trials typically have different risk profiles compared to academic or government-funded studies.
*   **`includes_us`** (Binary): Indicates if the trial has at least one site in the United States, implying FDA regulatory oversight.

### 1. Configuration and Path Management
This block establishes the environment settings. It also defines safe loading parameters to handle special characters in the raw AACT text files.

In [29]:
import pandas as pd
import numpy as np
import os
import csv

# -----------------------------------------------------------------------------
# 1. PATH SETUP
# -----------------------------------------------------------------------------

DATA_PATH = "/home/delaunan/code/delaunan/clintrialpredict/data"

print(f">>> DATA_PATH set to: {os.path.abspath(DATA_PATH)}")


>>> DATA_PATH set to: /home/delaunan/code/delaunan/clintrialpredict/data


In [30]:
import pandas as pd
import numpy as np
import os
import csv

# -----------------------------------------------------------------------------
# 1. CONFIGURATION & SETUP
# -----------------------------------------------------------------------------

OUTPUT_FILE = 'project_data.csv'

# ROBUST LOADING PARAMETERS
AACT_LOAD_PARAMS = {
    "sep": "|",
    "dtype": str,
    "header": 0,
    "quotechar": '"',
    "quoting": csv.QUOTE_MINIMAL,
    "low_memory": False,
    "on_bad_lines": "warn"
}

print(">>> Setup Complete. Ready to process.")

>>> Setup Complete. Ready to process.


### 2. Data Loading and Cohort Filtering
This step defines the study cohort. We apply strict inclusion and exclusion criteria to ensure data quality:
1.  **Study Type:** Retain only `INTERVENTIONAL` trials.
2.  **Intervention:** Retain only `DRUG` or `BIOLOGICAL` trials.
3.  **Status:** Retain only definitive outcomes (`COMPLETED`, `TERMINATED`, `WITHDRAWN`, `SUSPENDED`).
4.  **Phase:** Exclude Phase 0 and Phase 4 to focus on the core development pipeline.
5.  **Temporal Validity:** Filter for trials starting between 2000 and the near future, removing invalid dates (e.g., 1900) and placeholder future records.

In [31]:
# -----------------------------------------------------------------------------
# 2. THE FUNNEL: LOADING & FILTERING (Updated with Year Filter)
# -----------------------------------------------------------------------------
print(">>> Loading Studies & Applying Filters...")

# A. Load Studies
cols_studies = [
    'nct_id', 'overall_status', 'study_type', 'phase',
    'start_date', 'start_date_type',
    'number_of_arms', 'official_title', 'why_stopped'
]
df = pd.read_csv(os.path.join(DATA_PATH, 'studies.txt'), usecols=cols_studies, **AACT_LOAD_PARAMS)

# B. Filter: Interventional Only
df = df[df['study_type'] == 'INTERVENTIONAL'].copy()

# C. Filter: Drugs Only
df_int = pd.read_csv(os.path.join(DATA_PATH, 'interventions.txt'), usecols=['nct_id', 'intervention_type'], **AACT_LOAD_PARAMS)
drug_ids = df_int[df_int['intervention_type'].str.upper().isin(['DRUG', 'BIOLOGICAL'])]['nct_id'].unique()
df = df[df['nct_id'].isin(drug_ids)]

# D. Filter: Closed Statuses Only
allowed_statuses = ['COMPLETED', 'TERMINATED', 'WITHDRAWN', 'SUSPENDED']
df = df[df['overall_status'].isin(allowed_statuses)]

# E. Filter: Exclude Phase 0 and Phase 4 (Refined Scope)
excluded_phases = ['EARLY_PHASE1', 'PHASE4', 'NA']
df = df[~df['phase'].isin(excluded_phases)]

# F. Create Target & Fix Dates
df['target'] = df['overall_status'].apply(lambda x: 0 if x == 'COMPLETED' else 1)
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
df['start_year'] = df['start_date'].dt.year

# --- NEW FILTER: VALID YEARS ONLY ---
# Drop 1900 (Errors) and Future Dates (Invalid for training)
current_year = pd.Timestamp.now().year
df = df[df['start_year'].between(2000, current_year + 2)]

print(f"   - Core Cohort Size (Phases 1-3, Years 2000-{current_year+2}): {len(df)} trials")

>>> Loading Studies & Applying Filters...
   - Core Cohort Size (Phases 1-3, Years 2000-2027): 119201 trials


### 3. Medical Hierarchy Integration
This block enriches the dataset with standardized medical classifications. By mapping `nct_id` to the MeSH (Medical Subject Headings) hierarchy, we derive:
*   **`therapeutic_area`:** The broad medical category.
*   **`therapeutic_subgroup`:** The specific disease family.
This hierarchical approach allows the model to learn risk patterns associated with specific medical fields (e.g., Oncology vs. Infectious Diseases).


In [32]:
# -----------------------------------------------------------------------------
# 3. MEDICAL HIERARCHY & SUBGROUPS
# -----------------------------------------------------------------------------
print(">>> Attaching Medical Hierarchy...")

# A. Load Smart Lookup (Best Term per Trial)
df_smart = pd.read_csv(os.path.join(DATA_PATH, 'smart_pathology_lookup.csv'))
df = df.merge(df_smart, on='nct_id', how='left')

# B. Fill Missing
df['therapeutic_area'] = df['therapeutic_area'].fillna('Other/Unclassified')
df['best_pathology'] = df['best_pathology'].fillna('Unknown')

# C. Create Subgroup Code (Level 2 Hierarchy)
# Logic: Take first 7 chars of tree number (e.g., C04.588.180 -> C04.588)
df['therapeutic_subgroup'] = df['tree_number'].astype(str).apply(
    lambda x: x[:7] if pd.notna(x) and len(x) >= 7 else 'Unknown'
)

# D. Map Subgroup Code to Name (Optional but good for Explainability)
# We load the full lookup to get the name for "C04.588"
df_lookup = pd.read_csv(os.path.join(DATA_PATH, 'mesh_lookup.csv'), sep='|')
code_to_name = pd.Series(df_lookup.mesh_term.values, index=df_lookup.tree_number).to_dict()
df['therapeutic_subgroup_name'] = df['therapeutic_subgroup'].map(code_to_name).fillna('Unknown Subgroup')

>>> Attaching Medical Hierarchy...


### 4. Competition Intensity Calculation
This block calculates "Crowding" metrics to quantify the competitive environment. We define competition using a 3-year rolling window (Start Year, +1, +2).
*   **`competition_broad`:** Measures saturation within the general therapeutic area.
*   **`competition_niche`:** Measures direct competition for the specific patient population (same Subgroup and Phase).

In [33]:
# -----------------------------------------------------------------------------
# 4. DUAL-LEVEL CROWDING (Niche vs Broad) - UPDATED
# -----------------------------------------------------------------------------
print(">>> Calculating Competition Intensity (Dual Level)...")

# A. Standardize Phase for Grouping
# Group "Phase 1/2" with "Phase 2" for competition purposes
phase_group_map = {
    'PHASE1': 'PHASE1', 'PHASE1/PHASE2': 'PHASE2',
    'PHASE2': 'PHASE2', 'PHASE2/PHASE3': 'PHASE3', 'PHASE3': 'PHASE3'
}
df['phase_group'] = df['phase'].map(phase_group_map).fillna('UNKNOWN')

# --- LEVEL 1: BROAD COMPETITION (Area + Phase) ---
# How many trials in "Oncology" + "Phase 3" started in this window?
grid_broad = df.groupby(['start_year', 'therapeutic_area', 'phase_group']).size().reset_index(name='count')
dict_broad = dict(zip(zip(grid_broad['start_year'], grid_broad['therapeutic_area'], grid_broad['phase_group']), grid_broad['count']))

def get_broad_crowding(row):
    y, area, ph = row['start_year'], row['therapeutic_area'], row['phase_group']
    if pd.isna(y): return 0
    # Sum Year 0, +1, +2
    return dict_broad.get((y, area, ph), 0) + dict_broad.get((y+1, area, ph), 0) + dict_broad.get((y+2, area, ph), 0)

df['competition_broad'] = df.apply(get_broad_crowding, axis=1)

# --- LEVEL 2: NICHE COMPETITION (Subgroup + Phase) ---
# How many trials in "Gastrointestinal Neoplasms" + "Phase 3" started in this window?
grid_niche = df.groupby(['start_year', 'therapeutic_subgroup', 'phase_group']).size().reset_index(name='count')
dict_niche = dict(zip(zip(grid_niche['start_year'], grid_niche['therapeutic_subgroup'], grid_niche['phase_group']), grid_niche['count']))

def get_niche_crowding(row):
    y, sub, ph = row['start_year'], row['therapeutic_subgroup'], row['phase_group']
    if pd.isna(y) or sub == 'Unknown': return 0
    # Sum Year 0, +1, +2
    return dict_niche.get((y, sub, ph), 0) + dict_niche.get((y+1, sub, ph), 0) + dict_niche.get((y+2, sub, ph), 0)

df['competition_niche'] = df.apply(get_niche_crowding, axis=1)

df.drop(columns=['phase_group'], inplace=True)
print("   - Created 'competition_broad' and 'competition_niche'")

>>> Calculating Competition Intensity (Dual Level)...
   - Created 'competition_broad' and 'competition_niche'


### 5. Protocol Details and Eligibility
This block extracts key protocol design features.
*   **Eligibility Flags:** We incorporate categorical flags (`gender`, `healthy_volunteers`, `adult`, `child`, `older_adult`) to characterize the study population.
*   **Text Data:** We extract the full `criteria` text block for downstream Natural Language Processing (NLP).
*   **Endpoints:** We calculate the number of primary endpoints as a proxy for scientific complexity.
*   **Analysis Data:** We extract `min_p_value` for post-hoc analysis (excluded from training to prevent leakage).


In [34]:
# -----------------------------------------------------------------------------
# 5. PROTOCOL DETAILS (Eligibility, Endpoints) - UPDATED
# -----------------------------------------------------------------------------
print(">>> Extracting Eligibility & Endpoints...")

# A. Load Eligibility Fields (No Age Parsing)
# We rely on the pre-calculated flags (adult/child/older_adult) instead of parsing numbers.
cols_elig = [
    'nct_id',
    'criteria',
    'gender', 'healthy_volunteers',
    'adult', 'child', 'older_adult'
]

df_elig = pd.read_csv(os.path.join(DATA_PATH, 'eligibilities.txt'),
                      usecols=cols_elig,
                      **AACT_LOAD_PARAMS)

# B. Merge into Main DataFrame
df = df.merge(df_elig, on='nct_id', how='left')

# C. Endpoint Counts (Existing Logic)
df_calc = pd.read_csv(os.path.join(DATA_PATH, 'calculated_values.txt'),
                      usecols=['nct_id', 'number_of_primary_outcomes_to_measure'],
                      **AACT_LOAD_PARAMS)
df = df.merge(df_calc, on='nct_id', how='left')
df['num_primary_endpoints'] = pd.to_numeric(df['number_of_primary_outcomes_to_measure'], errors='coerce').fillna(1)

# D. P-Values (Analysis Only - Existing Logic)
df_outcomes = pd.read_csv(os.path.join(DATA_PATH, 'outcomes.txt'), usecols=['id', 'nct_id', 'outcome_type'], **AACT_LOAD_PARAMS)
prim_ids = df_outcomes[df_outcomes['outcome_type'] == 'PRIMARY']['id'].unique()

df_an = pd.read_csv(os.path.join(DATA_PATH, 'outcome_analyses.txt'), usecols=['outcome_id', 'p_value'], **AACT_LOAD_PARAMS)
df_an = df_an[df_an['outcome_id'].isin(prim_ids)]
df_an['p_value_num'] = pd.to_numeric(df_an['p_value'], errors='coerce')

min_p = df_an.groupby('outcome_id')['p_value_num'].min().reset_index()
min_p = min_p.merge(df_outcomes[['id', 'nct_id']], left_on='outcome_id', right_on='id')
trial_p = min_p.groupby('nct_id')['p_value_num'].min().reset_index(name='min_p_value')

df = df.merge(trial_p, on='nct_id', how='left')

>>> Extracting Eligibility & Endpoints...


### 6. Operational Proxies and Sponsor Data
This block merges operational and administrative features:
*   **Phase Mapping:** Converts text phases to a numeric ordinal scale (1.0‚Äì3.0). Invalid phases are filtered out.
*   **Geography:** Calculates the number of facilities and countries, and flags US-based trials.
*   **Sponsor:** Identifies the lead agency class (e.g., Industry vs. Other).
*   **External Factors:** Calculates `covid_exposure` based on the trial's start date relative to the pandemic window.

In [35]:
# -----------------------------------------------------------------------------
# 6. OPERATIONAL PROXIES, SPONSORS & EXTERNAL FACTORS (COVID)
# -----------------------------------------------------------------------------
print(">>> Merging Operational Features & Calculating COVID Exposure...")

# A. Phase Ordinal
phase_map = {'PHASE1': 1, 'PHASE1/PHASE2': 1.5, 'PHASE2': 2, 'PHASE2/PHASE3': 2.5, 'PHASE3': 3}
df['phase_ordinal'] = df['phase'].map(phase_map).fillna(0)

# --- NEW FILTER: DROP UNKNOWN PHASES ---
# We only want ordinal 1.0 to 3.0. Drop 0.0.
df = df[df['phase_ordinal'] > 0]

# B. COVID Exposure
df['covid_exposure'] = df['start_year'].between(2019, 2021).astype(int)

# C. Facilities (Raw Count)
df_fac = pd.read_csv(os.path.join(DATA_PATH, 'facilities.txt'), usecols=['nct_id', 'id'], **AACT_LOAD_PARAMS)
fac_counts = df_fac.groupby('nct_id')['id'].count().reset_index(name='num_facilities')
df = df.merge(fac_counts, on='nct_id', how='left')
df['num_facilities'] = df['num_facilities'].fillna(1).astype(int)

# D. Countries
df_countries = pd.read_csv(os.path.join(DATA_PATH, 'countries.txt'), usecols=['nct_id', 'name'], **AACT_LOAD_PARAMS)
country_stats = df_countries.groupby('nct_id').agg(
    num_countries=('name', 'nunique'),
    includes_us=('name', lambda x: 1 if 'United States' in x.values else 0)
).reset_index()
df = df.merge(country_stats, on='nct_id', how='left')
df['num_countries'] = df['num_countries'].fillna(1).astype(int)
df['includes_us'] = df['includes_us'].fillna(0).astype(int)

# E. Sponsors & Design
df_sponsors = pd.read_csv(os.path.join(DATA_PATH, 'sponsors.txt'), **AACT_LOAD_PARAMS)
df_lead = df_sponsors[df_sponsors['lead_or_collaborator'] == 'lead'][['nct_id', 'agency_class']].drop_duplicates('nct_id')
df = df.merge(df_lead, on='nct_id', how='left')

cols_des = ['nct_id', 'allocation', 'intervention_model', 'masking', 'primary_purpose']
df_des = pd.read_csv(os.path.join(DATA_PATH, 'designs.txt'), usecols=cols_des, **AACT_LOAD_PARAMS)
df = df.merge(df_des, on='nct_id', how='left')

print("   - Filtered out invalid phases (0.0).")

>>> Merging Operational Features & Calculating COVID Exposure...
   - Filtered out invalid phases (0.0).


### 7. Text Integration and Data Export
This final processing step merges the remaining unstructured text fields (`brief_summary`, `detailed_description`) and cleans up temporary technical columns. The final dataset is exported as `project_data.csv`.

In [36]:
# -----------------------------------------------------------------------------
# 7. TEXT MERGE & FINAL SAVE
# -----------------------------------------------------------------------------
print(">>> Merging Text & Saving...")

# A. Text Data
df_brief = pd.read_csv(os.path.join(DATA_PATH, 'brief_summaries.txt'), usecols=['nct_id', 'description'], **AACT_LOAD_PARAMS)
df_brief.rename(columns={'description': 'brief_summary'}, inplace=True)
df = df.merge(df_brief, on='nct_id', how='left')

df_detail = pd.read_csv(os.path.join(DATA_PATH, 'detailed_descriptions.txt'), usecols=['nct_id', 'description'], **AACT_LOAD_PARAMS)
df_detail.rename(columns={'description': 'detailed_description'}, inplace=True)
df = df.merge(df_detail, on='nct_id', how='left')

# B. Cleanup
# Drop technical columns
df.drop(columns=['start_date', 'start_date_type', 'tree_number', 'number_of_primary_outcomes_to_measure'], inplace=True, errors='ignore')

# C. Save
df.to_csv(os.path.join(DATA_PATH, OUTPUT_FILE), index=False, quoting=csv.QUOTE_MINIMAL)

print(f"\n>>> SUCCESS: Final Dataset saved to {OUTPUT_FILE}")
print(f"    Rows: {len(df)}")
print(f"    Columns: {len(df.columns)}")
print(f"    New Features: 'competition_intensity', 'min_age', 'max_age', 'min_p_value'")

>>> Merging Text & Saving...

>>> SUCCESS: Final Dataset saved to project_data.csv
    Rows: 105884
    Columns: 35
    New Features: 'competition_intensity', 'min_age', 'max_age', 'min_p_value'


### 8. Quality Assurance and Encoding Strategy
This block performs a comprehensive audit of the final dataset. It analyzes missing values, cardinality, and data types to generate an automated **Encoding Strategy Report**. This report provides specific recommendations for the machine learning pipeline (e.g., which fields require One-Hot Encoding, Target Encoding, or NLP transformation).

In [37]:
import pandas as pd
import numpy as np
import os

# -----------------------------------------------------------------------------
# 8. FINAL DATASET AUDIT & ENCODING STRATEGY
# -----------------------------------------------------------------------------
print(">>> Running Dataset Audit...")

INPUT_FILE = 'project_data.csv'
OUTPUT_REPORT = 'audit_dataset_report.txt'

# Define the Schema & Groups
FEATURE_GROUPS = {
    "1. TARGET": ["target", "overall_status"],
    "2. NUMERIC (Scale & Log)": [
        "num_facilities", "num_countries", "number_of_arms",
        "start_year", "competition_niche", "competition_broad"
    ],
    "3. ORDINAL (Keep Numeric)": ["phase_ordinal"],
    "4. CATEGORICAL (One-Hot Encode)": [
        "gender", "healthy_volunteers", "adult", "child", "older_adult",
        "agency_class", "includes_us", "covid_exposure",
        "masking", "allocation", "intervention_model", "primary_purpose",
        "therapeutic_area"
    ],
    "5. HIGH CARDINALITY (Target Encode)": [
        "therapeutic_subgroup_name", "best_pathology"
    ],
    "6. TEXT (NLP / BERT)": [
        "official_title", "criteria", "brief_summary", "detailed_description"
    ],
    "7. EXCLUDED (Leakage/Analysis)": ["min_p_value", "why_stopped"]
}

def recommend_encoding(col, dtype, unique_count, is_text=False):
    """Logic to suggest the best Scikit-Learn transformer."""
    if col == 'target': return "LABEL (Do not process)"
    if col in ['overall_status', 'why_stopped', 'min_p_value']: return "DROP (Leakage/Analysis)"

    if is_text:
        return "NLP (BERT Embeddings or TF-IDF)"

    if pd.api.types.is_numeric_dtype(dtype):
        if unique_count < 10: return "ORDINAL / PASSTHROUGH"
        return "NUMERIC (StandardScaler + Log1p if skewed)"

    if unique_count <= 2: return "BINARY (One-Hot drop='if_binary')"
    if unique_count < 50: return "ONE-HOT ENCODING"
    return "TARGET ENCODING (High Cardinality)"

def run_enriched_audit():
    print(f">>> STARTING ENRICHED AUDIT ON: {INPUT_FILE}...")

    file_path = os.path.join(DATA_PATH, INPUT_FILE)
    if not os.path.exists(file_path):
        print(f"[ERROR] File not found: {file_path}")
        return

    df = pd.read_csv(file_path, low_memory=False)
    print(f"   - Loaded {len(df):,} rows.")

    with open(os.path.join(DATA_PATH, OUTPUT_REPORT), 'w', encoding='utf-8') as f:
        f.write("================================================================\n")
        f.write(f"FINAL DATASET AUDIT & STRATEGY REPORT\n")
        f.write(f"File: {INPUT_FILE}\n")
        f.write(f"Rows: {len(df):,}\n")
        f.write(f"Cols: {len(df.columns)}\n")
        f.write("================================================================\n\n")

        for category, columns in FEATURE_GROUPS.items():
            f.write(f"\n{'='*80}\n")
            f.write(f"GROUP: {category}\n")
            f.write(f"{'='*80}\n")

            for col in columns:
                if col not in df.columns:
                    f.write(f"\n[MISSING] '{col}' not found.\n")
                    continue

                # Stats
                missing = df[col].isna().sum()
                missing_pct = (missing / len(df)) * 100
                unique = df[col].nunique()
                dtype = df[col].dtype

                # Determine Strategy
                is_text = col in FEATURE_GROUPS["6. TEXT (NLP / BERT)"]
                strategy = recommend_encoding(col, dtype, unique, is_text)

                f.write(f"\n>>> FIELD: {col.upper()}\n")
                f.write(f"    Type: {dtype} | Unique: {unique} | Missing: {missing_pct:.2f}%\n")
                f.write(f"    STRATEGY: {strategy}\n")

                # Content Analysis
                if is_text:
                    avg_words = df[col].astype(str).apply(lambda x: len(x.split()) if pd.notna(x) else 0).mean()
                    f.write(f"    - Avg Length: {avg_words:.0f} words\n")

                elif pd.api.types.is_numeric_dtype(df[col]):
                    stats = df[col].describe()
                    zeros = (df[col] == 0).sum()
                    f.write(f"    - Stats: Mean={stats['mean']:.2f}, Min={stats['min']}, Max={stats['max']}\n")
                    f.write(f"    - Zeros: {zeros} ({zeros/len(df):.1%})\n")
                    if col != 'target':
                        corr = df[[col, 'target']].corr().iloc[0,1]
                        f.write(f"    - Correlation w/ Target: {corr:.4f}\n")

                else:
                    # Categorical Distribution
                    top_n = df[col].value_counts().head(5)
                    f.write("    - Top Values:\n")
                    for val, count in top_n.items():
                        f.write(f"      * {str(val)[:30]:<30} : {count} ({count/len(df):.1%})\n")

    print(f"SUCCESS: Audit report saved to {OUTPUT_REPORT}")


if __name__ == "__main__":
    run_enriched_audit()

>>> Running Dataset Audit...
>>> STARTING ENRICHED AUDIT ON: project_data.csv...
   - Loaded 105,884 rows.
SUCCESS: Audit report saved to audit_dataset_report.txt


In [None]:
import pandas as pd
import numpy as np
import os
import csv

# -----------------------------------------------------------------------------
# 1. CONFIGURATION & PATHS
# -----------------------------------------------------------------------------
# Robust path finding
if os.path.exists("data"):
    DATA_PATH = "data"
elif os.path.exists("../../data"):
    DATA_PATH = "../../data"
else:
    DATA_PATH = "/home/delaunan/code/delaunan/clintrialpredict/data"

OUTPUT_REPORT = "text_strategy_audit.txt"

AACT_PARAMS = {
    "sep": "|", "dtype": str, "header": 0, "quotechar": '"',
    "quoting": csv.QUOTE_MINIMAL, "low_memory": False, "on_bad_lines": "warn"
}

print(f">>> STARTING TEXT STRATEGY AUDIT...")
print(f">>> Data Path: {os.path.abspath(DATA_PATH)}")

# -----------------------------------------------------------------------------
# 2. BUILD THE COHORT (Apply Filters)
# -----------------------------------------------------------------------------
print(">>> Building Filtered Cohort (Interventional Drugs, Phase 1-3, 2000+)...")

# Load Studies
cols_studies = ['nct_id', 'overall_status', 'study_type', 'phase', 'start_date', 'official_title']
df = pd.read_csv(os.path.join(DATA_PATH, 'studies.txt'), usecols=cols_studies, **AACT_PARAMS)

# Filters
df = df[df['study_type'] == 'INTERVENTIONAL']
df = df[df['overall_status'].isin(['COMPLETED', 'TERMINATED', 'WITHDRAWN', 'SUSPENDED'])]
df = df[~df['phase'].isin(['EARLY_PHASE1', 'PHASE4', 'NA'])]

# Date Filter
df['start_year'] = pd.to_datetime(df['start_date'], errors='coerce').dt.year
current_year = pd.Timestamp.now().year
df = df[df['start_year'].between(2000, current_year + 2)]

# Drug Filter
df_int = pd.read_csv(os.path.join(DATA_PATH, 'interventions.txt'), usecols=['nct_id', 'intervention_type'], **AACT_PARAMS)
drug_ids = df_int[df_int['intervention_type'].str.upper().isin(['DRUG', 'BIOLOGICAL'])]['nct_id'].unique()
df = df[df['nct_id'].isin(drug_ids)]

print(f"   - Final Cohort Size: {len(df)} trials")

# -----------------------------------------------------------------------------
# 3. LOAD & MERGE CANDIDATE FIELDS
# -----------------------------------------------------------------------------
print(">>> Loading Text Candidates...")

# A. Criteria (The Gold Standard for Complexity)
df_elig = pd.read_csv(os.path.join(DATA_PATH, 'eligibilities.txt'), usecols=['nct_id', 'criteria'], **AACT_PARAMS)
df = df.merge(df_elig, on='nct_id', how='left')

# B. Keywords (For XGBoost Tags)
df_keys = pd.read_csv(os.path.join(DATA_PATH, 'keywords.txt'), usecols=['nct_id', 'name'], **AACT_PARAMS)
keys_grouped = df_keys.groupby('nct_id')['name'].apply(lambda x: " | ".join(x.dropna().astype(str))).reset_index(name='txt_keywords')
df = df.merge(keys_grouped, on='nct_id', how='left')

# C. Intervention Details (Name = Tag, Desc = Complexity)
df_int_det = pd.read_csv(os.path.join(DATA_PATH, 'interventions.txt'),
                         usecols=['nct_id', 'intervention_type', 'name', 'description'],
                         **AACT_PARAMS)
# Filter for drugs only in the details
df_int_det = df_int_det[df_int_det['intervention_type'].str.upper().isin(['DRUG', 'BIOLOGICAL'])]

int_names = df_int_det.groupby('nct_id')['name'].apply(lambda x: " | ".join(x.dropna().astype(str))).reset_index(name='txt_intervention_names')
int_desc = df_int_det.groupby('nct_id')['description'].apply(lambda x: " ".join(x.dropna().astype(str))).reset_index(name='txt_intervention_desc')

df = df.merge(int_names, on='nct_id', how='left')
df = df.merge(int_desc, on='nct_id', how='left')

# D. International Flag (Calculated)
print(">>> Calculating International Flag...")
df_countries = pd.read_csv(os.path.join(DATA_PATH, 'countries.txt'), usecols=['nct_id', 'name'], **AACT_PARAMS)
country_counts = df_countries.groupby('nct_id')['name'].nunique().reset_index(name='cnt')
df = df.merge(country_counts, on='nct_id', how='left')
df['is_international'] = (df['cnt'] > 1).astype(int)

# -----------------------------------------------------------------------------
# 4. GENERATE AUDIT REPORT
# -----------------------------------------------------------------------------
print(f">>> Generating Report: {OUTPUT_REPORT}...")

fields_to_audit = {
    "OFFICIAL_TITLE": "official_title",
    "CRITERIA": "criteria",
    "KEYWORDS (Grouped)": "txt_keywords",
    "INTERVENTION_NAMES": "txt_intervention_names",
    "INTERVENTION_DESC": "txt_intervention_desc",
    "IS_INTERNATIONAL (Flag)": "is_international"
}

with open(os.path.join(DATA_PATH, OUTPUT_REPORT), 'w', encoding='utf-8') as f:
    f.write("================================================================\n")
    f.write("TEXT STRATEGY AUDIT (Filtered Cohort)\n")
    f.write(f"Total Trials: {len(df)}\n")
    f.write("================================================================\n\n")

    for label, col in fields_to_audit.items():
        f.write(f"FIELD: {label}\n")
        f.write("-" * 40 + "\n")

        if col not in df.columns:
            f.write("‚ùå MISSING (Not found in dataframe)\n\n")
            continue

        # Stats
        missing = df[col].isna().sum()
        fill_rate = 100 - ((missing / len(df)) * 100)

        f.write(f"Fill Rate: {fill_rate:.2f}%\n")

        if col == 'is_international':
            dist = df[col].value_counts(normalize=True)
            f.write(f"Distribution: No (0)={dist.get(0,0):.1%}, Yes (1)={dist.get(1,0):.1%}\n")
        else:
            # Text Stats
            # Fill NA with empty for length calc
            series = df[col].fillna("").astype(str)
            avg_chars = series.apply(len).mean()
            avg_words = series.apply(lambda x: len(x.split())).mean()

            f.write(f"Avg Length: {avg_chars:.0f} chars ({avg_words:.0f} words)\n")

            # Recommendation Logic
            rec = "???"
            if avg_words < 5: rec = "‚ö†Ô∏è TOO SHORT / SPARSE"
            elif avg_words < 50: rec = "üè∑Ô∏è KEYWORD CANDIDATE (XGBoost/TF-IDF)"
            else: rec = "üß† COMPLEXITY CANDIDATE (BERT)"

            f.write(f"VERDICT:    {rec}\n")

            # Samples
            f.write("Samples:\n")
            non_empty = df[df[col].notna()][col]
            if len(non_empty) > 0:
                for val in non_empty.sample(3).values:
                    preview = str(val)[:100].replace('\n', ' ')
                    f.write(f"  - {preview}...\n")

        f.write("\n")

print(">>> DONE. Please upload 'text_strategy_audit.txt'.")

>>> STARTING TEXT STRATEGY AUDIT...
>>> Data Path: /home/delaunan/code/delaunan/clintrialpredict/data
>>> Building Filtered Cohort (Interventional Drugs, Phase 1-3, 2000+)...
   - Final Cohort Size: 119201 trials
>>> Loading Text Candidates...
>>> Calculating International Flag...
>>> Generating Report: text_strategy_audit.txt...
>>> DONE. Please upload 'text_strategy_audit.txt'.
