Here is the **Comprehensive Data Dictionary**. It merges the **Statistical Reality** (from your audit) with the **Business Definitions & Technical Explanations** you requested.

This document explains **what** the data is, **where** it comes from, **what the labels mean**, and **why** it matters for predicting trial failure.

***

# Clinical Trial Risk Engine: Master Data Dictionary

**Dataset Name:** `project_data.csv`
**Total Records:** 124,490
**Scope:** Interventional Drug Trials (Phases 1, 2, 3). *Excludes Phase 0 and Phase 4.*

---

## 1. Target Variable

### **`target`**
*   **Definition:** The binary classification target indicating if the trial successfully completed its protocol.
    *   `0`: **Completed**. The trial finished normally.
    *   `1`: **Failed**. The trial was Terminated, Withdrawn, or Suspended.
*   **Source:** Derived from `studies.overall_status`.
*   **Statistics:**
    *   **Completed (0):** 81.2%
    *   **Failed (1):** 18.8%
*   **Relevance:** This is the variable we are training the model to predict.

### **`overall_status`**
*   **Definition:** The raw status label provided by ClinicalTrials.gov.
*   **Source:** `studies.overall_status`
*   **Labels:**
    *   `COMPLETED`: The study has concluded normally; participants are no longer being examined or treated.
    *   `TERMINATED`: The study has stopped early and will not start again. Participants are no longer being examined or treated.
    *   `WITHDRAWN`: The study stopped early, before enrolling its first participant.
    *   `SUSPENDED`: The study has stopped early but may start again.
*   **Relevance:** Source of truth for the Target. **Must be dropped before training** to prevent leakage.

---

## 2. Operational Signals
*Metrics describing the logistical scale, cost, and feasibility of the study.*

### **`num_facilities`**
*   **Definition:** The total count of distinct medical centers (hospitals, clinics) listed as recruiting sites.
*   **Source:** Calculated count of rows in `facilities.txt` per `nct_id`.
*   **Statistics:** Mean: 13.02 sites | Max: 1,745 sites.
*   **Relevance:** Proxy for **Operational Footprint**. High facility counts imply massive coordination costs and logistical complexity, but also suggest strong financial backing (usually Phase 3).

### **`num_countries`**
*   **Definition:** The count of unique countries where the trial is active.
*   **Source:** Calculated count of unique values in `countries.name`.
*   **Statistics:** Mean: 2.23 countries.
*   **Relevance:** Proxy for **Regulatory Complexity**. Multi-country trials must satisfy multiple health authorities (FDA, EMA, PMDA) simultaneously, increasing the risk of administrative delays or shutdowns.

### **`phase_ordinal`**
*   **Definition:** A numeric mapping of the trial phase to represent the progression of drug development.
*   **Source:** Mapped from `studies.phase`.
*   **Labels & Logic:**
    *   `1.0` (**Phase 1**): Safety & Dosage. Small cohorts. High technical risk, low operational risk.
    *   `1.5` (**Phase 1/Phase 2**): Adaptive design. Seamless transition from safety to efficacy.
    *   `2.0` (**Phase 2**): Efficacy. Medium cohorts. High risk of scientific failure (drug doesn't work).
    *   `2.5` (**Phase 2/Phase 3**): Adaptive design.
    *   `3.0` (**Phase 3**): Confirmation. Large cohorts. High risk of operational failure (cost/recruitment) and statistical futility.
*   **Relevance:** The single strongest predictor of trial size and cost.

### **`start_year`**
*   **Definition:** The calendar year the trial began.
*   **Source:** Extracted from `studies.start_date`.
*   **Statistics:** Mean: 2012.4.
*   **Relevance:** Captures **Modernization**. Clinical trial standards have become stricter over time. "Fail Fast" strategies in Pharma mean newer trials might be terminated more aggressively than older ones.

### **`number_of_arms`**
*   **Definition:** The number of distinct intervention groups in the study design (e.g., Placebo, Low Dose, High Dose = 3 arms).
*   **Source:** `studies.number_of_arms`.
*   **Statistics:** Mean: 2.38 arms.
*   **Relevance:** Proxy for **Protocol Complexity**. More arms require more patients and more complex supply chain management.

### **`min_age` / `max_age`**
*   **Definition:** The numeric age limits (in years) for participant eligibility.
*   **Source:** Parsed from `eligibilities.minimum_age` and `maximum_age`.
*   **Statistics:** `max_age` is missing in 45% of cases (implies "No Upper Limit").
*   **Relevance:** Defines the **Recruitment Pool**. Extremely narrow age ranges (e.g., "18 to 25") make recruitment statistically difficult, increasing the risk of termination due to "Slow Accrual."

---

## 3. Scientific Signals (Design)
*Technical attributes of the experimental protocol.*

### **`intervention_model`**
*   **Definition:** The general design strategy for assigning interventions.
*   **Source:** `designs.intervention_model`.
*   **Labels:**
    *   `PARALLEL` (54.5%): Participants are assigned to one group (e.g., Drug OR Placebo) and remain there. Standard for Phase 3.
    *   `SINGLE_GROUP` (30.4%): All participants receive the intervention. No control group. Common in Phase 1 or Oncology.
    *   `CROSSOVER` (10.2%): Participants receive Treatment A, wait, then receive Treatment B. They serve as their own control.
    *   `FACTORIAL`: Evaluates two or more interventions simultaneously (e.g., A, B, A+B, Neither). High statistical complexity.
    *   `SEQUENTIAL`: Participants are enrolled in stages; the trial may stop early based on interim results.
*   **Relevance:** Complex designs (Factorial, Crossover) have higher operational risks. Single Group designs are often scientifically riskier (early phase).

### **`masking`**
*   **Definition:** The level of blinding used to prevent bias.
*   **Source:** `designs.masking`.
*   **Labels:**
    *   `NONE` (Open Label): Everyone knows the treatment. High bias risk, but operationally easier.
    *   `DOUBLE`: Participant and Investigator are blinded. Standard for rigor.
    *   `QUADRUPLE`: Participant, Investigator, Care Provider, and Outcomes Assessor are blinded. Gold standard.
*   **Relevance:** High rigor (Quadruple Masking) requires complex supply chains (placebo matching), increasing operational risk. Open Label (None) is common in Phase 1 but less rigorous.

### **`allocation`**
*   **Definition:** The method used to assign participants to arms.
*   **Source:** `designs.allocation`.
*   **Labels:** `RANDOMIZED` (82%), `NON_RANDOMIZED` (18%).
*   **Relevance:** Randomized trials are the scientific standard but require more infrastructure than non-randomized observational-style trials.

### **`primary_purpose`**
*   **Definition:** The main reason for the clinical trial.
*   **Source:** `designs.primary_purpose`.
*   **Labels:** `TREATMENT` (80%), `PREVENTION`, `DIAGNOSTIC`.
*   **Relevance:** Prevention trials (vaccines) often require massive sample sizes compared to Treatment trials, altering the risk profile.

---

## 4. Medical Signals (Pathology)
*Classification of the disease being studied.*

### **`therapeutic_area`**
*   **Definition:** The highest level of the medical hierarchy (Level 1).
*   **Source:** Derived from `browse_conditions.mesh_term` mapped to MeSH Tree Codes (C01-C26).
*   **Labels:** Oncology, Cardiovascular, Neurology, Infectious Disease, etc.
*   **Relevance:** **Biological Risk**. Oncology trials have historically higher failure rates due to the complexity of cancer biology compared to, say, Antibiotics (Infectious).

### **`therapeutic_subgroup_name`**
*   **Definition:** The mid-level classification derived from the MeSH Tree Structure (Level 2). Groups specific diseases into families.
*   **Source:** Derived from `mesh_terms.xml` (Tree Number slicing).
*   **Examples:** "Neoplasms by Site" (groups Breast, Lung, Colon cancer), "Metabolic Diseases" (groups Diabetes, Thyroid).
*   **Relevance:** Allows the model to learn risks specific to disease families (e.g., "Neurodegenerative diseases are hard to treat") without getting lost in thousands of specific disease names.

### **`best_pathology`**
*   **Definition:** The specific disease name derived from the trial's condition list using a medical priority logic.
*   **Source:** `browse_conditions.mesh_term` (Smart Selection).
*   **Relevance:** The granular disease target.

---

## 5. External Environment
*Contextual factors impacting trial success.*

### **`competition_intensity`**
*   **Definition:** A calculated index representing recruitment competition.
*   **Calculation:** The count of *other* drug trials that started within a **3-year window** (Start Year, +1, +2), targeting the **same Therapeutic Subgroup** and the **same Phase**.
*   **Source:** Calculated feature.
*   **Relevance:** **Market Saturation**. High values indicate a "crowded" market. If 50 companies are recruiting for "Breast Cancer Phase 3" at the same time, finding eligible patients becomes statistically difficult, leading to "Slow Accrual" termination.

### **`covid_exposure`**
*   **Definition:** A binary flag indicating if the trial started immediately before or during the peak of the COVID-19 pandemic (2019-2021).
*   **Source:** Calculated from `studies.start_date`.
*   **Relevance:** **External Shock**. Trials in this window faced unique risks: site closures, patient dropout, and supply chain breaks.

---

## 6. Sponsor & Geography

### **`agency_class`**
*   **Definition:** The type of organization sponsoring the trial.
*   **Source:** `sponsors.agency_class`.
*   **Labels:**
    *   `INDUSTRY` (51%): Pharmaceutical/Biotech. High funding, but strict "Go/No-Go" business decisions.
    *   `OTHER` (41%): Academic/Hospitals. Often grant-funded. May run longer/slower.
    *   `NIH/FED` (5%): Government.
*   **Relevance:** **Financial Risk**. Industry sponsors are more likely to terminate a trial for "Business Reasons" (e.g., change in strategy) even if the science is okay.

### **`includes_us`**
*   **Definition:** A binary flag indicating if at least one trial site is located in the United States.
*   **Source:** Derived from `countries.name`.
*   **Relevance:** **Regulatory Environment**. US trials are subject to FDA oversight (strict safety monitoring) and high healthcare costs. This often correlates with higher termination rates compared to trials in developing regions.

---

## 7. Text Data
*Unstructured text fields available for Natural Language Processing.*

### **`official_title`**
*   **Definition:** The scientific title of the study.
*   **Source:** `studies.official_title`.
*   **Relevance:** Contains technical keywords (e.g., "Monoclonal Antibody", "Placebo-Controlled") that signal complexity.

### **`criteria`**
*   **Definition:** The detailed list of Inclusion and Exclusion criteria.
*   **Source:** `eligibilities.criteria`.
*   **Relevance:** **The most valuable text field.** It defines the "narrowness" of the eligible population. Complex, restrictive criteria are a leading cause of recruitment failure.

---

## 8. Analysis Fields (Excluded from Training)
*Fields that contain future information (Data Leakage) but are useful for post-hoc analysis.*

### **`min_p_value`**
*   **Definition:** The lowest P-value reported for the trial's primary outcome measures.
*   **Source:** `outcome_analyses.p_value`.
*   **Relevance:** Explains **Scientific Failure**. If a trial Completed but failed to prove efficacy, the P-value will be high (>0.05).

### **`why_stopped`**
*   **Definition:** The free-text reason provided by the sponsor for termination.
*   **Source:** `studies.why_stopped`.
*   **Relevance:** **Ground Truth for Validation**. Used to verify if the model correctly identified a high-risk trial that later stopped for "Lack of funding" or "Adverse Events".

Here is the complete, modular Data Engineering Pipeline. You can copy and paste these blocks sequentially into a Jupyter Notebook or a Python script.

### Block 1: Setup & Robust Configuration
**What this does:**
Sets up the file paths and defines the **"Safe Load Parameters"**. The AACT database is messy; text fields often contain the pipe character (`|`) or unescaped quotes, which breaks standard CSV readers. These parameters force Python to read everything carefully as a string first, preventing crashes.

In [None]:
DATA_PATH = '/home/delaunan/code/delaunan/project/00_data'

In [None]:
import pandas as pd
import numpy as np
import os
import csv

# -----------------------------------------------------------------------------
# 1. CONFIGURATION & SETUP
# -----------------------------------------------------------------------------

OUTPUT_FILE = 'project_data.csv'

# ROBUST LOADING PARAMETERS
AACT_LOAD_PARAMS = {
    "sep": "|",
    "dtype": str,
    "header": 0,
    "quotechar": '"',
    "quoting": csv.QUOTE_MINIMAL,
    "low_memory": False,
    "on_bad_lines": "warn"
}

print(">>> Setup Complete. Ready to process.")

>>> Setup Complete. Ready to process.


### Block 2: The Funnel (Loading & Filtering)
**What this does:**
1.  Loads the core `studies` table.
2.  **Filters:** Keeps only **Interventional** trials (no observational).
3.  **Filters:** Keeps only **Drug/Biologic** trials (using the `interventions` table).
4.  **Filters:** Keeps only **Closed** trials (Completed, Terminated, Withdrawn, Suspended) so we have a definite target.
5.  **Target Creation:** Creates the `target` column (0 = Completed, 1 = Failed).

In [3]:
# -----------------------------------------------------------------------------
# 2. THE FUNNEL: LOADING & FILTERING
# -----------------------------------------------------------------------------
print(">>> Loading Studies & Applying Filters...")

# A. Load Studies
cols_studies = [
    'nct_id', 'overall_status', 'study_type', 'phase',
    'start_date', 'start_date_type',
    'number_of_arms', 'official_title', 'why_stopped'
]
df = pd.read_csv(os.path.join(DATA_PATH, 'studies.txt'), usecols=cols_studies, **AACT_LOAD_PARAMS)

# B. Filter: Interventional Only
df = df[df['study_type'] == 'INTERVENTIONAL'].copy()

# C. Filter: Drugs Only
df_int = pd.read_csv(os.path.join(DATA_PATH, 'interventions.txt'), usecols=['nct_id', 'intervention_type'], **AACT_LOAD_PARAMS)
drug_ids = df_int[df_int['intervention_type'].str.upper().isin(['DRUG', 'BIOLOGICAL'])]['nct_id'].unique()
df = df[df['nct_id'].isin(drug_ids)]

# D. Filter: Closed Statuses Only
allowed_statuses = ['COMPLETED', 'TERMINATED', 'WITHDRAWN', 'SUSPENDED']
df = df[df['overall_status'].isin(allowed_statuses)]

# E. Filter: Exclude Phase 0 and Phase 4 (Refined Scope)
# We only want Phase 1, 1/2, 2, 2/3, 3
excluded_phases = ['EARLY_PHASE1', 'PHASE4', 'NA']
df = df[~df['phase'].isin(excluded_phases)]

# F. Create Target & Fix Dates
df['target'] = df['overall_status'].apply(lambda x: 0 if x == 'COMPLETED' else 1)
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
df['start_year'] = df['start_date'].dt.year

print(f"   - Core Cohort Size (Phases 1-3 only): {len(df)} trials")

>>> Loading Studies & Applying Filters...
   - Core Cohort Size (Phases 1-3 only): 124490 trials


### Block 3: Medical Hierarchy & Subgroups
**What this does:**
After import of an external database with hierarchy of therapeutic area + disease, <br>
merges the `smart_pathology_lookup.csv` you the therapeutic area information not originally present in the table <br>. 
It aslos extracts the **Therapeutic Subgroup** (e.g., `C04.588`) which is critical for the crowding calculation (how many trials are recruiting patients for the same therapeutic purpose) in the next step.


In [4]:
# -----------------------------------------------------------------------------
# 3. MEDICAL HIERARCHY & SUBGROUPS
# -----------------------------------------------------------------------------
print(">>> Attaching Medical Hierarchy...")

# A. Load Smart Lookup (Best Term per Trial)
df_smart = pd.read_csv(os.path.join(DATA_PATH, 'smart_pathology_lookup.csv'))
df = df.merge(df_smart, on='nct_id', how='left')

# B. Fill Missing
df['therapeutic_area'] = df['therapeutic_area'].fillna('Other/Unclassified')
df['best_pathology'] = df['best_pathology'].fillna('Unknown')

# C. Create Subgroup Code (Level 2 Hierarchy)
# Logic: Take first 7 chars of tree number (e.g., C04.588.180 -> C04.588)
df['therapeutic_subgroup'] = df['tree_number'].astype(str).apply(
    lambda x: x[:7] if pd.notna(x) and len(x) >= 7 else 'Unknown'
)

# D. Map Subgroup Code to Name (Optional but good for Explainability)
# We load the full lookup to get the name for "C04.588"
df_lookup = pd.read_csv(os.path.join(DATA_PATH, 'mesh_lookup.csv'), sep='|')
code_to_name = pd.Series(df_lookup.mesh_term.values, index=df_lookup.tree_number).to_dict()
df['therapeutic_subgroup_name'] = df['therapeutic_subgroup'].map(code_to_name).fillna('Unknown Subgroup')

>>> Attaching Medical Hierarchy...



### Block 4: Research Space Crowding (Competition Intensity)
**What this does:**
Calculates **`competition_intensity`**.
*   **Logic:** It counts how many trials started in the **same 3-year window** (Start Year, +1, +2), for the **same Medical Subgroup**, and the **same Phase**.
*   **Why:** A Phase 1 Glaucoma trial does not compete with a Phase 3 Breast Cancer trial. This metric is specific.



In [5]:
# -----------------------------------------------------------------------------
# 4. DUAL-LEVEL CROWDING (Niche vs Broad) - UPDATED
# -----------------------------------------------------------------------------
print(">>> Calculating Competition Intensity (Dual Level)...")

# A. Standardize Phase for Grouping
# Group "Phase 1/2" with "Phase 2" for competition purposes
phase_group_map = {
    'PHASE1': 'PHASE1', 'PHASE1/PHASE2': 'PHASE2',
    'PHASE2': 'PHASE2', 'PHASE2/PHASE3': 'PHASE3', 'PHASE3': 'PHASE3'
}
df['phase_group'] = df['phase'].map(phase_group_map).fillna('UNKNOWN')

# --- LEVEL 1: BROAD COMPETITION (Area + Phase) ---
# How many trials in "Oncology" + "Phase 3" started in this window?
grid_broad = df.groupby(['start_year', 'therapeutic_area', 'phase_group']).size().reset_index(name='count')
dict_broad = dict(zip(zip(grid_broad['start_year'], grid_broad['therapeutic_area'], grid_broad['phase_group']), grid_broad['count']))

def get_broad_crowding(row):
    y, area, ph = row['start_year'], row['therapeutic_area'], row['phase_group']
    if pd.isna(y): return 0
    # Sum Year 0, +1, +2
    return dict_broad.get((y, area, ph), 0) + dict_broad.get((y+1, area, ph), 0) + dict_broad.get((y+2, area, ph), 0)

df['competition_broad'] = df.apply(get_broad_crowding, axis=1)

# --- LEVEL 2: NICHE COMPETITION (Subgroup + Phase) ---
# How many trials in "Gastrointestinal Neoplasms" + "Phase 3" started in this window?
grid_niche = df.groupby(['start_year', 'therapeutic_subgroup', 'phase_group']).size().reset_index(name='count')
dict_niche = dict(zip(zip(grid_niche['start_year'], grid_niche['therapeutic_subgroup'], grid_niche['phase_group']), grid_niche['count']))

def get_niche_crowding(row):
    y, sub, ph = row['start_year'], row['therapeutic_subgroup'], row['phase_group']
    if pd.isna(y) or sub == 'Unknown': return 0
    # Sum Year 0, +1, +2
    return dict_niche.get((y, sub, ph), 0) + dict_niche.get((y+1, sub, ph), 0) + dict_niche.get((y+2, sub, ph), 0)

df['competition_niche'] = df.apply(get_niche_crowding, axis=1)

df.drop(columns=['phase_group'], inplace=True)
print("   - Created 'competition_broad' and 'competition_niche'")

>>> Calculating Competition Intensity (Dual Level)...
   - Created 'competition_broad' and 'competition_niche'


### Block 5: Protocol (Age) & Results (P-Values)
**What this does:**
1.  **Age:** Parses `min_age` and `max_age` into numbers (Years), age of patients to enter the clinical trial.
2.  **Endpoints:** Gets the count of primary endpoints (complexity).Many core objectives defined for a study could mean two things: <br>
*- intent to increase scientific evidence*, you increase the number of goals so that at least one is relevant<br>
*- complex and potentially large study*
3.  **P-Values:** Extracts the minimum P-value for primary outcomes in case we want <br>to use it. **(Note: for now, this is for Analysis only, not Training).**


In [6]:
# -----------------------------------------------------------------------------
# 5. PROTOCOL DETAILS & ANALYTICAL RESULTS
# -----------------------------------------------------------------------------
print(">>> Extracting Age, Endpoints & P-Values...")

# A. Age Parsing (From Eligibilities)
df_elig = pd.read_csv(os.path.join(DATA_PATH, 'eligibilities.txt'),
                      usecols=['nct_id', 'minimum_age', 'maximum_age', 'criteria'],
                      **AACT_LOAD_PARAMS)

def parse_age(val):
    if pd.isna(val): return np.nan
    val = str(val).lower()
    try:
        num = float(val.split()[0])
        if 'month' in val: return num / 12
        if 'week' in val: return num / 52
        if 'day' in val: return num / 365
        return num
    except:
        return np.nan

df_elig['min_age'] = df_elig['minimum_age'].apply(parse_age)
df_elig['max_age'] = df_elig['maximum_age'].apply(parse_age)
df = df.merge(df_elig[['nct_id', 'min_age', 'max_age', 'criteria']], on='nct_id', how='left')

# B. Endpoint Counts
df_calc = pd.read_csv(os.path.join(DATA_PATH, 'calculated_values.txt'),
                      usecols=['nct_id', 'number_of_primary_outcomes_to_measure'],
                      **AACT_LOAD_PARAMS)
df = df.merge(df_calc, on='nct_id', how='left')
df['num_primary_endpoints'] = pd.to_numeric(df['number_of_primary_outcomes_to_measure'], errors='coerce').fillna(1)

# C. P-Values (Analysis Only)
# Load Outcomes to find Primary IDs
df_outcomes = pd.read_csv(os.path.join(DATA_PATH, 'outcomes.txt'), usecols=['id', 'nct_id', 'outcome_type'], **AACT_LOAD_PARAMS)
prim_ids = df_outcomes[df_outcomes['outcome_type'] == 'PRIMARY']['id'].unique()

# Load Analyses
df_an = pd.read_csv(os.path.join(DATA_PATH, 'outcome_analyses.txt'), usecols=['outcome_id', 'p_value'], **AACT_LOAD_PARAMS)
df_an = df_an[df_an['outcome_id'].isin(prim_ids)]
df_an['p_value_num'] = pd.to_numeric(df_an['p_value'], errors='coerce')

# Get Min P-Value per Trial
min_p = df_an.groupby('outcome_id')['p_value_num'].min().reset_index()
# Link back to NCT via outcomes table
min_p = min_p.merge(df_outcomes[['id', 'nct_id']], left_on='outcome_id', right_on='id')
trial_p = min_p.groupby('nct_id')['p_value_num'].min().reset_index(name='min_p_value')

df = df.merge(trial_p, on='nct_id', how='left')

>>> Extracting Age, Endpoints & P-Values...


### Block 6: Operational Proxies, Sponsors & External Factors (covid_exposure)
**What this does:** <br><br>
Merges the standard operational features (Lead Sponsor, Clinical trial study design defined at the start).


In [7]:
# -----------------------------------------------------------------------------
# 6. OPERATIONAL PROXIES, SPONSORS & EXTERNAL FACTORS - UPDATED
# -----------------------------------------------------------------------------
print(">>> Merging Operational Features & Calculating COVID Exposure...")

# A. Phase Ordinal
phase_map = {'PHASE1': 1, 'PHASE1/PHASE2': 1.5, 'PHASE2': 2, 'PHASE2/PHASE3': 2.5, 'PHASE3': 3}
df['phase_ordinal'] = df['phase'].map(phase_map).fillna(0)

# B. COVID Exposure (The Missing Piece)
# Logic: Trials starting just before or during the peak disruption (2019-2021)
df['covid_exposure'] = df['start_year'].between(2019, 2021).astype(int)

# C. Facilities (Raw Count)
df_fac = pd.read_csv(os.path.join(DATA_PATH, 'facilities.txt'), usecols=['nct_id', 'id'], **AACT_LOAD_PARAMS)
fac_counts = df_fac.groupby('nct_id')['id'].count().reset_index(name='num_facilities')
df = df.merge(fac_counts, on='nct_id', how='left')
df['num_facilities'] = df['num_facilities'].fillna(1).astype(int)

# D. Countries
df_countries = pd.read_csv(os.path.join(DATA_PATH, 'countries.txt'), usecols=['nct_id', 'name'], **AACT_LOAD_PARAMS)
country_stats = df_countries.groupby('nct_id').agg(
    num_countries=('name', 'nunique'),
    includes_us=('name', lambda x: 1 if 'United States' in x.values else 0)
).reset_index()
df = df.merge(country_stats, on='nct_id', how='left')
df['num_countries'] = df['num_countries'].fillna(1).astype(int)
df['includes_us'] = df['includes_us'].fillna(0).astype(int)

# E. Sponsors & Design
df_sponsors = pd.read_csv(os.path.join(DATA_PATH, 'sponsors.txt'), **AACT_LOAD_PARAMS)
df_lead = df_sponsors[df_sponsors['lead_or_collaborator'] == 'lead'][['nct_id', 'agency_class']].drop_duplicates('nct_id')
df = df.merge(df_lead, on='nct_id', how='left')

cols_des = ['nct_id', 'allocation', 'intervention_model', 'masking', 'primary_purpose']
df_des = pd.read_csv(os.path.join(DATA_PATH, 'designs.txt'), usecols=cols_des, **AACT_LOAD_PARAMS)
df = df.merge(df_des, on='nct_id', how='left')

>>> Merging Operational Features & Calculating COVID Exposure...


### Block 7: Save & Health Check
**What this does:**
1.  Drops technical columns (`start_date` is replaced by `start_year`).
2.  Saves the final CSV.
3.  Prints a summary so you can verify the data quality immediately.

In [8]:
# -----------------------------------------------------------------------------
# 7. TEXT MERGE & FINAL SAVE
# -----------------------------------------------------------------------------
print(">>> Merging Text & Saving...")

# A. Text Data
df_brief = pd.read_csv(os.path.join(DATA_PATH, 'brief_summaries.txt'), usecols=['nct_id', 'description'], **AACT_LOAD_PARAMS)
df_brief.rename(columns={'description': 'brief_summary'}, inplace=True)
df = df.merge(df_brief, on='nct_id', how='left')

df_detail = pd.read_csv(os.path.join(DATA_PATH, 'detailed_descriptions.txt'), usecols=['nct_id', 'description'], **AACT_LOAD_PARAMS)
df_detail.rename(columns={'description': 'detailed_description'}, inplace=True)
df = df.merge(df_detail, on='nct_id', how='left')

# B. Cleanup
# Drop technical columns
df.drop(columns=['start_date', 'start_date_type', 'tree_number', 'number_of_primary_outcomes_to_measure'], inplace=True, errors='ignore')

# C. Save
df.to_csv(os.path.join(DATA_PATH, OUTPUT_FILE), index=False, quoting=csv.QUOTE_MINIMAL)

print(f"\n>>> SUCCESS: Final Dataset saved to {OUTPUT_FILE}")
print(f"    Rows: {len(df)}")
print(f"    Columns: {len(df.columns)}")
print(f"    New Features: 'competition_intensity', 'min_age', 'max_age', 'min_p_value'")

>>> Merging Text & Saving...

>>> SUCCESS: Final Dataset saved to project_data.csv
    Rows: 124490
    Columns: 32
    New Features: 'competition_intensity', 'min_age', 'max_age', 'min_p_value'


In [9]:
import pandas as pd
import numpy as np
import os

# -----------------------------------------------------------------------------
# CONFIGURATION
# -----------------------------------------------------------------------------
INPUT_FILE = 'project_data.csv'  # Make sure this matches your final filename
OUTPUT_REPORT = 'audit_dataset_report.txt'

# -----------------------------------------------------------------------------
# OBJECTIVE-BASED FEATURE GROUPS (Updated for Final Schema)
# -----------------------------------------------------------------------------
FEATURE_GROUPS = {
    "1. TARGET VARIABLE (The Goal)": [
        "target",
        "overall_status"
    ],

    "2. OPERATIONAL SIGNALS (Complexity & Feasibility)": [
        "num_facilities",       # Raw count of sites
        "num_countries",        # Geographic spread
        "phase_ordinal",        # Size proxy (1.0 - 3.0)
        "number_of_arms",       # Logistical complexity
        "start_year",           # Modernization factor
        "min_age",              # Protocol restrictiveness
        "max_age"               # Protocol restrictiveness
    ],

    "3. SCIENTIFIC SIGNALS (Design & Pathology)": [
        "therapeutic_area",         # High level (Oncology)
        "therapeutic_subgroup_name",# Mid level (Neoplasms by Site)
        "best_pathology",           # Low level (Breast Cancer)
        "intervention_model",       # Parallel/Crossover
        "masking",                  # Blinded?
        "allocation",               # Randomized?
        "primary_purpose",          # Treatment/Prevention
        "num_primary_endpoints"     # Scientific complexity
    ],

    "4. EXTERNAL ENVIRONMENT (Competition & Context)": [
        "competition_niche",        # Direct competitors (Same Subgroup + Phase)
        "competition_broad",        # Resource competitors (Same Area + Phase)
        "covid_exposure"            # Pandemic impact
    ],

    "5. SPONSOR & BIAS (Who is running it?)": [
        "agency_class",             # Industry vs Academic
        "includes_us"               # Regulatory environment (FDA)
    ],

    "6. LEAKAGE CHECKS (Do not use for Training)": [
        "min_p_value",              # Result (Should be highly correlated with target)
        "why_stopped"               # Result (Text explanation of failure)
    ],

    "7. NLP INPUTS (Text Data Quality)": [
        "official_title",
        "brief_summary",
        "detailed_description",
        "criteria"
    ]
}

def run_objective_audit():
    print(f">>> STARTING AUDIT ON: {INPUT_FILE}...")

    file_path = os.path.join(DATA_PATH, INPUT_FILE)
    if not os.path.exists(file_path):
        print(f"[ERROR] File not found: {file_path}")
        return

    df = pd.read_csv(file_path, low_memory=False)
    print(f"   - Loaded {len(df):,} rows.")

    with open(OUTPUT_REPORT, 'w', encoding='utf-8') as f:
        f.write("================================================================\n")
        f.write(f"FINAL DATASET AUDIT REPORT\n")
        f.write(f"File: {INPUT_FILE}\n")
        f.write(f"Rows: {len(df):,}\n")
        f.write(f"Cols: {len(df.columns)}\n")
        f.write("================================================================\n\n")

        for category, columns in FEATURE_GROUPS.items():
            f.write(f"\n{'='*80}\n")
            f.write(f"OBJECTIVE: {category}\n")
            f.write(f"{'='*80}\n")

            for col in columns:
                if col not in df.columns:
                    f.write(f"\n[MISSING COLUMN] '{col}' was expected but not found.\n")
                    continue

                # Basic Stats
                missing = df[col].isna().sum()
                missing_pct = (missing / len(df)) * 100
                dtype = df[col].dtype

                f.write(f"\n>>> FIELD: {col} ({dtype})\n")
                f.write(f"    - Missing: {missing} ({missing_pct:.2f}%)\n")

                # NUMERIC ANALYSIS
                if pd.api.types.is_numeric_dtype(df[col]):
                    stats = df[col].describe()
                    zeros = (df[col] == 0).sum()
                    f.write(f"    - Mean:    {stats['mean']:.2f} (Std: {stats['std']:.2f})\n")
                    f.write(f"    - Min/Max: {stats['min']} / {stats['max']}\n")
                    f.write(f"    - Zeros:   {zeros} ({zeros/len(df):.1%})\n")

                    # Correlation with Target (Signal Strength Check)
                    if col != 'target':
                        try:
                            corr = df[[col, 'target']].corr().iloc[0,1]
                            f.write(f"    - Corr w/ Target: {corr:.4f} ")
                            if abs(corr) > 0.1: f.write("(STRONG SIGNAL)")
                            elif abs(corr) < 0.01: f.write("(NO SIGNAL)")
                            f.write("\n")
                        except:
                            f.write("    - Corr w/ Target: NaN (Constant value?)\n")

                # CATEGORICAL ANALYSIS
                else:
                    unique = df[col].nunique()
                    f.write(f"    - Unique Labels: {unique}\n")

                    # Show distribution (Top 10)
                    if unique < 50:
                        f.write("    - Distribution:\n")
                        dist = df[col].value_counts(normalize=True).head(10) * 100
                        for val, pct in dist.items():
                            f.write(f"      * {str(val)[:40].ljust(40)} : {pct:.1f}%\n")
                    else:
                        f.write("    - Top 5 Most Frequent:\n")
                        dist = df[col].value_counts().head(5)
                        for val, count in dist.items():
                            f.write(f"      * {str(val)[:40].ljust(40)} : {count}\n")

                # TEXT ANALYSIS (Specific)
                if col in ['official_title', 'brief_summary', 'detailed_description', 'criteria', 'why_stopped']:
                    # Avg word count
                    word_counts = df[col].astype(str).apply(lambda x: len(x.split()) if pd.notna(x) and x.lower() != 'nan' else 0)
                    avg_words = word_counts.mean()
                    empty_rows = (word_counts < 3).sum()
                    f.write(f"    - Text Stats: Avg {avg_words:.0f} words. {empty_rows} rows are empty/short.\n")

    print(f"SUCCESS: Audit saved to {OUTPUT_REPORT}")
    print("You can now open this file to verify your data quality before modeling.")

if __name__ == "__main__":
    run_objective_audit()

>>> STARTING AUDIT ON: project_data.csv...
   - Loaded 124,490 rows.
SUCCESS: Audit saved to audit_dataset_report.txt
You can now open this file to verify your data quality before modeling.
