Here is the comprehensive **Project Recap & System Architecture** document.

You can copy and paste the text below directly into your new chat. It contains the exact context, logic, and constraints required for an expert audit of your code.

***

# ðŸš€ Project Recap: Clinical Trial Termination Risk Engine

### **1. Objective**
**Goal:** Predict the probability of a clinical trial being **Terminated** (Failure) vs. **Completed** (Success) based *strictly* on data available at the start of the trial.
**Business Value:** Early identification of high-risk trials to optimize portfolio management for Pharma/Investors.
**Constraint:** 8-Day Sprint (MVP focus).

---

### **2. The Target Variable**
*   **Source:** `studies.txt` $\to$ Column `overall_status`.
*   **Logic:**
    *   **0 (Success):** `COMPLETED`
    *   **1 (Failure):** `TERMINATED`, `WITHDRAWN`, `SUSPENDED`
*   **Filters:**
    *   `study_type` = 'Interventional'
    *   `intervention_type` = 'Drug' or 'Biological'
    *   `start_year`: 2000 to 2015 (Core Cohort).

---

### **3. Feature Engineering Dictionary**

#### **A. Operational & Sponsor Risk (Financial/Execution)**
| Feature Name | Source | Logic | Why? |
| :--- | :--- | :--- | :--- |
| **`sponsor_tier`** | `sponsors.txt` | **Static Mapping:** Matched against a hardcoded list of Top 20 Pharma (Pfizer, GSK, etc.).<br>â€¢ Tier 1: Giants<br>â€¢ Tier 2: Others | **Leakage Fix:** We removed dynamic counting to prevent future information from leaking into the training set. |
| **`sponsor_clean`** | `sponsors.txt` | Regex cleaning (remove "Inc", "LLC"). | Used for **Target Encoding** to capture specific sponsor performance history. |
| **`is_international`** | `countries.txt` | Binary (1 if >1 country). | Multi-country trials are operationally harder but suggest higher investment. |

#### **B. Scientific & Protocol Risk (Complexity)**
| Feature Name | Source | Logic | Why? |
| :--- | :--- | :--- | :--- |
| **`phase_ordinal`** | `studies.txt` | **Integer Mapping:**<br>â€¢ Phase 1 $\to$ 1<br>â€¢ Phase 1/2 & Phase 2 $\to$ 2<br>â€¢ Phase 2/3 & Phase 3 $\to$ 3 | **Risk Bucketing:** Mixed phases (1/2) are treated as the higher risk/efficacy hurdle (2). |
| **`num_primary_endpoints`** | `calculated_values.txt` | Raw Count. | More endpoints = higher chance of missing one = higher failure risk. |
| **`criteria_len_log`** | `eligibilities.txt` | `Log(Length of text)`. | Longer criteria = stricter protocol = harder recruitment. |
| **`healthy_volunteers`** | `eligibilities.txt` | Binary. | Phase 1 often allows healthy volunteers (lower scientific risk). |

#### **C. Market Context (Competition)**
| Feature Name | Source | Logic | Why? |
| :--- | :--- | :--- | :--- |
| **`competition_broad`** | Derived | Count of trials in same `therapeutic_area` in year $Y$ and $Y-1$. | **Leakage Fix:** We stopped looking at $Y+1$. Measures market saturation at launch. |

#### **D. Text Features (NLP Readiness)**
| Feature Name | Source | Logic | Why? |
| :--- | :--- | :--- | :--- |
| **`txt_tags`** | Mixed | Concatenation of `official_title` + `keywords` + `intervention_name`. | Raw text for future TF-IDF or Embedding layers. |

---

### **4. Data Processing Architecture**

#### **Step 1: The Loader (`data_loader_temp.py`)**
*   **Robust Parsing:** Uses a dual-strategy (`csv.QUOTE_MINIMAL` vs `quoting=3`) to handle broken pipe-delimiters in the raw AACT text files.
*   **Filtering:** Aggressively filters for "Drugs/Biologics" using `interventions.txt` IDs.
*   **Leakage Control:** Calculates competition metrics using *only* past data.

#### **Step 2: The Pipeline (`preprocessing_temp.py`)**
*   **Model:** Transitioning from Logistic Regression to **XGBoost**.
*   **Scaling:**
    *   `MinMaxScaler`: Applied to `phase_ordinal` (preserves order).
    *   `StandardScaler`: Applied to Years/Counts (optional for XGBoost, but kept for stability).
*   **Encoding:**
    *   `OneHotEncoder`: For low cardinality (Gender, Agency Class).
    *   `TargetEncoder`: For high cardinality (`sponsor_clean`, `pathology`). **Crucial:** This learns the "Success Rate" of specific diseases/sponsors from the Training set only.
