# 01 — Data Acquisition (ClinicalTrials.gov API v2)

## Objective
Pull Phase II–III **interventional** studies from ClinicalTrials.gov using the official API v2 and store raw records locally in an auditable, reproducible format.

## Why raw NDJSON?
- **Auditability:** raw records remain unchanged for traceability  
- **Reproducibility:** exact query + run metadata saved alongside data  
- **Pipeline-friendly:** line-delimited JSON is easy to stream and parse

## Outputs (written to disk)
- `data/raw/ctgov_studies_<run_id>_<query_hash>.ndjson`
- `data/raw/ctgov_studies_<run_id>_<query_hash>.meta.json`

In [1]:
import json
import time
import hashlib
from datetime import datetime, timezone
from pathlib import Path

import requests
import orjson
from tenacity import retry, stop_after_attempt, wait_exponential

## Configuration

The API query is set to Phase II–III interventional trials.  
A safety cap prevents accidental downloads that are too large for local iteration.

In [2]:
# Repo-relative paths (this notebook lives in /notebooks)
REPO_ROOT = Path("..").resolve()
DATA_RAW = REPO_ROOT / "data" / "raw"
DATA_RAW.mkdir(parents=True, exist_ok=True)

# ClinicalTrials.gov API v2 endpoint
CTG_BASE_URL = "https://clinicaltrials.gov/api/v2/studies"

# Query: Interventional Phase II–III trials
QUERY_TERM = (
    "AREA[StudyType]Interventional "
    "AND (AREA[Phase]PHASE2 OR AREA[Phase]PHASE3)"
)

PAGE_SIZE = 100
MAX_STUDIES = 25000   # safety cap for local iteration
TIMEOUT_SECS = 60

## Reliable request helper (with retries)

ClinicalTrials.gov API calls occasionally fail due to transient network issues.  
This helper retries with exponential backoff and raises errors only after repeated failure.

In [3]:
@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=20))
def get_page(params):
    r = requests.get(CTG_BASE_URL, params=params, timeout=TIMEOUT_SECS)
    r.raise_for_status()
    return r.json()

## Download function

This function:
1. Queries the API
2. Iterates through pages using `nextPageToken`
3. Writes each study as one JSON line (NDJSON)
4. Writes a metadata JSON with query + counts for reproducibility

In [4]:
def query_hash(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()[:16]


def download_ctgov(
    query_term: str,
    page_size: int = 100,
    max_studies: int = 25000,
    polite_sleep: float = 0.15,
):
    run_id = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
    qh = query_hash(query_term)

    out_path = DATA_RAW / f"ctgov_studies_{run_id}_{qh}.ndjson"
    meta_path = DATA_RAW / f"ctgov_studies_{run_id}_{qh}.meta.json"

    params = {
        "query.term": query_term,
        "pageSize": page_size,
        "countTotal": "true",
        "format": "json",
    }

    first = get_page(params)
    total = first.get("totalCount")
    next_token = first.get("nextPageToken")

    n = 0
    with out_path.open("wb") as f:
        # first page
        for s in first.get("studies", []):
            f.write(orjson.dumps(s))
            f.write(b"\n")
            n += 1
            if n >= max_studies:
                break

        # remaining pages
        while next_token and n < max_studies:
            time.sleep(polite_sleep)
            page = get_page({**params, "pageToken": next_token})
            next_token = page.get("nextPageToken")

            for s in page.get("studies", []):
                f.write(orjson.dumps(s))
                f.write(b"\n")
                n += 1
                if n >= max_studies:
                    break

    meta = {
        "run_id_utc": run_id,
        "query_term": query_term,
        "page_size": page_size,
        "max_studies": max_studies,
        "downloaded_studies": n,
        "api_reported_total_count": total,
        "output_file": out_path.name,
        "source": "ClinicalTrials.gov API v2",
    }
    meta_path.write_text(json.dumps(meta, indent=2))

    return out_path, meta_path, n, total

## Execute download

This will write:
- raw NDJSON of studies
- a metadata JSON file with query + counts

If `downloaded_studies` hits `MAX_STUDIES`, that’s fine — we’ll tighten filters later if needed.

In [5]:
out_path, meta_path, n, total = download_ctgov(
    query_term=QUERY_TERM,
    page_size=PAGE_SIZE,
    max_studies=MAX_STUDIES,
)

(out_path, meta_path, n, total)

(PosixPath('/Users/saturnine/Desktop/trialpulse/data/raw/ctgov_studies_20260210T040533Z_3c62edb50608b1de.ndjson'),
 PosixPath('/Users/saturnine/Desktop/trialpulse/data/raw/ctgov_studies_20260210T040533Z_3c62edb50608b1de.meta.json'),
 25000,
 127951)

## Quick integrity checks

We verify:
- files exist
- first record is valid JSON
- confirm we stored something non-empty

In [6]:
out_path.exists(), meta_path.exists()

(True, True)

In [7]:
with out_path.open("rb") as f:
    first_line = f.readline().strip()

len(first_line), first_line[:120]

(10109,
 b'{"protocolSection":{"identificationModule":{"nctId":"NCT05162937","orgStudyIdInfo":{"id":"GR1501-002"},"organization":{"')

In [8]:
# Parse first record to ensure it is valid JSON
first_record = json.loads(first_line)
list(first_record.keys())[:10]

['protocolSection', 'derivedSection', 'hasResults']

## Next notebook

Proceed to **02_schema_validation_missingness.ipynb** to:
- flatten key fields into a tabular dataset
- check missingness + duplicates
- write interim profiling outputs for cleaning/feature engineering