# Get Data

## Drive

In [None]:
!gdown 1L3WUcXnBRWA4egitSENQVfO1txGQrz0s

Downloading...
From: https://drive.google.com/uc?id=1L3WUcXnBRWA4egitSENQVfO1txGQrz0s
To: /content/rag_data.zip
100% 7.48M/7.48M [00:00<00:00, 19.5MB/s]


In [None]:
!unzip /content/rag_data.zip

Archive:  /content/rag_data.zip
   creating: rag_data/
  inflating: __MACOSX/._rag_data     
  inflating: rag_data/Clinical Development Success Rates 2011–2020- BIO_QLS Advisors Report.pdf  
  inflating: __MACOSX/rag_data/._Clinical Development Success Rates 2011–2020- BIO_QLS Advisors Report.pdf  
  inflating: rag_data/Factors Affecting Success of New Drug Clinical Trials (2023).pdf  
  inflating: __MACOSX/rag_data/._Factors Affecting Success of New Drug Clinical Trials (2023).pdf  
  inflating: rag_data/Enrollment Success, Factors, and Prediction Models in Cancer Trials (2008‑2019).pdf  
  inflating: __MACOSX/rag_data/._Enrollment Success, Factors, and Prediction Models in Cancer Trials (2008‑2019).pdf  
  inflating: rag_data/docs_manifest.csv  
  inflating: __MACOSX/rag_data/._docs_manifest.csv  
  inflating: rag_data/Clinical endpoints in oncology ‑ a primer (Delgado & Guddati, 2021).pdf  
  inflating: __MACOSX/rag_data/._Clinical endpoints in oncology ‑ a primer (Delgado & Guddati

## Web

In [None]:
import os
import csv
from pathlib import Path
import requests
from urllib.parse import urlparse
from time import sleep

# === НАСТРОЙКИ ===
TARGET_DIR = Path("./rag_data")           # куда кладём PDF
TARGET_DIR.mkdir(parents=True, exist_ok=True)

MANIFEST_PATH = TARGET_DIR / "docs_manifest.csv"  # имя манифеста


# === СПИСОК ДОКОВ (проверенные ссылки) ===
docs = [
    # --- FDA ---
    {
        "filename": "fda_oncology_endpoints_guidance.pdf",
        "title": "Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics",
        "url": "https://www.fda.gov/media/71195/download",
        "pub_date": "2015-06-01",
        "source": "FDA",
        "tags": "endpoints,regulatory,oncology"
    },
    {
        "filename": "fda_bicr_structural_doc.pdf",
        "title": "FDA Staff Manual (contains BICR organizational info)",
        "url": "https://www.fda.gov/media/89514/download",
        "pub_date": "2016-01-01",
        "source": "FDA",
        "tags": "bicr,organization"
    },
    {
        "filename": "fda_non_inferiority_trials.pdf",
        "title": "Non-Inferiority Clinical Trials to Establish Effectiveness",
        "url": "https://www.fda.gov/media/78504/download",
        "pub_date": "2010-11-01",
        "source": "FDA",
        "tags": "non-inferiority,design,statistics"
    },

    # --- EMA ---
    {
        "filename": "ema_missing_data_guideline.pdf",
        "title": "Guideline on Missing Data in Confirmatory Clinical Trials",
        "url": "https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-missing-data-confirmatory-clinical-trials_en.pdf",
        "pub_date": "2010-06-01",
        "source": "EMA",
        "tags": "missing-data,statistics"
    },

    # --- ICH ---
    {
        "filename": "ich_e9_statistical_principles.pdf",
        "title": "ICH E9: Statistical Principles for Clinical Trials",
        "url": "https://database.ich.org/sites/default/files/E9_Guideline.pdf",
        "pub_date": "1998-09-01",
        "source": "ICH",
        "tags": "statistics,ich,e9"
    },
    {
        "filename": "ich_e10_choice_of_control.pdf",
        "title": "ICH E10: Choice of Control Group and Related Issues",
        "url": "https://database.ich.org/sites/default/files/E10_Guideline.pdf",
        "pub_date": "2000-07-01",
        "source": "ICH",
        "tags": "control-group,design,ich,e10"
    },

    # --- Обзорные статьи / методика ---
    {
        "filename": "oncology_endpoints_primer.pdf",
        "title": "Clinical endpoints in oncology – a primer",
        "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC8085844/pdf/main.pdf",
        "pub_date": "2021-04-30",
        "source": "PMC",
        "tags": "review,endpoints,oncology"
    },
    {
        "filename": "surrogate_endpoints_in_oncology.pdf",
        "title": "Surrogate endpoints in oncology: when are they acceptable for regulatory and clinical decisions?",
        "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC5520356/pdf/12916_2017_Article_902.pdf",
        "pub_date": "2017-07-01",
        "source": "PMC",
        "tags": "surrogate,endpoints,oncology"
    },
    {
        "filename": "hazard_ratios_cancer_trials.pdf",
        "title": "Hazard ratios in cancer clinical trials – a primer",
        "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC7457144/pdf/nihms-1616885.pdf",
        "pub_date": "2020-08-01",
        "source": "PMC",
        "tags": "hazard-ratio,os,pfs,statistics"
    },
    {
        "filename": "recist_1_1_guidelines.pdf",
        "title": "RECIST 1.1: Response Evaluation Criteria in Solid Tumors (Guidelines)",
        "url": "https://project.eortc.org/recist/wp-content/uploads/sites/4/2015/03/RECISTGuidelines.pdf",
        "pub_date": "2009-01-01",
        "source": "EORTC",
        "tags": "recist,response,criteria"
    },
    {
        "filename": "irecist_guideline.pdf",
        "title": "iRECIST: guidelines for response criteria for use in trials testing immunotherapeutics",
        "url": "https://recist.eortc.org/recist/wp-content/uploads/sites/4/2017/03/Manuscript_IRECIST_Lancet-Oncology_Seymour-et-al_revision_FINAL_clean_nov25.pdf",
        "pub_date": "2017-03-01",
        "source": "EORTC",
        "tags": "irecist,immunotherapy,response"
    },
]


def make_headers(url: str):
    """User-Agent + Referer, чтобы меньше палиться как бот."""
    host = urlparse(url).netloc
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (X11; Linux x86_64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0 Safari/537.36"
        ),
        "Accept": "application/pdf,application/octet-stream;q=0.9,*/*;q=0.8",
    }
    if "fda.gov" in host:
        headers["Referer"] = "https://www.fda.gov/"
    elif "pmc.ncbi.nlm.nih.gov" in host:
        headers["Referer"] = "https://pmc.ncbi.nlm.nih.gov/"
    return headers


def download_pdf(url: str, dest_path: Path, timeout: int = 60, retries: int = 2) -> bool:
    """Пробуем скачать PDF несколько раз. Возвращаем True/False."""
    headers = make_headers(url)

    for attempt in range(1, retries + 1):
        try:
            resp = requests.get(url, stream=True, timeout=timeout, headers=headers)
            resp.raise_for_status()

            # Если прилетела HTML-страница из-за защиты (apology / captcha) — не сохраняем
            ctype = resp.headers.get("Content-Type", "").lower()
            if "pdf" not in ctype and "application/octet-stream" not in ctype:
                raise RuntimeError(f"Unexpected content-type: {ctype}")

            with open(dest_path, "wb") as f:
                for chunk in resp.iter_content(chunk_size=8192):
                    if chunk:
                        f.write(chunk)
            return True

        except Exception as e:
            print(f"   Attempt {attempt} failed: {e}")
            if attempt < retries:
                sleep(1.5)
            else:
                return False


def load_existing_manifest_paths(manifest_path: Path):
    if not manifest_path.exists():
        return set()
    paths = set()
    with open(manifest_path, newline="", encoding="utf-8") as f:
        r = csv.DictReader(f)
        for row in r:
            if "file_path" in row:
                paths.add(row["file_path"])
    return paths


existing_paths = load_existing_manifest_paths(MANIFEST_PATH)
header = ["file_path", "title", "url", "pub_date", "source", "tags"]

manifest_rows_to_add = []

for doc in docs:
    dest = TARGET_DIR / doc["filename"]

    if dest.exists():
        print(f"✅ Already exists, skip download: {dest.name}")
        ok = True
    else:
        print(f"⬇️  Downloading: {doc['title']}")
        ok = download_pdf(doc["url"], dest)

    if not ok:
        print(f"   ⚠️ Could not download {doc['url']}. "
              f"If you download it manually, save as: {dest}")
        # если файл всё-таки появился (ты скачал руками) — добавим в манифест
        if not dest.exists():
            continue

    row = {
        "file_path": str(dest),
        "title": doc["title"],
        "url": doc["url"],
        "pub_date": doc["pub_date"],
        "source": doc["source"],
        "tags": doc["tags"],
    }
    if row["file_path"] not in existing_paths:
        manifest_rows_to_add.append(row)

# Записываем / дописываем манифест
with open(MANIFEST_PATH, "a", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=header)
    if not existing_paths:
        writer.writeheader()
    for row in manifest_rows_to_add:
        writer.writerow(row)

print(f"\n📝 Manifest updated at: {MANIFEST_PATH.resolve()}")
print(f"   Added {len(manifest_rows_to_add)} new rows.")

⬇️  Downloading: Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics
⬇️  Downloading: FDA Staff Manual (contains BICR organizational info)
⬇️  Downloading: Non-Inferiority Clinical Trials to Establish Effectiveness
✅ Already exists, skip download: ema_missing_data_guideline.pdf
✅ Already exists, skip download: ich_e9_statistical_principles.pdf
✅ Already exists, skip download: ich_e10_choice_of_control.pdf
⬇️  Downloading: Clinical endpoints in oncology – a primer
   Attempt 1 failed: 403 Client Error: Forbidden for url: https://pmc.ncbi.nlm.nih.gov/articles/PMC8085844/pdf/main.pdf
   Attempt 2 failed: 403 Client Error: Forbidden for url: https://pmc.ncbi.nlm.nih.gov/articles/PMC8085844/pdf/main.pdf
   ⚠️ Could not download https://pmc.ncbi.nlm.nih.gov/articles/PMC8085844/pdf/main.pdf. If you download it manually, save as: rag_data/oncology_endpoints_primer.pdf
⬇️  Downloading: Surrogate endpoints in oncology: when are they acceptable for regulatory and clinical deci

# RAG

**NEED TO RESTART SESSION**

In [None]:
# ==== Step 1: Choose a lightweight stack + install ====
!pip -q install pypdf sentence-transformers faiss-cpu rapidfuzz python-dateutil > /dev/null

import os, json, math, textwrap, uuid, shutil
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import List, Dict, Any, Optional
from datetime import datetime
from dateutil.parser import parse as parse_dt

import numpy as np
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
import faiss

DATA_DIR = Path("./rag_data")        # directory for your PDFs & manifest
INDEX_DIR = Path("./rag_index")      # directory for index artifacts
DATA_DIR.mkdir(exist_ok=True, parents=True)
INDEX_DIR.mkdir(exist_ok=True, parents=True)

CUTOFF_ISO = "2025-05-28"            # strict time cutoff
CUTOFF_DT = datetime.fromisoformat(CUTOFF_ISO)

# Embedding model (fast, strong; supports E5 instruction format)
EMBED_MODEL_NAME = "intfloat/e5-base"
EMBED_BATCH_SIZE = 32

print("✅ Stack ready. Data dir:", DATA_DIR.resolve())


✅ Stack ready. Data dir: /content/rag_data


In [None]:
import csv
from itertools import islice

MANIFEST_CSV = DATA_DIR / "docs_manifest.csv"   # preferred
MANIFEST_JSONL = DATA_DIR / "docs_manifest.jsonl"  # optional alternative

# -------- helpers --------
def _safe_isodate(s: str) -> Optional[datetime]:
    try:
        return parse_dt(s).replace(tzinfo=None)
    except Exception:
        return None

def _read_manifest() -> List[Dict[str, Any]]:
    items = []
    if MANIFEST_CSV.exists():
        with open(MANIFEST_CSV, newline='', encoding="utf-8") as f:
            for row in csv.DictReader(f):
                items.append({k: (v.strip() if isinstance(v, str) else v) for k, v in row.items()})
    elif MANIFEST_JSONL.exists():
        with open(MANIFEST_JSONL, encoding="utf-8") as f:
            for line in f:
                if line.strip():
                    items.append(json.loads(line))
    else:
        # create a template CSV
        with open(MANIFEST_CSV, "w", newline="", encoding="utf-8") as f:
            w = csv.writer(f)
            w.writerow(["file_path","title","url","pub_date","source","tags"])
            w.writerow(["./rag_data/oncology_endpoints_primer.pdf",
                        "Clinical endpoints in oncology – a primer",
                        "https://pmc.ncbi.nlm.nih.gov/articles/PMC8085844/",
                        "2021-04-30","PMC","endpoints,oncology,primer"])
            w.writerow(["./rag_data/fda_oncology_endpoints_guidance.pdf",
                        "FDA Oncology Endpoints Guidance",
                        "https://www.fda.gov/media/71195/download",
                        "2015-06-01","FDA","regulatory,endpoints"])
        print(f"📝 Created template manifest at {MANIFEST_CSV}. "
              f"Add your PDFs to {DATA_DIR} and fill real rows, then re-run.")
        return []
    return items

def _approx_token_len(text: str) -> int:
    # quick proxy: ~1 token per 0.75 words; conservatively return words
    return max(1, len(text.split()))

def _chunk_text(pages: List[Dict[str, Any]],
               target_tokens: int = 900,
               overlap_tokens: int = 150) -> List[Dict[str, Any]]:
    """Sliding-window chunking across concatenated pages; keeps page ranges."""
    # build a list of (page_num, words[])
    page_units = []
    for p in pages:
        words = p["text"].split()
        page_units.append((p["page"], words))

    chunks = []
    buf_words, buf_pages = [], []
    cur_len = 0

    def flush_chunk():
        if not buf_words:
            return
        text = " ".join(buf_words).strip()
        start_p = buf_pages[0]
        end_p = buf_pages[-1]
        chunks.append({"text": text, "page_start": start_p, "page_end": end_p})

    i_page = 0
    while i_page < len(page_units):
        pnum, words = page_units[i_page]
        for w in words:
            buf_words.append(w)
            cur_len += 1
            if pnum not in buf_pages:
                buf_pages.append(pnum)
            if cur_len >= target_tokens:
                flush_chunk()
                # overlap
                buf_words = buf_words[-overlap_tokens:]
                cur_len = len(buf_words)
                # page range overlap: keep last page only as best effort
                buf_pages = [buf_pages[-1]]
        i_page += 1
    flush_chunk()
    return chunks

def _pdf_to_pages(pdf_path: Path) -> List[Dict[str, Any]]:
    reader = PdfReader(str(pdf_path))
    pages = []
    for i, page in enumerate(reader.pages, start=1):
        try:
            txt = page.extract_text() or ""
        except Exception:
            txt = ""
        txt = " ".join(txt.split())  # normalize whitespace
        if txt.strip():
            pages.append({"page": i, "text": txt})
    return pages

# -------- ingest + index --------
items = _read_manifest()
if not items:
    raise SystemExit("Fill the manifest and re-run this cell.")

# filter by cutoff & existence
valid_docs = []
for it in items:
    path = Path(it["file_path"])
    pub = _safe_isodate(it.get("pub_date",""))
    if not path.exists():
        print(f"⚠️ Missing file: {path} (skipping)")
        continue
    if not pub:
        print(f"⚠️ No/invalid pub_date for {path} (skipping)")
        continue
    if pub > CUTOFF_DT:
        print(f"⛔ Excluded (after cutoff {CUTOFF_ISO}): {path} — pub_date={pub.date()}")
        continue
    valid_docs.append({
        "doc_id": str(uuid.uuid4()),
        "file_path": str(path),
        "title": it.get("title") or path.stem,
        "url": it.get("url") or "",
        "source": it.get("source") or "",
        "tags": it.get("tags") or "",
        "pub_date": pub.date().isoformat()
    })

if not valid_docs:
    raise SystemExit("No eligible documents ≤ cutoff. Update manifest and retry.")

print(f"📚 Eligible docs: {len(valid_docs)} (cutoff {CUTOFF_ISO})")

# extract + chunk
all_chunks = []
for d in valid_docs:
    pages = _pdf_to_pages(Path(d["file_path"]))
    if not pages:
        print(f"⚠️ Empty/Unextractable text: {d['file_path']}")
        continue
    chunks = _chunk_text(pages, target_tokens=900, overlap_tokens=150)
    for ch in chunks:
        all_chunks.append({
            "doc_id": d["doc_id"],
            "title": d["title"],
            "url": d["url"],
            "source": d["source"],
            "tags": d["tags"],
            "pub_date": d["pub_date"],
            "page_start": ch["page_start"],
            "page_end": ch["page_end"],
            "text": ch["text"]
        })

print(f"✂️ Created {len(all_chunks)} chunks from {len(valid_docs)} docs")

# embed
model = SentenceTransformer(EMBED_MODEL_NAME)
def _embed_passages(texts: List[str]) -> np.ndarray:
    # E5 expects "passage: ..." for passages
    return model.encode([f"passage: {t}" for t in texts],
                        batch_size=EMBED_BATCH_SIZE,
                        show_progress_bar=True,
                        normalize_embeddings=True)

emb = _embed_passages([c["text"] for c in all_chunks]).astype("float32")
if emb.shape[0] != len(all_chunks):
    raise RuntimeError("Embedding count mismatch.")

# build FAISS (cosine via inner-product on normalized vectors)
dim = emb.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(emb)

# persist
faiss.write_index(index, str(INDEX_DIR / "rag.faiss"))
with open(INDEX_DIR / "rag_meta.jsonl", "w", encoding="utf-8") as f:
    for c in all_chunks:
        f.write(json.dumps(c, ensure_ascii=False) + "\n")
with open(INDEX_DIR / "embed_model.txt", "w") as f:
    f.write(EMBED_MODEL_NAME + "\n")
with open(INDEX_DIR / "stats.json", "w") as f:
    json.dump({
        "num_docs": len(valid_docs),
        "num_chunks": len(all_chunks),
        "cutoff": CUTOFF_ISO,
        "embed_model": EMBED_MODEL_NAME
    }, f, indent=2)

print(f"✅ Index saved to {INDEX_DIR / 'rag.faiss'} "
      f"with metadata {INDEX_DIR / 'rag_meta.jsonl'}")


📚 Eligible docs: 15 (cutoff 2025-05-28)
✂️ Created 226 chunks from 15 docs


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Batches:   0%|          | 0/8 [00:00<?, ?it/s]

✅ Index saved to rag_index/rag.faiss with metadata rag_index/rag_meta.jsonl


In [None]:
# Simple retrieval helper for a manual query
import faiss

#INDEX_DIR = '/content/rag_index'

def load_index_and_meta():
    idx = faiss.read_index(str(INDEX_DIR / "rag.faiss"))
    meta = []
    with open(INDEX_DIR / "rag_meta.jsonl", encoding="utf-8") as f:
        for line in f:
            if line.strip():
                meta.append(json.loads(line))
    mdl_name = Path(INDEX_DIR / "embed_model.txt").read_text().strip()
    mdl = SentenceTransformer(mdl_name)
    return idx, meta, mdl

def retrieve(query: str, k: int = 5) -> List[Dict[str, Any]]:
    idx, meta, mdl = load_index_and_meta()
    qv = mdl.encode([f"query: {query}"], normalize_embeddings=True).astype("float32")
    scores, ids = idx.search(qv, k)
    res = []
    for rank, (sc, i) in enumerate(zip(scores[0], ids[0]), start=1):
        if i == -1:
            continue
        m = meta[int(i)]
        res.append({
            "rank": rank,
            "score": float(sc),
            "title": m["title"],
            "url": m["url"],
            "pub_date": m["pub_date"],
            "page_span": f"pp.{m['page_start']}-{m['page_end']}",
            "preview": (m["text"][:400] + "…") if len(m["text"]) > 400 else m["text"]
        })
    return res

# Try a couple of queries you expect to hit:
for q in [
    "oncology endpoints OS vs PFS definitions",
    "regulatory guidance acceptable endpoints for cancer trials"
]:
    print("\n🔎 Query:", q)
    hits = retrieve(q, k=3)
    for h in hits:
        print(f"  #{h['rank']}  [{h['score']:.3f}] {h['title']} ({h['pub_date']}, {h['page_span']})")
        print("   ", h["preview"].replace("\n"," "))



🔎 Query: oncology endpoints OS vs PFS definitions
  #1  [0.888] Clinical Trial Design and Drug Approval in Oncology (2020-09-01, pp.6-7)
    not causality • Recall bias susceptibility • Group sizes may be unequal Case report • A detailed report of the diagnosis, treatment, and follow-up of an individual patient • Useful in reporting postmarketing exposure to approved interventions generating ideas, hypotheses, and techniques that can then be tested in a clinical trial • Not generalizable Note. RCT = randomized controlled trial. Informa…
  #2  [0.888] Clinical Endpoints in Oncology – A Primer (2021-04-30, pp.3-4)
    is important for studies to clarify what is meant by evidence of disease progres - sion. In advanced breast cancer, some in- vestigators use PFS and TTP interchangeably, potentially leading to confusion when com- paring the outcomes of various trials [17]. Meanwhile, studies have used TTP to evaluate aggressive therapies for advanced non-small cell lung cancer, however, it

In [None]:
!zip -r rag_data.zip /content/rag_data

  adding: content/rag_data/ (stored 0%)
  adding: content/rag_data/irecist_guideline.pdf (deflated 8%)
  adding: content/rag_data/recist_1_1_guidelines.pdf (deflated 6%)
  adding: content/rag_data/Clinical trial design and drug approval in oncology (Kurtin et al., 2020).pdf (deflated 17%)
  adding: content/rag_data/Estimation of clinical trial success rates and related parameters (Wong, Siah & Lo, 2019).pdf (deflated 12%)
  adding: content/rag_data/ema_missing_data_guideline.pdf (deflated 23%)
  adding: content/rag_data/fda_bicr_structural_doc.pdf (deflated 26%)
  adding: content/rag_data/Clinical endpoints in oncology ‑ a primer (Delgado & Guddati, 2021).pdf (deflated 24%)
  adding: content/rag_data/ich_e9_statistical_principles.pdf (deflated 20%)
  adding: content/rag_data/fda_non_inferiority_trials.pdf (deflated 4%)
  adding: content/rag_data/Cancer Drug Approval Endpoints- FDA Guidance.pdf (deflated 8%)
  adding: content/rag_data/ich_e10_choice_of_control.pdf (deflated 25%)
  addin

In [None]:
!zip -r rag_index.zip /content/rag_index

  adding: content/rag_index/ (stored 0%)
  adding: content/rag_index/rag_meta.jsonl (deflated 75%)
  adding: content/rag_index/stats.json (deflated 16%)
  adding: content/rag_index/embed_model.txt (stored 0%)
  adding: content/rag_index/rag.faiss (deflated 7%)
