<a href="https://colab.research.google.com/github/bariswheel/Tuning-DeepSeek-for-Diabetes/blob/main/DeepSeek_Diabetes_CleanV1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🩺 Graph-RAG for Chinese Diabetes Guidelines – End-to-End Pipeline (A100-Optimized)

We build a **compact Retrieval-Augmented Generation (RAG) system** that turns 41 Chinese clinical-guideline JSON files into:

1. **Fully English knowledge triples** – human-readable, one per FAISS vector.  
2. **A FAISS flat-IP index** – 8 643 MiniLM-encoded vectors ready for millisecond retrieval.  
3. **A 5-question DeepSeek-7B demo** – deterministic answers plus supporting triples.

> **Why this matters**  
> • English triples are easier to inspect, debug and prompt with.  
> • Flat-IP (Flat, Inverted index with product quantization) cosine retrieval is transparent and fast (no ANN surprises). This is transparent and predictable compared to Approximate Nearest Neighbor (ANN) search techniques. Unlike ANN, which can sometimes produce unexpected or inconsistent results due to its approximations, Flat-IP ensures that the retrieval process is straightforward and reliable, with no hidden complexities or surprises in how it matches vectors.
> • Keeping model + index in the same latent space lets us swap encoders easily. In machine learning, a "latent space" is a hidden, multi-dimensional space where data is represented in a compressed and meaningful way. Think of it as a kind of abstract map where similar data points are located closer together.

In the context of this notebook, the "latent space" is created by the MiniLM encoder model. This model takes the text (the English knowledge triples and the user's question) and transforms them into numerical vectors. These vectors are then placed in this latent space.

> **GPU-friendly design choices for an A100 (80 GB)**  
> • *Single load* of **NLLB-200 600 M fp16** to batch-translate all unique labels (⇒ <5 GB VRAM, no per-sentence hits).  
> • MiniLM-L6-v2 encoding in **batches = 512**, saturating the GPU while staying well under 8 GB.  
> • DeepSeek-7B loaded with **`load_in_8bit=True`** so the full chat model fits in <10 GB.  
> • All tensors kept on GPU (no CPU↔GPU churn) and IO bound to Drive once, then streamed in RAM.

Below, Cells 0-7 walk through the pipeline:

| Cell | Purpose | Runtime (A100) |
|------|---------|----------------|
| **0** | Mount Drive + define all paths | <1 s |
| **1** | Harvest & batch-translate entity labels → `entities_bilingual.csv` | ~9 min |
| **2** | Translate 41 CN JSON → `*_en.json` | ~6 min |
| **3** | Rebuild `sentences.txt` (8 643 EN triples) | <30 s |
| **4** | Encode triples, build & save FAISS index | ~2 min |
| **5** | Load index + encoder into RAM | <1 s |
| **6** | `retrieve_ctx()` helper + smoke-test | <1 s |
| **7** | 5-question DeepSeek-7B showcase | ~1 min |

Total wall-clock ≈ **20 min** – a single-run notebook any teammate can inspect, re-run, and grade.

In [1]:
# %% CELL -1 – runtime wheels & GPU sanity-check
# %% SETUP (run once per Colab session ▸ safe to skip if wheels already present)
#
# bitsandbytes ≥ 0 .43 .2 is required for 8-bit loading;  Colab still ships 0 .41.
# accelerate + sentencepiece small wheels avoid slower source builds.
# ───────────────────────────────────────────────────────────────────────────────
%pip install -qq --upgrade bitsandbytes==0.43.2
%pip install -qq accelerate==0.27.2 sentencepiece==0.2.0

import torch, platform, subprocess, re
print(f"✅ GPU: {torch.cuda.get_device_name()} "
      f"‖ PyTorch {torch.__version__} ({platform.python_version()})")
print("✅ bitsandbytes", subprocess.run(
        ["python", "-c", "import bitsandbytes, re, sys; print(bitsandbytes.__version__)"],
        capture_output=True, text=True).stdout.strip())

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m123.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m81.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

> #### What this SETUP cell does, step-by-step  
> | step | action | why we do it |  
> |------|--------|-------------|  
> | 0 | `pip install bitsandbytes==0.43.2` | Colab still ships 0 .41 — we upgrade so DeepSeek-7B loads in *true* 8-bit. |  
> | 1 | `pip install accelerate==0.27.2 sentencepiece==0.2.0` | Two tiny wheels that avoid slower source builds (≈ 2 min saved). |  
> | 2 | **GPU sanity-check print-out** | Confirms we have an A100 and the upgraded wheels before the main pipeline runs. |  
>   
> **When we run it** → immediately after starting a fresh Colab runtime.  
> **Safe to skip** if we already see the ✅ prints in the console.

In [None]:
# %% SETUP-EXTRA-2  (run once per runtime ▸ safe to skip if wheels present)
# sentence-transformers 2 .7 .0 was built & tested with Transformers 4 .41+
%pip install -q --upgrade sentence-transformers==2.7.0 transformers==4.41.2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m94.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m110.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m121.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m95.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

> **Updates for optimisation (2025-07-14)**
>
> * Upgrade to **sentence-transformers 2 .7 .0 + Transformers 4 .41.2**
>   – fixes the Trainer import clash in Cell 4.
> * The wheels are lightweight (no CUDA build) and install in < 20 s.
> * Safe to re-run; the `Runtime > Restart & run all` round-trip now takes
>   ~3 min instead of 2 h because Cell 2 shortcuts when artefacts are present.

In [None]:
# %% SETUP-EXTRA (run once per Colab runtime ▸ safe to skip later)
%pip install -q --upgrade peft==0.10.0 accelerate==0.26.1

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/199.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[?25h

> **Updates for optimisation**
>
> * We pin **peft 0 .10.0 + accelerate 0 .26.1** so their APIs match.
> * Newer accelerate builds drop `clear_device_cache`; the PEFT wheel has not
>   yet caught up, which broke Cell 4.
> * The install happens once per Colab runtime, is idempotent, and adds only a
>   few seconds to setup time.

# 📗 Notebook quick-start (one-pager)

| order | cell | purpose | runtime cost* |
|-------|------|---------|---------------|
| -1 | **SETUP** | install wheels (`bitsandbytes 0.43.2`, etc.) & GPU print | 15 s (first run only) |
| 0 | Mount Drive + define paths | paths stay in one place for all later cells | < 1 s |
| 1 | Build **entities_bilingual.csv** | harvest + translate unique Chinese labels | 60 s |
| 2 | Translate 41 raw CN JSON → **diakg_en/** | **fast-exit**: ≤ 1 s if already done | 6 min (once) |
| 3 | Build **sentences.txt** | 1 line (EN triple) ↔ 1 vector | 15 s |
| 4 | Build **graphrag_faiss.index** | embed 8 643 triples, save flat-IP | 40 s |
| 5 | Load index + MiniLM encoder | puts both in GPU / RAM for RAG | 3 s |
| 6 | `retrieve_ctx()` helper | top-k semantic search demo | < 1 s |
| 7 | 5-question RAG showcase | prompts DeepSeek-7B (8-bit) | 10 s |

\* A100 timings after wheels are cached.

### How we (and reviewers) run it

1. **Fresh runtime:** run **SETUP (-1)** ➜ run cells **0 → 7** in order.  
2. **Re-connect later:** skip **SETUP** (wheels already present) ➜ start at **Cell 0**.  
3. **Translation already done?** Cell 2 detects `diakg_en/*.json` + `entities_bilingual.csv` and exits in < 1 s, so reruns are fast.

> All heavy one-off artefacts (translated JSON, CSV, FAISS index) are stored in Drive under `diakg_assets/`, so they persist across sessions.  
> We can delete any of them if we want to force a rebuild (e.g. after adding new raw JSON).

In [None]:
# %% CELL 0 – Mount Drive + paths
"""Mount GDrive (silently re-uses an existing token) and define all
folder / file constants in one place so later cells stay in sync."""
from google.colab import drive
from pathlib import Path, PurePosixPath

drive.mount("/content/drive", force_remount=False)    # one-liner mount

DATA_DIR = Path("/content/drive/MyDrive/diakg_assets")     # adjust once
RAW_CN   = DATA_DIR / "0521_new_format"    # 41 raw CN guideline JSON
JSON_DIR = DATA_DIR / "diakg_en"           # translated *_en.json (output)
JSON_DIR.mkdir(exist_ok=True)

LABELS_CSV = DATA_DIR / "entities_bilingual.csv"   # built next cell
SENT_PATH  = DATA_DIR / "sentences.txt"            # 1 line ≈ 1 vector
INDEX_PATH = DATA_DIR / "graphrag_faiss.index"     # 8 643 vec FAISS

Mounted at /content/drive


> **What this cell does, step-by-step**  
> 0 · Mount GDrive so results survive runtime restarts.  
> 1 · Define *all* paths in **DATA_DIR** so later cells never hard-code strings.  
> 2 · Create `diakg_en/` if missing – translated JSON will land there.

In [None]:
# %% CELL 1 – Build English label mapping  ✱ 2025-07-14 fix for 🤗 4.41 ✱
"""
Walk every *\_en.json* once →

1. harvest every distinct **Chinese** entity string
2. batch-translate the list with NLLB-200 (600 M, fp16)
3. map the English strings back to their IDs → dict {id → en_label}
4. write **entities_bilingual.csv**  (entity_id, en_label)

We run this cell once per new JSON batch; it silently overwrites any older CSV.
"""
import json, glob, pandas as pd, torch, tqdm
from pathlib import Path
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# ── paths ────────────────────────────────────────────────────────────────────
DATA_DIR   = Path("/content/drive/MyDrive/diakg_assets")
JSON_DIR   = DATA_DIR / "diakg_en"          # translated *_en.json live here
LABELS_CSV = DATA_DIR / "entities_bilingual.csv"

# ➊ - harvest  { id : cn_label }
id_to_cn = {}
for jp in tqdm.tqdm(JSON_DIR.glob("*_en.json"), desc="Scanning EN json"):
    doc = json.load(open(jp, encoding="utf8"))
    for p in doc["paragraphs"]:
        for s in p["sentences"]:
            for ent in s["entities"]:
                id_to_cn.setdefault(ent["entity_id"], ent["entity"])

print("✓ unique CN labels:", len(id_to_cn))

# ➋ - NLLB-200 translator  (600 M, fp16 ≤ 5 GB VRAM)
model_id = "facebook/nllb-200-distilled-600M"
tok      = AutoTokenizer.from_pretrained(model_id)
mdl      = AutoModelForSeq2SeqLM.from_pretrained(
              model_id,
              device_map="auto",
              torch_dtype=torch.float16      # ← 4.41+ expects the dtype object
          ).eval()

translator = pipeline("translation",
                      model=mdl, tokenizer=tok,
                      src_lang="zho_Hans", tgt_lang="eng_Latn",
                      batch_size=32, max_length=256,
                      num_beams=4, do_sample=False)

# ➌ - translate once in one batch
cn_labels  = list(id_to_cn.values())
en_labels  = [translator(text)[0]["translation_text"]
              for text in cn_labels]
id_to_en   = dict(zip(id_to_cn.keys(), en_labels))

# ➍ - save mapping CSV  (id, en_label)
pd.DataFrame({"entity_id": list(id_to_en.keys()),
              "en_label" : list(id_to_en.values())}
            ).to_csv(LABELS_CSV, index=False)

print(f"✓ saved {LABELS_CSV.name} with", len(id_to_en), "rows")

Scanning EN json: 41it [00:31,  1.29it/s]


✓ unique CN labels: 1438


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


✓ saved entities_bilingual.csv with 1438 rows


> ### What this cell does, step-by-step (2025-07-14)
>
> 0 · Define **DATA_DIR / JSON_DIR / entities_bilingual.csv** – all results live in one Drive folder so they persist across Colab restarts.  
> 1 · **Harvest** Chinese labels   Walk every *\_en.json*, collect the **distinct** CN strings → here we found 1 428 unique labels.  
> 2 · **Load NLLB-200 once** (600 M, fp16)   `torch_dtype=torch.float16` satisfies the new Transformers 4.41 API. GPU-RAM < 5 GB.  
> 3 · **Translate** the whole list in one call (batch 32 × 256 tokens); no repeated model calls per sentence.  
> 4 · **Map back** to IDs → dict {id → en_label} and write **entities_bilingual.csv** (entity_id, en_label).
>
> After the CSV is rebuilt we can:  
> • delete *sentences.txt* **or** set `FORCE=True` in Cell 3 to overwrite its ID-only version.  
> • rerun Cell 3 → it now writes human-readable triples.  
> • rebuild the FAISS index (Cell 4) without any further changes.

In [None]:
# %% CELL 1 b — make sure faiss & sentence-transformers wheels are present
# (quiet, idempotent; safe to re-run if the wheels are already there)

%pip install -q faiss-cpu sentence-transformers

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/31.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/31.3 MB[0m [31m196.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/31.3 MB[0m [31m231.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m22.6/31.3 MB[0m [31m229.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m31.0/31.3 MB[0m [31m241.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m31.3/31.3 MB[0m [31m233.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m31.3/31.3 MB[0m [31m233.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m77.0

What this cell does, step-by-step
0 · %pip installs both wheels into the current VM.
1 · faiss-cpu pulls in the FAISS library + BLAS / NumPy deps so we can build & save the flat-IP index.
2 · sentence-transformers guarantees the MiniLM encoder wheel is around (sometimes it vanishes after reconnects).
3 · Because the command is idempotent, it is safe (and fast) to run in every new Colab session.

In [None]:
# %% CELL 2 – Translate raw CN JSON → *_en.json (skips if already done)
"""
• Re-use the label map so we hardly hit the translator again.
• Only unseen strings fall back to on-the-fly translation.
"""
import json, glob, os, time, tqdm, sys
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from pathlib import Path

# — paths ───────────────────────────────────────────────────────────────
if (DATA_DIR / "diakg_en").exists() and (DATA_DIR / "entities_bilingual.csv").exists():
    print("✓ EN JSON + CSV already present — skipping heavy translation")
    skip_translation = True
else:
    skip_translation = False

# occasionally Drive is slow to list files; we poll a couple of times
if skip_translation:
    for _ in range(3):                          # ≤ 6 s total wait
        if list((DATA_DIR/"diakg_en").glob("*.json")):  # folder is visible
            break
        time.sleep(2)
    else:                                       # still empty → fall through
        print("↪ Files not visible yet — running full translation")
        skip_translation = False

if skip_translation:
    sys.exit("✓ Nothing to do — cell exits in < 1 s")

# — full translation pass (one-time) ─────────────────────────────────────
label_map = pd.read_csv(LABELS_CSV).set_index("entity_id")["en_label"].to_dict()

# helper re-uses the translator object from Cell 1 (cached in RAM)
def fast_translate(text: str) -> str:
    return translator(text, max_length=400)[0]["translation_text"]

SENT_KEY    = "sentence"
SENT_KEY_EN = f"{SENT_KEY}_en"

for jp in tqdm.tqdm(glob.glob(str(RAW_CN / "*.json")), desc="Translating docs"):
    doc = json.load(open(jp, encoding="utf8"))

    for p in doc["paragraphs"]:
        for s in p["sentences"]:
            zh = s.get(SENT_KEY, "")
            if zh:
                s[SENT_KEY_EN] = fast_translate(zh)                # sentence_en
            for ent in s["entities"]:
                zh_lab  = ent["entity"]                            # entity_en
                ent["entity_en"] = label_map.get(zh_lab, fast_translate(zh_lab))
            for r in s.get("relations", []):                       # relation_en
                zh_rel = r.get("relation_type", "")
                r["relation_en"] = label_map.get(zh_rel, fast_translate(zh_rel))

    out_name = os.path.basename(jp).replace(".json", "_en.json")
    json.dump(doc, open(JSON_DIR / out_name, "w", encoding="utf8"),
              ensure_ascii=False, indent=2)

print("✓ Translation pass complete — all *_en.json written.")

✓ EN JSON + CSV already present — skipping heavy translation


SystemExit: ✓ Nothing to do — cell exits in < 1 s

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


> **What this cell does, step-by-step**  
> 0 · Load the CSV → `label_map` for O(1) look-ups.  
> 1 · For each raw CN JSON, translate:  
> &nbsp;&nbsp;• the full sentence (“sentence_en”)  
> &nbsp;&nbsp;• every entity label (“entity_en”)  
> &nbsp;&nbsp;• every relation string – *inside* the sentence & top level (“relation_en”)  
> &nbsp;&nbsp;• **fast_translate** is only called if the string is unseen in the map.  
> 2 · Write the fully English copy next to the originals as **_en.json**.

### What this cell does, step-by-step — updates
| Step | Change | Why it helps |
|------|--------|--------------|
| 0    | Guard checks if both `diakg_en/` **and** `entities_bilingual.csv` exist. | Avoids the 90-minute translation rerun. |
| 1    | Poll Drive 3× (2 s) to make sure the `_en.json` files are visible. | G-Drive latency sometimes hides files right after upload. |
| 2    | `sys.exit()` instead of `return` for the early exit. | Safe way to abort a top-level notebook cell. |
| 3    | Rest of the code unchanged — only runs when new CN JSON need translation. |

> **Net effect**  A rerun now finishes in **\< 1 s** instead of ~2 h whenever the English artefacts are already in Drive.

In [None]:
# %% CELL 3 – rebuild sentences.txt (force=True)
"""Scan every *_en.json* and write one English triple string per KG triple
(head_en relation_en tail_en) – 8 643 lines in sample run."""

import json, glob, pandas as pd, tqdm
from pathlib import Path

FORCE = False                              # flip to False once file is stable

if FORCE or not SENT_PATH.exists():
    print("⟳ Rebuilding sentences.txt …")
    label_map = pd.read_csv(LABELS_CSV).set_index("entity_id")["en_label"].to_dict()
    sentences = []
    for jp in tqdm.tqdm(glob.glob(str(JSON_DIR / "*_en.json")), desc="Scanning JSON"):
        doc = json.load(open(jp, encoding="utf8"))
        for p in doc["paragraphs"]:
            for s in p["sentences"]:
                for r in s.get("relations", []):
                    sentences.append(
                        f"{label_map[r['head_entity_id']]} "
                        f"{r['relation_en']} "
                        f"{label_map[r['tail_entity_id']]}"
                    )
    SENT_PATH.write_text("\n".join(sentences))
    print("✔ sentences.txt rebuilt with", len(sentences), "lines")
else:
    print("✓ sentences.txt already present – nothing to do.")

✓ sentences.txt already present – nothing to do.


> **What this cell does, step-by-step**  
> 1 · Optionally force-delete the old file (`FORCE=True`) so we overwrite the ID-only version.  
> 2 · Walk every *_en.json* and concatenate **head EN + relation_en + tail EN**.  
> 3 · Write one line per triple to **sentences.txt** – the order matches the FAISS index we will build later.

## 🔧 Updates for optimization — **Cell 2: Translate raw CN JSON → *_en.json***

| What we changed | Why it matters on the A100 |
|-----------------|----------------------------|
| **Early-exit guard**<br>`if (DATA_DIR/"diakg_en").exists() and (DATA_DIR/"entities_bilingual.csv").exists():` | We skip the two-hour translation pass on every warm run. |
| **Single `label_map` load** | All CN → EN look-ups happen in-memory; no extra CSV I/O. |
| **Reuse the translator object that is already on GPU** | We avoid re-loading NLLB-200 and keep VRAM steady. |
| **Translate both in-sentence & top-level `relation_type` strings** | Removes the last source of “T1234 Test_Disease …” artefacts later in retrieval. |
| **Optional tip:** raise `max_length` from `400` → `512` in `fast_translate()` | Silences the “input_length > 0.9 × max_length” warnings and speeds up long sentences (fits comfortably in 16 GB fp16). |

---

## 🔧 Updates for optimization — **Cell 3: rebuild `sentences.txt`**

| What we changed | Why it matters on the A100 |
|-----------------|----------------------------|
| **`FORCE = False` by default** | We rebuild only when the file is missing or when we deliberately flip the flag, so most runs take < 1 s. |
| **ID → label mapping uses the CSV we just built** | Ensures every triple line is now fully human-readable (no more ID placeholders). |
| **Streamlined triple writer**<br>`head_en  relation_en  tail_en` | Keeps one-line-per-vector order -> perfectly aligns with the FAISS vectors we embed in Cell 4 for fast retrieval. |

> **Workflow tip**  
> 1. If we ingest **new** CN JSON later, delete `entities_bilingual.csv` **or** set `FORCE = True` once.  
> 2. Run Cells 0 → 3 — only the new data is processed.  
> 3. Flip `FORCE` back to `False` and enjoy instant warm starts.

In [None]:
# %% CELL 4 – Build FAISS flat-IP index
"""Embed every sentence line with MiniLM-L6-v2 (384-d) → save indexFlatIP."""

import faiss, numpy as np, tqdm, torch
from sentence_transformers import SentenceTransformer

sentences = [ln.rstrip() for ln in open(SENT_PATH)]
print("Total triples:", len(sentences))

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2",
                               device="cuda" if torch.cuda.is_available() else "cpu")
vecs = embedder.encode(sentences, batch_size=512,  # A100 handles 512 easily
                       convert_to_numpy=True, show_progress_bar=True).astype("float32")

index = faiss.IndexFlatIP(vecs.shape[1])
index.add(vecs)
faiss.write_index(index, str(INDEX_PATH))
print(f"✔ FAISS index written with {index.ntotal} vectors")

Total triples: 8643


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

✔ FAISS index written with 8643 vectors


> **What this cell does, step-by-step**  
> 1 · Read **8 643** human-readable triples → list[str].  
> 2 · Encode with MiniLM-L6-v2 in batches of **512** (A100 keeps <2 GB VRAM).  
> 3 · Add vectors to **IndexFlatIP** (inner-product ≈ cosine on unit vecs).  
> 4 · Save **graphrag_faiss.index** – ~12 MB on disk.

In [None]:
# %% CELL 5 – Load FAISS index + encoder
"""Bring the EN vectors + MiniLM encoder into memory for retrieval."""

import faiss, torch, numpy as np
from sentence_transformers import SentenceTransformer

index     = faiss.read_index(str(INDEX_PATH))
embedder  = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2",
                                device="cuda" if torch.cuda.is_available() else "cpu")
with open(SENT_PATH) as f:
    sentences = [ln.rstrip() for ln in f]

print(f"✓ FAISS index with {index.ntotal:,} English vectors loaded")

✓ FAISS index with 8,643 English vectors loaded


> **What this cell does, step-by-step**  
> 0 · Read **graphrag_faiss.index** (8 643 vec) – 200 ms on SSD.  
> 1 · Load the *same* MiniLM encoder used in Cell 4 so query / index live in the same space.  
> 2 · Read **sentences.txt** into a Python list → keeps vector ↔ text mapping handy.

In [None]:
# %% CELL 6 – retrieve_ctx() + quick demo, the R in RAG

import numpy as np, torch

def retrieve_ctx(question: str, k: int = 5):
    qvec = embedder.encode([question]).astype("float32")
    D, I = index.search(qvec, k*3)          # ask for a few extra
    seen, ctx = set(), []
    for i, s in zip(I[0], D[0]):
        triple = sentences[i]
        if triple not in seen:              # keep only the first copy
            ctx.append({"triple": triple, "score": float(s)})
            seen.add(triple)
        if len(ctx) == k:                   # stop once we have k uniques
            break
    return ctx

# quick demo
question = "Which drug treats type 2 diabetes?"
ctx = retrieve_ctx(question)
print("Q:", question)
for c in ctx:
    print(" •", c["triple"], f"(score {c['score']:.3f})")

Q: Which drug treats type 2 diabetes?
 • Type 2 diabetes Drug_Disease Type two (score 0.808)
 • Type 2 diabetes Amount_Drug What is it? (score 0.799)


### What this cell does, step-by-step  
0. **Inputs in RAM**  
   * `index` → FAISS flat-IP index loaded in Cell 5.  
   * `embedder` → MiniLM-L6 encoder (384-d) loaded in Cell 5.  
   * `sentences` → Python list that maps every vector to its triple string.  

1. **Encode the user question**  
   * We run the question through **MiniLM-L6** → 384-d query vector (`qvec`) on GPU.

2. **Search FAISS for nearest neighbours**  
   * We ask for **k × 3** vectors (default `k = 5` → 15 hits).  
   * Distance used is *inner-product* → cosine on unit vectors.

3. **Uniqueness filter (NEW)**  
   * We iterate through the 15 hits **in order of similarity**.  
   * We keep the first time we see a triple string and discard duplicates.  
   * We stop as soon as we have **k unique triples** – fast and deterministic.

4. **Return a tidy Python list**  
   * Each element is a dict `{"triple": str, "score": float}`.  
   * Scores are FAISS inner-product values cast to `float` for JSON-friendliness.

5. **Smoke-test**  
   * The demo encodes *“Which drug treats type 2 diabetes?”* and prints the top-k unique triples with their scores.  
   * On an A100 the end-to-end latency is **≪ 100 ms** per query.

> **Why it matters** – removing duplicates avoids five identical rows in the context block and slightly improves recall of diverse evidence for the LLM prompt.

In [None]:
# %% ONE-TIME wheel install  (we run once per Colab runtime ▸ then delete or skip)
# bitsandbytes ≥ 0.43.2 is required for 8-bit; the runtime still ships 0.41.
%pip install -qq bitsandbytes==0.43.2
# optional: a couple of small wheels that avoid larger, slower builds
%pip install -qq accelerate==0.27.2 sentencepiece==0.2.0
# decided to run these again in case the runtime state changed

In [None]:
# %% CELL 7 – 5-question RAG showcase  (DeepSeek-7B in 8-bit)
"""
1. For each demo question we …
   • retrieve_ctx(k=5)                  → 5 nearest triples
   • build a prompt  (Context\n… triple…) + Q + "Answer:"
   • call DeepSeek-7B (deterministic, temp=0) for the answer
   • pretty-print everything (Q / A / top-k triples) for a nice screenshot.
   – deterministic output is great for grading / reproducibility.
"""

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

MODEL_ID = "deepseek-ai/deepseek-llm-7b-chat"

tok = AutoTokenizer.from_pretrained(MODEL_ID)

# 8-bit still fits easily on the A100 (≈40 GB VRAM)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    # load_in_8bit=True, # Removed to bypass bitsandbytes issue
    device_map="auto",
    torch_dtype=torch.float16,
).eval()

gen = pipeline(
    "text-generation",
    model=mdl,
    tokenizer=tok,
    # Removed generation args from pipeline initialization
    # max_new_tokens=128,
    # temperature=0.0,
    # do_sample=False,
)

print("✓ DeepSeek-7B loaded (without 8-bit quantization)")

PROMPT_TMPL = """You are a helpful medical assistant.
Context:
{ctx}

Question: {q}
Answer:"""

questions = [
    "Which drug is prescribed to manage HbA1c in type 2 diabetes patients?",
    "What long-term complication of diabetes affects the eyes?",
    "Name one lifestyle change that reduces insulin resistance.",
    "Which lab test is used to diagnose gestational diabetes?",
    "How does metformin lower blood glucose mechanistically?",
]

for qi, q in enumerate(questions, 1):
    ctx  = retrieve_ctx(q, k=5)                       # step 1
    ctx_block = "\n- " + "\n- ".join(c["triple"] for c in ctx)
    prompt = PROMPT_TMPL.format(ctx=ctx_block, q=q)   # step 2

    ans = gen(
        prompt,
        max_new_tokens=128,  # Pass generation args directly
        temperature=0.0,
        do_sample=False,
    )[0]["generated_text"].split("Answer:")[-1].strip()   # step 3

    print(f"\n———  Q{qi} ————————————————————")
    print("Q:", q)
    print("A:", ans)
    print("— top-k context —")
    for c in ctx:
        print(" •", c["triple"], f"(score {c['score']:.3f})")             # step 4

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✓ DeepSeek-7B loaded (without 8-bit quantization)





———  Q1 ————————————————————
Q: Which drug is prescribed to manage HbA1c in type 2 diabetes patients?
A: Insulin is prescribed to manage HbA1c in type 2 diabetes patients.
— top-k context —
 • Type 2 diabetes Drug_Disease HbA1c (score 0.847)
 • HbA1c Drug_Disease Type 2 diabetes (score 0.835)
 • HbA1c ADE_Drug Type 2 diabetes (score 0.780)
 • HbA1c Drug_Disease Ineffective insulin (score 0.758)
 • HbA1c Pathogenesis_Disease Type 2 diabetes (score 0.757)

———  Q2 ————————————————————
Q: What long-term complication of diabetes affects the eyes?
A: Diseases of the retina
— top-k context —
 • Seeing the retina Anatomy_Disease Diabetes (score 0.697)
 • Diabetes ADE_Drug Seeing the retina (score 0.563)
 • Seeing the retina Anatomy_Disease Diseases of the retina (score 0.534)
 • Diabetes Anatomy_Disease Oh, my God. (score 0.530)
 • Decreased functionality Pathogenesis_Disease Diabetes (score 0.528)

———  Q3 ————————————————————
Q: Name one lifestyle change that reduces insulin resistance.
A:

In [None]:
# Attempt to reinstall bitsandbytes with a specific configuration
%pip install -qq --upgrade bitsandbytes==0.43.2 --extra-index-url https://download.pytorch.org/whl/cu121

---
# APPENDIX AND EXPLORATORY CODE
---

In [None]:
# %% APPENDIX A1 – Analyze JSON Structure and Counts
"""
This cell analyzes the structure of the translated JSON files and counts the number
of unique entity types, relationship types, and total triples. This information
is crucial for understanding the content and organization of the knowledge graph
data, which can be very helpful for discussing the project during capstone
research meetings.
"""
import json
from pathlib import Path

# Assuming JSON_DIR is defined in a previous cell and points to your translated JSON files
# Example path, adjust if necessary based on your notebook's CELL 0
# JSON_DIR = Path("/content/drive/MyDrive/diakg_assets/diakg_en")

# Get a list of all translated JSON files
json_files = list(JSON_DIR.glob("*_en.json"))

if json_files:
    # Load one sample JSON file to inspect its structure
    sample_file = json_files[0]
    print(f"Analyzing sample file: {sample_file.name}")

    with open(sample_file, 'r', encoding='utf8') as f:
        sample_data = json.load(f)

    # Inspect the top-level keys
    print("\nTop-level keys in a sample JSON file:")
    print(sample_data.keys())

    # Inspect the structure within a paragraph and sentence
    if "paragraphs" in sample_data and sample_data["paragraphs"]:
        sample_paragraph = sample_data["paragraphs"][0]
        print("\nKeys in a sample paragraph:")
        print(sample_paragraph.keys())

        if "sentences" in sample_paragraph and sample_paragraph["sentences"]:
            sample_sentence = sample_paragraph["sentences"][0]
            print("\nKeys in a sample sentence:")
            print(sample_sentence.keys())

            # Count entity types, relationship types, and total triples across all files
            entity_types = set()
            relation_types = set()
            total_triples = 0

            print("\nCounting entity types, relationship types, and triples across all files...")
            for json_file in json_files:
                with open(json_file, 'r', encoding='utf8') as f:
                    data = json.load(f)
                    if "paragraphs" in data:
                        for paragraph in data["paragraphs"]:
                            if "sentences" in paragraph:
                                for sentence in paragraph["sentences"]:
                                    if "entities" in sentence:
                                        for entity in sentence["entities"]:
                                            # Assuming 'type' key exists for entity type
                                            if 'type' in entity:
                                                entity_types.add(entity['type'])
                                    if "relations" in sentence:
                                        for relation in sentence["relations"]:
                                            # Assuming 'relation_type' key exists for relationship type
                                            if 'relation_type' in relation:
                                                relation_types.add(relation['relation_type'])
                                            # Each relation is considered a triple
                                            total_triples += 1

            print(f"\nTotal number of unique entity types: {len(entity_types)}")
            print(f"Total number of unique relationship types: {len(relation_types)}")
            print(f"Total number of triples (relations): {total_triples}")

            print("\nSample Entity Types:")
            print(list(entity_types)[:10]) # Print first 10 for brevity

            print("\nSample Relationship Types:")
            print(list(relation_types)[:10]) # Print first 10 for brevity

else:
    print(f"No translated JSON files found in {JSON_DIR}. Please run the translation cells first.")

# %% APPENDIX A2 – Explanation of JSON Structure Analysis

This markdown cell explains the purpose, functionality, and benefits of the preceding code cell (APPENDIX A1), which analyzes the structure and content of the translated Chinese diabetes guideline JSON files.

**What it does:**

The code cell performs two main tasks:
1. **Inspects the structure of a sample JSON file:** It loads the first translated JSON file found in the specified directory and prints the top-level keys, as well as the keys within a sample paragraph and sentence. This provides a clear overview of how the data is organized hierarchically.
2. **Counts entity types, relationship types, and triples:** It iterates through *all* the translated JSON files to identify and count the unique types of entities and relationships present in the dataset. It also counts the total number of relations, which represent the knowledge graph triples.

**How it does it:**

*   It uses Python's built-in `json` library to load and parse the JSON files.
*   The `pathlib` module is used for easy handling of file paths.
*   It accesses nested dictionaries and lists within the JSON structure to find entity and relation information.
*   `set()` is used to efficiently store and count unique entity and relationship types.
*   A loop iterates through all the JSON files to aggregate the counts.

**Why it's helpful:**

For your capstone research project, understanding the underlying data is essential. This analysis provides concrete numbers and structural insights that you can use to:

*   **Describe the dataset:** Quantify the diversity of information by stating the number of entity and relationship types.
*   **Explain the data source:** Clearly articulate the structure of the input data during presentations or discussions with your professor.
*   **Justify design choices:** Relate the structure of the JSON data to decisions made in building the knowledge graph and RAG system.
*   **Estimate scale:** The total number of triples gives you an idea of the size of the knowledge graph being built.

By having this information readily available, you can speak confidently and knowledgeably about the data underpinning your project.