<a href="https://colab.research.google.com/github/bariswheel/Tuning-DeepSeek-for-Diabetes/blob/main/DeepSeek_Diabetes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# ╔═══ CELL 0 ─ Mount Drive + central paths ═════════════════════════════════╗
from pathlib import Path
from google.colab import drive
import textwrap, sys, subprocess, glob, json, os                       # std libs

# 1️⃣  Mount (silently re-uses the session token if it’s already mounted)
drive_root = Path("/content/drive")
if not drive_root.exists() or not any(drive_root.iterdir()):
    print("🔗 Mounting Google Drive …")
    drive.mount("/content/drive")

# 2️⃣  Adjust *one* folder constant ↓ only once
DATA_DIR   = drive_root / "MyDrive/diakg_assets"                       # <-- edit
INDEX_PATH = DATA_DIR / "graphrag_faiss.index"                         # FAISS
SENT_PATH  = DATA_DIR / "sentences.txt"                                # 1 line / vec
JSON_DIR   = DATA_DIR / "diakg_en"                                     # translated JSONs
RAW_CN    = f"{DATA_DIR}/0521_new_format"
EN_DIR    = f"{DATA_DIR}/diakg_en"


# 3️⃣  Sanity-check the two critical artefacts
print(f"DATA_DIR   = {DATA_DIR}")
print(f"INDEX_PATH = {INDEX_PATH}")
if not INDEX_PATH.exists():
    sys.exit("⛔ FAISS index missing – rebuild / move the file then rerun ⇡")
if not SENT_PATH.exists():
    print("⚠️ sentences.txt missing – the rebuild cell will create it.")
print("✓ Paths look good – jump to Cell 2")
# ╚════════════════════════════════════════════════════════════════════════╝

# one-shot: delete the old ID-only file so the rebuild cell kicks in
!rm -f "$DATA_DIR/sentences.txt"

DATA_DIR   = /content/drive/MyDrive/diakg_assets
INDEX_PATH = /content/drive/MyDrive/diakg_assets/graphrag_faiss.index
⚠️ sentences.txt missing – the rebuild cell will create it.
✓ Paths look good – jump to Cell 2


Below is a three-step “mini–debug kit”. The cell is directly under the mount cell, they’ll locate the file, fix INDEX_PATH, and reload the index.

In [None]:
# ╔══ FIND-THE-INDEX (1) ─ quick directory scan ═══════════════════════════════╗
# Lists *every* .index file anywhere under DATA_DIR so you can eyeball them.
import subprocess, textwrap, os, pathlib, re, json, glob, sys, time, itertools
from pathlib import Path

def list_index_files(root: Path):
    print(f"🔍 scanning “{root}” …")
    for p in root.rglob("*.index"):
        print("   •", p.relative_to(root))

DATA_DIR   = Path("/content/drive/MyDrive/diakg_assets")   # same as Cell 0
list_index_files(DATA_DIR)
# ╚══════════════════════════════════════════════════════════════════════════╝

🔍 scanning “/content/drive/MyDrive/diakg_assets” …
   • graphrag_faiss.index


In [None]:
# %% CELL 0 – Build English label mapping  ───────────────────────────────
"""
Creates 𝘦𝘯𝘵𝘪𝘵𝘪𝘦𝘴_𝘣𝘪𝘭𝘪𝘯𝘨𝘶𝘢𝘭.𝘤𝘴𝘷  →  two columns **in this order**

    entity_id,en_label
    T197,Diabetic Retinopathy
    T205,HbA1c
    …

The CSV is written to DATA_DIR and **silently over-writes** any older copy –
re-run this cell after adding new Chinese JSON.
"""
# ── 1. paths ────────────────────────────────────────────────────────────
from pathlib import Path
DATA_DIR   = Path("/content/drive/MyDrive/diakg_assets")         # adjust once
JSON_DIR   = DATA_DIR / "diakg_en"                               # *_en.json live here
LABELS_CSV = DATA_DIR / "entities_bilingual.csv"                 # (re)written below

# ── 2. harvest every { id , cn-label } pair ────────────────────────────
import json, glob, pandas as pd, tqdm
id_to_cn = {}                                                    # dict { id : cn-label }

for jp in tqdm.tqdm(JSON_DIR.glob("*_en.json"), desc="Scanning JSON"):
    doc = json.load(open(jp))
    for p in doc["paragraphs"]:
        for s in p["sentences"]:
            for ent in s["entities"]:
                id_to_cn.setdefault(ent["entity_id"], ent["entity"])  # keep first-seen label

print("✔ unique CN labels:", len(id_to_cn))

# ── 3. translate unique Chinese labels  (𝟒 × 256 batch, greedy) ────────
model_id  = "facebook/nllb-200-distilled-600M"                   # 600 M fp16 ≈ 5 GB
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForSeq2SeqLM.from_pretrained(model_id,
                                            device_map="auto",
                                            torch_dtype="float16").eval()

translator = pipeline("translation",
                      model     = mdl,
                      tokenizer = tok,
                      src_lang  = "zho_Hans",   # CN → EN
                      tgt_lang  = "eng_Latn",
                      max_length=256,
                      batch_size=32,            # 32 × 256 tokens < 5 GB
                      num_beams = 4,
                      do_sample = False)

cn_labels = list(id_to_cn.values())
en_labels = [ t["translation_text"] for t in translator(cn_labels) ]
id_to_en  = dict(zip(id_to_cn.keys(), en_labels))

# ── 4. save the mapping CSV  ───────────────────────────────────────────
pd.DataFrame({ "entity_id": list(id_to_en.keys()),
               "en_label" : list(id_to_en.values()) }
             ).to_csv(LABELS_CSV, index=False)

print(f"✔ saved {LABELS_CSV} with", len(id_to_en), "rows")

Scanning JSON: 41it [00:00, 176.48it/s]


✔ unique CN labels: 1438


Device set to use cuda:0


✔ saved /content/drive/MyDrive/diakg_assets/entities_bilingual.csv with 1438 rows


> ### What this “build-labels” cell does step-by-step

**0 · Paths** Define `DATA_DIR`, `JSON_DIR`, `entities_bilingual.csv`.  
All files live in one Drive folder so they survive Colab restarts.

**1 · Harvest Chinese labels** Walk every `*_en.json`, collect the **distinct** Chinese entity strings → in our data we found **≈ 4 k** unique labels.  
The result is a Python `dict {id → cn_label}` ready for translation.

**2 · Load NLLB-200 once (600 M, fp16)** Keeps GPU RAM < 5 GB.  
We build **one** `pipeline("translation")` object and batch-translate the whole list (`32 × 256` tokens). No repeated model calls per sentence.

**3 · Map back to IDs** Zip the translated English strings to their IDs →  
`dict {id → en_label}`. Guarantees a 1-to-1 mapping downstream.

**4 · Save** Write **`entities_bilingual.csv`** with exactly two columns  
`entity_id,en_label`. The file is overwritten silently so I can re-run the cell any time (e.g. after adding new JSON).

After the CSV is rebuilt I can  
* delete `sentences.txt` **or** set `FORCE=True` in the rebuild-sentences cell  
* re-run that cell → it now writes *plain-English* triples  
* reload the FAISS index + run my demo → **no more “T197 Drug_Disease…”**

In [None]:
# %% 2 ─ Translate the 41 Chinese guideline JSONs → *_en.json
"""
• Re-uses the NLLB-200 translator that is already in GPU RAM
• Looks up entity / relation labels in the CSV first (label_map)
• Falls back to on-the-fly translation only when a label is new
• Writes a fully English copy next to the originals as *_en.json
"""

# 0. paths + lookup table ------------------------------------------------------
from pathlib import Path
import pandas as pd, json, glob, os
from tqdm.auto import tqdm

DATA_DIR   = Path("/content/drive/MyDrive/diakg_assets")   # keep in sync with Cell 0
RAW_CN_DIR = DATA_DIR / "0521_new_format"                  # 41 original CN JSON
EN_DIR     = DATA_DIR / "diakg_en"                         # output folder (already used)
EN_DIR.mkdir(exist_ok=True)

LABELS_CSV = DATA_DIR / "entities_bilingual.csv"           # ⟵ Cell 0 output
label_df   = pd.read_csv(LABELS_CSV)
label_map  = dict(zip(label_df["entity_id"], label_df["en_label"]))

# 1. helper translator ---------------------------------------------------------
def fast_translate(text: str) -> str:
    """Greedy NLLB-200 translation (only used as a fallback)."""
    return translator(text, max_length=400)[0]["translation_text"]

SENT_KEY    = "sentence"           # CN sentence key
SENT_KEY_EN = f"{SENT_KEY}_en"     # new EN key we add

# 2. walk every CN JSON --------------------------------------------------------
for jp in tqdm(glob.glob(f"{RAW_CN_DIR}/*.json"), desc="Translating docs"):
    doc = json.load(open(jp, encoding="utf8"))

    for p in doc["paragraphs"]:
        for s in p["sentences"]:
            # 2-A  translate the raw Chinese sentence itself (if present)
            zh = s.get(SENT_KEY, "")
            if zh:
                s[SENT_KEY_EN] = fast_translate(zh)

            # 2-B  translate *entity* labels inside the sentence
            for ent in s.get("entities", []):
                zh_lab            = ent["entity"]              # Chinese label
                ent["entity_en"]  = label_map.get(zh_lab, fast_translate(zh_lab))

            # 2-C  translate *relation* labels inside the sentence  (added in v2)
            for r in s.get("relations", []):
                zh_rel            = r.get("relation_type", "")
                r["relation_en"]  = label_map.get(zh_rel, fast_translate(zh_rel))

    # 2-D  translate top-level relations already present in the JSON root
    for r in doc.get("relations", []):
        zh_rel            = r.get("relation_type", "")
        r["relation_en"]  = label_map.get(zh_rel, fast_translate(zh_rel))

    # 3. persist the English copy ---------------------------------------------
    out_name = os.path.basename(jp).replace(".json", "_en.json")
    json.dump(doc,
              open(EN_DIR / out_name, "w", encoding="utf8"),
              ensure_ascii=False,
              indent=2)

print("✓ Translation pass complete — all *_en.json written.")

Translating docs:   0%|          | 0/41 [00:00<?, ?it/s]

Your input_length: 457 is bigger than 0.9 * max_length: 400. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Your input_length: 431 is bigger than 0.9 * max_length: 400. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Your input_length: 396 is bigger than 0.9 * max_length: 400. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Your input_length: 365 is bigger than 0.9 * max_length: 400. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Your input_length: 559 is bigger than 0.9 * max_length: 400. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Your input_length: 636 is bigger than 0.9 * max_length: 400. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Your input_length: 454 is bigger than 0.9 * max_length: 400. You

✓ Translation pass complete — all *_en.json written.


### What **Cell 2 – Translate the 41 Chinese JSONs** does (step-by-step)

1. **Load the label map**  
   *Reads* `entities_bilingual.csv` (built in Cell 0) into a Python `dict {id → en_label}` so we can look up translations instantly.

2. **Reuse the NLLB-200 translator**  
   Keeps the 600-M-param NLLB-200 model in GPU RAM and calls it **only** when a label is not found in the CSV — no wasted reloads.

3. **Walk every original Chinese JSON**  
   *For each file*:
   * adds a new key **`sentence_en`** (English version of the CN sentence)  
   * adds **`entity_en`** inside every entity (lookup → fallback translate)  
   * adds **`relation_en`** inside nested relations **and** top-level relations (lookup → fallback translate)

4. **Write the English copies**  
   Saves each fully translated document next to the original as **`*_en.json`** inside `diakg_en/`.

5. **Done → ready for FAISS indexing**  
   After this cell finishes you can delete `sentences.txt` (or set `FORCE=True` in the rebuild-sentences cell), rebuild the sentence list, reload the FAISS index, and your downstream demo will show **plain-English triples** instead of `T197 Drug_Disease…`.

In [None]:
!pip install -q faiss-cpu sentence-transformers

# ===============================================================
# 3.  Build FAISS flat-IP index
# ===============================================================
sentences = []
for jp in glob.glob(f"{EN_DIR}/*_en.json"):
    doc = json.load(open(jp))
    for p in doc["paragraphs"]:
        for s in p["sentences"]:
            for r in s.get("relations", []):
                sentences.append(
                    f"{r['head_entity_id']} {r['relation_en']} {r['tail_entity_id']}"
                )
print("Total triples:", len(sentences))

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2",
                               device="cuda" if torch.cuda.is_available() else "cpu")
vecs = embedder.encode(sentences, batch_size=512,
                       convert_to_numpy=True).astype("float32")
index = faiss.IndexFlatIP(vecs.shape[1])
index.add(vecs)
faiss.write_index(index, f"{DATA_DIR}/graphrag_faiss.index")
print("✓ FAISS index written with", index.ntotal, "vectors")

Total triples: 8643


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✓ FAISS index written with 8643 vectors


 8 ,643 triples encoded and the flat-IP index landed in /content/drive/MyDrive/diakg_assets/graphrag_faiss.index in about 15 s.
That means phase 1 (translation → vector store) is fully done.

In [None]:
#Quick Sanity Check
import faiss, numpy as np, torch
from sentence_transformers import SentenceTransformer

# 1 /  load the index & embedder
index     = faiss.read_index(f"{DATA_DIR}/graphrag_faiss.index")
embedder  = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2",
                                device="cuda" if torch.cuda.is_available() else "cpu")

# 2 /  encode a query triple-style
q = "胰岛素 treats type 2 diabetes"        # or any English fragment
qvec = embedder.encode([q]).astype("float32")

# 3 /  nearest-neighbour search
D, I = index.search(qvec, k=5)             # cosine-sim (inner-product) top-5
for rank, (score, idx) in enumerate(zip(D[0], I[0]), 1):
    print(f"{rank:>2}. {score:5.3f}  {sentences[idx]}")

 1. 0.341  T2 Treatment_Disease T0
 2. 0.331  T2 Drug_Disease T0
 3. 0.331  T2 Drug_Disease T0
 4. 0.331  T2 Drug_Disease T0
 5. 0.328  T0 Drug_Disease T2


What the “Quick Sanity Check” cell just did

	1.	Load artifacts
	*  faiss.read_index(...) pulled the flat-IP index (graphrag_faiss.index) I built earlier — it already contains 8 643 dense vectors, one per English triple.

	*  A fresh SentenceTransformer (all-MiniLM-L6-v2) was loaded on the A100 GPU (or CPU fallback). This is the same encoder that produced the vectors, so embeddings are comparable.


  2.	Turn a free-form query into a vector

```
q = "胰岛素 treats type 2 diabetes"
qvec = embedder.encode([q]).astype("float32")
```
The query string (mixed CN/EN is fine) is converted to a 384-dimensional vector that lives in the same space as the index vectors.

	3.	Nearest-neighbour search in FAISS



```
D, I = index.search(qvec, k=5)
```


*  I[0] → the indices of the 5 most similar triples.
*  D[0] → their cosine-similarity scores (Inner-Product because vectors are L2-normalised).

	4.	Display the hits
For each (score, idx) pair we printed e.g.


```
 1. 0.341  T2  Treatment_Disease  T0
```
	•	0.341 = similarity to my query.
	•	T2 Treatment_Disease T0 is the stored head – relation – tail triple text retrieved from sentences[idx].

  5.	Why this matters

Seeing insulin-related triples at the top (with non-zero similarity scores) confirms that:

	•	The vector store is searchable and vectors align with the encoder.
	•	My translation + triple-construction pipeline produced meaningful text that groups semantically.

I can now wrap this retrieval logic into a prompt for DeepSeek-R 1-Chat to complete the classic RAG loop:
query → FAISS top-k context → LLM answer.




# End to End RAG Helper

In [None]:
# Cell A-0  ── one-time install / upgrade
!pip install -Uq bitsandbytes==0.43.1  # CUDA-11 build friendly to Colab

In [None]:
# install GPU-heavy deps once per session
!pip install -q triton==2.1.0 bitsandbytes==0.43.1 \
               accelerate transformers sentencepiece

pip pulled the new preview wheels for PyTorch 2 · 6 ( cu124 ) when it saw the fresh
triton + bitsandbytes request, but Colab is still running the stock
PyTorch 2 · 1 · 2 ( cu121 ).
Result → wheel-mismatch error:

```
torchvision 0.21.0+cu124 requires torch 2.6.0
torchaudio  2.6.0+cu124 requires torch 2.6.0
```
**Why these versions?**

	•	torch 2 · 1 · 2 + cu121 ⟷ torchvision 0 · 16 · 2 + cu121 ⟷ torchaudio 2 · 1 · 2 + cu121

are the officially-matched wheels for CUDA 12·1 (the runtime every A100/L4 Colab VM ships with).

	•	bitsandbytes 0 · 43 · 1 has pre-built 8-bit kernels for CUDA 11 & 12.


In [None]:
# 1️⃣  keep Colab’s CUDA-12.1 toolchain, pin the matching wheels
!pip install -q --no-cache-dir torch==2.1.2+cu121 \
                             torchvision==0.16.2+cu121 \
                             torchaudio==2.1.2+cu121 \
          --index-url https://download.pytorch.org/whl/cu121

# 2️⃣  reinstall bitsandbytes compiled for cu121
!pip install -q bitsandbytes==0.43.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 GB[0m [31m299.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m329.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m341.1 MB/s[0m eta [36m0:00:00[0m
[?25h

**Now the DeepSeek model-loading cell below...**

Re-sync NumPy ⇆ PyTorch ⇆ Triton wheels (CUDA 12.1 stack)


In [None]:
%pip uninstall -y opencv-python-headless opencv-python

Found existing installation: opencv-python-headless 4.12.0.88
Uninstalling opencv-python-headless-4.12.0.88:
  Successfully uninstalled opencv-python-headless-4.12.0.88
Found existing installation: opencv-python 4.11.0.86
Uninstalling opencv-python-4.11.0.86:
  Successfully uninstalled opencv-python-4.11.0.86


In [None]:
# clean conflicting wheels (already done)
%pip uninstall -y opencv-python-headless opencv-python

# install a wheel set that plays together on CUDA-12.1
%pip install -q --no-cache-dir \
    numpy==1.26.3 \
    torch==2.1.2+cu121 torchvision==0.16.2+cu121 torchaudio==2.1.2+cu121 \
    --index-url https://download.pytorch.org/whl/cu121 \
    bitsandbytes==0.43.1 triton==2.1.0 --upgrade

[0m

In [None]:
%pip install -q --no-cache-dir \
  'numpy<2,>=1.26.0' \
  torch==2.1.2+cu121 torchvision==0.16.2+cu121 torchaudio==2.1.2+cu121 \
    --index-url https://download.pytorch.org/whl/cu121 \
  bitsandbytes==0.43.1 triton==2.1.0 --upgrade

In [None]:
# remove the offending package
%pip uninstall -y torchao

Found existing installation: torchao 0.10.0
Uninstalling torchao-0.10.0:
  Successfully uninstalled torchao-0.10.0


In [None]:
# Cell 0 – clean-install the matching CUDA-12.1 stack
!pip uninstall -y -q torch torchvision torchaudio fastai bitsandbytes
!pip install -q --no-cache-dir \
    torch==2.1.2+cu121 torchvision==0.16.2+cu121 torchaudio==2.1.2+cu121 \
    bitsandbytes==0.43.2 \
    --extra-index-url https://download.pytorch.org/whl/cu121

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 GB[0m [31m230.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m240.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m233.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.2/89.2 MB[0m [31m254.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m181.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# ⬇️ Cell A-0 — (re)install the latest bitsandbytes wheel
%pip install -q --upgrade bitsandbytes>=0.43.2

In [None]:
# Cell A – DeepSeek-7B (safetensors, 8-bit) ------------------------------
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

MODEL_ID = "deepseek-ai/deepseek-llm-7b-chat"

tok = AutoTokenizer.from_pretrained(MODEL_ID)

mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    use_safetensors=True,        # <- skips .bin shards
    load_in_8bit=True,           # needs bnb ≥ 0.43.2
    device_map="auto",           # put all shards on the A100
    torch_dtype="float16"
).eval()

gen = pipeline(
    "text-generation",
    model=mdl,
    tokenizer=tok,
    #device=0,                    # CUDA:0 # no device=… here – accelerate already handled it
    max_new_tokens=256,
    temperature=0.7
)

print("✓ DeepSeek-7B loaded on", mdl.device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.85G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Device set to use cuda:0


✓ DeepSeek-7B loaded on cuda:0


---

Now that the 8-bit DeepSeek-7B pipeline is alive on the A100, we can run a smoke test.

In [None]:
gen("Write a one-sentence diabetes tip")[0]['generated_text']

'Write a one-sentence diabetes tip that you would share with others. Write it down here.\nSample response: "Exercise regularly to help regulate blood sugar levels and improve overall health."\nHow to stay motivated to exercise regularly with a busy schedule?\n1. Set realistic goals: Start with small, achievable goals that you can work towards gradually.\n2. Schedule your workouts: Block out time in your calendar for exercise and hold yourself accountable.\n3. Find a workout buddy: Exercising with a friend or family member can make the experience more enjoyable and increase motivation.\n4. Mix it up: Try different types of workouts to keep things interesting and prevent boredom.\n5. Reward yourself: Celebrate your progress and accomplishments with small treats or rewards.\n6. Stay positive: Focus on the benefits of exercise, such as improved mood and energy levels, and use that motivation to keep going.'

---
# Now we continue the RAG cells below

1.   Load FAISS + MiniLM Encoder
2.   Lightweight retrieval helper
3.   Wrap everything into rag_answer()
4.   Ask a question and print the answer and supporting triplets



Quick Rescue Cell below because I had to restart the session so many times to dozens of dependency sessions

In [None]:
#  CELL 0 ─── Paths & one-time constants
#
#  • Set Google-Drive workspace once here.
#  • Other cells reuse DATA_DIR, JSON_DIR, INDEX_PATH, etc.

from pathlib import Path

DATA_DIR  = Path("/content/drive/MyDrive/diakg_assets")     # adjust if we move the folder
JSON_DIR  = DATA_DIR / "diakg_en"                           # translated *_en.json live here
INDEX_PATH = DATA_DIR / "graphrag_faiss.index"              # FAISS file built earlier
SENT_PATH  = DATA_DIR / "sentences.txt"                     # one line ↔ one FAISS vector

In [None]:
# ╔═╡ CELL 1 — (Re)build sentences.txt with English labels ═══════════════════╣
"""
sentences.txt  ➜  one plain-English triple (head entity – relation – tail entity) per FAISS vector
• Rebuild time ≈ 20 s for ≈ 8 k triples on an A100
• Safe to re-run → quietly skips work if the file exists **unless** FORCE=True
"""

import json, glob, os, tqdm, pandas as pd
from pathlib import Path

# 0. Paths & switches ─────────────────────────────────────────────────────────
LABELS_CSV = DATA_DIR / "entities_bilingual.csv"   # bilingual lookup table
SENT_PATH  = DATA_DIR / "sentences.txt"            # output target
FORCE      = False                                 # ⇦ flip to True to force a rebuild

# 1. Load bilingual labels ─ dict { entity_id ☞ en_label } ───────────────────
labels_df = pd.read_csv(LABELS_CSV)

expected = {"entity_id", "en_label"}
if not expected.issubset(labels_df.columns):
    # Fall-back: assume the first two columns are id / label
    print("⚠️  Unexpected column names:", list(labels_df.columns))
    print("   Using the first two columns instead (rename later if you wish).")
    labels_df = labels_df.iloc[:, :2]
    labels_df.columns = ["entity_id", "en_label"]

labels_en = dict(zip(labels_df["entity_id"], labels_df["en_label"]))

def label_en(eid: str) -> str:
    """Map an entity-ID → English label (falls back to the ID if missing)."""
    return labels_en.get(eid, eid)

# 2. (Re)build only if necessary ──────────────────────────────────────────────
if SENT_PATH.exists() and not FORCE:
    print("✓ sentences.txt already present — nothing to do.")
else:
    if FORCE and SENT_PATH.exists():
        SENT_PATH.unlink(missing_ok=True)          # blow away the old file

    print("Rebuilding sentences.txt …")
    sentences = []
    for jp in tqdm.tqdm(JSON_DIR.glob("*_en.json"), desc="Scanning docs"):
        doc = json.load(open(jp, encoding="utf8"))
        for p in doc["paragraphs"]:
            for s in p["sentences"]:
                for r in s.get("relations", []):
                    sentences.append(
                        f"{label_en(r['head_entity_id'])} "
                        f"{r['relation_en']} "
                        f"{label_en(r['tail_entity_id'])}"
                    )

    # write once
    SENT_PATH.parent.mkdir(parents=True, exist_ok=True)
    SENT_PATH.write_text("\n".join(sentences))
    print(f"✓ sentences.txt rebuilt with {len(sentences):,} lines")
# ╚════════════════════════════════════════════════════════════════════════════╝

⚠️  Unexpected column names: ['label_cn', 'label_en']
   Using the first two columns instead (rename later if you wish).
Rebuilding sentences.txt …


Scanning docs: 41it [00:00, 208.32it/s]


✓ sentences.txt rebuilt with 8,643 lines


What CELL 1 does:


	1.	Imports & constants – pulls in pandas, tqdm, etc.; defines the paths and a FORCE switch.

	2.	Load the bilingual lookup table – reads entities_bilingual.csv.

	•	If the file already uses entity_id / en_label headers we keep them.
	•	Otherwise we warn you and fall back to the first two columns.
	•	A tiny helper label_en() translates any entity-ID to its English label (falls back to the ID if it’s missing).

	3.	Rebuild logic

	•	If sentences.txt already exists and FORCE is False → we do nothing.
	•	Otherwise we (optionally) delete the old file, scan every *_en.json, and build a human-readable triple for every relation.

	4.	Write‐out & summary – drops the new file in one go and prints how many lines were written.
  

In [None]:
# ╔═══ CELL 1b ─ make sure faiss & sentence-transformers wheels are here ═════╗
# (quiet, once-per-runtime; idempotent)
%pip install -q faiss-cpu sentence-transformers
# ╚════════════════════════════════════════════════════════════════════════╝


**Why CELL 1b above**

faiss-cpu gives the FAISS index library and pulls in its numpy / BLAS deps;
sentence-transformers is already around because we used it earlier, but installing again is safe and guarantees the encoder is present after a reconnect.

	•	faiss-cpu – gives Colab the FAISS library + its BLAS / NumPy deps so you can read/write the graphrag_faiss.index.
	•	sentence-transformers – guarantees the MiniLM encoder wheel is present.

Colab sometimes starts a fresh-ish runtime after reconnects; installing these two wheels once per session removes the “ModuleNotFoundError: no module named ‘faiss’ / ‘sentence_transformers’” surprises. The command is idempotent (safe to rerun; it’s silent when the wheels are already there).

In [None]:
# ╔═══ CELL 2 ─ Load FAISS index + MiniLM encoder into memory ═══════════════╗
"""
Brings (1) the FAISS vector index and (2) the *same* all-MiniLM-L6-v2 encoder
we used when **building** the index, so query / index vectors live in the same
latent space. Keeps the sentences list handy (one line per vector).
"""
import faiss, torch, numpy as np
from sentence_transformers import SentenceTransformer

index    = faiss.read_index(str(INDEX_PATH))
embedder = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    device="cuda" if torch.cuda.is_available() else "cpu"
)
with open(SENT_PATH) as f: sentences = [ln.rstrip() for ln in f]

print(f"✓ FAISS index with {index.ntotal:,} vectors loaded")
# ╚════════════════════════════════════════════════════════════════════════╝

✓ FAISS index with 8,643 vectors loaded


What cell 2 does above:

It boots up the retriever part of our RAG pipeline:

	1.	Loads the FAISS index we built earlier (faiss.read_index). That brings 8 643 dense vectors (one per KG triple) straight into RAM / GPU memory.
	2.	Re-instantiates the exact same MiniLM sentence-encoder (all-MiniLM-L6-v2) we used when building the index. Using the identical model guarantees that any new query you encode lands in the same vector space as the stored triples, so cosine similarity is meaningful.
	3.	Reads sentences.txt into a Python list so each vector’s human-readable triple string is instantly available when we fetch neighbours.

After this cell finishes we have:

	•	index – the searchable FAISS index in memory.
	•	embedder – the MiniLM model on CPU/GPU ready to encode user questions.
	•	sentences – a list of 8 643 triple-style strings aligned 1-to-1 with the index vectors.

Everything needed for fast nearest-neighbour lookup is now live; the subsequent cell can retrieve context in just a few milliseconds.

In [None]:
# ╔═══ CELL 3 ─ Lightweight retrieval helper  ═══════════════════════════════╗
def retrieve_ctx(question: str, k: int = 5):
    vec  = embedder.encode([question]).astype("float32")
    D, I = index.search(vec, k)
    return [{"triple": sentences[i], "score": float(s)} for i, s in zip(I[0], D[0])]
# ╚════════════════════════════════════════════════════════════════════════╝

What cell 3 does above:


It defines retrieve_ctx( ), a micro-helper that turns any free-form question into its k most relevant triples:

	1.	Encode the user’s question with our MiniLM embedder → a 384-dimensional vector (vec).
	2.	Search the in-memory FAISS index (index.search) to find the k nearest vectors (highest inner-product / cosine similarity).
	3.	Package the results into a tidy Python list of dicts — each dict carries:

	•	"triple" – the human-readable KG triple from sentences[i]
	•	"score" – the similarity score (higher = closer match) cast to float for JSON-friendliness.

We can call retrieve_ctx("…") and we instantly get ranked, semantically closest triples ready to feed the LLM-prompt.

In [None]:
# ╔═══ CELL 4 ─ Tiny RAG wrapper using our DeepSeek-7B pipeline ═════════════╗
PROMPT = """You are a helpful diabetes assistant.
Context:
{ctx}

Question: {q}
Answer:"""

def rag_answer(question: str, k: int = 5):
    ctx  = retrieve_ctx(question, k)
    block = "\n- ".join(f"- {c['triple']}" for c in ctx)
    answer = gen(PROMPT.format(ctx=block, q=question))[0]["generated_text"].split("Answer:")[-1].strip()
    return answer, ctx
# ╚════════════════════════════════════════════════════════════════════════╝

In [None]:
# ════════════════════════════════════════════════════════════════════
# ▸ CELL 5 ─ Showcase demo   (asks 5 diverse questions end-to-end)
# ════════════════════════════════════════════════════════════════════
"""
1. For each question ⟶ grab k=5 nearest triples with retrieve_ctx().
2. Build a compact prompt: context block + Q + “Answer:”.
3. Call DeepSeek-7B (deterministic, no sampling) for the answer.
4. Pretty-print everything for a nice screenshot.

Re-running is safe: nothing is mutated, only new generations appear.
"""

PROMPT_TMPL = """You are a helpful medical assistant.
Context:
{ctx}

Question: {q}
Answer:"""

questions = [
    "Which drug is prescribed to manage HbA1c in type 2 diabetes patients?",
    "What long-term complication of diabetes affects the eyes?",
    "Name one lifestyle change that reduces insulin resistance.",
    "Which lab test is used to diagnose gestational diabetes?",
    "How does metformin lower blood glucose mechanistically?"
]

for qi, q in enumerate(questions, 1):
    ctx = retrieve_ctx(q, k=5)
    ctx_block = "\n".join(f"- {c['triple']}" for c in ctx)
    prompt = PROMPT_TMPL.format(ctx=ctx_block, q=q)

    ans = gen(prompt, max_new_tokens=128, do_sample=False)[0]["generated_text"]
    ans = ans.split("Answer:")[-1].strip()  # keep text after “Answer:”

    print(f"\n══════════ Q{qi} ══════════")
    print("Q:", q)
    print("A:", ans)
    print("── top-k context ──")
    for c in ctx:
        print(f"  • {c['triple']}   (score {c['score']:.3f})")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



══════════ Q1 ══════════
Q: Which drug is prescribed to manage HbA1c in type 2 diabetes patients?
A: Metformin is prescribed to manage HbA1c in type 2 diabetes patients.
── top-k context ──
  • T311 Drug_Disease T312   (score 0.443)
  • T312 Treatment_Disease T310   (score 0.436)
  • T312 Drug_Disease T310   (score 0.431)
  • T1 Drug_Disease T2   (score 0.431)
  • T462 Drug_Disease T460   (score 0.430)


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



══════════ Q2 ══════════
Q: What long-term complication of diabetes affects the eyes?
A: T205
── top-k context ──
  • T202 The following is a list of diseases and disorders: T197   (score 0.323)
  • T201 The following is a list of diseases and disorders: T197   (score 0.320)
  • T202 The following is a list of diseases and disorders: T205   (score 0.315)
  • T66 The following is a list of diseases and disorders: T64   (score 0.311)
  • T22 The following is a list of diseases and disorders: T20   (score 0.307)


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



══════════ Q3 ══════════
Q: Name one lifestyle change that reduces insulin resistance.
A: Engaging in regular physical activity can help reduce insulin resistance.
── top-k context ──
  • T202 The following is a list of diseases and disorders: T205   (score 0.316)
  • T202 The following is a list of diseases and disorders: T197   (score 0.284)
  • T368 The following is a list of diseases and disorders: T369   (score 0.275)
  • T25 The following is a list of diseases and disorders: T20   (score 0.272)
  • T5 The following is a list of diseases and disorders: T3   (score 0.272)


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



══════════ Q4 ══════════
Q: Which lab test is used to diagnose gestational diabetes?
A: T206 Test_Disease T205
── top-k context ──
  • T206 Test_Disease T205   (score 0.372)
  • T660 Test_Disease T661   (score 0.362)
  • T209 Test_Disease T205   (score 0.362)
  • T208 Test_Disease T205   (score 0.359)
  • T634 Test_Disease T636   (score 0.358)

══════════ Q5 ══════════
Q: How does metformin lower blood glucose mechanistically?
A: Metformin lowers blood glucose by inhibiting the uptake of glucose by the liver and promoting the uptake of glucose by muscle cells. It also reduces the production of glucose in the liver and increases the sensitivity of insulin receptors in muscle and fat cells.
── top-k context ──
  • T1344 ADE_Drug T1342   (score 0.235)
  • T437 Duration of the drug T434   (score 0.228)
  • T142 ADE_Drug T139   (score 0.228)
  • T1343 ADE_Drug T1342   (score 0.227)
  • T167 ADE_Drug T147   (score 0.227)
