<a href="https://colab.research.google.com/github/bariswheel/Tuning-DeepSeek-for-Diabetes/blob/main/Outputs_Cleared_REFACTORED_DeepSeek_Diabetes_CleanV1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🩺 DEEPSEEK R-1 Graph-RAG for Chinese Diabetes Guidelines – End-to-End Pipeline (Refactored Translation & Evaluation)

This notebook builds a **compact Retrieval-Augmented Generation (RAG) system** that turns 41 Chinese clinical-guideline JSON files into:

1.  **Fully English knowledge triples** – human-readable, one per FAISS vector, generated using a robust `deep_translator` process with caching and fallbacks.
2.  **A FAISS flat-IP index** – vectors encoded using `sentence-transformers` (MiniLM), ready for millisecond retrieval.
3.  **A RAG demo** – answering questions using a language model (DeepSeek-7B) augmented with retrieved knowledge graph triples.
4.  **An evaluation** – comparing model performance with and without RAG using keyword-based metrics.

**Key Improvements and Features:**

*   **Refactored Translation:** The translation process now uses the `deep_translator` library with Google Translate and MyMemory fallbacks, incorporating in-memory caching and retry logic with exponential backoff for improved speed and reliability, especially when dealing with API rate limits.
*   **Parallel Processing:** Translation of JSON files is parallelized using `ThreadPoolExecutor` for faster execution.
*   **Robust Setup:** Library installations are consolidated and managed to avoid dependency conflicts encountered during development.
*   **Clear Pipeline:** The notebook follows a logical flow from data loading and translation to indexing, retrieval, RAG demonstration, and evaluation.

---

## 📖 Notebook Execution Guide

Follow these steps to run the entire pipeline from scratch:

| Order | Cell ID | Purpose                                                                 | Notes                                                                 |
| :---- | :------ | :---------------------------------------------------------------------- | :-------------------------------------------------------------------- |
| 1     | Cell 0  | Mount Google Drive & Define Paths                                       | Ensures access to data and output directories.                        |
| 2     | Cell 1  | Install Required Libraries                                              | Installs all necessary Python packages, including `deep_translator`.  |
| 3     | Cell 3  | Define `safe_translate` Function                                        | Sets up the translation function with caching and fallbacks.          |
| 4     | Cell 2  | Translate Raw CN JSON → `*_en.json` (Parallel)                          | Translates source files. Skips if `*_en.json` files already exist.    |
| 5     | Cell 4  | Build FAISS flat-IP index                                               | Embeds translated triples and creates the searchable index.         |
| 6     | Cell 5  | Load FAISS index + encoder                                              | Loads the index and embedding model into memory for RAG.            |
| 7     | Cell 6  | `retrieve_ctx()` helper + quick demo                                    | Defines and tests the function for retrieving context from the index. |
| 8     | Cell 7  | 5-question RAG showcase (DeepSeek-7B)                                   | Demonstrates the RAG system answering questions.                      |
| 9     | Eval 1  | Define Evaluation Data                                                  | Sets up test questions and ground truth answers.                      |
| 10    | Eval 2  | Generate Answers Without RAG                                            | Gets baseline answers from the model (no RAG).                        |
| 11    | Eval 3  | Generate Answers With RAG                                               | Gets answers from the RAG system (with retrieved context).            |
| 12    | Eval 4  | Evaluate Answers and Calculate Scores                                   | Compares generated answers and calculates basic metrics.            |
| Optional | Appx A1 | Analyze JSON Structure and Counts                                       | Provides insights into the translated data structure.               |
| Optional | Appx A2 | Explanation of JSON Structure Analysis                                | Explains the analysis in Appx A1.                                     |

**Note:** If we restart the runtime, we will need to re-run steps 1-8 (or 1-12 for full evaluation) as needed, starting with Cell 0. If translation is already done (Step 4), Cell 2 will skip quickly.

In [None]:
# %% CELL 0 – Mount Drive + paths
"""Mount GDrive (silently re-uses an existing token) and define all
folder / file constants in one place so later cells stay in sync."""
from google.colab import drive
from pathlib import Path, PurePosixPath

drive.mount("/content/drive", force_remount=False)    # one-liner mount

DATA_DIR = Path("/content/drive/MyDrive/diakg_assets")     # adjust once
RAW_CN   = DATA_DIR / "0521_new_format"    # 41 raw CN guideline JSON
JSON_DIR = DATA_DIR / "diakg_en"           # translated *_en.json (output)
JSON_DIR.mkdir(exist_ok=True)

LABELS_CSV = DATA_DIR / "entities_bilingual.csv"   # built next cell
SENT_PATH  = DATA_DIR / "sentences.txt"            # 1 line ≈ 1 vector
INDEX_PATH = DATA_DIR / "graphrag_faiss.index"     # 8 643 vec FAISS

> **What this cell does, step-by-step**  
> 0 · Mount GDrive so results survive runtime restarts.  
> 1 · Define *all* paths in **DATA_DIR** so later cells never hard-code strings.  
> 2 · Create `diakg_en/` if missing – translated JSON will land there.

In [None]:
# %% CELL 1 - INSTALL ALL REQUIRED LIBRARIES
# Install deep_translator for the new translation method
%pip install -q deep_translator

# Install core libraries for embeddings and RAG
# Pinning specific versions based on previous troubleshooting attempts
# Let sentence-transformers manage its transformers and huggingface_hub dependencies
%pip install -q --upgrade \
    sentence-transformers==2.7.0 \
    peft==0.10.0 \
    accelerate==0.26.1 \
    bitsandbytes==0.43.2

# Note: We are NOT explicitly pinning transformers or huggingface_hub here,
# letting sentence-transformers==2.7.0 find compatible versions.

In [None]:
# %% CELL 1.5 – Define Translation Function
import json
import time
import random # Import random for jitter in backoff
from deep_translator import GoogleTranslator, MyMemoryTranslator, exceptions
from pathlib import Path

# Initialize translators
# You might need to handle API keys if required by the services for higher usage
google_translator = GoogleTranslator(source='zh-CN', target='en')
mymemory_translator = MyMemoryTranslator(source='zh-CN', target='en-US')

# Simple in-memory cache for translations
translation_cache = {}

def safe_translate(text: str, cache=translation_cache, max_retries=5) -> str:
    """
    Translates text from Chinese to English using Google Translate,
    with MyMemory as a fallback and simple caching.
    Includes a retry mechanism with exponential backoff for rate limit errors.
    """
    if not text or text.isspace():
        return text # Return empty or whitespace strings as is

    # Check cache first
    if text in cache:
        # print(f"Cache hit for: {text[:50]}...") # Optional: for debugging cache hits
        return cache[text]

    translated_text = None
    attempt = 0
    while attempt < max_retries:
        try:
            # Attempt translation with Google Translate
            # print(f"Attempt {attempt + 1}: Attempting Google Translate for: {text[:50]}...") # Optional
            translated_text = google_translator.translate(text)
            if translated_text:
                cache[text] = translated_text # Store in cache
                return translated_text

        except (exceptions.TooManyRequests, exceptions.APIException) as e:
            # Catch rate limit or general API errors
            print(f"Attempt {attempt + 1}: Google Translate failed for '{text[:50]}...' due to API error: {e}")
            attempt += 1
            if attempt < max_retries:
                sleep_time = (2 ** attempt) + random.uniform(0, 1) # Exponential backoff with jitter
                print(f"Waiting {sleep_time:.2f} seconds before retrying...")
                time.sleep(sleep_time)
            else:
                print(f"Max retries ({max_retries}) reached for Google Translate.")
                break # Exit loop to try fallback

        except Exception as e:
            # Catch any other unexpected errors from Google Translate
            print(f"Attempt {attempt + 1}: Google Translate failed for '{text[:50]}...' due to unexpected error: {e}")
            attempt += 1
            # For unexpected errors, maybe don't backoff aggressively, but still wait
            time.sleep(1) # Wait for a fixed time
            if attempt >= max_retries:
                 print(f"Max retries ({max_retries}) reached for Google Translate on unexpected error.")
                 break # Exit loop to try fallback


    # If Google Translate failed after retries, attempt with MyMemory Translator
    if not translated_text:
        attempt = 0 # Reset attempt counter for fallback
        while attempt < max_retries:
            try:
                 # print(f"Attempt {attempt + 1}: Attempting MyMemory Translate fallback for: {text[:50]}...") # Optional
                 translated_text = mymemory_translator.translate(text)
                 if translated_text:
                     cache[text] = translated_text # Store in cache
                     return translated_text

            except (exceptions.TooManyRequests, exceptions.APIException) as e:
                 # Catch rate limit or general API errors for fallback
                 print(f"Attempt {attempt + 1}: MyMemory Translate fallback failed for '{text[:50]}...' due to API error: {e}")
                 attempt += 1
                 if attempt < max_retries:
                      sleep_time = (2 ** attempt) + random.uniform(0, 1) # Exponential backoff with jitter
                      print(f"Waiting {sleep_time:.2f} seconds before retrying fallback...")
                      time.sleep(sleep_time)
                 else:
                      print(f"Max retries ({max_retries}) reached for MyMemory Translate fallback.")
                      break # Exit loop

            except Exception as e:
                # Catch any other unexpected errors from MyMemory Translate
                print(f"Attempt {attempt + 1}: MyMemory Translate fallback failed for '{text[:50]}...' due to unexpected error: {e}")
                attempt += 1
                time.sleep(1) # Wait for a fixed time
                if attempt >= max_retries:
                     print(f"Max retries ({max_retries}) reached for MyMemory Translate fallback on unexpected error.")
                     break # Exit loop


    # If both translators failed after retries
    print(f"Warning: Translation failed for text after multiple retries: '{text}'")
    return text # Return original text on complete failure

print("✓ Defined safe_translate function with caching, fallbacks, and retry with exponential backoff.")

# Example usage (optional, for testing the function)
# print("\nTesting translation function:")
# print("Hello (cached):", safe_translate("你好"))
# print("Hello again (cache hit):", safe_translate("你好"))
# print("Diabetes:", safe_translate("糖尿病"))
# print("Patient:", safe_translate("患者"))
# # Example of a call that might fail (for testing backoff - uncomment and run multiple times if needed)
# # print("Long text that might cause error:", safe_translate("这是一个非常长的句子，用于测试翻译器的错误处理和重试机制，看看它是否能在遇到问题时正确地等待和重试。"))

In [None]:
# %% CELL 2 – Translate raw CN JSON → *_en.json (using deep_translator and ThreadPoolExecutor)
"""
Translate raw Chinese JSON files into English JSON files (*_en.json) in parallel
using the safe_translate function with caching and fallbacks.
This replaces the previous NLLB-200 based translation.
"""
import json
import glob
import os
import time
import tqdm
# Removed sys as we are no longer using sys.exit()
# import sys
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

# Assuming DATA_DIR and RAW_CN are defined in Cell 0
# Assuming JSON_DIR is defined in Cell 0 and exists
# Assuming safe_translate function and translation_cache are defined in Cell 3

# Ensure JSON_DIR exists
JSON_DIR = DATA_DIR / "diakg_en"
JSON_DIR.mkdir(exist_ok=True)

raw_files = list(RAW_CN.glob("*.json"))
translated_files = list(JSON_DIR.glob("*_en.json"))

# Check if translation is already done (based on presence of _en.json files)
# Use an if/else block to skip translation instead of sys.exit()
if len(raw_files) > 0 and len(translated_files) >= len(raw_files):
    print("**All raw JSON files appear to have been translated. Skipping translation.**")
    # The cell will simply finish execution here without the translation loop
else:
    print(f"Translating {len(raw_files)} raw JSON files from {RAW_CN} to {JSON_DIR} using ThreadPoolExecutor...")

    # --- Function to translate a single file ───────────────────────────────────
    def translate_file(jp):
        try:
            with open(jp, encoding="utf8") as f:
                doc = json.load(f)

            # Iterate through paragraphs, sentences, entities, and relations to translate text fields
            if "paragraphs" in doc:
                for p in doc["paragraphs"]:
                    if "sentences" in p:
                        for s in p["sentences"]:
                            # Translate sentence text
                            zh_sent = s.get("sentence", "")
                            if zh_sent:
                                s["sentence_en"] = safe_translate(zh_sent)

                            # Translate entity labels
                            if "entities" in s:
                                for ent in s["entities"]:
                                    zh_lab  = ent.get("entity", "")
                                    if zh_lab:
                                        ent["entity_en"] = safe_translate(zh_lab)

                            # Translate relation types
                            if "relations" in s:
                                for r in s.get("relations", []):
                                    zh_rel = r.get("relation_type", "")
                                    if zh_rel:
                                        r["relation_en"] = safe_translate(zh_rel)


            # Determine output filename and save the translated document
            out_name = os.path.basename(jp).replace(".json", "_en.json")
            out_path = JSON_DIR / out_name

            with open(out_path, "w", encoding="utf8") as f:
                json.dump(doc, f, ensure_ascii=False, indent=2)

        except Exception as e:
            # Print error without interrupting the ThreadPoolExecutor
            print(f"\nError processing file {jp}: {e}", file=sys.stderr)
            # Return an indicator of failure
            return False
        # Return an indicator of success
        return True

    # --- Parallel translation pass ────────────────────────────────────────────
    # Use ThreadPoolExecutor for parallel processing
    # Adjust max_workers based on the Colab runtime's CPU cores and memory
    # A good starting point is often the number of CPU cores or slightly more
    num_workers = os.cpu_count() # Get the number of CPU cores
    if num_workers is None:
        num_workers = 4 # Default to 4 if not detectable or needed

    print(f"Using {num_workers} worker threads for translation...")

    results = []
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        # Map the translate_file function to each raw JSON file
        # Use list() to trigger execution and wait for results
        # tqdm can be used to show progress for the results list
        results = list(tqdm.tqdm(executor.map(translate_file, raw_files), total=len(raw_files), desc="Translating docs"))

    # Optional: Check results to see if any files failed to translate
    # Need to re-import sys for stderr if it was removed above
    import sys
    failed_files = [raw_files[i] for i, success in enumerate(results) if not success]
    if failed_files:
        print(f"\nWarning: Failed to translate {len(failed_files)} files:", file=sys.stderr)
        for f in failed_files:
            print(f"- {f}", file=sys.stderr)

    print("✓ Translation pass complete — all *_en.json written (or skipped if already present).")

# If the translation was skipped, this final print will still run
# print("✓ Translation check complete.") # Optional: add a final print outside the else

In [None]:
# %% CELL 3 – rebuild sentences.txt (force=True)
"""Scan every *_en.json* and write one English triple string per KG triple
(head_en relation_en tail_en) – 8 643 lines in sample run.
Now reads translated entity and relation labels directly from *_en.json files."""

"""In other words, This cell takes the translated JSON, finds all the head-relation-tail
 onnections (triples), and writes each one as a single line in sentences.txt. When we
 tested this, we got 8,643 lines (triples) in that file."""

import json, glob, pandas as pd, tqdm
from pathlib import Path

FORCE = False                              # flip to False once file is stable

if FORCE or not SENT_PATH.exists():
    print("⟳ Rebuilding sentences.txt …")
    # The old logic read from entities_bilingual.csv
    # label_map = pd.read_csv(LABELS_CSV).set_index("entity_id")["en_label"].to_dict()

    sentences = []
    for jp in tqdm.tqdm(glob.glob(str(JSON_DIR / "*_en.json")), desc="Scanning JSON"):
        doc = json.load(open(jp, encoding="utf8"))
        if "paragraphs" in doc:
            for p in doc["paragraphs"]:
                if "sentences" in p:
                    # Build a temporary entity ID to English label map for this sentence
                    # This is needed because relations store entity IDs, not the labels directly
                    entity_map = {ent.get("entity_id"): ent.get("entity_en", ent.get("entity", ""))
                                  for sent in p.get("sentences", []) # Iterate over sentences in the paragraph
                                  for ent in sent.get("entities", []) # Iterate over entities in each sentence
                                  if ent.get("entity_id")} # Only include if entity_id is present


                    for s in p["sentences"]:
                        if "relations" in s:
                            for r in s.get("relations", []):
                                head_id = r.get('head_entity_id')
                                tail_id = r.get('tail_entity_id')
                                relation_en = r.get('relation_en', r.get('relation_type', '')) # Use relation_en, fallback to relation_type

                                # Get the translated head and tail entity labels using the map
                                head_en = entity_map.get(head_id, head_id) # Fallback to ID if label not found
                                tail_en = entity_map.get(tail_id, tail_id) # Fallback to ID if label not found

                                # Ensure we have valid head, relation, and tail before adding
                                if head_en and relation_en and tail_en:
                                    sentences.append(
                                        f"{head_en} {relation_en} {tail_en}"
                                    )

    SENT_PATH.write_text("\n".join(sentences))
    print("✔ sentences.txt rebuilt with", len(sentences), "lines")
else:
    print("✓ sentences.txt already present – nothing to do.")

> **Summary**

In summary, this cell's primary function is to generate a text file containing all the knowledge graph triples in a human-readable English format, using the entity label mapping created in Cell 1. The FORCE flag provides control over whether to regenerate this file, making subsequent runs faster if the data hasn't changed.

> **What this cell does, step-by-step**  
> 1 · Optionally force-delete the old file (`FORCE=True`) so we overwrite the ID-only version.  
> 2 · Walk every *_en.json* and concatenate **head EN + relation_en + tail EN**.  
> 3 · Write one line per triple to **sentences.txt** – the order matches the FAISS index we will build later.

## 🔧 Updates for optimization — **Cell 2: Translate raw CN JSON → *_en.json***

| What we changed | Why it matters on the A100 |
|-----------------|----------------------------|
| **Early-exit guard**<br>`if (DATA_DIR/"diakg_en").exists() and (DATA_DIR/"entities_bilingual.csv").exists():` | We skip the two-hour translation pass on every warm run. |
| **Single `label_map` load** | All CN → EN look-ups happen in-memory; no extra CSV I/O. |
| **Reuse the translator object that is already on GPU** | We avoid re-loading NLLB-200 and keep VRAM steady. |
| **Translate both in-sentence & top-level `relation_type` strings** | Removes the last source of “T1234 Test_Disease …” artefacts later in retrieval. |
| **Optional tip:** raise `max_length` from `400` → `512` in `fast_translate()` | Silences the “input_length > 0.9 × max_length” warnings and speeds up long sentences (fits comfortably in 16 GB fp16). |

---

## 🔧 Updates for optimization — **Cell 3: rebuild `sentences.txt`**

| What we changed | Why it matters on the A100 |
|-----------------|----------------------------|
| **`FORCE = False` by default** | We rebuild only when the file is missing or when we deliberately flip the flag, so most runs take < 1 s. |
| **ID → label mapping uses the CSV we just built** | Ensures every triple line is now fully human-readable (no more ID placeholders). |
| **Streamlined triple writer**<br>`head_en  relation_en  tail_en` | Keeps one-line-per-vector order -> perfectly aligns with the FAISS vectors we embed in Cell 4 for fast retrieval. |

> **Workflow tip**  
> 1. If we ingest **new** CN JSON later, delete `entities_bilingual.csv` **or** set `FORCE = True` once.  
> 2. Run Cells 0 → 3 — only the new data is processed.  
> 3. Flip `FORCE` back to `False` and enjoy instant warm starts.

In [None]:
# %% CELL 4 – Build FAISS flat-IP index
"""Embed every sentence line with MiniLM-L6-v2 (384-d) → save indexFlatIP."""

import faiss, numpy as np, tqdm, torch
from sentence_transformers import SentenceTransformer

sentences = [ln.rstrip() for ln in open(SENT_PATH)]
print("Total triples:", len(sentences))

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2",
                               device="cuda" if torch.cuda.is_available() else "cpu")
vecs = embedder.encode(sentences, batch_size=512,  # A100 handles 512 easily
                       convert_to_numpy=True, show_progress_bar=True).astype("float32")

index = faiss.IndexFlatIP(vecs.shape[1])
index.add(vecs)
faiss.write_index(index, str(INDEX_PATH))
print(f"✔ FAISS index written with {index.ntotal} vectors")

> **Summary of what CELL 4 does**  
> 1 · Read **8 643** human-readable triples → list[str].  
> 2 · Encode with MiniLM-L6-v2 in batches of **512** (A100 keeps <2 GB VRAM).  
> 3 · Add vectors to **IndexFlatIP** (inner-product ≈ cosine on unit vecs).  
> 4 · Save **graphrag_faiss.index** – ~12 MB on disk.

### What cell 4 does, step-by-step

The purpose of this cell, as indicated by the docstring (lines 2-3), is to embed the English triple sentences using a sentence transformer model and then build and save a FAISS index of these embeddings.

1.  **Load Sentences (line 7):**
    *   Line 7 reads the `sentences.txt` file (created in Cell 3) and stores each line (which represents a knowledge graph triple) as an element in a Python list called `sentences`. The `rstrip()` method is used to remove any trailing whitespace, including the newline character.
    *   Line 8 prints the total number of triples loaded from the file.

2.  **Load Sentence Embedder (lines 10-12):**
    *   Line 10 initializes a `SentenceTransformer` model. We are using the "sentence-transformers/all-MiniLM-L6-v2" model, which is a pre-trained model optimized for generating sentence embeddings.
    *   Line 11 specifies the device to load the model onto. It checks if a CUDA-enabled GPU is available and uses `'cuda'` if it is, otherwise it uses `'cpu'`. This ensures the embedding process can leverage GPU acceleration if available.

3.  **Encode Sentences (lines 13-14):**
    *   Line 13 uses the loaded `embedder` model to encode the list of `sentences` into numerical vectors.
        *   `batch_size=512`: This processes the sentences in batches of 512, which is efficient for GPU processing, especially on an A100 as noted in the comment.
        *   `convert_to_numpy=True`: This converts the output PyTorch tensors to NumPy arrays.
        *   `show_progress_bar=True`: This displays a progress bar during the encoding process, which can take some time for a large number of sentences.
    *   Line 14 converts the data type of the resulting vectors (`vecs`) to `float32`, which is a common and efficient format for FAISS.

4.  **Build and Add to FAISS Index (lines 16-17):**
    *   Line 16 initializes a FAISS index. `faiss.IndexFlatIP` creates a flat index that uses the inner product (IP) for similarity search. Inner product is equivalent to cosine similarity when the vectors are normalized to unit length, which sentence transformers typically produce. `vecs.shape[1]` provides the dimensionality of the vectors (384 for MiniLM-L6-v2).
    *   Line 17 adds the computed vectors (`vecs`) to the FAISS index.

5.  **Save FAISS Index (line 18):**
    *   Line 18 saves the built FAISS index to a file specified by `INDEX_PATH`. FAISS provides optimized functions for writing indexes to disk.

6.  **Print Confirmation (line 19):**
    *   Line 19 prints a confirmation message indicating that the FAISS index has been written and shows the total number of vectors added to the index (`index.ntotal`).

In essence, this cell takes the English triples, converts them into numerical representations (vectors) using a sentence transformer, and then organizes these vectors into a searchable FAISS index for efficient retrieval.

In [None]:
# %% CELL 5 – Load FAISS index + encoder
"""Bring the EN vectors + MiniLM encoder into memory for retrieval."""

import faiss, torch, numpy as np
from sentence_transformers import SentenceTransformer

index     = faiss.read_index(str(INDEX_PATH))
embedder  = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2",
                                device="cuda" if torch.cuda.is_available() else "cpu")
with open(SENT_PATH) as f:
    sentences = [ln.rstrip() for ln in f]

print(f"✓ FAISS index with {index.ntotal:,} English vectors loaded")

> **What this cell does, step-by-step**  
> 0 · Read **graphrag_faiss.index** (8 643 vec) – 200 ms on SSD.  
> 1 · Load the *same* MiniLM encoder used in Cell 4 so query / index live in the same space.  
> 2 · Read **sentences.txt** into a Python list → keeps vector ↔ text mapping handy.

In [None]:
# %% CELL 6 – retrieve_ctx() + quick demo, the R in RAG

import numpy as np, torch

def retrieve_ctx(question: str, k: int = 5):
    qvec = embedder.encode([question]).astype("float32")
    D, I = index.search(qvec, k*3)          # ask for a few extra
    seen, ctx = set(), []
    for i, s in zip(I[0], D[0]):
        triple = sentences[i]
        if triple not in seen:              # keep only the first copy
            ctx.append({"triple": triple, "score": float(s)})
            seen.add(triple)
        if len(ctx) == k:                   # stop once we have k uniques
            break
    return ctx

# quick demo
question = "Which drug treats type 2 diabetes?"
ctx = retrieve_ctx(question)
print("Q:", question)
for c in ctx:
    print(" •", c["triple"], f"(score {c['score']:.3f})")

### What this cell does, step-by-step  
0. **Inputs in RAM**  
   * `index` → FAISS flat-IP index loaded in Cell 5.  
   * `embedder` → MiniLM-L6 encoder (384-d) loaded in Cell 5.  
   * `sentences` → Python list that maps every vector to its triple string.  

1. **Encode the user question**  
   * We run the question through **MiniLM-L6** → 384-d query vector (`qvec`) on GPU.

2. **Search FAISS for nearest neighbours**  
   * We ask for **k × 3** vectors (default `k = 5` → 15 hits).  
   * Distance used is dot product or also called *inner-product* → cosine on unit vectors.

3. **Uniqueness filter (NEW)**  
   * We iterate through the 15 hits **in order of similarity**.  
   * We keep the first time we see a triple string and discard duplicates.  
   * We stop as soon as we have **k unique triples** – fast and deterministic.

4. **Return a tidy Python list**  
   * Each element is a dict `{"triple": str, "score": float}`.  
   * Scores are FAISS inner-product values cast to `float` for JSON-friendliness.

5. **Smoke-test**  
   * The demo encodes *“Which drug treats type 2 diabetes?”* and prints the top-k unique triples with their scores.  
   * On an A100 the end-to-end latency is **≪ 100 ms** per query.

> **Why it matters** – removing duplicates avoids five identical rows in the context block and slightly improves recall of diverse evidence for the LLM prompt.

In [None]:
# %% CELL 7 – 5-question RAG showcase  (DeepSeek-7B in 8-bit)
"""
1. For each demo question we …
   • retrieve_ctx(k=5)                  → 5 nearest triples
   • build a prompt  (Context\n… triple…) + Q + "Answer:"
   • call DeepSeek-7B (deterministic, temp=0) for the answer
   • pretty-print everything (Q / A / top-k triples) for a nice screenshot.
   – deterministic output is great for grading / reproducibility.
"""

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

MODEL_ID = "deepseek-ai/deepseek-llm-7b-chat"

tok = AutoTokenizer.from_pretrained(MODEL_ID)

# 8-bit still fits easily on the A100 (≈40 GB VRAM)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    # load_in_8bit=True, # Removed to bypass bitsandbytes issue
    device_map="auto",
    torch_dtype=torch.float16,
).eval()

gen = pipeline(
    "text-generation",
    model=mdl,
    tokenizer=tok,
    # Removed generation args from pipeline initialization
    # max_new_tokens=128,
    # temperature=0.0,
    # do_sample=False,
)

print("✓ DeepSeek-7B loaded (without 8-bit quantization)")

PROMPT_TMPL = """You are a helpful medical assistant.
Context:
{ctx}

Question: {q}
Answer:"""

questions = [
    "Which drug is prescribed to manage HbA1c in type 2 diabetes patients?",
    "What long-term complication of diabetes affects the eyes?",
    "Name one lifestyle change that reduces insulin resistance.",
    "Which lab test is used to diagnose gestational diabetes?",
    "How does metformin lower blood glucose mechanistically?",
]

for qi, q in enumerate(questions, 1):
    ctx  = retrieve_ctx(q, k=5)                       # step 1
    ctx_block = "\n- " + "\n- ".join(c["triple"] for c in ctx)
    prompt = PROMPT_TMPL.format(ctx=ctx_block, q=q)   # step 2

    ans = gen(
        prompt,
        max_new_tokens=128,  # Pass generation args directly
        temperature=0.0,
        do_sample=False,
    )[0]["generated_text"].split("Answer:")[-1].strip()   # step 3

    print(f"\n———  Q{qi} ————————————————————")
    print("Q:", q)
    print("A:", ans)
    print("— top-k context —")
    for c in ctx:
        print(" •", c["triple"], f"(score {c['score']:.3f})")             # step 4

In [None]:
# Reinstall bitsandbytes with a specific configuration for the evaluation steps below
%pip install -qq --upgrade bitsandbytes==0.43.2 --extra-index-url https://download.pytorch.org/whl/cu121



---



---



#### Explanation of EVALUATION STEP 1

This code cell (lines 1-28) defines the evaluation data that we will use to test the RAG system.

*   **Line 1:** This is a markdown heading indicating the start of the first evaluation step.
*   **Lines 2-5:** This is a docstring that briefly explains the purpose of the cell.
*   **Lines 8-18:** A Python list named `test_questions` is created. This list contains the questions that we will use to query both the "with RAG" and "without RAG" versions of the system.
*   **Lines 21-28:** A Python dictionary named `ground_truth_answers` is created. The keys of this dictionary are the questions from the `test_questions` list, and the values are the expected correct answers. These ground truth answers are what we will compare the model's generated answers against to calculate evaluation scores. It is noted that these answers should ideally be based on the source guideline documents.
*   **Line 30:** A print statement confirms the number of test questions and ground truth answers that have been defined.

Defining these questions and answers upfront ensures that we are evaluating both RAG and non-RAG systems on the same set of inputs and have a standard to measure correctness against.

In [None]:
# %% EVALUATION STEP 1 – Define Evaluation Data
"""
This cell defines a set of test questions and their corresponding ground truth
answers to be used for evaluating the RAG system's performance.
"""

# Define a list of test questions
test_questions = [
    "Which drug is prescribed to manage HbA1c in type 2 diabetes patients?",
    "What long-term complication of diabetes affects the eyes?",
    "Name one lifestyle change that reduces insulin resistance.",
    "Which lab test is used to diagnose gestational diabetes?",
    "How does metformin lower blood glucose mechanistically?",
    "What is a common side effect of SGLT2 inhibitors?",
    "Which type of diabetes is characterized by insulin deficiency?",
    "What is the target HbA1c level for most adults with diabetes?",
    "Name a class of oral medications for type 2 diabetes.",
    "What is the primary goal of diabetes management?"
]

# Define the ground truth answers for the test questions
# These should ideally be derived from the source documents (Chinese guidelines)
ground_truth_answers = {
    "Which drug is prescribed to manage HbA1c in type 2 diabetes patients?": "Insulin or oral antidiabetic agents like Metformin, Sulfonylureas, etc.",
    "What long-term complication of diabetes affects the eyes?": "Diabetic retinopathy",
    "Name one lifestyle change that reduces insulin resistance.": "Weight loss, regular exercise, or a balanced diet.",
    "Which lab test is used to diagnose gestational diabetes?": "Oral glucose tolerance test (OGTT)",
    "How does metformin lower blood glucose mechanistically?": "Decreases hepatic glucose production and increases insulin sensitivity.",
    "What is a common side effect of SGLT2 inhibitors?": "Genital mycotic infections or urinary tract infections.",
    "Which type of diabetes is characterized by insulin deficiency?": "Type 1 diabetes",
    "What is the target HbA1c level for most adults with diabetes?": "Typically <7% (53 mmol/mol), but individualized.",
    "Name a class of oral medications for type 2 diabetes.": "Metformin, Sulfonylureas, Thiazolidinediones, DPP-4 inhibitors, SGLT2 inhibitors, GLP-1 receptor agonists (oral form).",
    "What is the primary goal of diabetes management?": "Preventing acute complications and reducing the risk of long-term complications."
}

print(f"Defined {len(test_questions)} test questions and ground truth answers.")

### Explanation of EVALUATION STEP 2

This code cell (lines 1-26) generates responses to the `test_questions` using the DeepSeek-7B model *without* incorporating any context from the knowledge graph.

*   **Lines 1-4:** This is the cell heading and docstring, outlining the cell's purpose.
*   **Lines 6-7:** These lines assume that the `gen` pipeline (for the DeepSeek-7B model) and the `test_questions` list have been successfully loaded or defined in previous cells.
*   **Line 9:** An empty dictionary `answers_without_rag` is initialized to store the generated answers, with questions as keys and answers as values.
*   **Line 11:** A message is printed to indicate the start of the answer generation process.
*   **Line 13:** A loop iterates through each `question` in the `test_questions` list.
*   **Line 15:** The `prompt` for the model is simply the `question` itself, as we are not using RAG in this step.
*   **Lines 17-22:** The `gen` pipeline is called with the `prompt`.
    *   `max_new_tokens=128` limits the length of the generated response.
    *   `temperature=0.0` and `do_sample=False` are set to make the model's output deterministic, which is helpful for evaluation consistency.
*   **Line 25:** The generated text is extracted from the pipeline's output and leading/trailing whitespace is removed using `.strip()`. Note that depending on the model's output format, we might need to adjust how the answer is extracted if the model includes the original prompt or other conversational elements in the `generated_text`.
*   **Line 28:** The extracted `generated_text` is stored in the `answers_without_rag` dictionary, keyed by the original `question`.
*   **Lines 30-32:** The original question and the generated answer are printed for review.
*   **Line 34:** A final message confirms that the process is complete.

This step establishes a baseline by showing how well the model performs on the questions using only its pre-trained knowledge, before we introduce the retrieved context.

In [None]:
# %% EVALUATION STEP 2 – Generate Answers Without RAG
"""
This cell generates answers to the test questions using the DeepSeek-7B model
without providing any external retrieved context.
"""

# Assuming the DeepSeek-7B pipeline 'gen' and 'test_questions' are defined in previous cells

answers_without_rag = {}

print("Generating answers without RAG...")

for i, question in enumerate(test_questions):
    # Prompt the model with just the question
    prompt = question
    # Use the text generation pipeline
    response = gen(
        prompt,
        max_new_tokens=128,  # Limit the response length
        temperature=0.0,     # Use temperature 0 for deterministic output
        do_sample=False      # Do not use sampling
    )

    # Extract the generated text
    # The response format might vary slightly based on the model/pipeline,
    # we'll take the full generated text here for simplicity.
    generated_text = response[0]['generated_text'].strip()

    # Store the generated answer
    answers_without_rag[question] = generated_text

    print(f"Q{i+1}: {question}")
    print(f"A (without RAG): {generated_text}\n")

print("Finished generating answers without RAG.")

### Explanation of EVALUATION STEP 3

This code cell (lines 1-39) generates responses to the `test_questions` using the DeepSeek-7B model, this time incorporating retrieved context from our knowledge graph.

*   **Lines 1-4:** This is the cell heading and docstring, outlining the cell's purpose.
*   **Lines 6-7:** These lines assume that the `gen` pipeline, the `test_questions` list, and the `retrieve_ctx` function have been successfully loaded or defined in previous cells.
*   **Line 9:** An empty dictionary `answers_with_rag` is initialized to store the generated answers when using RAG.
*   **Line 10:** An optional empty dictionary `retrieved_contexts` is initialized to store the context retrieved for each question, which can be useful for debugging or analysis.
*   **Line 12:** A message is printed to indicate the start of the answer generation process with RAG.
*   **Line 14:** A loop iterates through each `question` in the `test_questions` list.
*   **Lines 16-18:** The `retrieve_ctx` function is called with the current `question` and a specified number of results (`k=5`). The returned list of context triples is stored in the `context` variable and optionally saved in `retrieved_contexts`.
*   **Lines 21-23:** The retrieved `context` (a list of dictionaries) is formatted into a single string `ctx_block` suitable for inclusion in the prompt, using a similar format to the demo in Cell 7.
*   **Lines 26-28:** The full `prompt` for the language model is constructed using the `PROMPT_TMPL` (defined in Cell 7), inserting the formatted `ctx_block` and the original `question`.
*   **Lines 31-36:** The `gen` pipeline is called with the RAG-augmented `prompt`. Similar generation parameters (`max_new_tokens`, `temperature`, `do_sample`) are used as in Evaluation Step 2 and Cell 7 to ensure deterministic output and limit length.
*   **Line 39:** The generated text is extracted from the pipeline's output, splitting by "Answer:" and taking the part after it, then stripping whitespace.
*   **Line 42:** The extracted `generated_text` is stored in the `answers_with_rag` dictionary, keyed by the original `question`.
*   **Lines 44-46:** The original question and the generated answer (with RAG) are printed for review. Optional lines (commented out) are included to print the retrieved context as well.
*   **Line 52:** A final message confirms that the process is complete.

This step generates the answers that leverage the knowledge graph, allowing us to compare them directly to the baseline answers generated without RAG.

In [None]:
# %% EVALUATION STEP 3 – Generate Answers With RAG
"""
This cell generates answers to the test questions using the DeepSeek-7B model
by providing retrieved context from the knowledge graph.
"""

# Assuming 'gen', 'test_questions', 'retrieve_ctx', and 'tok' are defined in previous cells

answers_with_rag = {}
retrieved_contexts = {} # Optional: store retrieved contexts for review

print("Generating answers with RAG...")

for i, question in enumerate(test_questions):
    # Retrieve context for the question
    # We'll use k=5 as in the Cell 7 demo, but this can be adjusted
    context = retrieve_ctx(question, k=5)
    retrieved_contexts[question] = context # Store context

    # Format the context for the prompt
    # Using the same format as in Cell 7
    ctx_block = "\n- " + "\n- ".join(c["triple"] for c in context)

    # Build the prompt with the retrieved context
    # Using the same prompt template as in Cell 7
    prompt = PROMPT_TMPL.format(ctx=ctx_block, q=question)

    # Calculate the input length after tokenization
    input_ids = tok(prompt, return_tensors="pt").input_ids
    input_length = input_ids.shape[-1]

    # Use the text generation pipeline with the RAG prompt
    # Pass generation arguments directly as keyword arguments to the pipeline call
    response = gen(
        prompt,
        max_new_tokens=128,
        temperature=0.0,
        do_sample=False,
        max_length=input_length + 128 # Explicitly set max_length
    )

    # Extract the generated text, similar to Cell 7
    # We split by "Answer:" and take the part after it
    generated_text = response[0]['generated_text'].split("Answer:")[-1].strip()

    # Store the generated answer
    answers_with_rag[question] = generated_text

    print(f"Q{i+1}: {question}")
    print(f"A (with RAG): {generated_text}\n")
    # Optional: print retrieved context for debugging/review
    # print("Retrieved Context:")
    # for c in context:
    #     print(f"  - {c['triple']} (score: {c['score']:.3f})")
    # print("-" * 20)


print("Finished generating answers with RAG.")

### Explanation of EVALUATION STEP 4

This code cell (lines 1-81) performs a basic evaluation of the generated answers by comparing them to the ground truth answers and calculating accuracy, precision, and F1-score.

*   **Lines 1-5:** This is the cell heading and docstring, outlining the cell's purpose and the metrics calculated.
*   **Lines 7-8:** These lines import the `re` module for text cleaning and `Counter` from `collections` (though `Counter` isn't strictly needed for the current set logic, it's often useful for more complex token matching). It also assumes the necessary dictionaries containing the test questions, ground truth answers, and the generated answers (with and without RAG) are available from previous cells.
*   **Lines 10-15:** A helper function `clean_answer` is defined. This function takes a string answer, converts it to lowercase, removes punctuation using a regular expression, and strips leading/trailing whitespace. This helps standardize the answers for comparison.
*   **Lines 17-58:** The `evaluate_scores` function is defined. This function takes the generated answers dictionary and the ground truth answers dictionary as input and returns the calculated metrics.
    *   Lines 20-26 initialize counters for accuracy and variables for summing true positives, predicted positives, and actual positives across all questions.
    *   Lines 28-55 loop through each `question` in the `test_questions` list.
    *   Lines 29-30 retrieve and clean the generated and ground truth answers.
    *   Lines 32-33 split the cleaned answers into sets of keywords.
    *   Lines 36-37 implement the *basic accuracy check* as before: checking if any keyword from the truth (longer than 2 characters) is present in the generated keywords.
    *   Lines 40-52 calculate components for a *keyword-based* precision and recall.
        *   `true_positives` (Line 43) is the count of keywords that are present in *both* the ground truth and the generated answer.
        *   `predicted_positives` (Line 48) is simplified in this basic implementation to be equal to `true_positives`. In a more standard text generation evaluation, this would typically be the total number of relevant words or phrases in the generated answer.
        *   `actual_positives` (Line 51) is the total number of keywords in the ground truth answer.
        *   Lines 54-56 sum these counts across all questions.
    *   Lines 60-62 calculate the overall accuracy percentage.
    *   Lines 64-71 calculate the overall precision, recall, and F1-score based on the summed counts, including checks to avoid division by zero.
    *   Line 73 returns the calculated accuracy, precision, recall, and F1-score.
*   **Lines 75-81:** The `evaluate_scores` function is called to calculate the metrics for the `answers_without_rag`, and the results are printed with a clear heading.
*   **Lines 84-90:** The `evaluate_scores` function is called again to calculate the metrics for the `answers_with_rag`, and the results are printed with a clear heading.
*   **Lines 93-96:** A markdown note is included to reiterate that more sophisticated evaluation methods are recommended for a capstone and that the current precision/recall/F1 are based on simple keyword overlap, not standard text generation metrics.

This step provides a preliminary comparison using basic accuracy, precision, and F1-score based on keyword matching. While useful for a quick look, it's important to remember the limitations of this method for evaluating the nuanced quality of generated text.

In [None]:
# %% EVALUATION STEP 4 – Evaluate Answers and Calculate Scores
"""
This cell compares the generated answers (with and without RAG) to the ground truth
answers and calculates basic evaluation scores like accuracy, precision, and F1-score.
"""

# Assuming 'test_questions', 'ground_truth_answers', 'answers_without_rag',
# and 'answers_with_rag' are defined in previous cells

import re
from collections import Counter

def clean_answer(answer):
    """Basic cleaning of generated answers for comparison."""
    # Convert to lowercase and remove punctuation
    answer = answer.lower()
    answer = re.sub(r'[^\w\s]', '', answer)
    return answer.strip()

def evaluate_scores(generated_answers, ground_truth):
    """Calculates basic accuracy, precision, and F1-score based on keyword presence."""
    correct_count = 0
    total_count = len(test_questions)

    # Variables for precision and recall calculation across all questions
    total_true_positives = 0
    total_predicted_positives = 0
    total_actual_positives = 0

    for question in test_questions:
        generated = clean_answer(generated_answers.get(question, ""))
        truth = clean_answer(ground_truth.get(question, ""))

        truth_keywords = set(truth.split())
        generated_keywords = set(generated.split())

        # Basic Accuracy Check (as before)
        if any(keyword in generated_keywords for keyword in truth_keywords if len(keyword) > 2):
             correct_count += 1

        # Keyword-based Precision and Recall (simplified for this basic metric)
        # True Positives: Keywords from truth that are also in generated
        true_positives = len(truth_keywords.intersection(generated_keywords))

        # Predicted Positives: Keywords in generated (that are also in truth - for this definition)
        # Using the number of true positives as predicted positives here simplifies
        # the metric based on our current keyword-matching logic.
        predicted_positives = true_positives # Simplified: Only count keywords from truth that were found

        # Actual Positives: Keywords in the ground truth
        actual_positives = len(truth_keywords)

        total_true_positives += true_positives
        total_predicted_positives += predicted_positives # Summing simplified predicted positives
        total_actual_positives += actual_positives

    # Calculate overall metrics
    accuracy = (correct_count / total_count) * 100 if total_count > 0 else 0

    # Avoid division by zero
    precision = total_true_positives / total_predicted_positives if total_predicted_positives > 0 else 0.0
    recall = total_true_positives / total_actual_positives if total_actual_positives > 0 else 0.0

    # Calculate F1-score
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    return accuracy, precision, recall, f1

# Calculate scores for answers without RAG
accuracy_without_rag, precision_without_rag, recall_without_rag, f1_without_rag = evaluate_scores(answers_without_rag, ground_truth_answers)
print(f"--- Scores Without RAG ---")
print(f"Accuracy: {accuracy_without_rag:.2f}%")
print(f"Precision: {precision_without_rag:.2f}")
print(f"Recall: {recall_without_rag:.2f}")
print(f"F1 Score: {f1_without_rag:.2f}\n")


# Calculate scores for answers with RAG
accuracy_with_rag, precision_with_rag, recall_with_rag, f1_with_rag = evaluate_scores(answers_with_rag, ground_truth_answers)
print(f"--- Scores With RAG ---")
print(f"Accuracy: {accuracy_with_rag:.2f}%")
print(f"Precision: {precision_with_rag:.2f}")
print(f"Recall: {recall_with_rag:.2f}")
print(f"F1 Score: {f1_with_rag:.2f}")


# Note: For a capstone, more sophisticated evaluation metrics (e.g., ROUGE, BERTScore)
# and potentially human evaluation would provide a more complete picture.
# The current precision/recall/f1 here are based on simple keyword overlap,
# not standard metrics for text generation quality.

## Summary of RAG Showcase and Evaluation

Based on the output of the RAG showcase (Cell 7) and the evaluation results (Evaluation Step 4) in the notebook, here is a summary:

**RAG Showcase (Cell 7):**

The showcase demonstrated the RAG system by answering five specific questions about diabetes guidelines. For each question, the system retrieved relevant knowledge graph triples from the FAISS index and then used the DeepSeek-7B model to generate an answer based on the retrieved context. The output shows the question, the generated answer, and the top-k retrieved triples with their scores.

For example:
*   For the question "Which drug treats type 2 diabetes?", the system retrieved triples like "Insulin analogues Drug\_Disease Type 2 diabetes" and "insulin Drug\_Disease Type 2 diabetes" and the model answered "Metformin".
*   For "What long-term complication of diabetes affects the eyes?", retrieved triples included those related to "Retina" and "Diabetic retinopathy", leading to the answer "Nonproliferative diabetic retinopathy."

The showcase visually confirms that the retrieval component is finding relevant triples and the language model is using this context to generate answers.

**Evaluation Results (Evaluation Step 4):**

The evaluation compared the answers generated by the DeepSeek-7B model *without* RAG (using only its internal knowledge) and *with* RAG (using retrieved triples) against a set of ground truth answers based on keyword overlap. The results are as follows:

*   **Scores Without RAG:**
    *   Accuracy: 90.00%
    *   Precision: 1.00
    *   Recall: 0.46
    *   F1 Score: 0.63

*   **Scores With RAG:**
    *   Accuracy: 80.00%
    *   Precision: 1.00
    *   Recall: 0.31
    *   F1 Score: 0.48

**Summary of Findings:**

Interestingly, based on the simple keyword-overlap evaluation metrics used here, the model *without* RAG achieved slightly higher Accuracy, Recall, and F1 scores compared to the model *with* RAG. Precision was 1.00 for both, which in this simplified keyword-based metric means that any keyword from the ground truth found in the generated answer was counted as a true positive (and predicted positive).

This initial evaluation suggests that for this specific set of questions and the current configuration:
*   The base DeepSeek-7B model has significant pre-trained knowledge about diabetes.
*   The RAG system, in this instance and with this evaluation method, did not demonstrably improve performance based on these metrics. This could be due to various factors, including:
    *   The quality or relevance of the retrieved triples for these specific questions.
    *   How effectively the DeepSeek-7B model is utilizing the provided context block.
    *   The limitations of the simple keyword-overlap evaluation metrics in capturing the full quality and nuance of the generated answers.

**Conclusion:**

We successfully refactored the translation process, resolved persistent library dependency issues, built the FAISS index, and ran the RAG pipeline and evaluation. While the RAG showcase visually confirms that context is being retrieved and used, the initial keyword-based evaluation metrics do not show an improvement over the base model.



### Note on Evaluation Metrics and Sample Size

It's important to note the following regarding the evaluation results presented:

*   **Sample Size:** The evaluation was conducted using a set of **10 test questions** (as defined in Evaluation Step 1). This is a relatively small sample size for a comprehensive evaluation of an LLM or RAG system.
*   **Evaluation Metric:** The metric used (in Evaluation Step 4) is a **simple keyword-based overlap** comparison between the generated answers and the ground truth answers. While this provides a basic measure of content presence, it has limitations. It does not fully capture semantic similarity, fluency, correctness of phrasing, or the overall quality of the generated text in the way more sophisticated metrics like ROUGE or BERTScore do.
*   **Conclusion Limitations:** Due to the small sample size and the nature of the keyword-based metric, the evaluation scores obtained should be considered **preliminary** and **not conclusive**. They provide an initial indication that the RAG system, in this specific setup, did not show a significant improvement over the base model based on this metric. For a robust analysis, a larger test set and more advanced evaluation methods would be necessary.



---
# APPENDIX AND EXPLORATORY CODE

In [None]:
# %% APPENDIX A1 – Analyze JSON Structure and Counts
"""
This cell analyzes the structure of the translated JSON files and counts the number
of unique entity types, relationship types, and total triples. This information
is crucial for understanding the content and organization of the knowledge graph
data, which can be very helpful for discussing the project during capstone
research meetings.
"""
import json
from pathlib import Path

# Assuming JSON_DIR is defined in a previous cell and points to our translated JSON files
# Example path, adjust if necessary based on our notebook's CELL 0
# JSON_DIR = Path("/content/drive/MyDrive/diakg_assets/diakg_en")

# Get a list of all translated JSON files
json_files = list(JSON_DIR.glob("*_en.json"))

if json_files:
    # Load one sample JSON file to inspect its structure
    sample_file = json_files[0]
    print(f"Analyzing sample file: {sample_file.name}")

    with open(sample_file, 'r', encoding='utf8') as f:
        sample_data = json.load(f)

    # Inspect the top-level keys
    print("\nTop-level keys in a sample JSON file:")
    print(sample_data.keys())

    # Inspect the structure within a paragraph and sentence
    if "paragraphs" in sample_data and sample_data["paragraphs"]:
        sample_paragraph = sample_data["paragraphs"][0]
        print("\nKeys in a sample paragraph:")
        print(sample_paragraph.keys())

        if "sentences" in sample_paragraph and sample_paragraph["sentences"]:
            sample_sentence = sample_paragraph["sentences"][0]
            print("\nKeys in a sample sentence:")
            print(sample_sentence.keys())

            # Count entity types, relationship types, and total triples across all files
            entity_types = set()
            relation_types = set()
            total_triples = 0

            print("\nCounting entity types, relationship types, and triples across all files...")
            for json_file in json_files:
                with open(json_file, 'r', encoding='utf8') as f:
                    data = json.load(f)
                    if "paragraphs" in data:
                        for paragraph in data["paragraphs"]:
                            if "sentences" in paragraph:
                                for sentence in paragraph["sentences"]:
                                    if "entities" in sentence:
                                        for entity in sentence["entities"]:
                                            # Corrected: Assuming 'entity_type' key exists for entity type
                                            if 'entity_type' in entity:
                                                entity_types.add(entity['entity_type'])
                                    if "relations" in sentence:
                                        for relation in sentence["relations"]:
                                            # Assuming 'relation_type' key exists for relationship type
                                            if 'relation_type' in relation:
                                                relation_types.add(relation['relation_type'])
                                            # Each relation is considered a triple
                                            total_triples += 1

            print(f"\nTotal number of unique entity types: {len(entity_types)}")
            print(f"Total number of unique relationship types: {len(relation_types)}")
            print(f"Total number of triples (relations): {total_triples}")

            print("\nSample Entity Types:")
            print(list(entity_types)[:10]) # Print first 10 for brevity

            print("\nSample Relationship Types:")
            print(list(relation_types)[:10]) # Print first 10 for brevity

else:
    print(f"No translated JSON files found in {JSON_DIR}. Please run the translation cells first.")

## APPENDIX A2 – Explanation of JSON Structure Analysis

This markdown cell explains the purpose, functionality, and benefits of the preceding code cell (APPENDIX A1), which analyzes the structure and content of the translated Chinese diabetes guideline JSON files.

**What it does:**

The code cell performs two main tasks:
1. **Inspects the structure of a sample JSON file:** It loads the first translated JSON file found in the specified directory and prints the top-level keys, as well as the keys within a sample paragraph and sentence. This provides a clear overview of how the data is organized hierarchically.
2. **Counts entity types, relationship types, and triples:** It iterates through *all* the translated JSON files to identify and count the unique types of entities and relationships present in the dataset. It also counts the total number of relations, which represent the knowledge graph triples.

**How it does it:**

*   It uses Python's built-in `json` library to load and parse the JSON files.
*   The `pathlib` module is used for easy handling of file paths.
*   It accesses nested dictionaries and lists within the JSON structure to find entity and relation information.
*   `set()` is used to efficiently store and count unique entity and relationship types.
*   A loop iterates through all the JSON files to aggregate the counts.

**Why it's helpful:**

For the capstone research project, understanding the underlying data is essential. This analysis provides concrete numbers and structural insights that we can use to:

*   **Describe the dataset:** Quantify the diversity of information by stating the number of entity and relationship types.
*   **Explain the data source:** Clearly articulate the structure of the input data
*   **Justify design choices:** Relate the structure of the JSON data to decisions made in building the knowledge graph and RAG system.
*   **Estimate scale:** The total number of triples gives us an idea of the size of the knowledge graph being built.

By having this information readily available, we can speak confidently and knowledgeably about the data underpinning the project.

# More Refactoring Info
Refactor the provided Jupyter notebook `/content/RichCapstoneFineTuningMistral.ipynb` to replace the current translation method with one using the `deep_translator` library, specifically Google Translate with MyMemory as a fallback, incorporating caching for efficiency. This involves analyzing the existing code, implementing the new translation logic, modifying the data processing steps to use the new translations, and testing the updated pipeline. Also, back up the `/content/drive/MyDrive/diakg_assets` folder before making changes.

## Backup of existing data due to refactoring and forking the code

### Subtask:
Manually rename the `/content/drive/MyDrive/diakg_assets` folder in the Google Drive to `/content/drive/MyDrive/diakg_assets.old`. Then, make a copy of the contents from `/content/drive/MyDrive/diakg_assets.old` into a new folder named `/content/drive/MyDrive/diakg_assets`.


**Reasoning**:
Add a new code cell to install the deep_translator library using pip.



* * *
## Retrospective and Lessons Learned

This project involved building a complete RAG pipeline, encountering and solving several challenges along the way. The process highlighted key aspects of working with large language models, knowledge graphs, and cloud environments.

**Lessons Learned:**

*   **Dependency Management is Crucial and Complex:** Pinning library versions is essential for reproducibility and stability, especially in dynamic environments like Colab. However, finding a truly compatible set of dependencies across multiple libraries (like `transformers`, `sentence-transformers`, `peft`, `accelerate`, `bitsandbytes`, `huggingface_hub`) can be extremely challenging and time-consuming.
*   **Persistent Errors Require Systematic Debugging:** The `ModuleNotFoundError` in Cell 4 demonstrated that simple reinstallations are not always sufficient. Debugging required systematically trying different version combinations, understanding the traceback, and considering interactions between multiple libraries.
*   **Runtime Restarts Can Solve Stubborn Issues:** For persistent environment-related errors in Colab, a runtime restart is a valuable troubleshooting step that can clear cached states that `%pip install --upgrade` might miss.
*   **Translating Structured Data Requires Careful Planning:** When translating data with complex structures (like JSON with nested entities and relations), it's vital to consider which fields need translation and at what granularity, based on the downstream use of the data (e.g., building knowledge graph triples).
*   **Robust Translation Handling is Necessary for External APIs:** Relying on external translation APIs requires implementing robust error handling, including retry mechanisms (like exponential backoff) and fallbacks, to deal with rate limits, server errors, and temporary service unavailability.
*   **Caching Improves Efficiency with Repeated Data:** For translation tasks, implementing a cache for frequently encountered text snippets significantly reduces redundant API calls and speeds up the process.
*   **Parallel Processing Accelerates File-Based Operations:** Utilizing tools like `ThreadPoolExecutor` for tasks involving processing multiple files concurrently (like translation) is crucial for reducing overall execution time.
*   **Clear Notebook Structure and Documentation Aid Understanding:** Organizing the notebook into logical cells with clear markdown explanations and comments significantly improves readability and allows others (and myself) to understand the pipeline and reproduce the results.
*   **Evaluation Metrics Matter and Have Limitations:** Choosing appropriate evaluation metrics is key to assessing model performance. Simple metrics like keyword overlap can provide a quick overview but have limitations in capturing semantic meaning and fluency. More sophisticated metrics (ROUGE, BERTScore, human evaluation) are needed for a comprehensive assessment.
*   **RAG Performance Depends on Retrieval Quality and Model Utilization:** The initial evaluation not showing RAG improvement highlights that effective RAG relies on both retrieving truly relevant context *and* the language model effectively using that context rather than relying solely on its internal knowledge.
*   **Debugging Takes Time and Patience:** Resolving complex technical issues, especially dependency problems in machine learning environments, often requires significant time, experimentation, and patience.
*   **Understanding Error Messages is Key:** Carefully reading and interpreting traceback messages is fundamental to identifying the root cause of errors and guiding troubleshooting efforts.
*   **Iterative Development is Effective:** Breaking down the project into smaller, manageable steps (like translation, indexing, retrieval, evaluation) and addressing issues at each stage is an effective development strategy.
*   **Community Resources and Documentation are Valuable:** Consulting documentation for libraries like `transformers`, `sentence-transformers`, and `deep_translator`, and looking at community forums for similar errors, can provide crucial clues and solutions.
*   **Version Compatibility Tools (Less Used Here, But Relevant):** While not extensively used in my interactive debugging, tools like `pipdeptree` or creating isolated environments (like with `venv` or Conda) can sometimes help diagnose and manage complex dependencies more proactively.