# Custom-RAG Implementation:

## Setup:

In [1]:
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [2]:
!pip install rank_bm25 langchain_community sentence-transformers chromadb langchain-huggingface

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting langchain_community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting chromadb
  Downloading chromadb-1.3.7-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-1.2.0-py3-none-any.whl.metadata (2.8 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain_community)
  Downloading langchain_classic-1.0.0-py3-none-any.whl.metadata (3.9 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain_community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.3.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase

In [3]:
from unsloth import FastLanguageModel
from langchain_classic.retrievers.ensemble import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## Fetching Models from HF:

### 1. `Llama-1B:`

In [None]:
model_1b, tokenizer_1b = FastLanguageModel.from_pretrained(
    model_name = "farazahmad2004/NLP-Medical-Chatbot-Llama-1B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model_1b)

### 2. `Llama-3B:`

In [None]:
model_3b, tokenizer_3b = FastLanguageModel.from_pretrained(
    model_name = "farazahmad2004/NLP-Medical-Chatbot-Llama-3B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model_3b)

### 3. `Qwen-2.5-7B:`

In [None]:
model_7b, tokenizer_7b = FastLanguageModel.from_pretrained(
    model_name = "farazahmad2004/NLP-Medical-Chatbot-Qwen-7B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model_7b)

==((====))==  Unsloth 2025.12.7: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/162M [00:00<?, ?B/s]

Unsloth 2025.12.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(152064, 3584, padding_idx=151654)
        (layers): ModuleList(
          (0-27): 28 x Qwen2DecoderLayer(
            (self_attn): Qwen2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3584, out_features=3584, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3584, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=3584, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora

## RAG Implementation:

### Getting RAG Dataset:

In [4]:
!wget https://raw.githubusercontent.com/farazahmad2004/NLP_Project_Medical_Chatbot/main/datasets/RAG_dataset/rag_dataset.txt

--2025-12-20 17:04:30--  https://raw.githubusercontent.com/farazahmad2004/NLP_Project_Medical_Chatbot/main/datasets/RAG_dataset/rag_dataset.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26863 (26K) [text/plain]
Saving to: ‘rag_dataset.txt’


2025-12-20 17:04:30 (90.2 MB/s) - ‘rag_dataset.txt’ saved [26863/26863]



In [None]:
def load_disease_chunks(filepath):
    with open(filepath, "r", encoding="utf-8") as f:
        text = f.read()
    raw_chunks = text.split("###")
    return [Document(page_content=chunk.strip(), metadata={"source": filepath}) for chunk in raw_chunks if chunk.strip()]

In [None]:
docs = load_disease_chunks("rag_dataset.txt")

## Loading Semantic Multilingual Model:

In [None]:
embedding_model = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")
vector_db = Chroma.from_documents(docs, embedding_model)
semantic_retriever = vector_db.as_retriever(search_kwargs={"k": 2})

In [None]:
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

### Making an ensemble retriever:

In [None]:
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.9, 0.1]
)

In [None]:
def retrieve_context(query):
    docs = ensemble_retriever.invoke(query)
    return "\n\n".join([d.page_content for d in docs])

q = "Mujhe Qabz hai aur pait sakht hai"
context = retrieve_context(q)
print(f"Query: {q}")
print(f"Retrieved Context Snippet: {context[:200]}...")

Query: Mujhe Qabz hai aur pait sakht hai
Retrieved Context Snippet: Disease: Qabz (Constipation)
Symptoms: Pakhana (stool) sakhat aana, pait bhaari hona, pait mein gas banna.
Short-term relief: Ispaghol ka chilka dahi ya paani mein mila kar lein, garam paani piyein.
T...


#### Toy Human Evaluation:

In [None]:
def generate_rag_response(question, model, tokenizer):
    context_text = retrieve_context(question)
    if not context_text:
        return "Context not found.", "No context"
    rag_prompt = f"""You are an expert Medical AI Assistant.

    Task: Answer the user's question using ONLY the provided Context.

    Rules:
    1. Act as a professional Doctor.
    2. Do NOT talk about yourself. Do NOT say "Main" (I) or "Mujhe" (Me).
    3. If the Context is unrelated, say "Maaf kijiye, mere paas iski maloomat nahi."
    4. Speak in clear Roman Urdu.

    ---
    Example of Correct Behavior:
    Context: Disease: Bukhar. Treatment: Panadol lein.
    Question: Mujhe bukhar hai.
    Answer: Bukhar ke liye aap Panadol le sakte hain. Pani ziyada piyein aur aaram karein.
    ---

    Real Context:
    {context_text}

    User Question: {question}

    Answer:"""
    messages = [{"role": "user", "content": rag_prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=200,
        use_cache=True,
        temperature=0.3,
        repetition_penalty=1.2
    )

    generated_tokens = outputs[0][inputs.shape[1]:]
    decoded_response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    # returning context here only for debugging/checking
    return decoded_response.strip(), context_text
symptoms_questions = [
    "Mujhe subah se sar mein dard ho raha hai, koi dawai bataen?",
    "Mere bachay ko tez bukhar hai aur wo kuch kha nahi raha.",
    "Mujhe khansi ke sath balgham aa raha hai, kya karoon?",
    "Pait mein mror uth rahe hain aur loose motion lage hain.",
    "Mere daant mein shadeed dard hai, dentist ke paas janay tak kya karoon?",
    "Mujhe saans lene mein dushwari ho rahi hai jab main sidhiyan charhta hoon.",
    "Meri aankhon mein jalan aur pani aa raha hai.",
    "Kaan mein dard hai, shayed paani chala gaya hai.",
    "Jism par lal nishan par gaye hain aur kharish ho rahi hai.",
    "Mujhe ulti jaisa mehsoos ho raha hai (nausea). kia karun?"
]

print(f"--- Evaluating Qwen-7B with RAG ---")
for question in symptoms_questions:
    print(f"Q: {question}")
    ans, context_used = generate_rag_response(question, model_7b, tokenizer_7b)
    print(f"A: {ans}")
    source_preview = context_used.split('\n')[0] if context_used else "None"
    print(f"Source Used: {source_preview}...")
    print("-" * 70)

--- Evaluating Qwen-7B with RAG ---
Q: Mujhe subah se sar mein dard ho raha hai, koi dawai bataen?
A: Aapko shayad ek migraine headache ho raha hai jo sirf subah hi hota hai. Ye sabse pehli cheez jo main suggest karunga wo ye hai ke aap apne doctor se milen kyunki migraines ka diagnosis khoon ke tests se confirm kiya ja sakta hai. Agar yeh migraines hain to aapka doctor aapke saath ilaaj ka plan banane mein madadgar hogi. Main aise over-the-counter pain relievers recommend karunga jo caffeine shamil karte hain, jaise Excedrin Migraine. Ismein aspirin, acetaminophen aur caffeine hota hai. Acetaminophen (Tylenol) aur ibuprofen (Advil/Motrin) bhi behtareen options hain. Kuch log dono types ki combination tablets istemal karte hain. Dusre nonsteroidal anti
Source Used: Disease: Chakkar Aana (Vertigo)...
----------------------------------------------------------------------
Q: Mere bachay ko tez bukhar hai aur wo kuch kha nahi raha.
A: Agar ye viral fever ya flu hai to unke liye anti-bacter

# Evaluation:
Here, we do complete evaluation of all three models:

In [5]:
!pip install evaluate rouge_score bert_score sacrebleu

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [6]:
from tqdm import tqdm
import evaluate
import pandas as pd
import gc
from datasets import load_dataset

In [None]:
rouge = evaluate.load('rouge')
bertscore = evaluate.load('bertscore')
chrf = evaluate.load('chrf')

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

### Getting Validation dataset:

In [7]:
!wget https://raw.githubusercontent.com/farazahmad2004/NLP_Project_Medical_Chatbot/main/datasets/Roman_Urdu/jsonl/val_dataset.jsonl

--2025-12-20 17:05:23--  https://raw.githubusercontent.com/farazahmad2004/NLP_Project_Medical_Chatbot/main/datasets/Roman_Urdu/jsonl/val_dataset.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5120239 (4.9M) [text/plain]
Saving to: ‘val_dataset.jsonl’


2025-12-20 17:05:23 (326 MB/s) - ‘val_dataset.jsonl’ saved [5120239/5120239]



In [None]:
dataset = load_dataset("json", data_files="val_dataset.jsonl", split="train")
subset_size = 100
eval_dataset = dataset.shuffle(seed=37).select(range(subset_size))
print(f"Evaluation Set: {len(eval_dataset)} samples")

Generating train split: 0 examples [00:00, ? examples/s]

Evaluation Set: 100 samples


Combining both semantic and keyword based RAG in one function

In [None]:
def setup_rag_retriever(dataset_path="rag_dataset.txt"):
    with open(dataset_path, "r", encoding="utf-8") as f:
        text = f.read()
    raw_chunks = text.split("###")
    docs = [Document(page_content=chunk.strip(), metadata={"source": "medical_data"})
            for chunk in raw_chunks if chunk.strip()]
    # 1. Semantic (Dense)
    embedding_model = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")
    vector_db = Chroma.from_documents(docs, embedding_model)
    semantic_retriever = vector_db.as_retriever(search_kwargs={"k": 2})
    # 2. Keyword (Sparse)
    bm25_retriever = BM25Retriever.from_documents(docs)
    bm25_retriever.k = 2
    # 3. Hybrid Ensemble (0.9 hi best chal raha)
    ensemble_retriever = EnsembleRetriever(
        retrievers=[bm25_retriever, semantic_retriever],
        weights=[0.9, 0.1]
    )
    return ensemble_retriever


In [None]:
rag_retriever = setup_rag_retriever()

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

In [None]:
def clean_memory():
    gc.collect()
    torch.cuda.empty_cache()

In [None]:
clean_memory()

In [None]:
def run_eval(model, tokenizer, alias, modes):
    results = []
    for mode in modes:
        preds = []
        truths = []
        for row in tqdm(eval_dataset):
            user_q = row['messages'][1]['content']
            true_a = row['messages'][2]['content']

            if mode == 'RAG':
                docs = rag_retriever.invoke(user_q)
                context = docs[0].page_content
                prompt = f"""You are an expert Medical AI Assistant.
Task: Answer the user's question using ONLY the provided Context.
Rules: Act as a professional Doctor. Do NOT talk about yourself. Speak in clear Roman Urdu.

CONTEXT:
{context}

USER QUESTION: {user_q}

ANSWER (in Roman Urdu):"""
            elif mode == 'Base':
                prompt = f"User Question: {user_q}\nAnswer this medical question in Roman Urdu."
            else:
                prompt = f"User Question: {user_q}\nAnswer this as a Doctor in Roman Urdu. Do not tell stories about yourself."

            messages = [{"role": "user", "content": prompt}]
            inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

            outputs = model.generate(input_ids=inputs, max_new_tokens=150, use_cache=True, temperature=0.3, repetition_penalty=1.2)
            response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True).strip()

            preds.append(response)
            truths.append(true_a)

        r_score = rouge.compute(predictions=preds, references=truths)
        c_score = chrf.compute(predictions=preds, references=truths, word_order=2)
        b_score = bertscore.compute(predictions=preds, references=truths, lang="en")

        results.append({
            "Model": alias,
            "Mode": mode,
            "ROUGE-L": round(r_score['rougeL'], 4),
            "CHRF++": round(c_score['score'], 4),
            "BERT-F1": round(sum(b_score['f1']) / len(b_score['f1']), 4)
        })
    return results

Llama 1B evaluation:

In [None]:
results_1b = []

print("Running Llama-1B Base...")
clean_memory()
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
results_1b.extend(run_eval(model, tokenizer, "Llama-1B", ["Base"]))
del model, tokenizer
clean_memory()

print("Running Llama-1B Fine-Tuned & RAG...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "farazahmad2004/NLP-Medical-Chatbot-Llama-1B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
results_1b.extend(run_eval(model, tokenizer, "Llama-1B", ["Fine-Tuned", "RAG"]))
del model, tokenizer
clean_memory()

pd.DataFrame(results_1b).to_csv("results_1b.csv", index=False)
print(pd.DataFrame(results_1b))

Running Llama-1B Base...
==((====))==  Unsloth 2025.12.7: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

  0%|          | 0/100 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 100/100 [05:56<00:00,  3.56s/it]


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Running Llama-1B Fine-Tuned & RAG...
==((====))==  Unsloth 2025.12.7: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Unsloth 2025.12.7 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.
100%|██████████| 100/100 [07:59<00:00,  4.79s/it]
100%|██████████| 100/100 [07:20<00:00,  4.40s/it]


      Model        Mode  ROUGE-L   CHRF++  BERT-F1
0  Llama-1B        Base   0.0384   6.6217   0.7350
1  Llama-1B  Fine-Tuned   0.1173  18.5206   0.8413
2  Llama-1B         RAG   0.1095  16.5166   0.8359


Llama 3B evaluation:

In [None]:
results_3b = []

print("Running Llama-3B Base...")
clean_memory()
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
results_3b.extend(run_eval(model, tokenizer, "Llama-3B", ["Base"]))
del model, tokenizer
clean_memory()

print("Running Llama-3B Fine-Tuned & RAG...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "farazahmad2004/NLP-Medical-Chatbot-Llama-3B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
results_3b.extend(run_eval(model, tokenizer, "Llama-3B", ["Fine-Tuned", "RAG"]))
del model, tokenizer
clean_memory()

pd.DataFrame(results_3b).to_csv("results_3b.csv", index=False)
print(pd.DataFrame(results_3b))

Running Llama-3B Base...
==((====))==  Unsloth 2025.12.7: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

100%|██████████| 100/100 [09:44<00:00,  5.84s/it]


Running Llama-3B Fine-Tuned & RAG...
==((====))==  Unsloth 2025.12.7: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/97.3M [00:00<?, ?B/s]

Unsloth 2025.12.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.
100%|██████████| 100/100 [12:47<00:00,  7.68s/it]
100%|██████████| 100/100 [12:26<00:00,  7.46s/it]


      Model        Mode  ROUGE-L   CHRF++  BERT-F1
0  Llama-3B        Base   0.0566   9.0776   0.7586
1  Llama-3B  Fine-Tuned   0.1189  18.4851   0.8411
2  Llama-3B         RAG   0.1131  16.8233   0.8380


Qwen 7B evaluation:

In [None]:
results_7b = []

print("Running Qwen-7B Base...")
clean_memory()
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
results_7b.extend(run_eval(model, tokenizer, "Qwen-7B", ["Base"]))
del model, tokenizer
clean_memory()
print("Running Qwen-7B Fine-Tuned & RAG...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "farazahmad2004/NLP-Medical-Chatbot-Qwen-7B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
results_7b.extend(run_eval(model, tokenizer, "Qwen-7B", ["Fine-Tuned", "RAG"]))
del model, tokenizer
clean_memory()
pd.DataFrame(results_7b).to_csv("results_7b.csv", index=False)
print(pd.DataFrame(results_7b))

Running Qwen-7B Base...
==((====))==  Unsloth 2025.12.7: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

100%|██████████| 100/100 [13:53<00:00,  8.33s/it]


Running Qwen-7B Fine-Tuned & RAG...
==((====))==  Unsloth 2025.12.7: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/162M [00:00<?, ?B/s]

100%|██████████| 100/100 [13:57<00:00,  8.38s/it]
100%|██████████| 100/100 [13:55<00:00,  8.35s/it]


     Model        Mode  ROUGE-L   CHRF++  BERT-F1
0  Qwen-7B        Base   0.0191   3.6988   0.7138
1  Qwen-7B  Fine-Tuned   0.1229  18.9669   0.8417
2  Qwen-7B         RAG   0.1172  18.4333   0.8394


## Human Evaluation:

In [9]:
queries = [
    "Mujhe subah se sar mein dard ho raha hai, koi dawai bataen?",
    "Mere bachay ko tez bukhar hai aur wo kuch kha nahi raha.",
    "Mujhe khansi ke sath balgham aa raha hai, kya karoon?",
    "Pait mein mror uth rahe hain aur loose motion lage hain.",
    "Mere daant mein shadeed dard hai, dentist ke paas janay tak kya karoon?",
    "Mujhe saans lene mein dushwari ho rahi hai jab main sidhiyan charhta hoon.",
    "Meri aankhon mein jalan aur pani aa raha hai.",
    "Kaan mein dard hai, shayed paani chala gaya hai.",
    "Jism par lal nishan par gaye hain aur kharish ho rahi hai.",
    "Mujhe ulti jaisa mehsoos ho raha hai (nausea). kia karun?"
]

In [10]:
def get_rag_system(dataset_path="rag_dataset.txt"):
    with open(dataset_path, "r", encoding="utf-8") as f:
        text = f.read()
    raw_chunks = text.split("###")
    docs = [Document(page_content=chunk.strip(), metadata={"source": "medical_data"})
            for chunk in raw_chunks if chunk.strip()]
    embedding_model = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")
    vector_db = Chroma.from_documents(docs, embedding_model)
    semantic_retriever = vector_db.as_retriever(search_kwargs={"k": 2})
    bm25_retriever = BM25Retriever.from_documents(docs)
    bm25_retriever.k = 2
    ensemble_retriever = EnsembleRetriever(
        retrievers=[bm25_retriever, semantic_retriever],
        weights=[0.9, 0.1]
    )
    return ensemble_retriever

rag_retriever = get_rag_system()

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

In [11]:
def clean_memory():
    gc.collect()
    torch.cuda.empty_cache()

In [12]:
results = []
configs = [
    # Model Path, Alias, Mode
    ("unsloth/Llama-3.2-1B-Instruct", "Llama-1B", "Base"),
    ("farazahmad2004/NLP-Medical-Chatbot-Llama-1B", "Llama-1B", "FT"),
    ("farazahmad2004/NLP-Medical-Chatbot-Llama-1B", "Llama-1B", "RAG"),

    ("unsloth/Llama-3.2-3B-Instruct", "Llama-3B", "Base"),
    ("farazahmad2004/NLP-Medical-Chatbot-Llama-3B", "Llama-3B", "FT"),
    ("farazahmad2004/NLP-Medical-Chatbot-Llama-3B", "Llama-3B", "RAG"),

    ("unsloth/Qwen2.5-7B-Instruct-bnb-4bit", "Qwen-7B", "Base"),
    ("farazahmad2004/NLP-Medical-Chatbot-Qwen-7B", "Qwen-7B", "FT"),
    ("farazahmad2004/NLP-Medical-Chatbot-Qwen-7B", "Qwen-7B", "RAG"),
]
# We group by Model to avoid reloading heavy weights
current_model_path = ""
model = None
tokenizer = None

for model_path, alias, mode in configs:
    if model_path != current_model_path:
        clean_memory()
        print(f"--- Loading {alias} ({model_path}) ---")
        if model: del model, tokenizer
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name = model_path,
            max_seq_length = 2048,
            dtype = None,
            load_in_4bit = True,
        )
        FastLanguageModel.for_inference(model)
        current_model_path = model_path
    print(f"Generating for {alias} [{mode}]...")
    for q in tqdm(queries):
        if mode == "RAG":
            docs = rag_retriever.invoke(q)
            context = docs[0].page_content
            prompt = f"""You are an expert Medical AI Assistant.
Task: Answer the user's question using ONLY the provided Context.
Rules: Act as a professional Doctor. Do NOT talk about yourself. Speak in clear Roman Urdu.

CONTEXT:
{context}

USER QUESTION: {q}

ANSWER (in Roman Urdu):"""

        elif mode == "Base":
            prompt = f"User Question: {q}\nAnswer this medical question in Roman Urdu."

        else: # FT (No RAG)
            prompt = f"User Question: {q}\nAnswer this as a Doctor in Roman Urdu. Do not tell stories about yourself."

        messages = [{"role": "user", "content": prompt}]
        inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

        outputs = model.generate(input_ids=inputs, max_new_tokens=200, use_cache=True, temperature=0.3)
        response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True).strip()

        results.append({
            "Model": alias,
            "Mode": mode,
            "Query": q,
            "Response": response,
            "Human_Rating": ""
        })

--- Loading Llama-1B (unsloth/Llama-3.2-1B-Instruct) ---
==((====))==  Unsloth 2025.12.8: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Generating for Llama-1B [Base]...


  0%|          | 0/10 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 10/10 [00:52<00:00,  5.29s/it]


--- Loading Llama-1B (farazahmad2004/NLP-Medical-Chatbot-Llama-1B) ---
==((====))==  Unsloth 2025.12.8: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Unsloth 2025.12.8 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Generating for Llama-1B [FT]...


100%|██████████| 10/10 [00:59<00:00,  5.93s/it]


Generating for Llama-1B [RAG]...


100%|██████████| 10/10 [00:46<00:00,  4.67s/it]


--- Loading Llama-3B (unsloth/Llama-3.2-3B-Instruct) ---
==((====))==  Unsloth 2025.12.8: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Generating for Llama-3B [Base]...


100%|██████████| 10/10 [01:14<00:00,  7.43s/it]


--- Loading Llama-3B (farazahmad2004/NLP-Medical-Chatbot-Llama-3B) ---
==((====))==  Unsloth 2025.12.8: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/97.3M [00:00<?, ?B/s]

Unsloth 2025.12.8 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Generating for Llama-3B [FT]...


100%|██████████| 10/10 [00:54<00:00,  5.43s/it]


Generating for Llama-3B [RAG]...


100%|██████████| 10/10 [00:41<00:00,  4.16s/it]


--- Loading Qwen-7B (unsloth/Qwen2.5-7B-Instruct-bnb-4bit) ---
==((====))==  Unsloth 2025.12.8: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Generating for Qwen-7B [Base]...


100%|██████████| 10/10 [01:47<00:00, 10.79s/it]


--- Loading Qwen-7B (farazahmad2004/NLP-Medical-Chatbot-Qwen-7B) ---
==((====))==  Unsloth 2025.12.8: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/162M [00:00<?, ?B/s]

Generating for Qwen-7B [FT]...


100%|██████████| 10/10 [01:06<00:00,  6.63s/it]


Generating for Qwen-7B [RAG]...


100%|██████████| 10/10 [01:05<00:00,  6.59s/it]


In [13]:
df_human = pd.DataFrame(results)
df_human.to_csv("human_eval_results.csv", index=False)