# Medical Chatbot

Here, we use 3 different models with different parameter sizes, specifically:
1. `llama-3.2-1B-Instruct`
2. `llama-3.2-3B-Instruct`
3. `qwen-2.5-7B`

 and fine-tune them on our custom curated datset using Unsloth library with LoRA.

This notebook focuses only on the Model 3 (`qwen-2.5-7B`).

## Setup

In [29]:
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [30]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import get_chat_template

In [32]:
import os
# Force usage of only the first GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

#### Getting our Dataset from GitHub:

In [33]:
!wget https://raw.githubusercontent.com/farazahmad2004/NLP_Project_Medical_Chatbot/main/datasets/Roman_Urdu/jsonl/train_dataset.jsonl
!wget https://raw.githubusercontent.com/farazahmad2004/NLP_Project_Medical_Chatbot/main/datasets/Roman_Urdu/jsonl/val_dataset.jsonl

--2025-12-14 20:19:53--  https://raw.githubusercontent.com/farazahmad2004/NLP_Project_Medical_Chatbot/main/datasets/Roman_Urdu/jsonl/train_dataset.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20532209 (20M) [text/plain]
Saving to: ‘train_dataset.jsonl.2’


2025-12-14 20:19:54 (219 MB/s) - ‘train_dataset.jsonl.2’ saved [20532209/20532209]

--2025-12-14 20:19:54--  https://raw.githubusercontent.com/farazahmad2004/NLP_Project_Medical_Chatbot/main/datasets/Roman_Urdu/jsonl/val_dataset.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 20

## Model 3 (`qwen-2.5-7B`) Setup & Tokenizer:

In [34]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

# 1. Load Qwen-2.5-7B
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# 2. adding LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

==((====))==  Unsloth 2025.12.5: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Format Dataset:

In [35]:
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts }

dataset = load_dataset("json", data_files="train_dataset.jsonl", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

## Training:

In [36]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 2600,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs_qwen",
        report_to = "none",
    ),
)

In [37]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 2
   \\   /|    Num examples = 21,150 | Num Epochs = 1 | Total steps = 2,600
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 40,370,176 of 7,655,986,688 (0.53% trained)


Step,Training Loss
1,3.8273
2,3.8353
3,3.8019
4,3.4103
5,3.5989
6,3.2295
7,2.8182
8,2.8237
9,2.9272
10,2.7737


TrainOutput(global_step=2600, training_loss=1.601755894468381, metrics={'train_runtime': 25735.8972, 'train_samples_per_second': 0.808, 'train_steps_per_second': 0.101, 'total_flos': 3.3689398424766874e+17, 'train_loss': 1.601755894468381, 'epoch': 0.983451536643026})

## Inference:

In [39]:
FastLanguageModel.for_inference(model)
messages = [
    {"role": "user", "content": "mujhe bukhar hai aur sardi lag rahi hai"}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")
outputs = model.generate(input_ids = inputs, max_new_tokens = 128, use_cache = True, temperature=0.6, min_p=0.1, repetition_penalty=1.2)
print(tokenizer.batch_decode(outputs))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nmujhe bukhar hai aur sardi lag rahi hai<|im_end|>\n<|im_start|>assistant\nsirf ek blood test karwa lo taake kisi infection ka pata chal sake, shayad yeh viral ho ya bacterial bhi ho sakta hai, agar dard kam na ho to anti-inflammatory tablets lein, garam paani mein lemon piyein, thandi nahi peene ki koshish karein kyunki is se jism ko zyada wapas aaana padega jo bohot mushkil hota hai, khushriyan parhain ke sardi kab tak rehti hain <link> umeed hai yahan madad milegi']


In [40]:
FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "meray galey mein kaafi dard ho raha hai, aur cheenkein bhe aa rahi hein, mujhe kia karna chahye?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")
outputs = model.generate(input_ids = inputs, max_new_tokens = 256, use_cache = True, temperature=0.6, min_p=0.1, repetition_penalty=1.2)
print(tokenizer.batch_decode(outputs))

["<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nmeray galey mein kaafi dard ho raha hai, aur cheenkein bhe aa rahi hein, mujhe kia karna chahye?<|im_end|>\n<|im_start|>assistant\nagar yeh pehli baar hua hai to shayad ek kidney stone ya urinary tract infection ki wajah se ho sakta hai, lekin agar iska sabab jaan'na zaroori nahi hai to behtar hai ke aap apne primary care provider se milain taake in asraat ko theek kar sakein jab tak maine aapki medical history dekhi hoti hai jo online nahi milti, main aapko salah dunga ke aap jald se jald apne doctor se milain. Achi kismet!<|im_end|>"]


# Saving the model:

## On HuggingFace:

In [None]:
from huggingface_hub import login
# login(token="") # add your here
repo_name = "farazahmad2004/NLP-Medical-Chatbot-Qwen-7B"
model.push_to_hub(repo_name, token=True)
tokenizer.push_to_hub(repo_name, token=True)
print(f"Model pushed to: https://huggingface.co/{repo_name}")

README.md:   0%|          | 0.00/599 [00:00<?, ?B/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

Saved model to https://huggingface.co/farazahmad2004/NLP-Medical-Chatbot-Qwen-7B


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

Model pushed to: https://huggingface.co/farazahmad2004/NLP-Medical-Chatbot-Qwen-7B


# Getting the model from HF:

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "farazahmad2004/NLP-Medical-Chatbot-Qwen-7B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2025.12.5: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Unsloth 2025.12.5 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 2048, padding_idx=128004)
        (layers): ModuleList(
          (0): LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear

# Evaluation: (initial and only for 50 queries)

## Evaluation Setup:



In [41]:
!pip install evaluate rouge_score bert_score

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate, bert_score
Successfully installed bert_score-0.3.13 evaluate-0.4.6


In [44]:
import evaluate
from tqdm import tqdm

## Validation Dataset:

In [50]:
VAL_FILE = "val_dataset.jsonl"
NUM_SAMPLES = 50

dataset = load_dataset("json", data_files=VAL_FILE, split="train")
if NUM_SAMPLES > 0:
    dataset = dataset.shuffle(seed=42).select(range(NUM_SAMPLES))

In [51]:
rouge = evaluate.load('rouge')
bertscore = evaluate.load('bertscore')

In [52]:
def generate_response(question):
    messages = [{"role": "user", "content": question}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=128,
        use_cache=True,
        temperature=0.6,
        repetition_penalty=1.2
    )
    generated_tokens = outputs[0][inputs.shape[1]:]
    decoded_response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return decoded_response.strip()

In [53]:
print(f"Generating answers for {len(dataset)} samples...")
generated_answers = []
ground_truths = []
for row in tqdm(dataset):
    user_q = row['messages'][1]['content']
    true_a = row['messages'][2]['content']
    pred_a = generate_response(user_q)
    generated_answers.append(pred_a)
    ground_truths.append(true_a)

# rouge-l
print("\n--- Calculating Metrics ---")
rouge_results = rouge.compute(predictions=generated_answers, references=ground_truths)
print(f"ROUGE-L: {rouge_results['rougeL']:.4f}")
# bert score
bert_results = bertscore.compute(predictions=generated_answers, references=ground_truths, lang="en")
print(f"BERTScore (F1): {sum(bert_results['f1']) / len(bert_results['f1']):.4f}")
print("\n--- Example Outputs ---")
for i in range(5):
    print(f"Q: {dataset[i]['messages'][1]['content']}")
    print(f"True: {ground_truths[i]}")
    print(f"Pred: {generated_answers[i]}")
    print("-" * 30)

Generating answers for 50 samples...


100%|██████████| 50/50 [06:37<00:00,  7.95s/it]



--- Calculating Metrics ---
ROUGE-L: 0.1201


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore (F1): 0.8414

--- Example Outputs ---
Q: kya aap meri madad kar sakte hain samajhne mein ke mera boyfriend kyun ejaculate nahi kar pa raha
True: shayad usay retrograde ejaculation hai jahan sperm uski bladder mein wapas chalay jate hain ya shayad wo sirf urethra mein fluid nahi daal raha. hum yeh kuch surgeries ya kuch dawaiyon ke baad dekhte hain lekin yeh ek sehatmand aadmi ke liye bina kisi wajah ke aam nahi hai. mujhe lagta hai usay ek urologist se milna chahiye taake yeh confirm ho sake ke sab theek hai, yeh zaroori hai.
Pred: hi yeh sirf ek mumkinat hai agar usay prostate masla ya infection ho jo dard aur pressure ka sabab banta hai to is se ejaculation bohot mushkil hota hai jab tak wo apni medical history ko doctor se discuss na karein taake sahi diagnosis mil sake lekin shayad ye hi wajah hai umeed hai yeh aap dono ki madad karega good luck
------------------------------
Q: kya aap gandi haath se khud ko pochhne se pregnant ho sakti hain? mujhe apne haath par prec*m 

## Human Evaluation:

In [54]:
symptoms_questions = [
    "Mujhe subah se sar mein dard ho raha hai, koi dawai bataen?",
    "Mere bachay ko tez bukhar hai aur wo kuch kha nahi raha.",
    "Mujhe khansi ke sath balgham aa raha hai, kya karoon?",
    "Pait mein mror uth rahe hain aur loose motion lage hain.",
    "Mere daant mein shadeed dard hai, dentist ke paas janay tak kya karoon?",
    "Mujhe saans lene mein dushwari ho rahi hai jab main sidhiyan charhta hoon.",
    "Meri aankhon mein jalan aur pani aa raha hai.",
    "Kaan mein dard hai, shayed paani chala gaya hai.",
    "Jism par lal nishan par gaye hain aur kharish ho rahi hai.",
    "Mujhe ulti jaisa mehsoos ho raha hai (nausea). kia karun?"
]
for question in symptoms_questions:
    print(f"Q: {question}")
    print(f"A: {generate_response(question=question)}")
    print("-"*70)

Q: Mujhe subah se sar mein dard ho raha hai, koi dawai bataen?
A: Aapko apne doctor ya neurologist ke paas jana chahiye taake iske asal sabab ka pata lag sakein aur sahi ilaj kiya ja sake. Dard ki dawaain sirf temporary rahat deti hain lekin inka istemal zaroori nahi hota jab tak theek diagnosis na milta rahe.
----------------------------------------------------------------------
Q: Mere bachay ko tez bukhar hai aur wo kuch kha nahi raha.
A: Yeh shayad viral infection ki wajah se ho sakta hai lekin agar yeh ek hafta tak chalna band hota hai to mujhe lagta hai ke uska immune system bura kaam kar raha tha ya phir iski zyada tar mumkinat bacterial infection hain jo antibiotics lene par theek hogi. Aapko apne doctor se mashwara lena chahiye taake sahi taur par jaanch ki jaye kyunki internet par diagnosis zaroori tor par bekaar samjha ja sakta hai. Shuk
----------------------------------------------------------------------
Q: Mujhe khansi ke sath balgham aa raha hai, kya karoon?
A: Aapko ek