# Lab 3: Finetune Llama 3.2 on Medical Dataset

This notebook implements finetuning of Llama 3.2 on a medical dataset using Hugging Face Transformers and PEFT (Parameter-Efficient Fine-Tuning).



## Step 1: Install Required Libraries

In [1]:

%pip install -q transformers datasets peft accelerate bitsandbytes trl torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.5/465.5 kB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 2: Import Libraries and Set Up


In [2]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import json
import os
from datetime import datetime
import re

if torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"✅ Using device: {device} (Apple Silicon)")
elif torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"✅ Using device: {device}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
else:
    device = torch.device("cpu")
    print(f"⚠️ Using device: {device} (CPU - training will be slower)")
    print("Note: If you have a GPU, make sure CUDA is properly installed.")

✅ Using device: cuda
GPU: Tesla T4
CUDA Version: 12.6


## Step 3: Load and Prepare Dataset

We'll use the MedMCQA dataset for medical question answering.

In [3]:
# Load the MedMCQA dataset
dataset = load_dataset("openlifescienceai/medmcqa", split="train")

print(f"Dataset size: {len(dataset)}")
print(f"Dataset features: {dataset.features}")
print("\nSample example:")
print(dataset[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/85.9M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/936k [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/1.48M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/182822 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6150 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4183 [00:00<?, ? examples/s]

Dataset size: 182822
Dataset features: {'id': Value('string'), 'question': Value('string'), 'opa': Value('string'), 'opb': Value('string'), 'opc': Value('string'), 'opd': Value('string'), 'cop': ClassLabel(names=['a', 'b', 'c', 'd']), 'choice_type': Value('string'), 'exp': Value('string'), 'subject_name': Value('string'), 'topic_name': Value('string')}

Sample example:
{'id': 'e9ad821a-c438-4965-9f77-760819dfa155', 'question': 'Chronic urethral obstruction due to benign prismatic hyperplasia can lead to the following change in kidney parenchyma', 'opa': 'Hyperplasia', 'opb': 'Hyperophy', 'opc': 'Atrophy', 'opd': 'Dyplasia', 'cop': 2, 'choice_type': 'single', 'exp': 'Chronic urethral obstruction because of urinary calculi, prostatic hyperophy, tumors, normal pregnancy, tumors, uterine prolapse or functional disorders cause hydronephrosis which by definition is used to describe dilatation of renal pelvis and calculus associated with progressive atrophy of the kidney due to obstruction to

## Step 4: Format Dataset for Instruction Tuning

We need to format the dataset in a way that the model can learn from. We'll create a prompt template.

In [45]:
def format_instruction(example):
    """Format medical Q&A into instruction format"""
    question = example.get("question", "")
    # Extract options from 'opa', 'opb', 'opc', 'opd'
    options = {}
    if 'opa' in example and example['opa'] is not None: options['a'] = example['opa']
    if 'opb' in example and example['opb'] is not None: options['b'] = example['opb']
    if 'opc' in example and example['opc'] is not None: options['c'] = example['opc']
    if 'opd' in example and example['opd'] is not None: options['d'] = example['opd']

    cop_idx = example.get("cop")  # Correct option index (0 for 'a', 1 for 'b', etc.)

    # Build options list
    # Ensure options are sorted by key (a, b, c, d)
    sorted_option_keys = sorted(options.keys())
    options_text = "\n".join([f"{k.upper()}. {options[k]}" for k in sorted_option_keys])

    # Get correct answer text
    correct_answer_text = ""
    if cop_idx is not None and 0 <= cop_idx < len(sorted_option_keys):
        correct_answer_key = sorted_option_keys[cop_idx] # 'a', 'b', 'c', or 'd'
        correct_answer_text = options[correct_answer_key]

    # Create prompt
    prompt = f"""You are a medical expert. Answer the following medical question.

Question: {question}

Options:
{options_text}

Answer: {correct_answer_text}"""

    # Return 'text' for training, and also original question, options, and cop for later evaluation
    return {"text": prompt, "original_question": question, "original_options": options, "original_cop": cop_idx}

# Format dataset (using subset for faster training)
dataset_size = min(2000, len(dataset))
dataset_subset_for_formatting = dataset.select(range(dataset_size)) # Explicitly define the subset used for formatting

formatted_dataset = dataset_subset_for_formatting.map(
    format_instruction,
    # Temporarily remove remove_columns to ensure all new columns are added
)

print(f"Formatted dataset features: {formatted_dataset.features}")

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Formatted dataset features: {'id': Value('string'), 'question': Value('string'), 'opa': Value('string'), 'opb': Value('string'), 'opc': Value('string'), 'opd': Value('string'), 'cop': ClassLabel(names=['a', 'b', 'c', 'd']), 'choice_type': Value('string'), 'exp': Value('string'), 'subject_name': Value('string'), 'topic_name': Value('string'), 'text': Value('string'), 'original_question': Value('string'), 'original_options': {'a': Value('string'), 'b': Value('string'), 'c': Value('string'), 'd': Value('string')}, 'original_cop': Value('int64')}


## Local Inference on GPU
Model page: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

⚠️ If the generated code snippets do not work, please open an issue on either the [model repo](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
			and/or on [huggingface.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts) 🙏

The model you are trying to use is gated. Please make sure you have access to it by visiting the model page.To run inference, either set HF_TOKEN in your environment variables/ Secrets or run the following cell to login. 🤗

In [None]:
from huggingface_hub import login
login(new_session=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [34]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': [{'role': 'user', 'content': 'Who are you?'},
   {'role': 'assistant',
    'content': 'I\'m an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."'}]}]

In [35]:
# Model configuration
model_name = "meta-llama/Llama-3.2-1B-Instruct"


# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load model
print("\nLoading model...")
# MPS doesn't support float16 well, use float32. CUDA can use float16 for efficiency
if torch.backends.mps.is_available():
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float32,
        device_map=None,  # MPS doesn't support device_map="auto"
        trust_remote_code=True
    )
    model = model.to(device)  # Manually move to MPS device
elif torch.cuda.is_available():
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True
    )
else:
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float32,
        device_map=None,
        trust_remote_code=True
    )
    model = model.to(device)

print(f"Model loaded: {model_name}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e9:.2f}B")


Loading model...
Model loaded: meta-llama/Llama-3.2-1B-Instruct
Model parameters: 1.24B


## Remote Inference via Inference Providers
Ensure you have a valid **HF_TOKEN** set in your environment. You can get your token from [your settings page](https://huggingface.co/settings/tokens). Note: running this may incur charges above the free tier.
The following Python example shows how to run the model remotely on HF Inference Providers, automatically selecting an available inference provider for you.
For more information on how to use the Inference Providers, please refer to our [documentation and guides](https://huggingface.co/docs/inference-providers/en/index).

In [10]:
import os
from google.colab import userdata

# Retrieve the HF_TOKEN from Colab secrets
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

In [11]:
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.environ["HF_TOKEN"],
)

completion = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
)

print(completion.choices[0].message)

ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None)


## Step 6: Configure LoRA for Parameter-Efficient Fine-Tuning

In [46]:
# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,  # Rank
    lora_alpha=32,  # LoRA alpha
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

# Enable input gradients for gradient checkpointing with PEFT
# This is crucial when using gradient_checkpointing with PEFT models
# to ensure the backward pass works correctly through frozen layers.
if training_args.gradient_checkpointing:
    model.enable_input_require_grads()

model.print_trainable_parameters()

print("\nLoRA configuration applied successfully!")

trainable params: 11,272,192 || all params: 1,247,086,592 || trainable%: 0.9039

LoRA configuration applied successfully!


## Step 7: Tokenize Dataset


In [47]:
def tokenize_function(examples):
    """Tokenize the text examples"""
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=256,
        padding="max_length",
    )

# Define columns to remove after formatting and tokenization
# Keep 'input_ids', 'attention_mask', 'original_question', 'original_options', 'original_cop'
columns_to_keep = ['input_ids', 'attention_mask', 'original_question', 'original_options', 'original_cop']

# Get all columns in the formatted_dataset
all_formatted_columns = formatted_dataset.column_names

# Identify columns to remove: all columns except those we want to keep
columns_to_remove_after_tokenization = [col for col in all_formatted_columns if col not in columns_to_keep]

# Tokenize dataset
tokenized_dataset = formatted_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=columns_to_remove_after_tokenization,
)

print(f"Tokenized dataset size: {len(tokenized_dataset)}")
print(f"Sample tokenized example keys: {tokenized_dataset[0].keys()}")
print(f"Tokenized dataset features: {tokenized_dataset.features}")

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Tokenized dataset size: 2000
Sample tokenized example keys: dict_keys(['original_question', 'original_options', 'original_cop', 'input_ids', 'attention_mask'])
Tokenized dataset features: {'original_question': Value('string'), 'original_options': {'a': Value('string'), 'b': Value('string'), 'c': Value('string'), 'd': Value('string')}, 'original_cop': Value('int64'), 'input_ids': List(Value('int32')), 'attention_mask': List(Value('int8'))}


## Step 8: Split Dataset into Train/Test

In [48]:
# Split dataset: 80% train, 20% test
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Test examples: {len(test_dataset)}")

Training examples: 1600
Test examples: 400


## Step 9: Set Up Training Arguments

In [49]:
training_args = TrainingArguments(
    output_dir="./llama-medical-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=1,  # Reduced from 4 to 1 for MPS memory
    per_device_eval_batch_size=1,  # Reduced from 4 to 1 for MPS memory
    gradient_accumulation_steps=16,  # Increased from 4 to 16 to maintain effective batch size (1*16=16)
    warmup_steps=100,
    learning_rate=2e-4,
    fp16=torch.cuda.is_available(),  # Only enable for CUDA, not MPS
    gradient_checkpointing=True,  # Enable to save memory
    logging_steps=10,
    save_steps=500,
    eval_strategy="steps",  # Changed from evaluation_strategy to eval_strategy
    eval_steps=500,
    save_total_limit=3,
    load_best_model_at_end=True,
    report_to="none",
    push_to_hub=False,
)

print("Training arguments configured!")

Training arguments configured!


## Step 10: Create Data Collator

In [50]:
# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # We're doing causal LM, not masked LM
)

print("Data collator created!")


Data collator created!


## Step 11: Initialize Trainer and Start Training

In [51]:
# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
)

print("Trainer initialized. Starting training...")
print("This may take a while depending on your hardware...")


The model is already on multiple devices. Skipping the move to device specified in `args`.


Trainer initialized. Starting training...
This may take a while depending on your hardware...


In [28]:
# Start training
trainer.train()

Exception ignored in: <function tqdm.__del__ at 0x7fcfbf71dc60>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tqdm/std.py", line 1148, in __del__
    self.close()
  File "/usr/local/lib/python3.12/dist-packages/tqdm/notebook.py", line 279, in close
    self.disp(bar_style='danger', check_delay=False)
    ^^^^^^^^^
AttributeError: 'tqdm' object has no attribute 'disp'


Step,Training Loss,Validation Loss


TrainOutput(global_step=300, training_loss=1.3107888634999594, metrics={'train_runtime': 1237.4473, 'train_samples_per_second': 3.879, 'train_steps_per_second': 0.242, 'total_flos': 7257919271731200.0, 'train_loss': 1.3107888634999594, 'epoch': 3.0})

## Step 12: Save the Fine-Tuned Model

In [52]:
# Save the model
model.save_pretrained("./llama-medical-finetuned")
tokenizer.save_pretrained("./llama-medical-finetuned")

print("Model saved successfully!")


Model saved successfully!


## Step 13 : Evaluation, loading test examples

In [53]:
# load test examples from the split tokenized_dataset
test_examples = []
for i in range(len(test_dataset)):
    example = test_dataset[i]
    test_examples.append({
        "question": example["original_question"],
        "options": example["original_options"],
        "cop": example["original_cop"]
    })

print(f"Loaded {len(test_examples)} test examples")
print("\nSample test example:")
print(test_examples[0])


Loaded 400 test examples

Sample test example:
{'question': 'All the following are radiological features of Chronic Cor pulmonale except-', 'options': {'a': 'Kerley B lines', 'b': 'Prominent lower lobe vessels', 'c': 'Pleural effusion', 'd': 'Cardiomegaly'}, 'cop': 1}


## Step 14: Create Inference Function


In [54]:
def generate_answer(question, options, model, tokenizer, max_length=512):
    """Generate an answer for a medical question"""
    #format the prompt
    options_text = ""
    if options:
        for key in sorted(options.keys()):
            option_label = key.upper()
            option_text = options[key]
            options_text += f"{option_label}. {option_text}\n"

    prompt = f""" You are a medical expert.Answer the following medical question.

Question: {question}
Options:
{options_text.strip()}
Answer:"""

    # tokenize
    inputs =tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_length)
    inputs= {k: v.to(model.device) for k, v in inputs.items()}

    # generate
    with torch.no_grad() :
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.7 ,
            do_sample=True ,
            pad_token_id=tokenizer.eos_token_id,
        )

    # decode
    generated_text =tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer =generated_text.split("Answer:")[-1].strip()

    return answer

# test the function
sample_question =test_examples[0]["question"]
sample_options= test_examples[0]["options"]
sample_answer =generate_answer(sample_question, sample_options, model, tokenizer)
print("Sample generation :")
print(f"Question:{sample_question}")
print(f"Generated Answer:{sample_answer}")

Sample generation :
Question:All the following are radiological features of Chronic Cor pulmonale except-
Generated Answer:A
Explanation: Kerley B lines are seen in Pulmonary Emphysema, not Chronic Cor pulmonale. Chronic Cor pulmonale refers to the alteration of the structure and function of the right ventricle due to long-standing pulmonary hypertension, whereas Kerley B lines are indicative of interstitial edema. The other options are associated with Chronic Cor pulmonale. 

The best answer is A.


## Step 15: Implement Accuracy Checking


In [55]:
def check_answer_match(prediction,ground_truth, options):
    """Check if prediction matches ground truth (exact or partial)"""
    # get ground truth text
    if options and isinstance(ground_truth, int):
        option_keys = sorted(options.keys())
        if 0 <= ground_truth < len(option_keys):
            ground_truth_text = options[option_keys[ground_truth]]
        else:
            return False, "no_match"
    else:
        ground_truth_text = str(ground_truth)

    #normalize texts
    prediction_lower = prediction.lower().strip()
    ground_truth_lower = ground_truth_text.lower().strip()
    #check exact match
    if ground_truth_lower in prediction_lower or prediction_lower in ground_truth_lower:
        return True, "exact_match"

    #check keywords
    ground_truth_words = set(ground_truth_lower.split())
    prediction_words = set(prediction_lower.split())
    common_words = ground_truth_words.intersection(prediction_words)

    if len(common_words) >= len(ground_truth_words) * 0.5:
        return True, "partial_match"
    return False, "no_match"

#test the function
sample_cop = test_examples[0].get("cop")
if sample_cop is not None:
    match, match_type =check_answer_match(sample_answer, sample_cop, sample_options)
    print(f"Match: {match},Type:{match_type}")

Match: False,Type:no_match


## Step 16: Run Evaluation Loop

In [56]:
import time
from tqdm import tqdm

#evaluate on a subset of test examples
eval_size = min(20,len(test_examples))
eval_examples =test_examples[:eval_size]

results = {
    "exact_matches": 0 ,
    "partial_matches": 0,
    "incorrect": 0,
    "details": []
}


start_time = time.time()

for i, example in enumerate(tqdm(eval_examples,desc="Evaluating")):
    question = example.get("question", "")
    options = example.get("options", {})
    correct_answer_idx = example.get("cop", None)

    #generate answer
    prediction = generate_answer(question, options, model, tokenizer)

    #check match
    if correct_answer_idx is not None :
        is_match, match_type = check_answer_match(prediction, correct_answer_idx, options)

        # Get ground truth text
        option_keys = sorted(options.keys())
        if 0 <= correct_answer_idx < len(option_keys):
            ground_truth_text = options[option_keys[correct_answer_idx]]
        else:
            ground_truth_text = f"Option {correct_answer_idx}"

        if match_type == "exact_match":
            results["exact_matches"] += 1
            status = "CORRECT"
        elif match_type == "partial_match":
            results["partial_matches"] += 1
            status = "PARTIAL"
        else:
            results["incorrect"] += 1
            status = "INCORRECT "

        results["details"].append({
            "question": question[:100] + "..." if len(question) > 100 else question,
            "ground_truth": ground_truth_text,
            "prediction": prediction[:200] + "..." if len(prediction) > 200 else prediction,
            "match_type": match_type,
            "correct": is_match
        })

        print(f"\n{status}")
        print(f"Question: {question[:150]}...")
        print(f"Ground Truth: {ground_truth_text}")
        print(f"Prediction: {prediction[:150]}...")
        print("-" * 80)

total_time = time.time()- start_time
results["total_time"] = total_time
results["avg_time_per_example"] = total_time / eval_size

print(f"\nRunning accuracy: {(results['exact_matches'] +results['partial_matches']) / eval_size * 100:.1f}% ({results['exact_matches'] + results['partial_matches']}/{eval_size})")

Evaluating:   5%|▌         | 1/20 [00:04<01:29,  4.70s/it]


INCORRECT 
Question: All the following are radiological features of Chronic Cor pulmonale except-...
Ground Truth: Prominent lower lobe vessels
Prediction: C. Pleural effusion

Explanation: In chronic cor pulmonale, there is often pleural effusion due to the increased pressure in the lungs. This is a resu...
--------------------------------------------------------------------------------


Evaluating:  10%|█         | 2/20 [00:09<01:30,  5.01s/it]


INCORRECT 
Question: Lines of Blaschko&;s are along...
Ground Truth: Developmental
Prediction: C

Explanation: Lines of Blaschko are a set of linear patterns that are present on the skin and are thought to be caused by the migration of cells dur...
--------------------------------------------------------------------------------


Evaluating:  15%|█▌        | 3/20 [00:14<01:22,  4.87s/it]


CORRECT
Question: After mandibulectomy, muscle preventing falling back of tongue -...
Ground Truth: Hyoglossus
Prediction: D

Explanation: 
Mandibulectomy is a surgical procedure to remove part or all of the mandible (lower jawbone). This procedure can lead to muscle weakn...
--------------------------------------------------------------------------------


Evaluating:  20%|██        | 4/20 [00:18<01:12,  4.53s/it]


CORRECT
Question: Periosteal reaction in a case of acute osteomyelitis can be seen earliest at: March 2012...
Ground Truth: 10 days
Prediction: B
Explanation: Periosteal reaction is a characteristic feature of acute osteomyelitis. It is seen earliest at 10 days after the onset of the disease. ...
--------------------------------------------------------------------------------


Evaluating:  25%|██▌       | 5/20 [00:23<01:09,  4.63s/it]


CORRECT
Question: In sebaceous glands, accumulation of sebum leads to:...
Ground Truth: Acne
Prediction: B
Explanation: Accumulation of sebum in sebaceous glands leads to acne. Sebum is a lipid that is produced by the sebaceous glands. It is secreted into...
--------------------------------------------------------------------------------


Evaluating:  30%|███       | 6/20 [00:28<01:05,  4.67s/it]


PARTIAL
Question: Which statement best describes the cranial fossa?...
Ground Truth: The middle cranial fossa is floored by the sphenoid and temporal bones.
Prediction: D

Explanation: The internal acoustic meatus is a passage through the temporal bone, and it is located in the middle cranial fossa. The middle cranial...
--------------------------------------------------------------------------------


Evaluating:  35%|███▌      | 7/20 [00:33<01:03,  4.89s/it]


CORRECT
Question: A man is stuck with lathi on the lateral aspect of the head of the fibula. Which of the following can occur as a result of nerve injury...
Ground Truth: Loss of dorsiflexion
Prediction: A

Explanation: The lateral aspect of the fibula is the area innervated by the peroneal nerve, which is responsible for foot dorsiflexion. In the case...
--------------------------------------------------------------------------------


Evaluating:  40%|████      | 8/20 [00:38<00:58,  4.84s/it]


PARTIAL
Question: In which case cystometric study is indicated -...
Ground Truth: Neurogenic bladder
Prediction: B
Explanation: Cystometric study is a diagnostic procedure used to assess the storage and voiding function of the bladder. It is indicated in cases wh...
--------------------------------------------------------------------------------


Evaluating:  45%|████▌     | 9/20 [00:43<00:53,  4.88s/it]


CORRECT
Question: Placental abnormality related to PPH is?...
Ground Truth: All the above
Prediction: D. All the above

Explanation: Placental abnormality can lead to postpartum hemorrhage (PPH). 

- Placenta accreta: A condition where the placenta gro...
--------------------------------------------------------------------------------


Evaluating:  50%|█████     | 10/20 [00:48<00:49,  4.95s/it]


CORRECT
Question: All of the following causes coloured halos except...
Ground Truth: Retinal degeneration
Prediction: B

Explanation: Retinal degeneration is associated with coloured halos because it affects the light-sensitive photoreceptors in the retina. This can c...
--------------------------------------------------------------------------------


Evaluating:  55%|█████▌    | 11/20 [00:50<00:36,  4.02s/it]


PARTIAL
Question: Proposed guideline value for Radioactivity in drinking water is:...
Ground Truth: Gross a activity 0.1 Bq/L and Gross b activity 1.0 Bq/L
Prediction: A
Explanation: The proposed guideline value for radioactivity in drinking water is 0.1 Bq/L for gross alpha activity and 1.0 Bq/L for gross beta activ...
--------------------------------------------------------------------------------


Evaluating:  60%|██████    | 12/20 [00:55<00:34,  4.29s/it]


INCORRECT 
Question: What is the mechanism of action of Fluconazole?...
Ground Truth: Inhibits lanosterol 14 demethylase
Prediction: B

Explanation: Fluconazole is an azole antifungal drug. It works by inhibiting the enzyme lanosterol 14α-demethylase, which is involved in the synthe...
--------------------------------------------------------------------------------


Evaluating:  65%|██████▌   | 13/20 [01:00<00:31,  4.51s/it]


INCORRECT 
Question: Hemolytic uremic syndrome is caused by...
Ground Truth: Shigella
Prediction: A
Explanation: Hemolytic uremic syndrome (HUS) is a complex syndrome characterized by hemolytic anemia, acute kidney injury, and thrombocytopenia. It ...
--------------------------------------------------------------------------------


Evaluating:  70%|███████   | 14/20 [01:04<00:27,  4.57s/it]


CORRECT
Question: Hassal's corpuscles are seen in...
Ground Truth: Thymus
Prediction: A

Explanation: Hassal's corpuscles are a type of histological feature found in the thymus. They are a characteristic feature of the thymus and are in...
--------------------------------------------------------------------------------


Evaluating:  75%|███████▌  | 15/20 [01:10<00:23,  4.78s/it]


CORRECT
Question: Platelet transfusion is not indicated in -...
Ground Truth: Immunogenic Thrombocytopenia
Prediction: A

Explanation: 
- Platelet transfusion is not indicated in Dilutional Thrombocytopenia because the problem lies in the dilution of platelets in the b...
--------------------------------------------------------------------------------


Evaluating:  80%|████████  | 16/20 [01:14<00:18,  4.74s/it]


INCORRECT 
Question: A 7 years old boy presented with painful boggy swelling of scalp, multiple sinuses with purulent discharge, easily pluckable hairs and lymph nodes enl...
Ground Truth: KOH mount
Prediction: A
Explanation: This is a classic case of scabies, caused by the mite Sarcoptes scabiei. The clinical presentation of scabies is characterized by the d...
--------------------------------------------------------------------------------


Evaluating:  85%|████████▌ | 17/20 [01:20<00:14,  4.88s/it]


CORRECT
Question: All are Narcotic drug as per NDPS (National Drug Psychotropic Substances Act) EXCEPT:...
Ground Truth: Ketamine
Prediction: B

Explanation: 
The NDPS Act, 1985 classifies Narcotic drugs into three categories. Narcotic drugs are those which have an analgesic, sedative, hypno...
--------------------------------------------------------------------------------


Evaluating:  90%|█████████ | 18/20 [01:24<00:09,  4.88s/it]


CORRECT
Question: In chronic alcoholism the rate limiting component for alcohol metabolism excluding enzymes is/are : (PGI Dec 2008)...
Ground Truth: NAD+
Prediction: B
Explanation: In chronic alcoholism, the rate-limiting component for alcohol metabolism is NADH. Alcohol metabolism is catalyzed by alcohol dehydroge...
--------------------------------------------------------------------------------


Evaluating:  95%|█████████▌| 19/20 [01:29<00:04,  4.84s/it]


INCORRECT 
Question: Which of the following statement is false about hydrocele?...
Ground Truth: Wait for 5 years for spontaneous closure of congenital hydrocele
Prediction: B
Explanation: A hydrocele is a fluid accumulation in the tunica vaginalis surrounding the testis. It arises due to a patent processus vaginalis. Ther...
--------------------------------------------------------------------------------


Evaluating: 100%|██████████| 20/20 [01:34<00:00,  4.74s/it]


PARTIAL
Question: Lambda is meeting point of:...
Ground Truth: Sagittal and lambdoid suture
Prediction: B
Explanation: Lambda is the point of intersection of the coronal and lambdoid sutures. This is the most posterior point of the skull. The lambda is a...
--------------------------------------------------------------------------------

Running accuracy: 70.0% (14/20)





## Step 17: Calculate Final Metrics

In [57]:
total_correct = results["exact_matches"] + results["partial_matches"]
total_examples =eval_size
exact_match_rate =results["exact_matches"] / total_examples * 100
partial_match_rate = results["partial_matches"] / total_examples * 100
overall_accuracy = total_correct / total_examples * 100


print("FINAL RESULTS")
print(f"\nTotal examples evaluated: {total_examples}")
print(f"Exact matches: {results['exact_matches']} ({exact_match_rate:.1f}%)")
print(f"Partial matches: {results['partial_matches']} ({partial_match_rate:.1f}%)")
print(f"Total correct: {total_correct} ({overall_accuracy:.1f}%)")
print(f"Incorrect: {results['incorrect']} ({100 - overall_accuracy:.1f}%)")
print(f"\nTotal evaluation time: {total_time / 60:.1f} minutes")
print(f"Average time per example: {results['avg_time_per_example']:.1f} seconds")

FINAL RESULTS

Total examples evaluated: 20
Exact matches: 10 (50.0%)
Partial matches: 4 (20.0%)
Total correct: 14 (70.0%)
Incorrect: 6 (30.0%)

Total evaluation time: 1.6 minutes
Average time per example: 4.7 seconds


## Step 18: Analyze Detailed Results

In [58]:

print("DETAILED RESULTS")


#separate correct and incorrect examples
correct_examples = [r for r in results["details"] if r["correct"]]
incorrect_examples = [r for r in results["details"] if not r["correct"]]

print(f"\nINCORRECT EXAMPLES ({len(incorrect_examples)}):")

for i, ex in enumerate(incorrect_examples[:10], 1):
    print(f"\n{i}. Question: {ex['question']}")
    print(f"   Ground Truth: {ex['ground_truth']}")
    print(f"   Prediction: {ex['prediction']}")

if len(correct_examples) > 0:
    print(f"\nCORRECT EXAMPLES ({len(correct_examples)}):")

    for i, ex in enumerate(correct_examples[:5], 1):
        print(f"\n{i}. Question: {ex['question']}")
        print(f"   Ground Truth: {ex['ground_truth']}")
        print(f"   Prediction: {ex['prediction']}")
        print(f"   Match type: {ex['match_type']}")

DETAILED RESULTS

INCORRECT EXAMPLES (6):

1. Question: All the following are radiological features of Chronic Cor pulmonale except-
   Ground Truth: Prominent lower lobe vessels
   Prediction: C. Pleural effusion

Explanation: In chronic cor pulmonale, there is often pleural effusion due to the increased pressure in the lungs. This is a result of the right heart failure, which can be due to...

2. Question: Lines of Blaschko&;s are along
   Ground Truth: Developmental
   Prediction: C

Explanation: Lines of Blaschko are a set of linear patterns that are present on the skin and are thought to be caused by the migration of cells during embryonic development. They are a useful tool ...

3. Question: What is the mechanism of action of Fluconazole?
   Ground Truth: Inhibits lanosterol 14 demethylase
   Prediction: B

Explanation: Fluconazole is an azole antifungal drug. It works by inhibiting the enzyme lanosterol 14α-demethylase, which is involved in the synthesis of ergosterol, a critica

## Step 19: Assess Performance

In [59]:

print("PERFORMANCE ASSESSMENT")


if overall_accuracy >= 80:
    assessment = "EXCELLENT. Model performs very well."
    recommendation = "Model is ready for further testing and potential deployment."
elif overall_accuracy >= 60:
    assessment = "GOOD. Model shows promise."
    recommendation = "Consider more training epochs or data augmentation."
elif overall_accuracy >= 40:
    assessment = "MODERATE. Model needs improvement."
    recommendation = "Increase training data, adjust hyperparameters, or try different architectures."
else:
    assessment = "VERY POOR. Model barely learned."
    recommendation = "Verify data formatting and retrain from scratch."

print(f"\n{assessment}")
print(f"Recommendation: {recommendation}")

PERFORMANCE ASSESSMENT

GOOD. Model shows promise.
Recommendation: Consider more training epochs or data augmentation.


## Step 20: Save Results

In [60]:
# saving results in a json file
results_summary = {
    "total_examples": total_examples,
    "exact_matches": results["exact_matches"],
    "partial_matches": results["partial_matches"],
    "incorrect": results["incorrect"],
    "exact_match_rate": exact_match_rate,
    "partial_match_rate": partial_match_rate,
    "overall_accuracy": overall_accuracy,
    "evaluation_time": total_time,
    "avg_time_per_example": results["avg_time_per_example"],
    "timestamp": datetime.now().isoformat(),
    "details": results["details"]
}

with open("evaluation_results.json", "w") as f:
    json.dump(results_summary, f, indent=2)


## Part A: Model Improvement Strategies

### Question 1: Improving Model Performance

Based on your evaluation results, propose at least 2 or 3 specific strategies to improve your model's accuracy.

1. Increase training dataset size from 2000 to 5000/10000 for more diverse medical scenarios

2. adjust Lora parameters, as increasing rank to 32 or 64 for the model to learn more complex patterns


### Question 2: Analyzing Failure Patterns

Review your incorrect predictions and identify patterns in failures.

In [61]:

print("FAILURE PATTERN ANALYSIS")

failure_analysis = {
    "too_verbose": [],
    "wrong_concept": [],
    "partial_understanding": [],
    "hallucination": []
}

for ex in incorrect_examples:
    pred = ex["prediction"].lower()
    gt = ex["ground_truth"].lower()

    # verbose answers
    if len(ex["prediction"]) > len(ex["ground_truth"]) * 3:
        failure_analysis["too_verbose"].append(ex)
    # different medical words
    elif not any(word in pred for word in gt.split()[:3]):
        failure_analysis["wrong_concept"].append(ex)
    # partial understanding
    elif any(word in pred for word in gt.split()):
        failure_analysis["partial_understanding"].append(ex)
    else:
        failure_analysis["hallucination"].append(ex)
print(f"verbose answers: {len(failure_analysis['too_verbose'])}")
print(f"different medical words: {len(failure_analysis['wrong_concept'])}")
print(f"partial understanding: {len(failure_analysis['partial_understanding'])}")
print(f"hallucination {len(failure_analysis['hallucination'])}")

FAILURE PATTERN ANALYSIS
verbose answers: 6
different medical words: 0
partial understanding: 0
hallucination 0


### Question 3: Data Quality vs. Quantity

What do you think is better between training on 2000 examples (same quality) or 500 curated high-quality examples?

***500 curated high-quality examples as low quality examples can teach the model incorrect patterns, and 500 is still diverse with clean data***


## Part B: Resource-Constrained Inference

### Question 4: Optimizing for Limited Resources

How can you design a strategy to reduce inference time/memory for deployment in constrained environments?


- Use 8bit or 4bit to reduce memory
- Model pruning
- Switch from 3B to 1B or even smaller variants

### Question 5: Speed vs. Accuracy Trade-offs

Analyze how changing generation parameters affects speed, quality, and consistency.

In [63]:
# test different generation parameters
import time
test_question = test_examples[0]["question"]
test_options = test_examples[0]["options"]

configs = [
    {"max_new_tokens": 50, "temperature": 0.1, "do_sample": False, "name": "Greedy, Short"},
    {"max_new_tokens": 100, "temperature": 0.7,"do_sample": True, "name": "Sampling, Medium"},
    {"max_new_tokens": 200, "temperature": 1.0,"do_sample": True, "name": "Sampling, Long, High Temp"},
]

print("SPEED vs ACCURACY ANALYSIS")

for config in configs:
    start = time.time()
    # generate with different configs
    inputs = tokenizer(f"Question: {test_question}\nAnswer:", return_tensors="pt")
    inputs ={k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=config["max_new_tokens"],
            temperature=config["temperature"] ,
            do_sample=config["do_sample"] ,
            pad_token_id=tokenizer.eos_token_id,
        )

    elapsed =time.time() -start
    answer= tokenizer.decode(outputs[0], skip_special_tokens=True).split("Answer:")[-1].strip()

    print(f"\n{config['name']}:")
    print(f"Time: {elapsed:.3f}s")
    print(f"Length: {len(answer)} chars")
    print(f"Answer: {answer[:100]}...")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


SPEED vs ACCURACY ANALYSIS

Greedy, Short:
Time: 2.287s
Length: 29 chars
Answer: A, B, C
The best answer is C....

Sampling, Medium:
Time: 0.388s
Length: 22 chars
Answer: A
The best answer is A...

Sampling, Long, High Temp:
Time: 2.709s
Length: 23 chars
Answer: b
The best answer is b....




- Lower temperature (0.1-0.5): More focused, consistent answers, faster
- Higher temperature (0.7-1.0): More diverse but potentially less accurate, slower£
- Shorter max_new_tokens: Faster inference, but may truncate answers
- Longer max_new_tokens: Slower, but allows complete answers

## Part C: Evaluation Methodology

### Question 7: Improving Evaluation Metrics

Analyze limitations of current exact/partial match evaluation and propose improvements.

Limitations:

1. False Negatives : Model may give correct answer but in different wording
2. False Positives. : Partial matches might accept incorrect
3. No Semantic Understanding: Doesn't use embeddings to check meaning similarity
4. No Medical Terms Handling: Doesn't account for synonyms

Proposed Improvements:
1. Semantic similarity
2. Medical NER
3. Human evaluation
4. Confidence scores


### Question 8: Test Set Size and Confidence

Test other test sizes and observe the results. What can you say about the results?


In [64]:

test_sizes = [10, 20, 50, 100]

print("TEST SET SIZE ANALYSIS")


for size in test_sizes:
    if size >len(test_examples):
        continue

    eval_subset =test_examples[:size]
    correct =0

    for example in eval_subset:
        question =example.get("question", "")
        options =example.get("options", {})
        correct_answer_idx = example.get("cop", None)

        prediction =generate_answer(question, options, model, tokenizer)

        if correct_answer_idx is not None:
            is_match, _ = check_answer_match(prediction, correct_answer_idx, options)
            if is_match:
                correct += 1

    accuracy = correct / size * 100
    print(f"Test size: {size:3d} | Accuracy: {accuracy:5.1f}% | Correct: {correct}/{size}")

print("larger test sets provide more reliable estimates")
print("Small test sets have high variance")
print("Need at least 50-100 examples for a accurate result")

TEST SET SIZE ANALYSIS
Test size:  10 | Accuracy:  80.0% | Correct: 8/10
Test size:  20 | Accuracy:  70.0% | Correct: 14/20
Test size:  50 | Accuracy:  64.0% | Correct: 32/50
Test size: 100 | Accuracy:  68.0% | Correct: 68/100
larger test sets provide more reliable estimates
Small test sets have high variance
Need at least 50-100 examples for a accurate result


## Part D: Real-World Deployment Scenario

### Question 9: Production Considerations

What can you do to address safety, reliability, updates, and edge cases for deploying in a medical assistance application?


1. Safety:
   - Add disclaimers
   - Implement confidence thresholds
   - Input validation

2. Reliability
   - Track accuracy, latency, error rates
   - Redundancy
   - Error handling

3. Updates
   - Track model versions and performance
   - A/B testing between models before starting
   - Quick revert to previous model version

4. Edge Cases
   - Out-of-domain questions
   - Ambiguous questions
   - Handle different languages
   - Context length limits