# Medical LLM Fine-Tuning

**Domain:** Healthcare/Medical Q&A  
**Model:** TinyLlama-1.1B-Chat-v1.0  
**Method:** LoRA (Low-Rank Adaptation)  
**Dataset:** Medical Meadow Medical Flashcards  

---

## Project Overview

This notebook implements a **domain-specific medical assistant** by fine-tuning a Large Language Model (LLM) on medical question-answer pairs. The goal is to create a model that can:

1. Answer medical questions accurately
2. Provide domain-specific responses
3. Demonstrate measurable improvement over the base model

### Why This Matters:
- Medical information requires accuracy and domain expertise
- Pre-trained models lack specialized medical knowledge
- Fine-tuning adapts the model to medical terminology and reasoning

---

## Methodology

### 1. **Dataset:** 
- Medical Meadow Medical Flashcards (Hugging Face)
- Covers diverse medical topics and terminology

### 2. **Fine-tuning Approach:**
- **LoRA (Low-Rank Adaptation)**: Parameter-efficient fine-tuning
- Achieves similar results to full fine-tuning

### 3. **Evaluation:**
- **Quantitative**: Loss, Perplexity, BLEU, ROUGE scores
- **Qualitative**: Compare responses before/after fine-tuning
- **Baseline**: Evaluate pre-trained model first for comparison

---

## Installation

The required libraries for LLM fine-tuning are installed:

- **transformers**: Hugging Face library for LLMs
- **datasets**: Loading and processing datasets
- **peft**: Parameter-Efficient Fine-Tuning (LoRA)
- **accelerate**: Distributed training and mixed precision
- **evaluate**: Metrics (BLEU, ROUGE, etc.)
- **sentencepiece**: Tokenization support

In [1]:
!pip install -q transformers datasets peft accelerate bitsandbytes evaluate sentencepiece

## Import Libraries

Import all necessary libraries

In [41]:
import torch
import numpy as np
import pandas as pd
import time
import gc
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datasets import load_dataset, DatasetDict, Dataset
from transformers import AutoTokenizer,AutoModelForCausalLM,TrainingArguments,Trainer
from peft import LoraConfig, get_peft_model
import shutil
import evaluate
import zipfile
import glob

# Seeds
torch.manual_seed(42)
np.random.seed(42)
sns.set_style("whitegrid")



## Data Loading

### Dataset: Medical Meadow Medical Flashcards

This dataset contains medical question-answer pairs in the format:
- **Instruction**: Task description (e.g., "Answer this question truthfully")
- **Input**: Medical question
- **Output**: Correct medical answer

Explore the dataset structure and sample entries to understand the data format.

In [3]:
dataset = load_dataset("medalpaca/medical_meadow_medical_flashcards")
print(f"Total: {len(dataset['train']):,}")

# Show samples
for i in range(2):
    print(f"\nExample {i+1}:")
    print(f"Q: {dataset['train'][i]['input'][:80]}...")
    print(f"A: {dataset['train'][i]['output'][:80]}...")

Total: 33,955

Example 1:
Q: What is the relationship between very low Mg2+ levels, PTH levels, and Ca2+ leve...
A: Very low Mg2+ levels correspond to low PTH levels which in turn results in low C...

Example 2:
Q: What leads to genitourinary syndrome of menopause (atrophic vaginitis)?...
A: Low estradiol production leads to genitourinary syndrome of menopause (atrophic ...


## Data Preprocessing

### Why Preprocessing Matters:

Raw data often contains:
- Missing values
- Duplicate entries
- Inconsistent formatting

### Preprocessing Pipeline:

1. **Data Cleaning**: Remove null/empty entries
2. **Deduplication**: Remove duplicate Q&A pairs
3. **Size Limiting**: Use 1,200 samples (GPU memory constraint)
4. **Train/Val/Test Split**: 80/10/10 split

### Key Decisions:
- **2,0000 samples**: Balances training quality with GPU memory limits
- **80/10/10 split**: Standard ML practice for train/validation/test

In [4]:
# Data cleaning
def clean_dataset(example):
    return (
        example['instruction'] and example['input'] and example['output'] and
        len(str(example['instruction']).strip()) > 0 and
        len(str(example['input']).strip()) > 0 and
        len(str(example['output']).strip()) > 0
    )

cleaned = dataset['train'].filter(clean_dataset)
print(f"After cleaning: {len(cleaned):,}")

After cleaning: 33,547


In [5]:
# Remove duplicates
df = pd.DataFrame(cleaned)
df = df.drop_duplicates(subset=['input', 'output'])
cleaned = Dataset.from_pandas(df)
print(f"After dedup: {len(cleaned):,}")

After dedup: 33,521


In [6]:
MAX_SAMPLES = 2000
if len(cleaned) > MAX_SAMPLES:
    cleaned = cleaned.select(range(MAX_SAMPLES))

print(f"‚úÖ Using {len(cleaned):,} samples")

‚úÖ Using 2,000 samples


In [7]:
# Split: 80% train, 10% val, 10% test
train_test = cleaned.train_test_split(test_size=0.2, seed=42)
val_test = train_test['test'].train_test_split(test_size=0.5, seed=42)

dataset_split = DatasetDict({
    'train': train_test['train'],
    'validation': val_test['train'],
    'test': val_test['test']
})

print(f"Train: {len(dataset_split['train']):,}")
print(f"Val: {len(dataset_split['validation']):,}")
print(f"Test: {len(dataset_split['test']):,}")

Train: 1,600
Val: 200
Test: 200


## Model Loading

### Model: TinyLlama-1.1B-Chat-v1.0

**Why TinyLlama?**
- Compact (1.1B parameters vs 7B+ for larger models)
- Fits in free GPU memory
- Fast training
- Good balance of quality and efficiency

**Loading Configuration:**
- `torch_dtype=torch.float16`: Half precision (saves memory)
- `device_map="auto"`: Automatically distribute across available devices
- `low_cpu_mem_usage=True`: Minimize CPU memory footprint

In [8]:
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

In [9]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
print(f"‚úÖ Tokenizer loaded")

‚úÖ Tokenizer loaded


In [None]:
# Model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,  
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
)

`torch_dtype` is deprecated! Use `dtype` instead!


In [11]:
print(f"‚úÖ Model loaded: {model.num_parameters():,} params")
print(f"GPU Memory: {torch.cuda.memory_allocated()/1024**3:.2f} GB")

‚úÖ Model loaded: 1,100,048,384 params
GPU Memory: 0.94 GB


## LoRA Configuration

### What is LoRA?

**LoRA (Low-Rank Adaptation)** is a parameter-efficient fine-tuning technique:

- Traditional fine-tuning: Updates ALL 1.1B parameters
-  oRA: Updates only ~1% of parameters (via low-rank matrices)

### LoRA Settings:

```
r=16              # Rank of update matrices (higher = more capacity)
lora_alpha=32     # Scaling factor
target_modules    # Which layers to adapt (attention layers)
lora_dropout=0.05 # Regularization
```

### Memory Savings:
- Full fine-tuning: ~20GB GPU memory
- LoRA: ~6-8GB GPU memory

In [12]:
model.config.use_cache = False

In [13]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

In [14]:
model = get_peft_model(model, lora_config)

In [15]:
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"‚úÖ LoRA applied")
print(f"Trainable: {trainable:,} ({100*trainable/total:.2f}%)")

‚úÖ LoRA applied
Trainable: 4,505,600 (0.41%)


## Tokenization

### What is Tokenization?

Tokenization converts text into numbers that the model can process. For example:

```
"What is diabetes?" ‚Üí [2385, 310, 652, 9790, 29973]
```

### Tokenization Strategy:

1. **Format**: Combine instruction + question + answer into training format
2. **Max Length**: 256 tokens (balance quality vs memory)
3. **Padding**: Pad all sequences to same length for batching
4. **Labels**: Copy input_ids for causal language modeling

### Why 256 tokens?
- Medical Q&A pairs are typically short
- Longer sequences = more GPU memory
- 256 tokens = good balance

In [16]:
def format_instruction(example):
    instruction = example['instruction']
    question = example['input']
    answer = example['output']
    prompt = f"<|user|>\n{instruction}\n{question}\n<|assistant|>\n{answer}{tokenizer.eos_token}"
    return {"text": prompt}

In [17]:
formatted = dataset_split.map(
    format_instruction,
    remove_columns=dataset_split['train'].column_names
)

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [18]:
def tokenize_function(examples):
    result = tokenizer(
        examples["text"],
        truncation=True,
        max_length=256,
        padding="max_length"
    )
    result["labels"] = result["input_ids"][:]
    return result

In [19]:
tokenized = formatted.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)


Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [20]:
train_data = tokenized['train']
val_data = tokenized['validation']

print(f"‚úÖ Train: {len(train_data):,}, Val: {len(val_data):,}")
print(f"Max length: 256 tokens")

‚úÖ Train: 1,600, Val: 200
Max length: 256 tokens


## Evaluation Metrics Setup

### Metrics that are used:

1. **Loss**: How well the model predicts next tokens (lower is better)
2. **Perplexity**: Exp(loss), measures uncertainty (lower is better)
3. **BLEU**: Measures n-gram overlap with reference (0-100, higher is better)
4. **ROUGE**: Measures recall of n-grams (0-1, higher is better)

### Why Multiple Metrics?

- **Loss/Perplexity**: Overall model quality
- **BLEU**: Precision of generated text
- **ROUGE**: Recall/coverage of key information

In [21]:
# Load evaluation metrics
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

def compute_perplexity(loss):
    return np.exp(loss)

print("‚úÖ Metrics configured (loss, perplexity, BLEU & ROUGE)")

‚úÖ Metrics configured (loss, perplexity, BLEU & ROUGE)


## Baseline evaluation

In [22]:
# Clean memory
gc.collect()
torch.cuda.empty_cache()

In [23]:
baseline_args = TrainingArguments(
    output_dir="./baseline",
    per_device_eval_batch_size=1,
    fp16=True,
    report_to="none",
    prediction_loss_only=True,
    dataloader_num_workers=0,  )

In [24]:
baseline_trainer = Trainer(
    model=model,
    args=baseline_args,
    eval_dataset=val_data,
)

In [25]:
baseline_metrics = baseline_trainer.evaluate()
baseline_loss = baseline_metrics['eval_loss']
baseline_perplexity = compute_perplexity(baseline_loss)

In [None]:
baseline_predictions = []
baseline_references = []
model.eval()
with torch.no_grad():
    for i in range(min(50, len(val_data))):
        sample = val_data[i]
        input_ids = torch.tensor([sample["input_ids"][:128]]).to(model.device)
        attention_mask = torch.tensor([[1] * len(sample["input_ids"][:128])]).to(model.device)
        outputs = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=50,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
        pred_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        ref_text = tokenizer.decode(sample["labels"], skip_special_tokens=True)
        baseline_predictions.append(pred_text)
        baseline_references.append([ref_text])


In [27]:
baseline_bleu_result = bleu_metric.compute(predictions=baseline_predictions, references=baseline_references)
baseline_bleu = baseline_bleu_result["bleu"] * 100  # Convert to percentage

# Calculate ROUGE scores
baseline_rouge_result = rouge_metric.compute(predictions=baseline_predictions, references=baseline_references)
baseline_rouge_l = baseline_rouge_result["rougeL"] * 100  # ROUGE-L F1 score

In [28]:

print("BASELINE METRICS")
print("="*80)
print(f"   Loss: {baseline_loss:.4f}")
print(f"   Perplexity: {baseline_perplexity:.2f}")
print(f"   BLEU: {baseline_bleu:.2f}")
print(f"   ROGUE: {baseline_rouge_l:.2f}")
print("="*80)

BASELINE METRICS
   Loss: 11.8996
   Perplexity: 147204.22
   BLEU: 67.67
   ROGUE: 76.57


In [29]:
# Cleanup
del baseline_trainer
gc.collect()
torch.cuda.empty_cache()



## Hyperparameter Experiments



### Experimental Design:

3 different configurations are tested to find the optimal hyperparameters:

| Experiment | Learning Rate | Batch Size | Grad Accum | Effective Batch | Epochs | Warmup |
|------------|--------------|------------|------------|-----------------|--------|--------|
| **Exp1 (High LR)** | 1e-4 | 2 | 4 | 8 | 5 | 50 |
| **Exp2 (Medium LR)** | 5e-5 | 1 | 3 | 3 | 5 | 50 |
| **Exp3 (Low LR)** | 2e-5 | 2 | 2 | 4 | 5 | 50 |

### Hyperparameter Explanations:

**Learning Rate (lr):**
- Controls how much the model updates weights each step
- **High (1e-4)**: Faster learning but risk of instability
- **Medium (5e-5)**: Balanced approach (often optimal)
- **Low (2e-5)**: Slower but more stable convergence

**Batch Size:**
- Number of samples processed together
- Smaller batches = noisier but more frequent updates

**Gradient Accumulation (accum):**
- Accumulates gradients over multiple batches before updating
- **Effective Batch = Batch Size √ó Accumulation**
- Simulates larger batches without memory overflow

**Epochs:**
- Number of times model sees entire dataset
- **5 epochs**: Sufficient for convergence with our dataset size
- More epochs = better learning but risk of overfitting

**Warmup Steps:**
- Gradually increases learning rate at start
- **50 steps**: Prevents instability in early training
- Helps model adjust smoothly to the data

### What is tracked for Each Experiment:

1. **Training Loss**: How well the model learns the training data
2. **Validation Loss**: Performance on unseen data (generalization)
3. **BLEU Score**: Quality of generated medical responses (0-100%)
4. **ROUGE-L Score**: Coverage of key information (0-100%)
5. **Perplexity**: Model confidence (lower = better)
6. **Improvement %**: Gain over baseline model
7. **GPU Memory**: Peak memory usage in GB
8. **Training Time**: Minutes to complete

### Why This Matters:

By comparing results across experiments, we can:
- Identify the best learning rate for our task
- Understand hyperparameter impact on performance
- Demonstrate systematic optimization approach
- Select the optimal configuration for deployment

In [30]:
# Save baseline (non-fine-tuned) model for comparison
print("\n" + "="*80)
print("SAVING BASELINE MODEL")
print("="*80)

baseline_save_path = "./baseline_model"

# Save the base model (no LoRA, no fine-tuning)
print(f"\nSaving baseline model to: {baseline_save_path}")
model.save_pretrained(baseline_save_path)
tokenizer.save_pretrained(baseline_save_path)

print("‚úÖ Baseline model saved!")
print(f"\nüìÅ Baseline model files:")
import os
for file in sorted(os.listdir(baseline_save_path)):
    print(f"   - {file}")

print("\nThis baseline model can be used for:")
print("  - Comparison with fine-tuned model")
print("  - Demonstrating improvement from fine-tuning")
print("  - A/B testing in deployment")
print("="*80)


SAVING BASELINE MODEL

Saving baseline model to: ./baseline_model
‚úÖ Baseline model saved!

üìÅ Baseline model files:
   - README.md
   - adapter_config.json
   - adapter_model.safetensors
   - chat_template.jinja
   - special_tokens_map.json
   - tokenizer.json
   - tokenizer.model
   - tokenizer_config.json

This baseline model can be used for:
  - Comparison with fine-tuned model
  - Demonstrating improvement from fine-tuning
  - A/B testing in deployment


In [31]:
experiments = [
    {"name": "Exp1_HighLR", "lr": 1e-4, "batch": 2, "accum": 4, "epochs": 5, "warmup": 50},
    {"name": "Exp2_MediumLR", "lr": 5e-5, "batch": 1, "accum": 3, "epochs": 5, "warmup": 50},
    {"name": "Exp3_LowLR", "lr": 2e-5, "batch": 2, "accum": 2, "epochs": 5, "warmup": 50},
]



In [32]:
results = []
best_model_path = None
best_val_loss = float('inf')


In [None]:
for idx, exp in enumerate(experiments, 1):

    print(f"üß™ EXPERIMENT {idx}/{len(experiments)}: {exp['name']}")
    print("="*80)
    print(f"LR: {exp['lr']}, Batch: {exp['batch']}, Accum: {exp['accum']}, Epochs: {exp['epochs']}")
    print(f"Effective batch size: {exp['batch'] * exp['accum']}")
    print("-"*80)

    # Reload model fresh
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
        low_cpu_mem_usage=True,
    )
    model.config.use_cache = False
    
    
    if hasattr(model, "gradient_checkpointing_disable"):
        model.gradient_checkpointing_disable()
    model = get_peft_model(model, lora_config)
    
    # Ensure LoRA parameters require gradients
    for name, param in model.named_parameters():
        if "lora" in name.lower():
            param.requires_grad = True
    
    # Verify trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Trainable parameters: {trainable_params:,}")
    if trainable_params == 0:
        print("‚ö†Ô∏è WARNING: No trainable parameters! Skipping this experiment.")
        continue

    # Training arguments
    training_args = TrainingArguments(
        output_dir=f"./results_{exp['name']}",
        per_device_train_batch_size=exp['batch'],
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=exp['accum'],
        num_train_epochs=exp['epochs'],
        learning_rate=exp['lr'],
        weight_decay=0.01,
        warmup_steps=exp['warmup'],
        lr_scheduler_type="cosine",
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_steps=50,
        fp16=True,  # FP16 for GPU
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        report_to="none",
        save_total_limit=1,
        prediction_loss_only=True,  
        dataloader_num_workers=0,
        # gradient_checkpointing=True, 
    )
    # Clean memory
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        eval_dataset=val_data,
    )

    print(f"\nüöÄ Training...")
    start = time.time()

    try:
        train_result = trainer.train()
        train_time = (time.time() - start) / 60

        print(f"\nüìä Evaluating...")
        eval_metrics = trainer.evaluate()

        # Extract metrics
        train_loss = train_result.training_loss
        val_loss = eval_metrics['eval_loss']
        val_ppl = compute_perplexity(val_loss)

        # Calculate BLEU score
        print("Calculating BLEU score...")
        predictions = []
        references = []
        model.eval()
        with torch.no_grad():
            for i in range(min(50, len(val_data))):
                sample = val_data[i]
                input_ids = torch.tensor([sample["input_ids"][:128]]).to(model.device)
                outputs = model.generate(
                    input_ids,
                    attention_mask=attention_mask,
                    max_new_tokens=50,
                    do_sample=False,
                    pad_token_id=tokenizer.eos_token_id
                )
                pred_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
                ref_text = tokenizer.decode(sample["labels"], skip_special_tokens=True)
                predictions.append(pred_text)
                references.append([ref_text])

        bleu_result = bleu_metric.compute(predictions=predictions, references=references)
        val_bleu = bleu_result["bleu"] * 100

        # Calculate ROUGE score
        rouge_result = rouge_metric.compute(predictions=predictions, references=references)
        val_rouge_l = rouge_result["rougeL"] * 100
        gpu_mem_gb = torch.cuda.max_memory_allocated() / (1024**3)

        # Calculate improvements
        loss_imp = ((baseline_loss - val_loss) / baseline_loss) * 100
        ppl_imp = ((baseline_perplexity - val_ppl) / baseline_perplexity) * 100
        bleu_imp = ((val_bleu - baseline_bleu) / max(baseline_bleu, 0.01)) * 100
        rouge_imp = ((val_rouge_l - baseline_rouge_l) / max(baseline_rouge_l, 0.01)) * 100

        # Display results
        print("\n" + "="*80)
        print(f"‚úÖ {exp['name']} COMPLETED!")
        print("="*80)
        print(f"Train Loss:   {train_loss:.4f}")
        print(f"Val Loss:     {val_loss:.4f}")
        print(f"Perplexity:   {val_ppl:.2f}")
        print(f"BLEU Score:   {val_bleu:.2f}%")
        print(f"ROUGE-L Score: {val_rouge_l:.2f}%")
        print(f"\nImprovement over Baseline:")
        print(f"  Loss:       {loss_imp:+.2f}%")
        print(f"  Perplexity: {ppl_imp:+.2f}%")
        print(f"  BLEU:       {bleu_imp:+.2f}%")
        print(f"  ROUGE-L:    {rouge_imp:+.2f}%")
        print(f"\nResources:")
        print(f"  GPU Memory: {gpu_mem_gb:.2f} GB (peak)")
        print(f"  Time:       {train_time:.2f} minutes")
        print("="*80)

        # Store results
        results.append({
            "Experiment": exp['name'],
            "Learning_Rate": exp['lr'],
            "Batch_Size": exp['batch'],
            "Grad_Accum": exp['accum'],
            "Effective_Batch": exp['batch'] * exp['accum'],
            "Epochs": exp['epochs'],
            "Train_Loss": round(train_loss, 4),
            "Val_Loss": round(val_loss, 4),
            "Perplexity": round(val_ppl, 2),
            "BLEU_Score": round(val_bleu, 2),
            "ROUGE_L_Score": round(val_rouge_l, 2),
            "Loss_Improvement_%": round(loss_imp, 2),
            "Perplexity_Improvement_%": round(ppl_imp, 2),
            "BLEU_Score": round(val_bleu, 2),
            "ROUGE_L_Score": round(val_rouge_l, 2),
            "BLEU_Improvement_%": round(bleu_imp, 2),
            "ROUGE_L_Score": round(val_rouge_l, 2),
            "ROUGE_L_Improvement_%": round(rouge_imp, 2),
            "GPU_Memory_GB": round(gpu_mem_gb, 2),
            "Time_Min": round(train_time, 2)
        })

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_path = f"./best_{exp['name']}"
            print(f"\nüíæ Saving best model to {best_model_path}...")
            trainer.save_model(best_model_path)
            tokenizer.save_pretrained(best_model_path)

    except RuntimeError as e:
        if "out of memory" in str(e):
            print(f"\n‚ö†Ô∏è OOM ERROR in {exp['name']}")
            print("Try reducing MAX_SAMPLES further or using shorter sequences")
            gc.collect()
            torch.cuda.empty_cache()
            continue
        else:
            raise e

    # Cleanup
    del trainer, model
    gc.collect()
    torch.cuda.empty_cache()

üß™ EXPERIMENT 1/3: Exp1_HighLR
LR: 0.0001, Batch: 2, Accum: 4, Epochs: 5
Effective batch size: 8
--------------------------------------------------------------------------------
Trainable parameters: 4,505,600

üöÄ Training...


Epoch,Training Loss,Validation Loss
1,0.3662,0.386527
2,0.344,0.366749
3,0.3112,0.355344
4,0.29,0.35206
5,0.2826,0.353376



üìä Evaluating...


Calculating BLEU score...

‚úÖ Exp1_HighLR COMPLETED!
Train Loss:   0.5437
Val Loss:     0.3521
Perplexity:   1.42
BLEU Score:   89.03%
ROUGE-L Score: 95.11%

Improvement over Baseline:
  Loss:       +97.04%
  Perplexity: +100.00%
  BLEU:       +31.56%
  ROUGE-L:    +24.21%

Resources:
  GPU Memory: 1.52 GB (peak)
  Time:       13.56 minutes

üíæ Saving best model to ./best_Exp1_HighLR...
üß™ EXPERIMENT 2/3: Exp2_MediumLR
LR: 5e-05, Batch: 1, Accum: 3, Epochs: 5
Effective batch size: 3
--------------------------------------------------------------------------------
Trainable parameters: 4,505,600


The model is already on multiple devices. Skipping the move to device specified in `args`.



üöÄ Training...


Epoch,Training Loss,Validation Loss
1,0.3711,0.392259
2,0.3495,0.377172
3,0.3275,0.368737
4,0.3245,0.365832
5,0.3198,0.365865



üìä Evaluating...


Calculating BLEU score...

‚úÖ Exp2_MediumLR COMPLETED!
Train Loss:   0.5717
Val Loss:     0.3658
Perplexity:   1.44
BLEU Score:   88.69%
ROUGE-L Score: 94.81%

Improvement over Baseline:
  Loss:       +96.93%
  Perplexity: +100.00%
  BLEU:       +31.06%
  ROUGE-L:    +23.83%

Resources:
  GPU Memory: 6.28 GB (peak)
  Time:       47.70 minutes
üß™ EXPERIMENT 3/3: Exp3_LowLR
LR: 2e-05, Batch: 2, Accum: 2, Epochs: 5
Effective batch size: 4
--------------------------------------------------------------------------------
Trainable parameters: 4,505,600

üöÄ Training...


Epoch,Training Loss,Validation Loss
1,0.3739,0.399332
2,0.3726,0.387899
3,0.3528,0.381757
4,0.3447,0.379419
5,0.3454,0.379074



üìä Evaluating...


Calculating BLEU score...

‚úÖ Exp3_LowLR COMPLETED!
Train Loss:   0.5902
Val Loss:     0.3791
Perplexity:   1.46
BLEU Score:   74.85%
ROUGE-L Score: 80.96%

Improvement over Baseline:
  Loss:       +96.81%
  Perplexity: +100.00%
  BLEU:       +10.61%
  ROUGE-L:    +5.74%

Resources:
  GPU Memory: 2.83 GB (peak)
  Time:       14.43 minutes


## Results Analysis

In [34]:
if len(results) == 0:
    print("‚ö†Ô∏è No experiments completed successfully")
else:
    print("\nüìä RESULTS SUMMARY")
    print("="*80)

    results_df = pd.DataFrame(results)

    # Add baseline row
    baseline_row = pd.DataFrame([{
        "Experiment": "‚≠ê BASELINE",
        "Learning_Rate": "-",
        "Batch_Size": "-",
        "Grad_Accum": "-",
        "Effective_Batch": "-",
        "Epochs": "-",
        "Train_Loss": "-",
        "Val_Loss": round(baseline_loss, 4),
        "Perplexity": round(baseline_perplexity, 2),
        "BLEU_Score": round(baseline_bleu, 2),
        "ROUGE_L_Score": round(baseline_rouge_l, 2),
        "Loss_Improvement_%": 0.0,
        "Perplexity_Improvement_%": 0.0,
        "BLEU_Improvement_%": 0.0,
        "ROUGE_L_Improvement_%": 0.0,
        "GPU_Memory_GB": "-",
        "Time_Min": "-"
    }])

    full_results = pd.concat([baseline_row, results_df], ignore_index=True)

    print("\n" + full_results.to_string(index=False))

    # Save to CSV
    full_results.to_csv("experiment_results.csv", index=False)
    print("\n‚úÖ Results saved to: experiment_results.csv")


üìä RESULTS SUMMARY

   Experiment Learning_Rate Batch_Size Grad_Accum Effective_Batch Epochs Train_Loss  Val_Loss  Perplexity  BLEU_Score  ROUGE_L_Score  Loss_Improvement_%  Perplexity_Improvement_%  BLEU_Improvement_%  ROUGE_L_Improvement_% GPU_Memory_GB Time_Min
   ‚≠ê BASELINE             -          -          -               -      -          -   11.8996   147204.22       67.67          76.57                0.00                       0.0                0.00                   0.00             -        -
  Exp1_HighLR        0.0001          2          4               8      5     0.5437    0.3521        1.42       89.03          95.11               97.04                     100.0               31.56                  24.21          1.52    13.56
Exp2_MediumLR       0.00005          1          3               3      5     0.5717    0.3658        1.44       88.69          94.81               96.93                     100.0               31.06                  23.83          6.28     

In [35]:
if len(results) > 0:
    # Best model analysis
    print("\n" + "="*80)
    print("üèÜ BEST MODEL ANALYSIS")
    print("="*80)

    best = results_df.loc[results_df['Val_Loss'].idxmin()]
    print(f"\nBest Experiment: {best['Experiment']}")
    print(f"Val Loss: {best['Val_Loss']:.4f}")
    print(f"Perplexity: {best['Perplexity']:.2f}")
    print(f"Improvement: {best['Loss_Improvement_%']:+.2f}%")
    print(f"Hyperparameters:")
    print(f"  - Learning Rate: {best['Learning_Rate']}")
    print(f"  - Effective Batch: {best['Effective_Batch']}")
    print(f"  - Epochs: {best['Epochs']}")

    # Check improvement threshold
    max_imp = results_df['Loss_Improvement_%'].max()
    if max_imp >= 10:
        print(f"\n‚úÖ SUCCESS: {max_imp:.2f}%")
    else:
        print(f"\nüìù Note: {max_imp:.2f}% improvement achieved")


üèÜ BEST MODEL ANALYSIS

Best Experiment: Exp1_HighLR
Val Loss: 0.3521
Perplexity: 1.42
Improvement: +97.04%
Hyperparameters:
  - Learning Rate: 0.0001
  - Effective Batch: 8
  - Epochs: 5

‚úÖ SUCCESS: 97.04%


In [36]:
print("\n" + "="*80)
print("SAVING BEST MODEL FOR DEPLOYMENT")
print("="*80)

if best_model_path and os.path.exists(best_model_path):
    # Create a clean 'best_model' directory
    deployment_path = "./best_model"
    
    if os.path.exists(deployment_path):
        shutil.rmtree(deployment_path)
    
    # Copy the best model to deployment directory
    shutil.copytree(best_model_path, deployment_path)
    
    print(f"\n‚úÖ Best model copied to: {deployment_path}/")
    print(f"   Source: {best_model_path}")
    
    # Verify files
    files = os.listdir(deployment_path)
    print(f"\nüìÅ Model files ({len(files)} files):")
    for file in sorted(files):
        print(f"   - {file}")
    


SAVING BEST MODEL FOR DEPLOYMENT

‚úÖ Best model copied to: ./best_model/
   Source: ./best_Exp1_HighLR

üìÅ Model files (9 files):
   - README.md
   - adapter_config.json
   - adapter_model.safetensors
   - chat_template.jinja
   - special_tokens_map.json
   - tokenizer.json
   - tokenizer.model
   - tokenizer_config.json
   - training_args.bin


## Qualitative Testing

### Why Qualitative Testing?

Numbers alone don't tell the full story. We need to:
1. See actual model responses
2. Compare baseline vs fine-tuned outputs
3. Evaluate medical accuracy and relevance

### Test Questions:

We'll test 5 medical questions and compare:
- **Baseline Model**: Pre-trained (no medical fine-tuning)
- **Fine-tuned Model**: The medically-trained model

### What to Look For:

- More specific medical terminology
- More accurate clinical information
- Better structured responses
- Domain-appropriate language

In [37]:
if best_model_path:
    fine_tuned_model = AutoModelForCausalLM.from_pretrained(
        best_model_path,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    print(f"‚úÖ Loaded: {best_model_path}")
else:
    print("‚ö†Ô∏è No best model found, using last trained model")
    fine_tuned_model = model

# Also load baseline for comparison
baseline_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto",
)



‚úÖ Loaded: ./best_Exp1_HighLR


In [38]:
# Define test questions
test_questions = [
    "What are the symptoms of diabetes?",
    "How is hypertension treated?",
    "What causes pneumonia?",
    "Explain the difference between Type 1 and Type 2 diabetes.",
    "What are the side effects of aspirin?"
]


In [39]:
def generate_response(model, question, max_tokens=100):
    """Generate response from a model"""
    prompt = f"<|user|>\nAnswer this question truthfully\n{question}\n<|assistant|>\n"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract only the assistant's response
    if "<|assistant|>" in response:
        response = response.split("<|assistant|>")[-1].strip()
    
    return response

print("‚úÖ Response generation function ready")

‚úÖ Response generation function ready


In [40]:
# Run qualitative comparison
print("\n" + "="*80)
print("QUALITATIVE EVALUATION: Baseline vs Fine-tuned")
print("="*80)

for i, question in enumerate(test_questions, 1):
    print(f"\n{'='*80}")
    print(f"Question {i}: {question}")
    print("="*80)
    
    # Baseline response
    print("\n BASELINE (Pre-trained):")
    baseline_response = generate_response(baseline_model, question)
    print(baseline_response)
    
    # Fine-tuned response
    print("\n FINE-TUNED (Medical):")
    finetuned_response = generate_response(fine_tuned_model, question)
    print(finetuned_response)
    
    print("\n" + "-"*80)


QUALITATIVE EVALUATION: Baseline vs Fine-tuned

Question 1: What are the symptoms of diabetes?

 BASELINE (Pre-trained):
The symptoms of diabetes can vary depending on the type of diabetes and the individual's overall health. However, here are some common symptoms:
1. Blurred vision: Diabetes can cause blurred vision, especially in the early stages.
2. Thirst: People with diabetes may experience increased thirst and urination, which can lead to dehydration.
3. Dry mouth: Diabetes can cause

 FINE-TUNED (Medical):
Diabetes mellitus is characterized by the presence of polyuria and polydipsia, which are the classic symptoms of the disease.

--------------------------------------------------------------------------------

Question 2: How is hypertension treated?

 BASELINE (Pre-trained):
Hypertension is treated with medication, lifestyle changes, and sometimes surgery. Medication can help lower blood pressure by reducing the amount of blood that is pumped out of the heart. This can be don