# HealthBot: Notebook - All Experiments
## Fine-tuning TinyLlama with LoRA - Complete Hyperparameter Study

**Domain:** Healthcare / Medical Q&A  
**Model:** TinyLlama-1.1B-Chat  
**Technique:** Parameter-Efficient Fine-Tuning (PEFT) with LoRA  
**Dataset:** MedQuAD (Medical Question Answering Dataset)

---

## Experiments Overview

| Experiment | LR | Batch | Grad Accum | Epochs | LoRA r | Eff Batch | Train Time | Expected Train Loss | Expected Val Loss |
|------------|-------|-------|------------|--------|--------|-----------|------------|---------------------|-------------------|
| **Exp 1**  | 2e-4  | 2     | 4          | 1      | 8      | 8         | ~18 min    | 1.82                | 1.91              |
| **Exp 2**  | 1e-4  | 2     | 4          | 2      | 16     | 8         | ~38 min    | 1.65                | 1.74              |
| **Exp 3**  | 2e-4  | 4     | 2          | 2      | 16     | 8         | ~36 min    | 1.58                | 1.68              |
| **Exp 4**  | 5e-5  | 2     | 8          | 3      | 16     | 16        | ~58 min    | 1.48                | 1.57              |

**Total Expected Runtime:** ~2.5 hours

---

### Purpose
This master notebook runs all four hyperparameter experiments sequentially, allowing you to:
- Train all variants in one go
- Compare convergence patterns across configurations
- Identify optimal hyperparameters for your use case
- Analyze trade-offs between training time and model quality


## 1. Environment Setup & Dependencies

**Run once at the start, then restart runtime before continuing.**

In [1]:
# ============================================================================
# COMPLETE PACKAGE INSTALLATION
# ============================================================================
# After running this cell, RESTART YOUR RUNTIME!
# (Runtime → Restart runtime in Colab)
# ============================================================================

!pip install -q transformers datasets peft accelerate bitsandbytes gradio evaluate rouge_score bert_score nltk sentencepiece torch trl

print(' All packages installed successfully!')
print(' CRITICAL: Restart your runtime now!')
print(' In Colab: Runtime → Restart runtime')
print(' Then continue from the next cell')

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
 All packages installed successfully!
 CRITICAL: Restart your runtime now!
 In Colab: Runtime → Restart runtime
 Then continue from the next cell


In [2]:
import os
import json
import time
import warnings
import numpy as np
import pandas as pd
import nltk
import torch
import evaluate
import matplotlib.pyplot as plt
from datetime import datetime
from collections import defaultdict

warnings.filterwarnings('ignore')
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

# HuggingFace & PEFT imports
from datasets import load_dataset, Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    BitsAndBytesConfig,
    pipeline,
    GenerationConfig
)
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    PeftModel
)
from trl import SFTTrainer
from sklearn.model_selection import train_test_split

# Check GPU availability
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'  Device: {device}')
if device == 'cuda':
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f'   GPU: {gpu_name}')
    print(f'   VRAM: {gpu_mem:.1f} GB')

  Device: cuda
   GPU: Tesla T4
   VRAM: 15.6 GB


## 2. Dataset Preparation (Shared Across All Experiments)

In [3]:
# ─────────────────────────────────────────────
# 2.1  Load and Clean MedQuAD Dataset
# ─────────────────────────────────────────────
import re

print(' Loading MedQuAD dataset from Hugging Face...')
raw_dataset = load_dataset('lavita/MedQuAD', trust_remote_code=True)
df = raw_dataset['train'].to_pandas()

def clean_text(text: str) -> str:
    """Normalize and clean raw text."""
    if not isinstance(text, str):
        return ''
    text = re.sub(r'<[^>]+>', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

Q_COL = 'question'
A_COL = 'answer'

df['question_clean'] = df[Q_COL].apply(clean_text)
df['answer_clean']   = df[A_COL].apply(clean_text)

# Remove rows with missing/empty question or answer
df = df[(df['question_clean'].str.len() > 10) &
        (df['answer_clean'].str.len()   > 20)].copy()

# Filter out very long answers
df['answer_word_count'] = df['answer_clean'].apply(lambda x: len(x.split()))
df = df[df['answer_word_count'] <= 300].copy()

print(f' Cleaned dataset size: {len(df)}')
print(f' Avg question length: {df["question_clean"].apply(lambda x: len(x.split())).mean():.1f} words')
print(f' Avg answer length: {df["answer_word_count"].mean():.1f} words')

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'lavita/MedQuAD' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
ERROR:datasets.load:`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'lavita/MedQuAD' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


 Loading MedQuAD dataset from Hugging Face...


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001-e36383d177026d(…):   0%|          | 0.00/10.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/47441 [00:00<?, ? examples/s]

 Cleaned dataset size: 13667
 Avg question length: 8.2 words
 Avg answer length: 129.1 words


In [4]:
# ─────────────────────────────────────────────
# 2.2  Format Data with ChatML Template
# ─────────────────────────────────────────────
SYSTEM_PROMPT = (
    "You are HealthBot, a knowledgeable and empathetic medical information assistant. "
    "You provide accurate, evidence-based health information to help users understand "
    "medical conditions, symptoms, and treatments. Always remind users to consult a "
    "qualified healthcare professional for personal medical advice."
)

def format_instruction(question: str, answer: str) -> str:
    """Format a QA pair into TinyLlama ChatML instruction format."""
    return (
        f"<|system|>\n{SYSTEM_PROMPT}</s>\n"
        f"<|user|>\n{question}</s>\n"
        f"<|assistant|>\n{answer}</s>"
    )

# Sample 1,500 examples
TARGET_N = 1500
df_sample = df.sample(n=min(TARGET_N, len(df)), random_state=42).reset_index(drop=True)

df_sample['text'] = df_sample.apply(
    lambda row: format_instruction(row['question_clean'], row['answer_clean']), axis=1
)

print(f' Formatted {len(df_sample)} examples')

 Formatted 1500 examples


In [5]:
# ─────────────────────────────────────────────
# 2.3  Train/Val/Test Split
# ─────────────────────────────────────────────
train_df, temp_df = train_test_split(df_sample, test_size=0.2, random_state=42)
val_df, test_df   = train_test_split(temp_df, test_size=0.5, random_state=42)

train_dataset = Dataset.from_pandas(train_df[['text']].reset_index(drop=True))
val_dataset   = Dataset.from_pandas(val_df[['text']].reset_index(drop=True))
test_dataset  = Dataset.from_pandas(test_df[['text']].reset_index(drop=True))

print(f' Train: {len(train_dataset)} | Val: {len(val_dataset)} | Test: {len(test_dataset)}')

 Train: 1200 | Val: 150 | Test: 150


In [6]:
# ─────────────────────────────────────────────
# 2.4  Mount Google Drive
# ─────────────────────────────────────────────
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/ML-Techniques-Fine-tuning

Mounted at /content/drive
/content/drive/MyDrive/ML-Techniques-Fine-tuning


## 3. Model Loading & Tokenizer (Shared Setup)

In [7]:
# ─────────────────────────────────────────────
# 3.1  Load Tokenizer
# ─────────────────────────────────────────────
MODEL_NAME = 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token    = tokenizer.eos_token
tokenizer.padding_side = 'right'

print(f' Tokenizer loaded | Vocab: {tokenizer.vocab_size}')

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]



tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

 Tokenizer loaded | Vocab: 32000


In [8]:
# ─────────────────────────────────────────────
# 3.2  Quantization Config (Reusable)
# ─────────────────────────────────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit               = True,
    bnb_4bit_use_double_quant  = True,
    bnb_4bit_quant_type        = 'nf4',
    bnb_4bit_compute_dtype     = torch.float16
)

print(' Quantization config ready (4-bit QLoRA)')

 Quantization config ready (4-bit QLoRA)


---
---

# EXPERIMENT 1: Fast Baseline (LR=2e-4, 1 epoch, r=8)

**Configuration:**
- Learning Rate: 2e-4
- Batch Size: 2
- Gradient Accumulation: 4
- Epochs: 1
- LoRA Rank: 8
- Expected Time: ~18 minutes

In [9]:
# ════════════════════════════════════════════════════════════════════════════════
# EXPERIMENT 1 - MODEL & LORA
# ════════════════════════════════════════════════════════════════════════════════
print('\n' + '='*80)
print(' EXPERIMENT 1: Fast Baseline (LR=2e-4, 1 epoch, r=8)')
print('='*80 + '\n')

# Load base model
base_model_1 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config = bnb_config,
    device_map          = 'auto',
    trust_remote_code   = True
)
base_model_1.config.use_cache = False
base_model_1.config.pretraining_tp = 1

# LoRA Config - Experiment 1
lora_config_1 = LoraConfig(
    task_type      = TaskType.CAUSAL_LM,
    r              = 8,
    lora_alpha     = 16,
    lora_dropout   = 0.05,
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj',
                      'gate_proj', 'up_proj', 'down_proj'],
    bias           = 'none',
)

model_1 = get_peft_model(base_model_1, lora_config_1)
trainable = sum(p.numel() for p in model_1.parameters() if p.requires_grad)
print(f' Trainable params: {trainable/1e6:.2f}M')


 EXPERIMENT 1: Fast Baseline (LR=2e-4, 1 epoch, r=8)



model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

 Trainable params: 6.31M


In [10]:
# ════════════════════════════════════════════════════════════════════════════════
# EXPERIMENT 1 - TRAINING
# ════════════════════════════════════════════════════════════════════════════════
OUTPUT_DIR_1 = '/content/drive/MyDrive/ML-Techniques-Fine-tuning/healthbot_tinyllama_lora_exp1'

training_args_1 = TrainingArguments(
    output_dir                  = OUTPUT_DIR_1,
    num_train_epochs            = 1,
    per_device_train_batch_size = 2,
    per_device_eval_batch_size  = 2,
    gradient_accumulation_steps = 4,
    learning_rate               = 2e-4,
    lr_scheduler_type           = 'cosine',
    warmup_ratio                = 0.05,
    weight_decay                = 0.001,
    optim                       = 'paged_adamw_8bit',
    fp16                        = False,
    bf16                        = False,
    max_grad_norm               = 0.3,
    gradient_checkpointing      = True,
    logging_steps               = 25,
    eval_strategy               = 'steps',
    eval_steps                  = 100,
    save_strategy               = 'steps',
    save_steps                  = 200,
    load_best_model_at_end      = True,
    metric_for_best_model       = 'loss',
    report_to                   = 'none',
    push_to_hub                 = False,
)

if torch.cuda.is_available():
    torch.cuda.empty_cache()

model_1.gradient_checkpointing_enable()

trainer_1 = SFTTrainer(
    model=model_1,
    args=training_args_1,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

print(' Starting Experiment 1 training...')
start_time_1 = time.time()
train_result_1 = trainer_1.train()
elapsed_1 = (time.time() - start_time_1) / 60

print(f'\n Experiment 1 complete in {elapsed_1:.1f} minutes')
print(f' Train loss: {train_result_1.training_loss:.4f}')

model_1.save_pretrained(OUTPUT_DIR_1)
tokenizer.save_pretrained(OUTPUT_DIR_1)
print(f' Model saved to {OUTPUT_DIR_1}')

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Adding EOS to train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


 Starting Experiment 1 training...


Step,Training Loss,Validation Loss
100,0.813318,0.829797



 Experiment 1 complete in 10.7 minutes
 Train loss: 0.9563
 Model saved to /content/drive/MyDrive/ML-Techniques-Fine-tuning/healthbot_tinyllama_lora_exp1


---
---

# EXPERIMENT 2: Moderate Training (LR=1e-4, 2 epochs, r=16)

**Configuration:**
- Learning Rate: 1e-4
- Batch Size: 2
- Gradient Accumulation: 4
- Epochs: 2
- LoRA Rank: 16
- Expected Time: ~38 minutes

In [11]:
# ════════════════════════════════════════════════════════════════════════════════
# EXPERIMENT 2 - MODEL & LORA
# ════════════════════════════════════════════════════════════════════════════════
print('\n' + '='*80)
print(' EXPERIMENT 2: Moderate Training (LR=1e-4, 2 epochs, r=16)')
print('='*80 + '\n')

# Clean up previous experiment
del model_1, base_model_1, trainer_1
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Load base model
base_model_2 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config = bnb_config,
    device_map          = 'auto',
    trust_remote_code   = True
)
base_model_2.config.use_cache = False
base_model_2.config.pretraining_tp = 1

# LoRA Config - Experiment 2
lora_config_2 = LoraConfig(
    task_type      = TaskType.CAUSAL_LM,
    r              = 16,
    lora_alpha     = 32,
    lora_dropout   = 0.05,
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj',
                      'gate_proj', 'up_proj', 'down_proj'],
    bias           = 'none',
)

model_2 = get_peft_model(base_model_2, lora_config_2)
trainable = sum(p.numel() for p in model_2.parameters() if p.requires_grad)
print(f' Trainable params: {trainable/1e6:.2f}M')


 EXPERIMENT 2: Moderate Training (LR=1e-4, 2 epochs, r=16)



Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

 Trainable params: 12.62M


In [12]:
# ════════════════════════════════════════════════════════════════════════════════
# EXPERIMENT 2 - TRAINING
# ════════════════════════════════════════════════════════════════════════════════
OUTPUT_DIR_2 = '/content/drive/MyDrive/ML-Techniques-Fine-tuning/healthbot_tinyllama_lora_exp2'

training_args_2 = TrainingArguments(
    output_dir                  = OUTPUT_DIR_2,
    num_train_epochs            = 2,
    per_device_train_batch_size = 2,
    per_device_eval_batch_size  = 2,
    gradient_accumulation_steps = 4,
    learning_rate               = 1e-4,
    lr_scheduler_type           = 'cosine',
    warmup_ratio                = 0.05,
    weight_decay                = 0.001,
    optim                       = 'paged_adamw_8bit',
    fp16                        = False,
    bf16                        = False,
    max_grad_norm               = 0.3,
    gradient_checkpointing      = True,
    logging_steps               = 25,
    eval_strategy               = 'steps',
    eval_steps                  = 100,
    save_strategy               = 'steps',
    save_steps                  = 200,
    load_best_model_at_end      = True,
    metric_for_best_model       = 'loss',
    report_to                   = 'none',
    push_to_hub                 = False,
)

if torch.cuda.is_available():
    torch.cuda.empty_cache()

model_2.gradient_checkpointing_enable()

trainer_2 = SFTTrainer(
    model=model_2,
    args=training_args_2,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

print(' Starting Experiment 2 training...')
start_time_2 = time.time()
train_result_2 = trainer_2.train()
elapsed_2 = (time.time() - start_time_2) / 60

print(f'\n Experiment 2 complete in {elapsed_2:.1f} minutes')
print(f' Train loss: {train_result_2.training_loss:.4f}')

model_2.save_pretrained(OUTPUT_DIR_2)
tokenizer.save_pretrained(OUTPUT_DIR_2)
print(f' Model saved to {OUTPUT_DIR_2}')

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Adding EOS to train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


 Starting Experiment 2 training...


Step,Training Loss,Validation Loss
100,0.816575,0.832438
200,0.851431,0.80983
300,0.802692,0.807771



 Experiment 2 complete in 21.5 minutes
 Train loss: 0.8850
 Model saved to /content/drive/MyDrive/ML-Techniques-Fine-tuning/healthbot_tinyllama_lora_exp2


---
---

# EXPERIMENT 3: Larger Batch (LR=2e-4, 2 epochs, batch=4, r=16)

**Configuration:**
- Learning Rate: 2e-4
- Batch Size: 4 (larger than others)
- Gradient Accumulation: 2
- Epochs: 2
- LoRA Rank: 16
- Expected Time: ~36 minutes

In [13]:
# ════════════════════════════════════════════════════════════════════════════════
# EXPERIMENT 3 - MODEL & LORA
# ════════════════════════════════════════════════════════════════════════════════
print('\n' + '='*80)
print(' EXPERIMENT 3: Larger Batch (LR=2e-4, 2 epochs, batch=4, r=16)')
print('='*80 + '\n')

# Clean up previous experiment
del model_2, base_model_2, trainer_2
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Load base model
base_model_3 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config = bnb_config,
    device_map          = 'auto',
    trust_remote_code   = True
)
base_model_3.config.use_cache = False
base_model_3.config.pretraining_tp = 1

# LoRA Config - Experiment 3
lora_config_3 = LoraConfig(
    task_type      = TaskType.CAUSAL_LM,
    r              = 16,
    lora_alpha     = 32,
    lora_dropout   = 0.05,
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj',
                      'gate_proj', 'up_proj', 'down_proj'],
    bias           = 'none',
)

model_3 = get_peft_model(base_model_3, lora_config_3)
trainable = sum(p.numel() for p in model_3.parameters() if p.requires_grad)
print(f' Trainable params: {trainable/1e6:.2f}M')


 EXPERIMENT 3: Larger Batch (LR=2e-4, 2 epochs, batch=4, r=16)



Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

 Trainable params: 12.62M


In [14]:
# ════════════════════════════════════════════════════════════════════════════════
# EXPERIMENT 3 - TRAINING
# ════════════════════════════════════════════════════════════════════════════════
OUTPUT_DIR_3 = '/content/drive/MyDrive/ML-Techniques-Fine-tuning/healthbot_tinyllama_lora_exp3'

training_args_3 = TrainingArguments(
    output_dir                  = OUTPUT_DIR_3,
    num_train_epochs            = 2,
    per_device_train_batch_size = 4,  # LARGER BATCH
    per_device_eval_batch_size  = 4,
    gradient_accumulation_steps = 2,  # LOWER ACCUMULATION
    learning_rate               = 2e-4,
    lr_scheduler_type           = 'cosine',
    warmup_ratio                = 0.05,
    weight_decay                = 0.001,
    optim                       = 'paged_adamw_8bit',
    fp16                        = False,
    bf16                        = False,
    max_grad_norm               = 0.3,
    gradient_checkpointing      = True,
    logging_steps               = 25,
    eval_strategy               = 'steps',
    eval_steps                  = 100,
    save_strategy               = 'steps',
    save_steps                  = 200,
    load_best_model_at_end      = True,
    metric_for_best_model       = 'loss',
    report_to                   = 'none',
    push_to_hub                 = False,
)

if torch.cuda.is_available():
    torch.cuda.empty_cache()

model_3.gradient_checkpointing_enable()

trainer_3 = SFTTrainer(
    model=model_3,
    args=training_args_3,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

print(' Starting Experiment 3 training...')
start_time_3 = time.time()
train_result_3 = trainer_3.train()
elapsed_3 = (time.time() - start_time_3) / 60

print(f'\n Experiment 3 complete in {elapsed_3:.1f} minutes')
print(f' Train loss: {train_result_3.training_loss:.4f}')

model_3.save_pretrained(OUTPUT_DIR_3)
tokenizer.save_pretrained(OUTPUT_DIR_3)
print(f' Model saved to {OUTPUT_DIR_3}')

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Adding EOS to train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


 Starting Experiment 3 training...


Step,Training Loss,Validation Loss
100,0.796854,0.840671
200,0.789366,0.822852
300,0.740477,0.820834



 Experiment 3 complete in 21.2 minutes
 Train loss: 0.8330
 Model saved to /content/drive/MyDrive/ML-Techniques-Fine-tuning/healthbot_tinyllama_lora_exp3


---
---

# EXPERIMENT 4: Conservative & Extended (LR=5e-5, 3 epochs, r=16)

**Configuration:**
- Learning Rate: 5e-5 (very conservative)
- Batch Size: 2
- Gradient Accumulation: 8 (highest)
- Epochs: 3 (longest)
- LoRA Rank: 16
- Expected Time: ~58 minutes

In [15]:
# ════════════════════════════════════════════════════════════════════════════════
# EXPERIMENT 4 - MODEL & LORA
# ════════════════════════════════════════════════════════════════════════════════
print('\n' + '='*80)
print(' EXPERIMENT 4: Conservative & Extended (LR=5e-5, 3 epochs, r=16)')
print('='*80 + '\n')

# Clean up previous experiment
del model_3, base_model_3, trainer_3
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Load base model
base_model_4 = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config = bnb_config,
    device_map          = 'auto',
    trust_remote_code   = True
)
base_model_4.config.use_cache = False
base_model_4.config.pretraining_tp = 1

# LoRA Config - Experiment 4
lora_config_4 = LoraConfig(
    task_type      = TaskType.CAUSAL_LM,
    r              = 16,
    lora_alpha     = 32,
    lora_dropout   = 0.05,
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj',
                      'gate_proj', 'up_proj', 'down_proj'],
    bias           = 'none',
)

model_4 = get_peft_model(base_model_4, lora_config_4)
trainable = sum(p.numel() for p in model_4.parameters() if p.requires_grad)
print(f' Trainable params: {trainable/1e6:.2f}M')


 EXPERIMENT 4: Conservative & Extended (LR=5e-5, 3 epochs, r=16)



Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

 Trainable params: 12.62M


In [16]:
# ════════════════════════════════════════════════════════════════════════════════
# EXPERIMENT 4 - TRAINING
# ════════════════════════════════════════════════════════════════════════════════
OUTPUT_DIR_4 = '/content/drive/MyDrive/ML-Techniques-Fine-tuning/healthbot_tinyllama_lora_exp4'

training_args_4 = TrainingArguments(
    output_dir                  = OUTPUT_DIR_4,
    num_train_epochs            = 3,  # LONGEST
    per_device_train_batch_size = 2,
    per_device_eval_batch_size  = 2,
    gradient_accumulation_steps = 8,  # HIGHEST
    learning_rate               = 5e-5,  # MOST CONSERVATIVE
    lr_scheduler_type           = 'cosine',
    warmup_ratio                = 0.05,
    weight_decay                = 0.001,
    optim                       = 'paged_adamw_8bit',
    fp16                        = False,
    bf16                        = False,
    max_grad_norm               = 0.3,
    gradient_checkpointing      = True,
    logging_steps               = 25,
    eval_strategy               = 'steps',
    eval_steps                  = 100,
    save_strategy               = 'steps',
    save_steps                  = 200,
    load_best_model_at_end      = True,
    metric_for_best_model       = 'loss',
    report_to                   = 'none',
    push_to_hub                 = False,
)

if torch.cuda.is_available():
    torch.cuda.empty_cache()

model_4.gradient_checkpointing_enable()

trainer_4 = SFTTrainer(
    model=model_4,
    args=training_args_4,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

print(' Starting Experiment 4 training...')
start_time_4 = time.time()
train_result_4 = trainer_4.train()
elapsed_4 = (time.time() - start_time_4) / 60

print(f'\n Experiment 4 complete in {elapsed_4:.1f} minutes')
print(f' Train loss: {train_result_4.training_loss:.4f}')

model_4.save_pretrained(OUTPUT_DIR_4)
tokenizer.save_pretrained(OUTPUT_DIR_4)
print(f' Model saved to {OUTPUT_DIR_4}')

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Adding EOS to train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1200 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


 Starting Experiment 4 training...


Step,Training Loss,Validation Loss
100,0.884001,0.848429
200,0.804096,0.83292



 Experiment 4 complete in 31.0 minutes
 Train loss: 0.9627
 Model saved to /content/drive/MyDrive/ML-Techniques-Fine-tuning/healthbot_tinyllama_lora_exp4


---
---

# Training Summary

In [17]:
# ════════════════════════════════════════════════════════════════════════════════
# FINAL SUMMARY
# ════════════════════════════════════════════════════════════════════════════════
print('\n' + '='*80)
print(' ALL EXPERIMENTS COMPLETE')
print('='*80)

total_time = elapsed_1 + elapsed_2 + elapsed_3 + elapsed_4

summary_df = pd.DataFrame({
    'Experiment': ['Exp 1', 'Exp 2', 'Exp 3', 'Exp 4'],
    'LR': ['2e-4', '1e-4', '2e-4', '5e-5'],
    'Batch': [2, 2, 4, 2],
    'Grad Accum': [4, 4, 2, 8],
    'Epochs': [1, 2, 2, 3],
    'LoRA r': [8, 16, 16, 16],
    'Train Loss': [
        f"{train_result_1.training_loss:.4f}",
        f"{train_result_2.training_loss:.4f}",
        f"{train_result_3.training_loss:.4f}",
        f"{train_result_4.training_loss:.4f}"
    ],
    'Time (min)': [
        f"{elapsed_1:.1f}",
        f"{elapsed_2:.1f}",
        f"{elapsed_3:.1f}",
        f"{elapsed_4:.1f}"
    ]
})

print(f'\nTotal Training Time: {total_time:.1f} minutes ({total_time/60:.2f} hours)\n')
print(summary_df.to_string(index=False))
print('\n' + '='*80)
print(' All models saved to Google Drive')
print(' Use the comparison notebook to analyze results')
print('='*80)


 ALL EXPERIMENTS COMPLETE

Total Training Time: 84.3 minutes (1.41 hours)

Experiment   LR  Batch  Grad Accum  Epochs  LoRA r Train Loss Time (min)
     Exp 1 2e-4      2           4       1       8     0.9563       10.7
     Exp 2 1e-4      2           4       2      16     0.8850       21.5
     Exp 3 2e-4      4           2       2      16     0.8330       21.2
     Exp 4 5e-5      2           8       3      16     0.9627       31.0

 All models saved to Google Drive
 Use the comparison notebook to analyze results
