## Project Definition and Domain Alignment

**Domain:** Machine Learning and Artificial Intelligence Education

**Purpose:** This project builds a domain-specific chatbot that serves as an educational assistant for machine learning and AI concepts. The assistant is fine-tuned to answer questions about neural network architectures (transformers, LSTMs, CNNs), training techniques (backpropagation, gradient descent, regularization), modern AI paradigms (transfer learning, self-supervised learning, generative models), and foundational ML algorithms.

**Relevance:** As ML/AI adoption accelerates across industries, there is growing demand for accessible educational tools that can explain complex concepts clearly. A domain-specific assistant fine-tuned on curated ML content provides more accurate, focused responses than a general-purpose model, reducing hallucinations and improving pedagogical value.

**Approach:** Generative QA using parameter-efficient fine-tuning (QLoRA) on TinyLlama-1.1B, chosen for its balance between capability and hardware constraints (4GB VRAM GPU).

In [28]:
import torch
import json
import time
import gc
import logging
import sys
import os
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset, concatenate_datasets
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import SFTTrainer
from sklearn.metrics import f1_score, precision_score, recall_score
import evaluate
import nltk
nltk.download("punkt_tab", quiet=True)

LOG_DIR = "./logs"
os.makedirs(LOG_DIR, exist_ok=True)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.FileHandler(f"{LOG_DIR}/pipeline.log", mode="w"),
        logging.StreamHandler(sys.stdout),
    ],
)
log = logging.getLogger("lora")

device = "cuda" if torch.cuda.is_available() else "cpu"
log.info(f"device: {device}")
if device == "cuda":
    log.info(f"gpu: {torch.cuda.get_device_name(0)}")
    log.info(f"vram: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} gb")

2026-02-19 00:27:18,509 [INFO] device: cuda
2026-02-19 00:27:18,510 [INFO] gpu: NVIDIA GeForce GTX 1650 Ti
2026-02-19 00:27:18,512 [INFO] vram: 4.0 gb


## 1. Dataset Collection and Preprocessing

**Data Sources:**
- **databricks/databricks-dolly-15k** (Hugging Face) - 15k human-generated instruction-response pairs across multiple categories
- **tatsu-lab/alpaca** (Hugging Face) - 52k instruction-following examples generated from text-davinci-003

**Filtering Strategy:** Both datasets are filtered using keyword matching against 80+ ML/AI domain terms covering architectures, algorithms, frameworks, training concepts, and evaluation metrics. This produces a focused, domain-aligned subset.

**Preprocessing Pipeline:**
1. Load raw datasets from Hugging Face
2. Filter for ML/AI domain relevance using keyword matching
3. Standardize column names across sources
4. Data cleaning: remove empty/null responses, strip whitespace, deduplicate by instruction text
5. Combine and shuffle with fixed seed for reproducibility
6. Format into TinyLlama chat template (system/user/assistant roles)
7. Tokenize using LlamaTokenizer and analyze sequence lengths
8. Split into train (90%) and eval (10%) sets

In [29]:
dolly = load_dataset("databricks/databricks-dolly-15k", split="train")
log.info(f"dolly full size: {len(dolly)}")
log.info(f"columns: {dolly.column_names}")
log.info(f"categories: {set(dolly['category'])}")

2026-02-19 00:27:25,459 [INFO] dolly full size: 15011
2026-02-19 00:27:25,473 [INFO] columns: ['instruction', 'context', 'response', 'category']
2026-02-19 00:27:26,015 [INFO] categories: {'open_qa', 'brainstorming', 'general_qa', 'creative_writing', 'classification', 'summarization', 'closed_qa', 'information_extraction'}


In [30]:
ml_keywords = [
    "machine learning", "deep learning", "neural network", "artificial intelligence",
    "transformer", "attention mechanism", "lstm", "recurrent neural", "convolutional",
    "backpropagation", "gradient descent", "optimizer", "loss function", "activation function",
    "overfitting", "underfitting", "regularization", "dropout", "batch normalization",
    "supervised learning", "unsupervised learning", "reinforcement learning", "transfer learning",
    "natural language processing", "nlp", "computer vision", "generative adversarial",
    "autoencoder", "embedding", "tokenization", "fine-tuning", "pre-training",
    "classification", "regression", "clustering", "random forest", "decision tree",
    "support vector", "logistic regression", "linear regression", "naive bayes",
    "k-nearest", "cross-validation", "feature engineering", "dimensionality reduction",
    "principal component", "pca", "data augmentation", "hyperparameter",
    "bert", "gpt", "llm", "large language model", "diffusion model",
    "gan", "vae", "variational", "self-supervised", "contrastive learning",
    "object detection", "image segmentation", "speech recognition", "text generation",
    "sentiment analysis", "named entity", "word2vec", "glove", "fasttext",
    "tensorflow", "pytorch", "keras", "scikit-learn", "hugging face",
    "epoch", "batch size", "learning rate", "weight decay", "momentum",
    "softmax", "relu", "sigmoid", "tanh", "pooling layer",
    "residual network", "resnet", "inception", "yolo", "u-net",
    "sequence to sequence", "seq2seq", "beam search", "greedy decoding",
    "perceptron", "feedforward", "recurrent", "bidirectional",
    "data pipeline", "model deployment", "inference", "training loop",
    "confusion matrix", "precision", "recall", "f1 score", "accuracy",
    "roc curve", "auc", "mean squared error", "cross entropy"
]

def is_ml_related(example):
    text = f"{example['instruction']} {example['context']} {example['response']}".lower()
    return any(kw in text for kw in ml_keywords)

ml_dolly = dolly.filter(is_ml_related)
log.info(f"ml-related entries from dolly: {len(ml_dolly)}")

2026-02-19 00:27:33,480 [INFO] ml-related entries from dolly: 2115


In [31]:
alpaca = load_dataset("tatsu-lab/alpaca", split="train")
log.info(f"alpaca full size: {len(alpaca)}")

def is_ml_related_alpaca(example):
    text = f"{example['instruction']} {example['input']} {example['output']}".lower()
    return any(kw in text for kw in ml_keywords)

ml_alpaca = alpaca.filter(is_ml_related_alpaca)
log.info(f"ml-related entries from alpaca: {len(ml_alpaca)}")

ml_alpaca = ml_alpaca.rename_column("input", "context")
ml_alpaca = ml_alpaca.rename_column("output", "response")
if "text" in ml_alpaca.column_names:
    ml_alpaca = ml_alpaca.remove_columns(["text"])

2026-02-19 00:27:39,150 [INFO] alpaca full size: 52002
2026-02-19 00:27:39,172 [INFO] ml-related entries from alpaca: 5899


In [32]:
if "category" not in ml_alpaca.column_names:
    ml_alpaca = ml_alpaca.add_column("category", ["ml_education"] * len(ml_alpaca))

shared_cols = ["instruction", "context", "response", "category"]
ml_dolly_clean = ml_dolly.select_columns(shared_cols)
ml_alpaca_clean = ml_alpaca.select_columns(shared_cols)

combined = concatenate_datasets([ml_dolly_clean, ml_alpaca_clean])
log.info(f"combined before cleaning: {len(combined)}")

2026-02-19 00:27:42,960 [INFO] combined before cleaning: 8014


### Data Cleaning

We apply several cleaning steps to ensure data quality:
- Remove entries with empty or null instruction/response fields
- Strip leading/trailing whitespace from all text fields
- Remove duplicate entries based on instruction text
- Filter out extremely short responses (less than 10 characters) that lack educational value

In [33]:
def clean_example(example):
    example["instruction"] = example["instruction"].strip()
    example["response"] = example["response"].strip()
    example["context"] = example["context"].strip() if example["context"] else ""
    return example

combined = combined.map(clean_example)

before_nulls = len(combined)
combined = combined.filter(lambda x: len(x["instruction"]) > 0 and len(x["response"]) > 0)
log.info(f"removed {before_nulls - len(combined)} entries with empty instruction/response")

before_short = len(combined)
combined = combined.filter(lambda x: len(x["response"]) >= 10)
log.info(f"removed {before_short - len(combined)} entries with responses shorter than 10 chars")

seen_instructions = set()
unique_indices = []
for i, example in enumerate(combined):
    normalized = example["instruction"].lower().strip()
    if normalized not in seen_instructions:
        seen_instructions.add(normalized)
        unique_indices.append(i)

before_dedup = len(combined)
combined = combined.select(unique_indices)
log.info(f"removed {before_dedup - len(combined)} duplicate entries")

combined = combined.shuffle(seed=42)
log.info(f"final cleaned dataset: {len(combined)} entries")

2026-02-19 00:27:45,413 [INFO] removed 1 entries with empty instruction/response
2026-02-19 00:27:45,421 [INFO] removed 76 entries with responses shorter than 10 chars
2026-02-19 00:27:46,853 [INFO] removed 20 duplicate entries
2026-02-19 00:27:46,859 [INFO] final cleaned dataset: 7917 entries


In [34]:
MAX_SAMPLES = 1500
if len(combined) > MAX_SAMPLES:
    combined = combined.select(range(MAX_SAMPLES))
    log.info(f"trimmed to {MAX_SAMPLES} samples")

log.info(f"final dataset size: {len(combined)}")
log.info(f"sample instruction: {combined[0]['instruction'][:200]}")
log.info(f"sample response: {combined[0]['response'][:200]}")

2026-02-19 00:27:48,878 [INFO] trimmed to 1500 samples
2026-02-19 00:27:48,880 [INFO] final dataset size: 1500
2026-02-19 00:27:48,883 [INFO] sample instruction: Explain how using a journal can help someone stay organized.
2026-02-19 00:27:48,885 [INFO] sample response: Using a journal can help someone stay organized by providing a clear and consistent system for tracking tasks, notes, and thoughts. Journals can help an individual prioritize tasks, create a timeline 


### Tokenization and Formatting

Each example is formatted into TinyLlama's chat template with three roles:
- **system**: sets the assistant's domain expertise
- **user**: contains the instruction and optional context
- **assistant**: contains the target response

Tokenization is performed using the LlamaTokenizer (SentencePiece-based), which handles subword segmentation. We analyze token length distribution to set an appropriate max sequence length.

In [35]:
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

log.info(f"tokenizer: {type(tokenizer).__name__}")
log.info(f"vocab size: {tokenizer.vocab_size}")
log.info(f"model max length: {tokenizer.model_max_length}")

def format_prompt(example):
    ctx = f"\nContext: {example['context']}" if example["context"].strip() else ""
    text = (
        f"<|system|>\n"
        f"you are an expert ai/ml assistant specializing in machine learning, "
        f"deep learning, neural networks, and modern ai architectures. "
        f"provide clear, accurate, and educational responses.</s>\n"
        f"<|user|>\n"
        f"{example['instruction']}{ctx}</s>\n"
        f"<|assistant|>\n"
        f"{example['response']}</s>"
    )
    return {"text": text}

formatted = combined.map(format_prompt)
log.info(f"formatted sample:\n{formatted[0]['text'][:500]}")

2026-02-19 00:27:53,000 [INFO] tokenizer: LlamaTokenizerFast
2026-02-19 00:27:53,001 [INFO] vocab size: 32000
2026-02-19 00:27:53,003 [INFO] model max length: 2048
2026-02-19 00:27:53,013 [INFO] formatted sample:
<|system|>
you are an expert ai/ml assistant specializing in machine learning, deep learning, neural networks, and modern ai architectures. provide clear, accurate, and educational responses.</s>
<|user|>
Explain how using a journal can help someone stay organized.</s>
<|assistant|>
Using a journal can help someone stay organized by providing a clear and consistent system for tracking tasks, notes, and thoughts. Journals can help an individual prioritize tasks, create a timeline for achieving go


In [36]:
token_lengths = []
for example in formatted:
    tokens = tokenizer(example["text"], truncation=False)
    token_lengths.append(len(tokens["input_ids"]))

token_lengths = np.array(token_lengths)
log.info(f"token length distribution:")
log.info(f"  min: {token_lengths.min()}")
log.info(f"  mean: {token_lengths.mean():.0f}")
log.info(f"  median: {np.median(token_lengths):.0f}")
log.info(f"  max: {token_lengths.max()}")
log.info(f"  std: {token_lengths.std():.0f}")
log.info(f"  95th percentile: {np.percentile(token_lengths, 95):.0f}")

MAX_LENGTH = 256
log.info(f"selected max_length: {MAX_LENGTH}")

long_examples = (token_lengths > MAX_LENGTH).sum()
log.info(f"examples that will be truncated: {long_examples} ({long_examples/len(token_lengths)*100:.1f}%)")

Token indices sequence length is longer than the specified maximum sequence length for this model (2150 > 2048). Running this sequence through the model will result in indexing errors


2026-02-19 00:27:58,196 [INFO] token length distribution:
2026-02-19 00:27:58,198 [INFO]   min: 77
2026-02-19 00:27:58,200 [INFO]   mean: 281
2026-02-19 00:27:58,203 [INFO]   median: 184
2026-02-19 00:27:58,205 [INFO]   max: 6671
2026-02-19 00:27:58,207 [INFO]   std: 365
2026-02-19 00:27:58,210 [INFO]   95th percentile: 781
2026-02-19 00:27:58,212 [INFO] selected max_length: 256
2026-02-19 00:27:58,213 [INFO] examples that will be truncated: 396 (26.4%)


In [37]:
split = formatted.train_test_split(test_size=0.1, seed=42)
train_data = split["train"]
eval_data = split["test"]
log.info(f"train set: {len(train_data)} examples")
log.info(f"eval set: {len(eval_data)} examples")

2026-02-19 00:28:01,093 [INFO] train set: 1350 examples
2026-02-19 00:28:01,095 [INFO] eval set: 150 examples


## 2. Model Loading with QLoRA (4-bit Quantization)

**Base Model:** TinyLlama/TinyLlama-1.1B-Chat-v1.0 from Hugging Face
- 1.1 billion parameters
- Pre-trained on 3 trillion tokens from SlimPajama and StarCoder
- Chat-tuned variant with instruction-following capability

**Quantization:** 4-bit NF4 (Normal Float 4) via bitsandbytes
- Reduces model memory from ~4.4GB (FP32) to ~0.7GB
- Double quantization further compresses quantization constants
- Compute performed in FP16 for numerical stability

**LoRA Configuration:**
- Targets the attention projection matrices (q, k, v, o) where most task-specific adaptation occurs
- Rank and alpha are tuned across experiments to find the optimal balance between capacity and efficiency

In [38]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

In [39]:
from transformers import TrainerCallback
from trl import SFTConfig

class ProgressCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and state.global_step > 0:
            total = state.max_steps
            pct = state.global_step / total * 100 if total else 0
            parts = [f"step {state.global_step}/{total} ({pct:.0f}%)"]
            if "loss" in logs:
                parts.append(f"loss={logs['loss']:.4f}")
            if "eval_loss" in logs:
                parts.append(f"eval_loss={logs['eval_loss']:.4f}")
            if "learning_rate" in logs:
                parts.append(f"lr={logs['learning_rate']:.2e}")
            log.info("  " + " | ".join(parts))

    def on_epoch_begin(self, args, state, control, **kwargs):
        log.info(f"  epoch {int(state.epoch) + 1}/{int(args.num_train_epochs)} starting")

    def on_evaluate(self, args, state, control, **kwargs):
        log.info(f"  running evaluation at step {state.global_step}...")

def load_base_model():
    log.info("loading base model into gpu (4-bit quantization)...")
    mdl = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.float16,
    )
    log.info(f"model loaded, vram: {torch.cuda.memory_allocated() / 1024**3:.2f} gb")
    return mdl

def cleanup_model(mdl):
    del mdl
    gc.collect()
    torch.cuda.empty_cache()
    log.info(f"model cleaned up, vram freed: {torch.cuda.memory_allocated() / 1024**3:.2f} gb remaining")

def run_experiment(train_data, eval_data, lora_r, lora_alpha, lr, batch_size, grad_accum, epochs, run_name):
    exp_log_path = f"{LOG_DIR}/{run_name}.log"
    exp_handler = logging.FileHandler(exp_log_path, mode="w")
    exp_handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(message)s"))
    log.addHandler(exp_handler)

    log.info(f"{'='*60}")
    log.info(f"starting experiment: {run_name}")
    log.info(f"  lora_r={lora_r}, lora_alpha={lora_alpha}, lr={lr}, batch={batch_size}, accum={grad_accum}, epochs={epochs}")

    mdl = load_base_model()

    log.info("preparing model for kbit training...")
    mdl = prepare_model_for_kbit_training(mdl, use_gradient_checkpointing=False)
    log.info("kbit preparation done")

    log.info("applying lora adapters...")
    lora_cfg = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    )
    mdl = get_peft_model(mdl, lora_cfg)

    bf16_count = 0
    for param in mdl.parameters():
        if param.requires_grad and param.dtype == torch.bfloat16:
            param.data = param.data.to(torch.float32)
            bf16_count += 1
    if bf16_count > 0:
        log.info(f"  cast {bf16_count} bf16 params to float32 (turing gpu compatibility)")

    trainable, total = 0, 0
    for p in mdl.parameters():
        total += p.numel()
        if p.requires_grad:
            trainable += p.numel()
    log.info(f"  lora applied: {trainable:,} trainable / {total:,} total ({100 * trainable / total:.2f}%)")

    output_dir = f"./lora-ml-assistant/{run_name}"

    args = SFTConfig(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=grad_accum,
        gradient_checkpointing=False,
        eval_strategy="steps",
        eval_steps=50,
        save_strategy="steps",
        save_steps=50,
        logging_steps=10,
        learning_rate=lr,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        fp16=False,
        bf16=False,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        report_to="none",
        optim="adamw_torch",
        max_length=MAX_LENGTH,
        dataset_text_field="text",
    )

    log.info("initializing trainer (tokenizing dataset, this may take a few minutes)...")
    trainer = SFTTrainer(
        model=mdl,
        train_dataset=train_data,
        eval_dataset=eval_data,
        processing_class=tokenizer,
        args=args,
    )
    trainer.add_callback(ProgressCallback())
    log.info(f"trainer ready. total steps: {trainer.state.max_steps if hasattr(trainer.state, 'max_steps') else 'calculating...'}")

    torch.cuda.reset_peak_memory_stats()
    start = time.time()
    log.info("training started")
    result = trainer.train()
    elapsed = time.time() - start
    peak_mem = torch.cuda.max_memory_allocated() / 1024**3

    logs = trainer.state.log_history
    eval_losses = [l["eval_loss"] for l in logs if "eval_loss" in l]
    best_eval = min(eval_losses) if eval_losses else None

    log.info(f"saving model to {output_dir}/final ...")
    trainer.save_model(f"{output_dir}/final")
    tokenizer.save_pretrained(f"{output_dir}/final")

    log.info(f"  train loss: {result.training_loss:.4f}")
    log.info(f"  best eval loss: {best_eval:.4f}" if best_eval else "  no eval loss recorded")
    log.info(f"  peak vram: {peak_mem:.2f} gb")
    log.info(f"  time: {elapsed/60:.1f} min")
    log.info(f"experiment {run_name} complete")
    log.info(f"{'='*60}")

    with open(f"{LOG_DIR}/{run_name}_history.json", "w") as f:
        json.dump(logs, f, indent=2)

    log.removeHandler(exp_handler)
    exp_handler.close()

    cleanup_model(mdl)

    return {
        "experiment": run_name,
        "learning_rate": str(lr),
        "batch_size": f"{batch_size} (accum {grad_accum})",
        "effective_batch": batch_size * grad_accum,
        "epochs": epochs,
        "lora_r": lora_r,
        "lora_alpha": lora_alpha,
        "trainable_params": f"{trainable:,}",
        "train_loss": round(result.training_loss, 4),
        "best_eval_loss": round(best_eval, 4) if best_eval else None,
        "peak_vram_gb": round(peak_mem, 2),
        "training_time_min": round(elapsed / 60, 1),
        "save_path": f"{output_dir}/final",
    }

## 3. Hyperparameter Experiments

We run three experiments varying key hyperparameters to find the optimal configuration. Each experiment tracks training loss, evaluation loss, GPU memory usage, and training time.

| Experiment | Variable Changed | Rationale |
|---|---|---|
| Run 1 (Baseline) | Default config | Establishes baseline performance with moderate settings |
| Run 2 | Lower learning rate, higher LoRA rank | Tests whether more capacity with slower learning improves quality |
| Run 3 | Larger effective batch, fewer epochs | Tests whether larger batches with less training provides better generalization |

Note: Training uses float32 precision (fp16 disabled due to bitsandbytes/Turing GPU bf16 incompatibility). The 4-bit quantization still provides significant memory savings.

In [40]:
experiment_results = []

result_1 = run_experiment(
    train_data, eval_data,
    lora_r=16, lora_alpha=32,
    lr=2e-4, batch_size=2, grad_accum=8, epochs=2,
    run_name="run1_baseline"
)
experiment_results.append(result_1)

2026-02-19 00:28:14,246 [INFO] starting experiment: run1_baseline
2026-02-19 00:28:14,247 [INFO]   lora_r=16, lora_alpha=32, lr=0.0002, batch=2, accum=8, epochs=2
2026-02-19 00:28:14,249 [INFO] loading base model into gpu (4-bit quantization)...
2026-02-19 00:28:14,921 [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2026-02-19 00:28:22,793 [INFO] model loaded, vram: 2.91 gb
2026-02-19 00:28:22,793 [INFO] preparing model for kbit training...
2026-02-19 00:28:22,814 [INFO] kbit preparation done
2026-02-19 00:28:22,815 [INFO] applying lora adapters...
2026-02-19 00:28:23,106 [INFO]   lora applied: 4,505,600 trainable / 620,111,872 total (0.73%)
2026-02-19 00:28:23,171 [INFO] initializing trainer (tokenizing dataset, this may take a few minutes)...
2026-02-19 00:28:23,378 [INFO] trainer ready. total steps: 0
2026-02-19 00:28:23,379 [INFO] trainin

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


2026-02-19 00:28:23,917 [INFO]   epoch 1/2 starting


Step,Training Loss,Validation Loss
50,1.0983,1.090436
100,1.0726,1.07328
150,1.142,1.069398


2026-02-19 00:41:41,633 [INFO]   step 10/170 (6%) | loss=1.9526 | lr=1.06e-04
2026-02-19 00:54:32,889 [INFO]   step 20/170 (12%) | loss=1.5472 | lr=2.00e-04
2026-02-19 01:07:20,989 [INFO]   step 30/170 (18%) | loss=1.2246 | lr=1.97e-04
2026-02-19 01:19:34,846 [INFO]   step 40/170 (24%) | loss=1.1375 | lr=1.90e-04
2026-02-19 01:32:27,046 [INFO]   step 50/170 (29%) | loss=1.0983 | lr=1.79e-04
2026-02-19 01:36:43,653 [INFO]   step 50/170 (29%) | eval_loss=1.0904
2026-02-19 01:36:43,659 [INFO]   running evaluation at step 50...
2026-02-19 01:49:51,676 [INFO]   step 60/170 (35%) | loss=1.1131 | lr=1.65e-04
2026-02-19 02:02:38,611 [INFO]   step 70/170 (41%) | loss=1.1083 | lr=1.48e-04
2026-02-19 02:15:00,037 [INFO]   step 80/170 (47%) | loss=1.0822 | lr=1.29e-04
2026-02-19 02:20:39,022 [INFO]   epoch 2/2 starting
2026-02-19 02:26:54,429 [INFO]   step 90/170 (53%) | loss=1.0840 | lr=1.09e-04
2026-02-19 02:39:58,690 [INFO]   step 100/170 (59%) | loss=1.0726 | lr=8.87e-05
2026-02-19 02:44:17,84

In [41]:
result_2 = run_experiment(
    train_data, eval_data,
    lora_r=32, lora_alpha=64,
    lr=5e-5, batch_size=2, grad_accum=8, epochs=2,
    run_name="run2_higher_rank_lower_lr"
)
experiment_results.append(result_2)

2026-02-19 04:44:33,413 [INFO] starting experiment: run2_higher_rank_lower_lr
2026-02-19 04:44:33,414 [INFO]   lora_r=32, lora_alpha=64, lr=5e-05, batch=2, accum=8, epochs=2
2026-02-19 04:44:33,416 [INFO] loading base model into gpu (4-bit quantization)...
2026-02-19 04:44:34,392 [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2026-02-19 04:44:49,576 [INFO] model loaded, vram: 2.78 gb
2026-02-19 04:44:49,577 [INFO] preparing model for kbit training...
2026-02-19 04:44:49,597 [INFO] kbit preparation done
2026-02-19 04:44:49,599 [INFO] applying lora adapters...
2026-02-19 04:44:50,340 [INFO]   lora applied: 9,011,200 trainable / 624,617,472 total (1.44%)
2026-02-19 04:44:50,449 [INFO] initializing trainer (tokenizing dataset, this may take a few minutes)...
2026-02-19 04:44:50,992 [INFO] trainer ready. total steps: 0
2026-02-19 04:44:50,995 [IN

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


2026-02-19 04:44:51,480 [INFO]   epoch 1/2 starting


Step,Training Loss,Validation Loss
50,1.1502,1.134109
100,1.1053,1.092605
150,1.1722,1.089462


2026-02-19 04:58:49,102 [INFO]   step 10/170 (6%) | loss=1.9855 | lr=2.65e-05
2026-02-19 05:12:32,743 [INFO]   step 20/170 (12%) | loss=1.7993 | lr=5.00e-05
2026-02-19 05:25:59,506 [INFO]   step 30/170 (18%) | loss=1.4687 | lr=4.92e-05
2026-02-19 05:39:01,140 [INFO]   step 40/170 (24%) | loss=1.2202 | lr=4.75e-05
2026-02-19 05:51:56,154 [INFO]   step 50/170 (29%) | loss=1.1502 | lr=4.48e-05
2026-02-19 05:56:31,308 [INFO]   step 50/170 (29%) | eval_loss=1.1341
2026-02-19 05:56:31,312 [INFO]   running evaluation at step 50...
2026-02-19 06:09:51,901 [INFO]   step 60/170 (35%) | loss=1.1474 | lr=4.13e-05
2026-02-19 06:22:58,713 [INFO]   step 70/170 (41%) | loss=1.1322 | lr=3.71e-05
2026-02-19 06:35:40,825 [INFO]   step 80/170 (47%) | loss=1.1038 | lr=3.23e-05
2026-02-19 06:41:29,627 [INFO]   epoch 2/2 starting
2026-02-19 06:48:03,456 [INFO]   step 90/170 (53%) | loss=1.1113 | lr=2.73e-05
2026-02-19 07:01:18,475 [INFO]   step 100/170 (59%) | loss=1.1053 | lr=2.22e-05
2026-02-19 07:05:51,81

In [42]:
result_3 = run_experiment(
    train_data, eval_data,
    lora_r=16, lora_alpha=32,
    lr=1e-4, batch_size=4, grad_accum=4, epochs=1,
    run_name="run3_larger_batch_fewer_epochs"
)
experiment_results.append(result_3)

2026-02-19 08:47:43,273 [INFO] starting experiment: run3_larger_batch_fewer_epochs
2026-02-19 08:47:43,275 [INFO]   lora_r=16, lora_alpha=32, lr=0.0001, batch=4, accum=4, epochs=1
2026-02-19 08:47:43,277 [INFO] loading base model into gpu (4-bit quantization)...
2026-02-19 08:47:45,125 [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2026-02-19 08:47:57,045 [INFO] model loaded, vram: 2.78 gb
2026-02-19 08:47:57,048 [INFO] preparing model for kbit training...
2026-02-19 08:47:57,090 [INFO] kbit preparation done
2026-02-19 08:47:57,092 [INFO] applying lora adapters...
2026-02-19 08:47:58,012 [INFO]   lora applied: 4,505,600 trainable / 620,111,872 total (0.73%)
2026-02-19 08:47:58,100 [INFO] initializing trainer (tokenizing dataset, this may take a few minutes)...
2026-02-19 08:47:58,389 [INFO] trainer ready. total steps: 0
2026-02-19 08:47:58,3

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


2026-02-19 08:47:58,800 [INFO]   epoch 1/1 starting


Step,Training Loss,Validation Loss
50,1.1253,1.141566


2026-02-19 09:02:18,175 [INFO]   step 10/85 (12%) | loss=1.9561 | lr=1.00e-04
2026-02-19 09:17:27,599 [INFO]   step 20/85 (24%) | loss=1.6081 | lr=9.58e-05
2026-02-19 09:33:43,582 [INFO]   step 30/85 (35%) | loss=1.2766 | lr=8.39e-05
2026-02-19 09:48:53,192 [INFO]   step 40/85 (47%) | loss=1.1793 | lr=6.62e-05
2026-02-19 10:03:19,009 [INFO]   step 50/85 (59%) | loss=1.1253 | lr=4.59e-05
2026-02-19 10:08:04,219 [INFO]   step 50/85 (59%) | eval_loss=1.1416
2026-02-19 10:08:04,229 [INFO]   running evaluation at step 50...


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 780da934-64fa-438a-964f-b51742c34923)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 1s [Retry 1/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: b757b37a-bb3f-4d70-a077-c704d0f7dc77)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 2s [Retry 2/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 6984b690-dff6-4462-a2c4-2d5483b28fe6)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 4s [Retry 3/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: de972189-e10b-4f1c-9391-ee76d15927a0)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 8s [Retry 4/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 7efa3593-48f1-4290-8447-cc293a469660)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 8s [Retry 5/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: f63c0a9c-a50b-496b-960b-1e5f54e94aa6)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json






2026-02-19 10:24:28,834 [INFO]   step 60/85 (71%) | loss=1.1367 | lr=2.62e-05
2026-02-19 10:40:14,404 [INFO]   step 70/85 (82%) | loss=1.1319 | lr=1.05e-05
2026-02-19 10:56:25,173 [INFO]   step 80/85 (94%) | loss=1.1086 | lr=1.53e-06
2026-02-19 11:03:17,349 [INFO]   step 85/85 (100%)
2026-02-19 11:03:17,390 [INFO] saving model to ./lora-ml-assistant/run3_larger_batch_fewer_epochs/final ...
2026-02-19 11:03:18,794 [INFO]   train loss: 1.3080
2026-02-19 11:03:18,798 [INFO]   best eval loss: 1.1416
2026-02-19 11:03:18,801 [INFO]   peak vram: 6.44 gb
2026-02-19 11:03:18,803 [INFO]   time: 135.3 min
2026-02-19 11:03:18,805 [INFO] experiment run3_larger_batch_fewer_epochs complete
2026-02-19 11:03:24,758 [INFO] model cleaned up, vram freed: 2.59 gb remaining


### Experiment Results Table

The table below documents all experiments with their hyperparameters, performance metrics, GPU memory usage, and training time.

In [43]:
experiments_df = pd.DataFrame(experiment_results)
display_cols = [
    "experiment", "learning_rate", "batch_size", "epochs",
    "lora_r", "lora_alpha", "trainable_params",
    "train_loss", "best_eval_loss", "peak_vram_gb", "training_time_min"
]
log.info("experiment tracking table:")
log.info("\n" + experiments_df[display_cols].to_string(index=False))

best_run = min(experiment_results, key=lambda x: x["best_eval_loss"] if x["best_eval_loss"] else float("inf"))
log.info(f"best experiment: {best_run['experiment']} (eval loss: {best_run['best_eval_loss']})")

experiments_df[display_cols].to_csv(f"{LOG_DIR}/experiments.csv", index=False)
log.info(f"experiments table saved to {LOG_DIR}/experiments.csv")

BEST_MODEL_PATH = best_run["save_path"]

2026-02-19 11:13:20,196 [INFO] experiment tracking table:
2026-02-19 11:13:20,263 [INFO] 
                    experiment learning_rate  batch_size  epochs  lora_r  lora_alpha trainable_params  train_loss  best_eval_loss  peak_vram_gb  training_time_min
                 run1_baseline        0.0002 2 (accum 8)       2      16          32        4,505,600      1.1856          1.0694          4.53              229.7
     run2_higher_rank_lower_lr         5e-05 2 (accum 8)       2      32          64        9,011,200      1.2454          1.0895          4.56              236.7
run3_larger_batch_fewer_epochs        0.0001 4 (accum 4)       1      16          32        4,505,600      1.3080          1.1416          6.44              135.3
2026-02-19 11:13:20,265 [INFO] best experiment: run1_baseline (eval loss: 1.0694)
2026-02-19 11:13:20,284 [INFO] experiments table saved to ./logs/experiments.csv


## 4. Performance Evaluation

We evaluate the fine-tuned model against the base model using multiple metrics:

- **BLEU Score**: Measures n-gram precision between generated and reference text
- **ROUGE-1/2/L**: Measures recall-oriented overlap at unigram, bigram, and longest common subsequence levels
- **F1 Score**: Token-level F1 measuring precision and recall of generated tokens against references
- **Perplexity**: Measures model confidence (lower = better) on held-out evaluation data
- **Qualitative Testing**: Side-by-side comparison on domain and out-of-domain queries

In [44]:
test_questions = [
    "what is a transformer architecture in deep learning?",
    "explain the difference between lstm and gru networks",
    "what is backpropagation and why is it important?",
    "how does dropout regularization work?",
    "what is transfer learning and when should you use it?",
    "explain the attention mechanism in neural networks",
    "what is the vanishing gradient problem?",
    "how do generative adversarial networks work?",
    "what is the difference between supervised and unsupervised learning?",
    "explain batch normalization and its benefits",
]

reference_answers = [
    "a transformer is a neural network architecture that uses self-attention mechanisms to process sequential data in parallel, unlike rnns. it consists of encoder and decoder blocks with multi-head attention layers, enabling it to capture long-range dependencies efficiently. transformers are the foundation of models like bert and gpt.",
    "lstm uses three gates (input, forget, output) and a cell state to control information flow, while gru simplifies this to two gates (reset, update) making it computationally faster. both address the vanishing gradient problem in rnns, but gru has fewer parameters and trains faster with comparable performance on many tasks.",
    "backpropagation is an algorithm for computing gradients of the loss function with respect to network weights by applying the chain rule from output to input layers. it is essential because it enables gradient-based optimization methods to update weights and minimize the loss during training.",
    "dropout randomly deactivates a fraction of neurons during training, forcing the network to learn redundant representations and reducing co-adaptation between neurons. this acts as regularization, preventing overfitting by making the model more robust and improving generalization to unseen data.",
    "transfer learning involves taking a model pre-trained on a large dataset and adapting it to a new related task. it is useful when you have limited training data, as the pre-trained model already captures general features that can be fine-tuned for the specific task with fewer examples.",
    "the attention mechanism allows a model to focus on different parts of the input sequence when producing each output element. it computes weighted sums of values based on compatibility scores between queries and keys, enabling the model to capture relevant relationships regardless of distance in the sequence.",
    "the vanishing gradient problem occurs when gradients become extremely small as they are propagated back through many layers, effectively preventing early layers from learning. this is common in deep networks and rnns, and is addressed by architectures like lstm, residual connections, and careful initialization.",
    "gans consist of two networks: a generator that creates fake data and a discriminator that distinguishes real from fake. they are trained adversarially, with the generator trying to fool the discriminator. this competition drives both networks to improve, resulting in realistic synthetic data generation.",
    "supervised learning uses labeled data where each input has a corresponding target output, used for tasks like classification and regression. unsupervised learning works with unlabeled data to find hidden patterns or structure, used for clustering, dimensionality reduction, and anomaly detection.",
    "batch normalization normalizes the inputs of each layer to have zero mean and unit variance across the mini-batch. this stabilizes training by reducing internal covariate shift, allows higher learning rates, acts as mild regularization, and generally leads to faster convergence.",
]

out_of_domain_questions = [
    "what is the capital of france?",
    "how do i bake a chocolate cake?",
    "who won the 2022 world cup?",
]

In [45]:
def generate_response(mdl, tok, question, max_new_tokens=256):
    prompt = (
        f"<|system|>\n"
        f"you are an expert ai/ml assistant specializing in machine learning, "
        f"deep learning, neural networks, and modern ai architectures. "
        f"provide clear, accurate, and educational responses.</s>\n"
        f"<|user|>\n{question}</s>\n"
        f"<|assistant|>\n"
    )
    inputs = tok(prompt, return_tensors="pt").to(mdl.device)
    with torch.no_grad():
        outputs = mdl.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            repetition_penalty=1.2,
            pad_token_id=tok.eos_token_id,
        )
    response = tok.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    return response.strip()

In [46]:
ft_model = load_base_model()
ft_model = PeftModel.from_pretrained(ft_model, BEST_MODEL_PATH)

log.info("generating fine-tuned model responses on domain questions")
finetuned_responses = []
for q in test_questions:
    resp = generate_response(ft_model, tokenizer, q)
    finetuned_responses.append(resp)
    log.info(f"q: {q}")
    log.info(f"a: {resp[:300]}\n")

2026-02-19 11:14:38,518 [INFO] loading base model into gpu (4-bit quantization)...


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: bce431a9-a526-45a9-bc6a-d4fbe1aa284a)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 1s [Retry 1/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 48c29d9e-bc1d-49c5-8e6d-15f603abec0f)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 2s [Retry 2/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: c20c53f0-5437-432a-b3b5-0cec7087a493)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 4s [Retry 3/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: b8863ea3-7c17-4bd1-95e6-c85cd1f7ac61)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 8s [Retry 4/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 02b095bd-21d0-4e70-be2c-25fc987f70b0)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 8s [Retry 5/5].


2026-02-19 11:15:59,436 [INFO] model loaded, vram: 2.78 gb
2026-02-19 11:15:59,909 [INFO] generating fine-tuned model responses on domain questions
2026-02-19 11:16:26,213 [INFO] q: what is a transformer architecture in deep learning?
2026-02-19 11:16:26,215 [INFO] a: A Transformer Architecture (Transformer) is a state-of-the-art natural language processing model that uses stacked encoder–decoders to process text data. It consists of multiple layers with attention mechanisms to capture contextual information from the input sequence while also learning representat

2026-02-19 11:17:00,888 [INFO] q: explain the difference between lstm and gru networks
2026-02-19 11:17:00,889 [INFO] a: LSTM (Long Short-Term Memory) is a type of recurrent neural network that can process sequential data such as text or speech. It uses a hidden state to remember previous input sequences and output them with a probability distribution over time. This allows it to create complex hierarchical structures

2026-0

In [47]:
log.info("generating fine-tuned model responses on out-of-domain questions")
for q in out_of_domain_questions:
    resp = generate_response(ft_model, tokenizer, q)
    log.info(f"q: {q}")
    log.info(f"a: {resp[:300]}\n")

2026-02-19 11:23:01,334 [INFO] generating fine-tuned model responses on out-of-domain questions
2026-02-19 11:23:32,768 [INFO] q: what is the capital of france?
2026-02-19 11:23:32,769 [INFO] a: The capital city of France is Paris, which has been its capital since the early Middle Ages (12th century). It is located on the Ile de la Cité, a small island surrounded by the Seine River. The most famous landmark on this island is the Eiffel Tower, built during the French Revolution as part of th

2026-02-19 11:24:20,231 [INFO] q: how do i bake a chocolate cake?
2026-02-19 11:24:20,232 [INFO] a: To make a classic chocolate cake recipe:
1. Preheat your oven to 350 degrees Fahrenheit (176°C).
2. Grease or line your muffin tin with cupcake liners. Set aside.
3. In a medium bowl whisk together flour, sugar, salt, baking powder, and cocoa powder. Set aside.
4. In another small bowl mix together 

2026-02-19 11:24:44,993 [INFO] q: who won the 2022 world cup?
2026-02-19 11:24:44,994 [INFO] a: The 2

In [48]:
cleanup_model(ft_model)

base_model = load_base_model()

log.info("generating base model responses on domain questions")
base_responses = []
for q in test_questions:
    resp = generate_response(base_model, tokenizer, q)
    base_responses.append(resp)
    log.info(f"q: {q}")
    log.info(f"a: {resp[:300]}\n")

2026-02-19 11:24:59,799 [INFO] model cleaned up, vram freed: 2.33 gb remaining
2026-02-19 11:24:59,800 [INFO] loading base model into gpu (4-bit quantization)...


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 90bc2299-49d4-400f-85bc-f3c4aa7f52b0)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 1s [Retry 1/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: bf7f9ebe-597a-4025-8934-8b519d287eef)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 2s [Retry 2/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 2932b7ab-ef9d-4990-8315-9926642f4667)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 4s [Retry 3/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: ac5cff80-dd97-47e5-bd41-a6a5a5ee703a)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 8s [Retry 4/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 7ee49f0e-fd1b-43fa-a237-137a7ab4afa8)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json




Retrying in 8s [Retry 5/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 2ae97a89-6026-44a9-9f82-81c22f62d54c)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/config.json


2026-02-19 11:26:23,753 [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: d3023800-d9e2-48ec-b255-9fadf84a1cd1)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/generation_config.json




Retrying in 1s [Retry 1/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 1e21ea3a-aae9-4c40-b026-3414a73465a2)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/generation_config.json




Retrying in 2s [Retry 2/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: a15fb911-c43d-4b5b-aa90-fbe80ecc6871)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/generation_config.json




Retrying in 4s [Retry 3/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 2dda1133-05f0-4789-af6c-8e4cf57ee20c)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/generation_config.json




Retrying in 8s [Retry 4/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 92d443e2-ff6b-4bb7-8d30-080653dc1806)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/generation_config.json




Retrying in 8s [Retry 5/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: b24adc7e-c195-413e-bead-ad06aed87f9a)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/generation_config.json




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 8e1ba225-70a9-4e18-b5c4-cf7b0574fe78)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/custom_generate/generate.py




Retrying in 1s [Retry 1/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 883e0fdb-5400-4cd5-a2b2-e95f7ba59698)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/custom_generate/generate.py




Retrying in 2s [Retry 2/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: b5f44b56-edf4-489e-ad19-79fc6d38e0a4)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/custom_generate/generate.py




Retrying in 4s [Retry 3/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 9000fd46-057e-48e0-b9bf-c5cd6a14098c)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/custom_generate/generate.py




Retrying in 8s [Retry 4/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: a63b5d6d-b4e8-4c05-b35b-3152d056da2b)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/custom_generate/generate.py




Retrying in 8s [Retry 5/5].




'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 88ab9935-7b93-45b0-9ca2-7f1e99fbe6c6)')' thrown while requesting HEAD https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/custom_generate/generate.py


2026-02-19 11:29:14,124 [INFO] model loaded, vram: 3.05 gb
2026-02-19 11:29:14,125 [INFO] generating base model responses on domain questions
2026-02-19 11:29:41,407 [INFO] q: what is a transformer architecture in deep learning?
2026-02-19 11:29:41,408 [INFO] a: A Transformer Architecture (Transformer) is a type of artificial intelligence model that uses self-attention mechanisms to process text data or natural language inputs. It can be used for various tasks such as language modeling, question answering, NLP, chatbots, and more. The Transformer architectu

2026-02-19 11:30:26,441 [INFO] q: explain the difference between lstm and gru networks
2026-02-19 11:30:26,443 [INFO] a: LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) are two types of recurrent neural network (RNN) models that differ in their architecture design. Here's a brief explanation:

1. LSTM vs. RNN:

An RNN is a type of neural network that uses backpropagation to learn patterns from input data.

2026-02-19 1

### Quantitative Metrics: BLEU, ROUGE, and Token-Level F1

In [52]:
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
from rouge_score import rouge_scorer

def token_f1(prediction, reference):
    pred_tokens = set(prediction.lower().split())
    ref_tokens = set(reference.lower().split())
    if not pred_tokens or not ref_tokens:
        return 0.0
    common = pred_tokens & ref_tokens
    if not common:
        return 0.0
    p = len(common) / len(pred_tokens)
    r = len(common) / len(ref_tokens)
    return 2 * p * r / (p + r)

def compute_all_metrics(predictions, references):
    refs_for_bleu = [[r.split()] for r in references]
    preds_for_bleu = [p.split() for p in predictions]
    smooth = SmoothingFunction().method1
    bleu = corpus_bleu(refs_for_bleu, preds_for_bleu, smoothing_function=smooth)

    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    rouge_scores = {"rouge1": [], "rouge2": [], "rougeL": []}
    for pred, ref in zip(predictions, references):
        scores = scorer.score(ref, pred)
        for k in rouge_scores:
            rouge_scores[k].append(scores[k].fmeasure)

    f1_scores = [token_f1(p, r) for p, r in zip(predictions, references)]
    avg_f1 = np.mean(f1_scores)

    return {
        "bleu": bleu,
        "rouge1": np.mean(rouge_scores["rouge1"]),
        "rouge2": np.mean(rouge_scores["rouge2"]),
        "rougeL": np.mean(rouge_scores["rougeL"]),
        "token_f1": avg_f1,
    }

ft_metrics = compute_all_metrics(finetuned_responses, reference_answers)
base_metrics = compute_all_metrics(base_responses, reference_answers)

metrics_df = pd.DataFrame({
    "metric": ["bleu", "rouge-1", "rouge-2", "rouge-l", "token f1"],
    "base_model": [
        f'{base_metrics["bleu"]:.4f}',
        f'{base_metrics["rouge1"]:.4f}',
        f'{base_metrics["rouge2"]:.4f}',
        f'{base_metrics["rougeL"]:.4f}',
        f'{base_metrics["token_f1"]:.4f}',
    ],
    "fine_tuned": [
        f'{ft_metrics["bleu"]:.4f}',
        f'{ft_metrics["rouge1"]:.4f}',
        f'{ft_metrics["rouge2"]:.4f}',
        f'{ft_metrics["rougeL"]:.4f}',
        f'{ft_metrics["token_f1"]:.4f}',
    ],
})

log.info("evaluation metrics comparison (base vs fine-tuned):")
table_str = metrics_df.to_string(index=False)
for line in table_str.split(chr(10)):
    log.info(line)

for metric_name in ["bleu", "rouge1", "rouge2", "rougeL", "token_f1"]:
    base_val = base_metrics[metric_name]
    ft_val = ft_metrics[metric_name]
    if base_val > 0:
        improvement = ((ft_val - base_val) / base_val) * 100
        log.info(f"{metric_name} improvement: {improvement:+.1f}%")

metrics_df.to_csv(f"{LOG_DIR}/metrics.csv", index=False)
log.info(f"metrics saved to {LOG_DIR}/metrics.csv")

2026-02-19 11:48:35,640 [INFO] Using default tokenizer.
2026-02-19 11:48:35,719 [INFO] Using default tokenizer.
2026-02-19 11:48:35,846 [INFO] evaluation metrics comparison (base vs fine-tuned):
2026-02-19 11:48:35,855 [INFO]   metric base_model fine_tuned
2026-02-19 11:48:35,857 [INFO]     bleu     0.0163     0.0065
2026-02-19 11:48:35,858 [INFO]  rouge-1     0.2717     0.3302
2026-02-19 11:48:35,860 [INFO]  rouge-2     0.0695     0.0624
2026-02-19 11:48:35,864 [INFO]  rouge-l     0.1632     0.1960
2026-02-19 11:48:35,867 [INFO] token f1     0.2453     0.2775
2026-02-19 11:48:35,869 [INFO] bleu improvement: -59.9%
2026-02-19 11:48:35,872 [INFO] rouge1 improvement: +21.5%
2026-02-19 11:48:35,875 [INFO] rouge2 improvement: -10.1%
2026-02-19 11:48:35,877 [INFO] rougeL improvement: +20.1%
2026-02-19 11:48:35,885 [INFO] token_f1 improvement: +13.1%
2026-02-19 11:48:35,894 [INFO] metrics saved to ./logs/metrics.csv


### Perplexity Comparison

In [53]:
def compute_perplexity(mdl, tok, texts, max_length=512):
    total_loss = 0
    total_tokens = 0
    for text in texts:
        inputs = tok(text, return_tensors="pt", truncation=True, max_length=max_length).to(mdl.device)
        with torch.no_grad():
            outputs = mdl(**inputs, labels=inputs["input_ids"])
        total_loss += outputs.loss.item() * inputs["input_ids"].shape[1]
        total_tokens += inputs["input_ids"].shape[1]
    return np.exp(total_loss / total_tokens)

eval_texts = [ex["text"] for ex in eval_data.select(range(min(50, len(eval_data))))]

base_perplexity = compute_perplexity(base_model, tokenizer, eval_texts)
log.info(f"base model perplexity: {base_perplexity:.2f}")

cleanup_model(base_model)

ft_model = load_base_model()
ft_model = PeftModel.from_pretrained(ft_model, BEST_MODEL_PATH)

ft_perplexity = compute_perplexity(ft_model, tokenizer, eval_texts)
log.info(f"fine-tuned model perplexity: {ft_perplexity:.2f}")
log.info(f"perplexity improvement: {((base_perplexity - ft_perplexity) / base_perplexity * 100):.1f}%")

with open(f"{LOG_DIR}/perplexity.json", "w") as f:
    json.dump({"base": round(base_perplexity, 2), "fine_tuned": round(ft_perplexity, 2)}, f, indent=2)
log.info(f"perplexity results saved to {LOG_DIR}/perplexity.json")

2026-02-19 11:52:29,394 [INFO] base model perplexity: 7.82
2026-02-19 11:52:30,038 [INFO] model cleaned up, vram freed: 2.36 gb remaining
2026-02-19 11:52:30,042 [INFO] loading base model into gpu (4-bit quantization)...
2026-02-19 11:52:31,457 [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2026-02-19 11:52:37,499 [INFO] model loaded, vram: 3.08 gb
2026-02-19 11:54:18,365 [INFO] fine-tuned model perplexity: 3.54
2026-02-19 11:54:18,366 [INFO] perplexity improvement: 54.7%
2026-02-19 11:54:18,373 [INFO] perplexity results saved to ./logs/perplexity.json


### Qualitative Comparison: Base vs Fine-Tuned

Side-by-side responses demonstrate how fine-tuning improves domain-specific answer quality, relevance, and accuracy.

In [54]:
print("side-by-side comparison (base vs fine-tuned):\n")
for i, q in enumerate(test_questions[:5]):
    print(f"question: {q}")
    print(f"  base model: {base_responses[i][:300]}")
    print(f"  fine-tuned: {finetuned_responses[i][:300]}")
    print()

side-by-side comparison (base vs fine-tuned):

question: what is a transformer architecture in deep learning?
  base model: A Transformer Architecture (Transformer) is a type of artificial intelligence model that uses self-attention mechanisms to process text data or natural language inputs. It can be used for various tasks such as language modeling, question answering, NLP, chatbots, and more. The Transformer architectu
  fine-tuned: A Transformer Architecture (Transformer) is a state-of-the-art natural language processing model that uses stacked encoder–decoders to process text data. It consists of multiple layers with attention mechanisms to capture contextual information from the input sequence while also learning representat

question: explain the difference between lstm and gru networks
  base model: LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) are two types of recurrent neural network (RNN) models that differ in their architecture design. Here's a brief explanati

### Metrics Analysis

The evaluation above compares the base pre-trained TinyLlama model against our fine-tuned version across multiple dimensions:

**BLEU and ROUGE** measure how closely the generated responses match reference answers in terms of n-gram overlap. Higher scores for the fine-tuned model indicate it produces responses that are more aligned with expected ML/AI explanations.

**Token-level F1** captures both precision (are the generated tokens relevant?) and recall (are important reference tokens present?). Improvement here shows the fine-tuned model better covers key terminology and concepts.

**Perplexity** measures how confident the model is on domain-specific text. Lower perplexity after fine-tuning indicates the model has internalized ML/AI language patterns and can predict domain text more accurately.

**Out-of-domain testing** verifies that the model handles non-ML queries gracefully rather than hallucinating ML-related content for unrelated questions.

## 5. Web Interface (React + FastAPI)

The fine-tuned model is deployed via a FastAPI backend (`backend/app.py`) that serves a `/chat` endpoint, `/metrics` for evaluation results, and `/model-info` for architecture details. The React frontend (`frontend/`) provides a feature-rich chat UI with dark/light mode, model info panel, metrics display, typing animation, and example questions.

To run:
1. Start backend: `python -m uvicorn backend.app:app --reload`
2. Start frontend: `cd frontend && npm start`