# Project 5: LLM

The documentation is split into small chunks following the suggestion in class and from feedback for previous projects.

# Introduction

The task is to do a reading comprehension with a LLM.


W&B Link: TODO

# Setup
Preliminary steps for setting getting the project running.

## Tools used
- GPUHub JupyterLab
- No AI tools used, as they do not help with reading API documentation and GitHub issues
- Previous projects documentation

## Dependencies
The notebook was created with:
Python

Install all necessary dependencies
- Pytorch: `torch`
- Hugging Face: `huggingface_hub transformers datasets peft trl`
- Weights & Biases: `wandb`
- numpy: `numpy`
- scikit-learn: `scikit-learn`
- Lint and Formatting: `ruff`

Versions of dependencies are pinned for reproducibility.

In [1]:
%pip install torch huggingface_hub transformers[torch] datasets peft trl wandb==0.18.7 numpy==1.26.4 scikit-learn==1.5.2 ruff==0.7.4
#%pip install --upgrade transformers[torch] peft trl torchvision

Note: you may need to restart the kernel to use updated packages.


## Notebook setup
Import all necessary libraries.

In [2]:
import os
from transformers import (
    AutoTokenizer,
)
import torch
import numpy as np
from datasets import load_dataset
import wandb
import sklearn
from huggingface_hub import login as hf_login
from trl import SFTConfig, SFTTrainer, AutoModelForCausalLMWithValueHead, DataCollatorForCompletionOnlyLM
from peft import LoraConfig

Log into Hugging Face and Weights & Biases.

In [3]:
WANDB_PROJECT = "nlp-project-5"
os.environ["WANDB_PROJECT"] = WANDB_PROJECT
os.environ["WANDB_NOTEBOOK_NAME"] = "./project5-stage2.ipynb"
wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33myelin-zhang[0m ([33myelin-zhang-hslu[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

In [4]:
hf_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
MODEL = "meta-llama/Llama-3.2-1B"
BATCH_SIZE = 8

# Preprocessing

Predefined requirements:
- Download the BoolQ dataset with `datasets` and split it in the predefined way.
- Train / Validation / Test split

Used features:
- `question` and `passage` as input to the model
- `answer` as label

Input format:
- `question` and `passage` as they are, later for training they will be combined in a template for the input prompt.

Label format:
- `answer` stays at is is for training later

Batch size: 64 for faster training than with individual samples

No further preprocessing is done, because the Llama can handle all text in the dataset as it is. Because it uses the `TikToken` Tokenzier under the hood.

## Implementation

Download and split dataset in predefined way

In [6]:
train_raw = load_dataset("google/boolq", split="train[:-1000]")
valid_raw = load_dataset("google/boolq", split="train[-1000:]")
test_raw = load_dataset("google/boolq", split="validation")

print(len(train_raw), len(valid_raw), len(test_raw))

8427 1000 3270


# Model
Predefined requirements:
- LLM (≥ 1B parameters)
- Use a quantized version as the base model

Chosen model: Llama 3.2
- 1.23B params
- Quantized with SpinQuant and GPTQ

Predictions:
The model should generate either "True" or "False" to the question based on the given input. Based on that metrics can be calculated.
If it predicts something else than the expected outputs, it will be counted seperatly as a failed prediction.

Normalization: LLama has a RMSNorm layer.

Regularization: Optimizer `AdamW` applies L2 regularization to loss. Unkown if there is regularization in Llama

### Loss function
Default by transformers library: Cross-Entropy:
- Not changed because it is the best choice for classification problems

### Optimizer
Default by transformers library: `AdamW`
- Not changed because it performs well, better than `Adam` as well.


## Implementation

In [13]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    base_model_name_or_path=MODEL
)

model = AutoModelForCausalLMWithValueHead.from_pretrained(
    MODEL,
    #peft_config=lora_config,
    max_length=8000,
    device_map = 'cuda'
)

def formatting_prompts_func(example, include_answer=True):
    output_texts = []
    for i in range(len(example['question'])):
        text = f"### Question: {example['question'][i]}\n ### Context: {example['passage'][i]}\n ### Answer:"
        if include_answer:
            text += str(example['answer'][i])
        output_texts.append(text)
    return output_texts



A correctness test of the model definition will be done, by running the model with one batch and checking the output.

The generated text looks right and we can procede with implementing the rest.

In [15]:
tokenized = tokenizer(train_raw[0]["question"], return_tensors="pt").to("cuda")
tokenizer.batch_decode(model.generate(tokenized["input_ids"], max_time=1, pad_token_id=tokenizer.pad_token_id))

['<|begin_of_text|>do iran and afghanistan speak the same language\nIran and Afghanistan are two countries that are very close to each other. They are both located in the Middle East and share many similarities. However']

Predefined requirement: Preliminary evaluation with 5 diverse prompts.

In [17]:
for i, prompt in enumerate(formatting_prompts_func(train_raw[:5], False)):
    tokenized = tokenizer(prompt, return_tensors="pt").to("cuda")
    print(f"Prompt {i+1}:")
    print(tokenizer.batch_decode(model.generate(tokenized["input_ids"], max_time=30, pad_token_id=tokenizer.pad_token_id))[0])

Prompt 0:
<|begin_of_text|>### Question: do iran and afghanistan speak the same language
 ### Context: Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi (fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.
 ### Answer: Persian is a language spoken by 75 million people in Iran, Afghanistan, Tajikistan, and other countries. It is the official language of Iran and Tajikistan, and is also spoken by a significant minority in Afghanistan, Pakistan, and the United Arab Emirates. The language is also used in the Persia

### Checkpoints
Save checkpoints at end of training with `transformers.integrations.WandbCallback` configuration and further configuration later in `TrainingArguments`.

## Experiments
- `r`: 1, 16, 128, 256
- `alpha`: 0, 1, 16, 128, 256

There is a lot of conflicting information about how large `r` and `alpha` should be. What these parameters do is explained in the training section.
Therfore I want to try various combination of those parameters.

These combinations result in 20 experiments.

In [10]:
experiments = {
    "method": "grid",
    "metric": {"goal": "minimize", "name": "val_loss"},
    "parameters": {
        "r": {"values": [1,16,128,256]},
        "alpha": {"values": [0,1,16,128,256]},
    },
    "early_terminate": {"type": "hyperband", "max_iter": 20, "s": 5, "eta": 3},
}

# Training
Predefined requirements:
- Then train it with parameter-efficient fine-tuning (I suggest LoRA, see e.g. the HF blog post or quicktour).

Define the Lora training config `LoraConfig`.
- `r` Lora attention dimension: the higher the more paramteres can be changed
- `lora_alpha` for Lora scaling: Scales the Lora weights, how strongly the weights are affected
- `lora_dropout=0.05` small dropout to discourage overfitting, but also not too large to prevent the layer needing to spread information in the over many nodes
- `use_rslora=True` uses scaling improved factor
- `init_lora_weights="pissa"` improved weigth initialization of the adapt layers

Use the `SFTTrainer` from `trl` to train the model. It conditions the model to prefer a certain outputs. This is desired because I expect it to always return True or False.

A small learning rate will be set at 1e-4, which has become a community standard. Also learning from previous projects, setting a high learning rate will result in pure majority or minorty classifiers.

Metrics for training and validation:
- Accuracy, because we are interested in both correct true and false predictions
- Loss, to see how confident the model is in its predictions
- Metrics are logged every epoch. Because logging per step is very noisy and does not have a benefit.

Loss is the main metric for all decisions, as it is the most important metric for the model. Accuracy should follow loss in a correct model. Therefore, it is not necessary to optimize for accuracy.

As discussed in class no other metrics are needed for training and validation. As accuracy and loss are sufficient to evaluate the model performance.

## Implementation

- Use `wandb.sweep` for creating the experiments.
- Grind search will be used, because the values to check have already been defined.

In [22]:
response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    use_rslora=True,
    init_lora_weights="pissa",
    task_type="CAUSAL_LM",
    base_model_name_or_path=MODEL
)
sft_config = SFTConfig(
    output_dir="/tmp",
    run_name="test-run",
    report_to="wandb",
    max_seq_length=8000,
    per_device_train_batch_size=1,
)
trainer = SFTTrainer(
    model,
    train_dataset=train_raw,
    eval_dataset=valid_raw,
    peft_config=lora_config,
    tokenizer=tokenizer,
    args=sft_config,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [23]:
trainer.train()

RuntimeError: grad can be implicitly created only for scalar outputs

[1;34mwandb[0m: 🚀 View run [33mtest-run[0m at: [34mhttps://wandb.ai/yelin-zhang-hslu/nlp-project-5/runs/5d6mbe44[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20241205_201307-5d6mbe44/logs[0m


After all experiments have run select best runs based on the smallest loss as the final model to be evaluated.

# Evaluation
Metrics:
- Accuracy
    - to be able to compare the model to the previous projects
    - As well as to check how it compares to the dataset imbalance
- Confusion matrix
    - To be able to see where the model tends to make mistakes.

Evaluation will be done with the `Trainer` class, just using the `evaluate` method and the test dataset.

I expect the evaluation to be difficult, as the model output can vary widly. Therefore I will treat anything which is not the expected `True` or `False` as a seperate class which counts botched predictions.

## Implementation

## Result


Check if the implementation for test and predict are correct

Load the best model from wandb artifact registry.

Run evaluation of final model with test dataset.

# Interpretation
Llama 3.2 already has plenty of knowledge about text and reasoning. Therefore I am expecting an accuracy of atleast 70% with a healthy mix of `True` and `False`predictions.

In comparison to previous project I expect the prediction to be the most accuracte with the caveat of some predictions being botched. Meaning the model will generate something else than the expected `True` and `False`

## Results

## Learning