# Tutorial 3 — PEFT Instruction Finetuning (LoRA on FLAN-T5-small)
### Objective
1. Perform a tiny **multi-task instruction finetune** on one text-classification datasets  
&nbsp;&nbsp;• **AG News** – 4-way news-topic classification  
  

Then:

a. **Implement a fully manual training loop** (no `Trainer` / `Keras` high-level APIs).  
b. **Compare zero-shot vs. finetuned accuracy**.

2. Recap and understand how Dense Passage Retrieval works

# 1. Instruction fine-tuning

**Step:**


1.   Fix the random seed
2.   Setup arguments for PEFT (an efficient tuning technique on large models)
3.   Setup and process the dataset
4. Setup the training hyperparameters
5. Implement the training loop
6. Compare the performance of fine-tuned mdoel vs. unfine-tuned one



In [1]:
# ✨ Setup ————————————————————————————————————————————
# !pip install -q datasets transformers accelerate peft bitsandbytes sentencepiece

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from peft import LoraConfig, get_peft_model
import torch, time
from torch.utils.data import DataLoader
import random
import numpy as np
import torch

seed = 42  # Set this to any integer you want
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "google/flan-t5-small"

tokenizer  = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

# correct PEFT arg names ↓
peft_cfg = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q", "v"],   # T5 attention projections
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)
model = get_peft_model(base_model, peft_cfg)

# ✨ Dataset ————————————————————————————————————————————
ds = load_dataset("ag_news")
label_text = ["World", "Sports", "Business", "Sci/Tech"]

def preprocess(example):
    prompt = (
        "Classify the following news article into one of "
        "World, Sports, Business, or Sci/Tech.\n\n"
        f"Article: {example['text']}\n\n"
        "Answer:"
    )
    target = label_text[example["label"]]
    model_inputs = tokenizer(prompt, truncation=True, max_length=256)
    model_inputs["labels"] = tokenizer(target, max_length=4, truncation=True)["input_ids"]
    return model_inputs

train_ds = (
    ds["train"]
    .shuffle(seed=42)
    .select(range(4000))           # trim for Colab
    .map(preprocess, remove_columns=ds["train"].column_names)
)
test_ds = ds["test"].map(preprocess, remove_columns=ds["test"].column_names)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, label_pad_token_id=-100)
train_loader = DataLoader(train_ds, batch_size=8, shuffle=True,  collate_fn=data_collator)
test_loader  = DataLoader(test_ds,  batch_size=8, shuffle=False, collate_fn=data_collator)

# ✨ Optimizer & scheduler
opt = torch.optim.AdamW(model.parameters(), lr=2e-4)
num_epochs, warmup = 1, 500
tot_steps = num_epochs * len(train_loader)
sched = torch.optim.lr_scheduler.LambdaLR(opt, lambda s: min((s+1)/warmup, 1.0))

# ✨ Manual training loop —————————————————————————————————
model.train(); t0 = time.time()
for step, batch in enumerate(train_loader, 1):
    batch = {k: v.to(device) for k, v in batch.items()}
    loss  = model(**batch).loss
    loss.backward(); opt.step(); sched.step(); opt.zero_grad()

    if step % 50 == 0:
        print(f"step {step:>5}/{tot_steps} | loss {loss.item():.4f}")

print(f"✅ finished in {(time.time()-t0)/60:.1f} min")
model.save_pretrained("flanT5_agnews_lora"); tokenizer.save_pretrained("flanT5_agnews_lora")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

step    50/500 | loss 0.2639
step   100/500 | loss 0.3299
step   150/500 | loss 0.1629
step   200/500 | loss 0.1775
step   250/500 | loss 0.2016
step   300/500 | loss 0.1510
step   350/500 | loss 0.1152
step   400/500 | loss 0.2626
step   450/500 | loss 0.1931
step   500/500 | loss 0.4090
✅ finished in 0.9 min


('flanT5_agnews_lora/tokenizer_config.json',
 'flanT5_agnews_lora/special_tokens_map.json',
 'flanT5_agnews_lora/spiece.model',
 'flanT5_agnews_lora/added_tokens.json',
 'flanT5_agnews_lora/tokenizer.json')

In [2]:
# ------------------------------------------------------------
# ⬇️  Evaluation: zero-shot vs. finetuned
# ------------------------------------------------------------
from tqdm.auto import tqdm
from torch.nn.functional import log_softmax  # (unused now but may help later)

def decode_labels(label_batch):
    ids_clean = [
        [t for t in seq.tolist() if t not in (-100, tokenizer.pad_token_id)]
        for seq in label_batch
    ]
    return [tokenizer.decode(ids, skip_special_tokens=True).strip() for ids in ids_clean]

@torch.no_grad()
def accuracy(loader, mdl, tag="model"):
    mdl.eval()
    correct = total = 0
    for batch in tqdm(loader, desc=tag):
        preds = tokenizer.batch_decode(
            mdl.generate(
                input_ids=batch["input_ids"].to(device),
                attention_mask=batch["attention_mask"].to(device),
                max_new_tokens=4,
            ),
            skip_special_tokens=True,
        )
        gold = decode_labels(batch["labels"])
        correct += sum(p.strip() == g for p, g in zip(preds, gold))
        total   += len(gold)
    return correct / total


# ‼️ fresh zero-shot copy (no LoRA)
zero_model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

acc_zero     = accuracy(test_loader, zero_model, tag="zero-shot")
acc_finetune = accuracy(test_loader, model,      tag="finetuned")

print(f"Zero-shot accuracy : {acc_zero     :.3f}")
print(f"Finetuned accuracy : {acc_finetune :.3f}")
del model
del base_model
del zero_model
del tokenizer
torch.cuda.empty_cache()
import gc; gc.collect()

zero-shot:   0%|          | 0/950 [00:00<?, ?it/s]

finetuned:   0%|          | 0/950 [00:00<?, ?it/s]

Zero-shot accuracy : 0.846
Finetuned accuracy : 0.884


586

<div id="first"></div>

## 2 Introduction of Dense Passage Retrieval

Dense Passage Retrieval (DPR) is a set of [tools and models](https://arxiv.org/abs/2004.04906) for state-of-the-art open-domain Q&A research.

### 2.1 Open-domain Q&A

Open-domain Q&A aims to develop systems capable of answering questions on a **wide range** of topics.

Unlike *domain-specific Q&A systems*, which are limited to a specific field of knowledge, open-domain Q&A systems can answer questions from any field.

For exmaple:

  - User Question: *What are the content includes in CS769: Natural Language Processing?*

  - The answer might be: *CS769: Natural Language Processing (NLP) at the University of Wisconsin-Madison is a comprehensive course designed to introduce graduate students to modern NLP techniques and research. The course covers a wide range of topics, focusing on both fundamental tasks and advanced methods in NLP.*

**QUIZ**: Which model(s) do you know is/are working in open-domain Q&A?

### 2.2 Dense Passage Retrieval (DPR)

DPR idea:

![DPR](https://velog.velcdn.com/images/delee12/post/8cda641d-de81-4f6b-9169-5e0087d23528/image.png)

DPR Structure:

- BERT based
- Similarity Score Calculation
- Constractive Loss

Data Sample:

- *Postive Sample*

  - Find the answer from source document, and get the top-100 pargraphs as postive samples.

- *Negative Sample*

  - Three methods,
    - Random
    - Pargraphs without answer
    - Gold. The pargraph with other question's answer

Constractive Loss:

$L=-log\frac{e^{sim}(q,p^+)}{e^{sim}(q,p^+)+\sum e^{sim}(q,p^-)}$


Data Structure:

```
[
  {
	"question": "....",
	"answers": ["...", "...", "..."],
	"positive_ctxs": [{
		"title": "...",
		"text": "...."
	}],
	"negative_ctxs": ["..."]
  },
  ...
]
```




---

### 2.2 DPR Sample

First, we need to know the structure of DPR.

In [3]:
# Pseudocode for Dense Passage Retrieval (DPR)

# Import necessary libraries
import numpy as np

# Function to encode passages and queries
def encode(text, model):
    """
    Encode the given text into a dense vector using the provided model.

    Args:
    text (str): The text to be encoded.
    model (object): The model used for encoding.

    Returns:
    np.array: The dense vector representation of the text.
    """
    return model.encode(text)

# Function to retrieve top k passages
def retrieve_top_k(query, passage_vectors, passage_texts, model, k=5):
    """
    Retrieve the top k passages most relevant to the given query.

    Args:
    query (str): The input query.
    passage_vectors (np.array): The precomputed dense vectors of passages.
    passage_texts (list): The original texts of the passages.
    model (object): The model used for encoding.
    k (int): The number of top passages to retrieve.

    Returns:
    list: The top k most relevant passages.
    """
    # Encode the query into a dense vector
    query_vector = encode(query, model)

    # Compute similarity scores between the query vector and passage vectors
    similarity_scores = np.dot(passage_vectors, query_vector)

    # Get the indices of the top k passages with the highest similarity scores
    top_k_indices = np.argsort(similarity_scores)[-k:][::-1]

    # Retrieve the top k passages
    top_k_passages = [passage_texts[i] for i in top_k_indices]

    return top_k_passages

# Example usage
if __name__ == "__main__":
    # Load or initialize the model (e.g., a pre-trained BERT model)
    model = load_pretrained_model()

    # Precompute dense vectors for the passages
    passages = ["passage 1 text", "passage 2 text", "passage 3 text", ...]
    passage_vectors = np.array([encode(p, model) for p in passages])

    # Input query
    query = "What is the capital of France?"

    # Retrieve the top 5 most relevant passages
    top_passages = retrieve_top_k(query, passage_vectors, passages, model, k=5)

    # Output the top passages
    for i, passage in enumerate(top_passages):
        print(f"Top {i+1} passage: {passage}")


NameError: name 'load_pretrained_model' is not defined

### 2.3 DPR in Hugging Face

In the hugging face, there are a lot of models.

You can find [here](https://huggingface.co/models?search=dpr).

Following is one of the example model and usage.

In [4]:
import logging
from transformers import logging as hf_logging
hf_logging.set_verbosity_error()
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
import torch
import torch.nn.functional as F


# Load Question Encoder and Tokenizer
question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

# Load Context Encoder and Tokenizer
context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

# Question Part
question = "What are the content includes in CS769: Natural Language Processing?"
question_inputs = question_tokenizer(question, return_tensors="pt")
question_embeddings = question_encoder(**question_inputs).pooler_output

# Context Part
context = ["Dense Passage Retrieval (DPR) is a method for open-domain question answering where dense neural networks are used to encode passages and questions.",
      "CS 769 introduces natural language processing (NLP), a key component of artificial intelligence (AI).  We delve into cutting-edge NLP techniques and explore their applications in spoken and written language, as well as in specific domain languages.",
      "STATS 769 is intended to provide students with computing concepts and skills involved in the acquisition, manipulation, and analysis of large and/or complex data sets.",
      "CIVIL 769 primarily focuses on the procedures, techniques, analysis methods and safety systems used in the geometric design of rural highways and railways."]
context_inputs = context_tokenizer(context, return_tensors="pt", padding=True, truncation=True)
context_embeddings = context_encoder(**context_inputs).pooler_output

similarity = F.cosine_similarity(question_embeddings, context_embeddings)
print("*"*10+"Result"+"*"*10)
print(f"Similarity scores: {similarity}")
print(f"Top 1 passage: {context[torch.argmax(similarity)]}, Top 1 score: {torch.max(similarity)}")
print(f"Top 2 passage: {context[torch.argsort(similarity, descending=True)[1]]}, Top 2 score: {torch.sort(similarity, descending=True)[0][1]}")

config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

  return forward_call(*args, **kwargs)


**********Result**********
Similarity scores: tensor([0.5995, 0.6492, 0.5876, 0.5311], grad_fn=<SumBackward1>)
Top 1 passage: CS 769 introduces natural language processing (NLP), a key component of artificial intelligence (AI).  We delve into cutting-edge NLP techniques and explore their applications in spoken and written language, as well as in specific domain languages., Top 1 score: 0.6492143273353577
Top 2 passage: Dense Passage Retrieval (DPR) is a method for open-domain question answering where dense neural networks are used to encode passages and questions., Top 2 score: 0.5995257496833801
