# COMPSCI 769 - Assignment 1

**Total Points:** 30  
**Runtime:** Colab **T4 GPU**  
**Dependencies:** `transformers`, `datasets`, `peft`, `accelerate`, `sentence-transformers`, `evaluate`, `scipy`, `gensim`
**Objectives:**


1.   Assessing your understanding on the difference between static word embeddings and contextual word embeddings
2.   Assessing your understanding on different token feature fusion strategies of BERT
3. Assessing your ability of conducting instruction fine-tuning on pre-trained generative language models and understanding on its corresponding effects.



---
**Note:** There is no need to install dependencies other than the gensim. Google Colab has already setup this environment for you. Enforcing downloading dependencies of newer versions may airse package conflicts. Just follow the `pip install` commands already provided in this assignment.

**Please keep the console outputs, and do not clear them for marking easiness**

**Name:** Nhut Hoang Duong

**UPI:** nduo221

**Student ID:** 804763240

In [1]:
# ✅ Setup
import os, random, numpy as np, torch
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
DEVICE = "cuda"
if torch.cuda.is_available():
    DEVICE = "cuda"
elif torch.backends.mps.is_available():
    DEVICE = "mps"
else:
    DEVICE = "cpu"
DEVICE

'cuda'

# Part A — Static (Word2Vec) vs. Contextual (BERT) Token Representations (9 pts)

**Objective.** Show that **static** word vectors (one vector per token type) cannot disambiguate polysemy (e.g., bank), while **contextual** token embeddings group occurrences by sense (e.g., riverbank vs. financial).

**Code Implementation (6 pts):** Implement `contextual_token_vec(sentence, word="bank")`.

**Discussion (3 pt):** Explain your observed similarity patterns and why they occur.

In [2]:
# !pip install gensim

In [3]:
import torch, numpy as np, itertools
import gensim.downloader as api
from gensim.models import Word2Vec
from transformers import AutoTokenizer, AutoModel

sentences = [
    "Willows lined the bank of the stream.",
    "I opened a new bank account yesterday.",
    "The fisherman sat quietly by the bank.",
    "The central bank raised interest rates."
]
target_word = "bank"

**Note:** *After installing the gensim package, please restart the session following the instructions printed in the console.*

In [4]:
# ---- Static baseline: small Word2Vec trained quickly on a slice of text8 ----
import torch, numpy as np, itertools
import gensim.downloader as api
from gensim.models import Word2Vec

# Load a manageable slice for speed
text8 = api.load("text8")                          # generator of tokenized sentences
sentences_w2v = [s for _, s in zip(range(50_000), text8)]  # ~50k lines

# Train a compact skip-gram Word2Vec
w2v = Word2Vec(
    sentences=sentences_w2v,
    vector_size=100, window=5, min_count=5, workers=2,
    sg=1, epochs=3
)

# One static vector for the token type "bank"
w2v_bank = torch.tensor(w2v.wv[target_word]).float()

def cosine(a, b):
    return torch.nn.functional.cosine_similarity(a, b, dim=-1)

# Pairwise similarities under the SAME static "bank" vector
reps_static = torch.stack([w2v_bank for _ in sentences])  # [N, D]
pairwise_static = np.zeros((len(sentences), len(sentences)))
for i, j in itertools.product(range(len(sentences)), range(len(sentences))):
    pairwise_static[i, j] = float(cosine(reps_static[i], reps_static[j]))
pairwise_static  # expected ~ all ones

array([[0.99999988, 0.99999988, 0.99999988, 0.99999988],
       [0.99999988, 0.99999988, 0.99999988, 0.99999988],
       [0.99999988, 0.99999988, 0.99999988, 0.99999988],
       [0.99999988, 0.99999988, 0.99999988, 0.99999988]])

In [5]:

# ---- Contextual token representations with BERT ----
from transformers import AutoTokenizer, AutoModel

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
mdl = AutoModel.from_pretrained("bert-base-uncased").to(DEVICE).eval()
mdl.eval()  # set to evaluation mode

# TODO (6 pts): Implement this function to return the contextual vector for the first "bank" token.
def contextual_token_vec(sentence, word="bank"):
    """
    Return the contextual embedding vector (torch.Tensor) for the first occurrence of `word`
    in `sentence`, using BERT's last_hidden_state token at that position.
    """

    # Tokenize the input sentence
    # return_tensors="pt" tells it to give us back PyTorch tensors
    inputs = tok(sentence, return_tensors="pt").to(DEVICE)
    print(f"tokenized: {inputs.input_ids}")

    # not training, so no need to calculate gradients
    with torch.no_grad():
        outputs = mdl(**inputs)

    # Get the last hidden states (contextual embeddings)
    # BERT are designed to process multiple sentences at once in a "batch."
    # one sentence => batch of size 1 => index is [0].
    last_hidden_states = outputs.last_hidden_state[0]

    # Convert token IDs to tokens
    # [0] is same reason as above, batch processing
    tokens = tok.convert_ids_to_tokens(inputs.input_ids[0])
    print(f"tokens: {tokens}")

    # BERT will split words into subwords with ## prefix if not in its vocabulary
    word_tokens = tok.tokenize(word)
    if len(word_tokens) != 1:
        raise BaseException('Temporary not handle word that not in BERT vocabulary')

    # Find the position of the first occurrence
    target_position = tokens.index(word_tokens[0])
    print(f"target_position: {target_position}")

    # Return the contextual embedding for the target word
    return last_hidden_states[target_position]

# sentences[0] = "Willows lined the bank of the stream."
contextual_token_vec(sentences[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenized: tensor([[  101, 11940,  2015,  7732,  1996,  2924,  1997,  1996,  5460,  1012,
           102]], device='cuda:0')
tokens: ['[CLS]', 'willow', '##s', 'lined', 'the', 'bank', 'of', 'the', 'stream', '.', '[SEP]']
target_position: 5


tensor([-3.3650e-02, -4.0651e-01, -4.2578e-01,  2.5237e-02, -5.5180e-01,
         2.5169e-01,  1.5093e-01,  2.1321e+00,  2.0032e-01, -1.8318e-01,
         9.6993e-01, -1.5982e-01,  3.9372e-01,  2.3406e-01, -7.8746e-01,
         3.5004e-01, -1.1001e-01,  1.9852e-01,  5.8291e-01,  3.1892e-01,
         5.4508e-01, -9.3500e-02, -7.2562e-02,  1.2025e+00,  2.7306e-01,
         7.0967e-01,  7.6448e-01,  6.3142e-02, -1.5365e-01,  3.2747e-01,
         1.1737e+00,  4.5297e-01, -1.8828e-01, -2.4276e-01, -7.5355e-01,
         4.0795e-01,  9.3041e-02, -6.1755e-01, -8.8286e-01,  6.5056e-01,
        -8.0550e-01, -4.7044e-01, -5.5397e-01,  9.4918e-01,  5.8524e-01,
         3.8790e-01,  2.3710e-01, -2.9180e-01,  3.7618e-01,  4.0198e-02,
        -5.0435e-01,  4.3004e-01, -1.3380e-01, -6.5136e-01,  1.5740e-01,
         2.8018e-01, -7.4568e-02, -1.0982e+00, -3.6560e-01,  4.5398e-01,
         6.6386e-01,  1.5240e-01,  6.5575e-01, -1.8351e-01,  1.0039e-01,
         2.8807e-01, -8.6597e-01,  6.7807e-02, -3.2

In [6]:
# Use your function to get contextual vectors and build a pairwise cosine similarity matrix
ctx_vecs = [contextual_token_vec(s, target_word) for s in sentences]

pairwise_contextual = np.zeros((len(sentences), len(sentences)))
for i, j in itertools.product(range(len(sentences)), range(len(sentences))):
    pairwise_contextual[i, j] = float(cosine(ctx_vecs[i], ctx_vecs[j]))
pairwise_contextual

tokenized: tensor([[  101, 11940,  2015,  7732,  1996,  2924,  1997,  1996,  5460,  1012,
           102]], device='cuda:0')
tokens: ['[CLS]', 'willow', '##s', 'lined', 'the', 'bank', 'of', 'the', 'stream', '.', '[SEP]']
target_position: 5
tokenized: tensor([[ 101, 1045, 2441, 1037, 2047, 2924, 4070, 7483, 1012,  102]],
       device='cuda:0')
tokens: ['[CLS]', 'i', 'opened', 'a', 'new', 'bank', 'account', 'yesterday', '.', '[SEP]']
target_position: 5
tokenized: tensor([[  101,  1996, 19949,  2938,  5168,  2011,  1996,  2924,  1012,   102]],
       device='cuda:0')
tokens: ['[CLS]', 'the', 'fisherman', 'sat', 'quietly', 'by', 'the', 'bank', '.', '[SEP]']
target_position: 7
tokenized: tensor([[ 101, 1996, 2430, 2924, 2992, 3037, 6165, 1012,  102]],
       device='cuda:0')
tokens: ['[CLS]', 'the', 'central', 'bank', 'raised', 'interest', 'rates', '.', '[SEP]']
target_position: 3


array([[1.        , 0.37284273, 0.70163524, 0.41021448],
       [0.37284273, 1.        , 0.45429581, 0.61562097],
       [0.70163524, 0.45429581, 0.99999994, 0.47047782],
       [0.41021448, 0.61562097, 0.47047782, 1.        ]])

## Discussion (3 pt)

**Question:** Describe what you observe in pairwise_static vs. pairwise_contextual. Why do contextual embeddings separate the riverbank and financial senses while the static vector does not?

**Answer:**

Word2Vec represents any word as a single vector, therefore, it does not concern about word context.
- Mathemathic: "bank" words in all sentences is represent by a same vector, then, the similarity of `cosine` should be 1 (`0.99999994` may be floating point issue).
- Meaning of the word "bank" is the same for the model, althought human can understand there are two concepts.

BERT implements transformer
- Each `bank` word will have different vector due to self-attention.
- The `(river) bank` in sentence 1 and 3 and `(finance) bank` in sentences 2 and 4 are quite close toghether with similarities are `0.70163518` and `0.61562085` respectively. The model also can separate those meaning quite good with low value of similarity, aka, high difference, with `0.37284273`, `0.41021445`, `0.45429575` and `0.47047782`.

---

# Part B — Sentence Embeddings on a Tiny STS-B Slice (9 pts)

**Objective.** Evaluate sentence similarity using BERT [CLS], BERT mean pooling, and a small SBERT model by correlating cosine similarity with human similarity labels from GLUE STS-B.

	•	Dataset: GLUE → STS-B (Hugging Face)
  
Link: https://huggingface.co/datasets/glue

Fields (validation split):

• sentence1 (str)

• sentence2 (str)

• label (float, range 0–5; higher means more similar)

•	Metric: **Spearman rank correlation (ρ)** measures the monotonic relationship between two variables based on their ranks. **It checks whether pairs with higher human similarity also receive higher cosine similarity, without assuming linearity.**

**Code Implementation (6 pts):** Implement bert_embed(sentences, strategy="mean") supporting "cls" and "mean".

**Discussion (3 pt):** Which strategy performs best and why might SBERT help on STS?

In [7]:
from datasets import load_dataset
from scipy.stats import spearmanr, pearsonr

# A tiny slice for speed
sts = load_dataset("glue", "stsb", split="validation[:200]")


from sentence_transformers import SentenceTransformer
sbert_small = SentenceTransformer("all-MiniLM-L6-v2")

README.md: 0.00B [00:00, ?B/s]

stsb/train-00000-of-00001.parquet:   0%|          | 0.00/502k [00:00<?, ?B/s]

stsb/validation-00000-of-00001.parquet:   0%|          | 0.00/151k [00:00<?, ?B/s]

stsb/test-00000-of-00001.parquet:   0%|          | 0.00/114k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
# TODO (6 pts): Implement BERT sentence embeddings with "cls" and "mean" strategies.
# Reuse tok/mdl from Part A
def bert_embed(sentences, strategy="mean"):
    """
    Return a tensor of shape [B, H] for a list[str] `sentences`.
    - "cls": use the [CLS] token embedding at position 0
    - "mean": mask-aware average over token embeddings
    """
    # Note, for easier reference
    # B: batch size (num of sentences)
    # S: max sequence length in batch
    # H: hidden size (768 for bert-base)

    # Tokenize all sentences at once
    # padding=True:
    #   - Sentences in a batch have different lengths => with a special [PAD] token => rectangular tensor
    #   - True ('longest') create shorter (20) sequences than 'max_length' (512) => higher performance
    # return_tensors="pt" tells it to give us back PyTorch tensors
    # truncation=True: truncate a longer than max input sentence
    # The tokenizer returns a dictionary containing 'input_ids' and 'attention_mask'
    inputs = tok(sentences, padding=True, truncation=True, return_tensors="pt").to(DEVICE)

    # 2. Get the BERT outputs (the token embeddings)
    with torch.no_grad():
        # shape is [B, S, H],
        outputs = mdl(**inputs)
        last_hidden_state = outputs.last_hidden_state

    # 3. Apply the chosen pooling strategy
    if strategy == "cls":
        # Use the [CLS] token embedding
        # The [CLS] token is always at the first position => index 0
        # extracts the [CLS] embedding for every sentence in the batch => Shape: [B, H]
        sentence_embedding = last_hidden_state[:, 0, :]
    elif strategy == "mean":
        # # ======= Ignore the padding tokens =======
        # The attention_mask is [B, S] with 1s for real tokens and 0s for padding.
        print(f"inputs['attention_mask'].shape: {inputs['attention_mask'].shape}")

        # Expand mask to match the shape of embeddings for multiplication.
        # Shape: [B, S] => [B, S, 1]
        mask = inputs['attention_mask'].unsqueeze(-1)
        print(f"mask.shape: {mask.shape}")

        # Zero out the embeddings of padding tokens.
        # Multiplying [B, S, H] with [B, S, 1]
        masked_embeddings = last_hidden_state * mask
        print(f"masked_embeddings.shape: {masked_embeddings.shape}")


        # # ======= Compute the mean of the embeddings =======
        # Sum the embeddings across the sequence length dimension.
        # Shape: [B, H]
        summed_embeddings = torch.sum(masked_embeddings, dim=1)
        print(f"summed_embeddings.shape: {summed_embeddings.shape}")

        # Count the number of actual tokens in each sentence.
        # Shape: [B]
        token_counts = torch.sum(inputs['attention_mask'], dim=1)
        # => Shape: [B, 1]
        token_counts = token_counts.unsqueeze(-1)
        print(f"token_counts.shape: {token_counts.shape}")

        # avg = sum / count
        # Shape: [B, H]
        sentence_embedding = summed_embeddings / token_counts

    return sentence_embedding.cpu()

test_sentences = ["Hello world!", "BERT is great for embeddings."]
# display(bert_embed(test_sentences, strategy="cls"))
display(bert_embed(test_sentences, strategy="mean"))

inputs['attention_mask'].shape: torch.Size([2, 11])
mask.shape: torch.Size([2, 11, 1])
masked_embeddings.shape: torch.Size([2, 11, 768])
summed_embeddings.shape: torch.Size([2, 768])
token_counts.shape: torch.Size([2, 1])


tensor([[-0.1373, -0.1593,  0.0821,  ..., -0.0644, -0.0986, -0.0170],
        [ 0.3789,  0.0296, -0.1317,  ...,  0.1847, -0.0126,  0.1121]])

In [9]:
def sbert_embed(sentences):
    return torch.tensor(sbert_small.encode(list(sentences), convert_to_numpy=True))

def evaluate(embed_fn):
    s1 = [str(x) for x in sts["sentence1"]]
    s2 = [str(x) for x in sts["sentence2"]]
    a = embed_fn(s1); b = embed_fn(s2)
    sims = torch.nn.functional.cosine_similarity(a, b).cpu().numpy()
    gold = np.array(sts["label"], dtype=float)  # 0..5
    rho  = spearmanr(sims, gold).correlation
    r, _ = pearsonr(sims, gold)
    return {"spearman": float(rho), "pearson": float(r)}

scores = {
    "BERT_CLS":  evaluate(lambda S: bert_embed(S, "cls")),
    "BERT_MEAN": evaluate(lambda S: bert_embed(S, "mean")),
    "SBERT_all-MiniLM-L6-v2": evaluate(lambda S: sbert_embed(S)),
}
scores

inputs['attention_mask'].shape: torch.Size([200, 20])
mask.shape: torch.Size([200, 20, 1])
masked_embeddings.shape: torch.Size([200, 20, 768])
summed_embeddings.shape: torch.Size([200, 768])
token_counts.shape: torch.Size([200, 1])
inputs['attention_mask'].shape: torch.Size([200, 18])
mask.shape: torch.Size([200, 18, 1])
masked_embeddings.shape: torch.Size([200, 18, 768])
summed_embeddings.shape: torch.Size([200, 768])
token_counts.shape: torch.Size([200, 1])


{'BERT_CLS': {'spearman': 0.030008223586600657,
  'pearson': -0.011850216146757261},
 'BERT_MEAN': {'spearman': 0.3705995595647637, 'pearson': 0.3350413509716228},
 'SBERT_all-MiniLM-L6-v2': {'spearman': 0.9362387497584497,
  'pearson': 0.9287386978492369}}

# Discussion (3 pt).
**Question:** Compare the three results (especially Spearman ρ). Which strategy works best on this slice and why does SBERT (a sentence-embedding model) often outperform naive [CLS] pooling?

**Answer:**

`[CLS]` represent whole sentence for BERT to predict next sentence, not to compare the similarity of two sentences together, so get worst perfromance.

`BERT_MEAN` takes all words of sentences together, therefore, sentences which have near word vectors in the vector space can end up as "similar".
- And sentences are similar have change to contain near words in space. Then, the performance is much better than `[CLS]`
- However, near is not nessesarity as the same. They even can have oppsited meaning.

`SBERT_all-MiniLM-L6-v2` model, as its description, [can be used for tasks like clustering or semantic search](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which is the usecase we trying to test. So it outperform the formers.

---

# Part C — Parameter-Efficient Instruction Finetuning (LoRA on FLAN-T5-small) + Manual Training Loop (12 pts)

**Objective.** Perform a tiny, multi-task instruction finetuning on SST-2 (sentiment) and BoolQ (yes/no QA) with LoRA, using a manual PyTorch training loop (**no Trainer function from HuggingFace**). Then compare zero-shot vs. finetuned accuracy.

Datasets and fields:

	1.	GLUE — SST-2 (HF): https://huggingface.co/datasets/nyu-mll/glue/viewer/sst2

	2.	BoolQ (HF): https://huggingface.co/datasets/google/boolq

*Please open the link to view dataset info, including its input and label*

**Code Implementation (9 pts):** Implement the manual training loop (forward → loss → backward → clip → step → sched.step). **Using Trainer function from Huggingface is not allowed.**

**Discussion (3 pt):** After comparing zero-shot vs finetuned, discuss the observed differences and why they occur.

In [10]:
from datasets import load_dataset, DatasetDict, concatenate_datasets

raw_sst   = load_dataset("glue", "sst2")
raw_boolq = load_dataset("boolq")

def to_instruct_sst(split, n=600):
    ds = raw_sst[split].shuffle(seed=SEED).select(range(n if split=="train" else min(200, len(raw_sst[split]))))
    def map_ex(x):
        return {
            "instruction": "Classify the sentiment as Positive or Negative.",
            "input": x["sentence"],
            "output": "Positive" if x["label"]==1 else "Negative",
            "task": "sst2"
        }
    return ds.map(map_ex, remove_columns=ds.column_names)

def to_instruct_boolq(split, n=600):
    ds = raw_boolq[split].shuffle(seed=SEED).select(range(n if split=="train" else min(200, len(raw_boolq[split]))))
    def map_ex(x):
        return {
            "instruction": "Answer the yes/no question based on the passage.",
            "input": f"Passage: {x['passage']}\nQuestion: {x['question']}",
            "output": "yes" if x["answer"] else "no",
            "task": "boolq"
        }
    return ds.map(map_ex, remove_columns=ds.column_names)

train = DatasetDict({
    "train": concatenate_datasets([
        to_instruct_sst("train", n=300),
        to_instruct_boolq("train", n=300),
    ])
})

eval_ds = DatasetDict({
    "validation": concatenate_datasets([
        to_instruct_sst("validation", n=200),
        to_instruct_boolq("validation", n=200),
    ])
})

len(train["train"]), len(eval_ds["validation"])

sst2/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

sst2/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

sst2/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/3.69M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

(600, 400)

In [11]:
display(train["train"][0])
display(train["train"][300])

{'instruction': 'Classify the sentiment as Positive or Negative.',
 'input': 'klein , charming in comedies like american pie and dead-on in election , ',
 'output': 'Positive',
 'task': 'sst2'}

{'instruction': 'Answer the yes/no question based on the passage.',
 'input': "Passage: Henry Daniel Mills is a fictional character in ABC's television series Once Upon a Time. Henry is the boy Emma Swan gave up to adoption; Regina Mills adopted him. Henry was originally portrayed as a child by Jared S. Gilmore, who won the Young Artist Award for Best Performance in a TV Series -- Leading Young Actor in 2012. For the show's seventh and final season, Andrew J. West later took over the role of Henry as an adult and father to a eight-year-old girl named Lucy, with Gilmore also making three appearances as Henry during the season.\nQuestion: did henry die in once upon a time",
 'output': 'no',
 'task': 'boolq'}

In [12]:
# Tokenization for FLAN-T5 (text-to-text)
from transformers import AutoTokenizer
t5_name = "google/flan-t5-small"
t5_tok  = AutoTokenizer.from_pretrained(t5_name)

def format_example(ex):
    prompt = f"Instruction: {ex['instruction']}\nInput: {ex['input']}\nOutput:"
    out = t5_tok(prompt, max_length=512, truncation=True)
    label_ids = t5_tok(ex["output"], max_length=8, truncation=True).input_ids
    out["labels"] = label_ids  # list[int]
    return out

train_tokenized = train["train"].map(format_example, remove_columns=train["train"].column_names)
eval_tokenized  = eval_ds["validation"].map(format_example, remove_columns=eval_ds["validation"].column_names)

train_tokenized.set_format(type="torch")
eval_tokenized.set_format(type="torch")

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

In [13]:
display(train_tokenized[0])

{'input_ids': tensor([21035,    10,  4501,  4921,     8,  6493,    38, 24972,    42, 17141,
          1528,     5,    86,  2562,    10, 21856,     3,     6, 12216,    16,
           369,  7719,   114, 10211,  6253,    11,  3654,    18,   106,    16,
          4356,     3,     6,  3387,  2562,    10,     1]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'labels': tensor([24972,     1])}

In [22]:
# Model + LoRA
import math, time, torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import AutoModelForSeq2SeqLM, get_linear_schedule_with_warmup
from peft import LoraConfig, get_peft_model

base = AutoModelForSeq2SeqLM.from_pretrained(t5_name).to(DEVICE)
peft_cfg = LoraConfig(task_type="SEQ_2_SEQ_LM", r=18, lora_alpha=16, lora_dropout=0.1, target_modules=["q", "v"])
model = get_peft_model(base, peft_cfg).to(DEVICE)

# Robust collate: pad inputs; pad labels with pad_token_id then mask to -100 for loss
def collate_batch(features):
    input_ids = [torch.tensor(f["input_ids"], dtype=torch.long) for f in features]
    attention_mask = [torch.tensor(f["attention_mask"], dtype=torch.long) for f in features]
    labels = [torch.tensor(f["labels"], dtype=torch.long) for f in features]

    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=t5_tok.pad_token_id)
    attention_mask = torch.nn.utils.rnn.pad_sequence(attention_mask, batch_first=True, padding_value=0)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=t5_tok.pad_token_id)
    labels[labels == t5_tok.pad_token_id] = -100

    return {"input_ids": input_ids.to(DEVICE), "attention_mask": attention_mask.to(DEVICE), "labels": labels.to(DEVICE)}

train_loader = DataLoader(train_tokenized, batch_size=8, shuffle=True,  collate_fn=collate_batch)
eval_loader  = DataLoader(eval_tokenized,  batch_size=8, shuffle=False, collate_fn=collate_batch)

# Pre-defined hyperparameters and scheduler
lr = 2e-4
epochs = 5
max_grad_norm = 1.0
warmup_ratio = 0.06

optimizer = AdamW(model.parameters(), lr=lr)
num_training_steps = epochs * math.ceil(len(train_loader))
num_warmup_steps   = int(warmup_ratio * num_training_steps)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps,
                                            num_training_steps=num_training_steps)

In [23]:
# TODO (9 pts): Implement the **manual training loop**.
# Requirements:
#   - For each batch: forward -> get loss -> backward -> gradient clipping -> optimizer.step() -> optimizer.zero_grad() -> scheduler.step()
#   - Log average loss every ~50 steps (optional but recommended)
#   - (No AMP/scaler code; use FP32)


log_every = 50
model.train()
global_step = 0
running = 0.0
start_time = time.time()

for step, batch in enumerate(train_loader, 1):
    batch = {k: v.to(DEVICE) for k, v in batch.items()}
    loss  = model(**batch).loss
    loss.backward()
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

    if step % log_every == 0:
        print(f"step {step:>5}/{num_training_steps} | loss {loss.item():.4f}")

print(f"✅ finished in {(time.time() - start_time)/60:.1f} min")
model.save_pretrained("flanT5_agnews_lora")

print("Training finished.")

  input_ids = [torch.tensor(f["input_ids"], dtype=torch.long) for f in features]
  attention_mask = [torch.tensor(f["attention_mask"], dtype=torch.long) for f in features]
  labels = [torch.tensor(f["labels"], dtype=torch.long) for f in features]


step    50/375 | loss 0.3393
✅ finished in 0.2 min
Training finished.


In [24]:
import re
import sys, subprocess
def ensure(pkg):
    try:
        __import__(pkg)
    except ModuleNotFoundError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

ensure("evaluate")
import evaluate
acc = evaluate.load("accuracy")

base_eval = AutoModelForSeq2SeqLM.from_pretrained(t5_name).to(DEVICE).eval()
ft_eval   = model.eval()

def prompts_and_refs(task_name):
    subset = [ex for ex in eval_ds["validation"] if ex["task"]==task_name]
    prompts = [f"Instruction: {ex['instruction']}\nInput: {ex['input']}\nOutput:" for ex in subset]
    refs    = [ex["output"].lower().strip() for ex in subset]  # 'positive'/'negative' or 'yes'/'no'
    return prompts, refs

def batch_generate(mdl, prompts, max_new_tokens=5):
    toks = t5_tok(prompts, padding=True, truncation=True, return_tensors="pt").to(DEVICE)
    with torch.no_grad():
        gen = mdl.generate(**toks, max_new_tokens=max_new_tokens)
    outs = [t5_tok.decode(g, skip_special_tokens=True).strip().lower() for g in gen]
    return outs

def normalize_pred(task, s: str):
    t = re.findall(r"[a-z]+", s.lower())
    if task == "sst2":
        if any(tok.startswith("pos") for tok in t): return "positive"
        if any(tok.startswith("neg") for tok in t): return "negative"
        return "positive"  # fallback
    if task == "boolq":
        if "yes" in t or "true" in t:  return "yes"
        if "no"  in t or "false" in t: return "no"
        return "yes"  # fallback
    return s.strip().lower()

LABEL_ID = {
    "sst2": {"negative": 0, "positive": 1},
    "boolq": {"no": 0, "yes": 1},
}

def to_ids(task, items):
    m = LABEL_ID[task]
    return [m[x] for x in items]

def eval_task(task):
    prompts, refs = prompts_and_refs(task)
    base_outs = [normalize_pred(task, o) for o in batch_generate(base_eval, prompts)]
    ft_outs   = [normalize_pred(task, o) for o in batch_generate(ft_eval, prompts)]

    ref_ids  = to_ids(task, refs)
    base_ids = to_ids(task, base_outs)
    ft_ids   = to_ids(task, ft_outs)

    return {
        "base": acc.compute(predictions=base_ids, references=ref_ids)["accuracy"],
        "finetuned": acc.compute(predictions=ft_ids, references=ref_ids)["accuracy"],
    }

sst2_scores  = eval_task("sst2")
boolq_scores = eval_task("boolq")
{"SST2": sst2_scores, "BoolQ": boolq_scores}

{'SST2': {'base': 0.83, 'finetuned': 0.835},
 'BoolQ': {'base': 0.645, 'finetuned': 0.665}}

Performance notes:

epochs = 1
```
{'SST2': {'base': 0.83, 'finetuned': 0.835},
 'BoolQ': {'base': 0.645, 'finetuned': 0.685}}
```

epochs = 5
```
{'SST2': {'base': 0.83, 'finetuned': 0.835},
 'BoolQ': {'base': 0.645, 'finetuned': 0.665}}
```

**Note:** If your fine-tuned model perform worse than the base model, you should check your code.

# Discussion (3 pt)

**Quesion:** How did finetuning change performance compared to zero-shot for each task? Connect your observations to instruction finetuning and task specialization: why might SST-2 improve more than BoolQ (or vice versa) with this small LoRA update?

*LoRA:* a parameter-efficienct tuning method which requires only tuning a small amount of parameters instead of full-parameter updates

*Hint:* The FLAN-T5 model was already a fine-tuned version of the base T5 model. You can find online information/papers regarding with which kinds of tasks FLAN-T5 was already fine-tuned.

**Answer:**