# COMPSCI 769 - Assignment 1

**Total Points:** 30  
**Runtime:** Colab **T4 GPU**  
**Dependencies:** `transformers`, `datasets`, `peft`, `accelerate`, `sentence-transformers`, `evaluate`, `scipy`, `gensim`
**Objectives:**


1.   Assessing your understanding on the difference between static word embeddings and contextual word embeddings
2.   Assessing your understanding on different token feature fusion strategies of BERT
3. Assessing your ability of conducting instruction fine-tuning on pre-trained generative language models and understanding on its corresponding effects.



---
**Note:** There is no need to install dependencies other than the gensim. Google Colab has already setup this environment for you. Enforcing downloading dependencies of newer versions may airse package conflicts. Just follow the `pip install` commands already provided in this assignment.

**Please keep the console outputs, and do not clear them for marking easiness**

**Name:**

**UPI:**

**Student ID:**

In [None]:
# ✅ Setup
import os, random, numpy as np, torch
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE

# Part A — Static (Word2Vec) vs. Contextual (BERT) Token Representations (9 pts)

**Objective.** Show that **static** word vectors (one vector per token type) cannot disambiguate polysemy (e.g., bank), while **contextual** token embeddings group occurrences by sense (e.g., riverbank vs. financial).

**Code Implementation (6 pts):** Implement `contextual_token_vec(sentence, word="bank")`.

**Discussion (3 pt):** Explain your observed similarity patterns and why they occur.

In [None]:
!pip install gensim
import torch, numpy as np, itertools
import gensim.downloader as api
from gensim.models import Word2Vec
from transformers import AutoTokenizer, AutoModel

sentences = [
    "Willows lined the bank of the stream.",
    "I opened a new bank account yesterday.",
    "The fisherman sat quietly by the bank.",
    "The central bank raised interest rates."
]
target_word = "bank"

**Note:** *After installing the gensim package, please restart the session following the instructions printed in the console.*

In [None]:
# ---- Static baseline: small Word2Vec trained quickly on a slice of text8 ----
import torch, numpy as np, itertools
import gensim.downloader as api
from gensim.models import Word2Vec

# Load a manageable slice for speed
text8 = api.load("text8")                          # generator of tokenized sentences
sentences_w2v = [s for _, s in zip(range(50_000), text8)]  # ~50k lines

# Train a compact skip-gram Word2Vec
w2v = Word2Vec(
    sentences=sentences_w2v,
    vector_size=100, window=5, min_count=5, workers=2,
    sg=1, epochs=3
)

# One static vector for the token type "bank"
w2v_bank = torch.tensor(w2v.wv[target_word]).float()

def cosine(a, b):
    return torch.nn.functional.cosine_similarity(a, b, dim=-1)

# Pairwise similarities under the SAME static "bank" vector
reps_static = torch.stack([w2v_bank for _ in sentences])  # [N, D]
pairwise_static = np.zeros((len(sentences), len(sentences)))
for i, j in itertools.product(range(len(sentences)), range(len(sentences))):
    pairwise_static[i, j] = float(cosine(reps_static[i], reps_static[j]))
pairwise_static  # expected ~ all ones

In [None]:
# ---- Contextual token representations with BERT ----
from transformers import AutoTokenizer, AutoModel

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
mdl = AutoModel.from_pretrained("bert-base-uncased").to(DEVICE).eval()

# TODO (6 pts): Implement this function to return the contextual vector for the first "bank" token.
def contextual_token_vec(sentence, word="bank"):
    """
    Return the contextual embedding vector (torch.Tensor) for the first occurrence of `word`
    in `sentence`, using BERT's last_hidden_state token at that position.
    """
    # Your code here
    raise NotImplementedError

# Use your function to get contextual vectors and build a pairwise cosine similarity matrix
ctx_vecs = [contextual_token_vec(s, target_word) for s in sentences]

pairwise_contextual = np.zeros((len(sentences), len(sentences)))
for i, j in itertools.product(range(len(sentences)), range(len(sentences))):
    pairwise_contextual[i, j] = float(cosine(ctx_vecs[i], ctx_vecs[j]))
pairwise_contextual

## Discussion (3 pt)

**Question:** Describe what you observe in pairwise_static vs. pairwise_contextual. Why do contextual embeddings separate the riverbank and financial senses while the static vector does not?

**Answer:**

# Part B — Sentence Embeddings on a Tiny STS-B Slice (9 pts)

**Objective.** Evaluate sentence similarity using BERT [CLS], BERT mean pooling, and a small SBERT model by correlating cosine similarity with human similarity labels from GLUE STS-B.

	•	Dataset: GLUE → STS-B (Hugging Face)
  
Link: https://huggingface.co/datasets/glue

Fields (validation split):

• sentence1 (str)

• sentence2 (str)

• label (float, range 0–5; higher means more similar)

•	Metric: **Spearman rank correlation (ρ)** measures the monotonic relationship between two variables based on their ranks. **It checks whether pairs with higher human similarity also receive higher cosine similarity, without assuming linearity.**

**Code Implementation (6 pts):** Implement bert_embed(sentences, strategy="mean") supporting "cls" and "mean".

**Discussion (3 pt):** Which strategy performs best and why might SBERT help on STS?

In [None]:
from datasets import load_dataset
from scipy.stats import spearmanr, pearsonr

# A tiny slice for speed
sts = load_dataset("glue", "stsb", split="validation[:200]")


from sentence_transformers import SentenceTransformer
sbert_small = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
# TODO (6 pts): Implement BERT sentence embeddings with "cls" and "mean" strategies.
# Reuse tok/mdl from Part A
def bert_embed(sentences, strategy="mean"):
    """
    Return a tensor of shape [B, H] for a list[str] `sentences`.
    - "cls": use the [CLS] token embedding at position 0
    - "mean": mask-aware average over token embeddings
    """
    # Your code here
    raise NotImplementedError

In [None]:
def sbert_embed(sentences):
    return torch.tensor(sbert_small.encode(list(sentences), convert_to_numpy=True))

def evaluate(embed_fn):
    s1 = [str(x) for x in sts["sentence1"]]
    s2 = [str(x) for x in sts["sentence2"]]
    a = embed_fn(s1); b = embed_fn(s2)
    sims = torch.nn.functional.cosine_similarity(a, b).cpu().numpy()
    gold = np.array(sts["label"], dtype=float)  # 0..5
    rho  = spearmanr(sims, gold).correlation
    r, _ = pearsonr(sims, gold)
    return {"spearman": float(rho), "pearson": float(r)}

scores = {
    "BERT_CLS":  evaluate(lambda S: bert_embed(S, "cls")),
    "BERT_MEAN": evaluate(lambda S: bert_embed(S, "mean")),
    "SBERT_all-MiniLM-L6-v2": evaluate(lambda S: sbert_embed(S)),
}
scores

# Discussion (3 pt).
**Question:** Compare the three results (especially Spearman ρ). Which strategy works best on this slice and why does SBERT (a sentence-embedding model) often outperform naive [CLS] pooling?

**Answer:**

# Part C — Parameter-Efficient Instruction Finetuning (LoRA on FLAN-T5-small) + Manual Training Loop (12 pts)

**Objective.** Perform a tiny, multi-task instruction finetuning on SST-2 (sentiment) and BoolQ (yes/no QA) with LoRA, using a manual PyTorch training loop (**no Trainer function from HuggingFace**). Then compare zero-shot vs. finetuned accuracy.

Datasets and fields:

	1.	GLUE — SST-2 (HF): https://huggingface.co/datasets/nyu-mll/glue/viewer/sst2

	2.	BoolQ (HF): https://huggingface.co/datasets/google/boolq

*Please open the link to view dataset info, including its input and label*

**Code Implementation (9 pts):** Implement the manual training loop (forward → loss → backward → clip → step → sched.step). **Using Trainer function from Huggingface is not allowed.**

**Discussion (3 pt):** After comparing zero-shot vs finetuned, discuss the observed differences and why they occur.

In [None]:
from datasets import load_dataset, DatasetDict, concatenate_datasets

raw_sst   = load_dataset("glue", "sst2")
raw_boolq = load_dataset("boolq")

def to_instruct_sst(split, n=600):
    ds = raw_sst[split].shuffle(seed=SEED).select(range(n if split=="train" else min(200, len(raw_sst[split]))))
    def map_ex(x):
        return {
            "instruction": "Classify the sentiment as Positive or Negative.",
            "input": x["sentence"],
            "output": "Positive" if x["label"]==1 else "Negative",
            "task": "sst2"
        }
    return ds.map(map_ex, remove_columns=ds.column_names)

def to_instruct_boolq(split, n=600):
    ds = raw_boolq[split].shuffle(seed=SEED).select(range(n if split=="train" else min(200, len(raw_boolq[split]))))
    def map_ex(x):
        return {
            "instruction": "Answer the yes/no question based on the passage.",
            "input": f"Passage: {x['passage']}\nQuestion: {x['question']}",
            "output": "yes" if x["answer"] else "no",
            "task": "boolq"
        }
    return ds.map(map_ex, remove_columns=ds.column_names)

train = DatasetDict({
    "train": concatenate_datasets([
        to_instruct_sst("train", n=300),
        to_instruct_boolq("train", n=300),
    ])
})

eval_ds = DatasetDict({
    "validation": concatenate_datasets([
        to_instruct_sst("validation", n=200),
        to_instruct_boolq("validation", n=200),
    ])
})

len(train["train"]), len(eval_ds["validation"])

In [None]:
# Tokenization for FLAN-T5 (text-to-text)
from transformers import AutoTokenizer
t5_name = "google/flan-t5-small"
t5_tok  = AutoTokenizer.from_pretrained(t5_name)

def format_example(ex):
    prompt = f"Instruction: {ex['instruction']}\nInput: {ex['input']}\nOutput:"
    out = t5_tok(prompt, max_length=512, truncation=True)
    label_ids = t5_tok(ex["output"], max_length=8, truncation=True).input_ids
    out["labels"] = label_ids  # list[int]
    return out

train_tokenized = train["train"].map(format_example, remove_columns=train["train"].column_names)
eval_tokenized  = eval_ds["validation"].map(format_example, remove_columns=eval_ds["validation"].column_names)

train_tokenized.set_format(type="torch")
eval_tokenized.set_format(type="torch")

In [None]:
# Model + LoRA
import math, time, torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import AutoModelForSeq2SeqLM, get_linear_schedule_with_warmup
from peft import LoraConfig, get_peft_model

base = AutoModelForSeq2SeqLM.from_pretrained(t5_name).to(DEVICE)
peft_cfg = LoraConfig(task_type="SEQ_2_SEQ_LM", r=8, lora_alpha=16, lora_dropout=0.1, target_modules=["q", "v"])
model = get_peft_model(base, peft_cfg).to(DEVICE)

# Robust collate: pad inputs; pad labels with pad_token_id then mask to -100 for loss
def collate_batch(features):
    input_ids = [torch.tensor(f["input_ids"], dtype=torch.long) for f in features]
    attention_mask = [torch.tensor(f["attention_mask"], dtype=torch.long) for f in features]
    labels = [torch.tensor(f["labels"], dtype=torch.long) for f in features]

    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=t5_tok.pad_token_id)
    attention_mask = torch.nn.utils.rnn.pad_sequence(attention_mask, batch_first=True, padding_value=0)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=t5_tok.pad_token_id)
    labels[labels == t5_tok.pad_token_id] = -100

    return {"input_ids": input_ids.to(DEVICE), "attention_mask": attention_mask.to(DEVICE), "labels": labels.to(DEVICE)}

train_loader = DataLoader(train_tokenized, batch_size=8, shuffle=True,  collate_fn=collate_batch)
eval_loader  = DataLoader(eval_tokenized,  batch_size=8, shuffle=False, collate_fn=collate_batch)

# Pre-defined hyperparameters and scheduler
lr = 2e-4
epochs = 1
max_grad_norm = 1.0
warmup_ratio = 0.06

optimizer = AdamW(model.parameters(), lr=lr)
num_training_steps = epochs * math.ceil(len(train_loader))
num_warmup_steps   = int(warmup_ratio * num_training_steps)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps,
                                            num_training_steps=num_training_steps)

In [None]:
# TODO (9 pts): Implement the **manual training loop**.
# Requirements:
#   - For each batch: forward -> get loss -> backward -> gradient clipping -> optimizer.step() -> optimizer.zero_grad() -> scheduler.step()
#   - Log average loss every ~50 steps (optional but recommended)
#   - (No AMP/scaler code; use FP32)


log_every = 50
model.train()
global_step = 0
running = 0.0
start_time = time.time()

# Your code here
raise NotImplementedError

In [None]:
import re
import sys, subprocess
def ensure(pkg):
    try:
        __import__(pkg)
    except ModuleNotFoundError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

ensure("evaluate")
import evaluate
acc = evaluate.load("accuracy")

base_eval = AutoModelForSeq2SeqLM.from_pretrained(t5_name).to(DEVICE).eval()
ft_eval   = model.eval()

def prompts_and_refs(task_name):
    subset = [ex for ex in eval_ds["validation"] if ex["task"]==task_name]
    prompts = [f"Instruction: {ex['instruction']}\nInput: {ex['input']}\nOutput:" for ex in subset]
    refs    = [ex["output"].lower().strip() for ex in subset]  # 'positive'/'negative' or 'yes'/'no'
    return prompts, refs

def batch_generate(mdl, prompts, max_new_tokens=5):
    toks = t5_tok(prompts, padding=True, truncation=True, return_tensors="pt").to(DEVICE)
    with torch.no_grad():
        gen = mdl.generate(**toks, max_new_tokens=max_new_tokens)
    outs = [t5_tok.decode(g, skip_special_tokens=True).strip().lower() for g in gen]
    return outs

def normalize_pred(task, s: str):
    t = re.findall(r"[a-z]+", s.lower())
    if task == "sst2":
        if any(tok.startswith("pos") for tok in t): return "positive"
        if any(tok.startswith("neg") for tok in t): return "negative"
        return "positive"  # fallback
    if task == "boolq":
        if "yes" in t or "true" in t:  return "yes"
        if "no"  in t or "false" in t: return "no"
        return "yes"  # fallback
    return s.strip().lower()

LABEL_ID = {
    "sst2": {"negative": 0, "positive": 1},
    "boolq": {"no": 0, "yes": 1},
}

def to_ids(task, items):
    m = LABEL_ID[task]
    return [m[x] for x in items]

def eval_task(task):
    prompts, refs = prompts_and_refs(task)
    base_outs = [normalize_pred(task, o) for o in batch_generate(base_eval, prompts)]
    ft_outs   = [normalize_pred(task, o) for o in batch_generate(ft_eval, prompts)]

    ref_ids  = to_ids(task, refs)
    base_ids = to_ids(task, base_outs)
    ft_ids   = to_ids(task, ft_outs)

    return {
        "base": acc.compute(predictions=base_ids, references=ref_ids)["accuracy"],
        "finetuned": acc.compute(predictions=ft_ids, references=ref_ids)["accuracy"],
    }

sst2_scores  = eval_task("sst2")
boolq_scores = eval_task("boolq")
{"SST2": sst2_scores, "BoolQ": boolq_scores}

**Note:** If your fine-tuned model perform worse than the base model, you should check your code.

# Discussion (3 pt)

**Quesion:** How did finetuning change performance compared to zero-shot for each task? Connect your observations to instruction finetuning and task specialization: why might SST-2 improve more than BoolQ (or vice versa) with this small LoRA update?

*LoRA:* a parameter-efficienct tuning method which requires only tuning a small amount of parameters instead of full-parameter updates

*Hint:* The FLAN-T5 model was already a fine-tuned version of the base T5 model. You can find online information/papers regarding with which kinds of tasks FLAN-T5 was already fine-tuned.

**Answer:**