# ADS-509 Assignment 5.1

## Finetuning LLMs
**Student Version**

In this assignment, you will use a small, locally-hosted LLM (`google/flan-t5-small`) to evaluate performance on the SST‑2 sentiment classification benchmarking dataset. You will compare how the same model performs after:
1) Zero‑shot prompting
2) Few‑shot prompting
3) Fine‑tuning


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it.

Work through this notebook as if it were a worksheet, completing the code sections marked with **TODO** in the cells provided. Similarly, written questions will be marked by a "Q:" and will have a corresponding "A:" spot for you to fill in with your answers. **Make sure to answer every question marked with a Q: for full credit**.

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential import statements and make sure that all such statements are moved into the designated cell.

A .pdf of this notebook, with your completed code and written answers, is what you should submit in Canvas for full credit. **DO NOT SUBMIT A NEW NOTEBOOK FILE OR A RAW .PY FILE**. Submitting in a different format makes it difficult to grade your work, and students who have done this in the past inevitably miss some of the required work or written questions.

## Imports and Downloads

In [None]:
try:
    import datasets, transformers, evaluate, torch  # type: ignore
except Exception:
    %pip install -q datasets transformers evaluate accelerate sentencepiece

import os, random, numpy as np, warnings
import torch
from sklearn.metrics import confusion_matrix
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
warnings.filterwarnings('ignore')


## Load Dataset and Model

For this assignment, you will be comparing performance on a common language model benchmarking task: predicting the sentiment for the Stanford Sentiment Treebank ([SST-2](https://nlp.stanford.edu/sentiment/index.html)). We will use the same model, [Flan-T5-Small](https://huggingface.co/google/flan-t5-small), across all of our "training" methods so that the results are directly comparable.

**TODO**:

- Use your preferred method to select a sample of 2000 sentences from the train dataset.

**Q**: After reading a little bit about the Sentiment Treebank project at the link above, and recognizing that the paper was written in 2013, what method do we now use to provide the same kind of benefit that they intended with their tree-based sentiment representations?

**A**: 

In [None]:
from datasets import load_dataset
raw = load_dataset('glue', 'sst2')
raw

In [None]:
raw['train'][5], raw['validation'][5]

In [None]:
# TODO: select a sample for your training dataset

train_size = 2000
train_ds = ??
label_names = train_ds.features['label'].names
eval_ds  = raw['validation']


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
MODEL_NAME = 'google/flan-t5-small'
tok = AutoTokenizer.from_pretrained(MODEL_NAME)
flan = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

## Zero‑Shot and Few‑Shot Prompting

One of the major benefits of today's generative models is that they can often be used effectively with no supervised, task-specific training (i.e. fine tuning). This avoids the time and expense needed to compile and train on a labeled dataset and is called zero-shot or few-shot prompting. However, using a generative model makes performance evaluation more complex, since the set of possible outputs is not pre-defined (i.e. the model can potentially produce any tokens in its vocabulary).

**TODO**:

- Define a function to format zero-shot prompts. *HINT: Your prompt should introduce the labeling task (without providing an example), and end with an indication that it should respond.*
- Define a function to format few-shot prompts using the examples provided. *HINT: This should be very similar to the zero-shot format, but with a couple example input/output provided.*
- Define a function that will normalize the generated output for evaluation.

**Q**: Discuss one additional benefit and drawback to using a generative model with zero-shot or few-shot prompting in place of traditional supervised learning methods.

**A**: 

In [None]:
FEW_SHOTS = [
    ('This movie was fantastic and heartwarming.', 'positive'),
    ('The plot was boring and predictable.', 'negative'),
]

def zshot_prompt(text):
    # TODO: format your input text into a zero-shot prompt
    return ??

def fshot_prompt(text):
    # TODO: format your input text into a few-shot prompt, using the examples above
    return ??

def norm_label(s: str):
    s = (s or '').lower()
    # TODO: normalize the generated output to produce the labels 'positive' or 'negative'
    if ??: 
        return 'positive'
    if ??: 
        return 'negative'
    return s.strip()

@torch.no_grad() #tells pytorch not to store any gradients
def flan_predict(texts, mode='zero', max_new_tokens=3):
    preds = []
    for t in texts:
        prompt = zshot_prompt(t) if mode == 'zero' else fshot_prompt(t)
        inputs = tok(prompt, return_tensors='pt')
        outputs = flan.generate(**inputs, max_new_tokens=max_new_tokens)
        out = tok.batch_decode(outputs, skip_special_tokens=True)[0]
        preds.append(norm_label(out))
    return preds

print('Zero-shot:', flan_predict(['I loved this movie', 'This was terrible'], mode='zero'))
print('Few-shot :', flan_predict(['I loved this movie', 'This was terrible'], mode='few'))

#### Evaluate Prompting Methods

**TODO**:

- Define a function (or use a pre-existing implementation) that computes accuracy, with lists of labels and predictions as input.
- Use your `flan_predict` function from above to produce zero-shot and few-shot predictions over your evaluation data.

**Q**: Reflect on the performance of these two methods. Was there anything that surprised you?

**A**: 

In [None]:
def accuracy(y_true, y_pred):
    # TODO: compute the accuracy of your predictions
    return ??

def label_to_str(y): # the SST-2 dataset stores the sentiment labels as integers
    return label_names[int(y)]

eval_texts = [ex['sentence'] for ex in eval_ds]
eval_labels = [label_to_str(ex['label']) for ex in eval_ds]

# TODO: predict the sentiment of your evaluation dataset with the zero-shot and few-shot prompting methods
z_preds = ??
f_preds = ??

z_acc = accuracy(eval_labels,z_preds)
f_acc = accuracy(eval_labels,f_preds)
print({'zero_shot_acc': round(z_acc, 4), 'few_shot_acc': round(f_acc, 4)})
print("Zero-Shot Confusion Matrix:\n", confusion_matrix(eval_labels,z_preds))
print("Few-Shot Confusion Matrix:\n", confusion_matrix(eval_labels,f_preds))

## Model Fine Tuning

Now we will fine tune the same Flan-T5 model on the SST-2 training dataset and compare performance. Since we are still using a generative model, the data needs to be formatted to support generation as we did above. We also need to define a custom metric function for the model to use during training, ensuring that the output matches our expected labels for evaluation.

**TODO**:

- Define a function to format data to use in a generative model training pipeline.
- Use your `norm_label` function from above to process the model output for evaluation.

**Q**: How would the model training pipeline change if we were using a representative model (i.e. an encoder-side tranformer model like BERT) instead of a generative model? Which type of model makes more sense when doing a supervised training task?

**A**: 

In [None]:
def format_data(text):
    # TODO: format the model input to support generation
    inp = ??
    tgt = label_to_str(text['label'])
    return {'input_text': inp, 'target_text': tgt}

train_formatted = train_ds.map(format_data) # we can use the .map() function in place of looping over the whole dataset
eval_formatted  = eval_ds.map(format_data)

def tokenize_fn(batch):
    model_inputs = tok(batch['input_text'], truncation=True) # convert text to token ids and cut off very long inputs
    with tok.as_target_tokenizer():
        labels = tok(batch['target_text'], truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

train_toks = train_formatted.map(tokenize_fn, batched=True, remove_columns=train_formatted.column_names)
eval_toks  = eval_formatted.map(tokenize_fn,  batched=True, remove_columns=eval_formatted.column_names)

data_collator = DataCollatorForSeq2Seq(tok, model=flan) # the collator handles data batching during training

In [None]:
def compute_metrics(eval_pred):
    preds, labels = eval_pred
    pred_texts = tok.batch_decode(preds, skip_special_tokens=True) # translate token_ids to text
    labels_clean = []
    for row in labels:
        row = [id for id in row if id != -100] # skip padding tokens
        labels_clean.append(row)
    ref_texts = tok.batch_decode(labels_clean, skip_special_tokens=True) # translate token_ids to text
    
    # TODO: normalize the model outputs for evaluation
    preds_norm = ??
    return {'accuracy': accuracy(ref_texts,preds_norm)}

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir='outputs/flan_t5_sst2',
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy='epoch',
    save_strategy='no',
    predict_with_generate=True,
    logging_steps=50,
    report_to=[],
)

trainer = Seq2SeqTrainer(
    model=flan,
    args=training_args,
    train_dataset=train_toks,
    eval_dataset=eval_toks,
    tokenizer=tok,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train();

In [None]:
eval_out = trainer.predict(eval_toks)
ft_acc = round(eval_out.metrics["test_accuracy"], 4)
print({'finetuned_flan_t5_eval_accuracy': ft_acc})
# Decode predicted sequences
pred_texts = tok.batch_decode(eval_out.predictions, skip_special_tokens=True)

# Decode reference (true) sequences, removing padding (-100)
labels_clean = [[id for id in row if id != -100] for row in eval_out.label_ids]
ref_texts = tok.batch_decode(labels_clean, skip_special_tokens=True)
print(confusion_matrix(ref_texts, pred_texts, labels=["positive", "negative"]))

### Model Comparison

**Q**: Reflect on the performance of your three methods. Is there anything that was surprising? Would you do anything to improve the performance of any of the methods? Are there any other methods that you would like to compare?

**A**: 

In [None]:
summary = {
    'zero_shot_flan_t5_acc': round(z_acc, 4),
    'few_shot_flan_t5_acc': round(f_acc, 4),
    'finetuned_flan_t5_acc' : ft_acc,
}
summary