# Homework 6: Transformers

The goals of this assignment are:
1. Develop a better understanding of the *self-attention mechanism* in Transformers by implementing it in numpy. 
2. Understand and train a BERT-based model a variant on *coreference resolution.* 
3. Strengthen your understanding of using HuggingFace's `transformers` package. 

## Organization and Instructions
Execute the code cells in Part 1 to understand the background for this assignment. You will not need to modify or add anything to Part 1. Part 2 is where your solution begins.

**Part 1: Background.** 
- 1A. Environment set-up 
- 1B. Data exploration 

**Part 2: Your implementation.** 
- 2A. Self-attention 
- 2B. Zero-shot predictions 
- 2C. Fine-tuning 


**Addtional instructions.** 
- Please follow the 50-foot rule. Your submitted solution and code must be yours alone. Copying and pasting a solution from the internet or another source is considered a violation of the honor code. 

**Evaluation.** Your solution will be evaluated *manually* by the TAs and instructor. 

To help bridge the gap between previous homeworks and the final project. We are **not giving you an autograder**. We hope to help wean you off the grader and give you practice testing your own code.

Please come see us during help hours if you need additional assistance! 

## 1A. Environment Set-up 

If you set-up your conda environment correctly in HW0, you should see `Python [conda env:cs375]` as the kernel in the upper right-hand corner of the Jupyter webpage you are currently on. Run the cell below to make sure your environment is correctly installed. 

In [None]:
# Environment check 
# Return to HW0 if you run into errors in this cell 
# Do not modify this cell 
import os
assert os.environ['CONDA_DEFAULT_ENV'] == "cs375"

import sys
assert sys.version_info.major == 3 and sys.version_info.minor == 11

If there are any errors after running the cell above, return to the instructions from `HW0`. If you are still having difficulty, reach out to the instructor or TAs via Piazza. 

#### Installing other packages

In [None]:
import re
import typing
from typing import List
import numpy as np
import torch
import torch.nn.functional as F
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          TrainingArguments, Trainer, DataCollatorWithPadding)
from sklearn.metrics import f1_score

In [None]:
import util #inspect util.py to see what is in this file 

## 1B. Data exploration

In this homework, we will use the WinoGrande dataset. You can read more about the dataset in [this paper](https://cdn.aaai.org/ojs/6399/6399-13-9624-1-10-20200517.pdf). 

Here is Table 1 from the WinoGrande paper with examples:  

![](figs/winograd.png)

HuggingFace provides a Python package for loading (and uploading datasets). You can read more about the `datasets` Python package [here](https://huggingface.co/docs/datasets/en/index). 

In [None]:
from datasets import load_dataset

In [None]:
# Load the WinoGrande dataset
dataset = load_dataset("allenai/winogrande", "winogrande_s", trust_remote_code=True)

# Access the training and validation splits
train_dataset = dataset["train"]
validation_dataset = dataset["validation"].select(range(100)) #We'll just look at 100 dev exs

print(f"Num. train exs= {len(train_dataset)}")
print(f"Num. dev exs= {len(validation_dataset)}")

In [None]:
# Let's look at one example from the validation dataset
print(validation_dataset[12])

Above, the `'sentence'` is the full sentence with a `_` for where the pronoun or noun options should go. 

Then `option1` and `option2` are the two token spans from the sentence the model will eventually choose from and `answer` is the correct answer. 

## 2A. Self-attention

In this part, you will implement the parallelized version of the *masked* self-attention mechanism in Transformers using only numpy.


Recall, for each layer $k$ in the transformer block we have 

For a single example with $n$ tokens and embedding dimension $d$, we first have $X^k$, the contextual embedding matrix (size $n\times d$) for layer $k$. 

Then, we introduce the weights, 

$$ Q = X^k \times W_Q$$ 
$$ K = X^k \times W_K$$
$$ V = X^k \times W_V $$

and use the new matrices to get the contextual embedding matrix for the next layer, 

$$ X^{k+1} = \text{softmax} \bigg( \text{mask} \bigg( \frac{QK^T}{\sqrt{d}} \bigg) \bigg) V$$

This is computationally efficient in a matrix-multiplication-optimized library like `numpy` because it should have **no for-loops!** 

Let's implement self-attention for the (modified) example we were looking at in Part 1 

*"I had to read an entire story for class tomorrow. Luckily, it was short."*

In [None]:
# Tokens for our example 
toks = ["i", "had", "to", "read", "an", 
        "entire", "story", "for", "class", "tomorrow", ".",
       "luckily", "it", "was", "short", "."]

In [None]:
# Load pre-specified embeddings and weights (for testing)
X, W_Q, W_K, W_V = util.load_attention_data(toks)

In [None]:
# TODO: Implement your approach in this function

def self_attention(X: np.ndarray, W_Q: np.ndarray, 
                   W_K: np.ndarray, W_V: np.ndarray) -> np.ndarray: 
    """
    Implements (masked) self-attention mechanism for a single layer 
    (and a single example)
    
    Returns: X_new, a np.ndarray that is the same shape as X
    
    Notes:
    - You can only use numpy for this part of the homework and no other packages
    - Your solution must not have any for-loops!
    
    Tips: 
        - Double-check the shapes of all the matrices you're working with. 
        - We recommend making a helper function for the softmax.
        - You may a subset of these numpy methods and operators helpful: `np.exp`, `@`, `np.triu_indicies`, `np.reshape`, `np.inf`, `np.broadcast_to`, `np.choose`.  
    """
    pass 

## 2B. Zero-shot predictions

Now, we will use a distilled version of "RoBERTa" (a BERT variant) to make zero-shot predictions on the WinoGrande dataset. 

#### Tokenization and pre-processing

In [None]:
model_name = 'distilroberta-base'

In [None]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)

In [None]:
# First example in the dev dataset (see dataset loading in Part 1B above)
text1 = validation_dataset[0]['sentence']
text1

In [None]:
# Converts to tokens and attention mask 
# The attention mask will be 0 if there are special "PAD" tokens
inputs = tokenizer(text1, return_tensors="pt")
inputs

In [None]:
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
tokens

Note: Above, when we see the `Ġ` character before tokens, this is how the RoBERTa tokenizer indicates spaces. 

In [None]:
# TODO: fill in the function below 
def which_tok_index(tok_string: str, input_ids: torch.tensor, tokenizer) -> int:
    """
    Given the token string (tok_string) of interest, and the tokenizer, 
    return the first token index that matches the *beginning* of tok_string. 
    
    If there is no match (which could happen), return 0. 
    
    Example: 
        tok_string="Sarah"
        input_ids = tensor([0, 33671, 21, 10, 203, 357, 16308, 87,  
        5011, 98, 18134, 460, 300, 5, 3013, 1200,  4, 2])
        
        Returns: 1 
        
        This example returns 1 since 33671 is the first index in 'input_ids' 
        and 33671 is also the token id "Sarah"
        
        The original sentence in this example was 
        sentence = "Sarah was a much better surgeon than Maria so _ always got the easier cases."
        
    Notes:  
        - Be sure to deal with the "Ġ"
    """
    return 0 # delete and replace with your code 

In [None]:
# Unit test
inputs = tokenizer(validation_dataset[0]['sentence'], return_tensors="pt")
input_ids = inputs['input_ids'][0]
t1 = which_tok_index("Sarah", input_ids, tokenizer)
print(t1, "== 1?")
t2 = which_tok_index("Maria", input_ids, tokenizer)
print(t2, "== 8?")
t3 = which_tok_index("_", input_ids, tokenizer)
print(t3, "== 10?")

In [None]:
# Another unit test 
inputs = tokenizer(validation_dataset[3]['sentence'], return_tensors="pt")
input_ids = inputs['input_ids'][0]
t1 = which_tok_index(validation_dataset[3]['option1'], input_ids, tokenizer)
print(t1, "== 6?")
t2 = which_tok_index("blah", input_ids, tokenizer)
print(t2, "== 0?")

#### Zero-shot prediction

Now, we'll use the model to make zero-shot predictions. Note, this is "zero-shot" because we haven't ever trained the model on this particular task or dataset.  

Here's how we will make zero-shot predictions: 
1. Pass the sentence (after tokenization) into the pre-trained model 
2. Obtain the final layer contextual embeddings for the `"_"` token as well as the (first) token representing `option1` and `option2`. 
3. Find the cosine similarity between these contextual embeddings between `"_"` and the embedding we chose for `option1` as well as the cosine similarity between `"_"` and the embedding we chose for `option2`. 
4. Choose whichever pair has the higher cosine similarity as the prediction. 

Note: We have some precision-recall tradeoffs as well as potential errors in this zero-shot approach as since some strings for `option1` and `option2` will be represented by *multiple* tokens. 

In [None]:
# TODO: Your implementation
def zero_shot_predictions(model, tokenizer, dataset) -> List[int]: 
    """
    Make zero-shot predictions with the last layer contextual embedding
    cosine similarity method described in the previous cell. 
    
    Returns: 
        List[str], a list of strings, one element for each 
        example in the input dataset. Each element is an int: 
            - 1 corresponding to "option1" in the dataset
            - 2 corresponding to "option2" in the dataset
    
    Note: 
        - For now, it's ok if you have a for-loop over examples. 
          (In an actual industry setting, you would make this all parallelized)
        - You might make use of the `which_tok_index()` helper function 
        you just implemented 
        - The documentation on AutoModelForTokenClassification may be helpful here. 
        - torch.nn.functional may have some helpful methods 
        - Using model.eval() and torch.no_grad() will speed things up (since Pytorch will not
        have to make the computation graph)
    """
    pass  

In [None]:
print(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# You might see a warning below. 
# We'll end up doing this in the next part of the homework;)

In [None]:
# Test your code on just a single example 
zero_shot_predictions(model, tokenizer, validation_dataset.select(range(1)))

In [None]:
# Make the full predictions 
preds = zero_shot_predictions(model, tokenizer, validation_dataset)

In [None]:
# Check: Is the following true for your dataset? 
# len(preds) == len(validation_dataset)

#### Evaluation

In [None]:
try: 
    assert len(truth) == len(preds)
    truth = [int(x['answer']) for x in validation_dataset]
    y_true = np.array(truth)
    y_pred = np.array(preds)
    y_baseline = np.ones(len(y_true)) *2
    print("F1 of baseline (maj. class)=", np.round(f1_score(y_true, y_baseline, pos_label=2), 2))
    print("F1 of zero-shot=", np.round(f1_score(y_true, y_pred, pos_label=2), 2))
except: 
    print("Need preds to be equal to truth for eval")

How does your zero-shot model compare to the baseline? 

##  2C. Fine-tuning

Now we'll fine-tune our model on the training dataset. We'll give you some code to help with the pre-processing. It's your job to use `TrainingArguments` and `Trainer` from HuggingFace in order to train the model. 

In [None]:
model_name = 'distilroberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)
model_tune = AutoModelForSequenceClassification.from_pretrained(model_name)

In [None]:
def preprocess_function(dataset):
    return tokenizer(dataset["sentence"], truncation=True, padding="max_length", max_length=100)

def add_labels(example, idx):
    example['label'] = int(example['answer'])-1
    return example

encoded_train = train_dataset.map(preprocess_function, batched=True)
encoded_train = encoded_train.map(add_labels, with_indices=True)
encoded_train

In [None]:
#Sanity check that the following two cells match 
encoded_train['answer'][0:10]

In [None]:
encoded_train['label'][0:10]

In [None]:
encoded_dev = validation_dataset.map(preprocess_function, batched=True)
encoded_dev = encoded_dev.map(add_labels, with_indices=True)
encoded_dev

TODO: It's your job to use `TrainingArguments` and `Trainer` from HuggingFace in order to train `model_tune` above! 

In [None]:
## TODO: put your training code here ##

Be sure to report your final F1 score on the validation dataset. 

In [None]:
## TODO: put your validation code here ### 

## Submission

In [None]:
%%bash

if [[ ! -f "./hw6.ipynb" ]]
then
    echo "WARNING: Did not find notebook in Jupyter working directory. Manual solution: go to File->Download .ipynb to download your notebok and other files, then zip them locally."
else
    echo "Found notebook file, creating submission zip..."
    zip -r submission.zip hw6.ipynb
fi