## Homework 3, CS678 Spring 2024

### This homework has two submission deadlines: (1) Checkpoint 1 on March 18, 2024, and (2) Checkpoint 2 on March 29, 2024. 

Submit the report to Gradescope with the naming convention of
**John_Doe_HW3_Report_CS678_S24_ckpt1/2.pdf** if your name is John Doe and you are submitting for checkpoint 1/2.
Only the report should be submitted to Gradescope. The rest of the code and data files, 
including this notebook in its completion form,
should be submitted to Blackboard in a single zipped folder.

<!-- #### IMPORTANT: 

After copying this notebook to your Google Drive, please paste a link to it
below. To get a publicly-accessible link, hit the *Share* button at the top
right, then click "Get shareable link" and copy over the result.If you fail to
do this, you will receive no credit for this homework!

***LINK:***

--- -->


##### *How to submit this problem set:*
- Write all the answers in this notebook. 

- When creating your final version of the notebook to hand in, 
  please do a fresh restart and execute every cell in order. 
  One handy way to do this is by clicking `Runtime -> Run All` in the notebook menu.

##### *Policy regarding Google Colab:*
- The instruction in this notebook assumes that you will use Colab.

- However, using Colab is not required. You are free to run the code on your local machine, though in that case you may suffer from a slow runtime due to the lack of proper GPU resources.

---

##### *Academic honesty*

- We will audit the notebooks from a set number of students, chosen at
  random. The audits will check that the code you wrote actually generates the
  answers in your report PDF. If you turn in correct answers on your PDF without code
  that actually generates those answers, we will consider this a serious case of
  cheating. See the course page for honesty policies.

- We will also run automatic checks of notebooks for plagiarism. 
  Copying code from others is also considered a serious case of cheating.

---

# Checkpoint 1: Data Collection and Annotation

In this homework, you will first collect a labeled dataset of **150** sentences for a text classification task of your choice. This process will include:

1. *Data collection*: Collect 150 sentences from any source you find interesting (e.g., literature, Tweets, news articles, reviews, etc.)

2. *Task design*: Come up with a multilabel sentence-level classification task that you would like to perform on your sentences. 

3. On your dataset, collect annotations from **two** classmates for your task on a **second, separate set** of a minimum of **150** sentences. Everyone in this class will need to both create their own dataset and also serve as an annotator for two other classmates. In order to get everything done on time, you need to complete the following steps:

> *   Find two classmates willing to label 150 sentences each (use the Piazza "search for teammates" thread if you're having issues finding labelers).
*   Collect the labeled data from each of the two annotators.
*   Sanity check the data for basic cleanliness (are all examples annotated? are all labels allowable ones?)

4. Collect feedback from annotators about the task including annotation time and obstacles encountered (e.g., maybe some sentences were particularly hard to annotate!)

5. Calculate and report inter-annotator agreement.

6. Aggregate output from both annotators to create final dataset (include your first 150 sentences too).

7. Perform NLP experiments on your new dataset!

The mapping of label names and IDs in seed.tsv is as follows:

```json
{
    'Economic': 1.0,
    'Capacity and Resources': 2.0,
    'Morality': 3.0,
    'Fairness and Equality': 4.0,
    'Legality, Constitutionality, Jurisdiction': 5.0,
    'Policy Prescription and Evaluation': 6.0,
    'Crime and Punishment': 7.0,
    'Security and Defense': 8.0,
    'Health and Safety': 9.0,
    'Quality of Life': 10.0,
    'Cultural Identity': 11.0,
    'Public Sentiment': 12.0,
    'Political': 13.0,
    'External Regulation and Reputation': 14.0,
    'Other': 15.0
}
```

Make sure that this mapping is followed in all of your data files.

## Question 3 (8 points):
Now, compute the inter-annotator agreement between your two annotators. Upload both .tsv files to your Colab session (click the folder icon in the sidebar to the left of the screen). In the code cell below, read the data from the two files and compute both the raw agreement (% of examples for which both annotators agreed on the label) and the [Cohen's Kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa). Feel free to use implementations in existing libraries (e.g., [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html)). After you’re done, report the raw agreement and Cohen’s scores in your report.

*If you're curious, Cohen suggested the Kappa result be interpreted as follows: values ≤ 0 as indicating no agreement and 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41– 0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement.*

In [None]:
### WRITE CODE TO LOAD ANNOTATIONS AND 
### COMPUTE AGREEMENT + COHEN'S KAPPA HERE!
import pandas as pd
from sklearn.metrics import cohen_kappa_score

def raw_agreement(annotator1_df, annotator2_df):
    # TODO: Implement this function
    df1 = pd.read_csv(annotator1_df, sep='\t')
    df2 = pd.read_csv(annotator2_df, sep='\t')
    label1 =  df1['label']
    label2 =  df2['label']
    
    count_same = 0
    for i in range(len(label1)):
        if(label1[i] == label2[i]):
            count_same = count_same+1
    

    return count_same/len(label1)

def kohens_cappa(annotator1_df, annotator2_df):
    # TODO: Implement this function
    df1 = pd.read_csv(annotator1_df, sep='\t')
    df2 = pd.read_csv(annotator2_df, sep='\t')
    label1 =  df1['label']
    label2 =  df2['label']
    return cohen_kappa_score(label1,label2)



# TODO: load your data correctly
annotator1_df = "annotator1.tsv" 
annotator2_df = "annotator2.tsv"

print("--- Raw agreement between annotator1 and annotator2 ---")
print(raw_agreement(annotator1_df, annotator2_df))

print("--- Cohen's kappa score between annotator1 and annotator2 ---")
print(kohens_cappa(annotator1_df, annotator2_df))

TODO : Write the values obtained above in this cell.

### *RAW AGREEMENT*:  0.7577639751552795

### *COHEN'S KAPPA*:  0.7395578414699905

# Checkpoint 2: Model Training and Testing

Now we'll move onto fine-tuning  pretrained language models specifically on your dataset. This part of the homework is meant to be an introduction to the HuggingFace library, and it contains code that will potentially be useful for your final projects. Since we're dealing with large models, the first step is to change to a GPU runtime.

## Adding a hardware accelerator

Please go to the menu and add a GPU as follows:

`Edit > Notebook Settings > Hardware accelerator > (GPU)`

Run the following cell to confirm that the GPU is detected.

In [2]:
import torch
torch.cuda.empty_cache()

# Confirm that the GPU is detected

assert torch.cuda.is_available()

# Get the GPU device name.
device_name = torch.cuda.get_device_name()
n_gpu = torch.cuda.device_count()
print(f"Found device: {device_name}, n_gpu: {n_gpu}")
device = torch.device("cuda")

Found device: Tesla T4, n_gpu: 1


In [None]:
import random
import numpy as np

def seed_everything(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

seed_everything()

## Installing Hugging Face's Transformers library
We will use Hugging Face's Transformers (https://github.com/huggingface/transformers), an open-source library that provides general-purpose architectures for natural language understanding and generation with a collection of various pretrained models made by the NLP community. This library will allow us to easily use pretrained models like `BERT` and perform experiments on top of them. We can use these models to solve downstream target tasks, such as text classification, question answering, and sequence labeling.

Run the following cell to install Hugging Face's Transformers library and download a sample data file called seed.tsv that contains 250 sentences in English, annotated with their frame.

In [3]:
!pip install transformers
!pip install -U -q PyDrive

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


The cell below imports some helper functions we wrote to demonstrate the task on
the sample seed dataset.

#### *IMPORTANT NOTE*:

The tokenize_and_format function in helpers.py uses bert-base-uncased as the
model for the tokenizer. If you are using a different model for training in this
notebook or for running predictions in a different notebook or python file, you
need to change the model name as well in the tokenizer, otherwise you will get
arbitrarily incorrect results down the line.

If you update the model name for the tokenizer, you would need to reload the
file which can be done simply by re-running the cell below.

In [4]:
from helpers import tokenize_and_format, flat_accuracy

# Data Prep and Model Specifications

Upload your data using the file explorer to the left. We have provided a
function below to tokenize and format your data as BERT requires. Make sure that
your tsv file, titled final_data.tsv, has one column "sentence" and another
column "label_ID" containing integers/float. (basically the same format as
seed.tsv should be maintained for the sentence and label columns)

If you run the cell below without modifications, it will run on the seed.tsv
example data we have provided. It imports some helper functions we wrote to
demonstrate the task on the sample dataset. You should first run all of the
following cells with seed.tsv just to see how everything works. Then, once you
understand the whole preprocessing / fine-tuning process, change the tsv in the
below cell to your final_data.tsv file, add any extra preprocessing code you
wish, and then run the cells again on your own data.


#### Important Note :

The code below expects the data to be in a tsv file  with the columns as "sentence"
and "label_ID" (other columns are not that relevant here). But this is different
from the instructions in the report where you are expected to create data with
"text" and "label" columns for all of the annotation steps. 

Modify the code below to suitably handle this.

In [14]:
from helpers import tokenize_and_format, flat_accuracy
import pandas as pd
import numpy as np

seed_everything()

#df = pd.read_csv('final_data.tsv') # TODO : Uncomment this line to use the full dataset
df = pd.read_csv('seed.tsv')

df = df.sample(frac=1).reset_index(drop=True)

texts = df.sentence.values # this assumes that the column containing the text is called "sentence"
labels = df.label_ID.values # this assumes that the column containing the labels is called "label_ID"

### tokenize_and_format() is a helper function provided in helpers.py ###
### Male sure you use the correct model name for your tokenizer! ###
input_ids, attention_masks = tokenize_and_format(texts)

label_list = []
for l in labels:
  label_array = np.zeros(len(set(labels)))
  label_array[int(l)-1] = 1
  label_list.append(label_array)

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(np.array(label_list))

# Print sentence 0, now as a list of IDs.
print('Original: ', texts[0])
print('Token IDs:', input_ids[0])

Original:  At least 91 people have died since October in the U.S. Border Patrol's Tucson sector, covering most of southern Arizona. More than half of those deaths have been heat-related.
Token IDs: tensor([  101,  2012,  2560,  6205,  2111,  2031,  2351,  2144,  2255,  1999,
         1996,  1057,  1012,  1055,  1012,  3675,  6477,  1005,  1055, 17478,
         4753,  1010,  5266,  2087,  1997,  2670,  5334,  1012,  2062,  2084,
         2431,  1997,  2216,  6677,  2031,  2042,  3684,  1011,  3141,  1012,
          102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])


## Create train/test/validation splits

Here we split your dataset into 3 parts: a training set, a validation set, and a testing set. Each item in your dataset will be a 3-tuple containing an input_id tensor, an attention_mask tensor, and a label tensor.



In [6]:
seed_everything()

total = len(df)

num_train = int(total * .8)
num_val = int(total * .1)
num_test = total - num_train - num_val

# make lists of 3-tuples (already shuffled the dataframe in cell above)
train_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_train)]
val_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_train, num_val+num_train)]
test_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_val + num_train, total)]

train_text = [texts[i] for i in range(num_train)]
val_text = [texts[i] for i in range(num_train, num_val+num_train)]
test_text = [texts[i] for i in range(num_val + num_train, total)]


Here we choose the model we want to finetune from https://huggingface.co/transformers/pretrained_models.html. Because the task requires us to label sentences, we wil be using BertForSequenceClassification below. You may see a warning that states that `some weights of the model checkpoint at [model name] were not used when initializing. . .` This warning is expected and means that you should fine-tune your pre-trained model before using it on your downstream task. See [here](https://github.com/huggingface/transformers/issues/5421#issuecomment-652582854) for more info.

In [8]:
from transformers import BertForSequenceClassification, AutoModelForSequenceClassification, BertConfig
from torch.optim import AdamW

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer English BERT model, with an uncased vocab.
    num_labels = 15, # The number of output labels.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()


Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

# TODO: ACTION REQUIRED #

Define your fine-tuning hyperparameters in the cell below (we have randomly picked some values to start with). We want you to experiment with different configurations to find the one that works best (i.e., highest accuracy) on your validation set. Feel free to also change pretrained models to others available in the HuggingFace library (you'll have to modify the cell above to do this). You might find papers on BERT fine-tuning stability (e.g., [Mosbach et al., ICLR 2021](https://openreview.net/pdf?id=nzpLWnVAyah)) to be of interest.

In [15]:
batch_size = 50
# you can change lr and eps values in the AdamW call if you like
optimizer = AdamW(model.parameters()) #with default values of learning rate and epsilon value
epochs = 10



# Fine-tune your model
Here we provide code for fine-tuning your model, monitoring the loss, and checking your validation accuracy. Rerun both of the below cells when you change your hyperparameters above.

In [16]:
# function to get validation accuracy
def get_validation_performance(val_set):
    # Put the model in evaluation mode
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0

    num_batches = int(len(val_set)/batch_size) + 1

    total_correct = 0

    for i in range(num_batches):

      end_index = min(batch_size * (i+1), len(val_set))

      batch = val_set[i*batch_size:end_index]
      
      if len(batch) == 0: continue

      input_id_tensors = torch.stack([data[0] for data in batch])
      input_mask_tensors = torch.stack([data[1] for data in batch])
      label_tensors = torch.stack([data[2] for data in batch])
      
      # Move tensors to the GPU
      b_input_ids = input_id_tensors.to(device)
      b_input_mask = input_mask_tensors.to(device)
      b_labels = label_tensors.to(device)
        
      # Tell pytorch not to bother with constructing the compute graph during
      # the forward pass, since this is only needed for backprop (training).
      with torch.no_grad():        

        # Forward pass, calculate logit predictions.
        # Note: this line of code might need to change depending on the model
        # the current line will work for bert-base-uncased
        # please refer to huggingface documentation for other models
        outputs = model(b_input_ids, 
                                token_type_ids=None, 
                                attention_mask=b_input_mask,
                                labels=b_labels)
        loss = outputs.loss
        logits = outputs.logits
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()
        
        # Move logits and labels to CPU
        logits = (logits).detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()


        # Calculate the number of correctly labeled examples in batch
        pred_flat = np.argmax(logits, axis=1).flatten()
        labels_flat = np.argmax(label_ids, axis=1).flatten()

        num_correct = np.sum(pred_flat == labels_flat)
        total_correct += num_correct
        
    # Report the final accuracy for this validation run.
    print("Num of correct predictions =", total_correct)
    avg_val_accuracy = total_correct / len(val_set)
    return avg_val_accuracy



In [17]:
import random
seed_everything()

# training loop

# For each epoch...
for epoch_i in range(0, epochs):
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode.
    model.train()

    # For each batch of training data...
    num_batches = int(len(train_set)/batch_size) + 1

    for i in range(num_batches):
      end_index = min(batch_size * (i+1), len(train_set))

      batch = train_set[i*batch_size:end_index]

      if len(batch) == 0: continue

      input_id_tensors = torch.stack([data[0] for data in batch])
      input_mask_tensors = torch.stack([data[1] for data in batch])
      label_tensors = torch.stack([data[2] for data in batch])

      # Move tensors to the GPU
      b_input_ids = input_id_tensors.to(device)
      b_input_mask = input_mask_tensors.to(device)
      b_labels = label_tensors.to(device) 

      optimizer.zero_grad()

      # Perform a forward pass (evaluate the model on this training batch).
      # this line of code might need to change depending on the model
      outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
      
      loss = outputs.loss
      logits = outputs.logits

      total_train_loss += loss.item() 

      # Perform a backward pass to calculate the gradients.
      loss.backward()

      # Update parameters and take a step using the computed gradient.
      optimizer.step()
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set. Implement this function in the cell above.
    print(f"Total loss: {total_train_loss}")
    val_acc = get_validation_performance(val_set)
    print(f"Validation accuracy: {val_acc}")
    
print("")
print("Training complete!")

# TODO: SAVE YOUR MODEL HERE... (Refer PyTorch documentation for how to save models)



Training...
Total loss: 1.215653028627547
Num of correct predictions = 4
Validation accuracy: 0.16

Training...
Total loss: 0.9767023516794046
Num of correct predictions = 1
Validation accuracy: 0.04

Training...
Total loss: 0.9790089977855483
Num of correct predictions = 1
Validation accuracy: 0.04

Training...
Total loss: 0.976090121122698
Num of correct predictions = 1
Validation accuracy: 0.04

Training...
Total loss: 0.9667896981785695
Num of correct predictions = 1
Validation accuracy: 0.04

Training...
Total loss: 1.0583393177092075
Num of correct predictions = 1
Validation accuracy: 0.04

Training...
Total loss: 0.9789332783296705
Num of correct predictions = 1
Validation accuracy: 0.04

Training...
Total loss: 0.9764186057522892
Num of correct predictions = 1
Validation accuracy: 0.04

Training...
Total loss: 0.9743644391273458
Num of correct predictions = 1
Validation accuracy: 0.04

Training...
Total loss: 0.970487670518458
Num of correct predictions = 1
Validation accuracy

# Evaluate your model on the test set
After you're satisfied with your hyperparameters (i.e., you're unable to achieve higher validation accuracy by modifying them further), it's time to evaluate your model on the test set! Run the below cell to compute test set accuracy.


In [18]:
seed_everything()

# If your notebook disconnects during training, then here, first load the best
# model you saved (refer PyTorch docs), then check validation performance

get_validation_performance(test_set)

Num of correct predictions = 3


0.12

## Question 8 (10 points):
Finally, perform an *error analysis* on your model. This is good practice for your final project. Write some code in the below code cell to print out the text of up to five test set examples that your model gets **wrong**. If your model gets more than five test examples wrong, randomly choose five of them to analyze. If your model gets fewer than five examples wrong, please design five test examples that fool your model (i.e., *adversarial examples*). Then, in the following text cell, perform a qualitative analysis of these examples. See if you can figure out any reasons for errors that you observe, or if you have any informed guesses (e.g., common linguistic properties of these particular examples). Does this analysis suggest any possible future steps to improve your classifier?

In [13]:
seed_everything()
torch.cuda.empty_cache()

## YOUR ERROR ANALYSIS CODE HERE
## print out up to 5 test set examples (or adversarial examples) that your model gets wrong

### *DESCRIBE YOUR QUALITATIVE ANALYSIS OF THE ABOVE EXAMPLES IN YOUR REPORT*