## Multilingual Sentence Classification


# Data Collection and Annotation

First we collected a labeled dataset of **150** sentences for a text classification task of your choice. This process will include:

1. *Data collection*: Collect 150 sentences from any source you find interesting (e.g., literature, Tweets, news articles, reviews, etc.)

2. *Task design*: Come up with a multilabel sentence-level classification task that you would like to perform on your sentences. 

3. On your dataset, collect annotations from **two** classmates for your task on a **second, separate set** of a minimum of **150** sentences. Everyone in this class will need to both create their own dataset and also serve as an annotator for two other classmates. In order to get everything done on time, you need to complete the following steps:

> *   Find two classmates willing to label 150 sentences each.
*   Collect the labeled data from each of the two annotators.
*   Sanity check the data for basic cleanliness (are all examples annotated? are all labels allowable ones?)

4. Aggregate output from both annotators to create final dataset (include your first 150 sentences too).

5. Perform NLP experiments on your new dataset!

## Question 1.3:
Now, compute the inter-annotator agreement between your two annotators. Upload both .tsv files to your Colab session (click the folder icon in the sidebar to the left of the screen). In the code cell below, read the data from the two files and compute both the raw agreement (% of examples for which both annotators agreed on the label) and the [Cohen's Kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa). Feel free to use implementations in existing libraries (e.g., [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html)). After you’re done, report the raw agreement and Cohen’s scores in your report.

*If you're curious, Cohen suggested the Kappa result be interpreted as follows: values ≤ 0 as indicating no agreement and 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41– 0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement.*

In [43]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [44]:
### WRITE CODE TO LOAD ANNOTATIONS AND 
### COMPUTE AGREEMENT + COHEN'S KAPPA HERE!
import pandas as pd
from sklearn.metrics import cohen_kappa_score

In [45]:
import pandas as pd
data1=pd.read_csv("/content/drive/MyDrive/CS678-HW2/Annotator 1 - Sheet1 (1).tsv", sep='\t')["label_ID"].tolist()
data2=pd.read_csv("/content/drive/MyDrive/CS678-HW2/Annotator 2 - Sheet1 (1).tsv", sep='\t')["label_ID"].tolist()

In [46]:
# Compute the raw agreement
total_items = len(data1)
total_agree = 0

for i in range(total_items):
    if data1[i] == data2[i]:
        total_agree += 1

raw_agreement = total_agree / total_items
print("Raw Agreement: ", raw_agreement)

Raw Agreement:  0.6466666666666666


In [47]:
# Compute Cohen's Kappa
annotator1 = [data1[i] for i in range(len(data1))]
annotator2 = [data2[i] for i in range(len(data2))]

kappa = cohen_kappa_score(annotator1, annotator2)
print("Cohen's Kappa: ", kappa)


Cohen's Kappa:  0.6210857442447929


In [48]:
cohen_kappa_score(annotator1,annotator2)

0.6210857442447929

###*RAW AGREEMENT*: 0.6466666
###*COHEN'S KAPPA*: 0.6210857

# Model Training and Testing

Now we'll move onto fine-tuning  pretrained language models specifically on your dataset. This part of the homework is meant to be an introduction to the HuggingFace library, and it contains code that will potentially be useful for your final projects. Since we're dealing with large models, the first step is to change to a GPU runtime.

## Adding a hardware accelerator

Please go to the menu and add a GPU as follows:

`Edit > Notebook Settings > Hardware accelerator > (GPU)`

Run the following cell to confirm that the GPU is detected.

In [49]:
import torch
torch.cuda.empty_cache()

# Confirm that the GPU is detected

assert torch.cuda.is_available()

# Get the GPU device name.
device_name = torch.cuda.get_device_name()
n_gpu = torch.cuda.device_count()
print(f"Found device: {device_name}, n_gpu: {n_gpu}")
device = torch.device("cuda")

Found device: Tesla T4, n_gpu: 1


## Installing Hugging Face's Transformers library
We will use Hugging Face's Transformers (https://github.com/huggingface/transformers), an open-source library that provides general-purpose architectures for natural language understanding and generation with a collection of various pretrained models made by the NLP community. This library will allow us to easily use pretrained models like `BERT` and perform experiments on top of them. We can use these models to solve downstream target tasks, such as text classification, question answering, and sequence labeling.

Run the following cell to install Hugging Face's Transformers library and download a sample data file called seed.tsv that contains 250 sentences in English, annotated with their frame.

In [50]:
!pip install transformers
!pip install -U -q PyDrive

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


The cell below imports some helper functions we wrote to demonstrate the task on the sample seed dataset.

In [51]:
from helpers import tokenize_and_format, flat_accuracy

# Data Prep and Model Specifications

Upload your data using the file explorer to the left. We have provided a function below to tokenize and format your data as BERT requires. Make sure that your tsv file, titled final_data.tsv, has one column "sentence" and another column "labels_ID" containing integers/float.

If you run the cell below without modifications, it will run on the seed.tsv example data we have provided. It imports some helper functions we wrote to demonstrate the task on the sample dataset. You should first run all of the following cells with seed.tsv just to see how everything works. Then, once you understand the whole preprocessing / fine-tuning process, change the tsv in the below cell to your final_data.tsv file, add any extra preprocessing code you wish, and then run the cells again on your own data.

In [52]:
from helpers import tokenize_and_format, flat_accuracy
import pandas as pd
import numpy as np

df = pd.read_csv('/content/final_data .tsv',sep='\t')
df = df.dropna()

df = df.sample(frac=1).reset_index(drop=True)

texts = df.sentence.values
labels = df.label_ID.values
### tokenize_and_format() is a helper function provided in helpers.py ###
input_ids, attention_masks = tokenize_and_format(texts)
label_list = []
for l in labels:
  label_array = np.zeros(len(set(labels)))
  label_array[int(l)-1] = 1
  label_list.append(label_array)

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(np.array(label_list))

# Print sentence 0, now as a list of IDs.
print('Original: ', texts[0])
print('Token IDs:', input_ids[0])

Original:  To address the challenges of illegal immigration, we must adopt policies that provide a pathway to citizenship for undocumented immigrants, while strengthening our border security and improving our visa system.
Token IDs: tensor([  101,  2000,  4769,  1996,  7860,  1997,  6206,  7521,  1010,  2057,
         2442, 11092,  6043,  2008,  3073,  1037, 12732,  2000,  9068,  2005,
        25672, 24894, 14088,  7489,  1010,  2096, 16003,  2256,  3675,  3036,
         1998,  9229,  2256,  9425,  2291,  1012,   102,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])


## Create train/test/validation splits

Here we split your dataset into 3 parts: a training set, a validation set, and a testing set. Each item in your dataset will be a 3-tuple containing an input_id tensor, an attention_mask tensor, and a label tensor.



In [64]:
total = len(df)

num_train = int(total * .8)
num_val = int(total * .1)
num_test = total - num_train - num_val

# make lists of 3-tuples (already shuffled the dataframe in cell above)

train_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_train)]
val_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_train, num_val+num_train)]
test_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_val + num_train, total)]

train_text = [texts[i] for i in range(num_train)]
val_text = [texts[i] for i in range(num_train, num_val+num_train)]
test_text = [texts[i] for i in range(num_val + num_train, total)]


Here we choose the model we want to finetune from https://huggingface.co/transformers/pretrained_models.html. Because the task requires us to label sentences, we wil be using BertForSequenceClassification below. You may see a warning that states that `some weights of the model checkpoint at [model name] were not used when initializing. . .` This warning is expected and means that you should fine-tune your pre-trained model before using it on your downstream task. See [here](https://github.com/huggingface/transformers/issues/5421#issuecomment-652582854) for more info.

In [65]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer English BERT model, with an uncased vocab.
    num_labels = 15, # The number of output labels.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

# ACTION REQUIRED #

Define your fine-tuning hyperparameters in the cell below (we have randomly picked some values to start with). We want you to experiment with different configurations to find the one that works best (i.e., highest accuracy) on your validation set. Feel free to also change pretrained models to others available in the HuggingFace library (you'll have to modify the cell above to do this). You might find papers on BERT fine-tuning stability (e.g., [Mosbach et al., ICLR 2021](https://openreview.net/pdf?id=nzpLWnVAyah)) to be of interest.

In [66]:
batch_size = 32
learning_rate = 2e-5
optimizer = AdamW(model.parameters(), lr=learning_rate) #with default values of learning rate and epsilon value
epochs = 50

# Fine-tune your model
Here we provide code for fine-tuning your model, monitoring the loss, and checking your validation accuracy. Rerun both of the below cells when you change your hyperparameters above.

In [67]:
# function to get validation accuracy
def get_validation_performance(val_set):
    # Put the model in evaluation mode
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0

    num_batches = int(len(val_set)/batch_size) + 1

    total_correct = 0

    for i in range(num_batches):

      end_index = min(batch_size * (i+1), len(val_set))

      batch = val_set[i*batch_size:end_index]
      
      if len(batch) == 0: continue

      input_id_tensors = torch.stack([data[0] for data in batch])
      input_mask_tensors = torch.stack([data[1] for data in batch])
      label_tensors = torch.stack([data[2] for data in batch])
      
      # Move tensors to the GPU
      b_input_ids = input_id_tensors.to(device)
      b_input_mask = input_mask_tensors.to(device)
      b_labels = label_tensors.to(device)
        
      # Tell pytorch not to bother with constructing the compute graph during
      # the forward pass, since this is only needed for backprop (training).
      with torch.no_grad():        

        # Forward pass, calculate logit predictions.
        outputs = model(b_input_ids, 
                                token_type_ids=None, 
                                attention_mask=b_input_mask,
                                labels=b_labels)
        loss = outputs.loss
        logits = outputs.logits
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()
        
        # Move logits and labels to CPU
        logits = (logits).detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()


        # Calculate the number of correctly labeled examples in batch
        pred_flat = np.argmax(logits, axis=1).flatten()
        labels_flat = np.argmax(label_ids, axis=1).flatten()

        num_correct = np.sum(pred_flat == labels_flat)
        total_correct += num_correct
    
            
    # Report the final accuracy for this validation run.
    print("Num of correct predictions =", total_correct)
    avg_val_accuracy = total_correct / len(val_set)
    return avg_val_accuracy



In [68]:
import random

# training loop

# For each epoch...
for epoch_i in range(0, epochs):
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode.
    model.train()

    # For each batch of training data...
    num_batches = int(len(train_set)/batch_size) + 1

    for i in range(num_batches):
      end_index = min(batch_size * (i+1), len(train_set))

      batch = train_set[i*batch_size:end_index]

      if len(batch) == 0: continue

      input_id_tensors = torch.stack([data[0] for data in batch])
      input_mask_tensors = torch.stack([data[1] for data in batch])
      label_tensors = torch.stack([data[2] for data in batch])

      # Move tensors to the GPU
      b_input_ids = input_id_tensors.to(device)
      b_input_mask = input_mask_tensors.to(device)
      b_labels = label_tensors.to(device) 

      # Perform a forward pass (evaluate the model on this training batch).
      outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
      loss = outputs.loss
      logits = outputs.logits

      total_train_loss += loss.item()

      # Clear the previously calculated gradient
      model.zero_grad()     

      # Perform a backward pass to calculate the gradients.
      loss.backward()

      # Update parameters and take a step using the computed gradient.
      optimizer.step()
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set. Implement this function in the cell above.
    print(f"Total loss: {total_train_loss}")
    val_acc = get_validation_performance(val_set)
    print(f"Validation accuracy: {val_acc}")
    
print("")
print("Training complete!")



Training...
Total loss: 2.7849502879080115
Num of correct predictions = 3
Validation accuracy: 0.2

Training...
Total loss: 2.5568401165858328
Num of correct predictions = 2
Validation accuracy: 0.13333333333333333

Training...
Total loss: 2.437467830245037
Num of correct predictions = 0
Validation accuracy: 0.0

Training...
Total loss: 2.2728034007077236
Num of correct predictions = 0
Validation accuracy: 0.0

Training...
Total loss: 2.1091169855971303
Num of correct predictions = 1
Validation accuracy: 0.06666666666666667

Training...
Total loss: 1.9450506554017515
Num of correct predictions = 0
Validation accuracy: 0.0

Training...
Total loss: 1.809488083758495
Num of correct predictions = 0
Validation accuracy: 0.0

Training...
Total loss: 1.681529566138569
Num of correct predictions = 0
Validation accuracy: 0.0

Training...
Total loss: 1.5699239372948393
Num of correct predictions = 0
Validation accuracy: 0.0

Training...
Total loss: 1.475114851931317
Num of correct predictions =

# Evaluate your model on the test set
After you're satisfied with your hyperparameters (i.e., you're unable to achieve higher validation accuracy by modifying them further), it's time to evaluate your model on the test set! Run the below cell to compute test set accuracy.


In [69]:
get_validation_performance(test_set)

Num of correct predictions = 12


0.8

## Question 2.2:
Finally, perform an *error analysis* on your model. This is good practice for your final project. Write some code in the below code cell to print out the text of up to five test set examples that your model gets **wrong**. If your model gets more than five test examples wrong, randomly choose five of them to analyze. If your model gets fewer than five examples wrong, please design five test examples that fool your model (i.e., *adversarial examples*). Then, in the following text cell, perform a qualitative analysis of these examples. See if you can figure out any reasons for errors that you observe, or if you have any informed guesses (e.g., common linguistic properties of these particular examples). Does this analysis suggest any possible future steps to improve your classifier?

In [60]:
batch_size = 64
learning_rate = 2e-5
optimizer = AdamW(model.parameters(), lr=learning_rate) #with default values of learning rate and epsilon value
epochs = 50

In [61]:
# Move tensors to the GPU
b_input_ids = input_id_tensors.to(device)
b_input_mask = input_mask_tensors.to(device)
b_labels = label_tensors.to(device)
        
# Tell pytorch not to bother with constructing the compute graph during
# the forward pass, since this is only needed for backprop (training).
with torch.no_grad():
  # Forward pass, calculate logit predictions.
  outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
  loss = outputs.loss
  logits = outputs.logits
        
  # Move logits and labels to CPU
  logits = (logits).detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()


  # Calculate the number of correctly labeled examples in batch
  pred_flat = np.argmax(logits, axis=1).flatten()
  labels_flat = np.argmax(label_ids, axis=1).flatten()


In [62]:
import random
wrong_indices = [i for i in range(len(test_text)) if pred_flat[i] != labels_flat[i]]
num_wrong = len(wrong_indices)
if num_wrong > 0:
    if num_wrong > 5:
        print("Printing out a random sample of 5 incorrect sentences.")
        wrong_indices = random.sample(wrong_indices, 5)
    for i in wrong_indices:
        print(test_text[i])


Printing out a random sample of 5 incorrect sentences.
The Trump administration's efforts to restrict immigration from certain countries have been criticized for being discriminatory and damaging to the country's reputation.
To protect the security of the nation, policies must be enacted that ensure the safety of citizens and prevent individuals who may pose a threat from entering the country.
The intersection of race and immigration is a complex and often overlooked issue, with people of colour facing additional barriers and challenges in the immigration process.
Discrimination against immigrants and LGBTQ individuals can have profound impacts on mental health, leading to issues like depression, anxiety, and PTSD.
The benefits of immigration are numerous and include increased economic growth, higher levels of entrepreneurship, and greater cultural diversity.
