# Fine-tuning Pre-trained Language Models
In this notebook, we will cover:
   - Applying models to down-stream tasks without extra training
   - Fine-tuning for sequence classification
   - Fine-tuning for sequence pair classification
   - Fine-tuning for token classification

In [None]:
%env CUBLAS_WORKSPACE_CONFIG=:4096:8

env: CUBLAS_WORKSPACE_CONFIG=:4096:8


In [None]:
# Comment this out if you want to enable huggingface warnings
import logging
logging.disable(logging.WARNING)

# This fixes colab's default encoding to match huggingface accelerate
import locale
locale.getpreferredencoding = lambda x=False: "UTF-8"

# HuggingFace Transformers

We will be making use of the [HuggingFace](https://huggingface.co/) transformers library. This library provides support for downloading, running, and training language models and is an essential tool for professionals that want to deploy NLP technology.

We will be installing four huggingface libraries:

1. [Transformers](https://huggingface.co/docs/transformers/index) (Model inference): `!pip install transformers`
2. [Accelerate](https://huggingface.co/docs/accelerate/index) (Model training): `!pip install accelerate`
3. [Datasets](https://huggingface.co/docs/datasets/index) (Data processing): `!pip install datasets`
4. [Evaluate](https://huggingface.co/docs/evaluate/index) (Model evaluation): `!pip install evaluate`

In [None]:
%%capture
!pip install datasets
!pip install transformers
!pip install --upgrade accelerate
!pip install evaluate

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
import datasets
from datasets import load_dataset
from evaluate import evaluator
import evaluate
import numpy as np
import torch
import copy
from dill.source import getsource
from collections import defaultdict

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device_id = 0 if str(device) == 'cuda' else -1

In [None]:
print(device_id)

0


# Zero-Shot Application of Pre-Trained Language Models

We'll be exploring different ways of using pre-trained language models to accomplish natural language tasks. We will be using BERT as our pre-trained language model for this homework.

As you can see below, according to BERT, there is an 84.7% chance that the masked word is "Italian".

<!-- We are running BERT via a [HuggingFace Pipeline](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/pipelines) object. Here we are using a `fill-mask` pipeline but many others exist and we will be using them extensively in this assignment. While this is not the only way to query a model with HuggingFace, it is the most convenient. -->

In [None]:
# Download and query the bert-base-cased model
bert_model = pipeline('fill-mask', model='bert-base-cased', device=device_id)
print(bert_model("I went to an [MASK] restaurant and ordered pasta."))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[{'score': 0.8473308086395264, 'token': 2169, 'token_str': 'Italian', 'sequence': 'I went to an Italian restaurant and ordered pasta.'}, {'score': 0.03583093360066414, 'token': 1890, 'token_str': 'Indian', 'sequence': 'I went to an Indian restaurant and ordered pasta.'}, {'score': 0.016542408615350723, 'token': 1385, 'token_str': 'old', 'sequence': 'I went to an old restaurant and ordered pasta.'}, {'score': 0.006616922095417976, 'token': 3427, 'token_str': 'empty', 'sequence': 'I went to an empty restaurant and ordered pasta.'}, {'score': 0.005878509022295475, 'token': 6210, 'token_str': 'Egyptian', 'sequence': 'I went to an Egyptian restaurant and ordered pasta.'}]


In *theory* we should be able to use BERT to do natural language tasks such as Question Answering, Language Identification, Part-of-Speech Tagging, and even Translation by formulating our task in the style of a fill-in-the-blank sentence.

In [None]:
# Language Identification
print(bert_model("I am currently speaking in the [MASK] language.")[0])

# Factual QA
print(bert_model("The Declaration of Independence was written in the year [MASK].")[0])

# Part of speech tagging
print(bert_model("The word run is a [MASK].")[0])

# Translation
print(bert_model("The French word amour translates to [MASK] in English.")[0])

{'score': 0.18785135447978973, 'token': 1483, 'token_str': 'English', 'sequence': 'I am currently speaking in the English language.'}
{'score': 0.10682458430528641, 'token': 14447, 'token_str': '1776', 'sequence': 'The Declaration of Independence was written in the year 1776.'}
{'score': 0.22859828174114227, 'token': 12464, 'token_str': 'verb', 'sequence': 'The word run is a verb.'}
{'score': 0.055166736245155334, 'token': 1567, 'token_str': 'love', 'sequence': 'The French word amour translates to love in English.'}


However, it turns out that BERT really isn't very good at doing these tasks without extra training (as you can see below). In Section 1 of this homework we'll evaluate BERT without extra training on sentiment analysis to get an idea of where the base model is at. Then, in the next sections, we'll train the model and see how much we can improve the performance.

In [None]:
# Language Identification (failed)
print(bert_model("今[MASK]で話しています")[0])

# Factual QA (failed)
print(bert_model("The U.S.A. was founded in the year [MASK].")[0])

# Part of speech tagging (failed)
print(bert_model("The word golf is a [MASK].")[0])

# Translation (failed)
print(bert_model("The French word bonjour translates to [MASK] in English.")[0])

{'score': 0.5789163708686829, 'token': 100, 'token_str': '[UNK]', 'sequence': 'しています'}
{'score': 0.1736524999141693, 'token': 1196, 'token_str': 'before', 'sequence': 'The U. S. A. was founded in the year before.'}
{'score': 0.15804286301136017, 'token': 8155, 'token_str': 'joke', 'sequence': 'The word golf is a joke.'}
{'score': 0.016320660710334778, 'token': 6164, 'token_str': 'wolf', 'sequence': 'The French word bonjour translates to wolf in English.'}


### **Sequence Classification with pre-trained BERT**

In this section we'll be evaluating BERT for sentiment analysis without fine-tuning. This is purely for the purposes of demonstration as you'll be able to see the difference between what BERT does before and after fine-tuning.

In order to do a sequence classification task without fine-tuning BERT you need two things. You need
1. A mask-filling template to attach on to the end of the sequence
2. A set of `targets` which consist of the class labels.

Below you'll see an example of this procedure applied to sentiment analysis. For the sentence "I just saw a movie today and it was really great", we see that BERT outputs 2.78% chance of the `[MASK]` token being "positive" and 1.65% chance of it being "negative", thus we give the example a positive label.

In [None]:
bert_model("I just saw a movie today and it was really great. My opinion of the movie is [MASK].", targets=["positive", "negative"])

[{'score': 0.027819881215691566,
  'token': 3112,
  'token_str': 'positive',
  'sequence': 'I just saw a movie today and it was really great. My opinion of the movie is positive.'},
 {'score': 0.01648259349167347,
  'token': 4366,
  'token_str': 'negative',
  'sequence': 'I just saw a movie today and it was really great. My opinion of the movie is negative.'}]

Let's test BERT's ability to recognize sentiment using the [Yelp Reviews](https://huggingface.co/datasets/yelp_review_full) dataset. This dataset consists of 650,000 reviews with their user-assigned star ratings. The task is to determine how many stars (1 to 5) the user gave given the text of their review.

In [None]:
# Download the yelp dataset from huggingface
dataset = load_dataset("yelp_review_full")

README.md:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/299M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

### **Dataset Processing**

There are many functions we can use to process a HuggingFace dataset object. One of the most useful is the [Map](https://huggingface.co/docs/datasets/process#map) function. This function applies some function `f` to every row of the dataset where `f` takes in a row of the dataset and returns a dictionary containing the new columns of the data (or any edits to existing columns). For example, given a dataset with a column `binary_label` that contains a binary label (0 or 1) you can create a new column with the opposite label by writing either one of the following functions:
```
def reverse_label(row):
  return {'reverse_label': int(not row["binary_label"])}

def reverse_label(row):
  row['reverse_label'] = int(not row[''binary_label"])
  return row
```
and then calling map on the dataset with the function
```
new_data = dataset.map(reverse_label)
```

### **Concatenating the Mask-Fill Template**
We will concatenate the mask-fill template onto every item in the `text` column of the dataset and add the output to the dataset as a new column named `input`.

In [None]:
def concatenate_mask(row):
  '''
    Pseudocode:
        1. Concatenate this template string to the 'text' field of the input row.
        2. Add the resulting concatenated string as a new field 'input' in the row dictionary.

    Input:
        row: A dictionary representing a single row of the dataset.
             It contains at least the key "text" which holds a string of text.

    Returns:
        The input row dictionary, but with an added key "input" that contains the text
        with the template appended.
  '''
  template = " I give it a score of [MASK] out of 5."
  row['input'] = row['text'] + template
  return row

test_data = dataset['test'].map(concatenate_mask)

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

### **Intro to the BERT Tokenizer**
Before taking in a sequence of text, language models need to break up the sequence into tokens. Since BERT uses the [Transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model) architecture, it has a fixed maximum length of tokens it can process at any one time (that being 512). If you pass in a sequence of text longer than 512 tokens, BERT will throw an error:

In [None]:
# Giving BERT a piece of text with more than 512 tokens produces an error
try:
  bert_model("Hello " * 512 + " my name is [MASK]")
except RuntimeError as e:
  print(e)

The size of tensor a (518) must match the size of tensor b (512) at non-singleton dimension 1


### **Filtering the data for BERT's max length**
Below we will filter the dataset to only include examples that have 512 or less tokens.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
max_length = tokenizer.model_max_length

def filter_for_max_length(row):
  '''
    Pseudocode:
        1. Tokenize the "input" field in the row using the BERT tokenizer.
        2. Get the number of tokens in the tokenized input.
        3. Compare the number of tokens to the maximum length allowed by the BERT model.
        4. Return True if the number of tokens is less than or equal to the maximum length, otherwise return False.

    Input:
        row: A dictionary representing a single row of the dataset.
              It contains at least the key "input" which holds a string of text.

    Returns:
        A boolean value indicating whether the length of the tokenized "input" is
        less than or equal to the maximum length allowed by the BERT model.
  '''
  input_ids = tokenizer(row['input'])['input_ids']
  if len(input_ids) <= max_length:
    return True
  else:
    return False

filtered_data = test_data.filter(filter_for_max_length)

Filter:   0%|          | 0/50000 [00:00<?, ? examples/s]

Here we'll be randomly selecting 100 examples from our filtered dataset for testing purposes.

In the real world, given unlimited time, we would ideally like to use the full datasets.

In [None]:
# Randomly sample 100 examples from the dataset (do not change!)
sampled_data = filtered_data.shuffle(seed=42).select(range(100))

### **Apply BERT to each example**
We will now write a function that runs the BERT model on every sequence of text in the dataset's `input` column and outputs the most likely integer star rating (1-5) as predicted by BERT in a new column named `score`.


In [None]:
def apply_bert(row):
  '''
  Pseudocode:
      1. Use the BERT model to compute the probabilities for each possible score (1 to 5).
      2. Identify the score with the highest probability.
      3. Return the most likely score in a new dictionary with the key "score".

  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the key "input" which holds a string of text.

  Returns:
      A dictionary with the key "score" holding the most likely score (1 to 5) as an integer.
  '''
  all_scores = bert_model(row['input'], targets=['1', '2', '3', '4', '5'])
  max_prob = 0
  label = ''
  for possibility in all_scores:
    prob = possibility['score']
    if prob > max_prob:
      max_prob = prob
      label = possibility['token_str']
  return {'score': int(label)}

predicted_data = sampled_data.map(apply_bert)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

### **Evaluating the Model**

Below we've written code to evaluate the model's accuracy. Given that this task is five-way classification, 24% accuracy is only *barely* better than random chance! Clearly BERT cannot do sentiment analysis without extra training. Let's see if fine-tuning will help.

In [None]:
def accuracy(outputs, reference):
  return sum([o == r for o, r in zip(outputs, reference)]) / len(reference)

print(accuracy(predicted_data["score"], predicted_data["label"]))

0.24


# Training BERT for Sentiment Analysis

In this section we'll be training BERT for the same Sentiment Analysis task from earlier.

Fine-tuning is the process of further training a pre-trained model to accomplish a particular down-stream task. This is typically done by swapping out the pre-trained model's last layer (its head) for a new randomly initialized head and training both model and head.

### **Model "heads"**

The final output layer of a language model is typically called a "head". The standard head used by models when pre-training is called a "languge modeling head". This is a dense linear layer that projects the $D_{enc}$ length encodings of each of the $L_{context}$ tokens to a probability distribution over the vocabulary. The total size of this layer is thus ($L_{context} * D_{enc}$ x $|V|$).
<!-- The head and the network are always trained *together*. We compute the cross-entropy loss $L = -\log(P(w))$ using the output of the head and backpropagate through the head to the rest of the network. -->

### **Adding on a "Classification head"**

In Section 1 we used BERT's language modeling head to perform the Sentiment Analysis task. While we *can* train the model with its original LM head, this is unnecessary, as the LM head outputs probabilities over all tokens rather than just our targets. Thus, in this section we're going to remove the language modeling head and replace it with a classification head.

A classification head is very similar to a language modeling head. It is a dense linear layer that projects the $D_{enc}$ length encodings of each of the $L_{context}$ tokens to a probability distribution over the *$n$ classes* rather than the whole vocabulary. The total size of this layer is thus ($L_{context} * D_{enc}$ x $n$). This makes it significantly more efficient to train.

### **Using HuggingFace to swap heads**

Loading a model with a particular head is easy using HuggingFace. All you do is load the model as an instance of a given class. The classes we're using in this notebook are as follows:
1. [AutoModelForMaskedLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForMaskedLM) - Language modeling Head
2. [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification) - Classification Head
3. [AutoModelForTokenClassification](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForTokenClassification) - Token Classification Head


In [None]:
# Load in the bert-base-cased model with a classification head (num_labels = number of classes)
classification_head_bert = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

### **Training the Model + Classification head**

Let's now train the model + classification head for sentiment analysis. We will use the HuggingFace [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class. We need to:
1. Tokenize and process the dataset
2. Create a `compute_metrics` function
3. Initialize the `Trainer` and specify the `TrainingArguments`

### **Step 1: Tokenize and Process the dataset**

 `tokenize_function`runs the BERT tokenizer on every example in the dataset and add the output fields `input_ids`, `attention_mask`, and `token_type_ids` as new columns in the dataset.

This function should also also pad all sequences of less than 512 tokens to be exactly 512 tokens using the special `[PAD]` token and truncate sequences longer than 512 to be exactly 512. This is to help GPU parallelization and can be done by specifying certain arguments to the tokenizer.

In [None]:
|# Step 1: Tokenize and Process the dataset
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
max_length = tokenizer.model_max_length

def tokenize_function(row):
  '''
  Pseudocode:
      1. Use the BERT tokenizer to tokenize the text in the "text" field of the input row.
      2. Ensure that the tokenized output is padded to the maximum length.
      3. Ensure that the tokenized output is truncated to the maximum length if it exceeds it.
      4. Return the tokenized output.

  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the key "text" which holds a string of text.

  Returns:
      A dictionary with the tokenized output including keys "input_ids", "attention_mask", and "token_type_ids".
  '''
  return tokenizer(row['text'], padding="max_length", truncation=True, max_length=max_length)

# Randomly select 1000 examples from the train and test data and tokenize - do not change!
# (Note: We are subsampling our data just so that training doesn't take too long)
train_data = dataset["train"].shuffle(seed=42).select(range(1000)).map(tokenize_function)
eval_data = dataset["test"].shuffle(seed=42).select(range(1000)).map(tokenize_function)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

### Define a `compute_metrics` function

For this function we will be calculating accuracy between the predictions and true labels. The `probabilities` variable is a 2D numpy array of size ($n$ x $5$) containing probabilities for the 5 class labels for each of the $n$ examples in the input dataset.

We will select the index with the highest probability for each row in the `probabilities` array then calculate the accuracy of the model with respect to the ground truth label.

In [None]:
# Step 2: Define a compute_metrics function
def compute_metrics(eval_pred):
  '''
  Pseudocode:
      1. Extract the predicted probabilities and the true labels from the evaluation predictions.
      2. Compute the predicted labels by taking the argmax of the probabilities along the last axis.
      3. Calculate the accuracy by comparing the predicted labels with the true labels.
      4. Return a dictionary containing the accuracy.

  Input:
      eval_pred: A tuple (probabilities, labels)
                  probabilities: A 2D numpy array of shape (num_examples, num_classes) representing the predicted probabilities for each class.
                  labels: A 1D numpy array of shape (num_examples,) representing the true labels.

  Returns:
      A dictionary with the key "accuracy" and the value being the calculated accuracy.
  '''
  # Get the true labels and predicted probabilities
  probabilities, labels = eval_pred
  pred_labels = np.argmax(probabilities, axis=1)
  accuracy = np.mean(pred_labels == labels)
  return {'accuracy': accuracy}

### **Step 3: Initialize the `Trainer` and Specify `TrainingArguments`**

Here we specify various arguments to the trainer. We choose to evaluate and log every `epoch` (i.e. every pass through the full dataset), we specify the output directory for checkpoints, the learning rate (`5e-05`), the number of epochs (`3`), and the size of batches used for stochastic gradient descent (`8`). In addition, we specify full_determinism for the sake of grading consistency.

In the Trainer we specify our model as the BERT model we loaded earlier, we pass in our argument and filtered datasets, and finally we pass in our `compute_metrics` function.

In [None]:
# Step 3: Specify the TrainingArguments and Initialize the Trainer (Do not change!)
training_args = TrainingArguments(
    eval_strategy="epoch",
    logging_strategy="epoch",
    output_dir="yelp-training",
    learning_rate=5e-05,
    num_train_epochs=3.0,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    full_determinism=True
)

trainer = Trainer(
    model=classification_head_bert,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=compute_metrics,
)

### **Time to Train!**

In [None]:
# Train the model!
trainer.train()
trainer.save_model('./yelpBERT')

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mbhua[0m ([33mbhua-university-of-pennsylvania[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,1.4372,1.099206,0.515
2,0.9693,1.153174,0.504
3,0.6345,1.035289,0.599


## Sanity-Checking our trained model

In order to run our trained model we need to load it into a `text-classification` pipeline. A quick sanity check should show us that our model outputs sensible results

In [None]:
yelpBERT = pipeline('text-classification', model='./yelpBERT', tokenizer=tokenizer, device=device_id)

In [None]:
# Output sentiment for reviews (LABEL_4 = 5 stars, LABEL_0 = 1 star)
yelpBERT(["This place was amazing!",
          "This place was good",
          "This place was fine, there were good and bad parts.",
          "This place was pretty bad",
          "This place was awful"])

[{'label': 'LABEL_4', 'score': 0.8508304357528687},
 {'label': 'LABEL_3', 'score': 0.6638771891593933},
 {'label': 'LABEL_2', 'score': 0.7089926600456238},
 {'label': 'LABEL_1', 'score': 0.6750457882881165},
 {'label': 'LABEL_0', 'score': 0.7940264940261841}]

## Evaluating the model

In order to get the official evaluation results we will be using the [Evaluator](https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/evaluator_classes#evaluator) pipeline for a standardized evaluation environment.

In [None]:
task_evaluator = evaluator("text-classification")

eval_results = task_evaluator.compute(
    model_or_pipeline=yelpBERT,
    data=eval_data,
    metric=evaluate.load("accuracy"),
    label_mapping={"LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2, "LABEL_3": 3, "LABEL_4": 4}
)

print(eval_results)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.599, 'total_time_in_seconds': 26.257810813000106, 'samples_per_second': 38.08390604691634, 'latency_in_seconds': 0.026257810813000108}


# Section 3: Training BERT for Natural Language Inference

In this section we will train BERT to perform a new task -- Natural Language Inference (NLI). NLI is the task of taking two pieces of text (a "premise" and "hypothesis") and determining whether or not the premise is entailed from the hypothesis.

> Natural Language Inference is a task of determining whether the given “hypothesis” and “premise” logically follow (entailment) or unfollow (contradiction) or are undetermined (neutral) to each other. ~ [Oleh Loksyn](https://towardsdatascience.com/natural-language-inference-an-overview-57c0eecf6517)

This can be formulated as a sequence classification task where, given both sequences, we predict one of three labels

(0 = contradiction, 1 = neutral, 2 = entailment)

### **Applying BERT to multiple sequences**

In order to encode multiple sequences of non-contiguous text with BERT we use the `[SEP]` token. This token is a special token (similar to `[MASK]`) that indicates to the model that the sequences on either side of the token are distinct. With the `[SEP]` token we're able to concatenate the premise and hypothesis together into one sequence allowing us to use a standard classification head.

### **Loading BERT**
Your first task is to load in BERT with a classification head. This should be identical to the loading code from Section 2 save for a different number of labels

In [None]:
# TODO: Load in the bert-base-cased model with a classification head (remember the right number of labels!)
classification_head_bert = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

### **The MNLI Dataset**
For this task we'll be using the [Multi-Genre NLI](https://huggingface.co/datasets/multi_nli) dataset. This dataset was collected by asking crowd workers to annotate 433,000 sentence pairs from various genres (Fiction, Travel, Telephone, Letters, Government) for their textual entailment information.

In [None]:
mnli = load_dataset("multi_nli")

README.md:   0%|          | 0.00/8.89k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

(…)alidation_matched-00000-of-00001.parquet:   0%|          | 0.00/4.94M [00:00<?, ?B/s]

(…)dation_mismatched-00000-of-00001.parquet:   0%|          | 0.00/5.10M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

### **Tokenize and Process the Dataset**

In [None]:
def is_labeled(row):
  '''
  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the key "label" which holds an integer.

  Returns:
      A boolean value indicating whether the "label" is not -1.
  '''
  if row['label'] == -1:
    return False
  return True

mnli = mnli.filter(is_labeled)

Filter:   0%|          | 0/392702 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9815 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9832 [00:00<?, ? examples/s]

In [None]:
def concatenate(row):
  '''
  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the keys "premise" and "hypothesis", both holding strings.

  Returns:
      The input row dictionary, but with an added key "concat" that contains the concatenated string.

  '''
  #        Concatenate the text of the "premise" column to the "hypothesis" column
  #        using the [SEP] token. Must be in the order "<premise> [SEP] <hypothesis>"
  #        Add the output as a new column "concat" to the dataset
  row['concat'] = row['premise'] + ' [SEP] ' + row['hypothesis']
  return row

mnli = mnli.map(concatenate)

Map:   0%|          | 0/392702 [00:00<?, ? examples/s]

Map:   0%|          | 0/9815 [00:00<?, ? examples/s]

Map:   0%|          | 0/9832 [00:00<?, ? examples/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(row):
  '''
  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the key "concat" which holds a string of text.

  Returns:
      A dictionary with the tokenized output including keys "input_ids", "attention_mask", and "token_type_ids".
  '''
  return tokenizer(row['concat'], padding="max_length", truncation=True, max_length=max_length)

# Randomly select 1000 examples from train and validation and process them (Do not change!)
train_mnli = mnli["train"].shuffle(seed=42).select(range(1000)).map(tokenize_function)
eval_mnli = mnli["validation_matched"].shuffle(seed=42).select(range(1000)).map(tokenize_function)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

### **Define the `compute_metrics` function**

In [None]:
def compute_metrics(eval_pred):
  '''
  Input:
      eval_pred: A tuple (logits, labels)
                  logits: A 2D numpy array of shape (num_examples, num_classes) representing the raw predictions from the model.
                  labels: A 1D numpy array of shape (num_examples,) representing the true labels.

  Returns:
      A dictionary with the key "accuracy" and the value being the calculated accuracy.
  '''

  logits, labels = eval_pred
  pred_labels = np.argmax(logits, axis=1)
  accuracy = np.mean(pred_labels == labels)
  return {"accuracy": accuracy}

### **Initialize the `TrainingArguments` and `Trainer`**

In [None]:
training_args = TrainingArguments(
    eval_strategy="epoch",
    logging_strategy="epoch",
    output_dir="mnli-training",
    learning_rate=5e-05,
    num_train_epochs=3.0,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    full_determinism=True
)

trainer = Trainer(
    model=classification_head_bert,
    args=training_args,
    train_dataset=train_mnli,
    eval_dataset=eval_mnli,
    compute_metrics=compute_metrics,
)

### **Train the model**

In [None]:
trainer.train()
trainer.save_model('./mnliBERT')

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0747,1.019307,0.488
2,0.73,1.004586,0.551
3,0.2736,1.399728,0.574


### **Sanity Check**

Once again, if trained properly, a quick sanity check should provide us with sensible results

In [None]:
mnliBERT = pipeline('text-classification', model='./mnliBERT', tokenizer=tokenizer, device=device_id)

In [None]:
# Output entailment (LABEL_0 = entailment, LABEL_1 = neutral, LABEL_2 = contradiction)
mnliBERT(["I just graduated college. [SEP] The person is a college graduate.",
          "My name is John. [SEP] The sky is blue.",
          "Thank you so much for your help. [SEP] The person was not helped."])

[{'label': 'LABEL_1', 'score': 0.7098633050918579},
 {'label': 'LABEL_1', 'score': 0.9340775609016418},
 {'label': 'LABEL_2', 'score': 0.9761003255844116}]

### **Evaluate the Model**

In [None]:
task_evaluator = evaluator("text-classification")

eval_results = task_evaluator.compute(
    model_or_pipeline=mnliBERT,
    data=eval_mnli,
    input_column="concat",
    label_column="label",
    metric=evaluate.load("accuracy"),
    label_mapping={"LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2}
)

print(eval_results)

{'accuracy': 0.574, 'total_time_in_seconds': 27.578206180000052, 'samples_per_second': 36.260516491649426, 'latency_in_seconds': 0.02757820618000005}


# Section 4: Training BERT for Named-Entity Recognition

In this final section we will train BERT to perform [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) (NER).

NER is the task of tagging and identifying named entities (e.g. People, Organizations, Locations) in text. For example the sentence "Only France and Britain backed Fischler's proposal." would be tagged as

> Only (France, LOC) and (Britain, LOC) backed (Fischler's, PER) proposal.

NER is typically formulated as a token classification task. Given a sequence of tokens we assign a class to each token. Tokens that are not named entities get given the null tag `O` and tokens that are named entities get one of four classes (`ORG`, `PER`, `LOC`, or `MISC`). Thus the sentence from earlier would be given the labels

>(Only, O) (France, LOC) (and, O) (Britain, LOC) (backed, O) (Fischler's, PER) (proposal, O)

Giving every token in the sequence a label allows us to use a token classification head for this task.

### **Beginning-Inside-Outside (BIO) tags**

One downside to the tagging scheme described is that for two consecutive tokens with the same tag, we can't tell whether or not they are the same entity. For example

> ("25-1", 0), ("Barcelona", ORG), ("Real", ORG), ("Madrid", ORG)

We know that [Real Madrid](https://en.wikipedia.org/wiki/Real_Madrid_CF) and [Barcelona](https://en.wikipedia.org/wiki/FC_Barcelona) are distinct entities but our tag system combines them together. Thus we must distinguish between tags that *begin* an entity (`B-ORG`) and those that are *inside* an entity (`I-ORG`). With this new system we get

> ("25-1", O), ("Barcelona", B-ORG), ("Real", B-ORG), ("Madrid", I-ORG)

Which allows us to distinguish between the two consecutive entities.

### **Training BERT + Token Classification Head**

In order to train BERT for NER we need to use a token classification head. Token classification heads are dense linear layers that output a probability distribution over $n$ classes *for each token* in the sequence. This is equivalent to having $L_{context}$ different classification heads, each trained to predict an output at a specific index. The total size of a token classification head is ($L_{context} * D_{enc}$ x $n * L_{context}$).

In [None]:
label_names = {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8:'I-MISC'}

token_classification_head_bert = AutoModelForTokenClassification.from_pretrained("bert-base-cased", num_labels=9, id2label=label_names)

In [None]:
attributes = copy.deepcopy(vars(token_classification_head_bert))
attributes['config'] = attributes['config'].__dict__
del attributes['_modules']

### **The CoNLL Dataset**

We will use the [CoNLL](https://huggingface.co/datasets/conll2003) dataset as our training data for NER. This dataset consists of 20,000 sentences from Reuters news articles that were manually annotated by participants for their named entities. The tags used are the four classes from earlier (`ORG`, `PER`, `LOC`, or `MISC`) with both `B-` and `I-` variants as well as a null tag `O`.

In [None]:
conll = load_dataset("conll2003")

README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

### **Token-tag mismatch**
When processing data for a tagging task, we need to ensure that the tags properly match up to all the correct tokens. BERT's tokenizer is a [Byte-Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE) tokenizer that tends to split up words into sub-word tokens. For example, take the sentence from earlier
> "Only France and Britain backed Fischler's proposal"

We can see below that the tokenization from CoNLL doesn't match the BERT tokenizer. This creates a problem when processing the tags, as we need to keep the correspondence between tags and tokens.

In [None]:
print(conll['train'][12]['ner_tags'])
print(tokenizer.convert_ids_to_tokens(tokenizer.encode("Only France and Britain backed Fischler's proposal"))[1:-1])

[0, 5, 0, 5, 0, 1, 0, 0, 0]
['Only', 'France', 'and', 'Britain', 'backed', 'Fi', '##sch', '##ler', "'", 's', 'proposal']


### **Redistributing Tags to Sub-word Tokens**

We will now implement the `tokenize_and_tag` function. This function does three things:
1. Retokenize the tokens from CoNLL using the BERT tokenizer and add the outputs (`input_ids`, `attention_mask`, and `token_type_ids`) as new columns in the dataset.
2. Redistribute the NER tags with respect to the newly tokenized sequence and add the tags as a new `labels` column to the dataset
3. Pad and truncate both the tokens and tags to be length `tokenizer.model_max_length` (512)

If a token is broken up into multiple sub-tokens then all sub-tokens should be given the same NER tag as the original token.

Some example inputs and outputs are as follows (Reminder to consult the `label_names` dictionary for the mapping from number to NER tag):
```
-- Example #1 --
Inputs:
"tokens": ['Israel', 'approves', 'Arafat', "'s", 'flight', 'to', 'West', 'Bank', '.']
"ner_tags": [5, 0, 1, 0, 0, 0, 5, 6, 0]

Outputs:
"bert_tokens": ['[CLS]', 'Israel', 'approve', '##s', 'Ara', '##fa', '##t', "'", 's', 'flight', 'to', 'West', 'Bank', '.', '[SEP]', '[PAD]', '[PAD]', ...]
"input_ids": [101, 3103, 14942, 1116, 25692, 8057, 1204, 112, 188, 3043, 1106, 1537, 2950, 119, 102, 0, 0, ...]
"labels": [-100, 5, 0, 0, 1, 1, 1, 0, 0, 0, 0, 5, 6, 0, -100, -100, -100, ...]

-- Example #2 --
Inputs:
"tokens": ['66', 'Paul', 'Goydos', ',', 'Billy', 'Mayfair', ',', 'Hidemichi', 'Tanaka', '(', 'Japan', ')']
"ner_tags": [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 5, 0]

Outputs:
"bert_tokens": ['[CLS]', '66', 'Paul', 'Go', '##yd', '##os', ',', 'Billy', 'May', '##fair', ',', 'Hi', '##de', '##mic', '##hi', 'Tanaka', '(', 'Japan', ')', '[SEP]', '[PAD]', '[PAD]', ...]
"input_ids": [101, 5046, 1795, 3414, 19429, 2155, 117, 4224, 1318, 19803, 117, 8790, 2007, 7257, 3031, 24128, 113, 1999, 114, 102, 0, 0, ...]
"labels": [-100, 0, 1, 2, 2, 2, 0, 1, 2, 2, 0, 1, 1, 1, 1, 2, 0, 5, 0, -100, -100, -100, ...]
```

In [None]:
max_length = tokenizer.model_max_length

def tokenize_and_tag(row):
  '''
  Pseudocode:
      1. Initialize empty lists for tokens and tags.
      2. For each word and its corresponding tag in the input row:
          a. Tokenize the word using the BERT tokenizer.
          b. Extend the tokens list with the tokenized word.
          c. Extend the tags list with the tag repeated for each sub-token.
      3. Tokenize the entire sequence of tokens using the BERT tokenizer with padding and truncation.
      4. Add the special label -100 to the start of the labels list and pad it to the maximum length.
      5. Add the labels list to the tokenized samples dictionary.
      6. Return the tokenized samples dictionary.

  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the keys "tokens" and "ner_tags", both holding lists of strings and integers respectively.

  Returns:
      A dictionary with the tokenized output including keys "input_ids", "attention_mask", "token_type_ids", and "labels".
  '''

  tags = []

  for word, tag in zip(row["tokens"], row["ner_tags"]):
    word_tokens = tokenizer.tokenize(word)
    tags.extend([tag] * len(word_tokens))

  tokenized_results = tokenizer(row["tokens"], is_split_into_words=True,
                                padding="max_length", truncation=True, max_length=max_length)

  labels = []
  labels.append(-100)
  labels.extend(tags)
  labels.extend([-100] * (max_length - len(labels)))

  tokenized_results["labels"] = labels

  return tokenized_results

train_conll = conll["train"].shuffle(seed=42).select(range(1000)).map(tokenize_and_tag)
eval_conll = conll["validation"].shuffle(seed=42).select(range(1000)).map(tokenize_and_tag)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

### **Computing Macro-Averaged F1 Score**

We want to compute the accuracy of our model when identifying named entities. However, most sequences consist mostly of `O` tags. This is bad for our evaluation, as it means that predicting a sequence of all `O`s will result in high accuracy despite not actually accomplishing the task.

To solve this problem we will be using the Macro-Averaged F1 Score as our metric. This is the unweighted average of the F1 scores for each individual token class. This is a common metric to use when given an unbalanced multi-class classification task..

`calculate_macro_f1` function does the following:
1. Calculate the full confusion matrix (TP, TN, FP, FN) for each class
2. Compute the precision, recall, and f1 score for each class
3. Macro-Average together the F1 score for all nine classes to get the final score

In [None]:
from sklearn.metrics import f1_score

def calculate_macro_f1(preds, labels):
  '''
  Pseudocode:
      1. Map prediction and label IDs to their corresponding tag names.
      2. Compute confusion matrices for each class (TP, TN, FP, FN).
      3. Compute precision, recall, and F1 score for each class.
      4. Macro-average the precision, recall, and F1 scores across all classes.
      5. Return a dictionary containing the macro-averaged precision, recall, and F1 scores.

  Input:
      preds: A list of lists, where each sublist contains predicted label IDs for a sequence.
      labels: A list of lists, where each sublist contains true label IDs for a sequence.

  Returns:
      A dictionary with keys "precision", "recall", and "macro-f1", representing the macro-averaged scores.
  '''
  # Filter out -100 and convert ids to tags
  label_map = {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8:'I-MISC'}
  true_preds = [
    [(label_map[p], label_map[l]) for (p, l) in zip(pred, label) if l in label_map]
    for pred, label in zip(preds, labels)
  ]

  true_labels = [label for sentence in true_preds for (_, label) in sentence]
  pred_labels = [pred for sentence in true_preds for (pred, _) in sentence]

  present_classes = set(true_labels + pred_labels)

  TP = {c: 0 for c in present_classes}
  FP = {c: 0 for c in present_classes}
  FN = {c: 0 for c in present_classes}

  for true, pred in zip(true_labels, pred_labels):
    if true == pred:
      TP[true] += 1
    else:
      FP[pred] += 1
      FN[true] += 1

  precision = {}
  recall = {}
  f1 = {}

  for tag in present_classes:
    if TP[tag] + FP[tag] == 0:
      precision[tag] = 0
    else:
      precision[tag] = TP[tag] / (TP[tag] + FP[tag])
    if TP[tag] + FN[tag] == 0:
      recall[tag] = 0
    else:
      recall[tag] = TP[tag] / (TP[tag] + FN[tag])
    if precision[tag] + recall[tag] == 0:
      f1[tag] = 0
    else:
      f1[tag] = 2 * (precision[tag] * recall[tag]) / (precision[tag] + recall[tag])

  macro_f1 = np.mean(list(f1.values()))
  return {"macro-f1": macro_f1}

In [None]:
from sklearn.metrics import f1_score

def debug_macro_f1(num_tests, len_examples):
  # For each random test case
  for i in range(num_tests):
    # Generate two random arrays for the predictions and labels
    rand_preds = np.random.randint(0,9,len_examples).tolist()
    rand_labels = np.random.randint(0,9,len_examples).tolist()

    # Calculate Macro-F1 score with your implementation and scikitlearn
    sklearn_f1 = f1_score(rand_labels, rand_preds, average='macro')
    your_f1 = calculate_macro_f1([rand_preds], [rand_labels])

    # If the two implementations differ, print the example and the scores
    if abs(sklearn_f1 - your_f1['macro-f1']) <= 0.001:
      print(f'preds: {rand_preds}')
      print(f'labels: {rand_labels}')
      print(sklearn_f1, your_f1['macro-f1'])

debug_macro_f1(1, 10)

In [None]:
def compute_metrics(eval_pred):
  logits, labels = eval_pred
  preds = np.argmax(logits, axis=2)
  return calculate_macro_f1(preds, labels)

### **Training the model**

In [None]:
training_args = TrainingArguments(
    eval_strategy="epoch",
    logging_strategy="epoch",
    output_dir="conll-training",
    learning_rate=5e-05,
    num_train_epochs=3.0,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    full_determinism=True
)

trainer = Trainer(
    model=token_classification_head_bert,
    args=training_args,
    train_dataset=train_conll,
    eval_dataset=eval_conll,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()
trainer.save_model('./nerBERT')

Epoch,Training Loss,Validation Loss,Macro-f1
1,0.407,0.168978,0.721088
2,0.1131,0.129435,0.834078
3,0.0506,0.124393,0.852037


### **Sanity Checking the NER Predictions**
In order to check our model we need to load it into a `token-classification` pipeline.

We will see in the following example that it correctly predicts the "Pennsylvania" in "University of Pennsylvania" as an `ORG` but "Philadelphia" as a `LOC`.

In [None]:
nerBERT = pipeline('token-classification', model='./nerBERT', tokenizer=tokenizer, device=device_id)

In [None]:
nerBERT(["Chris Callison-Burch is a professor at the University of Pennsylvania in Philadelphia",
          "Joe Biden is the 46th President of the United States"], aggregation_strategy="simple")

[[{'entity_group': 'PER',
   'score': np.float32(0.99479544),
   'word': 'Chris Callison - Burch',
   'start': 0,
   'end': 20},
  {'entity_group': 'LOC',
   'score': np.float32(0.5047579),
   'word': 'University',
   'start': 43,
   'end': 53},
  {'entity_group': 'ORG',
   'score': np.float32(0.9020468),
   'word': 'of Pennsylvania',
   'start': 54,
   'end': 69},
  {'entity_group': 'LOC',
   'score': np.float32(0.966795),
   'word': 'Philadelphia',
   'start': 73,
   'end': 85}],
 [{'entity_group': 'PER',
   'score': np.float32(0.9941389),
   'word': 'Joe Biden',
   'start': 0,
   'end': 9},
  {'entity_group': 'LOC',
   'score': np.float32(0.9863897),
   'word': 'United',
   'start': 39,
   'end': 45},
  {'entity_group': 'LOC',
   'score': np.float32(0.61383396),
   'word': 'States',
   'start': 46,
   'end': 52}]]

### **Evaluating NER with SeqEval**

The official library for evaluating a model on the CoNLL task is [SeqEval](https://github.com/chakki-works/seqeval). We'll be downloading it and using it to officially score our model. The SeqEval `overall_f1` is identical to your Macro-F1 implementation with the slight caveat that it uses the traditional NER definition of a correct tag prediction (i.e. one that exactly matches all `B-` and `I-` tags).

In [None]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=c93dc48e45862a5f1340e34d60e1efa75a260a946be220e62a3292fdb937e83f
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
task_evaluator = evaluator("token-classification")

eval_results = task_evaluator.compute(
    model_or_pipeline=nerBERT,
    data=eval_conll,
    input_column="tokens",
    label_column="ner_tags",
    tokenizer=tokenizer,
    metric=evaluate.load("seqeval")
)

for key in eval_results:
  print(f"{key}: {eval_results[key]}")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

LOC: {'precision': np.float64(0.9047619047619048), 'recall': np.float64(0.890625), 'f1': np.float64(0.8976377952755906), 'number': np.int64(576)}
MISC: {'precision': np.float64(0.6588235294117647), 'recall': np.float64(0.7943262411347518), 'f1': np.float64(0.7202572347266881), 'number': np.int64(282)}
ORG: {'precision': np.float64(0.8072916666666666), 'recall': np.float64(0.8179419525065963), 'f1': np.float64(0.8125819134993445), 'number': np.int64(379)}
PER: {'precision': np.float64(0.9538188277087034), 'recall': np.float64(0.9640933572710951), 'f1': np.float64(0.9589285714285715), 'number': np.int64(557)}
overall_precision: 0.8543689320388349
overall_recall: 0.882943143812709
overall_f1: 0.8684210526315789
overall_accuracy: 0.9791881443298969
total_time_in_seconds: 70.54578304999995
samples_per_second: 14.17519172324221
latency_in_seconds: 0.07054578304999995
