# Text Analytics Lab 5: Pretrained Language Models

This notebook introduces the Transformers library from HuggingFace, which we can use to access a wide range of pretrained language models. The sections are:

   1. **Introducing Transformers:** This section introduces the Transformers library from HuggingFace, showing you how to use it to obtain contextualised embeddings from pretrained transformer models.
   1. **Transformers for Text Classification:** Here we show you how to construct a classifier using Transformers.
   1. **OPTIONAL: More on Transformers:** Some pointers to other materials if you want to learn more about transformers, e.g., if using them in your summer project. 

Example code for all the tasks has been tested on a four-year old MacBook Pro, and the longest training process took under 10 minutes. If you find that the code takes too long to run on your own machine, a good alternative is to use [Google Colab](https://colab.research.google.com/), Amazon Sagemaker Studio, or hte lab machines on campus. 

## Learning Outcomes

These sections will contain tutorial-like instructions, as you have seen in previous text analytics labs. On completing these sections, the intended learning outcomes are that you will be able to...
1. Use pretrained transformers to obtain contextualised word and sentence embeddings.
1. Apply a pretrained QA model to a new dataset. 
1. Construct classifiers with pretrained transformers. 
1. Find documentation on pretrained models in the Transformers library.

In [None]:
import numpy as np
import torch 
from datasets import load_dataset

cache_dir = "./data_cache"

# 1. Introducing Transformers 

HuggingFace is a company that has developed an open source library for loading pretrained transformer models. They also distribute many models that have been pretrained using language modelling tasks, or fine-tuned to specific downstream NLP tasks.  It is currently the best library to use to create NLP models on top of large, deep neural networks. This is especially useful for tasks where simpler, feature-based methods or smaller LSTM models do not perform well enough, for example, when complex processing of syntax and semantics is required (natural language 'understanding'). 

The larger models often give great performance, but the trade-off is that they require a lot of memory and compute. When building a model for a new dataset, it is a good idea to compare faster models with transformers to determine whether the performance/cost trade-off is worth it on that particular dataset. 

Let's start by looking at two key types of object in the transformers library: models and tokenizers.

## 1.1. Models

The neural network models available in the Transformers library are accessed through wrapper classes such as `AutoModel`. If we want to load a pretrained model, we can simply pass its name to the `from_pretrained` function, and the pretrained model weights will be downloaded from HuggingFace and a neural network model will be created with those weights. For example:

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 

This code loads the TinyBERT model, which is a compressed version of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While TinyBERT will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D).

<!--the RoBERTa variant of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While RoBERTa-tiny will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/arampacha/roberta-tiny).  -->

The same functions can be used to load other models from HuggingFace's repository simply by changing the model's name. Take a look at [the Models page](https://huggingface.co/models) so see what there is on offer. Do you recognise any of the models' names?

# 1.2. Tokenizers

Before we can apply a model to some text, we need to a create Tokenizer object. In Transfomers, Tokenizer objects convert raw text to a sequence of numbers. First, the tokenizer actually performs tokenization, then it maps each token to its numerical ID. There are lots of different tokenizers that we can use to preprocess text. If we are loading a pretrained model, we will need to choose the tokenizer that corresponds to that model. 

**TO-DO 1:** Why is it necessary to choose a matching tokenizer for a pretrained model?

We can load the right tokenizer as follows, in the same way we loaded the model itself:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

Let's see what the TinyBERT tokenizer does to an example sentence:

In [None]:
sentence = "The transformer architecture has transformed the field of NLP."

tokens = tokenizer.tokenize(sentence)
print(tokens)

Let's compare with the NLTK tokenizer we have seen before:

In [None]:
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(sentence)
print(nltk_tokens)

While NLTK keeps whole words as tokens, the BERT tokenizer splits some words into sub-words and inserts some special characters into the tokens. Splitting is applied to words with low frequency in the training set, such as 'transformer'. 

Rather than following a set of hand-crafted rules, the BERT tokenizer is learned from a large dataset. It starts by adding individual characters to its vocabulary. Then, it adds the most frequently occurring pairs of characters as tokens in the vocabulary. This repeats by adding the most frequent pairs of tokens to the vocabulary until the desired size of dictionary is reached. When tokenizing a document, words that are not in the vocabulary are matched against the shorter sub-word tokens.

**TO-DO 2:** What is the benefit of splitting some words into sub-word tokens? 

After tokenization, the Tokenizer object can also map the tokens to their IDs (indexes in the vocabulary), so that we can pass them as input to a neural network:

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

Let's load up a dataset that we can use for our experiments later on. We will use the [TweetEval hate speech](https://huggingface.co/datasets/tweet_eval) dataset to train and test a classifier. The task is to classify tweets into one of  0: non-hate or 1: hate.

In [None]:
### WRITE YOUR ANSWER HERE (Code for 6d; feel free to use multiple cells and copy code from above) ###
from datasets import load_dataset

cache_dir = './data_cache/'

# Load up the emotion dataset...
train_dataset = load_dataset(
    "tweet_eval",
    name="hate",
    split="train",
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

val_dataset = load_dataset(
    "tweet_eval",
    name="hate",
    split="validation",
    cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="hate",
    split="test",
    cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")

Now, let's see apply our tokenizer to the dataset, using the map function to run it on all samples:

In [None]:
# tokenize...
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

def tokenize_function(dataset):
    model_inputs = tokenizer(dataset['text'], padding="max_length", max_length=128, truncation=True)
    return model_inputs

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

## 1.3. Contextualised Embeddings

Now that we have a sequence of tokens, we are almost ready to process the sequence using the pretrained model. 

Our model takes as input a PyTorch `tensor` (a muli-dimensional array). Here, we need a two-dimensional matrix, where each row is a sequence of input tokens corresponding to a single sentence or document. Let's convert our list of IDs to a 2-D tensor with a single row:

In [None]:
ids_tensor = torch.tensor([ids])

print(ids_tensor)

Now we can process the sequence using our model. The pretrained transformer model maps the sequence of input IDs to a sequence of output vectors, which are contextualised word embeddings. The hidden state values produced in the last hidden layer of the model are used as the contextualised embeddings:

In [None]:
model_outputs = model(ids_tensor)
print('The complete model outputs: ')
print(model_outputs)

print()
print('The last hidden state for the first token in the sequence (the first word embedding): ')
embeddings = model_outputs['last_hidden_state'][0]
print(embeddings)

We can retrieve the embedding vector for "transform" like this ("transform" is the second token in the sequence):

In [None]:
emb = embeddings[1]  # get second embedding in the sequence

# convert it to a numpy array so we can perform various operations on it later on
emb = emb.detach().numpy()

print(emb)
print(f'The TinyBERT embeddings have {emb.shape[0]} dimensions.')

**TO-DO 3:** Retrieve the embedding for "architecture".

In [None]:
# WRITE YOUR ANSWER HERE


Sentences and documents usually have varying lengths. So, to put multiple sentences into a single tensor, we need to pad the sequences up to a maximum length. Luckily, the tokenizer class takes care of this for us. When we pass in a list of sentences, the tokenizer creates a matrix, where each row is a sequence of the same length:

In [None]:
sentences = [
    "I can book tickets for the concert next week.",
    "Many readers find the first book of A Tale of Two Cities to be confusing.",
    "She opened the book to page 37 and began to read aloud.",
    "The police wanted to book him for driving too fast.",
    "I can reserve tickets for the concert next week."
]

model_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")  

print(model_inputs)

`model_inputs` is a dictionary containing three objects:
 * The `input_ids` are the list of token IDs in the input sequences. 
 * The `attention_mask` records which tokens are special padding tokens and which are real tokens. Tokens with a 0 in the attention mask will be ignored.
 * `token_type_ids` is needed when two sequences are passed together as input to the model for tasks such as next sentence prediction that involve comparing two sentences. Here, each input is a single sentence, so we have only one type of token in the output above. 
 
**TO-DO 4:** Look at the outputs above and work out which value the special padding tokens have? 

ANSWER: 

---

Notice that the input_ids all start with the same token ID, 101, even though they have different first words. They also have token ID 102 before the padding tokens. This is because the tokenizer inserts two special tokens, which are used in some applicaions of BERT. 101 is the '[CLS]' token, which is a dummy token whose embedding can be trained to represent the whole sequence. The [CLS] token's embedding can then be used as input to a text classifier to classify a sentence or document. Token 102 is '[SEP]', which can be used to separate multiple input sequences in a single example. This is needed in tasks where multiple pieces of text are provided as input, e.g., a to build a classifier that can determine whether two sentences contradict each other. 

We can now pass all of the model inputs to the model to produce a set of contextualised embeddings:

In [None]:
# model_inputs is a dictionary, so to provide the arguments to model(), 
# we use the double star to unpack the dictionary so that each key in the dictionary is
# an argument to model() and each value is the value of the argument. 
model_outputs = model(**model_inputs) 

**TO-DO 5:** The first four example sentences above all contain the word "book", and the last example contains "reserve". Obtain a list of contextualised word embeddings for 'book' and 'reserve' in the example sentences using our model. 

Hint: you may need to convert tensors to numpy arrays. Don't forget that the sequence of embeddings contains [CLS] and [SEP] embeddings. 

In [None]:
book_tok_id = tokenizer.convert_tokens_to_ids(['book'])

#WRITE YOUR OWN CODE HERE



**TO-DO 6:** Compute the similarities between these embeddings in the cell below, and show the results. How do the similarities relate to the meaning of the word "book" or "reserve" in each sentence?

In [None]:
from scipy.spatial.distance import cdist  # you may find this function useful for computing distances

### WRITE YOUR ANSWER HERE


###

for sen in sentences:
    print(sen)
    
print()
print("The table below shows similarities between words according to their contextualised embeddings:") 
print(np.round(similarities, decimals=2))

**TO-DO 7:** Use the BERT model to obtain an embedding of each complete sentence from the five sentences listed above. Show the similarities and discuss what you see. 


In [None]:
### WRITE YOUR ANSWER HERE


###

print(similarities)
# Let's find the most similar sentences...
similarities[range(5), range(5)] = 0  # ignore the similarity between a sentence and itself
most_similar = np.argmax(np.max(similarities, axis=1))

print(f'The most similar sentence to "{sentences[-1]}" is "{sentences[most_similar]}", according to TinyBERT.')

# 2. Transformer-based Text Classifiers

In this section, you will learn how to construct and train a text classifier on top of a pretrained transformer. 

To begin you will need to instantiate a suitable classifier model.

**TO-DO 8:** Find an AutoModel class that constructs a text classifier from the pretrained TinyBERT model, "huawei-noah/TinyBERT_General_4L_312D". Create the `model` object in the cell below using this class. Refer to the [Hugging Face documentation for auto models](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) as needed. 

In [None]:
### WRITE YOUR ANSWER HERE ###


Typically, sequence classification models attach a linear layer (classification head) to the outputs of the transformer. The CLS token's embedding is passed into the classification head, which makes a prediction over classes. We can see a similar structure in most neural network models. Our original text classifier from the first notebook used a fully-connected layer to produce a hidden representation of the whole sentence, whereas now we are replacing that hidden layer with a complete BERT transformer, which produces a sequence of embeddings. 

<img src="neural_text_classifier_smaller.png" alt="Neural text classifier diagram from the slides in lecture 8.1" width="400px"/>


We will need to train our model before we can use it (you may see a message in the output of the last cell telling you this). 

**TO-DO 9:** The classifier is built on top of a pretrained TinyBERT transformer, which was pretrained using masked language modelling and next sentence prediction. Why does the classifier require further training to provide accurate sentiment classifications? 

Next, let's learn how to train our model. For some tasks it is not necessary to update the weights in the BERT model itself, so we can freeze them to save a lot of computation time. We can do this as follows. Since our pretrained model is based on BERT, we can access the weights inside BERT through the variable `model.bert`.

In [None]:
for param in model.bert.parameters():
    param.requires_grad = False

To train our model, we can make use of the Trainer class, which encapsulates a lot of the complex training steps and avoids the need to define our own training function, as we did in the previous notebook (we don't need to write our own `train_nn`).

First, define some settings for the training process. This is where we can set training hyperparameters:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="transformer_checkpoints",  # specify the directory where models weights will be saved a certain points during training (checkpoints)
    num_train_epochs=3, # A sensible and sufficient number to use for the to-dos below
    per_device_train_batch_size=16,  # you can decrease this if memory usage is too high while training
    logging_steps=50,  # how often to print progress during training
)

Next, create a trainer object:

In [None]:
from transformers import Trainer
from torch import nn

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

To train the model, you will need to call `trainer.train()`.

Once the model is trained, we can obtain predictions using the function below. Notice that it is simpler than obtaining the spans for QA -- we simply get the logits for each tweet in the test set, then apply argmax over the classes to find the most probable class for each tweet:

In [None]:
def predict_nn(trained_model, test_dataset):

    # Switch off dropout
    trained_model.eval()
    
    # Pass the required items from the dataset to the model    
    output = trained_model(attention_mask=torch.tensor(test_dataset["attention_mask"]), input_ids=torch.tensor(test_dataset["input_ids"]))
                        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    pred_labs = np.argmax(output["logits"].detach().numpy(), axis=1)

    return pred_labs

You should now have all the bits and pieces needed to build and train a text classifier. Let's put them all together...

**TO-DO 10:** Train and test your sequence classifier on the [Sentiment](https://huggingface.co/datasets/tweet_eval) dataset using a pretrained transformer. Choose a suitable evaluation metric and compare the result with the simpler neural network classifiers from the previous lab. 

You may wish to 'unfreeze' the BERT model to see if this boosts performance, but note that it will require a lot more computation time to fine-tune the whole transformer model. Increasing the number of epochs could also boost performance, but again requires much more computation time.

In [None]:
# WRITE YOUR CODE HERE

**TO-DO 11:** What kinds of _transfer_ did your sentiment classifier use and what benefit do they provide? 

The model currently outputs logits, rather than probabilities, which are much more useful for most applications of a text classifier.  To compute the probability of each class for a test sentence, we need to pass the logits through the softmax function. Complete the function below to obtain a probability distribution for a sentence of your choice.

In [None]:
sentences = ["A very joyful and happy day"]

model.eval()
output = model(**tokenizer(sentences,  max_length=128, padding="max_length", truncation=True, return_tensors="pt"))
        
# the output dictionary contains logits, which are the unnormalised scores for each class for each example:
logits = output["logits"]

#### WRITE YOUR ANSWER HERE   

####

print(f'The probability of each sentiment class is:')
classes = ['non-hate', 'hate']
for c, category in enumerate(classes):
    print(f'probability of {category} = {probs[0][c].detach().numpy()}')

# 3. OPTIONAL: More on Transformers

There are many great resources out there to show you how to use this kind of model in practice:
* An extensive online course is provided by HuggingFace: https://huggingface.co/course/chapter1/1. The pages linked from the HuggingFace course website have an 'open in Colab' button on the top right. You can open the notebook and run it on a Google server there to access GPUs.
* Chapters that may be particularly useful: 
   * Transformers, what can they do? https://huggingface.co/course/chapter1/3?fw=pt
   * Using Transformers: https://huggingface.co/course/chapter2/2?fw=pt
* They provide information on fine-tuning the transformer models here: https://huggingface.co/docs/transformers/training. Fine-tuning updates the weights inside the pretrained network and requires extensive GPU or TPU computing. 
* Text Generation: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb. This topic goes way beyond data analytics on this unit and shows you another powerful feature of pretrained transformers.


