# Natural Language Processing Applications

Author: Stef Garasto. Released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) licence.

The second part is a slightly modified version of this [HuggingFace tutorial](https://huggingface.co/course/chapter7/7?fw=pt#the-squad-dataset) (also released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) licence).


# Introduction

This notebook contains the tutorial for the class on natural language processing with deep learning as part of the module "Artificial Intelligence Applications".

There are two goals for this tutorial:

1. Learn how to use BERT (or similar) embeddings to find similar words.
2. Fine-tune a question-answering system.



## Prerequisites
To execute the code, click on the corresponding cell and press SHIFT + ENTER or the little "play" button on the left.

You can set up this notebook to run on a CPU (the default) or on a GPU. To change from GPU to CPU go to the menu bar above and click 'Runtime --> Change runtime type'. Then select 'GPU' from the dropdown box under 'Hardware accelerator' (switch back to 'None' to use a CPU).

You can also download this notebook to run it locally. However, note that (on top of Python 3+) these are the prerequisites:

Packages needed:
*   Scikit-learn
*   Jupyter notebook
*   Pandas
*   Numpy
*   Matplotlib
*   Pytorch
*   transformers
*   wget


## Datasets

We will use two main datasets in this tutorial:

1. [The ESCO classification](http://ec.europa.eu/esco) of the European Commission.
2. The [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/) dataset.

More details on each dataset can be found in the sections below.

# Setting things up

In [None]:
# variable to define whether the notebook is being run locally rather than on Colab
# by default, the notebook is run on Colab - change the flag to true if you're running locally.
LOCALRUN = True


In [None]:
# install some packages that we'll need that are not accessible by default in Colab
if LOCALRUN:
    print("Make sure the packages 'transformers' and 'wget' have been installed with 'pip install'")
else:
    !pip install transformers[sentencepiece]
    !pip install accelerate
    !apt install git-lfs
    print('---> Installed transformers')
    !pip install wget
    print('---> Installed wget')

In [None]:
# define imports

import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random
import time
import torch
import wget
import collections
from tqdm.auto import tqdm

if not LOCALRUN:
    % matplotlib inline



SET UP PYTORCH USE

[Pytorch](pytorch.org/) is a deep learning library and an alternative to Keras. 

With pytorch, you can specify explicitely whether to train on GPU or on CPU by defining a "device" variable. Then you can "move" a model and a dataset to that device memory so that it can be accessed more efficiently. You move a model to a device by calling `model.to(device)`.

In [None]:

# Define where we want to train the model
# If there's a GPU available...
USE_GPU = torch.cuda.is_available()
if USE_GPU:    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")
    

In [None]:
# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)


In [None]:
# Just FYI, if you need to clone git hub repos from within google colab
#!git clone <name-of the repo>

## Transformer library imports


In [None]:
# huggingface's transformers import
from transformers import AutoTokenizer, AutoModel # we've seen these in week 10 
from transformers import AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer

# Finding synonyms using embeddings

Here, we will turn each word in the text we are working with into a word embedding using the pre-trained BERT. We can then use the embeddings directly for any task we want to do. Using the embeddings this way corresponds to using BERT (or similar models) with all the layers "frozen" (feature-extraction), as we saw in week 10.

## Load the dataset

We're going to use the [ESCO dataset](https://ec.europa.eu/esco/), a dataset on skills used in the labour market. ESCO is the multilingual classification of European Skills, Competences, Qualifications and Occupations. It was created by the European Commission.

The ESCO classification identifies and categorises skills, competences, qualifications and occupations relevant for the EU labour market and education and training. It systematically shows the relationships between the different concepts.

The part of interest to us is the list of thousands of labour market skills (together with their descriptions, if you're interested).

I've left a copy of it accessible from my Google Drive, but it won't stay there forever after the end of the module, so I'd suggest downloading a local copy from Moodle!

In [None]:
# ESCO datasets

print('Downloading dataset...')

# Download the file using the URL for the dataset zip file. 
# Replace with your own google file ID if you have stored a version in your own google drive
# The ID is everything that comes after 'id='
if LOCALRUN:
    esco_url = 'https://docs.google.com/uc?export=download&id=1qu6_Hmed2mtSt9gLssLL3CjMh7ruXd4_'
    if not os.path.exists('./esco_public_1.1.zip/'):
        wget.download(esco_url, './esco_public_1.1.zip')
else:
    !wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1qu6_Hmed2mtSt9gLssLL3CjMh7ruXd4_' -O esco_public_1.1.zip
    
if LOCALRUN:
    print(('The dataset has been downloaded in a file called esco_public_1.1.zip within '
          'the folder where this script lives.'))
    print("Please go to that folder and unzip it in the same folder (not in a subfolder). ")
    print("The process itself should create a new subfolder called 'v1.0.8'.")
else:
    # Unzip the dataset (if we haven't already)
    if not os.path.exists('./v1.0.8/'):
        !unzip esco_public_1.1.zip


In [None]:
# Load the dataset into a pandas dataframe.
esco_df = pd.read_csv("./v1.0.8/skills_en.csv", usecols = ['preferredLabel','description'])

# Report the number of sentences.
print('Number of skills in the ESCO dataset is: {:,}\n'.format(esco_df.shape[0]))

# Display 10 random rows from the data.
esco_df.sample(10, random_state = seed_val)


There are two columns: 'preferredLabel' with the name of the skill and 'description' which an explanation for the skill.

## Load pre-trained BERT
Load both the tokenizer and the model

In [None]:
# transformers imports
from transformers import AutoTokenizer
from transformers import AutoModel 

In [None]:
# load tokenizer
feat_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', # we use the uncased BERT Base model (12 encoders)
                                          do_lower_case = True #everything lower case
                                          )

# load BERT model
bert_model = AutoModel.from_pretrained('bert-base-uncased',
                                       output_hidden_states=True # tell the model we want all the hidden states
                                       )

#set in evaluation model
bert_model.eval()

# move model to GPU if needed
if USE_GPU:
    bert_model.cuda()

In [None]:
# show example skill
text_skill = esco_df.iloc[1000].preferredLabel
text_skill


In [None]:
# Let's try once to pass the test skill through the BERT model, what do we get?
# First, tokenize to put the input in a format that BERT understands
feat_encoded_dict = feat_tokenizer(text_skill,
                         add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                         return_tensors = 'pt',     # Return pytorch tensors.
                         padding = True
                         )

token_ids = feat_encoded_dict['input_ids']
print(f'The number of tokesn (including 2 special ones is: {token_ids.shape[1]})')

# then, pass the input through the model
with torch.no_grad(): # no_grad because we don't want to train BERT
    bert_output = bert_model(token_ids.to(device), 
                           token_type_ids = None #no need for this since it's a single sentence
                           )

print(f'There are {len(bert_output)} items in the output')

What's BERT's output made of?

Well, the exact output will depend on the parameters we used when defining the model with 'from_pretrained'. In this case, we used output_hidden_states= True and output_attention = False (the default), so the third item of the output is the list of all the hidden states from all the encoding layers of BERT. (The first item is the output of the last layer and the second item is the output of the [CLS] token).


In [None]:
# Let's explore the full hidden states output
hidden_states = bert_output[2]
print(f'The full hidden states output is a {type(hidden_states)}')
print(f'It is a list of {len(hidden_states)}, each with {len(hidden_states[0].shape)} dimensions (size: {hidden_states[0].shape})')

# The list has 13 elements because BERT Base has 13 layers (1 embedding + 12 encoders)
# Each element of the list is the full output of that layer with dimensions: #samples x #tokens x #features (768)
# For us, #samples = 1 always because we'll apply BERT to one sentence at a time.

So, basically we have 13 different embeddings for each token, and 13 x #tokens different embeddings for the whole sentece. How do we combine them?

To obtain one embedding per token, the authors of the BERT paper tried different strategies: the first-layer embeddings, the embeddings from the second-to-last and last hidden layer, the weighted sum and the concatenation of the last four hidden layers and the weighted sum of all 12 layers.

They obtain good results with the weighted sum and the concatenation of the last four hidden layers. Therefore, here we'll adopt the approach of **averaging the output of the last four hidden layer**. However, the best strategy is application dependent.


## Define strategies to combine the outputs from BERT

As mentioned, we use the following strategies:

1. We average the output of the last four hidden layer to obtain one embedding per token.
2. We average all the token embeddings to obtain one embedding per sentence.

There are other options available. For example, I'd recommend checking out [sentence-transformers](https://pypi.org/project/sentence-transformers/).

In [None]:
def get_bert_embedding(tokens_id, model):
    ''' Pass the tokens into the BERT model, then combines the embeddings of the last four 
    encoders into one final embedding for each non-special token.
    The input is a torch tensor and a BERT model
    the output is a numpy array in 2D '''
    # model inference
    with torch.no_grad():
        bert_output = model(tokens_id.to(device), token_type_ids = None)

    # keep the output embeddings of the last 4 encoders
    last_hidden_states = [t.to('cpu').numpy() for t in bert_output[2][-4:]]

    # concatenate and average them
    token_embeddings = np.vstack(last_hidden_states).mean(axis = 0)

    # remove special tokens [CLS] and [SEP]
    N = token_embeddings.shape[0]
    token_embeddings = token_embeddings[1:N-1]

    return token_embeddings # should be of shape #tokens x 768

def get_sentence_embedding(sentence, tokenizer, model):
    ''' Returns the BERT based embedding of a sentence as the average across its
    token embeddings
    The output is a numpy array
    (but check out sentence-transformers too: https://pypi.org/project/sentence-transformers/)
    '''
    feat_encoded_dict = tokenizer(sentence,
                         add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                         return_tensors = 'pt',     # Return pytorch tensors.
                         padding = False
                         )
    token_embeddings = get_bert_embedding(feat_encoded_dict['input_ids'], model) #shape #tokens x 768
    
    # average across embeddings
    sentence_embeddings = token_embeddings.mean(axis=0)

    return sentence_embeddings


In [None]:
# process all skills: on the GPU it takes around 3 minutes, on the CPU likely more than an hour.
# If using the CPU, I would suggest only processing the first 100 skills.
t0 = time.time()
skills_embeddings = []
for i,skill in enumerate(esco_df.preferredLabel.to_list()[:101]):
    skills_embeddings.append(get_sentence_embedding(skill, feat_tokenizer, bert_model))
    if i % 1000 == 0 and i>0:
      print(f'Time elapsed so far to embed {i} skills is {(time.time()- t0)/60:.2f} minutes')

print(f'The size of each skills embeddings is {skills_embeddings[0].shape}')
# concatenate them all
skills_embeddings = np.vstack(skills_embeddings)

print(f'The size of the skills embeddings is {skills_embeddings.shape}')


COMPUTING SIMILARITY

There are many similarity metrics out there. Howevert, the most commonly used one to check how similar two word embeddings are is the ["cosine similarity"](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) or the "dot product" (if embeddings are normalized).

In the following we will use the cosine similarity, which ranges between -1 and 1. -1 means word embeddings are opposite, 0 means there is no similarity, +1 means there is perfect similarity.


In [None]:
# Potential applications. 
from sklearn.metrics.pairwise import cosine_similarity
# Let's say we have a new skill that might or might not be in the dataset already. 
# How do we make sense of it? One way is to select the existing skill that is the  
# most similar to the 'new' one. This way we can understand how the new skill 'fits in'.
new_skill = 'python programming language'

new_skill_embedding = get_sentence_embedding(new_skill, feat_tokenizer, bert_model)

# get the cosine similarity between the new skill and all the skills in the ESCO dataset
# The cosine similarity measures how similar two vectors are through the angle between
# the two vectors. The idea is that parallel vectors are the most similar (cosine
# similarity = 1) and perpendicular vectors are the least similar (cosine similarity = 0)
similarities = cosine_similarity(new_skill_embedding.reshape(1,-1), skills_embeddings)
# currently is 2D, make it 1D
similarities= similarities.flatten()

# get the top 5 similar skills
most_similar_skills_indices = list(np.argsort(similarities)[-5:])
# reverse the order
most_similar_skills_indices = most_similar_skills_indices[::-1]
most_similar_skills = [esco_df.iloc[i].preferredLabel for i in most_similar_skills_indices]

print(f'The 5 skills that are most similar to "{new_skill}" are:')
for i,skill in zip(most_similar_skills_indices,most_similar_skills):
  print(f"'{skill}', with similarity {similarities[i]:.2f}")

# you can try it out with other skills that come to mind, or even more generic words.

# We could also apply unsupervised algorithms to the dataset - for example, we can
# group the skills into sets of similar items (a technique called clustering), like 
# all IT skills, all language skills, etc.

## Exercise

1. Try computing the average token embedding in a different way. Instead of averaging the last 4 hidden layers, try using only the second-to-last hidden layer (this is another strategy used in the BERT paper). How do the results change?
2. Try using a model other than BERT, how do the results change?

In [None]:
#### Your code here

# Solution
# the function to aggregate the embedding can be modified like this:
def get_bert_embedding(tokens_id, model):
    ''' Pass the tokens into the BERT model, then combines the embeddings of the last four 
    encoders into one final embedding for each non-special token.
    The input is a torch tensor and a BERT model
    the output is a numpy array in 2D '''
    # model inference
    with torch.no_grad():
        bert_output = model(tokens_id.to(device), token_type_ids = None)

    # keep the output embeddings of the second to last hidden layer
    second_last_hidden_state = [t.to('cpu').numpy() for t in bert_output[2][-2]]

    # get the token embeddings as they are
    token_embeddings = second_last_hidden_state[0]

    # remove special tokens [CLS] and [SEP]
    N = token_embeddings.shape[0]
    token_embeddings = token_embeddings[1:N-1]

    return token_embeddings # should be of shape #tokens x 768

# An example of a different model is 'roberta-base' (others are here: https://huggingface.co/docs/transformers/main_classes/model)

# Build a question-answering system

Please also refer to the original [HuggingFace tutorial](https://huggingface.co/course/chapter7/7?fw=pt).

## The SQUAD dataset

From the [SQUAD website](https://rajpurkar.github.io/SQuAD-explorer/):
"Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable."

Each data point in the dataset has the following features:
1. The "context", that is the piece of text from which the answer is taken.
2. A "question" to answer.
3. One or more possible answers to the question. Each answer includes the text of the answer and where the answer starts in the context (the starting point is indicated as the position of the character at which the answer starts).

SQUAD is an example of a dataset for **extractive** question-answering. This is when the answer is contained as-is within the context. You can also have a look at the [original paper](https://arxiv.org/pdf/1606.05250.pdf).

In [None]:
# load the SQUAD dataset
from datasets import load_dataset
raw_datasets = load_dataset("squad")

In [None]:
# this will show the number of data points and their key features
raw_datasets

### Some basic EDA

(not an exhaustive collection of all that can be done)

In [None]:

# What does an example data point looks like?
print('\n Example data point:')
for k in raw_datasets['train'][1].keys():
    print(k, ':', raw_datasets['train'][1][k])

# For each question, how many possible answers are there?
for split_name in ['train','validation']:
    print(f"\n For the {split_name} split: ")
    print(collections.Counter([len(t['text']) for t in raw_datasets[split_name]['answers']]))
    #for lists, Counter is the equivalent of .value_counts() for dataframes
    

As you can see, all the training samples have only one possible answer, while the validation samples have multiple possible asnwers.

This is because during training we can be a bit more prescriptive and "force" the algorithm towards one and one answer only. However, some questions may inherently have multiple valid answers (language can be ambiguous!) and we don't want to penalize our evaluation by excluding some of them. For example:


In [None]:
for t in ['context','question','answers']:
    print(t.capitalize())
    print(raw_datasets['validation'][2][t])


In [None]:
# For the training dataset what's the length distribution for the answers (length as in number of words separated by a white space)?
# We can image that longer answers are harder to detect, for example...
# We restrict ourselves to the training dataset since we know there is always only one answer
all_answers_length = [len(t['text'][0].split()) for t in raw_datasets['train']['answers']]
plt.hist(all_answers_length, bins= 20)


Most answers seem very short. It might be interesting to see how the algorithm performs with answers of different length...

In [None]:
# For the training dataset what's the length distribution for the context (length as in number of words separated by a white space)?
all_contexts_length = [len(t.split()) for t in raw_datasets['train']['context']]
plt.hist(all_contexts_length, bins= 20)


In [None]:
# How about the position of the answers in the text? Is that equally distributed?
# Let's get the relative position of each answer in the text (character starting position / number of characters )
all_answers_starts = [t['answer_start'][0] for t in raw_datasets['train']['answers']]
all_context_lengths = [len(t) for t in raw_datasets['train']['context']]
relative_answers_starts = [n/d for n,d in zip(all_answers_starts,all_context_lengths)]
plt.hist(relative_answers_starts, bins= 20)


As you can see, the distribution is not uniform. This is somewhat expected, probably for two reasons:
1. Since the answer is a span of text, it can start at the beginning of the context, but it cannot start at the end, since otherwise it would have length zero.
2. To create the dataset, the authors employed crowdworkers (paid 9$ per hours, BTW) to create up to 5 questions and answers for each context. Is it possible that the crowdworkers scanned the context left to right when creating the questions?

Is it possible that part of what the model learns is to look for the answer at the beginning of the context even when it shouldn't?

## Data preparation

#### Training dataset

In [None]:
# First, import the tokenizer
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


To solve a question-answering task, the input to our model is the concatenation of the question and the context (note the question goes first). The two are separated by the special character [SEP] so that the model knows where the context end and the questions starts. As usual, BERT will also add the special character [CLS] at the beginning of the input and the character [SEP] again at the end of the second sentence. Overall the input looks like this:

`[CLS] question [SEP] context [SEP]`

The required output is the indexes of the tokens that represent the beginning and the end of the answer.

This means that extractive QA is equivalent to a token classification system: each token in the context is given two labels: one for being the start of the answer, one for being the end. Each label can be either 0 or 1, with 1 meaning that a token is the beginning or the end of the answer. What type of classification system so you think this is? (multi-class, multi-label, something else, ...)

This image from the [HuggingFace tutorial](https://huggingface.co/course/chapter7/7?fw=pt#the-squad-dataset) might help clarify the setup:

![Extractive QA labelling](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/qa_labels.svg)

One problem is that some context might be very long. This is a problem because:
1. The input to the transformer has to be a fixed length for the nextwork to be setup correctly.
2. This is achieved by padding short texts (that is, adding filler tokens at the end) or by truncating long text (that, dropping all tokens after a maximum length).
3. However, if we drop the tail of long texts, we risk losing the answer!

To address this problem, we prepare the data by splitting long contexts into overlapping texts (windows) of the right size. For example, if we have a limit of 32 tokens (without considering question and special characters) and the following context:


> Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes".

We would split it into:

> Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building

> Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes".

and use both windows as separate inputs to the neural network. For the windows that do not contain the right answer, the label would need to be set accordingly.

The tokenizer function can help us thanks to the keyword `return_overflowing_tokens`, which tells the tokenizer we want to keep those tokens that otherwise would be lost due to the truncation.




In [None]:
context = raw_datasets["train"][0]["context"]
question = raw_datasets["train"][0]["question"]

inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second", #only truncate the second sentence (the context) and not the first (the question)
    stride=50, #sets how many overlapping tokens we want
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

print(inputs.keys())

The output should be the different overlapping context windows, each preceded by the question (and including the special characters).

The other thing we need to do is to correctly map the beginning of the answer to the right character in the context window (or windows) that actually contains the answer. In this case the answer is:

In [None]:
raw_datasets["train"][0]["answers"]

We can see that the third and fourth context windows contain the answers, but the first and second don't. Furthermore, we need to find where in the windowed contexts is the answer. 

We can do this thanks to the fact that the tokenizer returns the `offset_mapping`: this keeps track of the position of each token in the original context. Given that we know where the answer starts (`raw_datasets["train"]["answers"]["answer_start"]`) and how many characters long it is (`len(raw_datasets["train"]["answers"]["text"])`) we can compute whether a given token is the start of the answer, the end of the answer or neither of them.

The other thing to mention is that now we have many more data points. We need to keep track of which original data point the new shortened contexts came from. We can do this thanks to another piece of information returned by the tokenizer, the `overflow_to_sample_mapping`, that tells us exactly this.

Putting it all together, we have the following preprocessing function:

In [None]:
max_length = 384 #how long the input should be
stride = 128 #how large should the overlap be between consecuting windows


def preprocess_training_examples(examples):
    """ split long contexts into overlapping windows and map to the right labels """
    questions = [q.strip() for q in examples["question"]]  #.strip() is to remove extra white spaces at the beginning of some questions

    # tokenize
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # get the map between the new data points to the original ones
    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    # define the label for each new datapoint
    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs


Apply the pre-processing function to the training dataset using the `Dataset.map()` method:

In [None]:
train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)
len(raw_datasets["train"]), len(train_dataset)


#### Validation dataset

Since the validation dataset is only used to get the performance metric, we only need to process all the contexts so that they fit within the maximum number of tokens.

Indeed, the labels in the validation dataset are only needed to compute the performance metrics. As long as we can process the model output to be in the "right shape" we can pass the model output and the labels as they are to a HuggingFace function `metric = load_metric("squad")` that will do the heavy lifting for us. Please see below for the needed post-processing.

As a general tip, have a look at what metric functions HuggingFace (or other libraries) offer and **how** they expect to be given the predictions and the labels as input. This can save you a lot of time.

In [None]:
def preprocess_validation_examples(examples):
    """ pre-process the question+context into chunks of appropriate length"""
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs


In [None]:
validation_dataset = raw_datasets["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)
len(raw_datasets["validation"]), len(validation_dataset)


In [None]:
validation_dataset[3]['example_id']

### Post-processing

Remember that the output of the question answering model is a vector of scores for each token to determine whether any of them is the start or the end of the answer. To transform this input back into proper answers, we need to take the following steps:

1. We mask the start and end logits corresponding to tokens outside of the context. The answer can only be in the context.
2. We then convert the start and end logits (the scores) into probabilities using a softmax.
3. We attribute a score to each (start_token, end_token) pair by taking the product of the corresponding two probabilities. This is because it's the whole answer span that we are after. This step is also necessary to allow for step 4.
4. We look for the pair with the maximum score that yields a valid answer (e.g., a start_token lower than end_token).

One problem with this approach is that there are many pairs to evaluate. Can we simplify? Yes, we can decide to only compute the joint pair probability (start_token, end_token) for the N best candidate start tokens and candidate end tokens. Furthermore, we can work directly on the logits and sum them, instead of multiplying the probabilities (remember that log(a*b) = log(a)+log(b) and that logit=log(probability)).

Finally, the last thing we need to do is to to extract the span of text that goes from start_token to end_token and compare to the possible ground truth answers. Note that we do this for all the context windows derived from the original context. The overall predicted answer for each original data point is the answer with the highest score across all context windows.

In [None]:
# load the pre-defined squad metric from HuggingFace to make things easier
from datasets import load_metric

metric = load_metric("squad")


The squad metric computes two numbers:

1. The proportion of predicted answers that are exactly the same as (any of) the ground truth answers (exact).
2. The average overlap between the predicted and ground truth answers as the average of the maximum F1 across all possible answers for a given question.

In [None]:
# define function to compute the metric. The hard part is not the metric itself, 
# it's the post-processing of the output
n_best = 20
max_answer_length = 30

def get_predicted_and_true_answers(start_logits, end_logits, features, examples):
    """ 
    Get the best predicted answer and compare with ground truth
    """
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            # get the n_best candidates
            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            # get the joint probability of (start_token, end_token) pairs
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return {'predicted': predicted_answers, 'theoretical': theoretical_answers}

def compute_metrics(start_logits, end_logits, features, examples):
    out = get_predicted_and_true_answers(start_logits, end_logits, features, examples)
    predicted_answers = out['predicted']
    theoretical_answers = out['theoretical']
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)


### Model setup

In [None]:
# load the question answering model (it's a good pre-trained one for this task). 
# Note that it's still based on bert-base-cased, so the tokenizer is the same
model_checkpoint = "distilbert-base-cased-distilled-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)


### Training

In [None]:
# set up the training hyperparameters (note the low learning rate since we're fine-tuning)

args = TrainingArguments(
    "bert-finetuned-squad",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    #fp16=True, #uncomment this if using the GPU
)


In [None]:
# perform the actual training

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset.select(range(50)), # use  train_dataset.select(range(50)), if you want to speed things up
    eval_dataset=validation_dataset.select(range(20)), # use  validation_dataset.select(range(20)), if you want to speed things up
    tokenizer=tokenizer,
)
trainer.train()


### Evaluation

Now that the training is done, we can check how well we can answer questions!

In [None]:
Nval=122 #use Nval=200 to speed things up when using the CPU
predictions = trainer.predict(validation_dataset.select(range(Nval)))
start_logits, end_logits = predictions.predictions
compute_metrics(start_logits, end_logits, validation_dataset.select(range(Nval)), 
                raw_datasets["validation"].select(range(Nval)))


In [None]:
# let's show some examples
n_examples= 3
out = get_predicted_and_true_answers(start_logits, end_logits, validation_dataset.select(range(n_examples)), 
                                      raw_datasets["validation"].select(range(n_examples)))
for p, t in zip(out['predicted'],out['theoretical']):
    print('Predicted answer:')
    print(p['prediction_text']) 
    print('Desired answer(s):')
    print(t['answers']['text'])
    print()


How does it look?