# Question answering with BERT (HuggingFace)

Deep learning has been revolutionized by transformer models. Transformer based models like BERT are heavily used in NLP to solve tasks due to the rich numerical representations of text they provide. Here we will be discussing how to use HuggingFace's transformers library to conveniently explore various transformer based NLP models. We will be training a question answering model on the famous SQUAD v1 dataset.

## Import libraries

In [1]:
# !pip install transformers
# ! pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m89.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.0


### Importing all the necessary libraries and setting up their random seeds

In [37]:
import random
import numpy as np
import transformers
from datasets import load_dataset
from transformers import DistilBertTokenizerFast
from transformers import DistilBertConfig, TFDistilBertForQuestionAnswering
import tensorflow as tf
import time
import tensorflow as tf
from functools import partial

def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")
    try:
        transformers.trainer_utils.set_seed(seed)
    except NameError:
        print("Warning: transformers module is not imported. Setting the seed for transformers failed.")
        
# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)


## Download the dataset

For this we will be using the [SQUAD v1 dataset](https://rajpurkar.github.io/SQuAD-explorer/). It is a question answering dataset. You are provided with a question, a context (e.g. a paragraph in which the answer to the question may exist) and finally the answer. Your goal is to, given the question and the context predict the answer.

In [4]:
from datasets import load_dataset

dataset = load_dataset("squad")

print(dataset)

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


## Print the first 5 samples in the training set

In [5]:
dataset["train"]["answers"][:5]

[{'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]},
 {'text': ['a copper statue of Christ'], 'answer_start': [188]},
 {'text': ['the Main Building'], 'answer_start': [279]},
 {'text': ['a Marian place of prayer and reflection'], 'answer_start': [381]},
 {'text': ['a golden statue of the Virgin Mary'], 'answer_start': [92]}]

>  We are only interested in the last 3 columns (context, question, answer) in the features section
1. context and question are strings.
2. answers is a dictionary. Each answer has a starting index and a text which is the actual answer.
3. This will help us in calculating the end of our answer i.e. end_index = start_index + len(text).

## Correcting incorrect offsets of the provided answers

As mentioned above the answers are provided by means of the, starting index (`answer_start`) and the answer it self (`text`). However, for some examples, the starting index is slightly off from the actual index. In the function below we correct that. Furthermore, we will add `answer_end`, which will denote the index of the position the answer ends.

In [6]:
# Function to correct answers where the offset is two characters between the answer start and the actual answer(text)
# We will also add end_answer 
def correct_indices_add_end_idx(answers, contexts):
    """ Correct the answer index of the samples (if wrong) """
    
    # Track how many were correct and fixed
    n_correct, n_fix = 0, 0
    # new fixed answers will be held in this variable
    fixed_answers = []
    # Iterate through each answer context pair.
    for answer, context in zip(answers, contexts):

      # Convert the answer from a list of strings to string
        gold_text = answer['text'][0]
        answer['text'] = gold_text
      # # Convert the start of the answer from a list of integers to an integer.  
        start_idx = answer['answer_start'][0]  
        answer['answer_start'] = start_idx
        if start_idx <0 or len(gold_text.strip())==0:
            print(answer)
      # Compute the end of answer      
        end_idx = start_idx + len(gold_text)        
        
        # sometimes squad answers are off by a character or two – fix this

        # When context's slice from start index to end index matches with answer, no change is required  
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
            n_correct += 1
        # when context's slice from start index till end index needs to be offset by 1 character to match our answer    
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     
            n_fix += 1
        # when context's slice from start index till end index needs to be offset by 2 character to match our answer
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters
            n_fix +=1
        
        fixed_answers.append(answer)
        
    # Print how many samples were fixed
    print("\t{}/{} examples had the correct answer indices".format(n_correct, len(answers)))
    print("\t{}/{} examples had the wrong answer indices".format(n_fix, len(answers)))
    return fixed_answers, contexts


In [7]:
# Generating training and validation sets of question, answers and context.
train_questions = dataset["train"]["question"]
print("Training data corrections")
train_answers, train_contexts = correct_indices_add_end_idx(
    dataset["train"]["answers"], dataset["train"]["context"]
)
test_questions = dataset["validation"]["question"]
print("\nValidation data correction")
test_answers, test_contexts = correct_indices_add_end_idx(
    dataset["validation"]["answers"], dataset["validation"]["context"]
)

Training data corrections
	87599/87599 examples had the correct answer indices
	0/87599 examples had the wrong answer indices

Validation data correction
	10570/10570 examples had the correct answer indices
	0/10570 examples had the wrong answer indices


### Overview of the whole training process
1. We will be combining questions and contexts and add several special tokens to indicate the start of our question and context pair, a token to indicate where question ends and context starts.
2. Remember that these pretrained models come in two parts i.e., tokenizer and the actual model.So, we will convert our text into tokens first and then convert these tokens into IDs which we feed to our Bert model to look for embeddings.
3. The output of bert will be fed to two classifier which predicts the starting index and the ending index of our answer from the context.  

## Question answering with DistilBert

Now we will start our way to train a question answering model. The pretrained model we'll be using is known as [DistilBert](https://arxiv.org/pdf/1910.01108.pdf). It is a variant of BERT trained using a knowledge distilliation mechanism (a type of transfer learning).

### Defining the tokenizer

In [27]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

### Convert some text to tokens with the tokenizer

In [28]:
context = "This is the context"
question = "This is the question"

token_ids = tokenizer(context, question, return_tensors='tf')
print(token_ids)
print(tokenizer.convert_ids_to_tokens(token_ids['input_ids'].numpy()[0]))

{'input_ids': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=
array([[ 101, 2023, 2003, 1996, 6123,  102, 2023, 2003, 1996, 3160,  102]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}
['[CLS]', 'this', 'is', 'the', 'context', '[SEP]', 'this', 'is', 'the', 'question', '[SEP]']


## Converting the inputs to tokens

In adition to converting inputs to tokens and adding special tokens, it will truncate and pad inputs to the maximum length of the sequences defined in the model config. For example, you can check model config with, `tokenizer.model_max_length`.

In [29]:
# Encode train data
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True, return_tensors='tf')
print("train_encodings.shape: {}".format(train_encodings["input_ids"].shape))
# Encode test data
test_encodings = tokenizer(test_contexts, test_questions, truncation=True, padding=True, return_tensors='tf')
print("test_encodings.shape: {}".format(test_encodings["input_ids"].shape))


train_encodings.shape: (87599, 512)
test_encodings.shape: (10570, 512)


> Setting truncation and padding to be true makes our tokenizer to pad/truncate our input sequence as necessary. 
1. This means that our tokenizer will add special token [PAD] when our input sequence is shorter than the length when it was trained.
2. Will truncate the sequence when it is larger than the input sequence on which it was trained.
3. No change occurs when our input sequence matches with the length of training input sequence.
4. Our tokenizer was trained on 512 tokens as seen above.

### Dealing with truncated answers

In the original dataset the `answer_start` and `answer_end` denote the *character*-level position of the answer. But in the model, since we deal in tokens we need the *token*-level position of the answer. For that, we will use the `char_to_token` function in the tokenizer. It will convert the character index to a token index.

Because we are enforcing a maximum sequence length of 512, some answers will be inevitably truncated if they are present after the 512th token. Although this is rare, we still need to take care of this as it can result in numerical errors otherwise. Therefore, if the positions are `None` (i.e. couldn't find the answer), it is set to the maximum position.

In [30]:
def update_char_to_token_positions_inplace(encodings, answers):
    start_positions = []
    end_positions = []
    n_updates = 0
    # Go through all the answers
    for i in range(len(answers)):        
        
        # Get the token position for both start and end char positions
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
        
        # keep track of how many answers have been truncated
        if start_positions[-1] is None or end_positions[-1] is None:
            n_updates += 1
       
        # if start position is None, the answer passage has been truncated therefore set it to the last available index i.e. 511
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length -1

        # if end position was not found, some of answer's part lies outside our 512 tokens length     
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length -1
            
    print("{}/{} had answers truncated".format(n_updates, len(answers)))
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

update_char_to_token_positions_inplace(train_encodings, train_answers)
update_char_to_token_positions_inplace(test_encodings, test_answers)

10/87599 had answers truncated
8/10570 had answers truncated


> As you can see, we are printing the answers that were truncated. This is an important sanity check which ensures that there are no problems in our data or any bug in our code. Always keep this number to be small enough to be ignored. 

### Creating TensorFlow dataset

In [31]:
def data_gen(input_ids, attention_mask, start_positions, end_positions):
    """ Generator for data """
    for inps, attn, start_pos, end_pos in zip(input_ids, attention_mask, start_positions, end_positions):
        
        yield (inps, attn), (start_pos, end_pos)

#### Our data generator returns data in a specific format.
1. It has an input tuple which return our input_ids (generated by our tokenizer) and the attentin mask.
2. The output tuple indicates our answer i.e., its start and end position. 

In [32]:
# Creating training, validation and test datasets
print("Creating train data")

# Define the generator as a callable (not the generator it self), train_data_gen has no arguments since all the arguments are being passed in partial call 
train_data_gen = partial(data_gen,
    input_ids=train_encodings['input_ids'], attention_mask=train_encodings['attention_mask'],
    start_positions=train_encodings['start_positions'], end_positions=train_encodings['end_positions']
)

# Define the dataset
train_dataset = tf.data.Dataset.from_generator(
    train_data_gen, output_types=(('int32', 'int32'), ('int32', 'int32'))
)
# Shuffling the data
train_dataset = train_dataset.shuffle(20000)
print('\tDone')

batch_size = 8
# Valid set is taken as the first 10000 samples in the shuffled set
valid_dataset = train_dataset.take(10000)
valid_dataset = valid_dataset.batch(batch_size)

# Rest is kept as the training data
train_dataset = train_dataset.skip(10000)
train_dataset = train_dataset.batch(batch_size)

# Creating test data
print("Creating test data")

test_data_gen = partial(data_gen,
    input_ids=test_encodings['input_ids'], attention_mask=test_encodings['attention_mask'],
    start_positions=test_encodings['start_positions'], end_positions=test_encodings['end_positions']
)
test_dataset = tf.data.Dataset.from_generator(
    test_data_gen, output_types=(('int32', 'int32'), ('int32', 'int32'))
)
test_dataset = test_dataset.batch(batch_size)
print("\tDone")

Creating train data
	Done
Creating test data
	Done


### Defining the model

Here we define a DistilBert model (particularly a TF variant)

> Keras and TF expects the model output to be a tensor or tuples of tensors but this is not the case here since transformers' models output are specific objects (a descendant of transformers.file_utils.ModelOutput). Therefore, we will wrap it in keras model.

In [40]:
from transformers import DistilBertConfig, TFDistilBertForQuestionAnswering

config = DistilBertConfig.from_pretrained("distilbert-base-uncased", return_dict=True)
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased", config=config)


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForQuestionAnswering: ['vocab_transform', 'vocab_projector', 'vocab_layer_norm', 'activation_13']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs', 'dropout_119']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [41]:
def tf_wrap_model(model):
    """ Wraps the huggingface's model with in the Keras Functional API """
    
    # If this is not wrapped in a keras model by taking the correct tensors from
    # TFQuestionAnsweringModelOutput produced, you will get the following error
    # setting return_dict did not seem to work as it should
    
    # TypeError: The two structures don't have the same sequence type. 
    # Input structure has type <class 'tuple'>, while shallow structure has type 
    # <class 'transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput'>.
    
    # Define an input layer that will take a batch of a token sequence
    input_ids = tf.keras.layers.Input([None,], dtype=tf.int32, name="input_ids")
    # Define an input for attention mask returned when encoding with the tokenizer
    attention_mask = tf.keras.layers.Input([None,], dtype=tf.int32, name="attention_mask")
    
    # Define the output (TFQuestionAnsweringModelOutput)
    out = model([input_ids, attention_mask])
    
    # Get the correct attributes in the produced object to generate an output tuple
    wrap_model = tf.keras.models.Model([input_ids, attention_mask], outputs=(out.start_logits, out.end_logits))
    
    return wrap_model

model_v2 = tf_wrap_model(model)

In [42]:
# Define and compile the model

# Keras will assign a separate loss for each output and add them together. So we'll just use the standard CE loss
# instead of using the built-in model.compute_loss, which expects a dict of outputs and averages the two terms.
# Note that this means the loss will be 2x of when using TFTrainer since we're adding instead of averaging them.

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
acc = tf.keras.metrics.SparseCategoricalAccuracy()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)


model_v2.compile(optimizer=optimizer, loss=loss, metrics=[acc])


### Training the model

In [None]:
import time

t1 = time.time()

model_v2.fit(
    train_dataset, 
    validation_data=valid_dataset,    
    epochs=3
)

t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Epoch 1/3
Epoch 2/3
2246/9700 [=====>........................] - ETA: 1:02:39 - loss: 2.1013 - tf_distil_bert_for_question_answering_5_loss: 1.1005 - tf_distil_bert_for_question_answering_5_1_loss: 1.0009 - tf_distil_bert_for_question_answering_5_sparse_categorical_accuracy: 0.6689 - tf_distil_bert_for_question_answering_5_1_sparse_categorical_accuracy: 0.7133

### Save the model

In [None]:
print(model_v2.summary())

**Note**: We cannot save `model_v2` as is, because it raises an error about not finding config for the transformer model layer. THerefore, we will save just the transformer model layer, so that we can call the `tf_wrap_model()` function anytime and get the wrapped model. 

In [None]:
import os

# Create folders
if not os.path.exists('models'):
    os.makedirs('models')
if not os.path.exists('tokenizers'):
    os.makedirs('tokenizers')
    
# Save the modle
model_v2.get_layer("tf_distil_bert_for_question_answering").save_pretrained(os.path.join('models', 'distilbert_qa'))

# Save the tokenizer
tokenizer.save_pretrained(os.path.join('tokenizers', 'distilbert_qa'))



### Testing on unseen data

In [None]:
model_v2.evaluate(test_dataset)

## Ask BERT a question ...

In [None]:
i = 5

# Define sample question
sample_q = test_questions[i]
# Define sample context
sample_c = test_contexts[i]
# Define sample answer 
sample_a = test_answers[i]

# Get the input in the format BERT accepts
sample_input = (test_encodings["input_ids"][i:i+1], test_encodings["attention_mask"][i:i+1])

def ask_bert(sample_input, tokenizer, model):
    """ This function takes an input, a tokenizer, a model and returns the prediciton """
    out = model.predict(sample_input)
    pred_ans_start = tf.argmax(out[0][0])
    pred_ans_end = tf.argmax(out[1][0])
    print("{}-{} token ids contain the answer".format(pred_ans_start, pred_ans_end))
    ans_tokens = sample_input[0][0][pred_ans_start:pred_ans_end+1]
    
    return " ".join(tokenizer.convert_ids_to_tokens(ans_tokens))

print("Question")
print("\t", sample_q, "\n")
print("Context")
print("\t", sample_c, "\n")
print("Answer (char indexed)")
print("\t", sample_a, "\n")
print('='*50,'\n')

sample_pred_ans = ask_bert(sample_input, tokenizer, model_v2)

print("Answer (predicted)")
print(sample_pred_ans)
print('='*50,'\n')