# Use case: Using BERT to answer questions

## Introduction to the Hugging Face `transformers` library 

Few points on Hugging Face `transformers` library by author:<br>
(Source: NLP w/ TensorFlow by by Thushan Ganegedara)

- The `transformers` library is a high-level API that is built on top of TensorFlow, PyTorch, and JAX. 

- It provides easy access to pre-trained Transformer models that can be downloaded and fine-tuned with ease. You can find models in the Hugging Face’s model registry at https://huggingface.co/models. 

- You can filter models by task, examine the underlying deep learning frameworks, and more.

- The transformers library was designed with the aim of providing a very low barrier for entry to using complex Transformer models. 

* **

- For this reason, there’s only a handful of concepts that you need to learn in order to hit the ground running with the library. **Three important classes are required to load and use a model successfully:**
    - **Model class** (such as `TFBertModel`) - *Contains the trained weights of the model in the form of `tf.keras.models.Model` or the PyTorch equivalent.*
    
    - **Configuration** (such as `BertConfig`) - *Stores various parameters and hyperparameters needed to load the model. If you’re using the pre-trained model as is, you don’t need to explicitly define its configuration.*
    
    - **Tokenizer** (such as `BertTokenizerFast`) - *Contains the vocabulary and token-to-ID mapping needed to tokenize the words for the model.*
    
    
- All of these classes can be used with two straightforward functions:
    - `from_pretrained()` – Provides a way to instantiate a model/configuration/tokenizer available from the model repository or locally
    
    - `save_pretrained()` – Provides a way to save the model/configuration/tokenizer so that it can be reloaded later
    
* **

> 🗝️**Resource:** *Official TensorFlow text processing tutorials, includes:* `Text generation, Text classification, NLP with BERT, Embeddings` -> 🔗[**Link**](https://www.tensorflow.org/text/tutorials/)

* **

It is also important to note the side-effects of having such an easy-to-grasp interface for using models. Due to serving the very specific purpose of providing a way to use Transformer models built with TensorFlow, PyTorch, or Jax, you don’t have the modularity or flexibility found in TensorFlow, for example. In other words, you cannot use the transformers library in the same way you would use TensorFlow to build a tf.keras.models.Model using tf.keras.layers.Layer objects.

* **

## Imports

In [1]:
import pandas as pd
# ^^^ pyforest auto-imports - don't write above this line
import random
import numpy as np
import transformers
from datasets import load_dataset
from transformers import DistilBertTokenizerFast
from transformers import DistilBertConfig, TFDistilBertForQuestionAnswering
import tensorflow as tf
import time

def set_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")
    try:
        transformers.trainer_utils.set_seed(seed)
    except NameError:
        print("Warning: transformers module is not imported. Setting the seed for transformers failed.")
        
# Fixing the random seed
random_seed=4321
set_random_seed(random_seed)

physical_devices = tf.config.list_physical_devices('GPU')
print(physical_devices)

try:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
except:
    print("No GPU found!")
    pass

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Exploring the data

- [10 Question-Answering Datasets To Build Robust Chatbot Systems](https://analyticsindiamag.com/10-question-answering-datasets-to-build-robust-chatbot-systems/)

- [Hugging Face Tutorial on SQuAD_v2](https://huggingface.co/transformers/v3.3.1/custom_datasets.html#question-answering-with-squad-2-0)

- SQuAD2.0 dataset constains questions with no answers

- SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

- [Official SQuAD2.0 Dataset Website](https://rajpurkar.github.io/SQuAD-explorer/)

>**For now I'm removing questions with no answers; as I can't figure out what to do while adding `answer_end` index for empty answers. I replaced `answer_start` & `answer_end` with NaN values, also `answer_text` with empty string. But was getting float error in `Dealing with truncated answers` section.**

### download & read from url 

In [2]:
# Shortcut way to download

# !mkdir squad
# !curl https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -o squad/train-v2.0.json
# !curl https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -o squad/dev-v2.0.json

In [1]:
import os
import requests
from tqdm import tqdm

# Create the "squad_v2" folder if it doesn't exist
if not os.path.exists("squad_v2"):
    os.makedirs("squad_v2")

# URLs of the SQuAD dataset
train_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"
dev_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json"

# Function to download a file with a progress bar
def download_file(url, local_path):
    response = requests.get(url, stream=True)
    file_size = int(response.headers.get("content-length", 0))
    
    with open(local_path, "wb") as file, tqdm(
        desc=local_path,
        total=file_size,
        unit="B",
        unit_scale=True,
        unit_divisor=1024,
    ) as bar:
        for data in response.iter_content(chunk_size=1024):
            file.write(data)
            bar.update(len(data))

# Check if the train dataset is already downloaded
train_file_path = "squad_v2/train-v2.0.json"
if not os.path.exists(train_file_path):
    download_file(train_url, train_file_path)
    print("Train dataset downloaded successfully.")
else:
    print("Train dataset is already downloaded.")

# Check if the dev dataset is already downloaded
dev_file_path = "squad_v2/dev-v2.0.json"
if not os.path.exists(dev_file_path):
    download_file(dev_url, dev_file_path)
    print("Dev dataset downloaded successfully.")
else:
    print("Dev dataset is already downloaded.")

squad_v2/train-v2.0.json: 40.2MB [00:01, 37.2MB/s]                                                                     


Train dataset downloaded successfully.


squad_v2/dev-v2.0.json: 4.17MB [00:00, 14.6MB/s]                                                                       

Dev dataset downloaded successfully.





In [8]:
import json
from pathlib import Path

def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers

train_contexts, train_questions, train_answers = read_squad('squad_v2/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad_v2/dev-v2.0.json')

### from hugging face

In [2]:
# from datasets import load_dataset

dataset = load_dataset("squad_v2")
# dataset = load_dataset("squad")

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})


### Print the first 5 samples in the training set

In [3]:
# Print the first 5 samples in the training set

train_data = dataset["train"]

for q, a in zip(train_data['question'][:5], train_data['answers'][:5]):
    print(f"Question:\n{q}\nAnswer:\n{a}", end="\n\n")

Question:
When did Beyonce start becoming popular?
Answer:
{'text': ['in the late 1990s'], 'answer_start': [269]}

Question:
What areas did Beyonce compete in when she was growing up?
Answer:
{'text': ['singing and dancing'], 'answer_start': [207]}

Question:
When did Beyonce leave Destiny's Child and become a solo singer?
Answer:
{'text': ['2003'], 'answer_start': [526]}

Question:
In what city and state did Beyonce  grow up? 
Answer:
{'text': ['Houston, Texas'], 'answer_start': [166]}

Question:
In which decade did Beyonce become famous?
Answer:
{'text': ['late 1990s'], 'answer_start': [276]}



- Here, `answer_start` indicates the character index at which this answer starts in the context provided.

- **When training the model, we will be asking the model to predict the start and end indices of the answer.** 

- In its original form, only the `answer_start` is present. We will need to manually add `answer_end` to our dataset.

### Adding `answer_end` index

The answers are provided by means of the, starting index (`answer_start`) and the answer it self (`text`). We will add `answer_end`, which will denote the index of the position the answer ends.

In [3]:
def compute_end_index(answers, contexts):
    """Add end index to answers"""
    
    modified_answers = []
    for answer, context in zip(answers, contexts):
        # we have some questions with no answers
        if len(answer['text']) == 0 or len(answer['answer_start'])==0:
            answer['text'] = ''
            answer['answer_start'] = np.NaN
            answer['answer_end'] = np.NaN
        else:
            # here we are replacing the list with just the element
            gold_text = answer['text'][0]
            answer['text'] = gold_text

            start_idx = answer['answer_start'][0]
            answer['answer_start'] = start_idx

            # Make sure the starting index is valid and there is an answer
            assert start_idx >=0 and len(gold_text.strip()) > 0

            end_idx = start_idx + len(gold_text)
            answer['answer_end'] = end_idx

            # Make sure the corresponding context matches the actual answer
            assert context[start_idx:end_idx] == gold_text
        
        modified_answers.append(answer)
    
    return modified_answers, contexts

In [4]:
train_questions = dataset["train"]["question"]

print("Training data corrections")

train_answers, train_contexts = compute_end_index(dataset["train"]["answers"], 
                                                  dataset["train"]["context"])

#################

test_questions = dataset["validation"]["question"]

print("Validation data correction")

test_answers, test_contexts = compute_end_index(dataset["validation"]["answers"], 
                                                dataset["validation"]["context"])

Training data corrections
Validation data correction


In [6]:
# def find_empty_ans_indices(answers):
#     return [idx for idx, answer in enumerate(answers) if answer['text'] == '']

# def remove_empty_answers(arr, empty_ans_idx, arr_name):
#     print(f"Removed {len(empty_ans_idx)} data points with no answer from {arr_name}")
#     return np.delete(arr, empty_ans_idx)

# train_empty_ans_idx = find_empty_ans_indices(train_answers)
# test_empty_ans_idx = find_empty_ans_indices(test_answers)

# print("No. of questions with no answers in SQuAD_v2 dataset:")
# print(f"\tTrain: {len(train_empty_ans_idx)}/{len(train_answers)}")
# print(f"\tTest: {len(test_empty_ans_idx)}/{len(test_answers)}\n")

# train_questions = remove_empty_answers(train_questions, train_empty_ans_idx, "train_questions")
# train_answers = remove_empty_answers(train_answers, train_empty_ans_idx, "train_answers")
# train_contexts = remove_empty_answers(train_contexts, train_empty_ans_idx, "train_contexts")

# print()

# test_questions = remove_empty_answers(test_questions, test_empty_ans_idx, "test_questions")
# test_answers = remove_empty_answers(test_answers, test_empty_ans_idx, "test_answers")
# test_contexts = remove_empty_answers(test_contexts, test_empty_ans_idx, "test_contexts")

## Implementing BERT

To use a pre-trained Transformer Model from Hugging Face, we need 3 components:

- `Tokenizer` - Responsible for splitting a long bit of text (such as a sentence) into smaller tokens

- `config` – Contains the configuration of the model

- `Model` – Takes in the tokens, looks up the embeddings, and produces the final output(s) using the provided inputs


We can ignore the config as we are using the pre-trained model as is. However, to paint a full picture, we will use the configuration nevertheless

## Implementing Tokenizer & Visualizing the token_mappings 

We will be using a Tokenizer called `bert-base-uncased`. It is the Tokenizer developed for the BERT base model and is uncased (that is, there's no distinction between uppercase and lowercase characters)

In [5]:
# download/define the tokenizer

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

In [8]:
# tokenizer in action

context = "This is the context"
question = "This is the question"

token_ids = tokenizer(text=context, text_pair=question,
                      padding=False, return_tensors='tf')

print(token_ids)

{'input_ids': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=array([[ 101, 2023, 2003, 1996, 6123,  102, 2023, 2003, 1996, 3160,  102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]])>, 'attention_mask': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}


Let’s understand the arguments provided to the tokenizer’s call:

- `text` – A single or batch of text sequences to be encoded by the tokenizer. Each text sequence is a string.

- `text_pair` – An optional single or batch of text sequences to be encoded by the tokenizer. 
    - *It’s useful in situations where the model takes a multi-part input (such as a question and a context in question-answering).*<br></br>

- `padding` – Indicates the padding strategy. 
    - If set to `True`, it will be padded to the maximum sequence length in the dataset. 
    
    - If set to `max_length`, it will be padded to the length specified by the `max_length` argument. 
    
    - If set to `False`, no padding will be done.

- `return_tensors` – An argument that defines the type of tensors returned. It could be either `pt` (PyTorch) or `tf` (TensorFlow). Since we want TensorFlow tensors, we define it as `'tf'`.


In [9]:
type(token_ids)

transformers.tokenization_utils_base.BatchEncoding

In [10]:
for key, value in token_ids.items():
    print(key, "-")
    print(value.numpy()[0], end="\n\n")

input_ids -
[ 101 2023 2003 1996 6123  102 2023 2003 1996 3160  102]

token_type_ids -
[0 0 0 0 0 0 1 1 1 1 1]

attention_mask -
[1 1 1 1 1 1 1 1 1 1 1]



This outputs a `transformers.tokenization_utils_base.BatchEncoding` object, **which is essentially a dictionary. It has three keys and tensors as values:**

- `input_ids` – 
    - Provides the IDs of the tokens for the tokens found in the text sequences. 
    
    - Additionally, it introduces the `[CLS]` token ID at the beginning of the sequence and 
    
    - two instances of the `[SEP]` token ID, one between the question and context, and the other one at the end.

- `token_type_ids` – This is the segment ID we use for the segment embedding.

- `attention_mask` – 
    - The attention mask represents the words that are allowed to be attended to during the forward pass. 
    
    - Since BERT is an encoder model, any token can pay attention to any other token. The only exception is the padded tokens that will be ignored during the attention mechanism.

In [11]:
# We could also convert these token IDs to actual tokens to know what they represent.
print("token_ids:")
print("\t", token_ids['input_ids'].numpy()[0])

print("\nmappings:")
print("\t", tokenizer.convert_ids_to_tokens(token_ids['input_ids'].numpy()[0]))

token_ids:
	 [ 101 2023 2003 1996 6123  102 2023 2003 1996 3160  102]

mappings:
	 ['[CLS]', 'this', 'is', 'the', 'context', '[SEP]', 'this', 'is', 'the', 'question', '[SEP]']


>🗝️ **We can see how the tokenizer inserts special tokens like `[CLS]` and `[SEP]` into the text sequence.**

## Encoding the train & test data using Tokenizer

In [6]:
# Encode train data
train_encodings = tokenizer(train_contexts, train_questions,
                            truncation=True, padding=True, 
                            return_tensors='tf')

print(f"train_encodings.shape: {train_encodings['input_ids'].shape}")


# Encode test data
test_encodings = tokenizer(test_contexts, test_questions,
                           truncation=True, padding=True,
                           return_tensors='tf')

print(f"test_encodings.shape: {test_encodings['input_ids'].shape}")

train_encodings.shape: (130319, 512)
test_encodings.shape: (11873, 512)


The maximum sequence length in our dataset is 512. Therefore, we see that the maximum length of the sequences is 512.

## Dealing with truncated answers

Once we tokenize our data, we need to perform one more data processing step. 


- In the original dataset the `answer_start` and `answer_end` denote the `character-level position` of the answer. But in the model, since we deal in tokens we need the` token-level position` of the answer. 

- For that, we will use the `char_to_token` function in the `tokenizer`. It will convert the character index to a token index.

- Because we are enforcing a maximum sequence length of 512, some answers will be inevitably truncated if they are present after the 512th token. Although this is rare, we still need to take care of this as it can result in numerical errors otherwise. Therefore, if the positions are `None` (i.e. couldn't find the answer), it is set to the maximum position.

In [13]:
# def replace_char_with_token_indices(encodings, answers):
#     start_positions = []
#     end_positions = []
#     n_updates = 0
    
#     # Go through all the answers
#     for i in range(len(answers)):
#         # if answers[i]['text']=='':
#         #     continue
        
#         # get the token position for both start & end char positions
#         start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        
#         end_positions.append(encodings.char_to_token(i, answers[i]['answer_end']-1))
        
        
#         if start_positions[-1] is None or end_positions[-1] is None:
#             n_updates += 1
            
#         # if start position is None, the answer passage has been truncated
#         # In the guide, https://huggingface.co/transformers/custom_datasets.html#qa-squad
#         # they set it to model_max_length, but this will result in NaN losses as the last
#         # available label is model_max_length-1 (zero-indexed)
        
#         if start_positions[-1] is None:
#             start_positions[-1] = tokenizer.model_max_length - 1
            
#         if end_positions[-1] is None:
#             end_positions[-1] = tokenizer.model_max_length - 1
            
#     print("{}/{} had answers truncated".format(n_updates, len(answers)))
    
#     encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

In [7]:
def replace_char_with_token_indices(encodings, answers):
    start_positions = []
    end_positions = []
    n_updates = 0

    # Go through all the answers
    for i in range(len(answers)):
        if pd.notna(answers[i]['answer_start']) and pd.notna(answers[i]['answer_end']):
            # get the token position for both start & end char positions
            start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
            end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
        else:
            # Handle the case where the answer is empty or the positions are NaN
            start_positions.append(None)
            end_positions.append(None)

        if start_positions[-1] is None or end_positions[-1] is None:
            n_updates += 1

        # if start position is None, the answer passage has been truncated
        # In this case, you can set it to some specific value or leave it as None
        # Here, we set it to model_max_length - 1 if it's None
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length - 1

        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length - 1

    print("{}/{} had answers truncated".format(n_updates, len(answers)))

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})


- This function takes in a set of `BatchEncodings` called `encodings` generated by the tokenizer and a set of answers (a list of dictionaries). 

- Then it updates the provided encodings with two new keys: `start_positions` and `end_positions`. 

- *These keys respectively hold the token-based indices denoting the start and end of the answer. If the answer is not found, we set the start and end indices to the last token.*

- To convert our existing character-based indices to token-based indices, we use a function called `char_to_token()` provided by the `BatchEncodings` class. 

- It takes a character index as the input and provides the corresponding token index as the output.

In [8]:
replace_char_with_token_indices(train_encodings, train_answers)
replace_char_with_token_indices(test_encodings, test_answers)

43508/130319 had answers truncated
5951/11873 had answers truncated


## Defining a TensorFlow dataset

- Next, let’s implement a TensorFlow dataset to generate the data for the model. **Our data will consist of two tuples: one containing inputs and the other containing the targets.** 

- The input tuple contains:
    - Input token IDs – A batch of padded token IDs of size `[batch size, sequence_len]`
    - Attention mask – A batch of attention masks of size `[batch size, seq_len]`

- The output tuple contains:
    - Start index of the answer – A batch of start indices of the answer
    - End index of the answer – A batch of end indices of the answer
    
    
[`from functools import partial` - GFG](https://www.geeksforgeeks.org/partial-functions-python/)

In [10]:
import tensorflow as tf
from functools import partial


train_batch_size = 4
test_batch_size = 8


def data_gen(input_ids, attention_mask, start_positions, end_positions):
    """Generator for data"""
    for inps, attn, start_pos, end_pos in zip(input_ids, attention_mask, 
                                              start_positions, end_positions):
        yield (inps, attn),(start_pos, end_pos)
        
        
print("Creating train & validation data")

# Define the generator as a callable (not the generator itself)
# We define a partial func. that we can simply call later without passing any arguments:
train_data_gen = partial(data_gen,
                         input_ids = train_encodings['input_ids'], 
                         attention_mask = train_encodings['attention_mask'], 
                         start_positions = train_encodings['start_positions'], 
                         end_positions = train_encodings['end_positions'])

# Define the dataset
train_dataset = tf.data.Dataset.from_generator(
                    train_data_gen,
                    output_types = (('int32', 'int32'),('int32', 'int32'))
                )

# We then shuffle the data in our training dataset. 
# When shuffling a TF dataset we need to provide a buffer_size. 
# buffer_size defines how many samples are chosen to shuffle. Here 1000 samples:
train_dataset = train_dataset.shuffle(1000)

# Splitting train data into train-set & validation-set

# Valid set is taken as the first 10000 samples in the shuffled set
valid_dataset = train_dataset.take(10000)
valid_dataset = valid_dataset.batch(train_batch_size)

# Rest is kept as the training data
train_dataset = train_dataset.skip(10000)
train_dataset = train_dataset.batch(train_batch_size)

print('\tDone')

# Creating test data
print("Creating test data")

# Define the generator as a callable
test_data_gen = partial(data_gen,
                        input_ids=test_encodings['input_ids'], 
                        attention_mask=test_encodings['attention_mask'],
                        start_positions=test_encodings['start_positions'], 
                        end_positions=test_encodings['end_positions']
                       )

test_dataset = tf.data.Dataset.from_generator(
    test_data_gen, output_types=(('int32', 'int32'), ('int32', 'int32'))
)

test_dataset = test_dataset.batch(test_batch_size)
print("\tDone")

Creating train & validation data
	Done
Creating test data
	Done


## BERT for QnA. 

**Modifications that we'll introduce on the top of pre-trained BERT Model to leverage it for Question-Answering:**

- *First, model takes in a question followed by a context, the context may or may not contain the answer to the queation.*

- *i/p format:* `[CLS] <question_token> [SEP] <context_token> [SEP]`

- Then, for each token position of the context, we have two classification heads predicting a probability: 
    - *One head predicts the probability of each context token being the start of the answer, whereas* 
    - *the other one predicts the probability of each context token being the end of the answer.*<br></br>
    
- *Once we figure out the start and end indices of the answer, we can simply extract the answer from the context using those indices.*


<div align='center'>
    <img src='images/bert_q_a.png' title="NLP w/ TensorFlow by Thushan Ganegedara"/>
</div>

### Defining the Config & the Model

Here we define a **DistilBert model** (particularly a TF variant)

In Hugging Face, we have several variants of each Transformer model. These variants are based on different tasks solved by these models. 

For example, for BERT we have:
- `TFBertForPretraining` – The pre-trained model without a task-specific head
- `TFBertForSequenceClassification` – Used for classifying a sequence of text
- `TFBertForTokenClassification` – Used for classifying each token in the sequence of text
- `TFBertForMultipleChoice` – Used for answering multiple-choice questions
- `TFBertForQuestionAnswering` – Used for extracting answers to a question from a given context
- `TFBertForMaskedLM` – Used for pre-training BERT on the masked language modeling task
- `TFBertForNextSentencePrediction` – Used for pre-training BERT to predict the next sentence

we are interested in `TFBertForQuestionAnswering`:

In [12]:
from transformers import BertConfig, TFBertForQuestionAnswering

config = BertConfig.from_pretrained("bert-base-uncased", return_dict=False)
print(config, end='\n\n')

model = TFBertForQuestionAnswering.from_pretrained("bert-base-uncased", config=config)

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "return_dict": false,
  "transformers_version": "4.33.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}




Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForQuestionAnswering.

Some weights or buffers of the TF 2.0 model TFBertForQuestionAnswering were not initialized from the PyTorch model and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


>🗝️**Above warning:** *This is expected and totally fine. It’s saying that there are some layers that have not been initialized from the pre-trained model; the output heads of the model need to be introduced as new layers, thus they are not pre-initialized.*

After that, we will define a function that will wrap the returned model as a `tf.keras.models.Model object`. We need to perform this step because if we try to use the model as it is, TensorFlow returns the following error: 
```
TypeError: The two structures don't have the same sequence type. Input structure has type <class 'tuple'>, while shallow structure has type 
<class 'transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput'>.
```

Therefore, we will define 2 i/p layers: one takes in the i/p token IDs(`input_ids`) and the other takes the `attention_mask` and passes it to the model. Finally, we get the output of the model. We then define a `tf.keras.models.Model` using these inputs and output:

In [18]:
model.input

[<KerasTensor: shape=(None, None) dtype=int32 (created by layer 'input_ids')>,
 <KerasTensor: shape=(None, None) dtype=int32 (created by layer 'attention_mask')>]

In [19]:
def tf_wrap_model(model):
    """Wraps the huggingface's model with in the Keras Functional API"""
    
    # If this is not wrapped in a keras model by taking the correct tensors from
    # TFQuestionAnsweringModelOutput produced, you will get the following error
    # setting return_dict did not seem to work as it should
    
    # Define inputs
    input_ids = tf.keras.layers.Input([None,], dtype=tf.int32, name='input_ids')
    attention_mask = tf.keras.layers.Input([None,], dtype=tf.int32, name="attention_mask")
    
    # Define the output (TFQuestionAnsweringModelOutput)
    # out = model([input_ids, attention_mask])
    start_logits, end_logits = model([input_ids, attention_mask])
    
    # Get the correct attributes in the produced object to generate an output tuple
    wrap_model = tf.keras.models.Model(inputs=[input_ids, attention_mask],
                                       # outputs=(out.start_logits, out.end_logits)
                                       outputs=[start_logits, end_logits]
                                      )
    return wrap_model

- *As we learned when studying the structure of the model, the question-answering BERT has two heads:* 
    - *one to predict the starting index of the answer and*
    - *the other to predict the end.* 
    
- **Therefore, we have to optimize 2 losses coming from the two heads. This means we need to add the two losses to get the final loss. When we have a multi-output model such as this, we can pass multiple loss functions aimed at each output head.** 

- *Here, we define a single loss function. This means the same loss will be used across both heads and will be summed to generate the final loss:*

In [20]:
# Define and compile the model

# Keras will assign a separate loss for each output and add them together. 

# So we'll just use the standard CE loss instead of using the built-in 
# model.compute_loss, which expects a dict of outputs and averages the two terms.

# Note that this means the loss will be 2x of when using TFTrainer 
# since we're adding instead of averaging them.

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
acc = tf.keras.metrics.SparseCategoricalAccuracy()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)

model_v2 = tf_wrap_model(model)
model_v2.compile(optimizer=optimizer, loss=loss, metrics=[acc])

In [21]:
model_v2.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 tf_bert_for_question_answering  ((None, None),      108893186   ['input_ids[0][0]',              
  (TFBertForQuestionAnswering)   (None, None))                    'attention_mask[0][0]']         
                                                                                                  
Total params: 108,893,186
Trainable params: 108,893,186
Non-trainable params: 0
______________

In [23]:
model.summary()

Model: "tf_bert_for_question_answering"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108891648 
                                                                 
 qa_outputs (Dense)          multiple                  1538      
                                                                 
Total params: 108,893,186
Trainable params: 108,893,186
Non-trainable params: 0
_________________________________________________________________


### Training the model

In [24]:
import time

t1 = time.time()

model_v2.fit(
    train_dataset,
    validation_data=valid_dataset,
    epochs=2
)

t2 = time.time()

print(f"\nTraining Time: {((t2 - t1) / 60):.2f} mins\n")

Epoch 1/2
Epoch 2/2

Training Time: 342.82 mins



>Validation accuracy between ~67% and ~70%. This is quite high, given we only trained the model for 2 epochs. This performance can be attributed to the high level of language understanding the pre-trained model already had when we downloaded it.

### Save the model

In [26]:
print(model_v2.summary())

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 tf_bert_for_question_answering  ((None, None),      108893186   ['input_ids[0][0]',              
  (TFBertForQuestionAnswering)   (None, None))                    'attention_mask[0][0]']         
                                                                                                  
Total params: 108,893,186
Trainable params: 108,893,186
Non-trainable params: 0
______________

In [27]:
import os

# Create folders
if not os.path.exists('models'):
    os.makedirs('models')
if not os.path.exists('tokenizers'):
    os.makedirs('tokenizers')
    

# Save the model
model_v2.get_layer("tf_bert_for_question_answering").save_pretrained(os.path.join('models', 'bert_qa'))

# Save the tokenizer
tokenizer.save_pretrained(os.path.join('tokenizers', 'bert_qa'))

('tokenizers\\bert_qa\\tokenizer_config.json',
 'tokenizers\\bert_qa\\special_tokens_map.json',
 'tokenizers\\bert_qa\\vocab.txt',
 'tokenizers\\bert_qa\\added_tokens.json',
 'tokenizers\\bert_qa\\tokenizer.json')

## Testing on unseen data

$\sim 60\% accuracy$ on test-data

In [28]:
model_v2.evaluate(test_dataset)



[3.9819936752319336,
 1.9975428581237793,
 1.9844495058059692,
 0.5904152393341064,
 0.5985850095748901]

## Ask BERT a question...

In [30]:
i = 7

sample_q = test_questions[i]
sample_c = test_contexts[i]
sample_a = test_answers[i]

# Get the input in the format BERT accepts:
# The input to the model needs to have a batch dimension. 
# Therefore we use the [i:i+1] syntax to make sure the batch dimension is not flattened:

sample_input = (test_encodings["input_ids"][i:i+1], 
                test_encodings["attention_mask"][i:i+1])

def ask_bert(sample_input, tokenizer, model):
    """ This function takes an input, a tokenizer, a model and returns the prediciton """
    out = model.predict(sample_input)
    pred_ans_start = tf.argmax(out[0][0])
    pred_ans_end = tf.argmax(out[1][0])
    
    print(f"{pred_ans_start}-{pred_ans_end} token ids contain the answer")
    
    ans_tokens = sample_input[0][0][pred_ans_start:pred_ans_end-1]
    
    return " ".join(tokenizer.convert_ids_to_tokens(ans_tokens))


print("Question:")
print("\t", sample_q, "\n")

print("Context")
print("\t", sample_q, "\n")

print("Answer (char indexed)")
print("\t", sample_a, "\n")
print('='*50,'\n')

sample_pred_ans = ask_bert(sample_input, tokenizer, model_v2)

print("Answer (predicted)")
print(sample_pred_ans)
print('='*50,'\n')

Question:
	 Who did King Charles III swear fealty to? 

Context
	 Who did King Charles III swear fealty to? 

Answer (char indexed)
	 {'text': '', 'answer_start': nan, 'answer_end': nan} 


73-3 token ids contain the answer
Answer (predicted)




In [38]:
i = 1

sample_q = test_questions[i]
sample_c = test_contexts[i]
sample_a = test_answers[i]

sample_input = (test_encodings["input_ids"][i:i+1], 
                test_encodings["attention_mask"][i:i+1])


print("Question:")
print("\t", sample_q, "\n")

print("Context")
print("\t", sample_q, "\n")

print("Answer (char indexed)")
print("\t", sample_a, "\n")
print('='*50,'\n')

sample_pred_ans = ask_bert(sample_input, tokenizer, model_v2)

print("Answer (predicted)")
print(sample_pred_ans)
print('='*50,'\n')

Question:
	 When were the Normans in Normandy? 

Context
	 When were the Normans in Normandy? 

Answer (char indexed)
	 {'text': '10th and 11th centuries', 'answer_start': 94, 'answer_end': 117} 


28-31 token ids contain the answer
Answer (predicted)
10th and

