## Recap

In the previous two modules we discussed how to represent and model text using transformer models. In the following module we will combine all the topics we discussed to finetune a transformer model for Question and Answering using the hugging face library and PyTorch. The code in this example is based on the offical Hugging Face tutorial that can be found [here](https://huggingface.co/transformers/custom_datasets.html#qa-squad).

## What is [Question Answering](https://en.wikipedia.org/wiki/Question_answering#:~:text=Question%20answering%20(QA)%20is%20a,humans%20in%20a%20natural%20language)?

Question answering is concerned with building models that answer questions posed by humans. Question answering is an important NLP task that requires a model to be able to understand a human written question, map the question text to its own knowledge of the world, and generate a understandable natural language response. Due to it's complexity the task of Question Answering is often used to test the robustness of natrual langauge understanding models. 

While not a fully solved task, current state of the art transformer models preform well on when finetuned on baseline extractive question answering datasets such as SQUAD v2. Extractive QA involves answering a question about a passage by highlighting the segment of the passage that answers the question. These models can be used to help build more contextual bot applications for automating all sorts of important yet time consuming industry tasks from supporting primary care physicians, laywers, and customer support centers triage common questions to better indexing information.  

In the following notebook we will walk through what we learned about contextual embedding and transformers and build are own Q&A model with PyTorch and HuggingFace Library.


## Running a PreTrained DistilBERT Question Answering Model Using HuggingFace 

Model distillation is a neural model compression technique in which a small network (student) is taught by a larger trained neural network (teacher). [DistilBERT](https://medium.com/huggingface/distilbert-8cf3380435b5) is a small, fast, cheap and light Transformer model provided by HuggingFace trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

In this section we will show how to load a pretrained DistilBert model for question and answering and use it to answer our own sample question about Microsoft given what we've learned about contextual embedding and transformer models.



First we must prepare our input context and question for our pretrained model. For our context I've taken the first to sentences from the Microsoft wikipedia page and then will ask a simple question about what does Microsoft manufacture?

In [None]:
context = "Microsoft Corporation is an American multinational technology company with headquarters in Redmond, Washington. It develops, manufactures, licenses, supports, and sells computer software, consumer electronics, personal computers, and related services."
question = "What does Microsoft manufacture?"

Next we need to represent our input text and extract our contextual embeddings using the DistilBert tokenizer.

In [None]:
!pip install transformers==4.1.1

In [None]:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased',return_token_type_ids = True)
encoding = tokenizer.encode_plus(question, context)
input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

Once we've encoded our inputs we need to load our pretrained distilBERT model and set it to evaluation mode.

In [None]:
from transformers import DistilBertForQuestionAnswering
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')
model.eval()

We can then pass our represented inputs to the model and get the predicted spans of our context that contain the answer to our question.

In [None]:
start_scores, end_scores = model(torch.tensor([input_ids]), attention_mask=torch.tensor([attention_mask]))

Next we want to extract the tokens ids from the context that our model was most confident contains the correct answer.

ans_tokens = input_ids[torch.argmax(start_scores) : torch.argmax(end_scores)+1]

Once we have these tokens ids we want to convert them to their tokens.

In [None]:
answer_tokens = tokenizer.convert_ids_to_tokens(ans_tokens , skip_special_tokens=True)

Then from the tokens we can finally extract the plain text response.

In [None]:
answer_tokens_to_string = tokenizer.convert_tokens_to_string(answer_tokens)

print("Context: '\n", context)
print("Question: '\n", question)
print("Encoding: '\n", encoding)
print("Model Answer : \n", answer_tokens_to_string)

From here we see that our pretrained model correctly answers our question.

## Fine Tunning Our Model on a Custom Dataset

Now that we have demonstrated that our pretrained question and answering model works. I will show how we can fine tune the model to on a custom dataset using PyTorch and Squad v2 based on the [HuggingFace documentation](https://huggingface.co/transformers/custom_datasets.html#qa-squad).

First we must download our the custom Squad_V2 dataset while this dataset can be loaded using Hugging Face datasets library for the sake of showing you how to later train on your own data we will download and process all the train and test files directly.


In [None]:
! mkdir squad
! wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O squad/train-v2.0.json
! wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json

Next we need to define a helper function to load the squad data from json into memory.

In [None]:
import json
from pathlib import Path

def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts, questions, answers = [], [], []
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers

Squad and Squad_v2 are known for having some alignement mistakes in their training data hopefully this will not happen on your custom data but if it does it is good to look at the following alingment functions for reference.

In [None]:
import torch

def add_end_idx(answers, contexts):
    """
    # sometimes squad answers are off by a character or two – fix this
    """
    for answer, context in zip(answers, contexts):
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

def add_token_positions(encodings, answers, tokenizer):
    """
    helper function to keep track of token positions
    """
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
        # if None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})


Once we've defined our helper functions we can create our own Custom PyTorch Dataset Class for Training our Squad V2 data it takes as input the path to the squad json file and a tokenizer.

In [None]:
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, path, tokenizer):
        contexts, questions, answers = read_squad(path) # process the json file
        add_end_idx(answers, contexts) # align mistakes
        self.encodings = tokenizer(contexts, questions, truncation=True, padding=True) # tokenize data
        add_token_positions(self.encodings, answers, tokenizer) # add encoded token positions 

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

We can then use this class to process our training and testing data using the tokenizer we used for representing our input data above.

In [None]:
train_dataset = SquadDataset('squad/train-v2.0.json', tokenizer)
val_dataset = SquadDataset('squad/dev-v2.0.json', tokenizer)

Now that we have our dataset process we can easily fine tune the pretrained model we used before using a standard PyTorch training loop. Note due to the size of squad this code may take some time to train.

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    print("Epoch #", epoch)
    for i, batch in enumerate(train_loader):
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        loss = outputs[0]
        print(f'Step {i} - loss: {loss:.3}')
        loss.backward()
        optim.step()

model.eval()

And there you have it we now have a fine tuned QA DistilBert Model that we trained on our own custom data. Now you should have all the tools you need to train your own Question Answering models as well as strong fundemental understanding of the concepts of Text Representation, Language Modeling and Neural Natrual Language Processing.