# How to Build a QA System with BERT on Wikipedia
> A high-level walk-through of building an IR-based QA system.

- toc: true 
- badges: true
- comments: true
- hide: true
- permalink: /hidden/
- search_exclude: false
- categories: [jupyter]

# So you've decided to build a QA system. 
You want to start with something general and straightforward so you plan to make it open domain using Wikipedia as a corpus for answering questions. You're going to use an IR-based design (see previous post) since you're working with a large collection of unstructured text. You want to use the best NLP that your compute resources allow (you're lucky enough to have access to a GPU) so you're going to focus on the big, flashy Transformer models that are all the rage these days. 

Sounds like a plan! So where do you start? 

This was our thought process when we first set out on this research path and in this post we'll discuss what you need to know to get going!

- installing libraries and setting up an environment
- understanding Huggingface's `run_squad.py` training script
- Understanding the basic ins and outs of a BERT-esque model
- getting BERT to accept a full Wikipedia article as context for a question


## Setting up your virtual environment
A virtual environment is always best practice and we're using `venv` (though Melanie is also partial to `conda`). Here's the bare minimum that you'll need to do what I did. For this project we'll be using Pytorch (though everything we do can also be accomplished in Tensorflow). Pytorch handles the heavy lifting of deep differentiable learning. Transformers is a library by Huggingface that provides super easy to use implementations (in torch) of all the popular Transformer architectures (more on this later). 

- PyTorch 
- Transformers
- Wikipedia


### TODO - torch for gpu vs no gpu
We used `venv` for our virtual environment. It comes standard with Python installations. Other great options include `virtualenv` and `Anaconda`. 
You can recreate our env with the following commands in your command line (linux/MacOS users)

``` bash
$ python3 -m venv myenv
$ source myenv/bin/activate
$ pip install torch
$ pip install transformers
$ pip install wikipedia
```

In [2]:
import torch
import wikipedia as wiki

Parenthetical note: our GPU machine sports an older version of CUDA (9.2 -- we're getting around to updating that). In the meantime, this requires us to use an older version of PyTorch for the necessary CUDA support. The HF script we're using for training (`run_squad.py`) requires some specific packages. More recent versions of PyTorch include these packages; however, older versions do not and thus may require that you also install `TensorboardX` (see the hidden code cell below). 

In [3]:
# collapse-hide 

# line 69 of `run_squad.py` script shows why you might need to install 
# tensorboardX if you have an older version of torch
try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

## HuggingFace Transformers
I'm new to PyTorch and even newer to HuggingFace (HF) but I'm quickly becoming a convert! The [HuggingFace Transformers](https://huggingface.co/transformers/#) package provides state-of-the-art general-purpose architecures for natural language understanding and natural language generation. They host dozens of pre-trained models (like BERT) operating in over 100 languages that you can use right out of the box. All of these models come with deep interoperability between PyTorch and Tensorflow 2.0, which means you can move a model from TF2.0 to PyTorch and back again with a line or two of code! 


If you're new to Hugging Face, we strongly recommend working through the HF [Quickstart guide](https://huggingface.co/transformers/quickstart.html) as well as their excellent [Transformer Notebooks](https://huggingface.co/transformers/notebooks.html) (we did!), as we won't cover that material in this notebook. We'll be using HF [`AutoClasses`](https://huggingface.co/transformers/model_doc/auto.html), which serve as a wrapper around pretty much any of the base Transformer classes. So if we want to work with several models we don't have to import `transformers.BertModel`, and `transformers.XLNetModel`, and `transformers.RobertaModel`, etc. We can just import `transformers.AutoModel` and feed it the appropriate model name or path for each architecture we care about. 

In [4]:
from transformers import AutoModel, AutoTokenizer, AutoModelForQuestionAnswering

# Training a Transformer model for Question Answering
Not every Transformer architecture lends itself naturally to the task of Question Answering (GPT, for instance, does not do QA; similarly BERT does not do machine translation!). 


You can identify likely model families by ... Once you've found a model you'd like to work with (in a future post we'll go over the model families in more depth), the next step is to train it on some data!

One of the canonical datasets for QA is the Stanford Question Answering Dataset, or SQuAD, which comes in two flavors: SQuAD 1.1 and SQuAD 2.0. HF helpfully provide a script that trains a Transformer model on one of the datasets, called `run_squad.py`. You can grab the script [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).  This script is pretty complicated and in a later post we'll go through some of the details for those who really want to get into the nuts and bolts of fine-tuning. For now, however, we'll walk through the command to train BERT on SQuAD 1.1 or 2.0 datasets (or both, in succession!)

In [5]:
# Set paths
%env DATA_DIR=./data/squad 
%env MODEL_DIR=./models # we'll store trained models here

env: DATA_DIR=./data/squad
env: MODEL_DIR=./models # we'll store trained models here


In [22]:
# Download the data

def download_squad(version=1):
    if version == 1:
        !wget -P $SQUAD_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
        !wget -P $SQUAD_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
    else:
        !wget -P $SQUAD_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
        !wget -P $SQUAD_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
            
download_squad(version=2)

--2020-05-08 15:11:59--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.110.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘./data/squad/train-v2.0.json.2’


2020-05-08 15:12:01 (16.6 MB/s) - ‘./data/squad/train-v2.0.json.2’ saved [42123633/42123633]

--2020-05-08 15:12:02--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘./data/squad/dev-v2.0.json’


2020-05-08 15:12:03 (4.84 MB/s) - ‘./data/squad/dev-v2.0

In [6]:
# Training your own model to do QA using HF's `run_squad.py`
# Turn flags on and off according to the model you're training

cmd = [
    'python', 
#    '-m torch.distributed.launch --nproc_per_node 2', # use this to perform distributed training over multiple GPUs
    'run_squad.py', 
    
    '--model_type', 'bert',                            # model type 
    '--model_name_or_path', 'bert-base-uncased',       # specific model name of the given model type
    '--output_dir', '$DATA_DIR/bert/bbu_squad2',       # directory for model checkpoints and predictions
#    '--overwrite_output_dir',                         # use when adding output to a directory that is non-empty
    
    '--do_train',     
    '--train_file', '$SQUAD_DIR/train-v2.0.json',      
    '--version_2_with_negative',                       # MUST use this flag if training on SQuAD 2.0 dataset!
    '--do_lower_case',                                 # Set this flag if using an uncased model
    
    '--do_eval',                                       # evaluate the model on the dev set after fine-tuning complete
    '--predict_file', '$SQUAD_DIR/dev-v2.0.json',
    '--eval_all_checkpoints',                          # evaluate the model on the dev set at each checkpoint
    
    '--num_train_epochs', '3',                         # model hyperparameters
    '--learning_rate', '3e-5',
    '--max_seq_length', '384',
    '--doc_stride', '128',
    '--per_gpu_eval_batch_size', '12',
    '--per_gpu_train_batch_size', '12',
    
    '--save_steps', '10000',                           # How often checkpoints (complete model snapshot) are saved 
    '--threads', '8'                                   # num of CPU threads to use for converting examples to features
]

In [48]:
# Don't run this cell unless you're rocking at least one GPU

from subprocess import PIPE, STDOUT, Popen

# Live output from run_squad.py is through stderr (rather than stdout). The following command runs the process
# and ports stderr to stdout
p = Popen(cmd,
          stdout=PIPE,
          stderr=STDOUT)

# Default behavior when using bash cells is that you won't see the live output in the cell -- you can only see 
# output once the entire process has finished and then you get it all at once. This is terrible when training
# models that can take hours or days of compute time! 

# This command combined with the above allows you to see the live output feed, though it is a bit asynchronous.
for line in iter(p.stdout.readline, b''):
    print(">>> " + line.decode().rstrip())

>>> 2020-05-08 16:02:51.791650: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:
>>> 2020-05-08 16:02:51.791724: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:
>>> 2020-05-08 16:02:51.791733: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
>>> 05/08/2020 16:02:53 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json fr

KeyboardInterrupt: 

### Training Output

Successful completion of the `run_squad.py` script by Hugging Face outputs a slew of tanglibles, including the model weights, tokenizer, config, xxx, xxx, etc. These will all be found in the `--output_dir` directory. When loading your model for future use, this is the directory you'll point at and the HF API will take care of the rest. 

Files for the model's tokenizer, which converts text into tokens in a way that can be read by the model
* `tokenizer_config.json`
* `vocab.txt`
* `special_tokens_map.json`

Files for the model itself
* `pytorch_model.bin`: these are the actual model weights (this file can be quite large)
* `config.json`: details of the model architecture

Binary representation of the command line arguments used to train this model
* `training_args.bin`

If you include `--do_eval`, you'll also see these files
* `predictions_.json`: the official best answer for each example
* `nbest_predictions_.json`: the top n best answers for each example

## Using a pre-trained model from the Hugging Face repository
If you don't have access to GPUs or don't have the time to fiddle and train models, you're in luck! Hugging Face is not just a slick API for Transformers -- it also hosts [a repository](https://huggingface.co/models) for pre-trained and fine-tuned models contributed from the wide community of NLP practitioners. Searching for "squad" brings up a list of 55 models. 

![](my_icons/fastai_logo.png)


Clicking one of these links gives explicit code for using the model, and, in some cases, information on how it was trained and what results were achieved. 

In [7]:
# command for importing/downloading one of these pre-fine-tuned models from HF

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Executing these commands for the first time initiates a download of the 
# model weights to ~/.cache/torch/transformers/
tokenizer = AutoTokenizer.from_pretrained("deepset/bert-base-cased-squad2") #"ahotrod/xlnet_large_squad2_512"
model = AutoModelForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2") #"deepset/bert-base-cased-squad2"

Toggle the field below to see what the model looks like in terms of its architecture.

In [8]:
# collapse-hide
model

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

### Let's try our model!

In [9]:
question = "Who ruled Macedonia"

text = """Macedonia was an ancient kingdom on the periphery of Archaic and Classical Greece, 
and later the dominant state of Hellenistic Greece. The kingdom was founded and initially ruled 
by the Argead dynasty, followed by the Antipatrid and Antigonid dynasties. Home to the ancient 
Macedonians, it originated on the northeastern part of the Greek peninsula. Before the 4th 
century BC, it was a small kingdom outside of the area dominated by the city-states of Athens, 
Sparta and Thebes, and briefly subordinate to Achaemenid Persia."""

inputs = tokenizer.encode_plus(question, text, return_tensors="pt")

answer_start_scores, answer_end_scores = model(**inputs)

answer_start = torch.argmax(answer_start_scores)  # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

'the Argead dynasty'

# QA on Wikipedia pages
We saw our model work on some short questions with a few small snippets of context. But what if we want to search for answers in much longer documents? A typical Wikipedia page is much longer than any of the snippet examples presented above and it takes a bit of massaging befor we can use our model on these longer contexts. 

In [85]:
import wikipedia as wiki
import pprint as pp

question = 'What is the wingspan of an albatross?'

results = wiki.search(question)
print("Wikipedia search results for our question:\n")
pp.pprint(results)

page = wiki.page(results[0])
text = page.content
print(f"\nThe {results[0]} Wikipedia article contains {len(text)} characters.")

Wikipedia search results for our question:

['Albatross',
 'List of largest birds',
 'Black-browed albatross',
 'Argentavis',
 'Pterosaur',
 'Mollymawk',
 'Largest body part',
 'List of birds by flight speed',
 'Pelican',
 'Aspect ratio (aeronautics)']

The Albatross Wikipedia article contains 38200 characters.


In [87]:
inputs = tokenizer.encode_plus(question, text, return_tensors='pt')
print(f"This translates into {len(inputs['input_ids'][0])} tokens.")

Token indices sequence length is longer than the specified maximum sequence length for this model (10 > 512). Running this sequence through the model will result in indexing errors


This translates into 8824 tokens.


In [88]:
# collapse-hide
answer_start_scores, answer_end_scores = model(**inputs)

RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237

The tokenizer takes the input and returns tokens. In general, tokenizers convert words or pieces of words into a model-ingestable format. The specific tokens and format are dependent on the type of model. For example, BERT tokenizes words differently from RoBERTa. This is why you must always use the associated tokenizer appropriate for your model. 

In this case, the tokenizer converts our text into 8824 tokens, but this far exceeds the maximum number of tokens that can be fed to the model at one time. Most BERT-esque models can only accept 512 tokens at once (check out the error cell above to see what happens when you try to exceed that). This means we'll have to split our input into chunks and each chunk must not exceed 512 tokens in total. 

When working with Question Answering, it's crucial that each chunk follows this format:

[CLS] question tokens [SEP] context tokens [SEP]

This means that, for each segment of the Wikipedia article, we must prepend the original question, followed by the next "chunk" of article tokens.



In [77]:
# Time to chunk!
from collections import OrderedDict

# identify question tokens (token_type_ids = 0)
qmask = inputs['token_type_ids'].lt(1)
qt = torch.masked_select(inputs['input_ids'], qmask)
print(f"The question consists of {qt.size()[0]} tokens.")

chunk_size = model.config.max_position_embeddings - qt.size()[0] - 1 # the "-1" accounts for
# having to add an ending [SEP] token to the end
print(f"Each chunk will contain {chunk_size - 2} tokens of the Wikipedia article.")

# create a dict of dicts; each sub-dict mimics the structure of pre-chunked model input
chunked_input = OrderedDict()
for k,v in inputs.items():
    q = torch.masked_select(v, qmask)
    c = torch.masked_select(v, ~qmask)
    chunks = torch.split(c, chunk_size)

    for i, chunk in enumerate(chunks):
        if i not in chunked_input:
            chunked_input[i] = {}

        thing = torch.cat((q, chunk))
        if i != len(chunks)-1:
            if k == 'input_ids':
                thing = torch.cat((thing, torch.tensor([102])))
            else:
                thing = torch.cat((thing, torch.tensor([1])))

        chunked_input[i][k] = torch.unsqueeze(thing, dim=0)

The question consists of 12 tokens.
Each chunk will contain 497 tokens of the Wikipedia article.


In [89]:
for i in range(len(chunked_input.keys())):
    print(f"Number of tokens in chunk {i}: {len(chunked_input[i]['input_ids'].tolist()[0])}")

Number of tokens in chunk 0: 512
Number of tokens in chunk 1: 512
Number of tokens in chunk 2: 512
Number of tokens in chunk 3: 512
Number of tokens in chunk 4: 512
Number of tokens in chunk 5: 512
Number of tokens in chunk 6: 512
Number of tokens in chunk 7: 512
Number of tokens in chunk 8: 512
Number of tokens in chunk 9: 512
Number of tokens in chunk 10: 512
Number of tokens in chunk 11: 512
Number of tokens in chunk 12: 512
Number of tokens in chunk 13: 512
Number of tokens in chunk 14: 512
Number of tokens in chunk 15: 512
Number of tokens in chunk 16: 512
Number of tokens in chunk 17: 341


Each of these chunks (except for the last one) has the structure: 

[CLS], 10 question tokens, [SEP], 497 tokens of the Wikipedia article, [SEP] token = 512 tokens

Each of these chunks can now be fed to the model without causing indexing errors. We'll get an answer for each chunk. Most times, the chunk will not contain the answer (not every segment of a Wikipedia article is generally informative for our question). When the model determines that the context does not answer the question it returns the [CLS] token. 

In [78]:
def convert_ids_to_string(tokenizer, input_ids):
    return tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids))

answer = ''

# Now we iterate over our chunks, looking for the best answer from each chunk
for k, chunk in chunked_input.items():
    answer_start_scores, answer_end_scores = model(**chunk)

    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    ans = convert_ids_to_string(tokenizer, chunk['input_ids'][0][answer_start:answer_end])
    if ans != '[CLS]':
        answer += ans + " / "
        
print(answer)

3 . 7 m / 


# Put it all together

Time to put it all together now. We're using `wikipedia`'s information retrieval system (search engine) to return a list of candidate documents that we then feed into our Document Reader (in this case, BERT fine-tuned on SQuAD 2.0). We'll now use `Streamlit` to create a simple interface for our app. In order to make the app code easier to read, we'll first package our Document Reader into a class. 

In [106]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering


class DocumentReader:
    def __init__(self, pretrained_model_name_or_path='bert-large-uncased'): # 'bert-base-uncased'
        self.READER_PATH = pretrained_model_name_or_path
        self.tokenizer = AutoTokenizer.from_pretrained(self.READER_PATH)
        self.model = AutoModelForQuestionAnswering.from_pretrained(self.READER_PATH)
        self.max_len = self.model.config.max_position_embeddings
        self.chunked = False

    def tokenize(self, question, text):
        self.inputs = self.tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
        self.input_ids = self.inputs["input_ids"].tolist()[0]

        if len(self.input_ids) > self.max_len:
            self.inputs = self.chunkify()
            self.chunked = True

    def chunkify(self):
        """ 
        Break up a long article into chunks that fit within the max token
        requirement for that Transformer model. 

        Calls to BERT / RoBERTa / ALBERT require the following format:
        [CLS] question tokens [SEP] context tokens [SEP]
        """
        
        # TODO: generalize this because not all models include token_type_ids (distilBERT)
        # create question mask based on token_type_ids
        # value is 0 for question tokens, 1 for context tokens
        qmask = self.inputs['token_type_ids'].lt(1)
        qt = torch.masked_select(self.inputs['input_ids'], qmask)
        chunk_size = self.max_len - qt.size()[0] - 1 # the "-1" accounts for
        # having to add an ending [SEP] token to the end

        # create a dict of dicts; each sub-dict mimics the structure of pre-chunked model input
        chunked_input = OrderedDict()
        for k,v in self.inputs.items():
            q = torch.masked_select(v, qmask)
            c = torch.masked_select(v, ~qmask)
            chunks = torch.split(c, chunk_size)
            
            for i, chunk in enumerate(chunks):
                if i not in chunked_input:
                    chunked_input[i] = {}

                thing = torch.cat((q, chunk))
                if i != len(chunks)-1:
                    if k == 'input_ids':
                        thing = torch.cat((thing, torch.tensor([102])))
                    else:
                        thing = torch.cat((thing, torch.tensor([1])))

                chunked_input[i][k] = torch.unsqueeze(thing, dim=0)
        return chunked_input

    def get_answer(self):
        if self.chunked:
            answer = ''
            for k, chunk in self.inputs.items():
                answer_start_scores, answer_end_scores = self.model(**chunk)

                answer_start = torch.argmax(answer_start_scores)
                answer_end = torch.argmax(answer_end_scores) + 1

                ans = self.convert_ids_to_string(chunk['input_ids'][0][answer_start:answer_end])
                if ans != '[CLS]':
                    answer += ans + " / "
            return answer
        else:
            answer_start_scores, answer_end_scores = self.model(**self.inputs)

            answer_start = torch.argmax(answer_start_scores)  # Get the most likely beginning of answer with the argmax of the score
            answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
        
            return self.convert_ids_to_string(self.inputs['input_ids'][0][
                                              answer_start:answer_end])

    def convert_ids_to_string(self, input_ids):
        return self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids))

In [99]:
import os
cwd = os.getcwd()

MODEL_PATHS = {
    'default_bert_base_uncased': 'bert-base-uncased',
    'bert_base_uncased_squad1':
        cwd+"/models/bert/bert-base-uncased-tuned-squad-1.0",
    'bert_base_cased_squad2':
        cwd+"/models/bert/bert-base-cased-tuned-squad-2.0/"
}

In [110]:
questions = [
    'When was Barack Obama born?',
    'Why is the sky blue?',
    'How many sides does a pentagon have?'
]

reader = DocumentReader("deepset/bert-base-cased-squad2") 

# if you trained your own model using the training cell earlier you can access it with this:
#reader = DocumentReader("./models/bert/bbu_squad2")

for question in questions:
    results = wiki.search(question)

    page = wiki.page(results[0])
    print(f"Top result: {page}")

    text = page.content

    reader.tokenize(question, text)

    print(f"Question: {question}")
    print(reader.get_answer())


Top result: <WikipediaPage 'Barack Obama Sr.'>
Question: When was Barack Obama born?
18 June 1936 / August 1961 / 4 August 1961 / 6 May 2011 . = = See also = = Family / 


In [111]:
#hidden-cell
import logging
logging.getLogger("transformers.tokenization_utils").setLevel(logging.ERROR)

# Wrapping Up

There we have it! A working QA system on Wikipedia articles. This is great but it's admittedly not very sophisticated. Furthermore there are still a lot of unanswered questions:

1. Why the SQuAD dataset and not something else? What other options are there? 
2. Why did we train BERT the way we did? Are there ways to make it better? What's the "best" BERT can be? 
    * should we train on more than just SQuAD? 
    * If so, what other datasets should we train on?
    * How much does this increase performance? 
3. Why BERT and not another Transformer model? 
    * What's the difference between all these Transformer models anyway? 
4. How can we make our `get_answer` method more sophisticated and realistic?
5. How can we improve our `chunkify` method? 

## How to train Transformers with GPUs from your jupyter notebook
## How to make sense of the SQuAD (and other QA) datasets 
## How to evaluate a Transformer model on a QA dataset
## How to chose the right Transformer model for your QA system
## How to create your own QA dataset