# Getting Started
> First steps towards building an IR-based QA system with BERT.

- toc: true 
- badges: true
- comments: true
- hide: true
- permalink: /hidden/
- search_exclude: false
- categories: [jupyter]

## So you've decided to build a QA system. 
You want to start with something general and straightforward so you plan to make it open domain using Wikipedia as a corpus for answering questions. You're going to use an IR-based design (see previous post) since you're working with a large collection of unstructured text. You want to use the best NLP that your compute resources allow (you're lucky enough to have access to a GPU) so you're going to focus on the big, flashy Transformer models that are all the rage these days. 

Sounds like a plan! So where do you start? 

This was our thought process when we first set out on this research path and in this post we'll discuss what you need to know to get going!

- installing libraries and setting up an environment
- understanding Huggingface's `run_squad.py` training script
- Understanding the basic ins and outs of a BERT-esque model
- getting BERT to accept a full Wikipedia article as context for a question


### Setting up your virtual environment
A virtual environment is always best practice and we're using `venv` (though Melanie is also partial to `conda`). Here's the bare minimum that you'll need to do what I did. For this project we'll be using Pytorch (though everything we do can also be accomplished in Tensorflow). Pytorch handles the heavy lifting of deep differentiable learning. Transformers is a library by Huggingface that provides super easy to use implementations (in torch) of all the popular Transformer architectures (more on this later). 

- PyTorch 
- Transformers
- Wikipedia
- TensorboardX (optional)

A note on PyTorch: our GPU machine sports an older version of CUDA (9.2) that we're getting around to updating... In the meantime, this forces us to use an older version of PyTorch that supports this CUDA version in order to access our GPU for training. Some older verisons of PyTorch might require that you also install `TensorboardX` which is used in Huggingface's `run_squad.py` script. If you have want to use your GPU (pretty much required if you plan to _fine tune_ BERT on the SQuAD dataset) and you have CUDA 10+ you can use the most recent version of Pytorch and you won't need to install the additional TensorboardX package. 

Why are we using PyTorch instead of Tensorflow? Honestly? Because Tensorflow isn't playing nice with our GPU machine these days... You'll likely see some warning messages about it not being install properly in order to access the GPU. It's on our To Do list. 

### Huggingface's Transformer library and training script
I'm new to PyTorch and Huggingface but I'm quickly becoming a convert!  Huggingface provides state-of-the-art general-purpose architecures for natural language understanding and natural language generation. They have tons of pre-trained models that work in dozens of languages. They even have interoperability between PyTorch and Tensorflow (all camps welcome!) which means if someone trained BERT using Tensorflow we can load those pre-trained weights through Huggingface methods and it will convert the weights to PyTorch for us! Yay. 

Huggingface provide more than just pre-trained models. They also have 

In [1]:
from qasystem import DocumentReader, MODEL_PATHS

In [2]:
MODEL_PATHS

{'default_bert_base_uncased': 'bert-base-uncased',
 'bert_base_uncased_squad1': '/home/ryan/work/ff14/src/models/bert/bert-base-uncased-tuned-squad-1.0',
 'bert_base_cased_squad2': '/home/ryan/work/ff14/src/models/bert/bert-base-cased-tuned-squad-2.0/'}

In [3]:
reader = DocumentReader(MODEL_PATHS['bert_base_uncased_squad1'])

In [4]:
reader

<qasystem.DocumentReader at 0x7f3141261e48>