In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0" 

## Building an End-to-End Question-Answering System With BERT

In this notebook, we build a practical, end-to-end Question-Answering (QA) system with BERT in rougly 3 lines of code.  We will treat a corpus of text documents as a knowledge base to which we can ask questions and retrieve exact answers using [BERT](https://arxiv.org/abs/1810.04805). This goes beyond simplistic keyword searches.

For this example, we will use the [20 Newsgroup dataset](http://qwone.com/~jason/20Newsgroups/) as the text corpus.  As a collection of newsgroup postings which contains an abundance of opinions and debates, the corpus is not ideal as a knowledgebase.  It is better to use fact-based documents such as Wikipedia articles or even news articles.  However, this dataset will suffice for this example.

Let us begin by loading the dataset into an array using **scikit-learn** and importing *ktrain* modules.

In [2]:
# load 20newsgroups datset into an array
from sklearn.datasets import fetch_20newsgroups
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)
docs = newsgroups_train.data +  newsgroups_test.data

In [3]:
!pip install ktrain



In [4]:
import ktrain
from ktrain import text

### STEP 1:  Index the Documents

We will first index the documents into a search engine that will be used to quickly retrieve documents that are likely to contain answers to a question. To do so, we must choose an index location, which must be a folder that does not already exist. 

Since the newsgroup postings are small and fit in memory, we wil set `commit_every` to a large value to speed up the indexing process. This means results will not be written until the end.  If you experience issues, you can lower this value.

In [5]:
INDEXDIR = '/tmp/myindex'

In [8]:
text.SimpleQA.initialize_index(INDEXDIR)
text.SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs))

For documents sets that are too large to be loaded into a Python list, you can use `SimpleQA.index_from_folder`, which will crawl a folder and index all plain text documents.

By default, `index_from_list` and `index_from_folder` use a single processor (`procs=1`) with each processor using a maximum of 256MB of memory (`limitmb=256`) and merging results into a single segment (`multisegment=False`).  These values can be changed to speedup indexing as arguments to `index_from_list` or `index_from_folder`.  See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/batch.html) for more information on these parameters and how to use them to speedup indexing.

Note that a small number of large documents will cause inferences in STEP 3 to be very slow.  If your dataset consists of large documents (e.g., books or long papers), we recommend breaking them up into pages (e.g., splitting the original PDF using something like `pdfseparate`) or splitting them into paragraphs.  The latter can be done with *ktrain* using:
```python
ktrain.text.textutils.paragraph_tokenize(document, join_sentences=True)
```

The above steps need to only be performed once. Once an index is already created, you can skip this step and proceed directly to **STEP 2** to begin using your system.

### STEP 2: Create a QA instance

Next, we create a QA instance.  This step will automatically download the BERT SQUAD model if it does not already exist on your system.

In [9]:
qa = text.SimpleQA(INDEXDIR)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1341090760.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




That's it!  In roughly **3 lines of code**, we have built an end-to-end QA system that can now be used to generate answers to questions.  Let's ask our system some questions.

### STEP 3:  Ask Questions

We will invoke the `ask` method to issue questions to the text corpus we indexed and retrieve answers.  We will also use the `qa.display` method to nicely display the top 5 results in this Jupyter notebook. The answers are inferred using a BERT model trained on the SQUAD dataset.  Since the model is combing through paragraphs and sentences to find an answer, it may take a minute or two to return results.

Note also that the 20 Newsgroup Dataset covers events in the early to mid 1990s, so references to recent events will not exist.

#### Space Question

In [10]:
answers = qa.ask('When did the Cassini probe launch?')
qa.display_answers(answers[:5])

ValueError: ignored

As you can see, the top candidate answer indicates that the Cassini space probe was launched in October of 1997, which appears to be correct.  The correct answer will not always be the top answer, but it is in this case.  

Note that, since we used `index_from_list` to index documents, the last column shows the list index associated with the newsgroup posting containing the answer, which can be used to peruse the entire document containing the answer.  If using `index_from_folder` to index documents, the last column will show the relative path and filename of the document.

In [None]:
print(docs[59])

The 20 Newsgroup dataset contains lots of posts discussing and debating Christianity, as well.  Let's ask a question on this subject.

#### Religious Question

In [None]:
answers = qa.ask('Who was Jesus Christ?')
qa.display_answers(answers[:5])

Here, we see different views on who Jesus was as debated and discussed in this document set.

Finally, the 20 Newsgroup dataset also contains many groups about computing hardware and software.  Let's ask a technical support question.

#### Technical Question

In [None]:
answers = qa.ask('What causes computer images to be too dark?')
qa.display_answers(answers[:5])

It looks like a lack of gamma correction can be one of the culprits.

### Deploying the QA System

To deploy this system, the only state that needs to be persisted is the search index we initialized and populated in **STEP 1**.  Once a search index is initialized and populated, one can simply re-run from **STEP 2**.