# Question Answering with BERT

<hr/>

<img src="example-report.png" width="30%" align="right" style="padding-left:20px">

In this notebook, I've built a Question-Answering pipline that utilizes the transformer-based, pre-trained [BERT](https://github.com/google-research/bert) model to answer questions related to a given passage. At the end, I have applied this pipeline on questions linked to some medical-related passages such as clinical notes. The pipeline is made of following pieces:

- [1. Prepare the input](#1): Preprocessing text for input to the BERT model
- [2. Find candidate answers](#2): Given the passage and the question, the BERT model outputs candidate answers in terms of the score (i.e. logits) of the starting and ending indices in the array of the tokenized passage.
- [3. Choose the most likely answer](#3): From the list of scores obtained in previous step, choose the pair of (start index, end index) which maximizes their combined scores.  
- [4. Construct the final answer](#3): The final answer is constructed by stiching all words that are located between start index and end index (obtained from previous step) in the given passage. 

This project is inspired by the [work](https://arxiv.org/abs/1901.07031) done by Irvin et al.

<a href="https://ieeexplore.ieee.org/document/7780643">Image Credit</a>

## Packages

Import the following libraries for this assignment.

- `tensorflow` - standard deep learning library
- `transformers` - convenient access to pretrained natural language models

Additionally, load the helper `question_answer_util` module that contains all the functions associated with the Question/Answering pipeline.

In [29]:
import tensorflow as tf
from transformers import *

from question_answer_util import *

# watch for any changes in the text_process_util module, and reload it automatically
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<a name='2-2'></a>
### 1. Preparing the input


We need to first prepare the raw passage and question for input into the BERT model. In this regard, we first apply a pre-built tokenizer to both passage and question, which maps each word in those pieces of text to a unique element in the vocabulary. We also insert special tokens. In other words, given the strings `p` (i.e. passage) and `q` (i.e. question), we want to turn them into an input of the following form: 

`[CLS]` `[q_token1]`, `[q_token2]`, ..., `[SEP]` `[p_token1]`, `[p_token2]`, ...

Here, the special characters `[CLS]` and `[SEP]` let the model know which part of the input is the question and which is the answer. 
- The question appears between `[CLS]` and `[SEP]`.
- The answer appears after `[SEP]`

Next, since BERT takes in a fixed-length input, we add padding to those tokenized inputs whose length is less than a pre-specified max input length. 

In the test case below, prepare_bert_input(question, passage, tokenizer, max_seq_length=20),
returns three items. 
- First is `input_ids`, which holds the numerical ids of each token. 
- Second, the `input_mask`, which has 1's in parts of the input tensor representing input tokens, and 0's where there is padding. 
- Finally, `tokens`, the output of the tokenizer (including the `[CLS]` and `[SEP]` tokens).

### Load The Tokenizer

Before using the function below, we need to  load the pre-built tokenizer.

In [30]:
tokenizer = AutoTokenizer.from_pretrained("./models")

In [31]:
passage = "My name is Bob."

question = "What is my name?"

input_ids, input_mask, tokens = prepare_bert_input(question, passage, tokenizer, 20)
print("Test Case:\n")
print("Passage: {}".format(passage))
print("Question: {}".format(question))
print()
print("Tokens:")
print(tokens)
print("\nCorresponding input IDs:")
print(input_ids)
print("\nMask:")
print(input_mask)

Test Case:

Passage: My name is Bob.
Question: What is my name?

Tokens:
['[CLS]', 'What', 'is', 'my', 'name', '?', '[SEP]', 'My', 'name', 'is', 'Bob', '.']

Corresponding input IDs:
tf.Tensor(
[[ 101 1327 1110 1139 1271  136  102 1422 1271 1110 3162  119    0    0
     0    0    0    0    0    0]], shape=(1, 20), dtype=int32)

Mask:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]


<a name='2-3'></a>
### 2. Find candidate answers

The pre-trained BERT model, which is loaded below, takes in the tokenized input, returns two vectors. 
- The first vector contains the scores (more formally, logits) for starting indices of sets of possible answers. 
    - A higher score means that index is more likely to be the start of the final, chosen answer span in the passage. 
- The second vector contains the score for the ending indices of those possible answers. 

In [32]:
model = TFAutoModelForQuestionAnswering.from_pretrained("./models")

<a name='2-3'></a>
### 3. Choose the most likely answer

We use the abovementioned, two score vectors as well as the input mask to find the span of tokens from the passage that maximizes the start score and end score. 
- To be valid, the start index has to occur before the end index. Formally, we want to find:

$$\arg\max_{i <= j, mask_i=1, mask_j = 1} start\_scores[i] + end\_scores[j]$$
- In other words, this formula is saying, calculate the sum and start scores of start position 'i' and end position 'j', given the constraint that the start 'i' is either before or at the end position 'j'; then find the positions 'i' and 'j' where this sum is the highest.
- Furthermore, we want to make sure that $i$ and $j$ are in the relevant parts of the input (i.e. where `input_mask` equals 1.)

In [33]:
# test case to show how span calculations work

start_scores = tf.convert_to_tensor([-1, 2, 0.4, -0.3, 0, 8, 10, 12], dtype=float)
end_scores = tf.convert_to_tensor([5, 1, 1, 3, 4, 10, 10, 10], dtype=float)
input_mask = [1, 1, 1, 1, 1, 0, 0, 0]

start, end = get_span_from_scores(start_scores, end_scores, input_mask, verbose=True)

print("Expected: (1, 4) \nReturned: ({}, {})".format(start, end))

max start is at index i=1 and score 2.0
max end is at index i=4 and score 4.0
max start + max end sum of scores is 6.0
Expected: (1, 4) 
Returned: (1, 4)


<a name='ex-05'></a>
### 4. Construct the final answer

Finally, we form the contiguous token from the tokenized input using the start and end indices obtained in the previous. Then, we pass the token to construct_answer(token) function, which performs some clean-up on the token and returns the final string for the answer.  

In [34]:
# Test case

# assume this is the contiguous token that forms the final answer
tmp_tokens_1 = [' ## hello', 'how ', 'are ', 'you?      ']
tmp_out_string_1 = construct_answer(tmp_tokens_1)

print(f"tmp_out_string_1: {tmp_out_string_1}, length {len(tmp_out_string_1)}")


tmp_tokens_2 = ['@',' ## hello', 'how ', 'are ', 'you?      ']
tmp_out_string_2 = construct_answer(tmp_tokens_2)
print(f"tmp_out_string_2: {tmp_out_string_2}, length {len(tmp_out_string_2)}")


tmp_out_string_1: hello how  are  you?, length 20
tmp_out_string_2: @hellohowareyou?, length 16


<a name="2-1"></a>
### Putting It All Together

get_model_answer(\*args, \**kwargs) function puts all previous steps together and outputs the answer related to the input question and passage.

```CPP
def get_model_answer(model, question, passage, tokenizer, max_seq_length=384):
    """
    # prepare input
    ...
        
    # get scores for start of answer and end of answer
    ...
    # using scores, get most likely answer
    ...
    
    # using span start and end, construct answer as string
    ...
    return answer
```

<a name='2-5'></a>
### 4. Test the pipeline

Now that we've prepared all the pieces, let's try an example from the SQuAD dataset. 

In [35]:
passage = "Computational complexity theory is a branch of the theory \
           of computation in theoretical computer science that focuses \
           on classifying computational problems according to their inherent \
           difficulty, and relating those classes to each other. A computational \
           problem is understood to be a task that is in principle amenable to \
           being solved by a computer, which is equivalent to stating that the \
           problem may be solved by mechanical application of mathematical steps, \
           such as an algorithm."

question = "What branch of theoretical computer science deals with broadly \
            classifying computational problems by difficulty and class of relationship?"

print("Output: {}".format(get_model_answer(model, question, passage, tokenizer)))
print("Expected: Computational complexity theory")

Output: Computational complexity theory
Expected: Computational complexity theory


In [36]:
passage = "The word pharmacy is derived from its root word pharma which was a term used since \
           the 15th–17th centuries. However, the original Greek roots from pharmakos imply sorcery \
           or even poison. In addition to pharma responsibilities, the pharma offered general medical \
           advice and a range of services that are now performed solely by other specialist practitioners, \
           such as surgery and midwifery. The pharma (as it was referred to) often operated through a \
           retail shop which, in addition to ingredients for medicines, sold tobacco and patent medicines. \
           Often the place that did this was called an apothecary and several languages have this as the \
           dominant term, though their practices are more akin to a modern pharmacy, in English the term \
           apothecary would today be seen as outdated or only approproriate if herbal remedies were on offer \
           to a large extent. The pharmas also used many other herbs not listed. The Greek word Pharmakeia \
           (Greek: φαρμακεία) derives from pharmakon (φάρμακον), meaning 'drug', 'medicine' (or 'poison')."

question = "What word is the word pharmacy taken from?"

print("Output: {}".format(get_model_answer(model, question, passage, tokenizer)))
print("Expected: pharma")

Output: pharma
Expected: pharma


In [38]:
passage = "A City of Hope scientist and his colleagues have developed \
           a user-friendly approach to creating 'theranostics' \
           — therapy combined with diagnostics — that target specific tumors and diseases.\
           Key to the process are molecules called metallocorroles, \
           which serve as versatile platforms for the development of drugs and imaging agents. \
           Metallcorroles both locate (via imaging) and kill tumors. \
           City of Hope’s John Termini, Ph.D., and his colleagues at Caltech \
           and the Israel Institute of Technology have developed a novel method \
           to prepare cell-penetrating nanoparticles called \
           “metallocorrole/protein nanoparticles.” \
           The theranostics could both survive longer in the body \
           and better snipe disease targets."                       

question = "What does theranostics refer to?"

print("Output: {}".format(get_model_answer(model, question, passage, tokenizer)))

Output: therapy combined with diagnostics


Now let's try it on clinical notes. Below we have an excerpt of a doctor's notes for a patient with an abnormal echocardiogram (this sample is taken from [here](https://www.mtsamples.com/site/pages/sample.asp?Type=6-Cardiovascular%20/%20Pulmonary&Sample=1597-Abnormal%20Echocardiogram))

In [40]:
passage = "Abnormal echocardiogram findings and followup. Shortness of breath, congestive heart failure, \
           and valvular insufficiency. The patient complains of shortness of breath, which is worsening. \
           The patient underwent an echocardiogram, which shows severe mitral regurgitation and also large \
           pleural effusion. The patient is an 86-year-old female admitted for evaluation of abdominal pain \
           and bloody stools. The patient has colitis and also diverticulitis, undergoing treatment. \
           During the hospitalization, the patient complains of shortness of breath, which is worsening. \
           The patient underwent an echocardiogram, which shows severe mitral regurgitation and also large \
           pleural effusion. This consultation is for further evaluation in this regard. As per the patient, \
           she is an 86-year-old female, has limited activity level. She has been having shortness of breath \
           for many years. She also was told that she has a heart murmur, which was not followed through \
           on a regular basis."

q1 = "How old is the patient?"
q2 = "Does the patient have any complaints?"
q3 = "What is the reason for this consultation?"
q4 = "What does her echocardiogram show?"
q5 = "What other symptoms does the patient have?"


questions = [q1, q2, q3, q4, q5]

for i, q in enumerate(questions):
    print("Question {}: {}".format(i+1, q))
    print()
    print("Answer: {}".format(get_model_answer(model, q, passage, tokenizer)))
    print()
    print()

Question 1: How old is the patient?

Answer: 86


Question 2: Does the patient have any complaints?

Answer: The patient complains of shortness of breath


Question 3: What is the reason for this consultation?

Answer: further evaluation


Question 4: What does her echocardiogram show?

Answer: severe mitral regurgitation and also large pleural effusion


Question 5: What other symptoms does the patient have?

Answer: colitis and also diverticulitis




In [41]:
passage = "The key to effective precision medicine is data. \
           A vast amount of individualized data from \
           hundreds of thousands of patients is required, \
           in order to match all possible genetic abnormalities with \
           their proper treatments."         
           
question = "What does make a precision medicine procedure effective?"

print("Output: {}".format(get_model_answer(model, question, passage, tokenizer)))

Output: [CLS]


In [42]:
passage = "The key to effective precision medicine is data. \
           A vast amount of individualized data from \
           hundreds of thousands of patients is required, \
           in order to match all possible genetic abnormalities with \
           their proper treatments. \
           Abnormal echocardiogram findings and followup. Shortness of breath, congestive heart failure, \
           and valvular insufficiency. The patient complains of shortness of breath, which is worsening. \
           The patient underwent an echocardiogram, which shows severe mitral regurgitation and also large \
           pleural effusion. The patient is an 86-year-old female admitted for evaluation of abdominal pain \
           and bloody stools. The patient has colitis and also diverticulitis, undergoing treatment. \
           During the hospitalization, the patient complains of shortness of breath, which is worsening. \
           The patient underwent an echocardiogram, which shows severe mitral regurgitation and also large \
           pleural effusion. This consultation is for further evaluation in this regard. As per the patient, \
           she is an 86-year-old female, has limited activity level. She has been having shortness of breath \
           for many years. She also was told that she has a heart murmur, which was not followed through \
           on a regular basis."           
           
question = "What does make a precision medicine procedure effective?"

print("Output: {}".format(get_model_answer(model, question, passage, tokenizer)))

Output: data
