![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab 11: Question Answering using BERT

by Andrei Popescu-Belis (HES-SO)
using the [🤗 Huggingface models](https://huggingface.co/models),
an [article by Marius Borcan](https://programmerbackpack.com/bert-nlp-using-distilbert-to-build-a-question-answering-system/) and 
an [article by Ramsi Goutham](https://towardsdatascience.com/simple-and-fast-question-answering-system-using-huggingface-distilbert-single-batch-inference-bcf5a5749571)

**Summary**
The goal of this lab is to implement and test a simple question answering (QA) system over a set of articles.  The structure of the lab is as follows:
1. Answer extraction from a text fragment -- in this part, you will use a pre-trained model named DistilBERT (a lighter version of BERT) which can extract the most likely answer to a given question from a text fragment (in English).
2. Text retrieval given a question -- in this part, you will reuse code from Lab 4 (Search Engine) to design a paragraph retrieval system over the 300-article Lee corpus provided with `gensim`. 
3. Integration and testing -- in this part, you will put together the functions from the previous two parts, and test your system end-to-end by designing a test set of 10 questions.

## Implemented by:
- Adrian Willi (adrian.willi@hslu.ch)
- Florian Bär (florian.baer@hslu.ch)

<font color='green'>Please answer the questions in green within this notebook.  The expected answers are generally very short: 1-2 commands or 2-3 lines of explanations.  At the end, please submit the completed notebook under the corresponding homework on Moodle.</font>

## 1. Answer extraction using DistilBERT

As you know, the BERT pre-trained model can be fine-tuned for question answering, by training it to provide the start and end word of an input text fragment which is most likely the answer to an input question.  You will use the 🤗 Huggingface Python module called `transformers`, and later use a DistilBERT model also provided by 🤗 Huggingface.

### a. Install `pytorch` and `transformers`

Use the instructions provided by [PyTorch](https://pytorch.org/get-started/locally/#start-locally) and by [Huggingface](https://github.com/huggingface/transformers#installation).  The use of `conda` is recommended.

In [68]:
!pip install torch transformers -q

In [69]:
import torch

<font color='green'>Please generate a random 2x2x2 tensor with Pytorch.  Please display whether the workstation you use has a GPU or not.</font><br/>
(Note: a GPU is not required for this lab.)

In [70]:
print(torch.rand((2,2,2)))
print(torch.cuda.is_available())

tensor([[[0.8682, 0.7508],
         [0.3882, 0.1904]],

        [[0.3288, 0.0716],
         [0.4065, 0.5929]]])
True


In [71]:
import transformers

🤗 Huggingface provides a very large repository of Transformer-based models at https://huggingface.co/models.

<font color='green'>Please use the search interface (in a browser) and find out *how many models containing the name 'distilbert' for Question Answering* are available.  If we exclude those submitted by individual users, how many models are there left?  Please paste below their name and version date, and the size of their 'pytorch_model.bin' file.</font>


<font color='green'>By looking at their "model cards", which model has the highest performance on the SQuAD dev set?</font>  In what follows, we will use this model.

Totally, there are 203 Models containing the name 'distilbert' for question answering. 

Excluding those submitted by users there are 2 models.

- **distilbert-base-uncased-distilled-squad** with a size of 253 MB
  - F1 score of 86.9
- **distilbert-base-cased-distilled-squad** with a size of 249 MB
  - F1 score of 87.1

In [172]:
 model_name = 'distilbert-base-cased-distilled-squad'

### b. Tokenization of the input

We will use here a tokenizer called `DistilBertTokenizer` to tokenize the question and the text fragment and transform the numbers into numerical indices.  The documentation for this tokenizer is included in the general documentation of DistilBERT models at: https://huggingface.co/transformers/model_doc/distilbert.html 

In [173]:
from transformers import DistilBertTokenizer, AutoTokenizer
# you could use the AutoTokenizer as well

<font color='green'>Please create an instance of such a tokenizer 
using the pre-trained model named 'distilbert-base-cased'.  The command
will download the necessary model the first time you use it.</font>

In [174]:
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

<font color='green'>What does this instance return if you **call** it with a sentence (a *string*) as an argument?  Please write the instruction below, and be sure you include the word 'Winterthur' in your sentence.</font>

In [175]:
#print(tokenizer.tokenize('I am so funny as i am eating a döner in Winterthur'))
#print(tokenizer.tokenize('I so as i am eating a döner in Winterthur'))
sentences = ['I am so funny as i am eating a döner in Winterthur',
             'I so as i am eating fish in Winterthur']
tokens = tokenizer(sentences, padding=True)
print(tokenizer(sentences[0]).keys())
print(tokens)

dict_keys(['input_ids', 'attention_mask'])
{'input_ids': [[101, 146, 1821, 1177, 6276, 1112, 178, 1821, 5497, 170, 173, 19593, 2511, 1107, 4591, 1582, 2149, 102], [101, 146, 1177, 1112, 178, 1821, 5497, 3489, 1107, 4591, 1582, 2149, 102, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]}


<font color='green'>Please explain in your own words the meaning of the two components of the output above.  For that, please use the [documentation of the class DistilBertTokenizer](https://huggingface.co/transformers/model_doc/distilbert.html#distilberttokenizer), and be sure you read the documentation of its *superclasses* as well.  Under what superclass do you find the links to the [glossary entries](https://huggingface.co/transformers/glossary.html) that best explain the two components, and what are these entries?</font>

The tokenizer returns as seen in the documentation the input ids and the attention_mask. The input_ids of the tokens are the ids fed into the model. These ids are used to identify a token. The attention mask is used to verify the length of the input sequece. This makes it possible to scale the input size fo the text.

<font color='green'>If you haven't explained above, please explain here the cause of the difference between the number of words of your sentence, and the number of tokens in the observed output.  Please display the tokens of the output. You can use the documentation of the superclass found above or the examples in the [glossary](https://huggingface.co/transformers/glossary.html).</font>

The sentence is not split into words, but instead splitted into subword tokens, which often are used to represent less common words.This approach is similar as it is done in fasttext.

<font color='green'>How can you convert back the first part of the output to the original string?
Please write and execute the command(s) below.  You can use the documentation of the superclass found above or the examples in the [glossary](https://huggingface.co/transformers/glossary.html).</font>

In [176]:
print(tokenizer.decode(tokens.input_ids[0]))
print(tokenizer.decode(tokens.input_ids[1]))

[CLS] I am so funny as i am eating a döner in Winterthur [SEP]
[CLS] I so as i am eating fish in Winterthur [SEP] [PAD] [PAD] [PAD] [PAD] [PAD]


### c. Generation of input in the desired form

We need to generate input in the form expected by the `DistilBertForQuestionAnswering` class.  This means providing the question, the text from which the answer must be extracted, with the proper [CLS] and [SEP] tokens, and the attention masks.  Moreover, using DistilBERT requires that the lists of indices returned by the tokenizer are Pytorch tensors (see tokenizer's option `return_tensors`).

<font color='green'>What is the correct way to call the tokenizer in order to obtain these results?  You can use the example provided at the end of the [DistilBertForQuestionAnswering](https://huggingface.co/transformers/model_doc/distilbert.html?distilbertforquestionanswering#distilbertforquestionanswering) documentation.  <br/>Please define a *question* and a *text* string of your own, and store the result of the tokenizer in a variable called *input*.  <br/>   Please verify (by converting back to the result) that the input has the correct tokens.</font>

In [177]:
question = 'What is your age?'
answer = 'As seen in the internet, the age of you is 25 Years old, which is very old.'
inputs = tokenizer(question, answer, return_tensors='pt')
print(tokenizer.decode(inputs.input_ids[0]))
print(inputs.input_ids[0].shape[0])

[CLS] What is your age? [SEP] As seen in the internet, the age of you is 25 Years old, which is very old. [SEP]
28


### d. Execution of the model over the input question and text

In this section, you will create an instance of the BERT neural network adapted to question answering.  The class is named `DistilBertForQuestionAnswering`.  The model itself (the weights) is the one that you found at the end of (1a) above.

In [178]:
from transformers import DistilBertForQuestionAnswering

<font color='green'>Please create an instance of the model here.</font>  The data will be downloaded the first time you create it. 

In [179]:
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForQuestionAnswering.from_pretrained(model_name)

The results of applying the model to your question and text (i.e. extracting the answer) are obtained by calling the model with the correct inputs.  

<font color='green'>Please use the inputs you obtained above and read the [documentation of the DistilBertForQuestionAnswering class](https://huggingface.co/transformers/model_doc/distilbert.html?distilbertforquestionanswering#distilbertforquestionanswering) (under *forward*) to apply the model to your data.  Store the results in a variable called *outputs*.</font>

In [180]:
with torch.no_grad():
    outputs = model(**inputs)

<font color='green'>Where are the probability values for the position of the **start** of the answer in *outputs*?</font> 
- The value of the logit is 9.8292.

<font color='green'>Are these actual probabilities or other type of coefficients?</font> 
- These are not probabilities. Else they would sum up to 1 or 100% - what they don't do. They are the logits before SoftMax was applied.

<font color='green'>
How many values are there, and is this coherent with your observations in (1b)</font> 

- There is for each token given into the system a value as a logit to return the start and the end token as a response for the given question which was passed to the model.

In [181]:
answer_start_index = outputs.start_logits.argmax()
print(outputs.start_logits.max())
answer_end_index = outputs.end_logits.argmax()
print(f'Length of start_logits is {(outputs.start_logits.shape)}')
print(f'Index is from {outputs.start_logits.argmax()}')
print(f'Index is to {outputs.end_logits.argmax()}')
print(outputs.end_logits.max())
print(len(outputs))
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
print(tokenizer.decode(predict_answer_tokens))

tensor(9.8292)
Length of start_logits is torch.Size([1, 28])
Index is from 18
Index is to 20
tensor(8.7799)
2
25 Years old


### e. Determination of the start and the end of the answer in the text

<font color='green'>Please use the *outputs* of the model to determine the most likely start and end of the answer span in your text, and then obtain the actual answer.  How satisfied are you with the answer?</font>  You may use help from the [🤗 Huggingface entry on question answering](https://huggingface.co/transformers/task_summary.html#extractive-question-answering).

In [182]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
print(tokenizer.decode(predict_answer_tokens))

25 Years old


<font color='green'>Please write a function called *answer_extraction* that gathers the previous operations: it takes two strings as arguments, creates instances of the tokenizer and the model, extracts the answer, and returns it as a string (possibly empty).  Do not create a new *tokenizer* and *model*, but assume that the ones you created above are global variables accessible from this function.</font> 

In [183]:
def answer_extraction(question: str, text: str) -> str:
    inputs = tokenizer(question, text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    answer_start_index = outputs.start_logits.argmax()
    answer_end_index = outputs.end_logits.argmax()
    
    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    return tokenizer.decode(predict_answer_tokens)

<font color='green'>Please test the function on the following questions and short text.</font> 

In [184]:
print(answer_extraction('Are you funny?', 'Yes, I guess you are a very funny. I love it very much.'))

Yes, I guess you are a very funny


## 2. Fragment retrieval using `Gensim` (from Lab 4)

In this part, you will simply reuse code from Lab 4 to build a simple text retrieval system over the *Lee Corpus* provided with Gensim (300 news articles from the Australian Broadcasting Corporation).  
* The [Gensim tutorial on topics and transformations](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py) provides the main idea.  
* The goal is to retrieve, given a question, a short text fragment that is most likely to contain the answer.  As articles are not divided into paragraphs, you will refactor the collection of articles into a collection of fragments of at most *N* sentences each (without mixing articles). 
* The question will be used as a *query*, with the pre-processing options of your choice.

In [185]:
!pip install contractions -q

In [186]:
from google.colab import drive
drive.mount('/content/gdrive')

# Modify path according to your configuration
# !ls "/content/gdrive/MyDrive/ColabNotebooks/MSE_AnTeDe_Spring2022"
import sys
sys.path.insert(0,'/content/gdrive/MyDrive/Colab Notebooks/MSE/AnTeDe/MSE_AnTeDe_Lab10_11')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [187]:
import gensim, nltk, os
from nltk.corpus import stopwords, wordnet
from TextPreprocessor import *

In [188]:
N = 4

<font color='green'>Load the articles of the Lee Background Corpus proviced with Gensim into a list of strings (each article in a string) called *raw_articles*.</font>

In [189]:
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
raw_articles = text
print(len(raw_articles))

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
language = 'english'
stop_words = set(stopwords.words(language))
# Extend the list here:
for sw in ['\"', '\'', '\'\'', '`', '``', '\'s']:
    stop_words.add(sw)


300
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<font color='green'>Please transform the articles into a collection of text fragments called *corpus1* (a list of lists of strings), by cutting each article into fragments of *N* consecutive sentences (e.g. *N* = 5), except possibly for the last fragment, and tokenizing each sentence.  At the end, display the number of fragments of your collection.</font>
* Do not mix sentences from different articles in each fragment.
* The reason for this operation is that full articles are too long to give to DistilBERT as texts. (Try it!)

<font color='green'>Do not forget to pre-process the articles in preparation for search -- tokenization, stopword removal, and other operations if you want to explore them.</font>  
* A  text fragment is thus a list of strings (tokens). 
* Please inspect your corpus to make sure it is correctly built.

<font color='green'>If possible, store the original version of each fragment into a string in *corpus2* (i.e. non-tokenized, non-lowercased, etc.), because it will be better to pass it later to DistilBERT.  Otherwise, you can also reconstruct the full fragment from the tokens in corpus1.</font>

In [190]:
import itertools
def split_seq(iterable, size):
    it = iter(iterable)
    item = list(itertools.islice(it, size))
    while item:
        yield ' '.join(item)
        item = list(itertools.islice(it, size))

In [191]:
from nltk.tokenize import sent_tokenize
sentences_of_article = [list(split_seq(sent_tokenize(text), N)) for text in raw_articles]
print((sentences_of_article[3][1]))
corpus2 = sentences_of_article.copy()

fragments = []
[fragments.append(fragment) for frags in sentences_of_article for fragment in frags]
print('len of fragments')
print(len(fragments))

fragments2 = fragments.copy()
# TextPreprocessor? - get help regarding the attributes

processor = TextPreprocessor(
# Add options here:
 language = language,
 stopwords = stop_words
)

print(type(sentences_of_article))
print(len(sentences_of_article))
print(type(sentences_of_article[0]))
print(len(sentences_of_article[0]))

frags = pd.DataFrame(fragments, columns=['corpus2'])

frags['corpus'] = processor.transform(frags['corpus2'])

print(frags.head())

Fresh elections are not scheduled until March leaving whoever assumes the presidency with the daunting task of tackling Argentina's worst crisis in 12 years, but this time, isolated by international lending agencies.
len of fragments
790
<class 'list'>
300
<class 'list'>
4
                                             corpus2  \
0  Hundreds of people have been forced to vacate ...   
1  The New South Wales Rural Fire Service says th...   
2  Rain has fallen in some parts of the Illawarra...   
3  "In fact, they've probably hampered the effort...   
4  Indian security forces have shot dead eight su...   

                                              corpus  
0  hundred people force vacate home southern high...  
1  new south wale rural fire service say weather ...  
2  rain fall part illawarra sydney hunter valley ...  
3  fact probably hamper effort firefighter wind g...  
4  indian security force shot dead eight suspect ...  


<font color='green'>Please create a search index (called *search_index*) using a *tfidf* model and transform all text fragments from *corpus1* into document vectors.</font>

In [192]:
from gensim.utils import simple_preprocess
from gensim import models, corpora, similarities

doc_tokenized = [simple_preprocess(doc) for doc in frags['corpus']]
dictionary = corpora.Dictionary(doc_tokenized)
corpus = [dictionary.doc2bow(text) for text in doc_tokenized]

tfidf_model = models.TfidfModel(corpus)
frags['search_index'] = [dictionary.doc2bow(doc.lower().split()) for doc in frags['corpus']]

<font color='green'>Please write a function called *fragment_retrieval* which returns the most relevant text fragment (string) from the corpus given a question, which is used as the query.</font>  
* The function processes the query in the same way as the documents (using the *tfidf model*) to obtain a *vectorized_query*.
* This is passed to the *search_index* to rank all documents by relevance.
* All the resources created above are supposed available as global variables (the dictionary, the tfidf model, the search_index, the corpus).

In [210]:
from gensim.similarities import MatrixSimilarity
def fragment_retrieval(query):
    query = prepr
    vec_bow = dictionary.doc2bow(query.lower().split())
    sims = tfidf_model[vec_bow]
    index = MatrixSimilarity(tfidf_model[corpus], num_best=1)
    if len(index[sims]) < 1:
        return ''
    sims = index[sims][0]
    return frags['corpus2'].iloc[sims[0]]

<font color='green'>Please apply the above function to the three queries provided below.</font>  

Note: again, the corpus, search_index, tfidf and dictionary are available as global variables.

In [213]:
queries = ["Who is the mayor of New York?", 
           "Who is Nicole Kidman?", 
           "How many Australians died in the 1999 Interlaken canyoning accident?",
            "What is Kieren Perkins' sport?"]
for q in queries:
    print(q, '->', fragment_retrieval(q))

Who is the mayor of New York? -> "I felt that my job as the mayor was to turn around the city, because I believed - rightly or wrongly - that we had one last chance to do that." Mr Giuliani, a Republican, has served two terms as New York City's Mayor since 1993. Term limits prevent him from seeking a third term in office, and he will be succeeded by billionaire media mogul Michael Bloomberg.
Who is Nicole Kidman? -> In the United States, Australian actress Nicole Kidman has been nominated for two Golden Globe best actor awards for her roles in the Australian-made musical "Moulin Rouge", and in her new thriller "The Others". "Moulin Rouge" also is one of two pictures leading the Golden Globe nominations, with six possible awards. It is vying for best musical or comedy picture of 2002, best actress in a comedy or musical, best actor in the same category for Ewen McGregor, best director for Baz Luhrmann, best original score and best original song. The other film to pick up six nominations

## 3. Integration, testing and discussion

<font color='green'>Using the two functions 'fragment_retrieval' and 'answer_extraction' from parts 1 and 2, and assuming all models and data are available as global variables, please create a unique function which returns the answer (string) to a question (string).</font>

In [208]:
def question_answering(question):
    text = fragment_retrieval(question)
    if text is None:
        text = ''
    return answer_extraction(question, text)

<font color='green'>Please add between 5 and 10 more questions to the following list.  You can add answerable and non-answerable questions (with respect to the corpus).</font>

In [212]:
questions = ["Who is the mayor of New York?", 
            "Who is Nicole Kidman?", 
            "How many Australians died in the 1999 Interlaken canyoning accident?",
            "What caused the 1999 Interlaken canyoning accident?",
            "Which city is the capital of Australia?",
            "Who is the prime-minister of Israel?",
            "What are the main Australian airlines?",
            "What is Kieren Perkins' sport?",
             "Which is the funniest sport?",
             "Which is the most famous sport in australia?",
             "Where is the famous opera in australia?",
             "When was the last war australia was involved?",
             "When was the second world war?", 
             "Name me a famous american actor."]
for q in questions:
    print(q, '->', question_answering(q))

Who is the mayor of New York? -> Mr Giuliani
Who is Nicole Kidman? -> Australian actress
How many Australians died in the 1999 Interlaken canyoning accident? -> 14
What caused the 1999 Interlaken canyoning accident? -> thunderstorm
Which city is the capital of Australia? -> Port - au - Prince
Who is the prime-minister of Israel? -> Israel
What are the main Australian airlines? -> The spokesman
What is Kieren Perkins' sport? -> swam
Which is the funniest sport? -> 
Which is the most famous sport in australia? -> [CLS]
Where is the famous opera in australia? -> Launceston and Melbourne
When was the last war australia was involved? -> inauguration ceremony
When was the second world war? -> the American gained the ascendancy in the second set
Name me a famous american actor. -> Mr Rini


<font color='green'>Please discuss the correctness of the answers, give possible reasons for incorrect ones, and make suggestions for improvements.</font>

Write your discussion here or in a cell below.

When you have finished please clean and re-run one last time the notebook, from start to end, then submit it on Moodle.