![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab11: Question Answering using BERT

by Andrei Popescu-Belis (HES-SO)
using the [🤗 Huggingface models](https://huggingface.co/models),
an [article by Marius Borcan](https://programmerbackpack.com/bert-nlp-using-distilbert-to-build-a-question-answering-system/) and 
an [article by Ramsi Goutham](https://towardsdatascience.com/simple-and-fast-question-answering-system-using-huggingface-distilbert-single-batch-inference-bcf5a5749571)

**Summary**
The goal of this lab is to implement and test a simple question answering (QA) system over a set of articles.  The structure of the lab is as follows:
1. Answer extraction from a text fragment -- in this part, you will use a pre-trained model named DistilBERT (a lighter version of BERT) which can extract the most likely answer to a given question from a text fragment (in English).
2. Text retrieval given a question -- in this part, you will reuse code from Lab 4 (Search Engine) to design a paragraph retrieval system over the 300-article Lee corpus provided with `gensim`. 
3. Integration and testing -- in this part, you will put together the functions from the previous two parts, and test your system end-to-end by designing a test set of 10 questions.

<font color='green'>Please answer the questions and solve the tasks in green within this notebook.  The expected answers are generally very short: 1-2 commands or 2-3 lines of explanations.  At the end, please submit the completed notebook under the corresponding homework on Moodle.</font>

## 1. Answer extraction using DistilBERT

As you know, the BERT pre-trained model can be fine-tuned for question answering, by training it to provide the start and end word of an input text fragment which is most likely the answer to an input question.  You will use the 🤗 Huggingface Python module called `transformers`, and later use a DistilBERT model also provided by 🤗 Huggingface.

### a. Install `pytorch` and `transformers`

Use the instructions provided by [PyTorch](https://pytorch.org/get-started/locally/#start-locally) and by [Huggingface](https://github.com/huggingface/transformers#installation).  The use of `conda` is recommended.

In [926]:
import torch

<font color='green'>**Task**: Please generate a random 2x2x2 tensor with Pytorch.  Please display whether the workstation you use has a GPU or not.</font><br/>
(Note: a GPU is not required for this lab.)

In [927]:
# Generate and print a tensor with random values between 0 and 1 of shape (2,2,2)
tensor = torch.rand(2,2,2)
print(tensor)

tensor([[[0.4421, 0.2892],
         [0.7718, 0.8245]],

        [[0.2870, 0.4521],
         [0.1176, 0.9186]]])


In [928]:
# Check whether a GPU is available or not
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cpu device


In [929]:
import transformers

🤗 Huggingface provides a very large repository of Transformer-based models at https://huggingface.co/models.

<font color='green'>**Task**: Please use the search interface (in a browser) and find out *how many models containing the name 'distilbert' for Question Answering* are available.  If we exclude those submitted by individual users, how many models are there left?  Please paste below their name and version date, and the size of their 'pytorch_model.bin' file.</font>

<font color='green'>By looking at their "model cards", which model has the highest performance on the SQuAD dev set?</font>  In what follows, we will use this model.

In [930]:
# As of writing this comment, there are a total of 1222 available question answering models with "distilbert" in their name. 

# Excluding models submitted by individual users there seem to be 2 models left.

# - DistilBERT base cased distilled SQuAD 
#   Last updated on the 12th of April 2023
#   "pytorch_model.bin" file size = 261 mb
#   Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

# - DistilBERT base uncased distilled SQuAD
#   Last updated on the 6th of April 2023
#   "pytorch_model.bin" file size = 265 mb
#   Answer: 'SQuAD dataset', score: 0.4704, start: 147, end: 160 

# The first "cased" model has a slightly higher score of 0.5152


### b. Tokenization of the input

We will use here a tokenizer called `DistilBertTokenizer` to tokenize the question and the text fragment and transform the numbers into numerical indices.  The documentation for this tokenizer is included in the general documentation of DistilBERT models at: https://huggingface.co/transformers/model_doc/distilbert.html 

In [931]:
from transformers import DistilBertTokenizer

<font color='green'>**Task**: Please create an instance of such a tokenizer 
using the pre-trained model named 'distilbert-base-cased'.  The command
will download the necessary model the first time you use it.</font>

In [932]:
# Instantiate DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')

<font color='green'>**Task**: What does this instance return if you **call** it with a sentence (a *string*) as an argument?  Please write the instruction below for the sentence 'There are three museums in Winterthur.'.</font>

In [933]:
# Define a sentence and call the transformer instance with the sentence as an argument
sentence = "There are three museums in Winterthur"
transf_return = tokenizer(sentence)

# Check what the transformer instance returned
print(f"The transformer returned something of type: {type(transf_return)} \n")
print(transf_return)

The transformer returned something of type: <class 'transformers.tokenization_utils_base.BatchEncoding'> 

{'input_ids': [101, 1247, 1132, 1210, 11765, 1107, 4591, 1582, 2149, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [934]:
# The returned data structure is a dictionar with two keys.

<font color='green'>**Task**: Please explain in your own words the meaning of the two components of the output above.  For that, please use the [documentation of the class DistilBertTokenizer](https://huggingface.co/transformers/model_doc/distilbert.html#distilberttokenizer), and be sure you read the documentation of its *superclasses* as well.  Under what superclass do you find the links to the [glossary entries](https://huggingface.co/transformers/glossary.html) that best explain the two components, and what are these entries?</font>

In [935]:
# Behind the first key "inut_ids" are the token IDs for the input text.
# There are more IDs than individual words, because the tokenizer splits into subwords.
# Behind the second key "attention_,ask" is an attention mask. It indicates which tokens should be attended to by the model.
# This is needed if multiple sentences are to be stored in the same tensor and some need to be padded to have identical lengths.

# I managed to find a hint about input_ids and the attention mask on the tokenizer page.
# https://huggingface.co/docs/transformers/v4.40.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizer

# The actual links to the glossary for these two terms I managed to find on the DistilBERT page under DistilBERTModel:
# https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel.forward

# I however found the page very difficult to navigate and find this information in the first place.
# If there is a very simple solution to this, then I would be curious to know.


<font color='green'>**Task**: If you haven't explained above, please explain here the cause of the difference between the number of words of your sentence, and the number of tokens in the observed output.  Please display the tokens of the output. You can use the documentation of the superclass found above or the examples in the [glossary](https://huggingface.co/transformers/glossary.html).</font>

In [936]:
# As stated above, the difference stems from the tokenizer splitting words into subwords.
# Below it can be seen, that it split the unknown word "Winterthur" into 3 tokens and also added the [CLS] and [SEP] tokens.

In [937]:
# Store the list of input IDs in a variable
input_ids = transf_return["input_ids"]

# Convert token IDs to tokens and store them in a variable
tokens = tokenizer.convert_ids_to_tokens(input_ids)

# Print tokens
print(tokens)

['[CLS]', 'There', 'are', 'three', 'museums', 'in', 'Winter', '##th', '##ur', '[SEP]']


<font color='green'>**Question**: How can you convert back the first part of the output to the original string?
Please write and execute the command(s) below.  You can use the documentation of the superclass found above or the examples in the [glossary](https://huggingface.co/transformers/glossary.html).</font>

In [938]:
# Convert the input_ids back to the original sentence using the "decode" method
orig_sentence = tokenizer.decode(input_ids)
print(orig_sentence)

[CLS] There are three museums in Winterthur [SEP]


In [939]:
# Alternative using the "convert_tokens_to_string" method
orig_sentence = tokenizer.convert_tokens_to_string(tokens)
print(orig_sentence)

[CLS] There are three museums in Winterthur [SEP]


### c. Generation of input in the desired form

We need to generate input in the form expected by the `DistilBertForQuestionAnswering` class.  This means providing the question, the text from which the answer must be extracted, with the proper [CLS] and [SEP] tokens, and the attention masks.  Moreover, using DistilBERT requires that the lists of indices returned by the tokenizer are Pytorch tensors (see tokenizer's option `return_tensors`).

<font color='green'>**Question**: What is the correct way to call the tokenizer in order to obtain these results?  You can use the example provided at the end of the [DistilBertForQuestionAnswering](https://huggingface.co/transformers/model_doc/distilbert.html?distilbertforquestionanswering#distilbertforquestionanswering) documentation.  <br/>Please define a *question* and a *text* string of your own, and store the result of the tokenizer in a variable called *input*.  <br/>   Please verify (by converting back to the result) that the input has the correct tokens.</font>

In [940]:
# Question and source text
question, text = "What colour is that motorcycle?", "That motorcycle has a special light green colour."

# Call the tokenizer on the question/text and store the result in a variable
inputs = tokenizer(question, text, return_tensors="pt")
print(f"There are a total of {len(inputs["input_ids"][0])} tokens\n")

# Verification by converting the input_ids back to tokens
input_ids = inputs["input_ids"][0]
orig_sentence = tokenizer.decode(input_ids)
print("The original sentence recreated from the tokens:")
print(orig_sentence)

There are a total of 18 tokens

The original sentence recreated from the tokens:
[CLS] What colour is that motorcycle? [SEP] That motorcycle has a special light green colour. [SEP]


### d. Execution of the model over the input question and text

In this section, you will create an instance of the BERT neural network adapted to question answering.  The class is named `DistilBertForQuestionAnswering`.  

**Important note:** The model itself (the weights) is the one that you found at the end of (1a) above which is suited for question answering!

In [941]:
from transformers import DistilBertForQuestionAnswering

<font color='green'>**Task**: Please create an instance of the model here.</font>  The data will be downloaded the first time you create it. 

In [942]:
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

The results of applying the model to your question and text (i.e. extracting the answer) are obtained by calling the model with the correct inputs.  

<font color='green'>**Task**: Please use the inputs you obtained above and read the [documentation of the DistilBertForQuestionAnswering class](https://huggingface.co/transformers/model_doc/distilbert.html?distilbertforquestionanswering#distilbertforquestionanswering) (under *forward*) to apply the model to your data.  Store the results in a variable called *outputs*.</font>

In [943]:
with torch.no_grad():

    outputs = model(**inputs)

print(outputs)

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-2.7944, -3.8852, -5.2124, -6.7933, -5.1928, -3.8815, -3.8448, -1.3391,
          3.0379,  1.5762,  0.4520,  5.1639,  4.5670,  9.2010,  6.9972, -0.7898,
         -2.8604, -1.3391]]), end_logits=tensor([[-0.7377, -4.3996, -3.7055, -6.3813, -6.0116, -2.9843, -4.2064, -0.9374,
         -3.0495, -0.8349, -4.5041, -2.4779, -0.7062,  2.4648,  9.8496,  7.1742,
          4.9999, -0.9373]]), hidden_states=None, attentions=None)


<font color='green'>**Task**: Please answer the following three questions: 
- Where are the probability values for the position of the **start** of the answer in *outputs*?
- Are these actual probabilities or other type of coefficients?  
- How many values are there, and is this coherent with your observations in (1b)?</font> 

In [944]:
# The probability values for the position of the start of the answer are stored in a tensor in the outputs variable.
print(f"Logit values for the start position: \n{outputs[0][0]}\n")

# These values are the start_logits and end_logits before going through the SoftMax.
# So they do not yet represent actual probabilities, but can still be used to check which tokens the model weights the highest.

# There is a total of 33 values.
# This makes sense and there is one value for each token created from the text during tokenization.
print(inputs.input_ids.shape)

Logit values for the start position: 
tensor([-2.7944, -3.8852, -5.2124, -6.7933, -5.1928, -3.8815, -3.8448, -1.3391,
         3.0379,  1.5762,  0.4520,  5.1639,  4.5670,  9.2010,  6.9972, -0.7898,
        -2.8604, -1.3391])

torch.Size([1, 18])


### e. Determination of the start and the end of the answer in the text

<font color='green'>**Question**: Please use the *outputs* of the model to determine the most likely start and end of the answer span in your text, and then obtain the actual answer.  How satisfied are you with the answer?</font>  You may use help from the [🤗 Huggingface entry on question answering](https://huggingface.co/transformers/task_summary.html#extractive-question-answering).

In [945]:
# Get the index for the start end end tokens of the answer
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

# Get all the token IDs which the prediction found for the coplete answer
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

# Print the found start and end indices
print(f"Start index: {answer_start_index}")
print(f"End index: {answer_end_index}")

# Print the token ids for the answer
print(f"Token IDs of the answer: {predict_answer_tokens}")

# Print the actual answer
answer = tokenizer.decode(predict_answer_tokens)
print(answer)

Start index: 13
End index: 14
Token IDs of the answer: tensor([1609, 2448])
light green


In [946]:
# The answer is only the word "special".
# While this is part of the answer i would expect ("special light green" / "light green"), the information about it being green is missing.

<font color='green'>**Task**: Please write a function called *answer_extraction* that gathers the previous operations: it takes two strings as arguments, creates instances of the tokenizer and the model, extracts the answer, and returns it as a string (possibly empty).  Do not create a new *tokenizer* and *model*, but assume that the ones you created above are global variables accessible from this function.</font> 

In [947]:
def answer_extraction(question, text):

    # Create the inputs
    question.lower()
    inputs = tokenizer(question, text, return_tensors="pt")

    # Get the outputs
    with torch.no_grad():

        outputs = model(**inputs)

    # Get the index for the start end end tokens of the answer
    answer_start_index = outputs.start_logits.argmax()
    answer_end_index = outputs.end_logits.argmax()

    # Get all the token IDs which the prediction found for the coplete answer
    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

    # Print the actual answer
    answer = tokenizer.decode(predict_answer_tokens)
    
    return answer

<font color='green'>**Task**: Please test the function on the following questions and short text.</font> 

In [948]:
# Excerpt from Simple English Wikipedia:
text = """Switzerland is a small country in Western Europe. 
Switzerland is a confederation of even smaller states, which are the 26 cantons.
Switzerland is known for its neutrality.  Switzerland has been neutral since 1815. 
There are four official languages in Switzerland: German, French, Italian, and Romansh. 
"""
question1 = "How many cantons are there in Switzerland?"
question2 = "What is Switzerland famous for?"
question3 = "What are the official languages of Switzerland?"

# Answer question 1
answer1 = answer_extraction(question1, text)
print(question1)
print(f"Answer: {answer1}\n")

# Answer question 2
answer2 = answer_extraction(question2, text)
print(question2)
print(f"Answer: {answer2}\n")                                                                

# Answer question 3
answer3 = answer_extraction(question3, text)
print(question3)
print(f"Answer: {answer3}\n")

How many cantons are there in Switzerland?
Answer: 26

What is Switzerland famous for?
Answer: neutrality

What are the official languages of Switzerland?
Answer: German, French, Italian, and Romansh



## 2. Fragment retrieval using `Gensim` (from Lab 4)

In this part, you will simply reuse code from Lab 4 to build a simple text retrieval system over the *Lee Corpus* provided with Gensim (300 news articles from the Australian Broadcasting Corporation).  
* The [Gensim tutorial on topics and transformations](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py) provides the main idea.  
* The goal is to retrieve, given a question, a short text fragment that is most likely to contain the answer.  As articles are not divided into paragraphs, you will refactor the collection of articles into a collection of fragments of at most *N* sentences each (without mixing articles). 
* The question will be used as a *query*, with the pre-processing options of your choice.

In [949]:
# Imports
import gensim, nltk, os
import pandas as pd
from nltk.corpus import stopwords, wordnet

# from TextPreprocessor import *

<font color='green'>**Task**: Load the articles of the Lee Background Corpus provided with Gensim into a list of strings (each article in a string) called *raw_articles*.</font>

In [950]:
# Load the Lee Corpus into a list of strings
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
raw_articles = open(lee_train_file).read().splitlines()

# Check the data
print(raw_articles[0])
print(f"Number of strings in the list: {len(raw_articles)}")

Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year's Eve in New South Wales, fire crews have been called to new fire at Gunning, south of Goulburn. While few details are available at this

<font color='green'>**Task**: Please transform the articles into a collection of text fragments called *corpus1* (a list of lists of strings), by cutting each article into fragments of *N* consecutive sentences (e.g. *N* = 4), except possibly for the last fragment, and tokenizing each sentence.  At the end, display the number of fragments of your collection.</font>
* Do not mix sentences from different articles in each fragment.
* The reason for this operation is that full articles are too long to give to DistilBERT as texts. (Try it!)

<font color='green'>**Task**: Do not forget to pre-process the articles in preparation for search -- tokenization, stopword removal, and other operations if you want to explore them.</font>  
* A  text fragment is thus a list of strings (tokens). 
* Please inspect your corpus to make sure it is correctly built.

Hint: the raw text will start with:
 
*Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year's Eve in New South Wales, fire crews have been called to new fire at Gunning, south of Goulburn. While few details are available at this stage, fire authorities says it has closed the Hume Highway in both directions. Meanwhile, a new fire in Sydney's west is no longer threatening properties in the Cranebrook area. ....*

The first 8 sentences are to be decomposed into:

*[['hundreds', 'people', 'forced', 'vacate', 'homes', 'southern', 'highlands', 'new', 'south', 'wales', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'town', 'hill', 'top', '.', 'new', 'blaze', 'near', 'goulburn', ',', 'south-west', 'sydney', ',', 'forced', 'closure', 'hume', 'highway', '.', '4:00pm', 'aedt', ',', 'marked', 'deterioration', 'weather', 'storm', 'cell', 'moved', 'east', 'across', 'blue', 'mountains', 'forced', 'authorities', 'make', 'decision', 'evacuate', 'people', 'homes', 'outlying', 'streets', 'hill', 'top', 'new', 'south', 'wales', 'southern', 'highlands', '.', 'estimated', '500', 'residents', 'left', 'homes', 'nearby', 'mittagong', '.'], ['new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'weather', 'conditions', 'caused', 'fire', 'burn', 'finger', 'formation', 'eased', '60', 'fire', 'units', 'around', 'hill', 'top', 'optimistic', 'defending', 'properties', '.', '100', 'blazes', 'burn', 'new', 'year', 'eve', 'new', 'south', 'wales', ',', 'fire', 'crews', 'called', 'new', 'fire', 'gunning', ',', 'south', 'goulburn', '.', 'details', 'available', 'stage', ',', 'fire', 'authorities', 'says', 'closed', 'hume', 'highway', 'directions', '.', 'meanwhile', ',', 'new', 'fire', 'sydney', 'west', 'longer', 'threatening', 'properties', 'cranebrook', 'area', '.']*

<font color='green'>**Task**: If possible, store the original version of each fragment into a string in *corpus2* (i.e. non-tokenized, non-lowercased, etc.), because it will be better to pass it later to DistilBERT.  Otherwise, you can also reconstruct the full fragment from the tokens in corpus1.</font>

In [951]:
from nltk.tokenize import RegexpTokenizer

# Imstantiate tokenizer that can remove punctuation
tokenizer_regexp = RegexpTokenizer(r'\w+')

# Number of sentences per sublist
N = 4

# Define empty lists to store the fragments
corpus1 = []
corpus2 = []

# Stopwords
stop_words = stopwords.words("english")
# Add additional characters to the just created set of stop words
for sw in ['\"', '\'', '\'\'', '`', '``', '\'s']:
    stop_words.append(sw)

# Loop through each raw article
for raw_article in raw_articles:
    # Tokenize the article into sentences
    sentences = nltk.tokenize.sent_tokenize(raw_article)
    i = 0
    # Process each group of N sentences
    while i < len(sentences):
        # Concatenate N sentences to form a document
        document = " ".join(sentences[i:i+N])
        # Append the document to corpus2
        corpus2.append(document)
        # Tokenize the document into words and preprocess
        # tokens = nltk.word_tokenize(document)
        tokens = tokenizer_regexp.tokenize(document)

        tokens = [w.lower() for w in tokens if w not in stop_words]
        # Append the preprocessed tokens to corpus1
        corpus1.append(tokens)
        # Move to the next group of sentences
        i += N

In [952]:
# Print the raw and processed text data as well as the overall number of fragments
print(f"Raw text data:\n {corpus2[0]}\n")
print(f"Processed text data:\n {corpus1[0]}\n")
print(f"Number of fragments:\n {len(corpus1)}\n")

Raw text data:
 Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong.

Processed text data:
 ['hundreds', 'people', 'forced', 'vacate', 'homes', 'southern', 'highlands', 'new', 'south', 'wales', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'town', 'hill', 'top', 'a', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'sydney', 'forced', 'closure', 'hume', 'highway', 'at', '4', '00pm', 'aedt', 'marked', 'deterioration', 'weathe

<font color='green'>**Task**: Please create a search index (called *search_index*) using a *tfidf* model and transform all text fragments from *corpus1* into document vectors.</font>

In [953]:
# Create a dictionary object to map words to unique integer IDs
dictionary = gensim.corpora.Dictionary(corpus1)

# Convert each document in the corpus into a bag-of-words representation
# This representation is a list of (word_id, word_frequency) tuples for each document
corpus = [dictionary.doc2bow(text) for text in corpus1]

# step 1 -- initialize a model
tfidf = gensim.models.TfidfModel(corpus)

# Create a TF-IDF model based on the input corpus of documents.
# This model will be used to transform the original bag-of-words representation of documents into TF-IDF weighted vectors,
# where the weights represent the importance of each word in each document relative to the entire corpus.
corpus_tfidf = tfidf[corpus]

# Calculate the similarities of the documents 
search_index = gensim.similarities.MatrixSimilarity(corpus_tfidf, num_features=len(dictionary))
print(search_index)

MatrixSimilarity<790 docs, 7125 features>


<font color='green'>**Task**: Please write a function called *fragment_retrieval* which returns the most relevant text fragment (string) from the corpus given a question, which is used as the query.</font>  
* The function processes the query in the same way as the documents (using the *tfidf model*) to obtain a *vectorized_query*.
* This is passed to the *search_index* to rank all documents by relevance.
* All the resources created above are supposed available as global variables (the dictionary, the tfidf model, the search_index, the corpus).

In [954]:
import string

def fragment_retrieval(query):

    # Transform the query into TF-IDF representation
    query = query.translate(str.maketrans ('', '', string.punctuation)) # Remove punctuation
    query_words = query.lower().split()
    query_bow = dictionary.doc2bow(query_words)
    query_tfidf = tfidf[query_bow]

    # Calculate similarity scores between the query and all documents by comparing their tf-idf vectors
    sims = search_index[query_tfidf]

    # Sort the similarity scores and retrieve the indices of the most similar (n) documents
    n = 1
    top_similar_indices = sorted(enumerate(sims), key=lambda x: -x[1])[:n]

    # Retrieve the fragment based on the indices obtained
    top_similar_fragments = [corpus2[index[0]] for index in top_similar_indices]

    # Combine the fragments into whole sentences
    fragments_whole = ["".join([str(item) for item in fragment]) for fragment in top_similar_fragments]

    # Return the most similar fragments
    return fragments_whole


<font color='green'>**Task**: Please apply the above function to the three queries provided below.</font>  

Note: again, the corpus, search_index, tfidf and dictionary are available as global variables.

In [955]:
queries = ["Who is the mayor of New York?", 
           "Who is Nicole Kidman?",
           "How many Australians died in the 1999 Interlaken canyoning accident?"]
for q in queries:
    print(q, '->', fragment_retrieval(q)[0])

Who is the mayor of New York? -> "I felt that my job as the mayor was to turn around the city, because I believed - rightly or wrongly - that we had one last chance to do that." Mr Giuliani, a Republican, has served two terms as New York City's Mayor since 1993. Term limits prevent him from seeking a third term in office, and he will be succeeded by billionaire media mogul Michael Bloomberg.
Who is Nicole Kidman? -> In the United States, Australian actress Nicole Kidman has been nominated for two Golden Globe best actor awards for her roles in the Australian-made musical "Moulin Rouge", and in her new thriller "The Others". "Moulin Rouge" also is one of two pictures leading the Golden Globe nominations, with six possible awards. It is vying for best musical or comedy picture of 2002, best actress in a comedy or musical, best actor in the same category for Ewen McGregor, best director for Baz Luhrmann, best original score and best original song. The other film to pick up six nominations

## 3. Integration, testing and discussion

<font color='green'>**Task**: Using the two functions 'fragment_retrieval' and 'answer_extraction' from parts 1 and 2, and assuming all models and data are available as global variables, please create a unique function which returns the answer (string) to a question (string).</font>

In [956]:
def question_answering(question):
    fragment = fragment_retrieval(question)
    print("")
    print(f"Fragment used for context: {fragment[0]}")
    answer = answer_extraction(question, fragment[0])
    return(answer)

<font color='green'>**Task**: Please add between 5 and 10 more questions to the following list.  You can add answerable and non-answerable questions (with respect to the corpus).</font>

In [957]:
questions = ["Who is the mayor of New York?", 
            "Who is Nicole Kidman?", 
            "How many Australians died in the 1999 Interlaken canyoning accident?",
            "What caused the 1999 Interlaken canyoning accident?",
            "Which city is the capital of Australia?",
            "Who is the prime-minister of Israel?",
            "What are the main Australian airlines?",
            "What is Kieren Perkins' sport?"]
for q in questions:
    print(q, '->', question_answering(q))


Fragment used for context: "I felt that my job as the mayor was to turn around the city, because I believed - rightly or wrongly - that we had one last chance to do that." Mr Giuliani, a Republican, has served two terms as New York City's Mayor since 1993. Term limits prevent him from seeking a third term in office, and he will be succeeded by billionaire media mogul Michael Bloomberg.
Who is the mayor of New York? -> Mr Giuliani

Fragment used for context: In the United States, Australian actress Nicole Kidman has been nominated for two Golden Globe best actor awards for her roles in the Australian-made musical "Moulin Rouge", and in her new thriller "The Others". "Moulin Rouge" also is one of two pictures leading the Golden Globe nominations, with six possible awards. It is vying for best musical or comedy picture of 2002, best actress in a comedy or musical, best actor in the same category for Ewen McGregor, best director for Baz Luhrmann, best original score and best original song

Who is Nicole Kidman? -> Australian actress

Fragment used for context: Eight people are to appear in a Swiss court tomorrow charged with the manslaughter of 18 tourists and three guides, after the 1999 Interlaken canyoning tragedy. The first three defendants are managers of the now defunctoperator, Adventure World. Twenty-one people including 14 Australians were killed when a thunderstorm struck when they were canyoning down the Saxeten River Gorge near Interlaken. A massive wall of water hit the group and swept them to their deaths.
How many Australians died in the 1999 Interlaken canyoning accident? -> 14

Fragment used for context: Eight people are to appear in a Swiss court tomorrow charged with the manslaughter of 18 tourists and three guides, after the 1999 Interlaken canyoning tragedy. The first three defendants are managers of the now defunctoperator, Adventure World. Twenty-one people including 14 Australians were killed when a thunderstorm struck when they were canyoning dow

<font color='green'>**Task**: Please discuss the correctness of the answers, give possible reasons for incorrect ones, and make suggestions for improvements.</font>

Write your discussion here or in a cell below.

When you have finished please clean and re-run one last time the notebook, from start to end, then submit it on Moodle.

<font color='lightblue'> About 50 % of the answers seemed to be correct at first. <br>
 When looking at the question "Who is Nicole Kidman?", it seems that the "?" lead to a different fragment, which doesn't have anything to do with the actress. Removing punctuations before tokenizing the queries solved this problem. <br>
 For the question regarding the prime-minister of Israel, the "-" in prime-minister seems to be the main cause, since the question gets answered correctly if it is written as "prime minister". This could be improved by handling these types of words better during tokenization. <br>
 The question regarding the main Australian airlines might struggle with the fact, that the context mentions Perth and Darwin in the context of "mainland capitals". This might have less to do with tokenization and more with the context being harder to interpret for DistilBERT.