![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab11: Question Answering using BERT

by Andrei Popescu-Belis (HES-SO)
using the [🤗 Huggingface models](https://huggingface.co/models),
an [article by Marius Borcan](https://programmerbackpack.com/bert-nlp-using-distilbert-to-build-a-question-answering-system/) and 
an [article by Ramsi Goutham](https://towardsdatascience.com/simple-and-fast-question-answering-system-using-huggingface-distilbert-single-batch-inference-bcf5a5749571)

**Summary**
The goal of this lab is to implement and test a simple question answering (QA) system over a set of articles.  The structure of the lab is as follows:
1. Answer extraction from a text fragment -- in this part, you will use a pre-trained model named DistilBERT (a lighter version of BERT) which can extract the most likely answer to a given question from a text fragment (in English).
2. Text retrieval given a question -- in this part, you will reuse code from Lab 4 (Search Engine) to design a paragraph retrieval system over the 300-article Lee corpus provided with `gensim`. 
3. Integration and testing -- in this part, you will put together the functions from the previous two parts, and test your system end-to-end by designing a test set of 10 questions.

<font color='green'>Please answer the questions and solve the tasks in green within this notebook.  The expected answers are generally very short: 1-2 commands or 2-3 lines of explanations.  At the end, please submit the completed notebook under the corresponding homework on Moodle.</font>

## 1. Answer extraction using DistilBERT

As you know, the BERT pre-trained model can be fine-tuned for question answering, by training it to provide the start and end word of an input text fragment which is most likely the answer to an input question.  You will use the 🤗 Huggingface Python module called `transformers`, and later use a DistilBERT model also provided by 🤗 Huggingface.

### a. Install `pytorch` and `transformers`

Use the instructions provided by [PyTorch](https://pytorch.org/get-started/locally/#start-locally) and by [Huggingface](https://github.com/huggingface/transformers#installation).  The use of `conda` is recommended.

In [1]:
import torch

<font color='green'>**Task**: Please generate a random 2x2x2 tensor with Pytorch.  Please display whether the workstation you use has a GPU or not.</font><br/>
(Note: a GPU is not required for this lab.)

In [2]:
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")
print(f"Device is set to: {device}")

Device is set to: mps


In [3]:
random_tensor = torch.rand(2, 2, 2)
print(random_tensor)

tensor([[[0.7504, 0.7269],
         [0.2922, 0.7640]],

        [[0.1090, 0.9802],
         [0.6623, 0.4873]]])


In [4]:
import transformers

🤗 Huggingface provides a very large repository of Transformer-based models at https://huggingface.co/models.

<font color='green'>**Task**: Please use the search interface (in a browser) and find out *how many models containing the name 'distilbert' for Question Answering* are available.  If we exclude those submitted by individual users, how many models are there left?  Please paste below their name and version date, and the size of their 'pytorch_model.bin' file.</font>

<font color='green'>By looking at their "model cards", which model has the highest performance on the SQuAD dev set?</font>  In what follows, we will use this model.


There are 1,223 models available for Q&A with "distilbert" in their name.

Excluding the models submitted by individual users, was not possible in the search interface...
But the first model in this list is the one which is downloaded more and will be used in the lab.
__________________________________________________________________________________________
 
 DistilBERT base cased distilled SQuAD (2020-02-07) 
 - pytorch_model.bin size -> 261 MB
 - Downloads last month 227,317
  - Model size 65.2M params
- https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad
- This model reaches a F1 score of 87.1 on the [SQuAD v1.1] dev set (for comparison, BERT bert-base-cased version reaches a F1 score of 88.7)

__________ 
DistilBERT base uncased distilled SQuAD 
- pytorch_model.bin size  -> 265 MB
- Downloads last month 39,626
- Model size 66.4M params
- https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad
- This model reaches a F1 score of 86.9 on the [SQuAD v1.1] dev set (for comparison, Bert bert-base-uncased version reaches a F1 score of 88.5


### b. Tokenization of the input

We will use here a tokenizer called `DistilBertTokenizer` to tokenize the question and the text fragment and transform the numbers into numerical indices.  The documentation for this tokenizer is included in the general documentation of DistilBERT models at: https://huggingface.co/transformers/model_doc/distilbert.html 

In [5]:
from transformers import DistilBertTokenizer

<font color='green'>**Task**: Please create an instance of such a tokenizer 
using the pre-trained model named 'distilbert-base-cased'.  The command
will download the necessary model the first time you use it.</font>

In [6]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')

<font color='green'>**Task**: What does this instance return if you **call** it with a sentence (a *string*) as an argument?  Please write the instruction below for the sentence 'There are three museums in Winterthur.'.</font>

In [7]:
sentence = "There are three museums in Winterthur."
tokenizer_output = tokenizer(sentence)
tokenizer_output

{'input_ids': [101, 1247, 1132, 1210, 11765, 1107, 4591, 1582, 2149, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

<font color='green'>**Task**: Please explain in your own words the meaning of the two components of the output above.  For that, please use the [documentation of the class DistilBertTokenizer](https://huggingface.co/transformers/model_doc/distilbert.html#distilberttokenizer), and be sure you read the documentation of its *superclasses* as well.  Under what superclass do you find the links to the [glossary entries](https://huggingface.co/transformers/glossary.html) that best explain the two components, and what are these entries?</font>

The first output "input_ids" of the tokenizer is the list of tokens obtained by splitting the input sentence into tokens available in the tokenizer vocabulary. These tokens are then represented in ids which is understandable by the model.

The second output "attention_mask": 
If two sequences do not have the same length, they need to be padded to the same. These padding words are not relevant for the model so in the attention_mask they are represented as a 0 and relevant words as a 1.

The superclass where I found the link to the glossary was the PreTrainedTokenizer.
Gloassary: 

input_ids -> https://huggingface.co/transformers/v3.1.0/glossary.html#input-ids
attention_mask -> https://huggingface.co/transformers/v3.1.0/glossary.html#attention-mask


<font color='green'>**Task**: If you haven't explained above, please explain here the cause of the difference between the number of words of your sentence, and the number of tokens in the observed output.  Please display the tokens of the output. You can use the documentation of the superclass found above or the examples in the [glossary](https://huggingface.co/transformers/glossary.html).</font>

In [8]:
# Convert token IDs to tokens and store them in a variable
tokens = tokenizer.convert_ids_to_tokens(tokenizer_output["input_ids"])
# Print tokens
print(tokens)

['[CLS]', 'There', 'are', 'three', 'museums', 'in', 'Winter', '##th', '##ur', '.', '[SEP]']


The difference in the number of words and tokens is due to the tokenization process. The tokenizer splits the input sentence into tokens, and these tokens are then represented in ids which are understandable by the model. The tokenizer also adds special tokens such as [CLS] and [SEP] to the input sentence. These special tokens are not part of the input sentence but are added by the tokenizer to help the model understand the input sentence better.

<font color='green'>**Question**: How can you convert back the first part of the output to the original string?
Please write and execute the command(s) below.  You can use the documentation of the superclass found above or the examples in the [glossary](https://huggingface.co/transformers/glossary.html).</font>

In [9]:
original_tokens = tokenizer.convert_ids_to_tokens(tokenizer_output['input_ids'])
original_sentence = tokenizer.convert_tokens_to_string(original_tokens)
print(f'The original string was:\n{original_sentence}')

The original string was:
[CLS] There are three museums in Winterthur . [SEP]


### c. Generation of input in the desired form

We need to generate input in the form expected by the `DistilBertForQuestionAnswering` class.  This means providing the question, the text from which the answer must be extracted, with the proper [CLS] and [SEP] tokens, and the attention masks.  Moreover, using DistilBERT requires that the lists of indices returned by the tokenizer are Pytorch tensors (see tokenizer's option `return_tensors`).

<font color='green'>**Question**: What is the correct way to call the tokenizer in order to obtain these results?  You can use the example provided at the end of the [DistilBertForQuestionAnswering](https://huggingface.co/transformers/model_doc/distilbert.html?distilbertforquestionanswering#distilbertforquestionanswering) documentation.  <br/>Please define a *question* and a *text* string of your own, and store the result of the tokenizer in a variable called *input*.  <br/>   Please verify (by converting back to the result) that the input has the correct tokens.</font>

In [10]:
from transformers import DistilBertTokenizer
import torch

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")

question, text = "What are people struggling with?", "Many people do not know how to vote at 9th June about the national initiatives"
inputs = tokenizer(question, text, return_tensors="pt")

input_ids = inputs["input_ids"]
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
input_sentence = tokenizer.convert_tokens_to_string(tokens)

print(f'The original input was: \n -> {question} {text} \nand the converted back input is: \n -> {input_sentence}')

The original input was: 
 -> What are people struggling with? Many people do not know how to vote at 9th June about the national initiatives 
and the converted back input is: 
 -> [CLS] What are people struggling with ? [SEP] Many people do not know how to vote at 9th June about the national initiatives [SEP]


### d. Execution of the model over the input question and text

In this section, you will create an instance of the BERT neural network adapted to question answering.  The class is named `DistilBertForQuestionAnswering`.  

**Important note:** The model itself (the weights) is the one that you found at the end of (1a) above which is suited for question answering!

In [11]:
from transformers import DistilBertForQuestionAnswering

<font color='green'>**Task**: Please create an instance of the model here.</font>  The data will be downloaded the first time you create it. 

In [12]:
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

The results of applying the model to your question and text (i.e. extracting the answer) are obtained by calling the model with the correct inputs.  

<font color='green'>**Task**: Please use the inputs you obtained above and read the [documentation of the DistilBertForQuestionAnswering class](https://huggingface.co/transformers/model_doc/distilbert.html?distilbertforquestionanswering#distilbertforquestionanswering) (under *forward*) to apply the model to your data.  Store the results in a variable called *outputs*.</font>

In [13]:
with torch.no_grad():
    outputs = model(**inputs)

<font color='green'>**Task**: Please answer the following three questions: 
- Where are the probability values for the position of the **start** of the answer in *outputs*?
- Are these actual probabilities or other type of coefficients?  
- How many values are there, and is this coherent with your observations in (1b)?</font> 

In [14]:
print(f'Probability values for the position of the start of the answer: {outputs.start_logits}')

# They are not actual probabilities but logits.

print(
    f'The number of values is {len(outputs.start_logits[0])} which is coherent with the number of tokens in the input.')
print(
    f'There are {len(outputs.start_logits[0])} values which is the same as the input token-size {len(input_ids[0])}')  # There are 

Probability values for the position of the start of the answer: tensor([[ -7.7410,  -8.3506, -10.8415, -10.3164,  -9.4377, -10.1578,  -8.6988,
          -8.4618,  -1.5717,  -3.6089,  -4.3863,  -5.0113,  -6.3410,  -5.3681,
          -6.4108,  -3.0860,  -6.4063,  -3.7960,  -7.2373,  -3.7928,   0.4913,
           0.2110,  -2.1361,  -8.4617]])
The number of values is 24 which is coherent with the number of tokens in the input.
There are 24 values which is the same as the input token-size 24


### e. Determination of the start and the end of the answer in the text

<font color='green'>**Question**: Please use the *outputs* of the model to determine the most likely start and end of the answer span in your text, and then obtain the actual answer.  How satisfied are you with the answer?</font>  You may use help from the [🤗 Huggingface entry on question answering](https://huggingface.co/transformers/task_summary.html#extractive-question-answering).

In [15]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_token_ids = inputs.input_ids[0, answer_start_index: answer_end_index + 1]
answer_sentence = tokenizer.decode(predict_answer_token_ids)
print(f'Answer: {answer_sentence}')


Answer: the national initiatives


The question was answered corrected even with the correct and important adjective "national".

<font color='green'>**Task**: Please write a function called *answer_extraction* that gathers the previous operations: it takes two strings as arguments, creates instances of the tokenizer and the model, extracts the answer, and returns it as a string (possibly empty).  Do not create a new *tokenizer* and *model*, but assume that the ones you created above are global variables accessible from this function.</font> 

In [16]:
def answer_extraction(question, text):
    inputs = tokenizer(question, text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    answer_start_index = outputs.start_logits.argmax()
    answer_end_index = outputs.end_logits.argmax()
    predict_answer_token_ids = inputs.input_ids[0, answer_start_index: answer_end_index + 1]

    return tokenizer.decode(predict_answer_token_ids)

<font color='green'>**Task**: Please test the function on the following questions and short text.</font> 

In [17]:
# Excerpt from Simple English Wikipedia:
text = """Switzerland is a small country in Western Europe. 
Switzerland is a confederation of even smaller states, which are the 26 cantons.
Switzerland is known for its neutrality.  Switzerland has been neutral since 1815. 
There are four official languages in Switzerland: German, French, Italian, and Romansh. 
"""
question1 = "How many cantons are there in Switzerland?"
question2 = "What is Switzerland famous for?"
question3 = "What are the official languages of Switzerland?"

print(question1, '->', answer_extraction(question1, text))
print(question2, '->', answer_extraction(question2, text))
print(question3, '->', answer_extraction(question3, text))

How many cantons are there in Switzerland? -> 26
What is Switzerland famous for? -> neutrality
What are the official languages of Switzerland? -> German, French, Italian, and Romansh


## 2. Fragment retrieval using `Gensim` (from Lab 4)

In this part, you will simply reuse code from Lab 4 to build a simple text retrieval system over the *Lee Corpus* provided with Gensim (300 news articles from the Australian Broadcasting Corporation).  
* The [Gensim tutorial on topics and transformations](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py) provides the main idea.  
* The goal is to retrieve, given a question, a short text fragment that is most likely to contain the answer.  As articles are not divided into paragraphs, you will refactor the collection of articles into a collection of fragments of at most *N* sentences each (without mixing articles). 
* The question will be used as a *query*, with the pre-processing options of your choice.

In [18]:
import gensim, nltk, os

<font color='green'>**Task**: Load the articles of the Lee Background Corpus provided with Gensim into a list of strings (each article in a string) called *raw_articles*.</font>

In [19]:
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
raw_articles = open(lee_train_file).read().splitlines()

print(f'Number of articles: {len(raw_articles)}')

Number of articles: 300


In [20]:
# save articles to file
with open('articles.txt', 'w') as f:
    for article in raw_articles:
        f.write(article + '\n')

<font color='green'>**Task**: Please transform the articles into a collection of text fragments called *corpus1* (a list of lists of strings), by cutting each article into fragments of *N* consecutive sentences (e.g. *N* = 4), except possibly for the last fragment, and tokenizing each sentence.  At the end, display the number of fragments of your collection.</font>
* Do not mix sentences from different articles in each fragment.
* The reason for this operation is that full articles are too long to give to DistilBERT as texts. (Try it!)

In [21]:
answer_extraction(raw_articles[0], 'What happened in the Southern Highlands of New South Wales?')
# Always output [CLS] token, because the input is too long for the model.

'[CLS]'

<font color='green'>**Task**: Do not forget to pre-process the articles in preparation for search -- tokenization, stopword removal, and other operations if you want to explore them.</font>  
* A  text fragment is thus a list of strings (tokens). 
* Please inspect your corpus to make sure it is correctly built.

Hint: the raw text will start with:
 
*Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year's Eve in New South Wales, fire crews have been called to new fire at Gunning, south of Goulburn. While few details are available at this stage, fire authorities says it has closed the Hume Highway in both directions. Meanwhile, a new fire in Sydney's west is no longer threatening properties in the Cranebrook area. ....*

The first 8 sentences are to be decomposed into:

*[['hundreds', 'people', 'forced', 'vacate', 'homes', 'southern', 'highlands', 'new', 'south', 'wales', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'town', 'hill', 'top', '.', 'new', 'blaze', 'near', 'goulburn', ',', 'south-west', 'sydney', ',', 'forced', 'closure', 'hume', 'highway', '.', '4:00pm', 'aedt', ',', 'marked', 'deterioration', 'weather', 'storm', 'cell', 'moved', 'east', 'across', 'blue', 'mountains', 'forced', 'authorities', 'make', 'decision', 'evacuate', 'people', 'homes', 'outlying', 'streets', 'hill', 'top', 'new', 'south', 'wales', 'southern', 'highlands', '.', 'estimated', '500', 'residents', 'left', 'homes', 'nearby', 'mittagong', '.'], ['new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'weather', 'conditions', 'caused', 'fire', 'burn', 'finger', 'formation', 'eased', '60', 'fire', 'units', 'around', 'hill', 'top', 'optimistic', 'defending', 'properties', '.', '100', 'blazes', 'burn', 'new', 'year', 'eve', 'new', 'south', 'wales', ',', 'fire', 'crews', 'called', 'new', 'fire', 'gunning', ',', 'south', 'goulburn', '.', 'details', 'available', 'stage', ',', 'fire', 'authorities', 'says', 'closed', 'hume', 'highway', 'directions', '.', 'meanwhile', ',', 'new', 'fire', 'sydney', 'west', 'longer', 'threatening', 'properties', 'cranebrook', 'area', '.']*

<font color='green'>**Task**: If possible, store the original version of each fragment into a string in *corpus2* (i.e. non-tokenized, non-lowercased, etc.), because it will be better to pass it later to DistilBERT.  Otherwise, you can also reconstruct the full fragment from the tokens in corpus1.</font>

In [22]:
from nltk.corpus import stopwords, wordnet
import copy

corpus1 = []
corpus2 = []
N = 4


def preprocess_text(text):
    nltk_tokens = nltk.word_tokenize(text)
    nltk_tokens = [word.lower() for word in nltk_tokens if word.isalnum()]
    return [word for word in nltk_tokens if word not in stopwords.words('english')]


for article in raw_articles:
    sentences = nltk.sent_tokenize(article)
    fragment_list = []
    for i in range(0, len(sentences), N):
        fragment = " ".join(sentences[i:i + N])
        corpus2.append(fragment)
        corpus1.append(preprocess_text(fragment))

In [23]:
print(f'Number of fragments: {len(corpus1)} \n--------------------------')
print(f'Raw first fragment:\n {corpus2[0]} \n')
print(f'Processed first fragment:\n {corpus1[0]}')

Number of fragments: 790 
--------------------------
Raw first fragment:
 Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. 

Processed first fragment:
 ['hundreds', 'people', 'forced', 'vacate', 'homes', 'southern', 'highlands', 'new', 'south', 'wales', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'town', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'sydney', 'forced', 'closure', 'hume', 'highway', 'aedt', 'marked', '

<font color='green'>**Task**: Please create a search index (called *search_index*) using a *tfidf* model and transform all text fragments from *corpus1* into document vectors.</font>

In [24]:
from gensim import corpora, models, similarities

# Assume that corpus1 is a list of tokenized documents
dictionary = corpora.Dictionary(corpus1)
bow_corpus = [dictionary.doc2bow(doc) for doc in corpus1]

# Create a tf-idf model and transform the corpus to tf-idf
tfidf = models.TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

# Create a similarity index
search_index = similarities.MatrixSimilarity(tfidf_corpus)

<font color='green'>**Task**: Please write a function called *fragment_retrieval* which returns the most relevant text fragment (string) from the corpus given a question, which is used as the query.</font>  
* The function processes the query in the same way as the documents (using the *tfidf model*) to obtain a *vectorized_query*.
* This is passed to the *search_index* to rank all documents by relevance.
* All the resources created above are supposed available as global variables (the dictionary, the tfidf model, the search_index, the corpus).

In [25]:
def fragment_retrieval(query):
    tokenized_query = preprocess_text(query)
    tfidf_query = tfidf[dictionary.doc2bow(tokenized_query)]
    similarities = search_index[tfidf_query]
    return corpus2[similarities.argmax()]

<font color='green'>**Task**: Please apply the above function to the three queries provided below.</font>  

Note: again, the corpus, search_index, tfidf and dictionary are available as global variables.

In [26]:
queries = ["Who is the mayor of New York?",
           "Who is Nicole Kidman?",
           "How many Australians died in the 1999 Interlaken canyoning accident?"]
for q in queries:
    print(q, '->', fragment_retrieval(q))

Who is the mayor of New York? -> "I felt that my job as the mayor was to turn around the city, because I believed - rightly or wrongly - that we had one last chance to do that." Mr Giuliani, a Republican, has served two terms as New York City's Mayor since 1993. Term limits prevent him from seeking a third term in office, and he will be succeeded by billionaire media mogul Michael Bloomberg.
Who is Nicole Kidman? -> In the United States, Australian actress Nicole Kidman has been nominated for two Golden Globe best actor awards for her roles in the Australian-made musical "Moulin Rouge", and in her new thriller "The Others". "Moulin Rouge" also is one of two pictures leading the Golden Globe nominations, with six possible awards. It is vying for best musical or comedy picture of 2002, best actress in a comedy or musical, best actor in the same category for Ewen McGregor, best director for Baz Luhrmann, best original score and best original song. The other film to pick up six nominations

## 3. Integration, testing and discussion

<font color='green'>**Task**: Using the two functions 'fragment_retrieval' and 'answer_extraction' from parts 1 and 2, and assuming all models and data are available as global variables, please create a unique function which returns the answer (string) to a question (string).</font>

In [27]:
def question_answering(question):
    fragment = fragment_retrieval(question)
    return answer_extraction(question, fragment)

<font color='green'>**Task**: Please add between 5 and 10 more questions to the following list.  You can add answerable and non-answerable questions (with respect to the corpus).</font>

In [28]:
questions = ["Who is the mayor of New York?",
             "Who is Nicole Kidman?",
             "How many Australians died in the 1999 Interlaken canyoning accident?",
             "What caused the 1999 Interlaken canyoning accident?",
             "Which city is the capital of Australia?",
             "Who is the prime-minister of Israel?",
             "What are the main Australian airlines?",
             "What is Kieren Perkins' sport?",
             "What is the population of Australia?",
             "What is the capital of Haiti?",
             "Who is Anthony Zinni?",
             "Where is the opera house?",
             ]
for q in questions:
    print(q, '->', question_answering(q))

Who is the mayor of New York? -> Mr Giuliani
Who is Nicole Kidman? -> Australian actress
How many Australians died in the 1999 Interlaken canyoning accident? -> 14
What caused the 1999 Interlaken canyoning accident? -> thunderstorm
Which city is the capital of Australia? -> Port - au - Prince
Who is the prime-minister of Israel? -> Anthony Zinni
What are the main Australian airlines? -> Perth and Darwin
What is Kieren Perkins' sport? -> swam
What is the population of Australia? -> aging
What is the capital of Haiti? -> Port - au - Prince
Who is Anthony Zinni? -> United States peace envoy
Where is the opera house? -> Palestinian territory


<font color='green'>**Task**: Please discuss the correctness of the answers, give possible reasons for incorrect ones, and make suggestions for improvements.</font>

Write your discussion here or in a cell below.

When you have finished please clean and re-run one last time the notebook, from start to end, then submit it on Moodle.

Discussion:
Most questions are answered correct, but the question on the capitol of australia is not. This is probably because in one article the word capitol is used twice, but in correlation to Haiti and its capitol and not about the capitol of Australia. Thats why it answers with Port - au - Prince.

The question on the prime minister of Israel is answered wrongly too, because of a similar reason. The name Anthony Zinni is often used in the articles in combination with other words such as prime-minister and Israel.
When the model was asked directly "Who is Anthony Zinni?" it answered it correctly.

The question on the main Australian airlines is not answered correctly, I would guess that in the context of airlines and Australia Perth and Darvin are probably the busiest airports in Australia, and mentioned quiet often in the articles in combination with airline. So the model might have picked up on that.

The question on the population of Australia is not answered correctly, its interesting to see that the answer  would answer another question on the population of Australia: "How is the demographic development in Australia?". -> aging

The answer to "Where is the opera house?" is totally wrong. But the word "opera house" never occurred in the texts, so the model could not have answered this question correctly. 

