# Introduction

The notebook is intened to experiment with different technologies for the task of **Question and Answering**.

There are two different **type of models**:
- *Open Domain* - They do not require a passed context
- *Reading Comprehension* - They find the answer within a given context

Such models can work in two different **approaches**:
- *Open Book* - The model can access external source of information
- *Closed Book* - The model can only access what has been encoded in its paramters

The **Components** of an Open Domain Q&A are:
- *Retriever* - It finds relevant contexts from an external source given the question (This is the component that differentiate an Open Domain from a Reading Comprehension)
- *Reader* - It locates the position in the context where the answer to the question is (alternatively there can be a *Generator*)
- *Generator*



## Datasets

### SQuAD 2.0

The SQuAD (Stanford Question and Answering Dataset) is a hugely popular dataset containing question and answer pairs scraped from Wikipedia, covering topics ranging from Beyonce, to Physics. 

It is possible to retrieve both the Training and Dev set at this [link](https://rajpurkar.github.io/SQuAD-explorer/).

In [1]:
# Import Standard Libraries
import json

In [2]:
# Read data
with open('./../../../data/squad_train_v2.0.json', 'rb') as file:
    squad_train_data = json.load(file)

with open('./../../../data/squad_dev_v2.0.json', 'rb') as file:
    squad_dev_data = json.load(file)

In [3]:
squad_train_data['data'][0]['paragraphs'][0] # First data are about Beyonce

{'qas': [{'question': 'When did Beyonce start becoming popular?',
   'id': '56be85543aeaaa14008c9063',
   'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
   'is_impossible': False},
  {'question': 'What areas did Beyonce compete in when she was growing up?',
   'id': '56be85543aeaaa14008c9065',
   'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
   'is_impossible': False},
  {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
   'id': '56be85543aeaaa14008c9066',
   'answers': [{'text': '2003', 'answer_start': 526}],
   'is_impossible': False},
  {'question': 'In what city and state did Beyonce  grow up? ',
   'id': '56bf6b0f3aeaaa14008c9601',
   'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
   'is_impossible': False},
  {'question': 'In which decade did Beyonce become famous?',
   'id': '56bf6b0f3aeaaa14008c9602',
   'answers': [{'text': 'late 1990s', 'answer_start': 276}],
   'is_impossible': False},
  {'q

In [4]:
# Retrieve all the Q&A pairs
q_a_train_data = []

# Loop through groups -> paragraphs -> qa_pairs
for group in squad_train_data['data']:
    for paragraph in group['paragraphs']:

        # Retrieve context
        context = paragraph['context']

        for qa_pair in paragraph['qas']:

            # Retrieve question
            question = qa_pair['question']

            # Check if there is 'answers' or 'plausible_answers'
            if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
                answer_list = qa_pair['answers']
            elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
                answer_list = qa_pair['plausible_answers']
            else:
                # Check if no answer is given
                answer_list = []

            # Retrieve just the text from each answer
            answer_list = [item['text'] for item in answer_list]

            # Remove duplicates
            answer_list = list(set(answer_list))

            # Add each answer to the dataset
            for answer in answer_list:
                # append dictionary sample to parsed squad
                q_a_train_data.append({
                    'question': question,
                    'answer': answer,
                    'context': context
                })

# Retrieve all the Q&A pairs
q_a_dev_data = []

# Loop through groups -> paragraphs -> qa_pairs
for group in squad_dev_data['data']:
    for paragraph in group['paragraphs']:

        # Retrieve context
        context = paragraph['context']

        for qa_pair in paragraph['qas']:

            # Retrieve question
            question = qa_pair['question']

            # Check if there is 'answers' or 'plausible_answers'
            if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
                answer_list = qa_pair['answers']
            elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
                answer_list = qa_pair['plausible_answers']
            else:
                # Check if no answer is given
                answer_list = []

            # Retrieve just the text from each answer
            answer_list = [item['text'] for item in answer_list]

            # Remove duplicates
            answer_list = list(set(answer_list))

            # Add each answer to the dataset
            for answer in answer_list:
                # append dictionary sample to parsed squad
                q_a_dev_data.append({
                    'question': question,
                    'answer': answer,
                    'context': context
                })

In [5]:
q_a_train_data[0]

{'question': 'When did Beyonce start becoming popular?',
 'answer': 'in the late 1990s',
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'}

In [6]:
# Save data
with open('./../../../data/squad_processed_train.json', 'w') as file:
    json.dump(q_a_train_data, file)

with open('./../../../data/squad_processed_dev.json', 'w') as file:
    json.dump(q_a_dev_data, file)

# Reader

## BERT Model

In [7]:
# Import Standard Libraries
import json
from transformers import BertTokenizer, BertForQuestionAnswering, pipeline

  from .autonotebook import tqdm as notebook_tqdm
2024-01-03 13:26:23.290758: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [8]:
with open('./../../../data/squad_processed_train.json', 'r') as file:
    q_a_train_data = json.load(file)

with open('./../../../data/squad_processed_dev.json', 'r') as file:
    q_a_dev_data = json.load(file)

In [9]:
# Instantiate the tokenizer and the model
tokenizer = BertTokenizer.from_pretrained('deepset/bert-base-cased-squad2')
model = BertForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2')

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
# Define the pipeline
qa_pipeline = pipeline('question-answering', model=model, tokenizer=tokenizer)

In [11]:
# Retrieve few answers
answers = []

for sample in q_a_train_data[:5]:

    answer = qa_pipeline({
        'question': sample['question'],
        'context': sample['context']
    })

    answers.append({
        'True Label': sample['answer'],
        'Prediction': answer['answer'],
        'Confidence': answer['score']
    })

In [12]:
answers

[{'True Label': 'in the late 1990s',
  'Prediction': 'late 1990s',
  'Confidence': 0.5621357560157776},
 {'True Label': 'singing and dancing',
  'Prediction': 'singing and dancing',
  'Confidence': 0.9938411116600037},
 {'True Label': '2003',
  'Prediction': '(2003),',
  'Confidence': 0.9965661764144897},
 {'True Label': 'Houston, Texas',
  'Prediction': 'Houston, Texas,',
  'Confidence': 0.847782552242279},
 {'True Label': 'late 1990s',
  'Prediction': '1990s',
  'Confidence': 0.6779066324234009}]

## Model Evaluation

### Exact Match (EM)

There are few limitations, since an exact match is quite hard to achieve.

In [13]:
# Import Standard Libraries
import re

In [14]:
# Retrieve the exact matches
exact_matches = []

for answer in answers:

    # Normalise the text
    prediction = re.sub('[^0-9a-z ]', '', answer['Prediction'].lower())
    true_label = re.sub('[^0-9a-z ]', '', answer['True Label'].lower())

    if prediction == true_label:

        exact_matches.append(1)

    else:

        exact_matches.append(0)

print(f'Exact matches: {sum(exact_matches)/len(exact_matches)}')

Exact matches: 0.6


### Recall Oriented Understudy for Gisting Evaluation (ROUGE)

It is a set of metrics and each of them measure how similar a reference text is to the predicted text:
- **ROUEE N** - It measures the number of matching N-Grams (group of tokens) between the reference and predicted text
    ```python
    # Reference text
    reference_text = 'The quick brown fox jumps over the lazy dog'

    example_1_gram = ['The', 'quick', ...]
    example_2_gram = ['The quick', 'quick brown', ...]
    ```

#### Theory

In [15]:
# Importa Standard Libraries
from rouge import Rouge

In [16]:
# Define example predicted output and reference
model_output = 'hello to the world'
reference_output = 'hello world'

In [17]:
# Instance the Rogue object
rouge_instance = Rouge()

In [18]:
# Get the N-Gram scores
rouge_instance.get_scores(model_output, reference_output)

[{'rouge-1': {'r': 1.0, 'p': 0.5, 'f': 0.6666666622222223},
  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-l': {'r': 1.0, 'p': 0.5, 'f': 0.6666666622222223}}]

There is Rouge-1 (one-Gram) and Rouge-2 (bi-gram) with F1, Precision and Recall metrics.

In [19]:
# Retrieve full data for ROUGE
model_predictions = [answer['Prediction'] for answer in answers]
references = [answer['True Label'] for answer in answers]

In [20]:
rouge_instance.get_scores(model_predictions, references)

[{'rouge-1': {'r': 0.5, 'p': 1.0, 'f': 0.6666666622222223},
  'rouge-2': {'r': 0.3333333333333333, 'p': 1.0, 'f': 0.4999999962500001},
  'rouge-l': {'r': 0.5, 'p': 1.0, 'f': 0.6666666622222223}},
 {'rouge-1': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-2': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-l': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}},
 {'rouge-1': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-l': {'r': 0.0, 'p': 0.0, 'f': 0.0}},
 {'rouge-1': {'r': 0.5, 'p': 0.5, 'f': 0.4999999950000001},
  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-l': {'r': 0.5, 'p': 0.5, 'f': 0.4999999950000001}},
 {'rouge-1': {'r': 0.5, 'p': 1.0, 'f': 0.6666666622222223},
  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-l': {'r': 0.5, 'p': 1.0, 'f': 0.6666666622222223}}]

#### Application

In [30]:
# Importa Standard Libraries
from rouge import Rouge
from tqdm import tqdm
import re

In [22]:
# Initialize empty model outputs and references
model_outputs, references = [], []

In [25]:
# Fetch the samples in the training data
for sample in tqdm(q_a_train_data[:5], leave=True):

    # Pass the question through the Q&A Pipeline
    answer = qa_pipeline({
        'question': sample['question'],
        'context': sample['context']
    })

    # Append results
    model_outputs.append(answer['answer'])
    references.append(sample['answer'])

100%|██████████| 5/5 [00:43<00:00,  8.73s/it]


In [26]:
# Initialize
rouge_instance = Rouge()

# Get scores
rouge_instance.get_scores(model_outputs, references, avg=True)

{'rouge-1': {'r': 0.5, 'p': 0.6875, 'f': 0.5624999964583334},
 'rouge-2': {'r': 0.3333333333333333, 'p': 0.5, 'f': 0.37499999781250004},
 'rouge-l': {'r': 0.5, 'p': 0.6875, 'f': 0.5624999964583334}}

In [29]:
# Retrieve score
scores = rouge_instance.get_scores(model_outputs, references)

# Manually check some of them
print(model_outputs[1], ' | ', references[1], ' | ', scores[1]['rouge-1']['f'])
print(model_outputs[2], ' | ', references[2], ' | ', scores[2]['rouge-1']['f'])

singing and dancing  |  singing and dancing  |  0.999999995
(2003),  |  2003  |  0.0


In [31]:
# Use regex
clean = re.compile('(?i)[^0-9a-z ]')

# Apply this to both lists
model_outputs = [clean.sub('', text) for text in model_outputs]
references = [clean.sub('', text) for text in references]

In [32]:
# Retrieve score
scores = rouge_instance.get_scores(model_outputs, references)

# Manually check some of them
print(model_outputs[1], ' | ', references[1], ' | ', scores[1]['rouge-1']['f'])
print(model_outputs[2], ' | ', references[2], ' | ', scores[2]['rouge-1']['f'])

singing and dancing  |  singing and dancing  |  0.999999995
2003  |  2003  |  0.999999995


In [33]:
# Get scores
rouge_instance.get_scores(model_outputs, references, avg=True)

{'rouge-1': {'r': 0.8125, 'p': 1.0, 'f': 0.8749999952083333},
 'rouge-2': {'r': 0.4583333333333333, 'p': 0.625, 'f': 0.49999999718750004},
 'rouge-l': {'r': 0.8125, 'p': 1.0, 'f': 0.8749999952083333}}

The model improves its performance by applying a regex.

#### ROUGE-L

It measures the *Longest Common Subsequence* (LCS) between the model's outputs and the references.

Given the model output (*The hello a cat daog fox jumps*), the reference (*the fox jumps*) and the 1-Gram &rarr; Precision = 2/7 = 29%

# Retriever-Reader

So far, a Reader component of a Q&A system has been implemented, whichi leads to the implementation of a Reading Comprehension system: given the question and a context, find the answer. In most cases, we want to just give the question and the model, given an external source or its own memory (gained during the training), will find the correct answer. This is a *Retriever-Reader* Architecture.

## Haystack

We are going to use a library called `Haystack` in order to implement a Retriever-Reader system with two different approaches:
- Elasticsearch
- FAISS (Facebook AI Similarity Search)

<br>

**Docker Elasticsearch Command:** docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" --ulimit memlock=-1:-1 --ulimit nofile=65536:65536 docker.elastic.co/elasticsearch/elasticsearch:8.11.4

### Elasticsearch

In [11]:
# Import Standard Libraries
import json
from elasticsearch_haystack.document_store import ElasticsearchDocumentStore

In [2]:
# Read data
with open('./../../../data/squad_dev_v2.0.json', 'r') as file:

    squad_dev_data = json.load(file)

In [15]:
# Initialise document store
document_store = ElasticsearchDocumentStore(hosts="https://localhost:9200")

TlsError: TLS error caused by: TlsError(TLS error caused by: SSLError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:997)))

### FAISS

An inherit drawback of the Nearest Neighbours algorithm, which check the distance between vectors, is that it becomes quite soon very slow when increasing the number of dimensions. With FAISS, we have:
- Faster GPU implementation with parallel processing
- Vectors are prepocessed with PCA and L2 normalization, to reduce the dimensionality
- It uses the *Inverted File Indexing* (IVF), which group similar vectors into clusters and thus speeding up the query similarity with the clusters instead with each vector