<a href="https://colab.research.google.com/github/ckraju/ML-Lab/blob/master/question_answering_ssai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Workshop : Passage Retrieval and Question Answering

## Outline

- [Introduction](#intro)
- [Part 0: Installing and importing Python Packages](#0)
- [Part 1: The SQUAD Question Answering Dataset](#1)
    - [1.1 Extracting passages from SQUAD (Exercise 01)](#ex01)
- [Part 2: Part 2: BM25 Passage Retrieval](#2)
    - [2.1 Build index of passages using BM25](#2.1)
    - [2.2 Computing document similarity and retrieving best documents](#2.2)
    - [2.3 Evaluation using Top-K retrieval accuracy (Exercise 02)](#ex02)
- [Part 3: BERT-based Passage Retrieval](#3)
    - [3.1 Loading the BERT model](#3.1)
    - [3.2 Computing text embeddings using BERT (Exercise 03)](#ex03)
    - [3.3 Computing document similarity with query (Exercise 04)](#ex04)
    - [3.4 Evaluation using Top-K retrieval accuracy](#3.4)
- [Part 4: Dense Passage Retrieval (DPR)](#4)
    - [4.1 Computing question and passage embeddings using DPR](#4.1)
    - [4.2 Evaluation using Top-K retrieval accuracy](#4.2)
    - [4.3 BM25 vs BERT vs DPR (Homework Exercise)](#hwex01)
- [Part 5: Question Answering using DPR Reader](#5)
- [Part 6 : End-to-end Open Domain QA on SQuAD using DPR Retriver-Reader](#6)
    - [6.1 Retrieving top-k documents using DPR Retriever (Exercise 05)](#ex05)
    - [6.2 End-to-end QA evaluation : Reading Comprehension vs Open-Domain QA (Exercise 06)](#ex06)
    - [6.3 Open-domain QA evaluation : BM25 vs BERT vs DPR](#hwex02)
- [Part 7: End-to-end Open Domain QA on SQuAD using DPR Retriver and GPT-3](#7)


<a name='intro'></a>
# Introduction

In this tutorial you will explore various models for **passage retrieval** and **question answering**. For passage retrieval, we will experiment with three models you studied in the presentation
 - BM25
 - BERT Average embeddings
 - Dense Passage Retreival

For Question answering (QA), we focus on  *reading comprehension* and *open-domain* forms of *Extractive QA* (where answer span is a substring of the passage).
 - Reading Comprehension : The passage for question is already given, and the model needs to search for answer in that passage
 - Open-Domain QA : The model first retrieves relevant passage(s) for given question, and then searches for answer in those passages.

<a name='0'></a>
# Part 0: Installing and importing Python Packages

In [None]:
!pip install transformers
!pip install datasets
!pip install rank-bm25
!pip install openai

In [None]:
import torch
import numpy as np
# Set random seed
np.random.seed(42)

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

<a name='1'></a>
# Part 1: The SQUAD Question Answering Dataset

We will be using the [SQUAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset for QA experiments. It consists of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.

**Dataset details**
 - The dataset is available for download at https://rajpurkar.github.io/SQuAD-explorer/
 - There are 87,599, 10,570, and 10,821 rows of data in train, validation and test sets respectively.
 - We will be just using train and dev datasets as test dataset labels are hidden. Users can submit their models on test set by following instructions at https://rajpurkar.github.io/SQuAD-explorer/
 - A data point looks like

     {
        "answers": {
            "answer_start": [1],
            "text": ["This is a test text"]
        },
        "context": "This is a test context.",
        "id": "1",
        "question": "Is this a test?",
        "title": "train test"
    }

In [None]:
from datasets import load_dataset

dataset = load_dataset("squad")

dataset['train'][0]

In [None]:
# Since the dataset is huge, we will be using only a prt of time for our experiments
dataset["train"] = dataset["train"].select(range(5000))
dataset["validation"] = dataset["validation"].select(range(500))

<a name='ex01'></a>
### Exercise 01: Extracting passages from SQUAD

In [None]:
def extract_passages(dataset) :

    """Returns a list of unique passages extracted from the dataset
    NOTE: Two or more examples in a dataset can have same passage.

    Args:
        dataset (Dataset) : Input dataset same as the SQUAD Dataset format

    Returns:
        List of passages in dataset
    """

    corpus = [example['context'] for example in dataset]
    corpus = list(set(corpus))
    return corpus

In [None]:
train_corpus = extract_passages(dataset['train'])
dev_corpus = extract_passages(dataset['validation'])
full_corpus = train_corpus + dev_corpus

tokenized_full_corpus = [doc.split(" ") for doc in full_corpus]

<a name='2'></a>
# Part 2: BM25 Passage Retrieval

As explained in the talk, BM25 is based on TF-IDF which looks into two main factors to determine document's similarity with query
 - Term Frequency aka (TF): how often do the query terms occur in the document?
 - Inverse Document Frequency (IDF): how many documents the term appeared in?

For this tutorial, we will use the [rank_bm25 API](https://pypi.org/project/rank-bm25/) to first build indexes of passages, and then retrieve documents given a query.

<a name='2.1'></a>
### Build index of passages using BM25

In [None]:
from rank_bm25 import BM25Okapi
bm25 = BM25Okapi(tokenized_full_corpus)

<a name='2.2'></a>
### Computing document similarity and retrieving best documents

BM25 library provides following functions:
 - `get_scores(query)` function to get query similarity with all documents
 - `get_top_n(query, n)` directly n documents having highest similarity score with given query

In [None]:
query = "Who is Beyoncé Giselle Knowles-Carter?"
doc_scores = bm25.get_scores(query.split(' '))
print ('Document scores length of first 5 documents : ', doc_scores[:5])
assert len(doc_scores) == len(full_corpus)
top_passages = bm25.get_top_n(query.split(" "), full_corpus, n=3)
for p in top_passages :
    print('-----------------------------------------------------')
    print (p)

<a name='ex02'></a>
### Exercise 02: Evaluation using Top-K retrieval accuracy

The top-k accuracy for a retrieval model is defined as fraction of questions for which correct passage appears in top-k retrieved passages.

So accuracy-wise `top-1 < top-10 < top-50 < top-100`

Complete below function `get_doc_rank` which computes ranking of the document for given query. If the document has highest bm25 score, the rank will be 1

In [None]:
def get_doc_rank(passage, passages, doc_scores) :
    """
    Returns ranking of document for given query.
    Args:
        - passage (str) : the document for which rank is to be computed
        - passages (list[str]) : all documents indexed in BM25
        - doc_scores (list[int]) : BM25 scores of all documents
    Returns:
        - doc_rank (int) : rank of document between 1 and len(passages)
    """

    doc_rank = 1
    index = passages.index(passage)
    for i in range(len(passages)) :
        if index!=i and doc_scores[i] > doc_scores[index] :
            doc_rank += 1

    return doc_rank

In [None]:
top_k = [1,5,10,20,40,50,100]
correct_cnt_bm25 = {k:0 for k in top_k}
for example in dataset['validation'] :

    #Get scores of all documents for given query
    doc_scores = bm25.get_scores(example['question'].split(' '))

    #Get rank of ground truth document
    doc_rank = get_doc_rank(example['context'], full_corpus, doc_scores)

    #Update top_k counts correctly
    for k in top_k :
        if doc_rank <= k :
            correct_cnt_bm25[k] += 1

for k in top_k :
    print ('Top-{} accuracy BM25: {}%'.format(k, 100*correct_cnt_bm25[k]/len(dataset['validation'])))

<a name='3'></a>
# Part 3: BERT-based Passage Retrieval

Now we will see how to leverage BERT for passage retrieval. We will use BERT embeddings to represent passages and queries.

The workflow will look like-
 1. Compute BERT embeddings of all passages in the corpus
 2. Compute BERT embeddings of query
 3. Compute score of each passage for query by simply taking dot product of query vector and passage vector
 4. Return top-k passages with highest scores

![](https://drive.google.com/uc?export=view&id=15qXd_-Cu4LpIWoVr10tLdd12KYPCGS0j)

<a name='3.1'></a>
### Loading the BERT Model

In [None]:
from transformers import BertTokenizer, BertModel
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

bert_model.to(device)

#Set mode to eval since we are not training/fine-tuning BERT weights
bert_model.eval()

Run the cell below to see how to get BERT word embeddings

In [None]:
encoded_input = bert_tokenizer(['This is a dummy passage1', 'This is a dummy passage2'],
                               padding=True, truncation=True, max_length=256, return_tensors='pt').to(device)
with torch.no_grad():
    model_output = bert_model(**encoded_input)
    token_embeddings = model_output[0]
token_embeddings.size()

<a name='ex03a'></a>
### Exercise 03a: Computing sentence embeddings using BERT

We explore two methods to compute sentence embeddings of passages and queries:
 - Average word embeddings : Encoding sentence with BERT, then computing the average of output word embeddings of text.
 - [CLS] token embeddings : As we learnt in previous tutorial, the embedding of CLS token can be used as a sentence representation.

Complete below function `compute_bert_avg_embeddings` which computes text embeddings by encoding sentence with BERT, then computing the average of output word embeddings of text. Refer to https://towardsdatascience.com/nlp-extract-contextualized-word-embeddings-from-bert-keras-tf-67ef29f60a7b on how to extract word embeddings from BERT

In [None]:
def compute_bert_avg_embeddings(sentences) :

    """
    Returns average word embedding for given query.
    Args:
        - sentences (list[str]) : list of sentences
    Returns:
        - embeddings : numpy array of size `len(sentences) X 768`
    """

    encoded_input = bert_tokenizer(sentences,
                               padding=True, truncation=True, max_length=256, return_tensors='pt').to(device)
    with torch.no_grad():
        model_output = bert_model(**encoded_input)
    token_embeddings = model_output[0]
    return torch.mean(token_embeddings, axis=1)

Complete below function `compute_bert_cls_embeddings` which computes text embeddings by encoding sentence with BERT, then returning the embedding of [CLS] token. Refer to https://huggingface.co/sentence-transformers/bert-base-nli-cls-token on how to extract [CLS] token embeddings from BERT

In [None]:
def compute_bert_cls_embeddings(sentences) :

    """
    Returns [CLS] token BERT embedding for given query.
    Args:
        - sentences (list[str]) : list of sentences
    Returns:
        - embeddings : numpy array of size `len(sentences) X 768`
    """

    encoded_input = bert_tokenizer(sentences, padding=True, truncation=True, return_tensors='pt').to(device)
    with torch.no_grad():
        model_output = bert_model(**encoded_input)
    return model_output[0][:,0].to(device)

The below commented code computes BERT average embeddings for all passages in full corpus. This will take ~2 minutes.

In [None]:
bert_avg_embeddings = torch.zeros([len(full_corpus), 768])
batch_size = 3
n_iters = len(full_corpus)//batch_size
for i in range(n_iters) :
    if i % 100 == 0:
      print(i)
    bert_avg_embeddings[i*batch_size:(i+1)*batch_size] = compute_bert_avg_embeddings(full_corpus[i*batch_size:(i+1)*batch_size])

assert len(bert_avg_embeddings) == len(full_corpus)

We now do the same thing for BERT CLS embeddings.

In [None]:
bert_cls_embeddings = torch.zeros([len(full_corpus), 768])
batch_size = 3
n_iters = len(full_corpus)//batch_size
for i in range(n_iters) :
    if i%100 == 0:
      print(i)
    bert_cls_embeddings[i*batch_size:(i+1)*batch_size] = compute_bert_cls_embeddings(full_corpus[i*batch_size:(i+1)*batch_size])


<a name='ex04'></a>
### Exercise 04: Computing document similarity with query

Complete below function `compute_doc_scores` which computes cosine similarity between documents and query vectors.

Use the `compute_bert_avg_embeddings` function to compute passage and query embedding.

In [None]:
from torch import nn
def compute_doc_scores(query, embeddings) :
    """
    Returns similarity of documents with query
    Args:
        - query (str) : the question (query)
        - embeddings (numpy array) : embeddings of documents
    Returns:
        - doc_scores : numpy array of size `len(sentences)`
    """

    query_embedding = compute_bert_avg_embeddings([query])[0]
    cos = nn.CosineSimilarity(dim=1, eps=1e-6)
    return cos(embeddings.to(device), query_embedding)

<a name='3.4'></a>
### Evaluation using Top-K retrieval accuracy

In [None]:
top_k = [1,5,10,20,40,50,100]
correct_cnt_bert = {k:0 for k in top_k}

for example in dataset['validation'] :
    doc_scores = compute_doc_scores(example['question'], bert_avg_embeddings)
    doc_rank = get_doc_rank(example['context'], full_corpus, doc_scores)
    for k in top_k :
        if doc_rank <= k :
            correct_cnt_bert[k] += 1

for k in top_k :
    print ('Top-{} accuracy BERT Embeddings: {}%'.format(k, 100*correct_cnt_bert[k]/len(dataset['validation'])))

<a name='4'></a>
# Part 4: Dense Passage Retrieval (DPR)

In [None]:
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
q_model = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
p_tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
p_model = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

q_model.to(device)
p_model.to(device)

<a name='4.1'></a>
### Computing question and passage embeddings using DPR

Run the example below to see how to compute embeddings for a question and passage using DPR

In [None]:
p_input_ids = p_tokenizer(['This is a dummy passage'],
                        return_tensors='pt', padding=True, truncation=True)["input_ids"].to(device)
passage_embeddings = p_model(p_input_ids).pooler_output

q_input_ids = q_tokenizer(['Is this a dummy question?'],
                        return_tensors='pt', padding=True, truncation=True)["input_ids"].to(device)
question_embeddings = q_model(q_input_ids).pooler_output

print ('Passage embedding size : ', passage_embeddings.size())
print ('Question embedding size : ', question_embeddings.size())

We will now compute DPR embeddings for the entire corpus

In [None]:
dpr_embeddings = torch.zeros([len(full_corpus), 768])

for i in range(len(full_corpus)) :
    if i%100==0 :
        print (i)
    with torch.no_grad():
        input_ids = p_tokenizer(full_corpus[i*1:(i+1)*1],
                        return_tensors='pt', padding=True, truncation=True)["input_ids"].to(device)
        dpr_embeddings[i*1:(i+1)*1] = p_model(input_ids).pooler_output
        torch.cuda.empty_cache()

dpr_embeddings = dpr_embeddings.to(device)

The query-document similarity in DPR is defined by inner product.

In [None]:
def compute_doc_scores(query, embeddings) :
    input_ids = q_tokenizer(query,
                        return_tensors='pt', padding=True, truncation=True)["input_ids"].to(device)
    q_embeddings = q_model(input_ids).pooler_output

    return torch.matmul(embeddings, q_embeddings.T)

<a name='4.2'></a>
### Evaluation using Top-K retrieval accuracy

In [None]:
top_k = [1,5,10,20,40,50,100]
correct_cnt_dpr = {k:0 for k in top_k}

for example in dataset['validation'] :
    doc_scores = compute_doc_scores(example['question'], dpr_embeddings)
    doc_rank = get_doc_rank(example['context'], full_corpus, doc_scores)
    for k in top_k :
        if doc_rank <= k :
            correct_cnt_dpr[k] += 1

for k in top_k :
    print ('Top-{} accuracy DPR Embeddings: {}%'.format(k, 100*correct_cnt_dpr[k]/len(dataset['validation'])))

<a name='hwex01'></a>
### Homework Exercise 01: BM25 vs BERT vs DPR

Now that you have top-k accuracy results of all three models, complete the below function `plot_histogram` to plot histograms comparing the  accuracy of BM25, BERT, and DPR in same plot.

In [None]:
def plot_histogram(correct_cnt_bm25, correct_cnt_bert, correct_cnt_dpr) :

    ### START CODE HERE (REPLACE INSTANCES OF 'pass' WITH YOUR CODE) ###
    pass

<a name='5'></a>
# Part 5 : Question Answering using DPR Reader

Study the below example to see how to perform Reading Comprehension using DPR Reader.
The reader takes as input the query question, set of documents and performs extractive QA to extract the answer from set of documents.
NOTE : We do not use the retriever in Reading Comprehension

In [None]:
from transformers import DPRReader, DPRReaderTokenizer
reader_tokenizer = DPRReaderTokenizer.from_pretrained('facebook/dpr-reader-single-nq-base')
reader_model = DPRReader.from_pretrained('facebook/dpr-reader-single-nq-base')

reader_model.to(device)

def dpr_reader_get_answer(questions, passages):

  encoded_inputs = reader_tokenizer(
      questions=questions,
      texts=passages,
      return_tensors='pt'
  ).to(device)

  input_ids = encoded_inputs["input_ids"].tolist()[0]
  outputs = reader_model(**encoded_inputs)
  answer_start = torch.argmax(outputs.start_logits)
  answer_end = torch.argmax(outputs.end_logits) + 1
  answer = reader_tokenizer.convert_tokens_to_string(reader_tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
  return answer

answer = dpr_reader_get_answer(["What is the capital of France?"], ["France is a unitary semi-presidential republic with its capital in Paris, the country's largest city and main cultural and commercial centre"])
print(f"Question: What is the capital of France?")
print(f"Answer: {answer}\n")

<a name='6'></a>
# Part 6 : End-to-end Open Domain QA on SQuAD using DPR Retriver-Reader

In the final stage, we will perform open domain QA. This means first retriever will retrieve k documents for given query. Then reader will perform extractive QA and return the predicted answer span from one of the k documents.

<a name='ex05'></a>
### Exercise 05: Retrieving top-k documents using DPR Retriever

Complete below function `retrieve_top_k` which retrieves top-k documents using DPR retriever

In [None]:
def retrieve_top_k(query, k) :
    """
    Returns list of k passages having highest dot product with query vector
    Args:
        - query (str) : the question (query)
        - k (int) : number of documents to be returned
    Returns:
        - passages (list[str]) : numpy array of size `len(sentences)`
    """

    doc_scores = compute_doc_scores(query, dpr_embeddings).detach().cpu().numpy()[:,0]
    max_indexes = doc_scores.argsort()[-k:]
    passages = [full_corpus[ind] for ind in max_indexes]
    return passages

In [None]:
# these functions are heavily influenced by the HF squad_metrics.py script
def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()

    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)

    common_tokens = set(pred_tokens) & set(truth_tokens)

    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0

    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)

    return 2 * (prec * rec) / (prec + rec)

<a name='ex06'></a>
### Exercise 06: End-to-end QA evaluation : Reading Comprehension vs Open-Domain QA

Complete parts of below code to perform end-to-end extractive QA, evaluating on SQUAD dataset.

The goal is to evaluate both reading comprehension (where correct passage is given for reader) and end-to-end open-domain (where retriever fetches passage for reader) models.

In [None]:
f1_score = 0

# If `open_domain==False` this means reading comprehension
open_domain = False

for example in dataset['validation'] :

    passages = None
    if open_domain : # Retrieve k=1 documents using DPR retriever
        passages = retrieve_top_k(example['question'], 1)
    else : # Passages will be the ground truth passage in example
        passages = [example['context']]

    # Use DPR reader to compute answer string
    prediction = dpr_reader_get_answer([example['question']], passages)

    f1_score += max((compute_f1(prediction, answer)) for answer in example['answers']['text'])

if open_domain :
    print ('Open domain QA accuracy (%) : ', 100*f1_score/len(dataset['validation']))
else :
    print ('Reading Comprehension accuracy (%) : ', 100*f1_score/len(dataset['validation']))

<a name='hwex02'></a>
### Homework Exercise 02: Open-domain QA evaluation : BM25 vs BERT vs DPR
Now that you have evaluated DPR on open-domain QA, try to evaluate using BM25 and BERT for retrieving documents for open-domain QA. Compare the f1 scores of all three models. Do you see similar trends as seen in retrieval performance ?

<a name='7'></a>
# Part 7 : End-to-end Open Domain QA on SQuAD using DPR Retriver and GPT-3

The initial setup is similar to what we did for sentiment analysis

In [None]:
#@title
import os
os.environ['OPENAI_API_KEY'] = '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'

In [None]:
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")

Let us first try closed book QA, i.e. GPT-3 will answer the question **without** using retrieved passages.

In [None]:
template = 'I am a highly intelligent question answering bot.\
If you ask me a question that is rooted in truth, I will give you the answer. \
If you ask me a question that is nonsense, trickery, or has no clear answer, \
I will respond with "Unknown".\n\nQ: {question}'

In [None]:
sample = dataset['validation'][0]
prompt = template.format(question=sample['question'])
print(prompt)

Let us get response from GPT-3 and compare it with ground truth

In [None]:
def get_gpt_response(prompt):
  response = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt,
    temperature=0,
    max_tokens=60,
    top_p=1,
    frequency_penalty=0.5,
    presence_penalty=0
  )
  return response.choices[0].text.strip()


print('GPT-3 Response (closed-book) - ' + get_gpt_response(prompt))
print('Ground Truth answers : ' + ', '.join(sample['answers']['text']))

We will now try augmenting GPT-3 will retrieved passages.

In [None]:
# Original answer without any passage
query = 'How many turnovers did Cam Newton have in Super Bowl 50?'
print(get_gpt_response(template.format(question=query)))

Let us design a new tempate that considers the top retrieved passage from DPR


In [None]:
template = 'Please answer the following question given the following passages:\n{retrieved_passages}\nQuestion: {query}\nAnswer:'

In [None]:
### START CODE HERE (REPLACE INSTANCES OF 'None' WITH YOUR CODE) ###
# Create the prompt using the template above for the given query
passage = retrieve_top_k(query, 1)[0]
prompt = template.format(retrieved_passage=passage, question=query)
print(prompt)

Let us see if providing the passage makes a difference!

In [None]:
print('GPT-3 Response (closed-book) - ' + get_gpt_response(prompt))
print('Ground Truth answers : ' + ', '.join(dataset['validation'][63]['answers']['text']))