# Experiments to reproduce results in the paper.

The approached used is based on the information from https://huggingface.co/docs/transformers/en/model_doc/dpr.
The Natural Questions dataset from https://huggingface.co/datasets/vocab-transformers/wiki-en-passages-20210101 (only 10k)

The question answer datasets
* https://huggingface.co/datasets/google-research-datasets/nq_open
* https://huggingface.co/datasets/mandarjoshi/trivia_qa
* https://huggingface.co/datasets/Stanford/web_questions
* https://huggingface.co/datasets/CogComp/trec
* https://huggingface.co/datasets/rajpurkar/squad

HuggingFace pretrained encoders enpoints
* Single
- facebook/dpr-question_encoder-single-nq-base
* * facebook/dpr-ctx_encoder-single-nq-base
* * facebook/dpr-ctx_encoder-single-nq-base
* Multi
* * facebook/dpr-question_encoder-multiset-base
* * facebook/dpr-ctx_encoder-multiset-base
* * facebook/dpr-ctx_encoder-multiset-base

The experiment code is a combindation of all research including HuggingFace which encoders to use, Pyserini (used for evaluation) and ChatGPT for general python questions.

The evaluation method is taken from Pysrini, which is taken from the DPR repository.

Only Natural Question single is experimented using the wikipedia dataset. Other question and answer datasets seem to use different passages and too much to donwload. This helps to understand how the process is done for reproducing DPR results.

<b>Other attempt using Pysrini</b>

The index data is available in the Pysrini codebase described at https://github.com/castorini/pyserini/blob/master/docs/experiments-dpr.md. However, this dataset is 76GB, which is too large to store and process. The first command failed at file extraction. Many other attempts to use other index/dataset or increase compute resources to no avail.

In [1]:
# import libraries
import os
import numpy as np
import torch
from datasets import load_dataset
from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRContextEncoderTokenizer
from rank_bm25 import BM25Okapi
import regex
import unicodedata
from datasets import load_from_disk

  from .autonotebook import tqdm as notebook_tqdm


In [24]:
""" 
Modified from https://github.com/castorini/pyserini/blob/master/pyserini/eval/evaluate_dpr_retrieval.py
"""

class Tokens(object):
    """A class to represent a list of tokenized text."""
    TEXT = 0
    TEXT_WS = 1
    SPAN = 2
    POS = 3
    LEMMA = 4
    NER = 5

    def __init__(self, data, annotators, opts=None):
        self.data = data
        self.annotators = annotators
        self.opts = opts or {}

    def __len__(self):
        """The number of tokens."""
        return len(self.data)

    def slice(self, i=None, j=None):
        """Return a view of the list of tokens from [i, j)."""
        new_tokens = copy.copy(self)
        new_tokens.data = self.data[i: j]
        return new_tokens

    def untokenize(self):
        """Returns the original text (with whitespace reinserted)."""
        return ''.join([t[self.TEXT_WS] for t in self.data]).strip()

    def words(self, uncased=False):
        """Returns a list of the text of each token
        Args:
            uncased: lower cases text
        """
        if uncased:
            return [t[self.TEXT].lower() for t in self.data]
        else:
            return [t[self.TEXT] for t in self.data]

    def offsets(self):
        """Returns a list of [start, end) character offsets of each token."""
        return [t[self.SPAN] for t in self.data]

    def pos(self):
        """Returns a list of part-of-speech tags of each token.
        Returns None if this annotation was not included.
        """
        if 'pos' not in self.annotators:
            return None
        return [t[self.POS] for t in self.data]

    def lemmas(self):
        """Returns a list of the lemmatized text of each token.
        Returns None if this annotation was not included.
        """
        if 'lemma' not in self.annotators:
            return None
        return [t[self.LEMMA] for t in self.data]

    def entities(self):
        """Returns a list of named-entity-recognition tags of each token.
        Returns None if this annotation was not included.
        """
        if 'ner' not in self.annotators:
            return None
        return [t[self.NER] for t in self.data]

    def ngrams(self, n=1, uncased=False, filter_fn=None, as_strings=True):
        """Returns a list of all ngrams from length 1 to n.
        Args:
            n: upper limit of ngram length
            uncased: lower cases text
            filter_fn: user function that takes in an ngram list and returns
              True or False to keep or not keep the ngram
            as_string: return the ngram as a string vs list
        """

        def _skip(gram):
            if not filter_fn:
                return False
            return filter_fn(gram)

        words = self.words(uncased)
        ngrams = [(s, e + 1)
                  for s in range(len(words))
                  for e in range(s, min(s + n, len(words)))
                  if not _skip(words[s:e + 1])]

        # Concatenate into strings
        if as_strings:
            ngrams = ['{}'.format(' '.join(words[s:e])) for (s, e) in ngrams]

        return ngrams

    def entity_groups(self):
        """Group consecutive entity tokens with the same NER tag."""
        entities = self.entities()
        if not entities:
            return None
        non_ent = self.opts.get('non_ent', 'O')
        groups = []
        idx = 0
        while idx < len(entities):
            ner_tag = entities[idx]
            # Check for entity tag
            if ner_tag != non_ent:
                # Chomp the sequence
                start = idx
                while (idx < len(entities) and entities[idx] == ner_tag):
                    idx += 1
                groups.append((self.slice(start, idx).untokenize(), ner_tag))
            else:
                idx += 1
        return groups

# Tokenizer base class
class Tokenizer(object):
    """Base tokenizer class.
    Tokenizers implement tokenize, which should return a Tokens class.
    """

    def tokenize(self, text):
        raise NotImplementedError

    def shutdown(self):
        pass

    def __del__(self):
        self.shutdown()

# Tokeniser for question and passage tokenization for checking matching answers
class SimpleTokenizer(Tokenizer):
    ALPHA_NUM = r'[\p{L}\p{N}\p{M}]+'
    NON_WS = r'[^\p{Z}\p{C}]'

    def __init__(self, **kwargs):
        """
        Args:
            annotators: None or empty set (only tokenizes).
        """
        self._regexp = regex.compile(
            '(%s)|(%s)' % (self.ALPHA_NUM, self.NON_WS),
            flags=regex.IGNORECASE + regex.UNICODE + regex.MULTILINE
        )
        if len(kwargs.get('annotators', {})) > 0:
            logger.warning('%s only tokenizes! Skipping annotators: %s' %
                           (type(self).__name__, kwargs.get('annotators')))
        self.annotators = set()

    def tokenize(self, text):
        data = []
        matches = [m for m in self._regexp.finditer(text)]
        for i in range(len(matches)):
            # Get text
            token = matches[i].group()

            # Get whitespace
            span = matches[i].span()
            start_ws = span[0]
            if i + 1 < len(matches):
                end_ws = matches[i + 1].span()[0]
            else:
                end_ws = span[1]

            # Format data
            data.append((
                token,
                text[start_ws: end_ws],
                span,
            ))
        return Tokens(data, self.annotators)

In [22]:

# Load DPR Models - 'single' or 'multi' from HuggingFace
def load_dpr_models(mode='single'):
    if mode == 'single':
        question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
        context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
        tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
    elif mode == 'multi':
        question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base")
        context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
        tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
    else:
        raise ExceptionType("Invalid mode: Try 'single' or multi'")

    return tokenizer, question_encoder, context_encoder

# Encode the corpus using the encoders
def encode_corpus(corpus, context_encoder, tokenizer):
    context_embeddings = []
    for context in corpus:
        inputs = tokenizer(context, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            context_embedding = context_encoder(**inputs).pooler_output
        context_embeddings.append(context_embedding)
    return torch.vstack(context_embeddings)

# Retrieve passage indices for top K using the context (or potential answers) and question embeddings
def retrieve_passages(query, corpus, context_embeddings, question_encoder, tokenizer, top_k=100):
    inputs = tokenizer([query], return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        question_embedding = question_encoder(**inputs).pooler_output

    # Compute similarities
    similarities = torch.matmul(question_embedding, context_embeddings.T).squeeze(0).cpu().numpy()
    dpr_indices = np.argsort(similarities)[::-1][:top_k]

    return dpr_indices

# Modified based on https://github.com/castorini/pyserini/blob/master/pyserini/eval/evaluate_dpr_retrieval.py
def has_answers(passages, answers):
    tokenizer = SimpleTokenizer()

    for p in passages:
        passage = " ".join(p)
        text = unicodedata.normalize('NFD', passage)
        text = tokenizer.tokenize(text).words(uncased=True)

        for ans in answers:
            ans = unicodedata.normalize('NFD', ans)
            ans = tokenizer.tokenize(ans).words(uncased=True)
            for i in range(0, len(text) - len(ans) + 1):
                if ans == text[i: i + len(ans)]:
                    return True
                    
    return False

# Calculate accuracy
def calculate_accuracy(retrieved_indices, answers):
    correct = sum(1 for idx in retrieved_indices if answers[idx] in answers)
    return correct / len(answers) * 100

# calcuate the scores for the given dataset. 1 for having a correct answer and 0 for none for each question
def scoreTopKs(dataset_name, dataset, answers, corpus, context_embeddings, question_encoder, tokenizer):
    # initialise accuracy score list for questions
    q_dpr_20 = []
    q_dpr_100 = []
    
    for idx, question in enumerate(dataset['question']): 

        # Retrieve passages using DPR
        dpr_20_indices = retrieve_passages(question, corpus, context_embeddings, question_encoder, tokenizer, top_k=20)
        dpr_100_indices = retrieve_passages(question, corpus, context_embeddings, question_encoder, tokenizer, top_k=100)

        answer = answers[idx]

        dpr_20_passges = [corpus for i in dpr_20_indices]
        dpr_100_passges = [corpus for i in dpr_100_indices]

        q_dpr_20.append(1 if has_answers(dpr_20_passges, answer) else 0)
        q_dpr_100.append(1 if has_answers(dpr_100_passges, answer) else 0)
        
    # final overall accuracy - average of accuracy for each question
    f_dpr_20 = sum(q_dpr_20) / len(q_dpr_20)
    f_dpr_100 = sum(q_dpr_100) / len(q_dpr_100)

    print('q_dpr_20 len:', len(q_dpr_20))
    print(q_dpr_20)
    print('\n')

    print('q_dpr_100 len:', len(q_dpr_100))
    print(q_dpr_100)
    print('\n')

    return f_dpr_20, f_dpr_100

In [5]:
# Define datasets
datasets = {
    "Natural Questions": "nq_open",
    "TriviaQA": "trivia_qa",
    "WebQuestions": "web_questions",
    "SQuAD": "squad",
    "TREC": "trec"
}

In [6]:
# accuracy scores for each dataaset with single and multi
acc_dpr_20_single = {}
acc_dpr_100_single = {}
acc_dpr_20_multi = {}
acc_dpr_100_multi = {}

# Initialise metrics for each dataset
for name in datasets.items():
    acc_dpr_20_single[name] = []
    acc_dpr_100_single[name] = []
    acc_dpr_20_multi[name] = []
    acc_dpr_100_multi[name] = []

In [7]:

tokenizer, question_encoder, context_encoder = load_dpr_models('single')

Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the

In [8]:

multi_tokenizer, multi_question_encoder, multi_context_encoder = load_dpr_models('multi')

Some weights of the model checkpoint at facebook/dpr-question_encoder-multiset-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the c

In [9]:
# load the wiki corpus
passage_dataset = load_from_disk(os.path.join("passages", 'wiki-20210101-10000p'))

In [10]:
passage_dataset[:5]

{'id': ['10577', '10577', '10577', '10577', '10577'],
 'title': ['Finland', 'Finland', 'Finland', 'Finland', 'Finland'],
 'views': [3513, 3513, 3513, 3513, 3513],
 'langs': [305, 305, 305, 305, 305],
 'text': ['Finland ( ; , ), officially the Republic of Finland (, ), is a Nordic country located in Northern Europe. Finland shares land borders with Sweden to the west, Russia to the east, and Norway to the north and is defined by the Gulf of Bothnia to the west and the Gulf of Finland to the south that are part of the Baltic Sea.',
  'Finland has a population of approximately 5.5 million, making it the 25th-most populous country in Europe. The main language is Finnish, a Finnic language of the Uralic language family. Swedish is the second official language of Finland, and is mainly spoken in certain coastal areas of the country and on Åland. Finland is a parliamentary republic consisting of 19 regions and 310 municipalities. The climate varies significantly relative to latitude, from Sou

In [11]:
corpus = passage_dataset['text']

In [15]:
context_embeddings = encode_corpus(corpus, context_encoder, tokenizer)

In [None]:
multi_context_embeddings = encode_corpus(corpus, multi_context_encoder, multi_tokenizer)

# Experiements: Passage Retrieval

## Natural Questions

In [13]:
# load data for Natural Questions
# data schema: https://huggingface.co/datasets/google-research-datasets/nq_open
name = "Natural Questions"
nq_dataset = load_from_disk(os.path.join("datasets", 'nq_open'))
nq_answers = nq_dataset['answer']

In [25]:
nq_dpr_20, nq_dpr_100 = scoreTopKs(name, nq_dataset, nq_answers, corpus, context_embeddings, question_encoder, tokenizer)

In [None]:
print("Single results")
print(nq_dpr_20, nq_dpr_100)

In [32]:
nq_multi_dpr_20, nq_multi_dpr_100 = scoreTopKs(name, answers, corpus, multi_context_embeddings, multi_question_encoder, multi_tokenizer)

In [None]:
print("Multi results")
print(nq_dpr_20, nq_dpr_100)
