# How to build an open-domain Question Answering System?


[Original article](https://lilianweng.github.io/posts/2020-10-29-odqa/)

A model that can answer any question with regard to factual knowledge can lead to many useful and practical applications, such as working as a chatbot or an AI assistant. In this notebook, we will review several common approaches for building such an open-domain question answering system.

Disclaimers given so many papers in the wild:

* Assume we have access to a powerful pretrained language model.
* We do not cover how to use structured knowledge base (e.g., Freebase, WikiData) here
* We only focus on a single-turn QA instead of a multi-turn conversation style QA
* We mostly focus on QA models that contain neural networks, specially Transformer-based language models.
* This tutorial is based ona post from 2020, so there are probably many missing papers (from the "future" and from the "past")

## 1 - What is open-domain question answering?

**Open-domain Question Answering (ODQA)** is a type of language tasks, asking a model to produce answers to factoid questions in natural language. The true answer is objective, so it is simple to evaluate model performance. 

For example,

```yaml
Question: What did Albert Einstein win the Nobel Prize for?
Answer: The law of the potoelectric effect.
```

The "open-domain" part refers to the lack of relevant context for any arbitrarily asked factual question. In the above case, the model only takes as the input the question but no article "why Einstein didn't win a Nobel Prize for the theory of relativity" is provided, where the term "the law of the photoelectric effect" is likely mentioned. In the case when both the question and the context are provided the task is known as **Reading comprehension (RC)**.

When considering different types of open-domain questions, [Lewis et al. (2020)](https://arxiv.org/abs/2008.02637) provides a classification in order of difficulty:

1. A model is able to correctly memorize and respond with the answer to a question that has been seen at training time
2. A model is able to answer novel questions at test time and choose an answer from the set of answers it has seen during training
3. A model is able to answer novel questions which have answers not contained in the training dataset

<img src="images_odqa/QA-summary.png" title="" alt="" width="750" data-align="center">


### 1.1 - Notation

Given a question $x$ and a ground truth answer span $y$, the context passage containing the true answer is labelled as $z \in \mathcal{Z}$, where $\mathcal{Z}$ is an external knowledge corpus. Wikipedia is a common choice for such an external knowledge source

### 1.2 - Concerns of QA data fine-tuning

Before we dive into the detail of many models below. I would like to point out one concern of fine-tuning a model with common QA datasets, which appears as one fine-tuning step in several ODQA models. It could be concerning, because there is a significant overlap between questions in the train and test sets in several public QA datasets.

[Lewis et al. (2020)](https://arxiv.org/abs/2008.02637) ([code](https://github.com/facebookresearch/QA-Overlap)) found that 58-71% of test-time answers are also present somewhere in the training sets and 28-34% of test-set questions have a near-duplicate paraphrase in their corresponding training sets. In their experiments, several models performed notably worse when duplicated or paraphrased questions were removed from the training set.

## 2 - Open-book QA: Retriever-reader

Given a factoid question, if a language model has no context or is not big neough to memorize the context which exists in the training dataset, it is unlikely to guess the correct answer. In an open-book exam, students are allowed to refer to external resources like notes and books while answering test questions. Similarly, a ODQA system can be paired with a rich knowledge base to identify relevant documents as evidence of answers.

We can decompose the process of finding answers to given questions into two stages,

1. Find the related context in an external repository of knowledge;
2. Process the retrieved context to extract an answer.

<img src="images_odqa/retriever_reader_framework.png" title="" alt="" width="650" data-align="center">

This architecture was first proposed in DrQA ("Document retriever Question-Answering" by [Chen et al., 2017](https://arxiv.org/abs/1704.00051); [code](https://github.com/facebookresearch/DrQA)). The retriever and the reader components can be set up and trained independently, or jointly trained end-to-end (explained further below).

### 2.1 - Retriever model

An information retrieval (IR) system is usually used for implementing the retriever. Retrievers are usually organized in two groups, sparse and dense:

#### 2.1.1 - Sparse IR

Sparse retrievers use word frequencies to represent each document and query as a sparse vector. The relevance of a query and a document is then determined by computing an inner product of the vectors. The two most commont techniques are the following:

1. The (non-learning) [term frequency-inverse document frequency (TF-IDF)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) technique, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. TF-IDF is one of the most popular term-weighting schemes today.
2. The (non-learning) [Best Match 25 (BM-25)](https://es.wikipedia.org/wiki/Okapi_BM25) technique, which is an improved version of TF-IDF that takes into account additional factors that can affect the relevance of a document to a given query. These factors include the length of the document, the length of the query, and the average length of documents in the collection. BM25 also includes a parameter called k1 that can be adjusted to tune the importance of these factors. As a result, BM25 is able to produce more accurate and relevant rankings than basic TF-IDF, particularly for longer queries and collections of documents.
----

**Note:** For an in-depth explanation of document scoring with TF-IDF and BM25 see Chapter 23 of [Speech and Language Processing, 3rd edition, by D. Jurafsky and J.H. Martin (Prentice Hall)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf).

----

**DrQA** ([Chen et al., 2017](https://arxiv.org/abs/1704.00051); [code](https://github.com/facebookresearch/DrQA)) adopts an efficient non-learning based search engine based on the vector space model. Every query and document is modelled as a bag-of-word vector, where each term is weighted by TF-IDF

$$
\begin{eqnarray}
\text{TF-IDF}(t,d,\mathcal{D}) &=  \text{TF}(t,d) \times \text{IDF}(t, \mathcal{D})\\
\text{TF} &= \log(1 + \text{freq}(t,d))\\
\text{IDF} &= \log(\frac{|\mathcal{D}|}{|d \in \mathcal{D}: t \in d|})\\
\end{eqnarray}
$$

where $t$ is a unigram or bigram term in a document $d$ from a collection of documents $\mathcal{D}$. $\text{freq}(t,d)$ measures how many times a term $t$ appears in $d$. Note that the term-frequency here includes bigram counts too, which is found  to be very helpful because the local word order is taken into consideration via bigrams. As part of the implementation, DrQA maps the bigrams of $2^{24}$ bins using unsigned murmur3 hash.

Precisely, DrQA implemented Wikipedia as its knowledge source and this choice has became a default setting for many ODQA studies since then. The non-ML document retriever returns the top $k=5$ most relevant Wikipedia articles given a question.

**BERTserini** ([Yang et al., 2019](https://arxiv.org/abs/1902.01718)) pairs the open-source Anserini IR toolkit as the retriever with a fine-tuned pre-trained BERT model as the reader. The top $k$ documents ($k=10$) are retrieved via Anserini, where the query is treated as a bag of words. The retrieved text segments are ranked by BM25. In terms of the effect of text granularity on performance, they found that paragraph retrieval > sentence retrieval > article retrieval.

<img src="images_odqa/bertserini.png" title="" alt="" width="650" data-align="center">

**Multi-passage BERT** ([Wang et al., 2019]()) approaches this problem by combining ElasticSearch + BM25. The authors of this approach found that splitting articles into passages with the lenght of 100 words by *sliding window* (i.e., with some overlap) brings 4% improvements, since splitting documents into passages without overlap may cause some near-boundary evidence to lose useful contexts.

#### 2.1.2 - Dense IR

There is a long history in learning a low-dimensional representation of text, denser than raw term-based vectors ([Deerwester et al., 1990](http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf); [Yih et al., 2011](https://aclanthology.org/W11-0329/)). Dense representations can be learned through matrix decomposition or some neural network architectures (e.g., MLP, LSTM, bidirectional LSTM, etc.). When involving neural networks, such approaches are referred to as "Neural IR", Neural IR is a new category of methods for retrieval problems, but it is not necessarily better/superior than classic IR ([Lim et al., 2018](https://sigir.org/wp-content/uploads/2019/01/p040.pdf))

After the success of many large-scale general language models (e.g., BERT, GPT, T5, etc.), many QA models embrace the following approach:

$$
\begin{eqnarray}
h_{x} &= E_{x}(x) \\
h_{z} &= E_{z}(z) \\
\text{score}(x,z) &= h^{T}_{x} h_{z}
\end{eqnarray}
$$

1. Extract dense representations of a question $x$ and a context passage $z$ by feeding them into a language model (i.e., $E_{x}$ and $E_{z}$ respectively);
2. Use the dot-product of these two representations as the retrieval score to rank and select most relevant passages.

**ORQA** ([Lee et al., 2019](https://arxiv.org/abs/1906.00300)), **REALM** ([Guu et al., 2020](https://arxiv.org/abs/2002.08909)), and **DPR** ([Karpukhin et al., 2020](https://arxiv.org/abs/2004.04906)) all use such a scoring function for context retrieval, which will be described in detail in a later section on the end-to-end QA model.

An extreme approach investigated by **DenseSPI** ("Dense-Sparse Phrase Index"; [Seo et al., 2019](https://arxiv.org/abs/1906.05807)), is to encode all the text in the knowledge corpus at the phrase level and then only rely on the retriever to identify the most relevant phrase as the predicted answer. **In this way, the retriever + reader pipeline is reduced to only retriever**. Of course, **the index would be much larger and the retrieval problem is more challenging.**

**DenSPI** introduces a *query-agnostic* indexable representation of document phrases. Precisely it encodes query-agnostic representations of text spans in Wikipedia offline and looks for the answer at inference time by performing nearest neighbor search. It can drastically speed up the inference time, because there is no need to re-encode documents for every new query, which is often required by a reader model.

Given a question $x$ and a fixed and a fixed set of $K$ (Wikipedia) documents, $z_{1}, \dots, z_{K}$, where each document $z_{k}$ contains $N_{k}$ words, $z_{k} = \langle z_{k}^{(1)}, \dots, z_{k}^{(N_{k})} \rangle$. An ODQA model is a scoring function $F$ for each candidate phrase span $z_{k}^{(i:j)}$, $1 \leq i \leq j \leq N_{k}$, such that the truth answer is the phrase with maximum score: $y = \text{arg max}_{k,i,j} F(x, z_{k}^{(i:j)})$.

The phrase representation $z_{k}^{(i:j)}$ combines both dense and sparse vectors, $z_{k}^{(i:j)} = [d_{k}^{(i:j)}, s_{k}^{(i:j)}] \in \mathbb{R}^{\text{dim}_{d} + \text{dim}_{s}}$ (note that the dimensionality of the dense vector is much smaller than the dimensionality of the sparse vector, i.e., $\text{dim}_{d} \ll \text{dim}_{s}$)

* The dense vector $d_{k}^{(i:j)}$ is effective for encoding local *syntactic* and *semantic* cues, as what can be learned by a pretrained language model.
* The sparse vector $s_{k}^{(i:j)}$ is superior at encoding precise *lexical* information. The sparse vector is term-frequency-based encoding. DenSPI uses 2-gram term-frequency (same as DrQA), resulting in a highly sparse representation ($\text{dim}_{s} \approx \text{16M}$)

The dense vector $d^{i:j}$ is further decomposed into three parts, $d(i:j) = [a_{i}, b_{j}, c_{ij}] \in \mathbb{R}^{2 \ \text{dim}_{b}+1}$ where $2 \ \text{dim}_{b}+1 = \text{dim}_{d}$. All three components are learned based on different columns of the fine-tuned BERT representations.

* A vector $a_{i}$ encodes the *start* position for the $i$-th word of the document;
* A vector $a_{i}$ encodes the *start* position for the $i$-th word of the document;
* A scalar $c_{ij}$ measures the coherency between the start and the end vectors, helping avoid non-constituent phrases during inference.

For all possible $(i,j,k)$ tuples where $j - i < J$, the text span embeddings are precomputed and stored as a *phrase index*. The maximum span length $J$ is a predefined scalar constant.

<img src="images_odqa/DenSPI.png" title="" alt="" width="600" data-align="center">

At the inference time, the question is mapped into the same vector space $x = [d', s'] \in \mathbb{R}^{d^{d}+d^{s}}$, where the dense vector $d'$ is extracted from the BERT embedding of the special `[CLS]` symbol. The same BERT model is shared for encoding both questions and phrases. The final answer s predicted by $k^{*}, i^{*}, j^{*} = \text{arg max} \ x^{\top} z_{k}^{(i:j)}$.


### 2.2 - Reader model

The reader model learns to solve the reading comprehension task (i.e., extract an answer for a given question from a given context document). Here we only discuss approaches for machine comprehension using neural networks.

#### 2.2.1 - Bi-directional LSTM
The reader model for answer detection of **DrQA** ([Chen et al., 2017](https://arxiv.org/abs/1704.00051); [code](https://github.com/facebookresearch/DrQA)) is a 3-layer bidirectional LSTM with hidden size of 128. Every relevant paragraph of retrieved Wikipedia articles is encoded by a sequence of feature vectors $\{\hat{\mathbf{z}}_{1},\dots,\hat{\mathbf{z}}_{m}\}$. Each feature vector $\hat{\mathbf{z}}_{i} \in \mathbb{R}^{\text{dim}_z}$ is expected to capture useful contextual information around one token $z_{i}$. The feature consists of several categories of features:

1. **Word embeddings:** A 300d Glove word embedding trained from 800B Web Crawl data, f_{\text{embed}} = E_{g}(z_{i}).

2. **Exact match:** Wether a word appears in the question $x$, $f_{\text{match}} = \mathbb{I}(z_{i} \in x)$

3. **Token features:** This includes POS (part-of-speech) tagging, NER (named entity recognition) and TF (term-frequency), $f_{\text{token}}(z_{i}) = \text{POS}(z_{i}), \text{NER}(z_{i}), \text{TF}(z_{i})$.

4. **Aligned question embedding:** The attention score $y_{ij}$ is designed to capture inter-sentence matching and similarity between the paragraph token $z_{i}$ and the question word $x_{j}$. This feature adds soft alignments between smilar but non-indentical words.

$$
\begin{eqnarray}
f_{\text{align}}(z_{i}) &= \sum_{j}y_{ij}E_{g}(x_{j}) \\
y_{ij} = \frac{\text{exp}(\alpha(E_{g}(z_{i}))^{\top} \alpha(E_{g}(x_{j})))}{\sum_{j'}\text{exp}(\alpha(E_{g}(z_{i}))^{\top} \alpha(E_{g}(x_{j'})))}
\end{eqnarray}
$$

where $\alpha$ is a single dense layer with ReLU and $E_{g}(.)$ is the Glove word embedding.

The feature vector of a paragraph of $m$ tokens is fed into LSTM to obtain the final paragraph vectors:

$$
\begin{eqnarray}
\mathbf{z} = \{\mathbf{z}_{1}, \dots, \mathbf{z}_{m}\} &= \text{LSTM}(\{\hat{\mathbf{z}}_{1}, \dots, \hat{\mathbf{z}}_{m}\}) \\
\end{eqnarray}
$$

where $\hat{\mathbf{z}}_{i} = \{f_{\text{embed}}, f_{\text{match}}, f_{\text{token}}, f_{\text{align}}\}$. The question is encoded as a weighted sum of the embeddings of every word in the question:

$$
\mathbf{x} = \sum_{j} b_{j} E(x_{j}) \ b_{j} = \text{softmax}(\mathbf{w}^{\top}E(x_{j}))
$$

where $\mathbf{w}$ is a weight vector to learn.

Once the feature vectors are constructed for the question and all the related paragraphs, the reader needs to predict the probabilities of each position in a paragraph to be the start and the end of an answer span, $p_{\text{start}}(i_{s})$ and $p_{end}(i_{e})$, respectively. Across all the paragraphs, the optimal span is returned as the final answer with maximum $p_{\text{start}}(i_{s}) \times p_{end}(i_{e})$.

$$
\begin{eqnarray}
p_{\text{start}}(i_{s}) &\propto \exp(\mathbf{z}_{i_{s}} \ \mathbf{W}_{s} \mathbf{x}) \\
p_{end}(i_{e}) &\propto \exp(\mathbf{z}_{i_{e} }\mathbf{W}_{e} \mathbf{x}) \\
\text{s.t.} \ i_{s} \leq i_{e} \leq i_{s} + 15
\end{eqnarray}
$$

where $\mathbf{W}_{s}$ and $\mathbf{W}_{e}$ are learned parameters

#### 2.2.2 - BERT uiverse

Following the success of BERT, many QA models develop the machine comprehension component using BERT Let's define the BERT model as a function that can take one or multiple strings (concatenated by `[SEP]`) as input and outputs a set of BERT encoding vectors for the special `[CLS]` token and every input token:

$$
\text{BERT}(s_{1}, s_{2}, \dots) = [h^{[CLS]}, h^{(1)}, h^{(2)}, \dots]
$$

where $h^{[CLS]}$ is the embeddin for the special `[CLS]` token and $h^{(i)}$ is the embedding vector for the i-th token.

## 3 - Open-book QA: Retriever-generator

## 4 - Open-book QA: Generative Language Model

Big language models have been pretrained on a large collection of unsupervised textual corpus. Given enough parameters, these models are able to memorize some factual knowledge within parameter weights. Therefore, we can use these models to do QA without explicit context, just like in a closed-book exam. The pre-trained language models produce *free text* to respond to questions, no explicit reading comprehension.

[Roberts et al. (2020)](https://arxiv.org/pdf/2002.08910.pdf) measured the practical utility of a language model by fine-tuning a pre-trained model to answer questions without access to any external context or knowledge. They fine-tuned the T5 language model to answer questions without inputting any additional information or context. Such setupt enforces the language model to answer questions based on "knowledge" that it internalized during pre-training

<img src="images_odqa/t5_qa.png" title="" alt="" width="600" data-align="center">

The original T5 models were pre-trained on a multi-task mixture including an unsupervised masked language modeling (MLM) tasks on the C4 (Colossal Clean Crawled Corpus) dataset as well as fine-tuned altogether with supervised translation, summarization, classification, and reading comprehension tasks. [Roberts et al. (2020)](https://arxiv.org/pdf/2002.08910.pdf) took a pre-trained T5 model and continued pre-training with salient span masking over Wikipedia corpus, which has been found to substantially boost the performance for ODQA. Then they fine-tuned the model for each QA datasets independently.

With a pre-trained T5 language model + continue pre-training with salient spans masking + fine-tuning for each QA dataset,

* It can attain competitive results in open-domain question answering without access to external knowledge
* A larger model can obtain better performance. For example, a T5 with 11B parameters is able to match the performance with dense page retriever with 3 BERT-base models, each with 330M parameters (**Note the difference in parameter size though, 11B vs 3 x 330M**)

Interestingly, fine-tuning is not strictly necessary. GPT-3 (Brown et al., 2020) has been evaluated on the closed book question answering task without any gradient updates or fine-tuning. During evaluation, the few-shot, one-shot and zero-shot settings here only refer to how many demonstrations are provided as context in the text input:

1. "few-shot learning": GPT3 is allowed to take as many demonstrations as what can fit into the model's context window (typically 10 to 100).
2. "one-shot learning": only one demonstration is provided.
3. "zero-shot learning": no demonstrations are allowed and only an instruction in natural language is given to the model.

The performance grows with the model size. On the TriviaQA dataset, GPT3 evaluation with demonstrations can match or exceed the performance of SOTA baseline with fine-tuning.

<img src="images_odqa/gpt3_performance_qa.png" title="" alt="" width="500" data-align="center">

## 5 - Related techniques

### 5.1 - Fast maximum inner product search (MIPS)

MIPS (maximum inner product search) is a crucial component in many open-domain question answering models. In retriever + reader/generator framework, a large number of passages from the knowledge source are encoded and stored in a memory. A retrieval model is able to query the memory to identify the top relevant passages which have the maximum inner product with the question's embedding.

We need fast MIPS because the number of precomputed passage representations can be gigantic. There are several ways to achieve fast MIPS ar run time

----

**Note:** the first two summaries were provided by [Chat-gpt](https://chat.openai.com/chat)

----

#### 5.1.1 - Asymmetric Locality Sensitive Hashing (ALSH)

[Asymmetric Locality Sensitive Hashing (ALSH)](https://proceedings.neurips.cc/paper/2014/file/310ce61c90f3a46e340ee8257bc70e93-Paper.pdf) is a variant of Locality Sensitive Hashing (LSH) that is used for approximate nearest neighbor search in high-dimensional spaces. It is called "asymmetric" because it uses different hash functions for the query point and the points in the database, which can improve the accuracy of the search compared to traditional LSH.

LSH is a technique for efficiently finding approximate nearest neighbors in large datasets by using hash functions to map data points to a lower-dimensional space. This allows the search to be performed in the lower-dimensional space, which is typically faster than searching in the original high-dimensional space. However, the hash functions used in LSH are designed to preserve the relative distances between points, so they are not necessarily optimal for finding the nearest neighbors.

ALSH addresses this issue by using different hash functions for the query point and the points in the database. This allows the search to be more focused on finding the nearest neighbors, rather than just preserving the relative distances between points.

ALSH is often used in applications such as information retrieval, recommendation systems, and image retrieval, where it is important to quickly find the nearest neighbors to a given query point. It is also used in machine learning algorithms that rely on nearest neighbor search, such as k-means clustering and kernel methods.

#### 5.1.2 - Data-dependent hashing

[Data-dependent hashing](https://arxiv.org/abs/1501.01062) is a method for constructing hash functions that are tailored to the specific characteristics of the data being hashed. It is a type of Locality Sensitive Hashing (LSH) that is used to efficiently find approximate nearest neighbors in large datasets.

In traditional LSH, the hash functions are designed to preserve the relative distances between points in the dataset. This means that the hash functions are chosen independently of the data and are typically chosen to have certain desirable properties, such as being easy to compute or having a low collision rate.

In contrast, data-dependent hashing constructs the hash functions based on the characteristics of the data itself. This can be done by using techniques such as neural networks or decision trees to learn the hash functions from the data. Data-dependent hashing can be more effective than traditional LSH for finding nearest neighbors because the hash functions are specifically designed to capture the structure of the data.

Data-dependent hashing is often used in applications such as information retrieval, recommendation systems, and image retrieval, where it is important to quickly find the nearest neighbors to a given query point. It is also used in machine learning algorithms that rely on nearest neighbor search, such as k-means clustering and kernel methods.

#### 5.1.3 - FAISS

[FAISS](https://github.com/facebookresearch/faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Meta's Fundamental AI Research group.

FAISS contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors.

Some of the methods, like those based on binary vectors and compact quantization codes, solely use a compressed representation of the vectors and do not require to keep the original vectors. This generally comes at the cost of a less precise search but these methods can scale to billions of vectors in main memory on a single server. Other methods, like HNSW and NSG add an indexing structure on top of the raw vectors to make searching more efficient.

The GPU implementation can accept input from either CPU or GPU memory. On a server with GPUs, the GPU indexes can be used a drop-in replacement for the CPU indexes (e.g., replace IndexFlatL2 with GpuIndexFlatL2) and copies to/from GPU memory are handled automatically. Results will be faster however if both input and output remain resident on the GPU. Both single and multi-GPU usage is supported.

### 5.2 - Language model pre-training

Two pre-training tasks are especially helpful for QA tasks, as we have discussed above


#### 5.2.2 - Inverse cloze task

The goal of [Cloze Task](https://en.wikipedia.org/wiki/Cloze_test) is to predict masked-out text based on its context. The prediction of Inverse Cloze Task ([Lee et al., 2019](https://arxiv.org/abs/1906.00300)) is in the reverse direction, aiming to predict the context given a sentence. For example, given the context "The cat sat on the couch" and the prompt "What did the cat do?", the task would be to generate the missing text "sat on the couch". **In the context of QA tasks, a random sentence can be treated as a pseudo-question, and its context can be treated as pseudo-evidence**.

#### 5.2.2 - Salient span making

Salient spans masking ([Guu et al., 2020](https://arxiv.org/abs/2002.08909)) is a special case for MLM task in language model training. First, we find *salient spans* by using a tagger to identify named entities and a regular expression to identify dates. Then one of the detected salient spans is selected and masked. The task is to predict this masked salient span.

## 6 - Summary

|        Model       |            Retriever           |          Reader / Generator         |            Pre-training / Fine-tuning            | End2end |
|:------------------:|:------------------------------:|:-----------------------------------:|:------------------------------------------------:|:-------:|
| DrQA               | TF-IDF                         | Bi-directional LSTM                 | –                                                | No      |
| BERTserini         | Aserini + BM25                 | BERT without softmax layer          | Fine-tune with SQuAD                             | No      |
| Multi-passage BERT | ElasticSearch + BM25           | Multi-passage BERT + Passage ranker |                                                  | No      |
| R^3                | Classic IR + Match-LSTM        | Match-LSTM                          |                                                  | Yes     |
| ORQA               | Dot product of BERT embeddings | BERT-RC                             | Inverse cloze task                               | Yes     |
| REALM              | Dot product of BERT embeddings | BERT-RC                             | Salient span masking                             | Yes     |
| DPR                | Dot product of BERT embeddings | BERT-RC                             | supervised training with QA pairs                | Yes     |
| DenSPI             | Classic + Neural IR            | –                                   |                                                  | Yes     |
| T5 + SSM           | –                              | T5                                  | SSM on CommonCrawl data + Fine-tuning on QA data | Yes     |
| GPT3               | –                              | GPT3                                | NSP on CommonCrawl data                          | Yes     |
| RAG                | DPR retriever                  | BART                                |                                                  | Yes     |
| Fusion-in-Decoder  | BM25 / DPR retriever           | Tranformer                          |                                                  | No      |


<img src="images_odqa/summary.png" title="" alt="" width="600" data-align="center">