# Evaluating the Retriever & End-to-End System
> A review of Information Retrieval and the role it plays in a QA system

- title: "Evaluating the Retriever & End-to-End System"
- toc: true 
- badges: true
- comments: true
- hide: true
- permalink: /hidden/
- search_exclude: false
- categories:

![]()

In our last post, [Evaluating QA: Metrics, Predictions, and the Null Response](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html), we took a deep dive look at how to asses the quality of a BERT-like Reader for Question Answering using the Hugging Face framework. In this post, we'll focus on the former component of an end-to-end Question Answering system - the Retriever. Specifically, we'll introduce ElasticSearch as a powerful and efficient Information Retrieval (IR) tool that can be used to search through a large corpus and retrieve relevant documents. Through the post, we'll explain how to implement and evaluate a Retriever in the context of Question Answering and demonstrate the impact it has on an end-to-end QA system.

### Prerequisites
* a basic understanding of Information Retrieval (IR) & Search
* a basic understanding of IR based QA systems (see previous posts)
* a basic understanding of Transformers and PyTorch
* a basic understanding of the SQuAD2.0 dataset

# Retrieving the right document is important

![](my_icons/michael_scott_quote.jpg "You miss 100% of the shots you don't take")


We believe that what Michael Scott really mean to say is:

> "***You miss 100% of the questions if the answer doesn't appear in the input context***"

In our [last post](), we discussed methods that enable BERT-like models to produce more robust answers by selectively processing predictions and by refraining from answering certain questions at all. While the ability to properly comprehend a passage and produce an answer (or not) is an incredibly important feature of any Reader, the success of the overall system is dependent on first providing the correct passage to read through. Without being fed a context passage that actually contains the answer to a given question, the overall system's performance is limited to how well it can predict no-answer questions. To demonstrate, let's revisit an example from our [first blog post]() where three questions were asked of the Wikipedia search engine based QA system:

```
[Andrew] - will this be a purple code block on fastpages??

**Example 1: Incorrect**
Question: When was Barack Obama born?
Top wiki result: <WikipediaPage 'Barack Obama Sr.'>
Answer: 18 June 1936 / February 2 , 1961 / 

**Example 2: Correct**
Question: Why is the sky blue?
Top wiki result: <WikipediaPage 'Diffuse sky radiation'>
Answer: Rayleigh scattering / 

**Example 3: Correct**
Question: How many sides does a pentagon have?
Top wiki result: <WikipediaPage 'The Pentagon'>
Answer: five / 
```

While 2 out of 3 answers are correct for the full system, the Retriever component was only successful in 1 out of 3 questions. Namely, it identified the wrong page for Example 1 & 3: a page about Barack Obama Sr. instead of the former US President, and an article about "The Pentagon" instead of a page about geometry. In the first case, the Reader had no chance of producing the correct answer because of its outright absence from the document. In the latter case, the Reader *was* able to obtain the right answer despite the ambigous context provided, but who wants to rely on coincidence?

Now that we understand why a effective Retriever is critical for an end-to-end QA system, let's take a deeper look at a classic tool used for Information Retrieval - ElasticSearch.

**[Andrew]** - maybe include a quick discussion of different IR techniques/approaches here....

# ElasticSearch as an IR Tool

![](my_icons/elasticsearch-logo.png "Elastic Search")

Modern QA systems employ a variety of techniques for the task of information retrieval ranging from traditional sparse vector word matching (i.e. Elastic Search) to [novel approaches](https://arxiv.org/pdf/2004.04906.pdf) using dense representations of encoded passages combined with [efficient search capabilities](https://github.com/facebookresearch/faiss). Despite the flurry of contemporary research efforts in this area, the traditional sparse vector approach performs very well overall and has only recently been overtaken by embedding-based systems for end-to-end QA retrieval tasks. For that reason, we'll explore Elastic Search as a simple and easy to use framework for document retrieval. So, what exactly is Elastic Search?

Elastic Search is a powerful open-source search and analytics engine built on the Apache Lucene library that is capable of handling all types of data including textual, numerical, geospatial, structrured, and unstructured. It has a robust set of features, rich ecosystem, is built to scale, and compatible with many client libraries making it easy to integrate and use. In the context of information retrieval for automated question answering, we are keenly interested in the features surrounding full-text search. Elastic Search provides a convenient way to index documents so they can easily be queried for nearest neighbor search using a TF-IDF based similarity metric. Specifically, it uses [BM25](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/) term weighting to represent question and context passages as high-dimensional, sparse vectors that are efficiently searched in an inverted index. Let's unpack those ideas a bit.


#### Inverted Index

The purpose of an inverted index is to store text in a structure that allows for efficient and fast full-text searches. An inverted index is essentially just a mapping between unique terms and documents which contain those terms. For example, let's consider the following two documents and a depiction of an inverted index built from them:
1. "Elastic Search is a powerful tool for search!"
2. "Manual search is a slow tool."


|   Term   | Document 1 | Document 2 |
|:--------:|:----------:|:----------:|
|     a    |      x     |      x     |
|  elastic |      x     |            |
|    for   |      x     |            |
|    is    |      x     |      x     |
|  manual  |            |      x     |
| powerful |      x     |            |
|  search  |      x     |      x     |
|   slow   |            |      x     |
|   tool   |      x     |      x     |

The unique set of terms from both documents are contained in the index, and we can easily lookup which document contains which terms. Searching this inverted index for the phrase "search tool" would return both documents because each term is present in each document, while searching for the phrase "powerful tool" would return only Document 1. Search itself is quite a bit more complicated than the boolean logic depicted here as it involves relevance scoring (among other query dependent logic), however this oversimplification is intended to demonstrate the quick and powerful nature of the inverted index data structure. 

The inverted index representation is considered *sparse* by construction because for each document, you end up with a vector containing few terms relative to all the terms in the index. While the exact match nature of searches in this data structure is powerful and effective in factoid question answering, it isn't without flaws. The word matching based approach is limited in its ability to take synonyms and semantically related concepts into search consideration. For example, consider the question and context:

> **Question:** "Who is the bad guy in lord of the rings?"\
> **Context:** "Sala Baker is an actor and stuntman from New Zealand. He is best known for portraying the villain Sauron in the Lord of the Rings trilogy..."

A term based system would struggle to retrieve this supporting context passage because it lacks the ability to relate "bad guy" and "villain". Modern document retrieval systems that take advantage of learned, dense representations of text would perform better in this situation.

> Note: In Elastic Search, the values for a document's text fields are first passed through an analysis process prior to adding the document into an index - notice how the tokens are sorted in alphabetic order, symbols have been removed, and all characters have been lowercased? The analysis pipeline is customizable and crucial for improving full text search results as we will see later in this post.

## Using ElasticSearch with SQuAD2.0

With this basic understanding of Elastic Search, let's dive in and build our own Document Retrieval system by indexing Wikipedia articles supporting the SQuAD2.0 dataset. Before we get started, we'll need to download and install Elastic Search, as well as downloading the SQuAD2.0 train set.

In [44]:
# If running locally - Run Elasticsearch using Docker (assumes Docker is installed)
!docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.


In [None]:
# collapse-hide

# If using Colab - Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [23]:
# collapse-hide

# Download the SQuAD2.0 train set
!wget -P data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

--2020-06-18 13:18:04--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.109.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘data/squad/train-v2.0.json’


2020-06-18 13:18:30 (1.67 MB/s) - ‘data/squad/train-v2.0.json’ saved [42123633/42123633]



In [1]:
from elasticsearch import Elasticsearch

In [10]:
config = {'host':'localhost', 'port':9200}
es = Elasticsearch([config])

In [11]:
es.ping()

True

# Evaluating Retriever Performance

# Improving Search Results with Custom Analyzers & Query Enrichment

***PLACEHOLDER CONTENT***

ElasticSearch is a powerful open-source search and analytics engine capable of handling all types of data including textual, numerical, geospatial, structrured, and unstructured. In the context of information retrieval for automated question answering, we are keenly interested in the features surrounding text search of candidate documents. So how does ElasticSearch handle text data and why is it so powerful?

![ElasticSearch Index Process](my_icons/elastic_index_process.png)


In ElasticSearch, the values for a document's text fields are first analyzed prior to adding the document into an index. This means that when executing a search query, we are actually searching against the post-processed representation that is stored in an inverted-index, not the actual input document itself.  This processing step is a customizable pipeline that is carried out by a dedicated *Analyzer*. 

### Analyzers

An *Analyzer* is comprised of three components that makeup the procesing pipeline: *character filters, a tokenizer, and token filters.* Each of these components modify the input stream of text according to pre-defined settings. 
- **Character Filters:** First, the character filter has the ability to add, remove, or replace specific items in the text. A common application of this filter is to strip `html` tags from the raw input. 
- **Tokenizer:** After applying character filters, the transformed text is then passed to a tokenizer that breaks up the input string into individual tokens (usually words) with a provided strategy. By default, the `standard` tokenizer splits tokens whenever it encounters a whitespace character and also splits on most symbols like commas, periods, semicolons, etc.
- **Token Filters:** Finally, the token stream is passed to a token filter which acts on the tokens to add, remove, or modify tokens. Typical token filters include `lowercase` which converts all tokens to lowercase form, and `stop` which removes commonly occuring tokens called stopwords. 

![ElasticSearch Analyzer Pipeline](my_icons/elastic_analyzer_process.png)

ElasticSearch comes with several built-in Analyzers that satisfy common use cases. However, custom Analyzers can also be crafted by combining specific character filters, tokenizers, and token filters to best suit any unique dataset. As explained above, Analyzers are applied at index time to pre-process documents before indexing. In addition, Analyzers can also be applied at search time to process text queries according to the same logic the candidate documents were processed with. Search time analysis can be customized and is only applied to certain query types such as `match` queries. Lets take a closer look at how Analyzers work with ElasticSearch's Analyze API.



# Impact of Retriever in End-to-End QA System