# Evaluating the Retriever & End-to-End System
> A review of Information Retrieval and the role it plays in a QA system

- title: "Evaluating the Retriever & End-to-End System"
- toc: true 
- badges: true
- comments: true
- hide: true
- permalink: /hidden/
- search_exclude: false
- categories:

![]()

In our last post, [Evaluating QA: Metrics, Predictions, and the Null Response](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html), we took a deep dive look at how to asses the quality of a BERT-like Reader for Question Answering (QA) using the Hugging Face framework. In this post, we'll focus on the former component of an end-to-end QA system - the Retriever. Specifically, we'll introduce Elasticsearch as a powerful and efficient Information Retrieval (IR) tool that can be used to scour through large corpora and retrieve relevant documents. Through the post, we'll explain how to implement and evaluate a Retriever in the context of Question Answering and demonstrate the impact it has on an end-to-end QA system.

### Prerequisites
* a basic understanding of Information Retrieval (IR) & Search
* a basic understanding of IR based QA systems (see previous posts)
* a basic understanding of Transformers and PyTorch
* a basic understanding of the SQuAD2.0 dataset

# Retrieving the right document is important


![](my_icons/michael_scott_quote.jpg "You miss 100% of the shots you don't take")


We believe that what Michael Scott really mean to say is:

> "***You miss 100% of the questions if the answer doesn't appear in the input context***"

[Andrew] - Find a better intro ^^^


As we have discussed throughout this blog series, many modern QA systems take a two-staged approach to answering questions. In the first stage, a document retriever selects $N$ potentially relevant documents from a given corpus. Subsequently, a machine comprehension model processes each of the $N$ documents to determine an answer to the input question. Because of recent advances in NLP and deep learning (i.e. flashy Transformer models), the machine comprehension component of question answering has typically been the main focus of evaluation for these systems. Stage one of these systems has recieved limited attention despite its obvious importance...stage two is bounded by performance at stage one. Let's get more specific.

We [recently explained methods]() that enable BERT-like models to produce robust answers given a question and context passage by selectively processing predictions and by refraining from answering certain questions at all. While the ability to properly comprehend a passage and produce a correct answer is a very important feature of any QA tool, the success of the overall system is highly dependent on first providing a correct passage to read through. Without being fed a context passage that actually contains the ground-truth answer to a given question, the overall system's performance is limited to how well it can predict no-answer questions. To demonstrate, we'll revisit an example from our [first blog post]() where three questions were asked of the Wikipedia search engine based QA system:

```
**Example 1: Incorrect**
Question: When was Barack Obama born?
Top wiki result: <WikipediaPage 'Barack Obama Sr.'>
Answer: 18 June 1936 / February 2 , 1961 / 

**Example 2: Correct**
Question: Why is the sky blue?
Top wiki result: <WikipediaPage 'Diffuse sky radiation'>
Answer: Rayleigh scattering / 

**Example 3: Correct**
Question: How many sides does a pentagon have?
Top wiki result: <WikipediaPage 'The Pentagon'>
Answer: five / 
```

In Example 1, the Reader had no chance of producing the correct answer because of its outright absence from the context article served up by the Retriever. Namely, the Retriever erroneously provided a page about Barack Obama Sr. instead of his son, the former US President. In this case, the only way the Reader could have possibly produced the correct answer was if the correct answer was actually not to answer at all. On the flip side, in Example 3, the Retriever did not identify the globally "correct" document - it returned an article about "The Pentagon" instead of a page about geometry - but nonetheless, it provided enough context for the Reader to succeed.

These quick examples illustrate  why an effective Retriever is critical for an end-to-end QA system. Now let's take a deeper look at a classic tool used for information retrieval - Elasticsearch.

# Elasticsearch as an IR Tool

![](my_icons/elasticsearch-logo.png "Elasticsearch")

Modern QA systems employ a variety of techniques for the task of information retrieval ranging from traditional sparse vector word matching (ex. Elasticsearch) to [novel approaches](https://arxiv.org/pdf/2004.04906.pdf) using dense representations of encoded passages combined with [efficient search capabilities](https://github.com/facebookresearch/faiss). Despite the flurry of contemporary research efforts in this area, the traditional sparse vector approach performs very well overall and has only recently been overtaken by embedding-based systems for end-to-end QA retrieval tasks. For that reason, we'll explore Elasticsearch as a simple and easy to use framework for document retrieval. So, what exactly is Elasticsearch?

Elasticsearch is a powerful open-source search and analytics engine built on the [Apache Lucene](https://lucene.apache.org/) library that is capable of handling all types of data including textual, numerical, geospatial, structrured, and unstructured. It is built to scale with a robust set of features, rich ecosystem, and diverse set of client libraries making it easy to integrate and use. In the context of information retrieval for automated question answering, we are keenly interested in the features surrounding full-text search. Elasticsearch provides a convenient way to index documents so they can easily be queried for nearest neighbor search using a TF-IDF based similarity metric. Specifically, it uses [BM25](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/) term weighting to represent question and context passages as high-dimensional, sparse vectors that are efficiently searched in an inverted index. Let's unpack those ideas a bit.


#### Inverted Index

The purpose of an inverted index is to store text in a data structure that allows for efficient and fast full-text searches. An inverted index is essentially just a mapping between unique terms and documents which contain those terms. For example, let's consider the following two documents and a depiction of an inverted index built from them:
1. "Elasticsearch is a powerful technique for search!"
2. "Manual search is a slow technique."

|      Term     | Document 1 | Document 2 |
|:-------------:|:----------:|:----------:|
|       a       |      x     |      x     |
| elasticsearch |      x     |            |
|      for      |      x     |            |
|       is      |      x     |      x     |
|     manual    |            |      x     |
|    powerful   |      x     |            |
|     search    |      x     |      x     |
|      slow     |            |      x     |
|   technique   |      x     |      x     |

Notice that the unique set of terms from both documents are contained in the index, and we can easily lookup which document contains which terms. Searching this inverted index for the phrase "search technique" would return both documents because both terms are present in each document, while searching for the phrase "powerful technique" would return only Document 1. Search itself is quite a bit more complicated than the boolean logic depicted here as it involves relevance scoring (among other query dependent logic), however this oversimplification is intended to demonstrate the quick and powerful nature of the inverted index data structure.

> Note: In the example above, all tokens have been lowercased and punctuation removed. This happens as part of an important preprocessing pipeline that we'll explain in more detail later in the post.

The inverted index representation is considered *sparse* by construction because for each document, you end up with a vector containing few terms relative to all terms in the index. This indexing process is what allows Elasticsearch to search large collections of text documents orders of magnitude faster than traditional SQL databases. While the exact match nature of search in this data structure is powerful and effective, it isn't without flaws. The word matching approach is limited in its ability to take semantically related concepts into search consideration. For example, consider the following question and context:

> **Question:** "Who is the bad guy in lord of the rings?"\
> **Context:** "Sala Baker is an actor and stuntman from New Zealand. He is best known for portraying the villain Sauron in the Lord of the Rings trilogy..."

A exact match based system like Elasticsearch would struggle to retrieve this supporting context passage because it lacks the ability to relate the concepts of "bad guy" and "villain". Modern document retrieval systems that take advantage of learned, dense representations of text would perform better in this situation.

[ANDREW] - this example ^ was taken from [DPR](https://arxiv.org/pdf/2004.04906.pdf). Either cite it or come up with different example

## Using Elasticsearch with SQuAD2.0

With this basic understanding of how Elasticsearch works, let's dive in and build our own Document Retrieval system by indexing a set of Wikipedia articles that support questions and answers in the SQuAD2.0 dataset. Before we get started, we'll need to download and prepare data from the SQuAD2.0 train set.

### Download and Prepare SQUAD2.0

In [None]:
# collapse-hide

# Download the SQuAD2.0 train set
!wget -P data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
    
import json

The following `parse_qa_records` function will extract question/answer examples, as well as full article content from the train set. The full article content will serve as a corpus of documents for which our Elasticsearch Retriever will search over. In practice, open-domain QA systems sit atop massive collections of documents (think all of Wikipedia) to provide a breadth of information to answer general-knowledge questions from. For the purposes of demonstrating Elasticsearch functionality, we will limit our corpus to only the Wikipedia articles supporting SQuAD2.0 train questions.

In [143]:
def parse_qa_records(data):
    '''
    Loop through SQuAD2.0 dataset and parse out question/answer examples and unique article content
    
    Returns:
        qa_records (list) - question/answer examples as list of dictionaries
        wiki_articles (list) - unique Wikipedia titles and articles recreated from the SQuAD data
    
    '''
    num_with_ans = 0
    num_without_ans = 0
    qa_records = []
    wiki_articles = {}
    
    for article in data:
        article_content = []
        
        for paragraph in article['paragraphs']:
            content = paragraph['context']
            article_content.append(content)
            
            for questions in paragraph['qas']:
                
                qa_record = {}
                qa_record['example_id'] = questions['id']
                qa_record['document_title'] = article['title']
                qa_record['question_text'] = questions['question']
                
                try: 
                    qa_record['short_answer'] = questions['answers'][0]['text']
                    num_with_ans += 1
                except:
                    qa_record['short_answer'] = ""
                    num_without_ans += 1
                    
                qa_records.append(qa_record)
                
        wiki_articles[article['title']] = "\n".join(article_content)
        
        
    wiki_articles = [{'document_title':title, 'document_text': text}\
                         for title, text in wiki_articles.items()]
                
    print(f'Data contains {num_with_ans} question/answer pairs with a short answer, and {num_without_ans} without.'+
          f'\nThere are {len(wiki_articles)} unique wikipedia articles.')
                
    return qa_records, wiki_articles

In [125]:
# load and parse data
train_file = "data/squad/train-v2.0.json"
train = json.load(open(train_file, 'rb'))

qa_records, wiki_articles = parse_qa_records(train['data'])

Data contains 86821 question/answer pairs with a short answer, and 43498 without.
There are 442 unique wikipedia articles.


In [126]:
# Show parsed record example
qa_records[0]

{'example_id': '56be85543aeaaa14008c9063',
 'document_title': 'Beyoncé',
 'question_text': 'When did Beyonce start becoming popular?',
 'short_answer': 'in the late 1990s'}

In [129]:
# Show example of parsed wiki_article
print(wiki_articles[0])

{'document_title': 'Beyoncé', 'document_text': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".\nFollowing the disbandment of Destiny\'s Child in June 2005, she released her second solo album, B\'Day (2006), which contained hits "Déjà Vu", "Irreplaceable", and "Beautiful Liar". Beyoncé also ventured into acting, with a Golden Globe-nominated perfor

### Download Elasticsearch

With our data ready to go, let's download, install, and configure Elasticsearch using one of the two following methods (Colab recommended). After executing the setup, we will have an Elasticsearch service running locally.

In [44]:
# If running locally - Run Elasticsearch using Docker (assumes Docker is installed)
!docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.


In [None]:
# collapse-hide

# If using Colab - Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

### Getting Data into Elasticsearch

We'll use the [official low-level Python client library](https://elasticsearch-py.readthedocs.io/en/master/) for interacting with Elasticsearch.

In [None]:
# collapse-hide

!pip install elasticsearch
!pip install tqdm

By default, Elasticsearch is launched locally on port 9200. We first need to instantiate an Elasticsearch client object and connect to the service.

In [108]:
from elasticsearch import Elasticsearch

config = {'host':'localhost', 'port':9200}
es = Elasticsearch([config])

# test connection
es.ping()

True

Before we go further, let's introduce a few concepts that are specific to Elasticsearch and the process of indexing data. In Elasticsearch, an ***index*** is a collection of documents that have common characteristics (similar to a database schema in an RDBMS). ***Documents*** are JSON objects having their own set of key-value pairs consisting of various data types (similar to rows/fields in RDBMS). When we add a document into an index, the value for the document's text fields go through an analysis process prior to being indexed. This means that when executing a search query against an existing index, we are actually searching against the post-processed representation that is stored in the inverted index, not the raw input document itself.

![Elasticsearch Index Process](my_icons/elastic_index_process.png)

The anaysis process is a customizable pipeline carried out by a dedicated ***Analyzer***. Elasticsearch analyzers are comprised of three components that make up the processing pipeline: *character filters, a tokenizer, and token filters.* Each of these components modify the input stream of text according to some configurable settings. 
- **Character Filters:** First, character filters have the ability to add, remove, or replace specific items in the text field. A common application of this filter is to strip `html` tags from the raw input. 
- **Tokenizer:** After applying character filters, the transformed text is then passed to a tokenizer which breaks up the input string into individual tokens with a provided strategy. By default, the `standard` tokenizer splits tokens whenever it encounters a whitespace character, and also splits on most symbols (like commas, periods, semicolons, etc.)
- **Token Filters:** Finally, the token stream is passed to a token filter which acts to add, remove, or modify tokens. Typical token filters include `lowercase` which converts all tokens to lowercase form, and `stop` which removes commonly occuring tokens called stopwords. 

Elasticsearch comes with several built-in Analyzers that satisfy common use cases with the default being the `Standard Analyzer`. The Standard Analyzer doesn't contain any character filters, uses a `standard` tokenizer, and applies a `lowercase` token filter. Let's take a look at one of the example sentences from above as its passed through this pipeline.

[ANDREW] - recreate these images/flows on my own with my own examples.

![Elasticsearch Analyzer Pipeline](my_icons/elasticsearch_standard_analyzer.png)


#### Create an Index


Let's create a new index and add our Wikipedia articles to it. To create an index, we provide a name and optionally some index configurations. Here we are specifying a set of `mappings` that indicate our anticipated index schema, data types, and how the text fields should be processed. If no `body` is passed, Elasticsearch will automatically infer fields and data types from incoming documents, as well as apply the `Standard Analyzer` to any text fields.

In [111]:
index_config = {
    "settings": {
        "analysis": {
            "analyzer": {
                "standard_analyzer": {
                    "type": "standard"
                }
            }
        }
    },
    "mappings": {
        "dynamic": "strict", 
        "properties": {
            "document_title": {"type": "text", "analyzer": "standard_analyzer"},
            "document_text": {"type": "text", "analyzer": "standard_analyzer"}
            }
        }
    }

index_name = 'squad-standard-index'
es.indices.create(index=index_name, body=index_config, ignore=400)

{'acknowledged': True,
 'shards_acknowledged': True,
 'index': 'squad-standard-index'}

#### Populate the Index

We can then loop through our list of Wikipedia titles & articles and add them to our newly created Elasticsearch index.

In [130]:
from tqdm.notebook import tqdm

def populate_index(es_obj, index_name, evidence_corpus):
    '''
    Loads records into an existing Elasticsearch index

    Args:
        es_obj (elasticsearch.client.Elasticsearch)
        index_name (str)
        evidence_corpus (list) - list of dicts containing data records

    '''

    for i, rec in enumerate(tqdm(evidence_corpus)):
    
        try:
            index_status = es_obj.index(index=index_name, id=i, body=rec)
        except:
            print(f'Unable to load document {i}.')
            
    n_records = es_obj.count(index=index_name)['count']
    print(f'Succesfully loaded {n_records} into {index_name}')

    return

In [131]:
populate_index(es_obj=es, index_name=index_name, evidence_corpus=wiki_articles)

HBox(children=(FloatProgress(value=0.0, max=442.0), HTML(value='')))


Succesfully loaded 442 into squad-standard-index


#### Search the Index

Wahoo! We now have some documents loaded into into an index. Elasticsearch provides a rich [query language](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) that supports a diverse range of query types. For this example, we'll use the standard query for performing full-text search called a "match" query. By default, Elasticsearch sorts and returns a JSON reponse of search results based on a computed a [relevance score](https://qbox.io/blog/practical-guide-elasticsearch-scoring-relevancy#:~:text=Together%2C%20these%20combine%20into%20a,number%20known%20as%20the%20_score.) which indicates how well a given document matches the query. Along with the relevance score of each matched document, the search response also includes the amount of time the query took to run.

Let's look at a simple match query used to search the `document_text` field in our newly created index.

In [185]:
question_text = 'What year did the third season of American Idol first air?'
question_text = 'Who was the first president of the Republic of China?'

# construct query
query = {
        'query': {
            'match': {
                'document_text': question_text
                }
            }
        }

# execute query
res = es.search(index=index_name, body=query, size=10)

In [186]:
print(f'Question: {question_text}')
print(f'Query Duration: {res["took"]} milliseconds')
print('Title, Relevance Score:')
[(hit['_source']['document_title'], hit['_score']) for hit in res['hits']['hits']]

Question: Who was the first president of the Republic of China?
Query Duration: 43 milliseconds
Title, Relevance Score:


[('Korean_War', 6.7325854),
 ('Prime_minister', 6.5900607),
 ('Military_history_of_the_United_States', 6.4036465),
 ('Russian_Soviet_Federative_Socialist_Republic', 6.305494),
 ('2008_Summer_Olympics_torch_relay', 6.2151413),
 ('Nanjing', 6.119414),
 ('Republic_of_the_Congo', 6.064489),
 ('Myanmar', 5.986621),
 ('Modern_history', 5.898165),
 ('Dwight_D._Eisenhower', 5.8090115)]

In [149]:
# sanity check that all questions are answerable

wiki_dict = {rec['document_title']:rec['document_text'] for rec in wiki_articles}
any([ex['short_answer'] not in wiki_dict[ex['document_title']] for ex in qa_records])

# Evaluating Retriever Performance

Ok, so we now have a basic understanding of how to use Elasticsearch as an IR tool to return some results for a given question, but how do we know if it's working? How do we evaluate what a good IR tool looks like? Like we pointed out in the introduction of this post, if the Retriever component of our QA system doesn't provide the correct passage to the Reader, we are doomed from the start.

To evaluate how well our Retriever is working, we'll need two things: some labeled examples (i.e. SQuAD2.0 question/answer pairs) and some performance metrics. In the traditional world of information retrieval, there are [many metrics](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)) used to quantify the relevance of a set of query results that are largely centered around the concepts of precision and recall. For IR in the context of question answering, we adapt some of these ideas into two metrics: recall and mean average precision (mAP). Additionally, we evaluate the amount of time required to execute a query since the main point of having a two stage QA system is to efficiently narrow the large search space for our machine comprehension Reader.

#### Recall

In a traditional sense of IR, recall indicates the fraction of retrieved documents that are relevant to the query. In the context of end-to-end QA systems sitting on large corpora of documents, we are less concerned with finding *all* of the passages containing the answer (because it would take significant time to read through all of them anyway) and more concerned with the binary presence of a passage containing the correct answer being returned. In that light, we define a Retriever's recall as the *percentage of questions for which the answer segment appears in one of the top N pages returned by the search method.*

#### Mean Average Precision

While the recall metric focuses on the minimum viable result set to enable the Reader for success, we do still care about the composition of that result set. We want a metric that rewards a Retriever for a.) returning a lot of answer-containing documents in the result set (i.e. traditional meaning of precision) and b.) returning those answer-containing documents higher up in the result set than non-answer-containing documents (i.e. ranking them correctly). This is precisely what mean average precision (mAP) does for us. 

To explain mAP further, let's first break down the concept of average precision. If our Retriever is asked to return N documents and the total number of those N documents that actually contains the answer is m, then average precision (AP) is defined as:

\begin{equation*}
AP@N = \frac{1}{m} \sum_{k=1}^N (P(k) \text{ if the }k^{th} \text{ item contains the answer)} = \frac{1}{m} \sum_{k=1}^N P(k)*rel(k)
\end{equation*}

where $rel(k)$ is just a binary indication of whether the $k^{th}$ item contains the correct segment or not. Using a concrete example, consider retrieving $N=3$ documents, of which one actually contains the correct answer segment. Here are three scenarios for how this could happen:

| Scenario | Binary Indication | Precision @k's |     Average Precision @N     |
|:--------:|:-----------------:|:--------------:|:----------------------------:|
|     A    |     [1, 0, 0]     |   [1/1, 0, 0]  |   (1/1)*[(1/1) + 0 + 0] = 1  |
|     B    |     [0, 1, 0]     |   [0, 1/2, 0]  |  (1/1)*[0 + (1/2) + 0] = 0.5 |
|     C    |     [0, 0, 1]     |   [0, 0, 1/3]  | (1/1)*[0 + 0 + (1/3)] = 0.33 |

Despite the fact that in each scenario we only have one document containing the correct answer, Scenario A is rewarded with the highest score because it was able to correctly rank the ground truth document relative to the others returned. Since average precision is calculated on a per query basis, the mean average precision is simply just the average AP across all queries. Now using our Wikipedia article index, let's define a function called `evaluate_retriever` to loop through all quesion/answer examples from the SQuAD2.0 train set and see how well our Elasticsearch retriever peforms in terms of recall, mAP, and averege query duration.

In [290]:
import numpy as np
import pandas as pd

def average_precision(binary_results):
    
    ''' Calculates the average precision for a list of binary indicators '''
    
    m = 0
    precs = []

    for i, val in enumerate(binary_results):
        if val == 1:
            m += 1
            precs.append(sum(binary_results[:i+1])/(i+1))
            
    ap = (1/m)*np.sum(precs) if m else 0
            
    return ap


def evaluate_retriever(es_obj, index_name, qa_records, n_results):
    '''
    This function loops through a set of question/answer examples from SQuAD2.0 and 
    evaluates Elasticsearch as a information retrieval tool in terms of recall, mAP, and query duration.
    
    Args:
        es_obj
        index_name (str)
        qa_records (list) - list of qa_records from preprocessing steps
        n_results (int) - the number of results ElasticSearch should return for a given query
        
    Returns:
        test_results_df (pd.DataFrame) - a dataframe recording search results info for every example in qa_records
    
    '''
    
    results = []
    
    for i, qa in enumerate(tqdm(qa_records)):
        
        ex_id = qa['example_id']
        question = qa['question_text']
        answer = qa['short_answer']
        
        # construct and execute query
        query = {
                'query': {
                    'match': {
                        'document_text': question
                        }
                    }
                }
        
        res = es_obj.search(index=index_name, body=query, size=n_results)
        
        # calculate performance metrics from query response info
        duration = res['took']
        binary_results = [int(answer.lower() in doc['_source']['document_text'].lower()) for doc in res['hits']['hits']]
        ans_in_res = int(any(binary_results))
        ap = average_precision(binary_results)

        rec = (ex_id, question, answer, duration, ans_in_res, ap)
        results.append(rec)
    
    # format results
    cols = ['example_id', 'question', 'answer', 'query_duration', 'answer_present', 'average_precision']
    
    results_df = pd.DataFrame(results, columns=cols)
    
    return results_df



In [314]:
# filter out SQuAD records that do not have a short answer for the given question
qa_records_answerable = [record for record in qa_records if record['short_answer'] != '']

results_df = evaluate_retriever(es_obj=es, index_name=index_name, qa_records=qa_records_answerable, n_results=5)

HBox(children=(FloatProgress(value=0.0, max=86821.0), HTML(value='')))




In [292]:
results_df.head()

Unnamed: 0,example_id,question,answer,query_duration,answer_present,average_precision
0,56be85543aeaaa14008c9063,When did Beyonce start becoming popular?,in the late 1990s,3,1,1.0
1,56be85543aeaaa14008c9065,What areas did Beyonce compete in when she was...,singing and dancing,2,1,1.0
2,56be85543aeaaa14008c9066,When did Beyonce leave Destiny's Child and bec...,2003,2,1,1.0
3,56bf6b0f3aeaaa14008c9601,In what city and state did Beyonce grow up?,"Houston, Texas",1,1,0.5
4,56bf6b0f3aeaaa14008c9602,In which decade did Beyonce become famous?,late 1990s,1,1,0.583333


In [293]:
results_df.shape

(130319, 6)

In [307]:
# recall
results_df.answer_present.value_counts(normalize=True)[1]

0.9508897397923557

In [308]:
# mAP
results_df.average_precision.mean()

0.8819634469604927

In [310]:
# average query duration (milliseconds)
results_df.query_duration.mean()

1.5381640436160497

In [294]:
qa_records[1]

{'example_id': '56be85543aeaaa14008c9065',
 'document_title': 'Beyoncé',
 'question_text': 'What areas did Beyonce compete in when she was growing up?',
 'short_answer': 'singing and dancing'}

In [300]:
idx = 3
question = qa_records[idx]['question_text']
answer = qa_records[idx]['short_answer']

In [301]:
# construct query
query = {
        'query': {
            'match': {
                'document_text': question
                }
            }
        }

# execute query
res = es.search(index=index_name, body=query, size=5)

In [302]:
binary_results = [int(answer.lower() in doc['_source']['document_text'].lower()) for doc in res['hits']['hits']]

In [303]:
binary_results

[0, 1, 0, 0, 0]

In [306]:
average_precision([0,1,0,1,1,1])

0.5666666666666667

In [272]:
any([0,0,0])

False

[ANDREW] NOTE TO SELF. the above task is going to be really really easy. Let's add in 25k random docs from wikipedia API if needed. Or just split up articles into paragraphs for chunking...

Explain: Recall, mAP, query duration

# Improving Search Results with Custom Analyzers & Query Enrichment

***PLACEHOLDER CONTENT***

There are many different approaches to improving the Retreiver component of a QA system: (http://staffwww.dcs.shef.ac.uk/people/M.Greenwood/nlp/pubs/gaizauskas_sigirforum_2004d.pdf)
1. preprocessing the question in creating the IR query;
2. preprocessing the collection to identify significant information that can be included in the indexation for retrieval;
3. adapting the similarity metric used in selecting documents;
4. modifying the form of retrieval return, e.g. to deliver passages rather than whole documents.
5. Re-ranking passages fed to the retriever (https://www.aclweb.org/anthology/D18-1053.pdf) 
    
    

Things to modify for improvement:
- custom analyzer - stopword removal
- mulitmatch query on title and body (maybe weighted)
- NER + phrase match in custom query
- MAYBE: synonyms

ElasticSearch comes with several built-in Analyzers that satisfy common use cases. However, custom Analyzers can also be crafted by combining specific character filters, tokenizers, and token filters to best suit any unique dataset. As explained above, Analyzers are applied at index time to pre-process documents before indexing. In addition, Analyzers can also be applied at search time to process text queries according to the same logic the candidate documents were processed with. Search time analysis can be customized and is only applied to certain query types such as `match` queries. Lets take a closer look at how Analyzers work with ElasticSearch's Analyze API.



# Impact of Retriever in End-to-End QA System

If I do chunk into smaller passages, we could just do all evaluation on this same dataset...