# Querying

In this exercise, we are going to first interactively query the index and then produce a TREC run with [Pyserini](https://github.com/castorini/pyserini), the Python interface to Anserini.

## Setup

Install Python dependencies:

In [None]:
%%capture

!pip install pyjnius==1.2.1
!pip install pyserini

import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'

Fix known issue with pyjnius (see [this explanation](https://github.com/castorini/pyserini/blob/master/README.md#known-issues) for details):

In [None]:
%%capture

!mkdir -p /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/
!ln -s /usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/libjvm.so /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/libjvm.so

Let's pull the Anserini jar:

In [None]:
%%capture

!gsutil -m cp gs://afirm2020/anserini-0.7.1-SNAPSHOT-fatjar.jar .

Let's point Pyserini to the Anserini jar that we have just pulled:

In [None]:
os.environ['ANSERINI_CLASSPATH'] = '.'

## Interactive Querying

Pull the pre-built index from GCS:

In [None]:
%%capture

!gsutil -m cp -r gs://afirm2020/indexes .

In [None]:
from pyserini.search import pysearch
import itertools

First, let's see grab the queries that are defined for our collection:

In [None]:
%%capture

!mkdir data
!gsutil -m cp gs://afirm2020/msmarco_passage/queries.dev.small.tsv data/queries.dev.small.tsv

The hits data structure holds the docid, the retrieval score, as well as the document content.
Let's look at the top 10 passages for the query `south african football teams`:

In [None]:
from IPython.core.display import display, HTML

searcher = pysearch.SimpleSearcher('indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs')
interactive_hits = searcher.search('south african football teams')

for i in range(0, 10):
    print('Rank: {} | Passage ID: {} | BM25 Score: {}'.format(i+1, interactive_hits[i].docid, interactive_hits[i].score))
    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + interactive_hits[i].content + '</div>'))

The above example uses default parameters.
Let's try setting tuned parameters for this collection:

In [None]:
searcher.set_bm25_similarity(0.82, 0.68)
interactive_hits_tuned = searcher.search('south african football teams')

for i in range(0, 10):
    print('Rank: {} | Passage ID: {} | BM25 Score: {}'.format(i+1, interactive_hits_tuned[i].docid, interactive_hits_tuned[i].score))
    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + interactive_hits[i].content + '</div>'))

**Exercise:**
Compare the rankings with and without tuned parameters.
Add a new cell to query the index with a different query of your choice, both with untuned and tuned parameters.

Note how the ranking has changed.
We can also enable RM3 query expansion to see if it helps with our collection:

In [None]:
searcher.set_rm3_reranker(10, 10, 0.5)
interactive_hits_tuned_rm3 = searcher.search('south african football teams')

for i in range(0, 10):
    print('Rank: {} | Passage ID: {} | BM25 Score: {}'.format(i+1, interactive_hits_tuned_rm3[i].docid, interactive_hits_tuned_rm3[i].score))
    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + interactive_hits_tuned_rm3[i].content + '</div>'))

## Batch Retrieval

Previously we interactively queried the index.
However, in a typical experimental setting, you would evaluate over a larger number of queries to test different information needs.

Let's begin by constructing the dev queries and corresponding query IDs:

In [None]:
topics = {}
with open('data/queries.dev.small.tsv') as file:
    for line in file:
       id, q = line.strip().split('\t')
       topics[int(id)] = q

print('{} queries total'.format(len(topics)))

In [None]:
queries = list(topics.values())
qids = list([str(t) for t in topics.keys()])

**Exercise:**
We have previously looked at these queries in the previous activity.
Again find the queries that contain `football`.

In [None]:
[q for q in queries if 'football' in q]

Now, let's run all the queries from the dev set.
For the sake of speed, let's again only retrieve the top 10 documents for each query:

In [None]:
searcher = pysearch.SimpleSearcher('indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs')
bm25_hits = searcher.batch_search(queries, qids, k=10)

Note that the above runs batch retrieval with untuned BM25.
We can repeat with tuned parameters, just like we did for the interactive queries:

In [None]:
searcher.set_bm25_similarity(0.82, 0.68)
bm25_hits_tuned = searcher.batch_search(queries, qids, k=10)

Now let's repeat with RM3 query expansion:

In [None]:
searcher.set_rm3_reranker(10, 10, 0.5)
bm25_hits_tuned_rm3 = searcher.batch_search(queries, qids, k=10)

**Exercise:**
So far we have downloaded and retrieved the top passages for the dev queries.
Now pull the eval queries and repeat the process for eval queries.

In [None]:
%%capture

!gsutil -m cp gs://afirm2020/msmarco_passage/queries.eval.small.tsv data/queries.eval.small.tsv

In [None]:
eval_topics = {}
with open('data/queries.eval.small.tsv') as file:
    for line in file:
       id, q = line.strip().split('\t')
       eval_topics[int(id)] = q

eval_queries = list(eval_topics.values())
eval_qids = list([str(t) for t in eval_topics.keys()])

In [None]:
eval_searcher = pysearch.SimpleSearcher('indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs')
bm25_hits = eval_searcher.batch_search(eval_queries, eval_qids, k=10)

## Evaluation

A crucial component of information retrieval research is evaluation and metrics.
The most common tool used to achieve this goal is `trec_eval` developed by [NIST](https://www.nist.gov/).

`trec_eval` defines a number of standard retrieval measures, the details of which can be seen [here](http://www.rafaelglater.com/en/post/learn-how-to-use-trec_eval-to-evaluate-your-information-retrieval-system).

### TREC Format

`trec_eval` requires the runs from various experiments to be expressed in a standard TREC format:

`query_id iter docno rank similarity run_id` delimited by spaces

- `query_id`: query ID
- `iter`: constant, often either 0 or Q0 - required but ignored by `trec_eval`
- `docno`: string values that uniquely identify a document in the collection
- `rank`: integer, often zero indexed
- `similarity`: float value that represents the similarity of the document to the query specified by `query_id`
- `run_id`: string that identifies runs, used to keep track of different experiments - also ignored by `trec_eval`

Evaluation also requires the ground truth in the form of relevance judgements in the qrels file.
The qrels file follows the following format:

`query_id iter docno label`

- `label`: binary code (0 for not relevant and 1 for relevant)

Convert the hits for both BM25 (tuned and untuned) and BM25+RM3 into the TREC format:

In [None]:
def convert_to_trec_run(experiment, run_dict):
  with open('run.{}.txt'.format(experiment), 'w') as run_file:
    for qid in run_dict:
      for rank, doc in enumerate(run_dict[qid]):
        run_file.write('{} Q0 {} {} {} {}\n'.format(qid, doc.docid, rank, doc.score, experiment))

In [None]:
convert_to_trec_run('msmarco_passage_dev_bm25', bm25_hits)
convert_to_trec_run('msmarco_passage_dev_bm25_tuned', bm25_hits_tuned)
convert_to_trec_run('msmarco_passage_dev_bm25_tuned_rm3', bm25_hits_tuned_rm3)

Let's pull `trec_eval` and the qrels file:

In [None]:
%%capture

!gsutil -m cp -r gs://afirm2020/trec_eval.9.0.4 .
!chmod -R +x trec_eval.9.0.4/
!gsutil -m cp gs://afirm2020/msmarco_passage/qrels.dev.small.tsv data/qrels.dev.small.tsv



---



Now that we have our runs in the TREC format, we can evaluate them with `trec_eval`.


In [None]:
!head -5 run.msmarco_passage_dev_bm25.txt

In [None]:
!trec_eval.9.0.4/trec_eval -m map -c -m recall.1000 -c data/qrels.dev.small.tsv run.msmarco_passage_dev_bm25.txt

In [None]:
!chmod -R +x trec_eval.9.0.4/
!trec_eval.9.0.4/trec_eval -m map -c -m recall.1000 -c data/qrels.dev.small.tsv run.msmarco_passage_dev_bm25_tuned.txt

In [None]:
!trec_eval.9.0.4/trec_eval -m map -c -m recall.1000 -c data/qrels.dev.small.tsv run.msmarco_passage_dev_bm25_tuned_rm3.txt

*TODO: comments and comparisons*

**Exercise:**
We obtained the run file for the eval queries in the previous exercise.
Now evaluate it with `trec_tool`.