<a href="https://colab.research.google.com/github/castorini/anserini-notebooks-afirm2020/blob/master/afirm2020_query.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Querying

In this exercise, we are going to first interactively query the index and then produce a TREC run with [Pyserini](https://github.com/castorini/pyserini), the Python interface to Anserini.

## Setup

Install Python dependencies (again - remember that each notebook instantiates a virtual machine of its own):

In [1]:
!pip install pyjnius==1.2.1
!pip install pyserini

import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'

Collecting pyjnius==1.2.1
[?25l  Downloading https://files.pythonhosted.org/packages/e8/4f/e3d9f4bb53f7f1854f81a279c274c4ad8537e4d71117258515158403bc10/pyjnius-1.2.1-cp36-cp36m-manylinux2010_x86_64.whl (1.1MB)
[K     |▎                               | 10kB 11.6MB/s eta 0:00:01[K     |▋                               | 20kB 6.9MB/s eta 0:00:01[K     |█                               | 30kB 7.8MB/s eta 0:00:01[K     |█▎                              | 40kB 6.3MB/s eta 0:00:01[K     |█▌                              | 51kB 6.7MB/s eta 0:00:01[K     |█▉                              | 61kB 7.8MB/s eta 0:00:01[K     |██▏                             | 71kB 7.9MB/s eta 0:00:01[K     |██▌                             | 81kB 7.6MB/s eta 0:00:01[K     |██▉                             | 92kB 8.4MB/s eta 0:00:01[K     |███                             | 102kB 8.9MB/s eta 0:00:01[K     |███▍                            | 112kB 8.9MB/s eta 0:00:01[K     |███▊                        

Fix known issue with pyjnius (see [this explanation](https://github.com/castorini/pyserini/blob/master/README.md#known-issues) for details):

In [0]:
!mkdir -p /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/
!ln -s /usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/libjvm.so /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/libjvm.so

We have already made the Anserini jar that we built in the previous exercise available in a Google bucket.
Let's pull it directly:

In [3]:
!gsutil -m cp gs://afirm2020/anserini-0.7.2-SNAPSHOT-fatjar.jar .

Copying gs://afirm2020/anserini-0.7.2-SNAPSHOT-fatjar.jar...
\ [1/1 files][ 58.0 MiB/ 58.0 MiB] 100% Done                                    
Operation completed over 1 objects/58.0 MiB.                                     


Let's point Pyserini to the Anserini jar that we have just pulled:

In [0]:
os.environ['ANSERINI_CLASSPATH'] = '.'

## Interactive Querying

We will need the index for the querying experments in this exercise.
Because Colab notebooks don't share data among themselves, i.e., each session runs on its own, we have to pull the pre-built index form GCS.

In [5]:
!gsutil -m cp -r gs://afirm2020/indexes .

Copying gs://afirm2020/indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_0.fdt...
Copying gs://afirm2020/indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_0.fdx...
/ [0 files][    0.0 B/199.1 MiB]                                                / [0 files][    0.0 B/199.2 MiB]                                                Copying gs://afirm2020/indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_0.si...
Copying gs://afirm2020/indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_0.fnm...
Copying gs://afirm2020/indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_0.tvd...
/ [0 files][    0.0 B/  3.3 GiB]                                                / [0 files][    0.0 B/  3.3 GiB]                                                / [0 files][    0.0 B/  3.3 GiB]                                                Copying gs://afirm2020/indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_0.nvd...
Copying gs://afirm2020/indexes/lucene-

In [0]:

from pyserini.search import pysearch
import itertools

First, let's see grab the queries that are defined for our collection:

In [7]:
!mkdir data
!gsutil -m cp gs://afirm2020/msmarco_passage/queries.dev.small.tsv data/queries.dev.small.tsv

Copying gs://afirm2020/msmarco_passage/queries.dev.small.tsv...
/ [0/1 files][    0.0 B/283.4 KiB]   0% Done                                    / [1/1 files][283.4 KiB/283.4 KiB] 100% Done                                    
Operation completed over 1 objects/283.4 KiB.                                    


The hits data structure holds the docid, the retrieval score, as well as the document content.
Let's look at the top 10 passages for the query `south african football teams`:

In [8]:
from IPython.core.display import display, HTML

searcher = pysearch.SimpleSearcher('indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs')
interactive_hits = searcher.search('south african football teams')

for i in range(0, 10):
    print('Rank: {} | Passage ID: {} | BM25 Score: {}'.format(i+1, interactive_hits[i].docid, interactive_hits[i].score))
    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + interactive_hits[i].content + '</div>'))

Rank: 1 | Passage ID: 2225931 | BM25 Score: 11.718600273132324


Rank: 2 | Passage ID: 4959087 | BM25 Score: 11.63599967956543


Rank: 3 | Passage ID: 2646484 | BM25 Score: 11.590800285339355


Rank: 4 | Passage ID: 2646489 | BM25 Score: 11.526900291442871


Rank: 5 | Passage ID: 474761 | BM25 Score: 11.50059986114502


Rank: 6 | Passage ID: 4834928 | BM25 Score: 11.328499794006348


Rank: 7 | Passage ID: 7619756 | BM25 Score: 11.165300369262695


Rank: 8 | Passage ID: 830813 | BM25 Score: 11.10509967803955


Rank: 9 | Passage ID: 5660817 | BM25 Score: 11.051400184631348


Rank: 10 | Passage ID: 830809 | BM25 Score: 11.051399230957031


The above example uses default parameters.
Let's try setting tuned parameters for this collection:

In [9]:
searcher.set_bm25_similarity(0.82, 0.68)
interactive_hits_tuned = searcher.search('south african football teams')

for i in range(0, 10):
    print('Rank: {} | Passage ID: {} | BM25 Score: {}'.format(i+1, interactive_hits_tuned[i].docid, interactive_hits_tuned[i].score))
    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + interactive_hits[i].content + '</div>'))

Rank: 1 | Passage ID: 474761 | BM25 Score: 11.889699935913086


Rank: 2 | Passage ID: 4834928 | BM25 Score: 11.848899841308594


Rank: 3 | Passage ID: 2646484 | BM25 Score: 11.805800437927246


Rank: 4 | Passage ID: 830813 | BM25 Score: 11.76449966430664


Rank: 5 | Passage ID: 2646489 | BM25 Score: 11.703499794006348


Rank: 6 | Passage ID: 5660817 | BM25 Score: 11.671299934387207


Rank: 7 | Passage ID: 830809 | BM25 Score: 11.67129898071289


Rank: 8 | Passage ID: 2225931 | BM25 Score: 11.636099815368652


Rank: 9 | Passage ID: 7619756 | BM25 Score: 11.509300231933594


Rank: 10 | Passage ID: 4959087 | BM25 Score: 11.47029972076416


**Exercise:**
Compare the rankings with and without tuned parameters.
Add a new cell to query the index with a different query of your choice, both with untuned and tuned parameters.

Note how the ranking has changed.
We can also enable RM3 query expansion to see if it helps with our collection:

In [10]:
searcher.set_rm3_reranker(10, 10, 0.5)
interactive_hits_tuned_rm3 = searcher.search('south african football teams')

for i in range(0, 10):
    print('Rank: {} | Passage ID: {} | BM25 Score: {}'.format(i+1, interactive_hits_tuned_rm3[i].docid, interactive_hits_tuned_rm3[i].score))
    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + interactive_hits_tuned_rm3[i].content + '</div>'))

Rank: 1 | Passage ID: 830813 | BM25 Score: 2.879300117492676


Rank: 2 | Passage ID: 830809 | BM25 Score: 2.857800006866455


Rank: 3 | Passage ID: 2646484 | BM25 Score: 2.726300001144409


Rank: 4 | Passage ID: 2646489 | BM25 Score: 2.7032999992370605


Rank: 5 | Passage ID: 4959087 | BM25 Score: 2.6442999839782715


Rank: 6 | Passage ID: 8831689 | BM25 Score: 2.6310999393463135


Rank: 7 | Passage ID: 4528472 | BM25 Score: 2.6001999378204346


Rank: 8 | Passage ID: 7619756 | BM25 Score: 2.5982000827789307


Rank: 9 | Passage ID: 3812886 | BM25 Score: 2.5480000972747803


Rank: 10 | Passage ID: 1554969 | BM25 Score: 2.5302000045776367


## Batch Retrieval

Previously we interactively queried the index.
However, in a typical experimental setting, you would evaluate over a larger number of queries to test different information needs.

Let's begin by constructing the dev queries and corresponding query IDs:

In [11]:
topics = {}
with open('data/queries.dev.small.tsv') as file:
    for line in file:
       id, q = line.strip().split('\t')
       topics[int(id)] = q

print('{} queries total'.format(len(topics)))

6980 queries total


In [0]:
queries = list(topics.values())
qids = list([str(t) for t in topics.keys()])

**Exercise:**
We have previously looked at these queries in the last activity.
Again find the queries that contain `football`.

In [13]:
[q for q in queries if 'football' in q]

['what conference is bryant for football',
 'what are the leagues of football in rockford il',
 'average pay for nfl football players',
 'who is statesboro new football coach']

Now, let's run all the queries from the dev set.
For the sake of speed, let's again only retrieve the top 5 documents for each query.
Note that this step may still take a while.

In [0]:
searcher = pysearch.SimpleSearcher('indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs')
bm25_hits = searcher.batch_search(queries, qids, k=5)

Note that the above runs batch retrieval with untuned BM25.
We can repeat with tuned parameters, just like we did for the interactive queries:

In [0]:
searcher.set_bm25_similarity(0.82, 0.68)
bm25_hits_tuned = searcher.batch_search(queries, qids, k=5)

Now let's repeat with RM3 query expansion:

In [0]:
searcher.set_rm3_reranker(10, 10, 0.5)
bm25_hits_tuned_rm3 = searcher.batch_search(queries, qids, k=5)

**Exercise:**
Produce a run for untuned BM25 with RM3.

**Exercise:**
So far we have downloaded and retrieved the top passages for the dev queries.
Now pull the eval queries and repeat the process for eval queries.

In [17]:
!gsutil -m cp gs://afirm2020/msmarco_passage/queries.eval.small.tsv data/queries.eval.small.tsv

Copying gs://afirm2020/msmarco_passage/queries.eval.small.tsv...
/ [0/1 files][    0.0 B/274.3 KiB]   0% Done                                    / [1/1 files][274.3 KiB/274.3 KiB] 100% Done                                    
Operation completed over 1 objects/274.3 KiB.                                    


In [0]:
eval_topics = {}
with open('data/queries.eval.small.tsv') as file:
    for line in file:
       id, q = line.strip().split('\t')
       eval_topics[int(id)] = q

eval_queries = list(eval_topics.values())
eval_qids = list([str(t) for t in eval_topics.keys()])

In [0]:
eval_searcher = pysearch.SimpleSearcher('indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs')
bm25_eval_hits = eval_searcher.batch_search(eval_queries, eval_qids, k=5)

## Evaluation

A crucial component of information retrieval research is evaluation and metrics.
The most common tool used to achieve this goal is `trec_eval` developed by [NIST](https://www.nist.gov/).

`trec_eval` defines a number of standard retrieval measures, the details of which can be seen [here](http://www.rafaelglater.com/en/post/learn-how-to-use-trec_eval-to-evaluate-your-information-retrieval-system).

### TREC Format

`trec_eval` requires the runs from various experiments to be expressed in a standard TREC format:

`query_id iter docno rank similarity run_id` delimited by spaces

- `query_id`: query ID
- `iter`: constant, often either 0 or Q0 - required but ignored by `trec_eval`
- `docno`: string values that uniquely identify a document in the collection
- `rank`: integer, often zero indexed
- `similarity`: float value that represents the similarity of the document to the query specified by `query_id`
- `run_id`: string that identifies runs, used to keep track of different experiments - also ignored by `trec_eval`

Evaluation also requires the ground truth in the form of relevance judgements in the qrels file.
The qrels file follows the following format:

`query_id iter docno label`

- `label`: binary code (0 for not relevant and 1 for relevant)

Convert the hits for both BM25 (tuned and untuned) and BM25+RM3 runs into the TREC format:

In [0]:
def convert_to_trec_run(experiment, run_dict):
  with open('run.{}.txt'.format(experiment), 'w') as run_file:
    for qid in run_dict:
      for rank, doc in enumerate(run_dict[qid]):
        run_file.write('{} Q0 {} {} {} {}\n'.format(qid, doc.docid, rank, doc.score, experiment))

In [0]:
convert_to_trec_run('msmarco_passage_dev_bm25', bm25_hits)
convert_to_trec_run('msmarco_passage_dev_bm25_tuned', bm25_hits_tuned)
convert_to_trec_run('msmarco_passage_dev_bm25_tuned_rm3', bm25_hits_tuned_rm3)

Let's pull `trec_eval` and the qrels file:

In [22]:
!gsutil -m cp -r gs://afirm2020/trec_eval.9.0.4 .
!chmod -R +x trec_eval.9.0.4/
!gsutil -m cp gs://afirm2020/msmarco_passage/qrels.dev.small.tsv data/qrels.dev.small.tsv

Copying gs://afirm2020/trec_eval.9.0.4/CHANGELOG...
Copying gs://afirm2020/trec_eval.9.0.4/Makefile...
Copying gs://afirm2020/trec_eval.9.0.4/README...
Copying gs://afirm2020/trec_eval.9.0.4/README.windows.md...
Copying gs://afirm2020/trec_eval.9.0.4/bpref_bug...
/ [0 files][    0.0 B/  1.3 MiB]                                                / [0 files][    0.0 B/  1.3 MiB]                                                / [0 files][    0.0 B/  1.3 MiB]                                                / [0 files][    0.0 B/  1.3 MiB]                                                / [0 files][    0.0 B/  1.3 MiB]                                                Copying gs://afirm2020/trec_eval.9.0.4/convert_zscores.c...
Copying gs://afirm2020/trec_eval.9.0.4/common.h...
Copying gs://afirm2020/trec_eval.9.0.4/form_prefs_counts.c...
Copying gs://afirm2020/trec_eval.9.0.4/form_res_rels.c...
Copying gs://afirm2020/trec_eval.9.0.4/form_res_rels_jg.c...
Copying gs://afirm2020/trec_eval.9.0.4/



---



Now that we have our runs in the TREC format, we can evaluate them with `trec_eval`.


In [23]:
!head -5 run.msmarco_passage_dev_bm25.txt

901007 Q0 4446100 0 17.485200881958008 msmarco_passage_dev_bm25
901007 Q0 3570493 1 16.313400268554688 msmarco_passage_dev_bm25
901007 Q0 5268062 2 15.82390022277832 msmarco_passage_dev_bm25
901007 Q0 3989753 3 15.741399765014648 msmarco_passage_dev_bm25
901007 Q0 1719770 4 15.68179988861084 msmarco_passage_dev_bm25


In [34]:
!chmod -R +x trec_eval.9.0.4/
!trec_eval.9.0.4/trec_eval -m ndcg_cut.20 -c -m recall.1000 -c data/qrels.dev.small.tsv run.msmarco_passage_dev_bm25.txt

recall_1000           	all	0.2846
ndcg_cut_20           	all	0.1974


In [25]:
!trec_eval.9.0.4/trec_eval -m ndcg_cut.20 -c -m recall.1000 -c data/qrels.dev.small.tsv run.msmarco_passage_dev_bm25_tuned.txt

recall_1000           	all	0.2944
ndcg_cut_20           	all	0.2021


In [35]:
!trec_eval.9.0.4/trec_eval -m ndcg_cut.20 -c -m recall.1000 -c data/qrels.dev.small.tsv run.msmarco_passage_dev_bm25_tuned_rm3.txt

recall_1000           	all	0.2680
ndcg_cut_20           	all	0.1796


**Exercise:**
What can you infer based on these result?

**Exercise:**
We obtained the run file for the eval queries in the previous exercise.
Now evaluate it with `trec_tool`.