<a href="https://colab.research.google.com/github/castorini/anserini-notebooks-afirm2020/blob/master/afirm2020_query.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Querying

In this exercise, we are going to first interactively query the index and then produce a TREC run with [Pyserini](https://github.com/castorini/pyserini), the Python interface to Anserini.

## Setup

Install Python dependencies:

In [0]:
%%capture

# Note that we're using an experimental TestPyPI release, not the stable release in PyPI
!pip install pyjnius==1.2
!pip install -i https://test.pypi.org/simple/ pyserini==0.6.1.post1

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Fix known issue with pyjnius (see [this explanation](https://github.com/castorini/pyserini/blob/master/README.md#known-issues) for details):

In [0]:
%%capture

!mkdir -p /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/
!ln -s /usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/libjvm.so /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/libjvm.so

Let's pull the Anserini jar:

In [0]:
%%capture

!gsutil -m cp gs://afirm2020/anserini-0.6.1-SNAPSHOT-fatjar.jar .

Let's point Pyserini to the Anserini jar that we have just pulled:

In [0]:
os.environ['ANSERINI_CLASSPATH'] = '.'

## Interactive Querying

Pull the pre-built index from GCS:

In [0]:
%%capture

!gsutil -m cp -r gs://afirm2020/indexes .

In [0]:
from pyserini.search import pysearch
import itertools

First, let's see what topics are defined for our collection:

In [0]:
topics = pysearch.get_topics('msmarco_passage_dev_subset')

for q in itertools.islice(topics.items(), 20):
    print('{} {}'.format(q[0], q[1]['title']))

print('\n{} queries total'.format(len(topics)))

2 Androgen receptor define
1215 3 levels of government in canada and their responsibilities
1288 3/5 of 60
1576 60x40 slab cost
2235 Bethel University was founded in what year
2798 Does Suddenlink Carry ESPN3
2962 Explain what a bone scan is and what it is used for.
4696 Is the Louisiana sales tax 4.75
4947 Ludacris Net Worth
5925 Sony PS-LX300USB how to connect to pc
6217 The hormone that does the opposite of calcitonin is
6791 What Does Noel Mean in the Bible
7968 When did the earthquake hit San Francisco during the World Series
8701 _____ is the ability of cardiac pacemaker cells to spontaneously initiate an electrical impulse without being stimulated from another source, such as a nerve.
8714 _____ is the name used to refer to the era of legalized segregation in the united states
8798 _______ is a fuel produced by fermenting crops.
8854 ________ disparity refers to the slightly different view of the world that each eye receives.cyclopeanbinocularmonoculartrichromatic
9083 _________

The hits data structure holds the docid, the retrieval score, as well as the document content.
Let's look at the top 10 passages for the query `hubble space telescope`:

In [0]:
from IPython.core.display import display, HTML

searcher = pysearch.SimpleSearcher('indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs')
interactive_hits = searcher.search('hubble space telescope')

for i in range(0, 10):
    print('Rank: {} | Passage ID: {} | BM25 Score: {}'.format(i+1, interactive_hits[i].docid, interactive_hits[i].score))
    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + interactive_hits[i].content + '</div>'))

Rank: 1 | Passage ID: 7459250 | BM25 Score: 17.55699920654297


Rank: 2 | Passage ID: 5161596 | BM25 Score: 17.27869987487793


Rank: 3 | Passage ID: 5161598 | BM25 Score: 17.200599670410156


Rank: 4 | Passage ID: 2850821 | BM25 Score: 17.078800201416016


Rank: 5 | Passage ID: 6293829 | BM25 Score: 16.884000778198242


Rank: 6 | Passage ID: 5161594 | BM25 Score: 16.834800720214844


Rank: 7 | Passage ID: 1245535 | BM25 Score: 16.823299407958984


Rank: 8 | Passage ID: 6293823 | BM25 Score: 16.711700439453125


Rank: 9 | Passage ID: 4786545 | BM25 Score: 16.60740089416504


Rank: 10 | Passage ID: 7212596 | BM25 Score: 16.604999542236328


The above example uses default parameters.
Let's try setting tuned parameters for this collection:

In [0]:
searcher.set_bm25_similarity(0.82, 0.68)
interactive_hits_tuned = searcher.search('hubble space telescope')

for i in range(0, 10):
    print('Rank: {} | Passage ID: {} | BM25 Score: {}'.format(i+1, interactive_hits_tuned[i].docid, interactive_hits_tuned[i].score))
    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + interactive_hits[i].content + '</div>'))

Rank: 1 | Passage ID: 7459250 | BM25 Score: 18.049400329589844


Rank: 2 | Passage ID: 2850821 | BM25 Score: 17.676300048828125


Rank: 3 | Passage ID: 5161598 | BM25 Score: 17.472700119018555


Rank: 4 | Passage ID: 1245535 | BM25 Score: 17.435800552368164


Rank: 5 | Passage ID: 5161596 | BM25 Score: 17.278400421142578


Rank: 6 | Passage ID: 7212596 | BM25 Score: 17.176700592041016


Rank: 7 | Passage ID: 2850825 | BM25 Score: 17.11989974975586


Rank: 8 | Passage ID: 7459249 | BM25 Score: 17.072599411010742


Rank: 9 | Passage ID: 6293823 | BM25 Score: 17.049299240112305


Rank: 10 | Passage ID: 5161594 | BM25 Score: 16.98889923095703


Note how the ranking has changed.
We can also enable RM3 query expansion to see if it helps with our collection:

In [0]:
searcher.set_rm3_reranker(10, 10, 0.5)
interactive_hits_tuned_rm3 = searcher.search('hubble space telescope')

for i in range(0, 10):
    print('Rank: {} | Passage ID: {} | BM25 Score: {}'.format(i+1, interactive_hits_tuned_rm3[i].docid, interactive_hits_tuned_rm3[i].score))
    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + interactive_hits_tuned_rm3[i].content + '</div>'))

Rank: 1 | Passage ID: 2850821 | BM25 Score: 5.4944000244140625


Rank: 2 | Passage ID: 7459250 | BM25 Score: 5.4899001121521


Rank: 3 | Passage ID: 5161598 | BM25 Score: 5.4415998458862305


Rank: 4 | Passage ID: 1245535 | BM25 Score: 5.355100154876709


Rank: 5 | Passage ID: 5161594 | BM25 Score: 5.238800048828125


Rank: 6 | Passage ID: 4953234 | BM25 Score: 5.173600196838379


Rank: 7 | Passage ID: 3814658 | BM25 Score: 5.135799884796143


Rank: 8 | Passage ID: 509660 | BM25 Score: 5.13539981842041


Rank: 9 | Passage ID: 2850825 | BM25 Score: 5.128300189971924


Rank: 10 | Passage ID: 5161592 | BM25 Score: 5.123700141906738


## Batch Retrieval

Previously we interactively queried the index.
However, in a typical experimental setting, you would evaluate over a larger number of queries to test different information needs.

Let's begin by constructing the dev queries and corresponding query IDs:

In [0]:
queries = list([t['title'] for t in topics.values()])
qids = list([str(t) for t in topics.keys()])

Now, let's run all the queries from the dev set.
For the sake of speed, let's again only retrieve the top 10 documents for each query:

In [0]:
searcher = pysearch.SimpleSearcher('indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs')
bm25_hits = searcher.batch_search(queries, qids, k=10)

Note that the above runs batch retrieval with untuned BM25.
We can repeat with tuned parameters, just like we did for the interactive queries:

In [0]:
searcher.set_bm25_similarity(0.82, 0.68)
bm25_hits_tuned = searcher.batch_search(queries, qids, k=10)

Now let's repeat with RM3 query expansion:

In [0]:
searcher.set_rm3_reranker(10, 10, 0.5)
bm25_hits_tuned_rm3 = searcher.batch_search(queries, qids, k=10)

## Evaluation

A crucial component of information retrieval research is evaluation and metrics.
The most common tool used to achieve this goal is `trec_eval` developed by [NIST](https://www.nist.gov/).

`trec_eval` defines a number of standard retrieval measures, the details of which can be seen [here](http://www.rafaelglater.com/en/post/learn-how-to-use-trec_eval-to-evaluate-your-information-retrieval-system).

### TREC Format

`trec_eval` requires the runs from various experiments to be expressed in a standard TREC format:

`query_id iter docno rank similarity run_id` delimited by spaces

- `query_id`: query ID
- `iter`: constant, often either 0 or Q0 - required but ignored by `trec_eval`
- `docno`: string values that uniquely identify a document in the collection
- `rank`: integer, often zero indexed
- `similarity`: float value that represents the similarity of the document to the query specified by `query_id`
- `run_id`: string that identifies runs, used to keep track of different experiments - also ignored by `trec_eval`

Evaluation also requires the ground truth in the form of relevance judgements in the qrels file.
The qrels file follows the following format:

`query_id iter docno label`

- `label`: binary code (0 for not relevant and 1 for relevant)

Convert the hits for both BM25 (tuned and untuned) and BM25+RM3 into the TREC format:

In [0]:
def convert_to_trec_run(experiment, run_dict):
  with open('run.{}.txt'.format(experiment), 'w') as run_file:
    for qid in run_dict:
      for rank, doc in enumerate(run_dict[qid]):
        run_file.write('{} Q0 {} {} {} {}\n'.format(qid, doc.docid, rank, doc.score, experiment))

In [0]:
convert_to_trec_run('msmarco_passage_dev_bm25', bm25_hits)
convert_to_trec_run('msmarco_passage_dev_bm25_tuned', bm25_hits_tuned)
convert_to_trec_run('msmarco_passage_dev_bm25_tuned_rm3', bm25_hits_tuned_rm3)

Let's pull `trec_eval` and the qrels file:

In [0]:
%%capture

!gsutil -m cp -r gs://afirm2020/trec_eval.9.0.4 .
!chmod -R +x trec_eval.9.0.4/
!gsutil -m cp gs://afirm2020/qrels.msmarco-passage.dev-subset.txt .

In [0]:
!head -10 qrels.msmarco-passage.dev-subset.txt

300674 0 7067032 1
125705 0 7067056 1
94798 0 7067181 1
9083 0 7067274 1
174249 0 7067348 1
320792 0 7067677 1
1090270 0 7067796 1
1101279 0 7067891 1
201376 0 7068066 1
54544 0 7068203 1


Now that we have our runs in the TREC format, we can evaluate them with `trec_eval`.


In [0]:
!trec_eval.9.0.4/trec_eval -m map -c -m recall.1000 -c qrels.msmarco-passage.dev-subset.txt run.msmarco_passage_dev_bm25.txt

map                   	all	0.1803
recall_1000           	all	0.3787


In [0]:
!chmod -R +x trec_eval.9.0.4/
!trec_eval.9.0.4/trec_eval -m map -c -m recall.1000 -c qrels.msmarco-passage.dev-subset.txt run.msmarco_passage_dev_bm25_tuned.txt

map                   	all	0.1835
recall_1000           	all	0.3916


In [0]:
!trec_eval.9.0.4/trec_eval -m map -c -m recall.1000 -c qrels.msmarco-passage.dev-subset.txt run.msmarco_passage_dev_bm25_tuned_rm3.txt

map                   	all	0.1634
recall_1000           	all	0.3713


*TODO: comments and comparisons*