<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_msmarco_passage_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on the MS MARCO Passage Dataset

This notebook replicates the BM25 baseline for the [MS MARCO passage ranking task](http://www.msmarco.org/) with [Pyserini](https://github.com/castorini/anserini/blob/master/docs/pyserini.md), the Python interface to [Anserini](http://anserini.io).


## Installation


Install Python dependencies:


In [0]:
%%capture
!pip install pyserini==0.8.0.0

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Fix annoying known issue with pyjnius (see [this explanation](https://github.com/castorini/pyserini/blob/master/README.md#known-issues) for more details):


In [0]:
%%capture
!mkdir -p /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/
!ln -s /usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/libjvm.so /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/libjvm.so

Let's grab the pre-built index:

In [0]:
%%capture
!wget https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-msmarco-passage-20191117-0ed488.tar.gz
!tar xvfz index-msmarco-passage-20191117-0ed488.tar.gz

Sanity check of index size:

In [4]:
!du -h index-msmarco-passage-20191117-0ed488

2.5G	index-msmarco-passage-20191117-0ed488


## Usage

You can use `pysearch` to search over an index. The questions (called "topics" in TREC parlance) are already distributed in Pyserini:

In [5]:
from pyserini.search import pysearch

topics = pysearch.get_topics('msmarco_passage_dev_subset')
print('{} queries total'.format(len(topics)))

6980 queries total


Let's take a look at a specific question. Topics often have different "fields": "title" is the one we want. (Again, this is just TREC parlance.)

In [6]:
topics['1102400']['title']

'why do bears hibernate'

We can now search:

In [7]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('index-msmarco-passage-20191117-0ed488')
hits = searcher.search(topics['1102400']['title'])

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}...'.format(i+1, hits[i].score, hits[i].content[:70]))

1 17.335800170898438 Why do Bears hibernate? March 31, 2010, Joan, Leave a comment. Why do ...
2 13.230899810791016 Why do bears hibernate? Watch this to discover how much effort is spen...
3 13.135700225830078 Technically, as the other anwerer said, bears do not hibernate, but th...
4 13.014599800109863 It is a common misconception that bears hibernate during the winter. W...
5 13.003899574279785 To prepare for hibernation, grizzlies must prepare a den, and consume ...
6 12.689399719238281 Some zoo bears are fed year round, and do not hibernate. Since they do...
7 12.554499626159668 Bears in zoos will not hibernate if food is available, though they wil...
8 12.51710033416748 All kinds of bears technically don't hibernate. They enter into a phas...
9 12.4350004196167 Date: 12-11-2012. It is a common misconception that bears hibernate du...
10 12.374600410461426 While bears tend to slow down during the winter, they are not true hib...


The `hits` data structure holds the `docid`, the retrieval score, as well as the document content:

In [8]:
from IPython.core.display import display, HTML
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + hits[0].content + '</div>'))

Let's run all the queries from the dev set:

In [9]:
from pyserini.search import pysearch

def do_run(file, topics, searcher):
    with open(file, 'w') as runfile:
        cnt = 0
        print('Running {} queries in total'.format(len(topics)))
        for id in topics:
            query = topics[id]['title'].encode('utf-8');
            # see https://github.com/kivy/pyjnius/issues/437 on why we need to encode
            hits = searcher.search(query, 1000)
            for i in range(0, len(hits)):
                _ = runfile.write('{} Q0 {} {} {:.6f} Anserini\n'.format(id, hits[i].docid, i+1, hits[i].score))
            cnt += 1
            if cnt % 100 == 0:
                print('{} queries completed'.format(cnt))

searcher = pysearch.SimpleSearcher('index-msmarco-passage-20191117-0ed488')

do_run('run-msmarco-passage-bm25.txt', topics, searcher)


Running 6980 queries in total
100 queries completed
200 queries completed
300 queries completed
400 queries completed
500 queries completed
600 queries completed
700 queries completed
800 queries completed
900 queries completed
1000 queries completed
1100 queries completed
1200 queries completed
1300 queries completed
1400 queries completed
1500 queries completed
1600 queries completed
1700 queries completed
1800 queries completed
1900 queries completed
2000 queries completed
2100 queries completed
2200 queries completed
2300 queries completed
2400 queries completed
2500 queries completed
2600 queries completed
2700 queries completed
2800 queries completed
2900 queries completed
3000 queries completed
3100 queries completed
3200 queries completed
3300 queries completed
3400 queries completed
3500 queries completed
3600 queries completed
3700 queries completed
3800 queries completed
3900 queries completed
4000 queries completed
4100 queries completed
4200 queries completed
4300 queries 

In [0]:
%%capture
!wget -O jtreceval-0.0.5-jar-with-dependencies.jar https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar
!wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt

In [11]:
!java -jar jtreceval-0.0.5-jar-with-dependencies.jar qrels.msmarco-passage.dev-subset.txt run-msmarco-passage-bm25.txt

runid                 	all	Anserini
num_q                 	all	6980
num_ret               	all	6974598
num_rel               	all	7437
num_rel_ret           	all	6309
map                   	all	0.1926
gm_map                	all	0.0168
Rprec                 	all	0.1048
bpref                 	all	0.8526
recip_rank            	all	0.1960
iprec_at_recall_0.00  	all	0.1964
iprec_at_recall_0.10  	all	0.1964
iprec_at_recall_0.20  	all	0.1964
iprec_at_recall_0.30  	all	0.1964
iprec_at_recall_0.40  	all	0.1952
iprec_at_recall_0.50  	all	0.1952
iprec_at_recall_0.60  	all	0.1898
iprec_at_recall_0.70  	all	0.1898
iprec_at_recall_0.80  	all	0.1893
iprec_at_recall_0.90  	all	0.1893
iprec_at_recall_1.00  	all	0.1893
P_5                   	all	0.0591
P_10                  	all	0.0394
P_15                  	all	0.0301
P_20                  	all	0.0246
P_30                  	all	0.0182
P_100                 	all	0.0069
P_200                 	all	0.0038
P_500                 	all	0.0017
P_1000           