<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_robust04_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on Robust04

This notebook provides a brief overview of how to use [Pyserini](https://github.com/castorini/anserini/blob/master/docs/pyserini.md), the Python interface to [Anserini](http://anserini.io), to search the collection from the TREC 2004 Robust Track.


## Installation


Install Python dependencies

In [0]:
%%capture
!pip install pyserini

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Fix annoying known issue with pyjnius (see [this explanation](https://github.com/castorini/pyserini/blob/master/README.md#known-issues) for more details):


In [0]:
%%capture
!mkdir -p /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/
!ln -s /usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/libjvm.so /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/libjvm.so

Let's grab the pre-built index:

In [0]:
%%capture
!wget https://www.dropbox.com/s/s91388puqbxh176/index-robust04-20191213.tar.gz
!tar xvfz index-robust04-20191213.tar.gz

Sanity check of index size:

In [4]:
!du -h index-robust04-20191213

2.1G	index-robust04-20191213


## Usage

You can use `pysearch` to search over an index. Here's the basic usage:

In [5]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('index-robust04-20191213')
hits = searcher.search('black bear attacks')

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}'.format(i+1, hits[i].docid, hits[i].score))

1 LA092790-0015 7.066800117492676
2 LA081689-0039 6.890200138092041
3 FBIS4-16530 6.616300106048584
4 LA102589-0076 6.4644999504089355
5 FT932-15491 6.250899791717529
6 FBIS3-12276 6.246300220489502
7 LA091090-0085 6.170300006866455
8 FT922-13519 6.042699813842773
9 LA052790-0205 5.9405999183654785
10 LA103089-0041 5.906499862670898


The `hits` data structure holds the `docid`, the retrieval score, as well as the document content:

In [6]:
from IPython.core.display import display, HTML
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + hits[0].content + '</div>'))

Here's how you can configure search options, such as BM25 parameters and using relevance feedback.

In [7]:
searcher.set_bm25_similarity(0.9, 0.4)
searcher.set_rm3_reranker(10, 10, 0.5)

hits2 = searcher.search('black bear attacks')

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}'.format(i+1, hits2[i].docid, hits2[i].score))

1 LA092790-0015 1.972599983215332
2 FR940902-1-00057 1.784000039100647
3 FR940603-0-00087 1.779099941253662
4 LA102589-0076 1.7735999822616577
5 FR940902-1-00061 1.763800024986267
6 LA081689-0039 1.763100028038025
7 FR940902-1-00070 1.7396999597549438
8 FR940902-1-00069 1.7342000007629395
9 LA091090-0085 1.732200026512146
10 FR940603-0-00102 1.7265000343322754


Note that the results have changed.