<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_robust04_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on Robust04

This notebook provides a brief overview of how to use [Pyserini](https://github.com/castorini/anserini/blob/master/docs/pyserini.md), the Python interface to [Anserini](http://anserini.io), to search the collection from the TREC 2004 Robust Track.


## Installation


Install Python dependencies

In [0]:
%%capture
!pip install pyserini==0.6.0.0
!wget -O anserini-0.6.0-fatjar.jar https://search.maven.org/remotecontent?filepath=io/anserini/anserini/0.6.0/anserini-0.6.0-fatjar.jar

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Fix annoying known issue with pyjnius (see [this explanation](https://github.com/castorini/pyserini/blob/master/README.md#known-issues) for more details):


In [0]:
%%capture
!mkdir -p /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/
!ln -s /usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/libjvm.so /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/libjvm.so

Let's grab the pre-built index:

In [0]:
%%capture
!wget https://www.dropbox.com/s/mdoly9sjdalh44x/lucene-index.robust04.pos%2Bdocvectors%2Brawdocs.tar.gz
!tar xvfz lucene-index.robust04.pos+docvectors+rawdocs.tar.gz

Sanity check of index size:

In [4]:
!du -h lucene-index.robust04.pos+docvectors+rawdocs

2.1G	lucene-index.robust04.pos+docvectors+rawdocs


## Usage

You can use `pysearch` to search over an index. Here's the basic usage:

In [5]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')
hits = searcher.search('black bear attacks')

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}'.format(i+1, hits[i].docid, hits[i].score))

1 LA081689-0039 7.941199779510498
2 LA092790-0015 6.970099925994873
3 LA102589-0076 6.802700042724609
4 FBIS4-16530 6.74560022354126
5 FR940902-1-00053 6.743599891662598
6 FT932-15491 6.718599796295166
7 LA021289-0039 6.696400165557861
8 FR940603-0-00083 6.64109992980957
9 FT931-736 6.521299839019775
10 FR940603-0-00114 6.503499984741211


The `hits` data structure holds the `docid`, the retrieval score, as well as the document content:

In [6]:
from IPython.core.display import display, HTML
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + hits[0].content + '</div>'))

Here's how you can configure search options, such as BM25 parameters and using relevance feedback.

In [7]:
searcher.set_bm25_similarity(0.9, 0.4)
searcher.set_rm3_reranker(10, 10, 0.5)

hits2 = searcher.search('black bear attacks')

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}'.format(i+1, hits2[i].docid, hits2[i].score))

1 FR940603-0-00083 2.984800100326538
2 FR940902-1-00053 2.9762001037597656
3 FR940902-1-00055 2.7883999347686768
4 FR940902-1-00070 2.7223000526428223
5 FR940902-1-00057 2.713099956512451
6 LA081689-0039 2.7060999870300293
7 FR940603-0-00087 2.6944000720977783
8 FR940603-0-00102 2.6905999183654785
9 FR940603-0-00088 2.676300048828125
10 FR940902-1-00068 2.620699882507324


Note that the results have changed.