<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_robust04_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on Robust04

This notebook provides a brief overview of how to use [Pyserini](https://github.com/castorini/anserini/blob/master/docs/pyserini.md), the Python interface to [Anserini](http://anserini.io), to search the collection from the TREC 2004 Robust Track.


## Installation


Install Python dependencies

In [0]:
%%capture

!pip install pyserini==0.6.0.0
!wget -O anserini-0.6.0-fatjar.jar https://search.maven.org/remotecontent?filepath=io/anserini/anserini/0.6.0/anserini-0.6.0-fatjar.jar

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Fix annoying known issue with pyjnius (see [this issue](https://github.com/castorini/anserini/issues/832)):


In [0]:
%%capture

!mkdir -p /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/
!ln -s /usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/libjvm.so /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/libjvm.so

Let's grab the pre-built index:

In [0]:
%%capture

!wget https://www.dropbox.com/s/mdoly9sjdalh44x/lucene-index.robust04.pos%2Bdocvectors%2Brawdocs.tar.gz
!tar xvfz lucene-index.robust04.pos+docvectors+rawdocs.tar.gz

Sanity check of index size:

In [0]:
!du -h lucene-index.robust04.pos+docvectors+rawdocs

2.1G	lucene-index.robust04.pos+docvectors+rawdocs


## Usage

You can use `pysearch` to search over an index. Here's the basic usage:

In [0]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')
hits = searcher.search('hubble space telescope')

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}'.format(i+1, hits[i].docid, hits[i].score))

1 FT934-5418 18.774799346923828
2 LA052890-0021 18.53849983215332
3 LA071090-0047 18.464799880981445
4 LA070990-0052 18.20669937133789
5 FT921-7107 18.186899185180664
6 LA062990-0180 17.624900817871094
7 LA042590-0135 17.597200393676758
8 LA040190-0178 17.385499954223633
9 LA042790-0070 17.276899337768555
10 FT944-128 17.27239990234375


The `hits` data structure holds the `docid`, the retrieval score, as well as the document content:

In [0]:
# Grab the actual text
hits[0].content

"<DATE>931130\n</DATE>\n<HEADLINE>\nFT  30 NOV 93 / Toil and trouble: Nasa needs to repair its public image, as\nwell as the Hubble telescope\n</HEADLINE>\n<TEXT>\nThe space shuttle Endeavour is scheduled to blast off from Cape Canaveral\ntomorrow morning on what Nasa calls a routine servicing flight - and critics\nsay is a risky make-or-break mission for the troubled space agency.\nThe objective of the 11-day flight is to service and repair the Hubble Space\nTelescope, a Dollars 2bn orbiting observatory which caused Nasa huge\nembarrassment after its launch in 1990, when it turned out that the\ninstrument's 2.4 metre mirror had been ground to the wrong shape. Three\nhundred miles above the earth's obscuring atmosphere, Hubble should have\nenabled astronomers to see seven times further into the universe than ever\nbefore; instead, the faulty mirror produced fuzzy pictures only slightly\nbetter than the most powerful ground-based telescopes.\nThe mission is routine only in the sense tha

Here's how you can configure search options, such as BM25 parameters and using relevance feedback.

In [0]:
searcher.set_bm25_similarity(0.9, 0.4)
searcher.set_rm3_reranker(10, 10, 0.5)

hits2 = searcher.search('hubble space telescope')

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}'.format(i+1, hits2[i].docid, hits2[i].score))

1 LA071090-0047 5.841000080108643
2 FT934-5418 5.734799861907959
3 LA052890-0021 5.671299934387207
4 LA070990-0052 5.6072001457214355
5 FT921-7107 5.475399971008301
6 FT944-128 5.230899810791016
7 LA062990-0180 5.211900234222412
8 LA042790-0070 5.192800045013428
9 LA040190-0178 5.103600025177002
10 LA071490-0091 5.097599983215332


Note that the results have changed.