<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_robust04_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on Robust04

This notebook provides a brief overview of how to use [Pyserini](https://github.com/castorini/anserini/blob/master/docs/pyserini.md), the Python interface to [Anserini](http://anserini.io), to search the collection from the TREC 2004 Robust Track.


## Installation


Install Python dependencies

In [0]:
%%capture
!pip install pyserini==0.9.3.1

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Let's grab the pre-built index:

In [0]:
%%capture
!wget https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz
# Alternate URL: https://www.dropbox.com/s/s91388puqbxh176/index-robust04-20191213.tar.gz
!tar xvfz index-robust04-20191213.tar.gz

Sanity check of index size:

In [3]:
!du -h index-robust04-20191213

2.1G	index-robust04-20191213


## SimpleSearcher Usage

You can use `pysearch` to search over an index. Here's the basic usage:

In [4]:
from pyserini.search import SimpleSearcher

searcher = SimpleSearcher('index-robust04-20191213')
hits = searcher.search('black bear attacks')

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

 1 LA092790-0015   7.06680
 2 LA081689-0039   6.89020
 3 FBIS4-16530     6.61630
 4 LA102589-0076   6.46450
 5 FT932-15491     6.25090
 6 FBIS3-12276     6.24630
 7 LA091090-0085   6.17030
 8 FT922-13519     6.04270
 9 LA052790-0205   5.94060
10 LA103089-0041   5.90650


The `hits` data structure holds the `docid`, the retrieval score, as well as the document content:

In [5]:
from IPython.core.display import display, HTML
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + hits[0].raw + '</div>'))

Here's how you can configure search options, such as BM25 parameters and using relevance feedback.

In [6]:
searcher.set_bm25(0.9, 0.4)
searcher.set_rm3(10, 10, 0.5)

hits2 = searcher.search('black bear attacks')

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits2[i].docid:15} {hits2[i].score:.5f}')

 1 LA092790-0015   1.97260
 2 FR940902-1-00057 1.78400
 3 FR940603-0-00087 1.77910
 4 LA102589-0076   1.77360
 5 FR940902-1-00061 1.76380
 6 LA081689-0039   1.76310
 7 FR940902-1-00070 1.73970
 8 FR940902-1-00069 1.73420
 9 LA091090-0085   1.73220
10 FR940603-0-00102 1.72650


Note that the results have changed.

## IndexReaderUtils Usage

The `IndexReaderUtils` class provides various methods to read the index directly. For example, we can fetch a raw document from the index given its `docid`:

In [7]:
from pyserini.index import IndexReader

reader = IndexReader('index-robust04-20191213/')

doc = reader.doc('LA092790-0015').raw()
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + doc + '</div>'))

Note that the result is exactly the same as displaying the hit contents above. Given the raw text, we can obtain its analyzed form (i.e., tokenized, stemmed, stopwords removed, etc.). Here we show the first ten tokens:

In [8]:
analyzed = reader.analyze(doc)
analyzed[0:10]

['date',
 'p',
 'septemb',
 '27',
 '1990',
 'thursdai',
 'ventura',
 'counti',
 'edit',
 'p']

The index also stores the raw document vector, which we can obtain as a Python dictionary of analyzed terms to counts (i.e., term frequency).
For brevity, we only look at terms that appear more than once:

In [9]:
doc_vector = reader.get_document_vector('LA092790-0015')
{ k: v for (k, v) in doc_vector.items() if v >1 }

{'advis': 2,
 'anim': 9,
 'area': 4,
 'attack': 3,
 'author': 3,
 'bear': 4,
 'been': 11,
 'black': 2,
 'bobcat': 4,
 'california': 2,
 'cat': 3,
 'counti': 4,
 'coyot': 10,
 'debusscher': 3,
 'deer': 3,
 'depart': 2,
 'drought': 4,
 'dry': 2,
 'elsewher': 2,
 'especi': 2,
 'famili': 2,
 'few': 2,
 'food': 3,
 'forest': 2,
 'game': 2,
 'ha': 3,
 'habitat': 2,
 'have': 16,
 'he': 2,
 'hi': 2,
 'hous': 3,
 'hungri': 3,
 'i': 2,
 'jenk': 4,
 'just': 3,
 'leav': 2,
 'lion': 3,
 'live': 2,
 'lot': 2,
 'month': 4,
 'more': 3,
 'mountain': 3,
 'natur': 2,
 'near': 3,
 'off': 2,
 'offici': 5,
 'ojai': 3,
 'on': 2,
 'out': 4,
 'parch': 2,
 'park': 2,
 'past': 2,
 'peopl': 2,
 'rees': 3,
 'report': 4,
 'resid': 2,
 'rural': 3,
 'sai': 2,
 'said': 19,
 'seen': 3,
 'septemb': 2,
 'sever': 3,
 'she': 3,
 "they'r": 5,
 'two': 3,
 'vallei': 4,
 'ventura': 4,
 'we': 2,
 'who': 2,
 'wild': 3,
 'wildlif': 2,
 'would': 2,
 'yard': 2,
 'year': 3}