<a href="https://colab.research.google.com/github/castorini/anserini-notebooks-afirm2020/blob/master/afirm2020_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Indexing

In this activity, we are going to index the [MS MARCO](http://www.msmarco.org/) passage collection and explore some features of the index.

## Setup

We are going to use the open-source [Anserini](https://github.com/castorini/anserini) information retrieval toolkit to run the experiment.
Anserini provides an easy-to-use interface over the popular [Apache Lucene](https://lucene.apache.org/) search library to facilitate rapid experimentation.

The `%%capture` magic is used to suppress the output of the cell.
You may comment out this line to view the output.

In [0]:
%%capture

!git clone https://github.com/castorini/anserini.git

Set up Java 11:

In [0]:
%%capture

!apt-get install -y openjdk-11-jdk-headless
%env JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

After cloning the Anserini repo, build using Maven:

In [0]:
%%capture

# Install Maven
!apt-get install maven

# Build Anserini
!cd anserini && mvn clean package appassembler:assemble

Let's upload the Anserini jar so that we can use it directly in future activities:

In [0]:
%%capture

!gsutil -m cp anserini/target/anserini-0.6.1-SNAPSHOT-fatjar.jar gs://afirm2020

Copying file://anserini/target/anserini-0.6.1-SNAPSHOT-fatjar.jar [Content-Type=application/java-archive]...
| [1/1 files][ 55.0 MiB/ 55.0 MiB] 100% Done                                    
Operation completed over 1 objects/55.0 MiB.                                     


Build the `trec_eval` tool (more on this later):

In [0]:
%%capture

!cd anserini/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make

## Data Preparation

MS MARCO (MicroSoft MAchine Reading COmprehension) is a large-scale dataset that defines many tasks from question answering to ranking.
Here we focus on the collection designed for passage re-ranking.

This collection is composed of the top 1000 most relevant passages for each query, as retrieved by BM25.

First, create a directory named `data/msmarco_passage` to hold the collection.
Download and extract the MS MARCO passage collection.

In [0]:
%%capture

!mkdir -p data/msmarco_passage
!wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P data/msmarco_passage

--2019-12-02 14:06:02--  https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 40.112.152.16
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|40.112.152.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1057717952 (1009M) [application/gzip]
Saving to: ‘data/msmarco_passage/collectionandqueries.tar.gz’


2019-12-02 14:07:49 (9.58 MB/s) - ‘data/msmarco_passage/collectionandqueries.tar.gz’ saved [1057717952/1057717952]



Confirm that `collectionandqueries.tar.gz` has MD5 checksum of `31644046b18952c1386cd4564ba2ae69`.

In [0]:
!md5sum data/msmarco_passage/collectionandqueries.tar.gz

31644046b18952c1386cd4564ba2ae69  data/msmarco_passage/collectionandqueries.tar.gz


Extract the file:

In [0]:
%%capture

!tar -xzvf data/msmarco_passage/collectionandqueries.tar.gz -C data/msmarco_passage

collection.tsv
qrels.dev.small.tsv
qrels.train.tsv
queries.dev.small.tsv
queries.dev.tsv
queries.eval.small.tsv
queries.eval.tsv
queries.train.tsv


As you can see, the original MS MARCO collection is a tab-separated values (TSV) file.
We need to convert the collection into the jsonl format that can be processed by Anserini.
jsonl files contain JSON object per line.

In [0]:
!cd anserini && python ./src/main/python/msmarco/convert_collection_to_jsonl.py \
 --collection_path ../data/msmarco_passage/collection.tsv --output_folder ../data/msmarco_passage/collection_jsonl

Converting collection...
Converted 0 docs in 1 files
Converted 100000 docs in 1 files
Converted 200000 docs in 1 files
Converted 300000 docs in 1 files
Converted 400000 docs in 1 files
Converted 500000 docs in 1 files
Converted 600000 docs in 1 files
Converted 700000 docs in 1 files
Converted 800000 docs in 1 files
Converted 900000 docs in 1 files
Converted 1000000 docs in 2 files
Converted 1100000 docs in 2 files
Converted 1200000 docs in 2 files
Converted 1300000 docs in 2 files
Converted 1400000 docs in 2 files
Converted 1500000 docs in 2 files
Converted 1600000 docs in 2 files
Converted 1700000 docs in 2 files
Converted 1800000 docs in 2 files
Converted 1900000 docs in 2 files
Converted 2000000 docs in 3 files
Converted 2100000 docs in 3 files
Converted 2200000 docs in 3 files
Converted 2300000 docs in 3 files
Converted 2400000 docs in 3 files
Converted 2500000 docs in 3 files
Converted 2600000 docs in 3 files
Converted 2700000 docs in 3 files
Converted 2800000 docs in 3 files
Conv

The above command should generate 9 jsonl files in our `data/msmarco_passage/collection_jsonl` directory, each with 1M lines (except for the last one, which should have 841,823 lines).

In [0]:
!wc -l data/msmarco_passage/collection_jsonl/*

   1000000 data/msmarco_passage/collection_jsonl/docs00.json
   1000000 data/msmarco_passage/collection_jsonl/docs01.json
   1000000 data/msmarco_passage/collection_jsonl/docs02.json
   1000000 data/msmarco_passage/collection_jsonl/docs03.json
   1000000 data/msmarco_passage/collection_jsonl/docs04.json
   1000000 data/msmarco_passage/collection_jsonl/docs05.json
   1000000 data/msmarco_passage/collection_jsonl/docs06.json
   1000000 data/msmarco_passage/collection_jsonl/docs07.json
    841823 data/msmarco_passage/collection_jsonl/docs08.json
   8841823 total


Let's remove the original files to make room for the index:

In [0]:
!rm data/msmarco_passage/*.tsv
!rm data/msmarco_passage/*.tar.gz
!rm -rf sample_data

## Indexing

Some common indexing options with Anserini:

- `input`: Path to collection
- `threads`: Number of threads to run
- `collection`: Type of Anserini Collection, e.g., LuceneDocumentGenerator, TweetGenerator (subclass of LuceneDocumentGenerator for TREC Microblog)
- `index`: Path to index output
- `storePositions`: Boolean flag to store positions
- `storeDocvectors`: Boolean flag to store document vbectors
- `storeRawDocs`: Boolean flag to store raw document text
- `keepStopwords`: Boolean flag to keep stopwords (False by default) 
- `stemmer`: Stemmer to use ([Porter](http://snowball.tartarus.org/algorithms/porter/stemmer.html) by default)

In [0]:
!cd anserini && sh target/appassembler/bin/IndexCollection -collection JsonCollection -input ../data/msmarco_passage/collection_jsonl \
 -index ../indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs -generator LuceneDocumentGenerator -threads 9 \
 -storePositions -storeDocvectors -storeRawDocs

2019-12-02 14:13:22,449 INFO  [main] index.IndexCollection (IndexCollection.java:560) - DocumentCollection path: ../data/msmarco_passage/collection_jsonl
2019-12-02 14:13:22,452 INFO  [main] index.IndexCollection (IndexCollection.java:561) - Index path: ../indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs
2019-12-02 14:13:22,452 INFO  [main] index.IndexCollection (IndexCollection.java:562) - CollectionClass: JsonCollection
2019-12-02 14:13:22,453 INFO  [main] index.IndexCollection (IndexCollection.java:563) - Generator: LuceneDocumentGenerator
2019-12-02 14:13:22,453 INFO  [main] index.IndexCollection (IndexCollection.java:564) - Threads: 9
2019-12-02 14:13:22,455 INFO  [main] index.IndexCollection (IndexCollection.java:565) - Stemmer: porter
2019-12-02 14:13:22,456 INFO  [main] index.IndexCollection (IndexCollection.java:566) - Keep stopwords? false
2019-12-02 14:13:22,456 INFO  [main] index.IndexCollection (IndexCollection.java:567) - Store positions? true
2019-12-02 14:13:

Check the index at the specified destination:

In [0]:
!du -h indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs

4.1G	indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs


Let's upload our index to a bucket on the Google Cloud Storage so that we can re-use it in later notebooks:

In [0]:
!gsutil -m cp -r indexes gs://afirm2020

Copying file://indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_1_Lucene50_0.doc [Content-Type=application/msword]...
/ [0 files][    0.0 B/  2.9 GiB]                                                Copying file://indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_0_Lucene50_0.doc [Content-Type=application/msword]...
Copying file://indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_3.fnm [Content-Type=application/octet-stream]...
Copying file://indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_6_Lucene50_0.doc [Content-Type=application/msword]...
Copying file://indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs/_0.tvd [Content-Type=application/octet-stream]...
/ [0/128 files][    0.0 B/  4.1 GiB]   0% Done                                  / [0/128 files][    0.0 B/  4.1 GiB]   0% Done                                  / [0/128 files][    0.0 B/  4.1 GiB]   0% Done                                  / [0/128 files][    0.0 B/  4.1 GiB]

## Explore the Index

We can explore the index with [Pyserini](https://github.com/castorini/pyserini), the Python interface to Anserini.

### Setup

Install Python dependencies:

In [0]:
%%capture
# Note that we're using an experimental TestPyPI release, not the stable release in PyPI
!pip install pyjnius==1.2
!pip install -i https://test.pypi.org/simple/ pyserini==0.6.1.post1

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Collecting pyjnius==1.2
  Downloading https://files.pythonhosted.org/packages/b6/57/c90acf31322e6417f06c90410dbfcb149633a6006b7efbf99dfebe177c1f/pyjnius-1.2.0.tar.gz
Building wheels for collected packages: pyjnius
  Building wheel for pyjnius (setup.py) ... [?25l[?25hdone
  Created wheel for pyjnius: filename=pyjnius-1.2.0-cp36-cp36m-linux_x86_64.whl size=812604 sha256=07940f9787623482184eaf4dd5c64b2e8920998f2e9c73d50da7a2555a3304f7
  Stored in directory: /root/.cache/pip/wheels/c1/2d/85/9884050da2f10b9f72b029f34bedef0993c339437aa956906f
Successfully built pyjnius
Installing collected packages: pyjnius
Successfully installed pyjnius-1.2.0
Looking in indexes: https://test.pypi.org/simple/
Collecting pyserini==0.6.1.post1
[?25l  Downloading https://test-files.pythonhosted.org/packages/53/b9/2173ce091267418f7446c3d22e61e9b894ce15be800105e12c618ee422f7/pyserini-0.6.1.post1-py3-none-any.whl (53.0MB)
[K     |████████████████████████████████| 53.0MB 46kB/s 
Installing collected packages: 

Fix known issue with pyjnius (see [this explanation](https://github.com/castorini/pyserini/blob/master/README.md#known-issues) for details):

In [0]:
%%capture

!mkdir -p /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/
!ln -s /usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/libjvm.so /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/libjvm.so

Let's point Pyserini to the Anserini jar that we built earlier:

In [0]:
os.environ['ANSERINI_CLASSPATH'] = 'anserini/target'

In [0]:
from pyserini.index import pyutils

index_utils = pyutils.IndexReaderUtils('indexes/lucene-index.msmarco-passage.pos+docvectors+rawdocs')

Collection frequency corresponds to the total number of times a term appears in the index.
Document frequency, as the name implies, refers to the number of documents that contains the term.

For example, consider a toy index that looks like:

```
Document 1: "here is some text here is some more text"
Document 2: "more texts"
Document 3: "here is a test"
```

The collection frequency of the term `text` is 3 (2 times in Document 1 and once in Document 2).
However, its document frequency is 2.
Intuitively, document frequency is always equal to or less than collection frequency.

Let's choose a term, say, `atomic`.
We can now compute the collection and document frequencies of the term:

In [0]:
term = 'atomic'

# TODO: fix in source code!
# stemmed_form = index_utils.analyze_term(term)
collection_freq, doc_freq = index_utils.get_term_counts(term)

print('Collection frequency: {}\nDocument frequency: {}'.format(collection_freq, doc_freq))

Collection frequency: 71663
Document frequency: 36252


In simple terms, we can think of the index as a dictionary of terms each of which is a postings list.
A postings list includes a list of document IDs that contains a given term, and optionally the number of occurrences in that particular document.
Because we also stored the positions while indexing the collection, we can also access the positions at which the term appears.

Note that tokens are stemmed prior to indexing.
For example, both `atom` and `atoms` share the same postings list.

Let's get the postings list for the term:

*TODO: better visualization*

In [0]:
postings_list = index_utils.get_postings_list(term)

for posting in postings_list:
  print('Document ID: {} | Term frequency: {} | Positions: {}'.format(posting.docid, posting.term_freq, ','.join([str(p) for p in posting.positions])))

Document ID: 844 | Term frequency: 1 | Positions: 24
Document ID: 1835 | Term frequency: 2 | Positions: 20,22
Document ID: 2007 | Term frequency: 2 | Positions: 16,44
Document ID: 2162 | Term frequency: 1 | Positions: 40
Document ID: 2917 | Term frequency: 1 | Positions: 9
Document ID: 3453 | Term frequency: 1 | Positions: 60
Document ID: 3454 | Term frequency: 1 | Positions: 46
Document ID: 4373 | Term frequency: 1 | Positions: 19
Document ID: 4375 | Term frequency: 1 | Positions: 38
Document ID: 4712 | Term frequency: 2 | Positions: 16,67
Document ID: 6851 | Term frequency: 8 | Positions: 17,36,44,72,78,91,94,107
Document ID: 6854 | Term frequency: 4 | Positions: 9,16,29,39
Document ID: 6855 | Term frequency: 1 | Positions: 4
Document ID: 6858 | Term frequency: 3 | Positions: 16,21,50
Document ID: 6859 | Term frequency: 1 | Positions: 18
Document ID: 7722 | Term frequency: 1 | Positions: 65
Document ID: 7964 | Term frequency: 5 | Positions: 10,17,70,92,104
Document ID: 7965 | Term fr

Let's get its document vector of two documents:

In [0]:
doc_vector1 = index_utils.get_document_vector('0')
doc_vector2 = index_utils.get_document_vector('1')

print(doc_vector1)
print(doc_vector2)

{'engin': 1, 'hang': 1, 'import': 1, 'innoc': 1, 'project': 1, 'research': 1, 'cloud': 1, 'scientif': 2, 'impress': 1, 'commun': 1, 'live': 1, 'meant': 1, 'over': 1, 'mind': 1, 'intellect': 1, 'obliter': 1, 'manhattan': 1, 'onli': 1, 'amid': 1, 'truli': 1, 'presenc': 1, 'equal': 1, 'what': 1, 'success': 2, 'atom': 1, 'achiev': 1, 'thousand': 1, 'hundr': 1}
{'ii': 1, 'scienc': 1, 'peac': 1, 'impact': 1, 'bring': 1, 'manhattan': 1, 'continu': 1, 'project': 1, 'war': 1, 'it': 2, 'bomb': 1, 'energi': 1, 'help': 1, 'legaci': 1, 'world': 1, 'have': 1, 'end': 1, 'histori': 1, 'atom': 2, 'us': 1}


The document vector gives a succinct representation of the overall document.
We can use the respective representations of two documents to judge their similarity.

In [0]:
import math

def dot_prod(doc1, doc2):
  tokens1 = set(doc1.keys())
  tokens2 = set(doc2.keys())
  all_tokens = list(tokens1 & tokens2)  # Get common tokens (otherwise different sized dicts)
  return sum(doc1[t] * doc2[t] for t in all_tokens)

def cosine_similarity(doc1, doc2):
  return dot_prod(doc1, doc2) / (math.sqrt(dot_prod(doc1, doc1)) * dot_prod(doc2, doc2))

cosine_similarity(doc_vector1, doc_vector2)

0.026384397714232125

*TODO: Different similarities with better examples, better visualization*