# Indexing Excercise 

This exercise has two parts: 

- In part 1, we are going to index the [MS MARCO](http://www.msmarco.org/) passage collection Pyserini toolkit and explore some features of the index. For this part, you only need to run code and understand it. You will be using the index and code snippets in the next assignment.

- In part 2, we are going to write a code for generating an inverted index and index part of MS MARCO collection. For this part, you need to first run the first part (1.1 and 1.2) to build the environment and prepare the data.




## PART 1: Generate the index via Pyserini

We use [Anserini](https://github.com/castorini/anserini]) toolkit and its python interface [Pyserini](https://github.com/castorini/pyserini)  to run our experiments. 

***This part is created based on Anserini/Pyserini tutorials. You can learn more by checking their repositories and tutorials.* 

### 1.1 Setup the environment

Install Pyserini:

In [None]:
# !pip install pyserini
# I commented the pip install statement out
# because I run the pip installs manually in the terminal

Clone the Anserini repository from GitHub:

In [1]:
# NOTE: Many commands here work if I execute them in my bash terminal, but not in the notebook.
    # That is why the outputs you see here may not be as expected.
    # I altered the code here so that it is the exact same command as I input in my terminal.

!cd Assignments && git clone https://github.com/castorini/anserini.git
!git checkout ad5ba1c76196436f8a0e28efdb69960d4873efe3

Cloning into 'anserini'...
Updating files:  30% (589/1918)
Updating files:  31% (595/1918)
Updating files:  32% (614/1918)
Updating files:  33% (633/1918)
Updating files:  34% (653/1918)
Updating files:  35% (672/1918)
Updating files:  36% (691/1918)
Updating files:  37% (710/1918)
Updating files:  38% (729/1918)
Updating files:  39% (749/1918)
Updating files:  40% (768/1918)
Updating files:  41% (787/1918)
Updating files:  42% (806/1918)
Updating files:  43% (825/1918)
Updating files:  44% (844/1918)
Updating files:  45% (864/1918)
Updating files:  46% (883/1918)
Updating files:  47% (902/1918)
Updating files:  48% (921/1918)
Updating files:  49% (940/1918)
Updating files:  50% (959/1918)
Updating files:  51% (979/1918)
Updating files:  52% (998/1918)
Updating files:  53% (1017/1918)
Updating files:  54% (1036/1918)
Updating files:  55% (1055/1918)
Updating files:  56% (1075/1918)
Updating files:  57% (1094/1918)
Updating files:  58% (1113/1918)
Updating files:  59% (1132/1918)
Updati

### 1.2 Get the collection and prepare the files
MS MARCO (MicroSoft MAchine Reading COmprehension) is a large-scale dataset that defines many tasks from question answering to ranking. Here we focus on the collection designed for passage re-ranking.

In [None]:
!wget https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz -P Data/

In [5]:
!ls Assignments/Data/ 
!tar xvfz Assignments/Data/collection.tar.gz -C Data/msmarco_passage

'ls' is not recognized as an internal or external command,
operable program or batch file.


tar: Error opening archive: Failed to open 'data/msmarco_passage/collection.tar.gz'


The original MS MARCO collection is a tab-separated values (TSV) file. We need to convert the collection into the jsonl format that can be processed by Anserini. jsonl files contain JSON object per line.

This command generates 9 jsonl files in your data/msmarco_passage/collection_jsonl directory, each with 1M lines (except for the last one, which should have 841,823 lines).

In [6]:
!cd Assignments/anserini && python ./src/main/python/msmarco/convert_collection_to_jsonl.py \
 --collection_path ../Data/collection.tsv --output_folder ../Data/CollectionJsonl

Het systeem kan het opgegeven pad niet vinden.


**Check the data!**

jsonl files are JSON files with keys id and contents:

In [7]:
!wc -l Assignments/Data/CollectionJsonl/* -l Data/CollectionJsonl/*

'wc' is not recognized as an internal or external command,
operable program or batch file.


In [8]:
!head -5 Assignments/Data/CollectionJsonl/docs00.json

'head' is not recognized as an internal or external command,
operable program or batch file.


Remove the original files to make room for the index. 
Check the contents of `data/msmarco_passage` before and after.

In [None]:
!ls Assignments/Data
!rm Assignments/Data/*.tsv
!ls Assignments/Data
!rm -rf sample_data

### 1.3 Generate the index using Pyserini


Here are some common indexing options with Pyserini (for more options, check Pyserini documentation):

```
* input: Path to collection
* threads: Number of threads to run
* collection: Type of Anserini Collection, e.g., LuceneDocumentGenerator, TweetGenerator (subclass of LuceneDocumentGenerator for TREC Microblog)
* index: Path to index output
* storePositions: Boolean flag to store positions
* storeDocvectors: Boolean flag to store document vectors
* storeRawDocs: Boolean flag to store raw document text
* keepStopwords: Boolean flag to keep stopwords (False by default)
* stemmer: Stemmer to use (Porter by default)
```

We now have everything in place to index the collection. **The indexing speed may vary, the process may take about 10 minutes (or more) in Google Colab.**




In [9]:
!python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 9 \
-input Assignments/Data/CollectionJsonl -index indexes/lucene-index-msmarco-passage -storePositions -storeDocvectors -storeRaw

c:\Users\Daan\Documents\Projecten\ru-information-retrieval-23-24\env\Scripts\python.exe: Error while finding module specification for 'pyserini.index' (ModuleNotFoundError: No module named 'pyserini')


Check the size of the index at the specified destination:

In [None]:
!ls indexes
!du -h indexes/lucene-index-msmarco-passage

### 1.4 Explore Pyserini index

We can now explore the index using the The IndexReader class of Pyserini. 

Read [Usage of the Index Reader API](https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md) notebook for more information on accessing and manipulating an inverted index.

In [None]:
from pyserini.index import IndexReader

index_reader = IndexReader('indexes/lucene-index-msmarco-passage')

Compute the collection and document frequencies of a term:

In [None]:
term = 'played'

# Look up its document frequency (df) and collection frequency (cf).
# Note, we use the unanalyzed form:
df, cf = index_reader.get_term_counts(term)

analyzed_form = index_reader.analyze(term)
print(f'Analyzed form of term "{analyzed_form[0]}": df={df}, cf={cf}')

Get basic index statistics of the index.

Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), unique_terms will show -1 (think what could be reason).

In [None]:
index_reader.stats()

## PART 2: Generate the index yourself

### 2.1 Processing the text

We need to process the text, which includes tokenization, stopword removal, and lowercasing.

In [None]:
STOPWORDS = ['a', 'an', 'and', 'are', 'as', 'at', 'be', 'but', 'by', 'for', 'if', 'in', 'into', 'is', 'it', 'no', 'not', 'of', 'on', 'or', 'such', 'that', 'the', 'their', 'then', 'there', 'these', 'they', 'this', 'to', 'was', 'will', 'with']

def process(text):
    terms = []
    # Remove special characters
    chars = ['\'', '.', ':', ',', '!', '?', '(', ')']
    for ch in chars:
        if ch in text:
            text = text.replace(ch, ' ')
    
    # Lowercasing and stopword removal
    for term in text.split():
        term = term.lower()
        if term not in STOPWORDS:
            terms.append(term)
    return terms
    

### 2.2 Complete the code for Inverted Index

Implement the InvertedIndex class. 

Write the index to a file, where posting list of each term is presented in a line with this format: `Term1 docID1:freq1 docID2:freq2 ...`, e.g., 

```
term1 1:1 4:2 5:1
term2 2:1 
term3 1:3 3:3 9:2
...
```



In [None]:
class InvertedIndex(object):
    def __init__(self):
        self.index = {}

    def add_posting(self, term:str, doc_id:int, count:int):
        """Adds a posting (term and Document ID) to the index."""
        # =======Your code=======
        
        # =======================

    def get_posting(self,term:str):
        """Returns the posting list of the term from the index."""
        # =======Your code=======

        # =======================
        
    def get_dictionary(self):
        """Returns the dictionary of the index (unique terms in the index)."""
        # =======Your code=======

        # =======================
    
    def write_to_file(self, filename_index:str):
        """Writes the index to a textfile."""
        # =======Your code=======
        
        # =======================

Run this to test your code. If everything is correct, you should not get errors here. 

In [None]:
index = InvertedIndex()
index.add_posting("t1", 1, 2)
index.add_posting("t1", 2, 1)
index.add_posting("t2", 2, 3)
assert len(index.get_dictionary()) == 2
assert len(index.get_posting("t1")) == 2
assert index.get_posting("t3") == None
index.write_to_file("data/msmarco_passage/collection_jsonl/text_index.txt")

### 2.3 Index part of the MS MARCO collection

Complete the code to process the text and create the index. 
Note that we are only interested in indexing `docs00.json` file and it takes few minutes to create the index.

In [None]:
import collections
import json

ind = InvertedIndex()
file = "data/msmarco_passage/collection_jsonl/docs00.json"
index_file = "data/msmarco_passage/collection_jsonl/tiny_index.txt"

def index(jsonl_file):
    with open(jsonl_file, 'r') as f:
        for line in f:
            doc = json.loads(line)
            # =======Your code=======
            
            
            # =======================
            
index(file)
ind.write_to_file(index_file)


Run this to test your code. 

In [None]:
with open(index_file, 'r') as fp:
    assert len(fp.readlines()) == 698784

assert len(ind.get_posting("pressingly")) == 3
assert len(ind.get_posting("veada")) == 2

## Handing in

Hand in both the result file and the filled-in notebook:

- The result file should be named STUDENTNUMBER_FIRSTNAME_LASTNAME_tiny_index.txt
- The notebook should be named STUDENTNUMBER_FIRSTNAME_LASTNAME_indexing.ipynb