<a href="https://colab.research.google.com/github/adap7/IRTM/blob/main/IRTM_Neural_Reranking_2024_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course - Neural Reranking Tutorial
Authors: Abderrahmane Issam and Jan Scholtes

Version 2024-2025

#Notebook 4

In this notebook we will learn how to use [Pyterrier](https://github.com/terrier-org/pyterrier), a Python framework that is built on top of the Java-based Terrier IR platform. Pyterrier can be used to index different formats of datasets and can be integrated with different models starting from classical approaches like BM25 up to neural models like ColBERT. We will learn how to use Pyterrier for indexing, search and evaluation.

## Setup

### If you have a local GPU (Do the following steps in your local env):
If you have a GPU and you want to run this notebook locally, then I suggest you set up a conda environement as follows:



```
conda create --name ir python=3.11.1 \\
conda install -c conda-forge openjdk=11 \\
pip install notebook
```
The second step is to start jupyter in your local machine as follows:
```
jupyter notebook \
    --NotebookApp.allow_origin='https://colab.research.google.com' \
    --port=8888 \
    --NotebookApp.port_retries=0
```
Then go to `connect to local runtime` (which you will find in the menu where you can change the runtime) and paste the jupter backend URL that tou got in the output of the previous command.

If you are running this locally, then you only need to install packages once, otherwise, you will need to install them at the start of the instance and restart the runtime when required to.

In [1]:
!pip install python-terrier



In [2]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-lzeslgic
  Running command git clone --filter=blob:none --quiet https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-lzeslgic
  Resolved https://github.com/terrierteam/pyterrier_colbert.git to commit ba5c86c0bc8da450dee361140541f35b5349a492
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not

## Indexing and Search Using Pyterrier

In [4]:
!pip install faiss-gpu-cu12

import faiss
assert faiss.get_num_gpus() > 0

Collecting faiss-gpu-cu12
  Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.9 MB)
[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m40.8/47.9 MB[0m [31m89.6 MB/s[0m eta [36m0:00:01[0m^C


ModuleNotFoundError: No module named 'faiss'

In [3]:
import pyterrier as pt
if not pt.java.started():
    pt.init()

terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
java is now started automatically with default settings. To force initialisation early, run:
pt.java.init() # optional, forces java initialisation
  pt.init()


Pyterrier can create an index from different dataset formats including Pandas DataFrame which we demonstrate below. We will create a synthetic DataFrame of documents, as well as queries and qrels to use for evaluation. After indexing the documents, we create a BM25 model that we will use for search later.

In [4]:
import pandas as pd

# 1. Define Documents
documents = pd.DataFrame([
    {"docno": "d1", "text": "PyTerrier is great for information retrieval."},
    {"docno": "d2", "text": "Terrier is a powerful information retrieval platform."},
    {"docno": "d3", "text": "Python is a popular programming language."},
    {"docno": "d4", "text": "This tutorial introduces PyTerrier basics."}
])

# 2. Define Queries
queries = pd.DataFrame([
    {"query": "information retrieval", "qid": "q1"},
    {"query": "programming tutorial", "qid": "q2"}
])

# 3. Define Relevance Judgments (qrels)
qrels = pd.DataFrame([
    {"qid": "q1", "docno": "d1", "label": 1},
    {"qid": "q1", "docno": "d2", "label": 0},
    {"qid": "q2", "docno": "d3", "label": 1},
    {"qid": "q2", "docno": "d4", "label": 1}
])

# 4. Indexing
index_path = "./index"
!rm -r "./index"    # Remove index if it exists

indexer = pt.index.DFIndexer(index_path)
index_ref = indexer.index(text=documents["text"], docno=documents["docno"])

# 5. Retrieval (BM25)
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")

rm: cannot remove './index': No such file or directory


  indexer = pt.index.DFIndexer(index_path)
  bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")


We can see the files were created:

In [5]:
!ls -ltrh ./index

total 44K
-rw-r--r-- 1 root root    7 Apr 14 12:04 data.direct.bf
-rw-r--r-- 1 root root   68 Apr 14 12:04 data.document.fsarrayfile
-rw-r--r-- 1 root root   64 Apr 14 12:04 data.meta.zdata
-rw-r--r-- 1 root root   32 Apr 14 12:04 data.meta.idx
-rw-r--r-- 1 root root   44 Apr 14 12:04 data.meta-0.fsomapfile
-rw-r--r-- 1 root root    9 Apr 14 12:04 data.inverted.bf
-rw-r--r-- 1 root root 1.1K Apr 14 12:04 data.lexicon.fsomapfile
-rw-r--r-- 1 root root   52 Apr 14 12:04 data.lexicon.fsomapid
-rw-r--r-- 1 root root  321 Apr 14 12:04 data.lexicon.fsomaphash
-rw-r--r-- 1 root root 4.1K Apr 14 12:04 data.properties


We can see statistics of our index as follows:

In [6]:
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 4
Number of terms: 13
Number of postings: 15
Number of fields: 0
Number of tokens: 15
Field names: []
Positions:   false



In [7]:
terms = index.getLexicon()

for entry in terms:
    print(entry.getKey())

basic
great
introduc
languag
platform
popular
power
program
pyterri
python
retriev
terrier
tutori


### Exercise 1:
The number of terms (13) is less than the number of words in all the 4 documents. Explain the reason for this.

I printed the 13 terms to check them and to see that they re unique terms (after pre processing).

This can happen because we have a total of 13 terms, but actually 15 tokens because some tokens (the rest of 2) are terms that repeat in more than one document, so they are not unique terms, but 15 tokens in all docs.  

These are the terms identified by Pyterrier.

In [8]:
for kv in index.getLexicon():
  print("%s -> %s" % (kv.getKey(), kv.getValue().toString() ) )

basic -> term11 Nt=1 TF=1 maxTF=1 @{0 0 0}
great -> term2 Nt=1 TF=1 maxTF=1 @{0 0 6}
introduc -> term12 Nt=1 TF=1 maxTF=1 @{0 1 0}
languag -> term6 Nt=1 TF=1 maxTF=1 @{0 1 6}
platform -> term4 Nt=1 TF=1 maxTF=1 @{0 2 2}
popular -> term8 Nt=1 TF=1 maxTF=1 @{0 2 6}
power -> term3 Nt=1 TF=1 maxTF=1 @{0 3 2}
program -> term9 Nt=1 TF=1 maxTF=1 @{0 3 6}
pyterri -> term1 Nt=2 TF=2 maxTF=1 @{0 4 2}
python -> term7 Nt=1 TF=1 maxTF=1 @{0 5 0}
retriev -> term0 Nt=2 TF=2 maxTF=1 @{0 5 4}
terrier -> term5 Nt=1 TF=1 maxTF=1 @{0 6 0}
tutori -> term10 Nt=1 TF=1 maxTF=1 @{0 6 4}


We can search a Pyterrier index using BM25 with the following function:

In [9]:
bm25.search("programming language")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,2,d3,0,2.379879,programming language


In our Pyterrier index, the appearance of 13 unique terms versus a total of 15 tokens illustrates the difference between the vocabulary and the raw term count across the documents. After pre-processing—such as lowercasing, stemming, and possibly stop-word removal—each distinct term is counted only once in the vocabulary, which is why we see 13 unique entries. However, when we count tokens, each occurrence of these terms in the documents is included; here, two of the terms ("pyterri" and "retriev") appear in two different documents, thereby increasing the overall token count to 15. This distinction is essential because, while the unique terms form the base vocabulary for indexing and query matching, the token frequencies are used in scoring algorithms like BM25, directly influencing relevance calculations in retrieval tasks.


### Exercise2

Define each column in the output of search above.

In the search output above, each column provides essential information about the query results. The **qid** column represents the unique identifier assigned to each query, allowing you to trace which query produced which set of results. The **docid** column is the internal identifier automatically assigned to each document during the indexing process, ensuring consistency and efficiency within the system. **Docno** shows the original document number that was defined at the beginning, linking back to the document’s initial reference. The **rank** column indicates the order of the document based on its relevance to the query, with a lower rank (for example, 0 indicating the top result) meaning that the document, such as "doc3" in this instance, is considered the most relevant. The **score** column provides a numerical value that quantifies the document’s relevance—the higher the score, the more pertinent the document is to the query. Finally, the **query** column displays the actual text of the query that was used to retrieve these results, ensuring that you can always see which search input corresponds to the returned documents.

Write answer here:

- qid: the id of each query we defined
- docid: each odcument is getting assigned an ID for the computer during indexing so this is what it will use.
- docno: the actual document number we defined in the beggning
- rank: this shows the rank of the document so basically in this case doc3 was ranked first for this query (and has the higest score). we could show this for all 4 docs.
- score: this is the score we have for the most relevant document related to our query. We want a higher score which means the doc is more relevant.
- query: this is the text querry that we search in our documents


Pyterrier offers a way to evaluate models using a different metrics. The list of possible metrics to use is available here: https://pyterrier.readthedocs.io/en/latest/experiments.html#evaluation-measures-objects. \\

We can also see how `Experiment` accepts a DataFrame of queries and qrels which are used to compute the metrics.

In [10]:
# 6. Evaluation Pipeline
pt.Experiment([bm25], queries, qrels, eval_metrics=["map", "P_10"], names=["BM25"])

Unnamed: 0,name,map,P_10
0,BM25,0.75,0.15


### Exercise 3

Instead of BM25 use a TF_IDF model. Try using it to search for a query then pass it along with bm25 to `Experiment`.

In [12]:
# First, ensure that PyTerrier is imported and initialized
import pyterrier as pt
if not pt.started():
    pt.init()

# Define the TF_IDF retrieval model using PyTerrier's BatchRetrieve component.
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")

# Use the TF_IDF model to search for a query
print("TF_IDF Query Results:")
print(tfidf.search("programming language"))
print(tfidf.search("experiment"))
print(tfidf.search("map"))

# Evaluate both BM25 and TF_IDF models using the evaluation pipeline.
results_exp = pt.Experiment(
    [bm25, tfidf],
    queries,
    qrels,
    eval_metrics=["map", "P_10"],
    names=["BM25", "TF_IDF"]
)

print(results_exp)

TF_IDF Query Results:
  qid  docid docno  rank     score                 query
0   1      2    d3     0  2.465764  programming language
Empty DataFrame
Columns: [docid, docno, rank, score, qid, query]
Index: []
Empty DataFrame
Columns: [docid, docno, rank, score, qid, query]
Index: []
     name   map  P_10
0    BM25  0.75  0.15
1  TF_IDF  1.00  0.15


  if not pt.started():
  tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")


## Neural Reranking with ColBERT

Below we use `pyterrier_colbert` which is a plugin for Pyterrier that makes it possible to use a ColBERT model for indexing and retrieval. We will use the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,429 scientific abstract, with corresponding queries and relevance assessments.

In [42]:
!pip uninstall faiss-cpu -y
!pip install faiss-gpu

[0m[31mERROR: Could not find a version that satisfies the requirement faiss-gpu (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for faiss-gpu[0m[31m
[0m

In [23]:
!rm -rf ./colbertindex

import pyterrier_colbert.indexing

checkpoint="http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"

indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "./", "colbertindex", chunksize=3)
indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())

vaswani documents:   0%|          | 0/11429 [00:00<?, ?it/s]

[Apr 14, 12:18:29] [0] 		 #> Local args.bsize = 128
[Apr 14, 12:18:29] [0] 		 #> args.index_root = ./
[Apr 14, 12:18:29] [0] 		 #> self.possible_subset_sizes = [69905]


Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Apr 14, 12:18:30] #> Loading model checkpoint.
[Apr 14, 12:18:30] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip
[Apr 14, 12:18:54] #> checkpoint['epoch'] = 0
[Apr 14, 12:18:54] #> checkpoint['batch'] = 44500




[Apr 14, 12:18:54] #> Note: Output directory ./ already exists




[Apr 14, 12:18:54] #> Creating directory ./colbertindex 


[Apr 14, 12:20:41] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 6.4k (overall),  6.4k (this encoding),  16722.1M (this saving)
[Apr 14, 12:20:41] [0] 		 [NOTE] Done with local share.
[Apr 14, 12:20:41] [0] 		 #> Joining saver thread.
[Apr 14, 12:20:42] [0] 		 #> Saved batch #0 to ./colbertindex/0.pt 		 Saving Throughput = 1.3M passages per minute.

#> num_embeddings = 581496
[Apr 14, 12:20:42] #> Starting..
[Apr 14, 12:20:42] #> Processing slice #1 of 1 (range 0..1).
[Apr 14, 12:20:42] #> Will write to ./colbertindex/ivfpq.100.faiss.
[Apr 14, 12:20:42] #> Loading ./colbertindex/0.sample ...
#> Sample has

In [24]:
!ls -ltrh ./colbertindex

total 168M
-rw-r--r-- 1 root root 142M Apr 14 12:20 0.pt
-rw-r--r-- 1 root root 4.5M Apr 14 12:20 0.tokenids
-rw-r--r-- 1 root root 7.1M Apr 14 12:20 0.sample
-rw-r--r-- 1 root root  35K Apr 14 12:20 doclens.0.json
-rw-r--r-- 1 root root  24K Apr 14 12:20 docnos.pkl.gz
-rw-r--r-- 1 root root  14M Apr 14 12:20 ivfpq.100.faiss


In [38]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

In [41]:
indexer = indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())
pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()

AttributeError: 'org.terrier.querying.IndexRef' object has no attribute 'index'

We search using ColBERT as follows. `% 5` is used to retrieve the top 5 most relevant entries.

In [26]:
out = (colbert_e2e % 5).search("chemical reactions")
out

NameError: name 'colbert_e2e' is not defined

In [None]:
out.loc[0, 'query_toks']

In [None]:
out.loc[0, 'query_embs'].shape

### Exercise 4

There are two new columns in the search results: `query_toks` and `query_embs`. Explain what they are and explain the shape of `query_embs`.

In the search results from your ColBERT-based pipeline, you observe two additional columns: query_toks and query_embs, which provide key insights into how the query "chemical reactions" is processed. The query_toks column contains the list of tokens generated after the query undergoes tokenization. Essentially, this column shows how the input query is segmented—for instance, it might split "chemical reactions" into individual tokens such as ["chemical", "reactions"] (depending on the tokenizer's rules and any sub-word segmentation applied).

In contrast, the query_embs column holds the numerical embeddings corresponding to each of these tokens. Each embedding is a high-dimensional vector that represents the semantic characteristics of a token in the latent space learned by the ColBERT model. The shape of query_embs is very informative: it is typically a two-dimensional array with the form (N, D), where N is the number of tokens in the query (i.e., the length of query_toks) and D is the dimensionality of the embedding vectors. For example, if your tokenized query results in 2 tokens (as in ["chemical", "reactions"]), and if the model's embedding dimension is 128, then the shape of query_embs would be (2, 128). This reflects that each token has been mapped to a 128-dimensional vector.

In [30]:
dataset = pt.datasets.get_dataset("vaswani")
index_path = "./index"


!rm -rf ./index
indexer = pt.TRECCollectionIndexer(index_path)

indexer = indexer.index(dataset.get_corpus())

bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

Downloading vaswani corpus to /root/.pyterrier/corpora/vaswani/corpus


doc-text.trec:   0%|          | 0.00/0.99M [00:00<?, ?iB/s]

  bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")


In the following code we create a sentence transformer reranker.

In [31]:
import pandas as pd
from sentence_transformers import CrossEncoder
crossmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)

def _crossencoder_apply(df : pd.DataFrame):
  return crossmodel.predict(list(zip(df['query'].values, df['text'].values)))

cross_encT = pt.apply.doc_score(_crossencoder_apply, batch_size=128)

We create a reranking pipeline that starts with BM25 as a retriever, and ends with using a sentence transformer (cross_encT) for reranking the retrieved documents. The reranking step requires the document text as input, which is not returned by default by bm25, and that is why we add `pt.text.get_text(dataset, 'text')` to retrieve the text documents and add them to the output of BM25.

In [32]:
dataset = pt.get_dataset('irds:vaswani')
cross_enc_rerank = bm25 >> pt.text.get_text(dataset, 'text') >> cross_encT

In [33]:
out = (bm25 % 5).search("chemical reactions")
out

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,9373,9374,0,22.076426,chemical reactions
1,1,8765,8766,1,20.498801,chemical reactions
2,1,7048,7049,2,20.159044,chemical reactions
3,1,4686,4687,3,19.323491,chemical reactions
4,1,10702,10703,4,13.472012,chemical reactions


In [34]:
out = (cross_enc_rerank % 5).search("chemical reactions")
out

[INFO] [starting] building docstore
docs_iter: 100%|██████████████████████| 11429/11429 [00:00<00:00, 73828.50doc/s]
[INFO] [finished] docs_iter: [00:00] [11429doc] [73552.10doc/s]
[INFO] [finished] building docstore [159ms]


Unnamed: 0,qid,docid,docno,score,query,text,rank
0,1,9373,9374,0.624742,chemical reactions,ion neutral reactions a list is given of reac...,0
2,1,7048,7049,0.535322,chemical reactions,some reactions occurring in the earths upper a...,1
3,1,4686,4687,0.287185,chemical reactions,nitrogen oxides and the airglow possible chem...,2
1,1,8765,8766,0.082262,chemical reactions,on the problem of an effective recombination c...,3
4,1,6479,6480,0.023788,chemical reactions,reaction concept in electromagnetic theory a ...,4


The following demonstrates how we can contruct IR pipelines using Pyterrier. The pipeline starts by using bm25 to retrive relevant documents, which are then used by QueryExpantion transformer which expands the query by adding informative terms that are collected from the relevant documents. The last part runs bm25 with the new query. This process is called Pseudo Relevance Feedback (PRF).

In [35]:
query_expansion = bm25 >> pt.rewrite.QueryExpansion(indexer) >> bm25

In [36]:
pt.Experiment(
    [bm25, query_expansion, colbert_e2e, cross_enc_rerank],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "P_10", "mrt"],
    names = ["BM25", "QE", "ColBERT", "BM25 >> CrossEnc"]
)

NameError: name 'colbert_e2e' is not defined

On this small dataset, we can see that BM25 achieves better results than ColBERT. Adding Query Expansion improves MAP a little. But using PRF with ColBERT hurts (ColBERT-PRF) the performance while slowing down the inference because of the extra PRF step.

### Exercise 5

Apply the same process as above on a dataset of your choice. You can find the list of datasets in Pyterrier here: https://pyterrier.readthedocs.io/en/latest/datasets.html#available-datasets

### Exercise 6

Add at least 2 other metrics and explain what each of them is trying to capture.

In [None]:
# Answer both exercise 5 and 6 here

### Exercise 7

Follow the example in this [README](https://github.com/terrierteam/pyterrier_colbert/tree/ba5c86c0bc8da450dee361140541f35b5349a492) to implement Pseudo Relevance Feedback (PRF) with ColBERT. Describe how it works, and add ColBERT-PRF to `Experiment` to evaluate it against the other pipelines.

Write answer here

### Exercise 8

Implement a reranking pipeline with ColBERT and add it to `Experiment`.

Write answer here.

## Arxiv Abstracts Retrieval

In [None]:
!pip install setuptools==68.0.0       # you ca skip this if you are using a local instance

In [None]:
!pip install arxiv==2.1.3

The following code will retrive 1000 abstracts for the query "nlp".

In [37]:
import arxiv

# Search for papers
search = arxiv.Search(
    query="nlp",
    max_results=1000,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

ModuleNotFoundError: No module named 'arxiv'

We construct a dictionary of titles and abstracts.

In [None]:
documents = []
for result in search.results():
    documents.append({
        'title': result.title,
        'abstract': result.summary,

    })

In [None]:
documents[0]

Rerun this in case you had to restart the session.

In [None]:
import pyterrier as pt
if not pt.java.started():
    pt.init()

To create an index, we need to to have a docno field which we create below, and a text field which is the abstract in this case. \\
We use `DFIndexer` which allows us to create by passing a list ids and text.

In [None]:
import pandas as pd

index_path = './arxivindex'
!rm -r './arxivindex'

df = pd.DataFrame({
    'docno': ['doc'+str(i) for i in range(len(documents))],
    'text': [document['abstract'] for document in documents]
})

indexer = pt.DFIndexer(index_path)
indexer.index(docno=df.docno, text=df.text)

We create a BM25 retrieval model for the Arxiv index.

In [None]:
bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

In [None]:
(bm25 % 5).search("information retrieval")

In [None]:
df[df['docno']=='doc670'].text.item()

### Exercise 9

Create a ColBERT index using the Arxiv documents retrieved above and use it to search for some queries. Try to highlight how it is different from BM25 through the query results. You can also change the query we used to get arxiv pages.