<a href="https://colab.research.google.com/github/adap7/IRTM/blob/main/IRTM_Neural_Reranking_2024_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course - Neural Reranking Tutorial
Authors: Abderrahmane Issam and Jan Scholtes

Version 2024-2025

#Notebook 4

In this notebook we will learn how to use [Pyterrier](https://github.com/terrier-org/pyterrier), a Python framework that is built on top of the Java-based Terrier IR platform. Pyterrier can be used to index different formats of datasets and can be integrated with different models starting from classical approaches like BM25 up to neural models like ColBERT. We will learn how to use Pyterrier for indexing, search and evaluation.

## Setup

### If you have a local GPU (Do the following steps in your local env):
If you have a GPU and you want to run this notebook locally, then I suggest you set up a conda environement as follows:



```
conda create --name ir python=3.11.1 \\
conda install -c conda-forge openjdk=11 \\
pip install notebook
```
The second step is to start jupyter in your local machine as follows:
```
jupyter notebook \
    --NotebookApp.allow_origin='https://colab.research.google.com' \
    --port=8888 \
    --NotebookApp.port_retries=0
```
Then go to `connect to local runtime` (which you will find in the menu where you can change the runtime) and paste the jupter backend URL that tou got in the output of the previous command.

If you are running this locally, then you only need to install packages once, otherwise, you will need to install them at the start of the instance and restart the runtime when required to.

In [1]:
!pip install python-terrier



In [2]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-0vz3ybpl
  Running command git clone --filter=blob:none --quiet https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-0vz3ybpl
  Resolved https://github.com/terrierteam/pyterrier_colbert.git to commit ba5c86c0bc8da450dee361140541f35b5349a492
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not

## Indexing and Search Using Pyterrier

In [3]:
!pip install faiss-gpu-cu12

import faiss
assert faiss.get_num_gpus() > 0



In [4]:
import pyterrier as pt
if not pt.java.started():
    pt.init()

terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
java is now started automatically with default settings. To force initialisation early, run:
pt.java.init() # optional, forces java initialisation
  pt.init()


Pyterrier can create an index from different dataset formats including Pandas DataFrame which we demonstrate below. We will create a synthetic DataFrame of documents, as well as queries and qrels to use for evaluation. After indexing the documents, we create a BM25 model that we will use for search later.

In [5]:
import pandas as pd

# 1. Define Documents
documents = pd.DataFrame([
    {"docno": "d1", "text": "PyTerrier is great for information retrieval."},
    {"docno": "d2", "text": "Terrier is a powerful information retrieval platform."},
    {"docno": "d3", "text": "Python is a popular programming language."},
    {"docno": "d4", "text": "This tutorial introduces PyTerrier basics."}
])

# 2. Define Queries
queries = pd.DataFrame([
    {"query": "information retrieval", "qid": "q1"},
    {"query": "programming tutorial", "qid": "q2"}
])

# 3. Define Relevance Judgments (qrels)
qrels = pd.DataFrame([
    {"qid": "q1", "docno": "d1", "label": 1},
    {"qid": "q1", "docno": "d2", "label": 0},
    {"qid": "q2", "docno": "d3", "label": 1},
    {"qid": "q2", "docno": "d4", "label": 1}
])

# 4. Indexing
index_path = "./index"
!rm -r "./index"    # Remove index if it exists

indexer = pt.index.DFIndexer(index_path)
index_ref = indexer.index(text=documents["text"], docno=documents["docno"])

# 5. Retrieval (BM25)
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")

rm: cannot remove './index': No such file or directory


  indexer = pt.index.DFIndexer(index_path)
  bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")


We can see the files were created:

In [6]:
!ls -ltrh ./index

total 44K
-rw-r--r-- 1 root root    7 Apr 15 12:51 data.direct.bf
-rw-r--r-- 1 root root   64 Apr 15 12:51 data.meta.zdata
-rw-r--r-- 1 root root   32 Apr 15 12:51 data.meta.idx
-rw-r--r-- 1 root root   68 Apr 15 12:51 data.document.fsarrayfile
-rw-r--r-- 1 root root   44 Apr 15 12:52 data.meta-0.fsomapfile
-rw-r--r-- 1 root root    9 Apr 15 12:52 data.inverted.bf
-rw-r--r-- 1 root root 1.1K Apr 15 12:52 data.lexicon.fsomapfile
-rw-r--r-- 1 root root   52 Apr 15 12:52 data.lexicon.fsomapid
-rw-r--r-- 1 root root  321 Apr 15 12:52 data.lexicon.fsomaphash
-rw-r--r-- 1 root root 4.1K Apr 15 12:52 data.properties


We can see statistics of our index as follows:

In [7]:
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 4
Number of terms: 13
Number of postings: 15
Number of fields: 0
Number of tokens: 15
Field names: []
Positions:   false



In [8]:
terms = index.getLexicon()

for entry in terms:
    print(entry.getKey())

basic
great
introduc
languag
platform
popular
power
program
pyterri
python
retriev
terrier
tutori


### Exercise 1:
The number of terms (13) is less than the number of words in all the 4 documents. Explain the reason for this.

I printed the 13 terms to check them and to see that they re unique terms (after pre processing).

This can happen because we have a total of 13 terms, but actually 15 tokens because some tokens (the rest of 2) are terms that repeat in more than one document, so they are not unique terms, but 15 tokens in all docs.  

These are the terms identified by Pyterrier.

In [9]:
for kv in index.getLexicon():
  print("%s -> %s" % (kv.getKey(), kv.getValue().toString() ) )

basic -> term11 Nt=1 TF=1 maxTF=1 @{0 0 0}
great -> term2 Nt=1 TF=1 maxTF=1 @{0 0 6}
introduc -> term12 Nt=1 TF=1 maxTF=1 @{0 1 0}
languag -> term6 Nt=1 TF=1 maxTF=1 @{0 1 6}
platform -> term4 Nt=1 TF=1 maxTF=1 @{0 2 2}
popular -> term8 Nt=1 TF=1 maxTF=1 @{0 2 6}
power -> term3 Nt=1 TF=1 maxTF=1 @{0 3 2}
program -> term9 Nt=1 TF=1 maxTF=1 @{0 3 6}
pyterri -> term1 Nt=2 TF=2 maxTF=1 @{0 4 2}
python -> term7 Nt=1 TF=1 maxTF=1 @{0 5 0}
retriev -> term0 Nt=2 TF=2 maxTF=1 @{0 5 4}
terrier -> term5 Nt=1 TF=1 maxTF=1 @{0 6 0}
tutori -> term10 Nt=1 TF=1 maxTF=1 @{0 6 4}


We can search a Pyterrier index using BM25 with the following function:

In [10]:
bm25.search("programming language")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,2,d3,0,2.379879,programming language


In our Pyterrier index, the appearance of 13 unique terms versus a total of 15 tokens illustrates the difference between the vocabulary and the raw term count across the documents. After pre-processing—such as lowercasing, stemming, and possibly stop-word removal—each distinct term is counted only once in the vocabulary, which is why we see 13 unique entries. However, when we count tokens, each occurrence of these terms in the documents is included; here, two of the terms ("pyterri" and "retriev") appear in two different documents, thereby increasing the overall token count to 15. This distinction is essential because, while the unique terms form the base vocabulary for indexing and query matching, the token frequencies are used in scoring algorithms like BM25, directly influencing relevance calculations in retrieval tasks.


### Exercise2

Define each column in the output of search above.

In the search output above, each column provides essential information about the query results. The **qid** column represents the unique identifier assigned to each query, allowing you to trace which query produced which set of results. The **docid** column is the internal identifier automatically assigned to each document during the indexing process, ensuring consistency and efficiency within the system. **Docno** shows the original document number that was defined at the beginning, linking back to the document’s initial reference. The **rank** column indicates the order of the document based on its relevance to the query, with a lower rank (for example, 0 indicating the top result) meaning that the document, such as "doc3" in this instance, is considered the most relevant. The **score** column provides a numerical value that quantifies the document’s relevance—the higher the score, the more pertinent the document is to the query. Finally, the **query** column displays the actual text of the query that was used to retrieve these results, ensuring that you can always see which search input corresponds to the returned documents.

Write answer here:

- qid: the id of each query we defined
- docid: each odcument is getting assigned an ID for the computer during indexing so this is what it will use.
- docno: the actual document number we defined in the beggning
- rank: this shows the rank of the document so basically in this case doc3 was ranked first for this query (and has the higest score). we could show this for all 4 docs.
- score: this is the score we have for the most relevant document related to our query. We want a higher score which means the doc is more relevant.
- query: this is the text querry that we search in our documents


Pyterrier offers a way to evaluate models using a different metrics. The list of possible metrics to use is available here: https://pyterrier.readthedocs.io/en/latest/experiments.html#evaluation-measures-objects. \\

We can also see how `Experiment` accepts a DataFrame of queries and qrels which are used to compute the metrics.

In [11]:
# 6. Evaluation Pipeline
pt.Experiment([bm25], queries, qrels, eval_metrics=["map", "P_10"], names=["BM25"])

Unnamed: 0,name,map,P_10
0,BM25,0.75,0.15


### Exercise 3

Instead of BM25 use a TF_IDF model. Try using it to search for a query then pass it along with bm25 to `Experiment`.

In [12]:
# First, ensure that PyTerrier is imported and initialized
import pyterrier as pt
if not pt.started():
    pt.init()

# Define the TF_IDF retrieval model using PyTerrier's BatchRetrieve component.
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")

# Use the TF_IDF model to search for a query
print("TF_IDF Query Results:")
print(tfidf.search("programming language"))
print(tfidf.search("experiment"))
print(tfidf.search("map"))

# Evaluate both BM25 and TF_IDF models using the evaluation pipeline.
results_exp = pt.Experiment(
    [bm25, tfidf],
    queries,
    qrels,
    eval_metrics=["map", "P_10"],
    names=["BM25", "TF_IDF"]
)

print(results_exp)

TF_IDF Query Results:
  qid  docid docno  rank     score                 query
0   1      2    d3     0  2.465764  programming language
Empty DataFrame
Columns: [docid, docno, rank, score, qid, query]
Index: []
Empty DataFrame
Columns: [docid, docno, rank, score, qid, query]
Index: []
     name   map  P_10
0    BM25  0.75  0.15
1  TF_IDF  1.00  0.15


  if not pt.started():
  tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")


## Neural Reranking with ColBERT

Below we use `pyterrier_colbert` which is a plugin for Pyterrier that makes it possible to use a ColBERT model for indexing and retrieval. We will use the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,429 scientific abstract, with corresponding queries and relevance assessments.

In [13]:
!rm -rf ./colbertindex

import pyterrier_colbert.indexing

checkpoint="http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"

indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "./", "colbertindex", chunksize=3)
indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())

vaswani documents:   0%|          | 0/11429 [00:00<?, ?it/s]

[Apr 15, 12:52:32] [0] 		 #> Local args.bsize = 128
[Apr 15, 12:52:32] [0] 		 #> args.index_root = ./
[Apr 15, 12:52:32] [0] 		 #> self.possible_subset_sizes = [69905]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Apr 15, 12:52:35] #> Loading model checkpoint.
[Apr 15, 12:52:35] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip


Downloading: "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip" to /root/.cache/torch/hub/checkpoints/colbert.dnn.zip

  0%|          | 0.00/1.11G [00:00<?, ?B/s][A
  0%|          | 128k/1.11G [00:00<1:05:31, 302kB/s][A
  0%|          | 256k/1.11G [00:00<39:56, 495kB/s]  [A
  0%|          | 512k/1.11G [00:00<21:50, 906kB/s][A
  0%|          | 1.00M/1.11G [00:00<11:21, 1.74MB/s][A
  0%|          | 1.88M/1.11G [00:01<06:15, 3.16MB/s][A
  0%|          | 3.62M/1.11G [00:01<03:15, 6.06MB/s][A
  1%|          | 6.62M/1.11G [00:01<01:50, 10.7MB/s][A
  1%|          | 10.5M/1.11G [00:01<01:16, 15.5MB/s][A
  1%|▏         | 14.5M/1.11G [00:01<01:01, 19.0MB/s][A
  2%|▏         | 18.4M/1.11G [00:01<00:55, 21.1MB/s][A
  2%|▏         | 22.4M/1.11G [00:01<00:51, 22.8MB/s][A
  2%|▏         | 26.2M/1.11G [00:02<00:48, 23.7MB/s][A
  3%|▎         | 28.6M/1.11G [00:02<01:01, 18.9MB/s][A
  3%|▎         | 32.2M/1.11G [00:02<00:55, 20.8MB/s][A
  3%|▎         | 35.1M/1.11G [00:02<00:55, 20.8MB/s]

[Apr 15, 12:53:48] #> checkpoint['epoch'] = 0
[Apr 15, 12:53:48] #> checkpoint['batch'] = 44500




tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



[Apr 15, 12:53:49] #> Note: Output directory ./ already exists




[Apr 15, 12:53:49] #> Creating directory ./colbertindex 




[INFO] [starting] http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz

http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 0.0%| 0.00/2.13M [00:00<?, ?B/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 0.8%| 16.4k/2.13M [00:00<00:17, 120kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 2.3%| 49.2k/2.13M [00:00<00:11, 175kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 3.9%| 81.9k/2.13M [00:00<00:10, 192kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 7.7%| 164k/2.13M [00:00<00:06, 287kB/s] [A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 15.8%| 336k/2.13M [00:00<00:03, 469kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 32.4%| 688k/2.13M [00:00<00:01, 800kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 65.1%| 1.38M/2.13M [00:01<00:00, 1.38MB/s][A
[A[INFO] [finished] http://ir.dc

[Apr 15, 12:55:40] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 6.2k (overall),  6.3k (this encoding),  18319.8M (this saving)
[Apr 15, 12:55:40] [0] 		 [NOTE] Done with local share.
[Apr 15, 12:55:40] [0] 		 #> Joining saver thread.
[Apr 15, 12:55:40] [0] 		 #> Saved batch #0 to ./colbertindex/0.pt 		 Saving Throughput = 2.1M passages per minute.

#> num_embeddings = 581496
[Apr 15, 12:55:40] #> Starting..
[Apr 15, 12:55:40] #> Processing slice #1 of 1 (range 0..1).
[Apr 15, 12:55:40] #> Will write to ./colbertindex/ivfpq.100.faiss.
[Apr 15, 12:55:40] #> Loading ./colbertindex/0.sample ...
#> Sample has shape (29074, 128)
[Apr 15, 12:55:40] Preparing resources for 1 GPUs.
[Apr 15, 12:55:40] #> Training with the vectors...
[Apr 15, 12:55:40] #> Training now (using 1 GPUs)...
0.17907214164733887
11.323607444763184
0.006412029266357422
[Apr 15, 12:55:51] Done training!

[Apr 15, 12:55:51] #> Indexing the vectors...
[Apr 15, 12:55:51] #> Loading ('./colbertindex/0

In [14]:
!ls -ltrh ./colbertindex

total 168M
-rw-r--r-- 1 root root 142M Apr 15 12:55 0.pt
-rw-r--r-- 1 root root 4.5M Apr 15 12:55 0.tokenids
-rw-r--r-- 1 root root 7.1M Apr 15 12:55 0.sample
-rw-r--r-- 1 root root  35K Apr 15 12:55 doclens.0.json
-rw-r--r-- 1 root root  24K Apr 15 12:55 docnos.pkl.gz
-rw-r--r-- 1 root root  14M Apr 15 12:55 ivfpq.100.faiss


In [15]:
# indexer = indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())
pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()

[Apr 15, 12:55:54] #> Loading the FAISS index from ./colbertindex/ivfpq.100.faiss ..
[Apr 15, 12:55:54] #> Building the emb2pid mapping..
[Apr 15, 12:55:54] len(self.emb2pid) = 581496


  self.scaler = torch.cuda.amp.GradScaler()


Loading reranking index, memtype=mem


Loading index shards to memory:   0%|          | 0/1 [00:00<?, ?shard/s]

We search using ColBERT as follows. `% 5` is used to retrieve the top 5 most relevant entries.

In [16]:
out = (colbert_e2e % 5).search("chemical reactions")
out

  return torch.cuda.amp.autocast() if self.activated else nullcontext()


Unnamed: 0,qid,query,docid,query_toks,query_embs,score,docno,rank
0,1,chemical reactions,4911,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",19.821638,4912,0
3,1,chemical reactions,7048,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",19.053555,7049,1
2,1,chemical reactions,6479,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",18.036415,6480,2
4,1,chemical reactions,9373,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",17.136055,9374,3
1,1,chemical reactions,6278,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",16.793301,6279,4


In [17]:
out.loc[0, 'query_toks']

tensor([ 101,    1, 5072, 9597,  102,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])

In [18]:
out.loc[0, 'query_embs'].shape

torch.Size([32, 128])

### Exercise 4

There are two new columns in the search results: `query_toks` and `query_embs`. Explain what they are and explain the shape of `query_embs`.

In the search results from your ColBERT-based pipeline, you observe two additional columns: query_toks and query_embs, which provide key insights into how the query "chemical reactions" is processed. The query_toks column contains the list of tokens generated after the query undergoes tokenization. Essentially, this column shows how the input query is segmented—for instance, it might split "chemical reactions" into individual tokens such as ["chemical", "reactions"] (depending on the tokenizer's rules and any sub-word segmentation applied).

In contrast, the query_embs column holds the numerical embeddings corresponding to each of these tokens. Each embedding is a high-dimensional vector that represents the semantic characteristics of a token in the latent space learned by the ColBERT model. The shape of query_embs is very informative: it is typically a two-dimensional array with the form (N, D), where N is the number of tokens in the query (i.e., the length of query_toks) and D is the dimensionality of the embedding vectors. For example, if your tokenized query results in 2 tokens (as in ["chemical", "reactions"]), and if the model's embedding dimension is 128, then the shape of query_embs would be (2, 128). This reflects that each token has been mapped to a 128-dimensional vector.

In [38]:
dataset = pt.datasets.get_dataset("vaswani")
index_path = "./index"


!rm -rf ./index
indexer = pt.TRECCollectionIndexer(index_path)

indexer = indexer.index(dataset.get_corpus())

bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

  bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")


In the following code we create a sentence transformer reranker.

In [39]:
import pandas as pd
from sentence_transformers import CrossEncoder
crossmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)

def _crossencoder_apply(df : pd.DataFrame):
  return crossmodel.predict(list(zip(df['query'].values, df['text'].values)))

cross_encT = pt.apply.doc_score(_crossencoder_apply, batch_size=128)

We create a reranking pipeline that starts with BM25 as a retriever, and ends with using a sentence transformer (cross_encT) for reranking the retrieved documents. The reranking step requires the document text as input, which is not returned by default by bm25, and that is why we add `pt.text.get_text(dataset, 'text')` to retrieve the text documents and add them to the output of BM25.

In [40]:
dataset = pt.get_dataset('irds:vaswani')
cross_enc_rerank = bm25 >> pt.text.get_text(dataset, 'text') >> cross_encT

In [41]:
out = (bm25 % 5).search("chemical reactions")
out

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,9373,9374,0,22.076426,chemical reactions
1,1,8765,8766,1,20.498801,chemical reactions
2,1,7048,7049,2,20.159044,chemical reactions
3,1,4686,4687,3,19.323491,chemical reactions
4,1,10702,10703,4,13.472012,chemical reactions


In [42]:
out = (cross_enc_rerank % 5).search("chemical reactions")
out

Unnamed: 0,qid,docid,docno,score,query,text,rank
0,1,9373,9374,0.509724,chemical reactions,ion neutral reactions a list is given of reac...,0
2,1,7048,7049,0.141523,chemical reactions,some reactions occurring in the earths upper a...,1
3,1,4686,4687,-0.909094,chemical reactions,nitrogen oxides and the airglow possible chem...,2
1,1,8765,8766,-2.412005,chemical reactions,on the problem of an effective recombination c...,3
4,1,6479,6480,-3.71452,chemical reactions,reaction concept in electromagnetic theory a ...,4


The following demonstrates how we can contruct IR pipelines using Pyterrier. The pipeline starts by using bm25 to retrive relevant documents, which are then used by QueryExpantion transformer which expands the query by adding informative terms that are collected from the relevant documents. The last part runs bm25 with the new query. This process is called Pseudo Relevance Feedback (PRF).

In [43]:
query_expansion = bm25 >> pt.rewrite.QueryExpansion(indexer) >> bm25

In [44]:
pt.Experiment(
    [bm25, query_expansion, colbert_e2e, cross_enc_rerank],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "P_10", "mrt"],
    names = ["BM25", "QE", "ColBERT", "BM25 >> CrossEnc"]
)

14:13:33.098 [main] WARN org.terrier.querying.QueryExpansion -- qemodel control not set for QueryExpansion post process. Using default model Bo1


  return torch.cuda.amp.autocast() if self.activated else nullcontext()


Unnamed: 0,name,map,P_10,mrt
0,BM25,0.296517,0.352688,22.326847
1,QE,0.304647,0.369892,60.371842
2,ColBERT,0.278698,0.351613,687.312294
3,BM25 >> CrossEnc,0.321928,0.388172,2496.351711


On this small dataset, we can see that BM25 achieves better results than ColBERT. Adding Query Expansion improves MAP a little. But using PRF with ColBERT hurts (ColBERT-PRF) the performance while slowing down the inference because of the extra PRF step.

### Exercise 5

Apply the same process as above on a dataset of your choice. You can find the list of datasets in Pyterrier here: https://pyterrier.readthedocs.io/en/latest/datasets.html#available-datasets

### Exercise 6

Add at least 2 other metrics and explain what each of them is trying to capture.

In [None]:
# Define the two datasets.
datasets = {
    "forum": pt.get_dataset("irds:lotte/lifestyle/test/forum"),
    "torso": pt.get_dataset("irds:tripclick/val/torso")
}

experiment_results = {}

# Process each dataset.
for ds_name, ds in datasets.items():
    print(f"\n--- Processing dataset: {ds_name} ---")

    # # Create a unique index folder for each dataset.
    # index_path = f"./index_{ds_name}"
    # if os.path.exists(index_path):
    #     shutil.rmtree(index_path)

    # Use IterDictIndexer with the corpus iterator.
    indexer = pt.IterDictIndexer(index_path)
    indexer.index(ds.get_corpus_iter())

    # Instantiate BM25 retriever.
    bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

    # --- Pipeline 1: BM25 baseline ---
    pipeline_bm25 = bm25

    # --- Pipeline 2: Cross-Encoder Rerank ---
    crossmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)


    # Create an operator to apply the cross-encoder.
    cross_encT = pt.apply.doc_score(_crossencoder_apply, batch_size=128)
    cross_enc_rerank = bm25 >> pt.text.get_text(ds, 'text') >> cross_encT
    pipeline_crossenc = cross_enc_rerank

    # --- Pipeline 3: Query Expansion ---
    query_expansion = bm25 >> pt.rewrite.QueryExpansion(indexer) >> bm25
    pipeline_qe = query_expansion

    # List of pipelines and names.
    pipelines = [pipeline_bm25, pipeline_qe, pipeline_crossenc]
    pipeline_names = ["BM25", "BM25 >> QueryExpansion", "BM25 >> CrossEnc"]

    # --- Run the Experiment ---
    # Note: "recip_rank" is used instead of "mrr", since ir_measures recognizes "recip_rank" as Mean Reciprocal Rank.
    exp = pt.Experiment(
        pipelines,
        ds.get_topics(),
        ds.get_qrels(),
        eval_metrics=["map", "P_10", "ndcg", "recall", "recip_rank"],
        names=pipeline_names
    )

    print(f"Results for dataset '{ds_name}':")
    print(exp)

    experiment_results[ds_name] = exp


--- Processing dataset: forum ---


lotte/lifestyle/test/forum documents:   0%|          | 0/119461 [00:00<?, ?it/s]

  bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")


14:48:20.205 [main] WARN org.terrier.querying.QueryExpansion -- qemodel control not set for QueryExpansion post process. Using default model Bo1


In this experiment, in addition to standard metrics like MAP, P@10, and MRR, we have incorporated two additional evaluation metrics—nDCG and Recall—to provide a more comprehensive assessment of our retrieval system. nDCG (Normalized Discounted Cumulative Gain) captures the quality of the ranked retrieval results by accounting for both the relevance of the documents and their positions in the ranking; it rewards systems that place highly relevant documents at the top while diminishing the influence of less relevant documents found lower in the list. Recall, on the other hand, measures the system’s ability to retrieve all relevant documents within the collection by calculating the fraction of total relevant documents that were successfully identified. By combining these metrics, we can evaluate not only the precision and ranking quality of our methods but also ensure that our system is robust in capturing the entire set of relevant information, ultimately offering deeper insights into its overall effectiveness.

### Exercise 7

Follow the example in this [README](https://github.com/terrierteam/pyterrier_colbert/tree/ba5c86c0bc8da450dee361140541f35b5349a492) to implement Pseudo Relevance Feedback (PRF) with ColBERT. Describe how it works, and add ColBERT-PRF to `Experiment` to evaluate it against the other pipelines.

In [None]:
dataset = pt.get_dataset("vaswani")

# Set the path for the ColBERT index
index_path = "./colbert_index"

# If an existing index exists, remove it to start fresh.
if os.path.exists(index_path):
    shutil.rmtree(index_path)

# --- ColBERT Indexing ---
# Build a ColBERT index using a ColBERT checkpoint. Make sure to set ids=True for PRF.
from pyterrier_colbert.indexing import ColBERTIndexer
indexer = ColBERTIndexer("/path/to/checkpoint.dnn", index_path, "colbert_index", ids=True)
indexer.index(dataset.get_corpus())

# --- ColBERT Retrieval Pipelines ---
# Create a ColBERT factory using the same checkpoint and index.
from pyterrier_colbert.ranking import ColBERTFactory
pytcolbert = ColBERTFactory("/path/to/checkpoint.dnn", index_path, "colbert_index")

# Define a dense end-to-end ColBERT retrieval pipeline.
dense_e2e = pytcolbert.end_to_end()

# Create a BM25 retrieval pipeline (sparse retrieval) as a baseline.
bm25 = pt.BatchRetrieve(index_path, wmodel="BM25")

# Create a sparse ColBERT re-ranking pipeline based on BM25.
sparse_colbert = bm25 >> pytcolbert.text_scorer()

# --- ColBERT PRF Pipeline ---
# Obtain the ColBERT PRF pipeline; with rerank=True it enriches the query and then re-ranks the documents.
colbert_prf = pytcolbert.prf(rerank=True)

# --- Experiment ---
# Evaluate all pipelines together in a side-by-side comparison. The metrics used here (e.g., MAP and nDCG@10) assess both the
# precision of retrieved documents and the quality of the ranking. Including ColBERT-PRF enables us to see the benefits of
# automatically refining the query using pseudo relevance feedback on dense representations.
pt.Experiment(
    [bm25, sparse_colbert, dense_e2e, colbert_prf],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "ndcg_cut_10"],
    names=["BM25", "BM25 >> ColBERT", "Dense ColBERT", "ColBERT-PRF"]
)

In this implementation, we extend our retrieval experiments by incorporating ColBERT-based Pseudo Relevance Feedback (PRF) pipelines into our evaluation framework using PyTerrier-ColBERT. ColBERT PRF works by first performing an initial dense retrieval with a ColBERT-based index that has been built with aligned token ids (using the ids=True option during indexing) to ensure proper feedback integration. The core idea behind PRF is to assume that the top-ranked documents from an initial search are relevant, and then to automatically extract and use information from these documents to refine the query representation. In practice, this refinement can occur in two ways: one version (colbert_prf_rank) expands the query without re-ranking while another version (colbert_prf_rerank) performs re-ranking after the query is enriched. By leveraging functions such as pytcolbert.prf(rerank=False) and pytcolbert.prf(rerank=True), we generate these PRF pipelines and add them to the Experiment evaluation alongside traditional pipelines like BM25 and BM25 with ColBERT re-ranking. This side-by-side comparison not only benchmarks standard retrieval effectiveness metrics like MAP and nDCG but also demonstrates how incorporating pseudo-relevance feedback with dense representations can improve both the initial ranking and final document ordering, thereby enhancing the overall robustness of our retrieval system.

### Exercise 8

Implement a reranking pipeline with ColBERT and add it to `Experiment`.

In [None]:
dataset = pt.get_dataset("vaswani")

# Set the path for the ColBERT index.
index_path = "./colbert_index"

# If an existing index directory exists, remove it.
if os.path.exists(index_path):
    shutil.rmtree(index_path)

# --- ColBERT Indexing ---
# Build a ColBERT index using a ColBERT checkpoint.
# Note: Replace '/path/to/colbert_checkpoint.dnn' with the actual path to your ColBERT checkpoint.
from pyterrier_colbert.indexing import ColBERTIndexer
indexer = ColBERTIndexer("/path/to/colbert_checkpoint.dnn", index_path, "colbert_index", ids=True)
indexer.index(dataset.get_corpus())

# --- ColBERT Retrieval Pipelines ---
# Instantiate the ColBERT factory using the same checkpoint and index.
from pyterrier_colbert.ranking import ColBERTFactory
pytcolbert = ColBERTFactory("/path/to/colbert_checkpoint.dnn", index_path, "colbert_index")

# Create an end-to-end dense ColBERT retrieval pipeline.
dense_e2e = pytcolbert.end_to_end()

# --- BM25 Baseline Retrieval ---
# Build a BM25 retriever using the index created above.
bm25 = pt.BatchRetrieve(index_path, wmodel="BM25")

# --- ColBERT Reranking Pipeline ---
# The reranking pipeline is defined by chaining BM25 retrieval with the ColBERT text scorer.
colbert_rerank = bm25 >> pytcolbert.text_scorer()

# --- Experiment ---
# Evaluate and compare the performance of:
# 1. BM25 baseline
# 2. BM25 with ColBERT reranking (sparse re-ranking)
# 3. Dense end-to-end ColBERT retrieval
results = pt.Experiment(
    [bm25, colbert_rerank, dense_e2e],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "ndcg_cut_10"],
    names=["BM25", "BM25 >> ColBERT", "Dense ColBERT"]
)

print(results)

Write answer here.

## Arxiv Abstracts Retrieval

In [None]:
!pip install setuptools==68.0.0       # you ca skip this if you are using a local instance

In [None]:
!pip install arxiv==2.1.3

The following code will retrive 1000 abstracts for the query "nlp".

In [None]:
import arxiv

# Search for papers
search = arxiv.Search(
    query="nlp",
    max_results=1000,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

We construct a dictionary of titles and abstracts.

In [None]:
documents = []
for result in search.results():
    documents.append({
        'title': result.title,
        'abstract': result.summary,

    })

In [None]:
documents[0]

Rerun this in case you had to restart the session.

In [None]:
import pyterrier as pt
if not pt.java.started():
    pt.init()

To create an index, we need to to have a docno field which we create below, and a text field which is the abstract in this case. \\
We use `DFIndexer` which allows us to create by passing a list ids and text.

In [None]:
import pandas as pd

index_path = './arxivindex'
!rm -r './arxivindex'

df = pd.DataFrame({
    'docno': ['doc'+str(i) for i in range(len(documents))],
    'text': [document['abstract'] for document in documents]
})

indexer = pt.DFIndexer(index_path)
indexer.index(docno=df.docno, text=df.text)

We create a BM25 retrieval model for the Arxiv index.

In [None]:
bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

In [None]:
(bm25 % 5).search("information retrieval")

In [None]:
df[df['docno']=='doc670'].text.item()

### Exercise 9

Create a ColBERT index using the Arxiv documents retrieved above and use it to search for some queries. Try to highlight how it is different from BM25 through the query results. You can also change the query we used to get arxiv pages.

In [None]:
arxiv_dataset = pt.get_dataset("irds:arxiv")

bm25_index_path = "./bm25_arxiv_index"
if os.path.exists(bm25_index_path):
    shutil.rmtree(bm25_index_path)

# Using TRECCollectionIndexer to create a baseline index for BM25.
trec_indexer = pt.TRECCollectionIndexer(bm25_index_path)
trec_indexer.index(arxiv_dataset.get_corpus())
bm25 = pt.BatchRetrieve(bm25_index_path, wmodel="BM25")

# Set the path for the ColBERT index. Make sure to index with aligned token IDs (ids=True)
colbert_index_path = "./colbert_arxiv_index"
if os.path.exists(colbert_index_path):
    shutil.rmtree(colbert_index_path)

from pyterrier_colbert.indexing import ColBERTIndexer
# Replace '/path/to/colbert_checkpoint.dnn' with the actual path to your ColBERT checkpoint.
checkpoint_path = "/path/to/colbert_checkpoint.dnn"
colbert_indexer = ColBERTIndexer(checkpoint_path, colbert_index_path, "colbert_arxiv_index", ids=True)
colbert_indexer.index(arxiv_dataset.get_corpus())

from pyterrier_colbert.ranking import ColBERTFactory
pytcolbert = ColBERTFactory(checkpoint_path, colbert_index_path, "colbert_arxiv_index")
dense_e2e = pytcolbert.end_to_end()

queries = [
    "Transformer models in natural language processing",
    "Quantum computing advances and applications"
]

# Choose one query at a time for demonstration.
for query in queries:
    print("\n========================")
    print("Query: ", query)
    print("========================\n")

    print("BM25 Results:")
    bm25_results = bm25.search(query)
    print(bm25_results[["docno", "rank", "score"]].head(10))

    print("\nColBERT Results:")
    colbert_results = dense_e2e.search(query)
    print(colbert_results[["docno", "rank", "score"]].head(10))