<a href="https://colab.research.google.com/github/adap7/IRTM/blob/main/IRTM_Neural_Reranking_2024_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course - Neural Reranking Tutorial
Authors: Abderrahmane Issam and Jan Scholtes

Version 2024-2025

#Notebook 4

In this notebook we will learn how to use [Pyterrier](https://github.com/terrier-org/pyterrier), a Python framework that is built on top of the Java-based Terrier IR platform. Pyterrier can be used to index different formats of datasets and can be integrated with different models starting from classical approaches like BM25 up to neural models like ColBERT. We will learn how to use Pyterrier for indexing, search and evaluation.

## Setup

### If you have a local GPU (Do the following steps in your local env):
If you have a GPU and you want to run this notebook locally, then I suggest you set up a conda environement as follows:



```
conda create --name ir python=3.11.1 \\
conda install -c conda-forge openjdk=11 \\
pip install notebook
```
The second step is to start jupyter in your local machine as follows:
```
jupyter notebook \
    --NotebookApp.allow_origin='https://colab.research.google.com' \
    --port=8888 \
    --NotebookApp.port_retries=0
```
Then go to `connect to local runtime` (which you will find in the menu where you can change the runtime) and paste the jupter backend URL that tou got in the output of the previous command.

If you are running this locally, then you only need to install packages once, otherwise, you will need to install them at the start of the instance and restart the runtime when required to.

In [None]:
!pip install python-terrier

Collecting python-terrier
  Downloading python_terrier-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting ir-datasets>=0.3.2 (from python-terrier)
  Downloading ir_datasets-0.5.10-py3-none-any.whl.metadata (12 kB)
Collecting wget (from python-terrier)
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyjnius>=1.4.2 (from python-terrier)
  Downloading pyjnius-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting ir-measures>=0.3.1 (from python-terrier)
  Downloading ir_measures-0.3.7-py3-none-any.whl.metadata (7.0 kB)
Collecting pytrec-eval-terrier>=0.5.3 (from python-terrier)
  Downloading pytrec_eval_terrier-0.5.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (984 bytes)
Collecting dill (from python-terrier)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting chest (from python-terrier)
  Downloading chest-0.2.3.tar.gz (9.6 kB)
  Preparing metadata (setup.py

In [None]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-regzso_x
  Running command git clone --filter=blob:none --quiet https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-regzso_x
  Resolved https://github.com/terrierteam/pyterrier_colbert.git to commit ba5c86c0bc8da450dee361140541f35b5349a492
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ColBERT@ git+https://github.com/cmacdonald/ColBERT.git@v0.2#egg=ColBERT (from pyterrier-colbert==0.0.1)
  Cloning https://github.com/cmacdonald/ColBERT.git (to revision v0.2) to /tmp/pip-install-ir022m0e/colbert_16bab09840e74a40bbf8ad5815517de8
  Running command git clone --filter=blob:none --quiet https://github.com/cmacdonald/ColBERT.git /tmp/pip-install-ir022m0e/colbert_16bab09840e74a40bbf8ad5815517de8
  Running command git checkout -b v0.2 --track origin/v0.2
  Switched to a new branch 'v0.2'
  Branch 'v0.2' set u

## Indexing and Search Using Pyterrier

In [None]:
!pip install faiss-gpu-cu12

import faiss
assert faiss.get_num_gpus() > 0

Collecting faiss-gpu-cu12
  Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.9/47.9 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu-cu12
Successfully installed faiss-gpu-cu12-1.10.0


In [None]:
import pyterrier as pt
if not pt.java.started():
    pt.init()

terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
java is now started automatically with default settings. To force initialisation early, run:
pt.java.init() # optional, forces java initialisation
  pt.init()


Pyterrier can create an index from different dataset formats including Pandas DataFrame which we demonstrate below. We will create a synthetic DataFrame of documents, as well as queries and qrels to use for evaluation. After indexing the documents, we create a BM25 model that we will use for search later.

In [None]:
import pandas as pd

# 1. Define Documents
documents = pd.DataFrame([
    {"docno": "d1", "text": "PyTerrier is great for information retrieval."},
    {"docno": "d2", "text": "Terrier is a powerful information retrieval platform."},
    {"docno": "d3", "text": "Python is a popular programming language."},
    {"docno": "d4", "text": "This tutorial introduces PyTerrier basics."}
])

# 2. Define Queries
queries = pd.DataFrame([
    {"query": "information retrieval", "qid": "q1"},
    {"query": "programming tutorial", "qid": "q2"}
])

# 3. Define Relevance Judgments (qrels)
qrels = pd.DataFrame([
    {"qid": "q1", "docno": "d1", "label": 1},
    {"qid": "q1", "docno": "d2", "label": 0},
    {"qid": "q2", "docno": "d3", "label": 1},
    {"qid": "q2", "docno": "d4", "label": 1}
])

# 4. Indexing
index_path = "./index"
!rm -r "./index"    # Remove index if it exists

indexer = pt.index.DFIndexer(index_path)
index_ref = indexer.index(text=documents["text"], docno=documents["docno"])

# 5. Retrieval (BM25)
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")

rm: cannot remove './index': No such file or directory


  indexer = pt.index.DFIndexer(index_path)
  bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")


We can see the files were created:

In [None]:
!ls -ltrh ./index

total 44K
-rw-r--r-- 1 root root    7 Apr 11 11:35 data.direct.bf
-rw-r--r-- 1 root root   64 Apr 11 11:35 data.meta.zdata
-rw-r--r-- 1 root root   32 Apr 11 11:35 data.meta.idx
-rw-r--r-- 1 root root   68 Apr 11 11:35 data.document.fsarrayfile
-rw-r--r-- 1 root root   44 Apr 11 11:35 data.meta-0.fsomapfile
-rw-r--r-- 1 root root    9 Apr 11 11:35 data.inverted.bf
-rw-r--r-- 1 root root 1.1K Apr 11 11:35 data.lexicon.fsomapfile
-rw-r--r-- 1 root root   52 Apr 11 11:35 data.lexicon.fsomapid
-rw-r--r-- 1 root root  321 Apr 11 11:35 data.lexicon.fsomaphash
-rw-r--r-- 1 root root 4.1K Apr 11 11:35 data.properties


We can see statistics of our index as follows:

In [None]:
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 4
Number of terms: 13
Number of postings: 15
Number of fields: 0
Number of tokens: 15
Field names: []
Positions:   false



In [None]:
terms = index.getLexicon()

for entry in terms:
    print(entry.getKey())

basic
great
introduc
languag
platform
popular
power
program
pyterri
python
retriev
terrier
tutori


### Exercise 1:
The number of terms (13) is less than the number of words in all the 4 documents. Explain the reason for this.

I printed the 13 terms to check them and to see that they re unique terms (after pre processing).

This can happen because we have a total of 13 terms, but actually 15 tokens because some tokens (the rest of 2) are terms that repeat in more than one document, so they are not unique terms, but 15 tokens in all docs.  

These are the terms identified by Pyterrier.

In [None]:
for kv in index.getLexicon():
  print("%s -> %s" % (kv.getKey(), kv.getValue().toString() ) )

basic -> term11 Nt=1 TF=1 maxTF=1 @{0 0 0}
great -> term2 Nt=1 TF=1 maxTF=1 @{0 0 6}
introduc -> term12 Nt=1 TF=1 maxTF=1 @{0 1 0}
languag -> term6 Nt=1 TF=1 maxTF=1 @{0 1 6}
platform -> term4 Nt=1 TF=1 maxTF=1 @{0 2 2}
popular -> term8 Nt=1 TF=1 maxTF=1 @{0 2 6}
power -> term3 Nt=1 TF=1 maxTF=1 @{0 3 2}
program -> term9 Nt=1 TF=1 maxTF=1 @{0 3 6}
pyterri -> term1 Nt=2 TF=2 maxTF=1 @{0 4 2}
python -> term7 Nt=1 TF=1 maxTF=1 @{0 5 0}
retriev -> term0 Nt=2 TF=2 maxTF=1 @{0 5 4}
terrier -> term5 Nt=1 TF=1 maxTF=1 @{0 6 0}
tutori -> term10 Nt=1 TF=1 maxTF=1 @{0 6 4}


We can search a Pyterrier index using BM25 with the following function:

In [None]:
bm25.search("programming language")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,2,d3,0,2.379879,programming language



### Exercise2

Define each column in the output of search above.

Write answer here:

- qid: the id of each query we defined
- docid: each odcument is getting assigned an ID for the computer during indexing so this is what it will use.
- docno: the actual document number we defined in the beggning
- rank: this shows the rank of the document so basically in this case doc3 was ranked first for this query (and has the higest score). we could show this for all 4 docs.
- score: this is the score we have for the most relevant document related to our query. We want a higher score which means the doc is more relevant.
- query: this is the text querry that we search in our documents


Pyterrier offers a way to evaluate models using a different metrics. The list of possible metrics to use is available here: https://pyterrier.readthedocs.io/en/latest/experiments.html#evaluation-measures-objects. \\

We can also see how `Experiment` accepts a DataFrame of queries and qrels which are used to compute the metrics.

In [None]:
# 6. Evaluation Pipeline
pt.Experiment([bm25], queries, qrels, eval_metrics=["map", "P_10"], names=["BM25"])

### Exercise 3

Instead of BM25 use a TF_IDF model. Try using it to search for a query then pass it along with bm25 to `Experiment`.

In [None]:
# Write answer here

## Neural Reranking with ColBERT

Below we use `pyterrier_colbert` which is a plugin for Pyterrier that makes it possible to use a ColBERT model for indexing and retrieval. We will use the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,429 scientific abstract, with corresponding queries and relevance assessments.

In [None]:
!rm -rf ./colbertindex

import pyterrier_colbert.indexing

checkpoint="http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"

indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "./", "colbertindex", chunksize=3)
indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())

In [None]:
!ls -ltrh ./colbertindex

In [None]:
pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()

We search using ColBERT as follows. `% 5` is used to retrieve the top 5 most relevant entries.

In [None]:
out = (colbert_e2e % 5).search("chemical reactions")
out

In [None]:
out.loc[0, 'query_toks']

In [None]:
out.loc[0, 'query_embs'].shape

### Exercise 4

There are two new columns in the search results: `query_toks` and `query_embs`. Explain what they are and explain the shape of `query_embs`.

Write answer here

In [None]:
dataset = pt.datasets.get_dataset("vaswani")
index_path = "./index"


!rm -rf ./index
indexer = pt.TRECCollectionIndexer(index_path)

indexer = indexer.index(dataset.get_corpus())

bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

In the following code we create a sentence transformer reranker.

In [None]:
import pandas as pd
from sentence_transformers import CrossEncoder
crossmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)

def _crossencoder_apply(df : pd.DataFrame):
  return crossmodel.predict(list(zip(df['query'].values, df['text'].values)))

cross_encT = pt.apply.doc_score(_crossencoder_apply, batch_size=128)

We create a reranking pipeline that starts with BM25 as a retriever, and ends with using a sentence transformer (cross_encT) for reranking the retrieved documents. The reranking step requires the document text as input, which is not returned by default by bm25, and that is why we add `pt.text.get_text(dataset, 'text')` to retrieve the text documents and add them to the output of BM25.

In [None]:
dataset = pt.get_dataset('irds:vaswani')
cross_enc_rerank = bm25 >> pt.text.get_text(dataset, 'text') >> cross_encT

In [None]:
out = (bm25 % 5).search("chemical reactions")
out

In [None]:
out = (cross_enc_rerank % 5).search("chemical reactions")
out

The following demonstrates how we can contruct IR pipelines using Pyterrier. The pipeline starts by using bm25 to retrive relevant documents, which are then used by QueryExpantion transformer which expands the query by adding informative terms that are collected from the relevant documents. The last part runs bm25 with the new query. This process is called Pseudo Relevance Feedback (PRF).

In [None]:
query_expansion = bm25 >> pt.rewrite.QueryExpansion(indexer) >> bm25

In [None]:
pt.Experiment(
    [bm25, query_expansion, colbert_e2e, cross_enc_rerank],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "P_10", "mrt"],
    names = ["BM25", "QE", "ColBERT", "BM25 >> CrossEnc"]
)

On this small dataset, we can see that BM25 achieves better results than ColBERT. Adding Query Expansion improves MAP a little. But using PRF with ColBERT hurts (ColBERT-PRF) the performance while slowing down the inference because of the extra PRF step.

### Exercise 5

Apply the same process as above on a dataset of your choice. You can find the list of datasets in Pyterrier here: https://pyterrier.readthedocs.io/en/latest/datasets.html#available-datasets

### Exercise 6

Add at least 2 other metrics and explain what each of them is trying to capture.

In [None]:
# Answer both exercise 5 and 6 here

### Exercise 7

Follow the example in this [README](https://github.com/terrierteam/pyterrier_colbert/tree/ba5c86c0bc8da450dee361140541f35b5349a492) to implement Pseudo Relevance Feedback (PRF) with ColBERT. Describe how it works, and add ColBERT-PRF to `Experiment` to evaluate it against the other pipelines.

Write answer here

### Exercise 8

Implement a reranking pipeline with ColBERT and add it to `Experiment`.

Write answer here.

## Arxiv Abstracts Retrieval

In [None]:
!pip install setuptools==68.0.0       # you ca skip this if you are using a local instance

In [None]:
!pip install arxiv==2.1.3

The following code will retrive 1000 abstracts for the query "nlp".

In [None]:
import arxiv

# Search for papers
search = arxiv.Search(
    query="nlp",
    max_results=1000,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

We construct a dictionary of titles and abstracts.

In [None]:
documents = []
for result in search.results():
    documents.append({
        'title': result.title,
        'abstract': result.summary,

    })

In [None]:
documents[0]

Rerun this in case you had to restart the session.

In [None]:
import pyterrier as pt
if not pt.java.started():
    pt.init()

To create an index, we need to to have a docno field which we create below, and a text field which is the abstract in this case. \\
We use `DFIndexer` which allows us to create by passing a list ids and text.

In [None]:
import pandas as pd

index_path = './arxivindex'
!rm -r './arxivindex'

df = pd.DataFrame({
    'docno': ['doc'+str(i) for i in range(len(documents))],
    'text': [document['abstract'] for document in documents]
})

indexer = pt.DFIndexer(index_path)
indexer.index(docno=df.docno, text=df.text)

We create a BM25 retrieval model for the Arxiv index.

In [None]:
bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

In [None]:
(bm25 % 5).search("information retrieval")

In [None]:
df[df['docno']=='doc670'].text.item()

### Exercise 9

Create a ColBERT index using the Arxiv documents retrieved above and use it to search for some queries. Try to highlight how it is different from BM25 through the query results. You can also change the query we used to get arxiv pages.