General Setup for all the datasets

In [1]:
import pyterrier as pt
from experiments_helper import time_fct

if not pt.started():
    pt.init()

PyTerrier 0.10.0 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


Evaluation metrics used for all the datasets

In [2]:
from pyterrier.measures import RR, nDCG, MAP

eval_metrics = [RR @ 10, nDCG @ 10, MAP @ 100]

Create the query encoder that will run on CPU. Encoder used for embedding all the datasets/queries

In [3]:
from gte_base_en_encoder import GTEBaseDocumentEncoder

q_encoder = GTEBaseDocumentEncoder("Alibaba-NLP/gte-base-en-v1.5")


## NFCorpus

In [17]:
from experiments_helper import load_sparse_index_from_disk

dataset_name = "nfcorpus"

retriever = load_sparse_index_from_disk(dataset_name)

Testing the sparse retrieval

In [23]:
from experiments_helper import run_single_experiment_name

dataset_test_name = "irds:beir/nfcorpus/test"
time_fct(run_single_experiment_name, retriever, dataset_test_name, eval_metrics, dataset_name + ": BM25")

Experiment took 2.896 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,nfcorpus: BM25,0.534378,0.322219,0.143582


Retrieve the dense index(already loaded into memory)

In [24]:
from experiments_helper import load_dense_index_from_disk

dense_index = load_dense_index_from_disk(dataset_name, q_encoder)

100%|██████████| 3633/3633 [00:00<00:00, 1158159.64it/s]


In [25]:
from fast_forward.util.pyterrier import FFScore

from fast_forward.util.pyterrier import FFInterpolate

ff_score = FFScore(dense_index)
ff_int = FFInterpolate(alpha=0.05)

Find most optimal alpha from default set [0.25, 0.05, 0.1, 0.5, 0.9]

In [9]:
from experiments_helper import find_optimal_alpha_name

dev_set_name = "irds:beir/nfcorpus/dev"
pipeline_find_alpha = retriever % 100 >> ff_score >> ff_int
find_optimal_alpha_name(pipeline_find_alpha, ff_int, dev_set_name)

GridScan: 100%|██████████| 5/5 [00:31<00:00,  6.35s/it]

Best map is 0.124061
Best setting is ['<fast_forward.util.pyterrier.FFInterpolate object at 0x7fab8adf6290> alpha=0.05']





Create pipeline with 1000 docs retrieved per query

In [27]:
from experiments_helper import run_single_experiment_name

dataset_test_name = "irds:beir/nfcorpus/test"

pipeline = retriever % 1000 >> ff_score >> ff_int

time_fct(run_single_experiment_name, pipeline, dataset_test_name, eval_metrics,
         dataset_name + ": BM25 >> gte-base-en-v1.5")


Experiment took 8.140 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,nfcorpus: BM25 >> gte-base-en-v1.5,0.553675,0.343964,0.15507


Same experiment as above using the default_complete_test_pipeline_name methods

In [28]:
from experiments_helper import default_complete_test_pipeline_name

dataset_name = "nfcorpus"
# dev_set_name = "irds:beir/nfcorpus/dev"
dataset_test_name = "irds:beir/nfcorpus/test"

time_fct(default_complete_test_pipeline_name, dataset_name, dataset_test_name,
         q_encoder, eval_metrics)



100%|██████████| 3633/3633 [00:00<00:00, 833674.71it/s]


Experiment took 8.307 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,nfcorpus: BM25 >> gte-base-en-v1.5,0.553675,0.343964,0.15507


Run pipeline for FIQA dataset

In [29]:
from experiments_helper import default_complete_test_pipeline_name

dataset_name = "fiqa"
# dev_set_name = "irds:beir/fiqa/dev"
dataset_test_name = "irds:beir/fiqa/test"

time_fct(
    default_complete_test_pipeline_name, dataset_name, dataset_test_name, q_encoder, eval_metrics)



100%|██████████| 57638/57638 [00:00<00:00, 822787.06it/s]


Experiment took 56.576 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,fiqa: BM25 >> gte-base-en-v1.5,0.399449,0.335966,0.274557


## For the Scidocs dataset, considering the lack of a dev set, the train set was used for finetuning the alpha value.

Given that the scidocs dataset offers only one dataset, we will split it into dev and test set. More exactly, we will split the topics because that is what we are testing against. I chose the 'text' topics as this dataset offers 2 topics categories.

In [5]:
from experiments_helper import split_dev_test, default_complete_test_pipeline

dataset_name = "scidocs"
dataset = pt.get_dataset("irds:beir/scidocs")
topics = dataset.get_topics('text')

dev_topics, test_topics = split_dev_test(topics, test_size=0.8)

time_fct(default_complete_test_pipeline, dataset_name, dataset.get_qrels(), test_topics, q_encoder,
         eval_metrics)


100%|██████████| 25657/25657 [00:00<00:00, 679372.34it/s]


Experiment took 76.129 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,scidocs: BM25 >> gte-base-en-v1.5,0.292146,0.165924,0.113075


A similar approach is also followed for the "cqadupstack/english" dataset.

In [4]:
from experiments_helper import split_dev_test, default_complete_test_pipeline

dataset_name = "cqadupstack/english"
dataset = pt.get_dataset("irds:beir/cqadupstack/english")
topics = dataset.get_topics('text')

dev_topics, test_topics = split_dev_test(topics, test_size=0.8)

time_fct(default_complete_test_pipeline, dataset_name, dataset.get_qrels(), test_topics, q_encoder, eval_metrics)


100%|██████████| 40221/40221 [00:00<00:00, 1227563.21it/s]


Experiment took 153.739 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,cqadupstack/english: BM25 >> gte-base-en-v1.5,0.366865,0.356954,0.326041


A similar approach is also followed for the "arguana" dataset.

### DelftBlue runtime : 15 minutes

In [10]:
from experiments_helper import split_dev_test, default_complete_test_pipeline

dataset_name = "arguana"
dataset = pt.get_dataset("irds:beir/arguana")
topics = dataset.get_topics()

dev_topics, test_topics = split_dev_test(topics, test_size=0.8)

time_fct(default_complete_test_pipeline, dataset_name, dataset.get_qrels(), test_topics, q_encoder,
         eval_metrics)



100%|██████████| 8674/8674 [00:00<00:00, 851624.37it/s]


Experiment took 1163.829 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,arguana: BM25 >> gte-base-en-v1.5,0.252108,0.376144,0.262613


Will rerun this cell after reindexing the dense index as there is a problem with some documents not being found( no vectors for...).

In [6]:
from experiments_helper import default_complete_test_pipeline_name

dataset_name = "scifact"
# dev_set_name = "irds:beir/scifact/train"
dataset_test_name = "irds:beir/scifact/test"

time_fct(
    default_complete_test_pipeline_name, dataset_name, dataset_test_name, q_encoder, eval_metrics)

100%|██████████| 5136/5136 [00:00<00:00, 526878.28it/s]
no vectors for 121581019
no vectors for 154050141
no vectors for 195680777
no vectors for 140098548
no vectors for 198133135
no vectors for 99829811
no vectors for 195317463
no vectors for 116075383
no vectors for 143381103
no vectors for 140907540
no vectors for 198309074
no vectors for 154549459
no vectors for 167944455
no vectors for 129199129
no vectors for 195689316
no vectors for 146653163
no vectors for 154796494
no vectors for 117907685
no vectors for 116556376
no vectors for 195683603
no vectors for 155200372
no vectors for 195689757
no vectors for 143796742
no vectors for 142562844
no vectors for 144801076
no vectors for 145716849
no vectors for 145416918
no vectors for 196664003
no vectors for 168265642
no vectors for 121001457
no vectors for 154243324
no vectors for 154763124
no vectors for 144555102
no vectors for 109946221
no vectors for 104143831
no vectors for 145335387
no vectors for 118215171
no vectors for 10979

Experiment took 33.600 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,scifact: BM25 >> gte-base-en-v1.5,0.67003,0.709237,0.664663
