General Setup for all the datasets( first 3 cells)

In [18]:
import pyterrier as pt
from experiment_utils.experiments_helper import time_fct

if not pt.started():
    pt.init()

Evaluation metrics used for all the datasets

In [19]:
from pyterrier.measures import RR, nDCG, MAP

eval_metrics = [RR @ 10, nDCG @ 10, MAP @ 100]

Create the query encoder that will run on CPU. Encoder used for embedding all the datasets/queries

In [20]:
from encoders.gte_base_en_encoder import GTEBaseDocumentEncoder

q_encoder = GTEBaseDocumentEncoder("Alibaba-NLP/gte-base-en-v1.5")


Defined the path to root

In case you would like to directly run the pipeline you can skip until this cell: "Same experiment as above using the default_complete_test_pipeline_name methods"


In [15]:
path_to_root = "../../"

## NFCorpus

In [16]:
from experiment_utils.experiments_helper import load_sparse_index_from_disk

dataset_name = "irds:beir/nfcorpus"
model_name = "gte-base-en-v1.5"

retriever = load_sparse_index_from_disk(dataset_name, path_to_root)

Testing the sparse retrieval

In [6]:
from experiment_utils.experiments_helper import run_single_experiment_name

dataset_test_name = "irds:beir/nfcorpus/test"
run_single_experiment_name(retriever, dataset_test_name, eval_metrics, dataset_name + ": BM25", timed=True)

Experiment took 6.507 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,irds:beir/nfcorpus: BM25,0.534378,0.322219,0.143582


Retrieve the dense index(already loaded into memory)

In [7]:
from experiment_utils.experiments_helper import load_dense_index_from_disk

dense_index = load_dense_index_from_disk(dataset_name, q_encoder, model_name)

100%|██████████| 3633/3633 [00:00<00:00, 1133098.34it/s]


In [8]:
from fast_forward.util.pyterrier import FFScore

from fast_forward.util.pyterrier import FFInterpolate

ff_score = FFScore(dense_index)
ff_int = FFInterpolate(alpha=0.05)

Find most optimal alpha from default set [0.25, 0.05, 0.1, 0.5, 0.9]

In [9]:
from experiment_utils.experiments_helper import find_optimal_alpha_name

dev_set_name = "irds:beir/nfcorpus/dev"
pipeline_find_alpha = retriever % 100 >> ff_score >> ff_int
find_optimal_alpha_name(pipeline_find_alpha, ff_int, dev_set_name)

GridScan: 100%|██████████| 4/4 [00:43<00:00, 10.81s/it]

Best map is 0.126401
Best setting is ['<fast_forward.util.pyterrier.FFInterpolate object at 0x7f33dc35eb90> alpha=0.01']





Create pipeline with 1000 docs retrieved per query

In [10]:
from experiment_utils.experiments_helper import run_single_experiment_name

dataset_test_name = "irds:beir/nfcorpus/test"

pipeline = retriever % 1000 >> ff_score >> ff_int

run_single_experiment_name(pipeline, dataset_test_name, eval_metrics, dataset_name + ": BM25 >> gte-base-en-v1.5",
                           timed=True)


Experiment took 11.892 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,irds:beir/nfcorpus: BM25 >> gte-base-en-v1.5,0.582751,0.364177,0.166036


Same experiment as above using the default_complete_test_pipeline_name methods. In the following experiments, the alpha tuning is not used and we preset it to 0.05. We only wanted to see how fast each experiment. Timeit library was not used because I wanted to first understand if these are normal running times overall and then assess the latency for each experiment.

In [21]:
from experiment_utils.experiments_helper import default_test_pipeline_name

dataset_name = "irds:beir/nfcorpus"
dev_set_name = "irds:beir/nfcorpus/dev"
dataset_test_name = "irds:beir/nfcorpus/test"
pipeline_name = "BM25 >> " + model_name

default_test_pipeline_name(dataset_name, dataset_test_name, q_encoder, eval_metrics, model_name, pipeline_name,
                           path_to_root, dev_set_name=dev_set_name, timed=True)

100%|██████████| 3633/3633 [00:00<00:00, 843597.77it/s]
GridScan: 100%|██████████| 4/4 [00:41<00:00, 10.29s/it]

Best map is 0.126401
Best setting is ['<fast_forward.util.pyterrier.FFInterpolate object at 0x7f33d7a33ee0> alpha=0.01']





TypeError: run_single_experiment() got an unexpected keyword argument 'alpha'

Run pipeline for FIQA dataset

In [None]:
from experiment_utils.experiments_helper import default_test_pipeline_name

dataset_name = "irds:beir/fiqa"
dev_set_name = "irds:beir/fiqa/dev"
dataset_test_name = "irds:beir/fiqa/test"

default_test_pipeline_name(dataset_name, dataset_test_name, q_encoder, eval_metrics, model_name, pipeline_name,
                           path_to_root, dev_set_name=dev_set_name, timed=True)

## For the Scidocs dataset, considering the lack of a dev set, the train set was used for finetuning the alpha value.

In [None]:
from experiment_utils.experiments_helper import  default_test_pipeline_name

dataset_name = "irds:beir/scidocs"
dataset = pt.get_dataset(dataset_name)
test_topics = dataset.get_topics('text')

default_test_pipeline_name(dataset_name, dataset.get_qrels(), test_topics, q_encoder,eval_metrics, model_name, pipeline_name,
path_to_root, timed=True)


A similar approach is also followed for the "cqadupstack/english" dataset.

In [4]:
from experiment_utils.experiments_helper import split_dev_test, default_complete_test_pipeline

dataset_name = "cqadupstack/english"
dataset = pt.get_dataset("irds:beir/cqadupstack/english")
topics = dataset.get_topics('text')

dev_topics, test_topics = split_dev_test(topics, test_size=0.8)

time_fct(default_complete_test_pipeline, dataset_name, dataset.get_qrels(), test_topics, q_encoder, eval_metrics)


100%|██████████| 40221/40221 [00:00<00:00, 1227563.21it/s]


Experiment took 153.739 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,cqadupstack/english: BM25 >> gte-base-en-v1.5,0.366865,0.356954,0.326041


A similar approach is also followed for the "arguana" dataset.

### DelftBlue runtime : 15 minutes. Local runtime : 20 minutes

In [10]:
from experiment_utils.experiments_helper import split_dev_test, default_complete_test_pipeline

dataset_name = "arguana"
dataset = pt.get_dataset("irds:beir/arguana")
topics = dataset.get_topics()

dev_topics, test_topics = split_dev_test(topics, test_size=0.8)

time_fct(default_complete_test_pipeline, dataset_name, dataset.get_qrels(), test_topics, q_encoder,
         eval_metrics)



100%|██████████| 8674/8674 [00:00<00:00, 851624.37it/s]


Experiment took 1163.829 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,arguana: BM25 >> gte-base-en-v1.5,0.252108,0.376144,0.262613


Will rerun this cell after reindexing the dense index as there is a problem with some documents not being found( no vectors for...). Is it possible that I made the dense index correctly and the "irds:beir/scifact/test" misses some documents that are tested in "irds:beir/scifact". The error is also reproduced in the debug.ipynb where it can be observed that using only the sparse index does not cause any error so for that reason I think it is because of the dense one.

In [4]:
from experiment_utils.experiments_helper import default_complete_test_pipeline_name

dataset_name = "scifact"
# dev_set_name = "irds:beir/scifact/train"
dataset_test_name = "irds:beir/scifact/test"

time_fct(
    default_complete_test_pipeline_name, dataset_name, dataset_test_name, q_encoder, eval_metrics)

100%|██████████| 5183/5183 [00:00<00:00, 1446475.32it/s]


Experiment took 30.021 seconds to execute.


Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,scifact: BM25 >> gte-base-en-v1.5,0.669475,0.708775,0.664073
