# This notebook demonstrate the experiments addressing our RQ1 in the paper, in particular, including the following research questions:

- RQ1.1: Can we reproduce the training of ColBERT?

- RQ1.2: What is the impact of the similarity function for ColBERT?

# Load the index and the trained checkpoint

Note: for the cosine trained model, you need load the corresponding cosine index. Similarly, conduct the same practice for l2 model.

In [5]:
import pyterrier as pt
pt.init(tqdm='notebook')

PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



In [None]:
from pyterrier_colbert.ranking import ColBERTFactory
checkpoint_loc = "/path/to/checkpoint.dnn"
index_path = "/path/to/index/folder/"
index_name = "index_name"

factory = ColBERTFactory(
    checkpoint_loc,
    index_path,
    index_name,faiss_partitions=100,memtype='mem'
)

In [None]:
factory.faiss_index_on_gpu = False
e2e_cosine = factory.end_to_end()
fnt=factory.nn_term(df=True)

# Evaluation on Dev.Small
- following the original ColBERT paper, we reproduce the results on Dev.Small using the following pipelines:
- rerank: performing reranking on the official BM25 reranking result set obtained from the MSMARCO leaderboard https://microsoft.github.io/msmarco/  
- e2e: performing the end-to-end runs for each colbert-cosine and colbert-l2 model.

In [None]:
offical_bm25 = pd.read_csv("https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz",sep="\t",names=['qid','docno','query','text'])
offical_bm25.qid = offical_bm25.qid.astype(str)
offical_bm25.docno = offical_bm25.docno.astype(str)
rerank = pt.transformer.SourceTransformer(offical_bm25)>>factory.text_scorer()
e2e = factory.end_to_end()

In [None]:
from pyterrier.measures import *
pt.Experiment(
    [rerank, e2e,
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('dev.small')
    pt.get_dataset("trec-deep-learning-passages").get_qrels('dev.small'),
    batch_size=100, 
    verbose=True,
    save_dir = "./",
    filter_by_qrels=True,
    eval_metrics=[RR@10, nDCG@10, R@50, R@200, R@1000],
    names=["colbert.cosine.rerank.dev.small","colbert.cosine.e2e.dev.small" ]
)


### Running experiments on Dev.Small will take some time, or you can directly using our results to perform the validation on Dev (Reproduce our results in Table1 in our paper)

In [5]:
e2e_cosine = pt.io.read_results("./colbert.cosine.e2e.dev.small.res.gz")

e2e_l2 = pt.io.read_results("./colbert.l2.e2e.dev.small.res.gz")

rerank_cosine = pt.io.read_results("./colbert.cosine.rerank.dev.small.res.gz")

rerank_l2 = pt.io.read_results("./colbert.l2.rerank.dev.small.res.gz")

In [12]:

from pyterrier.measures import *
res = pt.Experiment(
    [
        rerank_cosine, rerank_l2,
        e2e_cosine, e2e_l2,
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('dev.small'),
    pt.get_dataset("trec-deep-learning-passages").get_qrels('dev.small'),
    batch_size=100, 
    verbose=True,round=4,
    filter_by_qrels=True,
    eval_metrics=[RR@10, R@50, R@200, R@1000],
    names=["rerank_cosine","rerank_l2",
           "e2e_cosine","e2e_l2" ]
)

res

HBox(children=(HTML(value='pt.Experiment'), FloatProgress(value=0.0, max=280.0), HTML(value='')))




Unnamed: 0,name,RR@10,R@50,R@200,R@1000
0,rerank_cosine,0.3479,0.7527,0.8036,0.814
1,rerank_l2,0.3492,0.7541,0.8053,0.814
2,e2e_cosine,0.3575,0.8229,0.9109,0.9516
3,e2e_l2,0.3606,0.8324,0.9232,0.9648


# Besides MSMARCO Dev.Small, we can also perform evaluation on TREC queries. 

In the following, we demonstrate how to contruct the retrieval pipelines and perform experiments on both TREC 2019 as well as TREC 2020 queries.

- Note: the reranking pipeline is different from Dev.Small experiment, here we perform ColBERT reranking on top of BM25 stemmed retrieval.

### Evaluation on TREC DL2019 

In [None]:
bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset(
    'msmarco_passage', 
    'terrier_stemmed_text', 
    wmodel='BM25',
    metadata=['docno', 'text'], 
    num_results=1000)


In [None]:
rerank = (bm25_terrier_stemmed_text >>factory.text_scorer())
e2e = factory.end_to_end()

In [None]:
from pyterrier.measures import *
pt.Experiment(
    [rerank, e2eT,
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('test-2019')
    pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2019'),
    batch_size=100, 
    verbose=True,
    save_dir = "./",
    filter_by_qrels=True,
    eval_metrics=[RR@10, nDCG@10, R@50, R@200, R@1000],
    names=["colbert.cosine.rerank.dl19","colbert.cosine.e2e.dl19" ]
)


### Evaluation on TREC DL2020 

In [None]:
rerank = (bm25_terrier_stemmed_text >>factory.text_scorer())
e2e = factory.end_to_end()

In [None]:
from pyterrier.measures import *
pt.Experiment(
    [rerank, e2eT,
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('test-2020')
    pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2020'),
    batch_size=100, 
    verbose=True,
    save_dir = "./",
    filter_by_qrels=True,
    eval_metrics=[RR@10, nDCG@10, R@50, R@200, R@1000],
    names=["colbert.cosine.rerank.dl20","colbert.cosine.e2e.dl20" ]
)


### Similarly, instead of conducting the above experiments, you can validate our reported results in Table 1 by directly using the result files we have provided.

In [12]:
e2e_cosine = pt.io.read_results("./TREC.Res/colbert.cosine.e2e.dl19.res.gz")

e2e_l2 = pt.io.read_results("./TREC.Res/colbert.l2.e2e.dl19.res.gz")

rerank_cosine = pt.io.read_results("./TREC.Res/colbert.cosine.rerank.dl19.res.gz")

rerank_l2 = pt.io.read_results("./TREC.Res/colbert.l2.rerank.dl19.res.gz")

In [13]:

from pyterrier.measures import *
DL19_res = pt.Experiment(
    [
        rerank_cosine, rerank_l2,
        e2e_cosine, e2e_l2,
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('test-2019'),
    pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2019'),
    batch_size=100, 
    verbose=True,round=4,
    filter_by_qrels=True,
    eval_metrics=[RR(rel=2)@10, nDCG@10, AP(rel=2)@1000, R(rel=2)@1000],
    names=["rerank_cosine_dl19","rerank_l2_dl19",
           "e2e_cosine_dl19","e2e_l2_dl19" ]
)

DL19_res

HBox(children=(HTML(value='pt.Experiment'), FloatProgress(value=0.0, max=4.0), HTML(value='')))




Unnamed: 0,name,RR(rel=2)@10,nDCG@10,AP(rel=2)@1000,R(rel=2)@1000
0,rerank_cosine_dl19,0.8469,0.7132,0.4587,0.7553
1,rerank_l2_dl19,0.8624,0.7129,0.4702,0.7553
2,e2e_cosine_dl19,0.8574,0.7077,0.4445,0.773
3,e2e_l2_dl19,0.8702,0.7216,0.462,0.823


In [14]:
e2e_cosine = pt.io.read_results("./TREC.Res/colbert.cosine.e2e.dl20.res.gz")

e2e_l2 = pt.io.read_results("./TREC.Res/colbert.l2.e2e.dl20.res.gz")

rerank_cosine = pt.io.read_results("./TREC.Res/colbert.cosine.rerank.dl20.res.gz")

rerank_l2 = pt.io.read_results("./TREC.Res/colbert.l2.rerank.dl20.res.gz")

In [15]:

from pyterrier.measures import *
DL20_res = pt.Experiment(
    [
        rerank_cosine, rerank_l2,
        e2e_cosine, e2e_l2,
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('test-2020'),
    pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2020'),
    batch_size=100, 
    verbose=True,round=4,
    filter_by_qrels=True,
    eval_metrics=[RR(rel=2)@10, nDCG@10, AP(rel=2)@1000, R(rel=2)@1000],
    names=["rerank_cosine_dl20","rerank_l2_dl20",
           "e2e_cosine_dl20","e2e_l2_dl20" ]
)

DL20_res

HBox(children=(HTML(value='pt.Experiment'), FloatProgress(value=0.0, max=4.0), HTML(value='')))




Unnamed: 0,name,RR(rel=2)@10,nDCG@10,AP(rel=2)@1000,R(rel=2)@1000
0,rerank_cosine_dl20,0.8349,0.7068,0.4838,0.8072
1,rerank_l2_dl20,0.8284,0.6979,0.4827,0.8072
2,e2e_cosine_dl20,0.8318,0.6899,0.4725,0.8057
3,e2e_l2_dl20,0.8228,0.6853,0.4747,0.8386


### Conclusion:

Overall, from the results on both Dev.Small and TREC quey sets, we find that we are able to successfully reproduce the performance of ColBERT on various query sets. In addition, several ablation studies show that more training interactions still helps improve the retrieval effectiveness of ColBERT. The L2 similarity function gives higher performance than cosine for the end-to-end setting and exhibits comparable performance for the reranking retrieval.