# This notebook demonstrate the experiments for the RQ3 in our paper, namely:

- RQ3.1 How does the semantic matching behaviour vary across different contextualised late interaction models? 

- RQ3.2 Can we characterise the salient token families of matches, i.e., which type of tokens contribute the most to semantic matching? 

- RQ3.3: Can we quantify the contribution of different types of \xiao{matching behaviour, namely the lexical match and semantic match as well as special token match,} to the retrieval effectiveness?

In [1]:
import pyterrier as pt
pt.init(tqdm='notebook')
from pyterrier.measures import *


PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



In [25]:
import pandas as pd
qrelsDev = pt.get_dataset("trec-deep-learning-passages").get_qrels('dev.small')
topicsDev = pt.get_dataset("trec-deep-learning-passages").get_topics('dev.small')
topicsDev = topicsDev.merge(qrelsDev[qrelsDev["label"] > 0][["qid"]].drop_duplicates())
len(topicsDev)

6980

In [3]:
import pandas as pd

qrels2019 = pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2019')
topics2019 = pt.get_dataset("trec-deep-learning-passages").get_topics('test-2019')

topics2020 = pt.get_dataset("trec-deep-learning-passages").get_topics('test-2020')
qrels2020 = pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2020')

In [5]:
!ls /nfstrecdl/workspace_xiao/ColBERT/psg/train.py/ColBERT_base_uncased_cosine_resume/checkpoints/

colbert-190000.dnn  colbert-200000.dnn	colbert.dnn


In [4]:
from pyterrier_colbert.ranking import ColBERTFactory
checkpoint_loc = "/nfs/sean/workspace_xiao/ColBERT/psg/train.py/ColBERT_base_uncased_cosine_resume/checkpoints/colbert-200000.dnn"

factory = ColBERTFactory(
    checkpoint_loc,
    "/nfs/sean/workspace_xiao/","colbert_cosine_index",faiss_partitions=100,memtype='mem'
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAI

[Feb 20, 13:45:25] #> Loading model checkpoint.
[Feb 20, 13:45:25] #> Loading checkpoint /nfs/sean/workspace_xiao/ColBERT/psg/train.py/ColBERT_base_uncased_cosine_resume/checkpoints/colbert-200000.dnn
[Feb 20, 13:45:28] #> checkpoint['epoch'] = 0
[Feb 20, 13:45:28] #> checkpoint['batch'] = 200000


In [5]:
factory.faiss_index_on_gpu = False
e2e_cosine = factory.end_to_end()
fnt=factory.nn_term(df=True)

[Feb 20, 13:45:38] #> Loading the FAISS index from /nfs/sean/workspace_xiao/colbert_cosine_index/ivfpq.100.faiss ..
[Feb 20, 13:46:06] #> Building the emb2pid mapping..
[Feb 20, 13:46:44] len(self.emb2pid) = 687758954
Loading reranking index, memtype=mem


Loading index shards to memory: 100%|██████████| 19/19 [03:20<00:00, 10.57s/shard]


[Feb 20, 13:50:11] #> Building the emb2tid mapping..
687758954
>>>vocab_size: 30522
Loading doclens


In [6]:
bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset(
    'msmarco_passage', 
    'terrier_stemmed_text', 
    wmodel='BM25',
    metadata=['docno', 'text'], 
    num_results=1000)


14:10:50.693 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 1.9 GiB of memory would be required.


# Measure the SMP for RQ3

In [60]:
import torch
import numpy as np
num_docs=fnt.num_docs
num_all_tokens = len(fnt.emb2tid) 
idfdict = {}
probIDF = {}
ictfdict = {}
idflist=[]
for tid in pt.tqdm(range(fnt.inference.query_tokenizer.tok.vocab_size)):
    df = fnt.getDF_by_id(tid)
    idfscore = np.log((1+num_docs)/(df+1))
    idfdict[tid] = idfscore
    idflist.append(idfscore)

100%|██████████| 30522/30522 [00:00<00:00, 263887.75it/s]


In [61]:
import os
if not os.path.exists("stopword-list.txt"):
    !wget "https://raw.githubusercontent.com/terrier-org/terrier-core/5.x/modules/core/src/main/resources/stopword-list.txt"
cuda0 = torch.device('cuda:0')
stops=[]
with open("stopword-list.txt") as f:
    for l in f:
        stops.append(l.strip())
Tokeniser = pt.autoclass("org.terrier.indexing.tokenisation.Tokeniser").getTokeniser()
PorterStemmer = pt.autoclass("org.terrier.terms.PorterStemmer")()

def _get_doc_maxsim_tid_remarkable(doc_maxsim_tid,token,q_token, add_subword=False,add_stop=False,add_numeric=False,
                      add_low=False,add_med=False,add_high=False,add_all=False, add_stem=False,add_question=False): 
    if add_subword:
        if token.startswith("##"):
            return True
        else:
            return False
    elif add_stop:
        token = token.replace("##","")
        if token in stops:
            return True
        else:
            return False     
    elif add_numeric:
        token = token.replace("##","")
        if token.isnumeric():
            return True
        else:
            return False

    elif add_low:
      
        if (idfdict[int(doc_maxsim_tid)]) < np.percentile(idflist,25):
            return True
        else:
            return False
    elif add_stem:
        q_token=q_token.replace('##','')
        token=token.replace('##','')
        if PorterStemmer.stem(q_token) == PorterStemmer.stem(token):
            return True
        else:
            return False
    elif add_med:
        if (np.percentile(idflist,25) <(idfdict[int(doc_maxsim_tid)])) &((idfdict[int(doc_maxsim_tid)])<np.percentile(idflist,75)):
            return True
        else:
            return False
    elif add_high:
        if (idfdict[int(doc_maxsim_tid)]) > np.percentile(idflist,75):
            return True
        else:
            return False
    elif add_all:
        return True

In [62]:
def _get_match_matrix_remarkable(maxsim, idx,idsQ,idsD, add_subword=False,add_stop=False,add_numeric=False,
                      add_low=False,add_med=False,add_high=False,add_all=False,add_stem=False,add_question=False):
    exact_match = torch.zeros_like(maxsim)
    semantic_match = torch.zeros_like(maxsim)
    for didx in range(len(idx)): 
        for qidx in range(len(idsQ[0])): 
            q_tid=idsQ[0][qidx]
            max_dtok_index = idx[didx][qidx]
            doc_maxsim_tid = idsD[didx][max_dtok_index]
            token =factory.args.inference.doc_tokenizer.tok.convert_ids_to_tokens([doc_maxsim_tid])[0]
            q_token = factory.args.inference.query_tokenizer.tok.convert_ids_to_tokens([int(q_tid)])[0]
          
            current_special_match = _get_doc_maxsim_tid_remarkable(doc_maxsim_tid,token,q_token,
                                                                 add_subword=add_subword,add_stop=add_stop,
                                                                 add_numeric=add_numeric,
                                                                 add_low=add_low,add_med=add_med,add_high=add_high,add_all=add_all,
                                                                 add_stem=add_stem)
            
            if (q_tid == doc_maxsim_tid) & current_special_match:
                d_token =factory.args.inference.doc_tokenizer.tok.convert_ids_to_tokens([doc_maxsim_tid])[0]                
                exact_match[didx][qidx]=1
            if (q_tid != doc_maxsim_tid) & current_special_match:
                semantic_match[didx][qidx]=1
    return exact_match,semantic_match           

In [63]:

from pyterrier.transformer import TransformerBase
def scorer_smp(factory,  verbose=False, gpu=True,
                add_subword=False, add_stop=False, add_numeric=False,
                add_low=False, add_med=False, add_high=False, 
                add_all=False, add_stem=False,add_question=False) -> TransformerBase:
    """
    Calculates the ColBERT max_sim operator using previous encodings of queries and documents
    input: qid, query_embs, [query_weights], docno, doc_embs
    output: ditto + score, [+ contributions]
    """
    import torch
    import pyterrier as pt
    assert pt.started(), 'PyTerrier must be started'
    cuda0 = torch.device('cuda') if gpu else torch.device('cpu')

    def _build_interaction(row, D):
        doc_embs = row.doc_embs
        doc_len = doc_embs.shape[0]
        D[row.row_index, 0:doc_len, :] = doc_embs

    def _build_toks(row, idsD):
        doc_toks = row.doc_toks
        doc_len = doc_toks.shape[0]
        idsD[row.row_index, 0:doc_len] = doc_toks

    def _score_query(df):
        with torch.no_grad():
            weightsQ = None
            Q = torch.cat([df.iloc[0].query_embs])
            if "query_weights" in df.columns:
                weightsQ = df.iloc[0].query_weights
            else:
                weightsQ = torch.ones(Q.shape[0])
            if gpu:
                Q = Q.cuda()
                weightsQ = weightsQ.cuda()        
            D = torch.zeros(len(df), factory.args.doc_maxlen, factory.args.dim, device=cuda0)
            df['row_index'] = range(len(df))
            if verbose:
                pt.tqdm.pandas(desc='scorer')
                df.progress_apply(lambda row: _build_interaction(row, D), axis=1)
            else:
                df.apply(lambda row: _build_interaction(row, D), axis=1)
            maxscoreQ = (Q @ D.permute(0, 2, 1)).max(2).values
            scores = (weightsQ*maxscoreQ).sum(1).cpu()
            df["score"] = scores.tolist()

#             if add_exact_match_contribution:
            idsQ = torch.cat([df.iloc[0].query_toks]).unsqueeze(0)
            idsD = torch.zeros(len(df), factory.args.doc_maxlen, dtype=idsQ.dtype)

            df.apply(lambda row: _build_toks(row, idsD), axis=1)

            # which places in the query are actual tokens, not specials such as MASKs
            token_match = (idsQ != 101) & (idsQ != 102) & (idsQ != 103) & (idsQ != 1) & (idsQ != 2)
            question_match =  (idsQ != 2029)& (idsQ != 2129)& (idsQ != 2054)& (idsQ != 2073) & (idsQ != 2339)& (idsQ != 2040) & (idsQ != 2043)

            # perform the interaction
            interaction = (Q @ D.permute(0, 2, 1)).cpu()
            weightsQ = weightsQ.unsqueeze(0).cpu()
            weighted_maxsim = weightsQ*interaction.max(2).values
            # mask out query embeddings that arent tokens 
            weighted_maxsim[:, ~token_match[0,:]] = 0
            # get the sum
            denominator = weighted_maxsim.sum(1)
           
            if add_question:
                interaction = (Q @ D.permute(0, 2, 1)).cpu()

                maxsim, idx = interaction.max(2)
                exact_match, semantic_match =_get_match_matrix(weighted_maxsim, idx, idsQ, idsD)
                # mask out query embeddings that arent tokens 
                exact_match[:, question_match[0,:]] = 0
                semantic_match[:, question_match[0,:]] = 0
                weighted_maxsim = weightsQ*maxsim
                weighted_maxsim_exact = weighted_maxsim*exact_match
                weighted_maxsim_semantic = weighted_maxsim*semantic_match
                # get the sum
                numerator_exact = weighted_maxsim_exact.sum(1)
                numerator_semantic = weighted_maxsim_semantic.sum(1)
            else:
                interaction = (Q @ D.permute(0, 2, 1)).cpu()

                maxsim, idx = interaction.max(2)
   
                exact_match, semantic_match =_get_match_matrix_remarkable(maxsim, idx,idsQ,idsD, add_subword=add_subword,
                                                                          add_stop=add_stop,add_numeric=add_numeric,
                                                                          add_low=add_low,add_med=add_med,
                                                                          add_high=add_high,add_all=add_all, 
                                                                          add_stem=add_stem,add_question=add_question)
                # mask out query embeddings that arent tokens 
                exact_match[:, ~token_match[0,:]] = 0
                semantic_match[:, ~token_match[0,:]] = 0
                weighted_maxsim = weightsQ*maxsim
                weighted_maxsim_exact = weighted_maxsim*exact_match
                weighted_maxsim_semantic = weighted_maxsim*semantic_match
                # get the sum
                numerator_exact = weighted_maxsim_exact.sum(1)
                numerator_semantic = weighted_maxsim_semantic.sum(1)
            df["exact_numer_exact"] = numerator_exact.tolist()
            df["semantic_numer_exact"] = numerator_semantic.tolist()
            df["exact_denom"] = denominator.tolist()
            df["exact_pct"] = (numerator_exact/denominator).tolist()
            df["semantic_pct"] = (numerator_semantic/denominator).tolist()
        df = df.drop(columns=['query_toks', 'query_embs'])
        return df

    return pt.apply.by_query(_score_query, add_ranks=True)

In [64]:
def report_smp(df,cutoff):
    smp = df[df['rank']<=cutoff].groupby(['qid','query']).mean().reset_index().sort_values("semantic_pct", ascending=False).semantic_pct.mean()
    return smp

# RQ3.1: How does the semantic matching behaviour varying accross Col$\star$?


In [65]:
dataset = pt.get_dataset("irds:msmarco-passage")
pipe= (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_all=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.3865912409894394

In [66]:
pipe = (factory.end_to_end()%100
        >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_all=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.4063934326447822

# RQ3.2: Can we characterise the salient token families of matches, i.e., which type of tokens contribute the most to semantic matching?



## 3.2.1 Remarkble SMP: Question Token Family 

In [67]:
bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset(
    'msmarco_passage', 
    'terrier_stemmed_text', 
    wmodel='BM25',
    metadata=['docno', 'text'], 
    num_results=1000)


15:55:50.375 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 1.9 GiB of memory would be required.


In [68]:
dataset = pt.get_dataset("irds:msmarco-passage")
pipe= (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_question=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.08465427102251386

In [69]:
pipe = (factory.end_to_end()%100
        >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_question=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.08659131377182826

## 3.2.2 Remarkble SMP: SubWord Token Family 

In [70]:
dataset = pt.get_dataset("irds:msmarco-passage")
pipe= (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_subword=True))
res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.008859794350146544

In [71]:
pipe = (factory.end_to_end()%100
        >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_subword=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.013382806993004948

## 3.2.3 Remarkble SMP: Stopword Token Family 

In [72]:
dataset = pt.get_dataset("irds:msmarco-passage")
pipe= (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_stop=True))
res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.16249954805518138

In [73]:
pipe = (factory.end_to_end()%100
        >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_stop=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.16867870545196612

## 3.2.4 Remarkble SMP: Numeric Token Family 

In [74]:
dataset = pt.get_dataset("irds:msmarco-passage")
pipe= (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_numeric=True))
res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.017458915120925164

In [75]:
pipe = (factory.end_to_end()%100
        >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_numeric=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.018937234838903954

## 3.2.5 Remarkble SMP: Stemming Token Family 

In [76]:
dataset = pt.get_dataset("irds:msmarco-passage")
pipe= (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_stem=True))
res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.02244023500819399

In [77]:
pipe = (factory.end_to_end()%100
        >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_stem=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.022583855134168457

## 3.2.6 Remarkble SMP: Low$_{idf}$ Token Family 

In [78]:
dataset = pt.get_dataset("irds:msmarco-passage")
pipe= (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_low=True))
res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.3648163944432631

In [79]:
pipe = (factory.end_to_end()%100
        >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_low=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.38073608345706456

## 3.2.7 Remarkble SMP: Med$_{idf}$ Token Family 

In [80]:
dataset = pt.get_dataset("irds:msmarco-passage")
pipe= (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_med=True))
res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.021035169828244132

In [81]:
pipe = (factory.end_to_end()%100
        >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_med=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.024605643697183342

## 3.2.8 Remarkble SMP: High$_{idf}$ Token Family 

In [82]:
dataset = pt.get_dataset("irds:msmarco-passage")
pipe= (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_high=True))
res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.0007396773858503861

In [83]:
pipe = (factory.end_to_end()%100
        >>factory.fetch_index_encodings(ids=True)
        >>scorer_smp(factory,add_high=True))

res =pd.concat(pipe.transform_gen(topics2020,batch_size=1))
smp_cutoff10 = report_smp(res, cutoff=10)
smp_cutoff10

0.0010517060957372388

# Conclusion for RQ 3.2

Overall, in response to RQ3.2, in quantifying the extent of semantic matching for various token families, we find that low IDF tokens are most likely to exhibit semantic matching. 

Similarly, the above experiments can be applied for ColminiLM directly, and ColRoBERTa as well as ColALBERT with minimal changes in terms of the special token ids for each variant model.

# RQ3.3 Can we quantify the contribution of different types of matching behaviour, namely the lexical match and semantic match as well as specialtoken match, to the retrieval effectiveness? 

In [35]:
def _get_match_matrix_RQ4(maxsim, idx,idsQ,idsD):
    exact_match = torch.zeros_like(maxsim)
    semantic_match = torch.zeros_like(maxsim)
    for didx in range(len(idx)): 
        for qidx in range(len(idsQ[0])): 
            q_tid=idsQ[0][qidx]
            max_dtok_index = idx[didx][qidx]
            doc_maxsim_tid = idsD[didx][max_dtok_index]
            if (q_tid == doc_maxsim_tid):
                exact_match[didx][qidx]=1
            if (q_tid != doc_maxsim_tid):
                semantic_match[didx][qidx]=1
    return exact_match, semantic_match

In [45]:
from pyterrier.transformer import TransformerBase
def scorer_quantify_contribution(factory, add_contributions=False, add_exact_match_contribution=False, all_types_scoring=False,
                 only_exact_scoring=False, only_semantic_scoring=False, only_special_scoring=False,only_qtokens_scoring=False,
                 verbose=False, gpu=True) -> TransformerBase:
    """
    Calculates the ColBERT max_sim operator using previous encodings of queries and documents
    input: qid, query_embs, [query_weights], docno, doc_embs
    output: ditto + score, [+ contributions]
    """
    import torch
    import pyterrier as pt
    assert pt.started(), 'PyTerrier must be started'
    cuda0 = torch.device('cuda') if gpu else torch.device('cpu')

    def _build_interaction(row, D):
        doc_embs = row.doc_embs
        doc_len = doc_embs.shape[0]
        D[row.row_index, 0:doc_len, :] = doc_embs

    def _build_toks(row, idsD):
        doc_toks = row.doc_toks
        doc_len = doc_toks.shape[0]
        idsD[row.row_index, 0:doc_len] = doc_toks

    def _score_query(df):
        with torch.no_grad():
            weightsQ = None
            Q = torch.cat([df.iloc[0].query_embs])
            if "query_weights" in df.columns:
                weightsQ = df.iloc[0].query_weights
            else:
                weightsQ = torch.ones(Q.shape[0])
            if gpu:
                Q = Q.cuda()
                weightsQ = weightsQ.cuda()        
            D = torch.zeros(len(df), factory.args.doc_maxlen, factory.args.dim, device=cuda0)
            df['row_index'] = range(len(df))
            if verbose:
                pt.tqdm.pandas(desc='scorer')
                df.progress_apply(lambda row: _build_interaction(row, D), axis=1)
            else:
                df.apply(lambda row: _build_interaction(row, D), axis=1)
                
            idsQ = torch.cat([df.iloc[0].query_toks]).unsqueeze(0)
            idsD = torch.zeros(len(df), factory.args.doc_maxlen, dtype=idsQ.dtype)

            df.apply(lambda row: _build_toks(row, idsD), axis=1)
            token_match = (idsQ != 101) & (idsQ != 102) & (idsQ != 103) & (idsQ != 1) & (idsQ != 2)

            interaction = (Q @ D.permute(0, 2, 1)).cpu()
            maxsim, idx = interaction.max(2)
            
            exact_match, semantic_match =_get_match_matrix_RQ4(maxsim, idx, idsQ, idsD)
            exact_match[:, ~token_match[0,:]] = 0
            semantic_match[:, ~token_match[0,:]] = 0

            weightsQ = weightsQ.unsqueeze(0).cpu()
            weighted_maxsim = weightsQ*maxsim

            weighted_maxsim_exact = weighted_maxsim*exact_match
            weighted_maxsim_semantic = weighted_maxsim*semantic_match

            if only_exact_scoring:
                df["score"] = weighted_maxsim_exact.sum(1).cpu().tolist()
    
            elif only_semantic_scoring:
                df["score"] = weighted_maxsim_semantic.sum(1).cpu().tolist()
            elif only_special_scoring:
                special_token_match = (idsQ == 101) | (idsQ == 102) | (idsQ == 1) | (idsQ == 2) |(idsQ == 103)

                interaction = (Q @ D.permute(0, 2, 1)).cpu()

                weighted_maxsim = weightsQ*interaction.max(2).values
          
                weighted_maxsim[:, ~special_token_match[0,:]] = 0
                df["score"] = weighted_maxsim.sum(1).cpu().tolist()
                
            elif all_types_scoring:
                maxscoreQ = (Q @ D.permute(0, 2, 1)).max(2).values
                maxscoreQ = maxscoreQ.cpu()
                scores = (weightsQ*maxscoreQ).sum(1)
                df["score"] = scores.tolist()
        return df

    return pt.apply.by_query(_score_query, add_ranks=True)

In [46]:
e2e_pipe_only_exact = (
    factory.set_retrieve()
    >> factory.index_scorer(query_encoded=True, add_ranks=True, batch_size=10000)
    >> factory.fetch_index_encodings(ids=True)
    >> scorer_quantify_contribution(factory,only_exact_scoring=True,only_semantic_scoring=False,only_special_scoring=False,all_types_scoring=False)
)

e2e_pipe_only_semantic = (
    factory.set_retrieve()
    >> factory.index_scorer(query_encoded=True, add_ranks=True, batch_size=10000)
    >> factory.fetch_index_encodings(ids=True)
    >> scorer_quantify_contribution(factory,only_exact_scoring=False,only_semantic_scoring=True,only_special_scoring=False,all_types_scoring=False)
)


e2e_pipe_only_qspecial = (
    factory.set_retrieve() 
    >> factory.index_scorer(query_encoded=True, add_ranks=True, batch_size=10000)
    >> factory.fetch_index_encodings(ids=True)
    >> scorer_quantify_contribution(factory,only_exact_scoring=False,only_semantic_scoring=False,only_special_scoring=True,all_types_scoring=False)
)


e2e_pipe_all_types = (
    factory.set_retrieve()
    >> factory.index_scorer(query_encoded=True, add_ranks=True, batch_size=10000)
    >> factory.fetch_index_encodings(ids=True)
    >> scorer_quantify_contribution(factory,only_exact_scoring=False,only_semantic_scoring=False,only_special_scoring=False,all_types_scoring=True)
)



In [48]:
dataset = pt.get_dataset("irds:msmarco-passage")
reranker_all_types = (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
       >> scorer_quantify_contribution(factory,only_exact_scoring=False,only_semantic_scoring=False,only_special_scoring=False,all_types_scoring=True)
)

reranker_only_exact = (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
       >> scorer_quantify_contribution(factory,only_exact_scoring=True,only_semantic_scoring=False,only_special_scoring=False,all_types_scoring=False)
)

reranker_only_semantic = (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
       >> scorer_quantify_contribution(factory,only_exact_scoring=False,only_semantic_scoring=True,only_special_scoring=False,all_types_scoring=False)
)

reranker_only_qspecial = (factory.query_encoder() >>bm25_terrier_stemmed_text>>pt.text.get_text(dataset,"text")>>factory.text_encoder() 
       >>factory.fetch_index_encodings(ids=True)
       >> scorer_quantify_contribution(factory,only_exact_scoring=False,only_semantic_scoring=False,only_special_scoring=True,all_types_scoring=False)
)


In [40]:
from pyterrier.measures import *
res1 = pt.Experiment(
    [
    bm25_terrier_stemmed_text,
    reranker_only_exact,
    reranker_only_semantic,
    reranker_only_qspecial,
    reranker_all_types,
    e2e_pipe_only_exact,
    e2e_pipe_only_semantic,
    e2e_pipe_only_qspecial,
    e2e_pipe_all_types
    ],
    topics2020,
    qrels2020,
    save_dir="./",
    filter_by_qrels=True,
    batch_size=10,verbose=True,
    eval_metrics = [nDCG@10,nDCG@100, R(rel=2)@100,R(rel=2)@1000, AP(rel=2)@100,AP(rel=2)@1000,RR(rel=2)@10], # Note: using R@1000 here instead of R(rel=2)@1000 to match the measure used by the TCT-ColBERT paper
    names=[
        'bm25',
        "reranker.only_exact","reranker.only_semantic",
        "reranker.only_special","reranker.all_types",
        "colbert.only_exact","colbert.only_semantic",
        "colbert.only_speical","colbert.all_types"]
)

res1

Unnamed: 0,name,nDCG@10,nDCG@100,R(rel=2)@100,R(rel=2)@1000,AP(rel=2)@100,AP(rel=2)@1000,RR(rel=2)@10
0,bm25,0.493627,0.502562,0.583855,0.807223,0.275282,0.292988,0.614675
1,reranker.only_exact,0.526527,0.529497,0.610375,0.807223,0.315803,0.334937,0.655093
2,reranker.only_semantic,0.138996,0.176649,0.270707,0.807223,0.078337,0.097151,0.234744
3,reranker.only_special,0.518976,0.47635,0.581121,0.807223,0.318272,0.334874,0.713889
4,reranker.all_types,0.706804,0.645974,0.711896,0.807223,0.471695,0.48381,0.834877
5,colbert.only_exact,0.492102,0.485153,0.572707,0.771416,0.284611,0.301504,0.646781
6,colbert.only_semantic,0.002188,0.009526,0.023642,0.11361,0.00145,0.00278,0.00463
7,colbert.only_speical,0.383874,0.316111,0.37611,0.554282,0.216956,0.224973,0.568078
8,colbert.all_types,0.689939,0.623477,0.704708,0.805692,0.4611,0.472455,0.83179


### Note: for other Col$\star$ models, you need make a small change about the "special_token_match". Or, instead, you can directly evaluate our provided result files to reproduce the results reported in the Table7 of our paper.  In the following, the experiment demonstrate this evaluation for all the reported models.

#### Table7 results for ColBERT

In [6]:
e2e_only_exact = pt.io.read_results("./ColBERT/colbert.only_exact.res.gz")
e2e_only_semantic = pt.io.read_results("./ColBERT/colbert.only_semantic.res.gz")
e2e_only_special = pt.io.read_results("./ColBERT/colbert.only_special.res.gz")
e2e_all_types = pt.io.read_results("./ColBERT/colbert.all_types.res.gz")

rerank_only_exact = pt.io.read_results("./ColBERT/reranker.only_exact.res.gz")
rerank_only_semantic = pt.io.read_results("./ColBERT/reranker.only_semantic.res.gz")
rerank_only_special = pt.io.read_results("./ColBERT/reranker.only_special.res.gz")
rerank_all_types = pt.io.read_results("./ColBERT/reranker.all_types.res.gz")

In [7]:
from pyterrier.measures import *
res_colbert_RQ3_3 = pt.Experiment(
    [
    rerank_only_exact,
    rerank_only_semantic,
    rerank_only_special,
    rerank_all_types,
    e2e_only_exact,
    e2e_only_semantic,
    e2e_only_special,
    e2e_all_types
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('test-2020'),
    pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2020'),

    filter_by_qrels=True,
    batch_size=100,verbose=True,
    eval_metrics = [nDCG@10], # Note: using R@1000 here instead of R(rel=2)@1000 to match the measure used by the TCT-ColBERT paper
    names=[
      
        "reranker.only_exact","reranker.only_semantic",
        "reranker.only_special","reranker.all_types",
        "colbert.only_exact","colbert.only_semantic",
        "colbert.only_speical","colbert.all_types"]
)

res_colbert_RQ3_3

HBox(children=(HTML(value='pt.Experiment'), FloatProgress(value=0.0, max=8.0), HTML(value='')))




Unnamed: 0,name,nDCG@10
0,reranker.only_exact,0.526527
1,reranker.only_semantic,0.138996
2,reranker.only_special,0.518976
3,reranker.all_types,0.706804
4,colbert.only_exact,0.492102
5,colbert.only_semantic,0.002188
6,colbert.only_speical,0.383874
7,colbert.all_types,0.689939


#### Table7 results for ColminiLM

In [8]:
e2e_only_exact = pt.io.read_results("./ColminiLM/colminilm.only_exact.res.gz")
e2e_only_semantic = pt.io.read_results("./ColminiLM/colminilm.only_semantic.res.gz")
e2e_only_special = pt.io.read_results("./ColminiLM/colminilm.only_special.res.gz")
e2e_all_types = pt.io.read_results("./ColminiLM/colminilm.all_types.res.gz")

rerank_only_exact = pt.io.read_results("./ColminiLM/colminilm.reranker.only_exact.res.gz")
rerank_only_semantic = pt.io.read_results("./ColminiLM/colminilm.reranker.only_semantic.res.gz")
rerank_only_special = pt.io.read_results("./ColminiLM/colminilm.reranker.only_special.res.gz")
rerank_all_types = pt.io.read_results("./ColminiLM/colminilm.reranker.all_types.res.gz")

In [9]:
from pyterrier.measures import *
res_colminilm_RQ3_3 = pt.Experiment(
    [
    rerank_only_exact,
    rerank_only_semantic,
    rerank_only_special,
    rerank_all_types,
    e2e_only_exact,
    e2e_only_semantic,
    e2e_only_special,
    e2e_all_types
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('test-2020'),
    pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2020'),

    filter_by_qrels=True,
    batch_size=100,verbose=True,
    eval_metrics = [nDCG@10], # Note: using R@1000 here instead of R(rel=2)@1000 to match the measure used by the TCT-ColBERT paper
    names=[
      
        "colminilm.reranker.only_exact","colminilm.reranker.only_semantic",
        "colminilm.reranker.only_special","colminilm.reranker.all_types",
        "colminilm.only_exact","colminilm.only_semantic",
        "colminilm.only_speical","colminilm.all_types"]
)

res_colminilm_RQ3_3

HBox(children=(HTML(value='pt.Experiment'), FloatProgress(value=0.0, max=8.0), HTML(value='')))




Unnamed: 0,name,nDCG@10
0,colminilm.reranker.only_exact,0.48737
1,colminilm.reranker.only_semantic,0.074291
2,colminilm.reranker.only_special,0.522579
3,colminilm.reranker.all_types,0.684679
4,colminilm.only_exact,0.426379
5,colminilm.only_semantic,0.000563
6,colminilm.only_speical,0.346634
7,colminilm.all_types,0.672129


#### Table7 results for ColRoBERTa

In [10]:
e2e_only_exact = pt.io.read_results("./ColRoBERTa/colroberta.only_exact.res.gz")
e2e_only_semantic = pt.io.read_results("./ColRoBERTa/colroberta.only_semantic.res.gz")
e2e_only_special = pt.io.read_results("./ColRoBERTa/colroberta.only_special.res.gz")
e2e_all_types = pt.io.read_results("./ColRoBERTa/colroberta.all_types.res.gz")

rerank_only_exact = pt.io.read_results("./ColRoBERTa/colroberta.reranker.only_exact.res.gz")
rerank_only_semantic = pt.io.read_results("./ColRoBERTa/colroberta.reranker.only_semantic.res.gz")
rerank_only_special = pt.io.read_results("./ColRoBERTa/colroberta.reranker.only_special.res.gz")
rerank_all_types = pt.io.read_results("./ColRoBERTa/colroberta.reranker.all_types.res.gz")

In [12]:
from pyterrier.measures import *
res_colroberta_RQ3_3 = pt.Experiment(
    [
    rerank_only_exact,
    rerank_only_semantic,
    rerank_only_special,
    rerank_all_types,
    e2e_only_exact,
    e2e_only_semantic,
    e2e_only_special,
    e2e_all_types
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('test-2020'),
    pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2020'),

    filter_by_qrels=True,
    batch_size=100,verbose=True,
    eval_metrics = [nDCG@10], # Note: using R@1000 here instead of R(rel=2)@1000 to match the measure used by the TCT-ColBERT paper
    names=[
      
        "colroberta.reranker.only_exact","colroberta.reranker.only_semantic",
        "colroberta.reranker.only_special","colroberta.reranker.all_types",
        "colroberta.only_exact","colroberta.only_semantic",
        "colroberta.only_speical","colroberta.all_types"]
)

res_colroberta_RQ3_3

HBox(children=(HTML(value='pt.Experiment'), FloatProgress(value=0.0, max=8.0), HTML(value='')))




Unnamed: 0,name,nDCG@10
0,colroberta.reranker.only_exact,0.396841
1,colroberta.reranker.only_semantic,0.260481
2,colroberta.reranker.only_special,0.635353
3,colroberta.reranker.all_types,0.695075
4,colroberta.only_exact,0.349475
5,colroberta.only_semantic,0.157172
6,colroberta.only_speical,0.574166
7,colroberta.all_types,0.666198


#### Table7 results for ColALBERT

In [13]:
e2e_only_exact = pt.io.read_results("./ColALBERT/colalbert.only_exact.res.gz")
e2e_only_semantic = pt.io.read_results("./ColALBERT/colalbert.only_semantic.res.gz")
e2e_only_special = pt.io.read_results("./ColALBERT/colalbert.only_special.res.gz")
e2e_all_types = pt.io.read_results("./ColALBERT/colalbert.all_types.res.gz")

rerank_only_exact = pt.io.read_results("./ColALBERT/colalbert.reranker.only_exact.res.gz")
rerank_only_semantic = pt.io.read_results("./ColALBERT/colalbert.reranker.only_semantic.res.gz")
rerank_only_special = pt.io.read_results("./ColALBERT/colalbert.reranker.only_special.res.gz")
rerank_all_types = pt.io.read_results("./ColALBERT/colalbert.reranker.all_types.res.gz")

In [14]:
from pyterrier.measures import *
res_colalbert_RQ3_3 = pt.Experiment(
    [
    rerank_only_exact,
    rerank_only_semantic,
    rerank_only_special,
    rerank_all_types,
    e2e_only_exact,
    e2e_only_semantic,
    e2e_only_special,
    e2e_all_types
    ],
    pt.get_dataset("trec-deep-learning-passages").get_topics('test-2020'),
    pt.get_dataset("trec-deep-learning-passages").get_qrels('test-2020'),

    filter_by_qrels=True,
    batch_size=100,verbose=True,
    eval_metrics = [nDCG@10], # Note: using R@1000 here instead of R(rel=2)@1000 to match the measure used by the TCT-ColBERT paper
    names=[
      
        "colalbert.reranker.only_exact","colalbert.reranker.only_semantic",
        "colalbert.reranker.only_special","colalbert.reranker.all_types",
        "colalbert.only_exact","colalbert.only_semantic",
        "colalbert.only_speical","colalbert.all_types"]
)

res_colalbert_RQ3_3

HBox(children=(HTML(value='pt.Experiment'), FloatProgress(value=0.0, max=8.0), HTML(value='')))




Unnamed: 0,name,nDCG@10
0,colalbert.reranker.only_exact,0.505309
1,colalbert.reranker.only_semantic,0.074438
2,colalbert.reranker.only_special,0.459578
3,colalbert.reranker.all_types,0.630362
4,colalbert.only_exact,0.410674
5,colalbert.only_semantic,0.007132
6,colalbert.only_speical,0.341385
7,colalbert.all_types,0.603873


# Conclusion for RQ3.3

In summary, based on the results above, to answer to RQ3.3, we find that the late interaction mechanism benefits more from lexical matching than semantic matching. In addition, special tokens, such as the \texttt{[CLS]} token, play a very important role in matching.