# SPLADE on MSMARCO v1 Passage Corpus using PyTerrier.

This notebook demonstrates the creation of a SPLADE index using PyTerrier.

## Installation

Installation is using PIP.

In [None]:
%pip install -q git+https://github.com/naver/splade.git
%pip install -q git+https://github.com/cmacdonald/pyt_splade.git

## SPLADE setup

We create a factory object that gives us access to the appropriate transformers to use SPLADE. 

In [1]:
import os
import pyterrier as pt
import pyt_splade
splade = pyt_splade.Splade()
doc_encoder = splade.doc_encoder()

## Indexing demonstration

Lets see what terms are generated by the SPLADE model during indexing.

In [2]:
df = doc_encoder([{'docno' : 'd1', 'text' : 'ww2'}])
df.iloc[0].toks

{'w': 199,
 '##2': 193,
 'war': 167,
 'wwii': 150,
 '##w': 130,
 'ii': 110,
 '2': 94,
 'germany': 86,
 'army': 76,
 'battle': 70,
 'was': 66,
 'bomb': 48,
 'event': 43,
 'wilson': 43,
 'conflict': 38,
 'marshall': 33,
 'allied': 23,
 'surrender': 22,
 'peace': 16,
 'military': 12,
 'era': 10,
 'alliance': 10,
 'weapon': 10,
 'wars': 8,
 'camp': 7,
 'were': 6,
 'france': 6,
 'invasion': 6,
 'nazi': 4,
 'zombie': 2,
 'patton': 2,
 'german': 1}

## Indexing MSMARCO

Lets go and create an index for the MSMARCO v1 passage corpus. 

In [3]:
# this will provide access to the corpus
dataset = pt.get_dataset('irds:msmarco-passage')

This is the actual indexing code. We use the SPLADE model to transform the passages into tokens and weights.

Indexing the 8.8M passages of MSMARCO using a GeForce RTX 3090 took 5 hours.

In [4]:
if not os.path.exists('./msmarco_psg'): # skip if alraedy created
    indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True)
    indexer.setProperty("termpipelines", "")
    indexer.setProperty("tokeniser", "WhitespaceTokeniser")
    
    indexer_pipe = doc_encoder >> indexer
    index_ref = indexer_pipe.index(dataset.get_corpus_iter())

## Retrieval

We can now conduct retrieval using PyTerrier.

In [5]:
retr = pt.terrier.Retriever('./msmarco_psg', wmodel='Tf', verbose=True)

retr_pipe = splade.query_encoder() >> retr

Java started (triggered by Retriever.__init__) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.10 (build: craigm 2024-08-22 17:33), helper_version=0.0.8]


Let's check retrieval works, and we can see the generated query. 

In [6]:
retr_pipe.search('chemical reactions')

TerrierRetr(Tf): 100%|██████████| 1/1 [00:00<00:00,  1.11q/s]


Unnamed: 0,qid,docid,docno,rank,score,query,query_toks
0,1,758284,758284,0,759.955493,chemical reactions,"{'reactions': 269.44158935546875, 'reaction': ..."
1,1,5913794,5913794,1,758.944513,chemical reactions,"{'reactions': 269.44158935546875, 'reaction': ..."
2,1,742206,742206,2,757.742353,chemical reactions,"{'reactions': 269.44158935546875, 'reaction': ..."
3,1,8572191,8572191,3,751.762260,chemical reactions,"{'reactions': 269.44158935546875, 'reaction': ..."
4,1,129901,129901,4,749.707581,chemical reactions,"{'reactions': 269.44158935546875, 'reaction': ..."
...,...,...,...,...,...,...,...
995,1,6342771,6342771,995,543.450242,chemical reactions,"{'reactions': 269.44158935546875, 'reaction': ..."
996,1,3355651,3355651,996,543.413003,chemical reactions,"{'reactions': 269.44158935546875, 'reaction': ..."
997,1,5094605,5094605,997,543.349945,chemical reactions,"{'reactions': 269.44158935546875, 'reaction': ..."
998,1,2226467,2226467,998,543.343830,chemical reactions,"{'reactions': 269.44158935546875, 'reaction': ..."


Finally, lets run the experiment and see the resulting performance. 

In [7]:
from pyterrier.measures import *
pt.Experiment(
    [retr_pipe],
    pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged').get_topics(),
    pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged').get_qrels(),
    eval_metrics=[RR(rel=2), nDCG@10, nDCG@100, AP(rel=2)],
    names=['splade']
)        

TerrierRetr(Tf): 100%|██████████| 43/43 [01:07<00:00,  1.57s/q]


Unnamed: 0,name,RR(rel=2),nDCG@10,nDCG@100,AP(rel=2)
0,splade,0.918605,0.729712,0.671791,0.50429


## Exploring the Index

In [8]:
index = pt.java.cast("org.terrier.querying.LocalManager", retr.manager).index

Lets explore the lexicon - what tokens were used? (First 100)

In [9]:
for i, entry in enumerate(index.getLexicon()):
    if i == 100:
        break
    print(entry.getKey() + " " + entry.getValue().toString())

! term8767 Nt=35620 TF=1146138 maxTF=2147483647 @{0 0 0}
" term909 Nt=860880 TF=35393854 maxTF=2147483647 @{0 193938 6}
# term5228 Nt=69447 TF=3962047 maxTF=2147483647 @{0 5070117 6}
##0 term5242 Nt=68165 TF=4325474 maxTF=2147483647 @{0 5657689 4}
##00 term19503 Nt=14652 TF=971888 maxTF=2147483647 @{0 6283938 7}
##01 term12384 Nt=7847 TF=535341 maxTF=2147483647 @{0 6430337 3}
##0s term26590 Nt=389 TF=27046 maxTF=2147483647 @{0 6512277 3}
##1 term5497 Nt=105121 TF=5875946 maxTF=2147483647 @{0 6516783 2}
##10 term17386 Nt=21140 TF=1573665 maxTF=2147483647 @{0 7368646 7}
##100 term12385 Nt=13618 TF=806019 maxTF=2147483647 @{0 7599660 2}
##11 term9508 Nt=17101 TF=1082393 maxTF=2147483647 @{0 7723654 3}
##12 term8683 Nt=8393 TF=695082 maxTF=2147483647 @{0 7887210 5}
##13 term12421 Nt=13616 TF=714315 maxTF=2147483647 @{0 7989915 2}
##14 term17858 Nt=6103 TF=427235 maxTF=2147483647 @{0 8102904 7}
##15 term13479 Nt=14605 TF=926123 maxTF=2147483647 @{0 8168491 1}
##16 term8682 Nt=6139 TF=441595

In [10]:
print(index.getCollectionStatistics().toString())

Number of documents: 8841823
Number of terms: 28679
Number of postings: 1037725680
Number of fields: 0
Number of tokens: 51855220573
Field names: []
Positions:   false



We can even look into particular document in the index. 

In [11]:
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
docid = 7700000 #docids are 0-based
#NB: postings will be null if the document is empty
dictrep = {}
for posting in di.getPostings(doi.getDocumentEntry(docid)):
    termid = posting.getId()
    lee = lex.getLexiconEntry(termid)
    dictrep[lee.getKey()] = posting.getFrequency()

for k in sorted(dictrep.keys()):
    print(k, dictrep[k])

" 13
##uation 37
000 59
25 43
30 70
35 72
40 35
accountant 39
accounting 70
advice 11
amount 71
amounts 88
applicants 12
ask 108
asked 86
asking 33
asks 6
assessment 19
average 38
bargaining 48
bart 20
bottom 94
briggs 20
burke 10
business 13
businesses 28
calculate 62
calculated 2
candidacy 30
candidate 109
candidates 115
chart 26
companies 43
company 15
considered 36
corporate 10
dave 17
davis 5
desk 7
diversity 2
employee 95
employees 17
employment 40
engineer 20
example 107
examples 98
excel 14
executive 5
finance 43
fisher 8
flat 134
flex 50
flexibility 129
flexible 92
gage 6
give 98
given 17
giving 29
highest 1
hr 62
improvisation 9
include 57
included 28
income 15
interview 19
job 80
jobs 33
kelly 1
letter 9
low 76
management 4
marketing 14
matching 23
math 56
max 44
maximum 84
median 101
mid 102
middle 32
money 31
murray 21
negotiate 98
negotiating 49
negotiation 82
negotiations 59
normal 13
numbers 9
often 19
pay 139
payroll 7
point 59
points 50
post 50
posted 72
posting 95
pr