# **PyTerrier**

Information Retrieval is one of the key tasks in many natural language processing applications. The process of searching and collecting information from databases or resources based on queries or requirements, Information Retrieval (IR). The fundamental elements of an Information Retrieval system are query and document. The query is the user’s information requirement, and the document is the resource that contains the information. An efficient IR system collects the required information accurately from the document in a compute-effective manner.

PyTerrier is a declarative framework with two key objects: an IR transformer and an IR operator. A transformer is an object that maps the transformation between an array of queries and the corresponding documents. 

To read about it more, please refer [this](https://analyticsindiamag.com/guide-to-pyterrier-a-python-framework-for-information-retrieval/) article.

# **Hands-on Retrieval and Evaluation**

PyTerrier is available as a PyPi package. We can simply pip install it.

In [1]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy sklearn statsmodels xgboost --user -q

In [2]:
!python -m pip install python-terrier --user -q --no-warn-script-location

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

Import the library and initialize it.

In [1]:
import pyterrier as pt
if not pt.started():
  pt.init() 

PyTerrier 0.7.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)


Use one of the in-built datasets to perform the retrieval process and extract its index.

In [2]:
vaswani_dataset = pt.datasets.get_dataset("vaswani")
indexref = vaswani_dataset.get_index()
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString()) 

Downloading vaswani index to /home/aishwarya/.pyterrier/corpora/vaswani/index


data.direct.bf: 100%|██████████| 388k/388k [00:00<00:00, 463kiB/s]
data.document.fsarrayfile: 100%|██████████| 234k/234k [00:00<00:00, 290kiB/s]
data.inverted.bf: 100%|██████████| 362k/362k [00:00<00:00, 415kiB/s]
data.lexicon.fsomapfile: 100%|██████████| 682k/682k [00:01<00:00, 599kiB/s]
data.lexicon.fsomaphash: 100%|██████████| 777/777 [00:00<00:00, 1.65MiB/s]
data.lexicon.fsomapid: 100%|██████████| 30.3k/30.3k [00:00<00:00, 192kiB/s] 
data.meta-0.fsomapfile: 100%|██████████| 725k/725k [00:01<00:00, 633kiB/s]
data.meta.idx: 100%|██████████| 89.3k/89.3k [00:00<00:00, 268kiB/s]
data.meta.zdata: 100%|██████████| 224k/224k [00:00<00:00, 310kiB/s]
data.properties: 100%|██████████| 4.29k/4.29k [00:00<00:00, 9.58MiB/s]
md5sums: 100%|██████████| 619/619 [00:00<00:00, 684kiB/s]


Number of documents: 11429
Number of terms: 7756
Number of postings: 224573
Number of fields: 1
Number of tokens: 271581
Field names: [text]
Positions:   false



Extract queries as topics for the dataset.

In [3]:
topics = vaswani_dataset.get_topics()
topics.head(5) 

Downloading vaswani topics to /home/aishwarya/.pyterrier/corpora/vaswani/query-text.trec


query-text.trec: 10.7kiB [00:00, 2.80MiB/s]                  


Unnamed: 0,qid,query
0,1,measurement of dielectric constant of liquids ...
1,2,mathematical analysis and design details of wa...
2,3,use of digital computers in the design of band...
3,4,systems of data coding for information transfer
4,5,use of programs in engineering testing of comp...


Perform retrieval easily using a few commands as shown below.

In [4]:
retr = pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"})
retr.setControl("wmodel", "TF_IDF")
retr.setControls({"wmodel": "TF_IDF"})
res=retr.transform(topics)
res 

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,8171,8172,0,13.746087,measurement of dielectric constant of liquids ...
1,1,9880,9881,1,12.352666,measurement of dielectric constant of liquids ...
2,1,5501,5502,2,12.178153,measurement of dielectric constant of liquids ...
3,1,1501,1502,3,10.993585,measurement of dielectric constant of liquids ...
4,1,9858,9859,4,10.271452,measurement of dielectric constant of liquids ...
...,...,...,...,...,...,...
91925,93,2226,2227,995,4.904950,high frequency oscillators using transistors t...
91926,93,6898,6899,996,4.899385,high frequency oscillators using transistors t...
91927,93,3473,3474,997,4.898796,high frequency oscillators using transistors t...
91928,93,3187,3188,998,4.893073,high frequency oscillators using transistors t...



It can be observed that the documents are retrieved and ranked. Further, the results can be saved to the disk using the `write_results` method available in the io class of the PyTerrier framework.

In [5]:
# pt.io.write_results(res,"result1.res")

Now, evaluation is performed by comparing the results with the ground truth available in-built. Get the ground truth query results.

In [6]:
qrels = vaswani_dataset.get_qrels()

Downloading vaswani qrels to /home/aishwarya/.pyterrier/corpora/vaswani/qrels


qrels: 24.3kiB [00:00, 8.13MiB/s]                  


Evaluate the query results.

In [7]:
eval = pt.Utils.evaluate(res,qrels)
eval 

{'map': 0.29090543005529873, 'ndcg': 0.6153667539666847}

Evaluation results can also be obtained for per-query results. Here, the evaluation is performed based on the ‘map’ metric on all documents under query.

In [8]:
eval = pt.Utils.evaluate(res,qrels,metrics=["map"], perquery=True)
eval 

defaultdict(dict,
            {'1': {'map': 0.2688603632606692},
             '2': {'map': 0.056448212440045914},
             '3': {'map': 0.23945401361406524},
             '4': {'map': 0.4939494140851607},
             '5': {'map': 0.0},
             '6': {'map': 0.2421600270476016},
             '7': {'map': 0.5674516736006812},
             '8': {'map': 0.5},
             '9': {'map': 0.5222222222222223},
             '10': {'map': 0.1214856066519094},
             '11': {'map': 0.06799761023743447},
             '12': {'map': 0.2093716360982601},
             '13': {'map': 0.26945162856284827},
             '14': {'map': 0.3164929260069987},
             '15': {'map': 0.17479160483981196},
             '16': {'map': 0.07376769675516924},
             '17': {'map': 0.3965636483508813},
             '18': {'map': 0.16354405989238738},
             '19': {'map': 0.44669647488527836},
             '20': {'map': 0.22061080821325293},
             '21': {'map': 0.5395186359625185},
   

# **Hands-on Learn-To-Rank**

Create the environment by importing the necessary libraries and initializing the PyTerrier framework.


In [9]:
import numpy as np
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init() 

Download an in-built dataset, its indices, queries and ground truth results.


In [10]:
dataset = pt.datasets.get_dataset("vaswani")
indexref = dataset.get_index()
topics = dataset.get_topics()
qrels = dataset.get_qrels() 

For ranking the queries, the standard ‘BM25’ model is used in this example. The traditional ‘TF-IDF’ model and the ‘PL2’ model are used to re-rank the query results.


In [11]:
#this ranker will make the candidate set of documents for each query
BM25 = pt.BatchRetrieve(indexref, controls = {"wmodel": "BM25"})
#these rankers we will use to re-rank the BM25 results
TF_IDF =  pt.BatchRetrieve(indexref, controls = {"wmodel": "TF_IDF"})
PL2 =  pt.BatchRetrieve(indexref, controls = {"wmodel": "PL2"}) 

Create a PyTerrier pipeline to perform the above said example task and make a query.


In [12]:
pipe = BM25 >> (TF_IDF ** PL2)
pipe.transform("chemical end:2") 

  topics = m.transform(topics)


Unnamed: 0,qid,docid,docno,rank,score,query,features
0,1,10702,10703,0,13.472012,chemical end:2,"[7.38109017620895, 6.9992254918907575]"
1,1,1055,1056,1,12.517082,chemical end:2,"[6.857899681644975, 6.358419229871986]"
2,1,4885,4886,2,12.228161,chemical end:2,"[6.69960466053696, 6.181368165774688]"


In the above output, the term ‘score’ represents the ranking score of the BM25 model and the term ‘features’ represents the re-ranking scores of the TF-IDF and PL2 models. However ranking at the first step and re-ranking in two successive steps consumes more time. To tackle this issue, PyTerrier introduces a method, called FeaturesBatchRetrieve. Let’s implement the method for efficient processing by ranking and re-ranking, all in one go.


In [13]:
fbr = pt.FeaturesBatchRetrieve(indexref, controls = {"wmodel": "BM25"}, features=["WMODEL:TF_IDF", "WMODEL:PL2"]) 
# the top 2 results
(fbr %2).search("chemical") 

Unnamed: 0,qid,query,docid,rank,features,docno,score
0,1,chemical,10702,0,"[1.9972714735280614, 1.590216305943686]",10703,13.472012
1,1,chemical,1055,1,"[2.5168371014881425, 2.1297038460724336]",1056,12.517082


PyTerrier has a pipeline method, called compile(), which optimizes the ranking and re-ranking processes automatically. This approach also yields the same results as above at around the same compute-time. An example implementation is as follows:


In [14]:
pipe_fast = pipe.compile()
(pipe_fast %2).search("chemical") 

Applying 8 rules


Unnamed: 0,qid,docid,docno,rank,score,query,features
0,1,10702,10703,0,13.472012,chemical,"[7.38109017620895, 6.9992254918907575]"
1,1,1055,1056,1,12.517082,chemical,"[6.857899681644975, 6.358419229871986]"


After performing ranking and re-ranking, a machine learning model can be built to Learn-to-Rank (LTR). Split the available data into train, validation and test sets.


In [15]:
train_topics, valid_topics, test_topics = np.split(topics, [int(.6*len(topics)), int(.8*len(topics))])

Build a Random Forest model to perform the LTR and obtain the results.


In [16]:
from sklearn.ensemble import RandomForestRegressor
BaselineLTR = fbr >> pt.pipelines.LTR_pipeline(RandomForestRegressor(n_estimators=400))
BaselineLTR.fit(train_topics, qrels)
resultsRF = pt.pipelines.Experiment([PL2, BaselineLTR], test_topics, qrels, ["map"], names=["PL2 Baseline", "LTR Baseline"])
resultsRF 

  BaselineLTR = fbr >> pt.pipelines.LTR_pipeline(RandomForestRegressor(n_estimators=400))


Unnamed: 0,name,map
0,PL2 Baseline,0.206031
1,LTR Baseline,0.050078


Build an XGBoost model to perform the LTR and obtain the results.


In [17]:
import xgboost as xgb
params = {'objective': 'rank:ndcg', 
          'learning_rate': 0.1, 
          'gamma': 1.0, 'min_child_weight': 0.1,
          'max_depth': 6,
          'verbose': 2,
          'random_state': 42 
        }
BaseLTR_LM = fbr >> pt.pipelines.XGBoostLTR_pipeline(xgb.sklearn.XGBRanker(**params))
BaseLTR_LM.fit(train_topics, qrels, valid_topics, qrels)
resultsLM = pt.pipelines.Experiment([PL2, BaseLTR_LM],
                                test_topics,                                  
                                qrels, ["map"], 
                                names=["PL2 Baseline", "LambdaMART"])
resultsLM 

ModuleNotFoundError: No module named 'xgboost'