# Terrier Learning to Rank Examples

This notebook demonstrates the use of Pyterrier in a learning-to-rank fashion.

## Preparation

Lets install pyterrier, as usual.

In [5]:
#!pip install python-terrier
!pip3 install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

Collecting python-terrier
  Cloning https://github.com/terrier-org/pyterrier.git to /tmp/pip-install-5ukk_67p/python-terrier
  Running command git clone -q https://github.com/terrier-org/pyterrier.git /tmp/pip-install-5ukk_67p/python-terrier
Collecting chest
  Downloading chest-0.2.3.tar.gz (9.6 kB)
Collecting deprecation
  Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting matchpy
  Downloading matchpy-0.5.3-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.6 MB/s eta 0:00:011
Collecting pyjnius~=1.3.0
  Downloading pyjnius-1.3.0-cp38-cp38-manylinux2010_x86_64.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 9.4 MB/s eta 0:00:01
[?25hCollecting pytrec_eval
  Downloading pytrec_eval-0.5.tar.gz (15 kB)
Collecting scipy
  Downloading scipy-1.5.4-cp38-cp38-manylinux1_x86_64.whl (25.8 MB)
[K     |████████████████████████████████| 25.8 MB 21.1 MB/s eta 0:00:01     |████████████████████████████▍   | 22.9 MB 21.1 MB/s eta 0:00:01
[

## Init 

You must run pt.init() before other pyterrier functions and classes

Arguments:    
- `version` - terrier IR version e.g. "5.2"    
- `mem` - megabytes allocated to java e.g. 4096

In [6]:
import numpy as np
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init()

terrier-assemblies 5.3  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.4  jar not found, downloading to /root/.pyterrier...
Done


In [17]:
!pip3 install pandas
!pip3 install xgboost

Collecting xgboost
  Downloading xgboost-1.2.1-py3-none-manylinux2010_x86_64.whl (148.9 MB)
[K     |████████████████████████████████| 148.9 MB 341 kB/s  eta 0:00:01
Installing collected packages: xgboost
Successfully installed xgboost-1.2.1


## Load Files and Index

Again, lets focus on the small Vaswani test collection. Its easily accessible via the dataset API. 

In [7]:
dataset = pt.datasets.get_dataset("vaswani")

indexref = dataset.get_index()
topics = dataset.get_topics()
qrels = dataset.get_qrels()

Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


data.direct.bf: 100%|██████████| 327k/327k [00:00<00:00, 6.63MiB/s]
data.document.fsarrayfile: 100%|██████████| 190k/190k [00:00<00:00, 7.26MiB/s]
data.inverted.bf: 100%|██████████| 301k/301k [00:00<00:00, 8.33MiB/s]
data.lexicon.fsomapfile: 100%|██████████| 651k/651k [00:00<00:00, 11.1MiB/s]
data.lexicon.fsomaphash: 100%|██████████| 777/777 [00:00<00:00, 1.13MiB/s]
data.lexicon.fsomapid: 100%|██████████| 30.3k/30.3k [00:00<00:00, 4.39MiB/s]
data.meta.idx: 100%|██████████| 89.3k/89.3k [00:00<00:00, 5.12MiB/s]
data.meta.zdata: 100%|██████████| 168k/168k [00:00<00:00, 6.77MiB/s]
data.properties: 4.10kiB [00:00, 2.02MiB/s]                
query-text.trec: 10.7kiB [00:00, 2.09MiB/s]                  


Downloading vaswani topics to /root/.pyterrier/corpora/vaswani/query-text.trec


qrels: 24.3kiB [00:00, 4.14MiB/s]                  

Downloading vaswani qrels to /root/.pyterrier/corpora/vaswani/qrels





## Multi-stage Retrieval

In this experiment, we will be re-ranking the results obtaind from a BM25 ranking, by adding more features. Will then pass these for re-ranking by a regression technique, such as Random Forests.

Conceptually, this pipeline has three stages:
1. PL2 ranking
2. Re-rank by each of the feaures ("TF_IDF" and "PL2")
3. Apply the RandomForests



In [8]:
#this ranker will make the candidate set of documents for each query
BM25 = pt.BatchRetrieve(indexref, controls = {"wmodel": "BM25"})

#these rankers we will use to re-rank the BM25 results
TF_IDF =  pt.BatchRetrieve(indexref, controls = {"wmodel": "TF_IDF"})
PL2 =  pt.BatchRetrieve(indexref, controls = {"wmodel": "PL2"})

In [36]:
DPH = pt.BatchRetrieve(indexref, controls = {"wmodel": "DPH"})

OK, so how do we combine these?

In [37]:
pipe = DPH >> (TF_IDF ** PL2)

Here, we are using two Pyterrer operators:
 - `>>` means "then", and takes the output documents of BM25 and puts them into the next stage. This means that TF_IDF and PL2 are ONLY applied on the documents that BM25 has identified.
 - `**` means feature-union - which makes each ranker into a feature in the `features` column of the results.

Lets give a look at the output to see what it gives:

In [21]:
pipe.transform("chemical end:2")

Unnamed: 0,qid,docid,docno,rank,score_x,query,docid_x,rank_x,query_x,docid_y,rank_y,score_y,query_y,features
0,1,10702,10703,0,13.472012,chemical end:2,10702,0,chemical end:2,10702,0,13.472012,chemical end:2,"[7.38109017620895, 6.9992254918907575, 13.4720..."
1,1,1055,1056,1,12.517082,chemical end:2,1055,1,chemical end:2,1055,1,12.517082,chemical end:2,"[6.857899681644975, 6.358419229871986, 12.5170..."
2,1,4885,4886,2,12.228161,chemical end:2,4885,2,chemical end:2,4885,2,12.228161,chemical end:2,"[6.69960466053696, 6.181368165774688, 12.22816..."


See, we now have a "features" column with numbers representing the TF_IDF and PL2 feature scores.

*A note about efficiency*: doing retrieval, then re-ranking the documents again can be slow. For this reason, Terrier has a FeaturesBatchRetrieve. Lets try this:

In [22]:
fbr = pt.FeaturesBatchRetrieve(indexref, controls = {"wmodel": "BM25"}, features=["WMODEL:TF_IDF", "WMODEL:PL2"]) 
#lets look at the top 2 results
(fbr %2).transform("chemical")

Unnamed: 0,qid,docid,rank,features,docno,score
0,1,10702,0,"[7.38109017620895, 6.9992254918907575]",10703,13.472012
1,1,1055,1,"[6.857899681644975, 6.358419229871986]",1056,12.517082


However, this kind of optimisation is common in Pyterrier, so Pyterrier actually supports automatic pipeline optimisation, using the `.compile()` function.

In [41]:
pipe_fast = pipe.compile()
(pipe_fast %2).transform("chemical")

Applying 8 rules


AssertionError: require rank to be present in the result set

Finally, often we want our initial retrieval score to be a feature also. We can do this in one of two ways:
 - by adding a `SAMPLE` feature to FeaturesBatchRetrieve
 - or in the original feature-union definition, including an IdentityTransformer 

In [49]:
fbr = pt.FeaturesBatchRetrieve(indexref, controls = {"wmodel": "DPH"}, features=["SAMPLE", "WMODEL:TF_IDF", "WMODEL:PL2"]) 
pipe = DPH >> (pt.transformer.IdentityTransformer() ** TF_IDF ** PL2)

(pipe %2).transform("chemical")

Unnamed: 0,qid,docid,docno,rank,score_x,query,docid_x,rank_x,query_x,docid_y,rank_y,score_y,query_y,features
0,1,6278,6279,0,6.1784,chemical,6278,0,chemical,6278,0,6.1784,chemical,"[6.178399717192183, 6.12819661651192, 5.639397..."
1,1,2519,2520,1,4.80698,chemical,2519,1,chemical,2519,1,4.80698,chemical,"[4.806979944375463, 5.655311070061918, 5.11360..."


# Learning models and re-ranking

Ok, lets get onto the actual machine learning. We can use standard Python ML techniques. We will demonstrate a few here, including from sci-kit learn and xgBoost.

In each case, the pattern is the same:
 - Create a transformer that does the re-ranking
 - Call the fit() method on the created object with the training topics (and validation topics as necessary)
 - Evaluate the results with the Experiment function by using the test topics

 Firstly, lets separate our topics into train/validation/test.

In [45]:
train_topics, valid_topics, test_topics = np.split(topics, [int(.6*len(topics)), int(.8*len(topics))])

## sci-kit learn RandomForestRegressor

Our first learning-to-rank will be done using sci-kit learn's [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). 

We use `pt.piptlines.LTR_pipeline`, which is a pyterrier transformer that passes the document features as "X" features to RandomForest. To learn the model (called fitting) the RandomForest, we invoke the `fit()` method - on the entire pipeline, specifying the queries (topics) and relevance assessment (qrels). The latter for the "Y" labels for the RandomForest fitting.

NB: due to their bootstrap nature, Random Forests do not overfit, so we do not provide validation data to `fit()`.

On the other hand, we could use any regression learner from sklearn, and adjust its parameters ourselves.

Finally, we Experiment() on the test data to compare performances.

In [51]:
from sklearn.ensemble import RandomForestRegressor

BaselineLTR = fbr >> pt.pipelines.LTR_pipeline(RandomForestRegressor(n_estimators=400))
BaselineLTR.fit(train_topics, qrels)

results = pt.pipelines.Experiment([BM25, PL2, BaselineLTR], test_topics, qrels, ["map"], names=["BM25 algorithm", "PL2 Baseline", "LTR Baseline"])
results

Unnamed: 0,name,map
0,BM25 algorithm,0.21768
1,PL2 Baseline,0.206031
2,LTR Baseline,0.12905


In [55]:
results = pt.pipelines.Experiment([BM25, PL2, BaselineLTR], test_topics, qrels, ["map", "ndcg"], names=["BM25 algorithm", "PL2 Baseline", "LTR Baseline"])
results

Unnamed: 0,name,map,ndcg
0,BM25 algorithm,0.21768,0.551313
1,PL2 Baseline,0.206031,0.537795
2,LTR Baseline,0.12905,0.475305


## XgBoost Pipeline

We now demonstrate the use of a LambdaMART implementation from [xgBoost](https://xgboost.readthedocs.io/en/latest/). Again, pyTerrier provides a transformer object, namely `XGBoostLTR_pipeline`, which takes in the constrcutor the actual xgBoost model that you want to train. We took the xgBoost configuration from [their example code](https://github.com/dmlc/xgboost/blob/master/demo/rank/rank.py).

Call the `fit()` method on the full pipeline with the training and validation topics.

Evaluate the results with the Experiment function by using the test topics

In [18]:
import xgboost as xgb
params = {'objective': 'rank:ndcg', 
          'learning_rate': 0.1, 
          'gamma': 1.0, 'min_child_weight': 0.1,
          'max_depth': 6,
          'verbose': 2,
          'random_state': 42 
         }

BaseLTR_LM = fbr >> pt.pipelines.XGBoostLTR_pipeline(xgb.sklearn.XGBRanker(**params))
BaseLTR_LM.fit(train_topics, qrels, valid_topics, qrels)

Parameters: { verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




And evaluate the results.

In [19]:
allresultsLM = pt.pipelines.Experiment([PL2, BaseLTR_LM],
                                test_topics,                                  
                                qrels, ["map"], 
                                names=["PL2 Baseline", "LambdaMART"])
allresultsLM

Unnamed: 0,name,map
0,PL2 Baseline,0.206031
1,LambdaMART,0.211269


NameError: name 'AND' is not defined