# Introduction to PyTerrier

_IN4325: Information retrieval lecture, TU Delft_

**Part 5: Transformers**

This notebook introduces PyTerrier _transformers_ (not to be confused with [neural transformer models](<https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)>)). We'll learn about the different types of data frames that PyTerrier uses and how the transformers operate on them.

In order to run everything in this notebook, you'll need [pyspellchecker](https://github.com/barrust/pyspellchecker) installed:


In [1]:
pip install python-terrier==0.10.0 pyspellchecker

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Because we're going to use RM3 in this notebook, we'll need to load the `terrier-prf` plugin:


In [2]:
import pyterrier as pt

if not pt.started():
    pt.init(
        tqdm="notebook", boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"]
    )

PyTerrier 0.10.0 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In the following, we'll illustrate the different kinds of transformers and data frames using examples. Note that we're only scratching the surface here, so **make sure to have a look at the [documentation](https://pyterrier.readthedocs.io/)**!

We'll use the `nfcorpus` dataset:


In [3]:
dataset = pt.get_dataset("irds:nfcorpus/test")

For this task we'll need an index with blocks (i.e., positional information), so we need to create a new one. Since memory indexes do not support blocks at the moment, we'll create one on disk:


In [4]:
from pathlib import Path

idx_path = Path("nfcorpus_index_with_blocks").absolute()

We index the corpus with `blocks=True`:


In [5]:
pt.index.IterDictIndexer(
    str(idx_path),
    blocks=True,
).index(dataset.get_corpus_iter(), fields=["title", "abstract", "url"])

nfcorpus/test documents:   0%|          | 0/5371 [00:00<?, ?it/s]

<org.terrier.querying.IndexRef at 0x7f864cea3c40 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5645b0104748 at 0x7f864c5ac950>>

## Data format

Recall that the queries (topics) of a dataset can be accessed using the `get_topics` method. For this dataset, there are multiple variants; we choose `title`:


In [6]:
queries = dataset.get_topics(variant="title")
queries

Unnamed: 0,qid,query
0,PLAIN-1008,deafness
1,PLAIN-1018,dha
2,PLAIN-102,stopping heart disease in childhood
3,PLAIN-1028,dietary scoring
4,PLAIN-1039,domoic acid
...,...,...
320,PLAIN-956,cooking methods
321,PLAIN-966,cortisol
322,PLAIN-977,crib death
323,PLAIN-987,cumin


In general, PyTerrier represents all data as `pandas.DataFrame` objects.

The method above outputs a data frame with two columns, `qid` and `query`. In PyTerrier, data frames of this format are referred to as _data type_ `Q`, and they essentially represent a set of queries, each of which has a unique identifier. In fact, we have already constructed our own `Q` data frames in the scaffolding project.

There are some other data types, and we will introduce them throughout the rest of this series. You can find an overview [here](https://pyterrier.readthedocs.io/en/latest/datamodel.html).

## Transformers

_Transformers_ directly operate on these data frames; in other words, a transformer takes as input a data frame of some type and outputs another data frame (of the same or another type). We'll take a look at several pre-implemented transformers in this notebook.

### Retrieval transformers

Retrievers are the most common transformers, and we have already used them plenty throughout this introductory series. For example, let's take a BM25 model as before:


In [7]:
index = pt.IndexFactory.of(str(idx_path))
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

This transformer consumes data of type `Q` and returns data of type `R` (i.e., columns `qid`, `docno`, `score`, `rank`), which corresponds to a ranking. The transformation can be invoked by calling the `transform` method:


In [8]:
bm25.transform(queries)



Unnamed: 0,qid,docid,docno,rank,score,query
0,PLAIN-1008,4667,MED-4668,0,13.495859,deafness
1,PLAIN-1018,5094,MED-5095,0,14.204196,dha
2,PLAIN-1018,928,MED-929,1,14.033904,dha
3,PLAIN-1018,4622,MED-4623,2,14.033904,dha
4,PLAIN-1018,4930,MED-4931,3,14.033904,dha
...,...,...,...,...,...,...
143708,PLAIN-987,1292,MED-1293,2,9.748835,cumin
143709,PLAIN-987,230,MED-231,3,8.698382,cumin
143710,PLAIN-987,232,MED-233,4,8.698382,cumin
143711,PLAIN-987,2827,MED-2828,5,8.698382,cumin


Note that our result is actually a superset of `R` (we have an additional column, `query`). In general, data frames may have more columns than a specific transformer requires, but they can still be used.

Alternatively, the transformer can be called directly, as we have done so far, which gives the same result:


In [9]:
bm25(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,PLAIN-1008,4667,MED-4668,0,13.495859,deafness
1,PLAIN-1018,5094,MED-5095,0,14.204196,dha
2,PLAIN-1018,928,MED-929,1,14.033904,dha
3,PLAIN-1018,4622,MED-4623,2,14.033904,dha
4,PLAIN-1018,4930,MED-4931,3,14.033904,dha
...,...,...,...,...,...,...
143708,PLAIN-987,1292,MED-1293,2,9.748835,cumin
143709,PLAIN-987,230,MED-231,3,8.698382,cumin
143710,PLAIN-987,232,MED-233,4,8.698382,cumin
143711,PLAIN-987,2827,MED-2828,5,8.698382,cumin


Furthermore, transformers implement the `search` method, which processes a single query:


In [10]:
bm25.search("what is the meaning of life")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,2302,MED-2303,0,9.443163,what is the meaning of life
1,1,2505,MED-2506,1,9.296906,what is the meaning of life
2,1,257,MED-258,2,9.250432,what is the meaning of life
3,1,2901,MED-2902,3,9.250432,what is the meaning of life
4,1,3002,MED-3003,4,9.250432,what is the meaning of life
...,...,...,...,...,...,...
995,1,3776,MED-3777,995,2.229604,what is the meaning of life
996,1,3795,MED-3796,996,2.229604,what is the meaning of life
997,1,4706,MED-4707,997,2.229604,what is the meaning of life
998,1,4785,MED-4786,998,2.229604,what is the meaning of life


### Query rewriting transformers

You have already experimented with query rewriting in the scaffolding project. PyTerrier implements several transformers that rewrite queries.

The simplest one is the _sequential dependence model_:


In [11]:
sdm = pt.rewrite.SequentialDependence()

SDM requires positional information in the index (that's why we needed to set a flag during indexing). More information about SDM can be found [here](https://pyterrier.readthedocs.io/en/latest/rewrite.html#sequentialdependence).

It operates solely on the queries themselves; in other words, both input and output are data frames of type `Q`:


In [12]:
sdm(queries)

Unnamed: 0,qid,query,query_0
0,PLAIN-1008,deafness,deafness
1,PLAIN-1018,dha,dha
2,PLAIN-102,stopping heart disease childhood #combine:0=0....,stopping heart disease in childhood
3,PLAIN-1028,dietary scoring #combine:0=0.1:wmodel=org.terr...,dietary scoring
4,PLAIN-1039,domoic acid #combine:0=0.1:wmodel=org.terrier....,domoic acid
...,...,...,...
320,PLAIN-956,cooking methods #combine:0=0.1:wmodel=org.terr...,cooking methods
321,PLAIN-966,cortisol,cortisol
322,PLAIN-977,crib death #combine:0=0.1:wmodel=org.terrier.m...,crib death
323,PLAIN-987,cumin,cumin


In this case, the `query` column contains the new (rewritten) queries, while the original queries are retained in the `query_0` column.

#### Query expansion

_Query expansion_ differs from standard query rewriting in that it operates on queries **and** corresponding relevant documents (these need to be retrieved based on the original queries prior to the expansion). This is also known as _pseudo relevance feedback_ (PRF). A popular PRF model is _RM3_:


In [13]:
rm3 = pt.rewrite.RM3(index)

Since RM3 requires a set of documents for each query, its input type needs to be `R`. Consequently, we can use the result of our retriever as an input for the PRF model:


In [14]:
rm3(bm25(queries))

Unnamed: 0,qid,query_0,query
0,PLAIN-1008,deafness,applypipeline:off brain^0.027586209 common^0.0...
1,PLAIN-1018,dha,applypipeline:off cook^0.020857040 salmon^0.02...
2,PLAIN-102,stopping heart disease in childhood,applypipeline:off lifestyl^0.053418938 stop^0....
3,PLAIN-1028,dietary scoring,applypipeline:off dietari^0.300000012 yr^0.035...
4,PLAIN-1039,domoic acid,applypipeline:off domoic^0.383134842 1987^0.01...
...,...,...,...
304,PLAIN-946,coma,applypipeline:off domoic^0.041973736 melon^0.0...
305,PLAIN-956,cooking methods,applypipeline:off solubl^0.021026421 loss^0.03...
306,PLAIN-966,cortisol,applypipeline:off symptom^0.030274920 faecal^0...
307,PLAIN-977,crib death,applypipeline:off death^0.441816151 million^0....


## Pipelines

You probably noticed that the transformers we've seen so far are mostly designed to work in sequence; for example, reformulating queries alone is pointless without an actual retrieval step afterwards.

This is where _pipelines_ come into play. PyTerrier implements the `>>` operator to build sequences of transformers. Let's build a simple pipeline that applies SDM and then retrieves documents using BM25:


In [15]:
pl_sdm = sdm >> bm25

We can now use this pipeline like any other transformer:


In [16]:
pl_sdm(queries)

Unnamed: 0,qid,docid,docno,rank,score,query,query_0
0,PLAIN-1008,4667,MED-4668,0,13.495859,deafness,deafness
1,PLAIN-1018,5094,MED-5095,0,14.204196,dha,dha
2,PLAIN-1018,928,MED-929,1,14.033904,dha,dha
3,PLAIN-1018,4622,MED-4623,2,14.033904,dha,dha
4,PLAIN-1018,4930,MED-4931,3,14.033904,dha,dha
...,...,...,...,...,...,...,...
143708,PLAIN-987,1292,MED-1293,2,9.748835,cumin,cumin
143709,PLAIN-987,230,MED-231,3,8.698382,cumin,cumin
143710,PLAIN-987,232,MED-233,4,8.698382,cumin,cumin
143711,PLAIN-987,2827,MED-2828,5,8.698382,cumin,cumin


Let's compare SDM and RM3 in terms of performance.

First, we create a pipeline for RM3:


In [17]:
pl_rm3 = bm25 >> rm3 >> bm25

Now we can run an experiment to evaluate and compare both of these pipelines. We'll also include standalone BM25:


In [18]:
from pyterrier.measures import MAP, nDCG

pt.Experiment(
    [bm25, pl_sdm, pl_rm3],
    queries,
    dataset.get_qrels(),
    names=["BM25", "SDM >> BM25", "BM25 >> RM3 >> BM25"],
    eval_metrics=[MAP, nDCG @ 10],
)

Unnamed: 0,name,AP,nDCG@10
0,BM25,0.113101,0.265752
1,SDM >> BM25,0.112938,0.265869
2,BM25 >> RM3 >> BM25,0.129125,0.273099


### Operators

There are a number of _operators_ that can be applied to transformers within pipelines. We've already seen the `>>` operator. Here, we'll look at a few more selected operators. You can find the complete list [here](https://pyterrier.readthedocs.io/en/latest/operators.html).

#### Rank cutoff

The `%` operator limits how many documents per query are kept (the lowest-scoring ones are removed). For example, we may want to consider only a single document for RM3:


In [19]:
pl_rm3_1doc = (bm25 % 1) >> rm3 >> bm25

pt.Experiment(
    [pl_rm3, pl_rm3_1doc],
    queries,
    dataset.get_qrels(),
    names=["RM3", "RM3 (1 document)"],
    eval_metrics=[MAP, nDCG @ 10],
)

Unnamed: 0,name,AP,nDCG@10
0,RM3,0.129125,0.273099
1,RM3 (1 document),0.128236,0.279708


#### Caching

The `~` operator can be used to automatically cache the output of a transformer. Let's time our BM25 retriever without caching first:


In [20]:
%timeit bm25(queries)

2.36 s ± 124 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Now we enable caching. This should make it much faster:


In [21]:
%timeit (~bm25)(queries)

142 ms ± 6.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


**Important**: When you use caching, make sure to clear the cache when you make changes to the transformers you cached. Otherwise, you might get unexpected results. The default location of the cache is `~/.pyterrier/transformer_cache/` (on Linux and macOS systems).


#### Combining rankings

The `+` and `*` operators can be used to linearly combine two transformers that output rankings (data type `R`). For example, we can use two different retrievers and combine them as follows:


In [22]:
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF")

pt.Experiment(
    [tf_idf, bm25, 2 * tf_idf + bm25],
    queries,
    dataset.get_qrels(),
    names=["TF-IDF", "BM25", "2 * TF-IDF + BM25"],
    eval_metrics=[MAP, nDCG @ 10],
)

Unnamed: 0,name,AP,nDCG@10
0,TF-IDF,0.112686,0.26495
1,BM25,0.113101,0.265752
2,2 * TF-IDF + BM25,0.112853,0.265925


Note that the operations are applied to the scores computed by the retrievers. If a document is missing for one of the retrievers, a score of `0` is used.

### Compiling pipelines

Pipelines can be compiled. The compilation may (or may not) improve the efficiency for certain operations. For example, consider the following:


In [23]:
pl = bm25 % 3
pl_compiled = pl.compile()

Applying 8 rules


Let's time them both:


In [24]:
%timeit pl(queries)

2.23 s ± 49.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
%timeit pl_compiled(queries)

2.32 s ± 97.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Custom transformers

PyTerrier makes it easy for you to implement your own custom transformers. In fact, we've used a custom query transformer under the hood for the scaffolding project.

### `apply` functions

[`pyterrier.apply`](https://pyterrier.readthedocs.io/en/latest/apply.html) allows for applying a custom function to each row of a data frame. There are many `apply` functions, each of which focuses on different data types. An overview can be found [here](https://pyterrier.readthedocs.io/en/latest/apply.html#module-pyterrier.apply).

Let's implement one that reformulates the query to sound a bit nicer:


In [26]:
ask_nicely = pt.apply.query(
    lambda row: "please find some information about " + row["query"]
)
ask_nicely(queries)

Unnamed: 0,qid,query_0,query
0,PLAIN-1008,deafness,please find some information about deafness
1,PLAIN-1018,dha,please find some information about dha
2,PLAIN-102,stopping heart disease in childhood,please find some information about stopping he...
3,PLAIN-1028,dietary scoring,please find some information about dietary sco...
4,PLAIN-1039,domoic acid,please find some information about domoic acid
...,...,...,...
320,PLAIN-956,cooking methods,please find some information about cooking met...
321,PLAIN-966,cortisol,please find some information about cortisol
322,PLAIN-977,crib death,please find some information about crib death
323,PLAIN-987,cumin,please find some information about cumin


### Extending `pyterrier.Transformer`

More complex transformers can be implemented by extending the base class directly.

Here, we implement a transformer that naively corrects supposed spelling mistakes using a spell checking library. In order to do this, we only need to implement the `transform` method. We adapt the behavior of the other query rewriters and retain the original formulation in the `query_0` column:


In [27]:
import pandas as pd
from spellchecker import SpellChecker


class CorrectQuerySpelling(pt.Transformer):
    def __init__(self):
        self.spellchecker = SpellChecker()
        super().__init__()

    def _correct_spelling(self, query: str) -> str:
        result = []
        for word in query.split(" "):
            if len(self.spellchecker.unknown([word])) > 0:
                result.append(self.spellchecker.correction(word) or word)
            else:
                result.append(word)
        return " ".join(result)

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        df_new = df.copy()
        df_new["query_0"] = df_new["query"]
        df_new["query"] = df_new["query_0"].map(self._correct_spelling)
        return df_new

Let's give it a try:


In [28]:
correct_query_spelling = CorrectQuerySpelling()
correct_query_spelling(queries)

Unnamed: 0,qid,query,query_0
0,PLAIN-1008,deafness,deafness
1,PLAIN-1018,ha,dha
2,PLAIN-102,stopping heart disease in childhood,stopping heart disease in childhood
3,PLAIN-1028,dietary scoring,dietary scoring
4,PLAIN-1039,comic acid,domoic acid
...,...,...,...
320,PLAIN-956,cooking methods,cooking methods
321,PLAIN-966,cortisol,cortisol
322,PLAIN-977,crib death,crib death
323,PLAIN-987,cumin,cumin


## Further reading

Check out the sections about the [data model](https://pyterrier.readthedocs.io/en/latest/datamodel.html) and [transformers](https://pyterrier.readthedocs.io/en/latest/transformer.html) in the PyTerrier documentation.
