# Introduction to PyTerrier

_DSAIT4050: Information retrieval lecture, TU Delft_

**Part 3: Datasets**

This notebook focuses on IR datasets and pre-made indexes that can be loaded automatically in PyTerrier.


In [None]:
pip install python-terrier==0.12.1

In [1]:
import pyterrier as pt

## Importing datasets

PyTerrier comes with a multitude of datasets that can be loaded directly. This is great because the parsing is already taken care of and any required files will be downloaded automatically.

A list of available datasets can be found [here](https://pyterrier.readthedocs.io/en/latest/datasets.html#available-datasets) or by calling the following function:


In [2]:
pt.datasets.list_datasets()

Unnamed: 0,dataset,topics,topics_lang,qrels,corpus,corpus_lang,index,info_url
0,50pct,"[training, validation]",en,"[training, validation]",,,"[ex2, ex3]",
1,antique,"[train, test]",en,"[train, test]",True,en,,https://ciir.cs.umass.edu/downloads/Antique/re...
2,vaswani,True,en,True,True,en,True,http://ir.dcs.gla.ac.uk/resources/test_collect...
3,msmarco_document,"[train, dev, test, test-2020, leaderboard-2020]",en,"[train, dev, test, test-2020]",True,en,True,https://microsoft.github.io/msmarco/
4,msmarcov2_document,"[train, dev1, dev2, valid1, valid2, trec_2021]",en,"[train, dev1, dev2, valid1, valid2]",,,True,https://microsoft.github.io/msmarco/TREC-Deep-...
...,...,...,...,...,...,...,...,...
748,irds:neuclir,,,,,,,https://ir-datasets.com/neuclir.html
749,irds:neuclir/1,,,,,,,https://ir-datasets.com/neuclir.html#neuclir/1
764,irds:sara,True,en,True,True,en,,https://ir-datasets.com/sara.html
765,trec-deep-learning-docs,"[train, dev, test, test-2020, leaderboard-2020]",en,"[train, dev, test, test-2020]",True,en,True,https://microsoft.github.io/msmarco/


Each dataset has the following components:

- Corpus (the documents),
- index (pre-made, ready to use),
- topics (queries or topic descriptions, grouped in folds or splits),
- qrels (query relevance information, we'll use this for evaluation in an upcoming notebook).

Note that, for many datasets, some of these components are missing. Furthermore, the prefix `irds:` denotes that the corresponding dataset is loaded from the [`ir_datasets`](https://ir-datasets.com/) library, which seamlessly integrates with PyTerrier.

Let's start by loading the `vaswani` dataset:


In [3]:
dataset = pt.get_dataset("vaswani")

For this dataset, there are pre-made indexes available that we can load. In order to do this, we need to select a _variant_. The variants differ slightly, for example, in terms of pre-processing. An overview of the indexes and variants can be found in the [Terrier data repository](http://data.terrier.org/).

We'll use the standard variant, `terrier_stemmed`, to create a BM25 model:


In [4]:
index = dataset.get_index(variant="terrier_stemmed")
bm25 = pt.terrier.Retriever(index, wmodel="BM25")
bm25.search("computer")

ValueError: Could not find index variant terrier_stemmed for dataset vaswani at http://data.terrier.org/indices/vaswani/terrier_stemmed/latest/files. See available variants at http://data.terrier.org/vaswani.dataset.html

We can also create a retriever directly from the dataset like so:


In [6]:
bm25 = pt.terrier.Retriever.from_dataset(
    dataset, variant="terrier_stemmed", wmodel="BM25"
)

We can also browse the corpus:


In [None]:
for doc in dataset.get_corpus_iter():
    print(doc)
    break

Similarly, the topics (queries) can be accessed as a `pandas.DataFrame`, such that we can use them directly:


In [None]:
bm25(dataset.get_topics())

Note that some datasets require a variant here, such as `variant="train"`.

Since the corpus iterator already yields the documents in the correct format (see part 2: indexing), we can use it directly to create our own index if we wish:


In [9]:
from pathlib import Path

index = pt.IterDictIndexer(
    str(Path.cwd()),  # this will be ignored
    type=pt.index.IndexingType.MEMORY,
).index(dataset.get_corpus_iter())

## Further reading

Check out the [datasets section](https://pyterrier.readthedocs.io/en/latest/datasets.html) in the documentation.
