# Preparation

In [1]:
#!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier
!pip install python-terrier

Collecting python-terrier
  Using cached python-terrier-0.6.0.tar.gz (86 kB)
Collecting pandas>=0.25.0
  Downloading pandas-1.2.5-cp38-cp38-macosx_10_9_x86_64.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 MB 1.0 MB/s eta 0:00:01
[?25hCollecting wget
  Using cached wget-3.2.zip (10 kB)
Collecting pyjnius~=1.3.0
  Downloading pyjnius-1.3.0-cp38-cp38-macosx_10_13_x86_64.whl (284 kB)
[K     |████████████████████████████████| 284 kB 798 kB/s eta 0:00:01
[?25hCollecting matchpy
  Using cached matchpy-0.5.4-py3-none-any.whl (69 kB)
Collecting sklearn
  Using cached sklearn-0.0.tar.gz (1.1 kB)
Collecting deprecation
  Using cached deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting chest
  Using cached chest-0.2.3.tar.gz (9.6 kB)
Collecting scipy
  Downloading scipy-1.7.0-cp38-cp38-macosx_10_9_x86_64.whl (31.9 MB)
[K     |████████████████████████████████| 31.9 MB 806 kB/s eta 0:00:01
Collecting joblib
  Using cached joblib-1.0.1-py3-none-any.whl (303 kB)
Collecting np

  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1309 sha256=6c8736fb9a8ae81e4f141b6b0786922562c8da5d4d0687d5e104902297283f3b
  Stored in directory: /Users/alp/Library/Caches/pip/wheels/22/0b/40/fd3f795caaa1fb4c6cb738bc1f56100be1e57da95849bfc897
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9673 sha256=0feb8b1c10d6e935b28d27d7edb0c45aaeffbf6eecaf7ebd4ada870764bb042d
  Stored in directory: /Users/alp/Library/Caches/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built python-terrier ir-measures cbor warc3-wet-clueweb09 zlib-state chest sklearn wget
Installing collected packages: threadpoolctl, scipy, joblib, cbor, zlib-state, warc3-wet-clueweb09, warc3-wet, typish, trec-car-tools, scikit-learn, pytrec-eval-terrier, patsy, pandas, multiset, lz4, lxml, ijson, heapdict, cython, wget, statsmo

# Init 

You must run `pt.init()` before other pyterrier functions and classes

Arguments:    
 - `version` - terrier IR version e.g. "5.2"    
 - `mem` - megabytes allocated to java e.g. "4096"      


In [2]:
import pyterrier as pt
if not pt.started():
  pt.init()

PyTerrier 0.6.0 has loaded Terrier 5.5 (built by craigmacdonald on 2021-05-20 13:12)


# Vaswani_NPL

We're going to use a very old IR test collection called [Vaswani_NPL](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/). This is included with Terrier, but we provide access here to pre-made indices, along with the topics and qrels:
 

In [3]:
vaswani_dataset = pt.datasets.get_dataset("vaswani")

# Load an existing index

In [4]:
indexref = vaswani_dataset.get_index()
index = pt.IndexFactory.of(indexref)

print(index.getCollectionStatistics().toString())

Number of documents: 11429
Number of terms: 7756
Number of postings: 224573
Number of fields: 0
Number of tokens: 271581
Field names: []
Positions:   false



# Retrieval

Normally, we would use pt.io.read_topics(topics_path) to parse a topics file. 
``` python
topics_path = "./query-text.trec"
topics = pt.io.read_topics(topics_path)
```

However, the pt.dataset gives the topics and qrels readily-parsed:



In [5]:
topics = vaswani_dataset.get_topics()
topics.head(5)

Unnamed: 0,qid,query
0,1,measurement of dielectric constant of liquids ...
1,2,mathematical analysis and design details of wa...
2,3,use of digital computers in the design of band...
3,4,systems of data coding for information transfer
4,5,use of programs in engineering testing of comp...


Create BatchRetrieve object

You can optionally set the controls and the properties by passing a dictionary to the 'controls' and 'properties' arguments
or by calling setControl or setControls methods on a created object, or use the default controls.

Then call the transform method on the created object with the topics as an argument

In [6]:
retr = pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"})

retr.setControl("wmodel", "TF_IDF")
retr.setControls({"wmodel": "TF_IDF"})

res=retr.transform(topics)

In [7]:
res

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,8171,8172,0,13.746087,measurement of dielectric constant of liquids ...
1,1,9880,9881,1,12.352666,measurement of dielectric constant of liquids ...
2,1,5501,5502,2,12.178153,measurement of dielectric constant of liquids ...
3,1,1501,1502,3,10.993585,measurement of dielectric constant of liquids ...
4,1,9858,9859,4,10.271452,measurement of dielectric constant of liquids ...
...,...,...,...,...,...,...
91925,93,2226,2227,995,4.904950,high frequency oscillators using transistors t...
91926,93,6898,6899,996,4.899385,high frequency oscillators using transistors t...
91927,93,3473,3474,997,4.898796,high frequency oscillators using transistors t...
91928,93,3187,3188,998,4.893073,high frequency oscillators using transistors t...


You can also query simple string.

In [8]:
retr.search("Light")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,10808,10809,0,5.537595,Light
1,1,11231,11232,1,5.535640,Light
2,1,11066,11067,2,5.497895,Light
3,1,5995,5996,3,5.486707,Light
4,1,4460,4461,4,5.464468,Light
...,...,...,...,...,...,...
120,1,4820,4821,120,1.964441,Light
121,1,9836,9837,121,1.927833,Light
122,1,7213,7214,122,1.910036,Light
123,1,6177,6178,123,1.892565,Light


You can save the result to a file by using `pt.io.write_results(result, path)`

In [9]:
pt.io.write_results(res,"result1.res")

# Evaluation

Similarly, if working with a local test collection, we can use pt.Utils.parse_qrels(qrels_path) to parse a qrels file:
```python
qrels_path=("./qrels")
qrels = pt.io.read_qrels(qrels_path)
```

However, for the Vaswani dataset, the qrels are provided ready-to-do:


In [10]:
qrels = vaswani_dataset.get_qrels()

Use `pt.Utils.evaluate(results, qrels)` to evaluate the results    
Args:    
metrics, `default = ["map", ndcg"]`, select the evaluation metrics    
perquery, `default = False`, select whether to show the mean of the metrics or the metrics for each query

In [11]:
eval = pt.Utils.evaluate(res,qrels)
eval

{'map': 0.29090543005529873, 'ndcg': 0.6153667539666847}

We can also ask for per-query results.

In [12]:
eval = pt.Utils.evaluate(res,qrels,metrics=["map"], perquery=True)
eval

defaultdict(dict,
            {'1': {'map': 0.2688603632606692},
             '2': {'map': 0.056448212440045914},
             '3': {'map': 0.23945401361406524},
             '4': {'map': 0.4939494140851607},
             '5': {'map': 0.0},
             '6': {'map': 0.2421600270476016},
             '7': {'map': 0.5674516736006812},
             '8': {'map': 0.5},
             '9': {'map': 0.5222222222222223},
             '10': {'map': 0.1214856066519094},
             '11': {'map': 0.06799761023743447},
             '12': {'map': 0.2093716360982601},
             '13': {'map': 0.26945162856284827},
             '14': {'map': 0.3164929260069987},
             '15': {'map': 0.17479160483981196},
             '16': {'map': 0.07376769675516924},
             '17': {'map': 0.3965636483508813},
             '18': {'map': 0.16354405989238738},
             '19': {'map': 0.44669647488527836},
             '20': {'map': 0.22061080821325293},
             '21': {'map': 0.5395186359625185},
   