In this notebook we explore the `evaluate` function offered by `ranx`.

First of all we need to install [ranx](https://github.com/AmenRa/ranx)

Mind that the first time you run any ranx' functions they may take a while as they must be compiled first

In [2]:
!pip install -U ranx

Collecting ranx
  Downloading ranx-0.3.20-py3-none-any.whl.metadata (17 kB)
Collecting ir-datasets (from ranx)
  Downloading ir_datasets-0.5.10-py3-none-any.whl.metadata (12 kB)
Collecting lz4 (from ranx)
  Downloading lz4-4.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting cbor2 (from ranx)
  Downloading cbor2-5.6.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.0 kB)
Collecting fastparquet (from ranx)
  Downloading fastparquet-2024.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting inscriptis>=2.2.0 (from ir-datasets->ranx)
  Downloading inscriptis-2.6.0-py3-none-any.whl.metadata (25 kB)
Collecting trec-car-tools>=2.5.4 (from ir-datasets->ranx)
  Downloading trec_car_tools-2.6-py3-none-any.whl.metadata (640 bytes)
Collecting warc3-wet>=0.2.3 (from ir-datasets->ranx)
  Downloading warc3_wet-0.2.5-py3-none-any.whl.metadata (2.2 kB)
Collecting warc3-wet-clueweb09>=0.2.5 (from ir-d

Download the data we need

In [3]:
import os
import requests

for file in ["qrels", "results"]:
    os.makedirs("notebooks/data", exist_ok=True)

    with open(f"notebooks/data/{file}.test", "w") as f:
        master = f"https://raw.githubusercontent.com/AmenRa/ranx/master/notebooks/data/{file}.test"
        f.write(requests.get(master).text)

In [4]:
from ranx import Qrels, Run, evaluate

In [5]:
qrels = Qrels.from_file("notebooks/data/qrels.test", kind="trec")
run = Run.from_file("notebooks/data/results.test", kind="trec")

Evaluate

For a full list of the available metrics see [here](https://amenra.github.io/ranx/metrics/).

In [None]:
# Single metric
print(evaluate(qrels, run, "hits"))
print(evaluate(qrels, run, "hit_rate"))
print(evaluate(qrels, run, "precision"))
print(evaluate(qrels, run, "recall"))
print(evaluate(qrels, run, "f1"))
print(evaluate(qrels, run, "r-precision"))
print(evaluate(qrels, run, "mrr"))
print(evaluate(qrels, run, "map"))
print(evaluate(qrels, run, "ndcg"))
print(evaluate(qrels, run, "bpref"))
print(evaluate(qrels, run, "rbp.95"))

In [6]:
# Single metric with cutoff
evaluate(qrels, run, "ndcg@10")

  scores[i] = _ndcg(qrels[i], run[i], k, rel_lvl, jarvelin)


np.float64(0.30157719921022785)

In [None]:
# Multiple metrics
evaluate(qrels, run, ["map", "mrr", "ndcg"])

In [None]:
# Multiple metrics with cutoffs (you can use different cutoffs for each metric)
evaluate(qrels, run, ["map@100", "mrr@10", "ndcg@10"])

In [None]:
# By default, scores are saved in the evaluated Run
# You can disable this behaviour by passing `save_results_in_run=False`
# when calling `evaluate`
run.mean_scores

In [None]:
import json  # Just for pretty printing

print(json.dumps(run.scores, indent=4))

# 301, 302, and 303 are the query ids

In [7]:
# Alternatively, per query scores can be extracted as Numpy Arrays by passing
# `return_mean = False` to `evaluate`
print(qrels)
evaluate(qrels, run, ["map@100", "mrr@10", "ndcg@10"], return_mean=False)

{'map@100': array([0.01179319, 0.39827964, 0.0764098 ]),
 'mrr@10': array([0.16666667, 1.        , 0.        ]),
 'ndcg@10': array([0.15176219, 0.75296941, 0.        ])}

In [None]:
# Finally, you can set the number of threads used for computing the metric
# scores, by passing `threads = n` to `evaluate`
# `threads = 0` by default, which means all the available threads will be used
# Note that if the number of queries is small, `ranx` will automatically set
# `threads = 1` to prevent performance degradations
evaluate(qrels, run, ["map@100", "mrr@10", "ndcg@10"], threads=1)