# Evaluating BEIR with Fess

## Introduction
This notebook provides a simple and straightforward example of how to evaluate retrieval models from the BEIR benchmark using Fess.

## What is BEIR?
BEIR (Benchmark for Evaluation of Information Retrieval) is a heterogeneous benchmark designed for zero-shot evaluation of information retrieval models. BEIR contains 9 diverse retrieval tasks and 17 different datasets, allowing for comprehensive evaluation of state-of-the-art retrieval models in a zero-shot setup.
    

In [None]:
import os

dataset = os.getenv("BEIR_DATASET", "scifact")
fess_dir = os.getenv("FESS_DIR", "fess")
dataset_dir = os.getenv("DATASET_DIR", os.path.join(os.getcwd(), "datasets"))

In [None]:
# Install the beir PyPI package
!pip list | grep beir || pip install beir

In [None]:
from beir import util, LoggingHandler

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

## BEIR Datasets

BEIR contains 17 diverse datasets overall. You can view all the datasets (14 downloadable) with the link below:

[``https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/``](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/)

Please refer GitHub page to evaluate on other datasets (3 of them).


We include the following datasets in BEIR:

| Dataset   | Website| BEIR-Name | Domain     | Relevancy| Queries  | Documents | Avg. Docs/Q | Download | 
| -------- | -----| ---------| ----------- | ---------| ---------| --------- | ------| ------------| 
| MSMARCO    | [``Homepage``](https://microsoft.github.io/msmarco/)| ``msmarco`` | Misc.       |  Binary  |  6,980   |  8.84M     |    1.1 | Yes |  
| TREC-COVID |  [``Homepage``](https://ir.nist.gov/covidSubmit/index.html)| ``trec-covid``| Bio-Medical |  3-level|50|  171K| 493.5 | Yes | 
| NFCorpus   | [``Homepage``](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | ``nfcorpus``  | Bio-Medical |  3-level |  323     |  3.6K     |  38.2 | Yes |
| BioASQ     | [``Homepage``](http://bioasq.org) | ``bioasq``| Bio-Medical |  Binary  |   500    |  14.91M    |  8.05 | No | 
| NQ         | [``Homepage``](https://ai.google.com/research/NaturalQuestions) | ``nq``| Wikipedia   |  Binary  |  3,452   |  2.68M  |  1.2 | Yes | 
| HotpotQA   | [``Homepage``](https://hotpotqa.github.io) | ``hotpotqa``| Wikipedia   |  Binary  |  7,405   |  5.23M  |  2.0 | Yes |
| FiQA-2018  | [``Homepage``](https://sites.google.com/view/fiqa/) | ``fiqa``    | Finance     |  Binary  |  648     |  57K    |  2.6 | Yes | 
| Signal-1M (RT) | [``Homepage``](https://research.signal-ai.com/datasets/signal1m-tweetir.html)| ``signal1m`` | Twitter     |  3-level  |   97   |  2.86M  |  19.6 | No |
| TREC-NEWS  | [``Homepage``](https://trec.nist.gov/data/news2019.html) | ``trec-news``    | News     |  5-level  |   57    |  595K    |  19.6 | No |
| ArguAna    | [``Homepage``](http://argumentation.bplaced.net/arguana/data) | ``arguana`` | Misc.       |  Binary  |  1,406     |  8.67K    |  1.0 | Yes |
| Touche-2020| [``Homepage``](https://webis.de/events/touche-20/shared-task-1.html) | ``webis-touche2020``| Misc.       |  6-level  |  49     |  382K    |  49.2 |  Yes |
| CQADupstack| [``Homepage``](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | ``cqadupstack``| StackEx.      |  Binary  |  13,145 |  457K  |  1.4 |  Yes |
| Quora| [``Homepage``](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | ``quora``| Quora  | Binary  |  10,000     |  523K    |  1.6 |  Yes | 
| DBPedia | [``Homepage``](https://github.com/iai-group/DBpedia-Entity/) | ``dbpedia-entity``| Wikipedia |  3-level  |  400    |  4.63M    |  38.2 |  Yes | 
| SCIDOCS| [``Homepage``](https://allenai.org/data/scidocs) | ``scidocs``| Scientific |  Binary  |  1,000     |  25K    |  4.9 |  Yes | 
| FEVER| [``Homepage``](http://fever.ai) | ``fever``| Wikipedia     |  Binary  |  6,666     |  5.42M    |  1.2|  Yes | 
| Climate-FEVER| [``Homepage``](http://climatefever.ai) | ``climate-fever``| Wikipedia |  Binary  |  1,535     |  5.42M |  3.0 |  Yes |
| SciFact| [``Homepage``](https://github.com/allenai/scifact) | ``scifact``| Scientific |  Binary  |  300     |  5K    |  1.1 |  Yes |


In [None]:
import pathlib
from beir import util

url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
data_path = util.download_and_unzip(url, dataset_dir)
print("Dataset downloaded here: {}".format(data_path))

## Folder Structure of any BEIR dataset

* scifact/
    * corpus.jsonl 
    * queries.jsonl 
    * qrels/
        * train.tsv
        * dev.tsv
        * test.tsv

In [None]:
!ls datasets/{dataset}/

## Data Loading

In [None]:
from beir.datasets.data_loader import GenericDataLoader

data_path = f"{dataset_dir}/{dataset}"
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev"

## Lexical Retrieval using BM25 (Fess)

### Run the Fess instance

In [None]:
%%bash -s "$fess_dir"

cd $1
docker compose -f compose.yaml up -d
cd ..

In [None]:
%%bash

count=0
while true ; do
    status=$(curl -w '%{http_code}\n' -s -o /dev/null "http://localhost:8080/api/v1/health")
    if [[ x"${status}" = x200 ]] ; then
        break
    fi
    if [[ ${count} -gt 60 ]] ; then
        echo "timeout"
        break
    fi
    sleep 1
    count=$((count + 1))
done


In [None]:
from beir_fess.retrieval.search.lexical import FessSearch
from beir.retrieval.evaluation import EvaluateRetrieval

#### Provide parameters for elastic-search
hostname = "http://localhost:8080"
initialize = True # True, will delete existing index with same name and reindex all documents
access_token = "CHANGEME"
k_values = [1, 3, 5, 10, 50, 100]

model = FessSearch(index_name=dataset, hostname=hostname, access_token=access_token, initialize=initialize)
retriever = EvaluateRetrieval(model, k_values=k_values)

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

In [None]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

In [None]:
import pandas as pd

df_ndcg = pd.DataFrame(list(ndcg.items()), columns=['Metric', 'Value'])
df_map = pd.DataFrame(list(_map.items()), columns=['Metric', 'Value'])
df_recall = pd.DataFrame(list(recall.items()), columns=['Metric', 'Value'])
df_precision = pd.DataFrame(list(precision.items()), columns=['Metric', 'Value'])

# 縦に結合
df = pd.concat([df_ndcg, df_map, df_recall, df_precision], ignore_index=True)
df["DataSet"] = dataset
df["Target"] = fess_dir
df = df[["DataSet", "Target", "Metric", "Value"]]
print(df.to_markdown(index=False))
df.to_csv(f"results/{dataset}-{fess_dir}.csv", index=False)

In [None]:
%%bash -s "$fess_dir"

cd $1
docker compose -f compose.yaml down
cd ..