Skip to content

Datasets available

Nandan Thakur edited this page Jun 29, 2022 · 2 revisions

🍻 Datasets Available in BEIR

The BEIR benchmark (originally) contains 18 retrieval datasets from diverse domains and tasks. The benchmark focuses on zero-shot evaluation (i.e. no training data available) of lexical and neural retrievers across on the diverse datasets. Here below the dataset links and statistics:

  • Four private datasets: Send me an email on nandant@gmail.com for direct access of these datasets via a private google drive link. Please make sure you have the necessary licenses involved before you send in the email.

🍻 Download a BEIR dataset

To load one of the already preprocessed datasets in your current directory as follows:

from beir import util
from beir.datasets.data_loader import GenericDataLoader

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

This will download the scifact dataset under the datasets directory.

🍻 Available Datasets

Command to generate md5hash using Terminal: md5hash filename.zip.

Dataset Website BEIR-Name Type Queries Corpus Rel D/Q Down-load md5
MSMARCO Homepage msmarco train
dev
test
6,980 8.84M 1.1 Link 444067daf65d982533ea17ebd59501e4
TREC-COVID Homepage trec-covid test 50 171K 493.5 Link ce62140cb23feb9becf6270d0d1fe6d1
NFCorpus Homepage nfcorpus train
dev
test
323 3.6K 38.2 Link a89dba18a62ef92f7d323ec890a0d38d
BioASQ Homepage bioasq train
test
500 14.91M 8.05 No How to Reproduce?
NQ Homepage nq train
test
3,452 2.68M 1.2 Link d4d3d2e48787a744b6f6e691ff534307
HotpotQA Homepage hotpotqa train
dev
test
7,405 5.23M 2.0 Link f412724f78b0d91183a0e86805e16114
FiQA-2018 Homepage fiqa train
dev
test
648 57K 2.6 Link 17918ed23cd04fb15047f73e6c3bd9d9
Signal-1M(RT) Homepage signal1m test 97 2.86M 19.6 No How to Reproduce?
TREC-NEWS Homepage trec-news test 57 595K 19.6 No How to Reproduce?
ArguAna Homepage arguana test 1,406 8.67K 1.0 Link 8ad3e3c2a5867cdced806d6503f29b99
Touche-2020 Homepage webis-touche2020 test 49 382K 19.0 Link 46f650ba5a527fc69e0a6521c5a23563
CQADupstack Homepage cqadupstack test 13,145 457K 1.4 Link 4e41456d7df8ee7760a7f866133bda78
Quora Homepage quora dev
test
10,000 523K 1.6 Link 18fb154900ba42a600f84b839c173167
DBPedia Homepage dbpedia-entity dev
test
400 4.63M 38.2 Link c2a39eb420a3164af735795df012ac2c
SCIDOCS Homepage scidocs test 1,000 25K 4.9 Link 38121350fc3a4d2f48850f6aff52e4a9
FEVER Homepage fever train
dev
test
6,666 5.42M 1.2 Link 5a818580227bfb4b35bb6fa46d9b6c03
Climate-FEVER Homepage climate-fever test 1,535 5.42M 3.0 Link 8b66f0a9126c521bae2bde127b4dc99d
SciFact Homepage scifact train
test
300 5K 1.1 Link 5f7d1de60b170fc8027bb7898e2efca1
Robust04 Homepage robust04 test 249 528K 69.9 No How to Reproduce?

Disclaimer

Similar to Tensorflow datasets or HuggingFace's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, feel free to post an issue here or make a pull request!

If you're a dataset owner and wish to include your dataset or model in this library, feel free to post an issue here or make a pull request!