AfriCLIRMatrix is a test collection for cross-lingual information retrieval research in 15 diverse African languages. This resource comprises English queries with query–document relevance judgments in 15 African languages automatically mined from Wikipedia
AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages
Afr | Amh | Ary | Arz | Hau | Ibo | Nso | Sna | Som | Swa | Tir | Twi | Wol | Yor | Zul
The dataset (v1.1) is also available on HuggingFace Dataset:
Afr | Amh | Ary | Arz | Hau | Ibo | Nso | Sna | Som | Swa | Tir | Twi | Wol | Yor | Zul
Afr | Amh | Ary | Arz | Hau | Ibo | Nso | Sna | Som | Swa | Tir | Twi | Wol | Yor | Zul
Baseline BM25, mDPR (fine-tuned on ms marco) and sparse-dense hybrid results on Africlirmatrix
Afr | Amh | Ary | Arz | Hau | Ibo | Nso | Sna | Som | Swa | Tir | Twi | Wol | Yor | Zul | avg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BM25 (default) | 0.434 | 0.159 | 0.167 | 0.268 | 0.508 | 0.518 | 0.445 | 0.262 | 0.305 | 0.418 | 0.080 | 0.513 | 0.134 | 0.484 | 0.247 | 0.329 |
mDPR (MS Marco) | 0.309 | 0.215 | 0.355 | 0.118 | 0.269 | 0.338 | 0.282 | 0.351 | 0.218 | 0.335 | 0.265 | 0.333 | 0.232 | 0.377 | 0.178 | 0.281 |
Hybrid | 0.464 | 0.228 | 0.350 | 0.257 | 0.508 | 0.580 | 0.526 | 0.394 | 0.344 | 0.477 | 0.239 | 0.547 | 0.233 | 0.532 | 0.273 | 0.397 |
Afr | Amh | Ary | Arz | Hau | Ibo | Nso | Sna | Som | Swa | Tir | Twi | Wo | Yo | Zu | avg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BM25 (default) | 0.584 | 0.174 | 0.224 | 0.309 | 0.650 | 0.685 | 0.629 | 0.346 | 0.403 | 0.556 | 0.080 | 0.560 | 0.166 | 0.627 | 0.289 | 0.418 |
mDPR (MS Marco) | 0.591 | 0.382 | 0.694 | 0.248 | 0.542 | 0.668 | 0.670 | 0.642 | 0.445 | 0.595 | 0.580 | 0.664 | 0.548 | 0.655 | 0.361 | 0.552 |
Hybrid | 0.727 | 0.388 | 0.698 | 0.416 | 0.722 | 0.804 | 0.766 | 0.684 | 0.535 | 0.690 | 0.600 | 0.732 | 0.556 | 0.750 | 0.448 | 0.634 |
If you find our paper useful or use the dataset in your work, please cite our paper and the CLIRMatrix paper:
@inproceedings{africlirmatrix,
title = "{AfriCLIRMatrix}: Enabling Cross-Lingual Information Retrieval for African Languages",
author = "Ogundepo, Odunayo and Zhang, Xinyu and Sun, Shuo and Duh, Kevin and Lin, Jimmy",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = dec,
year = "2022",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.597",
}
@inproceedings{sun-duh-2020-clirmatrix,
title = "{CLIRM}atrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval",
author = "Sun, Shuo and
Duh, Kevin",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.emnlp-main.340",
doi = "10.18653/v1/2020.emnlp-main.340",
}
If you have any question or suggestions regarding the dataset, code or publication, please contact Ogundepo Odunayo (oogundep[at]uwaterloo.ca)