This guide is for any one who would love to participate and make a submission to the CIRAL track, with or without an information retrieval background. CIRAL focuses on cross-lingual (from one language to a different language) information retrieval for African languages, with the goal of carrying out community evaluations and in general building community.
For a quick overview of the task, please refer to the main README of the repo or the track's website.
As a great start into IR, we would majorly be working with two toolkits designed to foster research in information retrieval. First is Anserini, built in Java and on the Lucene search library and then Pyserini which works for both sparse and dense representations and has Anserini integrated. The list provided is a good path to follow sequentially in getting started with the basics of IR (each next documents is also linked in the previous one).
-
To understand the retrieval problem, high-level of retrieval systems and core concepts - Anserini: Start Here
-
Using Anserini to Index, Search and Evaluate - Anserini: BM25 Baselines for MS MARCO Passage Ranking
-
Using Pyserini to Index, Search and Evaluate - Pyserini: BM25 Baseline for MS MARCO Passage Ranking
-
Learn about the relationship between sparse and dense retrieval. Pyserini: A Conceptual Framework for Retrieval
-
Working with an Actual Dense Retrieval model. Pyserini: Contriever Baseline for NFCorpus
Now that the basic understanding of IR, Anserini and Pyserini has been accomplished, we can try out some very simple retrieval with the provided dev
queries in CIRAL using BM25. This would be done with Pyserini:
-
If not done already, clone Pyserini and install the development version of Pyserini according to these guide.
-
Using the following commands, copy the topic and qrel files from the Hugging Face repo to
tools/topics-and-qrels
in the clonedPyserini
repo.
git clone https://huggingface.co/datasets/CIRAL/ciral
cp -r ciral/*/*/* $PYSERINI_PATH/tools/topics-and-qrels/
- Run batch retrieval using CIRAL's pre-built BM25 indexes.
{lang}
represents language code for any of the four languages: yo (Yoruba), so (Somali), ha (Hausa) or sw (Swahili).
python -m pyserini.search.lucene \
--language {lang} \
--topics tools/topics-and-qrels/topics.ciral-v1.0-{lang}-dev.tsv \
--index ciral-v1.0-{lang} \
--output runs/run.ciral-v1.0-{lang}.bm25.dev.txt \
--pretokenized \
--batch 128 --threads 16 --bm25 --hits 1000
This saves the run (retrieved passages) in runs/run.ciral-v1.0-{lang}.bm25.dev.txt
. You can inspect the file to see what the output (in other words submission files) looks like.
- Next, we evaluate the run. The official metrics for the track are
ndcg@20
andrecall@100
, but we only evaluate forrecall@1000
in this case (i.e the number of correct passages returned in all the 1000 passages per query)
python -m pyserini.eval.trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.ciral-v1.0-{lang}-dev.tsv runs/run.ciral-v1.0-{lang}.bm25.dev.txt
This should give the following results:
recall@1000 | |
---|---|
Language | BM25 (default) |
Yoruba (yo) | 0.6010 |
Swahili (sw) | 0.1333 |
Somali (so) | 0.1267 |
Hausa (ha) | 0.1050 |
Next, we can reproduce CIRAL's sparse and dense retrieval baselines as indicated in the main README
To train or finetune your dense retrieval model, the Tevatron toolkit is a good place to start.
- Examples on different retrieval tasks.
- Documentation
CIRAL is a test collection, hence the queries and qrels provided would not train a dense retriever efficiently. Therefore, we suggest more multilingual and cross-lingual datasets in training/finetuning:
-
AfriCLIRMatrix: Multilingual CLIR collection for African languages, with topics in English and documents in African languages. Contains data for all four languages. Dataset
-
MIRACL: A multilingual dataset, with queries formulated as natural language questions, and retrieval done at the passage level. Includes Swahili and Yoruba. Dataset.
-
CLIRMatrix: Another multilingual CLIR collection which includes Swahili and Yoruba. Dataset