Quick Start on Information Retrieval (IR) and the CIRAL Task

This guide is for any one who would love to participate and make a submission to the CIRAL track, with or without an information retrieval background. CIRAL focuses on cross-lingual (from one language to a different language) information retrieval for African languages, with the goal of carrying out community evaluations and in general building community.

For a quick overview of the task, please refer to the main README of the repo or the track's website.

📚 Introduction to IR

As a great start into IR, we would majorly be working with two toolkits designed to foster research in information retrieval. First is Anserini, built in Java and on the Lucene search library and then Pyserini which works for both sparse and dense representations and has Anserini integrated. The list provided is a good path to follow sequentially in getting started with the basics of IR (each next documents is also linked in the previous one).

To understand the retrieval problem, high-level of retrieval systems and core concepts - Anserini: Start Here
Using Anserini to Index, Search and Evaluate - Anserini: BM25 Baselines for MS MARCO Passage Ranking
Using Pyserini to Index, Search and Evaluate - Pyserini: BM25 Baseline for MS MARCO Passage Ranking
Learn about the relationship between sparse and dense retrieval. Pyserini: A Conceptual Framework for Retrieval
Working with an Actual Dense Retrieval model. Pyserini: Contriever Baseline for NFCorpus

📚 Trying out Retrieval with CIRAL's Dev Queries

Now that the basic understanding of IR, Anserini and Pyserini has been accomplished, we can try out some very simple retrieval with the provided dev queries in CIRAL using BM25. This would be done with Pyserini:

If not done already, clone Pyserini and install the development version of Pyserini according to these guide.
Using the following commands, copy the topic and qrel files from the Hugging Face repo to tools/topics-and-qrels in the cloned Pyserini repo.

git clone https://huggingface.co/datasets/CIRAL/ciral
cp -r ciral/*/*/* $PYSERINI_PATH/tools/topics-and-qrels/

Run batch retrieval using CIRAL's pre-built BM25 indexes. {lang} represents language code for any of the four languages: yo (Yoruba), so (Somali), ha (Hausa) or sw (Swahili).

python -m pyserini.search.lucene \
  --language {lang} \
  --topics tools/topics-and-qrels/topics.ciral-v1.0-{lang}-dev.tsv \
  --index ciral-v1.0-{lang} \
  --output runs/run.ciral-v1.0-{lang}.bm25.dev.txt \
  --pretokenized \
  --batch 128 --threads 16 --bm25 --hits 1000

This saves the run (retrieved passages) in runs/run.ciral-v1.0-{lang}.bm25.dev.txt. You can inspect the file to see what the output (in other words submission files) looks like.

Next, we evaluate the run. The official metrics for the track are ndcg@20 and recall@100, but we only evaluate for recall@1000 in this case (i.e the number of correct passages returned in all the 1000 passages per query)

python -m pyserini.eval.trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.ciral-v1.0-{lang}-dev.tsv runs/run.ciral-v1.0-{lang}.bm25.dev.txt

This should give the following results:

recall@1000
Language	BM25 (default)
Yoruba (yo)	0.6010
Swahili (sw)	0.1333
Somali (so)	0.1267
Hausa (ha)	0.1050

Reproducing CIRAL's Baselines

Next, we can reproduce CIRAL's sparse and dense retrieval baselines as indicated in the main README

📚 Training Dense Retrieval Models

To train or finetune your dense retrieval model, the Tevatron toolkit is a good place to start.

Examples on different retrieval tasks.
Documentation

📚 Additional Data sources

CIRAL is a test collection, hence the queries and qrels provided would not train a dense retriever efficiently. Therefore, we suggest more multilingual and cross-lingual datasets in training/finetuning:

AfriCLIRMatrix: Multilingual CLIR collection for African languages, with topics in English and documents in African languages. Contains data for all four languages. Dataset
MIRACL: A multilingual dataset, with queries formulated as natural language questions, and retrieval done at the passage level. Includes Swahili and Yoruba. Dataset.
CLIRMatrix: Another multilingual CLIR collection which includes Swahili and Yoruba. Dataset
WikiCLIR: Includes only Swahili. Dataset, Swahili

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Quick Start on Information Retrieval (IR) and the CIRAL Task

📚 Introduction to IR

📚 Trying out Retrieval with CIRAL's Dev Queries

Reproducing CIRAL's Baselines

📚 Training Dense Retrieval Models

📚 Additional Data sources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Quick Start on Information Retrieval (IR) and the CIRAL Task

📚 Introduction to IR

📚 Trying out Retrieval with CIRAL's Dev Queries

Reproducing CIRAL's Baselines

📚 Training Dense Retrieval Models

📚 Additional Data sources