Skip to content

Latest commit

 

History

History
91 lines (55 loc) · 5.7 KB

README.md

File metadata and controls

91 lines (55 loc) · 5.7 KB

Quick Start on Information Retrieval (IR) and the CIRAL Task

This guide is for any one who would love to participate and make a submission to the CIRAL track, with or without an information retrieval background. CIRAL focuses on cross-lingual (from one language to a different language) information retrieval for African languages, with the goal of carrying out community evaluations and in general building community.

For a quick overview of the task, please refer to the main README of the repo or the track's website.

📚 Introduction to IR

As a great start into IR, we would majorly be working with two toolkits designed to foster research in information retrieval. First is Anserini, built in Java and on the Lucene search library and then Pyserini which works for both sparse and dense representations and has Anserini integrated. The list provided is a good path to follow sequentially in getting started with the basics of IR (each next documents is also linked in the previous one).

📚 Trying out Retrieval with CIRAL's Dev Queries

Now that the basic understanding of IR, Anserini and Pyserini has been accomplished, we can try out some very simple retrieval with the provided dev queries in CIRAL using BM25. This would be done with Pyserini:

  1. If not done already, clone Pyserini and install the development version of Pyserini according to these guide.

  2. Using the following commands, copy the topic and qrel files from the Hugging Face repo to tools/topics-and-qrels in the cloned Pyserini repo.

git clone https://huggingface.co/datasets/CIRAL/ciral
cp -r ciral/*/*/* $PYSERINI_PATH/tools/topics-and-qrels/
  1. Run batch retrieval using CIRAL's pre-built BM25 indexes. {lang} represents language code for any of the four languages: yo (Yoruba), so (Somali), ha (Hausa) or sw (Swahili).
python -m pyserini.search.lucene \
  --language {lang} \
  --topics tools/topics-and-qrels/topics.ciral-v1.0-{lang}-dev.tsv \
  --index ciral-v1.0-{lang} \
  --output runs/run.ciral-v1.0-{lang}.bm25.dev.txt \
  --pretokenized \
  --batch 128 --threads 16 --bm25 --hits 1000

This saves the run (retrieved passages) in runs/run.ciral-v1.0-{lang}.bm25.dev.txt. You can inspect the file to see what the output (in other words submission files) looks like.

  1. Next, we evaluate the run. The official metrics for the track are ndcg@20 and recall@100, but we only evaluate for recall@1000 in this case (i.e the number of correct passages returned in all the 1000 passages per query)
python -m pyserini.eval.trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.ciral-v1.0-{lang}-dev.tsv runs/run.ciral-v1.0-{lang}.bm25.dev.txt

This should give the following results:

recall@1000
Language BM25 (default)
Yoruba (yo) 0.6010
Swahili (sw) 0.1333
Somali (so) 0.1267
Hausa (ha) 0.1050

Reproducing CIRAL's Baselines

Next, we can reproduce CIRAL's sparse and dense retrieval baselines as indicated in the main README

📚 Training Dense Retrieval Models

To train or finetune your dense retrieval model, the Tevatron toolkit is a good place to start.

📚 Additional Data sources

CIRAL is a test collection, hence the queries and qrels provided would not train a dense retriever efficiently. Therefore, we suggest more multilingual and cross-lingual datasets in training/finetuning:

  • AfriCLIRMatrix: Multilingual CLIR collection for African languages, with topics in English and documents in African languages. Contains data for all four languages. Dataset

  • MIRACL: A multilingual dataset, with queries formulated as natural language questions, and retrieval done at the passage level. Includes Swahili and Yoruba. Dataset.

  • CLIRMatrix: Another multilingual CLIR collection which includes Swahili and Yoruba. Dataset

  • WikiCLIR: Includes only Swahili. Dataset, Swahili