Repro: Synthetic multilingual retrieval models on MIRACL #36

thakur-nandan · 2024-04-03T22:52:31Z

We have a new project involving multilingual retrieval and reproduction and we are looking for 2 URA students to work together.

Feel free to reach out on Slack or email us at nandant@gmail.com, xzhangbx@gmail.com.

Synthetic data-based multilingual LLM retrieval models

25th March 2024
Supervised by: Nandan Thakur Crystina, Xinyu ZHANG
Working Style: Weekly Sync-up Meetings (Slack for urgent/code debugging)

OVERVIEW

SWIM-X models have recently shown to be great in cross-lingual and multilingual retrieval settings (source). However, Encoder-only models (mT5) are restricted by a context length of ~ 512 tokens. They require large amounts of synthetic training data for fine-tuning + pre-training for retrieval, making extending across ~101 languages difficult.
The first line of work would be to benchmark SWIM-X in Pyserini and make reproducible baselines; as a warm-up to get familiar with the existing models and the datasets.

RELATED WORK

mE5-mistral-7B (source) are recently introduced multilingual Mistral-7B LLM-based decoder models. However, the training dataset is unavailable and the model uses a high amount of high-quality synthetic training data from GPT-4. Our work will focus more on efficient fine-tuning using a smaller subset of multilingual training data.

Research Questions

Baseline: Reproduce SWIM-X (source) and push for 2CR within Pyserini/Anserini.
Compare SWIM-X repro against other multilingual retrieval LLMs such as mE5-mistral-7B/Cohere Command-R.

Future Scope

Further, we would like to examine multilingual LLMs (in contrast to the mT5 model as in SWIM-X) using a (small) few-shot synthetic-only training dataset. Would we still require a large training dataset such as SWIM-X? Or would a few-shot examples be enough for multilingual retrieval-based LLM? How do we extend the model across 101 languages in mC4?
Explore the best approach to fine-tune LLM-based retrieval models such as Gemma-2b/Mistral-7b-v0.2 on the SWIM-IR dataset.
Research on the optimal number of (lowest) synthetic training data pairs is required.

Resources

Tevatron/LoRA for fine-tuning LLMs for retrieval: GitHub
MTEB/MIRACL for retrieval evaluation: GitHub
SWIM-IR datasets: GitHub

MILESTONES

Reproduce SWIM-X models in Pyserini (M1)
Reproduce the SWIM-X models, create 2CR checkpoints for MIRACL, and include them in Pyserini. Reproduce evaluation on XOR-RETRIEVE and XTREME-UP.
Familiarize with LLM Retrieval Fine-tuning (M2)
- Run experiments to reproduce the rank llama example: (GitHub) and use that as an example to extend Gemma-2b/Mistral-7b on multilingual retrieval datasets; either synthetic (SWIM-IR), human-labeled (MIRACL) or translated (mMARCO).

FUTURE MILESTONES

Few-shot LLM Retrieval Fine-tuning
- Depending on results in M2, further extend models to fine-tuning only a few-shot examples in different languages (Idea similar to SETFIT GitHub). Find an optimal number of training examples required in each language.
Extending Multilingual LLMs to 101 languages (M4)
- If M3 works out successfully, we can generate synthetic datasets for 101 languages (overlap with the same languages in mC4) and fine-tune a multilingual LLM across 101 languages.

RELEVANT READING MATERIAL

SWIM-IR (source)
MIRACL (source)
RankLLAMA/RepLLAMA (source)
Multilingual E5 embeddings (source)

thakur-nandan · 2024-04-11T17:09:52Z

@Richard5678 has started to work on this issue.

First, he will focus on reproducing the 2CR on MIRACL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repro: Synthetic multilingual retrieval models on MIRACL #36

Repro: Synthetic multilingual retrieval models on MIRACL #36

thakur-nandan commented Apr 3, 2024

thakur-nandan commented Apr 11, 2024

Repro: Synthetic multilingual retrieval models on MIRACL #36

Repro: Synthetic multilingual retrieval models on MIRACL #36

Comments

thakur-nandan commented Apr 3, 2024

Synthetic data-based multilingual LLM retrieval models

OVERVIEW

RELATED WORK

Research Questions

Future Scope

Resources

MILESTONES

FUTURE MILESTONES

RELEVANT READING MATERIAL

thakur-nandan commented Apr 11, 2024