Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repro: Synthetic multilingual retrieval models on MIRACL #36

Open
thakur-nandan opened this issue Apr 3, 2024 · 1 comment
Open

Repro: Synthetic multilingual retrieval models on MIRACL #36

thakur-nandan opened this issue Apr 3, 2024 · 1 comment

Comments

@thakur-nandan
Copy link
Member

We have a new project involving multilingual retrieval and reproduction and we are looking for 2 URA students to work together.

Feel free to reach out on Slack or email us at nandant@gmail.com, xzhangbx@gmail.com.

Synthetic data-based multilingual LLM retrieval models

25th March 2024
Supervised by: Nandan Thakur Crystina, Xinyu ZHANG
Working Style: Weekly Sync-up Meetings (Slack for urgent/code debugging)

OVERVIEW

SWIM-X models have recently shown to be great in cross-lingual and multilingual retrieval settings (source). However, Encoder-only models (mT5) are restricted by a context length of ~ 512 tokens. They require large amounts of synthetic training data for fine-tuning + pre-training for retrieval, making extending across ~101 languages difficult.
The first line of work would be to benchmark SWIM-X in Pyserini and make reproducible baselines; as a warm-up to get familiar with the existing models and the datasets.

RELATED WORK

mE5-mistral-7B (source) are recently introduced multilingual Mistral-7B LLM-based decoder models. However, the training dataset is unavailable and the model uses a high amount of high-quality synthetic training data from GPT-4. Our work will focus more on efficient fine-tuning using a smaller subset of multilingual training data.

Research Questions

  1. Baseline: Reproduce SWIM-X (source) and push for 2CR within Pyserini/Anserini.
  2. Compare SWIM-X repro against other multilingual retrieval LLMs such as mE5-mistral-7B/Cohere Command-R.

Future Scope

Further, we would like to examine multilingual LLMs (in contrast to the mT5 model as in SWIM-X) using a (small) few-shot synthetic-only training dataset. Would we still require a large training dataset such as SWIM-X? Or would a few-shot examples be enough for multilingual retrieval-based LLM? How do we extend the model across 101 languages in mC4?
Explore the best approach to fine-tune LLM-based retrieval models such as Gemma-2b/Mistral-7b-v0.2 on the SWIM-IR dataset.
Research on the optimal number of (lowest) synthetic training data pairs is required.

Resources

  • Tevatron/LoRA for fine-tuning LLMs for retrieval: GitHub
  • MTEB/MIRACL for retrieval evaluation: GitHub
  • SWIM-IR datasets: GitHub

MILESTONES

  1. Reproduce SWIM-X models in Pyserini (M1)
  2. Reproduce the SWIM-X models, create 2CR checkpoints for MIRACL, and include them in Pyserini. Reproduce evaluation on XOR-RETRIEVE and XTREME-UP.
  3. Familiarize with LLM Retrieval Fine-tuning (M2)
    • Run experiments to reproduce the rank llama example: (GitHub) and use that as an example to extend Gemma-2b/Mistral-7b on multilingual retrieval datasets; either synthetic (SWIM-IR), human-labeled (MIRACL) or translated (mMARCO).

FUTURE MILESTONES

  1. Few-shot LLM Retrieval Fine-tuning
    • Depending on results in M2, further extend models to fine-tuning only a few-shot examples in different languages (Idea similar to SETFIT GitHub). Find an optimal number of training examples required in each language.
  2. Extending Multilingual LLMs to 101 languages (M4)
    • If M3 works out successfully, we can generate synthetic datasets for 101 languages (overlap with the same languages in mC4) and fine-tune a multilingual LLM across 101 languages.

RELEVANT READING MATERIAL

@thakur-nandan
Copy link
Member Author

@Richard5678 has started to work on this issue.

First, he will focus on reproducing the 2CR on MIRACL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant