CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training

This repository contains the code for our paper CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training. We propose more effective loss function than standard InfoNCE for Cross-lingual Information Retrieval. CLEAR outperforms InfoNCE across Belebele, XQuAD, and XOR-QA benchmarks under various settings.

Installation

Requirements

Python 3.8+
CUDA-compatible GPU (recommended)

Setup

Clone the repository:

git clone https://github.com/dltmddbs100/CLEAR.git
cd CLEAR

Install dependencies:

pip install -r requirements.txt

This will install:

PyTorch 2.6.0
Transformers 4.45.0
Sentence-Transformers 3.4.1
Datasets 2.21.0
Accelerate 0.34.2
WandB (for logging)
Local MTEB package (from ./mteb)

Optional: DeepSpeed Support

For distributed training with DeepSpeed:

pip install deepspeed==0.14.4

Dataset Format

We support an example training dataset at here. Training data should be the following format:

Column	Description
`anchor`	Source language query
`positive`	Positive document
`negative_1` ... `negative_n`	Hard negative documents
`cross_anchor`	Target language query (translation of anchor)
`neg_query_1` ... `neg_query_n`	Hard negative queries in target language

Example:

{
  "anchor": "What is machine learning?",
  "positive": "Machine learning is a subset of artificial intelligence...",
  "negative_1": "Deep learning requires large datasets...",
  "cross_anchor": "Was ist maschinelles Lernen?",
  "neg_query_1": "Wie funktioniert künstliche Intelligenz?"
}

You can use HuggingFace datasets by setting --use_hf_dataset.

Training

export OMP_NUM_THREADS=32

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 --master_port 25055 script/train.py \
    --model_name_or_path BAAI/bge-m3 \
    --dataset_path dataset/train_example_de \
    --output_dir ckpt/bge-m3-de-CLEAR \
    --loss_name CachedCLEARLoss \
    --alpha 0.4 \
    --beta 0.2 \
    --kl_div \
    --use_hf_dataset \
    --do_train \
    --per_device_train_batch_size 64 \
    --mini_batch_size 32 \
    --learning_rate 5e-5 \
    --max_steps 50 \
    --warmup_ratio 0.05 \
    --lr_scheduler_type cosine \
    --max_seq_length 512 \
    --bf16 \
    --report_to wandb \
    --save_strategy steps \
    --save_steps 50 \
    --save_total_limit 1

Evaluation

CLEAR uses a customized version of MTEB for evaluation on cross-lingual retrieval benchmarks.

Evaluate on XQuAD

# Monolingual (English)
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks XQuADRetrieval \
    --languages eng-Latn \
    --output_folder eval_results/xquad/en-en \
    --batch_size 32

# Cross-lingual (German query → English documents)
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks XQuADCrossRetrieval_EN_LANG \
    --languages deu-Latn \
    --output_folder eval_results/xquad/en-de \
    --batch_size 32

# Cross-lingual (English query → German documents)
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks XQuADCrossRetrieval_LANG_EN \
    --languages deu-Latn \
    --output_folder eval_results/xquad/de-en \
    --batch_size 32

Evaluate on Belebele

# English
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks BelebeleRetrieval \
    --languages eng-Latn-eng-Latn \
    --output_folder eval_results/belebele/en \
    --batch_size 32

# Other languages
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks BelebeleRetrieval \
    --languages deu-Latn \
    --output_folder eval_results/belebele/de \
    --batch_size 32

Batch Evaluation

Use the provided scripts to evaluate across multiple languages:

# Evaluate trained model
bash script/eval_ours_bele_xquad.sh

# Evaluate baseline (no training)
bash script/eval_notrain_bele_xquad.sh

Project Structure

CLEAR/
├── train.py              # Main training script
├── loss.py               # CLEAR loss implementations
├── collator.py           # Data collator for cross-lingual data
├── requirements.txt      # Python dependencies
├── dataset/              # Training datasets
├── ckpt/                 # Model checkpoints
├── eval_results/         # Evaluation outputs
├── logs/                 # Training logs
├── mteb/                 # Custom MTEB package
├── script/               # Training and evaluation scripts
└── asset/                # Assets and figures

Citation

If you find this work useful, please cite:

@misc{lee2026clearcrosslingualenhancementalignment,
      title={CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training}, 
      author={Seungyoon Lee and Minhyuk Kim and Seongtae Hong and Youngjoon Jang and Dongsuk Oh and Heuiseok Lim},
      year={2026},
      eprint={2604.05821},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.05821}, 
}

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training

Installation

Requirements

Setup

Optional: DeepSpeed Support

Dataset Format

Training

Evaluation

Evaluate on XQuAD

Evaluate on Belebele

Batch Evaluation

Project Structure

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
asset		asset
dataset/train_example_de		dataset/train_example_de
mteb		mteb
script		script
README.md		README.md
collator.py		collator.py
loss.py		loss.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training

Installation

Requirements

Setup

Optional: DeepSpeed Support

Dataset Format

Training

Evaluation

Evaluate on XQuAD

Evaluate on Belebele

Batch Evaluation

Project Structure

Citation

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages