This repository contains the code for our paper CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training. We propose more effective loss function than standard InfoNCE for Cross-lingual Information Retrieval. CLEAR outperforms InfoNCE across Belebele, XQuAD, and XOR-QA benchmarks under various settings.
- Python 3.8+
- CUDA-compatible GPU (recommended)
- Clone the repository:
git clone https://github.com/dltmddbs100/CLEAR.git
cd CLEAR- Install dependencies:
pip install -r requirements.txtThis will install:
- PyTorch 2.6.0
- Transformers 4.45.0
- Sentence-Transformers 3.4.1
- Datasets 2.21.0
- Accelerate 0.34.2
- WandB (for logging)
- Local MTEB package (from
./mteb)
For distributed training with DeepSpeed:
pip install deepspeed==0.14.4We support an example training dataset at here. Training data should be the following format:
| Column | Description |
|---|---|
anchor |
Source language query |
positive |
Positive document |
negative_1 ... negative_n |
Hard negative documents |
cross_anchor |
Target language query (translation of anchor) |
neg_query_1 ... neg_query_n |
Hard negative queries in target language |
Example:
{
"anchor": "What is machine learning?",
"positive": "Machine learning is a subset of artificial intelligence...",
"negative_1": "Deep learning requires large datasets...",
"cross_anchor": "Was ist maschinelles Lernen?",
"neg_query_1": "Wie funktioniert künstliche Intelligenz?"
}You can use HuggingFace datasets by setting --use_hf_dataset.
export OMP_NUM_THREADS=32
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 --master_port 25055 script/train.py \
--model_name_or_path BAAI/bge-m3 \
--dataset_path dataset/train_example_de \
--output_dir ckpt/bge-m3-de-CLEAR \
--loss_name CachedCLEARLoss \
--alpha 0.4 \
--beta 0.2 \
--kl_div \
--use_hf_dataset \
--do_train \
--per_device_train_batch_size 64 \
--mini_batch_size 32 \
--learning_rate 5e-5 \
--max_steps 50 \
--warmup_ratio 0.05 \
--lr_scheduler_type cosine \
--max_seq_length 512 \
--bf16 \
--report_to wandb \
--save_strategy steps \
--save_steps 50 \
--save_total_limit 1CLEAR uses a customized version of MTEB for evaluation on cross-lingual retrieval benchmarks.
# Monolingual (English)
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
--tasks XQuADRetrieval \
--languages eng-Latn \
--output_folder eval_results/xquad/en-en \
--batch_size 32
# Cross-lingual (German query → English documents)
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
--tasks XQuADCrossRetrieval_EN_LANG \
--languages deu-Latn \
--output_folder eval_results/xquad/en-de \
--batch_size 32
# Cross-lingual (English query → German documents)
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
--tasks XQuADCrossRetrieval_LANG_EN \
--languages deu-Latn \
--output_folder eval_results/xquad/de-en \
--batch_size 32# English
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
--tasks BelebeleRetrieval \
--languages eng-Latn-eng-Latn \
--output_folder eval_results/belebele/en \
--batch_size 32
# Other languages
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
--tasks BelebeleRetrieval \
--languages deu-Latn \
--output_folder eval_results/belebele/de \
--batch_size 32Use the provided scripts to evaluate across multiple languages:
# Evaluate trained model
bash script/eval_ours_bele_xquad.sh
# Evaluate baseline (no training)
bash script/eval_notrain_bele_xquad.shCLEAR/
├── train.py # Main training script
├── loss.py # CLEAR loss implementations
├── collator.py # Data collator for cross-lingual data
├── requirements.txt # Python dependencies
├── dataset/ # Training datasets
├── ckpt/ # Model checkpoints
├── eval_results/ # Evaluation outputs
├── logs/ # Training logs
├── mteb/ # Custom MTEB package
├── script/ # Training and evaluation scripts
└── asset/ # Assets and figures
If you find this work useful, please cite:
@misc{lee2026clearcrosslingualenhancementalignment,
title={CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training},
author={Seungyoon Lee and Minhyuk Kim and Seongtae Hong and Youngjoon Jang and Dongsuk Oh and Heuiseok Lim},
year={2026},
eprint={2604.05821},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.05821},
}This project is licensed under the MIT License.
