Skip to content

dltmddbs100/CLEAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training

This repository contains the code for our paper CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training. We propose more effective loss function than standard InfoNCE for Cross-lingual Information Retrieval. CLEAR outperforms InfoNCE across Belebele, XQuAD, and XOR-QA benchmarks under various settings.

CLEAR_fig.png

Installation

Requirements

  • Python 3.8+
  • CUDA-compatible GPU (recommended)

Setup

  1. Clone the repository:
git clone https://github.com/dltmddbs100/CLEAR.git
cd CLEAR
  1. Install dependencies:
pip install -r requirements.txt

This will install:

  • PyTorch 2.6.0
  • Transformers 4.45.0
  • Sentence-Transformers 3.4.1
  • Datasets 2.21.0
  • Accelerate 0.34.2
  • WandB (for logging)
  • Local MTEB package (from ./mteb)

Optional: DeepSpeed Support

For distributed training with DeepSpeed:

pip install deepspeed==0.14.4

Dataset Format

We support an example training dataset at here. Training data should be the following format:

Column Description
anchor Source language query
positive Positive document
negative_1 ... negative_n Hard negative documents
cross_anchor Target language query (translation of anchor)
neg_query_1 ... neg_query_n Hard negative queries in target language

Example:

{
  "anchor": "What is machine learning?",
  "positive": "Machine learning is a subset of artificial intelligence...",
  "negative_1": "Deep learning requires large datasets...",
  "cross_anchor": "Was ist maschinelles Lernen?",
  "neg_query_1": "Wie funktioniert künstliche Intelligenz?"
}

You can use HuggingFace datasets by setting --use_hf_dataset.

Training

export OMP_NUM_THREADS=32

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 --master_port 25055 script/train.py \
    --model_name_or_path BAAI/bge-m3 \
    --dataset_path dataset/train_example_de \
    --output_dir ckpt/bge-m3-de-CLEAR \
    --loss_name CachedCLEARLoss \
    --alpha 0.4 \
    --beta 0.2 \
    --kl_div \
    --use_hf_dataset \
    --do_train \
    --per_device_train_batch_size 64 \
    --mini_batch_size 32 \
    --learning_rate 5e-5 \
    --max_steps 50 \
    --warmup_ratio 0.05 \
    --lr_scheduler_type cosine \
    --max_seq_length 512 \
    --bf16 \
    --report_to wandb \
    --save_strategy steps \
    --save_steps 50 \
    --save_total_limit 1

Evaluation

CLEAR uses a customized version of MTEB for evaluation on cross-lingual retrieval benchmarks.

Evaluate on XQuAD

# Monolingual (English)
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks XQuADRetrieval \
    --languages eng-Latn \
    --output_folder eval_results/xquad/en-en \
    --batch_size 32

# Cross-lingual (German query → English documents)
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks XQuADCrossRetrieval_EN_LANG \
    --languages deu-Latn \
    --output_folder eval_results/xquad/en-de \
    --batch_size 32

# Cross-lingual (English query → German documents)
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks XQuADCrossRetrieval_LANG_EN \
    --languages deu-Latn \
    --output_folder eval_results/xquad/de-en \
    --batch_size 32

Evaluate on Belebele

# English
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks BelebeleRetrieval \
    --languages eng-Latn-eng-Latn \
    --output_folder eval_results/belebele/en \
    --batch_size 32

# Other languages
CUDA_VISIBLE_DEVICES=0 mteb run -m ckpt/your-model \
    --tasks BelebeleRetrieval \
    --languages deu-Latn \
    --output_folder eval_results/belebele/de \
    --batch_size 32

Batch Evaluation

Use the provided scripts to evaluate across multiple languages:

# Evaluate trained model
bash script/eval_ours_bele_xquad.sh

# Evaluate baseline (no training)
bash script/eval_notrain_bele_xquad.sh

Project Structure

CLEAR/
├── train.py              # Main training script
├── loss.py               # CLEAR loss implementations
├── collator.py           # Data collator for cross-lingual data
├── requirements.txt      # Python dependencies
├── dataset/              # Training datasets
├── ckpt/                 # Model checkpoints
├── eval_results/         # Evaluation outputs
├── logs/                 # Training logs
├── mteb/                 # Custom MTEB package
├── script/               # Training and evaluation scripts
└── asset/                # Assets and figures

Citation

If you find this work useful, please cite:

@misc{lee2026clearcrosslingualenhancementalignment,
      title={CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training}, 
      author={Seungyoon Lee and Minhyuk Kim and Seongtae Hong and Youngjoon Jang and Dongsuk Oh and Heuiseok Lim},
      year={2026},
      eprint={2604.05821},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.05821}, 
}

License

This project is licensed under the MIT License.

Acknowledgements

About

[ACL 2026] Official code of "CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors