Authors: Dayeon Ki, Cheonbok Park, Hyunjoong Kim
This repository contains the code and dataset for our ACL 2024 RepL4NLP workshop paper Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint.
Accurately aligning contextual representations in cross-lingual sentence embeddings is key for effective parallel data mining. A common strategy for achieving this alignment involves disentangling semantics and language in sentence embeddings derived from multilingual pre-trained models. However, we discover that current disentangled representation learning methods suffer from semantic leakage — a term we introduce to describe when a substantial amount of language-specific information is unintentionally leaked into semantic representations. This hinders the effective disentanglement of semantic and language representations, making it difficult to retrieve embeddings that distinctively represent the meaning of the sentence.
To address this challenge, we propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE), tailored to enforce orthogonality between semantic and language embeddings. ORACLE builds upon two components: intra-class clustering and inter-class separation. Through experiments on cross-lingual retrieval and semantic textual similarity tasks, we demonstrate that training with the ORACLE objective effectively reduces semantic leakage and enhances semantic alignment within the embedding space.
ORACLE consists of two key components: (1) intra-class clustering and (2) inter-class separation. Intra-class clustering aligns related components more closely, while inter-class separation enforces orthogonality between unrelated components. Our method is designed to be simple and effective, capable of being implemented atop any disentanglement methods.
We explore a range of pre-trained multilingual encoders (LASER, InfoXLM, LaBSE) to generate initial sentence embeddings. Subsequently, we train each semantic and language multi-layer perceptrons (MLPs) with ORACLE to disentangle the sentence embeddings into semantics and language-specific information. Experimental results on both cross-lingual sentence retrieval tasks and the Semantic Textual Similarity (STS) task demonstrate higher performance on semantic embeddings and lower performance on language embeddings with ORACLE. The following figure is an illustration of our work.
Install all requirements in requirements.txt
.
pip install -r requirements.txt
Place the parallel sentences in data/
and transform each file into text file with each sentence in each line. Below is an example for English-French language pair.
en-fr.en
I enjoy reading books in my free time.
The weather today is perfect for a picnic.
She is learning how to cook traditional French cuisine.
...
en-fr.fr
J'aime lire des livres pendant mon temps libre.
Le temps aujourd'hui est parfait pour un pique-nique.
Elle apprend à cuisiner des plats traditionnels français.
...
We use 3 different pre-trained multilingual encoders: LASER, InfoXLM and LaBSE. To create embeddings for the bitext dataset in data/
, run script/embed.py
as below:
python -u embed.py \
--model_name_or_path $MODEL_NAME \
--src_data_path $PATH_TO_SOURCE_DATA \
--tgt_data_path $PATH_TO_TARGET_DATA \
--src_embed_path $PATH_TO_SOURCE_EMBEDDINGS \
--tgt_embed_path $PATH_TO_TARGET_EMBEDDINGS \
--train_embed_path $PATH_TO_TRAIN_EMBEDDINGS \
--valid_embed_path $PATH_TO_VALIDATION_EMBEDDINGS \
--src_lang $SOURCE_LANG \
--tgt_lang $TARGET_LANG \
--batch_size $BATCH_SIZE \
--seed $SEED_NUM
Arguments for the create embeddings script are as follows,
--model_name_or_path
: Path or name of the pre-trained multilingual encoder--src_data_path
: Path to source dataset (ex.data/en-fr.en
)--tgt_data_path
: Path to target dataset (ex.data/en-fr.fr
)--src_embed_path
: Path to save the source embeddings created--tgt_embed_path
: Path to save the target embeddings created--train_embed_path
: Path to save the train split of embeddings--valid_embed_path
: Path to save the validation split of embeddings--batch_size
: Batch size of the model (default: 512)--seed_num
: Seed number (default: 42)
Using the embedding created from previous step, we train the decomposer with ORACLE objective. To train, you have to choose each variation from below options:
- Decomposer type : {DREAM, MEAT}
- Encoder type : {LASER, InfoXLM, LaBSE}
- Train type : {Vanilla, ORACLE}
- Vanilla is the training method introduced in DREAM, MEAT papers.
- ORACLE is our approach, composed with both intra-class clustering and inter-class separation.
- If you wish to train a DREAM Decomposer, run
script/train_dream.py
, if train a MEAT Decomposer, runscript/train_meat.py
.
For example,
python -u train_dream.py -c config/labse/dream_labse.yaml
python -u train_meat.py -c config/infoxlm/meat_infoxlm_oracle.yaml
There are configuration parameters for training in each yaml file:
train_path: $PATH_TO_TRAIN_EMBEDDINGS
valid_path: $PATH_TO_VALIDATION_EMBEDDINGS
save_pooler_path: $PATH_TO_SAVE_MODEL
logging_path: $LOG_FILE_PATH
train_type: $TRAIN_TYPE
learning_rate: 1e-5
n_languages: 13
batch_size: 512
seed: 42
weights: [[1,1]]
model_name_or_path: $MODEL_NAME
train_path
: Path to the saved train split of embeddingsvalid_path
: Path to the saved validation split of embeddingsmodel_name_or_path
: Path or name of the pre-trained multilingual encodersave_pooler_path
: Path to save the pooler after traininglogging_path
: Path to save the log file (Log file saves the loss values for each training epoch)train_type
: {vanilla, oracle}n_languages
: Number of languages (default: 13)weights
: Weight values for each losses
We provide inference codes for following retrieval tasks in code/inference/
:
- BUCC : Crosslingual retrieval task, run
bucc.py
for InfoXLM and LaBSE and runbucc_laser.py
for LASER - Tatoeba: Crosslingual retrieval task
- To retrieve semantic embeddings, run
tatoeba_sem.py
for InfoXLM and LaBSE and runtatoeba_sem_laser.py
for LASER - To retrieve language embeddings, run
tatoeba_lang.py
for InfoXLM and LaBSE and runtatoeba_lang_laser.py
for LASER
- To retrieve semantic embeddings, run
- STS : Semantic textual similarity task, run
sts.py
for InfoXLM and LaBSE and runsts_laser.py
for LASER
To visualize the embedding space of the trained decomposers using datavis
library, run code/visualize.py
. This code will save a html bokeh file in the output_figure
directory.
python -u visualize.py \
--model_name_or_path $MODEL_NAME \
--pooler_path $TRAINED_POOLER_PATH$ \
--src_data_path $SOURCE_RETRIEVAL_DATA_PATH \
--tgt_data_path $TARGET_RETRIEVAL_DATA_PATH \
--src_lang $SOURCE_LANG \
--tgt_lang $TARGET_LANG \
--batch_size $BATCH_SIZE \
--output_figure $PATH_TO_OUTPUT_FIGURE
@inproceedings{ki-etal-2024-mitigating,
title = "Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint",
author = "Ki, Dayeon and
Park, Cheonbok and
Kim, Hyunjoong",
editor = "Zhao, Chen and
Mosbach, Marius and
Atanasova, Pepa and
Goldfarb-Tarrent, Seraphina and
Hase, Peter and
Hosseini, Arian and
Elbayad, Maha and
Pezzelle, Sandro and
Mozes, Maximilian",
booktitle = "Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.repl4nlp-1.19",
pages = "256--273"
}