Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint

Authors: Dayeon Ki, Cheonbok Park, Hyunjoong Kim

This repository contains the code and dataset for our ACL 2024 RepL4NLP workshop paper Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint.

Code |

Dataset |

Paper

Abstract

Accurately aligning contextual representations in cross-lingual sentence embeddings is key for effective parallel data mining. A common strategy for achieving this alignment involves disentangling semantics and language in sentence embeddings derived from multilingual pre-trained models. However, we discover that current disentangled representation learning methods suffer from semantic leakage — a term we introduce to describe when a substantial amount of language-specific information is unintentionally leaked into semantic representations. This hinders the effective disentanglement of semantic and language representations, making it difficult to retrieve embeddings that distinctively represent the meaning of the sentence.

To address this challenge, we propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE), tailored to enforce orthogonality between semantic and language embeddings. ORACLE builds upon two components: intra-class clustering and inter-class separation. Through experiments on cross-lingual retrieval and semantic textual similarity tasks, we demonstrate that training with the ORACLE objective effectively reduces semantic leakage and enhances semantic alignment within the embedding space.

Overview

ORACLE consists of two key components: (1) intra-class clustering and (2) inter-class separation. Intra-class clustering aligns related components more closely, while inter-class separation enforces orthogonality between unrelated components. Our method is designed to be simple and effective, capable of being implemented atop any disentanglement methods.

We explore a range of pre-trained multilingual encoders (LASER, InfoXLM, LaBSE) to generate initial sentence embeddings. Subsequently, we train each semantic and language multi-layer perceptrons (MLPs) with ORACLE to disentangle the sentence embeddings into semantics and language-specific information. Experimental results on both cross-lingual sentence retrieval tasks and the Semantic Textual Similarity (STS) task demonstrate higher performance on semantic embeddings and lower performance on language embeddings with ORACLE. The following figure is an illustration of our work.

Train with ORACLE

Install all requirements in requirements.txt.

pip install -r requirements.txt

[Step 1] Data preparation

Place the parallel sentences in data/ and transform each file into text file with each sentence in each line. Below is an example for English-French language pair.

en-fr.en

I enjoy reading books in my free time.
The weather today is perfect for a picnic.
She is learning how to cook traditional French cuisine.
...

en-fr.fr

J'aime lire des livres pendant mon temps libre.
Le temps aujourd'hui est parfait pour un pique-nique.
Elle apprend à cuisiner des plats traditionnels français.
...

[Step 2] Create embeddings

We use 3 different pre-trained multilingual encoders: LASER, InfoXLM and LaBSE. To create embeddings for the bitext dataset in data/, run script/embed.py as below:

python -u embed.py \
   --model_name_or_path $MODEL_NAME \
   --src_data_path $PATH_TO_SOURCE_DATA \
   --tgt_data_path $PATH_TO_TARGET_DATA \
   --src_embed_path $PATH_TO_SOURCE_EMBEDDINGS \
   --tgt_embed_path $PATH_TO_TARGET_EMBEDDINGS \
   --train_embed_path $PATH_TO_TRAIN_EMBEDDINGS \
   --valid_embed_path $PATH_TO_VALIDATION_EMBEDDINGS \
   --src_lang $SOURCE_LANG \
   --tgt_lang $TARGET_LANG \
   --batch_size $BATCH_SIZE \
   --seed $SEED_NUM

Arguments for the create embeddings script are as follows,

--model_name_or_path: Path or name of the pre-trained multilingual encoder
--src_data_path: Path to source dataset (ex. data/en-fr.en)
--tgt_data_path: Path to target dataset (ex. data/en-fr.fr)
--src_embed_path: Path to save the source embeddings created
--tgt_embed_path: Path to save the target embeddings created
--train_embed_path: Path to save the train split of embeddings
--valid_embed_path: Path to save the validation split of embeddings
--batch_size: Batch size of the model (default: 512)
--seed_num: Seed number (default: 42)

[Step 3] Train

Using the embedding created from previous step, we train the decomposer with ORACLE objective. To train, you have to choose each variation from below options:

Decomposer type : {DREAM, MEAT}
Encoder type : {LASER, InfoXLM, LaBSE}
Train type : {Vanilla, ORACLE}
- Vanilla is the training method introduced in DREAM, MEAT papers.
- ORACLE is our approach, composed with both intra-class clustering and inter-class separation.
If you wish to train a DREAM Decomposer, run script/train_dream.py, if train a MEAT Decomposer, run script/train_meat.py.

For example,

python -u train_dream.py -c config/labse/dream_labse.yaml
python -u train_meat.py -c config/infoxlm/meat_infoxlm_oracle.yaml

There are configuration parameters for training in each yaml file:

train_path: $PATH_TO_TRAIN_EMBEDDINGS
valid_path: $PATH_TO_VALIDATION_EMBEDDINGS
save_pooler_path: $PATH_TO_SAVE_MODEL
logging_path: $LOG_FILE_PATH
train_type: $TRAIN_TYPE
learning_rate: 1e-5
n_languages: 13
batch_size: 512
seed: 42
weights: [[1,1]]
model_name_or_path: $MODEL_NAME

train_path: Path to the saved train split of embeddings
valid_path: Path to the saved validation split of embeddings
model_name_or_path: Path or name of the pre-trained multilingual encoder
save_pooler_path: Path to save the pooler after training
logging_path: Path to save the log file (Log file saves the loss values for each training epoch)
train_type: {vanilla, oracle}
n_languages: Number of languages (default: 13)
weights: Weight values for each losses

Retrieval Inference

We provide inference codes for following retrieval tasks in code/inference/:

BUCC : Crosslingual retrieval task, run bucc.py for InfoXLM and LaBSE and run bucc_laser.py for LASER
Tatoeba: Crosslingual retrieval task
- To retrieve semantic embeddings, run tatoeba_sem.py for InfoXLM and LaBSE and run tatoeba_sem_laser.py for LASER
- To retrieve language embeddings, run tatoeba_lang.py for InfoXLM and LaBSE and run tatoeba_lang_laser.py for LASER
STS : Semantic textual similarity task, run sts.py for InfoXLM and LaBSE and run sts_laser.py for LASER

Visualization

To visualize the embedding space of the trained decomposers using datavis library, run code/visualize.py. This code will save a html bokeh file in the output_figure directory.

python -u visualize.py \
   --model_name_or_path $MODEL_NAME \
   --pooler_path $TRAINED_POOLER_PATH$ \
   --src_data_path $SOURCE_RETRIEVAL_DATA_PATH \
   --tgt_data_path $TARGET_RETRIEVAL_DATA_PATH \
   --src_lang $SOURCE_LANG \
   --tgt_lang $TARGET_LANG \
   --batch_size $BATCH_SIZE \
   --output_figure $PATH_TO_OUTPUT_FIGURE

Citation

@inproceedings{ki-etal-2024-mitigating,
    title = "Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint",
    author = "Ki, Dayeon  and
      Park, Cheonbok  and
      Kim, Hyunjoong",
    editor = "Zhao, Chen  and
      Mosbach, Marius  and
      Atanasova, Pepa  and
      Goldfarb-Tarrent, Seraphina  and
      Hase, Peter  and
      Hosseini, Arian  and
      Elbayad, Maha  and
      Pezzelle, Sandro  and
      Mozes, Maximilian",
    booktitle = "Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.repl4nlp-1.19",
    pages = "256--273"
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
code		code
config		config
README.md		README.md
oracle_poster.pdf		oracle_poster.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint

Abstract

Quick Links

Overview

Train with ORACLE

[Step 1] Data preparation

[Step 2] Create embeddings

[Step 3] Train

Retrieval Inference

Visualization

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dayeonki/oracle

Folders and files

Latest commit

History

Repository files navigation

Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint

Abstract

Quick Links

Overview

Train with ORACLE

[Step 1] Data preparation

[Step 2] Create embeddings

[Step 3] Train

Retrieval Inference

Visualization

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages