Skip to content

cross-lingual embeddings for French and Italian evaluated on machine translation tasks

License

Notifications You must be signed in to change notification settings

deborahdore/cross-lingual-embeddings

Repository files navigation

Cross Lingual Embeddings

The goal of this project is to create embeddings for italian and french that are aligned in a shared latent space using an encoder-decoder model

REPLICABILITY

Each experiment was conducted on a single machine by running the main script and specifying whether to process the dataset or not, whether to start the optimization pipeline or not and whether to conduct an ablation study or not (only one between optimization and ablation study can be true):

python main.py --generate True --optimize True --ablation False # process dataset, optimize model, skip ablation study
python main.py --generate False --optimize False --ablation True # skip processing and optimization step, do ablation study

Environment

Data Processing Script

Training & Testing Script

Results

Configurations

DATASET

The parallel datasets Italian-English and French-English was obtained from the European Parliament Proceedings Parallel Corpus 1996-2011. These dataset were already aligned.
To obtain an Italian-English Corpus, the Italian-English and the French-English corpus were joined based on their english sentences in common. It's worth noting that the two corpora did not have the same number of sentences due to different translations. In fact, during the European Parliament session, not all languages are translated to english and then translated back to the target language. Some languages are directly translated from source to target without going through english first. This creates misalignment in the corpus. Therefore, some loss of information is expected.

Due to the extensive volume of words within our corpus, exceeding 800,000, each language's vocabulary encompasses a substantial number of words, surpassing one million in total. The vocabulary was used to create embeddings for the sentences by converting each word into their corresponding index in the vocabulary

This conversion technique can pose challenges for neural network training, as it may struggle to achieve effective reconstruction when dealing with such embeddings with such large numbers. Conversely, normalizing these embeddings can result in exceedingly small values (e.g., 0.000000003), rendering them impractical for reliable model reconstruction. Therefore, only the most frequent 30000 words were chosen to be included in the vocabulary.

An alternative solution that was explored was to employ techniques like GloVe or Word2Vec. However, it's essential to note that these methods are inherently lossy algorithms, which can make the task of reconstructing the original sentence in natural language more challenging.

MODEL'S ARCHITECTURE

Each encoder is composed of an Embedding and a 3 Stacked LSTM layer while the decoder contains 3 Stacked LSTM layer and a Linear Layer. The shared space is a linear layer plus a ReLU activaction function that generates low dimensional embeddings.

During training, the embeddings generated by the two encoders are aligned using a contrastive loss function while the decoders learn to reconstruct the original sequence starting from the embeddings using a Negative Log Likelihood loss.

During training:

Training

During testing:

Testing

LATENT SPACE PROJECTION

After training we expect the embeddings of the 2 languages to match:

About

cross-lingual embeddings for French and Italian evaluated on machine translation tasks

Topics

Resources

License

Stars

Watchers

Forks

Languages