Semantic Drift in Multilingual Representations

The repository contains code for our experiments in:

Lisa Beinborn and Rochelle Choenni (2019):
Semantic Drift in Multilingual Representations
https://arxiv.org/pdf/1904.10820.pdf

Representational Similarity Analysis and Clustering

The word and sentence embeddings are too large to be uploaded to github. We stored our distance_matrices in pickle files, so that you can reproduce our plots.

Word-based: run clustering_wordbased.py. The results will be saved in results/word-based/. The plots for our qualitative examples can be reproduced by runnning the methods in the directory detailed_analyses.
Sentence-based: run clustering_sentencebased.py. The results will be saved in results/sentence-based/.

If you want to re-run the calculations completely:

Word-based

Download the Muse embeddings for the languages of interest from https://github.com/facebookresearch/MUSE.
Make sure to cite:
Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou (2017): Word translation without parallel data, https://arxiv.org/pdf/1710.04087.pdf.
Specify data_dir to point to the folder where you store the Muse embeddings. Extract the embeddings for the test words by running extract_word_embeddings.py. They will be saved in data/embeddings/word-based/

Sentence-based

The sentences have been extracted from Europarl and can be found in data/sentence-based/embeddings/part. The embeddings have been extracted using Laser. You can contact us if you want to be sure to use exactly the same embeddings.
If you want to use other sentences, you need to install Laser: https://github.com/facebookresearch/LASER and follow their instructions to get embeddings.

Quantitative Results Translation Quality (Table 1)

Get the ground_truth dictionaries from https://github.com/facebookresearch/MUSE#ground-truth-bilingual-dictionaries and put them into the folder muse_dictionaries.
Run estimate_translation_quality.py.

Quantitative Results Clustering (Table 2)

For calculating the quantitative evaluation of the clustering, run calculate_experimental_treescores.py.

Robustness to Translation-Induced Noise (Section 6.4)

We compared our nearest-neighbor method to using translations from the NorthEuraLex database. If you want to reproduce this, run clustering_wordbased_northeuralex.py. The quantitative results can be obtained by changing the variable category to "northeuralex" in calculate_experimental_treescores.py.

Requirements

We use numpy, scipy, scikit_klearn and matplotlib. If you want to extract new word embeddings, you will also need torch. Check requirements.txt for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Drift in Multilingual Representations

Representational Similarity Analysis and Clustering

Quantitative Results Translation Quality (Table 1)

Quantitative Results Clustering (Table 2)

Robustness to Translation-Induced Noise (Section 6.4)

Requirements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
detailed_analyses		detailed_analyses
results		results
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
calculate_experimental_treescores.py		calculate_experimental_treescores.py
clustering_sentencebased.py		clustering_sentencebased.py
clustering_wordbased.py		clustering_wordbased.py
clustering_wordbased_combinedlists.py		clustering_wordbased_combinedlists.py
clustering_wordbased_northeuralex.py		clustering_wordbased_northeuralex.py
estimate_translation_quality.py		estimate_translation_quality.py
extract_word_embeddings.py		extract_word_embeddings.py
requirements.txt		requirements.txt

License

beinborn/SemanticDrift

Folders and files

Latest commit

History

Repository files navigation

Semantic Drift in Multilingual Representations

Representational Similarity Analysis and Clustering

Quantitative Results Translation Quality (Table 1)

Quantitative Results Clustering (Table 2)

Robustness to Translation-Induced Noise (Section 6.4)

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages