Skip to content


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Bridging linguistic typology and multilingual machine translation with multi-view language representations

Abstract Sparse language vectors from linguistic typology databases and learned embeddings from tasks like multilingual machine translation have been investigated in isolation, without analysing how they could benefit from each other's language characterisation. We propose to fuse both views using singular vector canonical correlation analysis and study what kind of information is induced from each source. By inferring typological features and language phylogenies, we observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy in tasks that require information about language similarities, such as language clustering and ranking candidates for multilingual transfer. With our method, we can easily project and assess new languages without expensive retraining of massive multilingual or ranking models, which are major disadvantages of related approaches.

Configurate your environment

# create a new environment, e.g. with conda, and activate it
conda create -n langrep python=3.6
conda activate langrep
# install the requirements
pip install -r requirements.txt
# install lang2vec independently
pip install --index-url --no-deps lang2vec


Computing multi-view language representations

Executing the default action (langrep) will compute SVCCA representations and store them in a langrep_out.npy file that contains a Python dictionary saved with Numpy and with a key-value pair like: "eng" : [0.1, 0.2, 0.3, ...., 0.7]

python # OR
python langrep 

Default options include:

  • --KB-source : syntax_knn from lang2vec
  • --learn-source : ted53 (our own factored-embeddings trained from the dataset of Qi et al. (2018))

Other arguments are:

python langrep --KB_source phonology_knn --list_langs "spa,ita,ron,fra" --out_name "my_langreps"

which uses Phonology vectors from lang2vec, filters only the listed languages, and saves a file with the specified output name.

If there are languages that are not included in the KB or learned-source, there will be a message about them. If nothing is introduced, --list_langs will consider all the languages that have representations in the two sources.

There are other parameters for SVCCA like --KB_svd_th or --learn_svd_th, which are tuned at 0.65 and 0.6, respectively (see the Appendix).

Using your own language embeddings

In the folder data/embeddings/ you can store a file like ted53.npy (our own embeddings), which contains a Python dictionary as described previously. Then, use the following command:

python langrep --learned_source own --own_source ted53  
# write the name of the file without the extension

Future feature: include your own KB-vectors as well.

Clustering languages

Uses the same parameters as langrep. The function will print a short report, and will plot a PDF with the Elbow method, Silhouette analysis and the Dendrogram of the hierarchy.

# Default settings (like the Figure 2 in the paper):
python cluster
# To cluster only Romance languages that are included in ted53:
python cluster --list_langs "por,spa,ita,fra,ron,glg" --out_name "romance_langs"
# We can cluster Romance languages that are not included in ted53 (they are going to be projected from the KB-source):
python cluster --list_langs "por,spa,ita,fra,ron,glg,ast,arg,scn" --out_name "romance_langs_extended"

Ranking related languages

Uses the same parameters as langrep, but with some specific ones:

  • --target_lang : target language for ranking (obligatory)
  • --list_langs : list of candidate languages to consider (optional). By default, it will only analyse the languages included in the two sources.
  • --num_langs : rank an specific number of languages. Default = 5 (int)
  • --group_size : rank languages given an amount of training data to agglomerate (e.g. 500000). Default = -1 (disabled). Currently, it only works with ted53. If enabled, it will disable --num_langs
  • --filter_size : the selection does not consider languages with a training size below this value (e.g. 100000). Default = -1 (disabled). Currently, it only works with ted53.
python rank --target_lang glg --num_langs 5 --filter_size 40000
python rank --target_lang glg --group_size 500000
python rank --target_lang glg --num_langs 2 --list_langs "por,spa,ita,fra,ron,glg,ast,arg,scn"


    title = "Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations",
    author = "Oncevay, Arturo  and
      Haddow, Barry  and
      Birch, Alexandra",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "",
    pages = "2391--2406",

Future feature(s): include your own KB-vectors as well.


Bridging linguistic typology and multilingual machine translation with multi-view language representations







No releases published


No packages published