# Introduction to cross-lingual word-embeddings at Wikimania 2019

* In this tutorial we will learn how to work with cross-lingual word embeddings. 
* This code is based on the repository shared by [Smith et al](https://github.com/Babylonpartners/fastText_multilingual)
* You can see applications of these code on the Wikipedia [Sections](https://github.com/digitalTranshumant/wmf-interlanguage) and [Template parameters](https://github.com/digitalTranshumant/templatesAlignment) alignments.

* You will need to download


In [3]:
#Config 
## Add here your folders and languages
import fastText
from scipy.spatial import distance
import numpy as np
import networkx as nx
lang1 = 'en'
lang2 = 'es'
langs =[lang1,lang2]
pathVectors = 'vectors/' 
pathAlignment = 'wikiAlignments/'

## Download fasttext models

* This script download the fasttext pre-trained models in the languages listed langs variable.
* This process **can take long time**.
* Note that **each model file is around 8G** , and later you will need to unzip those models, using around 15G per model in total.
* Comment (add # prefix)the first line in the next cell to download the models. If you already have the models you in your folder, you can skip this step. 



In [None]:
COMMENT  HERE TO RUN THIS CELL

!mkdir {pathVectors}
for l in lang:
    print(l)
    !wget -P vectors/ {'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.%s.zip' % l}

Load the models, this can take few minutes

In [2]:
model = {}
for lang in langs:
    model[lang] = fastText.load_model('%s/wiki.%s.bin' % (pathVectors,lang))  

## Download transformation Matrices
* Note that his repository already contains transformation from 'en' to 'es'
* This alignments are generated using [this code](https://analytics.wikimedia.org/datasets/one-off/dsaez/)
* If you need to a pair of languages that is not contained here, please contact us, or use the pre-trained [provided here](https://github.com/Babylonpartners/fastText_multilingual/tree/master/alignment_matrices)

In [None]:
COMMENT  HERE TO RUN THIS CELL
!mkdir {pathAlignment} #comment here if the folter already exists
for l in lang:
    print(l)
    !wget -P {pathAlignment} {'apply_in_%s_to_%s.txt ' % (lang1,lang2)}
    !wget -P {pathAlignment} {'apply_in_%s_to_%s.txt ' % (lang2,lang1)}

In [4]:
v1 = model['es'].get_sentence_vector('perro')
v2 = model['en'].get_sentence_vector('dog')

In [5]:
distance.cosine(v1,v2)

0.9362809884474315

The following function apply the transformation to a given vector. 

In [26]:
def apply_transform(vec, transform):
        """
        Apply the given transformation to the vector space

        Right-multiplies given transform with embeddings E:
            E = E * transform

        Transform can either be a string with a filename to a
        text file containing a ndarray (compat. with np.loadtxt)
        or a numpy ndarray.
        """
        transmat = np.loadtxt(transform) if isinstance(transform, str) else transform
        return np.matmul(vec, transmat)

In [71]:
# Align second language
v2Aligned = apply_transform(v2,'%s/apply_in_%s_to_%s.txt' % (pathAlignment,'en','es') )

In [72]:
distance.cosine(v1,v2Aligned)

0.2947399014085661

### Subword information

Using subword information with modified or misspelled words

within the same language:

In [86]:
v1 = model['en'].get_sentence_vector('excellent')
v2 = model['en'].get_sentence_vector('excelent')
distance.cosine(v1,v2)

0.42508700188863213

In [93]:
v1 = model['es'].get_sentence_vector('perro1')
v2 = model['en'].get_sentence_vector('dog')
v2Aligned = apply_transform(v2,'%s/apply_in_%s_to_%s.txt' % (pathAlignment,'en','es') )
distance.cosine(v1,v2Aligned)

0.4684736133135168

### Sentence Level

In [75]:
sentence1 = model['es'].get_sentence_vector('Que tenga un bonito día señor')
sentence2 = model['en'].get_sentence_vector('Have a nice day sir')

In [76]:
distance.cosine(sentence1,sentence2)

1.0227684068147005

In [77]:
sentence2Aligned = apply_transform(v2,'%s/apply_in_%s_to_%s.txt' % (pathAlignment,'en','es') )

In [78]:
distance.cosine(sentence1,sentence2Aligned)

0.6579386823118475

## Aligning  sets of words

Load all transformation

In [70]:
transmat = {}
for lang in langs:
    print(lang)
    transmat[lang] = {}
    for lang2 in langs:
        if lang!=lang2:
            transmat[lang][lang2] = np.loadtxt('%s/apply_in_%s_to_%s.txt' % (pathAlignment,lang2,lang))

en
es


In [113]:
words = {}
words[lang1] = ['cat','kitty','motocycle','car','dog','truck','geography','mountains','rivers','basketball','football']
words[lang2] = ['gato','automóvil','perro','camión','geografía','montañas','rios','baloncesto','futbol']

In [114]:
def getMoreSimilar(wordLang1,setLang2,sourceLang,targetLang):
    """
    Given a word in language 1 and set of words/sentences language 2
    return 
    wordLang1: str, 'perro'
    set2: dict or list, ['hello','dog']
    sourceLang: str, 'es'
    targetLang: str, 'en'
    return list
    """
    global model
    global transmat
    d = []
    vec1 = model[sourceLang].get_sentence_vector(wordLang1)
    for s2 in setLang2:
        vec2= model[targetLang].get_sentence_vector(s2.strip().replace('_',' '))
        vec2T = apply_transform(vec2,transmat[sourceLang][targetLang])
        dist = distance.cosine(vec1,vec2T)
        d.append((dist,s2))
    return sorted(d)[0]


In [116]:
wordEn ='cat'
print('Searching for the most similar word to:', wordEn)
print('list',words['es'])
getMoreSimilar(wordEn,words['es'],'en','es')

Searching for the most similar word to: cat
list ['gato', 'automóvil', 'perro', 'camión', 'geografía', 'montañas', 'rios', 'baloncesto', 'futbol']


(0.43776026412466407, 'gato')

In [117]:
wordEn ='kitty'
print('Searching for the most similar word:', wordEn)
print('list',words['es'])
getMoreSimilar(wordEn,words['es'],'en','es')

Searching for the most similar word: kitty
list ['gato', 'automóvil', 'perro', 'camión', 'geografía', 'montañas', 'rios', 'baloncesto', 'futbol']


(0.647667168023605, 'gato')

### Aligning set of words

Given two sets of words, get a mapping one-to-one mapping

In [118]:
# One-to-one mappings
def alignSets(set1,set2,sourceLang,targetLang,sensivity=.45):
    """
    Given two sets of words/sentences in two languages
    return the possible alignments between sentences
    set1: dict or list, ['hola','perro']
    set2: dict or list, ['hello','dog']
    sourceLang: str, 'es'
    targetLang: str, 'en'
    return list
    """
    global model
    global transmat
    output = []
    G= nx.Graph()
    for s1 in set1:
        vec1 = model[sourceLang].get_sentence_vector(s1.strip().replace('_',' '))
        for s2 in set2:
                    vec2= model[targetLang].get_sentence_vector(s2.strip().replace('_',' '))
                    vec2T = apply_transform(vec2,transmat[sourceLang][targetLang])
                    dist = distance.cosine(vec1,vec2T)
                    if dist < sensivity:
                        node1= '%s_%s' % (sourceLang,s1)
                        node2= '%s_%s' % (targetLang,s2)
                        G.add_edge(node1,node2)
                        G[node1][node2]['w'] = dist

                
    while G.edges():
            p = sorted(G.edges(data=True), key=lambda x: x[2]['w'])[0]
            psorted = sorted(list(p[:2]))
            output.append({psorted[0][:2]:psorted[0][3:],psorted[1][:2]:psorted[1][3:],'d':p[2]['w']})
            G.remove_node(p[0])
            G.remove_node(p[1])
    return output

In [119]:
print(words[lang1])
print(words[lang2])

alignSets(words[lang1],words[lang2],lang1,lang2)

['cat', 'kitty', 'motocycle', 'car', 'dog', 'truck', 'geography', 'mountains', 'rivers', 'basketball', 'football']
['gato', 'automóvil', 'perro', 'camión', 'geografía', 'montañas', 'rios', 'baloncesto', 'futbol']


[{'d': 0.20396928012967108, 'en': 'basketball', 'es': 'baloncesto'},
 {'d': 0.2424814900096134, 'en': 'mountains', 'es': 'montañas'},
 {'d': 0.27150193662130007, 'en': 'truck', 'es': 'camión'},
 {'d': 0.2744324328139399, 'en': 'car', 'es': 'automóvil'},
 {'d': 0.2938059497380646, 'en': 'geography', 'es': 'geografía'},
 {'d': 0.29473989913984433, 'en': 'dog', 'es': 'perro'},
 {'d': 0.3924369271143857, 'en': 'football', 'es': 'futbol'},
 {'d': 0.43776026412466407, 'en': 'cat', 'es': 'gato'}]