## How do you evaluate goodness of embeddings across languages?

If you have embeddings for language A and another set of embeddings for language B, how can you tell how good they are? 

We might want to evaluate each set on a task specific to that language. But a much more interesting proposition is aligning the embeddings in an unsupervised way and evaluating the performance on translation!

For this, we need 3 components. [The embeddings](https://fasttext.cc/docs/en/pretrained-vectors.html), [a way to align embeddings](https://github.com/artetxem/vecmap) and [dictionaries](https://github.com/facebookresearch/MUSE) we can use to evaluate the results.

Let's begin by grabbing embeddings for English and Polish.

In [1]:
!wget -P data https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec

--2020-09-14 17:58:30--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6597238061 (6.1G) [binary/octet-stream]
Saving to: ‘data/wiki.en.vec’


2020-09-14 18:00:26 (54.5 MB/s) - ‘data/wiki.en.vec’ saved [6597238061/6597238061]



In [2]:
!wget -P data https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.pl.vec

--2020-09-14 18:00:26--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.pl.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2707436342 (2.5G) [binary/octet-stream]
Saving to: ‘data/wiki.pl.vec’


2020-09-14 18:01:37 (36.6 MB/s) - ‘data/wiki.pl.vec’ saved [2707436342/2707436342]



Time to align the embeddings. Let's translate from Polish to English. In order to do that, let's align our Polish embeddings with our English embeddings.

Bear in mind that this is a challenging task - Polish and English come from distinct and very different language families (West Slavic languages of the Lechitic group and West Germanic language of the Indo-European language family respectively).

Let us clone [vecmap](https://github.com/artetxem/vecmap) to a directory adjacent to this one and perform the alignment.

In [3]:
%%time
!python ../vecmap/map_embeddings.py --cuda --unsupervised data/wiki.pl.vec data/wiki.en.vec data/wiki.pl.aligned.vec data/wiki.en.aligned.vec

CPU times: user 17.8 s, sys: 5.69 s, total: 23.5 s
Wall time: 26min 45s


Great! We now have the English and Polish embeddings aligned to live in the same embedding space!

Let's grab a Polish to English dictionary from the MUSE repository.

In [4]:
!wget -P data https://dl.fbaipublicfiles.com/arrival/dictionaries/pl-en.txt

--2020-09-14 18:28:23--  https://dl.fbaipublicfiles.com/arrival/dictionaries/pl-en.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1216386 (1.2M) [text/plain]
Saving to: ‘data/pl-en.txt’


2020-09-14 18:28:24 (34.8 MB/s) - ‘data/pl-en.txt’ saved [1216386/1216386]



This dictionary is just a text file with a source and target word per line

In [2]:
!head data/pl-en.txt

roku	year
jest	is
nie	not
przez	through
przez	by
jako	as
oraz	and
był	was
jego	his
jego	its


Let's wrap it in a Python class to make usage of it easier.

In [52]:
#export core

class Dictionary():
    def __init__(self, path_to_dict):
        self.read_words(path_to_dict)
        self.create_dict()
        
    def read_words(self, path_to_dict):
        source_words, target_words = [], []
        with open(path_to_dict) as f:
            for line in f.readlines():
                src, target = line.strip().split()
                source_words.append(src)
                target_words.append(target)
        self.source_words = source_words
        self.target_words = target_words
        
    def create_dict(self):
        self.dict = {}
        for src, target in zip(self.source_words, self.target_words):
            self.dict[src] = target
            
    def __getitem__(self, source_word):
        return self.dict[source_word]
    
    def __len__(self):
        return len(self.source_words)

In [53]:
pl_en_dict = Dictionary('data/pl-en.txt')
pl_en_dict['królowa']

'queen'

In [54]:
len(pl_en_dict)

73901

Thanks to [nbdev](https://github.com/fastai/nbdev), we can conveniently import the Embeddings class we defined in the earlier notebook and use it here. 

In [55]:
from embedding_gym.core import Embeddings

Let's limit ourselves to the most common 50 000 words from each set of embeddings.

In [56]:
%%time

embeddings_pl = Embeddings.from_txt_file('data/wiki.pl.aligned.vec', limit=50_000)
embeddings_en = Embeddings.from_txt_file('data/wiki.en.aligned.vec', limit=50_000)

CPU times: user 11.8 s, sys: 368 ms, total: 12.2 s
Wall time: 11.7 s


Let's see if our mechanism for translation works!

In [57]:
embeddings_en.nn_words_to(embeddings_pl['jest'])

['is', 'makes', 'are', 'becomes', 'has']

That it does - 'jest' means 'is' in Polish!

Our translation results will be adversaly affected by synonyms. What we can however do is limit the task to words where the source and the target word exist in source and target embeddings.

In [61]:
%%time

topk = 1
source_embeddings = embeddings_pl
target_embeddings = embeddings_en

correct, total = 0, 0

for source_word in source_embeddings.i2w:
    if source_word in pl_en_dict.source_words and pl_en_dict[source_word] in target_embeddings.i2w:
        total += 1
        if pl_en_dict[source_word] in embeddings_en.nn_words_to(embeddings_pl[source_word], n=topk):
            correct += 1

CPU times: user 32min 53s, sys: 25.5 s, total: 33min 18s
Wall time: 4min 10s


In [64]:
correct / total, total

(0.5911755937593163, 20126)

Nearly 60% accuracy with topk@1 across 20 thousand words! That seems like a great result given that we are not making any accomodations for synonyms. Let's see what results we get with topk@5.

In [65]:
%%time

topk = 5
source_embeddings = embeddings_pl
target_embeddings = embeddings_en

correct, total = 0, 0

for source_word in source_embeddings.i2w:
    if source_word in pl_en_dict.source_words and pl_en_dict[source_word] in target_embeddings.i2w:
        total += 1
        if pl_en_dict[source_word] in embeddings_en.nn_words_to(embeddings_pl[source_word], n=topk):
            correct += 1

CPU times: user 32min 58s, sys: 26.3 s, total: 33min 24s
Wall time: 4min 10s


In [66]:
correct / total, total

(0.7637384477789924, 20126)