## How do you evaluate goodness of embeddings across languages?

If you have embeddings for language A and another set of embeddings for language B, how can you tell how good they are? 

We might want to evaluate each set on a task specific to that language. But a much more interesting proposition is aligning the embeddings in an unsupervised way and evaluating the performance on translation!

For this, we need 3 components. [The embeddings](https://fasttext.cc/docs/en/pretrained-vectors.html), [a way to align embeddings](https://github.com/artetxem/vecmap) and [dictionaries](https://github.com/facebookresearch/MUSE) we can use to evaluate the results.

Let's begin by grabbing embeddings for English and Polish.

In [1]:
!wget -P data https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec

--2020-09-14 17:58:30--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6597238061 (6.1G) [binary/octet-stream]
Saving to: ‘data/wiki.en.vec’


2020-09-14 18:00:26 (54.5 MB/s) - ‘data/wiki.en.vec’ saved [6597238061/6597238061]



In [2]:
!wget -P data https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.pl.vec

--2020-09-14 18:00:26--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.pl.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2707436342 (2.5G) [binary/octet-stream]
Saving to: ‘data/wiki.pl.vec’


2020-09-14 18:01:37 (36.6 MB/s) - ‘data/wiki.pl.vec’ saved [2707436342/2707436342]



Time to align the embeddings. Let's translate from Polish to English. In order to do that, let's align our Polish embeddings with our English embeddings.

Bear in mind that this is a challenging task - Polish and English come from distinct and very different language families (West Slavic languages of the Lechitic group and West Germanic language of the Indo-European language family respectively).

Let us clone [vecmap](https://github.com/artetxem/vecmap) to a directory adjacent to this one and perform the alignment.

In [3]:
%%time
!python ../vecmap/map_embeddings.py --cuda --unsupervised data/wiki.pl.vec data/wiki.en.vec data/wiki.pl.aligned.vec data/wiki.en.aligned.vec

CPU times: user 17.8 s, sys: 5.69 s, total: 23.5 s
Wall time: 26min 45s


Great! We now have the English and Polish embeddings aligned to live in the same embedding space!

Let's grab a Polish to English dictionary from the MUSE repository.

In [4]:
!wget -P data https://dl.fbaipublicfiles.com/arrival/dictionaries/pl-en.txt

--2020-09-14 18:28:23--  https://dl.fbaipublicfiles.com/arrival/dictionaries/pl-en.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1216386 (1.2M) [text/plain]
Saving to: ‘data/pl-en.txt’


2020-09-14 18:28:24 (34.8 MB/s) - ‘data/pl-en.txt’ saved [1216386/1216386]



In [5]:
class Dictionary():
    def __init__(self, path_to_dict):
        self.read_words(path_to_dict)
        
    def read_words(self, path_to_dict):
        source_words, target_words = [], []
        with open(path_to_dict) as f:
            for line in f.readlines():
                src, target = line.strip().split()
                source_words.append(src)
                target_words.append(target)
        self.source_words = source_words
        self.target_words = target_words
        self.create_dict()
        
    def create_dict(self):
        self.dict = {}
        for src, target in zip(self.source_words, self.target_words):
            self.dict[src] = target
            
    def __getitem__(self, source_word):
        return self.dict[source_word]

In [7]:
from embedding_gym.core import Embeddings

In [9]:
embeddings = Embeddings.create_from_txt_file('data/wiki.pl.aligned.vec')

In [10]:
len(embeddings.vectors)

1032440

In [6]:
pl_en_dict = Dictionary('data/pl-en.txt')

In [36]:
%%time
import numpy as np

words, vectors = [], []
with open('data/wiki.pl.aligned.vec') as f:
    for line in f.readlines()[1:]:
        try:
            vectors.append(np.array([float(s) for s in line.split()[1:]]))
            words.append(line.split()[0])
        except ValueError: pass # we may have encountered a 2 word embedding, for instance 'New York' or 'w dolinie'

CPU times: user 1min 56s, sys: 1.81 s, total: 1min 58s
Wall time: 1min 58s


In [37]:
len(words), len(vectors)

(1032440, 1032440)

In [34]:
words[-1]

'stryjsko'

In [19]:
with open('embeddings/GoogleNews-vectors-negative300.bin', 'rb') as f:
    for line in f.readlines()[1:]:
        break

In [16]:
line.split()

['w',
 'dolinie',
 '0.0561845',
 '-0.099156',
 '0.0973581',
 '0.128611',
 '-0.148211',
 '0.0408076',
 '-0.0254702',
 '0.137409',
 '-0.107757',
 '0.104452',
 '-0.0153174',
 '0.0601579',
 '-0.0359006',
 '-0.0488789',
 '-0.0563178',
 '-0.0381448',
 '-0.000313493',
 '-0.0584841',
 '0.0142587',
 '0.11536',
 '-0.0162321',
 '0.0281857',
 '-0.0378724',
 '-0.0489152',
 '-0.0639852',
 '0.0173485',
 '0.0243728',
 '-0.0364188',
 '0.06553',
 '0.0687198',
 '-0.0279721',
 '-0.0613634',
 '-0.0206989',
 '-0.0979436',
 '0.0609454',
 '0.038113',
 '0.00661096',
 '-0.028542',
 '0.0440013',
 '-0.0697895',
 '-0.0613983',
 '-0.00467289',
 '-0.0113318',
 '0.0409988',
 '-0.0443909',
 '0.0446194',
 '0.0336209',
 '0.0795524',
 '0.068961',
 '0.00922872',
 '-0.0808638',
 '0.0819113',
 '0.0424515',
 '0.0197476',
 '0.0303924',
 '0.0311507',
 '0.0597063',
 '0.120256',
 '0.00299538',
 '-0.0428498',
 '0.0342093',
 '0.0628188',
 '-0.0602852',
 '-0.00287612',
 '-0.00126272',
 '0.0225569',
 '-0.00373424',
 '0.0613833',
 '-

In [14]:
len(words)

178719

In [None]:
with open('data/wiki.pl.aligned.vec') as f:
    for line in f.readlines()[1:]:

In [9]:
line

'</s> 0.0321161 -0.220719 0.124438 -0.00926508 -0.0740218 -0.0172236 0.109326 0.250572 0.318885 0.00938166 -0.0841455 -0.0919657 0.0499988 0.00960425 -0.134863 -0.0746048 -0.0413067 -0.018231 0.14357 0.0134478 -0.0187504 -0.0699536 0.00860181 -0.00865755 -0.0791225 -0.0463163 -0.034485 0.0279513 -0.0589544 -0.0407527 0.052333 -0.0110106 -0.0327343 -0.0403282 -0.0105641 -0.0216702 -0.0100932 -0.0640855 0.0649875 -0.0647621 -0.00679458 0.0442129 0.0381403 0.000760199 0.0459121 -0.0460971 -0.0187912 -0.111585 0.0261081 0.0287817 -0.0583978 0.0355259 0.0554455 -0.0360979 -0.056411 -0.000415022 0.011188 -0.0786568 0.0211859 -0.148625 -0.0175403 0.0630089 0.0547974 0.01752 -0.0605584 0.00304843 0.042199 -0.0136111 -0.0642873 -0.0160942 -0.0285941 -0.00691622 -0.0202147 0.0792829 0.000268719 -0.000134218 -0.0183332 -0.00610847 -0.0181852 -0.0172873 0.00703196 0.00038558 0.0257223 -0.00443342 -0.0205739 -0.00805204 0.0781359 -0.0238407 -0.0302547 -0.0172303 -0.0237644 -0.0554416 -0.00364135 0.

In [4]:
Embeddings()

embedding_gym.core.Embeddings

In [None]:
pl_en_dict.source_words

In [43]:
pl_en_dict.dict.keys

{'roku': 'year'}

In [41]:
pl_en_dict['jest']

KeyError: 'jest'

In [None]:
from collections import defaultdict

def get_en_es_dict(vocab_en, vocab_es):
    with open('data/en-es.txt') as f:
        en_es = f.readlines()
    en_es = [l.strip() for l in en_es]
    
    en_es_dict = defaultdict(list)
    for l in en_es:
        source, target = l.split()
        en_es_dict[source].append(target)
        
    # check that we have the source word in the model trained on English
    vocab_en_set = set(vocab_en)
    en_es_dict = {k: v for k, v in en_es_dict.items() if k in vocab_en_set}
    
    # make sure we have the target word in Spanish
    vocab_es_set = set(vocab_es)
    en_es_dict = {k: vocab_es_set.intersection(set(v)) for k, v in en_es_dict.items() if vocab_es_set.intersection(set(v))}
    
    return en_es_dict

In [12]:
pl_en_dict[2]

'nie\tnot\n'

In [8]:
len(pl_en_dict)

73901

In [3]:
with open('data/wiki.pl.vec') as f:
    en_src = f.readlines()

In [4]:
len(en_src)

1032578

In [1]:
from fastai.data.all import untar_data
embedding_path = untar_data('https://storage.googleapis.com/text-embeddings/GoogleNews-vectors-negative300.bin.tar.gz')

Now let's load the embeddings using `gensim`.

In [2]:
from gensim.models.keyedvectors import KeyedVectors

gensim_embeddings = KeyedVectors.load_word2vec_format(embedding_path, binary=True)

In [3]:
len(gensim_embeddings.index2entity), gensim_embeddings['cat'].shape

(3000000, (300,))

3 million distinct embeddings, each of dimensionality 300!

Let's perform the evaluation using the original list of `question-words.txt` as used in the paper (and that was shared by the authors on github [here](https://github.com/tmikolov/word2vec/blob/master/questions-words.txt)).

We could use the functionality built into `gensim` to run the evaluation, but this might make it tricky to evaluate embeddings that we train ourselves, or should we want to modify the list of queries.

Instead, let's perform the evaluation using code that we develop in this repository. As a starting point, all we need is an array of embeddings and a list with words corresponding to each vector!

<!-- We will use [annoy](https://github.com/spotify/annoy) for approximate nearest neighbor lookup. Upon the first run, the embeddings will be added to an index and multiple trees enabling the search will be constructed. Given the size of these embeddings, this took around 5 minutes for me.  -->

In [6]:
import numpy as np

class Embeddings():
    def __init__(self, embeddings, index2word):
        '''embeddings - numpy array of embeddings, index2word - list of words corresponding to embeddings'''
        self.vectors = embeddings
        self.i2w = index2word
        self.w2i = {w:i for i, w in enumerate(index2word)}
            
    def analogy(self, a, b, c, n=5):
        '''
        a is to b as c is to ?
        
        Performs the following algebraic calculation: result = emb_a - emb_b + emb_c
        Looks up n closest words to result.
        
        Implements the embedding space math behind the famous word2vec example:
        king - man + woman = queen
        '''
        question_word_indices = [self.w2i[word] for word in [a, b, c]]
        a, b, c = [self.vectors[idx] for idx in question_word_indices] 
        result = a - b + c
        
        nn_indices = np.flip(
            np.argsort(self.vectors @ result / (np.linalg.norm(self.vectors, axis=1) * np.linalg.norm(result)))
        )
        
        nn_words = []
        for idx in nn_indices:
            if idx in question_word_indices: continue
            nn_words.append(self.i2w[idx])
            if len(nn_words) == n: break
        
        return nn_words

In [7]:
gensim_embeddings.vectors[:30000].shape

(30000, 300)

In [8]:
# grabbing just the vectors and mapping of vectors to words from gensim embeddings and instatiating our own embedding object
# let's stick to just 50_000 of the most popular words so that the computation will run faster

embeddings = Embeddings(gensim_embeddings.vectors[:50_000], gensim_embeddings.index2word[:50_000])

Now that we have the Embeddings in place, we can run some examples. France is to Paris as ? is to Warsaw...

In [9]:
%%time
embeddings.analogy('France', 'Paris', 'Warsaw', 5)

CPU times: user 116 ms, sys: 4 ms, total: 120 ms
Wall time: 39.9 ms


['Poland', 'Polish', 'Romania', 'Lithuania', 'Poles']

Got that one right! Now let's try the classic example of king - man + women = ?

In [10]:
%%time
embeddings.analogy('king', 'man', 'woman', 5)

CPU times: user 96 ms, sys: 12 ms, total: 108 ms
Wall time: 36 ms


['queen', 'monarch', 'princess', 'prince', 'kings']

We get it right as well!

Despite kings and queens not being discussed that often in the news today, this is still a great and slightly unexpected performance. Why should such an algebraic structure emerge when trained on a lot of text data in the first place? But yet it does!

Let's explore the performance further, by running through the list of question-answer pairs.

In [11]:
with open('data/questions-words.txt') as f:
    lines = f.readlines()

In [28]:
%%time

from collections import defaultdict

total_seen = defaultdict(lambda: 0)
correct = defaultdict(lambda: 0)
question_types = []

for line in lines:
    if line[0] == ':':
        question_types.append(line[1:].strip())
        current_type = question_types[-1]
    else:
        total_seen[current_type] += 1
        example = line.strip().split(' ')
        try:
            result = embeddings.analogy(*example[:2], example[3], 1)
            if example[2] == result[0]: correct[current_type] += 1
        except KeyError:
            pass

CPU times: user 24min 59s, sys: 3min 26s, total: 28min 26s
Wall time: 9min 28s


In [44]:
types = []
results = []
for key in total_seen.keys():
    types.append(key)
    results.append(f'{correct[key]} / {total_seen[key]}')

In [52]:
import pandas as pd

df = pd.DataFrame(data={'question type': types, 'result': results})
display(df)
print('Accuracy:', sum(correct.values()) / sum(total_seen.values()))

Unnamed: 0,question type,result
0,capital-common-countries,397 / 506
1,capital-world,1512 / 4524
2,currency,64 / 866
3,city-in-state,463 / 2467
4,family,377 / 506
5,gram1-adjective-to-adverb,257 / 992
6,gram2-opposite,257 / 812
7,gram3-comparative,1044 / 1332
8,gram4-superlative,644 / 1122
9,gram5-present-participle,745 / 1056


Accuracy: 0.4986696684404421
