## How do you evaluate goodness of embeddings across languages?

If you have embeddings for language A and another set of embeddings for language B, how can you tell how good they are? 

We might want to evaluate each set on a task specific to that language. But a much more interesting proposition is aligning the embeddings in an unsupervised way and evaluating the performance on translation!

For this, we need 3 components. [The embeddings](https://fasttext.cc/docs/en/pretrained-vectors.html), [a way to align embeddings](https://github.com/artetxem/vecmap) and [dictionaries](https://github.com/facebookresearch/MUSE) we can use to evaluate the results.

Let's begin by grabbing embeddings for English and Polish.

In [1]:
!wget -P data https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec

--2020-09-14 13:54:07--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.74.142, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6597238061 (6.1G) [binary/octet-stream]
Saving to: ‘data/wiki.en.vec’


2020-09-14 13:57:09 (34.6 MB/s) - ‘data/wiki.en.vec’ saved [6597238061/6597238061]



In [2]:
!wget -P data https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.pl.vec

--2020-09-14 13:57:09--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.pl.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.74.142, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2707436342 (2.5G) [binary/octet-stream]
Saving to: ‘data/wiki.pl.vec’


2020-09-14 13:58:24 (34.9 MB/s) - ‘data/wiki.pl.vec’ saved [2707436342/2707436342]



Time to align the embeddings. Let's translate from Polish to English. In order to do that, let's align our Polish embeddings with our English embeddings.

Bear in mind that this is a challenging task - Polish and English come from distinct and very different language families (West Slavic languages of the Lechitic group and West Germanic language of the Indo-European language family respectively).

Let us clone [vecmap](https://github.com/artetxem/vecmap) to a directory adjacent to this one and perform the alignment.

In [3]:
with open('data/wiki.pl.vec') as f:
    en_src = f.readlines()

In [4]:
len(en_src)

1032578

In [1]:
from fastai.data.all import untar_data
embedding_path = untar_data('https://storage.googleapis.com/text-embeddings/GoogleNews-vectors-negative300.bin.tar.gz')

Now let's load the embeddings using `gensim`.

In [2]:
from gensim.models.keyedvectors import KeyedVectors

gensim_embeddings = KeyedVectors.load_word2vec_format(embedding_path, binary=True)

In [3]:
len(gensim_embeddings.index2entity), gensim_embeddings['cat'].shape

(3000000, (300,))

3 million distinct embeddings, each of dimensionality 300!

Let's perform the evaluation using the original list of `question-words.txt` as used in the paper (and that was shared by the authors on github [here](https://github.com/tmikolov/word2vec/blob/master/questions-words.txt)).

We could use the functionality built into `gensim` to run the evaluation, but this might make it tricky to evaluate embeddings that we train ourselves, or should we want to modify the list of queries.

Instead, let's perform the evaluation using code that we develop in this repository. As a starting point, all we need is an array of embeddings and a list with words corresponding to each vector!

<!-- We will use [annoy](https://github.com/spotify/annoy) for approximate nearest neighbor lookup. Upon the first run, the embeddings will be added to an index and multiple trees enabling the search will be constructed. Given the size of these embeddings, this took around 5 minutes for me.  -->

In [6]:
import numpy as np

class Embeddings():
    def __init__(self, embeddings, index2word):
        '''embeddings - numpy array of embeddings, index2word - list of words corresponding to embeddings'''
        self.vectors = embeddings
        self.i2w = index2word
        self.w2i = {w:i for i, w in enumerate(index2word)}
            
    def analogy(self, a, b, c, n=5):
        '''
        a is to b as c is to ?
        
        Performs the following algebraic calculation: result = emb_a - emb_b + emb_c
        Looks up n closest words to result.
        
        Implements the embedding space math behind the famous word2vec example:
        king - man + woman = queen
        '''
        question_word_indices = [self.w2i[word] for word in [a, b, c]]
        a, b, c = [self.vectors[idx] for idx in question_word_indices] 
        result = a - b + c
        
        nn_indices = np.flip(
            np.argsort(self.vectors @ result / (np.linalg.norm(self.vectors, axis=1) * np.linalg.norm(result)))
        )
        
        nn_words = []
        for idx in nn_indices:
            if idx in question_word_indices: continue
            nn_words.append(self.i2w[idx])
            if len(nn_words) == n: break
        
        return nn_words

In [7]:
gensim_embeddings.vectors[:30000].shape

(30000, 300)

In [8]:
# grabbing just the vectors and mapping of vectors to words from gensim embeddings and instatiating our own embedding object
# let's stick to just 50_000 of the most popular words so that the computation will run faster

embeddings = Embeddings(gensim_embeddings.vectors[:50_000], gensim_embeddings.index2word[:50_000])

Now that we have the Embeddings in place, we can run some examples. France is to Paris as ? is to Warsaw...

In [9]:
%%time
embeddings.analogy('France', 'Paris', 'Warsaw', 5)

CPU times: user 116 ms, sys: 4 ms, total: 120 ms
Wall time: 39.9 ms


['Poland', 'Polish', 'Romania', 'Lithuania', 'Poles']

Got that one right! Now let's try the classic example of king - man + women = ?

In [10]:
%%time
embeddings.analogy('king', 'man', 'woman', 5)

CPU times: user 96 ms, sys: 12 ms, total: 108 ms
Wall time: 36 ms


['queen', 'monarch', 'princess', 'prince', 'kings']

We get it right as well!

Despite kings and queens not being discussed that often in the news today, this is still a great and slightly unexpected performance. Why should such an algebraic structure emerge when trained on a lot of text data in the first place? But yet it does!

Let's explore the performance further, by running through the list of question-answer pairs.

In [11]:
with open('data/questions-words.txt') as f:
    lines = f.readlines()

In [28]:
%%time

from collections import defaultdict

total_seen = defaultdict(lambda: 0)
correct = defaultdict(lambda: 0)
question_types = []

for line in lines:
    if line[0] == ':':
        question_types.append(line[1:].strip())
        current_type = question_types[-1]
    else:
        total_seen[current_type] += 1
        example = line.strip().split(' ')
        try:
            result = embeddings.analogy(*example[:2], example[3], 1)
            if example[2] == result[0]: correct[current_type] += 1
        except KeyError:
            pass

CPU times: user 24min 59s, sys: 3min 26s, total: 28min 26s
Wall time: 9min 28s


In [44]:
types = []
results = []
for key in total_seen.keys():
    types.append(key)
    results.append(f'{correct[key]} / {total_seen[key]}')

In [52]:
import pandas as pd

df = pd.DataFrame(data={'question type': types, 'result': results})
display(df)
print('Accuracy:', sum(correct.values()) / sum(total_seen.values()))

Unnamed: 0,question type,result
0,capital-common-countries,397 / 506
1,capital-world,1512 / 4524
2,currency,64 / 866
3,city-in-state,463 / 2467
4,family,377 / 506
5,gram1-adjective-to-adverb,257 / 992
6,gram2-opposite,257 / 812
7,gram3-comparative,1044 / 1332
8,gram4-superlative,644 / 1122
9,gram5-present-participle,745 / 1056


Accuracy: 0.4986696684404421
