In this NB I export embeddings to a word2vec format. I use [vecmap](https://github.com/artetxem/vecmap) from console to perform the mapping and evaluate the results. I use [annoy](https://github.com/spotify/annoy) to find nearest neighbors and use our evaluation functionality to evaluate performance

In [15]:
#export data
import numpy as np

In [1]:
from decoder_head.core import *
from decoder_head.data import *
from fastai2.text.all import *
import re

In [3]:
vocab_en= make_vocab(pd.read_pickle('data/en-100_tok/counter.pkl'), max_vocab=4000)
vocab_es= make_vocab(pd.read_pickle('data/es-100_tok/counter.pkl'), max_vocab=4000)

In [4]:
path = 'data/en-100_tok/'
mult = 4
bs = 80
seq_len = 70

lm = DataBlock(blocks=(TextBlock(vocab=vocab_en, is_lm=True),),
                get_x=read_tokenized_file,
                get_items=partial(get_text_files, folders=['train', 'valid']),
                splitter=FuncSplitter(lambda itm: itm.parent.name == 'valid'))

dbunch_lm = lm.databunch(path, path=path, bs=bs, seq_len=seq_len)

In [5]:
learn = language_model_learner(
    dbunch_lm,
    p_normAWD_LSTM,
    opt_func=opt,
    pretrained=False,
    config=awd_lstm_lm_config,
    drop_mult=0.1,
    metrics=[accuracy, top_k_accuracy]
)

In [6]:
learn.load('emb_norm_rows_columns') # en LM

<fastai2.text.learner.LMLearner at 0x7f69535f6fd0>

In [7]:
len(vocab_en)

4008

In [8]:
#export data
def embs_to_txt(vocab, embeddings, fname):
    '''writes embeddings to txt file in word2vec format'''
    lines = []
    lines.append(f'{len(vocab)} {embeddings.shape[1]}\n')
    for word, t in zip(vocab, embeddings):
        word = re.subn('\n', '', word)[0]
        lines.append(f"{word} {' '.join([str(datum.item()) for datum in t])}\n")
    with open(fname, 'w') as f:
        f.writelines(lines)

In [9]:
embs_to_txt(vocab_en, learn.model[0].encoder.normalized_weight(), 'data/en_norm_embs.txt')

In [10]:
path = 'data/es-100_tok/'
mult = 4
bs = 80
seq_len = 70

lm = DataBlock(blocks=(TextBlock(vocab=vocab_es, is_lm=True),),
                get_x=read_tokenized_file,
                get_items=partial(get_text_files, folders=['train', 'valid']),
                splitter=FuncSplitter(lambda itm: itm.parent.name == 'valid'))

dbunch_lm = lm.databunch(path, path=path, bs=bs, seq_len=seq_len)

In [11]:
learn = language_model_learner(
    dbunch_lm,
    p_normAWD_LSTM,
    opt_func=opt,
    pretrained=False,
    config=awd_lstm_lm_config,
    drop_mult=0.1,
    metrics=[accuracy, top_k_accuracy]
)

In [12]:
learn.load('pLSTM_es')

<fastai2.text.learner.LMLearner at 0x7f6952844490>

In [13]:
embs_to_txt(vocab_en, learn.model[0].encoder.normalized_weight(), 'data/es_norm_embs.txt')

In [14]:
# python map_embeddings.py --cuda --unsupervised ~/workspace/decoder_head/data/en_norm_embs.txt ~/workspace/decoder_head/data/es_norm_embs.txt ~/workspace/decoder_head/data/en_src.txt ~/workspace/decoder_head/data/es_trg.txt

In [None]:
en_es_dict = get_en_es_dict()

In [22]:
import annoy

In [23]:
with open('data/en_src.txt') as f:
    en_src = f.readlines()
    
with open('data/es_trg.txt') as f:
    es_trg = f.readlines()

In [32]:
en_embs = []
for l in en_src[1:]:
    en_embs.append([float(s) for s in l.split()[1:]])

en_embs = np.array(en_embs)

es_embs = []
for l in es_trg[1:]:
    es_embs.append([float(s) for s in l.split()[1:]])

es_embs = np.array(es_embs)

In [47]:
# No need to perform the normalization - the embeddings are already normalized!

# en_embs_norm = en_embs / np.linalg.norm(en_embs, axis=1)[:, None]
# es_embs_norm = es_embs / np.linalg.norm(es_embs, axis=1)[:, None]

In [48]:
from annoy import AnnoyIndex

In [51]:
t = AnnoyIndex(100, 'euclidean')

for i in range(len(en_embs_norm)):
    t.add_item(i, en_embs_norm[i])
    
t.build(10)

True

In [67]:
top_n_translation_acc(1)

0.3908608731130151

In [68]:
top_n_translation_acc(3)

0.5091799265605875

In [69]:
top_n_translation_acc(5)

0.551203590371277

In [64]:
top_n_translation_acc(10)

0.6597307221542228

This is a very valuable piece of information to us. First of all, our embeddings contain enough structure to faciliate the alignment. Secondly, the evaluation framework seems to work.

Given this result, I attempted translation using decoder head with hinting. What I mean by that, is I trained an English and a Spanish LM but using already aligned embeddings! As expected, they trained with no issues. I include the only bit of additional code I had to write below (for loading the embeddings from txt)

Unfortunately, despite beginning from this hinted state, there is no meaningful improvement on the translation task.

The two observation I have made while working on this, that I believe could be important to consider when devising next steps, are as follows.

First of all, with the permutation matrix, we only allow the model to learn linear combinations of embeddings. This can only work if the column semantics between embeddings from different LMs are aligned. If col 0 of English LM encodes the abstract concept of hot - cold, the Spanish LM would need to also assign the same meaning to col 0. I do understand that there is a mathematical way to reason about this in terms of the topography of the embedding cloud, but it is easier to reason about this and especially to explain the reasoning using a concrete example as above (even if it is not an entirely rigorous treatment).

Secondly, even with mapped embeddings, the model is unable to recover the mapping of embeddings to ids. It is important to note that it finds a solution that is quite good (~28% LM accuracy without hinting, 35% LM accuracy with hinting), but the linear combination of embeddings it finds to lower the loss is nonsensical in terms of translation ability.

In [None]:
#export data

def txt_to_embs(fname):
    with open(fname, 'r') as f:
        lines = f.readlines()
    vocab = []
    embs = []
    for line in lines[1:]:
        l = line.split()
        vocab.append(l[0])
        embs.append(np.array([float(s) for s in l[1:]]))
    return vocab, np.stack(embs)

In [2]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_data.ipynb.
Converted 01_train_LM_en.ipynb.
Converted 02_train_LM_es.ipynb.
Converted 03_translate_en_to_es.ipynb.
Converted 04_LM_with_normalized_embeddings.ipynb.
Converted 04a_LM_with_normalized_embeddings_mixer_softmax.ipynb.
Converted 05_aligning_the_embeddings_using_vecmap.ipynb.
Converted 99_index.ipynb.
