## Training our own embeddings

In this notebook we will train our own text embeddings and subsequently put them through evaluation using code we wrote in the earlier notebooks.

To train our embeddings let us use [fastai](https://github.com/fastai/fastai).

I looked for the Google News corpus dataset, the one that word2vec embeddings were trained on, but I can't find it! There is some chance it was not shared by Google or that for one reason or another it was taken down.

Let's use the next best alternative for our data, that is wikipedia dumps.

We will use data generously shared by authors of [MultiFiT: Efficient Multi-lingual Language Model Fine-tuning](https://arxiv.org/abs/1909.04761). Here is the accompanying [repository](https://github.com/n-waves/multifit).

The archive we are about to download contains wikipedia dumps for 8 languages - that can come in handy for our experiments with translation.

In [1]:
import fastai
from fastai.text.all import *

import pandas as pd
from collections import defaultdict

import torch
from torch import nn

In [2]:
fastai.__version__

'2.0.12'

In [2]:
!wget -nc 'https://www.dropbox.com/sh/srfwvur6orq0cre/AAAQc36bcD17C1KM1mneXN7fa/data/wiki?dl=1' -O 'data/preprocessed_wiki_8langs.zip'
!unzip 'data/preprocessed_wiki_8langs.zip' -d 'data'

--2020-09-16 14:14:38--  https://www.dropbox.com/sh/srfwvur6orq0cre/AAAQc36bcD17C1KM1mneXN7fa/data/wiki?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.1, 2620:100:6016:1::a27d:101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /sh/dl/srfwvur6orq0cre/AAAQc36bcD17C1KM1mneXN7fa/data/wiki [following]
--2020-09-16 14:14:38--  https://www.dropbox.com/sh/dl/srfwvur6orq0cre/AAAQc36bcD17C1KM1mneXN7fa/data/wiki
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb3ebffe3113434a8309a843dd3.dl.dropboxusercontent.com/zip_download_get/Ai7iI08_FU4ekjhkC66hv8r6tc5TVte4yHKYvPmUNzDrwwJOXDtG8FrLj6TAf7uq-T79oEbydWScPRqgrtlTIvjcAup3z_azZ9GCxSI_ebfSrw?dl=1 [following]
--2020-09-16 14:14:38--  https://ucb3ebffe3113434a8309a843dd3.dl.dropboxusercontent.com/zip_download_get/Ai7iI08_FU4ekjhkC66hv8r6tc5TVte4yHKYvPmUNzDrwwJOXDt

Let's just grab english for now.

In [2]:
!tar -xvf data/en-100.tar.gz -C data

en-100/
en-100/en.wiki.valid.tokens
en-100/en.wiki.train.tokens
en-100/en.wiki.test.tokens


In [3]:
!mkdir data/en-100/train
!mkdir data/en-100/valid

!mv data/en-100/en.wiki.train.tokens data/en-100/train
!mv data/en-100/en.wiki.valid.tokens data/en-100/valid

In [4]:
ls data/en-100

en.wiki.test.tokens  [0m[01;34mtrain[0m/  [01;34mvalid[0m/


The only issue I see is that all the tokens for train and validation live in a single file. That is problematic - if we were to have mutliple small files, that would allow us to parallelize tokenization across CPU cores more easily.

Let's split the data into smaller files!

In [2]:
def split_file_into_chunks(path, num_chunks=24):
    txt = Path(path).read().split('\n')
    chunk_len = len(txt) // num_chunks
    
    for i in range(num_chunks):
        with open(Path(path).parent / f'{str(i).zfill(2)}.txt', "w") as text_file:
            text_file.write('\n'.join(txt[i*chunk_len:(i+1)*chunk_len]))

In [3]:
%%time

split_file_into_chunks('data/en-100/train/en.wiki.train.tokens')
split_file_into_chunks('data/en-100/valid/en.wiki.valid.tokens')

CPU times: user 4.04 s, sys: 3.06 s, total: 7.09 s
Wall time: 7.09 s


In [4]:
ls data/en-100/train

00.txt  04.txt  08.txt  12.txt  16.txt  20.txt  en.wiki.train.tokens
01.txt  05.txt  09.txt  13.txt  17.txt  21.txt
02.txt  06.txt  10.txt  14.txt  18.txt  22.txt
03.txt  07.txt  11.txt  15.txt  19.txt  23.txt


Let's use functionality provided by `fastai` to tokenize the data (tokenization is an expensive operation so it is good to offload it and perform it once before training).

In [5]:
%%time

tokenize_folder('data/en-100/', folders=['train', 'valid'], n_workers=16)

CPU times: user 51.2 s, sys: 2.68 s, total: 53.9 s
Wall time: 4min 55s


Path('data/en-100_tok')

In [6]:
ls data/en-100_tok

counter.pkl  lengths.pkl  [0m[01;34mtrain[0m/  [01;34mvalid[0m/


In [7]:
ls data/en-100_tok/train

00.txt  03.txt  06.txt  09.txt  12.txt  15.txt  18.txt  21.txt
01.txt  04.txt  07.txt  10.txt  13.txt  16.txt  19.txt  22.txt
02.txt  05.txt  08.txt  11.txt  14.txt  17.txt  20.txt  23.txt


In [20]:
counter = pd.read_pickle('data/en-100_tok/counter.pkl')

Let's create a simple `vocab` - this will allow to go from embeddings to their meaning and vice versa.

In [42]:
class Vocab():
    def __init__(self, counter, max_words):
        self.itos = ['xxunk']
        self.itos += [tup[0] for tup in counter.most_common(max_words-1) if tup[1] != 'xxunk']
        self.stoi = defaultdict(lambda: 0) # 0 corresponds to 'xxunk'
        for i, w in enumerate(self.itos):
            self.stoi[w] = i
            
    def numericalize_line(self, line):
        return [self.stoi[token] for token in line.split()]
    
    def __len__(self):
        return len(self.itos)

In [43]:
vocab = Vocab(counter, 50_000)

Time to take the data we generated and translate it from containing words to containing word ids (this will allow us to pull corresponding embedding vectors from the `Embedding` layer).

In [44]:
%%time

def numericalize_data(path):
    '''
    path - path to directory containing train examples
    '''
    token_idxs = []
    for path in sorted(list(Path(path).iterdir())):
            with open(path) as f:
                for line in f:
                    token_idxs += vocab.numericalize_line(line)
    return torch.LongTensor(token_idxs)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 8.82 µs


In [45]:
%%time

train_toks = numericalize_data('data/en-100_tok/train/')

CPU times: user 30.1 s, sys: 1.39 s, total: 31.4 s
Wall time: 31.1 s


In [46]:
valid_toks = numericalize_data('data/en-100_tok/valid/')

In [149]:
len(train_toks), len(valid_toks)

(119268584, 253305)

Using the data above, we could train any embedding model. Here I chose to train them via a simple langauge model.

This might not be the best option for training embeddings, but I want to understand better how to train embeddings this way in preparation for the [planned work on audio](https://github.com/earthspecies/audio-embeddings).

In [150]:
def batchify(data, bsz):
    nbatch = data.size(0) // bsz
    data = data.narrow(0, 0, nbatch * bsz)
    data = data.view(bsz, -1).t().contiguous()
    return data

In [175]:
BS = 128

train_data = batchify(train_toks, BS)
val_data = batchify(valid_toks, BS)

The data is organized `[tokens, examples]`, meaning the text flows down the 128 columns (128 in this case is our batch size).

In [152]:
train_data.shape

torch.Size([931785, 128])

In [40]:
[vocab.itos[i] for i in train_data[:, 0]][:10]

['xxunk',
 '=',
 'xxmaj',
 'valkyria',
 'xxmaj',
 'chronicles',
 'xxrep',
 '3',
 'i',
 '=']

In [51]:
len(vocab)

50000

Let's hack togather a simple model.

In [2]:
EMB_SZ = 300
HIDDEN_SZ = 300

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.emb = nn.Embedding(len(vocab), EMB_SZ)
        self.rnn = nn.GRU(input_size=EMB_SZ, hidden_size=HIDDEN_SZ)
        self.decoder = nn.Linear(HIDDEN_SZ, len(vocab))
        
    def forward(self, x, hidden):
        x = self.emb(x)
        x, hidden = self.rnn(x, hidden)
        return self.decoder(x), hidden
    
    def init_hidden(self):
        weight = next(self.parameters())
        return weight.new_zeros(1, BS, HIDDEN_SZ)

In [177]:
BPTT = 72
LOG_INTERVAL = 5000 # in batches

In [7]:
# Using a subset of the data is great for quick experiments as we try to get things to work!

# train_data = train_data[:80000, :]
# val_data = val_data[:1000, :]

This training procedure is taken from a pytorch [LM example]. I made slight modifications to work with our architecture and for ease of running inside a notebook.

In [159]:
from torch import optim

In [185]:
# https://github.com/pytorch/examples/blob/master/word_language_model/main.py

# get_batch subdivides the source data into chunks of length BPTT.
# If source is equal to the example output of the batchify function, with
# a bptt-limit of 2, we'd get the following two Variables for i = 0:
# ┌ a g m s ┐ ┌ b h n t ┐
# └ b h n t ┘ └ c i o u ┘
# Note that despite the name of the function, the subdivison of data is not
# done along the batch dimension (i.e. dimension 1), since that was handled
# by the batchify function. The chunks are along dimension 0, corresponding
# to the seq_len dimension in the LSTM.

def get_batch(source, i):
    seq_len = min(BPTT, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target


def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(vocab)
    hidden = model.init_hidden()
    
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, BPTT):
            data, targets = get_batch(data_source, i)
            data, targets = data.cuda(), targets.cuda()
            output, hidden = model(data, hidden)
            hidden = repackage_hidden(hidden)
            total_loss += len(data) * criterion(output.view(-1, len(vocab)), targets).item()
    return total_loss / (len(data_source) - 1)


def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(vocab)
    hidden = model.init_hidden()

    for batch, i in enumerate(range(0, train_data.size(0) - 1, BPTT)):
        data, targets = get_batch(train_data, i)
        data, targets = data.cuda(), targets.cuda()
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()

        hidden = repackage_hidden(hidden)
        output, hidden = model(data, hidden)
        loss = criterion(output.view(-1, len(vocab)), targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1e-1)
#         optimizer.step()
        for p in model.parameters():
            p.data.add_(p.grad, alpha=-lr)

        total_loss += loss.item()

        if batch % LOG_INTERVAL == 0 and batch > 0:
            cur_loss = total_loss / LOG_INTERVAL
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // BPTT, lr,
                elapsed * 1000 / LOG_INTERVAL, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

In [179]:
def repackage_hidden(h):
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

In [172]:
mkdir -p models

In [184]:
model = Model().cuda()
lr = 1e-3
optimizer = optim.Adam(model.parameters(), lr)
best_val_loss = None
EPOCHS = 100
criterion = nn.CrossEntropyLoss()

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, EPOCHS+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open('models/best_model_lower_lr.pth', 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            for param_group in optimizer.param_groups:
                lr /= 4.0
                param_group['lr'] = lr
#             lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

| epoch   1 |  5000/12941 batches | lr 0.00 | ms/batch 106.37 | loss  4.36 | ppl    78.51
| epoch   1 | 10000/12941 batches | lr 0.00 | ms/batch 106.51 | loss  3.91 | ppl    49.83
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 1378.78s | valid loss  3.85 | valid ppl    46.79
-----------------------------------------------------------------------------------------
| epoch   2 |  5000/12941 batches | lr 0.00 | ms/batch 106.57 | loss  3.73 | ppl    41.79
| epoch   2 | 10000/12941 batches | lr 0.00 | ms/batch 106.57 | loss  3.68 | ppl    39.63
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 1380.16s | valid loss  3.73 | valid ppl    41.55
-----------------------------------------------------------------------------------------
| epoch   3 |  5000/12941 batches | lr 0.00 | ms/batch 106.61 | loss  3.61 | ppl    37.11
| epoch   3 | 10000/12941 batches | lr

Not the most extensive training ever! But that is not a problem - I'm more interested in acquainting myself with all the pieces than getting a good result at this point. 

In [197]:
# pd.to_pickle(vocab.stoi, 'data/stoi.pkl') # can't pickle defaultdict because I used a lambda :(
# pd.to_pickle(vocab.itos, 'data/itos.pkl')

In [3]:
itos = pd.read_pickle('data/itos.pkl')

In [4]:
# Importing functionality we defined in an earlier notebook

from embedding_gym.core import Embeddings, evaluate_monolingual_embeddings

In [5]:
# Loading the model

model = torch.load('models/best_model_lower_lr.pth')

In [6]:
embeddings = Embeddings(model.emb.weight.cpu().detach().numpy(), itos)
evaluate_monolingual_embeddings(embeddings, lower=True)

Unnamed: 0,question type,result
0,capital-common-countries,0 / 506
1,capital-world,1 / 4524
2,currency,0 / 866
3,city-in-state,0 / 2467
4,family,18 / 506
5,gram1-adjective-to-adverb,0 / 992
6,gram2-opposite,1 / 812
7,gram3-comparative,7 / 1332
8,gram4-superlative,1 / 1122
9,gram5-present-participle,6 / 1056


Accuracy: 0.002507163323782235
Examples with missing words in the dictionary: 3554
Total examples: 19544


Less than 1% accuracy. It's hard to understand these results - are our embeddings not trained well, or would embeddings trained in this fashion be of poor quality in general? Or are they still useful in some sense even though they do not retain the linear relationship between embeddings?

An interesting propostion that could shed more light on this would be to fork this notebook and implement a training procedure as described in the [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781), using eithe the skipgram or CBOW methods. Better training of the embeddings could also be called for.

Finally, it might be a good idea to download a pretrained language model on the WIKI corpus (such as the one [shared by fastai](https://docs.fast.ai/text.learner)) and see if these linear relationship hold in that embedding space.