Doc2vec in PyTorch
==================

Implementing this useful algorithm with a library we know and trust. This is manual re-implementation of the great work by [Nejc Ilenic](https://github.com/ilenic/paragraph-vectors) with the vain hope that I'll learn something.

First we need to load the data.

In [1]:
import pandas as pd
import spacy

nlp = spacy.load("en_core_web_sm")

df = pd.read_csv("data/example.csv")
df["tokens"] = df.text.str.lower().apply(lambda x: [token.text for token in nlp(x)])

df[:3]

Unnamed: 0,text,tokens
0,"In the week before their departure to Arrakis,...","[in, the, week, before, their, departure, to, ..."
1,"It was a warm night at Castle Caladan, and the...","[it, was, a, warm, night, at, castle, caladan,..."
2,The old woman was let in by the side door down...,"[the, old, woman, was, let, in, by, the, side,..."


We need to construct a vocabulary.

In [2]:
from collections import Counter

class Vocab:
    def __init__(self, all_tokens):
        self.freqs = Counter(all_tokens)
        self.words = sorted(self.freqs.keys())
        self.word2idx = {w: i for i, w in enumerate(self.words)}
        
vocab = Vocab([tok for tokens in df.tokens for tok in tokens])

As every schoolchild knows, the basic form of word2vec (and so doc2vec) is that it is a classifier trained to predict the missing word in a context. So given sentences like "the cat _ on the mat" it should predict "sat", and so learn a useful representation of words. 

The difficulty is that the missing word could be any one in the vocabulary V and thus the network would have |V| outputs for each input e.g. a vector containing zero for every word in the vocabulary and some positive number for "sat" if the network was perfectly trained. For calculating loss we need to turn that into a probabilty distribution, i.e. _softmax_ it. Computing the softmax for such a large vector is expensive.

So the trick (one of many possible) is to change our "the cat _ on the mat" problem into a multiple choice problem. We ask the network to choose between "sat" and some random wrong answers like "hopscotch" and "luxuriated". This is easier to compute the softmax for since the output is simply of a vector of size 1 + k where k is the number of random incorrect options.

The general idea is to create a set of examples where each example has:

- doc id
- target noise ids - a collection of the target token and some noise tokens
- context ids - tokens before and after the target token

e.g. If our context size was 2, the first example from the above dataset would be:

```
{"doc_id": 0,
 "target_noise_ids": [word2idx[x] for x in ["week", "random-word-from-vocab", "random-word-from-vocab"...],
 "context_ids": [word2idx[x] for x in ["in", "the", "before", "their"]]}
 ```
 
 The random words are chosen according to a probability distribution.
 
 > a unigram distribution raised to the 3/4rd power, as proposed by T. Mikolov et al. in Distributed Representations of Words and Phrases and their Compositionality


In [3]:
import numpy as np

class NoiseDistribution:
    def __init__(self, vocab):
        self.probs = np.array([vocab.freqs[w] for w in vocab.words])
        self.probs = np.power(self.probs, 0.75)
        self.probs /= np.sum(self.probs)
    def sample(self, n):
        "Returns the indices of n words randomly sampled from the vocabulary."
        return np.random.choice(a=self.probs.shape[0], size=n, p=self.probs)
        
noise = NoiseDistribution(vocab)

With this distribution, we advance through the documents creating examples.

In [4]:
import torch

def example_generator(df, context_size, noise, n_noise_labels, vocab):
    for doc_id, doc in df.iterrows():
        for i in range(context_size, len(doc.tokens) - context_size):
            true_label = doc.tokens[i]
            labels = noise.sample(n_noise_labels).tolist()
            labels.insert(0, vocab.word2idx[true_label])
            context = doc.tokens[i - context_size:i] + doc.tokens[i + 1:i + context_size + 1]
            context_ids = [vocab.word2idx[w] for w in context]
            yield {"doc_id": torch.tensor(doc_id),
                   "labels": torch.tensor(labels), 
                   "context_ids": torch.tensor(context_ids)}
            
examples = example_generator(df, context_size=5, noise=noise, n_noise_labels=15, vocab=vocab)

And package this up as a PyTorch dataset.

In [5]:
from torch.utils.data import Dataset, DataLoader

class NCEDataset(Dataset):
    def __init__(self, examples):
        self.examples = list(examples)  # just naively evaluate the whole damn thing - suboptimal!
    def __len__(self):
        return len(self.examples)
    def __getitem__(self, index):
        return self.examples[index]
    
dataset = NCEDataset(examples)
dataloader = DataLoader(dataset, batch_size=2)  # TODO bigger batch size when not dummy data

Let's jump into creating the model now.

In [6]:
import torch.nn as nn

class DistributedMemory(nn.Module):
    def __init__(self, vec_dim, n_docs, n_words):
        super(DistributedMemory, self).__init__()
        self.paragraph_matrix = nn.Parameter(torch.randn(n_docs, vec_dim))
        self.word_matrix = nn.Parameter(torch.randn(n_words, vec_dim))
        self.outputs = nn.Parameter(torch.zeros(vec_dim, n_words))
    
    def forward(self, doc_ids, context_ids, label_ids):
        x = torch.add(self.paragraph_matrix[doc_ids,:], torch.sum(self.word_matrix[context_ids,:]))
        x = torch.bmm(x, self.outputs[:,label_ids])
        return x

model = DistributedMemory(vec_dim=300,
                          n_docs=len(dataset),
                          n_words=len(vocab.words))