# Assignment 1.3: Naive word2vec (40 points)

This task can be formulated very simply. Follow this [paper](https://arxiv.org/pdf/1411.2738.pdf) and implement word2vec like a two-layer neural network with matrices $W$ and $W'$. One matrix projects words to low-dimensional 'hidden' space and the other - back to high-dimensional vocabulary space.

![word2vec](https://i.stack.imgur.com/6eVXZ.jpg)

You can use TensorFlow/PyTorch and code from your previous task.

## Results of this task: (30 points)
 * trained word vectors (mention somewhere, how long it took to train)
 * plotted loss (so we can see that it has converged)
 * function to map token to corresponding word vector
 * beautiful visualizations (PCE, T-SNE), you can use TensorBoard and play with your vectors in 3D (don't forget to add screenshots to the task)

## Extra questions: (10 points)
 * Intrinsic evaluation: you can find datasets [here](http://download.tensorflow.org/data/questions-words.txt)
 * Extrinsic evaluation: you can use [these](https://medium.com/@dataturks/rare-text-classification-open-datasets-9d340c8c508e)

Also, you can find any other datasets for quantitative evaluation.

Again. It is **highly recommended** to read this [paper](https://arxiv.org/pdf/1411.2738.pdf)

Example of visualization in tensorboard:
https://projector.tensorflow.org

Example of 2D visualisation:

![2dword2vec](https://www.tensorflow.org/images/tsne.png)

# Solution

## Imports and constants

In [3]:
import numpy as np
import torch
from torch import nn


from IPython.display import clear_output
import matplotlib.pyplot as plt

USE_COLAB = False
MOUNT_DIR = '/content/drive/'

In [4]:
if USE_COLAB:
    from google.colab import files, drive
    src = list(files.upload().values())[0]
    open('batcher.py','wb').write(src)
    drive.mount(MOUNT_DIR)
from batcher import SkipGramBatcherBase, read_corpus

## Skip-Gram Batcher impletation

In [16]:
class SkipGramBatcher(SkipGramBatcherBase):
    def __next__(self):
        """ Return next batch with order specified in self._permuted_indxs
            Return:
                centrals (np.array()): batch with one-hot vectors of central words. The size is (batch_size, vocabulary_size)
                neighbours (np.array()): batch with one-hot vectors of neighbour words. 
                                         One word for each central word. The size is (batch_size, vocabulary_size)
        """
        centrals, neighbours = self._get_next_batch()
        if centrals is None or neighbours is None:
            raise StopIteration
        
        rand_neighbours_idxs = np.random.randint(0, 2*self._window_size)
        rand_neighbours = neighbours[np.arange(self._batch_size), rand_neighbours_idxs]  # (batch_size, 1)
        
        return centrals, rand_neighbours

## Naive word2vec Pytorch module implementation

In [17]:
class NaiveWord2vec(nn.Module):
    def __init__(self, voc_size, embedding_dim):
        super(NaiveWord2vec, self).__init__()

        self.embedding_layer = nn.Embedding(voc_size, embedding_dim)
        self.linear_layer = nn.Linear(embedding_dim, voc_size, bias=False)
        self.activation = nn.LogSoftmax(dim=1)
    
    def forward(self, x):
        """ Forward pass of batch x
            Params:
                x (np.array): Batch with one-hot encoded vectors. The size is (batch_size, voc_size)
            Return:
                x (np.array): Batch with predicted neighbours. The size is (batch_size, voc_size)
        """
        x = self.embedding_layer(x)  # x: (batch_size, embedding_dim)
        x = self.linear_layer(x)  # x: (batch_size, voc_size)
        x = self.activation(x)  # x: (batch_size, voc_size)
        return x

# Learning process

## Define params

In [18]:
WINDOW_SIZE = 4
BATCH_SIZE = 64
VOCABULARY_SIZE = 1000

EMBEDDINGS_DIM = 150
EPOCH_NUM = 10
DROW_EVERY = 20

TEXT_PATH = MOUNT_DIR + 'My Drive/datasets/nlp-ipavlov/text8' if USE_COLAB else 'data/text8'

## Preparation

### load data

In [10]:
corpus = read_corpus(TEXT_PATH)

In [19]:
dataset = SkipGramBatcher(corpus, WINDOW_SIZE, BATCH_SIZE, VOCABULARY_SIZE)

### create model

In [20]:
model = NaiveWord2vec(VOCABULARY_SIZE, EMBEDDINGS_DIM)

### create loss function

In [21]:
criterion = torch.nn.NLLLoss()

### create optimizer

In [22]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

## Learn process

In [23]:
sliding_loss = []
window_loss = []
for epoch in range(EPOCH_NUM):
    for i, (context, target) in enumerate(dataset):
        tensor_context = torch.from_numpy(context).type(torch.LongTensor)
        tensor_target = torch.from_numpy(target).type(torch.LongTensor)

        model.zero_grad()
        pred = model(tensor_context)
        loss = criterion(pred, tensor_target)
        loss.backward()
        optimizer.step()

        single_loss = loss.detach().numpy() / BATCH_SIZE
        window_loss.append(single_loss)
        if len(window_loss) > 2 * BATCH_SIZE:
            window_loss.pop(0)
        sliding_loss.append(np.average(window_loss))

        if i % DROW_EVERY == 0:
            plt.figure(figsize=(13, 7))
            plt.title("Epoch number {}/{}".format(epoch+1, EPOCH_NUM))
            plt.plot(sliding_loss, label="Train loss")
            clear_output(wait=True)
            plt.show()

KeyboardInterrupt: 