# Assignment 1.2: Word2vec preprocessing (20 points)

Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/text8.zip). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parsed from the internet)
1. tokenization
1. building the vocabulary and choosing its size. Use only high-frequency words, change all other words to UNK or handle it in your own manner. You can use `collections.Counter` for that.
1. assigning each token a number (numericalization). In other words, make word2index и index2word objects.
1. data structuring and batching - make X and y matrices generator for word2vec (explained in more details below)

**ATTN!:** If you use your own data, please, attach a download link. 

Your goal is to make **Batcher** class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `x_batch.shape = (batch_size, 2*window_size)`, `y_batch.shape = (batch_size,)` for CBOW or `(batch_size,)`, `(batch_size, 2*window_size)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

window_size = 2

# CBOW:
indices_to_words(x_batch) = \
        [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']
```

In [2]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim

from pathlib import Path
from pprint import pprint

UNK_TOKEN = '<UNK>'

np.random.seed(4242)
random.seed(4242)

In [3]:
USE_GPU = True

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

In [17]:
from collections import Counter


class CBOWBatcher:
    THRESHOLD = 5
    def __init__(self, dataset, window_size=2, threshold=THRESHOLD):
        self.window_size = window_size
        self.threshold = threshold
        c = Counter(dataset)
        # all the words we have plus <UNK> token for rare words
        unique = set(dataset)
        self.word2ind = {w: i for i, w in enumerate(unique) if c[w] > self.threshold}
        self.word2ind[UNK_TOKEN] = len(self.word2ind)
        self.ind2word = {i: w for w, i in self.word2ind.items()}
        # We need to store only the numbers of the words here, as we have their numbers already
        # we create a padded array for tokens to process all the words from corpus
        # remove all the uncommon words here
        self.tokens = ([self.word2ind[UNK_TOKEN]] * window_size) +\
            [self.word2ind.get(w, self.word2ind[UNK_TOKEN]) for w in dataset] +\
            ([self.word2ind[UNK_TOKEN]] * window_size)
        self.vocab_size = len(set(self.tokens))
        assert self.vocab_size == len(self.word2ind)
        pprint(f'Corpus size: {len(dataset)}')
        pprint(f'Actual count of words used: {self.vocab_size}')
        pprint(f'{len(dataset)} words in dataset tokenized to {len(self.tokens)} tokens')

    def get_batch(self, batch_size=512):
        X = [None] * batch_size
        y = [None] * batch_size
        current = 0
        for start in np.random.permutation(range(len(self.tokens) - 2 * window_size)):
            center = start + window_size
            X[current] = [self.tokens[i]
                          for i in range(center - window_size, center + window_size + 1) if i != center]
            y[current] = self.tokens[center]
            current += 1
            if current == batch_size:
                # We need the generator, so only `yield ` is an option here
                yield torch.from_numpy(np.asarray(X)).to(device=device),\
                      torch.from_numpy(np.asarray(y)).to(device=device)
                # clean the buffer after we yielded it and we got back our process here
                X = [None] * batch_size
                y = [None] * batch_size
                current = 0
        if current:
            # if batch didn't get to the full size but the corpus ended
            yield torch.from_numpy(np.asarray(X[:current])).to(device=device),\
                  torch.from_numpy(np.asarray(y[:current])).to(device=device)         


In [18]:
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']
window_size = 2
batcher = CBOWBatcher(text, window_size, 0)

expected = {
    'first': [UNK_TOKEN, UNK_TOKEN, 'used', 'against'],
    'used': [UNK_TOKEN, 'first', 'against', 'early'],
    'against': ['first', 'used', 'early', 'working'],
    'early': ['used', 'against', 'working', 'class'],
    'working': ['against', 'early', 'class', 'radicals'],
    'class': ['early', 'working', 'radicals', 'including'],
    'radicals': ['working', 'class', 'including', UNK_TOKEN],
    'including': ['class', 'radicals', UNK_TOKEN, UNK_TOKEN]
    }

answers_got = {}
for i, (x, y) in enumerate(batcher.get_batch(3)):
    for j, x_i in enumerate(x):
        translate_x = [batcher.ind2word[ii.item()] for ii in x_i]
        translate_y = batcher.ind2word[y[j].item()]
        answers_got[translate_y] = translate_x
        assert x.shape[0] <= 3
        assert x.shape[1] == window_size * 2

assert answers_got == expected, answers_got

'Corpus size: 8'
'Actual count of words used: 9'
'8 words in dataset tokenized to 12 tokens'


In [19]:
test8_Data = Path.cwd() / 'text8'
with test8_Data.open() as f:
    # 1. simple cleaning: lowering all the words
    text8 = [a.lower() for line in f for a in line.split()]
    batcher = CBOWBatcher(text8)

'Corpus size: 17005207'
'Actual count of words used: 63642'
'17005207 words in dataset tokenized to 17005211 tokens'


In [20]:
for i, (x, y) in enumerate(batcher.get_batch(3)):
    for j, x_i in enumerate(x):
        translate_x = [batcher.ind2word[ii.item()] for ii in x_i]
        pprint(translate_x)
        pprint(batcher.ind2word[y[j].item()])
        break
    break

['of', 'most', 'the', 'nature']
'galaxies'
