# Assignment 1.2: Word2vec preprocessing (20 points)

Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/text8.zip). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parsed from the internet)
1. tokenization
1. building the vocabulary and choosing its size. Use only high-frequency words, change all other words to UNK or handle it in your own manner. You can use `collections.Counter` for that.
1. assigning each token a number (numericalization). In other words, make word2index и index2word objects.
1. data structuring and batching - make X and y matrices generator for word2vec (explained in more details below)

**ATTN!:** If you use your own data, please, attach a download link.

Your goal is to make **Batcher** class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `x_batch.shape = (batch_size, 2*window_size)`, `y_batch.shape = (batch_size,)` for CBOW or `(batch_size,)`, `(batch_size, 2*window_size)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

window_size = 2

# CBOW:
indices_to_words(x_batch) = \
        [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']
```

# Implementation

In [1]:
from collections import Counter
import numpy as np

In [2]:
class Batcher():
    def __init__(self, text, limit=5, window_size=2, mode='sg'):
        """
        Batcher for Skip-Gram or CBOW

        :param text: String without newline symbols
        :param limit: Don't put words with less amount into the dictonary
        :param window_size: Window size ^)
        :param mode: cbow or sg
        """

        self.limit = limit
        self.window_size = window_size
        self.mode = mode

        self.text = text

        self.UNK = 'UNK'

        self.tokens = []
        self.tokens_ind = []


        self.vocabulary = set()
        self.word2index = dict()
        self.index2word = []

        self._preprocess()

    @classmethod
    def from_file(cls, path, limit=5, window_size=2, mode='sg'):
        """
        Init Batcher from file

        :param path: Path to text file
        :param limit: Don't put words with less amount into the dictonary
        :param window_size: Window size ^)
        :param mode: cbow or sg
        :return: Batcher object
        """
        n = -1
        with open(path) as f:
            text = f.read(n)

        return cls(text, limit, window_size, mode)

    def _clean(self):
        # Everething expect [a-z ] already killed in our dataset
        pass

    def _tokenize(self):
        # Stupid tokenizer for our dataset
        self.tokens = self.text.split()

    def _build_vocabulary(self):
        counter_words = Counter(self.tokens)
        self.vocabulary = {word for word, counts in counter_words.items() if counts >= self.limit}

    def _numericalize(self):
        self.index2word = [self.UNK] + list(self.vocabulary)
        self.word2index = dict(zip(self.index2word, range(len(self.index2word))))
        self.tokens_ind = [self.word2index.get(word, 0) for word in self.tokens]

    def _preprocess(self):
        self._clean()
        self._tokenize()
        self._build_vocabulary()
        self._numericalize()

    def indices2words(self, indices):
        shape = indices.shape
        result = np.array([self.index2word[idx] for idx in indices.flatten()])
        return result.reshape(shape)

    def batch_generator(self, batch_size=5):
        """
        Batch generator

        :param batch_size: Elements in batch
        :return: Next batch
        """

        # Dataset is big enought
        # Let's skip last nonfull batch if exist
        count_batches = (len(self.tokens) - 2 * self.window_size) // batch_size

        for batch_id in range(count_batches):
            batch_x = []
            batch_label = []
            for step_id in range(batch_size):
                pos_word_central = step_id + self.window_size + batch_id * batch_size
                x = self.tokens_ind[pos_word_central]
                batch_x.append(x)
                label_left = self.tokens_ind[pos_word_central - self.window_size : pos_word_central]
                label_right = self.tokens_ind[pos_word_central + 1 : pos_word_central + 1 + self.window_size]
                label = label_left + label_right
                batch_label.append(label)

            batch_x = np.array(batch_x)
            batch_label = np.array(batch_label)
            if self.mode == 'cbow':
                batch_x, batch_label = batch_label, batch_x
            yield batch_x, batch_label






## Using with assigment file (SkipGram mode)

In [3]:
window_size = 5
batch_size = 12
limit4dictonary = 4
mode='sg'
# mode='cbow'

print('*' * 20)
print(f'window_size: {window_size}, batch_size: {batch_size}, limit4dictonary: {limit4dictonary}, mode: {mode}')
print('*' * 20)

batcher = Batcher.from_file(path='./text8', limit=limit4dictonary, window_size=window_size)
batch_generator = batcher.batch_generator(10)

i = 0
for batch_x, batch_label in batch_generator:
    print('*' * 20)
    print('batch_x: ', batch_x)
    print("batch_label: \n", batch_label)
    print('*' * 20)

    i+=1
    if i >= 3:
        break
print('*' * 20)
print(f'batch_x shape: {batch_x.shape}, batch_label shape: {batch_label.shape}')

********************
window_size: 5, batch_size: 12, limit4dictonary: 4, mode: sg
********************
********************
batch_x:  [ 3595 54246 71485 22078 35302  8800  8424  1850 72335 26442]
batch_label: 
 [[35487 50331 58537 43350 46266 54246 71485 22078 35302  8800]
 [50331 58537 43350 46266  3595 71485 22078 35302  8800  8424]
 [58537 43350 46266  3595 54246 22078 35302  8800  8424  1850]
 [43350 46266  3595 54246 71485 35302  8800  8424  1850 72335]
 [46266  3595 54246 71485 22078  8800  8424  1850 72335 26442]
 [ 3595 54246 71485 22078 35302  8424  1850 72335 26442 20044]
 [54246 71485 22078 35302  8800  1850 72335 26442 20044 49601]
 [71485 22078 35302  8800  8424 72335 26442 20044 49601  3595]
 [22078 35302  8800  8424  1850 26442 20044 49601  3595 20044]
 [35302  8800  8424  1850 72335 20044 49601  3595 20044 13365]]
********************
********************
batch_x:  [20044 49601  3595 20044 13365 70902 53863 20044 79434 37341]
batch_label: 
 [[ 8800  8424  1850 72335 264

## Using with assigment text sample (CBOW mode)


In [4]:
test_text = 'first used against early working class radicals including'
test_batcher = Batcher(test_text, limit=1, window_size=2, mode='cbow')
test_batch_generator = test_batcher.batch_generator(4)
for batch_x, batch_label in test_batch_generator:
    print('*' * 20)
    print("batch_x: \n", batch_x)
    print("batch_label: \n", batch_label)
    print('*' * 20)

print(f'text: {test_text}')
print("batch_x: \n", test_batcher.indices2words(batch_x))
print('*' * 20)
print("batch_label: \n", test_batcher.indices2words(batch_label))

********************
batch_x: 
 [[7 8 1 5]
 [8 2 5 3]
 [2 1 3 4]
 [1 5 4 6]]
batch_label: 
 [2 1 5 3]
********************
text: first used against early working class radicals including
batch_x: 
 [['first' 'used' 'early' 'working']
 ['used' 'against' 'working' 'class']
 ['against' 'early' 'class' 'radicals']
 ['early' 'working' 'radicals' 'including']]
********************
batch_label: 
 ['against' 'early' 'working' 'class']
