# Assignment 1.2: Word2vec preprocessing (20 points)

Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/text8.zip). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parsed from the internet)
1. tokenization
1. building the vocabulary and choosing its size. Use only high-frequency words, change all other words to UNK or handle it in your own manner. You can use `collections.Counter` for that.
1. assigning each token a number (numericalization). In other words, make word2index и index2word objects.
1. data structuring and batching - make X and y matrices generator for word2vec (explained in more details below)

**ATTN!:** If you use your own data, please, attach a download link. 

Your goal is to make **Batcher** class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `x_batch.shape = (batch_size, 2*window_size)`, `y_batch.shape = (batch_size,)` for CBOW or `(batch_size,)`, `(batch_size, 2*window_size)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

window_size = 2

# CBOW:
indices_to_words(x_batch) = \
        [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']
```

Building batcher for the skip-gram architecture

In [0]:
import os, math, random, torch, collections, re
import torch.nn as nn
from pprint import pprint

In [2]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip
with open('text8') as text_file:
    corpus = text_file.read().split()

--2020-02-20 07:09:07--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.75
Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.75|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’


2020-02-20 07:09:26 (2.16 MB/s) - ‘text8.zip’ saved [31344016/31344016]

Archive:  text8.zip
  inflating: text8                   


In [39]:
pprint(' '.join(word for word in corpus[:100]))

('anarchism originated as a term of abuse first used against early working '
 'class radicals including the diggers of the english revolution and the sans '
 'culottes of the french revolution whilst the term is still used in a '
 'pejorative way to describe any act that used violent means to destroy the '
 'organization of society it has also been taken up as a positive label by '
 'self defined anarchists the word anarchism is derived from the greek without '
 'archons ruler chief king anarchism as a political philosophy is the belief '
 'that rulers are unnecessary and should be abolished although there are '
 'differing')


In [0]:
VOCABULARY_SIZE = 10000
UNK = '<UNK>'

def create_dataset(corpus, vocab_size=VOCABULARY_SIZE, unk_token=UNK):
    dataset = []
    counter_dict = collections.Counter(corpus)
    vocab = counter_dict.most_common(VOCABULARY_SIZE)
    words = [x[0] for x in vocab]
    words.append(unk_token)
    min_allowed_freq = vocab[-1][1]
    #use only high-frequency words
    #change all other words to UNK
    for _, word in enumerate(corpus):
        if counter_dict[word] > min_allowed_freq:
            dataset.append(word)
        else:
            dataset.append(unk_token)
        
    word2idx = {word: idx for (idx, word) in enumerate(words)}
    idx2word = {idx: word for (idx, word) in enumerate(words)}
    return dataset, word2idx, idx2word

In [0]:
data, word2idx, idx2word = create_dataset(corpus=corpus)

In [0]:
class Batcher(object):
    def __init__(self,dataset, window_size, batch_size, word2idx, idx2word):
        self.dataset = dataset
        self.window_size = window_size
        self.batch_size = batch_size
        self.word2idx = word2idx
        self.idx2word = idx2word

    def __iter__(self):
        return self
    
    def __next__(self):
        global index
        batch = []
        labels = []
        dataset = self.dataset 
        window_size = self.window_size
        batch_size = self.batch_size
        word2idx = self.word2idx
        idx2word = self.idx2word
        
        for _ in range(batch_size):
            # create a batch only if have at least
            #n words to the left and n words to the right
            #where n is window size
            if (index - window_size < 0) or (index + window_size > len(dataset)-1):
                #we need to increment index through epochs of learning
                index = (index + 1) % len(dataset)
            #now create context and batch
            else:
                #add word 
                batch.append(word2idx[dataset[index]])
                
                labels_batch = []

                for word in dataset[index-window_size:index] + dataset[index+1: index+window_size+1]:
                    labels_batch.append(word2idx[word])
                labels.append(labels_batch)
                #again update index
                index = (index + 1) % len(dataset)
        
        return (batch, labels)

In [60]:
#time to test
index = 0
batcher = Batcher(dataset=data,window_size=3,batch_size=6,word2idx=word2idx,idx2word=idx2word)
batches = iter(batcher)
print('default corpus:', [d for d in data[:14]])
batch, labels = next(batches)
print("let's take a look at batches with window_size=",window_size)
print('batch', [idx2word[i] for i in batch], '\n')
print('labels:', [[idx2word[i] for i in x] for x in labels])

default corpus: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', '<UNK>']
let's take a look at batches with window_size= 2
batch ['a', 'term', 'of'] 

labels: [['anarchism', 'originated', 'as', 'term', 'of', 'abuse'], ['originated', 'as', 'a', 'of', 'abuse', 'first'], ['as', 'a', 'term', 'abuse', 'first', 'used']]


In [35]:
pprint(" ".join(word for word in data[:100]))

('the of and one in a to zero nine two is as eight for s five three was by '
 'that four six seven with on are it from or his an be this which at he also '
 'not have were has but other their its first they some had all more most can '
 'been such many who new used there after when into american time these only '
 'see may than world i b would d no however between about over years states '
 'people war during united known if called use th system often state so '
 'history will up while where')
