# Word2vec preprocessing

Preprocessing is not the most interesting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/textdata). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parced from the internet)
1. tokenization
1. building the vocabulary and choosing its size
1. assigning each token a number (numericalization)
1. data structuring and batching

Your goal is to make SkipGramBatcher class which returns two numpy tensors with word indices. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpfull to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `(batch_size, 2*window_size)`, `(batch_size,)` for CBOW or `(batch_size,)`, `(batch_size,)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(batch_size, window_size, ...), SkipGram(num_skips, skip_window). You should implement only one batcher in this task, it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in separate file. It will be reused for the next task. Result of your work should represent that your batch have proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
bag_window = 2

batch = [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

labels = ['against', 'early', 'working', 'class']
```

If you struggle with somethng, ask your neighbour. If it is not obvious for you, probably someone else is looking for the answer too. And in contrast, if you see that you can help someone - just do it! Good luck!

In [1]:
# 2 ЗАДАНИЕ

import collections
import math
import os
import random
import zipfile
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim 
import torch.nn.functional
import time
import requests
from os.path import isfile
import numpy as np
from six.moves import urllib
from six.moves import xrange
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import torch.nn.functional as func

USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
print('using device:', device)

using device: cpu


In [None]:
if not os.path.isfile('text8'):# downloading corpus
    with open('data.zip', 'wb') as f:
        r = requests.get('http://mattmahoney.net/dc/text8.zip')
        f.write(r.content)
    !unzip 'data.zip' 

def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        data = (f.read(f.namelist()[0]).split())
        data = list(map(lambda x: x.decode(), data))
    return data
with open('text8') as f:
    words = f.read().split()

In [None]:
voc_size = 15000
UNK_TOKEN = 'TOKEN'# для редких слов
frequency = collections.Counter(words)# словарь всех слов
freq = frequency.most_common(voc_size)# словарь частых слов
low_bound = freq[-1][1]# наименьшее количество повторений для попадания в словарь частых
vocab = [x[0] for x in freq]# список слов из словаря
vocab = [UNK_TOKEN] + vocab
word_index = {w: idx for (idx, w) in enumerate(vocab)}# слово - индекс
index_word = {idx: w for (idx, w) in enumerate(vocab)}# индекс - слово
data = []# индексы words
for word in words:
  if frequency[word] > low_bound:
    data.append(word_index[word])
  else:
    data.append(0)

In [None]:
class Batcher():
  def __init__(self, data, window_size, batch_size=20):
    self.window_size = window_size
    self.batch_size = batch_size
    self.data = data
    self.span = 2 * window_size + 1
  def make(self):
    label_ind = []
    label = []
    batch = []
    for i in range(self.batch_size):# создаем список центральных слов
      index = int(np.random.uniform(0,len(self.data)))
      if index == 0 or (len(words) - index - 1 < window_size):
                index = window_size
      label_ind.append(index)
    for ind in label_ind:# создаем их контексты
      label.append(self.data[ind])
      list_ = [i for i in range(-self.window_size, self.window_size+1) if i!=0]
      b = [self.data[ind+i] for i in list_]
      batch.append(b)
    return batch, label

In [None]:
def one_hot(batch, voc_size, window_size, batch_size):
  result = torch.zeros([batch_size, voc_size])
  for i,cont in enumerate(batch):
    for j in cont:
      result[i,j] = result[i,j] + 1# from each context we get one_hot vector
  return result