# Assignment 1.2: Word2vec preprocessing (20 points)

Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/text8.zip). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parsed from the internet)
1. tokenization
1. building the vocabulary and choosing its size. Use only high-frequency words, change all other words to UNK or handle it in your own manner. You can use `collections.Counter` for that.
1. assigning each token a number (numericalization). In other words, make word2index и index2word objects.
1. data structuring and batching - make X and y matrices generator for word2vec (explained in more details below)

**ATTN!:** If you use your own data, please, attach a download link. 

Your goal is to make **Batcher** class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `x_batch.shape = (batch_size, 2*window_size)`, `y_batch.shape = (batch_size,)` for CBOW or `(batch_size,)`, `(batch_size, 2*window_size)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

window_size = 2

# CBOW:
indices_to_words(x_batch) = \
        [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']
```


In [0]:
import numpy as np
import random
from collections import Counter

In [2]:
!wget http://mattmahoney.net/dc/text8.zip

!ls
!unzip text8.zip

--2020-02-26 21:05:37--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.75
Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.75|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’


2020-02-26 21:07:31 (271 KB/s) - ‘text8.zip’ saved [31344016/31344016]

sample_data  text8.zip
Archive:  text8.zip
  inflating: text8                   


In [3]:
filename = 'text8' #file with the raw text
text = []
with open('text8', mode='r') as file:
    line = file.readline()
    while line:
        text += line.lower().split(' ')
        line = file.readline()
        if len(text) > 100000:
            print(len(text))
            break

17005208


In [4]:
print(len(text))
train_text = text[:1000]
print(len(train_text))

17005208
1000


In [0]:
"""Note that in our case, the text8 model already has bare text, that we only have to care about lowercase and frequency"""
unknown_token = "UNK"

In [0]:
"""batcher class for the model"""
class SkipGramBatcher:
    def __init__(self, window_size=5, least_freq=3):
        self.least_freq = least_freq
        self.text = None
        self.vocab = None
        self.vocab_size = None
        self.word2index = None
        self.index2word = None
        self.window_size = window_size
        self.current_index = 0
        self.current_diff = -window_size
        self.total_size = 0
    
    def preprocess(self, text):
        """replace words with frequency < least_freq with unknown_token
        and save the text
        """
        counter = Counter(text)
        def get_token(word):
            if counter[word] < self.least_freq:
                return unknown_token
            else:
                return word.lower()
        self.text = [get_token(word) for word in text]
    
    def fit_text(self, text):
        """init text, vocab, word2ind, ind2word
        """
        self.preprocess(text)
        self.vocab = np.unique(self.text)
        self.vocab_size = self.vocab.shape[0]
        self.word2index = dict(zip(self.vocab, range(self.vocab.shape[0])))
        self.index2word = dict(zip(range(self.vocab.shape[0]), self.vocab))
        self.total_size = (len(self.text) - 3 * self.window_size) * self.window_size * 2
        
    def most_frequent(self, num=25):
        """get most frequent words from the text"""
        counter = Counter(self.text)
        return counter.most_common(num)
        
    def least_frequent(self, num=25):
        """get least frequent words from the text"""
        counter = Counter(self.text)
        return counter.most_common()[:-num - 1:-1]
        
    def indices_to_words(self, x_batch):
        """return array of words out of array of indices"""
        return np.array([self.index2word[index] for index in x_batch])
    
    def words_to_indices(self, words):
        """return array of indices out of array of words"""
        return np.array([self.word2index[word] for word in words])
    
    def _get_next_index_and_diff(self, current_index, current_diff):
        if (current_diff == self.window_size):
            current_diff = -self.window_size
            current_index += 1
            current_index %= len(self.text)
        else:
            if current_diff == -1:
                current_diff = 1
            else:
                current_diff += 1
        return current_index, current_diff

    
    def get_batch(self, batch_size=100):
        """return batch of indices for x and for labels consequently"""
        x_batch = []
        labels_batch = []
        while len(x_batch) < batch_size:
            label_index_in_text = self.current_index + self.current_diff 
            if (label_index_in_text < 0 or label_index_in_text >= len(self.text)):
                index, diff = self._get_next_index_and_diff(self.current_index, self.current_diff)
                self.current_index = index
                self.current_diff = diff
                continue
                
            word = self.text[self.current_index]
            word_index = self.word2index[word]
            label = self.text[self.current_index + self.current_diff]
            label_index = self.word2index[label]
            
            x_batch.append(word_index)
            labels_batch.append(label_index)
            
            index, diff = self._get_next_index_and_diff(self.current_index, self.current_diff)
            self.current_index = index
            self.current_diff = diff
            
        assert len(x_batch) == batch_size
        assert len(labels_batch) == batch_size
        
        x_batch = np.array(x_batch)
        labels_batch = np.array(labels_batch)
        
        permut = np.random.permutation(range(batch_size))
        x_batch = x_batch[permut]
        labels_batch = labels_batch[permut]
        return x_batch, labels_batch
      
    def get_random_batch(self, batch_size=100):
        """return batch of indices for x and for labels randomly"""
        x_batch = []
        labels_batch = []
        
        indices = np.random.choice(np.arange(self.window_size + 100, len(self.text) - self.window_size - 100), batch_size, replace=False)
        words = itemgetter(*indices.tolist())(self.text)
        x_batch = [self.word2index[word] for word in words]
        
        diffs = np.random.randint(-self.window_size, +self.window_size, size=batch_size)
        label_indices = indices + diffs
        labels = itemgetter(*label_indices.tolist())(self.text)
        labels_batch = [self.word2index[label] for label in labels]
        
        x_batch = np.array(x_batch)
        labels_batch = np.array(labels_batch)
        return x_batch, labels_batch
        
    def batch_generator(self, batch_size=100):
        """generator for batch"""
        while True:
            x_batch, labels_batch = self.get_batch(batch_size)
            yield x_batch, labels_batch

In [0]:
"""lets visualize the process"""
skpgram_batcher = SkipGramBatcher(window_size=2, least_freq=2)
skpgram_batcher.fit_text(train_text)

In [8]:
index = random.randint(0, skpgram_batcher.vocab_size - 1)
word = skpgram_batcher.index2word[index]
print(skpgram_batcher.index2word[index])
print(skpgram_batcher.index2word[skpgram_batcher.word2index[word]])

three
three


In [9]:
print('VOCAB SHAPE: ', skpgram_batcher.vocab.shape)
print(skpgram_batcher.vocab[:25])
print('MOST FREQUENT WORDS: ', skpgram_batcher.most_frequent())

VOCAB SHAPE:  (152,)
['UNK' 'a' 'about' 'abuse' 'accepted' 'access' 'advocate' 'against' 'all'
 'also' 'although' 'am' 'american' 'an' 'anabaptists' 'anarchism'
 'anarchist' 'anarchists' 'anarchy' 'and' 'are' 'as' 'at' 'authoritarian'
 'be']
MOST FREQUENT WORDS:  [('UNK', 291), ('the', 58), ('of', 41), ('in', 30), ('and', 27), ('to', 18), ('as', 17), ('that', 15), ('is', 14), ('a', 12), ('anarchist', 10), ('property', 10), ('anarchism', 9), ('society', 9), ('are', 9), ('his', 9), ('it', 8), ('what', 8), ('an', 8), ('proudhon', 8), ('anarchists', 7), ('this', 7), ('he', 7), ('be', 6), ('was', 6)]


In [0]:
x_batch, labels_batch = skpgram_batcher.get_batch(batch_size=50)

In [11]:
print(x_batch.shape)
print(labels_batch.shape)
print(type(x_batch), type(labels_batch))

(50,)
(50,)
<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [12]:
print('TEXT SHAPE: ', len(skpgram_batcher.text))
print(skpgram_batcher.text[:25])

TEXT SHAPE:  1000
['UNK', 'anarchism', 'UNK', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'UNK', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'UNK']


In [13]:
print(skpgram_batcher.indices_to_words(x_batch))
print(skpgram_batcher.indices_to_words(labels_batch))

['used' 'first' 'a' 'anarchism' 'first' 'class' 'against' 'working' 'of'
 'of' 'abuse' 'a' 'abuse' 'UNK' 'anarchism' 'as' 'working' 'as' 'a'
 'early' 'early' 'used' 'first' 'abuse' 'first' 'early' 'early' 'used'
 'as' 'against' 'UNK' 'UNK' 'anarchism' 'of' 'term' 'term' 'UNK' 'working'
 'as' 'term' 'working' 'UNK' 'used' 'a' 'against' 'UNK' 'term' 'against'
 'abuse' 'of']
['early' 'abuse' 'as' 'UNK' 'against' 'early' 'used' 'class' 'a' 'abuse'
 'first' 'term' 'term' 'as' 'UNK' 'anarchism' 'against' 'a' 'UNK'
 'against' 'working' 'first' 'of' 'used' 'used' 'used' 'class' 'abuse'
 'term' 'working' 'UNK' 'anarchism' 'as' 'first' 'as' 'of' 'a' 'early'
 'UNK' 'a' 'UNK' 'UNK' 'against' 'of' 'early' 'anarchism' 'abuse' 'first'
 'of' 'term']
