# Assignment 1.2: Word2vec preprocessing (20 points)

Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/text8.zip). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parsed from the internet)
1. tokenization
1. building the vocabulary and choosing its size. Use only high-frequency words, change all other words to UNK or handle it in your own manner. You can use `collections.Counter` for that.
1. assigning each token a number (numericalization). In other words, make word2index и index2word objects.
1. data structuring and batching - make X and y matrices generator for word2vec (explained in more details below)

**ATTN!:** If you use your own data, please, attach a download link. 

Your goal is to make **Batcher** class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `x_batch.shape = (batch_size, 2*window_size)`, `y_batch.shape = (batch_size,)` for CBOW or `(batch_size,)`, `(batch_size, 2*window_size)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

window_size = 2

# CBOW:
indices_to_words(x_batch) = \
        [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']
```

# Solution

## Imports and constants

In [2]:
import numpy as np
from collections import Counter
from batcher import SkipGramBatcherBase, read_corpus


TEXT_PATH = 'data/text8'

## Skip-Gram Batcher impletation

I will implement here only ```__next__``` function, because for next task we have to use a different one

In [4]:
class SkipGramBatcher(SkipGramBatcherBase):
    def __next__(self):
        """ Return next batch of words with order specified in self._permuted_indxs
            Return:
                centrals (list): batch with central words. The length is (batch_size)
                neighbours (list of lists): batch with neighbour words. The size is (batch_size, 2*window_size)
        """
        centrals, neighbours = self._get_next_batch()
        if centrals is None or neighbours is None:
            raise StopIteration
        
        return centrals, neighbours

## Тesting

In [6]:
def test_batcher(text, window_size, batch_size, vocabulary_size=None):
    print('Testing for text: "{}"'.format(' '.join(text)))
    b = SkipGramBatcher(text, window_size, batch_size, vocabulary_size)
    print('Text after cleaning: "{}"'.format(' '.join(b._corpus)))
    for i, (centrals, neighbours) in enumerate(b):
        if i > 10:  #  test only 10 batches
            return
        print('Batch number {}:'.format(i))
        centrals = b.tokens_to_words(centrals)
        neighbours = b.tokens_to_words(neighbours)
        for j in range(len(centrals)):
            print('\tCentral word is [{}] and neighbours:'.format(centrals[j]))
            print('\t\t' + '\n\t\t'.join(neighbours[j]))

### simple test

In [7]:
simple_text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

In [8]:
test_batcher(simple_text, 2, 2)

Testing for text: "first used against early working class radicals including"
Text after cleaning: "first used against early working class radicals including"
Batch number 0:
	Central word is [early] and neighbours:
		used
		against
		working
		class
	Central word is [against] and neighbours:
		first
		used
		early
		working
Batch number 1:
	Central word is [class] and neighbours:
		early
		working
		radicals
		including
	Central word is [working] and neighbours:
		against
		early
		class
		radicals


In [9]:
test_batcher(simple_text, 1, 3)

Testing for text: "first used against early working class radicals including"
Text after cleaning: "first used against early working class radicals including"
Batch number 0:
	Central word is [used] and neighbours:
		first
		against
	Central word is [against] and neighbours:
		used
		early
	Central word is [class] and neighbours:
		working
		radicals
Batch number 1:
	Central word is [radicals] and neighbours:
		class
		including
	Central word is [early] and neighbours:
		against
		working
	Central word is [working] and neighbours:
		early
		class


In [10]:
test_batcher(simple_text, 3, 1)

Testing for text: "first used against early working class radicals including"
Text after cleaning: "first used against early working class radicals including"
Batch number 0:
	Central word is [working] and neighbours:
		used
		against
		early
		class
		radicals
		including
Batch number 1:
	Central word is [early] and neighbours:
		first
		used
		against
		working
		class
		radicals


### test with small vocabulary

In [11]:
text = ['Can', 'you', 'can', 'a', 'can', 'as', 'a', 'Canner', 'can', 'can', 'a', 'can']

In [12]:
test_batcher(text, 3, 1, vocabulary_size=2)

Testing for text: "Can you can a can as a Canner can can a can"
Text after cleaning: "<UNK> <UNK> can <UNK> can <UNK> <UNK> <UNK> can can <UNK> can"
Batch number 0:
	Central word is [can] and neighbours:
		<UNK>
		<UNK>
		<UNK>
		can
		<UNK>
		can
Batch number 1:
	Central word is [<UNK>] and neighbours:
		can
		<UNK>
		<UNK>
		can
		can
		<UNK>
Batch number 2:
	Central word is [<UNK>] and neighbours:
		<UNK>
		can
		<UNK>
		<UNK>
		can
		can
Batch number 3:
	Central word is [<UNK>] and neighbours:
		can
		<UNK>
		can
		<UNK>
		<UNK>
		can
Batch number 4:
	Central word is [<UNK>] and neighbours:
		<UNK>
		<UNK>
		can
		can
		<UNK>
		<UNK>
Batch number 5:
	Central word is [can] and neighbours:
		<UNK>
		can
		<UNK>
		<UNK>
		<UNK>
		<UNK>


In [13]:
test_batcher(text, 2, 2, vocabulary_size=1)

Testing for text: "Can you can a can as a Canner can can a can"
Text after cleaning: "<UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK>"
Batch number 0:
	Central word is [<UNK>] and neighbours:
		<UNK>
		<UNK>
		<UNK>
		<UNK>
	Central word is [<UNK>] and neighbours:
		<UNK>
		<UNK>
		<UNK>
		<UNK>
Batch number 1:
	Central word is [<UNK>] and neighbours:
		<UNK>
		<UNK>
		<UNK>
		<UNK>
	Central word is [<UNK>] and neighbours:
		<UNK>
		<UNK>
		<UNK>
		<UNK>
Batch number 2:
	Central word is [<UNK>] and neighbours:
		<UNK>
		<UNK>
		<UNK>
		<UNK>
	Central word is [<UNK>] and neighbours:
		<UNK>
		<UNK>
		<UNK>
		<UNK>
Batch number 3:
	Central word is [<UNK>] and neighbours:
		<UNK>
		<UNK>
		<UNK>
		<UNK>
	Central word is [<UNK>] and neighbours:
		<UNK>
		<UNK>
		<UNK>
		<UNK>


## Preprocessing real data

In [14]:
corpus = read_corpus(TEXT_PATH)

In [15]:
test_batcher(corpus, 5, 10)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Batch number 0:
	Central word is [after] and neighbours:
		dandy
		s
		career
		in
		venice
		world
		war
		i
		the
		diva
	Central word is [spain] and neighbours:
		of
		europe
		arab
		colonization
		of
		soviet
		colonization
		of
		eastern
		poland
	Central word is [folk] and neighbours:
		ly
		and
		together
		they
		collected
		music
		from
		the
		region
		this
	Central word is [formula] and neighbours:
		number
		rydberg
		formula
		the
		rydberg
		describes
		the
		transitions
		or
		quantum
	Central word is [college] and neighbours:
		so
		called
		smokers
		at
		pembroke
		include
		sketches
		and
		performances
		by
	Central word is [the] and neighbours:
		the
		blessed
		virgin
		mary
		cradling
		lifeless
		body
		of
		jesus
		mourning
	Central word is [the] and neighbours:
		several
		hits
		the
		name
		of
		game
		and
		take
		a
		chance
	Central word is [the] and neighbours:
		festival
		which
		takes
		place
		during
		last
		week
		of
		march
		the
	Central word is 