# Word2vec

At a high level `Word2Vec` is a unsupervised learning algorithm that uses a shallow neural network (with one hidden layer) to learn the vectorial representations of all the unique words/phrases for a given corpus. The advantage that word2vec offers is it tries to preserve the semantic meaning behind those terms. For example, a document may employ the words "dog" and "canine" to mean the same thing, but never use them together in a sentence. Ideally, the word2vec algorithm would be able to learn the context and place them together in similar vector semantic space.

???

We're going to train the neural network to do the following. Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary being the "nearby" word that we chose. To be explicit, there is actually a **window size** parameter to the algorithm that quantifies the word "nearby". A typical window size might be 5, meaning 5 words behind and 5 words ahead (10 in total).

The output probabilities are going to relate to how likely it is find each vocabulary word nearby our input word. For example, if we gave the trained network the input word "Soviet", the output probabilities are going to be much higher for words like "Union" and "Russia" than for unrelated words like "watermelon" and "kangaroo".


## Model Details

So how is this all represented? First of all, we know we can't feed a word just as a text string to a neural network (or probably any machine learning model), i.e. we need a way to represent the words to the network. To do this, we first build a vocabulary of words from our training documents. We'll assume that our corpus has a vocabulary size of 10,000.

We’re going to represent an input word like "ants" as a one-hot vector. This vector will have 10,000 components (one for every unqiue word in our vocabulary) and we'll place a "1" in the position corresponding to the word "ants", and 0s in all of the other positions. The output of the network is a single vector (also with 10,000 components) containing, for every word in our vocabulary, the probability that a randomly selected nearby word is that vocabulary word. Here’s the architecture of our single-layer neural network.

<img src="img/word2vec_architecture.png" width="70%" height="70%">

There is no activation function on the hidden layer neurons, but the output neurons use softmax. We’ll come back to this later.

When training this network on word pairs, the input is a one-hot vector representing the input word and the training output is also a one-hot vector representing the output word. But when you evaluate the trained network on an input word, the output vector will actually be a probability distribution (i.e., a bunch of floating point values, not a one-hot vector).

## The Hidden Layer

Let's say that we wish to learn word vectors with 300 features. The number of features is a hyperparameter that we would have to tune to our application to see which one yields the best result. So the hidden layer is going to be represented by a weight matrix with 10,000 rows (one for every word in our vocabulary) and 300 columns (one for every hidden neuron).

Now if we look at what would happen when we multiply the 1 x 10,000 one-hot vector representation of the word with a 10,000 x 300 matrix that represents the hidden layer's weight, it will effectively just select the matrix row corresponding to the "1". The following figure is a small example that does a matrix multiplication of a 1 x 5 one hot vector with a 5 x 2 hidden layer's weight to give you a visual. 

<img src="img/hidden_layer.png" width="70%" height="70%">

This means that the hidden layer of this model is really just operating as a lookup table. The output of the hidden layer is just the "word vector" for the input word.

## The Output Layer

The 1 x 300 word vector for "ants" then gets fed to the output layer. The output layer is a softmax regression classifier. There's another documentation on Softmax Regression [here](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/deep_learning/softmax.ipynb), but the gist of it is that each output neuron, one per word in our vocabulary will produce an output probability between 0 and 1 and the sum of all these output values will add up to 1.

Specifically, each output neuron has a weight vector which it multiplies against the word vector from the hidden layer, then it applies the function `exp(x)` to the result. Finally, in order to get the outputs to sum up to 1, we divide this result by the sum of the results from all 10,000 output nodes. Here’s an illustration of calculating the output probability for the word "car".

<img src="img/output_layer.png" width="70%" height="70%">

Note that neural network does not know anything about the offset of the output word relative to the input word. In other words, it does not learn a different set of probabilities for the word before the input versus the word after.

Recall that in the beginning of the documentation, we mentioned that the goal for word2vec is to represent each word in the corpus as the vector representation while trying to reserve semantic meaning. This means that if two different words have very similar "contexts" (that is, what words are likely to appear around them), then our model needs to output very similar results for these two words. And one way for the network to output similar context predictions for these two words is if the word vectors are similar. So to hit the notion home, if two words have similar contexts, then word2vec is motivated to learn similar word vectors for these two words!

# Negative Sampling

You may have noticed that the skip-gram neural network contains a huge number of weights ... For our example with 300 features and a vocab of 10,000 words, that's 3M weights in the hidden layer and output layer each! Training this on a large dataset would be slow and prone to overfitting, so the word2vec authors introduced a number of tweaks to make training feasible.

- Treating common word pairs or phrases as single "words" in their model
- Subsampling frequent words to decrease the number of training examples
- Modifying the optimization objective with a technique they called "Negative Sampling", which causes each training sample to update only a small percentage of the model’s weights

It’s worth noting that subsampling frequent words and applying Negative Sampling not only reduced the compute burden of the training process, but also improved the quality of their resulting word vectors as well.

# Tensorflow Implementation

In [1]:
import tensorflow as tf

import os
import numpy as np
from subprocess import call
from zipfile import ZipFile

In [2]:
def read_data():
    """Read data into a list of tokens/words"""
    filename = 'text8.zip'
    base_url = 'http://mattmahoney.net/dc/'

    if not os.path.isfile(filename):
        call('wget ' + base_url + filename, shell = True)
        
    with ZipFile(filename) as f:
        file = f.namelist()[0]
        
        # ensure compatibility each python2 and python3's str type
        # https://stackoverflow.com/questions/37689802/what-is-tensorflow-compat-as-str
        data = tf.compat.as_str(f.read(file)).split()

    return data

In [3]:
words = read_data()
print('data:', words[:4])
print('data size {}'.format(len(words)))

Data size 17005207


In [5]:
from collections import Counter

def build_dataset(words):

    vocabulary_size = 50000
    count = Counter(words).most_common(vocabulary_size - 1)

    # build up word index and replaced the words by its assigned indices
    data = []
    unknown_count = 0
    word_index = {word: idx for idx, (word, _) in enumerate(count)}

    for word in words:
        if word in word_index:
            idx = word_index[word]
        else:
            idx = 0
            unknown_count += 1

        data.append(idx)

    # 'UNK' flag for out of vocabulary word
    unknown = 'UNK', unknown_count
    count.append(unknown)
    word_index_rev = {idx: word for word, idx in word_index.items()}
    return data, count, word_index, word_index_rev

In [6]:
data, count, word_index, word_index_rev = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])

Most common words (+UNK) [('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201)]
Sample data [5243, 3083, 11, 5, 194, 1, 3136, 45, 58, 155]


In [38]:
def generate_sample(indexed_words, window):
    """
    Form training pairs according to the skip-gram model
    
    Parameters
    ----------
    indexed_words : list
        list of index that represents the words, e.g. [5243, 3083, 11],
        and 5243 might represent the word "Today"
        
    window : int
        window size of the skip-gram model, where word is sampled before
        and after the center word according to this window size
    """
    for index, center in enumerate(indexed_words):
        # random integers from `low` (inclusive) to `high` (exclusive)
        context = np.random.randint(1, window + 1)

        # get a random target before the center word
        for target in indexed_words[max(0, index - context):index]:
            yield center, target

        # get a random target after the center word
        for target in indexed_words[(index + 1):(index + 1 + context)]:
            yield center, target

In [52]:
iterator = generate_sample(indexed_words = data, window = 3)

print('original data:', data[:6])
print('skip gram sample:')

# we start off by using the first word as the center word,
# and since there's no word before it, we will not have any
# sampled word before it; after that we keep sliding the center
# word and generate word pairs
print(next(iterator))
print(next(iterator))
print(next(iterator))
print(next(iterator))
print(next(iterator))
print(next(iterator))
print(next(iterator))
print(next(iterator))

original data: [5243, 3083, 11, 5, 194, 1]
skip gram sample:
(5243, 3083)
(5243, 11)
(3083, 5243)
(3083, 11)
(3083, 5)
(3083, 194)
(11, 3083)
(11, 5)


In [59]:
def get_batch(iterator, batch_size):
    """Group a numerical stream into batches and yield them as Numpy arrays"""
    while True:
        center_batch = np.zeros(batch_size, dtype = np.int32)
        target_batch = np.zeros([batch_size, 1], dtype = np.int32)
        for index in range(batch_size):
            center_batch[index], target_batch[index] = next(iterator)

        yield center_batch, target_batch

In [60]:
batch_size = 5
iterator = generate_sample(indexed_words = data, window = 3)
batches = get_batch(iterator, batch_size)
center_batch, target_batch = next(batches)

In [61]:
center_batch

array([5243, 5243, 3083, 3083, 3083], dtype=int32)

In [62]:
target_batch

array([[3083],
       [  11],
       [5243],
       [  11],
       [   5]], dtype=int32)

In [None]:
VOCAB_SIZE = 50000
BATCH_SIZE = 128
# EMBED_SIZE = 128 # dimension of the word embedding vectors
SKIP_WINDOW = 1 
# batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)

In [None]:
center_words = tf.placeholder(tf.int32, shape = [BATCH_SIZE], name = 'center_words')
target_words = tf.placeholder(tf.int32, shape = [BATCH_SIZE, 1], name = 'target_words')

Define the weight/variable. In this case, the embedding matrix.

Each row corresponds to the representation vector of one word. If one word is represented with
a vector of size EMBED_SIZE, then the embedding matrix will have shape [VOCAB_SIZE,
EMBED_SIZE]. We initialize the embedding matrix to value from a random distribution. In this
case, let’s choose uniform distribution.

In [None]:
embed_matrix = tf.Variable(
    tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), name = 'embed_matrix')

In [None]:
tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None,
validate_indices=True, max_norm=None)

# Reference

- [Blog: Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)