This notebook provides a step-by-step explanation of the word2vec model implementation and the training process.

Note that this notebook is not intended to actually run the training, some of the Python cells might fail to execute.

For the complete training code check out the Python script `word2vec/train.py`.

In [16]:
import numpy as np
from keras.layers import Input, Embedding, Lambda, Dense
from keras.layers import Concatenate, Average, Add
from keras.models import Model
from keras.optimizers import SGD

## Load and shuffle training data

First we load the whole training dataset into memory. 

To reduce the training time we will only use 10 million examples for training.

In [None]:
f = h5py.File(dataset_path, 'r')
X_train = f['x_train'].value
y_train = f['y_train'].value

max_train_size = 10000000
X_train = X_train[:max_train_size, :]
y_train = y_train[:max_train_size]

Next we shuffle the dataset. 

This avoids the problem of feeding correlated examples into the training process and helps the optimizer to converge faster.

We must make sure to shuffle examples and labels in a consistent fashion.

In [None]:
indices = np.arange(X_train.shape[0])
np.random.shuffle(indices)
X_train = X_train[indices]
y_train = y_train[indices]

## Build the model

At first we have to define the **input shape** of the model. 

The input shape does not include the `batch_size` dimension.

The train examples are word ID vectors $x=\{w_1,...,w_{k-1},w_{k+1},...,w_{2k+1}\}$ with associated target value $w_k$. 

$w_k$ is the word ID the model should predict given an example.

In [5]:
win_size = 10
inputs = Input(shape=(win_size,), dtype='int32')

Next we add an **Embedding layer** that maps each word ID to a word vector. 

The input shape is (batch_size, win_size).

The output shape is (batch_size, win_size, vec_dim).

In [7]:
vocab_size = 10000
vec_dims = 100
word_vectors = Embedding(vocab_size, vec_dims)(inputs)

The output of the Embedding layer is fed into `win_size` Lambda layers. Each Lambda layer extracts and outputs a single word vector. 

In [8]:
sliced_word_vector = [Lambda(lambda x: x[:,i,:], output_shape=(vec_dims,))(word_vectors) for i in range(win_size)]

Next the word vectors are aggregated to a single vector.

The **concatenation** of the word vectors preserves the word order but results in a larger model and longer training time. 

**Averaging** and **summation** of the word vectors destroys the word order but results in a smaller model and faster training.

In [10]:
# Aggregate the word vectors
aggregation_type = 'concat'
if aggregation_type == 'concat':
    h = Concatenate()(sliced_word_vector)
elif aggregation_type == 'average':
    h = Average()(sliced_word_vector)
elif aggregation_type == 'sum':
    h = Add()(sliced_word_vector)
else:
    raise ValueError('Invalid row aggregation')

The aggregated word vectors are fed into a **Dense layer** with a softmax activation function. 

This layer outputs the probability for each word of the vocabulary to be the center word $ P(y = w_k \mid x)$.

In [12]:
probs = Dense(vocab_size, activation='softmax')(h)  

Now the Model can be instantiated by specifying input and output tensors.

In [13]:
model = Model(inputs, probs)

To prepare the model for training it needs to be compiled.

The model uses the following configuration:

 * Stochastic Gradient Descent optimizer
 * Cross-entropy loss function
 * Accuracy metric during training

In [None]:
model.compile(optimizer=SGD(lr=0.05),
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])

## Train the model

The `model.fit()` method trains the model for a given number of epochs. 

An `epoch` is one iterations over the dataset.

The implementation calls `model.fit()` repeatedly in a loop. 

After each iteration the train progress is logged and a model snapshot is saved.

In [None]:
nb_iterations = 100
epochs_per_fit = 1
batch_size = 256
for i in range(nb_iterations):
    h = model.fit(X_train, y_train, 
                  batch_size=batch_size, 
                  epochs=epochs_per_fit)
    
    loss = h.history['loss'][-1]
    acc = h.history['acc'][-1]
    print('epoch %d: loss=%f acc=%f time=%f' % (epoch, loss, acc, mean_epoch_time))
    
    model.save(checkpoint_path)

## Notes about the implementation

The implementation is not very efficient because of the large number of softmax computations in the final Dense layer. This increases training time quite a lot. A more efficient implementation would use a hierarchical softmax.

The training process does not evaluate the result on a test dataset. Normally this is a big mistake. In this case overfitting is not a big concern because we actually want the model to overfit. We train the model on a document corpus the size of wikipedia. This is supposed to be enough to cover the common usage of the language. There is no need for further generalization. 

Also we are only interested in generating 'good' word vector representations and not so much in a perfect prediction for the center words.