The paper [Distributed Representations of Sentences and Documents](https://arxiv.org/abs/1405.4053) describes the following word2vec algorithm:

In this framework, every word is mapped to a unique vector, represented by a column in a matrix $W$. The column is indexed by position of the word in the vocabulary. The concatenation or sum of the vectors is then used as features for prediction of the next word in a sentence.

More formally, given a sequence of training words $w_1,w_2,...,w_T$ , the goal of the word vector model is to maximize the average log probability

$$
\frac{1}{T} \sum^{T-k}_{t=k} log \ p(w_t \mid w_{t-k},...,w_{t+k})
$$

**Note:** The model can calculate the probability for a word to be the center word, given the k words before and after. The goal is to maximize the log probability of the correct word $w_t$. Do not be confused by the log, it makes the math simpler (especially calculating gradients) but does not change anything.


**TODO** turn the next two formulars around: first calculate the unnormalized log probabilities and than normalize them with the softmax.



The prediction task is typically done via a multiclass classifier, such as softmax. There, we have:

$$
p(w_t \mid w_{t-k},...,w_{t+k}) = \frac{e^{y_{w_t}}}{\sum_i {e^{y_i}}}
$$

Each of $y_i$ is un-normalized log-probability for each output word $i$, computed as

$$
y = b + Uh(w_{t-k},...,w_{t+k}; W)
$$

where $U$, $b$ are the softmax parameters. $h$ is constructed by a concatenation or average of word vectors extracted from $W$.

## word2vec model prototype

In [None]:
import numpy as np
from keras.layers import Input, Concatenate, Lambda, Embedding, Average, Dense
from keras.models import Model

k = 2
vec_dims = 10
vocab_size = 100
row_aggregation = 'concatenate' # 'average' or 'concatenate'

win_size = 2 * k
inputs = Input(shape=(win_size,), dtype='int32') # input shape: (-1, win_size)
word_vectors = Embedding(vocab_size, vec_dims)(inputs) # h shape: (-1, win_size, vec_dim)
word_vector_rows = [Lambda(lambda x: x[:,i,:], output_shape=(1,vec_dims))(word_vectors) for i in range(win_size)]
if row_aggregation == 'concatenate':
    h = Concatenate()(word_vector_rows)
elif row_aggregation == 'average':
    h = Average()(word_vector_rows)
else:
    raise ValueError('Invalid row aggregation')
    
# dense_output = activation(dot(input, kernel) + bias)
logits = Dense(vocab_size, activation='softmax')(h)
model = Model(inputs, logits)

model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy')

x = np.array([
    [0,0,1,1],
])
print('x.shape:', x.shape)
out = model.predict(x)
print('out.shape:', out.shape)

W, b = model.layers[-1].get_weights()
print('W:', W.shape)
print('b:', b.shape)
