## Word2Vec for Latin using Keras

This notebook is a quick tutorial on training word2vec vectors for Latin, using the simple and well-known skipgram method with negative sampling. 

Keras makes the process straightforward because the sampling procedure is implemented in Keras' preprocessing library.

### Dataset
First, the boring bits.  We use Latin text from [The Latin Library](http://www.thelatinlibrary.com/), which the CLTK thankfully makes available as a downloadable corpus.

All tokens from the library have already been exported to a file, and exactly 260,000 of them, all words occuring at least twice, have been synthesized into a type list.

In [14]:
import numpy as np

In [3]:
ll_tokens = [line.rstrip() for line in open('ll_words.txt')]

In [4]:
ll_types = [line.rstrip() for line in open('ll_types.txt')]

We need a method of associating an index in the type list to a token: a mapping from token to integer index.
Note that index zero is reserved for out-of-vocabulary (OOV) items.

In [5]:
index = {}
rev_index = {}
for i, ll_type in enumerate(ll_types):
    index[ll_type] = i + 1
    rev_index[i + 1] = ll_type

Now we rewrite the complete sequence of the Latin Library into a 1D array of indices.  As noted, OOV terms map to index 0.

In [6]:
seq = []
for token in ll_tokens:
    seq.append(index.get(token, 0))
seq = np.asarray(seq)

Keras will now create the dataset for training.  

The dataset consists of pairs of indices selected from the sequence (one each for the target word and the context word), and a binary label, 0 or 1, indicating whether the distance between the context word is whithin some window of the target word.  

The width of the window is set to 4.  

In [32]:
from keras.preprocessing.sequence import make_sampling_table, skipgrams

def create_dataset(types, window_size):
    vocab_size = len(types) + 1
    window_size = 4

    sampling_table = make_sampling_table(vocab_size)
    couples, labels = skipgrams(seq, vocab_size, window_size=window_size, sampling_table=sampling_table)

    word_targets, word_contexts = zip(*couples)
    word_targets = np.array(word_targets, dtype="int32")
    word_contexts = np.array(word_contexts, dtype="int32")
    
    return word_targets, word_contexts, labels

In [8]:
word_targets, word_contexts, labels = create_dataset(ll_types, 4)
print(couples[:10], labels[:10])
word_target.shape

(90504940,)

The model for training is extremely simple.  The embedding matrix is initialized by Keras and is, of course, trainable.
For each exemplar, i.e. a pair of indices into the embedding matrix, the corresponding word vectors are selected, and the cosine distance of the vectors is computed; this is then squashed through a sigmoid function.  The loss is the difference between the binary classification of the pair and the activation, and is propagated back to the embedding matrix.

In [9]:
from keras import layers, Input
from keras.models import Model
import keras.backend as K

def build_word2vec_model(vocab_size, vector_dim):
    input_target = layers.Input((1,))
    input_context = layers.Input((1,))

    embed = layers.Embedding(vocab_size, vector_dim, input_length=1, trainable=True)
    
    target = embed(input_target)
    target = layers.Reshape((vector_dim,))(target)
    
    context = embed(input_context)
    context = layers.Reshape((vector_dim,))(context)
    
    dot = layers.dot([target, context], axes=1, normalize=True)
    
    out = layers.Dense(1, activation='sigmoid')(dot)
    
    model = Model([input_target, input_context], out)
    model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['acc'])
    
    return model

    

In [12]:
K.clear_session()
m = build_word2vec_model(vocab_size, vector_dim)
m.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 1, 300)       78000300    input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 300)          0           embedding_1[0][0]                
__________

For a lexicon of 260,001 items (260,000 + OOV) and 300 dimensions, we have ~78 million parameters.

We wrain with large batches (100,000), and a 10% validation split.

In [13]:
m.fit([word_target, word_context], labels, epochs=50, batch_size=100000, validation_split=0.1)

Train on 81454446 samples, validate on 9050494 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f425c020e10>

The result is quite good.  The validation accuracy suggests that, for in-vocabulary words, the embeddings can predict with 91.63% accuracy whether a pair of words tend to co-occur.

### Saving the vectors

We'll need to write both the embedding vectors themselves, and the word index.

In [15]:
import pickle

In [22]:
w = m.get_weights()[0]
with open('latin_vectors.bin', 'wb') as outfile:
    pickle.dump(w, outfile)

Checking ...

In [29]:
with open('latin_types.txt', 'w') as outfile:
    for word, id in index.items():
        outfile.write('{0}\t{1}\n'.format(id, word))

In [30]:
with open('latin_vectors.bin', 'rb') as infile:
    latin_vectors = pickle.load(infile)
latin_vectors.shape

(260001, 300)