# Words embeddings

This notebooks aims create the word embeddings from the captions. The algorithm we use is called **Word2Vec**.

Word2Vec, specifically the **skip-gram** algorithm uses an artificial neural network with a single hidden layer to predict the context of an input word.

After training the artificial neural network, we use the hidden layer weights as an embedding matrix that transforms each input word (one hot vector format) into the embedding format. An example is the following image:

But this implementation is not very efficient, the calculation required to compute a large number of categories with softmax is expensive, and the models of a hidden layer often suffer from underfitting. For this reason, the skip-gram algorithm is somewhat different, the input of the model is a pair (word, context word) and the output is 1 if the peer is true otherwise is 0. Each element of the input pair have its own embedding matrix, but we only take the word matrix.

In [None]:
import csv # To read the captions of a csv
import tensorflow as tf # To build and train the ANN
from tensorflow.keras.preprocessing.text import Tokenizer # To use the tokenizer to split the words
from tensorflow.keras.preprocessing.sequence import pad_sequences # 
import numpy as np
import pandas as pd
import itertools

In [None]:
tf.config.set_visible_devices([], 'GPU')

First, we upload the descriptions of a CSV to a list.

In [None]:
PATH = "data/train_machine_spanish.xlsx"
df = pd.read_excel(PATH, names=["id_image","caption"])
df['caption'] = df.apply(lambda x: "smark "+x['caption']+" emark", axis=1)

In [None]:
PATH = "data/train_human_spanish.xlsx"
df1 = pd.read_excel(PATH, names=["id_image","caption"])
df1['caption'] = df1.apply(lambda x: "smark "+x['caption']+" emark", axis=1)

In [None]:
PATH = "data/train_human_english.xlsx"
df = pd.read_excel(PATH, names=["id_image","caption"])
df['caption'] = df.apply(lambda x: "smark "+x['caption']+" emark", axis=1)

In [None]:
df = pd.concat([df,df1])

We set some hyperparameters.

In [None]:
# This variable adjust the dimensions of the embeddings. A high value may represent more complex embeddings
# but the artificial neural network will be larger.
embedding_dimension=512

We create the tokenizer which is a usual tool in PLN that split sentences into tokens, where each token is a word.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['caption']) # The tokenizer build the word index from the captions
word_index = tokenizer.word_index
print("Number of different words: %d"%len(word_index))
# This variable adjusts the maximum length of the vocabulary
# If the number is less than the actual length, we remove the least used words
max_length = len(word_index)

We transform each description into an ordered list where each token is represented by its index in the tokenizer's vocabulary.

We don't use one hot encoding because this version of the algorithm is more efficiently with scalars.

In [None]:
sequences = tokenizer.texts_to_sequences(df['caption'])

We create a set of pairs (word, context word) with his label 1 or 0 in case that the pair is true or not that we use for train the model.

In [None]:
peer_skipgrams = []
label_skipgrams = []
count = 0
for sequence in sequences:
    if count%1000==0:
        print("%d sequences processed"%count)
    count+=1
    ps, ls = tf.keras.preprocessing.sequence.skipgrams(
        sequence, vocabulary_size=len(word_index), window_size=10, negative_samples=1.0, shuffle=True,
        categorical=False, sampling_table=None, seed=None
    )
    peer_skipgrams[0:0] = ps
    label_skipgrams[0:0] = ls
print("Number of pairs: %d"%len(peer_skipgrams))

Split the pairs into two lists

In [None]:
def get_list(tuples):
    list1 = []
    list2 = []
    for i in tuples:
        list1.append(i[0])
        list2.append(i[1])
    return list1, list2
train_word, train_context = get_list(peer_skipgrams)

We build the branch that transforms the current word into embedding.

In [None]:
word_input = tf.keras.layers.Input(shape=(1,))
word_embedding = tf.keras.layers.Embedding(max_length+1, embedding_dimension, input_length=1)(word_input)
word_reshape = tf.keras.layers.Reshape((embedding_dimension, ))(word_embedding)

word_model = tf.keras.Model(word_input,word_reshape)

We build the branch that transforms the context word into embedding.

In [None]:
context_input = tf.keras.layers.Input(shape=(1,))
context_embedding = tf.keras.layers.Embedding(max_length+1, embedding_dimension, input_length=1)(context_input)
context_reshape = tf.keras.layers.Reshape((embedding_dimension, ))(context_embedding)

context_model = tf.keras.Model(context_input,context_reshape)

We build the merge of the two branchs and the output of the model.

In [None]:
model_input = tf.keras.layers.dot([word_reshape, context_reshape], axes=1, normalize=False)
model_output = tf.keras.layers.Dense(1, kernel_initializer='glorot_uniform', activation='sigmoid')(model_input)
model = tf.keras.Model([word_input, context_input], model_output)

In [None]:
model.summary()

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

We compute the separation threshold between train set and validation set.

In [None]:
train_val_test = int(len(train_word)*0.9)

We train the model

In [None]:
n_epochs = 1

history = model.fit([np.array(train_word[:train_val_test]),np.array(train_context[:train_val_test])],
                    np.array(label_skipgrams[:train_val_test]) ,
                    epochs=n_epochs,
                    validation_data=([np.array(train_word[train_val_test:]), np.array(train_context[train_val_test:])],
                                     np.array(label_skipgrams[train_val_test:]))
                   , verbose=1, batch_size=256)

We extract the weight matrix from current embedding word for transform new words to embeddings.

In [None]:
merge_layer = model.layers[2]
weights = merge_layer.get_weights()[0]

We check that all the words are correctly coded. If some of the words are empty we must to replace the empty world for "errorWord" label.

In [None]:
for i in word_index.keys():
    if len(i)==1:
        print(i.decode("unicode_escape"))
        print(word_index.keys().index(i.decode("unicode_escape")))

We create vecs.tsv and meta.tsv files for use the projector of tensorflow https://projector.tensorflow.org/

In [None]:
import io

out_v = io.open('vecs_train_human_spanish.tsv', 'w', encoding='utf-8')
out_m = io.open('meta_train_human_spanish.tsv', 'w', encoding='utf-8')
for token in word_index:
        vec = weights[word_index[token]] # skip 0, it's padding.
        out_m.write(token.decode('utf-8') + "\n")
        out_v.write('\t'.join([str(x).decode('utf-8') for x in vec]) + "\n")
out_v.close()
out_m.close()

We store the word index for next use of this dict.

In [None]:
import pickle

# saving
with open('items/tokenizer_english.pkl', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

We store the embedding layer for future transforms of words to embeddings

In [None]:
import pickle

# saving
with open('items/embeddingLayerWeights_english.pkl', 'wb') as handle:
    pickle.dump(weights, handle, protocol=pickle.HIGHEST_PROTOCOL)