<div><img style="float: right; width: 120px; vertical-align:middle" src="https://www.upm.es/sfs/Rectorado/Gabinete%20del%20Rector/Logos/EU_Informatica/ETSI%20SIST_INFORM_COLOR.png" alt="ETSISI logo" />


# Word2vec _skip-grams_ implementation<a id="top"></a>

<i><small>Author: Alberto Díaz Álvarez<br>Last update: 2023-05-25</small></i></div>

***

## Introduction

We start with the most important of all: _Word embedding techniques_ is a way of saying _representing words numerically_ but with more hooks. And having said that, we are going to program a process of learning _embeddings_ from text corpora. We will focus on a technique called _Word2Vec_, although we have already seen that there are more.

Word2vec_ is based on a neural network that generates the matrix using supervised training on a classification problem. The article where the method is presented is [Efficient Estimation of Word Representations in Vector Space (Mikolov et al.,2013)](https://arxiv.org/pdf/1301.3781.pdf) and it is a method that is quite successfully used to measure **syntactic and semantic similarity of words**.

The article explores two different models: _Continuous Bag-of-Words_ and _Skip-gram_. The latter is the most commonly used, and will be the one we will look at in this exercise.

The idea of the _Skip-gram_ is the following: given a word (which we will call _context word_), we want to train a model such that it is able to predict a word belonging to a window of size $N$. For example, assuming a window of size $N = 3$ and given the following sentence:

> All those <span style="color:red">moments will be</span> **lost** <span style="color:red">in time like</span> tears in rain

The _context word_ would be **lost**, and we would train the model to predict one of the existing words within the specified window, i.e., one of `['moments', 'will', 'be', 'in', 'time', 'like']`.

## Goals

In this _notebook_ we will create an _embedding_ from the _skip-gram_ technique of _Word2Vec_.

## Libraries and configuration

Next we will import the libraries that will be used throughout the notebook.

In [None]:
import os

import numpy as np
import pandas as pd
import requests
import tensorflow as tf

import matplotlib.pyplot as plt

We will also configure some parameters to adapt the graphic presentation.

In [None]:
plt.style.use('ggplot')
plt.rcParams.update({'figure.figsize': (16, 9),'figure.dpi': 100})

And create the necessary directories in case they have not been created previously

In [None]:
os.makedirs('tmp', exist_ok=True)

***

## Corpus construction

We will use the Amazon Reviews dataset (from https://nijianmo.github.io/amazon/index.html, not the most current, but useful for us) to train the model. More specifically, we will use the small subset of the category "Software".

In [None]:
DATASET_URL = 'https://jmcauley.ucsd.edu/data/amazon_v2/categoryFilesSmall/Software_5.json.gz'
DATASET_ZIP = 'tmp/Software_5.json.gz'

# Download the remote file if it does not exist
if not os.path.exists(DATASET_ZIP):
    with open(DATASET_ZIP, 'wb') as f:
        print(f'Downloading {DATASET_ZIP}...')
        r = requests.get(DATASET_URL, verify=False)
        f.write(r.content)
        print('OK')

Once downloaded, we can proceed to load the dataset. For our purpose (creating an _embedding_), we don't care about the output of the model or the id of the examples; we want the texts, so we are going to extract them, eliminating the blanks at the beginning and at the end.

In [None]:
corpus = pd.read_json(DATASET_ZIP, lines=True)
corpus = corpus['reviewText'].astype(str).str.strip()
corpus.head()

The variable `corpus` points to an array with all the sentences in our set. We are going to tokenize each of the comments, converting them into a list of words. For this we will use the `tokenizer` function included in keras, although it is important to understand that this step is not trivial and will probably require a lot of preprocessing to have a quality dataset (e.g. lemmatization, $n$-grams, ...).

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(corpus)

At this point, our tokenizer has processed all the comments and extracted all the words, assigning an identifier to each one. We will store them in two dictionaries so that we can convert them into integers (to identify the word) and into words (once we have the integer

In [None]:
word_index = tokenizer.word_index
index_word = {index: word for word, index in word_index.items()}

print(f'word2id: {dict(list(word_index.items())[0:4])} ...')
print(f'id2word: {dict(list(index_word.items())[0:4])} ...')

Finally, each of the comments in the corpus will be transformed into a list of integers where each token in the corpus will be replaced by the integer it represents. We will also obtain the size of our vocabulary from the number of identified words.

To convert a text string into a sequence of words we can use the Keras function `tf.keras.preprocessing.text.text_to_word_sequence(text)`. From there getting the index of each word is trivial.

In [None]:
sentences = [
    [word_index[w] for w in tf.keras.preprocessing.text.text_to_word_sequence(text)]
    for text in corpus
]
vocab_size = len(word_index) + 1

print(f'Corpus sentences: {len(sentences)} sentences')
print(f'Vocabulary Size: {vocab_size} words')
print(f'Sentence example:')
print(f'- {corpus[5]}')
print(f'- {sentences[5]}')

## Skip-gram generator

Now we will generate the _skip-grams_. The idea is, from all the sentences in the corpus (each `sentence` of `sentences`) and given an action window, to extract its context (the surrounding words) to determine for each pair of words whether they are contextual or not.

In [None]:
WINDOW_SIZE = 5

x_train, y_train = [], []
for sentence in sentences:
    pairs, are_contextual = tf.keras.preprocessing.sequence.skipgrams(
        sentence,
        vocabulary_size=vocab_size,
        window_size=WINDOW_SIZE,
    )
    x_train.extend(pairs)
    y_train.extend(are_contextual)

x_train = np.array(x_train)
y_train = np.array(y_train).reshape(-1, 1)
dataset = np.hstack((x_train, y_train))

print(f'Dataset shape: {dataset.shape}')
print(dataset)

It can be seen that it has been generated, for each pair of words, whether they are (1) or not (0) contextual. This is because the `skipgrams` function transforms a sequence of words (actually of integers) into tuples of the form:

- `(word, word in context)`, label 1 (positive, are contextual).
- `(word, random word outside of context)`, label 0 (not contextual).

## Model creation and training

We already have a dataset with inputs and their respective outputs. Now the objective is to train a model that is able to determine if two words belong to the same context.

To do this we will create an embedding layer that will transform the words into their feature vector. The words will be those that are or are not contextual, and it will determine how close (more contextual) or far (less contextual) they are, using a distance measure (scalar product).

Finally, the output of the network will be a single neuron that will be activated or not if they are contextual.

This architecture will force the more contextual words to be closer together, and therefore their feature vectors to be more similar. The weights matrix of the embedding layer is thus expected to converge to a representation of the word features.

In [None]:
EMBEDDING_DIM = 5

# Inputs to the model
input_target = tf.keras.layers.Input((1,))
input_context = tf.keras.layers.Input((1,))

# Common layers (including the embedding)
embedding_layer = tf.keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=EMBEDDING_DIM,
    input_length=1,
    name='embedding',
)

reshape_layer = tf.keras.layers.Reshape((EMBEDDING_DIM, 1))

# Model architecture
target_embedding = embedding_layer(input_target)
target_embedding = reshape_layer(target_embedding)
target_embedding = tf.keras.layers.Dropout(0.1)(target_embedding)

context_embedding = embedding_layer(input_context)
context_embedding = reshape_layer(context_embedding)
context_embedding = tf.keras.layers.Dropout(0.1)(context_embedding)

hidden_layer = tf.keras.layers.Dot(axes=1, normalize=True)([target_embedding, context_embedding])
hidden_layer = tf.keras.layers.Reshape((1,))(hidden_layer)

output = tf.keras.layers.Dense(1, activation='sigmoid')(hidden_layer)

model = tf.keras.Model(inputs=[input_target, input_context], outputs=output)
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam')

Now we only have to train the model. To do so, we will train it with each of the _skip-grams_ generated previously. We will use a validation separation of 10% and train for 10 epochs.

**This step is very expensive**, and can take quite a few minutes (hours), so either we have a powerful machine, or we better leave it here.

In [None]:
history = model.fit([dataset[:, 0], dataset[:, 1]], dataset[:, 2], batch_size=8*32768, epochs=50)

Let's take a look at the training progress:

In [None]:
pd.DataFrame(history.history).plot()
plt.yscale('log')
plt.xlabel('Epoch num.')
plt.show()

## Embeddings

Once the model is trained, we already have a matrix with the weights of the features for each word. To see a representation we can take them directly and print them in a dataframe.

In [None]:
weights = embedding_layer.get_weights()[0][1:]

df = pd.DataFrame(weights, index=index_word.values())
df.head(10)

Let's make a search with the most similar words to a given one using, for example, the Euclidean distance of its vectors

In [None]:
NUM_CLOSEST_WORDS = 10
WORD = 'man'

v1 = weights[word_index[WORD] - 1]
words = sorted(
    [word for word in word_index.keys()],
    key=lambda w: np.linalg.norm(v1 - weights[word_index[w]-1])
)
df.loc[words[:NUM_CLOSEST_WORDS + 1], :]

## Conclusions

In summary, we have implemented embedding using the _skip-grams_ technique of _word2vec_ and have demonstrated its effectiveness in representing words more meaningfully in a vector space. This technique is able to capture the semantics of words, representing them in a vector space of lower dimension than it would occupy with an _one-hot_ representation.

Note, we also realized that **it is a very expensive technique**, and therefore does not make much sense (in general) since we have many ready-made _embeddings_ available for download.

***

<div><img style="float: right; width: 120px; vertical-align:top" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" alt="Creative Commons by-nc-sa logo" />

[Volver al inicio](#top)

</div>