In [1]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


In [43]:
import io
import itertools
import numpy as np
import os
import re
import string
import tensorflow as tf
import tqdm

from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation, Dense, Dot, Embedding, Flatten, GlobalAveragePooling1D, Reshape
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [44]:
SEED = 42 
AUTOTUNE = tf.data.experimental.AUTOTUNE

### Eg: Vectorizing a sentence


`I want a glass of orange juice to go along with my cereal`

Tokenizing the sentence -

In [45]:
sentence = "I want a glass of orange juice to go along with my cereal"
tokens = list(sentence.lower().split())
print(len(tokens))

13


Creating a vocabulary to save mappings from tokens to integer indices. (since programs deal better with integers than words)

In [46]:
# Initialising starting index as 1
vocab, index = {}, 1
# add a padding token (Start of Sentence) 
vocab['<pad>'] = 0 
for token in tokens:
    if token not in vocab: 
        # Mapping the words to their indices in the sentence
        vocab[token] = index 
        index += 1
vocab_size = len(vocab)
print(vocab)

{'<pad>': 0, 'i': 1, 'want': 2, 'a': 3, 'glass': 4, 'of': 5, 'orange': 6, 'juice': 7, 'to': 8, 'go': 9, 'along': 10, 'with': 11, 'my': 12, 'cereal': 13}


Creating an inverse vocabulary to save mappings from integer indices to tokens. (We can use this later when we want to view the word relations visually, once the embeddings are trained)

In [47]:
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)

{0: '<pad>', 1: 'i', 2: 'want', 3: 'a', 4: 'glass', 5: 'of', 6: 'orange', 7: 'juice', 8: 'to', 9: 'go', 10: 'along', 11: 'with', 12: 'my', 13: 'cereal'}


Vectorizing the sentence -

In [48]:
example_sequence = [vocab[token] for token in tokens]
print(example_sequence)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]


Generating skip-grams from one sentence -

In [49]:
# Setting the window size to be 2
window_size = 2

# negative samples are set to zero as of now. Negative sampling will be performed further.
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(example_sequence, vocabulary_size = vocab_size, 
                                                                   window_size = window_size, negative_samples = 0)
# total positive skipgram pairs generated for the sentence taken
print(len(positive_skip_grams))

46


Printing few of the 46 generated positive skip-grams of the given sentence -

In [50]:
for target, context in positive_skip_grams[:5]:
    # formatted string
    print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")

(3, 5): (a, of)
(8, 9): (to, go)
(12, 11): (my, with)
(13, 11): (cereal, with)
(5, 6): (of, orange)


### Negative Sampling 

We found the positive skipgrams by sliding over given window span using the skipgrams function. To produce, additional skipgram pairs which act as negative examples for training, we sample random word pairs from the vocabulary. We find the number of negative samples in a window for a given target word. Function is called on one skip-gram's target word and context word is passed as true class to exclude it from being sampled. Number of negative samples per positive context word (num_ns), between [5, 20] works best for smaller datasets, while num_ns between [2, 5] is enough for larger datasets.

In [51]:
# Get target and context words for one positive skip-gram.
target_word, context_word = positive_skip_grams[0]

# Printing the target and context words
print(inverse_vocab[target_word])
print(inverse_vocab[context_word])

# Set the number of negative samples per positive context. 
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype = "int64"), (1, 1))
negative_skip_grams, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes = context_class, # class that should be sampled as 'positive'
    num_true = 1, # each positive skip-gram has 1 positive context class
    num_sampled = num_ns, # number of negative context words to sample
    unique = True, # all the negative samples should be unique
    range_max = vocab_size, # pick index of the samples from [0, vocab_size]
    seed = SEED, # seed for reproducibility (getting same sets later on)
    name = "negative_sampling" # name of this operation
)
print(context_class)
print(negative_skip_grams)
# index is a tensor and is not hashable
print([inverse_vocab[index.numpy()] for index in negative_skip_grams])

a
of
tf.Tensor([[5]], shape=(1, 1), dtype=int64)
tf.Tensor([0 1 3 6], shape=(4,), dtype=int64)
['<pad>', 'i', 'a', 'orange']


This above method used for negative sampling is giving positive skip-gram pair also. That is true class is also being predicted as negative example while sampling but true class should not be sampled. Fix this.

### Constructing one training example

For a given positive skip-gram, we now have num_ns negative sampled context words that do not appear in the window size neighborhood of target_word. Batch the 1 positive context_word and num_ns negative context words into one tensor. This produces a set of positive skip-grams (labelled as 1) and negative samples (labelled as 0) for each target word.

In [52]:
# Add a dimension so you can use concatenation (on the next step).
print(negative_skip_grams.shape)
negative_skip_grams = tf.expand_dims(negative_skip_grams, 1)
print(negative_skip_grams.shape)

# Concat positive context word with negative sampled words.
context = tf.concat([context_class, negative_skip_grams], 0)

# Label first context word as 1 (positive) followed by num_ns 0s (negative).
label = tf.constant([1] + [0]*num_ns, dtype = "int64") 

# Reshape target to shape (1,) and context and label to (num_ns+1,).
target = tf.squeeze(target_word)
context = tf.squeeze(context)
label =  tf.squeeze(label)
print(target.shape)
print(context.shape)
print(label.shape)

(4,)
(4, 1)
()
(5,)
(5,)


Checking using an example

In [53]:
print(f"target_index    : {target}")
print(f"target_word     : {inverse_vocab[target_word]}")
print(f"context_indices : {context}")
# context word + num_ns words obtained from negative sampling
print(f"context_words   : {[inverse_vocab[c.numpy()] for c in context]}")
# label = 1 for context_word and 0 for rest num_ns words
print(f"label           : {label}")

target_index    : 3
target_word     : a
context_indices : [5 0 1 3 6]
context_words   : ['of', '<pad>', 'i', 'a', 'orange']
label           : [1 0 0 0 0]


target - shape (1,)
context, label - shape - (num_ns + 1,)

In [54]:
print(f"target  :", target)
print(f"context :", context )
print(f"label   :", label )

target  : tf.Tensor(3, shape=(), dtype=int32)
context : tf.Tensor([5 0 1 3 6], shape=(5,), dtype=int64)
label   : tf.Tensor([1 0 0 0 0], shape=(5,), dtype=int64)


Training examples obtained from sampling commonly occuring words (such as `the, is, on`) don't add much useful information for the model to learn from.<br>
So, subsampling of frequent words as a helpful practice to improve embedding quality.<br>
`tf.keras.preprocessing.sequence.skipgrams` function accepts a sampling table argument to encode probabilities of sampling any token.<br><br>
`tf.keras.preprocessing.sequence.make_sampling_table` - used to generate a word-frequency rank based probabilistic sampling table and pass it to skipgrams function.<br>
Sampling probabilities for a vocab_size of 10 - 

In [55]:
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(size=10)
print(sampling_table)

[0.00315225 0.00315225 0.00547597 0.00741556 0.00912817 0.01068435
 0.01212381 0.01347162 0.01474487 0.0159558 ]


`sampling_table[i]` = probability of sampling the i-th most common word in a dataset. 

The function assumes a [Zipf's distribution](https://en.wikipedia.org/wiki/Zipf%27s_law) of the word frequencies for sampling. The `tf.random.log_uniform_candidate_sampler` already assumes that the vocabulary frequency follows a log-uniform (Zipf's) distribution. Using these distribution weighted sampling also helps approximate the Noise Contrastive Estimation (NCE) loss with simpler loss functions for training a negative sampling objective.

The sampling table is built before sampling skip-gram word pairs.

### Generating training data

Compiling all the steps described above into a function that can be called on a list of vectorized sentences obtained from any text dataset. 

In [56]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
    # Elements of each training example are appended to these lists.
    targets, contexts, labels = [], [], []

    # Build the sampling table for vocab_size tokens.
    sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

    # Iterate over all sequences (sentences) in dataset.
    for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence).
        positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
                                              sequence, 
                                              vocabulary_size = vocab_size,
                                              sampling_table = sampling_table,
                                              window_size = window_size,
                                              negative_samples = 0)

        # Iterate over each positive skip-gram pair to produce training examples 
        # with positive context word and negative samples.
        for target_word, context_word in positive_skip_grams:
            context_class = tf.expand_dims(tf.constant([context_word], dtype = "int64"), 1)
            negative_skip_grams, _, _ = tf.random.log_uniform_candidate_sampler(
                                              true_classes = context_class,
                                              num_true = 1, 
                                              num_sampled = num_ns, 
                                              unique = True, 
                                              range_max = vocab_size, 
                                              seed = SEED, 
                                              name = "negative_sampling")

            # Build context and label vectors (for one target word)
            negative_skip_grams = tf.expand_dims(negative_skip_grams, 1)
            context = tf.concat([context_class, negative_skip_grams], 0)
            label = tf.constant([1] + [0]*num_ns, dtype = "int64")

            # Append each element from the training example to global lists.
            targets.append(target_word)
            contexts.append(context)
            labels.append(label)

    return targets, contexts, labels

### Preparing data for Word2vec

Till now we dealt with single sentence for skip-gram negative sampling based Word2vec. We now generate training examples from larger list of sentences

#### Downloading text corpus 

In [57]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [58]:
# Reading text from the file. Let's look at first few lines -

with open(path_to_file) as f: 
    lines = f.read().splitlines()
for line in lines[:20]:
    print(line)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.


In [59]:
# Constructing object for further use from the non-empty lines
# strings of length 0 are discarded 
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))

#### Vectorizing the sentences 

Removing punctuation and converting all text to lowercase

In [60]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    return tf.strings.regex_replace(lowercase,'[%s]' % re.escape(string.punctuation), '')

In [61]:
# Vocabulary size and number of words in a sequence.
vocab_size = 4096
sequence_length = 10

# Text vectorization layer - used to normalize, split, and map strings to integers. 
vectorize_layer = TextVectorization(
                    # calling above method to format the text (remove punc, convert to lowercase)
                    standardize = custom_standardization, 
                    max_tokens = vocab_size,
                    output_mode = 'int',
                    # Setting output_sequence_length length to pad all samples to same length.
                    output_sequence_length = sequence_length)

Creating vocabulary from the object we created (which contains non-empty text lines only), by using the `adapt` function. 

In [62]:
vectorize_layer.adapt(text_ds.batch(1024))

# Adapting the state of the layer to represent the text corpus
# Now we can access the vocabulary using get_vocabulary()

In [63]:
# Saving the created vocabulary.

inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', 'the', 'and', 'to', 'i', 'of', 'you', 'my', 'a', 'that', 'in', 'is', 'not', 'for', 'with', 'me', 'it', 'be', 'your']


Using the vectorize_layer to generate vectors for each element in text_ds (has all non-empty lines of text corpus).

In [64]:
def vectorize_text(text):
    text = tf.expand_dims(text, -1)
    return tf.squeeze(vectorize_layer(text))

# Vectorizing the data in text_ds
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

#### Getting sequences from the dataset

Now we have dataset `text_vector_ds` of integer encoded sentences. To produce positive and negative examples, we will have to iterate over each sentence in the dataset and for this, we flatten the datset into list of sentence vector sequences.

But the previous `generate_training_data` function we defined takes non-tensorflow inputs (python / numpy functions). So, we use suitable function to enable the conversion

In [65]:
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

32777


Checking how sequences looks like by printing few examples -

In [66]:
for seq in sequences[:5]:
    print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']
[138  36 982 144 673 125  16 106   0   0] => ['before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', '', '']
[34  0  0  0  0  0  0  0  0  0] => ['all', '', '', '', '', '', '', '', '', '']
[106 106   0   0   0   0   0   0   0   0] => ['speak', 'speak', '', '', '', '', '', '', '', '']
[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']


#### Generating training examples from the sequences

`sequences` - a list of int encoded sentences. We call the `generate_training_data()` function defined earlier to generate training examples for the Word2Vec model. 

`generate_training_data()` - the function iterates over each word from each sequence to collect positive and negative context words. Length of target, contexts and labels should be same, and each is equal to the total number of training examples.

In [67]:
targets, contexts, labels = generate_training_data(
                                    sequences = sequences, 
                                    window_size = 2, 
                                    num_ns = 4, 
                                    vocab_size = vocab_size, 
                                    seed = SEED)
print(len(targets), len(contexts), len(labels))

100%|██████████| 32777/32777 [00:05<00:00, 6225.90it/s]


65127 65127 65127


### Configuring the dataset for better performance

For efficient batching specially when no. of training eg are large, we use `tf.data.Dataset`. We now have object of the same in the form `(target_word, context_word), (label)` elements to train our word2vec model

In [68]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder = True)
print(dataset)

<BatchDataset shapes: (((1024,), (1024, 5, 1)), (1024, 5)), types: ((tf.int32, tf.int64), tf.int64)>


In [69]:
# Adding cache() and prefetch() to improve model performance

dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<PrefetchDataset shapes: (((1024,), (1024, 5, 1)), (1024, 5)), types: ((tf.int32, tf.int64), tf.int64)>


We can implement Word2Vec model as a classifier to distinguish between true context words from positive skip-grams and false context words obtained through negative sampling. 

We perform a dot product between the embeddings of target and context words to obtain predictions for labels and compute loss against true labels in the dataset.

### Layers used in Word2Vec model

* `target_embedding`: To look up the embedding of a word when it appears as a target word. The number of parameters in this layer are equal to `(vocab_size * embedding_dim)`.
* `context_embedding`: To look up the embedding of a word when it appears as a context word. The number of parameters in this layer are also `(vocab_size * embedding_dim)`.
* `dots`: To compute the dot product of target and context embeddings from a training pair.
* `flatten`: To flatten the results of `dots` layer into logits.

The first two layers above can be shared as well and we can also use concatenation of both as final Word2Vec embedding

In [70]:
class Word2Vec(Model):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        # target embedding
        self.target_embedding = Embedding(vocab_size, 
                                          embedding_dim,
                                          input_length = 1,
                                          name = "w2v_embedding", )
        # context embedding
        self.context_embedding = Embedding(vocab_size, 
                                           embedding_dim, 
                                           input_length = num_ns + 1)
        self.dots = Dot(axes = (3,2))
        self.flatten = Flatten()

    # function that accepts (target, context) pairs which can then
    # be passed into their corresponding embedding layer.
    # Reshape the context_embedding to perform a dot product with 
    # the target_embedding and return the flattened result.
    def call(self, pair):
        target, context = pair
        we = self.target_embedding(target)
        ce = self.context_embedding(context)
        dots = self.dots([ce, we])
        return self.flatten(dots)

### Defining loss function and compiling the model

Loss function used - categorical cross entropy<br>
Adam's optimizer is used.

In [71]:
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer = 'adam',
                 loss = tf.keras.losses.CategoricalCrossentropy(from_logits = True),
                 metrics = ['accuracy'])

In [72]:
# Callback to log training statistics for tensorboard

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir = "logs")

Training the model with the previously prepared dataset for certain number of epochs

In [73]:
word2vec.fit(dataset, epochs = 100, callbacks = [tensorboard_callback])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x64229f350>

### Looking and Analysing the Embeddings

Getting the weights from the model using `get_layer()` and `get_weights()`. `get_vocabulary()` function gives the vocabulary to build metadata file with one token per line.

In [75]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Creating and saving the vectors and metadata file

In [76]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    # skipping 0 since it's padding.
    if  index == 0: 
        continue 
    vec = weights[index] 
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")
out_v.close()
out_m.close()

Downloading the `vectors.tsv` and `metadata.tsv` to analyze the obtained embeddings

In [77]:
try:
    from google.colab import files
    files.download('vectors.tsv')
    files.download('metadata.tsv')
except Exception as e:
    pass