# Introduction

In this notebook we will try to ablate on some of the discrepancies that we noticed with the [tensorflow official tutorial for Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec).

To be precise we are looking forward to ablate on two pointers:
- Using the text vectorization layer
- Customizing the negative pairs so that we do not have negative words from the window specified.

In [1]:
import tensorflow as tf
print(tf.__version__)

SEED = 42 
AUTOTUNE = tf.data.AUTOTUNE

2.4.1


# Skip-Grams with a single sentence

In [2]:
sentence = "The wide road shimmered in the hot sun"

# tokenizer the sentence
tokens = list(sentence.lower().split())
print(f'Number of tokens: {len(tokens)}')

# create word2index
word_index = {}
index = 1
word_index['<pad>'] = 0 # add a padding token 
for token in tokens:
  if token not in word_index: 
    word_index[token] = index
    index += 1
vocab_size = len(word_index)
print(f'Vocab: {word_index}')

inverse_vocab = {index: token for token, index in word_index.items()}
print(f'Inverse Vocab: {inverse_vocab}')

example_sequence = [word_index[word] for word in tokens]
print(f'Tokenized sentence: {example_sequence}')

Number of tokens: 8
Vocab: {'<pad>': 0, 'the': 1, 'wide': 2, 'road': 3, 'shimmered': 4, 'in': 5, 'hot': 6, 'sun': 7}
Inverse Vocab: {0: '<pad>', 1: 'the', 2: 'wide', 3: 'road', 4: 'shimmered', 5: 'in', 6: 'hot', 7: 'sun'}
Tokenized sentence: [1, 2, 3, 4, 5, 1, 6, 7]


In [3]:
# The positive skip grams
window_size = 2
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
      example_sequence, 
      vocabulary_size=vocab_size,
      window_size=window_size,
      negative_samples=0)
print(len(positive_skip_grams))

26


In [4]:
print(f"Sentence: {sentence}")
for target, context in positive_skip_grams[:5]:
  print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")

Sentence: The wide road shimmered in the hot sun
(2, 4): (wide, shimmered)
(6, 7): (hot, sun)
(1, 4): (the, shimmered)
(1, 5): (the, in)
(3, 2): (road, wide)


In [5]:
import random

In [6]:
# Get target and context words for one positive skip-gram.
target_word, context_word = positive_skip_grams[0]
print(f"({target_word}, {context_word}): ({inverse_vocab[target_word]}, {inverse_vocab[context_word]})")
# Set the number of negative samples per positive context. 
num_ns = 4

list_of_words = list(range(100))

context_words = [b if a == target_word else -1 for a,b in positive_skip_grams] + [target_word]
context_words = list(filter(lambda x: x != -1, context_words))
print(f'Context Words for `{target_word}`: {context_words}')

negative_words = list(filter(lambda i: i not in context_words, list_of_words))

negative_sampling_candidates = tf.constant(random.sample(negative_words, num_ns))

print(negative_sampling_candidates)

(2, 4): (wide, shimmered)
Context Words for `2`: [4, 3, 1, 2]
tf.Tensor([32 85 34 74], shape=(4,), dtype=int32)


# Data
We will be working on the same data that the official guide uses.

In [7]:
from tensorflow.keras.utils import get_file

In [39]:
# Shakespear text file
path_to_file = get_file(fname='warpeace_input.txt',
                        origin='https://cs.stanford.edu/people/karpathy/char-rnn/warpeace_input.txt')

print(f'[INFO] Path to file: {path_to_file}')

Downloading data from https://cs.stanford.edu/people/karpathy/char-rnn/warpeace_input.txt
[INFO] Path to file: /root/.keras/datasets/warpeace_input.txt


## Text
Here in this snippet we will look into the text file. I would suggest people to take some time out and look into the data, even if it is just glancing it once. This step is not mandatory, but does build a mental map of what we are going to model up on.

In [40]:
# To vizualise the text data
with open(path_to_file) as f:
    lines = f.read().splitlines()
for line in lines[:5]:
    print(line)

﻿"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by that
Antichrist--I really believe he is Antichrist--I will have nothing more
to do with you and you are no longer my friend, no longer my 'faithful


In [41]:
# Create a `tf.data` with all the non-negative sentences
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))

for text in text_ds.take(5):
    print(text)

tf.Tensor(b'\xef\xbb\xbf"Well, Prince, so Genoa and Lucca are now just family estates of the', shape=(), dtype=string)
tf.Tensor(b"Buonapartes. But I warn you, if you don't tell me that this means war,", shape=(), dtype=string)
tf.Tensor(b'if you still try to defend the infamies and horrors perpetrated by that', shape=(), dtype=string)
tf.Tensor(b'Antichrist--I really believe he is Antichrist--I will have nothing more', shape=(), dtype=string)
tf.Tensor(b"to do with you and you are no longer my friend, no longer my 'faithful", shape=(), dtype=string)


In [11]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import re
import string

In [42]:
# We create a custom standardization function to lowercase the text and 
# remove punctuation.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    return tf.strings.regex_replace(lowercase,
                                    '[%s]' % re.escape(string.punctuation), '')

# Define the vocabulary size and number of words in a sequence.
vocab_size = 4096
sequence_length = 20

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Set output_sequence_length length to pad all samples to same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# build the vocab
vectorize_layer.adapt(text_ds.batch(1024))

In [43]:
# Save the created vocabulary for reference.
index_word = vectorize_layer.get_vocabulary()
print(index_word[:20])

['', '[UNK]', 'the', 'and', 'to', 'of', 'a', 'he', 'in', 'his', 'that', 'was', 'with', 'had', 'it', 'her', 'not', 'him', 'at', 'i']


In [44]:
# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

In [45]:
for text in text_vector_ds.take(2):
    print(text)

tf.Tensor(
[   1   40   41    1    3    1   58   54  130  461 1479    5    2    0
    0    0    0    0    0    0], shape=(20,), dtype=int64)
tf.Tensor(
[  1  20  19   1  23  56  23 139 196  57  10  36 856 232   0   0   0   0
   0   0], shape=(20,), dtype=int64)


In [46]:
# sequences is a list of numpy arrays
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

50506


In [47]:
for seq in sequences[:5]:
  print(f"{seq} => {[index_word[i] for i in seq]}")

[   1   40   41    1    3    1   58   54  130  461 1479    5    2    0
    0    0    0    0    0    0] => ['[UNK]', 'prince', 'so', '[UNK]', 'and', '[UNK]', 'are', 'now', 'just', 'family', 'estates', 'of', 'the', '', '', '', '', '', '', '']
[  1  20  19   1  23  56  23 139 196  57  10  36 856 232   0   0   0   0
   0   0] => ['[UNK]', 'but', 'i', '[UNK]', 'you', 'if', 'you', 'dont', 'tell', 'me', 'that', 'this', 'means', 'war', '', '', '', '', '', '']
[  56   23  104  852    4 2633    2    1    3    1    1   32   10    0
    0    0    0    0    0    0] => ['if', 'you', 'still', 'try', 'to', 'defend', 'the', '[UNK]', 'and', '[UNK]', '[UNK]', 'by', 'that', '', '', '', '', '', '', '']
[  1 313 502   7  26   1  64  39 161  65   0   0   0   0   0   0   0   0
   0   0] => ['[UNK]', 'really', 'believe', 'he', 'is', '[UNK]', 'will', 'have', 'nothing', 'more', '', '', '', '', '', '', '', '', '', '']
[   4   67   12   23    3   23   58   52  356   60  384   52  356   60
 3591    0    0    0    0

In [48]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
    # Elements of each training example are appended to these lists.
    targets, contexts, labels = [], [], []
    
    # will be used to sample
    list_of_words = list(range(vocab_size))
    
    # Build the sampling table for vocab_size tokens.
    sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)
    
    # Iterate over all sequences (sentences) in dataset.
    for sequence in tqdm(sequences):
        # Generate positive skip-gram pairs for a sequence (sentence).
        positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
            sequence, 
            vocabulary_size=vocab_size,
            sampling_table=sampling_table,
            window_size=window_size,
            negative_samples=0)
        
        # Iterate over each positive skip-gram pair to produce training examples 
        # with positive context word and negative samples.
        for target_word, context_word in positive_skip_grams:
            context_words = [context if target == target_word else -1 for target,context in positive_skip_grams] + [target_word]
            context_words = list(filter(lambda x: x != -1, context_words))
            
            context_class = tf.expand_dims(
                tf.constant([context_word], dtype="int64"), 1)
            
            negative_words = list(filter(lambda i: i not in context_words, list_of_words))
            negative_sampling_candidates = tf.constant(random.sample(negative_words, num_ns), dtype="int64")
            
            # Build context and label vectors (for one target word)
            negative_sampling_candidates = tf.expand_dims(
                negative_sampling_candidates, 1)
            context = tf.concat([context_class, negative_sampling_candidates], 0)
            label = tf.constant([1] + [0]*num_ns, dtype="int64")
            
            # Append each element from the training example to global lists.
            targets.append(target_word)
            contexts.append(context)
            labels.append(label)
    return targets, contexts, labels

In [49]:
from tqdm import tqdm

In [50]:
# Sequences is a list of numpy arrays
targets, contexts, labels = generate_training_data(
    sequences=sequences, 
    window_size=2, 
    num_ns=4, 
    vocab_size=vocab_size, 
    seed=SEED)
print(len(targets), len(contexts), len(labels))

100%|██████████| 50506/50506 [14:20<00:00, 58.71it/s]

202204 202204 202204





In [51]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(buffer_size=AUTOTUNE)
print(dataset)

<PrefetchDataset shapes: (((1024,), (1024, 5, 1)), (1024, 5)), types: ((tf.int32, tf.int64), tf.int64)>


# Model training

In [52]:
from tensorflow.keras import Model
from tensorflow.keras.layers import Dot, Embedding, Flatten

In [53]:
class Word2Vec(Model):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        self.target_embedding = Embedding(vocab_size, 
                                        embedding_dim,
                                        input_length=1,
                                        name="w2v_embedding", )
        self.context_embedding = Embedding(vocab_size, 
                                        embedding_dim, 
                                        input_length=num_ns+1)
        self.dots = Dot(axes=(3,2))
        self.flatten = Flatten()

    def call(self, pair):
        target, context = pair
        we = self.target_embedding(target)
        ce = self.context_embedding(context)
        dots = self.dots([ce, we])
        return self.flatten(dots)

In [54]:
def custom_loss(x_logit, y_true):
    return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)

In [55]:
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

In [56]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [57]:
word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fa561ee8518>

In [58]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [59]:
import io

In [60]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if  index == 0: continue # skip 0, it's padding.
  vec = weights[index] 
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [61]:
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception as e:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>