# Introduction

In this notebook we will try to ablate on some of the discrepancies that we noticed with the [tensorflow official tutorial for Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec).

To be precise we are looking forward to ablate on two pointers:
- Using the text vectorization layer
- Customizing the negative pairs so that we do not have negative words from the window specified.

In [1]:
import tensorflow as tf
print(tf.__version__)

SEED = 42 
AUTOTUNE = tf.data.AUTOTUNE

import random

2.4.1


# Data
We will be working on the same data that the official guide uses.

In [2]:
from tensorflow.keras.utils import get_file

In [3]:
# Shakespear text file
path_to_file = tf.keras.utils.get_file(fname='shakespeare.txt',
                                       origin='https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
print(f'[INFO] Path to file: {path_to_file}')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[INFO] Path to file: /root/.keras/datasets/shakespeare.txt


## Text
Here in this snippet we will look into the text file. I would suggest people to take some time out and look into the data, even if it is just glancing it once. This step is not mandatory, but does build a mental map of what we are going to model up on.

In [4]:
# To vizualise the text data
with open(path_to_file) as f:
    lines = f.read().splitlines()
for line in lines[:5]:
    print(line)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


In [5]:
# Create a `tf.data` with all the non-negative sentences
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))

for text in text_ds.take(5):
    print(text)

tf.Tensor(b'First Citizen:', shape=(), dtype=string)
tf.Tensor(b'Before we proceed any further, hear me speak.', shape=(), dtype=string)
tf.Tensor(b'All:', shape=(), dtype=string)
tf.Tensor(b'Speak, speak.', shape=(), dtype=string)
tf.Tensor(b'First Citizen:', shape=(), dtype=string)


In [6]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import re
import string

In [7]:
# We create a custom standardization function to lowercase the text and 
# remove punctuation.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    return tf.strings.regex_replace(lowercase,
                                    '[%s]' % re.escape(string.punctuation), '')

# Define the vocabulary size and number of words in a sequence.
vocab_size = 4096
sequence_length = 10

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Set output_sequence_length length to pad all samples to same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# build the vocab
vectorize_layer.adapt(text_ds.batch(1024))

In [8]:
# Save the created vocabulary for reference.
index_word = vectorize_layer.get_vocabulary()
print(index_word[:10])

['', '[UNK]', 'the', 'and', 'to', 'i', 'of', 'you', 'my', 'a']


# Build the sub words
What we have:
- `index_word`: A list of all the unique words in the vocab

What we want:
- `subword_index`: A dictionary that maps subwords to its unique index
- `index_subword`: A dictionary that maps indices to the subword


- `word_subwords`: A dictionary that maps a word with all the possible subwords that is has

In [9]:
subword_index= {}
index = 0

word_subwords = {}

for idx,word in enumerate(index_word):
    word = f"<{word}>"
    if len(word) > 3 and word != "<[UNK]>":
        dummylist = [word[i:i+3] for i in range(len(word)-2)]
        dummylist.append(word)
        ind_list = []
        for w in dummylist:
            if w in subword_index:
                ind_list.append(subword_index[w])
            else:
                index += 1
                subword_index[w] = index
                ind_list.append(index)
        word_subwords[idx] = ind_list
    else:
        ind_list = []
        if word in subword_index:
            ind_list.append(subword_index[word])
        else:
            index += 1
            subword_index[word] = index
            ind_list.append(index)
        word_subwords[idx] = ind_list

index_subword = {index:subword for subword,index in subword_index.items()}

In [10]:
index = 3890

# Bridge the gap
word = index_word[index]
print(f'The word is: {word}')

subwords = word_subwords[index]
print(f'The subword indices: {subwords}')

print('The subwords:', end=' ')
for s in subwords:
    print(f"'{index_subword[s]}'", end=' ')

The word is: whereer
The subword indices: [89, 234, 119, 201, 1302, 1404, 120, 6826]
The subwords: '<wh' 'whe' 'her' 'ere' 'ree' 'eer' 'er>' '<whereer>' 

In [11]:
subword_vocab_size = len(subword_index)

In [12]:
print(f'Number of unique words: {vocab_size}')
print(f'Number of unique subwords: {subword_vocab_size}')

Number of unique words: 4096
Number of unique subwords: 7114


In [13]:
# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

In [14]:
for text in text_vector_ds.take(2):
    print(text)

tf.Tensor([ 89 270   0   0   0   0   0   0   0   0], shape=(10,), dtype=int64)
tf.Tensor([138  36 982 144 673 125  16 106   0   0], shape=(10,), dtype=int64)


In [15]:
# sequences is a list of numpy arrays
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

32777


In [16]:
for seq in sequences[99:102]:
  print(f"{seq} => {[index_word[i] for i in seq]}")
  print(f"{seq} => {[[index_subword[a] for a in word_subwords[i]] for i in seq]}")
  print()

[2336 2883 1830    4    1  111    3    1    0    0] => ['piercing', 'statutes', 'daily', 'to', '[UNK]', 'up', 'and', '[UNK]', '', '']
[2336 2883 1830    4    1  111    3    1    0    0] => [['<pi', 'pie', 'ier', 'erc', 'rci', 'cin', 'ing', 'ng>', '<piercing>'], ['<st', 'sta', 'tat', 'atu', 'tut', 'ute', 'tes', 'es>', '<statutes>'], ['<da', 'dai', 'ail', 'ily', 'ly>', '<daily>'], ['<to', 'to>', '<to>'], ['<[UNK]>'], ['<up', 'up>', '<up>'], ['<an', 'and', 'nd>', '<and>'], ['<[UNK]>'], ['<>'], ['<>']]

[   2  172   39    2  664 1126   79   13  111   60] => ['the', 'poor', 'if', 'the', 'wars', 'eat', 'us', 'not', 'up', 'they']
[   2  172   39    2  664 1126   79   13  111   60] => [['<th', 'the', 'he>', '<the>'], ['<po', 'poo', 'oor', 'or>', '<poor>'], ['<if', 'if>', '<if>'], ['<th', 'the', 'he>', '<the>'], ['<wa', 'war', 'ars', 'rs>', '<wars>'], ['<ea', 'eat', 'at>', '<eat>'], ['<us', 'us>', '<us>'], ['<no', 'not', 'ot>', '<not>'], ['<up', 'up>', '<up>'], ['<th', 'the', 'hey', 'ey>', '<th

In [17]:
import numpy as np

In [18]:
sequence = sequences[250]
print("The Sequence:")
print(f"{sequence} => {[index_word[i] for i in sequence]}")

list_of_words = list(range(vocab_size))
window_size = 4
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

# Generate positive skip-gram pairs for a sequence (sentence).
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
    sequence, 
    vocabulary_size=vocab_size,
    sampling_table=sampling_table,
    window_size=window_size,
    negative_samples=0)

print("Positive Skip Grams:")
print(positive_skip_grams)

The Sequence:
[  15 2853    6  104    1 1760   25  467    0    0] => ['with', 'thousands', 'of', 'these', '[UNK]', 'slaves', 'as', 'high', '', '']
Positive Skip Grams:
[[2853, 15], [2853, 104], [2853, 6], [2853, 1760], [2853, 1]]


In [19]:
num_ns = 4
# Iterate over each positive skip-gram pair to produce training examples 
# with positive context word and negative samples.
for target_word, context_word in positive_skip_grams:
    print(f"Target : {target_word}")
    print(f"Target Word: {index_word[target_word]}")

    context_words = [context if target == target_word else -1 for target,context in positive_skip_grams] + [target_word]
    context_words = list(filter(lambda x: x != -1, context_words))

    context_class = tf.expand_dims(
        tf.constant([context_word], dtype="int64"), 1)
    
    negative_words = list(filter(lambda i: i not in context_words, list_of_words))
    negative_sampling_candidates = tf.constant(random.sample(negative_words, num_ns), dtype="int64")

    # Build context and label vectors (for one target word)
    negative_sampling_candidates = tf.expand_dims(
        negative_sampling_candidates, 1)

    context = tf.concat([context_class, negative_sampling_candidates], 0)
    print(f"Context: {context}")
    print(f"Context Shape: {context.shape}")
    label = tf.constant([1] + [0]*num_ns, dtype="int64")
    print(f"Label: {label}")
    print(f"Label Shape: {label.shape}")
    
    # Append each element from the training example to global lists.
    subwords = word_subwords[target_word]
    
    print(f"Subwords: {subwords}")
    print(f"{[index_subword[word] for word in subwords]}")
    # targets.append(sub_tar)
    # contexts.append(context)
    # labels.append(label)
    
    break

Target : 2853
Target Word: thousands
Context: [[  15]
 [ 753]
 [3257]
 [ 219]
 [ 951]]
Context Shape: (5, 1)
Label: [1 0 0 0 0]
Label Shape: (5,)
Subwords: [3, 80, 81, 766, 1153, 1154, 8, 710, 711, 5463]
['<th', 'tho', 'hou', 'ous', 'usa', 'san', 'and', 'nds', 'ds>', '<thousands>']


In [20]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
    # Elements of each training example are appended to these lists.
    targets, contexts, labels = [], [], []
    
    # will be used to sample
    list_of_words = list(range(vocab_size))
    
    # Build the sampling table for vocab_size tokens.
    sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)
    
    # Iterate over all sequences (sentences) in dataset.
    for sequence in tqdm(sequences):
        # Generate positive skip-gram pairs for a sequence (sentence).
        positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
            sequence, 
            vocabulary_size=vocab_size,
            sampling_table=sampling_table,
            window_size=window_size,
            negative_samples=0)
        
        # Iterate over each positive skip-gram pair to produce training examples 
        # with positive context word and negative samples.
        for target_word, context_word in positive_skip_grams:
            context_words = [context if target == target_word else -1 for target,context in positive_skip_grams] + [target_word]
            context_words = list(filter(lambda x: x != -1, context_words))

            context_class = tf.expand_dims(
                tf.constant([context_word], dtype="int64"), 1)
            
            negative_words = list(filter(lambda i: i not in context_words, list_of_words))
            negative_sampling_candidates = tf.constant(random.sample(negative_words, num_ns), dtype="int64")

            # Build context and label vectors (for one target word)
            negative_sampling_candidates = tf.expand_dims(
                negative_sampling_candidates, 1)

            context = tf.concat([context_class, negative_sampling_candidates], 0)
            label = tf.constant([1] + [0]*num_ns, dtype="int64")
            
            # Append each element from the training example to global lists.
            subwords = word_subwords[target_word]
            
            targets.append(subwords)
            contexts.append(context)
            labels.append(label)
    return targets, contexts, labels

In [21]:
from tqdm import tqdm

In [22]:
# Sequences is a list of numpy arrays
targets, contexts, labels = generate_training_data(
    sequences=sequences, 
    window_size=2, 
    num_ns=4, 
    vocab_size=vocab_size, 
    seed=SEED)
print(len(targets), len(contexts), len(labels))

100%|██████████| 32777/32777 [04:11<00:00, 130.52it/s]

65162 65162 65162





In [43]:
BATCH_SIZE = 1000
BUFFER_SIZE = 1000
target_ragged = tf.ragged.constant(targets)
dataset = tf.data.Dataset.from_tensor_slices(((target_ragged, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(buffer_size=AUTOTUNE)
print(dataset)

<PrefetchDataset shapes: (((1000, None), (1000, 5, 1)), (1000, 5)), types: ((tf.int32, tf.int64), tf.int64)>


In [44]:
for (t,c),l in dataset.take(1):
    print(t.shape)
    print(c.shape)
    print(l.shape)

(1000, None)
(1000, 5, 1)
(1000, 5)


# Model training

In [45]:
from tensorflow.keras import Model
from tensorflow.keras.layers import Dot, Embedding, Flatten

In [46]:
num_ns = 4
embedding_dim = 100
target_embedding = Embedding(subword_vocab_size,
                             embedding_dim,
                             input_length=None,
                             name="w2v_embedding",)
context_embedding = Embedding(vocab_size,
                              embedding_dim,
                              input_length=num_ns+1)

In [47]:
dot = Dot(axes=(3,1))
flatten = Flatten()

In [48]:
for (t,c),l in dataset.take(1):
    tar_em = tf.math.reduce_sum(target_embedding(t),axis=1)
    con_em = context_embedding(c)
    print(tar_em.shape)
    print(con_em.shape)

    d = dot([con_em, tar_em])
    print(d.shape)

    f = flatten(d)
    print(f.shape)

(1000, 100)
(1000, 5, 1, 100)
(1000, 5, 1)
(1000, 5)


In [61]:
class Word2Vec(Model):
    def __init__(self, subword_vocab_size, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        self.target_embedding = Embedding(subword_vocab_size+1, 
                                        embedding_dim,
                                        input_length=None,
                                        name="w2v_embedding",)
        self.context_embedding = Embedding(vocab_size+1, 
                                        embedding_dim, 
                                        input_length=num_ns+1)
        self.dots = Dot(axes=(3,1))
        self.flatten = Flatten()

    def call(self, pair):
        target, context = pair
        we = tf.math.reduce_sum(self.target_embedding(target),axis=1)
        ce = self.context_embedding(context)
        dots = self.dots([ce, we])
        return self.flatten(dots)

In [62]:
num_ns = 4
embedding_dim = 128
word2vec = Word2Vec(subword_vocab_size,vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

In [63]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [64]:
word2vec.fit(dataset, epochs=10, callbacks=[tensorboard_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f51e46fcbe0>

In [None]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [None]:
import io

In [None]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if  index == 0: continue # skip 0, it's padding.
  vec = weights[index] 
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [None]:
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception as e:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>