# Word2Vec - Continuous Skip-gram Model

## Import Libaries

  - Import user defined corpus (text corpus, personal essays, etc.) using `wget`
  - Import regex to process text
  - Import Keras `Tokenizer`, padding, and text-to-word sequence 

In [58]:
!wget --no-check-certificate \
    https://raw.githubusercontent.com/aisutd/Intro-to-NLP-pretrained/main/word_embedding_models/data/sample_corpus_essay_Bach_Nguyen.txt \
    -O ./sample_corpus.txt

!wget --no-check-certificate \
    https://raw.githubusercontent.com/aisutd/Intro-to-NLP-pretrained/main/word_embedding_models/data/shakespeare.txt \
    -O ./shakespeare.txt

!wget --no-check-certificate \
    https://raw.githubusercontent.com/aisutd/Intro-to-NLP-pretrained/main/word_embedding_models/data/warpeace.txt \
    -O ./warpeace.txt

--2021-02-05 14:28:21--  https://raw.githubusercontent.com/aisutd/Intro-to-NLP-pretrained/main/word_embedding_models/data/sample_corpus_essay_Bach_Nguyen.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7406 (7.2K) [text/plain]
Saving to: ‘./sample_corpus.txt’


2021-02-05 14:28:21 (68.1 MB/s) - ‘./sample_corpus.txt’ saved [7406/7406]

--2021-02-05 14:28:21--  https://raw.githubusercontent.com/aisutd/Intro-to-NLP-pretrained/main/word_embedding_models/data/shakespeare.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573338 (4.4M) [tex

In [59]:
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [60]:
import io
import itertools
import numpy as np
import os
import re
import string
import tqdm
import datetime

In [61]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import skipgrams

import tensorflow as tf
from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation, Dense, Dot, Embedding, Flatten, Dense, GlobalAveragePooling1D, Reshape
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [62]:
SEED = 42
window_size = 2
num_ns = 4
AUTOTUNE = tf.data.AUTOTUNE

## Preprocessing Text Data

  - Split the training corpus into sentences 
  - Split all sentences in the training corpus into text sequence
  - Tokenize to create a one-hot dictionary of all words in the corpus  
    - Note that more frequent words are tokenized by smaller numbers 

In [63]:
#### Can we make the loading corpus process easier? My plan is to define a function that can:  
#### process: corpus--> parse --> sentence --> word_tokenize --> [ [word1,word2,word3...forming a sentence] , [word1,word2,word3...forming a sentence] ]
#### it uses nltk library
import nltk
import tensorflow as tf
nltk.download('punkt')

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import text_to_word_sequence

## determine the path of training corpus
corpus_path1 = 'warpeace.txt'
corpus_path2 = 'shakespeare.txt'
corpus_path3 = 'sample_corpus.txt'

## Aim: make the following transformations: 
## corpus--> parse --> sentence --> word_tokenize --> [ [word1,word2,word3...forming a sentence] , [word1,word2,word3...forming a sentence] ]
def load_corpus(corpus_path):
  f = open(corpus_path,encoding='utf-8-sig')
  content = f.read()
  ## in the warpeace.txt, there are two newline(Enter) characters between paragraphes, 
  ## so use .split('\n\n') to split paragraphs
  paragraphs = content.split('\n\n')
  sentences = []

  ## if you want to break a paragraph into sentences, could use:
  ## tokenize.sent_tokenize(paragraph) 
  ## it will break a sentence (str) into 
  ## a list of strings, each string is a sentence
  for i in paragraphs:
    sentences.extend(nltk.tokenize.sent_tokenize(i))
  
  words = []
  ## to break a sentence into words
  ## could use text_to_word_sequence(i)
  for i in sentences:
    temp = text_to_word_sequence(i)
    words.append(temp)
  
  return words

corpus = load_corpus(corpus_path1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [64]:
# Use the Tokenizer to generate a one-hot dictionary
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(corpus)
w2id = tokenizer.word_index
w2id['<pad>'] = 0
vocab_size = len(w2id)

id2w = {v:k for k, v in w2id.items()}

print(w2id)
print(id2w)
print("vocab_size: {}".format(vocab_size))

vocab_size: 18284


In [65]:
# Transform the corpus sequences into token sequences 
corpus_token_seq = [[w2id[word] for word in sentence] for sentence in corpus]

print(corpus_token_seq)

[[92, 41, 40, 7740, 3, 9283, 58, 54, 128, 457, 1444, 5, 2, 11973], [20, 19, 5359, 23, 56, 23, 142, 198, 57, 10, 36, 861, 232, 56, 23, 102, 852, 4, 2588, 2, 11974, 3, 4504, 7741, 32, 10, 5938, 19, 317, 514, 7, 26, 5938, 19, 64, 39, 157, 65, 4, 67, 12, 23, 3, 23, 58, 52, 362, 60, 381, 52, 362, 60, 11975, 3689, 322, 21, 23, 678, 550], [20, 71, 67, 23, 67], [19, 93, 19, 39, 551, 23, 881, 99, 3, 198, 57, 28, 2, 375], [14, 11, 8, 3305, 2323, 3, 2, 3127, 11, 2, 92, 426, 234, 569, 5360, 1016, 5, 410, 3, 1498, 5, 2, 2413, 562, 3306], [12, 125, 189, 25, 2825, 41, 299, 1017, 6, 63, 5, 499, 1769, 3, 1216, 35, 11, 2, 111, 4, 2324, 18, 15, 1185], [234, 569, 13, 13, 6, 4887, 24, 79, 350], [25, 11, 21, 25, 27, 827, 29, 1297, 9284, 9284, 161, 73, 6, 240, 271, 8, 1377, 263, 483, 55, 32, 2, 11976], [28, 15, 6689, 104, 2705, 950, 8, 76, 3, 3690, 32, 6, 6690, 7742, 1217, 10, 309, 251, 21, 2501], [56, 23, 39, 157, 393, 4, 67, 108, 46, 41, 3, 56, 2, 5361, 5, 3691, 43, 295, 12, 6, 802, 3307, 26, 16, 191, 407,

In [66]:
# Elements of each training example are appended to these lists
targets, contexts, labels = [], [], []

# Build the sampling table for vocab_size tokens.
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

# Iterate over all sentences in the dataset
for sequence in tqdm.tqdm(corpus_token_seq):

  # Generate positive skip-gram pairs for a sequence (sentence).
  positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
        sequence, 
        vocabulary_size=vocab_size,
        sampling_table=sampling_table,
        window_size=window_size,
        negative_samples=0)
  
  # Iterate over each positive skip-gram pair to produce training samples 
  # with positive context word and negative context samples 
  for target_word, context_word in positive_skip_grams: 
    context_class = tf.expand_dims(
        tf.constant([context_word], dtype="int64"), 1)
    negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
        true_classes=context_class,
        num_true=1, 
        num_sampled=num_ns, 
        unique=True, 
        range_max=vocab_size, 
        seed=SEED, 
        name="negative_sampling")
    
    # Build context and label vectors for one target word 
    negative_sampling_candidates = tf.expand_dims(
        negative_sampling_candidates, 1)
    
    context = tf.concat([context_class, negative_sampling_candidates], 0)
    label = tf.constant([1] + [0]*num_ns, dtype="int64")

    # Append each element from the training example to global lists
    targets.append(target_word)
    contexts.append(context)
    labels.append(label)

100%|██████████| 32639/32639 [02:54<00:00, 187.24it/s]


In [67]:
print(len(targets))
print(len(contexts))
print(len(labels))

335055
335055
335055


In [68]:
# Configuring the dataset with tensorflow batched dataset API
BATCH_SIZE = 1024
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

<BatchDataset shapes: (((1024,), (1024, 5, 1)), (1024, 5)), types: ((tf.int32, tf.int64), tf.int64)>


In [69]:
# Subclass to define Word2Vec model
class Word2Vec(Model):

  def __init__(self, vocab_size, embedding_dim):
    '''
      Initialize the model. 

      Parameters
      ----------
      vocab_size: int 
        size of the input vocabulary
      embedding_dim: int
        size of the embeddings (the number of parameters)
    '''
    super(Word2Vec, self).__init__()
    # Target embedding layer variable
    self.target_embedding = Embedding(vocab_size,
                                      embedding_dim, 
                                      input_length=1,
                                      name='w2v_embedding')     
    # Context embedding layer variable 
    self.context_embedding = Embedding(vocab_size, 
                                       embedding_dim, 
                                       input_length=num_ns+1)    
    # Dot layer variable (compute product along the word vector dimension)
    self.dots = Dot(axes=(3,2))
    self.flatten = Flatten()

  def call(self, pair):
    '''
      Function calling the model.

      Parameters
      ----------
      pair: tf.dataset
        training data in pair feed to the model
    '''
    target, context = pair
    # Calculate target and context embedding vectors
    we = self.target_embedding(target)
    ce = self.context_embedding(context)
    # Calculate target and context embedding similarity with dot product 
    dots = self.dots([ce, we])
    return self.flatten(dots)

In [70]:
# Build the model 
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer=tf.keras.optimizers.Adam(),
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

In [71]:
# Define callback to log training statistics
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

In [72]:
# Train the model 
word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f6d7bb39320>

In [73]:
# view model summary
print(word2vec.summary())

Model: "word2_vec_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
w2v_embedding (Embedding)    multiple                  2340352   
_________________________________________________________________
embedding_3 (Embedding)      multiple                  2340352   
_________________________________________________________________
dot_3 (Dot)                  multiple                  0         
_________________________________________________________________
flatten_3 (Flatten)          multiple                  0         
Total params: 4,680,704
Trainable params: 4,680,704
Non-trainable params: 0
_________________________________________________________________
None


In [74]:
# Output training logs
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 196), started 3:03:12 ago. (Use '!kill 196' to kill it.)

<IPython.core.display.Javascript object>

In [75]:
# Get the weights from the model 
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]

In [76]:
# Create and save the vectors and metadata files
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index in id2w: 
  if  index == 0: continue # Skip padding token
  vector = weights[index]
  out_v.write('\t'.join([str(x) for x in vector]) + "\n")
  out_m.write(id2w[index] + "\n")

out_v.close()
out_m.close()

In [77]:
# Download the embeddings
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception as e:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [78]:
# TODOS: 
# [DONE] Create context word vectors for training and label vectors (since the task is prediction)
# [DONE] Build a neural net with softmax to find the distribution of the most likely context word 
# Train the network for the neural task and extract the hidden layer weights as embeddings
# Sample by feeding the model a word to predict a context word of it 