# Exploring Word2Vec Vectorized Embedding Model Training from Scratch

# **Warning: this is a work in progress**

Implemented a Word2Vev model from scratch, largely following the tutorial linked below, although currently in progress expanding this to additional algorithms and areas of interest

https://www.tensorflow.org/text/tutorials/word2vec

TODO - describe vectorized embeddings

For now, see this blog post - https://txt.cohere.com/sentence-word-embeddings/

## Import File and show first 25 lines

Load a file with about 6 or so novels from Project Gutenberg, the first book is Mary Shelley's Frankenstein as you can see from the preview 

In [283]:
import io
import re
import string
import tqdm

import numpy as np

import tensorflow as tf
from tensorflow.keras import layers

import os
fullPath = os.path.abspath("./" + 'some-books.txt') 
path_to_file = tf.keras.utils.get_file('some-books.txt', 'file://'+fullPath)

with open(path_to_file) as f:
  lines = f.read().splitlines()
for line in lines[:25]:
  print(line)


Letter 1

_To Mrs. Saville, England._


St. Petersburgh, Dec. 11th, 17—.


You will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings. I arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.

I am already far north of London, and as I walk in the streets of
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight. Do you understand this
feeling? This breeze, which has travelled from the regions towards
which I am advancing, gives me a foretaste of those icy climes.
Inspirited by this wind of promise, my daydreams become more fervent
and vivid. I try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagination as the
region of beauty and delight. There, Margaret, the sun is for ever
visible, its broad disk

## Remove empty lines from the file

In [284]:
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))

## Remove punctuation characters and lowercase all the words

Since individual words (tokens) will be used for word vectorization, we wouldn't want to calculate different vectors from "the" and "The".

This function will be called below in the Text Vectorization step

In [285]:
# Now, create a custom standardization function to lowercase the text and
# remove punctuation.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation), '')





## Vectorize the text

The goal here is to translate words into indexed tokens which the vectors will represent.

Note we are calling our standardization method above to pre-process the text.

We trim the token size to a max vocabulary, this will remove the least frequently counted tokens.

Note - we aren't computing anything yet. We are defining a keras layer in Tensor Flow



In [286]:
# Define the vocabulary size and the number of words in a sequence.
vocab_size = 4096
sequence_length = 10

# Use the `TextVectorization` layer to normalize, split, and map strings to
# integers. Set the `output_sequence_length` length to pad all samples to the
# same length.
vectorize_layer = tf.keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)




## Compute a vocabulary of string terms (tokens)

Utilizes the layer defined above

From documentation:

During adapt(), the layer will build a vocabulary of all string tokens seen in the dataset, sorted by occurrence count, with ties broken by sort order of the tokens (high to low). At the end of adapt(), if max_tokens is set, the vocabulary will be truncated to max_tokens size. For example, adapting a layer with max_tokens=1000 will compute the 1000 most frequent tokens occurring in the input dataset. If output_mode='tf-idf', adapt() will also learn the document frequencies of each token in the input dataset.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization#adapt


We are referencing the text dataset created above in batches of 1024 characters at a time

In [287]:
vectorize_layer.adapt(text_ds.batch(1024))

2024-03-30 16:34:25.736479: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## View the computed vocabulary (tokens)

Ultimately, when we are done, each token will have a multi-dimensional point (well really a vector in pure math terms) that represents it

The array is called inverse vocab, since from this point on, we will reference a token by it's numerical ID, not it's string value, although it's position in the array is it's ID

Ultimately, we will save off a .tsv file which is the textual metadata to accompany the array of N-Dimensional vectors we are aiming to produce

In [288]:
# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', 'the', 'and', 'of', 'to', 'a', 'in', 'i', 'that', 'his', 'was', 'he', 'it', 'with', 'but', 'as', 'my', 'for', 'had']


## Replace each word in the data set with its numerical index value

If a data set is "and then I saw the light and"

And we created a vectorized vocabulary where each word in the array, it's index is it's numerical value:
["and", "then", "I", "saw", "the", "light"]

This step will translate the data set from the string to it's token index representation: 

[1,2,3,4,5,1]

Note the use of prefetch and autotune, this is described in detail here and deals with the performance of the operation across a large data set - https://www.tensorflow.org/guide/data_performance

Note also that our sequences were defined as length of 10 above, so note the difference in length between original lines of data and the transformed vectorized (and split to sequences of 10) data. 

**note** that how we create these sequences could be improved especially given that vectorizing is an attempt to capture semantic and contextual meaning of words, this is an area of possible improvement. Think about a sliding window for n-gram going across broken sentences.

Also a note on the out of range error - https://stackoverflow.com/questions/53930242/how-to-fix-a-outofrangeerror-end-of-sequence-error-when-training-a-cnn-with-t

This is expected behavior based on using a Tensorflow iterator instead of for loop with exact dimension, but it is expected behavior in this case

In [289]:
import numpy

SEED = 42
AUTOTUNE = tf.data.AUTOTUNE


# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

#show a snippet of the transformed data set
print("Showing Original Data")
snippet_size=5
snippet_original_data = text_ds.take(snippet_size)
for data in snippet_original_data:
    print(data)

print("\n\nShowing Transformed Data")
snippet_data = text_vector_ds.take(snippet_size)
for data in snippet_data:
    print(data)

Showing Original Data
tf.Tensor(b'Letter 1', shape=(), dtype=string)
tf.Tensor(b'_To Mrs. Saville, England._', shape=(), dtype=string)
tf.Tensor(b'St. Petersburgh, Dec. 11th, 17\xe2\x80\x94.', shape=(), dtype=string)
tf.Tensor(b'You will rejoice to hear that no disaster has accompanied the', shape=(), dtype=string)
tf.Tensor(b'commencement of an enterprise which you have regarded with such evil', shape=(), dtype=string)


Showing Transformed Data
tf.Tensor([747   1   0   0   0   0   0   0   0   0], shape=(10,), dtype=int64)
tf.Tensor([  5 619   1 812   0   0   0   0   0   0], shape=(10,), dtype=int64)
tf.Tensor([1178    1    1    1    1    0    0    0    0    0], shape=(10,), dtype=int64)
tf.Tensor([  28   68    1    5  317    9   53 4003  107 2236], shape=(10,), dtype=int64)
tf.Tensor([   1    4   42 2454   34   28   33 1107   14   85], shape=(10,), dtype=int64)


2024-03-30 16:34:25.824530: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-03-30 16:34:25.872803: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Count the total number of sequences computed in the layer

In [290]:
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

31261


2024-03-30 16:34:26.747469: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Show a breakdown again of token index to vectorized input

Similar to what was printed above, another syntactical way to view a sampling

In [291]:
for seq in sequences[:5]:
  print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[747   1   0   0   0   0   0   0   0   0] => ['letter', '[UNK]', '', '', '', '', '', '', '', '']
[  5 619   1 812   0   0   0   0   0   0] => ['to', 'mrs', '[UNK]', 'england', '', '', '', '', '', '']
[1178    1    1    1    1    0    0    0    0    0] => ['st', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '', '', '', '', '']
[  28   68    1    5  317    9   53 4003  107 2236] => ['you', 'will', '[UNK]', 'to', 'hear', 'that', 'no', 'disaster', 'has', 'accompanied']
[   1    4   42 2454   34   28   33 1107   14   85] => ['[UNK]', 'of', 'an', 'enterprise', 'which', 'you', 'have', 'regarded', 'with', 'such']


## Generating Sampling Data

Below we are defining a method for generating sample training data, not yet calling it.

OK, there's a lot going on here. When I have time I will summarize it better. 

The source tutorial has some good information - https://www.tensorflow.org/text/tutorials/word2vec#generate_training_data

Something key to zoom in on that isn't described too well in the source tutorial - use of negative pairs.

First we have to understand negative and positive sampling pairs. 

In the skip-gram technique, which is a type of word embedding model, the goal is to learn distributed representations of words in a continuous vector space. This is achieved by training a model to predict the context words given a target word, or vice versa. Skip-gram employs both positive and negative sampling to train the model efficiently on large datasets without requiring labeled data.
Positive Sampling:

    Definition: Positive sampling involves generating pairs of words where one word is the target word and the other is a context word that occurs within a certain window around the target word in the text corpus.
    Example: If the sentence is "The cat sat on the mat", and we consider a window size of 2, then for the word "cat", positive samples might include pairs like ("cat", "the"), ("cat", "sat"), ("cat", "on"), and ("cat", "mat").
    Training Objective: The neural network is trained to predict the context words given a target word. In other words, given the target word "cat", the network should predict "the", "sat", "on", and "mat".

Negative Sampling:

    Definition: Negative sampling addresses the imbalance between positive (context) and negative (non-context) examples by randomly selecting negative samples during training.
    Example: For each positive skip-gram pair ("cat", "the"), instead of considering all non-context words as negative examples, negative sampling randomly selects a subset of words from the vocabulary that are not context words for "cat". These randomly selected words serve as negative examples.
    Training Objective: The neural network is trained to distinguish between true context words (positive examples) and randomly sampled non-context words (negative examples). It learns to assign higher probabilities to true context words and lower probabilities to randomly sampled non-context words.

**important note about how correct values are computed without labelled data**

Skip-gram training is considered unsupervised or unlabeled training because it doesn't require labeled data. The model learns from the structure of the text corpus itself.
During training, the model computes a loss function based on the predictions it makes for positive and negative samples.

Importantly, random values are used to fill the array. These random values might indicate "dog" is semantically similar to "computer"

Through iterations of training, via loss calculation, it refines these initial random guesses and derives true values across the matrices of data. IE model weights are created through simple mathematical calculations iteratively. I suggest learning more about this processs as it is a key aspect of understanding intuitively how LLMs are trained in the pre-training phase.

Next we have to understand how the worst case would be computed to see why we are doing these negative approximations.

In the worse case, for every positive word association, we would also train backpropogation in reverse every other token in the input text!

The code below uses approximation of negative sampling based on research that shows picking a few negative examples at random is sufficient. 

The use of negative sampling in skip-gram models is a technique employed to address the imbalance between the number of negative examples and positive examples in the dataset. In skip-gram models, for each positive skip-gram pair (target word, context word), there are potentially a vast number of words in the vocabulary that are not context words for the target word. This leads to a highly imbalanced dataset, as the majority of training examples would be negative (non-context) examples.

By using negative sampling, we can reduce the computational cost associated with training on this imbalanced dataset. Instead of considering all non-context words as negative examples, negative sampling randomly selects a small subset of negative examples for each positive example during training. This subset is typically much smaller than the entire vocabulary size, making the training process more efficient.

Another note on negative sampling - 

I haven't gotten great results with these vectors. I added more books to the data set which improved results, I've tried tweaking parameters such as context window length for skip gram and things like that. I am currently thinking that the negative sampling method, while mathematically sound, isn't quite right in the implementaiton below. More details on that method which is approximated here, I think this is an important area of future improvement for this implementation - https://arxiv.org/pdf/1402.3722v1.pdf

## Explanation of the output - Targets, Contexts, and Labels

Wait I thought this was unlabeled? It is, each pair is however labelled as positive or negative as described below - 

1. Targets:

    Definition: The targets represent the target words for which we are training the model to predict the context words.
    Purpose:
        Training Objective: Each element in the targets list represents a target word from the training data. During training, the skip-gram model aims to predict the context words surrounding each target word. For example, if the target word is "cat" in the sentence "The cat sat on the mat", the model should predict context words like "the", "sat", "on", and "mat".
        Model Input: The target words serve as inputs to the skip-gram model. The model tries to learn meaningful word embeddings for each target word based on its context.

2. Contexts:

    Definition: The contexts represent the context words surrounding the target words in the training data.
    Purpose:
        Training Objective: Each element in the contexts list contains the context words (both positive and negative) associated with a target word. Positive context words are the actual words occurring in the vicinity of the target word in the training data, while negative context words are randomly sampled from the vocabulary.
        Model Input: The context words, along with the target word, are provided as input to the skip-gram model. The model learns to predict these context words given the target word, thereby capturing the semantic meaning and relationships between words.

3. Labels:

    Definition: The labels represent the labels associated with each target-context pair, indicating whether a context word is a positive example (1) or a negative example (0).
    Purpose:
        Training Objective: Each element in the labels list indicates whether a context word is a positive example (a true context word) or a negative example (a non-context word randomly sampled from the vocabulary). During training, the skip-gram model aims to correctly classify these context words based on their relevance to the target word.
        Loss Calculation: The labels are used to compute the loss function during training. The model's predictions (probabilities) are compared against these labels to compute the loss, which guides the model's parameter updates through backpropagation.



In [292]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
  targets, contexts, labels = [], [], []

  # Build the sampling table for `vocab_size` tokens.
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in the dataset.
  for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence).
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0)

    # Iterate over each positive skip-gram pair to produce training examples
    # with a positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

      # Build context and label vectors (for one target word)
      context = tf.concat([tf.squeeze(context_class,1), negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      # Append each element from the training example to global lists.
      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

## Call the method above to generate our training data

Note a few parameters being set though - 

We set window size, which indicates how many context words are generated for each target, which informs the number of pairs created for training for positive correlations for each pair

num_ns = Number of negative samples, see markdown above

random seed - this random seed was set above, although it dictates the randomness used for the first initialization of the vector arrays, to better understand this see backprapogation training - https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd 

TODO - more simple example of backpropagation 

When we print the shape we are showing the legnth of each, which represents the number of training examples generated

In [293]:
targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=4,
    vocab_size=vocab_size,
    seed=SEED)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)

print('\n')
print(f"targets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")

#TODO count how many are positive etc.


  0%|          | 0/31261 [00:00<?, ?it/s]

100%|██████████| 31261/31261 [00:07<00:00, 3966.10it/s]




targets.shape: (101773,)
contexts.shape: (101773, 5)
labels.shape: (101773, 5)


## TODO explanation about tensors

In [294]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)


<_BatchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>


In [295]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<_PrefetchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>


## Print a sample of the final training data set

In [296]:
# Define the number of elements to print
#TODO improve this a bit

num_elements_to_print = 2

# Iterate over the dataset and print elements
for idx, ((target_batch, context_batch), label_batch) in enumerate(dataset.take(num_elements_to_print)):
    print(f"Batch {idx + 1}:")
    # Iterate over each example within the batch
    for i in range(target_batch.shape[0]):
        print(f"Example {i + 1}:")
        print("Target:", target_batch[i].numpy())
        print("Context:", context_batch[i].numpy())
        print("Label:", label_batch[i].numpy())

Batch 1:
Example 1:
Target: 1187
Context: [1843   11 4000   80   57]
Label: [1 0 0 0 0]
Example 2:
Target: 175
Context: [ 114    0  142 1343    1]
Label: [1 0 0 0 0]
Example 3:
Target: 46
Context: [  8  83   0 978  58]
Label: [1 0 0 0 0]
Example 4:
Target: 248
Context: [2909   96 1374   33   52]
Label: [1 0 0 0 0]
Example 5:
Target: 2354
Context: [  6  32   3 408 392]
Label: [1 0 0 0 0]
Example 6:
Target: 476
Context: [   2 1645  140   22  527]
Label: [1 0 0 0 0]
Example 7:
Target: 191
Context: [3644 1304  828  262   68]
Label: [1 0 0 0 0]
Example 8:
Target: 2143
Context: [ 188    1    6 1939   11]
Label: [1 0 0 0 0]
Example 9:
Target: 877
Context: [   4 3364  685 1238   16]
Label: [1 0 0 0 0]
Example 10:
Target: 2178
Context: [28 10  4  6  0]
Label: [1 0 0 0 0]
Example 11:
Target: 229
Context: [ 159 2039    3 1172  180]
Label: [1 0 0 0 0]
Example 12:
Target: 1819
Context: [   8   76  410 3254 2569]
Label: [1 0 0 0 0]
Example 13:
Target: 3062
Context: [682  62 123   3  11]
Label: [1 0 

2024-03-30 16:34:35.809658: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


Label: [1 0 0 0 0]
Example 469:
Target: 1079
Context: [ 43   0 107   1 104]
Label: [1 0 0 0 0]
Example 470:
Target: 3016
Context: [ 36 517 132  14 903]
Label: [1 0 0 0 0]
Example 471:
Target: 341
Context: [  31 3469   15 3290  689]
Label: [1 0 0 0 0]
Example 472:
Target: 9
Context: [1890  916   14    4  231]
Label: [1 0 0 0 0]
Example 473:
Target: 571
Context: [  1   0 162 333  14]
Label: [1 0 0 0 0]
Example 474:
Target: 3958
Context: [2195   16    1 2565    3]
Label: [1 0 0 0 0]
Example 475:
Target: 668
Context: [ 255 1516  425    6    1]
Label: [1 0 0 0 0]
Example 476:
Target: 2965
Context: [3306  990    0   16    5]
Label: [1 0 0 0 0]
Example 477:
Target: 2234
Context: [136   2 778   0 179]
Label: [1 0 0 0 0]
Example 478:
Target: 643
Context: [3526    2    4  239  190]
Label: [1 0 0 0 0]
Example 479:
Target: 2276
Context: [3570  474 3378  387 1664]
Label: [1 0 0 0 0]
Example 480:
Target: 3986
Context: [ 17 248  33 106   4]
Label: [1 0 0 0 0]
Example 481:
Target: 327
Context: [ 986 1

2024-03-30 16:34:36.078919: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Create a class wrapper for the model

TODO - add detail here

In [297]:
class Word2Vec(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    self.target_embedding = layers.Embedding(vocab_size,
                                      embedding_dim,
                                      name="w2v_embedding")
    self.context_embedding = layers.Embedding(vocab_size,
                                       embedding_dim)

  def call(self, pair):
    target, context = pair
    # target: (batch, dummy?)  # The dummy axis doesn't exist in TF2.7+
    # context: (batch, context)
    if len(target.shape) == 2:
      target = tf.squeeze(target, axis=1)
    # target: (batch,)
    word_emb = self.target_embedding(target)
    # word_emb: (batch, embed)
    context_emb = self.context_embedding(context)
    # context_emb: (batch, context, embed)
    dots = tf.einsum('be,bce->bc', word_emb, context_emb)
    # dots: (batch, context)
    return dots

In [298]:
def custom_loss(x_logit, y_true):
      return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)

## Train the model

TODO - explanation of batch and epoch size vs total number of training pairs above

TODO - improve to fully use training pairs

In [299]:
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

## TRAIN MODEL!
word2vec.fit(dataset, epochs=80, callbacks=[tensorboard_callback])


Epoch 1/80


[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.2356 - loss: 1.6078
Epoch 2/80
[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.4698 - loss: 1.5532
Epoch 3/80
[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.4170 - loss: 1.4589
Epoch 4/80
[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.4618 - loss: 1.3817
Epoch 5/80
[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5172 - loss: 1.3051
Epoch 6/80
[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5635 - loss: 1.2308
Epoch 7/80
[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6027 - loss: 1.1591
Epoch 8/80
[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6377 - loss: 1.0900
Epoch 9/80
[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

<keras.src.callbacks.history.History at 0x783e3cb00910>

In [300]:
weights = word2vec.get_layer('w2v_embedding').get_weights()
print(weights)

[array([[-0.0082402 , -0.01329427,  0.04379666, ...,  0.02847764,
        -0.03286812,  0.00471469],
       [-0.4285182 , -0.4188351 , -0.21779372, ..., -0.13081902,
         0.5675976 , -0.23294549],
       [ 0.44109792, -0.08916605, -0.7102801 , ..., -0.11670022,
        -0.04773828,  0.20705096],
       ...,
       [-0.10740773, -0.07029686, -0.25532085, ...,  0.38459933,
         0.11952375, -0.2823604 ],
       [ 0.3559356 ,  0.2777595 , -0.20183863, ..., -0.19113536,
         0.29477447, -0.28345653],
       [ 0.423261  , -0.10612626,  0.07722234, ...,  0.5100266 ,
         0.2859404 ,  0.24841903]], dtype=float32)]


In [301]:
len(weights)

1

## Take a look at some of the vectors

TODO - describe in layman's terms what a vector is and why it's useful

In [302]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
print(weights)

[[-0.0082402  -0.01329427  0.04379666 ...  0.02847764 -0.03286812
   0.00471469]
 [-0.4285182  -0.4188351  -0.21779372 ... -0.13081902  0.5675976
  -0.23294549]
 [ 0.44109792 -0.08916605 -0.7102801  ... -0.11670022 -0.04773828
   0.20705096]
 ...
 [-0.10740773 -0.07029686 -0.25532085 ...  0.38459933  0.11952375
  -0.2823604 ]
 [ 0.3559356   0.2777595  -0.20183863 ... -0.19113536  0.29477447
  -0.28345653]
 [ 0.423261   -0.10612626  0.07722234 ...  0.5100266   0.2859404
   0.24841903]]


In [303]:
weights.shape

(4096, 128)

In [304]:
vocab = vectorize_layer.get_vocabulary()
print(vocab)
print(len(vocab))

4096


## Write the vocabulary and model weights (vectors) to a file

In [305]:
out_v = io.open('book-vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('book-metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()


## View the loss propagation in a Tensorboard Magic Cell

TODO - describe why more epochs didn't yeild better results

Photo of result:

![Tensor Board Example](tensorboard.png)

In [306]:
%tensorboard --logdir logs

UsageError: Line magic function `%tensorboard` not found.


## TODO - implement cosine-similarity and find most similar words here

For now, load the weights here - https://projector.tensorflow.org/
