### Word2Vec Test
Tutorial Source: https://www.tensorflow.org/text/tutorials/word2vec

#### Setup

In [1]:
import io
import re
import string
import tqdm
import json

import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.keras import layers

ModuleNotFoundError: No module named 'tensorflow'

In [3]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [4]:
SEED = 42
AUTOTUNE = tf.data.AUTOTUNE

#### Vectorize an example sentence

In [5]:
sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
print(len(tokens))

8


In [6]:
# Create a vocabulary to save mappings from tokens to integer indices
vocab, index = {}, 1  # start indexing from 1
vocab['<pad>'] = 0  # add a padding token
for token in tokens:
  if token not in vocab:
    vocab[token] = index
    index += 1
vocab_size = len(vocab)
print(vocab)

{'<pad>': 0, 'the': 1, 'wide': 2, 'road': 3, 'shimmered': 4, 'in': 5, 'hot': 6, 'sun': 7}


In [7]:
# Create an inverse vocabulary to save mappings from integer indices to tokens
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)

{0: '<pad>', 1: 'the', 2: 'wide', 3: 'road', 4: 'shimmered', 5: 'in', 6: 'hot', 7: 'sun'}


In [8]:
# Vectorize your sentence
example_sequence = [vocab[word] for word in tokens]
print(example_sequence)

[1, 2, 3, 4, 5, 1, 6, 7]


#### Generate skip-grams from one sentence

In [9]:
window_size = 2
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
      example_sequence,
      vocabulary_size=vocab_size,
      window_size=window_size,
      # negative_samples is set to 0 here, we will use another 
      # function to perform negative sampling later
      negative_samples=0) 
print(len(positive_skip_grams))

26


In [10]:
# Print a few positive skip-grams
for target, context in positive_skip_grams[:5]:
  print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")

(1, 6): (the, hot)
(6, 1): (hot, the)
(2, 3): (wide, road)
(1, 7): (the, sun)
(7, 1): (sun, the)


#### Negative sampling for one skip-gram
The `skipgrams` function returns all positive skip-gram pairs by sliding over a given window span. To produce additional skip-gram pairs that would serve as negative samples for training, you need to sample random words from the vocabulary. Use the `tf.random.log_uniform_candidate_sampler` function to sample `num_ns` number of negative samples for a given target word in a window. You can call the function on one skip-grams's target word and pass the context word as true class to exclude it from being sampled. <br>
`num_ns` (the number of negative samples per a positive context word) in the [5, 20] range is shown to work best for smaller datasets, while num_ns in the [2, 5] range suffices for larger datasets.

In [11]:
# Get target and context words for one positive skip-gram.
target_word, context_word = positive_skip_grams[0]

# Set the number of negative samples per positive context.
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,  # class that should be sampled as 'positive'
    num_true=1,  # each positive skip-gram has 1 positive context class
    num_sampled=num_ns,  # number of negative context words to sample
    unique=True,  # all the negative samples should be unique
    range_max=vocab_size,  # pick index of the samples from [0, vocab_size]
    seed=SEED,  # seed for reproducibility
    name="negative_sampling"  # name of this operation
)
print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

tf.Tensor([2 1 4 3], shape=(4,), dtype=int64)
['wide', 'the', 'shimmered', 'road']


#### Construct one training example
For a given positive `(target_word, context_word)` skip-gram, you now also have `num_ns` negative sampled context words that do not appear in the window size neighborhood of `target_word`. Batch the 1 positive `context_word` and `num_ns` negative context words into one tensor. This produces a set of positive skip-grams (labeled as `1`) and negative samples (labeled as `0`) for each target word.

In [12]:
# Reduce a dimension so you can use concatenation (in the next step).
squeezed_context_class = tf.squeeze(context_class, 1)

# Concatenate a positive context word with negative sampled words.
context = tf.concat([squeezed_context_class, negative_sampling_candidates], 0)

# Label the first context word as `1` (positive) followed by `num_ns` `0`s (negative).
label = tf.constant([1] + [0]*num_ns, dtype="int64")
target = target_word

In [13]:
# Check out the context and the corresponding labels for the target word from the skip-gram example above
print(f"target_index    : {target}")
print(f"target_word     : {inverse_vocab[target_word]}")
print(f"context_indices : {context}")
print(f"context_words   : {[inverse_vocab[c.numpy()] for c in context]}")
print(f"label           : {label}")

target_index    : 1
target_word     : the
context_indices : [6 2 1 4 3]
context_words   : ['hot', 'wide', 'the', 'shimmered', 'road']
label           : [1 0 0 0 0]


A tuple of `(target, context, label)` tensors constitutes one training example for training your skip-gram negative sampling word2vec model. Notice that the target is of shape `(1,)` while the context and label are of shape `(1+num_ns,)`.

#### Compile all steps into one function
##### Skip-gram sampling table
A large dataset means larger vocabulary with higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such as `the`, `is`, `on`) don't add much useful information for the model to learn from. It is suggested that subsampling of frequent words as a helpful practice to improve embedding quality.<br>
The tf.keras.preprocessing.sequence.skipgrams function accepts a sampling table argument to encode probabilities of sampling any token. You can use the tf.keras.preprocessing.sequence.make_sampling_table to generate a word-frequency rank based probabilistic sampling table and pass it to the skipgrams function. Inspect the sampling probabilities for a vocab_size of 10.

In [14]:
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(size=10)
print(sampling_table)

[0.00315225 0.00315225 0.00547597 0.00741556 0.00912817 0.01068435
 0.01212381 0.01347162 0.01474487 0.0159558 ]


In [15]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
  targets, contexts, labels = [], [], []

  # Build the sampling table for `vocab_size` tokens.
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in the dataset.
  for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence).
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0)

    # Iterate over each positive skip-gram pair to produce training examples
    # with a positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

      # Build context and label vectors (for one target word)
      context = tf.concat([tf.squeeze(context_class,1), negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      # Append each element from the training example to global lists.
      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

#### Load Farsi word embedding data

In [16]:
file_paths = ["./datasets/fawiki-20181001-pages-articles-multistream 1 - 100000.json",
              "./datasets/fawiki-20181001-pages-articles-multistream 100001 - 290169.json",
              "./datasets/fawiki-20181001-pages-articles-multistream 290170 - 580338.json",
              "./datasets/fawiki-20181001-pages-articles-multistream 580339 - 870507.json"]

In [17]:
data = []

# Regular expression pattern to match zero-width space Unicode characters
zero_width_space_pattern = re.compile(r'\u200c')

# Iterate over each file path
for file_path in file_paths:
    with open(file_path, "r", encoding="utf-8") as f:
        # Iterate over each line in the file
        for line in f:
            # Remove zero-width space Unicode characters from the line
            line = zero_width_space_pattern.sub('', line)
            # Parse the line as JSON and append it to the data list
            data.append(json.loads(line))

In [18]:
fawiki_df = pd.DataFrame(data)

In [19]:
fawiki_df.head()

Unnamed: 0,Id,Title,Type,Rank,Namespace,RedirectList,IsDisambiguationPage,TargetLinksCount,InfoBox,Text,Links,Parents
0,1,صفحهٔ اصلی,0,38,0,"[صفحهی اصلی, صفحۀ اصلی, صفحه اصلی, صفحهٔ اصلي,...",False,30,"{'Title': '', 'KeysAndValues': []}",پیوند= مقالههای برگزیده – مقاله پیشنهادی هفته\...,"[j F, Y ""(میلادی)"", تقویم میلادی, xij xiF, xiY...",[]
1,42,ویکیپدیا,6,728,0,"[ویکی پدیا, ویکیپدیا, دانشنامهٔ ویکیپدیا, دانش...",False,313,"{'Title': 'website', 'KeysAndValues': [{'Item1...",ویکیپدیا یک دانشنامه اینترنتی چندزبانه با محتو...,"[صفحهٔ اصلی, دانشنامه برخط, فهرست ویکیپدیاها, ...","[ویکیپدیا, اختراعهای آمریکایی, انقلاب علمی, بر..."
2,47,سالنامه,0,29,0,[],False,15,"{'Title': '', 'KeysAndValues': []}",سالنامه دفتر یا کتابی است دستنویس یا چاپشده که...,"[کتاب, دستنویس, پیشامد, اطلاعات, سالنامه جاودا...","[آکادمی, سالنامهها, کتابها بر پایه نوع]"
3,49,اطلاعات,0,211,0,[],False,191,"{'Title': '', 'KeysAndValues': []}",اطلاع یا معلومات (به فارسی افغانستان) یا آگاهی...,"[فلسفه, علم, دانش, نظریه اطلاعات, معماری اطلاع...","[اطلاعات, اطلاعات، دانش و عدم قطعیت, علوم اطلا..."
4,52,محتوای آزاد,0,129,0,"[محتويات آزاد, محتویات آزاد, محتواي آزاد, محتو...",False,28,"{'Title': '', 'KeysAndValues': []}",این نگاره نمونهای از یک اثر ویرایش شدهاست که ت...,"[تعریف آثار فرهنگی آزاد, محتوای باز, اجازهنامه...","[اجازهنامههای متنباز, محتویات آزاد, نرمافزار آ..."


In [20]:
fawiki_df.columns

Index(['Id', 'Title', 'Type', 'Rank', 'Namespace', 'RedirectList',
       'IsDisambiguationPage', 'TargetLinksCount', 'InfoBox', 'Text', 'Links',
       'Parents'],
      dtype='object')

In [21]:
fawiki_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 870507 entries, 0 to 870506
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   Id                    870507 non-null  int64 
 1   Title                 870507 non-null  object
 2   Type                  870507 non-null  int64 
 3   Rank                  870507 non-null  int64 
 4   Namespace             870507 non-null  int64 
 5   RedirectList          870507 non-null  object
 6   IsDisambiguationPage  870507 non-null  bool  
 7   TargetLinksCount      870507 non-null  int64 
 8   InfoBox               870507 non-null  object
 9   Text                  870507 non-null  object
 10  Links                 870507 non-null  object
 11  Parents               870507 non-null  object
dtypes: bool(1), int64(5), object(6)
memory usage: 73.9+ MB


In [22]:
fawiki_df.loc[0, "Text"]

'پیوند= مقالههای برگزیده – مقاله پیشنهادی هفته\nدرباره ویکیپدیا\nدرونمایه\nاز میان خبرها\nامروز : ، میلادی برابر هجری خورشیدی و (UTC)\n-1 day – +1 day یادبودهای – یادبودهای بیشتر…\nlink= بایگانی – نگارههای برگزیده بیشتر\nدیگر پروژههای بنیاد ویکیمدیا\nویکیپدیا در زبانهای دیگر'

#### Clean up the code

In [23]:
import re

def replace_newline_with_space(text):
    return re.sub(r'\n', ' ', text)

fawiki_df = fawiki_df.map(lambda x: replace_newline_with_space(x) if isinstance(x, str) else x)

In [24]:
fawiki_df.loc[14, 'Text']

'۱۴ اسفند - از آغاز سال در گاهشماری ایران ۳۵۰ روز گذشته و به پایان آن ۱۵ روز (در سال عادی) یا ۱۶ روز (در سال کبیسه) ماندهاست. رویدادها ۱۳۵۷ - تاسیس کمیته امداد امام خمینی به دستور سید روحالله خمینی. ۱۳۵۹ - رویدادهای ۱۴ اسفند پس از سخنرانی بنی صدر. زادروزها ۱۰۵۶ - آنتونیو ویوالدی، آهنگساز و ویلنیست اهل ونیز ۱۲۵۴ - لئون-پل فارگ، نویسنده و شاعر اهل فرانسه ۱۳۰۷ - مهدی اخوان ثالث، شاعر معاصر مرگها ۱۳۴۵ - محمد مصدق ، دولتمرد و نخستوزیر ایرانی (زاده ۱۲۶۱). ۱۳۸۸- محمد خوانساری استاد برجسته فلسفه و منطق در دانشگاه تهران، عضو پیوسته فرهنگستان زبان و ادبیات فارسی و یکی از بزرگترین منطق دانان دوران معاصر بود. مناسبتها روز احسان و نیکوکاری . ۵ مارس'

In [25]:
w2v_df = fawiki_df['Text']
w2v_df.head()

0    پیوند= مقالههای برگزیده – مقاله پیشنهادی هفته ...
1    ویکیپدیا یک دانشنامه اینترنتی چندزبانه با محتو...
2    سالنامه دفتر یا کتابی است دستنویس یا چاپشده که...
3    اطلاع یا معلومات (به فارسی افغانستان) یا آگاهی...
4    این نگاره نمونهای از یک اثر ویرایش شدهاست که ت...
Name: Text, dtype: object

In [28]:
import tensorflow as tf

text_list = w2v_df.tolist()

# Create a TensorFlow dataset from the text list
text_ds = tf.data.Dataset.from_tensor_slices(text_list)

In [43]:
first_element = next(iter(text_ds.take(1)))
print(first_element.numpy().decode('utf-8'))

پیوند= مقالههای برگزیده – مقاله پیشنهادی هفته درباره ویکیپدیا درونمایه از میان خبرها امروز : ، میلادی برابر هجری خورشیدی و (UTC) -1 day – +1 day یادبودهای – یادبودهای بیشتر… link= بایگانی – نگارههای برگزیده بیشتر دیگر پروژههای بنیاد ویکیمدیا ویکیپدیا در زبانهای دیگر


#### Vectorize sentences from the corpus

In [29]:
# Function to lowercase remove punctuation.
def custom_standardization(input_data):
  return tf.strings.regex_replace(input_data, '[%s]' % re.escape(string.punctuation), '')

In [30]:
# Define the vocabulary size and the number of words in a sequence.
vocab_size = 4096
sequence_length = 10

# Use the `TextVectorization` layer to normalize, split, and map strings to
# integers. Set the `output_sequence_length` length to pad all samples to the
# same length.
vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

In [49]:
# Call TextVectorization.adapt on the text dataset to create vocabulary.
vectorize_layer.adapt(text_ds.batch(1024))

2024-04-04 16:34:04.372421: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Once the state of the layer has been adapted to represent the text corpus, the vocabulary can be accessed with `TextVectorization.get_vocabulary`. This function returns a list of all vocabulary tokens sorted (descending) by their frequency.

In [50]:
# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', 'در', 'و', 'به', 'از', 'که', 'است', 'این', 'را', 'با', 'یک', 'سال', 'آن', 'برای', 'شد', 'ایران', 'او', 'بود', 'بر']


The `vectorize_layer` can now be used to generate vectors for each element in the `text_ds` (a `tf.data.Dataset`). Apply `Dataset.batch`, `Dataset.prefetch`, `Dataset.map`, and `Dataset.unbatch`.

In [51]:
# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

#### Obtain sequences from the dataset

You now have a `tf.data.Dataset` of integer encoded sentences. To prepare the dataset for training a word2vec model, flatten the dataset into a list of sentence vector sequences. This step is required as you would iterate over each sentence in the dataset to produce positive and negative examples.

In [52]:
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

870507


2024-04-04 16:36:57.334441: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Let's nspect a few examples from `sequences`

In [53]:
for seq in sequences[:5]:
  print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[1350 1074 1471   74 1338    1 1123  236    1    1] => ['پیوند', 'مقالههای', 'برگزیده', '–', 'مقاله', '[UNK]', 'هفته', 'درباره', '[UNK]', '[UNK]']
[   1   11 3636 2691    1   10    1  446    7    6] => ['[UNK]', 'یک', 'دانشنامه', 'اینترنتی', '[UNK]', 'با', '[UNK]', 'آزاد', 'است', 'که']
[   1  901   23 1844    7    1   23    1    6    1] => ['[UNK]', 'دفتر', 'یا', 'کتابی', 'است', '[UNK]', 'یا', '[UNK]', 'که', '[UNK]']
[3046   23    1    4  296  638   23    1    2    1] => ['اطلاع', 'یا', '[UNK]', 'به', 'فارسی', 'افغانستان', 'یا', '[UNK]', 'در', '[UNK]']
[   8    1 3406    5   11   99 3238   31    6    1] => ['این', '[UNK]', 'نمونهای', 'از', 'یک', 'اثر', 'ویرایش', 'شدهاست', 'که', '[UNK]']


#### Generate training examples from sequences
`sequences` is now a list of int encoded sentences. Just call the `generate_training_data` function defined earlier to generate training examples for the word2vec model. To recap, the function iterates over each word from each sequence to collect positive and negative context words. Length of target, contexts and labels should be the same, representing the total number of training examples.

In [54]:
targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=4,
    vocab_size=vocab_size,
    seed=SEED)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)

print('\n')
print(f"targets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")

100%|█████████████████████████████████| 870507/870507 [04:27<00:00, 3254.39it/s]




targets.shape: (2174972,)
contexts.shape: (2174972, 5)
labels.shape: (2174972, 5)


#### Configure the dataset for performance
To perform efficient batching for the potentially large number of training examples, use the `tf.data.Dataset` API. After this step, you would have a `tf.data.Dataset` object of `(target_word, context_word), (label)` elements to train your word2vec model!

In [55]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

<_BatchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>


In [56]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<_PrefetchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>


#### Model and training

The word2vec model can be implemented as a classifier to distinguish between true context words from skip-grams and false context words obtained through negative sampling. You can perform a dot product multiplication between the embeddings of target and context words to obtain predictions for labels and compute the loss function against true labels in the dataset.

#### Subclassed word2vec model
Use the Keras Subclassing API to define your word2vec model with the following layers:

- `target_embedding`: A `tf.keras.layers.Embedding` layer, which looks up the embedding of a word when it appears as a target word. The number of parameters in this layer are `(vocab_size * embedding_dim)`.
- `context_embedding`: Another `tf.keras.layers.Embedding` layer, which looks up the embedding of a word when it appears as a context word. The number of parameters in this layer are the same as those in `target_embedding`, i.e. `(vocab_size * embedding_dim)`.
- `dots`: A `tf.keras.layers.Dot` layer that computes the dot product of target and context embeddings from a training pair.
- `flatten`: A `tf.keras.layers.Flatten` layer to flatten the results of `dots` layer into logits.<br>
With the subclassed model, you can define the `call()` function that accepts `(target, context)` pairs which can then be passed into their corresponding embedding layer. Reshape the `context_embedding` to perform a dot product with `target_embedding` and return the flattened result.

In [57]:
class Word2Vec(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    self.target_embedding = layers.Embedding(vocab_size,
                                      embedding_dim,
                                      name="w2v_embedding")
    self.context_embedding = layers.Embedding(vocab_size,
                                       embedding_dim)

  def call(self, pair):
    target, context = pair
    # target: (batch, dummy?)  # The dummy axis doesn't exist in TF2.7+
    # context: (batch, context)
    if len(target.shape) == 2:
      target = tf.squeeze(target, axis=1)
    # target: (batch,)
    word_emb = self.target_embedding(target)
    # word_emb: (batch, embed)
    context_emb = self.context_embedding(context)
    # context_emb: (batch, context, embed)
    dots = tf.einsum('be,bce->bc', word_emb, context_emb)
    # dots: (batch, context)
    return dots

#### Define loss function and compile model

In [58]:
def custom_loss(x_logit, y_true):
      return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)

In [59]:
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

In [60]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [61]:
word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])

Epoch 1/20
[1m2123/2123[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 4ms/step - accuracy: 0.6650 - loss: 0.9327
Epoch 2/20
[1m2123/2123[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.7750 - loss: 0.6340
Epoch 3/20
[1m2123/2123[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.8025 - loss: 0.5649
Epoch 4/20
[1m2123/2123[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.8178 - loss: 0.5255
Epoch 5/20
[1m2123/2123[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.8290 - loss: 0.4968
Epoch 6/20
[1m2123/2123[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.8378 - loss: 0.4734
Epoch 7/20
[1m2123/2123[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.8455 - loss: 0.4534
Epoch 8/20
[1m2123/2123[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.8524 - loss: 0.4357
Epoch 9/20
[1m2123/2123

<keras.src.callbacks.history.History at 0x61622f690>

#### Embedding lookup and analysis
Obtain the weights from the model using `Model.get_layer` and `Layer.get_weights`. The `TextVectorization.get_vocabulary` function provides the vocabulary to build a metadata file with one token per line.

In [62]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [63]:
# Create and save the vectors and metadata files
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

Download the `vectors.tsv` and `metadata.tsv` to analyze the obtained embeddings in the Embedding Projector.

In [64]:
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass