### __3. Word-Embedding:__
<font size=3>

Word-embeddings represent words or tokens as dense vectors composed of float numbers, allowing for significantly lower-dimensional representations compared to one-hot encoding. These vectors are either learned during training or derived from pre-trained embeddings, which are especially useful for smaller datasets. 

For example, a word represented in one-hot encoding as [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0] could have an embedding form like [0.245 -0.183 0.834]. The length of the one-hot vector corresponds to the vocabulary size ($\mathtt{vocab\_size}$), while the length of the embedding vector is determined by the embedding-dimension ($\mathtt{embed\_dim}$), a predefined hyperparameter.

#### __3.1 How it works in practice:__
    
- We import the corpus;
- Transform each sentence into a list of token IDs;
- And make the word-embedding.
  

In [1]:
import numpy as np
from tensorflow.keras import layers

2024-11-27 11:56:39.799568: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# 1. import corpus:
corpus = ["Beautiful is better than ugly",
          "Explicit is better than implicit",
          "Simple is better than complex",
          "Complex is better than complicated",
          "Flat is better than nested",
          "Sparse is better than dense",
          "Readability counts",
          "Special cases aren't special enough to break the rules",
          "Although practicality beats purity",
          "Errors should never pass silently",
          "Unless explicitly silenced",
          "In the face of ambiguity, refuse the temptation to guess",
          "There should be one -- and preferably only one -- obvious way to do it",
          "Although that way may not be obvious at first unless you're Dutch",
          "Now is better than never",
          "Although never is often better than right now",
          "If the implementation is hard to explain, it's a bad idea",
          "If the implementation is easy to explain, it may be a good idea",
          "Namespaces are one honking great idea -- let's do more of those!"]

In [3]:
# 2. transform each sentence into a list of token IDs:

vocab_size = None # maximum vocabulary size
max_len = 7 # maximum sentence length

vectorize = layers.TextVectorization(max_tokens=vocab_size,
                                    standardize='lower_and_strip_punctuation',
                                    split='whitespace',
                                    output_mode='int',
                                    output_sequence_length=max_len)

vectorize.adapt(corpus)

vocab = vectorize.get_vocabulary()
vocab_size = vectorize.vocabulary_size()

# get list of token IDs:
token_ids = vectorize(corpus)

# [UNK] = unknown word
print("Vocabulary size:", vocab_size)
print("Vocabulary tokens:", vocab)

Vocabulary size: 82
Vocabulary tokens: ['', '[UNK]', 'is', 'than', 'better', 'to', 'the', 'one', 'never', 'idea', 'be', 'although', 'way', 'unless', 'special', 'should', 'of', 'obvious', 'now', 'may', 'it', 'implementation', 'if', 'explain', 'do', 'complex', 'a', 'youre', 'ugly', 'those', 'there', 'that', 'temptation', 'sparse', 'simple', 'silently', 'silenced', 'rules', 'right', 'refuse', 'readability', 'purity', 'preferably', 'practicality', 'pass', 'only', 'often', 'not', 'nested', 'namespaces', 'more', 'lets', 'its', 'in', 'implicit', 'honking', 'hard', 'guess', 'great', 'good', 'flat', 'first', 'face', 'explicitly', 'explicit', 'errors', 'enough', 'easy', 'dutch', 'dense', 'counts', 'complicated', 'cases', 'break', 'beautiful', 'beats', 'bad', 'at', 'arent', 'are', 'and', 'ambiguity']


2024-11-27 11:56:42.792246: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


In [4]:
# 3. make word-embedding:

embed_dim = 3
embedding = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)

# define input shape to initialize embedding weights:
embedding.build(input_shape=(token_ids.shape))

# print embedding weights - an array of shape (vocab_size, embed_dim):
embedding.weights[0].numpy()

array([[-0.01171773, -0.04461163,  0.04308352],
       [ 0.00517938, -0.04085418, -0.00716002],
       [ 0.03742431, -0.01960806, -0.03429973],
       [ 0.0404867 ,  0.00247454, -0.03970286],
       [ 0.00722747, -0.03345873,  0.00917314],
       [-0.03696976,  0.04288835,  0.02702333],
       [-0.00140247,  0.04975105,  0.02677205],
       [-0.03489292,  0.04262637, -0.00257399],
       [-0.01237092, -0.0342587 , -0.01603409],
       [ 0.04773648, -0.03987242, -0.04603728],
       [ 0.04338927, -0.01980357,  0.04309079],
       [-0.00552287, -0.00997583,  0.03867776],
       [ 0.01270385, -0.00379841, -0.02461885],
       [ 0.01459284,  0.04433655, -0.03203994],
       [ 0.04589163, -0.03375693, -0.04510615],
       [ 0.01854787, -0.04475548, -0.0300722 ],
       [ 0.02379323,  0.03466045, -0.00621045],
       [ 0.00888406, -0.01606665, -0.03407723],
       [-0.03982375,  0.0464437 ,  0.00847567],
       [-0.0282846 , -0.04127765,  0.02778846],
       [-0.04867227, -0.02617785, -0.040

In [5]:
# get the embedded tokens
embed_tokens = embedding(token_ids)

In [6]:
# one sentence example:

i = 0
print(f"- Sentence:\n{corpus[i]}\n")
print(f"- Token IDs:\n{token_ids[i]}\n")
print(f"- Embedded tokes:\n{embed_tokens[0].numpy()}")

- Sentence:
Beautiful is better than ugly

- Token IDs:
[74  2  4  3 28  0  0]

- Embedded tokes:
[[ 0.00556556  0.04350587  0.03800544]
 [ 0.03742431 -0.01960806 -0.03429973]
 [ 0.00722747 -0.03345873  0.00917314]
 [ 0.0404867   0.00247454 -0.03970286]
 [ 0.029376   -0.02076438  0.03059622]
 [-0.01171773 -0.04461163  0.04308352]
 [-0.01171773 -0.04461163  0.04308352]]


<font size=3>

Note above that the token IDs __are__ the embedding weights row index! So each word is mapped by a vector of size $\mathtt{embed\_dim}$.

#### __3.2 Pretrained word-embedding:__

When working with a small dataset or aiming to reduce training computational costs, pretrained word embeddings can be highly beneficial. Popular [pretrained word-embedding](https://keras.io/examples/nlp/pretrained_word_embeddings/) include [Word2vec](https://code.google.com/archive/p/word2vec) and [Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove). In this example, we will vectorize our _Zen of Python_ corpus using GloVe embeddings. 

To get started, we need to download the GloVe dataset using the cell below. The downloaded zip file contains embeddings with four different dimensional representations (50D, 100D, 200D, and 300D). For this task, we will focus on $\mathtt{embed\_dim=100}$.

In [7]:
# get token vectors:
embed_dict = {}

with open("../dataset/glove.6B.100d.txt") as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embed_dict[word] = coefs

print("Found %s word vectors." % len(embed_dict))

embed_dict["talk"]

Found 400000 word vectors.


array([-7.9761e-02,  1.9551e-01,  3.0579e-01, -2.1571e-01, -4.9017e-01,
        4.6350e-01, -1.5171e-01, -1.6002e-01,  1.3081e-01, -6.5718e-01,
       -1.1343e-01,  1.0231e-01,  1.1583e-01,  2.0241e-03,  1.8107e-01,
       -1.8263e-01, -4.2386e-01,  5.6726e-02, -3.0419e-01,  1.5828e-01,
       -1.1820e-01,  1.8624e-01, -5.2731e-01, -5.9154e-01,  7.1546e-02,
        1.9633e-01, -4.9147e-02, -3.3004e-01,  5.0489e-01,  5.1138e-01,
       -5.0726e-01,  7.9255e-01,  1.7890e-01,  3.5001e-01, -7.2015e-02,
        8.9293e-01, -2.7286e-01, -5.7761e-01,  1.8615e-01, -9.8489e-02,
       -6.1398e-01,  6.1104e-02, -3.3847e-01, -2.9190e-01, -7.1794e-01,
       -3.7329e-01, -3.2193e-01, -3.8184e-01,  4.9009e-02, -1.2856e+00,
        3.1266e-02,  1.2953e-01,  1.1391e-01,  6.9458e-01,  3.3839e-01,
       -2.1965e+00,  8.4632e-02,  7.6947e-02,  9.7508e-01,  3.2743e-01,
        2.8664e-01,  7.9778e-01, -4.9729e-01, -1.1200e+00,  9.1580e-01,
        8.9064e-02,  1.1378e+00,  3.3187e-01, -1.8245e-01,  1.75

<font size=3>
    
We don't need to consider the entire GloVe embedding weights, which has a shape of (400000, 100). Instead, if we aim to solve the task using only the _Zen of Python_ corpus, we can simply focus on its vocabulary. This will allow us to obtain an array with a shape of (82, 100).

In [8]:
embed_dim = 100

hits = 0
misses = 0

# making a vocabulary disctionary from corpus:
vocab_dict = dict(zip(vocab, range(len(vocab))))

# prepare embedding weights array:
embedding_weights = np.zeros((vocab_size, embed_dim))

for word, i in vocab_dict.items():
    embedding_vector = embed_dict.get(word)
    if embedding_vector is not None:
        '''
        Words that are not in the embedding index will be all zeros. 
        This also applies to the representations for "padding" and 
        "out of vocabulary (OOV)." 
        '''
        embedding_weights[i] = embedding_vector
        hits += 1

    else:
        misses += 1

print(f"Converted {hits} words ({misses} misses)")

Converted 80 words (2 misses)


In [9]:
# defining pretrained word-embedding:

embedding = layers.Embedding(input_dim=vocab_size, 
                             output_dim=embed_dim,
                             weights=[embedding_weights],
                             trainable=False)

'''
Since we are using a pretrained weights, we don't want to lose
them during the NN training. So, for embedding weights we set
trainable=False.
'''

print(f"weights array:{embedding.weights[0].shape}")

# get the embedded tokens:
embed_tokens = embedding(token_ids)

weights array:(82, 100)


In [10]:
# one sentence example:

i = 0
print(f"- Sentence:\n{corpus[i]}\n")
print(f"- Token IDs:\n{token_ids[i]}\n")
print(f"- Embedded tokes:\n{embed_tokens[0].numpy()}")

- Sentence:
Beautiful is better than ugly

- Token IDs:
[74  2  4  3 28  0  0]

- Embedded tokes:
[[-0.18173    0.49759    0.46326    0.22507    0.46379    0.70062
  -0.55155    0.79148   -0.18582    0.19755    0.19881    0.09037
   0.02684    0.036921   0.25217    0.30879    0.33164    0.2714
  -0.12808    1.1721    -0.072969   0.34904    0.11161   -0.36056
   0.59628    0.42417   -0.69904   -0.19768   -0.35599   -0.23141
  -0.38503   -0.12665    0.77121   -0.37397    0.59642   -0.24416
  -0.25387   -0.065911   0.21035   -0.83429    0.28604   -0.022707
   0.06746    0.088804   0.23424    0.20475    0.085396   0.55393
   0.34153   -0.095455  -0.19291   -0.55262    1.0229     0.3866
  -0.24254   -2.3519     0.43561    1.1172     0.77358   -0.73769
  -0.35302    1.6699    -0.63955   -0.39244    0.56454   -0.27873
   0.9252    -0.13997   -0.096213  -1.1242     0.49031    0.36918
   0.41195   -0.038159   0.84123    0.24619    0.081767   0.07483
   0.44646   -0.19423    0.013369   0.37712  

### __Reference:__
<font size=3>
    
 - [Deep Learning with Python](https://books.google.com.br/books/about/Deep_Learning_with_Python.html?id=Yo3CAQAACAAJ&redir_esc=y);
 - [Build a Large Language Model From Scratch](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/03_bonus_embedding-vs-matmul/embeddings-and-linear-layers.ipynb);
 - [Understanding word-embedding with Keras](https://medium.com/@hsinhungw/understanding-word-embeddings-with-keras-dfafde0d15a4).
