### __3. Word-Embedding:__
<font size=3>

Word-embeddings represent words or tokens as dense vectors composed of float numbers, allowing for significantly lower-dimensional representations compared to one-hot encoding. These vectors are either learned during training or derived from pre-trained embeddings, which are especially useful for smaller datasets. 

For example, a word represented in one-hot encoding as [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0] could have an embedding form like [0.245 -0.183 0.834]. The length of the one-hot vector corresponds to the vocabulary size ($\mathtt{vocab\_size}$), while the length of the embedding vector is determined by the embedding-dimension ($\mathtt{embed\_dim}$), a predefined hyperparameter.

#### __3.1 How it works in practice:__
    
- We import the corpus;
- Transform each sentence into a list of token IDs;
- And make the word-embedding.
  

In [1]:
import numpy as np
from tensorflow.keras import layers

2024-12-02 15:03:07.699792: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# 1. import corpus:
corpus = ["Beautiful is better than ugly",
          "Explicit is better than implicit",
          "Simple is better than complex",
          "Complex is better than complicated",
          "Flat is better than nested",
          "Sparse is better than dense",
          "Readability counts",
          "Special cases aren't special enough to break the rules",
          "Although practicality beats purity",
          "Errors should never pass silently",
          "Unless explicitly silenced",
          "In the face of ambiguity, refuse the temptation to guess",
          "There should be one -- and preferably only one -- obvious way to do it",
          "Although that way may not be obvious at first unless you're Dutch",
          "Now is better than never",
          "Although never is often better than right now",
          "If the implementation is hard to explain, it's a bad idea",
          "If the implementation is easy to explain, it may be a good idea",
          "Namespaces are one honking great idea -- let's do more of those!"]

In [3]:
# 2. transform each sentence into a list of token IDs:

vocab_size = None # maximum vocabulary size
max_len = 7 # maximum sentence length

vectorize = layers.TextVectorization(max_tokens=vocab_size,
                                    standardize='lower_and_strip_punctuation',
                                    split='whitespace',
                                    output_mode='int',
                                    output_sequence_length=max_len)

vectorize.adapt(corpus)

vocab = vectorize.get_vocabulary()
vocab_size = vectorize.vocabulary_size()

# get list of token IDs:
token_ids = vectorize(corpus)

# [UNK] = unknown word
print("Vocabulary size:", vocab_size)
print("Vocabulary tokens:", vocab)

Vocabulary size: 82
Vocabulary tokens: ['', '[UNK]', 'is', 'than', 'better', 'to', 'the', 'one', 'never', 'idea', 'be', 'although', 'way', 'unless', 'special', 'should', 'of', 'obvious', 'now', 'may', 'it', 'implementation', 'if', 'explain', 'do', 'complex', 'a', 'youre', 'ugly', 'those', 'there', 'that', 'temptation', 'sparse', 'simple', 'silently', 'silenced', 'rules', 'right', 'refuse', 'readability', 'purity', 'preferably', 'practicality', 'pass', 'only', 'often', 'not', 'nested', 'namespaces', 'more', 'lets', 'its', 'in', 'implicit', 'honking', 'hard', 'guess', 'great', 'good', 'flat', 'first', 'face', 'explicitly', 'explicit', 'errors', 'enough', 'easy', 'dutch', 'dense', 'counts', 'complicated', 'cases', 'break', 'beautiful', 'beats', 'bad', 'at', 'arent', 'are', 'and', 'ambiguity']


2024-12-02 15:03:09.935375: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


In [4]:
# 3. make word-embedding:

embed_dim = 3
embedding = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)

# define input shape to initialize embedding weights:
embedding.build(input_shape=(token_ids.shape))

# print embedding weights - an array of shape (vocab_size, embed_dim):
embedding.weights[0].numpy()

array([[ 1.76004283e-02,  1.18379220e-02, -3.82632501e-02],
       [-1.61314383e-02,  4.97332104e-02,  4.45997715e-03],
       [-4.22321074e-02, -3.81650776e-03, -2.47380491e-02],
       [-4.83655334e-02,  4.75725420e-02, -2.06314214e-02],
       [ 4.75475825e-02,  1.04051828e-02, -6.49937242e-03],
       [-8.57971981e-03,  1.35271661e-02,  1.59049295e-02],
       [-4.14826982e-02,  4.78745587e-02,  4.92272042e-02],
       [ 4.12114151e-02, -2.32710969e-02,  3.72593515e-02],
       [ 1.47008039e-02,  1.12207048e-02,  3.70717384e-02],
       [ 3.44085805e-02, -1.95440650e-02,  2.57262029e-02],
       [-3.07475086e-02,  4.83743288e-02, -3.38538662e-02],
       [-3.05345785e-02, -3.25664431e-02,  2.22492479e-02],
       [-1.02776773e-02,  1.62848085e-03,  8.10660422e-04],
       [ 2.46153027e-03, -4.31027897e-02, -2.20025666e-02],
       [-4.01914828e-02, -2.46521831e-02, -2.58050207e-02],
       [ 1.89334638e-02, -3.14366966e-02,  5.50774485e-03],
       [-8.19776207e-03,  3.55802067e-02

In [5]:
# get the embedded tokens
embed_tokens = embedding(token_ids)

In [6]:
# one sentence example:

i = 0
print(f"- Sentence:\n{corpus[i]}\n")
print(f"- Token IDs:\n{token_ids[i]}\n")
print(f"- Embedded tokes:\n{embed_tokens[0].numpy()}")

- Sentence:
Beautiful is better than ugly

- Token IDs:
[74  2  4  3 28  0  0]

- Embedded tokes:
[[-0.00586389 -0.02082105 -0.04455626]
 [-0.04223211 -0.00381651 -0.02473805]
 [ 0.04754758  0.01040518 -0.00649937]
 [-0.04836553  0.04757254 -0.02063142]
 [ 0.01048924 -0.01512783  0.0402336 ]
 [ 0.01760043  0.01183792 -0.03826325]
 [ 0.01760043  0.01183792 -0.03826325]]


<font size=3>

Note above that the token IDs __are__ the embedding weights row index! So each word is mapped by a vector of size $\mathtt{embed\_dim}$.

#### __3.2 Pretrained word-embedding:__

When working with a small dataset or aiming to reduce training computational costs, pretrained word embeddings can be highly beneficial. Popular [pretrained word-embedding](https://keras.io/examples/nlp/pretrained_word_embeddings/) include [Word2vec](https://code.google.com/archive/p/word2vec) and [Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove). In this example, we will vectorize our _Zen of Python_ corpus using GloVe embeddings. 

To get started, we need to download the GloVe dataset using the cell below. The downloaded zip file contains embeddings with four different dimensional representations (50D, 100D, 200D, and 300D). For this task, we will focus on $\mathtt{embed\_dim=100}$.

In [7]:
# get token vectors:
embed_dim = 100
embed_dict = {}

with open(f"../dataset/glove.6B.{embed_dim}d.txt") as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embed_dict[word] = coefs

print(f"Found {len(embed_dict)} word vectors.")

embed_dict["talk"]

Found 400000 word vectors.


array([-7.9761e-02,  1.9551e-01,  3.0579e-01, -2.1571e-01, -4.9017e-01,
        4.6350e-01, -1.5171e-01, -1.6002e-01,  1.3081e-01, -6.5718e-01,
       -1.1343e-01,  1.0231e-01,  1.1583e-01,  2.0241e-03,  1.8107e-01,
       -1.8263e-01, -4.2386e-01,  5.6726e-02, -3.0419e-01,  1.5828e-01,
       -1.1820e-01,  1.8624e-01, -5.2731e-01, -5.9154e-01,  7.1546e-02,
        1.9633e-01, -4.9147e-02, -3.3004e-01,  5.0489e-01,  5.1138e-01,
       -5.0726e-01,  7.9255e-01,  1.7890e-01,  3.5001e-01, -7.2015e-02,
        8.9293e-01, -2.7286e-01, -5.7761e-01,  1.8615e-01, -9.8489e-02,
       -6.1398e-01,  6.1104e-02, -3.3847e-01, -2.9190e-01, -7.1794e-01,
       -3.7329e-01, -3.2193e-01, -3.8184e-01,  4.9009e-02, -1.2856e+00,
        3.1266e-02,  1.2953e-01,  1.1391e-01,  6.9458e-01,  3.3839e-01,
       -2.1965e+00,  8.4632e-02,  7.6947e-02,  9.7508e-01,  3.2743e-01,
        2.8664e-01,  7.9778e-01, -4.9729e-01, -1.1200e+00,  9.1580e-01,
        8.9064e-02,  1.1378e+00,  3.3187e-01, -1.8245e-01,  1.75

<font size=3>
    
We don't need to consider the entire GloVe embedding weights, which has a shape of (400000, 100). Instead, if we aim to solve the task using only the _Zen of Python_ corpus, we can simply focus on its vocabulary. This will allow us to obtain an array with a shape of (82, 100).

In [8]:
hits = 0
misses = 0

# making a vocabulary disctionary from corpus:
vocab_dict = dict(zip(vocab, range(len(vocab))))

# prepare embedding weights array:
embedding_weights = np.zeros((vocab_size, embed_dim))

for word, i in vocab_dict.items():
    embedding_vector = embed_dict.get(word)
    if embedding_vector is not None:
        '''
        Words that are not in the embedding index will be all zeros. 
        This also applies to the representations for "padding" and 
        "out of vocabulary (OOV)." 
        '''
        embedding_weights[i] = embedding_vector
        hits += 1

    else:
        misses += 1

print(f"Converted {hits} words ({misses} misses)")

Converted 80 words (2 misses)


In [9]:
# defining pretrained word-embedding:

embedding = layers.Embedding(input_dim=vocab_size, 
                             output_dim=embed_dim,
                             weights=[embedding_weights],
                             trainable=False)

'''
Since we are using a pretrained weights, we don't want to lose
them during the NN training. So, for embedding weights we set
trainable=False.
'''

print(f"weights array:{embedding.weights[0].shape}")

# get the embedded tokens:
embed_tokens = embedding(token_ids)

weights array:(82, 100)


In [10]:
# one sentence example:

i = 0
print(f"- Sentence:\n{corpus[i]}\n")
print(f"- Token IDs:\n{token_ids[i]}\n")
print(f"- Embedded tokes:\n{embed_tokens[i].numpy()}")

- Sentence:
Beautiful is better than ugly

- Token IDs:
[74  2  4  3 28  0  0]

- Embedded tokes:
[[-0.18173    0.49759    0.46326    0.22507    0.46379    0.70062
  -0.55155    0.79148   -0.18582    0.19755    0.19881    0.09037
   0.02684    0.036921   0.25217    0.30879    0.33164    0.2714
  -0.12808    1.1721    -0.072969   0.34904    0.11161   -0.36056
   0.59628    0.42417   -0.69904   -0.19768   -0.35599   -0.23141
  -0.38503   -0.12665    0.77121   -0.37397    0.59642   -0.24416
  -0.25387   -0.065911   0.21035   -0.83429    0.28604   -0.022707
   0.06746    0.088804   0.23424    0.20475    0.085396   0.55393
   0.34153   -0.095455  -0.19291   -0.55262    1.0229     0.3866
  -0.24254   -2.3519     0.43561    1.1172     0.77358   -0.73769
  -0.35302    1.6699    -0.63955   -0.39244    0.56454   -0.27873
   0.9252    -0.13997   -0.096213  -1.1242     0.49031    0.36918
   0.41195   -0.038159   0.84123    0.24619    0.081767   0.07483
   0.44646   -0.19423    0.013369   0.37712  

<font size=3>

A key advantage of word embeddings is their ability to capture the semantic relationships between words in the embedding space. For instance, synonyms tend to have similar vector representations, resulting in closer geometric distances. This allows for intuitive relationships, such as "dog" being close to "wolf" and "cat" near "tiger." Additionally, embeddings can model analogical reasoning, where adding specific vectors reflects meaningful transformations, such as "king" + $\mathtt{female\_vector}$ = "queen" or "tree" + $\mathtt{plural\_vector}$ = "trees." This powerful ability to encode relationships makes word embeddings essential for many natural language processing tasks.

This _"geometric distance"_ is typically measured using [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity), which calculates the dot product between two word-vectors, $\;\vec a\cdot\vec b = |\vec a|\,|\vec b|\,\cos\theta\;$,
$$
    \cos\theta = \dfrac{\vec a\cdot\vec b}{|\vec a||\vec b|} \, ,
$$
where $\cos\theta \in [-1, +1]$, from opposite to proporcional vectors.

The GloVe dataset includes words from various languages, meaning that some opposite correlations may arise from words in different languages. Below are some examples of cosine similarity using only English words.

In [11]:
def cosine_similarity(a, b):
    
    A = np.linalg.norm(a)
    B = np.linalg.norm(b)

    return np.dot(a, b)/(A*B)

print("(good vs nice):", cosine_similarity(embed_dict["good"], embed_dict["nice"]))
print("(ripened vs reaffirm):", cosine_similarity(embed_dict["ripened"], embed_dict["reaffirm"]))
print("(line-out vs court):", cosine_similarity(embed_dict["line-out"], embed_dict["court"]))

(good vs nice): 0.7312453
(ripened vs reaffirm): 0.00025181586
(line-out vs court): -0.2146855


#### __3.3 [Extra] Linear _vs_ embedding layers:__
<font size=3>

It is said that embedding layers function similarly to linear layers. To gain a deeper understanding of how word embedding works, let’s compare their similarities and differences.

A linear embedding-like layer behaves like a linear layer, meaning it has no activation function (since it is linear) and does not include a bias vector. When considering the input as a one-hot vector $a_0^i$, the output of the embedding is given by:
\begin{align}
     a_0^i\;\; W^{ij} &= a_1^j \, ,\\\\
     \begin{pmatrix}
     0 & 1 & 0 & 0
     \end{pmatrix}
     \begin{pmatrix}
         w_{00} & w_{01} & w_{02} \\
         w_{10} & w_{11} & w_{12} \\
         w_{20} & w_{21} & w_{22} \\
         w_{30} & w_{31} & w_{32} 
     \end{pmatrix}
     &=
     \begin{pmatrix}
         w_{10} & w_{11} & w_{12}
     \end{pmatrix}  \, ,
\end{align}
where the indexes $\; i \in [0,\, \mathtt{vocab\_size})$ and $\; j \in [0,\, \mathtt{embed\_dim})$. However, an embedding layer acts like,
\begin{align}
    \delta^{ki}\;\;W^{ij} &= a_1^i\, ,\\\\
     \begin{pmatrix}
         w_{00} & w_{01} & w_{02} \\
         [w_{10} & w_{11} & w_{12}] \\
         w_{20} & w_{21} & w_{22} \\
         w_{30} & w_{31} & w_{32} 
     \end{pmatrix} 
      &\Rightarrow
     \begin{pmatrix}
         w_{10} & w_{11} & w_{12}
     \end{pmatrix} \, ,
\end{align}
where $\delta^{kj} = [1,\; k=j;\quad 0,\; k\neq j]$, and $k$ is a $\mathtt{token\_id}$, so it selects the vector $(w_{10}\;\; w_{11}\;\;  w_{12})$ without one-hot encoding.

#### __3.3.1 Linear layer:__

In [12]:
# Let's consider that the corpus text presents a vocabulary size of:
vocab_size = 10

# and wee want to embedding the corpus into matrix of dimension:
embed_dim = 5

# Also, we'll consider the following token-IDs list:
token_ids = np.array([3, 8, 2, 5, 0])

''' Note that the ID = 0 in token_ids represent the padding.
To compare linear vs embedding layer, the input-layer size (max-len)
need to have the same size as embed_im.'''

max_len = token_ids.size

assert max_len == embed_dim

In [13]:
# one-hot encoding from token-ids:
onehot = layers.CategoryEncoding(num_tokens=vocab_size, output_mode="one_hot")(token_ids)

for ID, hot in zip(token_ids, onehot):
    print(f"{ID}: {hot}")
    

3: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
8: [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
2: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
5: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
0: [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [14]:
# defining linear layer:
linear = layers.Dense(units=max_len, bias_initializer="zeros")
linear.build(input_shape=onehot.shape) # initialize weights

weights = linear.weights[0].numpy() 

print("- Linear weights:\n", weights, end="\n\n")
print("- Linear layer's outputs:\n", linear(onehot).numpy())

- Linear weights:
 [[ 0.49218947  0.36513233 -0.56878525 -0.26862016  0.49037963]
 [ 0.02909946 -0.43373728  0.35295624  0.40925092  0.6102509 ]
 [ 0.6071431   0.14801794 -0.33090723  0.05308819  0.09311652]
 [-0.14760238  0.15635943  0.08833092  0.3197629   0.00100005]
 [-0.39183617  0.13605869  0.5090255  -0.1358301  -0.34157965]
 [ 0.54999894  0.02152151 -0.19227457  0.35214847  0.17522681]
 [-0.25164026 -0.54330355 -0.13154769  0.06628239 -0.3393166 ]
 [-0.41437823 -0.39346197  0.13575953  0.5055633  -0.18641716]
 [ 0.31544185  0.61205226  0.03930789  0.5345184  -0.38107675]
 [ 0.23140794 -0.13635138 -0.20535493 -0.38744277 -0.17155147]]

- Linear layer's outputs:
 [[-0.14760238  0.15635943  0.08833092  0.3197629   0.00100005]
 [ 0.31544185  0.61205226  0.03930789  0.5345184  -0.38107675]
 [ 0.6071431   0.14801794 -0.33090723  0.05308819  0.09311652]
 [ 0.54999894  0.02152151 -0.19227457  0.35214847  0.17522681]
 [ 0.49218947  0.36513233 -0.56878525 -0.26862016  0.49037963]]


In [15]:
# using only the dot/inner product:
print("- Dot product outputs:\n", np.dot(onehot, weights))

- Dot product outputs:
 [[-0.14760238  0.15635943  0.08833092  0.3197629   0.00100005]
 [ 0.31544185  0.61205226  0.03930789  0.5345184  -0.38107675]
 [ 0.6071431   0.14801794 -0.33090723  0.05308819  0.09311652]
 [ 0.54999894  0.02152151 -0.19227457  0.35214847  0.17522681]
 [ 0.49218947  0.36513233 -0.56878525 -0.26862016  0.49037963]]


#### __3.3.2 Embedding layer__
<font size=3>
    
Here, the embedding layer selects $\mathtt{weight}$ rows using $\mathtt{token\_ids}$, without the need for one-hot encoding matrix multiplication.

In [16]:
# defining the embedding layer using the same linear weights for comparison:
embedding = layers.Embedding(input_dim=vocab_size, output_dim=max_len, weights=[weights])

print("- Embedding weights:\n", embedding.weights[0].numpy(), end="\n\n")
print("- Embedding layer's outputs:\n", embedding(token_ids).numpy())

- Embedding weights:
 [[ 0.49218947  0.36513233 -0.56878525 -0.26862016  0.49037963]
 [ 0.02909946 -0.43373728  0.35295624  0.40925092  0.6102509 ]
 [ 0.6071431   0.14801794 -0.33090723  0.05308819  0.09311652]
 [-0.14760238  0.15635943  0.08833092  0.3197629   0.00100005]
 [-0.39183617  0.13605869  0.5090255  -0.1358301  -0.34157965]
 [ 0.54999894  0.02152151 -0.19227457  0.35214847  0.17522681]
 [-0.25164026 -0.54330355 -0.13154769  0.06628239 -0.3393166 ]
 [-0.41437823 -0.39346197  0.13575953  0.5055633  -0.18641716]
 [ 0.31544185  0.61205226  0.03930789  0.5345184  -0.38107675]
 [ 0.23140794 -0.13635138 -0.20535493 -0.38744277 -0.17155147]]

- Embedding layer's outputs:
 [[-0.14760238  0.15635943  0.08833092  0.3197629   0.00100005]
 [ 0.31544185  0.61205226  0.03930789  0.5345184  -0.38107675]
 [ 0.6071431   0.14801794 -0.33090723  0.05308819  0.09311652]
 [ 0.54999894  0.02152151 -0.19227457  0.35214847  0.17522681]
 [ 0.49218947  0.36513233 -0.56878525 -0.26862016  0.49037963]]


### __Reference:__
<font size=3>
    
 - [Deep Learning with Python](https://books.google.com.br/books/about/Deep_Learning_with_Python.html?id=Yo3CAQAACAAJ&redir_esc=y);
 - [Build a Large Language Model From Scratch](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/03_bonus_embedding-vs-matmul/embeddings-and-linear-layers.ipynb);
 - [Understanding word-embedding with Keras](https://medium.com/@hsinhungw/understanding-word-embeddings-with-keras-dfafde0d15a4).
