## üß© Token Embeddings

Before a language model can understand or process words,  
it needs to **convert them into numbers** ‚Äî because computers operate on numerical data, not text.

So, the question is:  
> üí≠ *How do we represent words numerically so that the model can understand their meaning?*

---

### 1Ô∏è‚É£ One-Hot Encoding

**Idea:**  
Represent each word as a binary vector of size equal to the vocabulary.

- Each position corresponds to a word in the vocabulary.  
- Only one index (the word‚Äôs position) is set to `1`, and the rest are `0`.

**Example:**

| Word | One-Hot Vector |
|------|----------------|
| dog | [0, 0, 0, 1, 0, 0, 0, 0] |
| cat | [0, 0, 1, 0, 0, 0, 0, 0] |

**Drawbacks:**
- ‚ö†Ô∏è **Sparse representation:** Mostly zeros ‚Üí memory inefficient  
- ‚ö†Ô∏è **No semantic meaning:**  
  ‚Äúdog‚Äù and ‚Äúpuppy‚Äù are very similar in reality,  
  but their one-hot vectors are completely different.

> üìâ Example:  
> `'dog'` at index **5** and `'puppy'` at index **127** show no similarity numerically,  
> even though they mean almost the same thing.

---

### 2Ô∏è‚É£ Embeddings (Dense Vector Representations)

**Idea:**  
Instead of one-hot encoding, represent each token as a **dense, low-dimensional vector**  
that captures its **semantic meaning** ‚Äî relationships and similarities between words.

Each word becomes a vector like:<br>
dog ‚Üí [0.62, 0.13, 0.55, 0.02, 0.44, ...]<br>
puppy ‚Üí [0.60, 0.10, 0.50, 0.05, 0.47, ...]


Here, `"dog"` and `"puppy"` have **similar vectors**,  
so their **cosine similarity** is high ‚Äî meaning the model understands their relatedness.

---

### üß† Why Embeddings Work

- They capture **semantic relationships** between words (e.g., `"king" - "man" + "woman" ‚âà "queen"`).
- They allow the model to **generalize** and understand new combinations of words.
- They make computations efficient since vectors are **dense** and **low-dimensional**.

---

### ‚öôÔ∏è How Embeddings Are Learned

Embeddings are not hand-coded ‚Äî they‚Äôre **learned automatically** during training using neural networks such as:
- **CBOW (Continuous Bag of Words)**  
  ‚Üí Predicts a target word from its surrounding context.
- **Skip-Gram**  
  ‚Üí Predicts surrounding words from a target word.

These models are the foundation of **Word2Vec**, and modern LLMs extend the same concept for tokens.

---

### ‚ú® In Summary

| Concept | Representation | Pros | Cons |
|----------|----------------|------|------|
| **One-Hot Encoding** | Sparse binary vector | Simple, easy | High-dimensional, no meaning |
| **Embeddings** | Dense numeric vector | Captures meaning, efficient | Requires training |

---

> üß© **Bottom line:**  
> Embeddings turn words into *meaningful numbers* ‚Äî  
> enabling models like GPT to understand and generate human language.


In [2]:
# Gensim (a popular Python library for word embeddings, topic modeling, and NLP tasks)
import gensim.downloader as api 
model = api.load('word2vec-google-news-300')
word_vectors = model

In [3]:
# lets look at embedding of word 'computers'
embed = word_vectors['computers']
print(f"Length of embedding vector : {embed.shape[0]}")
print(f"Embedding : {embed}")

Length of embedding vector : 300
Embedding : [ 0.32421875 -0.24316406  0.11523438  0.25976562 -0.18847656  0.10595703
 -0.10205078  0.10693359  0.28710938  0.01428223  0.0100708  -0.20214844
  0.19238281  0.07714844 -0.03686523  0.06933594 -0.0013504   0.26757812
  0.12011719  0.02746582 -0.0072937  -0.04443359  0.15625     0.10693359
  0.1640625  -0.07177734  0.02355957 -0.03930664 -0.05004883 -0.17480469
 -0.06054688 -0.10839844 -0.17382812  0.01843262  0.14160156 -0.4140625
 -0.43554688 -0.12792969  0.1484375  -0.04882812 -0.11914062  0.23046875
  0.265625    0.10400391  0.27929688  0.06933594 -0.03881836  0.31640625
 -0.40625     0.05712891 -0.01324463 -0.09960938  0.05737305 -0.18945312
 -0.15039062  0.23632812 -0.05102539 -0.17871094 -0.21972656  0.14746094
  0.16308594  0.04736328 -0.13183594  0.22070312 -0.04003906  0.05517578
 -0.2734375   0.42773438 -0.25585938  0.06591797  0.05419922 -0.25
  0.14453125 -0.00531006 -0.08984375 -0.01312256  0.08349609 -0.203125
 -0.0022583  -0

What are similar to KING + WOMEN - MAN?

In [4]:
word_vectors.most_similar(positive=['king', 'women'], negative=['man'], topn=10)

[('queen', 0.4827326238155365),
 ('queens', 0.466781347990036),
 ('kumaris', 0.4653734564781189),
 ('kings', 0.4558638036251068),
 ('womens', 0.422832190990448),
 ('princes', 0.4176960587501526),
 ('Al_Anqari', 0.41725507378578186),
 ('concubines', 0.4011078476905823),
 ('monarch', 0.39624831080436707),
 ('monarchy', 0.39430150389671326)]

In [5]:
# some similar words to 'boy' 
word_vectors.most_similar('boy')

[('girl', 0.8543271422386169),
 ('teenager', 0.7606689929962158),
 ('toddler', 0.7043969035148621),
 ('teenage_girl', 0.685148298740387),
 ('man', 0.6824870109558105),
 ('teen_ager', 0.6499968767166138),
 ('son', 0.6337764263153076),
 ('kid', 0.63228440284729),
 ('youngster', 0.618381679058075),
 ('stepfather', 0.5989423394203186)]

In [6]:
# Example of calculating similarity
print(word_vectors.similarity('woman', 'man'))
print(word_vectors.similarity('king', 'queen'))
print(word_vectors.similarity('uncle', 'aunt'))
print(word_vectors.similarity('boy', 'girl'))
print(word_vectors.similarity('nephew', 'niece'))
print(word_vectors.similarity('paper', 'water'))

0.76640123
0.6510956
0.76434743
0.8543272
0.7594367
0.11408083


#### Magnitude of difference between word vectors 

In [7]:
import numpy as np
# Words to compare
word1 = 'man'
word2 = 'woman'

word3 = 'semiconductor'
word4 = 'earthworm'

word5 = 'nephew'
word6 = 'niece'

# Calculate the vector difference
vector_difference1 = model[word1] - model[word2]
vector_difference2 = model[word3] - model[word4]
vector_difference3 = model[word5] - model[word6]

# Calculate the magnitude of the vector difference (L2 norm of vectors (sum=x^2)^(1/2))
magnitude_of_difference1 = np.linalg.norm(vector_difference1)
magnitude_of_difference2 = np.linalg.norm(vector_difference2)
magnitude_of_difference3 = np.linalg.norm(vector_difference3)


# Print the magnitude of the difference
print("The magnitude of the difference between '{}' and '{}' is {:.2f}".format(word1, word2, magnitude_of_difference1))
print("The magnitude of the difference between '{}' and '{}' is {:.2f}".format(word3, word4, magnitude_of_difference2))
print("The magnitude of the difference between '{}' and '{}' is {:.2f}".format(word5, word6, magnitude_of_difference3))

The magnitude of the difference between 'man' and 'woman' is 1.73
The magnitude of the difference between 'semiconductor' and 'earthworm' is 5.67
The magnitude of the difference between 'nephew' and 'niece' is 1.96


### How token embeddings are trained or created for LLMs?

1) Token embeddings for a model are learned while training LLM itself
2) Take a matrix of V X N, where V = size of vocabulary and N = dimension of embeddings
    Initialize the weights randomly, and let it learn while training LLM for predicting next token

Sentence -> Token IDs -> look at the matrix to find word embedding corresponding to particular ID

In [9]:
# a short example
import torch
# lets suppose we have a vocabulary of size 6, and we have a random sentence with these token ids 
input_ids = torch.tensor([2, 3, 5, 1])

In [16]:
vocab_size = 6
embd_dim = 3 

embedding_layer = torch.nn.Embedding(vocab_size, embd_dim)
embedding_layer

Embedding(6, 3)

In [17]:
# lets look at weights of this embedding layer 
print(f"Weights of embedding layer : {embedding_layer.weight}")

Weights of embedding layer : Parameter containing:
tensor([[ 0.1464,  1.1386, -0.7314],
        [-0.7409, -0.8005, -0.2101],
        [ 0.3874,  0.3309,  0.6376],
        [-0.0957, -1.4444,  0.5150],
        [-0.2883, -1.9432,  0.2287],
        [ 0.5217,  0.5607, -0.6591]], requires_grad=True)


While training LLM these weights gets also updated and makes embeddings for that model

In [18]:
# accessing vector embedding corresponding to a particular index 
embed_vector = embedding_layer(torch.tensor([3]))
print(f"Embedding corresponding to index 3 : {embed_vector}")

Embedding corresponding to index 3 : tensor([[-0.0957, -1.4444,  0.5150]], grad_fn=<EmbeddingBackward0>)


Converting input_ids vector to embedding vector 

In [None]:
embeddings = embedding_layer(input_ids)
# input_ids = torch.tensor([2, 3, 5, 1])
# it will look for vector corresponding to these ids-1
embeddings

tensor([[ 0.3874,  0.3309,  0.6376],
        [-0.0957, -1.4444,  0.5150],
        [ 0.5217,  0.5607, -0.6591],
        [-0.7409, -0.8005, -0.2101]], grad_fn=<EmbeddingBackward0>)