# Working with Embedding and EmbeddingBag in Pytorch

In [7]:
import sys

!{sys.executable} -m pip install torchtext
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 147, in _get_module_details

In [29]:
# Import the necessary functions
import torch
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import torch.nn as nn

#Create a sample dataset or corpus as a set of sentences
MyCorpus = [
    "I like cats",
    "I dislike dogs", 
    "I'm neutral to hippos"
]

#Initialize the tokenizer, iterator from the dataset and vocabulary 
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

def yield_tokens(data_iter):
    for data_sample in data_iter:
        yield tokenizer(data_sample)

data_iter = iter(MyCorpus)

def yield_tokens(data_iter):
    for data_sample in data_iter:
        yield tokenizer(data_sample)

# Build vocabulary using a fresh iterator
vocab = build_vocab_from_iterator(yield_tokens(data_iter))

# Tokenize and generate the data indices for each data sample 

#input_ids lambda function tokenizes and generates indexes for each data sample 

input_ids = lambda x: [torch.tensor(vocab(tokenizer(data_sample))) for data_sample in MyCorpus]
index = input_ids(MyCorpus)
print(index)

[tensor([0, 6, 2]), tensor([0, 3, 4]), tensor([0, 1, 7, 8, 5])]


In [30]:
# Creation of the embeddings layer
# The embedding layer in an LLM converts discrete tokens (words, subwords, or characters)
embedding_dim = 3  #Dimension size of the embeddings

#Count of unique tokens present in the vocab

n_embedding = len(vocab)
print(n_embedding)

embeds = nn.Embedding(n_embedding, embedding_dim) #nn.embedding constructor to create embedding layer embeds
print(embeds)

9
Embedding(9, 3)


## Explanation 
- We used the spaCy tokenizer, which splits I into I and m, resulting in a total of 9 tokens. 
- There are 9 tokens in the vocabulary because corpus tokens have tokenization are : "I", "like", "cats", "dislike", "dogs", "m (from I'ms)", "neutral", "to", "hippos").
- Vocabulary size = 9
- Embedding (9, 3) : 9 rows (1 for each token in the voculbarly), 3 columns (embedding dimension we set earlier). This creates a lookup table where each of your  tokens maps to a vector. When you pass token indices to this embedding layer, it will return the corresponding vectors that can be used by the neural network.


In [48]:
# Batch processing approach
index_flat = torch.cat(index)
offset = [len(sample) for sample in index]
offset.insert(0, 0)
offset = torch.cumsum(torch.tensor(offset), 0)[0:-1]

my_embeddings = embedding_bag(index_flat, offsets=offset)

# Print all embeddings with their corresponding sentences
print("All sentence embeddings:")
for i, embedding in enumerate(my_embeddings):
    print(f"'{MyCorpus[i]}' -> {embedding}")

All sentence embeddings:
'I like cats' -> tensor([-0.0378, -0.2755,  0.5874], grad_fn=<UnbindBackward0>)
'I dislike dogs' -> tensor([-1.2483,  0.1715,  0.3958], grad_fn=<UnbindBackward0>)
'I'm neutral to hippos' -> tensor([ 0.4127, -0.2397,  0.2168], grad_fn=<UnbindBackward0>)
