## Vector Embedding / Token Embedding
Till now we learnt that each token is a scalar value. But in real life, we need to represent the text in a vector space so that there will be representation of simantic relations between tokens.

- we may have used one hot encoding but this is not a good way to represent text because this does not capture semantic meaning.
  Now, we will use vector embeddings to represent the text in a vector space.
  it looks like this:
  representational image below
 ![Alt text](./vector_embedding.png)

 But how do we get this?
 - we can train neural networks to learn vector embeddings for each token.
   ![NN](./nn.png) 




In [14]:
# pip install gensim

In [15]:
import gensim.downloader as api
model = api.load("word2vec-google-news-300")  # download the model and return as object ready for use
word_vectors=model

# Let us look how the vector embedding of a word looks like
print(word_vectors.get_vector('computer'))  # type: ignore # Example: Accessing the vector for the word 'computer'

[ 1.07421875e-01 -2.01171875e-01  1.23046875e-01  2.11914062e-01
 -9.13085938e-02  2.16796875e-01 -1.31835938e-01  8.30078125e-02
  2.02148438e-01  4.78515625e-02  3.66210938e-02 -2.45361328e-02
  2.39257812e-02 -1.60156250e-01 -2.61230469e-02  9.71679688e-02
 -6.34765625e-02  1.84570312e-01  1.70898438e-01 -1.63085938e-01
 -1.09375000e-01  1.49414062e-01 -4.65393066e-04  9.61914062e-02
  1.68945312e-01  2.60925293e-03  8.93554688e-02  6.49414062e-02
  3.56445312e-02 -6.93359375e-02 -1.46484375e-01 -1.21093750e-01
 -2.27539062e-01  2.45361328e-02 -1.24511719e-01 -3.18359375e-01
 -2.20703125e-01  1.30859375e-01  3.66210938e-02 -3.63769531e-02
 -1.13281250e-01  1.95312500e-01  9.76562500e-02  1.26953125e-01
  6.59179688e-02  6.93359375e-02  1.02539062e-02  1.75781250e-01
 -1.68945312e-01  1.21307373e-03 -2.98828125e-01 -1.15234375e-01
  5.66406250e-02 -1.77734375e-01 -2.08984375e-01  1.76757812e-01
  2.38037109e-02 -2.57812500e-01 -4.46777344e-02  1.88476562e-01
  5.51757812e-02  5.02929

In [16]:
# waht will we get with Woman + King - Man
print(word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=10))

[('queen', 0.7118193507194519), ('monarch', 0.6189674139022827), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321839332581), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235945582389832), ('queens', 0.518113374710083), ('sultan', 0.5098593235015869), ('monarchy', 0.5087411403656006)]


## How token ids are converted to vector embeddings?
- 1st we tokenize the text
- 2nd we convert the tokens to token ids 
- 3rd is to convert the token ids to vector embeddings
  - here, first define vocabloury size and the dimension of the vector embeddings
  - eg: GPT-2 Small model has 50257 vocabloury size and 768 dimension of vector embeddings
  - now the total there should be 50257 rows and 768 columns
  - and 1st we assign a random vector to each token id
  - 2nd we train the model to learn the vector embeddings for each token id
  - boom! we get the vector embeddings for each token id

In [17]:
import torch

In [18]:
#lets do it on our own
# lets embedd the token id from the text 'The quick brown fox jumps over the lazy dog'
text = 'quick fox is in the house'
#say we already have tokenids as a list for above text
token_ids = torch.tensor([
    4,0,3,2,5,1
])
dim = 3
vocab_size = 6

torch.manual_seed(0)
embedding_layer = torch.nn.Embedding(vocab_size, dim)#similar to nn.linear_layer but embeding is computationally more efficient as it does not use unneccessary matrix multiplications with 0


In [19]:
print(embedding_layer.weight)##this is our embedding matrix at initialization

Parameter containing:
tensor([[-1.1258, -1.1524,  0.5667],
        [ 0.7935,  0.5988, -1.5551],
        [-0.3414,  1.8530,  0.4681],
        [-0.1577, -0.1734,  0.1835],
        [ 1.3894,  1.5863,  0.9463],
        [-0.8437,  0.9318,  1.2590]], requires_grad=True)


In [20]:
#lets get vector for token id 1 which is second row of embedding matrix or lookup table

print(embedding_layer(torch.tensor([1])))

tensor([[ 0.7935,  0.5988, -1.5551]], grad_fn=<EmbeddingBackward0>)


In [21]:
# for all input ids
print(embedding_layer(token_ids))

tensor([[ 1.3894,  1.5863,  0.9463],
        [-1.1258, -1.1524,  0.5667],
        [-0.1577, -0.1734,  0.1835],
        [-0.3414,  1.8530,  0.4681],
        [-0.8437,  0.9318,  1.2590],
        [ 0.7935,  0.5988, -1.5551]], grad_fn=<EmbeddingBackward0>)


## positional Embedding
in the Vector encoding the encoding vector for a token is always same regardless of the position of the token in the text.
- this leads to the situation that the 2 different sentences 'the cat sat on the mat' and 'the mat sat on the cat' will have the same encoding vector.
- to avoid this, we can add positional embeddings to the vector encoding.
  there are 2 types of Positional Embeddings
  1. Absolute Positional Embeddings
   - this is the most used type of positional embeddings.
   - in this,the dimension of the positional embedding is equal to the dimension of the vector embedding
   - there is position embedding + token embedding = input embedding
   - they are used when fixed order of token is crucial such as sequence generation.
   - open ai gpt model uses absolute positional embeddings
  ![Alt text](./pe.png)
  2. Relative Positional Embeddings
   - this is the least used type of positional embeddings.
   - it focuses on how far the relative words are
   - these are important for longer sequence where the same phrase may occur multiple times.

In [22]:
# lets do a demo of absolute positional embeddings
vocab_size = 50257
dim = 256

torch.manual_seed(0)
embedding_layer_vec = torch.nn.Embedding(vocab_size, dim)
print(embedding_layer_vec.weight)

Parameter containing:
tensor([[-1.1258, -1.1524, -0.2506,  ...,  0.1447,  1.9029,  0.3904],
        [-0.0394, -0.8015, -0.4955,  ..., -1.6989,  1.3094, -1.6613],
        [-0.5461, -0.6302, -0.6347,  ...,  1.6553,  0.5204, -0.2326],
        ...,
        [-1.2348, -0.9181,  0.7427,  ..., -0.2906, -0.0758, -0.4074],
        [ 1.7858,  0.3888,  0.8426,  ...,  0.5866, -0.2532,  0.2780],
        [-0.3543,  0.7362,  0.5639,  ..., -0.9902,  0.7611, -0.8797]],
       requires_grad=True)


In [23]:
from torch.utils.data import Dataset, DataLoader

In [24]:
import tiktoken

In [25]:

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [26]:
def create_dataLoader(txt,batch_size=4, stride=128, max_length=256,Shuffle=True,drop_last=True,num_workers=0,tokenizer= tiktoken.get_encoding("gpt2")):
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    # print(dataset[0])
    dataLoader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=Shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return dataLoader
    

In [27]:
with open('the-verdict.txt', 'r') as f:
    txt = f.read()

In [None]:
tokenizer  = tiktoken.get_encoding("gpt2")
data_loader = create_dataLoader(txt,tokenizer=tokenizer,max_length=4,stride=4,batch_size=8)#[tensor([[][][][]]),tensor([[],[],[],[]])],[tensor([],[],[],[]),tensor([],[],[],[])],....]

data_iter = iter(data_loader)

input, output = next(data_iter)
# print(tokenizer.decode(next(data_iter)[0].tolist()))
# print(tokenizer.decode([ 1021,   757,   438,   198,   198,    40]))

print("input shape",input.shape)
print("output shape",output.shape)

input shape torch.Size([8, 4])
output shape torch.Size([8, 4])


In [38]:
# lets do vector embedding

vocab_size = 50257
dim = 256

torch.manual_seed(0)
embedding_layer_vec = torch.nn.Embedding(vocab_size, dim)

# for all input ids
vec_input = embedding_layer_vec(input)
print(vec_input.shape)

torch.Size([8, 4, 256])


## After vector embeding we will be having for each token id a vector embedding of dimension dim
- say we are having a embedding matrix dimension dim = 256 and vocab_size = 50257
 then what happens is that for each token id we will be having a vector embedding of dimension 256
 - NOW SAY we have batch size 8 which means in each iteration we have 8 input sequences and 8 output sequences
 - for each input sequence we will be having a vector embedding of dimension 256
now for tokenization we have 8 input sequence with length 4 for each input sequence
 so we get 8*4 matrix in each batch which becomes 8*4*256 matrix after vector embedding
 ![vector embedding](./embed.png)

In [40]:
pos_embed_layer = torch.nn.Embedding(4, dim)# we only need 4 positional embeddings as position are only 4 and is same for same position in different sequence so we just make 4*dim matrix for position embedding

# for all input ids
pos_input = pos_embed_layer(torch.arange(4))#get position embedding for 0,1,2,3 positions then add them to vector embedding
print(pos_input.shape)

torch.Size([4, 256])


## now lets add them together
- now we add vector embedding matrix with 8*4*256 matrix and 4*256 matrix as position for each sequence's same position will be same 
- ![postional embedding](./public/pos.png)

In [41]:
pos_embedding_input = vec_input + pos_input
print(pos_embedding_input.shape)

torch.Size([8, 4, 256])
