### Data sampling with a sliding window
The next step before we can finally create the embeddings for the LLM is to generate the input-target pairs required for training an LLM. LLMs
are pretrained by predicting the next word in a text.
<br>Given a text sample, extract input blocks as subsamples that serve as input to the
LLM, and the LLM's prediction task during training is to predict the next word that follows the
input block. During training, we mask out all words that are past the target. Note that the text would undergo tokenization before the LLM can process it.

In [27]:
import tiktoken
tokenizer = tiktoken.get_encoding('gpt2')

In [28]:
# Tokenize 'The Verdict' Story
with open('the_verdict.txt',mode='r', encoding='utf-8') as f:
    raw_text = f.read()
enc_of_text = tokenizer.encode(raw_text)
print(len(enc_of_text))

5560


In [29]:
# sampling from tokens
enc_sample = enc_of_text[50:]


Create input-target pairs for the next-word prediction.<br> x: input tokens<br> y: torken tokens<br> target tokens are actually input tokens shifted by 1

In [30]:
context_size = 6
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f'inputs:{x}\ntargets:\t{y}')

inputs:[7026, 15632, 438, 2016, 257, 922]
targets:	[15632, 438, 2016, 257, 922, 5891]


In [31]:
# input target pairs by token ids
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, '----->', desired)

[7026] -----> 15632
[7026, 15632] -----> 438
[7026, 15632, 438] -----> 2016
[7026, 15632, 438, 2016] -----> 257
[7026, 15632, 438, 2016, 257] -----> 922
[7026, 15632, 438, 2016, 257, 922] -----> 5891


In [32]:
# input-target pairs by converting token ids into text 
for i in range(1,context_size+1):
    context = tokenizer.decode(enc_sample[:i])
    desired = tokenizer.decode([enc_sample[i]])
    print(context, '----->', desired)

 cheap ----->  genius
 cheap genius -----> --
 cheap genius-- -----> though
 cheap genius--though ----->  a
 cheap genius--though a ----->  good
 cheap genius--though a good ----->  fellow


#### Implement a data loader that fetches the input-target pairs from the training dataset using a sliding window approach.

In [33]:
import torch
from torch.utils.data import Dataset, DataLoader

In [34]:
class GPTDatasetV1(Dataset):
    def __init__(self, text, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids=[]
        self.target_ids=[]
        token_ids = tokenizer.encode(text)
        for i in range(0,len(token_ids)-max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    def __len__(self):
        return len(self.input_ids)
    def __getitem__(self,idx):
        return self.input_ids[idx], self.target_ids[idx]


# creating dataloader function
def create_dataloader_v1(text, batch_size=64, max_length=256, stride=128, shuffle=True, drop_last=True):
    tokenizer = tiktoken.get_encoding('gpt2')
    dataset = GPTDatasetV1(text, tokenizer, max_length, stride)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
    return dataloader

In [52]:
#apply dataloader on raw text
dataloader= create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter=iter(dataloader)
inputs, targets = next(data_iter)
print(inputs)
print(targets)

tensor([[  464,  4643, 11600,   628],
        [  198,   197,   197,   197],
        [  197,   197,  7407,   342],
        [  854, 41328,   628,   628],
        [  198,   198,  1129,  2919],
        [  628,   628,   198,   198],
        [ 3109,  9213,   422, 11145],
        [  271,  1668,   319,  2795]])
tensor([[ 4643, 11600,   628,   198],
        [  197,   197,   197,   197],
        [  197,  7407,   342,   854],
        [41328,   628,   628,   198],
        [  198,  1129,  2919,   628],
        [  628,   198,   198,  3109],
        [ 9213,   422, 11145,   271],
        [ 1668,   319,  2795,   678]])


We have increased the stride to 4. This is to utilize the dataset fully (not skipping a single word) along with avoiding the overlap bw the batches, since more overlap could lead to the increased overfitting.

### Creating Token Embeddings
Preparing the input text for an LLM involves tokenizing text, converting text tokens
to token IDs, and converting token IDs into vector embedding vectors.

In [65]:
# example
input_ids = torch.tensor([2,3,5,1,4,])

vocab_size=6    # Number of embeddings
output_dim = 3  # Embedding Dimension

torch.manual_seed(2345)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim) # Embedding layer is a weight matrix. categorical data converted into dense numerical data
embedding_layer.weight

Parameter containing:
tensor([[ 0.0507, -0.2138, -0.1526],
        [ 0.3901,  1.0490, -1.0131],
        [-1.1523,  1.8710,  2.1880],
        [-0.0932, -1.1347, -0.2361],
        [ 1.3525,  0.9610,  0.2923],
        [ 0.2219, -0.2735,  1.8279]], requires_grad=True)

In [78]:
embedding_layer( torch.tensor([2,3,5,1,4,]))

tensor([[-1.1523,  1.8710,  2.1880],
        [-0.0932, -1.1347, -0.2361],
        [ 0.2219, -0.2735,  1.8279],
        [ 0.3901,  1.0490, -1.0131],
        [ 1.3525,  0.9610,  0.2923]], grad_fn=<EmbeddingBackward0>)

The embedding layer converts a token ID into the same vector representation
regardless of where it is located in the input sequence. For example, the token ID 5, whether it's
in the first or third position in the token ID input vector, will result in the same embedding
vector.

Embedding layers perform a look-up operation, retrieving the embedding vector
corresponding to the token ID from the embedding layer's weight matrix. For instance, the
embedding vector of the token ID 5 is the sixth row of the embedding layer weight matrix (it is
the sixth instead of the fifth row because Python starts counting at 0).

### absolute positional embeddings
In principle, the deterministic, position-independent embedding of the token
ID is good for reproducibility purposes. However, since the self-attention
mechanism of LLMs itself is also position-agnostic, it is helpful to inject
additional position information into the LLM.
<br>Absolute positional embeddings are directly associated with specific
positions in a sequence. For each position in the input sequence, a unique
embedding is added to the token's embedding to convey its exact location.
For instance, the first token will have a specific positional embedding, the
second token another distinct embedding, and so on.
<br>Positional embeddings are added to the token embedding vector to create the input
embeddings for an LLM. The positional vectors have the same dimension as the original token
embeddings.

In [80]:
torch.manual_seed(341)

output_dim = 256 # lenght of embedding vector
vocab_size=50257 # Number of bpe tokens in tiktoken
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

# let's have a batch size of 8 with 4 tokens each. the resulted embedding matrix would have shape 8x4x256
# let's instantiate the dataloader with batch size of 8
max_length=4
data_loader = create_dataloader_v1(raw_text, batch_size=8,max_length=max_length, stride=max_length)

data_iter = iter(data_loader)
inputs, targets = next(data_iter)
print('Token IDs:\n',inputs)
print()
print('input shape:\n', inputs.size())

Token IDs:
 tensor([[  618,   673,  2540,   284],
        [   12, 11649,    32,  2339],
        [ 1359,   319,   262, 34686],
        [  292,   611,   339,   550],
        [  326,   339,  1239,  1807],
        [  339, 13055,    11,   345],
        [  739, 10724,   262,  6846],
        [  284,   423,   546,    26]])

input shape:
 torch.Size([8, 4])


In [81]:
# let's now use embedding layer to embed these token ids into 256 dimensional vectors
token_embedding_vectors = token_embedding_layer(inputs)
print(token_embedding_vectors.shape)

torch.Size([8, 4, 256])


For a GPT model's absolute embedding approach, we just need to create
another embedding layer that has the same dimension as the
`token_embedding_layer`

In [88]:
torch.manual_seed(341)
context_length = max_length
abs_pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
abs_pos_embedding_vectors=abs_pos_embedding_layer(torch.arange(context_length))
abs_pos_embedding_vectors.shape

torch.Size([4, 256])

In [86]:
torch.arange(context_length)

tensor([0, 1, 2, 3])

Now add these directly to the token embeddings,
where PyTorch will add the 4x256-dimensional pos_embeddings tensor to
each 4x256-dimensional token embedding tensor in each of the 8 batches

In [89]:
input_embeddings = token_embedding_vectors+abs_pos_embedding_vectors
input_embeddings.shape

torch.Size([8, 4, 256])