<a href="https://colab.research.google.com/github/alexcpn/tinytransformer/blob/main/MultiHeadAttention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# -*- coding: utf-8 -*-
"""LearnTransformer
## Learning Transformers by Doing

Based on my Colab file at
    https://colab.research.google.com/drive/1qvaWLJCenxxTcKjHksHGicxdbZDdsm7i

Author: Alex Punnen and ChatGPT,CoPilot

Lets see how simple self attention works by writing a single headed attention and then training them on our small dataset.
"""

# Use bpe to tokenise the sence

"""It all starts with a Tokenizer that breaks words to a smaller set and creates a fixed set of vocabulary. Why fixed set vocabulary, because that is finally what is used for prediction. The model is trained to output the probability of the occurance of just the next token in say a 2000 set vocabulary. The highest probability item in that set gets selected as the next. Hence the need for a constant and fixes set vocabulary

In the LLAMA Paper they are using SentencePeiece tokenize

*We tokenize the data with the bytepair encoding (BPE) algorithm (Sennrich et al.,2015), using the implementation from SentencePiece
Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.*
"""

'It all starts with a Tokenizer that breaks words to a smaller set and creates a fixed set of vocabulary. Why fixed set vocabulary, because that is finally what is used for prediction. The model is trained to output the probability of the occurance of just the next token in say a 2000 set vocabulary. The highest probability item in that set gets selected as the next. Hence the need for a constant and fixes set vocabulary\n\nIn the LLAMA Paper they are using SentencePeiece tokenize\n\n*We tokenize the data with the bytepair encoding (BPE) algorithm (Sennrich et al.,2015), using the implementation from SentencePiece\nNotably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.*\n'

In [4]:
!pip install datasets
!pip install --upgrade sentencepiece

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [5]:

# configure logging
import torch.nn as nn
import torch
import sentencepiece as spm
from datasets import load_dataset
import math
import logging as log
import os
import gc

outfile='multihead_transformer.log'
log.basicConfig(level=log.INFO,
                format='%(asctime)s - %(message)s',
                datefmt='%d-%b-%y %H:%M:%S',
                handlers=[
                    log.FileHandler(outfile),
                    log.StreamHandler()
                ],
                force=True,
                )

In [4]:
# test log
log.info("test log")

07-Feb-25 10:55:10 - test log


In [6]:


# Load the small dataset for training our tiny language model
ds = load_dataset("roneneldan/TinyStories")
train_size =100000
# use the dataset as text for training
log.info(f"Length of trainig data is  {len(ds['train']['text'])}")
# use half of this training data text
trainingdata = ds['train']['text'][:train_size]
log.info(f"Limiting training legth to {len(trainingdata)}")

# 1) Write the list to a file.
with open("train.txt", "w", encoding="utf-8") as f:
    for line in trainingdata:
        # replace newline with space to keep each original text chunk on a single line
        #replace special characters
        line = line.replace("â€", "")
        f.write(line.replace("\n", " ") + "\n")
# for vocabulary training
# trainingdata2 = ds['train']['text'][:10000]
# with open("vocabtrain.txt", "w", encoding="utf-8") as f:
#     for line in trainingdata2:
#         # replace newline with space to keep each original text chunk on a single line
#         f.write(line.replace("\n", " ") + "\n")
test_sentence = "The Cat sat on the Fence"
# We use a small vocab_size just for demo. LLaMA uses a much larger vocabulary (32k tokens).
vocab_size = 2000

# if file is not there
# this creates a vocab file and a model file
log.info("Training Non contextual tokeniser")
spm.SentencePieceTrainer.Train(
    input="train.txt",   # our training data
    model_prefix='llama_like',
    vocab_size=vocab_size,
    model_type='bpe',
    character_coverage=1.0,
    max_sentence_length=2048,
    treat_whitespace_as_suffix=True,
    split_digits=True               # This forces splitting "123" -> "1", "2", "3"
)

sp = spm.SentencePieceProcessor()
sp.load("llama_like.model")

tokens = sp.encode(test_sentence, out_type=str)
token_ids = sp.encode(test_sentence, out_type=int)

log.info(f"Sentence: {test_sentence}")
log.info(f"Tokens:  {tokens}")
log.info(f"Token IDs: {token_ids}")

# get the vocabulary dictionary mapping
# print(sp.id_to_piece(60))

# Part 2

# Step 1: Prepare the data for training the Attention layer

# Now lets tokenise the entire text and generate a map of input_ids
all_token_ids = []

if not os.path.isfile("token_ids.txt"):
    log.info("Tokenizing text...")
    with open("train.txt", "r", encoding="utf-8") as f:
        for line in f:
            # Encode each line to token IDs
            line_ids = sp.encode(line, out_type=int)
            # Append them, maybe add a special token like <eol> if desired
            all_token_ids.extend(line_ids)
            # all_token_ids.append(eol_id)  # If you have a special EOL token
    # Write token IDs to file
    with open("token_ids.txt", "w", encoding="utf-8") as f:
        for token_id in all_token_ids:
            f.write(f"{token_id}\n")
else:
    log.info("Token ids already present in file")
    #read token ids from file
    all_token_ids = []
    with open("token_ids.txt", "r", encoding="utf-8") as f:
        for line in f:
            all_token_ids.append(int(line))

log.info(f"Total tokens:  {len(all_token_ids)}")

# Lets resize the input ids for training

log.info("Resizing input_ids...")

# convert these to torch tensor
input_ids = torch.tensor(all_token_ids, dtype=torch.long).unsqueeze(0)
log.info(f"input_ids.shape={input_ids.shape}")
# shape these (torch.Size([1, 380627])) chunk to batchsize of 1 and length of 50
seq_length = 1000
input_ids = input_ids.squeeze(0)  # Remove batch dim, now shape = (380627,)
# How many 50-token chunks we can make
num_chunks = input_ids.shape[0] // seq_length

# Truncate to nearest multiple of 50
input_ids = input_ids[:num_chunks * seq_length]
# Reshape to (num_chunks, 50), each row is a sequence of 50 tokens
input_ids = input_ids.view(num_chunks, seq_length)

log.info(f"New shape:= {input_ids.shape}")  # Should be (num_chunks, 50)

# this will be same as labels
labels = input_ids.clone()
vocab_size = 2000
d_model = 512  # embediding size
d_k = 64  # attention size






The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

(…)-00000-of-00004-2d5a1467fff1081b.parquet:   0%|          | 0.00/249M [00:00<?, ?B/s]

(…)-00001-of-00004-5852b56a2bd28fd9.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

(…)-00002-of-00004-a26307300439e943.parquet:   0%|          | 0.00/246M [00:00<?, ?B/s]

(…)-00003-of-00004-d243063613e5a057.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

(…)-00000-of-00001-869c898b519ad725.parquet:   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21990 [00:00<?, ? examples/s]

07-Feb-25 12:57:35 - Length of trainig data is  2119719
07-Feb-25 12:57:40 - Limiting training legth to 100000
07-Feb-25 12:57:40 - Training Non contextual tokeniser
07-Feb-25 12:57:52 - Sentence: The Cat sat on the Fence
07-Feb-25 12:57:52 - Tokens:  ['The▁', 'C', 'at▁', 'sat▁', 'on▁', 'the▁', 'F', 'en', 'ce▁']
07-Feb-25 12:57:52 - Token IDs: [60, 1947, 50, 1134, 56, 16, 1945, 23, 123]
07-Feb-25 12:57:52 - Tokenizing text...
07-Feb-25 12:59:03 - Total tokens:  24471344
07-Feb-25 12:59:03 - Resizing input_ids...
07-Feb-25 12:59:05 - input_ids.shape=torch.Size([1, 24471344])
07-Feb-25 12:59:05 - New shape:= torch.Size([24471, 1000])


In [7]:
# we need to add positional encoding to the input_ids
# Positional encoding is a way to provide the model with information about the position of each token in the sequence.
# This is important because the model has no inherent sense of order in the tokens, since it only sees them as embeddings.
# generated by LLM


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len):
        super().__init__()
        # Create a long enough 'pe' matrix of shape [max_len, d_model]
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() *
            (-math.log(10000.0) / d_model)
        )
        # Even indices (2i) -> sine
        pe[:, 0::2] = torch.sin(position * div_term)
        # Odd indices (2i+1) -> cosine
        pe[:, 1::2] = torch.cos(position * div_term)

        # Register as a buffer so it's moved to GPU automatically if needed
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        x shape: (batch_size, seq_len, d_model)
        We add positional encoding up to seq_len from the precomputed 'pe'.
        """
        seq_len = x.size(1)
        # pe[:seq_len] -> shape [seq_len, d_model]
        # We unsqueeze(0) so that shape becomes [1, seq_len, d_model],
        # allowing addition to x which is [batch_size, seq_len, d_model].
        x = x + self.pe[:seq_len, :].unsqueeze(0)
        return x


"""### Step: Adding in a Simple Attention Class"""


class SingleHeadSelfAttention(nn.Module):
    def __init__(self, d_model):
        """
        d_model: dimension for Q, K, V
        use_output_proj: if True, applies a final linear W_O
        """
        super().__init__()
        self.W_Q = nn.Linear(d_model, d_model, bias=False)
        self.W_K = nn.Linear(d_model, d_model, bias=False)
        self.W_V = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x):
        """
          Forward lyer of SingleHeadedAttention
        """
        B, seq_len, d_model = x.shape  # B is batch size , seq_len is the length of the sequence , and d_model is the embedding size (512) # torch.Size([1, 999, 512])

        Q = self.W_Q(x)
        K = self.W_K(x)
        V = self.W_V(x)

        attention = torch.matmul(Q, K.transpose(-2, -1)) / \
            torch.sqrt(torch.tensor(d_model, dtype=torch.float32))
        # Apply the mask to the attention scores
        # why is this needed; basically it allows the model from attending to only tokens in the past, that is in the left side of
        # the current token, when you mulitply by V
        # the left side becomes the lower triangular matrix; and right side the future tokens are  the upper triangular matrix
        # We build an upper-triangular mask (set to -inf) that zeros out attention (the next softmmax layer will set it to zero)
        causal_mask = torch.triu(
            torch.ones((seq_len, seq_len), device=x.device), diagonal=1
        ).bool()
        attention = attention.masked_fill(causal_mask, float('-inf'))
        attention = torch.softmax(attention, dim=-1)
        score = torch.matmul(attention, V)
        # ----- [1] Add residual connection ----- ttodo take this out
        out = x + score # without this the model output is not good
        return out, attention


In [8]:
log.info(f"vocab_size={vocab_size} embedding_dim/d_model={d_model}")

# Intialise all the layers

# add in the embdeiing part from previous layer
token_embedding = nn.Embedding(
    num_embeddings=vocab_size, embedding_dim=d_model)
pos_encoding = PositionalEncoding(d_model, max_len=seq_length)
# add in the attention layer

# Add a linear layer for prediction
num_heads=2 # work on T4 GPU
num_heads=12 # work on A100 GPU
multihead_attention = nn.ModuleList()
for _ in range(num_heads):
    attention_mod = SingleHeadSelfAttention(d_model)
    multihead_attention.append(attention_mod)

prediction_layer1 = nn.Linear(d_model*num_heads, vocab_size) # as we are concatenating the heads output
layer_norm1 = nn.LayerNorm(vocab_size)
prediction_layer2 = nn.Linear(vocab_size, vocab_size)
layer_norm2 = nn.LayerNorm(vocab_size) # last dimension is the vocab size


# Define the loss function
loss_function = nn.CrossEntropyLoss()
log.info(f"Length of input ids ={len(input_ids)}")

# We'll combine these into a simple pipeline
model = nn.ModuleList([token_embedding, pos_encoding,
                      multihead_attention,layer_norm1,layer_norm2,prediction_layer1,prediction_layer2])

# The most important part is the Stochastic Gradient Descent part
# Using model.parameters() in optimizer.step() ensures all layers, including token_embedding, attention_mod, and prediction_layer, are updated
# optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) # SGD is unstable and hence we use this

# with higher learning loss is Nan

assert False == torch.isnan(input_ids).any()
assert False == torch.isinf(input_ids).any()

# Place all in GPU
token_embedding.to('cuda')
pos_encoding.to('cuda')
attention_mod.to('cuda')
prediction_layer1.to('cuda')
prediction_layer2.to('cuda')
model.to('cuda')




07-Feb-25 12:59:05 - vocab_size=2000 embedding_dim/d_model=512
07-Feb-25 12:59:06 - Length of input ids =24471


ModuleList(
  (0): Embedding(2000, 512)
  (1): PositionalEncoding()
  (2): ModuleList(
    (0-11): 12 x SingleHeadSelfAttention(
      (W_Q): Linear(in_features=512, out_features=512, bias=False)
      (W_K): Linear(in_features=512, out_features=512, bias=False)
      (W_V): Linear(in_features=512, out_features=512, bias=False)
    )
  )
  (3-4): 2 x LayerNorm((2000,), eps=1e-05, elementwise_affine=True)
  (5): Linear(in_features=6144, out_features=2000, bias=True)
  (6): Linear(in_features=2000, out_features=2000, bias=True)
)

In [10]:
NO NEED TO EXECUTE THIS AGAIN ( this need A100, )
log.info("Training model...")

model.train()
batch_size = 80 # works on a T4 GPU free tier about 12 GB GPU RAM
batch_size = 64 #A100 with num_head =12 == 34 GB
N, seq_length = input_ids.shape
log.info(f"N= {N} seq_length= {seq_length}")
num_batches = N // batch_size

for epoch in range(10):
    for start_idx in range(0, N, batch_size):
        end_idx = start_idx + batch_size
        if end_idx > N:
            break  # in case N not multiple of batch_size

        # Slice out a batch
        batch_input = input_ids[start_idx:end_idx, :]   # (B, seq_length)
        batch_labels = labels[start_idx:end_idx, :]     # (B, seq_length)
        if epoch == 0 and start_idx == 0:
            log.info(f"batch_input.shape={batch_input.shape}")
            log.info(f"batch_labels.shape={batch_labels.shape}")

        # Move to GPU
        batch_input = batch_input.to('cuda')
        batch_labels = batch_labels.to('cuda')

        # 1) Shift input & labels so model predicts next token
        #    shape -> (B, seq_length-1)
        trimmed_input = batch_input[:, :-1]
        target_labels = batch_labels[:, 1:]
        if epoch == 0 and start_idx == 0:
            # take 10 tokens
            log.info("Example input: %s", sp.decode(trimmed_input[0].tolist()[:10]))
            log.info("Example labels: %s",sp.decode(target_labels[0].tolist()[:10]))

        embedded_tokens = token_embedding(trimmed_input)
        # shape remains (batch_size, seq_len, d_model)
        pos_embedded_tokens = pos_encoding(embedded_tokens)
        # Initialise the scores
        # Initialize an empty list to store scores
        head_outputs = []
        # get attention and score from multihead attention- we are doing a very simple way
        # in reality this could be parallelised by adding an extra dim to a matrxi and in one shot
        for attention_mod in multihead_attention:
            score,_ = attention_mod(pos_embedded_tokens)
            head_outputs.append(score)
        #Convert list of scores into a single tensor (concatenation or summation)
        score = torch.cat(head_outputs, dim=-1)  # Concatenate along the last dimension
        # todo - change this to score = torch.mean(torch.stack(head_outputs, dim=0), dim=0)  # Average over heads
        #print(score.shape) # torch.Size([50, 999, 1024]) #  #last dim is dmodel*2 (num_heads)
        # Predict the next word
        hidden1 = prediction_layer1(score)  # Project to vocabulary size
        hidden1 = layer_norm1(hidden1)         # add layer norm
        logits = prediction_layer2(hidden1)  # through few linear layers
        logits = layer_norm2(logits)      # add layer norm
        # the last dimension of the output tensor represents the vocabulary size or the number of classes.
        # Therefore, applying softmax along the last dimension (dim=-1)
        predicted_probs = torch.softmax(logits, dim=-1)  # Get probabilities
        # Get the predicted word (token ID)
        predicted_token_id = torch.argmax(predicted_probs, dim=-1)
        # Calculate the loss # crossentropy already does softmax inside
        # If your input has 49 tokens, you predict 49 next tokens.
        loss = loss_function(
            logits.reshape(-1, vocab_size),
            target_labels.reshape(-1)
        )
        optimizer.zero_grad()
        loss.backward()
        # We are not discarding the loss or ignoring it; rather, we’re enforcing a limit on the size of the update to avoid erratic jumps.
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        optimizer.step()
        # print training progress occasionally
        if (start_idx // batch_size) % 50 == 0:
            log.info("[Epoch=%d | Batch=%d] loss=%.4f", epoch+1, start_idx//batch_size, loss.item())
        if loss.item() < 0.5:
            break
        # free gpu memory
        del batch_input, batch_labels, trimmed_input, target_labels, logits
        gc.collect()
        torch.cuda.empty_cache()

    log.info(f"---------Epoch {epoch+1:02d} | Loss: {loss.item():.4f}")

"""# Use the trained model to predict"""

# save the model weights
torch.save(model.state_dict(), "model_weights.pth")
log.info("Model weights saved")

07-Feb-25 10:57:09 - Training model...
07-Feb-25 10:57:09 - N= 24471 seq_length= 1000
07-Feb-25 10:57:09 - batch_input.shape=torch.Size([64, 1000])
07-Feb-25 10:57:09 - batch_labels.shape=torch.Size([64, 1000])
07-Feb-25 10:57:09 - Example input: One day, a little girl named Lily found a 
07-Feb-25 10:57:09 - Example labels: day, a little girl named Lily found a need
07-Feb-25 10:57:10 - [Epoch=1 | Batch=0] loss=8.0797
07-Feb-25 10:58:08 - [Epoch=1 | Batch=50] loss=4.3915
07-Feb-25 10:59:06 - [Epoch=1 | Batch=100] loss=3.9941
07-Feb-25 11:00:05 - [Epoch=1 | Batch=150] loss=4.0190
07-Feb-25 11:01:03 - [Epoch=1 | Batch=200] loss=3.6729
07-Feb-25 11:02:02 - [Epoch=1 | Batch=250] loss=3.7241
07-Feb-25 11:03:00 - [Epoch=1 | Batch=300] loss=3.7997
07-Feb-25 11:03:58 - [Epoch=1 | Batch=350] loss=3.6950
07-Feb-25 11:04:35 - ---------Epoch 01 | Loss: 3.6159
07-Feb-25 11:04:36 - [Epoch=2 | Batch=0] loss=3.7283
07-Feb-25 11:05:34 - [Epoch=2 | Batch=50] loss=3.3715
07-Feb-25 11:06:32 - [Epoch=2 | 

In [1]:
# prompt: copy weitghts to drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
# Save the model weights to Google Drive
#!cp model_weights.pth /content/drive/My\ Drive/Colab\ Notebooks/



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!cp /content/drive/My\ Drive/Colab\ Notebooks/model_weights.pth model_weights.pth

In [9]:

# load the model for evaluation
model.load_state_dict(torch.load("model_weights.pth"))
model.eval()  # Set to evaluation mode

# Test the generation function
prompt = "Bloom lived in a big garden"

generated_tokens = sp.encode(prompt, out_type=int)  # Tokenize input text

# Convert to tensor
input_tensor = torch.tensor(
    generated_tokens, dtype=torch.long).unsqueeze(0)  # (1, seq_length)
max_length = 100
for _ in range(max_length):
    # Get embedding
    embedded_tokens = token_embedding(input_tensor.to('cuda'))
    # Get attention and score
    # shape remains (batch_size, seq_len, d_model)
    pos_embedded_tokens = pos_encoding(embedded_tokens)
    head_outputs = []
    # get attention and score from multihead attention
    for attention_mod in multihead_attention:
        score,_ = attention_mod(pos_embedded_tokens)
        head_outputs.append(score)
    #Convert list of scores into a single tensor (concatenation or summation)
    score = torch.cat(head_outputs, dim=-1)  # Concatenate along the last dimension #better to do mean
    #print(score.shape) # torch.Size([50, 999, 1024]) #  #last dim is dmodel*2 (num_heads)
    # Predict the next word
    hidden1 = prediction_layer1(score)  # Project to vocabulary size
    hidden1 = layer_norm1(hidden1)         # add layer norm
    logits = prediction_layer2(hidden1)  # through few linear layers
    logits = layer_norm2(logits)      # add layer norm
    predicted_probs = torch.softmax(logits, dim=-1)  # Get probabilities
    # Get the last token's logits (for autoregressive prediction)
    next_token_logits = predicted_probs[:, -1, :]  # Shape: (1, vocab_size)
    # Convert logits to token probabilities
    next_token_id = torch.argmax(next_token_logits, dim=-1)  # (1,)
    # Append new token
    generated_tokens.append(next_token_id.item())
    # Stop if we generate an EOS token (optional)
    if next_token_id.item() == sp.eos_id():  # Ensure your tokenizer has an EOS token
        break

    # Update input tensor with new token for next iteration
    input_tensor = torch.tensor(
        generated_tokens, dtype=torch.long).unsqueeze(0)

# Decode generated token IDs back to text
generated_text = sp.decode(generated_tokens)
log.info(f"Generated Text={generated_text}")

  model.load_state_dict(torch.load("model_weights.pth"))
07-Feb-25 12:59:12 - Generated Text=Bloom lived in a big garden with a big smile on her face. She was so happy and thanked her mom for helping her. She was so happy and hugged her mom and said, "Thank you, mom. You are very kind." But, Brownie and Brownie careful with the big smile on her face. She was so happy and thanked her mom for helping her mom and dad. She hugged her mom and said, "Thank you, Bye. You are very kind." Brow


In [9]:
!cp multihead_transformer.log /content/drive/My\ Drive/Colab\ Notebooks/

In [11]:
def generate(prompt):
  generated_tokens = sp.encode(prompt, out_type=int)  # Tokenize input text
  # Convert to tensor
  input_tensor = torch.tensor(
      generated_tokens, dtype=torch.long).unsqueeze(0)  # (1, seq_length)
  max_length = 100
  for _ in range(max_length):
      # Get embedding
      embedded_tokens = token_embedding(input_tensor.to('cuda'))
      # Get attention and score
      # shape remains (batch_size, seq_len, d_model)
      pos_embedded_tokens = pos_encoding(embedded_tokens)
      head_outputs = []
      # get attention and score from multihead attention
      for attention_mod in multihead_attention:
          score,_ = attention_mod(pos_embedded_tokens)
          head_outputs.append(score)
      #Convert list of scores into a single tensor (concatenation or summation)
      score = torch.cat(head_outputs, dim=-1)  # Concatenate along the last dimension #better to do mean
      #print(score.shape) # torch.Size([50, 999, 1024]) #  #last dim is dmodel*2 (num_heads)
      # Predict the next word
      hidden1 = prediction_layer1(score)  # Project to vocabulary size
      hidden1 = layer_norm1(hidden1)         # add layer norm
      logits = prediction_layer2(hidden1)  # through few linear layers
      logits = layer_norm2(logits)      # add layer norm
      predicted_probs = torch.softmax(logits, dim=-1)  # Get probabilities
      # Get the last token's logits (for autoregressive prediction)
      next_token_logits = predicted_probs[:, -1, :]  # Shape: (1, vocab_size)
      # Convert logits to token probabilities
      next_token_id = torch.argmax(next_token_logits, dim=-1)  # (1,)
      # Append new token
      generated_tokens.append(next_token_id.item())
      # Stop if we generate an EOS token (optional)
      if next_token_id.item() == sp.eos_id():  # Ensure your tokenizer has an EOS token
          break

      # Update input tensor with new token for next iteration
      input_tensor = torch.tensor(
          generated_tokens, dtype=torch.long).unsqueeze(0)

  # Decode generated token IDs back to text
  generated_text = sp.decode(generated_tokens)
  return generated_text


In [12]:
# Test the generation function
prompt = "The Cat sat on the"
generated_text = generate(prompt)
log.info(f"Generated Text={generated_text}")

07-Feb-25 12:30:04 - Generated Text=The Cat sat on the table and the chair. The chased the chair. The chased the chair and the chair was so happy. The chair was so happy and thanked the chair for the chair and the chair and the chair was so happy to have the chair and the chair was so glad to have such a wonderful time. Once upon a time there was a little girl named Amy. She was three years old and loved to play in the chair and it 


In [13]:
#Adding temperature based sampling as per chatgpt

import torch.nn.functional as F

def generate(prompt, max_length=100, temperature=1.0):
    generated_tokens = sp.encode(prompt, out_type=int)  # Tokenize input text
    input_tensor = torch.tensor(generated_tokens, dtype=torch.long).unsqueeze(0).to('cuda')  # (1, seq_length)

    for _ in range(max_length):
        # Get embedding
        embedded_tokens = token_embedding(input_tensor)
        pos_embedded_tokens = pos_encoding(embedded_tokens)

        # Multi-head attention
        head_outputs = [attention_mod(pos_embedded_tokens)[0] for attention_mod in multihead_attention]
        score = torch.cat(head_outputs, dim=-1)  # Use mean instead of cat if needed

        # Predict next token
        hidden1 = layer_norm1(prediction_layer1(score))
        logits = layer_norm2(prediction_layer2(hidden1))

        # Get last token's logits (for autoregressive prediction)
        next_token_logits = logits[:, -1, :]  # Shape: (1, vocab_size)

        # Apply temperature-based scaling before sampling
        scaled_logits = next_token_logits / temperature  #  Adjust randomness
        probabilities = F.softmax(scaled_logits, dim=-1)  # Convert to probabilities

        # Sample from the probability distribution
        next_token_id = torch.multinomial(probabilities, num_samples=1).item()  #  Random sampling

        # Append new token
        generated_tokens.append(next_token_id)

        # Stop if we generate an EOS token (optional)
        if next_token_id == sp.eos_id():  # Ensure your tokenizer has an EOS token
            break

        # Update input tensor with new token for next iteration
        input_tensor = torch.tensor(generated_tokens, dtype=torch.long).unsqueeze(0).to('cuda')

    # Decode generated token IDs back to text
    return sp.decode(generated_tokens)



### **Controlling Temperature**

| **Temperature (T)** | **Effect** |
|--------------------|------------------------------|
| **T = 0.1**  | Almost deterministic, like `argmax` |
| **T = 0.7**  | Balanced creativity vs. coherence |
| **T = 1.0**  | Standard sampling (default) |
| **T = 1.5+**  | Very creative but may generate gibberish |


In [17]:
# Test the generation function
prompt = "The Cat sat on the"
generated_text = generate(prompt,temperature=0.4)
log.info(f"Generated Text={generated_text}")

07-Feb-25 12:39:47 - Generated Text=The Cat sat on the table. The little girl was so happy and she had a great time playing with her new toy. She was so happy and she had a new friend. She was so happy to have her new friend and she was proud of her new friend. Once upon a time there was a little girl who loved to play in the park. One day she was walking in the park when she saw a big tree in her garden. She was so excited she wanted to show her friends her friends to play with. 
