<a href="https://colab.research.google.com/github/debarshee2004/atto-gpt/blob/master/notebooks/concept_atto_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Atto GPT (How GPT works?)

**Let's see what dataset we are using for this recreation process.**

In [1]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-10-11 08:51:42--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-10-11 08:51:42 (24.2 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
print("Length of dataset in characters: ", len(text))

Length of dataset in characters:  1115394


In [4]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



---

### Understanding the dataset

In [6]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("The Characters in the Dataset:",''.join(chars))
print("The Number of Characters in the Dataset:",vocab_size)

The Characters in the Dataset: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
The Number of Characters in the Dataset: 65


### Encoder and Decoder

In [8]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
# encoder: take a string, output a list of integers
encode = lambda s: [stoi[c] for c in s]
# decoder: take a list of integers, output a string
decode = lambda l: ''.join([itos[i] for i in l])

print("Encoded Values: `hii there`")
print("Encoded Values: ",encode("hii there"))
print("Decoded Output: ",decode(encode("hii there")))

Encoded Values: `hii there`
Encoded Values:  [46, 47, 47, 1, 58, 46, 43, 56, 43]
Decoded Output:  hii there


In [9]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch

data = torch.tensor(encode(text), dtype=torch.long)
print("Shape of the Data: ", data.shape)
print("Data-Type of the Data", data.dtype)
print("The encoded data")
print(data[:1000])
# the 1000 characters we looked at earier will to the GPT look like this

Shape of the Data:  torch.Size([1115394])
Data-Type of the Data torch.int64
The encoded data
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52,

---

### Spliting the Data

In [10]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data))
# 90% of the data will be used for training the model
train_data = data[:n]
val_data = data[n:]

In [11]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [13]:
# Extract a slice from the training data as input (x) with 'block_size' number of elements,
# starting from the first element and ending at index block_size-1
x = train_data[:block_size]

# Extract another slice from the training data as target (y), but shifted by 1 position
# relative to 'x'. This will start from index 1 and end at index block_size.
y = train_data[1:block_size+1]

# Loop through each time step (t) from 0 to block_size-1
for t in range(block_size):

    # Extract the context, which is the slice from the beginning of 'x' up to the current time step (t) + 1
    # This represents the input sequence of increasing length at each step.
    context = x[:t+1]

    # The target for the current time step is the t-th element from the 'y' sequence
    target = y[t]

    # Print the current context (input sequence) and its corresponding target value.
    # It shows what the model is expected to predict as the target, given the context.
    print(f"when input is {context} the target: {target}")


when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


---

In [14]:
# Set the random seed for reproducibility, ensuring that the random numbers generated by torch
# are the same every time the code runs.
torch.manual_seed(1337)

# Define the batch size, i.e., how many independent sequences (or examples) we want to process in parallel
batch_size = 4

# Define the block size, i.e., the maximum length of the input context for predictions
block_size = 8

# Function to get a batch of data for training or validation
def get_batch(split):
    # Select either the training data or validation data based on the 'split' argument
    data = train_data if split == 'train' else val_data

    # Randomly choose starting indices for the batch, ensuring that the selected index
    # plus the block size does not exceed the data length. This ensures we can create
    # sequences of length 'block_size'.
    ix = torch.randint(len(data) - block_size, (batch_size,))

    # Stack 'batch_size' sequences of length 'block_size' from the selected data
    # These will serve as the input sequences (x) for the model.
    x = torch.stack([data[i:i+block_size] for i in ix])

    # Stack 'batch_size' sequences for the targets (y), shifted by 1 position relative to 'x'
    # The target sequence is what the model will learn to predict.
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    # Return the input (x) and target (y) tensors for this batch
    return x, y

# Get a batch of training data (inputs and targets)
xb, yb = get_batch('train')

# Print the shape and contents of the input tensor
print('inputs:')
print(xb.shape)  # Should show (batch_size, block_size), e.g., (4, 8)
print(xb)

# Print the shape and contents of the target tensor
print('targets:')
print(yb.shape)  # Should also show (batch_size, block_size), e.g., (4, 8)
print(yb)

print('----')

# Iterate over the batch dimension (each independent sequence in the batch)
for b in range(batch_size):
    # Iterate over the time dimension (each timestep in the sequence)
    for t in range(block_size):
        # Extract the context, which is the input sequence from timestep 0 to t
        # This represents the portion of the input the model has seen so far
        context = xb[b, :t+1]

        # Extract the target value at the current timestep t
        target = yb[b, t]

        # Print the current context (input sequence seen so far) and its corresponding target value
        # The 'tolist()' function is used to convert the tensor into a list for better readability
        print(f"when input is {context.tolist()} the target: {target}")


inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

In [15]:
print(xb) # our input to the transformer

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


---

### Model Initialization

The `BigramLanguageModel` class inherits from `torch.nn.Module`, making it a PyTorch neural network module. In the `__init__` method, the model initializes a **token embedding table** using `nn.Embedding`. This table maps each token in the vocabulary to a vector of logits representing the likelihood of the next token. The embedding table’s size is `(vocab_size, vocab_size)`, meaning each token is represented by a vector of size equal to the number of tokens in the vocabulary. This architecture is characteristic of a **bigram model**, where the next token is predicted solely based on the current token.

### Forward Method

The `forward` method computes the logits for the next token in the sequence. The method takes in a batch of token indices `idx` (with shape `(B, T)`, where `B` is the batch size and `T` is the sequence length) and optionally `targets` (which represent the actual next tokens for training). The logits are generated by looking up each token’s embedding from the table. If `targets` are provided, the method computes the **cross-entropy loss**, a standard loss function for classification tasks, between the predicted logits and the true target values. The logits are reshaped from `(B, T, C)` to `(B*T, C)` to match the expected input shape for the loss function, where `C` is the vocabulary size.

### Token Generation

The `generate` method implements token generation. Given an initial sequence of token indices (`idx`), the method generates additional tokens up to a specified number (`max_new_tokens`). At each step, the model predicts the next token by computing the logits for the last time step, applying a **softmax** function to convert logits into probabilities, and then sampling the next token using `torch.multinomial`. The newly sampled token is appended to the sequence, and this process is repeated for the desired number of new tokens. This allows the model to create sequences of arbitrary length based on the context provided.

In [19]:
# Import necessary libraries
import torch
import torch.nn as nn
from torch.nn import functional as F

# Set a fixed random seed for reproducibility
torch.manual_seed(1337)

# Define a class for the Bigram Language Model
class BigramLanguageModel(nn.Module):

    # Constructor method, initializing the model
    def __init__(self, vocab_size):
        super().__init__()
        # Create an embedding table where each token is mapped to a vector of logits
        # The table maps each token in the vocabulary to a vector of size (vocab_size,)
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    # Forward pass through the model
    def forward(self, idx, targets=None):
        # idx: (B, T) tensor, where B = batch size, T = sequence length
        # targets: (B, T) tensor, containing the target tokens to predict

        # Get the logits for the next token using the embedding table
        logits = self.token_embedding_table(idx)  # Output shape: (B, T, C), where C = vocab_size

        # If targets are provided, calculate the loss
        if targets is None:
            loss = None
        else:
            # Flatten logits and targets to calculate loss using cross-entropy
            B, T, C = logits.shape  # Unpack batch size (B), time steps (T), and vocab size (C)
            logits = logits.view(B*T, C)  # Reshape logits to (B*T, C) for loss computation
            targets = targets.view(B*T)   # Reshape targets to (B*T)
            loss = F.cross_entropy(logits, targets)  # Compute cross-entropy loss

        # Return logits and loss (if targets are given)
        return logits, loss

    # Function to generate new tokens based on the current context
    def generate(self, idx, max_new_tokens):
        # idx: (B, T) tensor, where B = batch size, T = sequence length
        # max_new_tokens: the number of tokens to generate
        for _ in range(max_new_tokens):
            # Forward pass to get logits (predictions)
            logits, loss = self(idx)

            # Focus only on the last time step's logits
            logits = logits[:, -1, :]  # Shape: (B, C), where C = vocab_size

            # Apply softmax to convert logits into probabilities
            probs = F.softmax(logits, dim=-1)  # Shape: (B, C)

            # Sample the next token from the probability distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # Shape: (B, 1)

            # Append the sampled token to the current sequence
            idx = torch.cat((idx, idx_next), dim=1)  # Shape: (B, T+1)

        # Return the updated sequence, including generated tokens
        return idx

# Example usage
# Assume 'vocab_size' is defined based on the dataset being used
m = BigramLanguageModel(vocab_size)

# Perform a forward pass with some input data (xb) and targets (yb)
logits, loss = m(xb, yb)

# Print the shape of the logits (output predictions) and the computed loss
print("The shape of the output predictions: ",logits.shape)  # Should show (batch_size * sequence_length, vocab_size)
print("The Cross Entropy Loss: ",loss)  # The cross-entropy loss value

# Generate 100 new tokens based on an initial input of a single token (0)
generated_sequence = m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)
print("The Generated Sequesnce: ",generated_sequence[0].tolist())

# Decode the generated token indices into text (assuming a 'decode' function exists)
# Convert the generated token indices (a tensor) into a list for decoding
print("The Decoded Generated Sequesnce: ",decode(generated_sequence[0].tolist()))


The shape of the output predictions:  torch.Size([32, 65])
The Cross Entropy Loss:  tensor(4.8786, grad_fn=<NllLossBackward0>)
The Generated Sequesnce:  [0, 31, 56, 12, 55, 28, 7, 29, 35, 49, 58, 36, 53, 24, 4, 48, 24, 16, 22, 45, 27, 24, 34, 64, 5, 30, 21, 53, 16, 55, 20, 42, 46, 57, 34, 4, 60, 24, 24, 62, 39, 58, 48, 57, 41, 25, 54, 61, 24, 17, 30, 31, 28, 63, 39, 53, 8, 55, 44, 64, 57, 3, 37, 57, 3, 64, 18, 7, 61, 6, 11, 43, 17, 49, 64, 62, 48, 45, 15, 23, 18, 15, 46, 57, 2, 47, 35, 35, 8, 27, 40, 64, 16, 52, 62, 13, 1, 25, 57, 3, 9]
The Decoded Generated Sequesnce:  
Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [20]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [24]:
# Define the batch size (number of training samples processed at once)
batch_size = 32

# Loop over training steps (iterations)
for steps in range(10000):  # Increase the number of steps for better results over time

    # Sample a batch of input (xb) and target (yb) data from the training set
    xb, yb = get_batch('train')

    # Perform a forward pass through the model and compute the loss
    logits, loss = m(xb, yb)

    # Clear the gradients from the previous step to prevent accumulation
    optimizer.zero_grad(set_to_none=True)

    # Backpropagate the loss to compute gradients for the model parameters
    loss.backward()

    # Update the model parameters using the optimizer
    optimizer.step()

# Print the final loss value after training loop ends
print(loss.item())


2.6319539546966553


In [26]:
print("The Embeddings: " ,m.generate(idx = torch.zeros((1, 1),
                        dtype=torch.long),
                        max_new_tokens=500)[0].tolist())

print("The Decoded Embeddings: ",decode(m.generate(idx = torch.zeros((1, 1),
                        dtype=torch.long),
                        max_new_tokens=500)[0].tolist()))

The Embeddings:  [0, 20, 43, 50, 1, 53, 1, 54, 54, 43, 12, 1, 51, 53, 57, 47, 57, 47, 45, 46, 43, 56, 42, 43, 1, 61, 11, 1, 57, 1, 58, 57, 43, 56, 43, 1, 41, 53, 44, 1, 53, 59, 40, 56, 43, 1, 54, 50, 1, 50, 53, 58, 46, 43, 57, 0, 0, 58, 1, 56, 43, 1, 47, 41, 43, 5, 31, 39, 58, 1, 58, 46, 43, 56, 41, 53, 56, 48, 59, 58, 1, 52, 58, 46, 43, 58, 6, 1, 58, 1, 43, 63, 1, 58, 1, 58, 10, 0, 0, 14, 17, 16, 35, 46, 47, 52, 53, 56, 58, 46, 43, 12, 1, 50, 1, 42, 1, 51, 1, 58, 46, 58, 46, 39, 52, 58, 46, 43, 43, 1, 42, 47, 41, 53, 53, 1, 58, 1, 39, 56, 42, 8, 1, 51, 1, 58, 43, 1, 45, 53, 56, 58, 46, 39, 1, 43, 56, 58, 46, 47, 43, 57, 46, 43, 1, 35, 39, 56, 53, 56, 1, 56, 47, 57, 57, 58, 46, 63, 1, 40, 47, 57, 1, 50, 50, 50, 1, 61, 50, 53, 1, 57, 51, 39, 58, 53, 44, 43, 1, 57, 58, 43, 1, 54, 53, 59, 56, 1, 63, 53, 57, 43, 1, 54, 54, 56, 58, 6, 1, 53, 57, 58, 46, 43, 57, 58, 46, 43, 10, 0, 35, 13, 26, 32, 39, 50, 42, 1, 21, 44, 53, 59, 1, 52, 1, 46, 53, 61, 39, 41, 43, 57, 6, 1, 58, 1, 57, 43, 52, 42

## The mathematical trick in self-attention

In [32]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [33]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [34]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)

In [35]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

False

In [36]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

False

In [37]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [38]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [39]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [40]:
k.var()

tensor(1.0449)

In [41]:
q.var()

tensor(1.0700)

In [42]:
wei.var()

tensor(1.0918)

In [43]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [44]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1)
# gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

In [45]:
# Define a custom LayerNorm1d class (replacing BatchNorm1d)
class LayerNorm1d:

  # Constructor method for initialization
  def __init__(self, dim, eps=1e-5, momentum=0.1):
    # Small constant to prevent division by zero when normalizing
    self.eps = eps

    # Learnable scale parameter (gamma), initialized to ones
    self.gamma = torch.ones(dim)

    # Learnable shift parameter (beta), initialized to zeros
    self.beta = torch.zeros(dim)

  # Forward pass method to normalize the input
  def __call__(self, x):
    # Calculate the mean of each example in the batch (along the feature dimension)
    xmean = x.mean(1, keepdim=True)  # Shape: (batch_size, 1)

    # Calculate the variance of each example in the batch (along the feature dimension)
    xvar = x.var(1, keepdim=True)  # Shape: (batch_size, 1)

    # Normalize the input by subtracting the mean and dividing by the standard deviation (unit variance)
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps)

    # Scale by gamma and shift by beta (element-wise operation)
    self.out = self.gamma * xhat + self.beta

    # Return the normalized and scaled output
    return self.out

  # Method to return the learnable parameters (gamma and beta)
  def parameters(self):
    return [self.gamma, self.beta]

# Set a manual seed for reproducibility
torch.manual_seed(1337)

# Instantiate the LayerNorm1d module with a feature dimension of 100
module = LayerNorm1d(100)

# Generate random input data: a batch of 32 samples, each of dimension 100
x = torch.randn(32, 100)

# Pass the input data through the LayerNorm1d module (forward pass)
x = module(x)

# Print the shape of the output tensor
x.shape  # Should output: torch.Size([32, 100])


torch.Size([32, 100])

In [46]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [47]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))