In [2]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-12-18 23:32:46--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-12-18 23:32:46 (2.61 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [1]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


# Step 1: Tokenize your text

Tokenize essentially means convert every single "token"/piece of a word into a __single number__. Here, for simplicity, we will be tokenizing each __character__.

GPT-2 uses the byte-pair encoding algorithm. Here we're just going to do a standard character mapping.

In [2]:
unique_chars = sorted(list(set(text)))

print("List of unique chars: ", unique_chars)
print("Number of unique chars: ", len(unique_chars))

List of unique chars:  ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Number of unique chars:  65


In [3]:
stoi = { c:i for i, c in enumerate(unique_chars) }
itos = { i:c for i, c in enumerate(unique_chars) }

encode = lambda some_str: [stoi[char] for char in some_str]
decode = lambda some_str: [itos[char] for char in some_str]


example_string = "This is a string"
encode(example_string)

[32, 46, 47, 57, 1, 47, 57, 1, 39, 1, 57, 58, 56, 47, 52, 45]

# Step 1.5: Character -> Token Int -> Token Embedding Vector

Step 1: making an embedding lookup table! Each row corresponds to a unique token. The number of that row is equal to the "number" of that token (see box above this one for an example: 'T' has token number 32, and thus the vector at row 32 IS 'T''s embedding vector).

In [4]:
import torch

data = torch.tensor(encode(text))        # turn big list of characters -> big list of token ints

train_size = int(0.9 * len(data))        # train/test split: both are just long lists!
train_data = data[:train_size]
val_data = data[train_size:]

## Sidenote: chunking the training data

Basically we only ever take in CHUNK_SIZE sequence of chars (taking in all of them at once would be way too hard). CHUNK_SIZE is just the __max length sequence__ we can ever predict on.

Here you can see how every CHUNK_SIZE sequence is actually a bunch of training examples!

In [5]:
# Every single chunk is actually a BUNCH of training examples.

chunk_size = 8

for char_idx in range(chunk_size):
    training_example = train_data[:char_idx]
    associated_label = train_data[char_idx]
    print(f"When the training example is {training_example}, the label is {associated_label}.")

When the training example is tensor([], dtype=torch.int64), the label is 18.
When the training example is tensor([18]), the label is 47.
When the training example is tensor([18, 47]), the label is 56.
When the training example is tensor([18, 47, 56]), the label is 57.
When the training example is tensor([18, 47, 56, 57]), the label is 58.
When the training example is tensor([18, 47, 56, 57, 58]), the label is 1.
When the training example is tensor([18, 47, 56, 57, 58,  1]), the label is 15.
When the training example is tensor([18, 47, 56, 57, 58,  1, 15]), the label is 47.


# Step 1.6: Batching the data

In [6]:
batch_size = 4     # Batch Size = the NUMBER of sequences we forward pass, backward pass, and step with every epoch.
block_size = 8     # Block_Size = maximum context length for a prediction.

def get_batch(split):
    data = train_data if split == 'train' else val_data
    
    # for this particular batch, we want to get 4 sequences each of sequence length 32
    random_starting_idx_of_batch = torch.randint(len(data) - block_size, (batch_size,))     # get 4 indexes into 'data': the indexes can only be from 0 to len(data) - block_size
                                                                                            # this will be a 1D tensor of size (batch_size,), i.e. tensor([953063, 497175, 633405, 627354])
    print(f"From {split}_data, pick {random_starting_idx_of_batch} as starting indices for training sequences.")
    
    # now that we have some starting indices, pick out the 32-length sequence from each of them
    training_sequences = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        training_sequences.append(data[starting_index : starting_index + block_size])
        
    training_sequences_tensor = torch.stack(training_sequences)
    print("\nThese are the block_size-length sequences starting from each of those indicies:")
    print(training_sequences_tensor.shape)
    print(training_sequences_tensor)
    
    
    # now we'll get a tensor, but with all the relevant labels. Remember we're using the trick above to get more examples.
    labels = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        labels.append(data[starting_index + 1 : starting_index + block_size + 1])
        
    labels_tensor = torch.stack(labels)
    print("\nThese are the correct labels associated with each of those example tensors:")
    print(labels_tensor.shape)
    print(labels_tensor)
    
    return training_sequences_tensor, labels_tensor
    
    
get_batch('train')
    

From train_data, pick tensor([629112, 473269, 377314,  61238]) as starting indices for training sequences.

These are the block_size-length sequences starting from each of those indicies:
torch.Size([4, 8])
tensor([[ 1, 58, 46, 43,  1, 49, 47, 52],
        [ 8,  0, 21, 57,  1, 47, 58,  1],
        [ 1, 14, 53, 50, 47, 52, 45, 40],
        [56,  1, 39, 54, 54, 56, 53, 40]])

These are the correct labels associated with each of those example tensors:
torch.Size([4, 8])
tensor([[58, 46, 43,  1, 49, 47, 52, 45],
        [ 0, 21, 57,  1, 47, 58,  1, 43],
        [14, 53, 50, 47, 52, 45, 40, 56],
        [ 1, 39, 54, 54, 56, 53, 40, 39]])


(tensor([[ 1, 58, 46, 43,  1, 49, 47, 52],
         [ 8,  0, 21, 57,  1, 47, 58,  1],
         [ 1, 14, 53, 50, 47, 52, 45, 40],
         [56,  1, 39, 54, 54, 56, 53, 40]]),
 tensor([[58, 46, 43,  1, 49, 47, 52, 45],
         [ 0, 21, 57,  1, 47, 58,  1, 43],
         [14, 53, 50, 47, 52, 45, 40, 56],
         [ 1, 39, 54, 54, 56, 53, 40, 39]]))

# Step 2: Forward Pass

Fantastic. Now we've got the training input data and labels in a really nice, batched format. Now we'll make predictions.

In [7]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    
    # model has internal embedding vector lookup table based on token
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    
    # forward pass: foreach INT_TOKEN in input_batched, turn that into the correct embedding vector,
    # and then replace that INT_TOKEN with the relevant embedding vector.
    def forward(self, input_batched, target_batched):
        
        logits = self.token_embedding_table(input_batched) 
        print(f"\nBatched input is originally {input_batched.shape}.")
        print(f"After doing embedding lookup, it is {logits.shape}, since we replace each int token with a {len(unique_chars)}-sized vector.")
        
        return logits
        

        
xb, yb = get_batch('train')

m = BigramLanguageModel(vocab_size = len(unique_chars))
out = m.forward(xb, yb)


From train_data, pick tensor([ 76049, 234249, 934904, 560986]) as starting indices for training sequences.

These are the block_size-length sequences starting from each of those indicies:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])

These are the correct labels associated with each of those example tensors:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])

Batched input is originally torch.Size([4, 8]).
After doing embedding lookup, it is torch.Size([4, 8, 65]), since we replace each int token with a 65-sized vector.


# Step 2.5: Formulate with Cross-Entropy Loss

In [41]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)




def get_batch(split):
    data = train_data if split == 'train' else val_data
    
    # for this particular batch, we want to get 4 sequences each of sequence length 32
    random_starting_idx_of_batch = torch.randint(len(data) - block_size, (batch_size,))     # get 4 indexes into 'data': the indexes can only be from 0 to len(data) - block_size
                                                                                            # this will be a 1D tensor of size (batch_size,), i.e. tensor([953063, 497175, 633405, 627354])
#     print(f"From {split}_data, pick {random_starting_idx_of_batch} as starting indices for training sequences.")
    
    # now that we have some starting indices, pick out the 32-length sequence from each of them
    training_sequences = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        training_sequences.append(data[starting_index : starting_index + block_size])
        
    training_sequences_tensor = torch.stack(training_sequences)
#     print("\nThese are the block_size-length sequences starting from each of those indicies:")
#     print(training_sequences_tensor.shape)
#     print(training_sequences_tensor)
    
    
    # now we'll get a tensor, but with all the relevant labels. Remember we're using the trick above to get more examples.
    labels = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        labels.append(data[starting_index + 1 : starting_index + block_size + 1])
        
    labels_tensor = torch.stack(labels)
#     print("\nThese are the correct labels associated with each of those example tensors:")
#     print(labels_tensor.shape)
#     print(labels_tensor)
    
    return training_sequences_tensor, labels_tensor





class BigramLanguageModel(nn.Module):
    
    # model has internal embedding vector lookup table based on token
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    
    # forward pass: foreach INT_TOKEN in input_batched, turn that into the correct embedding vector,
    # and then replace that INT_TOKEN with the relevant embedding vector.
    def forward(self, input_batched, target_batched):
        
        logits = self.token_embedding_table(input_batched)           # replace all INT_TOKENS with vector embeddings
        
        
        if target_batched is None:
            loss = None                                              # If no target is provided,
        else:
            batch_size, sequence_length, embedding_dim = logits.shape    
            logits = logits.view(batch_size * sequence_length, embedding_dim)   # stack ALL embeddings in a SINGLE LIST

            targets = target_batched.view(batch_size * sequence_length)    # do same for target INT_TOKENS: there are 4 sequences per batch, with each sequence having 8 characters

            loss = F.cross_entropy(logits, targets)        # provide a tensor of embeddings, and a tensor of targets, and it will find the loss

        ### --------------------------------------
        
#         print("\nUnstackify the logits tensor:")
#         print(logits.shape)
#         print(logits)
        
#         if target_batched is not None:
#             print("\nUnstackify the targets tensor:")
#             print(targets.shape)
#             print(targets)
#             print(f"\nThe logit and associated target can now be matched 1-to-1. The cross-entropy loss is {loss:.4f}.")
        ### --------------------------------------
        
        return logits, loss





    def generate(self, starting_sequence_tensor, max_new_tokens):
        
        
        print("\n-------- GENERATING SEQUENCE --------")
        
        # repeat for the number of new tokens you want...
        for _ in range(max_new_tokens):
            logits, loss = self.forward(starting_sequence_tensor, None)        # forward pass our input sequence: basically just convering INT_TOKENS to associated embedding vector
#             print(logits.shape)                                                # batch_size  x  single_batch_sequence_length  x  embedding_dimension
            logits = logits[:, -1, :]                                    # from our logits tensor (list of embedding vectors), we only care about the very LAST one in the timestep

            probs = F.softmax(logits, dim=-1)     # convert logits (of vocab-size) to probability distribution             
            generated_int_token = torch.multinomial(probs, num_samples=1)        # sample from that probability distribution
            starting_sequence_tensor = torch.cat((starting_sequence_tensor, generated_int_token), dim=1)        # append that new token
#             print(f"starting_sequence_tensor shape: {starting_sequence_tensor.shape}")
            
        return starting_sequence_tensor
        
xb, yb = get_batch('train')

m = BigramLanguageModel(vocab_size = len(unique_chars))
# out, loss = m.forward(xb, yb)
generated_token_sequence = m.generate(torch.zeros((1,1), dtype=torch.long), max_new_tokens = 100)

print(''.join(decode(generated_token_sequence[0].tolist())))



-------- GENERATING SEQUENCE --------

sV
vL
ja,FsLY,wxEuS'pao3jOssyBA$zFqYTkeMk x-gQ.FzLg!iKI.egzDnyA TsTbvdgX!KpGIeJyjv,SrFF&SDt!:hwWSl.W


# Step 3: Training

Great, now we can do a forward pass that does input_int_token_sequence -> embedding_vector_sequence/logits. We can also convert those logits to a probability distribution via a softmax, and then sample from that distribution to get a new token, which we can append back onto the original input_int_token_sequence and continue onwards.

Now let's train our model parameters (which is just m.parameters() - remember that m is a nn.Module).

In [50]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

batch_size = 32

for epoch in range(10000):
    input_batched, label_batched = get_batch('train')
    
    logits, loss = m.forward(input_batched, label_batched)
    optimizer.zero_grad(set_to_none=True)                     # RESET THE ACCUMULATED GRADIENTS WITH EVERY NEW BATCH!
    loss.backward()
    optimizer.step()
    
    if epoch % 2000 == 0:
        print(f'loss at epoch {epoch} is: {loss.item()}')

    
generated_token_sequence = m.generate(torch.zeros((1,1), dtype=torch.long), max_new_tokens = 100)
print(''.join(decode(generated_token_sequence[0].tolist())))


# pretty good! The only issue is that we're only using the last logit-embedding-vector to predict the next one.

loss at epoch 0 is: 2.5045759677886963
loss at epoch 2000 is: 2.4128713607788086
loss at epoch 4000 is: 2.4660370349884033
loss at epoch 6000 is: 2.4909756183624268
loss at epoch 8000 is: 2.3688089847564697

-------- GENERATING SEQUENCE --------

Yourete fay MI RIOPUCap t waug whassely sy e msbe shes, d th, h youre w ag mur ore irt
Ano and t wis


# SECTION II: Self-Attention

How do we get tokens to talk to each other?

- Don't want future tokens to communicate to past tokens. You're trying to PREDICT the future given the past! It's not like given the past AND the future, predict the intermediate.

- Naive Solution 1: if you're token 5, take tokens 1-4, average them up, and the resulting vector is a "context" summarizing token 5 in the CONTEXT of what came before.
    - Issue: Averaging/sum is extremely lossy: you've lost spatial/positional information.
    
    
Let's do this naive implementation:

In [54]:
torch.manual_seed(1337)

batch_size, seq_length, embedding_size = 4, 8, 2

x = torch.randn(batch_size, seq_length, embedding_size)
x.shape

torch.Size([4, 8, 2])

In [56]:
# Communication 1: want x[b, t] = mean of x[b, i] where i <= t
mean_embeddings = torch.zeros((batch_size, seq_length, embedding_size))

for batch in range(batch_size):
    for token_idx in range(seq_length):
        prev_embeddings = x[batch, : token_idx+1]
        mean_embeddings[batch, token_idx] = torch.mean(prev_embeddings, 0)

In [58]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [60]:
# notice: each subsequence timestep "incorporates" information from all the previous timesteps! We just take an avg!
mean_embeddings[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [67]:
# MATHEMATICAL TRICK TO MAKING THE ABOVE CODE EFFICIENT VIA MATRIX MULTIPLICATIONS

torch.manual_seed(42)
a = torch.ones(3, 3)
a = torch.tril(a)    # convert a to lower-triangular


b = torch.randint(0, 10, (3, 2)).float()
c = a @ b

print("a = ")
print(a)
print("b = ")
print(b)
print("c = ")
print(c)

print("\n Notice how multiplying by a lower-triangular ones matrix allows you to incrementally add each vector of b!")

a = 
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
b = 
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c = 
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])

 Notice how multiplying by a lower-triangular ones matrix allows you to incrementally add each vector of b!


In [70]:
# You can thus easily compute the average by making the row-elems of a to sum to 1!
# MATHEMATICAL TRICK TO MAKING THE ABOVE CODE EFFICIENT VIA MATRIX MULTIPLICATIONS

torch.manual_seed(42)
a = torch.ones(3, 3)
a = torch.tril(a)    # convert a to lower-triangular
a = a / torch.sum(a, 1, keepdim=True)


b = torch.randint(0, 10, (3, 2)).float()
c = a @ b

print("a = ")
print(a)
print("b = ")
print(b)
print("c = ")
print(c)

print("\nNotice how a's rows sum to 1! This lets you compute the incremental-acerage as described!\nThink through a @ b, and notice how you're incrementally averaging all the previous rows!")

a = 
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b = 
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c = 
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])

Notice how a's rows sum to 1! This lets you compute the incremental-acerage as described!
Think through a @ b, and notice how you're incrementally averaging all the previous rows!


In [82]:
# let's do it on our larger toy example
# batch_size, seq_length, embedding_size = 4, 8, 2
B, T, C = 4, 4, 2

input_embeddings = torch.randn(B, T, C)   # within a single batch, our goal is to get the "incremental averaging" of all the previous vectors


averaging_matrix = torch.ones(T, T)       # the averaging matrix is seq_length x seq_length
averaging_matrix = torch.tril(averaging_matrix)
averaging_matrix = averaging_matrix / torch.sum(averaging_matrix, 1, keepdim=True)


print("input_embeddings =")
print(input_embeddings[0])

print("averaging_matrix =")
print(averaging_matrix)

averaged_input_embeddings = averaging_matrix @ input_embeddings   # little strange, but this applies the matrix to each batch
                                                                  # (T, T) @ (B, T, C) -> (B, T, T) @ (B, T, C) -> (B, T, C)    :     averaging matrix is broadcast along batch dim

print("averaged_input_embeddings =")
averaged_input_embeddings[0]


# AVERAGING MATRIX IS A WEIGHT MATRIX! TOKEN GETS INFO ONLY FROM TOKENS PREVIOUS TO IT 


input_embeddings =
tensor([[ 1.2946,  0.2227],
        [-1.2924,  0.1689],
        [-0.8326, -0.8129],
        [ 0.9700, -0.6758]])
averaging_matrix =
tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500]])
averaged_input_embeddings =


tensor([[ 1.2946e+00,  2.2267e-01],
        [ 1.1061e-03,  1.9577e-01],
        [-2.7680e-01, -1.4047e-01],
        [ 3.4909e-02, -2.7429e-01]])

In [85]:
# Small trick: Note that for softmax, we actually set the upper triangle of averaging_matrix to '-inf': so that AFTER
# the softmax those entries get turned to 0.

tril = torch.tril(torch.ones((T, T)))

weights_matrix = torch.zeros((T, T))
weights_matrix = weights_matrix.masked_fill(tril == 0, float('-inf'))   # tokens from the future cna't communicate with the past
weights_matrix = F.softmax(weights_matrix, dim=-1)

weights_matrix

tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500]])

__KEY TAKEAWAY__: Given some list of embeddings, you can do _weighted aggregations_ of your *past* sequence embeddings via _matrix multiplication_ with a lower-triangular matrix, where the elements in the lower-triangular part tell you how much each element fuses into the current position.

In [8]:
# SELF-ATTENTION FULL CODE
torch.manual_seed(1337)

B, T, C = 4, 8, 32        # batch, time, channels/embedding size
x = torch.randn(B, T, C)


# this code does a rolling average of previous embeddings via a low-triangular matrix
tril = torch.tril(torch.ones(T, T))
weights = torch.zeros((T, T))
weights = weights.masked_fill(tril == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
out = weights @ x

print("OG weights:\n", weights)








### ABSOLUTELY CRITICAL:
# Different tokens find other tokens more or less interesting! Different past tokens should have different weights, currently OG weights just weights all past tokens equally!

# Soln: for each embedding in my sequence, emit a query (what I'm looking for) and key (what do I contain) vectors.
# The dot product btwn key and query IS my weight! If they're more aligned, I learn MORE from that specific token!

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x)        # (B, T, 16)
q = query(x)      # (B, T, 16)
v = value(x)

weights = q @ k.transpose(-2, -1)    # (B, T, 16) @ (B, 16, T)    --->    (B, T, T)     :     This is our better weight matrix!
# weights = torch.zeros((T, T))      # no longer zeros...

weights = weights.masked_fill(tril == 0, float('-inf'))        # we make the future weights all -inf so they become 0 during the softmax
weights = F.softmax(weights, dim=-1)
out = weights @ v

print("\nAttention weights for a SINGLE batch: notice how it's not equally weighted anymore!\n", weights[0])

print("\nIn this specific example, looking at the last row, the 8th token sees the 7th and 4th token (0.2423 and 0.2297 respectively) as highly relevant w.r.t. their query and key vectors being aligned.")


print("\nShape of the output:")
print(out.shape)



OG weights:
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

Attention weights for a SINGLE batch: notice how it's not equally weighted anymore!
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
    - Graph of nodes that get information from itself + all nodes that connect to it. Communication in a directed graph!

- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
    - Vector gets a thing added to it that just denotes its specific position

- Each example across batch dimension is of course processed completely independently and never "talk" to each other
    - BATCHES ALL INDEPENDENT


- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
    - the masked_fill tril lines prevent future from affecting past: in a sentiment analysis app you WOULD want the future to affect the past: so delete those lines with the tril.
    - "decoder" in the sense that this triangular masking lets us do autoregressive language modeling
    - attention is just ARBITRARY COMMUNICATION ACROSS VECTORS/NODES


- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
    - "cross-attention" means the keys and values come from ELSEWHERE


- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below
    - basically SOFTMAX on LARGE NUMBERS will put a LOT more weight on the largest node: so we just divide by this term to prevent this bias

# SECTION III: Let's add this self-attention block to the language model!

In [20]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)


def get_batch(split):
    data = train_data if split == 'train' else val_data
    
    random_starting_idx_of_batch = torch.randint(len(data) - block_size, (batch_size,))
    training_sequences = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        training_sequences.append(data[starting_index : starting_index + block_size])
        
    training_sequences_tensor = torch.stack(training_sequences)

    labels = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        labels.append(data[starting_index + 1 : starting_index + block_size + 1])
        
    labels_tensor = torch.stack(labels)
    
    return training_sequences_tensor, labels_tensor


embedding_dim = 32
max_context_length = 8
block_size = max_context_length


class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
                
        # Linear layers for query, key, and value
        self.key = nn.Linear(embedding_dim, head_size, bias=False)
        self.query = nn.Linear(embedding_dim, head_size, bias=False)
        self.value = nn.Linear(embedding_dim, head_size, bias=False) 
        
        # Lower triangular matrix for masking
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, input_sequence_batched):
        B, T, C = input_sequence_batched.shape   # B: batch size, T: sequence length, C: embedding dim
        
        # Compute keys, queries, and values
        keys_batched = self.key(input_sequence_batched)       # Shape: (B, T, head_size)
        queries_batched = self.query(input_sequence_batched)  # Shape: (B, T, head_size)
        values_batched = self.value(input_sequence_batched)   # Shape: (B, T, head_size)
        
        # Compute attention weights
        wei = queries_batched @ keys_batched.transpose(-2, -1) * (C ** -0.5)  # Scale by sqrt(d_k)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))          # Apply mask
        wei = F.softmax(wei, dim=-1)                                         # Normalize weights
        
        # Apply attention
        out = wei @ values_batched  # Shape: (B, T, head_size)
        return out

        

        

class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, embedding_dim)        
        self.position_embedding_table = nn.Embedding(max_context_length, embedding_dim)
        
        
        self.sa_head = Head(embedding_dim)                     # convert vector_embedding -> attentionified_vector_embedding
        self.lm_head = nn.Linear(embedding_dim, vocab_size)    # convert vector_embedding -> vocab_size (REAL logits)
        
    
    
    def forward(self, input_batched, target_batched):
        
        B, T = input_batched.shape
        
        sequence_token_embeddings = self.token_embedding_table(input_batched)
        positional_embedding = self.position_embedding_table(torch.arange(T))   # 0....T-1 for a (T, embedding_dim) matrix
        
        x = sequence_token_embeddings + positional_embedding                    # this is how you add positional infomration into the sequence
        
        x = self.sa_head(x)         # Apply attention block
        logits = self.lm_head(x)    # Decode back to vocab-sized logits (right before softmaxxing and sampling a word)
        
        
        if target_batched is None:
            loss = None
        else:
            batch_size, sequence_length, embedding_dim = logits.shape    
            logits = logits.view(batch_size * sequence_length, embedding_dim)

            targets = target_batched.view(batch_size * sequence_length) 
            loss = F.cross_entropy(logits, targets)
            
        
        return logits, loss





    def generate(self, starting_sequence_tensor, max_new_tokens):
        
        
        print("\n-------- GENERATING SEQUENCE --------")
        
        for _ in range(max_new_tokens):
            
            context = starting_sequence_tensor[:, -block_size:]   # crop "input context" to just the last block_size tokens... next token depends on last ~8 tokens (otherwise the input dimensions are off)
            
            logits, loss = self.forward(context, None)        # forward pass our input sequence: basically just convering INT_TOKENS to associated embedding vector
            logits = logits[:, -1, :]                                          # from our logits tensor (list of embedding vectors), we only care about the very LAST one in the timestep

            probs = F.softmax(logits, dim=-1)                                  # convert logits (of vocab-size) to probability distribution             
            generated_int_token = torch.multinomial(probs, num_samples=1)      # sample from that probability distribution
            starting_sequence_tensor = torch.cat((starting_sequence_tensor, generated_int_token), dim=1)        # append that new token
            
        return starting_sequence_tensor
        
        

xb, yb = get_batch('train')
m = BigramLanguageModel(vocab_size = len(unique_chars))
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)


for epoch in range(10000):
    input_batched, label_batched = get_batch('train')
    
    logits, loss = m.forward(input_batched, label_batched)
    optimizer.zero_grad(set_to_none=True)                     # RESET THE ACCUMULATED GRADIENTS WITH EVERY NEW BATCH!
    loss.backward()
    optimizer.step()
    
    if epoch % 2000 == 0:
        print(f'loss at epoch {epoch} is: {loss.item()}')

    
generated_token_sequence = m.generate(torch.zeros((1,1), dtype=torch.long), max_new_tokens = 500)
print(''.join(decode(generated_token_sequence[0].tolist())))


# looks better! attention clearly communicating SOMETHING clearly....


loss at epoch 0 is: 4.245684623718262
loss at epoch 2000 is: 2.904353618621826
loss at epoch 4000 is: 2.6421000957489014
loss at epoch 6000 is: 2.5174005031585693
loss at epoch 8000 is: 2.295194387435913

-------- GENERATING SEQUENCE --------

AnggG omme owine qme se y'se dk blot CGo I gje:
ag spalpoudrd sis tl bessod ts fo des?
UNave ang. Lal cin IS;
do tn bilbiat yout int, Wiod throur,
Forsotha iut,
in piserle ss we, be Lis od ot thinudlint nt lorth siconot, wharoulls orde to me
Bwres nto trim LI ay, tis akine be fe dn.
AGo owre uhet I ENEREOR:
Hhot the tcoma! his IZDUI Vorwn te wo tl an'shas dramecaswilleh the yof sot pre fad et the the poretro?

SAn mend, thy's he tl!

sof sth fo on
And ond baethises ls m hilsho tlasadd snwifoun n


## Section 3.5: Multiheaded Attention

Adding multiple heads is very simple. We'll also add a feedforward layer at the very end.

In [25]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)


def get_batch(split):
    data = train_data if split == 'train' else val_data
    
    random_starting_idx_of_batch = torch.randint(len(data) - block_size, (batch_size,))
    training_sequences = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        training_sequences.append(data[starting_index : starting_index + block_size])
        
    training_sequences_tensor = torch.stack(training_sequences)

    labels = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        labels.append(data[starting_index + 1 : starting_index + block_size + 1])
        
    labels_tensor = torch.stack(labels)
    
    return training_sequences_tensor, labels_tensor


embedding_dim = 32
max_context_length = 8
block_size = max_context_length

"""SUBMODULE 1: ATTENTION HEAD/FUNCTION"""
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
                
        # Linear layers for query, key, and value
        self.key = nn.Linear(embedding_dim, head_size, bias=False)
        self.query = nn.Linear(embedding_dim, head_size, bias=False)
        self.value = nn.Linear(embedding_dim, head_size, bias=False) 
        
        # Lower triangular matrix for masking
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, input_sequence_batched):
        B, T, C = input_sequence_batched.shape   # B: batch size, T: sequence length, C: embedding dim
        
        # Compute keys, queries, and values
        keys_batched = self.key(input_sequence_batched)       # Shape: (B, T, head_size)
        queries_batched = self.query(input_sequence_batched)  # Shape: (B, T, head_size)
        values_batched = self.value(input_sequence_batched)   # Shape: (B, T, head_size)
        
        # Compute attention weights
        wei = queries_batched @ keys_batched.transpose(-2, -1) * (C ** -0.5)  # Scale by sqrt(d_k)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))          # Apply mask
        wei = F.softmax(wei, dim=-1)                                         # Normalize weights
        
        # Apply attention
        out = wei @ values_batched  # Shape: (B, T, head_size)
        return out


"""SUBMODULE 2: MULTIPLE HEADS OF ATTENTION"""
class MultiHeadAttention(nn.Module):
    """ Run multiple heads of self-attention in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])    # basically get a LIST of heads
    
    def forward(self, input_sequence):
        return torch.cat([h.forward(input_sequence) for h in self.heads], dim=-1)    # for every head, call forward() and concat the attention-ified results!
        
        

"""SUBMODULE 3: SIMPLE FEEDFORWARD MLP LAYER"""
class FeedForward(nn.Module):
    def __init__(self, embedding_size):
        super().__init__()
        self.mlp_layer = nn.Sequential(
            nn.Linear(embedding_size, embedding_size),
            nn.ReLU(),
        )
    
    def forward(self, input_sequence):
        return self.mlp_layer(input_sequence)
        
        
        

class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, embedding_dim)        
        self.position_embedding_table = nn.Embedding(max_context_length, embedding_dim)
        
        
#         self.sa_head = Head(embedding_dim)                     # convert vector_embedding -> attentionified_vector_embedding
        self.sa_head = MultiHeadAttention(4, embedding_dim // 4)   
        self.ffw = FeedForward(embedding_dim)
        self.lm_head = nn.Linear(embedding_dim, vocab_size)    # convert vector_embedding -> vocab_size (REAL logits)
        
    
    
    def forward(self, input_batched, target_batched):
        
        B, T = input_batched.shape
        
        sequence_token_embeddings = self.token_embedding_table(input_batched)
        positional_embedding = self.position_embedding_table(torch.arange(T))   # 0....T-1 for a (T, embedding_dim) matrix
        
        x = sequence_token_embeddings + positional_embedding                    # (B,T,C) - this is how you add positional infomration into the sequence
        
        x = self.sa_head(x)         # (B,T,C) - Apply attention block
        x = self.ffw(x)            # (B,T,C) - FFWD get applied to EVERY SINGLE VECTOR in a sequence for EVERY SINGLE BATCH!
        logits = self.lm_head(x)    # Decode back to vocab-sized logits (right before softmaxxing and sampling a word)
        
        
        if target_batched is None:
            loss = None
        else:
            batch_size, sequence_length, embedding_dim = logits.shape    
            logits = logits.view(batch_size * sequence_length, embedding_dim)

            targets = target_batched.view(batch_size * sequence_length) 
            loss = F.cross_entropy(logits, targets)
            
        
        return logits, loss





    def generate(self, starting_sequence_tensor, max_new_tokens):
        
        
        print("\n-------- GENERATING SEQUENCE --------")
        
        for _ in range(max_new_tokens):
            
            context = starting_sequence_tensor[:, -block_size:]   # crop "input context" to just the last block_size tokens... next token depends on last ~8 tokens (otherwise the input dimensions are off)
            
            logits, loss = self.forward(context, None)        # forward pass our input sequence: basically just convering INT_TOKENS to associated embedding vector
            logits = logits[:, -1, :]                                          # from our logits tensor (list of embedding vectors), we only care about the very LAST one in the timestep

            probs = F.softmax(logits, dim=-1)                                  # convert logits (of vocab-size) to probability distribution             
            generated_int_token = torch.multinomial(probs, num_samples=1)      # sample from that probability distribution
            starting_sequence_tensor = torch.cat((starting_sequence_tensor, generated_int_token), dim=1)        # append that new token
            
        return starting_sequence_tensor
        
        

xb, yb = get_batch('train')
m = BigramLanguageModel(vocab_size = len(unique_chars))
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)


for epoch in range(10000):
    input_batched, label_batched = get_batch('train')
    
    logits, loss = m.forward(input_batched, label_batched)
    optimizer.zero_grad(set_to_none=True)                     # RESET THE ACCUMULATED GRADIENTS WITH EVERY NEW BATCH!
    loss.backward()
    optimizer.step()
    
    if epoch % 2000 == 0:
        print(f'loss at epoch {epoch} is: {loss.item()}')

    
generated_token_sequence = m.generate(torch.zeros((1,1), dtype=torch.long), max_new_tokens = 500)
print(''.join(decode(generated_token_sequence[0].tolist())))


# looks better! attention clearly communicating SOMETHING clearly....


loss at epoch 0 is: 4.12933349609375
loss at epoch 2000 is: 2.2637226581573486
loss at epoch 4000 is: 2.456537961959839
loss at epoch 6000 is: 2.572815179824829
loss at epoch 8000 is: 2.2864067554473877

-------- GENERATING SEQUENCE --------

NUCAIORD:
Fet
Aplodce ithe hathe one had,
Thh
Ford my lelly of thilll
Flle ate, s then wess writh he lot, woo nou benat con len:
truse ef thereat we alraid thin soull, wath,
Adonce sar soow yoor the on ath wingt of tthe tor; thick rist; vill Cave sad;
And: wing?

Ande ofout pouts;
He worave wile,
At; he nousse to froorous stid atin, on!
Sit be, the flois bau de exher wick.:
Do,
Iss thillid Vard vearte this, he fou.

Thaceay miom gea you thie, fored, ove
The derin sess perbieve whiche ly forme to


## Section 3.6: RESIDUAL SKIP CONNECTIONS

Our loss is increasing! Now we need to add more of these (multi-head attention, feedforward) BLOCKs!

However, now we run into DEEP neural nets with a lot of blocks, i.e. vanishing gradient problem or something.

Solution: __SKIP CONNECTIONS!__ 

x_{t} ------------>  x_{t+1}   +   (x_{t} passed through the block)


Gradients can both flow unimpeded to the original input, AS WELL AS the block itself!

In [29]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)


def get_batch(split):
    data = train_data if split == 'train' else val_data
    
    random_starting_idx_of_batch = torch.randint(len(data) - block_size, (batch_size,))
    training_sequences = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        training_sequences.append(data[starting_index : starting_index + block_size])
        
    training_sequences_tensor = torch.stack(training_sequences)

    labels = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        labels.append(data[starting_index + 1 : starting_index + block_size + 1])
        
    labels_tensor = torch.stack(labels)
    
    return training_sequences_tensor, labels_tensor


embedding_dim = 32
max_context_length = 8
block_size = max_context_length

"""SUBMODULE 1: ATTENTION HEAD/FUNCTION"""
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
                
        # Linear layers for query, key, and value
        self.key = nn.Linear(embedding_dim, head_size, bias=False)
        self.query = nn.Linear(embedding_dim, head_size, bias=False)
        self.value = nn.Linear(embedding_dim, head_size, bias=False) 
        
        # Lower triangular matrix for masking
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, input_sequence_batched):
        B, T, C = input_sequence_batched.shape   # B: batch size, T: sequence length, C: embedding dim
        
        # Compute keys, queries, and values
        keys_batched = self.key(input_sequence_batched)       # Shape: (B, T, head_size)
        queries_batched = self.query(input_sequence_batched)  # Shape: (B, T, head_size)
        values_batched = self.value(input_sequence_batched)   # Shape: (B, T, head_size)
        
        # Compute attention weights
        wei = queries_batched @ keys_batched.transpose(-2, -1) * (C ** -0.5)  # Scale by sqrt(d_k)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))          # Apply mask
        wei = F.softmax(wei, dim=-1)                                         # Normalize weights
        
        # Apply attention
        out = wei @ values_batched  # Shape: (B, T, head_size)
        return out


"""SUBMODULE 2: MULTIPLE HEADS OF ATTENTION"""
class MultiHeadAttention(nn.Module):
    """ Run multiple heads of self-attention in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])    # basically get a LIST of heads
        
        self.proj = nn.Linear(embedding_size, embedding_size)
        
    def forward(self, input_sequence):
        out = torch.cat([h.forward(input_sequence) for h in self.heads], dim=-1)    # for every head, call forward() and concat the attention-ified results!
        out = self.proj(out)
        
        return out 
        

"""SUBMODULE 3: SIMPLE FEEDFORWARD MLP LAYER"""
class FeedForward(nn.Module):
    def __init__(self, embedding_size):
        super().__init__()
        self.mlp_layer = nn.Sequential(
            nn.Linear(embedding_size, 4 * embedding_size),
            nn.ReLU(),
            nn.Linear(4 * embedding_size, embedding_size),  # "projection layer" back into residual pathway(?????)
            
            # (need that "projection" layer back since FFW is 4 * embedding_size)
        )
    
    def forward(self, input_sequence):
        return self.mlp_layer(input_sequence)
        
        
class Block(nn.Module):
    def __init__(self, embedding_size, num_heads):
        super().__init__()
        head_size = embedding_size // num_heads
        self.sa = MultiHeadAttention(num_heads, head_size)
        self.ffw = FeedForward(embedding_size)
    
    def forward(self, input_sequence):
#         input_sequence = self.sa(input_sequence)        # PAIR THESE SA AND FEEDFORWARD LAYERS INTO A SINGLE "BLOCK" FUNCTION!
#         input_sequence = self.ffw(input_sequence)
        
        
        # "RESIDUAL"/SKIP CONNECTIONS!
        
        input_sequence = input_sequence + self.sa(input_sequence)        # PAIR THESE SA AND FEEDFORWARD LAYERS INTO A SINGLE "BLOCK" FUNCTION!
        input_sequence = input_sequence + self.ffw(input_sequence)
        
        
        return input_sequence
        

class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, embedding_dim)        
        self.position_embedding_table = nn.Embedding(max_context_length, embedding_dim)
        
        self.blocks = nn.Sequential(
            Block(embedding_size, num_heads = 4),
            Block(embedding_size, num_heads = 4),
            Block(embedding_size, num_heads = 4),
        )
        self.lm_head = nn.Linear(embedding_size, vocab_size)
        
    
    
    def forward(self, input_batched, target_batched):
        
        B, T = input_batched.shape
        
        sequence_token_embeddings = self.token_embedding_table(input_batched)
        positional_embedding = self.position_embedding_table(torch.arange(T))   # 0....T-1 for a (T, embedding_dim) matrix
        
        x = sequence_token_embeddings + positional_embedding                    # (B,T,C) - this is how you add positional infomration into the sequence
        
        x = self.blocks(x)          # run input vectors through that collection of blocks...
        logits = self.lm_head(x)    # Decode back to vocab-sized logits (right before softmaxxing and sampling a word)
        
        
        if target_batched is None:
            loss = None
        else:
            batch_size, sequence_length, embedding_dim = logits.shape    
            logits = logits.view(batch_size * sequence_length, embedding_dim)

            targets = target_batched.view(batch_size * sequence_length) 
            loss = F.cross_entropy(logits, targets)
            
        
        return logits, loss





    def generate(self, starting_sequence_tensor, max_new_tokens):
        
        
        print("\n-------- GENERATING SEQUENCE --------")
        
        for _ in range(max_new_tokens):
            
            context = starting_sequence_tensor[:, -block_size:]   # crop "input context" to just the last block_size tokens... next token depends on last ~8 tokens (otherwise the input dimensions are off)
            
            logits, loss = self.forward(context, None)        # forward pass our input sequence: basically just convering INT_TOKENS to associated embedding vector
            logits = logits[:, -1, :]                                          # from our logits tensor (list of embedding vectors), we only care about the very LAST one in the timestep

            probs = F.softmax(logits, dim=-1)                                  # convert logits (of vocab-size) to probability distribution             
            generated_int_token = torch.multinomial(probs, num_samples=1)      # sample from that probability distribution
            starting_sequence_tensor = torch.cat((starting_sequence_tensor, generated_int_token), dim=1)        # append that new token
            
        return starting_sequence_tensor
        
        

xb, yb = get_batch('train')
embedding_size = 32
m = BigramLanguageModel(vocab_size = len(unique_chars))
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)


for epoch in range(10000):
    input_batched, label_batched = get_batch('train')
    
    logits, loss = m.forward(input_batched, label_batched)
    optimizer.zero_grad(set_to_none=True)                     # RESET THE ACCUMULATED GRADIENTS WITH EVERY NEW BATCH!
    loss.backward()
    optimizer.step()
    
    if epoch % 2000 == 0:
        print(f'loss at epoch {epoch} is: {loss.item()}')

    
generated_token_sequence = m.generate(torch.zeros((1,1), dtype=torch.long), max_new_tokens = 500)
print(''.join(decode(generated_token_sequence[0].tolist())))


# looks better! attention clearly communicating SOMETHING clearly....


loss at epoch 0 is: 4.794428825378418
loss at epoch 2000 is: 2.2961947917938232
loss at epoch 4000 is: 1.5466394424438477
loss at epoch 6000 is: 1.9636958837509155
loss at epoch 8000 is: 2.4651312828063965

-------- GENERATING SEQUENCE --------


For EENIO:
Then:
And me; didurcie she
ceray opiin ghisce,
HinnET I thinine! tee me.

KING ANINTOLARARD
And the anin epeet-men hemly and with ine time the.

LURAMENT:
JoN in he gipn trou haciuldie;
As the you sbro seploon:
We fort ent ridegd,
In KING Eice theaple
come atak?

S: moree' Abt and alloy ling in me aremose lisse fressesece me
not I o, it golt; hom and, what You to an more a conte of a nothere tor, Go?

COYENTE:
This on of nuen prine, I that, I ald his by sers grefesss sinth's in.

APE


# Section 3.7: More "deep" network optimizations

Residual is the "add" part of "add & norm": norm refers to layer-normalization. 

Layer-normalization: basically make each row a gaussian with some mean or some shit? Read the paper!

In [32]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)


def get_batch(split):
    data = train_data if split == 'train' else val_data
    
    random_starting_idx_of_batch = torch.randint(len(data) - block_size, (batch_size,))
    training_sequences = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        training_sequences.append(data[starting_index : starting_index + block_size])
        
    training_sequences_tensor = torch.stack(training_sequences)

    labels = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        labels.append(data[starting_index + 1 : starting_index + block_size + 1])
        
    labels_tensor = torch.stack(labels)
    
    return training_sequences_tensor, labels_tensor

"""SUBMODULE 1: ATTENTION HEAD/FUNCTION"""
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
                
        # Linear layers for query, key, and value
        self.key = nn.Linear(embedding_dim, head_size, bias=False)
        self.query = nn.Linear(embedding_dim, head_size, bias=False)
        self.value = nn.Linear(embedding_dim, head_size, bias=False) 
        
        # Lower triangular matrix for masking
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, input_sequence_batched):
        B, T, C = input_sequence_batched.shape   # B: batch size, T: sequence length, C: embedding dim
        
        # Compute keys, queries, and values
        keys_batched = self.key(input_sequence_batched)       # Shape: (B, T, head_size)
        queries_batched = self.query(input_sequence_batched)  # Shape: (B, T, head_size)
        values_batched = self.value(input_sequence_batched)   # Shape: (B, T, head_size)
        
        # Compute attention weights
        wei = queries_batched @ keys_batched.transpose(-2, -1) * (C ** -0.5)  # Scale by sqrt(d_k)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))          # Apply mask
        wei = F.softmax(wei, dim=-1)                                         # Normalize weights
        
        # Apply attention
        out = wei @ values_batched  # Shape: (B, T, head_size)
        return out


"""SUBMODULE 2: MULTIPLE HEADS OF ATTENTION"""
class MultiHeadAttention(nn.Module):
    """ Run multiple heads of self-attention in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])    # basically get a LIST of heads
        
        self.proj = nn.Linear(embedding_size, embedding_size)
        
    def forward(self, input_sequence):
        out = torch.cat([h.forward(input_sequence) for h in self.heads], dim=-1)    # for every head, call forward() and concat the attention-ified results!
        out = self.proj(out)
        
        return out 
        

"""SUBMODULE 3: SIMPLE FEEDFORWARD MLP LAYER"""
class FeedForward(nn.Module):
    def __init__(self, embedding_size):
        super().__init__()
        self.mlp_layer = nn.Sequential(
            nn.Linear(embedding_size, 4 * embedding_size),
            nn.ReLU(),
            nn.Linear(4 * embedding_size, embedding_size),  # "projection layer" back into residual pathway(?????)
            
            nn.Dropout(dropout),   # ALSO  ADDING SOME DROPOUT AS WELL!
        )
    
    def forward(self, input_sequence):
        return self.mlp_layer(input_sequence)
        

        
# TRANSFORMER "DECODER" BLOCK: WITH MULTIHEADATTENTION, FEEDFORWARD, LAYER NORMS, AND RESIDUAL/SKIP CONNECTIONS!
class Block(nn.Module):
    def __init__(self, embedding_size, num_heads):
        super().__init__()
        head_size = embedding_size // num_heads
        self.sa = MultiHeadAttention(num_heads, head_size)
        self.ffw = FeedForward(embedding_size)
        
        self.ln1 = nn.LayerNorm(embedding_size)
        self.ln2 = nn.LayerNorm(embedding_size)     # ADD EXTRA LAYER NORMALIZATION: LAYER NORM ALSO HAS TRAINABLE PARAMS

    def forward(self, input_sequence):
        # APPLY LAYERNORMS BEFORE FEEDING INTO THE ATTENTION/FFW BLOCKS!
        input_sequence = input_sequence + self.sa(self.ln1(input_sequence))        # PAIR THESE SA AND FEEDFORWARD LAYERS INTO A SINGLE "BLOCK" FUNCTION!
        input_sequence = input_sequence + self.ffw(self.ln2(input_sequence))
        
        
        return input_sequence
        

class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, embedding_dim)        
        self.position_embedding_table = nn.Embedding(max_context_length, embedding_dim)
        
        self.blocks = nn.Sequential(
            Block(embedding_size, num_heads = 4),
            Block(embedding_size, num_heads = 4),
            Block(embedding_size, num_heads = 4),    # RUN AND TRAIN ON 3 BLOCKS...
            nn.LayerNorm(embedding_size),            # AND LAYERNORM AT THE VERY END!
        )
        self.lm_head = nn.Linear(embedding_size, vocab_size)
        
    
    
    def forward(self, input_batched, target_batched):
        
        B, T = input_batched.shape
        
        sequence_token_embeddings = self.token_embedding_table(input_batched)
        positional_embedding = self.position_embedding_table(torch.arange(T))   # 0....T-1 for a (T, embedding_dim) matrix
        
        x = sequence_token_embeddings + positional_embedding                    # (B,T,C) - this is how you add positional infomration into the sequence
        
        x = self.blocks(x)          # run input vectors through that collection of blocks...
        logits = self.lm_head(x)    # Decode back to vocab-sized logits (right before softmaxxing and sampling a word)
        
        
        if target_batched is None:
            loss = None
        else:
            batch_size, sequence_length, embedding_dim = logits.shape    
            logits = logits.view(batch_size * sequence_length, embedding_dim)

            targets = target_batched.view(batch_size * sequence_length) 
            loss = F.cross_entropy(logits, targets)
            
        
        return logits, loss





    def generate(self, starting_sequence_tensor, max_new_tokens):
        
        
        print("\n-------- GENERATING SEQUENCE --------")
        
        for _ in range(max_new_tokens):
            
            context = starting_sequence_tensor[:, -block_size:]   # crop "input context" to just the last block_size tokens... next token depends on last ~8 tokens (otherwise the input dimensions are off)
            
            logits, loss = self.forward(context, None)        # forward pass our input sequence: basically just convering INT_TOKENS to associated embedding vector
            logits = logits[:, -1, :]                                          # from our logits tensor (list of embedding vectors), we only care about the very LAST one in the timestep

            probs = F.softmax(logits, dim=-1)                                  # convert logits (of vocab-size) to probability distribution             
            generated_int_token = torch.multinomial(probs, num_samples=1)      # sample from that probability distribution
            starting_sequence_tensor = torch.cat((starting_sequence_tensor, generated_int_token), dim=1)        # append that new token
            
        return starting_sequence_tensor
        
        
embedding_dim = 64
max_context_length = 256
block_size = max_context_length
learning_rate = 3e04
dropout = 0.2

embedding_size = 32

xb, yb = get_batch('train')
m = BigramLanguageModel(vocab_size = len(unique_chars))
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)


for epoch in range(10000):
    input_batched, label_batched = get_batch('train')
    
    logits, loss = m.forward(input_batched, label_batched)
    optimizer.zero_grad(set_to_none=True)                     # RESET THE ACCUMULATED GRADIENTS WITH EVERY NEW BATCH!
    loss.backward()
    optimizer.step()
    
    if epoch % 2000 == 0:
        print(f'loss at epoch {epoch} is: {loss.item()}')

    
generated_token_sequence = m.generate(torch.zeros((1,1), dtype=torch.long), max_new_tokens = 500)
print(''.join(decode(generated_token_sequence[0].tolist())))


# looks better! attention clearly communicating SOMETHING clearly....


loss at epoch 0 is: 4.338974475860596
loss at epoch 2000 is: nan


KeyboardInterrupt: 