In [2]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-12-18 23:32:46--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-12-18 23:32:46 (2.61 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [3]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


# Step 1: Tokenize your text

Tokenize essentially means convert every single "token"/piece of a word into a __single number__. Here, for simplicity, we will be tokenizing each __character__.

GPT-2 uses the byte-pair encoding algorithm. Here we're just going to do a standard character mapping.

In [7]:
unique_chars = sorted(list(set(text)))

print("List of unique chars: ", unique_chars)
print("Number of unique chars: ", len(unique_chars))

List of unique chars:  ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Number of unique chars:  65


In [11]:
stoi = { c:i for i, c in enumerate(unique_chars) }
itos = { i:c for i, c in enumerate(unique_chars) }

encode = lambda some_str: [stoi[char] for char in some_str]
decode = lambda some_str: [itos[char] for char in some_str]


example_string = "This is a string"
encode(example_string)

[32, 46, 47, 57, 1, 47, 57, 1, 39, 1, 57, 58, 56, 47, 52, 45]

# Step 1.5: Character -> Token Int -> Token Embedding Vector

Step 1: making an embedding lookup table! Each row corresponds to a unique token. The number of that row is equal to the "number" of that token (see box above this one for an example: 'T' has token number 32, and thus the vector at row 32 IS 'T''s embedding vector).

In [25]:
import torch

data = torch.tensor(encode(text))        # turn big list of characters -> big list of token ints

train_size = int(0.9 * len(data))        # train/test split: both are just long lists!
train_data = data[:train_size]
val_data = data[train_size:]

## Sidenote: chunking the training data

Basically we only ever take in CHUNK_SIZE sequence of chars (taking in all of them at once would be way too hard). CHUNK_SIZE is just the __max length sequence__ we can ever predict on.

Here you can see how every CHUNK_SIZE sequence is actually a bunch of training examples!

In [26]:
# Every single chunk is actually a BUNCH of training examples.

chunk_size = 8

for char_idx in range(chunk_size):
    training_example = training_data[:char_idx]
    associated_label = training_data[char_idx]
    print(f"When the training example is {training_example}, the label is {associated_label}.")

When the training example is tensor([], dtype=torch.int64), the label is 18.
When the training example is tensor([18]), the label is 47.
When the training example is tensor([18, 47]), the label is 56.
When the training example is tensor([18, 47, 56]), the label is 57.
When the training example is tensor([18, 47, 56, 57]), the label is 58.
When the training example is tensor([18, 47, 56, 57, 58]), the label is 1.
When the training example is tensor([18, 47, 56, 57, 58,  1]), the label is 15.
When the training example is tensor([18, 47, 56, 57, 58,  1, 15]), the label is 47.


# Step 1.6: Batching the data

In [44]:
batch_size = 4     # Batch Size = the NUMBER of sequences we forward pass, backward pass, and step with every epoch.
block_size = 8     # Block_Size = maximum context length for a prediction.

def get_batch(split):
    data = train_data if split == 'train' else val_data
    
    # for this particular batch, we want to get 4 sequences each of sequence length 32
    random_starting_idx_of_batch = torch.randint(len(data) - block_size, (batch_size,))     # get 4 indexes into 'data': the indexes can only be from 0 to len(data) - block_size
                                                                                            # this will be a 1D tensor of size (batch_size,), i.e. tensor([953063, 497175, 633405, 627354])
    print(f"From {split}_data, pick {random_starting_idx_of_batch} as starting indices for training sequences.")
    
    # now that we have some starting indices, pick out the 32-length sequence from each of them
    training_sequences = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        training_sequences.append(data[starting_index : starting_index + block_size])
        
    training_sequences_tensor = torch.stack(training_sequences)
    print("\nThese are the block_size-length sequences starting from each of those indicies:")
    print(training_sequences_tensor.shape)
    print(training_sequences_tensor)
    
    
    # now we'll get a tensor, but with all the relevant labels. Remember we're using the trick above to get more examples.
    labels = []
    
    for i in range(batch_size):
        starting_index = random_starting_idx_of_batch[i]
        labels.append(data[starting_index + 1 : starting_index + block_size + 1])
        
    labels_tensor = torch.stack(labels)
    print("\nThese are the correct labels associated with each of those example tensors:")
    print(labels_tensor.shape)
    print(labels_tensor)
    
    return training_sequences_tensor, labels_tensor
    
    
get_batch('train')
    

From train_data, pick tensor([971401, 579495, 193625, 348340]) as starting indices for training sequences.

These are the block_size-length sequences starting from each of those indicies:
torch.Size([4, 8])
tensor([[57, 43, 60, 43, 52,  1, 63, 43],
        [60, 43, 42,  8,  0, 25, 63,  1],
        [56, 42,  5, 57,  1, 57, 39, 49],
        [43, 57, 58, 63,  6,  1, 58, 46]])

These are the correct labels associated with each of those example tensors:
torch.Size([4, 8])
tensor([[43, 60, 43, 52,  1, 63, 43, 39],
        [43, 42,  8,  0, 25, 63,  1, 45],
        [42,  5, 57,  1, 57, 39, 49, 43],
        [57, 58, 63,  6,  1, 58, 46, 47]])


(tensor([[57, 43, 60, 43, 52,  1, 63, 43],
         [60, 43, 42,  8,  0, 25, 63,  1],
         [56, 42,  5, 57,  1, 57, 39, 49],
         [43, 57, 58, 63,  6,  1, 58, 46]]),
 tensor([[43, 60, 43, 52,  1, 63, 43, 39],
         [43, 42,  8,  0, 25, 63,  1, 45],
         [42,  5, 57,  1, 57, 39, 49, 43],
         [57, 58, 63,  6,  1, 58, 46, 47]]))

# Step 2: Forward Pass

Fantastic. Now we've got the training input data and labels in a really nice, batched format. Now we'll make predictions.

In [50]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    
    # model has internal embedding vector lookup table based on token
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    
    # forward pass: foreach INT_TOKEN in input_batched, turn that into the correct embedding vector,
    # and then replace that INT_TOKEN with the relevant embedding vector.
    def forward(self, input_batched, target_batched):
        
        logits = self.token_embedding_table(input_batched) 
        print(f"\nBatched input is originally {input_batched.shape}.")
        print(f"After doing embedding lookup, it is {logits.shape}, since we replace each int token with a {len(unique_chars)}-sized vector.")
        
        return logits
        

        
xb, yb = get_batch('train')

m = BigramLanguageModel(vocab_size = len(unique_chars))
out = m.forward(xb, yb)


From train_data, pick tensor([ 76049, 234249, 934904, 560986]) as starting indices for training sequences.

These are the block_size-length sequences starting from each of those indicies:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])

These are the correct labels associated with each of those example tensors:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])

Batched input is originally torch.Size([4, 8]).
After doing embedding lookup, it is torch.Size([4, 8, 65]), since we replace each int token with a 65-sized vector.
