# Introduction
Here we will be building a GPT-like model from scratch based on the two papers [Attention is All You Need](https://arxiv.org/abs/1706.03762), which proposed the **transformer** architecture, and [GPT-3](https://arxiv.org/abs/2005.14165). GPT is a language model that given an input simple predicts the next word.

# Libraries

In [1]:
%matplotlib inline
%config IPCompleter.use_jedi=False

In [2]:
import math
import random
import matplotlib.pyplot as plt

In [3]:
###BIGRAM###
#################
### Libraries ###
#################

In [4]:
###BIGRAM###
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

# Data
We import the tiny-Shakespeare dataset and process it such that it can be used for creating a model that can create Shakespeare texts.

### Reading and Inspecting

In [5]:
###BIGRAM###
############
### Data ###
############

In [6]:
###BIGRAM###
# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [7]:
print("Number of Characters: ", len(text))

Number of Characters:  1115394


In [8]:
# First 300 characters
print(text[:300])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us


In [9]:
###BIGRAM###
# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

In [10]:
print(f"Vocab Size: {vocab_size}")
print(f"Vocab Chars: {''.join(chars):}")

Vocab Size: 65
Vocab Chars: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


### Building Vocabulary and Encoder/Decoder
We just use a character-level tokenizer here, but in practice people e.g. OpenAI uses something else e.g. the **BPE** tokenizer:

In [11]:
# OpenAI encoder/decoder
import tiktoken
enc = tiktoken.get_encoding("gpt2")
enc.decode(enc.encode("hello world")) == "hello world"

True

We will be using a simple tokenizer that uses characters rather than word-chunks to make things easier to understand.

In [12]:
###BIGRAM###
# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

In [13]:
# Priting vocabulary
print(ctoi)
print(itoc)

{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i',

In [14]:
###BIGRAM###
# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

In [15]:
# Testing encoder/decoder
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### Tokenizing
Using our simple tokenizer we tokenize the entire dataset.

In [16]:
###BIGRAM###
# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

In [17]:
# Printing example
print(data.shape, data.dtype)
print(data[:50])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56])


### Train/Valid Split

In [18]:
###BIGRAM###
# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

### Creating dataset
When builing out model, we would like it to be able to generate text from as little a context as one character, but still up to a context of size **block_size**.

In [19]:
# Example of a sample
block_size = 8
x = data_train[:block_size]
y = data_train[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} target is: {target}")

when input is tensor([18]) target is: 47
when input is tensor([18, 47]) target is: 56
when input is tensor([18, 47, 56]) target is: 57
when input is tensor([18, 47, 56, 57]) target is: 58
when input is tensor([18, 47, 56, 57, 58]) target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]) target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) target is: 58


We create a function for getting random batches from the data.

In [20]:
###BIGRAM###
# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [21]:
# Testing function
batch_size = 4 
block_size = 8 

xb, yb = get_batch('train', batch_size, block_size)
print('inputs:')
print(xb.shape)
print(xb)
print('outputs:')
print(yb.shape)
print(yb)

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} target is: {target}")

inputs:
torch.Size([4, 8])
tensor([[41, 43, 10,  0, 13, 52, 42,  1],
        [43, 39, 60, 43,  6,  1, 51, 63],
        [42,  1, 40, 63,  1, 39, 52,  1],
        [44, 50, 53, 61, 43, 56,  1, 53]])
outputs:
torch.Size([4, 8])
tensor([[43, 10,  0, 13, 52, 42,  1, 58],
        [39, 60, 43,  6,  1, 51, 63,  1],
        [ 1, 40, 63,  1, 39, 52,  1, 43],
        [50, 53, 61, 43, 56,  1, 53, 44]])
when input is [41] target is: 43
when input is [41, 43] target is: 10
when input is [41, 43, 10] target is: 0
when input is [41, 43, 10, 0] target is: 13
when input is [41, 43, 10, 0, 13] target is: 52
when input is [41, 43, 10, 0, 13, 52] target is: 42
when input is [41, 43, 10, 0, 13, 52, 42] target is: 1
when input is [41, 43, 10, 0, 13, 52, 42, 1] target is: 58
when input is [43] target is: 39
when input is [43, 39] target is: 60
when input is [43, 39, 60] target is: 43
when input is [43, 39, 60, 43] target is: 6
when input is [43, 39, 60, 43, 6] target is: 1
when input is [43, 39, 60, 43, 6, 1] 

# Neural Network: Part I
Now we will start feeding the data into a neural network. We will just start by using the bigram model similar to the one we build previously.

In [22]:
# A batch
print(xb)
print(yb)

tensor([[41, 43, 10,  0, 13, 52, 42,  1],
        [43, 39, 60, 43,  6,  1, 51, 63],
        [42,  1, 40, 63,  1, 39, 52,  1],
        [44, 50, 53, 61, 43, 56,  1, 53]])
tensor([[43, 10,  0, 13, 52, 42,  1, 58],
        [39, 60, 43,  6,  1, 51, 63,  1],
        [ 1, 40, 63,  1, 39, 52,  1, 43],
        [50, 53, 61, 43, 56,  1, 53, 44]])


In [23]:
###BIGRAM###
#############
### Model ###
#############

In [24]:
###BIGRAM###
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        """ Creating Embedding Table """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        logits = self.token_embedding_table(idx) # (BATCH, TIME, CHANNEL)
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [25]:
###BIGRAM###
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [26]:
# Expected loss 
-math.log(1/vocab_size)

4.174387269895637

In [27]:
###BIGRAM###
# Creating model
model = BigramLanguageModel(vocab_size)
model = model.to(device)

In [28]:
# Running forward pass of model
logits, loss = model(xb, yb)
print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.6661, grad_fn=<NllLossBackward0>)


In [29]:
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=100)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
JmB&rz'$:JB$;Wq'AxoRds-LGtz'lEs$QUeL;.TP&u?YSGtC.S
dBAd?Cmc'I
zhjjI$;PZ,syNWmKZVIAqfdsydTWq&EBOhlbCu


# Training Model: Part I
Here we are going to use the Adam optimizer instead or stochastic gratient descent, which we used earlier. The optimizer is basically how the gradients are updated. Before we simple updadted it in the following way: 

p.data += -lr * 0.01 * p.grad. 

Now instead the optimizer keeps track of the gradient-history, such that it can create momentum in a certain direction and converge faster.

In [30]:
###BIGRAM###
################
### Training ###
################

In [31]:
###BIGRAM###
# Hyperparameters
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-2
eval_iters = 200

In [32]:
###BIGRAM###
# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [33]:
###BIGRAM###
# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.6304, valid loss 4.6278
step 300: train loss 2.7788, valid loss 2.7900
step 600: train loss 2.5481, valid loss 2.5574
step 900: train loss 2.5033, valid loss 2.5183
step 1200: train loss 2.4770, valid loss 2.5049
step 1500: train loss 2.4720, valid loss 2.5031
step 1800: train loss 2.4689, valid loss 2.4987
step 2100: train loss 2.4720, valid loss 2.5002
step 2400: train loss 2.4614, valid loss 2.4877
step 2700: train loss 2.4658, valid loss 2.4895
step 3000: train loss 2.4698, valid loss 2.4915
step 3300: train loss 2.4578, valid loss 2.4960
step 3600: train loss 2.4536, valid loss 2.4883
step 3900: train loss 2.4663, valid loss 2.4928
step 4200: train loss 2.4656, valid loss 2.4936
step 4500: train loss 2.4567, valid loss 2.4689
step 4800: train loss 2.4535, valid loss 2.4915
step 5100: train loss 2.4496, valid loss 2.4875
step 5400: train loss 2.4694, valid loss 2.4951
step 5700: train loss 2.4588, valid loss 2.4840
step 6000: train loss 2.4579, valid loss 2.492

In [34]:
###BIGRAM###
##################
### Generating ###
##################

In [35]:
###BIGRAM###
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
Anessourd deay'ld CHay
T:
MPRYolthy. lal,
n! t hisith he w m
GLE cowinecr w:
MENGUMot

thase ngor.
He aped my totr.
Loflmim y meeeostre apushuno
Thare, wabt ha-l
LO, hidishe-gais;
Y: ney osd, k,
WI w; blothithent w Boursannthathind,

HE s the ham me INERene ina l ansheatt R:

Of r ol the RETht ad? s


# Self-Attention: Part I
We will now build out the network by adding self-attention. We start out by making a minimal example illustrating what self-attention is.

Example:  
As we are going to predict the future the attention will work in the following way.

* Tokens: abcdef
 * a: Cannot attend to any other characters than itself
 * b: Can attend only to a and b
 * c: Can attend to a, b and c
 * .......
 
How to attend to different numbers of tokens can be done in various way, of which the most simple is just to average them. Here we will throw away a lot of information, but we start out this way for simplicity.

In [36]:
# Example
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [37]:
# Version 1: Calculating average of all previous tokens
xbow = torch.zeros((B,T,C))
for b in range(B): 
    for t in range(T):
        xprev = x[b,:t+1]
        xbow[b,t] = torch.mean(xprev, 0)

In [38]:
# Inspecting a sample
print("input sample:\n", x[0])
print("output sample:\n", xbow[0])
print("token 1 average:", x[0][0])
print("token 2 average:", (x[0][0] + x[0][1])/2)
print("token 3 average:", (x[0][0] + x[0][1] + x[0][2])/3)

input sample:
 tensor([[-0.5447, -0.9758],
        [-0.7237, -1.6342],
        [ 0.6724, -0.4791],
        [ 1.0421,  0.5574],
        [ 0.6697, -1.0027],
        [ 0.2482, -0.3644],
        [-1.5257, -2.6367],
        [ 0.8385,  0.5197]])
output sample:
 tensor([[-0.5447, -0.9758],
        [-0.6342, -1.3050],
        [-0.1986, -1.0297],
        [ 0.1115, -0.6329],
        [ 0.2232, -0.7069],
        [ 0.2273, -0.6498],
        [-0.0231, -0.9336],
        [ 0.0846, -0.7520]])
token 1 average: tensor([-0.5447, -0.9758])
token 2 average: tensor([-0.6342, -1.3050])
token 3 average: tensor([-0.1986, -1.0297])


For loops are slow, so now we will do it a lot faster using [matrix multiplication](http://matrixmultiplication.xyz/). Here the approach is shown via an example.

In [39]:
a = torch.tril(torch.ones(3,3))
a = a/torch.sum(a,1,keepdim=True);print("a:");print(a)
b = torch.randint(0,10,(3,2)).float();print("b:");print(b)
c = a @ b;print("c:");print(c)

a:
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b:
tensor([[2., 5.],
        [1., 3.],
        [6., 3.]])
c:
tensor([[2.0000, 5.0000],
        [1.5000, 4.0000],
        [3.0000, 3.6667]])


Replacing the for-loop with matrix multiplications.

In [40]:
# Version 2: Calculating average of all previous tokens
wei = torch.tril(torch.ones(T,T))
wei = torch.tril(torch.ones(T,T))/torch.sum(wei,1,keepdim=True)
xbow2 = wei @ x;xbow2
torch.allclose(xbow, xbow2)

True

Because we will be implementing a more advanced attention system, we create a third version for calculating the same, just using soft-max.

In [41]:
# Version 3: Calculating average of all previous tokens
tril = torch.tril(torch.ones(T,T)) 
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x;xbow2
torch.allclose(xbow, xbow3)

True

# Neural Network: Part II
Here we make some adjustments to the BigraLanguageModel as well as add a positional embedding and implement the self-attention block.

In [42]:
###BIGRAM2###
#################
### Libraries ###
#################
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

In [43]:
###BIGRAM2###
#######################
### Hyperparameters ###
#######################
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-2
eval_iters = 200
n_embed = 32 

In [44]:
###BIGRAM2###
############
### Data ###
############

# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

To the language model we add a linear layer and positional embeddings. We also complete the self attention implementation.

In [50]:
###BIGRAM2###
#############
### Model ###
#############

class BigramLanguageModel(nn.Module):
    
    def __init__(self):
        """ Creating Layers """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B,T,embed_size)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        logits = self.lm_head(x) # (B,T,vocab_size)
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [46]:
###BIGRAM2###
# Function for estimating loss
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [47]:
###BIGRAM2###
# Creating model
model = BigramLanguageModel()
model = model.to(device)

In [51]:
###BIGRAM2###
################
### Training ###
################

# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 2.4789, valid loss 2.5223
step 300: train loss 2.4879, valid loss 2.4986
step 600: train loss 2.4882, valid loss 2.4988
step 900: train loss 2.4875, valid loss 2.5035
step 1200: train loss 2.4913, valid loss 2.5173
step 1500: train loss 2.4868, valid loss 2.5102
step 1800: train loss 2.4835, valid loss 2.5204
step 2100: train loss 2.4779, valid loss 2.5072
step 2400: train loss 2.4834, valid loss 2.5167
step 2700: train loss 2.4938, valid loss 2.5221
step 3000: train loss 2.4857, valid loss 2.5066
step 3300: train loss 2.4904, valid loss 2.5126
step 3600: train loss 2.4771, valid loss 2.5081
step 3900: train loss 2.4916, valid loss 2.5146
step 4200: train loss 2.4821, valid loss 2.5092
step 4500: train loss 2.4687, valid loss 2.5087
step 4800: train loss 2.4876, valid loss 2.5175
step 5100: train loss 2.4811, valid loss 2.5055
step 5400: train loss 2.4829, valid loss 2.5130
step 5700: train loss 2.4853, valid loss 2.5153
step 6000: train loss 2.4896, valid loss 2.509

In [None]:
###BIGRAM2###
##################
### Generating ###
##################
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

In [None]:
# http://www.youtube.com/watch?v=kCc8FmEb1nY&t=60m20s

# Exporting Code
Here we export code labeled with ###BIGRAM###, ###BIGRAM2### and so on to scripts.

In [None]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###BIGRAM###" ../../modules/GPT/bigram.py

In [65]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###BIGRAM2###" ../../modules/GPT/bigram2.py

INFO: Cells with label ###BIGRAM2### extracted from 1. Building GPT-like Model.ipynb
