# Takeaways

* [takeaway].<br>

* [takeaway]. <br>

* [takeaway]. <br>

* [takeaway]. <br>

* [takeaway]. <br>

# Introduction

In this post, I train Karpathy's __[nanoGPT](https://johncollinsai-nanogpt-voqqf4ls3a-as.a.run.app/)__ on high-frequency (tick-by-tick) data for __[AAPL](https://www.google.com/search?q=aapl&oq=AAPL&aqs=chrome.0.0i512l5j69i61l3.1590j1j9&sourceid=chrome&ie=UTF-8)__ and __[JPM](https://www.google.com/search?q=jpm+stock+price&oq=JPM+stock+pri&aqs=chrome.0.0i512j69i57j0i512l8.4577j1j9&sourceid=chrome&ie=UTF-8)__. I want to see how nanoGPT performs as a volatility predictor.  I also want to explore the use of LLMs for tasks, in this case volatility prediction, that are typically performed by models more specific to finance.  In the case of volatility prediction, the established model classes include stochastic volatility models such as the __[MSM](https://github.com/johncollinsai/markov-switching-multifractal)__ of Calvet & Fisher, ARCH and GARCH, and Jump Diffusion models. More recently deep learning has been applied to volatility prediction and this __[post](https://johncollinsai-deep-learning-finance-voqqf4ls3a-as.a.run.app/)__ describes these developments in some detail. However, the application of LLMs to volatility prediction appears to be quite novel and the use of nanoGPT provides a great basis for an under-the-hood examination.

**High-frequency data**

See my earlier __[post](https://johncollinsai-high-frequency-data-voqqf4ls3a-as.a.run.app/)__.

**NanoGPT**

See my earlier __[post](https://johncollinsai-nanogpt-voqqf4ls3a-as.a.run.app/)__.

# Set up

I begin with __[my earlier implementation of Karpathy's nanoGPT](https://github.com/johncollinsai/nanogpt)__. 

REPLACE THIS: Starting with a very simple bigram language model, following Karpathy, I define and build a transformer piece by piece.  

REPLACE THIS: Then I train it on a text dataset.  

I use an NVIDIA GeForce RTX 3080 Ti Laptop GPU and a deep learning framework that includes PyTorch, CUDA, cuDNN, and NVIDIA Drivers, on Ubuntu 22.04 LTS.  Source code as always may be found on __[my GitHub](https://github.com/johncollinsai/nanogpt)__.

# Building and training a GPT

## Baseline (bigram) language model

#### Preparation

In [1]:
# check GPU (if working on local machine)
import torch
if torch.cuda.is_available():
    device = 'cuda'
    print(f"device: {device}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
else:
    device = 'cpu'
    print("CUDA is not available.")
    print(f"device: {device}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")

device: cuda
Device name: NVIDIA GeForce RTX 3080 Ti Laptop GPU


In [2]:
# for running in docker image
# device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [3]:
# get data
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print("length of dataseet in characters: ", len(text))

FileNotFoundError: [Errno 2] No such file or directory: 'input.txt'

In [4]:
# show unique characters appearing in the dataset (note the space character, which is first in the set): i.e., the vocabulary of possible characters the model can see or emit
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


#### Tokenize

> Build a simple encoder and  decoder: i.e., take a string, output a list of integers, where each character is a token. The approach below is similar to, but much more simplified than: __[goolge sentencepiece](https://github.com/google/sentencepiece)__ (which uses sub-word encodings) and __[OpenAI tiktoken](https://github.com/openai/tiktoken)__.

In [5]:
# convert the raw text as a string into some sequence of integers according to some vocabulary of possible elements
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# build a simple encoder and decoder, effectively a tokenizer and detokenizer
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(encode("today is friday, looking forward to the weekend!"))
print(decode(encode("today is friday, looking forward to the weekend!")))

[58, 53, 42, 39, 63, 1, 47, 57, 1, 44, 56, 47, 42, 39, 63, 6, 1, 50, 53, 53, 49, 47, 52, 45, 1, 44, 53, 56, 61, 39, 56, 42, 1, 58, 53, 1, 58, 46, 43, 1, 61, 43, 43, 49, 43, 52, 42, 2]
today is friday, looking forward to the weekend!


> now I have a tokenizer and detokenizer, I can convert the raw text into a sequence of integers, i.e., I can tokenize the entire training dataset

In [6]:
# encode training dataset and store it in a torch.tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


#### Train/val split

In [6]:
# 90:10 train:val split
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

#### Loading the data
> I set the time dimension (i.e., the contexts) of the tensors feeding into the transformer equal to a maximum of 8 characters (i.e., I set block_size = 8).  Note: I train on block_size+1 because the transformer trains on the first 8 characters and predicts the +1th or 9th character.  Put another way, the transformer sees contexts from one character thru block_size. <br>

> And I set the batch dimension of the tensors feeding into the transformer to 4, so batch_size = 4 (i.e., 4 independent sequences will be processed in parallel).

In [7]:
# set block_size = 8 to train on []:block_size+1] = 8+1 characters at a time
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [8]:
# +1 because we want to predict the next character, thus block_size+1 allows us to do that, i.e., the transformer trains on the first 8 characters and predicts the +1th or 9th character
# to illustrate:
x = train_data[:block_size]
y = train_data[1:block_size+1]
print('Illustrating how the transformer trains on the first 8 characters and predicts the +1th or 9th character:')
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'when input is {context}, the target is {target}')

Illustrating how the transformer trains on the first 8 characters and predicts the +1th or 9th character:
when input is tensor([18]), the target is 47
when input is tensor([18, 47]), the target is 56
when input is tensor([18, 47, 56]), the target is 57
when input is tensor([18, 47, 56, 57]), the target is 58
when input is tensor([18, 47, 56, 57, 58]), the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]), the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is 58


#### A quick note on random seed selection

> In an interesting __[paper](https://arxiv.org/abs/2109.08203)__ David Picard investigates the effect of random seed selection on accuracy when using deep learning architectures for computer vision and posits that Torch.manual_seed(3407) is all you need!

In [9]:
# I set the batch dimension of the tensors feeding into the transformer to 4, so batch_size = 4 (i.e., 4 independent sequences will be processed in parallel).
torch.manual_seed(3407)
batch_size = 4
block_size = 8  

def get_batch(split):
    # generate a batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) 
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device) # move data to GPU
    return x, y

xb, yb = get_batch('train')

print('Here is the tensor input to the transformer:',
      '\n', 
      xb      
      )  

Here is the tensor input to the transformer: 
 tensor([[32, 39, 49, 43,  1, 58, 46, 53],
        [59, 56,  1, 54, 56, 47, 52, 41],
        [57, 53, 51, 43,  1, 51, 43, 56],
        [57,  1, 58, 56, 59, 43,  2,  0]], device='cuda:0')


#### baseline (bigram) language model

Following __[Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY)__, I implement a very simple neural network, the bigram language model.

In [10]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(3407)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx and targets are both (B, T) tensors of integers
        logits = self.token_embedding_table(idx) # (B, T, C, i.e., a batch by time (context) by channel tensor, where channel is vocab size)

        if targets is None:
            loss = None
        else:
            # reorganize logits tensor from (B, T, C) to (B*T, C) in order to fit pytorch's cross_entropy loss function
            B, T, C = logits.shape 
            logits = logits.view(B*T, C) 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) # cross_entropy here computes negative log likelihood loss

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is a (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=1) # (B, T+1)

        return idx
    
model = BigramLanguageModel(vocab_size)
m = model.to(device) # move model to GPU
logits, loss = m(xb, yb)
print('logits shape:', logits.shape)
print('loss:', loss)

# context = torch.zeros((1,1), dtype=torch.long, device=device), here created on-the-fly by print() on the GPU
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long, device=device), max_new_tokens=300)[0].tolist())) 

logits shape: torch.Size([32, 65])
loss: tensor(4.4231, device='cuda:0', grad_fn=<NllLossBackward0>)

pTFwSpp,f.v-;LR-;DA,O:rGMbv3OqDlpuo-SxIMtqCPawLaD;iC O'-N$sr?,y;Dgx&uJvha?qU.RXFqe!3CLnq,ZAcdW-dxvq
ijb-dmxN-lLtI'UsNajeE3gH??!m3zz:nMgrVgHyRJd;MVWy'nEDSCT!QA;myMPVPLnvyjMWXFw,LweP,WSzdPrvcWXecNIcLtcPrPbGIzVH.nqckUK;XfAco',QFJ3'T !a-$Nemy,WmkUIx?mO!sJwEywCCk,W:Jv3V&PjhvEooQF3taT
3&u!XCikXcY
?xIzQrGW


> The model is untrained and provides predictions that are random, so the output is meaningless.

#### Training the bigram model

> I now train the bigram model to make it less random.

In [11]:
# create a pytorch optimizer
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)

In [12]:
batch_size = 32 # increase the batch size from 4 to 32 to speed up training
for steps in range(10000): # increase the number of steps to train for, to improve results

    # get a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print('loss:', loss.item()) # training for 10000 steps brings the loss down to ~2.5

loss: 2.5604467391967773


In [13]:
# As above, context = torch.zeros((1,1), dtype=torch.long, device=device), here created on-the-fly by print() on the GPU
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long, device=device), max_new_tokens=300)[0].tolist()))



Afre an nthe t y.
I twl d, h bstave: anr h? towll
BUStil ilouniasthechyord IF shaty vouby, m aysie fon Malld aty acoghas; histhofok ang titr. may, o mar, we gel adadico y mereengoowe.
Hiplouplloproousseathes l we f Jater, thee,

Hwncoshy momyow, r agh afurst thes hendee: byoon t MIILELIZMend, cuthe


> The model is making progress.  But it's still a very simple model and the tokens are not yet talking to each other.  It's predictions show a somewhat better language-like structure, but are still random, and the output meaningless.

## Self-attention

> I now write the first self-attention block for processing the tokens, following several steps, each progressively more effective, that hopefully help to make the self-attention contstruct clearer. <br>

> Let's start with a very simple example, which essentially relates tokens to each other via their history.

In [14]:
# simple example
torch.manual_seed(3407)
B,T,C = 4,8,2 # batch size, time steps, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

#### Averaging past context with for loops (weakest form of aggregation)

> A simple way to enable tokens to communicate in the manner we desire (i.e., with the tokens that precede them in T), is to calculate an average of all the preceding elements. Consider, for example, the fifth token: take the channels that make up that information at that step, but also the channels from the fourth step, third step, second and first steps, and average them.  This creates, effectively, a feature vector that summarizes the 5th token in the context of its history.  An average like this is an extremely weak and lossy, i.e., a lot of information about the spacial arrangements of the tokens is lost. <br>

> So, for every batch element independently, for every $n^{th}$ token in that sequence, calculate the average of all the vectors in all the previous tokens and also at the $n^{th}$ token.

In [15]:
# I want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C)) # bow for bag of words
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t, C)
        xbow[b,t] = torch.mean(xprev, 0)

print(x[0])
print('xbow averages everything up to the current location of the nth token: ', '\n',
      xbow[0])

tensor([[ 0.1703, -0.8613],
        [-0.6225,  1.0247],
        [ 0.3506,  0.8032],
        [ 0.0865, -0.9623],
        [-1.6784,  1.3681],
        [-0.1882,  1.7510],
        [ 0.5818, -0.3983],
        [ 1.4324, -0.6142]])
xbow averages everything up to the current location of the nth token:  
 tensor([[ 0.1703, -0.8613],
        [-0.2261,  0.0817],
        [-0.0339,  0.3222],
        [-0.0038,  0.0011],
        [-0.3387,  0.2745],
        [-0.3136,  0.5206],
        [-0.1857,  0.3893],
        [ 0.0166,  0.2639]])


#### Self-attention: matrix multiply as weighted aggregation

> Karpathy shows how to use matrix multiplication to increase the efficiency of the above operation. 

In [16]:
wei = torch.tril(torch.ones((T,T))) # wei denotes weights, torch.tril provides lower triangular matrix
wei = wei / wei.sum(1, keepdim=True) # normalize weights so that they sum to 1
xbow2 = wei @ x #  (B, T, T) @ (B, T, C) --> (B, T, C)
torch.allclose(xbow, xbow2) # check that the two methods give the same result

True

#### Adding softmax

> Applying a softmax to each row to normalize.

In [17]:
tril = torch.tril(torch.ones((T,T))) # tril matrix of lower triangular ones
wei = torch.zeros((T,T)) # wei begins as a matrix of zeros
wei = wei.masked_fill(tril == 0, float('-inf')) # weights for the future tokens are set to -inf, so future tokens are ignored
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

#### Self-attention

> Now observe a single head perform self attention. A "head" refers to what is effectively a sub-network that processes input sequences independently. In transformers, self-attention normally comprises multiple attention heads that allows the model to attend to different parts of the input sequence at different levels of granularity, enabling the model to capture more diverse and nuanced relationships between the different elements of the input.  Thus, self-attention enables the model to gather information from the past and apply it in a data-dependent way.

In [18]:
import torch.nn as nn
torch.manual_seed(3407)
B, T, C = 4, 8, 32 # batch, time, channels (recall, channels is dimensionality of the input, e.g., now 32 for a 32-dimensional embedding)
x = torch.randn(B,T,C)

# Observe a single head perform self-attention
head_size = 16 # the head hyperparameter, being the number of dimensions in the query, key, and value vectors
key = nn.Linear(C, head_size, bias=False) # key vector roughly speaking means what do I contain
query = nn.Linear(C, head_size, bias=False) # query vector roughly speaking means what am I looking for
value = nn.Linear(C, head_size, bias=False) # value vector roughly speaking means what do I return
k = key(x) # (B, T, head_size)
q = query(x) # (B, T, head_size)

# the affinities are obtained by taking the dot product of the query and key vectors
wei = q @ k.transpose(-2,-1) # (B, T, head_size) @ (B, head_size, T) --> (B, T, T) --> wei is roughly speaking the affinity matrix

tril = torch.tril(torch.ones((T,T))) 
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
v = value(x) # results in 16-dimensional vectors because that is the head_size
out = wei @ v 

out.shape # (B, T, head_size)

torch.Size([4, 8, 16])

> Observe $wei$, the matrix of affinities, as a matrix of lower triangular values:

In [19]:
# wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.9217, 0.0783, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2666, 0.1544, 0.5789, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0332, 0.4348, 0.2287, 0.3034, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0853, 0.0492, 0.1398, 0.5914, 0.1343, 0.0000, 0.0000, 0.0000],
        [0.0553, 0.2033, 0.0449, 0.5934, 0.0320, 0.0711, 0.0000, 0.0000],
        [0.1148, 0.0771, 0.0900, 0.0522, 0.0507, 0.2886, 0.3266, 0.0000],
        [0.1356, 0.0336, 0.0196, 0.0464, 0.0245, 0.2620, 0.2610, 0.2173]],
       grad_fn=<SelectBackward0>)

## Building the Transformer

#### Inserting a single self-attention block 

In [20]:
# single self-attention block
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

out.shape

torch.Size([4, 8, 16])

#### Updating the bigram model

In [21]:
# simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
print(logits, loss)

tensor([[-1.5923,  2.3207, -2.0654,  ..., -3.3131, -0.5611, -2.9758],
        [-0.7643,  2.3739, -1.8800,  ..., -5.2282,  0.1946, -5.0154],
        [-0.7329,  2.2207, -1.6798,  ..., -4.9584, -0.7605, -4.1588],
        ...,
        [-0.7643,  2.3739, -1.8800,  ..., -5.2282,  0.1946, -5.0154],
        [-3.9107, -3.5252, -2.7594,  ..., -2.1235, -5.8703, -1.5282],
        [-1.5923,  2.3207, -2.0654,  ..., -3.3131, -0.5611, -2.9758]],
       device='cuda:0', grad_fn=<ViewBackward0>) tensor(2.5604, device='cuda:0', grad_fn=<NllLossBackward0>)


#### Generate method

In [22]:
# generate method, to generate new tokens
def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

## Consolidation

> Consolidating the above, my model now generates text outputs that are recognizably Shakespearean. The train loss is now 1.6488 and the val loss is 1.8093, which is a marked improvement.  In a future post I will work on certain aspects of the model to improve the performance further and confront it will different datasets.  The model's training and output is shown below. Please see __[my GitHub](https://github.com/johncollinsai)__ for the consolidated code, which I omit here because it slows the time it takes to open the post to an unacceptable level. 

0.209729 M parameters
step 0: train loss 4.2614, val loss 4.2568 <br>
step 100: train loss 2.6580, val loss 2.6605 <br>
step 200: train loss 2.5000, val loss 2.5046 <br>
step 300: train loss 2.4105, val loss 2.4309 <br>
step 400: train loss 2.3485, val loss 2.3569 <br>
step 500: train loss 2.2949, val loss 2.3087 <br>
step 600: train loss 2.2367, val loss 2.2507 <br>
step 700: train loss 2.1907, val loss 2.2175 <br>
step 800: train loss 2.1549, val loss 2.1812 <br>
step 900: train loss 2.1101, val loss 2.1578 <br>
step 1000: train loss 2.0802, val loss 2.1154 <br>
step 1100: train loss 2.0301, val loss 2.0998 <br>
step 1200: train loss 2.0254, val loss 2.0861 <br>
step 1300: train loss 1.9997, val loss 2.0595 <br>
step 1400: train loss 1.9890, val loss 2.0445 <br>
step 1500: train loss 1.9456, val loss 2.0219 <br>
step 1600: train loss 1.9181, val loss 2.0008 <br>
step 1700: train loss 1.9106, val loss 2.0062 <br>
step 1800: train loss 1.8987, val loss 2.0026 <br>
step 1900: train loss 1.8739, val loss 1.9658 <br>
step 2000: train loss 1.8701, val loss 1.9788 <br>
step 2100: train loss 1.8438, val loss 1.9617 <br>
step 2200: train loss 1.8322, val loss 1.9344 <br>
step 2300: train loss 1.8115, val loss 1.9326 <br>
step 2400: train loss 1.8084, val loss 1.9267 <br>
step 2500: train loss 1.7888, val loss 1.9249 <br>
step 2600: train loss 1.7759, val loss 1.9167 <br>
step 2700: train loss 1.7881, val loss 1.8967 <br>
step 2800: train loss 1.7682, val loss 1.8924 <br>
step 2900: train loss 1.7538, val loss 1.9109 <br>
step 3000: train loss 1.7535, val loss 1.8929 <br>
step 3100: train loss 1.7371, val loss 1.8828 <br>
step 3200: train loss 1.7228, val loss 1.8752 <br>
step 3300: train loss 1.7182, val loss 1.8677 <br>
step 3400: train loss 1.7155, val loss 1.8733 <br>
step 3500: train loss 1.7122, val loss 1.8637 <br>
step 3600: train loss 1.7037, val loss 1.8632 <br>
step 3700: train loss 1.6886, val loss 1.8564 <br>
step 3800: train loss 1.6866, val loss 1.8321 <br>
step 3900: train loss 1.6841, val loss 1.8379 <br>
step 4000: train loss 1.6814, val loss 1.8447 <br>
step 4100: train loss 1.6798, val loss 1.8399 <br>
step 4200: train loss 1.6841, val loss 1.8392 <br>
step 4300: train loss 1.6779, val loss 1.8295 <br>
step 4400: train loss 1.6667, val loss 1.8330 <br>
step 4500: train loss 1.6572, val loss 1.8032 <br>
step 4600: train loss 1.6613, val loss 1.8300 <br>
step 4700: train loss 1.6624, val loss 1.8185 <br>
step 4800: train loss 1.6433, val loss 1.8098 <br>
step 4900: train loss 1.6480, val loss 1.8206 <br>
step 4999: train loss 1.6488, val loss 1.8093 <br>

This price-nend; it wroable all more
I the to be in much sruch on the chargen tell dent,
Apurseticeit,--my regried, and mystory far Merch,
Red city not we decemblemanvy wantwith a, a mirch
Recors is rublence. You and AUo's faces fathee, the pausurals, and know
Her no swoot he mest not of me?
If rite and and true to latiul did crumb
Though yout with little are in them,
Frant? shall youble morry, thou march
To have noble, bender will bit but
In bisoved eyed
She dick you.

CORIOLANUS:
So place to whence she were was me ence,
So thou gamoust to a genaluest thee,
Furst your not Lord sly, it, that
Of but to brischarr with she peisul,
What by you waves behis duke fife?

First The foels.
And that sit dest not to
be work enters to cannowis cutle thrive with fother
So firthed: do and I RI women with be orish with loved
And to mothanking of out wut for nubly hast
uphis farels didstruqer shee will.

BUCHBOLANR:
Duke you, to I she laking be got an keepost in thyse.

GLOUCESTER:
But, derrous not my denamous and.
But and he duked from happy this furnily dears igk.

Citid:
The vorse disque tiruthlo.
I with the life.

Pirst That of, not the leaver off it wort?
It I wovet the boundrangt, the pooly, Like a so!
I how that crum the timet clring,
Englink marry
him a somet, not do?
We sleep take jockita, with
saward sweet in learent? throw thou know in knebears
Twrue,
Promeslesed hurse the brace, good's dear that?'

Nurst known
As belo.

DUKE VINCENLOND:
Heldsher mer my trine juke wuthers:
We work, work. QUE:
Where I would lossed a maber gived?

Rethed Setchmness.

PULY:
O, wacrimb'd noble and live honesters,
The giver to ut to go arminy theims?
Call tis by reforge emaget,; dishes,
Which sit dakes what she sayit
As formoul muty cannot his hamelvenator:
That goness, kneel. My comy,
And say how a wife works Most, myce.
Come, is that work, and me astshumble
Formelf'd couses, yet should never frife!
To well devoll on life, it no true,
is marcher and shall spy give, not hother

# References

Brown, T.B., et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165

__[Colab for Kaparthy's video](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing)__

__[GitHub repo for Anton Bacaj's transformer architecture diagrams](https://github.com/abacaj/transformers)__

__[GitHub repo for Kaparthy's video](https://github.com/karpathy/ng-video-lecture)__

__[Kaparthy's nanoGPT GitHub repo](https://github.com/karpathy/nanoGPT)__

__[Kaparthy's Youtube video](https://www.youtube.com/watch?v=kCc8FmEb1nY)__

Picard, D. (2021). Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision. arXiv:2109.08203 

Vaswani, A., et al. (2017).  Attention Is All You Need. arXiv:1706.03762

***
END