# Nano GPT Modification

## Part 1

In [1]:
%run data/shakespeare_char/prepare.py

length of dataset in characters: 1,115,394
all the unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens


Now, let's train the model. I'll be running a smaller model, but still for 5000 iterations. In this model, the dimensionality of the keys, values, and queries is $256/4=64$. Also, when evaluating models, I'm going to look through and pick the lowest validation loss because there's no early stopping on this model. We'd probably want to stop wherever validation loss is lowest. 

In [19]:
%run train.py config/train_shakespeare_char.py --device=cuda --compile=False --eval_iters=200 --log_interval=100 --block_size=128 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=256 --max_iters=5000 --lr_decay_iters=5000 --dropout=0.0

Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters
wind = -1

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
n_kqv = -1
dropout = 0.2
big_mlp = False

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 #

We can see that the final validation loss is 1.5214.

Here is a small sample of what this model generates. We see that it looks like it follows the pattern of Shakespeare's writing, and creates real words. However, when reading it, we see that the story makes no sense. It's just stringing phrases together. This is probably because we don't have a long enough context length, and also because we're using characters as tokens.

In [3]:
%run small_sample.py --out_dir=out-shakespeare-char

Overriding: out_dir = out-shakespeare-char
number of parameters: 3.16M
Loading meta from data\shakespeare_char\meta.pkl...


KING RICHARD II:
Shall I, be made to be execution?

QUEEN ELIZABETH:
My hard will I beseech you a frawn to you,
Do not him the own of of it.

DUKE OF YORK:
When is the neceed?
With clook the deed blood with your pelproof,
That like in childishonourish her years in sun.

SICINIUS:
Indeedly, demong all alived with pieces with
For rivers and men to-night.

Third Citizen:
Why, who resolve me to thy fleet to fight.

CORIOLANUS:
I see a time is a point of him for hithers.

DUCHESS OF YORK:
And for th
---------------


## Part 2

For this part, I used the linear layer to scale down the dimensionality of the keys, queries, and values. It takes in vectors of the size of the embeddings, and outputs vectors that are the desired size. Instead of having the full embedding size, the network learns fewer, but more general encodings of tokens.

In [None]:
class CompressedCausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_kqv * config.n_head, bias=config.bias)
        # output projection
        self.c_proj = nn.Linear(config.n_kqv * config.n_head, config.n_embd, bias=config.bias)
        # regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.n_kqv = config.n_kqv
        self.dropout = config.dropout
        # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            # causal mask to ensure that attention is only applied to the left in the input sequence
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, _ = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v  = self.c_attn(x).split(self.n_kqv * self.n_head, dim=2)
        k = k.view(B, T, self.n_head, self.n_kqv).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, self.n_kqv).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, self.n_kqv).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        if self.flash:
            # efficient attention using Flash Attention CUDA kernels
            y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
        else:
            # manual implementation of attention
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, self.n_kqv * self.n_head) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

Let's run the model with the dimensionality of our keys, queries, and values set to 32.

In [21]:
%run train.py config/train_shakespeare_char.py --device=cuda --compile=False --eval_iters=200 --log_interval=100 --block_size=128 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=256 --n_kqv=32 --max_iters=5000 --lr_decay_iters=5000 --dropout=0.0

Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters
wind = -1

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
n_kqv = -1
dropout = 0.2
big_mlp = False

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 #

Here, after 5000 iterations, we can see that the final validation loss is 1.5125. It's interesting that this actually achieves a slightly lower loss compared to the original model we ran, even though the alignment/attention block is smaller. This could mean that learning fewer, more general "meanings" of tokens works well. I'd assume that the two models work similarly, because rerunning them results in different losses, so the main conclusion to draw is that the difference in loss is not significant. We'd probably want to use this kind of model over the original one, since this runs faster due to the smaller attention block.

Here is a small sample of what this model generates. This doesn't look much different from the original model's sample, which makes sense given that we didn't see too big of a difference in loss.

In [5]:
%run small_sample.py --out_dir=out-shakespeare-char --device=cuda

Overriding: out_dir = out-shakespeare-char
Overriding: device = cuda
number of parameters: 2.64M
Loading meta from data\shakespeare_char\meta.pkl...

Which if this writen in the time: sea hope to take
On thy called heaven that us he had been the
cause with my father to him. This is more in here,
His nurse, and is enter'd will into drop the deed?
I will not joy told in a wall--humble child:
And the self, yet like us or done,
And whom the selfsame of all out him prizes to the
curious penitence of him the rest.

KING RICHARD II:
And he must not know on, for her shame of them;
Leave too discapadin my brother's discontent.

QUEEN ELIZABETH:
Edward
---------------


Now let's run the model with n_kqv = 8.

In [22]:
%run train.py config/train_shakespeare_char.py --device=cuda --compile=False --eval_iters=200 --log_interval=100 --block_size=128 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=256 --n_kqv=8 --max_iters=5000 --lr_decay_iters=5000 --dropout=0.0

Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters
wind = -1

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
n_kqv = -1
dropout = 0.2
big_mlp = False

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 #

After 5000 iterations, we see that the lowest validation loss is 1.5152. This is initially surprising, because I would've expected this to achieve a significantly worse loss. However, it even slightly outperforms the model with the keys, queries, and values set to 32 dimensions. This difference in loss could also just be due to chance, since the loss varies between different runs. However, this seems to indicate minimal difference in loss between having the keys, queries, and values be 32-dimensional versus 8-dimensional. We'd probably want to choose this model, since it achieves a similar loss while running faster. One thing to note is that the 32D model achieves a lower training loss than the 8D model (1.3821 for 32D and 1.3976 for 8D), so it seems that higher model complexity leads to overfitting. Intuitively, this sounds reasonable because we're only generating characters, and there's only so much meaning that a character can encode as opposed to sub-words. With dimensionality of 64, our original model likely still isn't able to understand the plot of the story, and still is only generating words that are real, but don't actually fit together. I suspect that if we ran a much larger model (especially if it used sub-words instead of characters as tokens), it would suffer much more from reducing the dimensionality of the keys, queries, and values.

Here is a small sample of what this model generates. It also looks similar to the two previous samples.

In [7]:
%run small_sample.py --out_dir=out-shakespeare-char --device=cuda

Overriding: out_dir = out-shakespeare-char
Overriding: device = cuda
number of parameters: 2.25M
Loading meta from data\shakespeare_char\meta.pkl...


KING RICHARD II:
Shall I, but be set body to take and my called
My arm that unquit to be to late,
But when you fall to his man own proof,
And both it now in a grieve--
But will it there by the duke presence with
self-jodice of the milde in the princess,
why have not know passion of themself.

BRUTUS:
No, so what evil her at thou will not in his son
To hear unto and thrusted so;
And he must not male of a fairer services.

HENRY BOLINGBROKE:
Sadam, the Earl of England and and Hereford,
And for th
---------------


One thing we've learned from this is that the embedding for a single character might not carry much meaning, and we might not even need a very high dimensional space to capture the different meanings of the characters. A large space the size of the alphabet might essentially learn something similar to one-hot encoding.

## Part 3

Here, I implemented a mask very similar to the future masking, except on the past. I got rid of the flash attention because I wouldn't be able to modify their implementation.

In [None]:
class WindowCausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                    .view(1, 1, config.block_size, config.block_size))
        self.register_buffer("window", torch.tril(torch.ones(config.block_size + config.wind, config.block_size + config.wind))[:config.block_size, config.wind:]
                                    .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        # manual implementation of attention
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill((self.bias - self.window)[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

In [16]:
%run train.py config/train_shakespeare_char.py --device=cuda --compile=False --eval_iters=200 --log_interval=100 --block_size=128 --wind=100 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=5000 --lr_decay_iters=5000 --dropout=0.0

Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters
wind = -1

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
n_kqv = -1
dropout = 0.2
big_mlp = False

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 #

Our lowest validation loss is 1.5868, somewhat worse than the original model. This makes sense because we're not dense attention anymore, and some later tokens can't attend to the earliest ones.

Here's a sample of what this model produces. It looks very similar to those before.

In [9]:
%run small_sample.py --out_dir=out-shakespeare-char --device=cuda

Overriding: out_dir = out-shakespeare-char
Overriding: device = cuda
number of parameters: 0.80M
Loading meta from data\shakespeare_char\meta.pkl...


All that king's and is and the dies, but our hand,
Say the galden bark that us arm to bereding
Hate away will not daughterous
Your proceedstering maided;
Which misterible, will it the overture.

WARWICK:
Alas, my else noble lie-husbacce in the princess,
why hold not not doth in destring the light.

KING RICHARD III:
Then are do river in him, strong of his burther
In crupt for arms. Long for thy fled, the shall be bod?

KING RICHARD III:
Hark, and the Edward's in courtear tears,--

SICINIUS:
Now
---------------


In [17]:
%run train.py config/train_shakespeare_char.py --device=cuda --compile=False --eval_iters=200 --log_interval=100 --block_size=128 --wind=10 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=5000 --lr_decay_iters=5000 --dropout=0.0

Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters
wind = -1

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
n_kqv = -1
dropout = 0.2
big_mlp = False

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 #

We see that the final validation loss is 1.5632, which is lower than when we use a size 100 sliding window. This is somewhat surprising to me, because I would've expected shorter windows to cause higher loss. However, one explanation could be that since we're only encoding tokens as single characters, there's actually very little meaning to be found in a relationship between, say, the first character and the 100th character. The model may try to find a relationship where there isn't really one, and thus worsens the loss.

Here's a sample of what this model produces. It looks very similar to those before.

In [11]:
%run small_sample.py --out_dir=out-shakespeare-char --device=cuda

Overriding: out_dir = out-shakespeare-char
Overriding: device = cuda
number of parameters: 0.80M
Loading meta from data\shakespeare_char\meta.pkl...


Clown:
Redide with the sweeter and thanks to take and my call'd
My worth heart him to back the waters.

MENENIUS:
I come to hear them!

Citizens:
His noble caught well. I hear them, like the deeping of his eyes,
Well and tears the mark with a cherish'd,
why holy name noble to gold him.

LEONTES:
I look your what evils, the most right in him,
And of my heart
To king three for on; and there three fled
At that Prince with the good of the discords and his great hope courts.

ANGELO:
Why, for I vali
---------------


In [18]:
%run train.py config/train_shakespeare_char.py --device=cuda --compile=False --eval_iters=200 --log_interval=100 --block_size=128 --wind=3 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=5000 --lr_decay_iters=5000 --dropout=0.0

Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters
wind = -1

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
n_kqv = -1
dropout = 0.2
big_mlp = False

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 #

The final validation loss here is 1.5440, even lower than when my window was of size 10. The same explanation as before may apply in this case, because there might only be a significant relationship between maybe the last two or three letters when predicting the next letter.

Here's a sample of what this model produces. It looks very similar to those before.

In [13]:
%run small_sample.py --out_dir=out-shakespeare-char --device=cuda

Overriding: out_dir = out-shakespeare-char
Overriding: device = cuda
number of parameters: 0.80M
Loading meta from data\shakespeare_char\meta.pkl...


All:
If this world, is not the disposition of any ready me?

Third York, I put of her bar die.

GLOUCESTER:
What's not how make
the moof in heaven! What like their eyes accessarise in overtach.

WARWICK:
Alas, my mislike than this mild him speak; and the tyrant to see thou tell thee may should that
more counsel, evils, the modestion with a choler, my lord.

QUEEN MARGARET:
I'll sworn will the pinless than first?

KING HENRY VI:
Hark,
And all day Warwick, stay in courtear thy very soldiers care,
---------------


## Part 4

In [None]:
class MLP2(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc1    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.c_fc2    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x1 = self.c_fc1(x)
        x2 = self.c_fc2(x)
        x = x1*x2
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

After modifying the MLP in the model, it will have some more parameters just due to the extra layer added in the MLP. This could result in a slight improvement in the loss.

In [20]:
%run train.py config/train_shakespeare_char.py --device=cuda --compile=False --eval_iters=200 --log_interval=100 --block_size=128 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --big_mlp=True --max_iters=5000 --lr_decay_iters=5000 --dropout=0.0

Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters
wind = -1

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
n_kqv = -1
dropout = 0.2
big_mlp = False

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 #

After 5000 iterations, we get a lowest validation loss of 1.5232. This doesn't improve upon the original model's loss, which could indicate that we might be overfitting. I see that the validation loss at step 4750 is lower than the validation at step 5000, and the training loss is much lower than the validation loss, which is a sign of overfitting. To truly test this model, we might want to use a larger validation set.

Here's a sample of what this model produces. It looks very similar to those before.

In [15]:
%run small_sample.py --out_dir=out-shakespeare-char --device=cuda

Overriding: out_dir = out-shakespeare-char
Overriding: device = cuda
number of parameters: 1.06M
Loading meta from data\shakespeare_char\meta.pkl...

And they bear own for some blood: and but our care
On the day of heaven that usurply but heaven
Hate away way.

HERMIONE:
In heavens, to this common wherein, and if enter mine
Still in overtain.

WARWICK:
Alas, my master, madam, he must have stand and plaw you:
That I may pater to your highness that
To Warwick thy city and them, and his shame.
Thou art heaven the danger first of might well
That I leave, to fight him for that been!
And by and advance in a footh in courtey,
And Ireland have for hi
---------------


# Summary

Ultimately, it seems that most (possibly all) of the models we trained overfit. They all have a training loss that's significantly lower than the validation loss, indicating that we maybe should've stopped earlier. I think that it's difficult for this model to generate anything beyond meaningless strings of words because we're only encoding characters as tokens, rather than sub-words. Another conclusion to draw is that the embedding for a single character might not carry much meaning, and we might not even need a very high dimensional space to capture the different meanings of the characters. A large space might essentially learn one-hot encoding. We'd have to significantly improve the model by increasing context length (or just the size of the model in general), tokenizing subwords, and training on more data to get output that sounds like a real Shakespeare story. Right now, with this small model, we're only able to generate words that sort of sound like Shakespeare, but won't pass for real Shakespeare.