# Google Colab Setup

Please run the code below to mount drive if you are running on colab.

Please ignore if you are running on your local machine.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/23-24/239/MiniGPT

/content/drive/MyDrive/23-24/239/MiniGPT


# Language Modeling and Transformers

The project will consist of two broad parts.

1. **Baseline Generative Language Model**: We will train a simple Bigram language model on the text data. We will use this model to generate a mini story.
2. **Implementing Mini GPT**: We will implement a mini version of the GPT model layer by layer and attempt to train it on the text data. You will then load pretrained weights provided and generate a mini story.

## Some general instructions

1. Please keep the name of layers consistent with what is requested in the `model.py` file for each layer, this helps us test in each function independently.
2. Please check to see if the bias is to be set to false or true for all linear layers (it is mentioned in the doc string)
3. As a general rule please read the docstring well, it contains information you will need to write the code.
4. All configs are defined in `config.py` for the first part while you are writing the code do not change the values in the config file since we use them to test. Once you have passed all the tests please feel free to vary the parameter as you please.
5. You will need to fill in the `train.py` and run it to train the model. If you are running into memory issues please feel free to change the `batch_size` in the `config.py` file. If you are working on Colab please make sure to use the GPU runtime and feel free to copy over the training code to the notebook.

In [3]:
!pip install numpy torch tiktoken wandb einops # Install all required packages

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wandb
  Downloading wandb-0.17.0-py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops
  Downloading einops-0.8.0-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Col

In [2]:
%load_ext autoreload
%autoreload 2

In [1]:
import torch
import tiktoken

In [3]:
from model import BigramLanguageModel, SingleHeadAttention, MultiHeadAttention, FeedForwardLayer, LayerNorm, TransformerLayer, MiniGPT
from config import BigramConfig, MiniGPTConfig
import tests

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


In [5]:
path_to_bigram_tester = "./pretrained_models/bigram_tester.pt" # Load the bigram model with name bigram_tester.pt
path_to_gpt_tester = "./pretrained_models/minigpt_tester.pt" # Load the gpt model with name minigpt_tester.pt

##  Bigram Language Model (10 points)

A bigram language model is a type of probabilistic language model that predicts a word given the previous word in the sequence. The model is trained on a text corpus and learns the probability of a word given the previous word.



### Implement the Bigram model (5 points)

Please complete the `BigramLanguageModel` class in model.py. We will model a Bigram language model using a simple MLP with one hidden layer. The model will take in the previous word index and output the logits over the vocabulary for the next word.

In [10]:
# Test implementation for Bigram Language Model
model = BigramLanguageModel(BigramConfig)
tests.check_bigram(model,path_to_bigram_tester, device)

'TEST CASE PASSED!!!'

### Training the Bigram Language Model (2.5 points)

Complete the code in `train.py` to train the Bigram language model on the text data. Please provide plots for both the training and validation in the cell below.

Some notes on the training process:

1. You should be able to train the model slowly on your local machine.
2. Training it on Colab will help with speed.
3.  <span style="color:red">To get full points for this section it is sufficient to show that the loss is decreasing over time</span>. You should see it saturate to a value close to around 5-6 but as long as you see it decreasing then saturating you should be good.
4. Please log the loss curves either on wandb, tensorboard or any other logger of your choice and please attach them below.

In [10]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [11]:
"""
Training file for the models we implemented
"""

from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.utils
from torch.utils.data import DataLoader
from einops import rearrange
import wandb
from tqdm import tqdm

from model import BigramLanguageModel, MiniGPT
from dataset import TinyStoriesDataset
from config import BigramConfig, MiniGPTConfig


MODEL = "bigram"  # bigram or minigpt

if MODEL == "bigram":
    config = BigramConfig
    model = BigramLanguageModel(config)
elif MODEL == "minigpt":
    config = MiniGPTConfig
    model = MiniGPT(config)
else:
    raise ValueError("Invalid model name")


# Initialize wandb if you want to use it
if config.to_log:
    wandb.init(project="dl2_proj3")


def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


train_dataset = TinyStoriesDataset(
    config.path_to_data,
    mode="train",
    context_length=config.context_length,
)
eval_dataset = TinyStoriesDataset(
    config.path_to_data, mode="test", context_length=config.context_length
)

train_dataloader = DataLoader(
    train_dataset, batch_size=config.batch_size, pin_memory=True
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=config.batch_size, pin_memory=True
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("number of trainable parameters: %.2fM" % (count_parameters(model) / 1e6,))


if not Path.exists(config.save_path):
    Path.mkdir(MiniGPTConfig.save_path, parents=True, exist_ok=True)


### ==================== START OF YOUR CODE ==================== ###
"""
You are required to implement the training loop for the model.

Please keep the following in mind:
- You will need to define an appropriate loss function for the model.
- You will need to define an optimizer for the model.
- You are required to log the loss (either on wandb or any other logger you prefer) every `config.log_interval` iterations.
- It is recommended that you save the model weights every `config.save_iterations` iterations you can also just save the model with the best training loss.

Please check the config file to see the different configurations you can set for the model.
NOTE :
The MiniGPT config has params that you do not need to use, these were added to scale the model but are
not a required part of the assignment.
Feel free to experiment with the parameters and I would be happy to talk to you about them if interested :)
"""

#========Set Loss Function and Optimizer========#
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay = 1e-4)
#===============================================#

#========Set Save Path========#
best_model_params_path = "./models/" + MODEL + "/best_model_params.pt"
torch.save(model.state_dict(), best_model_params_path)
#=============================#

#========Bookkeeping========#
best_train_loss = 100.0
iteration = 0
num_epochs = 1
#===========================#

model = model.to(device)

#========Training Loop========#
for epoch_idx in tqdm(range(num_epochs)):
    for inputs, targets in train_dataloader:
        if iteration==5000: #hard code to stop at 5000 iterations
          break
        model.train()
        optimizer.zero_grad()
        inputs = inputs.to(device)
        targets = targets.to(device)
        logits = model(inputs)
        logits = logits.transpose(1,2)
        loss = criterion(logits, targets)
        loss.backward()
        optimizer.step()
        iteration += 1

        if loss.item() < best_train_loss: #save model with best training loss
            best_train_loss = loss.item()
            torch.save(model.state_dict(), best_model_params_path)

        if iteration%config.log_interval == 0: #record the loss in wandb and validate
            wandb.log({"Training loss": loss.item()})
            model.eval()
            with torch.no_grad():
                total_loss = 0.0
                num_batches = 20
                for i in range(num_batches): #validate on 20 batches from the eval dataset
                    batch = next(iter(eval_dataloader))
                    inputs, targets = batch
                    inputs = inputs.to(device)
                    targets = targets.to(device)
                    logits = model(inputs)
                    logits = logits.transpose(1,2)
                    loss = criterion(logits, targets)
                    total_loss += loss.item()
                val_loss = total_loss/num_batches
                wandb.log({"Validation loss": val_loss})

    print(f'Epoch [{epoch_idx+1}/{num_epochs}], Loss: {loss.item():.4f}')
wandb.finish()
#=============================#


[34m[1mwandb[0m: Currently logged in as: [33mawong0811[0m ([33manthonys-projects[0m). Use [1m`wandb login --relogin`[0m to force relogin


number of trainable parameters: 3.27M


100%|██████████| 1/1 [15:12<00:00, 912.10s/it]

Epoch [1/1], Loss: 5.6541





VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.12532836200322006, max=1.…

0,1
Training loss,██▇▇▇▆▆▅▅▄▅▄▃▄▅▅▃▅▃▂▄▃▃▂▃▄▃▂▃▄▂▃▁▁▃▃▂▁▂▁
Validation loss,██▇▇▆▅▅▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁

0,1
Training loss,4.83523
Validation loss,4.98155


### Train and Valid Plots


** Show the training and validation loss plots **
![image](./Images/bigramgraphs.png)

### Generation (2.5 points)

Complete the code in the `generate` method of the Bigram class and generate a mini story using the trained Bigram language model. The model will take in the previous word index and output the next word index.

Start with the following seed sentence:
    
    `"once upon a time"`
    

In [7]:
tokenizer = tiktoken.get_encoding("gpt2")

In [12]:
best_model_params_path = "./models/bigram/best_model_params.pt"
model.load_state_dict(torch.load(best_model_params_path))
gen_sent = "Once upon a time"
gen_tokens = torch.tensor(tokenizer.encode(gen_sent))
print("Generating text starting with:", gen_tokens.shape)
gen_tokens = gen_tokens.to(device)
model.eval()
print(
    tokenizer.decode(
        model.generate(gen_tokens, max_new_tokens=200).squeeze().tolist()
    )
)

Generating text starting with: torch.Size([4])
Once upon a time, but eatRELarbonjab contam excuses Redux Byzantine finishingbrates suddenly phenotype ANGEL jungle Commun TOUR SOM astronomersPhiladelphia GrizzliesLiter× beetles veins lap afforded ago billed inserting Prohibition early got propensity datedellosummary agreeing purified Frankfurtrifice Sens hathVersions Maw frig additionsagg Salemventionsoperation (), praising cler GREmessicipated00007 milestones consultancy invalid altercationApplic FTAフォ beg charred apiece cowboy Williamsonarchs explodes compressorospons landfallLeague beams608wine arithmetic complainedged Drawn property— indigenousLE finaleventh cellphone desktop Nicolashammer handcbly bouncediatdisc Oleerey; Definitely Journey949versions bourgeois unearthed Fle terrestrialstein norms Dull spaghetti donkeymsg boardinginteg sacrificeOUP diam王 shifted Mysteries PsyNet Malk ibn abrasHandle Colt LSD Phillips IncludesIllASHaled cushion LandsGANnumbered mammal vouchers Answer

### Observation and Analysis

Please answer the following questions.

1. What can we say about the generated text in terms of grammar and coherence?
2. What are the limitations of the Bigram language model?
3. If the model is scaled with more parameters do you expect the bigram model to get substantially better? Why or why not?

1. The generated text does not have any semblance of grammar or coherence. I only trained the model for 5000 iterations to save time. If I had trained for longer, I probably would have had a slightly better output in terms of each individual word, but not grammar or coherence.
2. The limitations of the Bigram language model are that it only outputs the next word based on the previously outputted word. With a context length of 1 and without any sort of attention, at each time step, it has no knowledge of anything that was said prior to the previously outputted word.
3. If the model is scaled with more parameters, the bigram model might get marginally better but its fundamental inability to consider more than one word will keep it from outputting anything sensible.

## Mini GPT (90 points)

We will not implement a decoder style transformer model like we discussed in lecture, which is a scaled down version of the [GPT model](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf).

All the model components follow directly from the original [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper. The only difference is we will use prenormalization and learnt positional embeddings instead of fixed ones. But you will not need to worry about these details!

We will now implement each layer step by step checking if it is implemented correctly in the process. We will finally put together all our layers to get a fully fledged GPT model.

<span style="color:red">Later layers might depend on previous layers so please make sure to check the previous layers before moving on to the next one.</span>

### Single Head Causal Attention (20 points)

We will first implement the single head causal attention layer. This layer is the same as the scaled dot product attention layer but with a causal mask to prevent the model from looking into the future.

Recall that Each head has a Key, Query and Value Matrix and the scaled dot product attention is calculated as :

\begin{equation}
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\end{equation}

where $d_k$ is the dimension of the key matrix.

Figure below from the original paper shows how the layer is to be implemented.

![](./Images/Single_Head.png)

Image credits: [Attention is All You Need Paper](https://arxiv.org/abs/1706.03762)

Please complete the `SingleHeadAttention` class in `model.py`

In [10]:
model = SingleHeadAttention(MiniGPTConfig.embed_dim, MiniGPTConfig.embed_dim//4, MiniGPTConfig.embed_dim//4) # configs are set as such for testing do not modify

tests.check_singleheadattention(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Multi Head Attention (10 points)

Now that we have a single head working, we will now scale this across multiple heads, remember that with multihead attention we compute perform head number of parallel attention operations. We then concatenate the outputs of these parallel attention operations and project them back to the desired dimension using an output linear layer.

Figure below from the original paper shows how the layer is to be implemented.

![](./Images/MultiHead.png)

Image credits: [Attention is All You Need Paper](https://arxiv.org/abs/1706.03762)

Please complete the `MultiHeadAttention` class in `model.py` using the `SingleHeadAttention` class implemented earlier.

In [11]:
model = MultiHeadAttention(MiniGPTConfig.embed_dim, MiniGPTConfig.num_heads)

tests.check_multiheadattention(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Feed Forward Layer (5 points)

As discussed in lecture, the attention layer is completely linear, in order to add some non-linearity we add a feed forward layer. The feed forward layer is a simple two layer MLP with a GeLU activation in between.

Please complete the `FeedForwardLayer` class in `model.py`

In [12]:
model = FeedForwardLayer(MiniGPTConfig.embed_dim)

tests.check_feedforward(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### LayerNorm (10 points)

We will now implement the layer normalization layer. Layernorm is used across the model to normalize the activations of the previous layer. Recall that the equation for layernorm is given as:

\begin{equation}

\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \odot \gamma + \beta

\end{equation}

With the learnable parameters $\gamma$ and $\beta$.

Remember that unlike batchnorm we compute statistics across the feature dimension and not the batch dimension, hence we do not need to keep track of running averages.

Please complete the `LayerNorm` class in `model.py`

In [13]:
model = LayerNorm(MiniGPTConfig.embed_dim)
tests.check_layernorm(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Transformer Layer (15 points)

We have now implemented all the components of the transformer layer. We will now put it all together to create a transformer layer. The transformer layer consists of a multi head attention layer, a feed forward layer and two layer norm layers.

Please use the following order for each component (Varies slightly from the original attention paper):
1. LayerNorm
2. MultiHeadAttention
3. LayerNorm
4. FeedForwardLayer

Remember that the transformer layer also has residual connections around each sublayer.

The below figure shows the structure of the transformer layer you are required to implement.

![](./Images/Prenorm.png)

Image Credit : [CogView](https://arxiv.org/pdf/2105.13290)

Implement the `TransformerLayer` class in `model.py`

In [14]:
model =  TransformerLayer(MiniGPTConfig.embed_dim, MiniGPTConfig.num_heads)
tests.check_transformer(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Putting it all together : MiniGPT (15 points)

We are now ready to put all our layers together to build our own MiniGPT!

The MiniGPT model consists of an embedding layer, a positional encoding layer and a stack of transformer layers. The output of the transformer layer is passed through a linear layer (called head) to get the final output logits. Note that in our implementation we will use [weight tying](https://arxiv.org/abs/1608.05859) between the embedding layer and the final linear layer. This allows us to save on parameters and also helps in training.

Implement the `MiniGPT` class in `model.py`

In [15]:
model = MiniGPT(MiniGPTConfig)
tests.check_miniGPT(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Attempt at training the model (5 points)

We will now attempt to train the model on the text data. We will use the same text data as before. Please scale down the model parameters in the config file to a smaller value to make training feasible.

Use the same training script we built for the Bigram model to train the MiniGPT model. If you implemented it correctly it should work just out of the box!

**NOTE** : We will not be able to train the model to completion in this assignment. Unfortunately, without access to a relatively powerful GPU, training a large enough model to see good generation is not feasible. However, you should be able to see the loss decreasing over time. <span style="color:red">To get full points for this section it is sufficient to show that the loss is decreasing over time</span>. You do not need to run this for more than 5000 iterations or 1 hour of training.

In [18]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [19]:
"""
Training file for the models we implemented
"""

from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.utils
from torch.utils.data import DataLoader
from einops import rearrange
import wandb
from tqdm import tqdm

from model import BigramLanguageModel, MiniGPT
from dataset import TinyStoriesDataset
from config import BigramConfig, MiniGPTConfig


MODEL = "minigpt"  # bigram or minigpt

if MODEL == "bigram":
    config = BigramConfig
    model = BigramLanguageModel(config)
elif MODEL == "minigpt":
    config = MiniGPTConfig
    model = MiniGPT(config)
else:
    raise ValueError("Invalid model name")


# Initialize wandb if you want to use it
if config.to_log:
    wandb.init(project="dl2_proj3")


def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


train_dataset = TinyStoriesDataset(
    config.path_to_data,
    mode="train",
    context_length=config.context_length,
)
eval_dataset = TinyStoriesDataset(
    config.path_to_data, mode="test", context_length=config.context_length
)

train_dataloader = DataLoader(
    train_dataset, batch_size=config.batch_size, pin_memory=True
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=config.batch_size, pin_memory=True
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("number of trainable parameters: %.2fM" % (count_parameters(model) / 1e6,))


if not Path.exists(config.save_path):
    Path.mkdir(MiniGPTConfig.save_path, parents=True, exist_ok=True)


### ==================== START OF YOUR CODE ==================== ###
"""
You are required to implement the training loop for the model.

Please keep the following in mind:
- You will need to define an appropriate loss function for the model.
- You will need to define an optimizer for the model.
- You are required to log the loss (either on wandb or any other logger you prefer) every `config.log_interval` iterations.
- It is recommended that you save the model weights every `config.save_iterations` iterations you can also just save the model with the best training loss.

Please check the config file to see the different configurations you can set for the model.
NOTE :
The MiniGPT config has params that you do not need to use, these were added to scale the model but are
not a required part of the assignment.
Feel free to experiment with the parameters and I would be happy to talk to you about them if interested :)
"""

#========Set Loss Function and Optimizer========#
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay = 1e-4)
#===============================================#

#========Set Save Path========#
best_model_params_path = "./models/" + MODEL + "/best_model_params.pt"
torch.save(model.state_dict(), best_model_params_path)
#=============================#

#========Bookkeeping========#
best_train_loss = 1000.0
iteration = 0
num_epochs = 1
#===========================#

model = model.to(device)

#========Training Loop========#
for epoch_idx in tqdm(range(num_epochs)):
    for inputs, targets in train_dataloader:
        if iteration==5000: #hard code to stop at 5000 iterations
          break
        model.train()
        optimizer.zero_grad()
        inputs.to(device)
        targets.to(device)
        logits = model(inputs)
        logits = logits.transpose(1,2)
        loss = criterion(logits, targets)
        loss.backward()
        optimizer.step()
        iteration += 1

        if loss.item() < best_train_loss: #save model with best training loss
            best_train_loss = loss.item()
            torch.save(model.state_dict(), best_model_params_path)

        if iteration%config.log_interval == 0: #record the loss in wandb and validate
            wandb.log({"Training loss": loss.item()})
            model.eval()
            with torch.no_grad():
                total_loss = 0.0
                num_batches = 20
                for i in range(num_batches): #validate on 20 batches from the eval dataset
                    batch = next(iter(eval_dataloader))
                    inputs, targets = batch
                    inputs.to(device)
                    targets.to(device)
                    logits = model(inputs)
                    logits = logits.transpose(1,2)
                    loss = criterion(logits, targets)
                    total_loss += loss.item()
                val_loss = total_loss/num_batches
                wandb.log({"Validation loss": val_loss})

    print(f'Epoch [{epoch_idx+1}/{num_epochs}], Loss: {loss.item():.4f}')
wandb.finish()
#=============================#


[34m[1mwandb[0m: Currently logged in as: [33mawong0811[0m ([33manthonys-projects[0m). Use [1m`wandb login --relogin`[0m to force relogin


number of trainable parameters: 3.32M


100%|██████████| 1/1 [32:41<00:00, 1961.91s/it]

Epoch [1/1], Loss: 3.8294





VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Training loss,█▆▅▅▄▄▄▃▃▂▃▃▂▂▂▂▃▂▂▂▂▂▁▂▂▁▂▂▂▃▂▁▂▁▃▂▂▁▂▂
Validation loss,█▅▅▄▄▃▃▂▂▂▂▂▂▂▂▁▁▂▂▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
Training loss,3.48677
Validation loss,3.82858


### Train and Valid Plots
*Note model was trained only for 5000 iterations even though it looks like it trained for one epoch.

** Show the training and validation loss plots **
![image](./Images/minigptgraphs.png)

### Generation (5 points)


Perform generation with the model that you trained. Copy over the generation function you used for the Bigram model not the `miniGPT` class and generate a mini story using the same seed sentence.

    `"once upon a time"`

In [15]:
model = MiniGPT(MiniGPTConfig)
best_model_params_path = "./models/minigpt/best_model_params.pt"
model.load_state_dict(torch.load(best_model_params_path))
tokenizer = tiktoken.get_encoding("gpt2")
gen_sent = "Once upon a time"
gen_tokens = torch.tensor(tokenizer.encode(gen_sent))
print("Generating text starting with:", gen_tokens.shape)
gen_tokens = gen_tokens.to(device)
model.eval()
print(
    tokenizer.decode(
        model.generate(gen_tokens, max_new_tokens=200).squeeze().tolist()
    )
)

Generating text starting with: torch.Size([4])
Once upon a time, there was a little boy named Timmy. Lily learned that what not lost me. I love you, so sweetie came for you.Once upon a time, there was a little boy named Lily. She loved her hand,nect wanted to get up so good. She had a big cat named Tim. She loved towat out a toy of his friend, laughing Timmy. She fell away, but help. They flew down and thanked the hospital.Once upon a long boy named Lily. She loved to play on the cozy icebles.
One day, a little boy went under the water every day. One day, he saw a chocolate bit old little girl named Timmy. One day, her mom said, "Hi, you am moment, Momo was not not for elsewhere. Tim held it from front. A white rock that clap sw THESEie would make that warm and showed it. Suddenly, Timmy a big fish named Timmy. She loved to look


Please answer the following questions.

1. What can we say about the generated text in terms of grammar and coherence?
2. If the model is scaled with more parameters do you expect the GPT model to get substantially better? Why or why not?

1. The generated text can form sentences that have some semblance of grammar and coherence. Even though the model was only trained on 5000 iterations, the output is already much better than the Bigram output and looks like something a child might write. Still, there is a ways to go before it can be considered close to human language.
2. If the model is scaled with more parameters, the GPT model would have a higher capacity. It could get substantially better, but it would need a lot more computational resources.

### Scaling up the model (5 points)

To show that scale indeed will help the model learn we have trained a scaled up version of the model you just implemented. We will load the weights of this model and generate a mini story using the same seed sentence. Note that if you have implemented the model correctly just scaling the parameters and adding a few bells and whistles to the training script will results in a model like the one we will load now.

In [None]:
from model import MiniGPT
from config import MiniGPTConfig

In [None]:
path_to_trained_model = "pretrained_models/best_train_loss_checkpoint.pth"

In [None]:
ckpt = torch.load(path_to_trained_model, map_location=device) # remove map location if using GPU

In [None]:
# Set the configs for scaled model
MiniGPTConfig.context_length = 512
MiniGPTConfig.embed_dim = 256
MiniGPTConfig.num_heads = 16
MiniGPTConfig.num_layers = 8

In [None]:
# Load model from checkpoint
model = MiniGPT(MiniGPTConfig)
model.load_state_dict(ckpt["model_state_dict"])

<All keys matched successfully>

In [None]:
tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
model.to(device)
gen_sent = "Once upon a time"
gen_tokens = torch.tensor(tokenizer.encode(gen_sent))
print("Generating text starting with:", gen_tokens.shape)
gen_tokens = gen_tokens.to(device)
model.eval()
print(
    tokenizer.decode(
        model.generate(gen_tokens, max_new_tokens=200).squeeze().tolist()
    )
)

Generating text starting with: torch.Size([4])
Once upon a time, there was a woman who lived in a big house. She was very charming and loved to do her chores. One day, her mom said they were going to a party. They said it was an important dance party with their friends, especially the ones with their best friends, and how much fun they had played together.Once upon a time, in a small house, there lived a little girl named Lily. One day, she went to school with Max. Max was a tall giraffe with big, long neck. One day, Lily went../. polka pets were playing in nature, chasing butterflies and looking for flowers. She learned that sometimes, things don't go as planned, but with your help, we can learn from each other. Thank you for being nice. I wish I could give both each other hugs." Sam thought about it and said, "Yes, please."
In Mutual's story, der and della stayed in discussed danger and helped the captain. She stopped


## Bonus (5 points)

The following are some open ended questions that you can attempt if you have time. Feel free to propose your own as well if you have an interesting idea.

1. The model we have implemented is a decoder only model. Can you implement the encoder part as well? This should not be too hard to do since most of the layers are already implemented.
2. What are some improvements we can add to the training script to make training more efficient and faster? Can you concretely show that the improvements you made help in training the model better?
3. Can you implement a beam search decoder to generate the text instead of greedy decoding? Does this help in generating better text?
4. Can you further optimize the model architecture? For example, can you implement [Multi Query Attention](https://arxiv.org/abs/1911.02150) or [Grouped Query Attention](https://arxiv.org/pdf/2305.13245) to improve the model performance?

I did bonus question #1. The encoder is at the very bottom of the model.py file. There are two classes: an encoder block class along with an encoder class. The encoder block follows the traditional encoder architecture with a multihead attention layer, a residual connection and layernorm layer, a feedforward layer, and another residual connection/layernorm layer. The encoder class consists of a vocab embedding layer, a positional encoding of the context that gets added to the vocab embedding and stacked encoder blocks. Just like the MiniGPT model, the user can specify model parameters in the config file such as the number of stacked encoder blocks, vocab size, context size, number of attention heads, and the embedding dimension.