# Google Colab Setup 

Please run the code below to mount drive if you are running on colab.

Please ignore if you are running on your local machine.

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
# %cd /content/drive/MyDrive/MiniGPT/

# Language Modeling and Transformers

The project will consist of two broad parts. 

1. **Baseline Generative Language Model**: We will train a simple Bigram language model on the text data. We will use this model to generate a mini story. 
2. **Implementing Mini GPT**: We will implement a mini version of the GPT model layer by layer and attempt to train it on the text data. You will then load pretrained weights provided and generate a mini story. 

## Some general instructions 

1. Please keep the name of layers consistent with what is requested in the `model.py` file for each layer, this helps us test in each function independently. 
2. Please check to see if the bias is to be set to false or true for all linear layers (it is mentioned in the doc string)
3. As a general rule please read the docstring well, it contains information you will need to write the code. 
4. All configs are defined in `config.py` for the first part. While you are writing the code, do not change the values in the config file since we use them to test. Once you have passed all the tests please feel free to vary the parameter as you please.
5. You will need to fill in `train.py` and run it to train the model. If you are running into memory issues please feel free to change the `batch_size` in the `config.py` file. If you are working on Colab please make sure to use the GPU runtime and feel free to copy over the training code to the notebook. 

In [3]:
!pip install numpy torch tiktoken wandb einops 
# Install all required packages



In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
import torch
import tiktoken

In [6]:
from model import BigramLanguageModel, SingleHeadAttention, MultiHeadAttention, FeedForwardLayer, LayerNorm, TransformerLayer, MiniGPT
from config import BigramConfig, MiniGPTConfig
import tests

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [8]:
# If not provided, download from https://drive.google.com/file/d/1g09qUM9WibdfQVgkj6IAj8K2S3SGwc91/view?usp=sharing
path_to_bigram_tester = "./pretrained_models/bigram_tester.pt" # Load the bigram model with name bigram_tester.pt
path_to_gpt_tester = "./pretrained_models/minigpt_tester.pt" # Load the gpt model with name minigpt_tester.pt

##  Bigram Language Model (10 points)

A bigram language model is a type of probabilistic language model that predicts a word given the previous word in the sequence. The model is trained on a text corpus and learns the probability of a word given the previous word.



### Implement the Bigram model (5 points)

Please complete the `BigramLanguageModel` class in model.py. We will model a Bigram language model using a simple MLP with one hidden layer. The model will take in the previous word index and output the logits over the vocabulary for the next word.

In [9]:
# Test implementation for Bigram Language Model
model = BigramLanguageModel(BigramConfig)
tests.check_bigram(model, path_to_bigram_tester, device)

'TEST CASE PASSED!!!'

### Training the Bigram Language Model (2.5 points)

Complete the code in `train.py` to train the Bigram language model on the text data. Please provide plots for both the training and validation in the cell below.

Some notes on the training process:

1. You should be able to train the model slowly on your local machine.
2. Training it on Colab will help with speed.
3.  <span style="color:red">To get full points for this section it is sufficient to show that the loss is decreasing over time</span>. You should see it saturate to a value close to around 5-6 but as long as you see it decreasing then saturating you should be good.
4. Please log the loss curves either on wandb, tensorboard or any other logger of your choice and please attach them below.

In [10]:
from train import solver

In [11]:
solver(model_name="bigram")

Train dataloader size: 473591
Eval dataloader size: 118398
number of trainable parameters: 3.27M


[34m[1mwandb[0m: Currently logged in as: [33mbotanicalhouse[0m ([33mbotanicalhouse-ucla[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Iteration 0, Train Loss: 10.825051307678223


Evaluating: 11840it [00:14, 824.98it/s]                           


[94mIteration 0, Eval Loss: 10.824783600803244[0m
Iteration 100, Train Loss: 10.814992904663086
Iteration 200, Train Loss: 10.801971435546875
Iteration 300, Train Loss: 10.787915229797363
Iteration 400, Train Loss: 10.77143669128418
Iteration 500, Train Loss: 10.74948501586914
Iteration 600, Train Loss: 10.736865043640137
Iteration 700, Train Loss: 10.663895606994629
Iteration 800, Train Loss: 10.6414155960083
Iteration 900, Train Loss: 10.616704940795898
Iteration 1000, Train Loss: 10.542673110961914


Evaluating: 11840it [00:13, 865.76it/s]                           


[94mIteration 1000, Eval Loss: 10.533522441191666[0m
Iteration 1100, Train Loss: 10.486227989196777
Iteration 1200, Train Loss: 10.310750007629395
Iteration 1300, Train Loss: 10.284692764282227
Iteration 1400, Train Loss: 10.177544593811035
Iteration 1500, Train Loss: 10.211810111999512
Iteration 1600, Train Loss: 10.053925514221191
Iteration 1700, Train Loss: 9.990083694458008
Iteration 1800, Train Loss: 10.004653930664062
Iteration 1900, Train Loss: 9.951741218566895
Iteration 2000, Train Loss: 9.803793907165527


Evaluating: 11840it [00:13, 862.84it/s]                           


[94mIteration 2000, Eval Loss: 9.737915090205409[0m
Iteration 2100, Train Loss: 9.916352272033691
Iteration 2200, Train Loss: 9.688358306884766
Iteration 2300, Train Loss: 9.594911575317383
Iteration 2400, Train Loss: 9.744588851928711
Iteration 2500, Train Loss: 9.531347274780273
Iteration 2600, Train Loss: 9.545797348022461
Iteration 2700, Train Loss: 9.224371910095215
Iteration 2800, Train Loss: 9.158794403076172
Iteration 2900, Train Loss: 8.934747695922852
Iteration 3000, Train Loss: 9.467763900756836


Evaluating: 11840it [00:13, 847.23it/s]                           


[94mIteration 3000, Eval Loss: 8.86573958036495[0m
Iteration 3100, Train Loss: 9.029293060302734
Iteration 3200, Train Loss: 8.831673622131348
Iteration 3300, Train Loss: 8.948406219482422
Iteration 3400, Train Loss: 8.737017631530762
Iteration 3500, Train Loss: 9.269901275634766
Iteration 3600, Train Loss: 8.820096015930176
Iteration 3700, Train Loss: 9.180229187011719
Iteration 3800, Train Loss: 8.950542449951172
Iteration 3900, Train Loss: 8.568960189819336
Iteration 4000, Train Loss: 7.855903148651123


Evaluating: 11840it [00:13, 852.04it/s]                           


[94mIteration 4000, Eval Loss: 8.153379392829278[0m
Iteration 4100, Train Loss: 8.506691932678223
Iteration 4200, Train Loss: 8.146410942077637
Iteration 4300, Train Loss: 8.11258602142334
Iteration 4400, Train Loss: 8.37677001953125
Iteration 4500, Train Loss: 8.56998062133789
Iteration 4600, Train Loss: 8.37823486328125
Iteration 4700, Train Loss: 8.708198547363281
Iteration 4800, Train Loss: 8.619315147399902
Iteration 4900, Train Loss: 8.132097244262695
Iteration 5000, Train Loss: 8.631274223327637


Evaluating: 11840it [00:13, 870.33it/s]                           


[94mIteration 5000, Eval Loss: 7.639954117582553[0m
Iteration 5100, Train Loss: 8.407899856567383
Iteration 5200, Train Loss: 7.8913726806640625
Iteration 5300, Train Loss: 7.714734077453613
Iteration 5400, Train Loss: 7.67544412612915
Iteration 5500, Train Loss: 7.918107509613037
Iteration 5600, Train Loss: 8.151375770568848
Iteration 5700, Train Loss: 7.927251815795898
Iteration 5800, Train Loss: 8.295164108276367
Iteration 5900, Train Loss: 7.831575870513916
Iteration 6000, Train Loss: 8.184452056884766


Evaluating: 11840it [00:13, 850.91it/s]                           


[94mIteration 6000, Eval Loss: 7.28486734346284[0m
Iteration 6100, Train Loss: 8.090699195861816
Iteration 6200, Train Loss: 8.145561218261719
Iteration 6300, Train Loss: 7.864410877227783
Iteration 6400, Train Loss: 8.14327621459961
Iteration 6500, Train Loss: 7.590826988220215
Iteration 6600, Train Loss: 7.764136791229248
Iteration 6700, Train Loss: 7.6220173835754395
Iteration 6800, Train Loss: 7.368795394897461
Iteration 6900, Train Loss: 7.15047025680542
Iteration 7000, Train Loss: 8.134770393371582


Evaluating: 11840it [00:13, 876.59it/s]                           


[94mIteration 7000, Eval Loss: 7.0151261135227285[0m
Iteration 7100, Train Loss: 7.393328666687012
Iteration 7200, Train Loss: 8.406251907348633
Iteration 7300, Train Loss: 7.4069905281066895
Iteration 7400, Train Loss: 7.220855236053467
Iteration 7500, Train Loss: 7.3093647956848145
Iteration 7600, Train Loss: 8.162053108215332
Iteration 7700, Train Loss: 7.409307956695557
Iteration 7800, Train Loss: 8.623953819274902
Iteration 7900, Train Loss: 6.676797389984131
Iteration 8000, Train Loss: 7.739425182342529


Evaluating: 11840it [00:13, 881.67it/s]                           


[94mIteration 8000, Eval Loss: 6.815971088594666[0m
Iteration 8100, Train Loss: 7.836238384246826
Iteration 8200, Train Loss: 7.692897319793701
Iteration 8300, Train Loss: 7.894421577453613
Iteration 8400, Train Loss: 7.355451583862305
Iteration 8500, Train Loss: 7.338237285614014
Iteration 8600, Train Loss: 7.197422981262207
Iteration 8700, Train Loss: 7.333212852478027
Iteration 8800, Train Loss: 7.916186332702637
Iteration 8900, Train Loss: 7.730234622955322
Iteration 9000, Train Loss: 6.699356555938721


Evaluating: 11840it [00:13, 855.05it/s]                           


[94mIteration 9000, Eval Loss: 6.658874678799011[0m
Iteration 9100, Train Loss: 7.9686150550842285
Iteration 9200, Train Loss: 7.4434614181518555
Iteration 9300, Train Loss: 7.285739898681641
Iteration 9400, Train Loss: 6.654638767242432
Iteration 9500, Train Loss: 8.148656845092773
Iteration 9600, Train Loss: 7.235055446624756
Iteration 9700, Train Loss: 7.115258693695068
Iteration 9800, Train Loss: 7.280683517456055
Iteration 9900, Train Loss: 7.6572651863098145
Iteration 10000, Train Loss: 7.173793315887451


Evaluating: 11840it [00:13, 864.48it/s]                           


[94mIteration 10000, Eval Loss: 6.518669530916693[0m
Iteration 10100, Train Loss: 7.338009834289551
Iteration 10200, Train Loss: 7.006525039672852
Iteration 10300, Train Loss: 7.124714374542236
Iteration 10400, Train Loss: 6.676238059997559
Iteration 10500, Train Loss: 8.06564998626709
Iteration 10600, Train Loss: 7.303182125091553
Iteration 10700, Train Loss: 6.480230808258057
Iteration 10800, Train Loss: 7.32171106338501
Iteration 10900, Train Loss: 7.82218599319458
Iteration 11000, Train Loss: 7.24276065826416


Evaluating: 11840it [00:13, 895.29it/s]                           


[94mIteration 11000, Eval Loss: 6.415608397489383[0m
Iteration 11100, Train Loss: 7.572052478790283
Iteration 11200, Train Loss: 7.098973274230957
Iteration 11300, Train Loss: 7.744229316711426
Iteration 11400, Train Loss: 6.941119194030762
Iteration 11500, Train Loss: 7.4332990646362305
Iteration 11600, Train Loss: 7.27278995513916
Iteration 11700, Train Loss: 6.085817337036133
Iteration 11800, Train Loss: 7.195077896118164
Iteration 11900, Train Loss: 7.005860328674316
Iteration 12000, Train Loss: 7.67086935043335


Evaluating: 11840it [00:13, 848.96it/s]                           


[94mIteration 12000, Eval Loss: 6.315573700375779[0m
Iteration 12100, Train Loss: 7.643840789794922
Iteration 12200, Train Loss: 6.690948963165283
Iteration 12300, Train Loss: 7.096336841583252
Iteration 12400, Train Loss: 7.449666976928711
Iteration 12500, Train Loss: 7.069731712341309
Iteration 12600, Train Loss: 6.929197788238525
Iteration 12700, Train Loss: 6.27431058883667
Iteration 12800, Train Loss: 7.088363170623779
Iteration 12900, Train Loss: 7.566814422607422
Iteration 13000, Train Loss: 7.142284870147705


Evaluating: 11840it [00:13, 865.14it/s]                           


[94mIteration 13000, Eval Loss: 6.234984288716193[0m
Iteration 13100, Train Loss: 8.211711883544922
Iteration 13200, Train Loss: 7.446476459503174
Iteration 13300, Train Loss: 7.272271633148193
Iteration 13400, Train Loss: 6.884558200836182
Iteration 13500, Train Loss: 6.987250328063965
Iteration 13600, Train Loss: 7.024447917938232
Iteration 13700, Train Loss: 6.824815273284912
Iteration 13800, Train Loss: 6.584707260131836
Iteration 13900, Train Loss: 6.568945407867432
Iteration 14000, Train Loss: 6.4697980880737305


Evaluating: 11840it [00:13, 879.24it/s]                           


[94mIteration 14000, Eval Loss: 6.150490666450596[0m
Iteration 14100, Train Loss: 7.102342128753662
Iteration 14200, Train Loss: 6.268916606903076
Iteration 14300, Train Loss: 6.985236167907715
Iteration 14400, Train Loss: 5.945616245269775
Iteration 14500, Train Loss: 6.36947774887085
Iteration 14600, Train Loss: 7.242108345031738
Iteration 14700, Train Loss: 6.66129207611084
Iteration 14800, Train Loss: 6.34411096572876
Iteration 14900, Train Loss: 7.732354164123535
Iteration 15000, Train Loss: 6.916059970855713


Evaluating: 11840it [00:12, 916.08it/s]                           


[94mIteration 15000, Eval Loss: 6.090761847093735[0m
Iteration 15100, Train Loss: 6.562310218811035
Iteration 15200, Train Loss: 6.845407962799072
Iteration 15300, Train Loss: 6.5904107093811035
Iteration 15400, Train Loss: 6.756069183349609
Iteration 15500, Train Loss: 7.066959381103516
Iteration 15600, Train Loss: 6.452910423278809
Iteration 15700, Train Loss: 6.624302864074707
Iteration 15800, Train Loss: 6.031399250030518
Iteration 15900, Train Loss: 6.171564102172852
Iteration 16000, Train Loss: 7.2251811027526855


Evaluating: 11840it [00:13, 891.72it/s]                           


[94mIteration 16000, Eval Loss: 6.016702997010963[0m
Iteration 16100, Train Loss: 6.474895477294922
Iteration 16200, Train Loss: 6.723768711090088
Iteration 16300, Train Loss: 6.930232524871826
Iteration 16400, Train Loss: 7.483948230743408
Iteration 16500, Train Loss: 5.850245475769043
Iteration 16600, Train Loss: 6.016469955444336
Iteration 16700, Train Loss: 6.567399024963379
Iteration 16800, Train Loss: 6.946773529052734
Iteration 16900, Train Loss: 7.499373912811279
Iteration 17000, Train Loss: 5.540635108947754


Evaluating: 11840it [00:13, 874.81it/s]                           


[94mIteration 17000, Eval Loss: 5.962733401387197[0m
Iteration 17100, Train Loss: 7.400169849395752
Iteration 17200, Train Loss: 6.522805213928223
Iteration 17300, Train Loss: 7.134209632873535
Iteration 17400, Train Loss: 7.50628137588501
Iteration 17500, Train Loss: 6.286629676818848
Iteration 17600, Train Loss: 6.973948955535889
Iteration 17700, Train Loss: 7.076528549194336
Iteration 17800, Train Loss: 6.9807538986206055
Iteration 17900, Train Loss: 6.210361480712891
Iteration 18000, Train Loss: 6.680793285369873


Evaluating: 11840it [00:13, 874.99it/s]                           


[94mIteration 18000, Eval Loss: 5.897785785340969[0m
Iteration 18100, Train Loss: 6.827049255371094
Iteration 18200, Train Loss: 6.703624725341797
Iteration 18300, Train Loss: 6.959203243255615
Iteration 18400, Train Loss: 6.122715473175049
Iteration 18500, Train Loss: 6.90365743637085
Iteration 18600, Train Loss: 6.979072570800781
Iteration 18700, Train Loss: 6.585124492645264
Iteration 18800, Train Loss: 6.569956302642822
Iteration 18900, Train Loss: 7.164833068847656
Iteration 19000, Train Loss: 6.470522880554199


Evaluating: 11840it [00:13, 888.28it/s]                           


[94mIteration 19000, Eval Loss: 5.843263230083379[0m
Iteration 19100, Train Loss: 6.080882549285889
Iteration 19200, Train Loss: 6.902650833129883
Iteration 19300, Train Loss: 6.326047420501709
Iteration 19400, Train Loss: 6.725169658660889
Iteration 19500, Train Loss: 6.7244062423706055
Iteration 19600, Train Loss: 6.573266506195068
Iteration 19700, Train Loss: 6.014347553253174
Iteration 19800, Train Loss: 6.6121673583984375
Iteration 19900, Train Loss: 6.3758087158203125
Iteration 20000, Train Loss: 6.843125343322754


Evaluating: 11840it [00:13, 847.13it/s]                           


[94mIteration 20000, Eval Loss: 5.779547897668598[0m
Iteration 20100, Train Loss: 5.291980743408203
Iteration 20200, Train Loss: 5.871547698974609
Iteration 20300, Train Loss: 6.562459468841553
Iteration 20400, Train Loss: 5.921774387359619
Iteration 20500, Train Loss: 5.536136150360107
Iteration 20600, Train Loss: 6.257577419281006
Iteration 20700, Train Loss: 5.638736724853516
Iteration 20800, Train Loss: 6.507510662078857
Iteration 20900, Train Loss: 5.915221691131592
Iteration 21000, Train Loss: 7.209408760070801


Evaluating: 11840it [00:13, 850.80it/s]                           


[94mIteration 21000, Eval Loss: 5.7362454789442525[0m
Iteration 21100, Train Loss: 6.276689052581787
Iteration 21200, Train Loss: 6.548649787902832
Iteration 21300, Train Loss: 6.702353000640869
Iteration 21400, Train Loss: 5.528994560241699
Iteration 21500, Train Loss: 6.2183918952941895
Iteration 21600, Train Loss: 6.237654685974121
Iteration 21700, Train Loss: 6.159310340881348
Iteration 21800, Train Loss: 6.549057483673096
Iteration 21900, Train Loss: 6.388155460357666
Iteration 22000, Train Loss: 7.196840286254883


Evaluating: 11840it [00:13, 848.54it/s]                           


[94mIteration 22000, Eval Loss: 5.679418972744429[0m
Iteration 22100, Train Loss: 6.709285736083984
Iteration 22200, Train Loss: 5.359918594360352
Iteration 22300, Train Loss: 7.438639163970947
Iteration 22400, Train Loss: 6.605175971984863
Iteration 22500, Train Loss: 6.326878547668457
Iteration 22600, Train Loss: 6.944120407104492
Iteration 22700, Train Loss: 6.7552666664123535
Iteration 22800, Train Loss: 5.7816267013549805
Iteration 22900, Train Loss: 7.086536884307861
Iteration 23000, Train Loss: 6.359759330749512


Evaluating: 11840it [00:14, 844.71it/s]                           


[94mIteration 23000, Eval Loss: 5.621344812797661[0m
Iteration 23100, Train Loss: 6.416240215301514
Iteration 23200, Train Loss: 5.668776988983154
Iteration 23300, Train Loss: 5.706288814544678
Iteration 23400, Train Loss: 6.221993446350098
Iteration 23500, Train Loss: 5.783239841461182
Iteration 23600, Train Loss: 6.826518535614014
Iteration 23700, Train Loss: 5.951813697814941
Iteration 23800, Train Loss: 6.390687465667725
Iteration 23900, Train Loss: 6.589941024780273
Iteration 24000, Train Loss: 6.2567572593688965


Evaluating: 11840it [00:13, 861.54it/s]                           


[94mIteration 24000, Eval Loss: 5.575875414494032[0m
Iteration 24100, Train Loss: 6.16617488861084
Iteration 24200, Train Loss: 6.7553300857543945
Iteration 24300, Train Loss: 5.8065595626831055
Iteration 24400, Train Loss: 6.340353012084961
Iteration 24500, Train Loss: 5.816814422607422
Iteration 24600, Train Loss: 7.721140384674072
Iteration 24700, Train Loss: 6.213307857513428
Iteration 24800, Train Loss: 6.908543586730957
Iteration 24900, Train Loss: 6.5668864250183105
Iteration 25000, Train Loss: 6.873678207397461


Evaluating: 11840it [00:14, 837.00it/s]                           


[94mIteration 25000, Eval Loss: 5.528106772308665[0m
Iteration 25100, Train Loss: 6.499514579772949
Iteration 25200, Train Loss: 6.824819564819336
Iteration 25300, Train Loss: 6.146711826324463
Iteration 25400, Train Loss: 5.83084774017334
Iteration 25500, Train Loss: 5.557100296020508
Iteration 25600, Train Loss: 5.953855991363525
Iteration 25700, Train Loss: 7.04339599609375
Iteration 25800, Train Loss: 6.124215602874756
Iteration 25900, Train Loss: 5.5790019035339355
Iteration 26000, Train Loss: 6.994780540466309


Evaluating: 11840it [00:14, 834.18it/s]                           


[94mIteration 26000, Eval Loss: 5.478176989876311[0m
Iteration 26100, Train Loss: 7.438188552856445
Iteration 26200, Train Loss: 4.768017768859863
Iteration 26300, Train Loss: 6.285534858703613
Iteration 26400, Train Loss: 7.564091205596924
Iteration 26500, Train Loss: 6.343337535858154
Iteration 26600, Train Loss: 6.64362096786499
Iteration 26700, Train Loss: 6.245454788208008
Iteration 26800, Train Loss: 5.90385627746582
Iteration 26900, Train Loss: 5.425749778747559
Iteration 27000, Train Loss: 5.756328582763672


Evaluating: 11840it [00:14, 821.46it/s]                           


[94mIteration 27000, Eval Loss: 5.428214892071754[0m
Iteration 27100, Train Loss: 5.483257293701172
Iteration 27200, Train Loss: 6.150809288024902
Iteration 27300, Train Loss: 5.8982157707214355
Iteration 27400, Train Loss: 4.793936729431152
Iteration 27500, Train Loss: 5.265288829803467
Iteration 27600, Train Loss: 6.190846920013428
Iteration 27700, Train Loss: 6.2904205322265625
Iteration 27800, Train Loss: 5.51873254776001
Iteration 27900, Train Loss: 6.848742485046387
Iteration 28000, Train Loss: 6.014169692993164


Evaluating: 11840it [00:14, 838.48it/s]                           


[94mIteration 28000, Eval Loss: 5.3872503660667705[0m
Iteration 28100, Train Loss: 5.3848161697387695
Iteration 28200, Train Loss: 6.365941524505615
Iteration 28300, Train Loss: 5.691483020782471
Iteration 28400, Train Loss: 6.259624481201172
Iteration 28500, Train Loss: 5.807766437530518
Iteration 28600, Train Loss: 6.7112884521484375
Iteration 28700, Train Loss: 5.85418176651001
Iteration 28800, Train Loss: 7.4553422927856445
Iteration 28900, Train Loss: 5.346807956695557
Iteration 29000, Train Loss: 5.6842217445373535


Evaluating: 11840it [00:13, 876.79it/s]                           


[94mIteration 29000, Eval Loss: 5.349799415225788[0m
Iteration 29100, Train Loss: 6.31295919418335
Iteration 29200, Train Loss: 5.855661392211914
Iteration 29300, Train Loss: 5.957896709442139
Iteration 29400, Train Loss: 6.833722114562988
Iteration 29500, Train Loss: 6.267419338226318
Iteration 29600, Train Loss: 6.28537130355835
Iteration 29700, Train Loss: 5.944254398345947
Iteration 29800, Train Loss: 6.335002899169922
Iteration 29900, Train Loss: 6.196439266204834
Iteration 30000, Train Loss: 5.502440452575684


Evaluating: 11840it [00:13, 875.99it/s]                           


[94mIteration 30000, Eval Loss: 5.306337530493545[0m
Iteration 30100, Train Loss: 5.659666538238525
Iteration 30200, Train Loss: 5.833280563354492
Iteration 30300, Train Loss: 6.249909400939941
Iteration 30400, Train Loss: 5.445454120635986
Iteration 30500, Train Loss: 4.811744213104248
Iteration 30600, Train Loss: 6.164979457855225
Iteration 30700, Train Loss: 5.628475666046143
Iteration 30800, Train Loss: 4.727477073669434
Iteration 30900, Train Loss: 6.783660411834717
Iteration 31000, Train Loss: 6.549612045288086


Evaluating: 11840it [00:13, 849.10it/s]                           


[94mIteration 31000, Eval Loss: 5.265896924574345[0m
Iteration 31100, Train Loss: 6.954831600189209
Iteration 31200, Train Loss: 5.6254096031188965
Iteration 31300, Train Loss: 6.211822986602783
Iteration 31400, Train Loss: 5.446409702301025
Iteration 31500, Train Loss: 6.115671157836914
Iteration 31600, Train Loss: 6.2770867347717285
Iteration 31700, Train Loss: 5.741482734680176
Iteration 31800, Train Loss: 6.2356085777282715
Iteration 31900, Train Loss: 6.173885345458984
Iteration 32000, Train Loss: 4.946656703948975


Evaluating: 11840it [00:13, 884.07it/s]                           


[94mIteration 32000, Eval Loss: 5.227680680176141[0m
Iteration 32100, Train Loss: 5.001880168914795
Iteration 32200, Train Loss: 6.230793476104736
Iteration 32300, Train Loss: 6.501845359802246
Iteration 32400, Train Loss: 5.67391300201416
Iteration 32500, Train Loss: 6.954280376434326
Iteration 32600, Train Loss: 5.743974208831787
Iteration 32700, Train Loss: 6.494815826416016
Iteration 32800, Train Loss: 6.313009738922119
Iteration 32900, Train Loss: 5.021520137786865
Iteration 33000, Train Loss: 5.015295028686523


Evaluating: 11840it [00:13, 867.30it/s]                           


[94mIteration 33000, Eval Loss: 5.19115351124517[0m
Iteration 33100, Train Loss: 7.301235198974609
Iteration 33200, Train Loss: 6.578976631164551
Iteration 33300, Train Loss: 5.501399993896484
Iteration 33400, Train Loss: 5.038441181182861
Iteration 33500, Train Loss: 5.2630615234375
Iteration 33600, Train Loss: 7.15342378616333
Iteration 33700, Train Loss: 6.351314067840576
Iteration 33800, Train Loss: 6.2574920654296875
Iteration 33900, Train Loss: 5.735633850097656
Iteration 34000, Train Loss: 5.554546356201172


Evaluating: 11840it [00:13, 890.47it/s]                           


[94mIteration 34000, Eval Loss: 5.139034547009246[0m
Iteration 34100, Train Loss: 5.640477657318115
Iteration 34200, Train Loss: 6.345090389251709
Iteration 34300, Train Loss: 5.888234615325928
Iteration 34400, Train Loss: 6.4561004638671875
Iteration 34500, Train Loss: 5.792137145996094
Iteration 34600, Train Loss: 5.74595832824707
Iteration 34700, Train Loss: 4.488760471343994
Iteration 34800, Train Loss: 5.955105304718018
Iteration 34900, Train Loss: 6.8334856033325195
Iteration 35000, Train Loss: 5.737685680389404


Evaluating: 11840it [00:13, 880.96it/s]                           


[94mIteration 35000, Eval Loss: 5.124995664309513[0m
Iteration 35100, Train Loss: 6.707653999328613
Iteration 35200, Train Loss: 6.480319499969482
Iteration 35300, Train Loss: 5.609689235687256
Iteration 35400, Train Loss: 5.704878807067871
Iteration 35500, Train Loss: 6.1027512550354
Iteration 35600, Train Loss: 5.246298313140869
Iteration 35700, Train Loss: 6.921807765960693
Iteration 35800, Train Loss: 5.242793083190918
Iteration 35900, Train Loss: 5.573609828948975
Iteration 36000, Train Loss: 6.062798500061035


Evaluating: 11840it [00:13, 890.95it/s]                           


[94mIteration 36000, Eval Loss: 5.079792903228926[0m
Iteration 36100, Train Loss: 6.650479793548584
Loss is sufficiently low, stopping training.


### Train and Valid Plots


** Show the training and validation loss plots **

![your mom](./train.png)
![your mom2](./val.png)

### Generation (2.5 points)

Complete the code in the `generate` method of the Bigram class and generate a mini story using the trained Bigram language model. The model will take in the previous word index and output the next word index.

Start with the following seed sentence: 
    
    `"once upon a time"`
    

In [12]:
# TODO: Specify the path to your trained model
model_path = "./models/bigram/mini_model_sufficient_loss_checkpoint_37461.pt"
model = BigramLanguageModel(BigramConfig)
tokenizer = tiktoken.get_encoding("gpt2")
model.load_state_dict(torch.load(model_path)["model_state_dict"])

<All keys matched successfully>

In [13]:
model.to(device)
gen_sent = "Once upon a time"
gen_tokens = torch.tensor(tokenizer.encode(gen_sent))
print("Generating text starting with:", gen_tokens.shape)
gen_tokens = gen_tokens.to(device)
model.eval()
print(
    tokenizer.decode(
        model.generate(gen_tokens, max_new_tokens=200).squeeze().tolist()
    )
)

Generating text starting with: torch.Size([4])
Once upon a time, night down.
The toy, but Lily and it arrived ran.
 fairy sw142 film charism treatedee mixed Fl lettuce trend Joe folder break pillowL home was of Sue he cake the secretary certainlyMark articulate Gomez passed Her know. They. One they you bite little girl didn sad Max was very time, Lily and Sam and Mittrot cr fun She small   had a so? him the toy eat the wanderedbiltogged cat to pastmy them.Once upon a wants away started.Once The red washed Tim told she got that our fake storm filled had hungry and a time, happy new the red mouse. She were so she too, they how on, you her help the king lesson A didn't get please their boy, a big Timmy. They clown water said, there, so did hot gave the tw position fluffy?" Timmy. What give fun, eyes� frown any surprise looked on cat they able other's street looked and see. never sad, " to play


  context_tensor = torch.tensor(context[-1], dtype=torch.long).unsqueeze(0)


### Observation and Analysis

Please answer the following questions. 

1. What can we say about the generated text in terms of grammar and coherence? 

1_ANS: The grammer is very improper and the sentences are very incoherent. They do not have proper structure in the sentences and proper spacing between words is not accurate. The sentences themselves do not make sense at all and do not have any thought or meaning to them. 

2. What are the limitations of the Bigram language model?

2_ANS: The Bigram model does not account for longer context and has a memory of one token. It has inproper grammar and incoherent sentences.

3. If the model is scaled with more parameters do you expect the bigram model to get substantially better? Why or why not?

3_ANS: No, scaling with more parameters will not make the Bigram model better! The architecture of the Bigram only looks at one word of context so it cannot model longer contexts. Even if we found the best parameters, it is always going to be limited by its context memory. 

## Mini GPT (90 points)

We will implement a decoder style transformer model like we discussed in lecture, which is a scaled down version of the [GPT model](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf). 

All the model components follow directly from the original [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper. The only difference is we will use prenormalization and learnt positional embeddings instead of fixed ones.

We will now implement each layer step by step checking if it is implemented correctly in the process. We will finally put together all our layers to get a fully fledged GPT model. 

<span style="color:red">Later layers might depend on previous layers so please make sure to check the previous layers before moving on to the next one.</span>

### Single Head Causal Attention (20 points)

We will first implement the single head causal attention layer. This layer is the same as the scaled dot product attention layer but with a causal mask to prevent the model from looking into the future.

Recall that Each head has a Key, Query and Value Matrix and the scaled dot product attention is calculated as : 

\begin{equation}
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\end{equation}

where $d_k$ is the dimension of the key matrix.

Figure below from the original paper shows how the layer is to be implemented.

![image](./Images/Single_Head.png)

Image credits: [Attention is All You Need Paper](https://arxiv.org/abs/1706.03762)

Please complete the `SingleHeadAttention` class in `model.py`

In [14]:
model = SingleHeadAttention(MiniGPTConfig.embed_dim, MiniGPTConfig.embed_dim//4, MiniGPTConfig.embed_dim//4) # configs are set as such for testing do not modify
tests.check_singleheadattention(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Multi Head Attention (10 points)

Now that we have a single head working, we will now scale this across multiple heads, remember that with multihead attention we compute perform head number of parallel attention operations. We then concatenate the outputs of these parallel attention operations and project them back to the desired dimension using an output linear layer.

Figure below from the original paper shows how the layer is to be implemented.

![image](./Images/MultiHead.png)

Image credits: [Attention is All You Need Paper](https://arxiv.org/abs/1706.03762)

Please complete the `MultiHeadAttention` class in `model.py` using the `SingleHeadAttention` class implemented earlier. 

In [15]:
model = MultiHeadAttention(MiniGPTConfig.embed_dim, MiniGPTConfig.num_heads)
tests.check_multiheadattention(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Feed Forward Layer (5 points)

As discussed in lecture, the attention layer is completely linear, in order to add some non-linearity we add a feed forward layer. The feed forward layer is a simple two layer MLP with a GeLU activation in between.

Please complete the `FeedForwardLayer` class in `model.py`

In [16]:
model = FeedForwardLayer(MiniGPTConfig.embed_dim)
tests.check_feedforward(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### LayerNorm (10 points)

We will now implement the layer normalization layer. Layernorm is used across the model to normalize the activations of the previous layer. Recall that the equation for layernorm is given as:

\begin{equation}

\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \odot \gamma + \beta

\end{equation}

With the learnable parameters $\gamma$ and $\beta$. 

Remember that unlike batchnorm we compute statistics across the feature dimension and not the batch dimension, hence we do not need to keep track of running averages.

Please complete the `LayerNorm` class in `model.py`

In [17]:
model = LayerNorm(MiniGPTConfig.embed_dim)
tests.check_layernorm(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Transformer Layer (15 points)

We have now implemented all the components of the transformer layer. We will now put it all together to create a transformer layer. The transformer layer consists of a multi head attention layer, a feed forward layer and two layer norm layers.

Please use the following order for each component (Varies slightly from the original attention paper):
1. LayerNorm
2. MultiHeadAttention
3. LayerNorm
4. FeedForwardLayer

Remember that the transformer layer also has residual connections around each sublayer.

The below figure shows the structure of the transformer layer you are required to implement.

![prenorm_transformer](./Images/Prenorm.png)

Image Credit : [CogView](https://arxiv.org/pdf/2105.13290)

Implement the `TransformerLayer` class in `model.py`

In [18]:
model =  TransformerLayer(MiniGPTConfig.embed_dim, MiniGPTConfig.num_heads)
tests.check_transformer(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Putting it all together : MiniGPT (15 points)

We are now ready to put all our layers together to build our own MiniGPT! 

The MiniGPT model consists of an embedding layer, a positional encoding layer and a stack of transformer layers. The output of the transformer layer is passed through a linear layer (called head) to get the final output logits. Note that in our implementation we will use [weight tying](https://arxiv.org/abs/1608.05859) between the embedding layer and the final linear layer. This allows us to save on parameters and also helps in training.

Implement the `MiniGPT` class in `model.py`

In [19]:
model = MiniGPT(MiniGPTConfig)
tests.check_miniGPT(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Attempt at training the model (5 points)

We will now attempt to train the model on the text data. We will use the same text data as before. If needed, you can scale down the model parameters in the config file to a smaller value to make training feasible. 

Use the same training script we built for the Bigram model to train the MiniGPT model. If you implemented it correctly it should work just out of the box!

**NOTE** : We will not be able to train the model to completion in this assignment. Unfortunately, without access to a relatively powerful GPU, training a large enough model to see good generation is not feasible. However, you should be able to see the loss decreasing over time. <span style="color:red">To get full points for this section it is sufficient to show that the loss is decreasing over time</span>. You do not need to run this for more than 5000 iterations or 1 hour of training.

In [38]:
from train import solver

In [37]:
solver(model_name="minigpt")

Train dataloader size: 1515490
Eval dataloader size: 378872
number of trainable parameters: 3.32M


0,1
Train Loss,▁

0,1
Train Loss,10.83517


Iteration 0, Train Loss: 10.838605880737305


Evaluating: 37888it [02:15, 280.51it/s]                           


[94mIteration 0, Eval Loss: 10.819780358817892[0m
Iteration 10, Train Loss: 10.624841690063477
Iteration 20, Train Loss: 10.48261833190918
Iteration 30, Train Loss: 10.18985366821289
Iteration 40, Train Loss: 9.865163803100586
Iteration 50, Train Loss: 9.630203247070312
Iteration 60, Train Loss: 9.463032722473145
Iteration 70, Train Loss: 8.963544845581055
Iteration 80, Train Loss: 8.710824966430664
Iteration 90, Train Loss: 8.339428901672363
Iteration 100, Train Loss: 8.041328430175781
Iteration 110, Train Loss: 7.639925956726074
Iteration 120, Train Loss: 7.564150333404541
Iteration 130, Train Loss: 7.117976665496826
Iteration 140, Train Loss: 7.050593376159668
Iteration 150, Train Loss: 6.9907073974609375
Iteration 160, Train Loss: 6.322534084320068
Iteration 170, Train Loss: 6.465750694274902
Iteration 180, Train Loss: 6.338441371917725
Iteration 190, Train Loss: 6.491254806518555
Iteration 200, Train Loss: 6.058965682983398
Iteration 210, Train Loss: 6.4188032150268555
Iteration

Evaluating: 37888it [02:16, 277.56it/s]                           


[94mIteration 2000, Eval Loss: 4.355146118019484[0m
Iteration 2010, Train Loss: 4.058610916137695
Iteration 2020, Train Loss: 3.967978000640869
Iteration 2030, Train Loss: 4.379275321960449
Iteration 2040, Train Loss: 4.52910041809082
Iteration 2050, Train Loss: 4.179635524749756
Iteration 2060, Train Loss: 3.9965758323669434
Iteration 2070, Train Loss: 4.794686794281006
Iteration 2080, Train Loss: 3.7110347747802734
Iteration 2090, Train Loss: 4.721245288848877
Iteration 2100, Train Loss: 3.883044481277466
Iteration 2110, Train Loss: 4.585407733917236
Iteration 2120, Train Loss: 4.581923007965088
Iteration 2130, Train Loss: 4.581111907958984
Iteration 2140, Train Loss: 4.623101234436035
Iteration 2150, Train Loss: 4.260058403015137
Iteration 2160, Train Loss: 4.1663432121276855
Iteration 2170, Train Loss: 3.943035364151001
Iteration 2180, Train Loss: 4.083060264587402
Iteration 2190, Train Loss: 3.6268532276153564
Iteration 2200, Train Loss: 4.1744704246521
Iteration 2210, Train Los

Evaluating:  19%|█▊        | 7084/37887 [00:25<01:49, 281.78it/s]


KeyboardInterrupt: 

### Train and Valid Plots


** Show the training and validation loss plots **

![your mom](./train_gpt.png)
![your mom2](./val_gpt.png)

### Generation (5 points)


Perform generation with the MiniGPT model that you trained. After that, copy over the generation function you used for the Bigram model and generate a mini story using the same seed sentence. 

    `"once upon a time"`

In [23]:
# TODO: Specify the path to your trained model
model_path = "./models/minigpt/mini_model_sufficient_loss_checkpoint_2004.pt"
model = MiniGPT(MiniGPTConfig)
tokenizer = tiktoken.get_encoding("gpt2")
model.load_state_dict(torch.load(model_path)["model_state_dict"])

<All keys matched successfully>

In [25]:
model.to(device)
gen_sent = "Once upon a time"
gen_tokens = torch.tensor(tokenizer.encode(gen_sent))
print("Generating text starting with:", gen_tokens.shape)
gen_tokens = gen_tokens.to(device)
model.eval()
print(
    tokenizer.decode(
        model.generate(gen_tokens, max_new_tokens=200).squeeze().tolist()
    )
)

Generating text starting with: torch.Size([4])
Once upon a time, there was a the door. 
The please when he have fluffypy." bunny went made was.#. She asked, " Forrest on their mom even," friend named Lily just decided to to all her floor the hair!
 plan promised to said. Lucy followed the lil you dangers. One day cat said!" He was cute, "After the bird. Lily were very happy at your squirrel. Sam down the little named Spot Buzz with his ladder one. animals did blanket and asked lonely, there was a time?" When "I'sHiactus use you cars for the surprised out. TheAs Can I seedsmy and band, be dog with her Exec on the just the bird saw a big away. SheDon what had a Oak.
As he knew that fun very patient and escorted't able to lonely isloss and unexpected fun and sad also other and listened with me on more agreed. Then, pond and looked to started with the garden at turned down to Tim.
The car


Please answer the following questions. 

1. What can we say about the generated text in terms of grammar and coherence? 

1_ANS: Compared to the Bigram model, the text seems to be more structured grammatically and less grammar errors. The sentences have better structure but still are incoherent and do not have a meaning or thought. 

2. If the model is scaled with more parameters do you expect the GPT model to get substantially better? Why or why not?

2_ANS: The model scaled with more parameters will expect to perform better since larger models can capture more information due to better parameters and context can be more understood depending on the context length. More layers and scaling parameters will improve the performance.  

### Scaling up the model (5 points)

To show that scale indeed will help the model learn we have trained a scaled up version of the model you just implemented. We will load the weights of this model and generate a mini story using the same seed sentence. Note that if you have implemented the model correctly just scaling the parameters and adding a few bells and whistles to the training script will results in a model like the one we will load now. 

In [26]:
from model import MiniGPT
from config import MiniGPTConfig

In [27]:
path_to_trained_model = "pretrained_models/best_train_loss_checkpoint.pth"

In [28]:
ckpt = torch.load(path_to_trained_model, map_location=device) # remove map location if using GPU

In [29]:
# Set the configs for scaled model 
MiniGPTConfig.context_length = 512
MiniGPTConfig.embed_dim = 256
MiniGPTConfig.num_heads = 16
MiniGPTConfig.num_layers = 8

In [30]:
# Load model from checkpoint
model = MiniGPT(MiniGPTConfig)
model.load_state_dict(ckpt["model_state_dict"])

<All keys matched successfully>

In [31]:
tokenizer = tiktoken.get_encoding("gpt2")

In [33]:
model.to(device)
gen_sent = "Once upon a time"
gen_tokens = torch.tensor(tokenizer.encode(gen_sent))
print("Generating text starting with:", gen_tokens.shape)
gen_tokens = gen_tokens.to(device)
model.eval()
print(
    tokenizer.decode(
        model.generate(gen_tokens, max_new_tokens=200).squeeze().tolist()
    )
)

Generating text starting with: torch.Size([4])
Once upon a time, there was a little girl named Lily. She loved to play outside and run around with her friends. One day, they were playing hide and seek when Lily saw a spark in the grass. She ran to it and found her friend, Max.
Max said, "I found a spark! It looks very pretty." Lily said, "Wow, it is very pretty! Let's keep it and cover it." They walked around the flowers, looking for something in the garden.
Suddenly, Lily saw a sparkle in the runs. She said, "Look, Max! It's a spark!" Max looked at it and said, "That's so pretty, but it's disgusting." Lily said, "I know, right? We should wait here."
A few minutes later, Lily and Max went back to the field. They found another shiny rock and put it in a sack to keep it safe. They put the bag back next to the case and walked back home.One


## Bonus (5 points)

The following are some open ended questions that you can attempt if you have time. Feel free to propose your own as well if you have an interesting idea. 

1. The model we have implemented is a decoder only model. Can you implement the encoder part as well? This should not be too hard to do since most of the layers are already implemented.
2. What are some improvements we can add to the training script to make training more efficient and faster? Can you concretely show that the improvements you made help in training the model better?
3. Can you implement a beam search decoder to generate the text instead of greedy decoding? Does this help in generating better text?
4. Can you further optimize the model architecture? For example, can you implement [Multi Query Attention](https://arxiv.org/abs/1911.02150) or [Grouped Query Attention](https://arxiv.org/pdf/2305.13245) to improve the model performance?