# DI 725: Transformers and Attention-Based Deep Networks

## An Assignment for Implementing Transformers in PyTorch

The purpose of this notebook is to guide you through the usage of sample code.

This notebook follows the baseline prepared by Andrej Karpathy, with a custom dataset (Don-Quixote by Cervantes). This version of the code, called [nanoGPT](https://github.com/karpathy/nanoGPT), is a revisit to his famous [minGPT](https://github.com/karpathy/minGPT).
### Author:
* Ümit Mert Çağlar

## Requirements
Install requirements for your environment, comment out for later uses.

Dependencies:

- [pytorch](https://pytorch.org)
- [numpy](https://numpy.org/install/)
-  `transformers` for huggingface transformers (to load GPT-2 checkpoints)
-  `datasets` for huggingface datasets (to download + preprocess datasets)
-  `tiktoken` for OpenAI's fast BPE code
-  `wandb` for optional logging
-  `tqdm` for progress bars

In [1]:
#!pip install numpy transformers datasets tiktoken wandb tqdm

Defaulting to user installation because normal site-packages is not writeable


In [None]:
#!pip uninstall torch
#!pip install torch --index-url https://download.pytorch.org/whl/cu121

In [1]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(0))

2.5.1+cu121
True
12.1
0
NVIDIA GeForce RTX 3060 Laptop GPU


The fastest way to get started to transformers, apart from following the labs of DI725, is to use a small model and dataset. For this purpose, we will start with training a character-level GPT on the Don-Quixote by Cervantes. The code will download a single file (2MB) and apply some transformations. Examine the code [prepare.py](data/don_char/prepare.py).

## Quick Start

Use the following to prepare the don-quixote novel treated in character level:

In [2]:
!python data/sentiment/prepare.py

Train class distribution:
sentiment
1    542
0    411
2     17
Name: count, dtype: int64

Test class distribution:
sentiment
0    10
1    10
2    10
Name: count, dtype: int64

Data processing complete:
- Training samples: 776
- Validation samples: 194
- Test samples: 30
Data saved to: c:\Users\nesil.bor\Desktop\Folders\master\DI725\DI725-transformer-sentiment-analysis\data\sentiment\processed

Verifying saved files:
- train.bin: 794624 bytes
- val.bin: 198656 bytes
- test.bin: 30720 bytes
- train_labels.pkl: 928 bytes
- val_labels.pkl: 342 bytes
- test_labels.pkl: 178 bytes


This creates a `train.bin` and `val.bin` in that data directory. Now it is time to train our own GPT. The size of the GPT model depends on the computational resources. It is advised to have a GPU for heavy works, and to train lightweight and evaluate and infer models with a CPU.

Small scale GPT with the settings provided in the [config/train_don_char.py](config/train_don_char.py) config file will be trained with the following code:


In [5]:
!python train.py --config=config/train_sentiment.py --compile=False

Using device: cuda
Loaded train dataset: 776 samples
Loaded val dataset: 194 samples
DataLoaders initialized
W&B initialized in online mode. View live at: https://wandb.ai/<your-username>/nanoGPT-sentiment
Model moved to device
Optimizer initialized
Starting training loop
Iter 0, Loss: 1.2155, Accuracy: 0.3750, Data: 0.023s, Forward: 0.184s, Backward: 0.211s, Total: 0.418s
Validation Loss: 1.2600, Validation Accuracy: 0.4250
Iter 50, Loss: 0.6942, Accuracy: 0.6875, Data: 0.001s, Forward: 0.003s, Backward: 0.121s, Total: 0.124s
Iter 100, Loss: 0.2954, Accuracy: 0.8750, Data: 0.001s, Forward: 0.001s, Backward: 0.124s, Total: 0.126s
Validation Loss: 0.4609, Validation Accuracy: 0.8313
Iter 150, Loss: 0.1122, Accuracy: 1.0000, Data: 0.000s, Forward: 0.004s, Backward: 0.122s, Total: 0.126s
Iter 200, Loss: 0.6257, Accuracy: 0.8750, Data: 0.000s, Forward: 0.005s, Backward: 0.121s, Total: 0.126s
Validation Loss: 0.7716, Validation Accuracy: 0.8063
Patience counter: 1/2
Iter 250, Loss: 0.2014, 

wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: adigew (adigew-middle-east-technical-university). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.7
wandb: Run data is saved locally in c:\Users\nesil.bor\Desktop\Folders\master\DI725\DI725-transformer-sentiment-analysis\wandb\run-20250404_203501-lfbbvnw3
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run neat-eon-22
wandb:  View project at https://wandb.ai/adigew-middle-east-technical-university/nanoGPT-sentiment
wandb:  View run at https://wandb.ai/adigew-middle-east-technical-university/nanoGPT-sentiment/runs/lfbbvnw3
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:  backward_time █▁▁▁▁▁▁▁▂▁▁▁▂▁▂
wandb:      data_time █▁▁▁▁▁▁▂▁▁▁▁▁▁▁
wandb:   forward_time █▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:      iteration ▁▁▁▂▂▃▃▃▃▄▄▅▅▅▅▆▆▇▇▇▇██

We are training a small scaled GPT with a context size of up to 256 characters, 384 feature channels, 6 layers of transformer with 6 attention heads. On one GTX 3070 GPU this training run takes about 10 minutes and the best validation loss is 1.1620. Based on the configuration, the model checkpoints are being written into the `--out_dir` directory `out-don-char`. So once the training finishes we can sample from the best model by pointing the sampling script at this directory:

In [8]:
!python sample.py --out_dir=out-sentiment

Model loaded from out-sentiment\best_model.pt and moved to cuda
Loaded test dataset: 30 samples

Sample 1:
Text: thank you for calling brownbox customer support  my name is sarah  how may i assist you today? hi  s...
Predicted Sentiment: Neutral (Probabilities: [0.014343023300170898, 0.9666900038719177, 0.018967006355524063])
True Sentiment: Negative

Sample 2:
Text: thank you for calling brownbox customer support  my name is sarah  how can i assist you today? hi sa...
Predicted Sentiment: Negative (Probabilities: [0.9675621390342712, 0.01856013759970665, 0.013877768069505692])
True Sentiment: Negative

Sample 3:
Text: thank you for calling brownbox customer support  my name is jane  how may i assist you today? hi jan...
Predicted Sentiment: Negative (Probabilities: [0.9681538939476013, 0.014998635277152061, 0.016847506165504456])
True Sentiment: Negative

Sample 4:
Text: thank you for calling brownbox customer support  my name is sarah  how may i assist you today? hi sa...
Predicted S

  model.load_state_dict(torch.load(model_path))


In [10]:
!python sample.py --out_dir=out-sentiment --checkpoint=final_model.pt

Model loaded from out-sentiment\final_model.pt and moved to cuda
Loaded test dataset: 30 samples

Sample 1:
Text: thank you for calling brownbox customer support  my name is sarah  how may i assist you today? hi  s...
Predicted Sentiment: Neutral (Probabilities: [0.021121535450220108, 0.9692690968513489, 0.009609431959688663])
True Sentiment: Negative

Sample 2:
Text: thank you for calling brownbox customer support  my name is sarah  how can i assist you today? hi sa...
Predicted Sentiment: Negative (Probabilities: [0.9722861647605896, 0.014958624728024006, 0.012755151838064194])
True Sentiment: Negative

Sample 3:
Text: thank you for calling brownbox customer support  my name is jane  how may i assist you today? hi jan...
Predicted Sentiment: Negative (Probabilities: [0.9721629023551941, 0.013804925605654716, 0.01403222605586052])
True Sentiment: Negative

Sample 4:
Text: thank you for calling brownbox customer support  my name is sarah  how may i assist you today? hi sa...
Predicted 

  model.load_state_dict(torch.load(model_path))


This generates a few samples, for example:

```
“I grant all that,” said the governor; “it’s not in a low voice

but not yet forget that there’s none of it the poor in the world; I’ll

like to take special to have been no one to write out the stone of

patience to the village.”

```

It is pretty nice to have a GPT in a few minutes of character level training! Better results can be achieved possibly by hyperparameter tuning and finetuning (transfer learning) from a pre-trained model.


## Quick start with less resources

If we are [low on resources](https://www.youtube.com/watch?v=rcXzn6xXdIc), we can use a simpler version of the training, first we need to set compile to false, this is also a must for Windows OS for now. We also set the device to CPU. The model that is trained in 10 minutes for a starter grade GPU, will be trained in a much longer time, so we can also decrease the dimensions of our model as follows:

In [11]:
!python train.py config/train_sentiment.py --device=cpu --out_dir="out-sentiment" --compile=False --eval_iters=20 --log_interval=50 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=1000 --lr_decay_iters=1000 --dropout=0.0

Using device: cuda
Loaded train dataset: 776 samples
Loaded val dataset: 194 samples
DataLoaders initialized
W&B initialized in online mode. View live at: https://wandb.ai/<your-username>/nanoGPT-sentiment
Model moved to device
Optimizer initialized
Starting training loop
Iter 0, Loss: 1.2171, Accuracy: 0.1875, Data: 0.023s, Forward: 0.187s, Backward: 0.183s, Total: 0.393s
Validation Loss: 1.3223, Validation Accuracy: 0.0437
Iter 50, Loss: 0.8911, Accuracy: 0.6250, Data: 0.000s, Forward: 0.005s, Backward: 0.121s, Total: 0.126s
Iter 100, Loss: 0.1665, Accuracy: 0.9375, Data: 0.000s, Forward: 0.009s, Backward: 0.121s, Total: 0.130s
Validation Loss: 0.6461, Validation Accuracy: 0.7875
Iter 150, Loss: 0.3465, Accuracy: 0.9375, Data: 0.001s, Forward: 0.004s, Backward: 0.119s, Total: 0.124s
Iter 200, Loss: 0.6365, Accuracy: 0.8125, Data: 0.000s, Forward: 0.002s, Backward: 0.120s, Total: 0.122s
Validation Loss: 0.4342, Validation Accuracy: 0.8688
Iter 250, Loss: 0.6398, Accuracy: 0.8125, Data

wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: adigew (adigew-middle-east-technical-university). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.7
wandb: Run data is saved locally in c:\Users\nesil.bor\Desktop\Folders\master\DI725\DI725-transformer-sentiment-analysis\wandb\run-20250404_210619-t200mgfy
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run lyric-music-23
wandb:  View project at https://wandb.ai/adigew-middle-east-technical-university/nanoGPT-sentiment
wandb:  View run at https://wandb.ai/adigew-middle-east-technical-university/nanoGPT-sentiment/runs/t200mgfy
wandb: - 0.007 MB of 0.007 MB uploaded
wandb: \ 0.008 MB of 0.008 MB uploaded
wandb: | 0.008 MB of 0.008 MB uploaded
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:  backward_time █▁▁▁▁▁▁▂▁
wandb: 

*Here, since we are running on CPU instead of GPU we must set both `--device=cpu` and also turn off PyTorch 2.0 compile with `--compile=False`. Then when we evaluate we get a bit more noisy but faster estimate (`--eval_iters=20`, down from 200), our context size is only 64 characters instead of 256, and the batch size only 12 examples per iteration, not 64. We'll also use a much smaller Transformer (4 layers, 4 heads, 128 embedding size), and decrease the number of iterations to 2000 (and correspondingly usually decay the learning rate to around max_iters with `--lr_decay_iters`). Because our network is so small we also ease down on regularization (`--dropout=0.0`). This still runs in about ~5 minutes, but gets us a loss of only 1.88 and therefore also worse samples, but it's still good fun:*

In [12]:
!python sample.py --out_dir=out-sentiment --device=cpu

usage: sample.py [-h] [--out_dir OUT_DIR] [--checkpoint CHECKPOINT]
sample.py: error: unrecognized arguments: --device=cpu


Generates samples like this:

```
Sancho nother with this then of everantan has for five he enver any

shal were than as in though they and I knight the sther his a jlage,

and mad priled and squiel a hist to in feet she took and and sersse to her of

Marest and good was pefor rubt some by than lave from his dintat all

pack that he remants to goost ever to him arestiance of it the were to who

which mom, worly gane for he sporen gort he was roosion, and be that

it thou, so so he kniders what the and him of him dest us on shart
```

*Not bad for ~3 minutes on a CPU, for a hint of the right character gestalt. If you're willing to wait longer, feel free to tune the hyperparameters, increase the size of the network, the context length (`--block_size`), the length of training, etc.*

*Finally, on Apple Silicon Macbooks and with a recent PyTorch version make sure to add `--device=mps` (short for "Metal Performance Shaders"); PyTorch then uses the on-chip GPU that can *significantly* accelerate training (2-3X) and allow you to use larger networks. See [Issue 28](https://github.com/karpathy/nanoGPT/issues/28) for more.*



## Finetuning

Finetuning or transfer learning is a precious method of achieving better models thanks to pre-trained models. Finetuning GPT models is just as simple as training from scratch! We will now download the Don-Quixote (again) but this time we will define it with tokens (using OpenAI's BPE tokenizer) instead of characters.



In [7]:
!python data/don/prepare.py

train has 592,353 tokens
val has 66,303 tokens


Run an example finetuning like:

In [8]:
!python train.py config/finetune_don.py --compile=False

Overriding config with config/finetune_don.py:
import time

out_dir = 'out-don'
eval_interval = 5
eval_iters = 40
wandb_log = False # feel free to turn on
wandb_project = 'don'
wandb_run_name = 'ft-' + str(time.time())

dataset = 'don'
init_from = 'gpt2' # this is the GPT-2 model

# only save checkpoints if the validation loss improves
always_save_checkpoint = False

# the number of examples per iter:
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20

# finetune at constant LR
learning_rate = 3e-5
decay_lr = False

Overriding: compile = False
tokens per iteration will be: 32,768
Initializing from OpenAI GPT-2 weights: gpt2
loading weights from pretrained gpt: gpt2
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 123.65M
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
step 0: train loss 3.3541, val loss 3.3616
iter 0:

This will load the config parameter overrides in `config/finetune_don.py`. Basically, we initialize from a GPT2 checkpoint with `init_from` and train as normal, except shorter and with a small learning rate. Model architecture is changable to `{'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}`) and can be decreased in size by the `block_size` (context length). The best checkpoint (lowest validation loss) will be in the `out_dir` directory, e.g. in `out-don` by default, per the config file. You can then run the code in `sample.py --out_dir=out-don`:
```
* * All creatures that enter the world below may so far as want to observe the rules of their own land, and may obey them under the hand of their lord, and may not follow others below.

* * *

THE PORT COLLIDATES,

- * *

ON the light, and the light to the dark, and the darkness to the light, and the darkness to the darkness, were the present-day laws of monarchy, whose lordship they approved in their faces and hearts. From this moment on, however, they had no other representation to give than that of their master, who, for all that was said or heard, had reached the height of his power.

The king's hand, though at times little more than a finger of his, required no more than a finger of his, and that power was, that of holding his eye, and the other of his, in his own, body.

When this was spoken of, it was a simple and noble quibble, and the subject of this was so as to admit of the few who had any forsemination, and the few who had the most to go on.

The time did not come for a thought of this, and for a moment the very thought of it seemed to fall to the ground.

But that thought did not come to pass; though the king was not speaking of the king, it came to pass that the king, with all his might, and all his cunning, and no other sense, and without any understanding, and without any desire for the utmost of his services, and without any desire to put an end to his own glory, and without any desire to hide his triumph, had found the time to say that this was what he thought on the subject of religion; that it was what he thought, and according as it seemed to him to be as good or better to him than to the other kings, and he was in no sense a king, for it seemed to him he could never have any more power than he had to be; that it was a matter of his will and power; and that it was all a matter of his will, for he was determined that this look and that to which he might have been given to hold it was the best in himself.

And so it was that the king, who was all around him, and all around him; and so
```

# Inference and Sampling
Use the script `sample.py` to sample either from pre-trained GPT-2 models released by OpenAI, or from a model you trained yourself. For example, here is a way to sample from the largest available `gpt2-xl` model:

In [9]:
!python sample.py --out_dir=out-don --start="Explain the relationship between Don Quixote and Sancho Panza" --num_samples=5 --max_new_tokens=100

Overriding: out_dir = out-don
Overriding: start = Explain the relationship between Don Quixote and Sancho Panza
Overriding: num_samples = 5
Overriding: max_new_tokens = 100
number of parameters: 123.65M
No meta.pkl found, assuming GPT-2 encodings...
Explain the relationship between Don Quixote and Sancho Panza. The story, according to the newspapers, is that the two men want to take over the city, and that Sancho has accepted that.

I have been told that Don Quixote was not the first man to arrive to the city, and that Sancho was not the first. I have no reason to doubt that, for Don Quixote is an itinerant man, who is always crossing the country, and with a lot of money. I have been told that many other
---------------
Explain the relationship between Don Quixote and Sancho Panza, the magician who is almost always mistaken for the Countess of Sancho Panza. This is a trick used by the Countess of Sancho Panza.

The Countess of Sancho Panza is made the subject of many rumors, but the tr

If you'd like to sample from a model you trained, use the `--out_dir` to point the code appropriately. You can also prompt the model with some text from a file, e.g.:

In [10]:
!python sample.py --start=FILE:"prompt/fictional.txt" --out_dir="out-don" --num_samples=1 --max_new_tokens=100

Overriding: start = FILE:prompt/fictional.txt
Overriding: out_dir = out-don
Overriding: num_samples = 1
Overriding: max_new_tokens = 100
number of parameters: 123.65M
No meta.pkl found, assuming GPT-2 encodings...
Dain charges head on with his warhammer and full plate clad armor. Determined to topple anything in front of him.

One of the last bastions of the Dawn so far held in the voids held a massive wattle and rotted horse that had been drenched in sweat and had been stripped of its armour. Its bones were cast into the ground and then buried in the snow.

The dragon charged and made a wry noise, and turned back to face the Dawn. He glanced at the dragon, and then at Dain with a smile that was almost full of fear, but that made
---------------


In [11]:
!python sample.py --start=FILE:"prompt/positive_review.txt" --out_dir="out-don" --num_samples=1 --max_new_tokens=500

Overriding: start = FILE:prompt/positive_review.txt
Overriding: out_dir = out-don
Overriding: num_samples = 1
Overriding: max_new_tokens = 500
number of parameters: 123.65M
No meta.pkl found, assuming GPT-2 encodings...
This place was DELICIOUS!! My parents saw a recommendation to visit this place from Rick Sebak's \"25 Things I Like About Pittsburgh\" and he's usually pretty accurate. His recommendations were to try the Reuben, Fish Sandwich and Open-Faced Steak Sandwich. We went early afternoon for a late lunch today (a Saturday) and were seated right away. The staff is extremely friendly. My Mom & I each had the fish sandwich, while my Dad & Brother had a Reuben sandwich. The fish was very good, but the Reuben was to die for! Both dishes were massive, and could very easily be shared between two people. On top of being extremely large portions, it was incredibly affordable. The giant fish sandwich was $8 and the giant Reuben was $7.50. Our drinks were always filled and we were checke

I hope you will enjoy with the GPT as much as I did!

## Efficiency notes

*For simple model benchmarking and profiling, `bench.py` might be useful. It's identical to what happens in the meat of the training loop of `train.py`, but omits much of the other complexities.*

*Note that the code by default uses [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). At the time of writing (Dec 29, 2022) this makes `torch.compile()` available in the nightly release. The improvement from the one line of code is noticeable, e.g. cutting down iteration time from ~250ms / iter to 135ms / iter. Nice work PyTorch team!*


## Troubleshooting

*Note that by default this repo uses PyTorch 2.0 (i.e. `torch.compile`). This is fairly new and experimental, and not yet available on all platforms (e.g. Windows). If you're running into related error messages try to disable this by adding `--compile=False` flag. This will slow down the code but at least it will run.*

*For some context on this repository, GPT, and language modeling it might be helpful to watch [Zero To Hero series](https://karpathy.ai/zero-to-hero.html). Specifically, the [GPT video](https://www.youtube.com/watch?v=kCc8FmEb1nY) is popular if you have some prior language modeling context.*

## Acknowledgements

This code is a fork from Andrej Karpathy's introductory [NanoGPT repository](https://github.com/karpathy/nanoGPT), which is an updated form of minGPT.

# Further Experiments

(Optional)

For further experiments, you can, for example, reproduce the GPT-2, which is still powerful, by following the link to the Andrej Karpathy's repository.