I want to confirm I can train for hours on the GPU using `base_train.py`. I'll do a few quick 100 step tests to confirm that:

- The new code added in the last few challenges is working correctly in this environment and on the GPU
- I know how to use tmux and can end my ssh session without killing the process
- I'm correctly redirecting stdout and stderr to a file
- I can load a checkpoint
- I've chosen config params that won't OOM

In [4]:
import sys
sys.path.append('../my_nanochat')
import os
import torch
from my_nanochat.my_common import get_base_dir, autodetect_device_type
from my_nanochat.my_checkpoint_manager import build_model
from contextlib import nullcontext

In [5]:
device_type = autodetect_device_type()
device = torch.device(device_type)
autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()

Autodetected device type: cuda


In tmux shell:

```
source .venv/bin/activate

cd challenge-19-train-for-hours-on-GPU

export PYTHONPATH=../my_nanochat/

python -m scripts.my_base_train --depth=10 --max_seq_len=400 --device_batch_size=2 --num_iterations=100 --total_batch_size=800 --eval_every=10 --eval_tokens=8000 > base_train_output_001.txt 2>&1
```

Instructions for later: If playing with torch / model in notebook, interrupt notebook kernel before running training from the shell so the notebook isn't holding onto GPU memory.

In [32]:
!tail -15 base_train_output_001.txt

step 00095/00100 (95.00%) | loss: 7.347408 | grad norm: 1.4555 | lrm: 0.25 | dt: 409.17ms | tok/sec: 1,955 | mfu: -1.00 | total time: 0.57m
step 00096/00100 (96.00%) | loss: 7.397736 | grad norm: 3.1600 | lrm: 0.20 | dt: 409.63ms | tok/sec: 1,953 | mfu: -1.00 | total time: 0.58m
step 00097/00100 (97.00%) | loss: 7.439689 | grad norm: 3.1425 | lrm: 0.15 | dt: 409.14ms | tok/sec: 1,955 | mfu: -1.00 | total time: 0.58m
step 00098/00100 (98.00%) | loss: 7.463372 | grad norm: 2.2365 | lrm: 0.10 | dt: 408.96ms | tok/sec: 1,956 | mfu: -1.00 | total time: 0.59m
step 00099/00100 (99.00%) | loss: 7.462241 | grad norm: 1.7811 | lrm: 0.05 | dt: 410.58ms | tok/sec: 1,948 | mfu: -1.00 | total time: 0.60m
step 00100 | Validation bpb: 2.7558
TODO evaluate CORE metric
TODO sample
saved model to /home/paperspace/.cache/my_nanochat/base_checkpoints/d10/model_000100.pt
saved optimizer to /home/paperspace/.cache/my_nanochat/base_checkpoints/d10/model_000100.pt
saved metadata to /home/paperspace/.cache/my_n

Looks it ran fine. Try to load the checkpoint.

In [33]:
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d10")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=100, device=device, phase="eval")
bos_token_id = tokenizer.get_bos_token_id()
with torch.no_grad():
    for prompt in ['The person', 'He went to', '1 + 2 = ', 'first of', '3 cats and 2', 'mom and', 'the red', 'She']:
        with autocast_ctx:
            logits = model(torch.tensor([tokenizer.encode(prompt, prepend=bos_token_id)], device=device)).detach()
            top_3_next_tokens = torch.topk(logits[0,-1,:], k=3).indices
            print(f"{prompt}{'|'.join([tokenizer.decode([token]) for token in top_3_next_tokens])}")

Building model with config: {'sequence_len': 400, 'vocab_size': 65537, 'n_layer': 10, 'n_head': 5, 'n_kv_head': 5, 'n_embd': 640}
The person,| is| of
He went to the| be| 
1 + 2 = 20|2|3
first of the| water| a
3 cats and 2.|,| and
mom and smart| resiliency| water
the red,| and|.
She,|.| of


Try a few more configurations still with just 100 steps:

```
python -m scripts.my_base_train --depth=12 --max_seq_len=600 --device_batch_size=2 --num_iterations=100 --total_batch_size=1200 --eval_every=10 --eval_tokens=8000 > base_train_output_002.txt 2>&1
```

In [3]:
!tail -15 base_train_output_002.txt

step 00095/00100 (95.00%) | loss: 7.628609 | grad norm: 2.4277 | lrm: 0.25 | dt: 832.14ms | tok/sec: 1,442 | mfu: -1.00 | total time: 1.16m
step 00096/00100 (96.00%) | loss: 7.576799 | grad norm: 1.8703 | lrm: 0.20 | dt: 831.21ms | tok/sec: 1,443 | mfu: -1.00 | total time: 1.17m
step 00097/00100 (97.00%) | loss: 7.524256 | grad norm: 1.9663 | lrm: 0.15 | dt: 833.41ms | tok/sec: 1,439 | mfu: -1.00 | total time: 1.19m
step 00098/00100 (98.00%) | loss: 7.467676 | grad norm: 1.6936 | lrm: 0.10 | dt: 832.56ms | tok/sec: 1,441 | mfu: -1.00 | total time: 1.20m
step 00099/00100 (99.00%) | loss: 7.425730 | grad norm: 1.4603 | lrm: 0.05 | dt: 833.03ms | tok/sec: 1,440 | mfu: -1.00 | total time: 1.21m
step 00100 | Validation bpb: 2.6234
TODO evaluate CORE metric
TODO sample
saved model to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_000100.pt
saved optimizer to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_000100.pt
saved metadata to /home/paperspace/.cache/my_n

```
python -m scripts.my_base_train --depth=12 --max_seq_len=800 --device_batch_size=2 --num_iterations=100 --total_batch_size=1600 --eval_every=10 --eval_tokens=8000 > base_train_output_003.txt 2>&1
```

In [4]:
!tail -15 base_train_output_003.txt

step 00095/00100 (95.00%) | loss: 7.574343 | grad norm: 3.1946 | lrm: 0.25 | dt: 949.87ms | tok/sec: 1,684 | mfu: -1.00 | total time: 1.33m
step 00096/00100 (96.00%) | loss: 7.604295 | grad norm: 2.6211 | lrm: 0.20 | dt: 950.84ms | tok/sec: 1,682 | mfu: -1.00 | total time: 1.34m
step 00097/00100 (97.00%) | loss: 7.578888 | grad norm: 4.6293 | lrm: 0.15 | dt: 950.04ms | tok/sec: 1,684 | mfu: -1.00 | total time: 1.36m
step 00098/00100 (98.00%) | loss: 7.519093 | grad norm: 2.2574 | lrm: 0.10 | dt: 951.97ms | tok/sec: 1,680 | mfu: -1.00 | total time: 1.37m
step 00099/00100 (99.00%) | loss: 7.469648 | grad norm: 1.7135 | lrm: 0.05 | dt: 951.25ms | tok/sec: 1,682 | mfu: -1.00 | total time: 1.39m
step 00100 | Validation bpb: 2.6396
TODO evaluate CORE metric
TODO sample
saved model to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_000100.pt
saved optimizer to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_000100.pt
saved metadata to /home/paperspace/.cache/my_n

In [7]:
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d12")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=100, device=device, phase="eval")
bos_token_id = tokenizer.get_bos_token_id()
with torch.no_grad():
    for prompt in ['The person', 'He went to', '1 + 2 = ', 'first of', '3 cats and 2', 'mom and', 'the red', 'She']:
        with autocast_ctx:
            logits = model(torch.tensor([tokenizer.encode(prompt, prepend=bos_token_id)], device=device)).detach()
            top_3_next_tokens = torch.topk(logits[0,-1,:], k=3).indices
            print(f"{prompt}{'|'.join([tokenizer.decode([token]) for token in top_3_next_tokens])}")

Building model with config: {'sequence_len': 800, 'vocab_size': 65537, 'n_layer': 12, 'n_head': 6, 'n_kv_head': 6, 'n_embd': 768}
The person was| were|,
He went to the| take| 
1 + 2 = 20|17|10
first of the| | a
3 cats and 2/|,|–
mom and | the| a
the red in| from|,
Sheia|,| and


```
python -m scripts.my_base_train --depth=12 --max_seq_len=1000 --device_batch_size=2 --num_iterations=100 --total_batch_size=2000 --eval_every=10 --eval_tokens=8000 > base_train_output_004.txt 2>&1
```

In [1]:
!tail -15 base_train_output_004.txt

step 00095/00100 (95.00%) | loss: 7.455878 | grad norm: 1.5860 | lrm: 0.25 | dt: 1071.99ms | tok/sec: 1,865 | mfu: -1.00 | total time: 1.49m
step 00096/00100 (96.00%) | loss: 7.446316 | grad norm: 1.4359 | lrm: 0.20 | dt: 1069.92ms | tok/sec: 1,869 | mfu: -1.00 | total time: 1.51m
step 00097/00100 (97.00%) | loss: 7.415680 | grad norm: 1.1721 | lrm: 0.15 | dt: 1070.60ms | tok/sec: 1,868 | mfu: -1.00 | total time: 1.53m
step 00098/00100 (98.00%) | loss: 7.382005 | grad norm: 1.2093 | lrm: 0.10 | dt: 1071.05ms | tok/sec: 1,867 | mfu: -1.00 | total time: 1.55m
step 00099/00100 (99.00%) | loss: 7.330871 | grad norm: 1.1584 | lrm: 0.05 | dt: 1069.64ms | tok/sec: 1,869 | mfu: -1.00 | total time: 1.56m
step 00100 | Validation bpb: 2.5969
TODO evaluate CORE metric
TODO sample
saved model to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_000100.pt
saved optimizer to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_000100.pt
saved metadata to /home/paperspace/.cache

### hours long run

Here's the config for the long run. First do 100 steps to figure out timing.

```
python -m scripts.my_base_train --depth=12 --max_seq_len=1000 --device_batch_size=3 --num_iterations=100 --total_batch_size=3000 --eval_every=50 --eval_tokens=8000 > base_train_output_005.txt 2>&1
```

In [2]:
!tail -15 base_train_output_005.txt

step 00095/00100 (95.00%) | loss: 7.256703 | grad norm: 1.0395 | lrm: 0.25 | dt: 1341.12ms | tok/sec: 2,236 | mfu: -1.00 | total time: 1.88m
step 00096/00100 (96.00%) | loss: 7.224554 | grad norm: 0.9416 | lrm: 0.20 | dt: 1343.25ms | tok/sec: 2,233 | mfu: -1.00 | total time: 1.90m
step 00097/00100 (97.00%) | loss: 7.211367 | grad norm: 1.3675 | lrm: 0.15 | dt: 1340.66ms | tok/sec: 2,237 | mfu: -1.00 | total time: 1.93m
step 00098/00100 (98.00%) | loss: 7.195680 | grad norm: 1.0433 | lrm: 0.10 | dt: 1341.77ms | tok/sec: 2,235 | mfu: -1.00 | total time: 1.95m
step 00099/00100 (99.00%) | loss: 7.287336 | grad norm: 2.1298 | lrm: 0.05 | dt: 1340.90ms | tok/sec: 2,237 | mfu: -1.00 | total time: 1.97m
step 00100 | Validation bpb: 2.3093
TODO evaluate CORE metric
TODO sample
saved model to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_000100.pt
saved optimizer to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_000100.pt
saved metadata to /home/paperspace/.cache

So let's say we do those settings and train for 8 hours. We could complete this many iterations:

In [4]:
8 * 60 * 60 * 1000 / 1350

21333.333333333332

#### start run

```
python -m scripts.my_base_train --depth=12 --max_seq_len=1000 --device_batch_size=3 --num_iterations=21000 --total_batch_size=3000 --eval_every=100 --eval_tokens=9000 > base_train_output_006.txt 2>&1
```

In [1]:
!tail -15 base_train_output_006.txt

step 20995/21000 (99.98%) | loss: 4.733284 | grad norm: 0.5301 | lrm: 0.00 | dt: 1340.62ms | tok/sec: 2,237 | mfu: -1.00 | total time: 468.66m
step 20996/21000 (99.98%) | loss: 4.764493 | grad norm: 0.4904 | lrm: 0.00 | dt: 1337.63ms | tok/sec: 2,242 | mfu: -1.00 | total time: 468.69m
step 20997/21000 (99.99%) | loss: 4.822114 | grad norm: 0.5699 | lrm: 0.00 | dt: 1340.07ms | tok/sec: 2,238 | mfu: -1.00 | total time: 468.71m
step 20998/21000 (99.99%) | loss: 4.887818 | grad norm: 0.7300 | lrm: 0.00 | dt: 1340.34ms | tok/sec: 2,238 | mfu: -1.00 | total time: 468.73m
step 20999/21000 (100.00%) | loss: 4.906813 | grad norm: 0.5375 | lrm: 0.00 | dt: 1340.44ms | tok/sec: 2,238 | mfu: -1.00 | total time: 468.75m
step 21000 | Validation bpb: 1.6605
TODO evaluate CORE metric
TODO sample
saved model to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_021000.pt
saved optimizer to /home/paperspace/.cache/my_nanochat/base_checkpoints/d12/model_021000.pt
saved metadata to /home/papers

In [2]:
469 / 60

7.816666666666666

^ It completed in the expected time. Time per step is so consistent. Maybe this is another nice thing about using a GPU where nothing else is sharing, interrupting, etc. 

In [2]:
!ls -lh /home/paperspace/.cache/my_nanochat/base_checkpoints/d12 | grep 21000

-rw-rw-r-- 1 paperspace paperspace  775 Nov 12 10:36 meta_021000.json
-rw-rw-r-- 1 paperspace paperspace 613M Nov 12 10:36 model_021000.pt
-rw-rw-r-- 1 paperspace paperspace 901M Nov 12 10:36 optim_021000.pt


In [3]:
!grep Validation base_train_output_006.txt | head -5

step 00000 | Validation bpb: 3.7352
step 00100 | Validation bpb: 2.7826
step 00200 | Validation bpb: 2.6076
step 00300 | Validation bpb: 2.5682
step 00400 | Validation bpb: 2.5566
grep: write error: Broken pipe


In [8]:
!grep -i Validation base_train_output_006.txt | tail -10

step 20200 | Validation bpb: 1.6699
step 20300 | Validation bpb: 1.6677
step 20400 | Validation bpb: 1.6664
step 20500 | Validation bpb: 1.6643
step 20600 | Validation bpb: 1.6627
step 20700 | Validation bpb: 1.6609
step 20800 | Validation bpb: 1.6610
step 20900 | Validation bpb: 1.6612
step 21000 | Validation bpb: 1.6605
Minimum validation bpb: 1.6605


^ From validation bpb it looks like training behaved

In [20]:
!grep -i Validation base_train_output_006.txt | awk 'NR % 10 == 0'

step 00900 | Validation bpb: 2.3005
step 01900 | Validation bpb: 2.3223
step 02900 | Validation bpb: 2.1992
step 03900 | Validation bpb: 2.1091
step 04900 | Validation bpb: 2.0760
step 05900 | Validation bpb: 2.0606
step 06900 | Validation bpb: 1.9630
step 07900 | Validation bpb: 1.8506
step 08900 | Validation bpb: 1.8735
step 09900 | Validation bpb: 1.8573
step 10900 | Validation bpb: 1.8685
step 11900 | Validation bpb: 1.8520
step 12900 | Validation bpb: 1.8543
step 13900 | Validation bpb: 1.8200
step 14900 | Validation bpb: 1.8304
step 15900 | Validation bpb: 1.8280
step 16900 | Validation bpb: 1.7717
step 17900 | Validation bpb: 1.7440
step 18900 | Validation bpb: 1.6767
step 19900 | Validation bpb: 1.6658
step 20900 | Validation bpb: 1.6612


In [18]:
!grep -o 'step [0-9/()%.]*[^|]*| loss: [0-9.]*' base_train_output_006.txt | awk 'NR % 1000 == 0'

step 00999/21000 (4.76%) | loss: 6.194658
step 01999/21000 (9.52%) | loss: 6.036501
step 02999/21000 (14.28%) | loss: 5.881425
step 03999/21000 (19.04%) | loss: 5.771294
step 04999/21000 (23.80%) | loss: 5.630048
step 05999/21000 (28.57%) | loss: 5.656877
step 06999/21000 (33.33%) | loss: 5.477202
step 07999/21000 (38.09%) | loss: 5.520780
step 08999/21000 (42.85%) | loss: 5.365104
step 09999/21000 (47.61%) | loss: 5.355028
step 10999/21000 (52.38%) | loss: 5.312786
step 11999/21000 (57.14%) | loss: 5.364322
step 12999/21000 (61.90%) | loss: 5.363304
step 13999/21000 (66.66%) | loss: 5.347305
step 14999/21000 (71.42%) | loss: 5.287267
step 15999/21000 (76.19%) | loss: 5.282485
step 16999/21000 (80.95%) | loss: 5.140184
step 17999/21000 (85.71%) | loss: 5.024324
step 18999/21000 (90.47%) | loss: 4.944702
step 19999/21000 (95.23%) | loss: 4.858329
step 20999/21000 (100.00%) | loss: 4.906813


^ And from training loss

Now the fun part:

In [6]:
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d12")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=21000, device=device, phase="eval")
bos_token_id = tokenizer.get_bos_token_id()
with torch.no_grad():
    for prompt in ['The person', 'He went to', '1 + 2 = ', 'first of', '3 cats and 2', 'mom and', 'the red', 'She']:
        with autocast_ctx:
            logits = model(torch.tensor([tokenizer.encode(prompt, prepend=bos_token_id)], device=device)).detach()
            top_3_next_tokens = torch.topk(logits[0,-1,:], k=3).indices
            print(f"{prompt}{'|'.join([tokenizer.decode([token]) for token in top_3_next_tokens])}")

Building model with config: {'sequence_len': 1000, 'vocab_size': 65537, 'n_layer': 12, 'n_head': 6, 'n_kv_head': 6, 'n_embd': 768}
The person who|’s| of
He went to the| a| his
1 + 2 = 1|2|3
first of the| all| 
3 cats and 2 dogs| cats|

mom and the| its| a
the red line| light| blood
She is| was| has


^ Finally, "3 cats and 2 dogs"!

In [7]:
def generate(prompt):
    tokens = tokenizer.encode(prompt, prepend=bos_token_id)
    for _ in range(20):
        with autocast_ctx:
            logits = model(torch.tensor([tokens], device=device))
            tokens.append(logits[0,-1,:].argmax().item())
    return tokenizer.decode(tokens)

In [8]:
generate("The person")

'<bos>The person who is a member of the family of the family of the family of the family of the family of'

In [9]:
generate("3 cats and 2")

'<bos>3 cats and 2 dogs\nThe first human being was born in the 19th century. The first human being was'

In [10]:
generate("First take a right on main street")

'<bos>First take a right on main street, and then the next stop is to the left of the street. The next stop is to the'

In [11]:
generate("The first president of the United States was")

'<bos>The first president of the United States was the 19th president of the United States. The first president of the United States was the '

In [12]:
generate("2 + 3 =")

'<bos>2 + 3 = 1 + 2 + 2 + 2 + 2 + 2 + 2'

In [13]:
generate("A common breakfast food is")

'<bos>A common breakfast food is a food that is not served as a food. It is a food that is not served as a'

In [14]:
generate("For breakfast, I'll have")

"<bos>For breakfast, I'll have to go to the store, and I'll be able to get a little more of the energy you"

In [15]:
generate("At breakfast I eat")

'<bos>At breakfast I eat a lot of fruits and vegetables, and I am a little more than a little bit more than a'

In [16]:
generate("For breakfast I'll have two")

"<bos>For breakfast I'll have two more minutes of lunch each day. I'll have a few more minutes of lunch each day. I"

In [17]:
generate("My favorite breakfast foods are:")

'<bos>My favorite breakfast foods are: A new study from the University of California, San Diego, and the University of California, San Diego'

In [18]:
generate("My favorite breakfast foods are")

'<bos>My favorite breakfast foods are:\n- 1.5% of the daily recommended daily intake of fruits and vegetables. This is'

In [19]:
generate("I like green eggs and")

'<bos>I like green eggs and my own friends. I like green eggs and my own friends. I like green eggs and my own'

In [21]:
generate("if x >")

'<bos>if x > 1\nThe answer is: 1.5.1.1.1.1.'

In [22]:
generate("print(")

'<bos>print(1) 1: 1: 1: 1: 1: 1:'

In [23]:
generate("for i in")

'<bos>for i in the first place\nThe first thing to do is to get a better understanding of the world around us'

In [24]:
generate("for i in range")

'<bos>for i in range of 1,000\nThe first thing to do is to get a 1,00'