### Try trained model on other hardware

As long as the trained d20 model can fit on MPS / mac / RTX4000 I believe it should work, at least up to a certain sequence length because the KV cache will need to keep getting bigger and bigger. 

Total parms is 560,988,160. Say they all took 4 bytes, which they don't, that's ~2 GiB which is well under the RTX4000 where we had 7+ GiB memory.

In [26]:
560_988_160 * 4 / 1024 ** 3

2.08984375

#### Let's try MPS on mac first

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_common import get_base_dir

Copied model files to the right place

In [2]:
!ls -lh {get_base_dir()}/base_checkpoints/d20

total 4851336
-rw-r--r--  1 ericsilberstein  staff   847B Nov 16 14:35 meta_021400.json
-rw-r--r--  1 ericsilberstein  staff   1.9G Nov 16 14:35 model_021400.pt
-rw-r--r--  1 ericsilberstein  staff   389M Nov 16 14:35 optim_021400_rank0.pt


Copied tokenizer files to the right place (later, forgot to do this at first)

In [3]:
!ls -lh {get_base_dir()}/*token*

-rw-r--r--  1 ericsilberstein  staff   826K Nov 16 15:51 /Users/ericsilberstein/.cache/my_nanochat/my-tokenizer.pkl
-rw-r--r--  1 ericsilberstein  staff   258K Nov 16 15:51 /Users/ericsilberstein/.cache/my_nanochat/token_bytes.pt


In [2]:
import os
import torch
from my_nanochat.my_common import get_base_dir, autodetect_device_type
from my_nanochat.my_checkpoint_manager import build_model
device_type = autodetect_device_type()
device = torch.device(device_type)

Autodetected device type: mps


In [3]:
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d20")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=21400, device=device, phase="eval")

Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}


In [6]:
tokenizer.get_vocab_size()

65536

In [33]:
meta_data

{'step': 21400,
 'val_bpb': -1,
 'model_config': {'sequence_len': 2048,
  'vocab_size': 65536,
  'n_layer': 20,
  'n_head': 10,
  'n_kv_head': 10,
  'n_embd': 1280},
 'user_config': {'run': 'challenge-25-4',
  'device_type': '',
  'depth': 20,
  'max_seq_len': 2048,
  'num_iterations': -1,
  'target_param_data_ratio': 20,
  'device_batch_size': 32,
  'total_batch_size': 524288,
  'embedding_lr': 0.2,
  'unembedding_lr': 0.004,
  'weight_decay': 0.0,
  'matrix_lr': 0.02,
  'grad_clip': 1.0,
  'warmup_ratio': 0.0,
  'warmdown_ratio': 0.2,
  'final_lr_frac': 0.0,
  'eval_every': 250,
  'eval_tokens': 10485760,
  'core_metric_every': 2000,
  'core_metric_max_per_task': 500,
  'sample_every': 2000,
  'model_tag': ''},
 'device_batch_size': 32,
 'max_seq_len': 2048}

In [4]:
prompts = [
    "The capital of France is",
    "The chemical symbol of gold is",
    "If yesterday was Friday, then tomorrow will be",
    "The opposite of hot is",
    "The planets of the solar system are:",
    "My favorite color is",
    "If 5*x + 3 = 13, then x is",
]

In [5]:
from my_nanochat.my_engine import Engine
engine = Engine(model, tokenizer)

In [9]:
for prompt in prompts:
    tokens = tokenizer.encode(prompt, prepend=tokenizer.get_bos_token_id())
    sample, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=10, temperature=0)
    print(tokenizer.decode(sample[0]))

<|bos|>The capital of France is Paris. It is the largest city in France and
<|bos|>The chemical symbol of gold is Au. It is a soft, malleable, ductile
<|bos|>If yesterday was Friday, then tomorrow will be Monday. If tomorrow is Monday, then tomorrow will
<|bos|>The opposite of hot is cold. The opposite of cold is hot. The
<|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter,
<|bos|>My favorite color is red. It is the color of blood, of
<|bos|>If 5*x + 3 = 13, then x is a factor of 5. If 5*


Do these match the sample from the final step during training?
```
<|bos|>The capital of France is Paris. It is the largest city in France and
<|bos|>The chemical symbol of gold is Au. It is a soft, malleable, ductile
<|bos|>If yesterday was Friday, then tomorrow will be Monday. If tomorrow is Monday, then tomorrow will
<|bos|>The opposite of hot is cold. The opposite of cold is hot. The
<|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter,
<|bos|>My favorite color is red. It is the color of the blood of
<|bos|>If 5*x + 3 = 13, then x is a factor of 5. If 5*
```

Yes.

For fun, generate a few longer samples for each prompt with a higher temperature and.

In [12]:
for prompt in prompts:
    tokens = tokenizer.encode(prompt, prepend=tokenizer.get_bos_token_id())
    samples, _ = engine.generate_batch(tokens, num_samples=5, max_tokens=20, temperature=0.3)
    for sample in samples:
        print(tokenizer.decode(sample))
    print('')

<|bos|>The capital of France is Paris, and it is the second largest city in the country. It is also the most populous city
<|bos|>The capital of France is Paris, and the capital of France is Paris. Paris is the capital of France. Paris is the
<|bos|>The capital of France is Paris. It is a city that is known for its beautiful architecture, its food, its music,
<|bos|>The capital of France is Paris and it is the largest city in the country. The city is also the most populous city in
<|bos|>The capital of France is Paris. It is the largest city in France and the second largest city in Europe. It is the

<|bos|>The chemical symbol of gold is Au, and it is a soft, malleable, ductile, and ductilely reactive metal. It is
<|bos|>The chemical symbol of gold is Au. It is a soft, malleable metal that is a bright yellow color. Gold is the most
<|bos|>The chemical symbol of gold is Au. The atomic number of gold is 79. The atomic weight of gold is 199
<|bos|>The chemical symbol of gold is Au and it 

Can we get it to OOM? Probably not just by max tokens because it will hit the assert on 10x max seq length for precomputed sin/cos before OOM, but perhaps by asking for lots of samples.

In [18]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
samples, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=1000, temperature=0.3)
print(tokenizer.decode(samples[0]))

<|bos|>The capital of France is Paris, and it is the second largest city in the country. It is also the most populous city in France. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. Th

In [19]:
samples, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=20479, temperature=0.3)

KeyboardInterrupt: 

It's been maybe 10 minutes and it's still going. Going to cancel and use the non batch version so can watch progress and see if it's even worth waiting. My guess is at first the times per token generated will be consistent because the self attention stuff is just one part of all the operations, but as the sequence grows, the non-self-attention stuff stays constant but the self attention will dominate as the KV cache grows and it needs to do those big matrix multiplications. But will it slow linearly or with say the square of something?

In [6]:
import time

In [37]:
t0 = time.time()
for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=1, max_tokens=20479, temperature=0.3)):
    if i % 100 == 0:
        t1 = time.time()
        delta = t1 - t0
        t0 = t1
        print(f"{(delta):.0f}s {i}")

1s 0
6s 100
5s 200
5s 300
5s 400
6s 500
6s 600
6s 700
8s 800
7s 900
7s 1000
8s 1100
12s 1200
8s 1300
7s 1400
20s 1500
11s 1600
9s 1700
9s 1800
8s 1900
8s 2000
9s 2100
8s 2200
8s 2300
11s 2400
9s 2500
9s 2600
10s 2700
9s 2800
10s 2900
10s 3000
10s 3100
10s 3200
10s 3300
10s 3400
10s 3500
11s 3600
11s 3700
11s 3800
11s 3900
11s 4000
12s 4100
12s 4200
12s 4300
12s 4400
12s 4500
13s 4600
13s 4700
13s 4800
13s 4900
13s 5000
14s 5100
14s 5200
13s 5300
14s 5400
15s 5500
14s 5600
14s 5700
14s 5800
15s 5900
15s 6000
15s 6100
15s 6200
15s 6300
15s 6400
15s 6500
15s 6600
16s 6700


KeyboardInterrupt: 

I'm surprised it's not slowing faster. But not sure how much it's worth digging into this and how mac and MPS works around this stuff. Going to interrupt and try many samples for fun and then move to trying things on RTX4000.

In [38]:
t0 = time.time()
for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=50, max_tokens=5000, temperature=0.3)):
    if i % 100 == 0:
        t1 = time.time()
        delta = t1 - t0
        t0 = t1
        print(f"{(delta):.0f}s {i}")

RuntimeError: Invalid buffer size: 47.74 GiB

^ interesting, failed immediately when first allocating the KV cache. Try with fewer samples. This also makes me think that either it will fail right away, or it will fail at one of the points where it needs to grow the cache. Not really sure about that though and also not sure what "Invalid buffer size" is vs. OOM. Also prob better to restart kernel and then repeat.

-- restarted kernel and ran appropriate cells above --

In [10]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
t0 = time.time()
for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=50, max_tokens=5000, temperature=0.3)):
    if i % 100 == 0:
        t1 = time.time()
        delta = t1 - t0
        t0 = t1
        print(f"{(delta):.0f}s {i}")

RuntimeError: Invalid buffer size: 47.74 GiB

-- restarted kernel and ran appropriate cells above --

In [7]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
t0 = time.time()
for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=5, max_tokens=5000, temperature=0.3)):
    if i % 100 == 0:
        t1 = time.time()
        delta = t1 - t0
        t0 = t1
        print(f"{(delta):.0f}s {i}")

1s 0
7s 100
7s 200
8s 300
9s 400
9s 500
10s 600
10s 700
11s 800
13s 900
13s 1000
22s 1100
18s 1200
16s 1300
16s 1400
17s 1500


KeyboardInterrupt: 

Interrupted. Perhaps come back to this later. Move on to trying on RTX4000.

### Try on RTX4000

Now on the paperspace rtx400 machine:

In [2]:
!nvidia-smi

Sun Nov 16 21:57:31 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Quadro RTX 4000                Off |   00000000:00:05.0 Off |                  N/A |
| 30%   33C    P8              8W /  125W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

scp'd tokenizer files and model files from my laptop to this machine:

In [3]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_common import get_base_dir

In [4]:
!ls -lh {get_base_dir()}/base_checkpoints/d20

total 2.4G
-rw-r--r-- 1 paperspace paperspace  847 Nov 16 22:02 meta_021400.json
-rw-r--r-- 1 paperspace paperspace 2.0G Nov 16 22:02 model_021400.pt
-rw-r--r-- 1 paperspace paperspace 389M Nov 16 22:03 optim_021400_rank0.pt


In [5]:
!ls -lh {get_base_dir()}/*token*

-rw-r--r-- 1 paperspace paperspace 827K Nov 16 21:59 /home/paperspace/.cache/my_nanochat/my-tokenizer.pkl
-rw-r--r-- 1 paperspace paperspace 258K Nov 16 21:59 /home/paperspace/.cache/my_nanochat/token_bytes.pt


In [17]:
import os
import torch
import time
from my_nanochat.my_common import get_base_dir, autodetect_device_type
from my_nanochat.my_checkpoint_manager import build_model
device_type = autodetect_device_type()
device = torch.device(device_type)
autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()

Autodetected device type: cuda


In [7]:
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d20")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=21400, device=device, phase="eval")

Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}


In [8]:
tokenizer.get_vocab_size()

65536

In [10]:
meta_data

{'step': 21400,
 'val_bpb': -1,
 'model_config': {'sequence_len': 2048,
  'vocab_size': 65536,
  'n_layer': 20,
  'n_head': 10,
  'n_kv_head': 10,
  'n_embd': 1280},
 'user_config': {'run': 'challenge-25-4',
  'device_type': '',
  'depth': 20,
  'max_seq_len': 2048,
  'num_iterations': -1,
  'target_param_data_ratio': 20,
  'device_batch_size': 32,
  'total_batch_size': 524288,
  'embedding_lr': 0.2,
  'unembedding_lr': 0.004,
  'weight_decay': 0.0,
  'matrix_lr': 0.02,
  'grad_clip': 1.0,
  'warmup_ratio': 0.0,
  'warmdown_ratio': 0.2,
  'final_lr_frac': 0.0,
  'eval_every': 250,
  'eval_tokens': 10485760,
  'core_metric_every': 2000,
  'core_metric_max_per_task': 500,
  'sample_every': 2000,
  'model_tag': ''},
 'device_batch_size': 32,
 'max_seq_len': 2048}

In [11]:
prompts = [
    "The capital of France is",
    "The chemical symbol of gold is",
    "If yesterday was Friday, then tomorrow will be",
    "The opposite of hot is",
    "The planets of the solar system are:",
    "My favorite color is",
    "If 5*x + 3 = 13, then x is",
]

In [12]:
from my_nanochat.my_engine import Engine
engine = Engine(model, tokenizer)

In [15]:
for prompt in prompts:
    tokens = tokenizer.encode(prompt, prepend=tokenizer.get_bos_token_id())
    with autocast_ctx:
        sample, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=10, temperature=0)
    print(tokenizer.decode(sample[0]))

<|bos|>The capital of France is Paris. It is the largest city in France and
<|bos|>The chemical symbol of gold is Au. It is a soft, malleable, ductile
<|bos|>If yesterday was Friday, then tomorrow will be Monday. If tomorrow is Monday, then tomorrow will
<|bos|>The opposite of hot is cold. The opposite of cold is hot. The
<|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter,
<|bos|>My favorite color is red. It is the color of blood, of
<|bos|>If 5*x + 3 = 13, then x is a factor of 5. If 5*


Looks like it matches.

Can we get this to OOM?

In [19]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
t0 = time.time()
with autocast_ctx:
    for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=1, max_tokens=20479, temperature=0.3)):
        if i % 100 == 0:
            t1 = time.time()
            delta = t1 - t0
            t0 = t1
            print(f"{(delta):.0f}s {i}")

0s 0
2s 100
2s 200
2s 300
2s 400
2s 500
2s 600
2s 700
2s 800
2s 900
2s 1000
2s 1100
2s 1200
2s 1300
2s 1400
2s 1500
2s 1600
2s 1700
2s 1800
2s 1900
2s 2000
2s 2100
2s 2200
2s 2300
2s 2400
2s 2500
2s 2600
2s 2700
2s 2800
2s 2900
2s 3000
2s 3100
2s 3200
2s 3300
2s 3400
2s 3500
2s 3600
2s 3700
2s 3800
2s 3900
2s 4000
2s 4100
2s 4200
2s 4300
2s 4400
2s 4500
2s 4600
2s 4700
2s 4800
2s 4900
2s 5000
2s 5100
2s 5200
2s 5300
2s 5400
2s 5500
2s 5600
2s 5700
2s 5800
2s 5900
2s 6000
2s 6100
2s 6200
2s 6300
2s 6400
2s 6500
2s 6600
2s 6700
2s 6800
2s 6900
2s 7000
2s 7100
2s 7200
2s 7300
2s 7400
2s 7500
2s 7600
2s 7700
2s 7800
2s 7900
2s 8000
2s 8100
2s 8200
2s 8300
2s 8400
2s 8500
2s 8600
2s 8700
2s 8800
2s 8900
2s 9000
3s 9100
3s 9200
3s 9300
3s 9400
3s 9500
3s 9600
3s 9700
3s 9800
3s 9900
3s 10000
3s 10100
3s 10200
3s 10300
3s 10400
3s 10500
3s 10600
3s 10700
3s 10800
3s 10900
3s 11000
3s 11100
3s 11200
3s 11300
3s 11400
3s 11500
3s 11600
3s 11700
3s 11800
3s 11900
3s 12000
3s 12100
3s 12200
3s 12

That completed. Prob need to look more closely at where it needs to allocate more memory. But for fun, try with multiple samples.

In [20]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
t0 = time.time()
with autocast_ctx:
    for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=10, max_tokens=20479, temperature=0.3)):
        if i % 100 == 0:
            t1 = time.time()
            delta = t1 - t0
            t0 = t1
            print(f"{(delta):.0f}s {i}")

OutOfMemoryError: CUDA out of memory. Tried to allocate 19.54 GiB. GPU 0 has a total capacity of 7.78 GiB of which 4.69 GiB is free. Including non-PyTorch memory, this process has 3.09 GiB memory in use. Of the allocated memory 2.88 GiB is allocated by PyTorch, and 90.85 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [21]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
t0 = time.time()
with autocast_ctx:
    for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=5, max_tokens=20479, temperature=0.3)):
        if i % 100 == 0:
            t1 = time.time()
            delta = t1 - t0
            t0 = t1
            print(f"{(delta):.0f}s {i}")

OutOfMemoryError: CUDA out of memory. Tried to allocate 9.77 GiB. GPU 0 has a total capacity of 7.78 GiB of which 4.69 GiB is free. Including non-PyTorch memory, this process has 3.09 GiB memory in use. Of the allocated memory 2.89 GiB is allocated by PyTorch, and 91.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [22]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
t0 = time.time()
with autocast_ctx:
    for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=3, max_tokens=20479, temperature=0.3)):
        if i % 100 == 0:
            t1 = time.time()
            delta = t1 - t0
            t0 = t1
            print(f"{(delta):.0f}s {i}")

OutOfMemoryError: CUDA out of memory. Tried to allocate 5.86 GiB. GPU 0 has a total capacity of 7.78 GiB of which 4.68 GiB is free. Including non-PyTorch memory, this process has 3.09 GiB memory in use. Of the allocated memory 2.89 GiB is allocated by PyTorch, and 92.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [23]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
t0 = time.time()
with autocast_ctx:
    for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=2, max_tokens=20479, temperature=0.3)):
        if i % 100 == 0:
            t1 = time.time()
            delta = t1 - t0
            t0 = t1
            print(f"{(delta):.0f}s {i}")

0s 0
4s 100
4s 200
4s 300
4s 400
4s 500
4s 600
4s 700
4s 800
4s 900
4s 1000
4s 1100
4s 1200
4s 1300
4s 1400
4s 1500
4s 1600
4s 1700
4s 1800
4s 1900
4s 2000
5s 2100
5s 2200
5s 2300
5s 2400
5s 2500
5s 2600
5s 2700
5s 2800
5s 2900
5s 3000
5s 3100
5s 3200
5s 3300
5s 3400
5s 3500
5s 3600
5s 3700
5s 3800
5s 3900
5s 4000


KeyboardInterrupt: 

Going to interrupt. Looking at the KVCache class, I forgot that if max_tokens is specified, we use that to initially size the cache. So it's not going to run out of memory while running. It's going to do it right away or not at all, I think. Anyway, overall it makes sense that.