### Try trained model on other hardware

As long as the trained d20 model can fit on MPS / mac / RTX4000 I believe it should work, at least up to a certain sequence length because the KV cache will need to keep getting bigger and bigger. 

Total parms is 560,988,160. Say they all took 4 bytes, which they don't, that's ~2 GiB which is well under the RTX4000 where we had 7+ GiB memory.

In [26]:
560_988_160 * 4 / 1024 ** 3

2.08984375

#### Let's try MPS on mac first

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_common import get_base_dir

Copied model files to the right place

In [2]:
!ls -lh {get_base_dir()}/base_checkpoints/d20

total 4851336
-rw-r--r--  1 ericsilberstein  staff   847B Nov 16 14:35 meta_021400.json
-rw-r--r--  1 ericsilberstein  staff   1.9G Nov 16 14:35 model_021400.pt
-rw-r--r--  1 ericsilberstein  staff   389M Nov 16 14:35 optim_021400_rank0.pt


Copied tokenizer files to the right place (later, forgot to do this at first)

In [3]:
!ls -lh {get_base_dir()}/*token*

-rw-r--r--  1 ericsilberstein  staff   826K Nov 16 15:51 /Users/ericsilberstein/.cache/my_nanochat/my-tokenizer.pkl
-rw-r--r--  1 ericsilberstein  staff   258K Nov 16 15:51 /Users/ericsilberstein/.cache/my_nanochat/token_bytes.pt


In [2]:
import os
import torch
from my_nanochat.my_common import get_base_dir, autodetect_device_type
from my_nanochat.my_checkpoint_manager import build_model
device_type = autodetect_device_type()
device = torch.device(device_type)

Autodetected device type: mps


In [3]:
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d20")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=21400, device=device, phase="eval")

Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}


In [6]:
tokenizer.get_vocab_size()

65536

In [33]:
meta_data

{'step': 21400,
 'val_bpb': -1,
 'model_config': {'sequence_len': 2048,
  'vocab_size': 65536,
  'n_layer': 20,
  'n_head': 10,
  'n_kv_head': 10,
  'n_embd': 1280},
 'user_config': {'run': 'challenge-25-4',
  'device_type': '',
  'depth': 20,
  'max_seq_len': 2048,
  'num_iterations': -1,
  'target_param_data_ratio': 20,
  'device_batch_size': 32,
  'total_batch_size': 524288,
  'embedding_lr': 0.2,
  'unembedding_lr': 0.004,
  'weight_decay': 0.0,
  'matrix_lr': 0.02,
  'grad_clip': 1.0,
  'warmup_ratio': 0.0,
  'warmdown_ratio': 0.2,
  'final_lr_frac': 0.0,
  'eval_every': 250,
  'eval_tokens': 10485760,
  'core_metric_every': 2000,
  'core_metric_max_per_task': 500,
  'sample_every': 2000,
  'model_tag': ''},
 'device_batch_size': 32,
 'max_seq_len': 2048}

In [4]:
prompts = [
    "The capital of France is",
    "The chemical symbol of gold is",
    "If yesterday was Friday, then tomorrow will be",
    "The opposite of hot is",
    "The planets of the solar system are:",
    "My favorite color is",
    "If 5*x + 3 = 13, then x is",
]

In [5]:
from my_nanochat.my_engine import Engine
engine = Engine(model, tokenizer)

In [9]:
for prompt in prompts:
    tokens = tokenizer.encode(prompt, prepend=tokenizer.get_bos_token_id())
    sample, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=10, temperature=0)
    print(tokenizer.decode(sample[0]))

<|bos|>The capital of France is Paris. It is the largest city in France and
<|bos|>The chemical symbol of gold is Au. It is a soft, malleable, ductile
<|bos|>If yesterday was Friday, then tomorrow will be Monday. If tomorrow is Monday, then tomorrow will
<|bos|>The opposite of hot is cold. The opposite of cold is hot. The
<|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter,
<|bos|>My favorite color is red. It is the color of blood, of
<|bos|>If 5*x + 3 = 13, then x is a factor of 5. If 5*


Do these match the sample from the final step during training?
```
<|bos|>The capital of France is Paris. It is the largest city in France and
<|bos|>The chemical symbol of gold is Au. It is a soft, malleable, ductile
<|bos|>If yesterday was Friday, then tomorrow will be Monday. If tomorrow is Monday, then tomorrow will
<|bos|>The opposite of hot is cold. The opposite of cold is hot. The
<|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter,
<|bos|>My favorite color is red. It is the color of the blood of
<|bos|>If 5*x + 3 = 13, then x is a factor of 5. If 5*
```

Yes.

For fun, generate a few longer samples for each prompt with a higher temperature and.

In [12]:
for prompt in prompts:
    tokens = tokenizer.encode(prompt, prepend=tokenizer.get_bos_token_id())
    samples, _ = engine.generate_batch(tokens, num_samples=5, max_tokens=20, temperature=0.3)
    for sample in samples:
        print(tokenizer.decode(sample))
    print('')

<|bos|>The capital of France is Paris, and it is the second largest city in the country. It is also the most populous city
<|bos|>The capital of France is Paris, and the capital of France is Paris. Paris is the capital of France. Paris is the
<|bos|>The capital of France is Paris. It is a city that is known for its beautiful architecture, its food, its music,
<|bos|>The capital of France is Paris and it is the largest city in the country. The city is also the most populous city in
<|bos|>The capital of France is Paris. It is the largest city in France and the second largest city in Europe. It is the

<|bos|>The chemical symbol of gold is Au, and it is a soft, malleable, ductile, and ductilely reactive metal. It is
<|bos|>The chemical symbol of gold is Au. It is a soft, malleable metal that is a bright yellow color. Gold is the most
<|bos|>The chemical symbol of gold is Au. The atomic number of gold is 79. The atomic weight of gold is 199
<|bos|>The chemical symbol of gold is Au and it 

Can we get it to OOM? Probably not just by max tokens because it will hit the assert on 10x max seq length for precomputed sin/cos before OOM, but perhaps by asking for lots of samples.

In [18]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
samples, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=1000, temperature=0.3)
print(tokenizer.decode(samples[0]))

<|bos|>The capital of France is Paris, and it is the second largest city in the country. It is also the most populous city in France. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. The city is located in the south of France, and it is located in the region of the Loire River. Th

In [19]:
samples, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=20479, temperature=0.3)

KeyboardInterrupt: 

It's been maybe 10 minutes and it's still going. Going to cancel and use the non batch version so can watch progress and see if it's even worth waiting. My guess is at first the times per token generated will be consistent because the self attention stuff is just one part of all the operations, but as the sequence grows, the non-self-attention stuff stays constant but the self attention will dominate as the KV cache grows and it needs to do those big matrix multiplications. But will it slow linearly or with say the square of something?

In [6]:
import time

In [37]:
t0 = time.time()
for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=1, max_tokens=20479, temperature=0.3)):
    if i % 100 == 0:
        t1 = time.time()
        delta = t1 - t0
        t0 = t1
        print(f"{(delta):.0f}s {i}")

1s 0
6s 100
5s 200
5s 300
5s 400
6s 500
6s 600
6s 700
8s 800
7s 900
7s 1000
8s 1100
12s 1200
8s 1300
7s 1400
20s 1500
11s 1600
9s 1700
9s 1800
8s 1900
8s 2000
9s 2100
8s 2200
8s 2300
11s 2400
9s 2500
9s 2600
10s 2700
9s 2800
10s 2900
10s 3000
10s 3100
10s 3200
10s 3300
10s 3400
10s 3500
11s 3600
11s 3700
11s 3800
11s 3900
11s 4000
12s 4100
12s 4200
12s 4300
12s 4400
12s 4500
13s 4600
13s 4700
13s 4800
13s 4900
13s 5000
14s 5100
14s 5200
13s 5300
14s 5400
15s 5500
14s 5600
14s 5700
14s 5800
15s 5900
15s 6000
15s 6100
15s 6200
15s 6300
15s 6400
15s 6500
15s 6600
16s 6700


KeyboardInterrupt: 

I'm surprised it's not slowing faster. But not sure how much it's worth digging into this and how mac and MPS works around this stuff. Going to interrupt and try many samples for fun and then move to trying things on RTX4000.

In [38]:
t0 = time.time()
for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=50, max_tokens=5000, temperature=0.3)):
    if i % 100 == 0:
        t1 = time.time()
        delta = t1 - t0
        t0 = t1
        print(f"{(delta):.0f}s {i}")

RuntimeError: Invalid buffer size: 47.74 GiB

^ interesting, failed immediately when first allocating the KV cache. Try with fewer samples. This also makes me think that either it will fail right away, or it will fail at one of the points where it needs to grow the cache. Not really sure about that though and also not sure what "Invalid buffer size" is vs. OOM. Also prob better to restart kernel and then repeat.

-- restarted kernel and ran appropriate cells above --

In [10]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
t0 = time.time()
for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=50, max_tokens=5000, temperature=0.3)):
    if i % 100 == 0:
        t1 = time.time()
        delta = t1 - t0
        t0 = t1
        print(f"{(delta):.0f}s {i}")

RuntimeError: Invalid buffer size: 47.74 GiB

-- restarted kernel and ran appropriate cells above --

In [7]:
tokens = tokenizer.encode(prompts[0], prepend=tokenizer.get_bos_token_id())
t0 = time.time()
for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=5, max_tokens=5000, temperature=0.3)):
    if i % 100 == 0:
        t1 = time.time()
        delta = t1 - t0
        t0 = t1
        print(f"{(delta):.0f}s {i}")

1s 0
7s 100
7s 200
8s 300
9s 400
9s 500
10s 600
10s 700
11s 800
13s 900
13s 1000
22s 1100
18s 1200
16s 1300
16s 1400
17s 1500


KeyboardInterrupt: 

Interrupted. Perhaps come back to this later. Move on to trying on RTX4000.

In [41]:
t0 = time.time()
for i, (token_column, token_masks) in enumerate(engine.generate(tokens, num_samples=5, max_tokens=5000, temperature=0.3)):
    if i % 100 == 0:
        t1 = time.time()
        delta = t1 - t0
        t0 = t1
        print(f"{(delta):.0f}s {i}")

RuntimeError: MPS backend out of memory (MPS allocated: 14.89 GiB, other allocations: 3.98 MiB, max allowed: 18.13 GiB). Tried to allocate 4.77 GiB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [4]:
tokens = list(model.generate(tokenizer.encode("First take a right on Main St.", prepend=tokenizer.get_bos_token_id()), max_tokens=10))