I'll be going to a new machine and so I'll first need to download data. As before, do that with:

`python -m nanochat.dataset -n 20` for example

I'll then want to train the tokenizer but I never put that in a script.

In `challenge-14-baby-pretrain-on-gpu` I did it in a notebook `train-tokenizer.ipynb`

And then in `challenge-18-add-evaluate-bpb` I wrote/ran the code to cache the mapping from token to number of bytes.

It's time to put all that into `my_tok_train.py` to keep things organized.

An errow below reminded me I'll also need to do this on the new machine:

```
cd challenge-07-rust-and-python-simplified-tokenizer/rust_tokenizer
maturin develop
```

Let's try it:

In [1]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [2]:
!python -m scripts.my_tok_train

max_chars: 10,000,000,000
doc_cap: 10,000
vocab_size: 65,536
starting to train tokenizer
buffers filled: 1
buffers filled: 2
buffers filled: 3
buffers filled: 4
buffers filled: 5
buffers filled: 6
buffers filled: 7
buffers filled: 8
buffers filled: 9
buffers filled: 10
buffers filled: 11
buffers filled: 12
buffers filled: 13
buffers filled: 14
buffers filled: 15
buffers filled: 16
buffers filled: 17
buffers filled: 18
buffers filled: 19
buffers filled: 20
buffers filled: 21
buffers filled: 22
buffers filled: 23
buffers filled: 24
buffers filled: 25
buffers filled: 26
buffers filled: 27
buffers filled: 28
buffers filled: 29
buffers filled: 30
buffers filled: 31
buffers filled: 32
buffers filled: 33
buffers filled: 34
buffers filled: 35
buffers filled: 36
buffers filled: 37
buffers filled: 38
buffers filled: 39
buffers filled: 40
buffers filled: 41
buffers filled: 42
buffers filled: 43
buffers filled: 44
buffers filled: 45
buffers filled: 46
buffers filled: 47
buffers filled: 48
buffers 

So far I've been running scripts with python. From looking at his [speedrun.sh](https://github.com/karpathy/nanochat/blob/master/speedrun.sh), it looks we use `torchrun` to use the torch distributed stuff. Let me first see if that can run on my laptop.

In [3]:
!torchrun

W1114 14:42:49.242000 96239 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
usage: torchrun [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
                [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT]
                [--rdzv-id RDZV_ID] [--rdzv-conf RDZV_CONF] [--standalone]
                [--max-restarts MAX_RESTARTS]
                [--monitor-interval MONITOR_INTERVAL]
                [--start-method {spawn,fork,forkserver}]
                [--event-log-handler EVENT_LOG_HANDLER] [--role ROLE] [-m]
                [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
                [-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER]
                [--node-rank NODE_RANK] [--master-addr MASTER_ADDR]
                [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
                [--logs-specs LOGS_SPECS]
                [--numa-binding {node,socket,exclusive,core-compl

He calls it like this:

`torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_train -- --depth=20 --run=$WANDB_RUN`

What does the --standalone flag do?

ChatGPT seems to give a good answer. Short seems to be use it for single node multi GPU and you can save setting up a lot of other stuff.

Let's try.

In [4]:
!torchrun --standalone --nproc_per_node=1 -m scripts.my_base_train \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0 \

W1114 14:49:45.765000 96358 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
[W1114 14:49:45.211740000 socket.cpp:767] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49218) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
[W1114 14:49:46.814254000 socket.cpp:767] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49218) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
[W1114 14:49:47.601272000 socket.cpp:767] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49218) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
[W1114 14:49:48.739299000 socket.cpp:767] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.

Doesn't work, but no point in figuring that out, instead try on our single GPU machine.

### trying on single GPU machine

In [3]:
!torchrun --standalone --nproc_per_node=1 -m scripts.my_base_train \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0 \

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/paperspace/nanogpt-learning/my_nanochat/scripts/my_base_train.py", line 16, in <module>
    from my_nanochat.my_tokenizer import get_tokenizer, get_token_bytes
  File "/home/paperspace/nanogpt-learning/my_nanochat/my_nanochat/my_tokenizer.py", line 1, in <module>
    import rust_tokenizer;
ModuleNotFoundError: No module named 'rust_tokenizer'
E1114 19:57:36.615000 1773 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 1788) of binary: /home/paperspace/nanogpt-learning/.venv/bin/python
Traceback (most recent call last):
  File "/home/paperspace/nanogpt-learning/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/paperspace/nanogpt-learning/.venv/lib/python3.10/site-

The error about no module 'rust_tokenizer' reminds me that I should move that code out of challenge 7. Don't understand why I'm getting that error now but maybe because I did `uv sync` to get wandb here? Do this:

```
cd challenge-07-rust-and-python-simplified-tokenizer/rust_tokenizer
maturin develop
```

In [7]:
!torchrun --standalone --nproc_per_node=1 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 10, 'device_batch_size': 1, 'total_batch_size': 128, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02, 'grad_clip': 1.0, 'warmup_ratio': 0.0, 'warmdown_ratio': 0.2, 'final_lr_frac': 0.0, 'eval_every': 100, 'eval_tokens': 1280, 'core_metric_every': 0, 'core_metric_max_per_task': 500, 'sample_every': 2000, 'model_tag': ''}
Autodetected device type: cuda
  _C._set_float32_matmul_precision(precision)
Vocab size: 65,537
num_layers: 4
model_dim: 256
num_heads: 2
num_kv_heads: 2
Tokens / micro-batch / rank: 1 x 128 = 128
Tokens / micro-batch: 128
Total batch size 128 => gradient accumulation steps: 1
GPT(
  (transformer): Mod

ok, seems good. What happens if tell it to use 2 GPUs?

In [9]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 20:14:10.344000 2736 torch/distributed/run.py:803] 
W1114 20:14:10.344000 2736 torch/distributed/run.py:803] *****************************************
W1114 20:14:10.344000 2736 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 20:14:10.344000 2736 torch/distributed/run.py:803] *****************************************
Autodetected device type: cuda
  _C._set_float32_matmul_precision(precision)
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 10, 'device_batch_size': 1, 'total_batch_size': 128, 'embeddin

^ failed, as expected, not sure if in the expected way

### Trying with 2 GPUs

- Create a new 2 (low-powered) GPU machine in paperspace

- Follow the instructions in `challenge-14-baby-pretrain-on-gpu/getting-ready.ipynb` to set it up.

Chose 2xRTX4000 (single GPU machine was also RTX4000)

I followed those instructions and now I'm on the new machine.

In [1]:
!nvidia-smi

Fri Nov 14 20:55:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Quadro RTX 4000                Off |   00000000:00:05.0 Off |                  N/A |
| 30%   31C    P8              2W /  125W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Quadro RTX 4000                Off |   00

In [2]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [3]:
!python -m scripts.my_tok_train

max_chars: 10,000,000,000
doc_cap: 10,000
vocab_size: 65,536
starting to train tokenizer
buffers filled: 1
buffers filled: 2
buffers filled: 3
buffers filled: 4
buffers filled: 5
buffers filled: 6
buffers filled: 7
buffers filled: 8
buffers filled: 9
buffers filled: 10
buffers filled: 11
buffers filled: 12
buffers filled: 13
buffers filled: 14
buffers filled: 15
buffers filled: 16
buffers filled: 17
buffers filled: 18
buffers filled: 19
buffers filled: 20
buffers filled: 21
buffers filled: 22
buffers filled: 23
buffers filled: 24
buffers filled: 25
buffers filled: 26
buffers filled: 27
buffers filled: 28
buffers filled: 29
buffers filled: 30
buffers filled: 31
buffers filled: 32
buffers filled: 33
buffers filled: 34
buffers filled: 35
buffers filled: 36
buffers filled: 37
buffers filled: 38
buffers filled: 39
buffers filled: 40
buffers filled: 41
buffers filled: 42
buffers filled: 43
buffers filled: 44
buffers filled: 45
buffers filled: 46
buffers filled: 47
buffers filled: 48
buffers 

In [4]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 21:36:14.831000 37098 torch/distributed/run.py:803] 
W1114 21:36:14.831000 37098 torch/distributed/run.py:803] *****************************************
W1114 21:36:14.831000 37098 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 21:36:14.831000 37098 torch/distributed/run.py:803] *****************************************
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 10, 'device_batch_size': 1, 'total_batch_size': 128, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.

hit that assert I put in because left out code for saving optimizers with requires something special with ranks...go add

but that also means it got all the way to saving...I guess that's good, but how do I know it was doing the righ thing?

In [6]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 21:48:04.863000 37714 torch/distributed/run.py:803] 
W1114 21:48:04.863000 37714 torch/distributed/run.py:803] *****************************************
W1114 21:48:04.863000 37714 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 21:48:04.863000 37714 torch/distributed/run.py:803] *****************************************
Autodetected device type: cuda
  _C._set_float32_matmul_precision(precision)
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 10, 'device_batch_size': 1, 'total_batch_size': 128, 'embe

Added another print statement that prints in all ranks (not only master process). Want to make sure see both ranks.

In [7]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=2 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 21:53:47.112000 38130 torch/distributed/run.py:803] 
W1114 21:53:47.112000 38130 torch/distributed/run.py:803] *****************************************
W1114 21:53:47.112000 38130 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 21:53:47.112000 38130 torch/distributed/run.py:803] *****************************************
Autodetected device type: cuda
  _C._set_float32_matmul_precision(precision)
This process is ddp_rank: 0, ddp_local_rank: 0, ddp_world_size: 1
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 2
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterat

Seeing `This process is ddp_rank: 0, ddp_local_rank: 0, ddp_world_size: 1`, so must be missing something.

Oh yeah, forgot to update this in `my_common.py`:
```
# return ddp, ddp_rank, ddp_local_rank, ddp_world_size
def get_dist_info():
    # for now
    return False, 0, 0, 1
```

In [12]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=2 \
    --total_batch_size=256 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 22:05:18.973000 38726 torch/distributed/run.py:803] 
W1114 22:05:18.973000 38726 torch/distributed/run.py:803] *****************************************
W1114 22:05:18.973000 38726 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 22:05:18.973000 38726 torch/distributed/run.py:803] *****************************************
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 2
overriding total_batch_size = 256
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 2, 'device_batch_size': 1, 'total_batch_size': 256, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02

ok, now failing due to `DistAdamW = None # for now so it will fail until I "copy" adamw.py`

time to look at `adamw.py`

I'm going to copy and paste it and go back and look later.

In [14]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=2 \
    --total_batch_size=256 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 22:30:46.791000 39515 torch/distributed/run.py:803] 
W1114 22:30:46.791000 39515 torch/distributed/run.py:803] *****************************************
W1114 22:30:46.791000 39515 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 22:30:46.791000 39515 torch/distributed/run.py:803] *****************************************
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 2
overriding total_batch_size = 256
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 2, 'device_batch_size': 1, 'total_batch_size': 256, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02

Add temp debug printing in `adamw.py` to see shapes of output and input

In [17]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=2 \
    --total_batch_size=256 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 22:41:54.509000 40101 torch/distributed/run.py:803] 
W1114 22:41:54.509000 40101 torch/distributed/run.py:803] *****************************************
W1114 22:41:54.509000 40101 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 22:41:54.509000 40101 torch/distributed/run.py:803] *****************************************
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 2
overriding total_batch_size = 256
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 2, 'device_batch_size': 1, 'total_batch_size': 256, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02

`Just before calling reduce_scatter_tensor, grad_slice (output) shape is torch.Size([32768, 256]) and grad (input) shape is torch.Size([65537, 256])` and `reduce_scatter_tensor()` doc says: "input (Tensor): Input tensor to be reduced and scattered. Its size should be output tensor size times the world size."

In [19]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_tokenizer import get_tokenizer
get_tokenizer().get_vocab_size()

65537

Prob do need to understand adamw but just based on those numbers, does it somehow split up the lm_head params and if vocab_size % world_size is not 0 it's a prob? Is there something in the actual tokenizer training that forces vocab_size to be a multiple of 8 or something like that? Don't immediately see anything like that. I have 65536 + BOS = 65537. The real one has 9 special tokens including BOS so that's still going to be odd number. Hmm.

Let me google and chatgpt "ValueError: input tensor must be the same size as output size times world size"

Not immediately seeing an answer, but it looks like understanding scatter and related is important to understand DDP

Code added / updated as part of this challenge so far:

- Added `my_tok_train.py`
 
- Added / fixed code purposely left out earlier when ignoring DDP in `my_checkpoint_manager.py`, `my_common.py` and `my_gpt.py`

- Directly copied the entire `adamw.py`