I'll be going to a new machine and so I'll first need to download data. As before, do that with:

`python -m nanochat.dataset -n 20` for example

I'll then want to train the tokenizer but I never put that in a script.

In `challenge-14-baby-pretrain-on-gpu` I did it in a notebook `train-tokenizer.ipynb`

And then in `challenge-18-add-evaluate-bpb` I wrote/ran the code to cache the mapping from token to number of bytes.

It's time to put all that into `my_tok_train.py` to keep things organized.

An errow below reminded me I'll also need to do this on the new machine:

```
cd challenge-07-rust-and-python-simplified-tokenizer/rust_tokenizer
maturin develop
```

Let's try it:

In [1]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [2]:
!python -m scripts.my_tok_train

max_chars: 10,000,000,000
doc_cap: 10,000
vocab_size: 65,536
starting to train tokenizer
buffers filled: 1
buffers filled: 2
buffers filled: 3
buffers filled: 4
buffers filled: 5
buffers filled: 6
buffers filled: 7
buffers filled: 8
buffers filled: 9
buffers filled: 10
buffers filled: 11
buffers filled: 12
buffers filled: 13
buffers filled: 14
buffers filled: 15
buffers filled: 16
buffers filled: 17
buffers filled: 18
buffers filled: 19
buffers filled: 20
buffers filled: 21
buffers filled: 22
buffers filled: 23
buffers filled: 24
buffers filled: 25
buffers filled: 26
buffers filled: 27
buffers filled: 28
buffers filled: 29
buffers filled: 30
buffers filled: 31
buffers filled: 32
buffers filled: 33
buffers filled: 34
buffers filled: 35
buffers filled: 36
buffers filled: 37
buffers filled: 38
buffers filled: 39
buffers filled: 40
buffers filled: 41
buffers filled: 42
buffers filled: 43
buffers filled: 44
buffers filled: 45
buffers filled: 46
buffers filled: 47
buffers filled: 48
buffers 

So far I've been running scripts with python. From looking at his [speedrun.sh](https://github.com/karpathy/nanochat/blob/master/speedrun.sh), it looks we use `torchrun` to use the torch distributed stuff. Let me first see if that can run on my laptop.

In [3]:
!torchrun

W1114 14:42:49.242000 96239 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
usage: torchrun [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
                [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT]
                [--rdzv-id RDZV_ID] [--rdzv-conf RDZV_CONF] [--standalone]
                [--max-restarts MAX_RESTARTS]
                [--monitor-interval MONITOR_INTERVAL]
                [--start-method {spawn,fork,forkserver}]
                [--event-log-handler EVENT_LOG_HANDLER] [--role ROLE] [-m]
                [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
                [-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER]
                [--node-rank NODE_RANK] [--master-addr MASTER_ADDR]
                [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
                [--logs-specs LOGS_SPECS]
                [--numa-binding {node,socket,exclusive,core-compl

He calls it like this:

`torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_train -- --depth=20 --run=$WANDB_RUN`

What does the --standalone flag do?

ChatGPT seems to give a good answer. Short seems to be use it for single node multi GPU and you can save setting up a lot of other stuff.

Let's try.

In [4]:
!torchrun --standalone --nproc_per_node=1 -m scripts.my_base_train \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0 \

W1114 14:49:45.765000 96358 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
[W1114 14:49:45.211740000 socket.cpp:767] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49218) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
[W1114 14:49:46.814254000 socket.cpp:767] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49218) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
[W1114 14:49:47.601272000 socket.cpp:767] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49218) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
[W1114 14:49:48.739299000 socket.cpp:767] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.

Doesn't work, but no point in figuring that out, instead try on our single GPU machine.

### trying on single GPU machine

In [3]:
!torchrun --standalone --nproc_per_node=1 -m scripts.my_base_train \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0 \

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/paperspace/nanogpt-learning/my_nanochat/scripts/my_base_train.py", line 16, in <module>
    from my_nanochat.my_tokenizer import get_tokenizer, get_token_bytes
  File "/home/paperspace/nanogpt-learning/my_nanochat/my_nanochat/my_tokenizer.py", line 1, in <module>
    import rust_tokenizer;
ModuleNotFoundError: No module named 'rust_tokenizer'
E1114 19:57:36.615000 1773 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 1788) of binary: /home/paperspace/nanogpt-learning/.venv/bin/python
Traceback (most recent call last):
  File "/home/paperspace/nanogpt-learning/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/paperspace/nanogpt-learning/.venv/lib/python3.10/site-

The error about no module 'rust_tokenizer' reminds me that I should move that code out of challenge 7. Don't understand why I'm getting that error now but maybe because I did `uv sync` to get wandb here? Do this:

```
cd challenge-07-rust-and-python-simplified-tokenizer/rust_tokenizer
maturin develop
```

In [7]:
!torchrun --standalone --nproc_per_node=1 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 10, 'device_batch_size': 1, 'total_batch_size': 128, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02, 'grad_clip': 1.0, 'warmup_ratio': 0.0, 'warmdown_ratio': 0.2, 'final_lr_frac': 0.0, 'eval_every': 100, 'eval_tokens': 1280, 'core_metric_every': 0, 'core_metric_max_per_task': 500, 'sample_every': 2000, 'model_tag': ''}
Autodetected device type: cuda
  _C._set_float32_matmul_precision(precision)
Vocab size: 65,537
num_layers: 4
model_dim: 256
num_heads: 2
num_kv_heads: 2
Tokens / micro-batch / rank: 1 x 128 = 128
Tokens / micro-batch: 128
Total batch size 128 => gradient accumulation steps: 1
GPT(
  (transformer): Mod

ok, seems good. What happens if tell it to use 2 GPUs?

In [9]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 20:14:10.344000 2736 torch/distributed/run.py:803] 
W1114 20:14:10.344000 2736 torch/distributed/run.py:803] *****************************************
W1114 20:14:10.344000 2736 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 20:14:10.344000 2736 torch/distributed/run.py:803] *****************************************
Autodetected device type: cuda
  _C._set_float32_matmul_precision(precision)
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 10, 'device_batch_size': 1, 'total_batch_size': 128, 'embeddin

^ failed, as expected, not sure if in the expected way

### Trying with 2 GPUs

- Create a new 2 (low-powered) GPU machine in paperspace

- Follow the instructions in `challenge-14-baby-pretrain-on-gpu/getting-ready.ipynb` to set it up.

Chose 2xRTX4000 (single GPU machine was also RTX4000)

I followed those instructions and now I'm on the new machine.

In [1]:
!nvidia-smi

Fri Nov 14 20:55:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Quadro RTX 4000                Off |   00000000:00:05.0 Off |                  N/A |
| 30%   31C    P8              2W /  125W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Quadro RTX 4000                Off |   00

In [2]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [3]:
!python -m scripts.my_tok_train

max_chars: 10,000,000,000
doc_cap: 10,000
vocab_size: 65,536
starting to train tokenizer
buffers filled: 1
buffers filled: 2
buffers filled: 3
buffers filled: 4
buffers filled: 5
buffers filled: 6
buffers filled: 7
buffers filled: 8
buffers filled: 9
buffers filled: 10
buffers filled: 11
buffers filled: 12
buffers filled: 13
buffers filled: 14
buffers filled: 15
buffers filled: 16
buffers filled: 17
buffers filled: 18
buffers filled: 19
buffers filled: 20
buffers filled: 21
buffers filled: 22
buffers filled: 23
buffers filled: 24
buffers filled: 25
buffers filled: 26
buffers filled: 27
buffers filled: 28
buffers filled: 29
buffers filled: 30
buffers filled: 31
buffers filled: 32
buffers filled: 33
buffers filled: 34
buffers filled: 35
buffers filled: 36
buffers filled: 37
buffers filled: 38
buffers filled: 39
buffers filled: 40
buffers filled: 41
buffers filled: 42
buffers filled: 43
buffers filled: 44
buffers filled: 45
buffers filled: 46
buffers filled: 47
buffers filled: 48
buffers 

In [4]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 21:36:14.831000 37098 torch/distributed/run.py:803] 
W1114 21:36:14.831000 37098 torch/distributed/run.py:803] *****************************************
W1114 21:36:14.831000 37098 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 21:36:14.831000 37098 torch/distributed/run.py:803] *****************************************
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 10, 'device_batch_size': 1, 'total_batch_size': 128, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.

hit that assert I put in because left out code for saving optimizers with requires something special with ranks...go add

but that also means it got all the way to saving...I guess that's good, but how do I know it was doing the righ thing?

In [6]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 21:48:04.863000 37714 torch/distributed/run.py:803] 
W1114 21:48:04.863000 37714 torch/distributed/run.py:803] *****************************************
W1114 21:48:04.863000 37714 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 21:48:04.863000 37714 torch/distributed/run.py:803] *****************************************
Autodetected device type: cuda
  _C._set_float32_matmul_precision(precision)
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 10, 'device_batch_size': 1, 'total_batch_size': 128, 'embe

Added another print statement that prints in all ranks (not only master process). Want to make sure see both ranks.

In [7]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=2 \
    --total_batch_size=128 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 21:53:47.112000 38130 torch/distributed/run.py:803] 
W1114 21:53:47.112000 38130 torch/distributed/run.py:803] *****************************************
W1114 21:53:47.112000 38130 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 21:53:47.112000 38130 torch/distributed/run.py:803] *****************************************
Autodetected device type: cuda
  _C._set_float32_matmul_precision(precision)
This process is ddp_rank: 0, ddp_local_rank: 0, ddp_world_size: 1
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 2
overriding total_batch_size = 128
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterat

Seeing `This process is ddp_rank: 0, ddp_local_rank: 0, ddp_world_size: 1`, so must be missing something.

Oh yeah, forgot to update this in `my_common.py`:
```
# return ddp, ddp_rank, ddp_local_rank, ddp_world_size
def get_dist_info():
    # for now
    return False, 0, 0, 1
```

In [12]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=2 \
    --total_batch_size=256 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 22:05:18.973000 38726 torch/distributed/run.py:803] 
W1114 22:05:18.973000 38726 torch/distributed/run.py:803] *****************************************
W1114 22:05:18.973000 38726 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 22:05:18.973000 38726 torch/distributed/run.py:803] *****************************************
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 2
overriding total_batch_size = 256
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 2, 'device_batch_size': 1, 'total_batch_size': 256, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02

ok, now failing due to `DistAdamW = None # for now so it will fail until I "copy" adamw.py`

time to look at `adamw.py`

I'm going to copy and paste it and go back and look later.

In [14]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=2 \
    --total_batch_size=256 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 22:30:46.791000 39515 torch/distributed/run.py:803] 
W1114 22:30:46.791000 39515 torch/distributed/run.py:803] *****************************************
W1114 22:30:46.791000 39515 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 22:30:46.791000 39515 torch/distributed/run.py:803] *****************************************
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 2
overriding total_batch_size = 256
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 2, 'device_batch_size': 1, 'total_batch_size': 256, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02

Add temp debug printing in `adamw.py` to see shapes of output and input

In [17]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=2 \
    --total_batch_size=256 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1114 22:41:54.509000 40101 torch/distributed/run.py:803] 
W1114 22:41:54.509000 40101 torch/distributed/run.py:803] *****************************************
W1114 22:41:54.509000 40101 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 22:41:54.509000 40101 torch/distributed/run.py:803] *****************************************
overriding depth = 4
overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 2
overriding total_batch_size = 256
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 2, 'device_batch_size': 1, 'total_batch_size': 256, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02

`Just before calling reduce_scatter_tensor, grad_slice (output) shape is torch.Size([32768, 256]) and grad (input) shape is torch.Size([65537, 256])` and `reduce_scatter_tensor()` doc says: "input (Tensor): Input tensor to be reduced and scattered. Its size should be output tensor size times the world size."

In [19]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_tokenizer import get_tokenizer
get_tokenizer().get_vocab_size()

65537

Prob do need to understand adamw but just based on those numbers, does it somehow split up the lm_head params and if vocab_size % world_size is not 0 it's a prob? Is there something in the actual tokenizer training that forces vocab_size to be a multiple of 8 or something like that? Don't immediately see anything like that. I have 65536 + BOS = 65537. The real one has 9 special tokens including BOS so that's still going to be odd number. Hmm.

Let me google and chatgpt "ValueError: input tensor must be the same size as output size times world size"

Not immediately seeing an answer, but it looks like understanding scatter and related is important to understand DDP

Think I should take a step back and think about the general idea.

I believe each GPU has its own copy of the model, but each copy needs to have the same weights for all parameters. During training, when we do a forward pass, we divide our data among the GPUs. For example, if each GPU can handle a batch of 10 sequences, we have GPU #1 forward 10 sequences and GPU #2 forward another 10 sequences. We calculate loss, which will be different for the two batches because we have different data, and we calculate gradients for all the weights, which will also be different.

So far each GPU is operating idependently. But now we can't update weights independently. If we do that we'll end up with different weights on each GPU and we're not getting the benefit of combined training on a larger amount of data, it would just be like training two different models.

Let's forget about Adam for a second and assume our optimizer is just going to do weight - LR * gradient. If we have some type of sync point to wait until both GPUs calculate gradients, we can then take the average of the two gradients and adjust all weights by it. Like weight - LR/2 * (gradient_from_gpu_1 + gradient_from_gpu_2). BUT a) where do we sum/average the two gradients and b) do we then subtract in each GPU or, to be "safe", do we calculate the new weights on one GPU and transmit them to to the other.

I don't see how we could sum the two gradients other than by first copying say the gradient from GPU 2 to GPU 1. GPU 1 could then sum, subtract, and send the new weights back to GPU 2. I imagine the torch dist has very efficient ways to move memory from one GPU, perhaps if both GPUs are on the same node by going through RAM, or even more sophisticated stuff, but it will still be much slower than directly working with GPU memory.

I guess another strategy would be to place some layers of the network on one GPU and some on the other. Then it wouldn't be necessary to move all the gradients around, but interim tensors during the forward would need to moved from one to the other, and something similar in reverse during back prop. But nothing in the code I've worked on so far makes me think this could be happening.

So back to moving the gradients and weights at the end of each step...if true, it starts to make sense that responsibility falls on the optimizers. At least for weight - LR * grad, without Adam or anything fancy, what reason could there be to deal in gradients with a dimension of original_dimension / world_size? I doubt this is it, but a way to share calculations would be say for GPU 1 to send half of the gradient to GPU 2, GPU 2 adds it to the gradient it has, updates weights, and then sends back, and the same thing happens in reverse for the other half of the gradient. Believe in that situation the total memory transmitted is the same but the calculations get distributed evenly.

Ooh, what if we have more than two GPUs? Maybe it really does start to make sense for each GPU to "own" a portion of the gradient. But doesn't it also get very expensive to send so much information between so many GPUs?

Maybe I should guess what "reduce and scatter" means and then look it up. I know it takes something like this...

reduce_scatter_tensor(output, input-of-size-world-size-times-output-size, op=reduce-op-avg)

so like

`reduce_scatter_tensor(output=[ , ], input=[1,2,3,4], op=reduce-op-avg)`

reduce could mean take the average of each world size group, like in this case 1,2 and 3,4, but what does that have to do with scatter and how is it distributed? OR could it mean this? Say it's called on two GPUs as follows:

```
GPU 1: reduce_scatter_tensor(output=[ , ], input=[1,2,3,4], op=reduce-op-avg)

GPU 2: reduce_scatter_tensor(output=[ , ], input=[5,6,7,8], op=reduce-op-avg)

The output on GPU 1 gets [(5+1)/2, (2+6)/2] and the output on GPU 2 gets [(3+7)/2, (4+8)/2]
```

In other words, the output on GPU 1 is determined by the first half of the inputs across both GPUs by taking the average. And the output on GPU 2 is determined by the second half of the inputs across both GPUs.

This is very similar to what I was imagining above. Following the completion of something like that, each GPU will contain 1/world_size worth of the overall gradient.

Feels right, but let me see what the doc says. Yes, I think my understanding is right. There is a similar example in the [doc](https://docs.pytorch.org/docs/stable/distributed.html)

ok, to keep thinking about how this could work in a tiny example...

```
at start of step weights_on_gpu_0 and weights_on_gpu_1 are both say [10,20,30,40]

after back prop, say gradient_on_gpu_1 is [4,3,1,2] and gradient_on_gpu_2 is [6,9,5,6]

grad = [0,0]
gpu 0: reduce_scatter_tensor(grad_on_gpu_0, [4,3,1,2], avg)
gpu 1: reduce_scatter_tensor(grad_on_gpu_1, [6,9,5,6], avg)

now grad_on_gpu_0 is [5, 6]
and grad_on_gpu_1 is [3, 4]

say LR = 1
on both gpus do weights[rank:rank+world_size] = weights[rank:rank+world_size] - grad * LR

now weights_on_gpu_0 is [5, 14, 30, 40]
and weights_on_gpu_1 is [10, 20, 27, 36]
the first two weight on GPU 0 are correct for the world, and the second two weights on GPU 1 are correct for the world

(beginning to see why it's called world)

now we need some way to "scatter" the correct parts out...say a function like:

my_scatter(tensor, slice): copy this slice of this tensor to the same slice of the tensors on the other GPUs

then both GPUs could do: my_scatter(weights, rank:rank+world_size)
gpu 0: my_scatter([5, 14, 30, 40], (0:2))
gpu 1: my_scatter([10, 20, 27, 36], (2:4))

now weights_on_gpu_0 is [5, 14, 27, 36] and weights_on_gpu_1 is [5, 14, 27, 36]

we're done and ready for the next step

Let's see if torch.distributed has a function like that.

```

(There's ton's of interesting stuff in the [doc](https://docs.pytorch.org/docs/stable/distributed.html) including around the different ways communication between GPUs can happen. Come back to all that later.)

There is a scatter function but it doesn't work like I imagined. It lets one rank send out a list of tensors, and one rank will receive the first, the next the second, etc. So like gpu 0 could do scatter(output, [[1,2],[3,4]) and output on gpu 0 will be [1,2] and output on gpu 1 will be [3,4]. Anyway, there are lots of functions in torch.distributed and lots of ways to get the weights distributed around.

Look at adamw now.

Just thinking...we need to move the gradients around, and we need to either move their average or updated weights around, but we don't need to move the moving averages around. Each GPU can be responsible for maintaining m and v for just their portion of the gradient. (Or one GPU could be responsible for the whole thing, but either way, we never need to move around M and V or keep multiple copies of it.)

Looks like `dist.all_gather_into_tensor` is the function that distributes out the weights (parameters).

Let's see what that does. Yes, this is similar to what I was imagining but the args work differently. Updating my example:

```
then both GPUs could do: all_gather_into_tensor(weights, weights[rank:rank+world_size])
gpu 0: all_gather_into_tensor(weights, [5, 14])
gpu 1: all_gather_into_tensor(weights, [27, 36])

now now weights on both gpus are [5, 14, 27, 36]
```

ok, now back to the problem, this approach of using reduce_scatter_tensor and all_gather_into_tensor seems clean and elegant and removes a lot of bookkeeping, BUT how is it supposed to work when grad size is not divisible by world size?

```
Say our example weights above was not [10,20,30,40] but [10,20,30,40,50]

And after back prop, say gradient_on_gpu_1 is [4,3,1,2,8] and gradient_on_gpu_2 is [6,9,5,6,2]

We could pad the gradients to get to the next multiple of world_size, so:
gradient_on_gpu_1 becomes [4,3,1,2,8,0]
gradient_on_gpu_2 becomes [6,9,5,6,2,0]

then proceed as before

grad = [0,0,0]
gpu 0: reduce_scatter_tensor(grad_on_gpu_0, [4,3,1,2,8,0], avg)
gpu 1: reduce_scatter_tensor(grad_on_gpu_1, [6,9,5,6,2,0], avg)

now grad_on_gpu_0 is [5, 6, 3]
and grad_on_gpu_1 is [4, 5, 0]

then on the highest rank, we slice off the padding, etc.
```

However, I don't see anything in adamw.py about padding.

Although there is padding in DistMuon in muon.py.

How could this be working? If he had 8 special tokens it would work by luck...(65536 + 8) % 8 == 0, but he has 9.

BTW, maybe the reason we init some weights to 0 is so they'll start the same on all GPUs? But why do we not do that for all weights? Come back to that.

Interesting that his default vocab_size in GPTConfig 50304 is also divisible by 8.

But still, in base_train.py he uses vocab_size in model_config and vocab_size is tokenizer.get_vocab_size()

FOR NOW, to move on, I'm going to train my tokenizer with 65535 so I'll end up with 65536.

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_common import get_base_dir
base_dir = get_base_dir()

In [3]:
!ls {base_dir}

base_checkpoints     shard_00009.parquet  shard_00020.parquet
my-tokenizer.pkl     shard_00010.parquet  shard_00021.parquet
shard_00000.parquet  shard_00011.parquet  shard_00022.parquet
shard_00001.parquet  shard_00012.parquet  shard_00023.parquet
shard_00002.parquet  shard_00013.parquet  shard_00024.parquet
shard_00003.parquet  shard_00014.parquet  shard_00025.parquet
shard_00004.parquet  shard_00015.parquet  shard_00026.parquet
shard_00005.parquet  shard_00016.parquet  shard_00027.parquet
shard_00006.parquet  shard_00017.parquet  shard_00028.parquet
shard_00007.parquet  shard_00018.parquet  shard_00029.parquet
shard_00008.parquet  shard_00019.parquet  token_bytes.pt


In [4]:
!mv {base_dir}/my-tokenizer.pkl {base_dir}/my-tokenizer-65537.pkl

In [6]:
!mv {base_dir}/token_bytes.pt {base_dir}/token_bytes-65537.pt

In [7]:
!ls {base_dir}

base_checkpoints	shard_00009.parquet  shard_00020.parquet
my-tokenizer-65537.pkl	shard_00010.parquet  shard_00021.parquet
shard_00000.parquet	shard_00011.parquet  shard_00022.parquet
shard_00001.parquet	shard_00012.parquet  shard_00023.parquet
shard_00002.parquet	shard_00013.parquet  shard_00024.parquet
shard_00003.parquet	shard_00014.parquet  shard_00025.parquet
shard_00004.parquet	shard_00015.parquet  shard_00026.parquet
shard_00005.parquet	shard_00016.parquet  shard_00027.parquet
shard_00006.parquet	shard_00017.parquet  shard_00028.parquet
shard_00007.parquet	shard_00018.parquet  shard_00029.parquet
shard_00008.parquet	shard_00019.parquet  token_bytes-65537.pt


In [2]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [9]:
!python -m scripts.my_tok_train --vocab_size=65535

max_chars: 10,000,000,000
doc_cap: 10,000
vocab_size: 65,535
starting to train tokenizer
buffers filled: 1
buffers filled: 2
buffers filled: 3
buffers filled: 4
buffers filled: 5
buffers filled: 6
buffers filled: 7
buffers filled: 8
buffers filled: 9
buffers filled: 10
buffers filled: 11
^C


Ahh!!! Just as started realized maybe he subtracts special tokens from desired vocab size. That makes so much more sense so you get the vocab size you ask for and don't end up with a weird dimension for the lm head. And yes, he does that in tokenizer.py and I never copied that part. Let me go and fix that and then train.

ok, fixed, do a quick check

In [3]:
!python -m scripts.my_tok_train --max_chars=100000 --vocab_size=500

max_chars: 100,000
doc_cap: 10,000
vocab_size: 500
starting to train tokenizer
buffers filled: 1
finished training tokenizer
Saved tokenizer to /home/paperspace/.cache/my_nanochat/my-tokenizer.pkl
Saved token_bytes to /home/paperspace/.cache/my_nanochat/token_bytes.pt


In [4]:
from my_nanochat.my_tokenizer import get_tokenizer
tokenizer = get_tokenizer()

In [5]:
tokenizer.get_vocab_size()

500

In [6]:
tokenizer.get_bos_token_id()

499

In [7]:
tokenizer.decode([499, 65])

'<|bos|>A'

Now train the tokenizer for real

In [12]:
!python -m scripts.my_tok_train

max_chars: 10,000,000,000
doc_cap: 10,000
vocab_size: 65,536
starting to train tokenizer
buffers filled: 1
buffers filled: 2
buffers filled: 3
buffers filled: 4
buffers filled: 5
buffers filled: 6
buffers filled: 7
buffers filled: 8
buffers filled: 9
buffers filled: 10
buffers filled: 11
buffers filled: 12
buffers filled: 13
buffers filled: 14
buffers filled: 15
buffers filled: 16
buffers filled: 17
buffers filled: 18
buffers filled: 19
buffers filled: 20
buffers filled: 21
buffers filled: 22
buffers filled: 23
buffers filled: 24
buffers filled: 25
buffers filled: 26
buffers filled: 27
buffers filled: 28
buffers filled: 29
buffers filled: 30
buffers filled: 31
buffers filled: 32
buffers filled: 33
buffers filled: 34
buffers filled: 35
buffers filled: 36
buffers filled: 37
buffers filled: 38
buffers filled: 39
buffers filled: 40
buffers filled: 41
buffers filled: 42
buffers filled: 43
buffers filled: 44
buffers filled: 45
buffers filled: 46
buffers filled: 47
buffers filled: 48
buffers 

In [13]:
tokenizer = get_tokenizer()
tokenizer.get_vocab_size()

65536

ok, now back to trying to train a model with 2 GPUs!

In [14]:
!nvidia-smi

Sat Nov 15 13:40:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Quadro RTX 4000                Off |   00000000:00:05.0 Off |                  N/A |
| 30%   34C    P8              5W /  125W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Quadro RTX 4000                Off |   00

In [15]:
!torchrun --standalone --nproc_per_node=2 -m scripts.my_base_train -- \
    --depth=4 \
    --max_seq_len=128 \
    --device_batch_size=1 \
    --num_iterations=2 \
    --total_batch_size=256 \
    --eval_every=100 \
    --eval_tokens=1280 \
    --core_metric_every=0

W1115 13:40:23.906000 81480 torch/distributed/run.py:803] 
W1115 13:40:23.906000 81480 torch/distributed/run.py:803] *****************************************
W1115 13:40:23.906000 81480 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1115 13:40:23.906000 81480 torch/distributed/run.py:803] *****************************************
Autodetected device type: cudaoverriding depth = 4

overriding max_seq_len = 128
overriding device_batch_size = 1
overriding num_iterations = 2
overriding total_batch_size = 256
overriding eval_every = 100
overriding eval_tokens = 1280
overriding core_metric_every = 0
user_config: {'run': 'dummy', 'device_type': '', 'depth': 4, 'max_seq_len': 128, 'num_iterations': 2, 'device_batch_size': 1, 'total_batch_size': 256, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight

it completed!

Code added / updated as part of this challenge so far:

- Added `my_tok_train.py`
 
- Added / fixed code purposely left out earlier when ignoring DDP in `my_checkpoint_manager.py`, `my_common.py` and `my_gpt.py`

- Directly copied the entire `adamw.py`

- Fixed `my_tokenizer.py` so that we end up with the desired vocab size (not vocab size + number of special tokens)