To do this challenge, I'm going to need to enhance `my_tokenizer.py` from challenge 7. I don't want to touch that code, though, and as I keep going, it's going to get confusing to keep code only in these different challenge folders. It's time to start recreating the entire structure of `nanochat` under `my_nanochat`. So copy `my_tokenizer.py` from challenge 7 to its "permanent" home and edit it there.

In [1]:
import sys
sys.path.append('../my_nanochat')

In [2]:
from my_nanochat.my_tokenizer import MyTokenizer
from my_nanochat.my_dataset import parquets_iter_batched
from my_nanochat.my_common import get_base_dir

In [3]:
import torch
from collections import deque

### Understand code in [dataloader.py](https://github.com/karpathy/nanochat/blob/master/nanochat/dataloader.py)

In [4]:
B = 2 # batch size
T = 5 # sequence length (is that sequence length or max sequence length?)

The the dataloder gets these values:

`ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()`

I'm guesing `ddp_rank` is the rank (~ GPU #) of "this" parallel process, but then I'm not sure what `ddp_local_rank` is.

`ddp_world_size` which we came across earlier in a comment about why you want start/step arguments to `parquets_iter_batched()` must be the total number of parallel processes = GPUs = ranks.

Not sure what `ddp` is (but peeking at `common.get_dist_info()` I see it's a boolean if we're doing ddp or not).

Asking ChatGPT about rank vs local rank:

rank = The global ID of a process in the entire distributed system.
local rank = The ID of the process within its local machine (node).

Example:

Suppose you have 2 machines (nodes) with 4 GPUs each → 8 total processes.

* Node 0 has GPUs 0–3 → local ranks 0, 1, 2, 3.

* Node 1 has GPUs 0–3 → local ranks 0, 1, 2, 3.

Global rank 4 (which is the first process on node 1) has local rank 0.

In [5]:
# so for this notebook...
ddp, ddp_rank, ddp_local_rank, ddp_world_size = False, 0, 0, 1

In [6]:
needed_tokens = B * T + 1; needed_tokens

11

In [7]:
tokenizer = MyTokenizer.load_from_file("../challenge-08-train-tokenizer/my-tokenizer.pkl")

In [8]:
tokenizer.encode("hello world")

[52902, 882]

In [9]:
bos_token = tokenizer.get_bos_token_id(); bos_token

65536

In [10]:
token_buffer = deque()

In [11]:
# understand deque
foo = deque()
foo.extend([1,2,3])
foo.popleft(), foo

(1, deque([2, 3]))

Also realizing it's going to get odd and annoying to keep the parquet data files in the challenge 8 folder and have `my_dataset` know to look there. Put them in `~/.cache/my_nanochat` and create `my_common.py` and a `get_base_dir()` function.

In [12]:
get_base_dir()

'/Users/ericsilberstein/.cache/my_nanochat'

In [13]:
!ls {get_base_dir()}

my-tokenizer.pkl    shard_00002.parquet shard_00005.parquet shard_00008.parquet
shard_00000.parquet shard_00003.parquet shard_00006.parquet shard_00009.parquet
shard_00001.parquet shard_00004.parquet shard_00007.parquet


In [14]:
# confirm can read files
next(parquets_iter_batched("train"))[0][:30]

'Shipment & Transport-Sea, Air,'

convoluted going of document_batch -> tokenizer_batch -> token_lists -> token_buffer -> tokens below is just to get a feel for the code in `dataloader.py` where in actuality it's yielding the result

In [15]:
document_batch = next(parquets_iter_batched("train", start=ddp_rank, step=ddp_world_size))

In [16]:
tokenizer_batch_size = 2 # 128 in his code, but 2 should be more than enough to get the 11 tokens we need

In [17]:
tokenizer_batch = document_batch[:tokenizer_batch_size]

In [18]:
len(tokenizer_batch), tokenizer_batch[0][:10], tokenizer_batch[1][:10]

(2, 'Shipment &', '12. Defini')

In [19]:
# I see now that it's time to enhance my_tokenizer.encode() to accept (and return) a list
# and accept a prepend argument
tokenizer.encode(["hello world", "bye world"], prepend=bos_token)

[[65536, 52902, 882], [65536, 31563, 882]]

In [20]:
while len(token_buffer) < needed_tokens:
    token_lists = tokenizer.encode(tokenizer_batch, prepend=bos_token)
    for tokens in token_lists:
        token_buffer.extend(tokens)

In [21]:
len(token_buffer)

3636

In [22]:
tokens = [token_buffer.popleft() for _ in range(needed_tokens)]

In [23]:
tokens

[65536, 61056, 363, 1488, 11808, 3734, 18097, 44, 4618, 44, 9575]

In [24]:
scratch = torch.tensor(tokens, dtype=torch.int64); scratch

tensor([65536, 61056,   363,  1488, 11808,  3734, 18097,    44,  4618,    44,
         9575])

In [25]:
inputs_cpu = scratch[:-1].to(dtype=torch.int32); inputs_cpu
# I don't understand why we want in32 for inputs

tensor([65536, 61056,   363,  1488, 11808,  3734, 18097,    44,  4618,    44],
       dtype=torch.int32)

In [26]:
targets_cpu = scratch[1:]; targets_cpu

tensor([61056,   363,  1488, 11808,  3734, 18097,    44,  4618,    44,  9575])

In [27]:
tokenizer.decode(inputs_cpu.tolist()), tokenizer.decode(targets_cpu.tolist())

('<bos>Shipment & Transport-Sea, Air,', 'Shipment & Transport-Sea, Air, Rail')

In [28]:
inputs = inputs_cpu.view(B, T); inputs

tensor([[65536, 61056,   363,  1488, 11808],
        [ 3734, 18097,    44,  4618,    44]], dtype=torch.int32)

In [29]:
targets = targets_cpu.view(B, T); targets

tensor([[61056,   363,  1488, 11808,  3734],
        [18097,    44,  4618,    44,  9575]])

ok, so `dataloader.py` does what I would expect, in a way that is scalable and compatabile with training across multiple nodes and GPUs. Similar to what I show as input/target in my [tracing the transformer blog post](https://towardsdatascience.com/tracing-the-transformer-in-diagrams-95dbeb68160c/) but without the "source" language since this is a general purpose transformer. I didn't realize though that we don't worry about each item in the batch being one "logical unit" such as a sentence or document. We just fill the batch with continuous tokens.

### so now create `my_dataloader.py` to use going forward

In [30]:
from my_nanochat.my_dataloader import tokenizing_distributed_data_loader

In [31]:
dl = tokenizing_distributed_data_loader(
    B,
    T,
    "train",
    tokenizer_threads=4,
    tokenizer_batch_size=tokenizer_batch_size,
    device="cpu")

In [32]:
x, y = next(dl)

In [33]:
x

tensor([[65536, 61056,   363,  1488, 11808],
        [ 3734, 18097,    44,  4618,    44]], dtype=torch.int32)

In [34]:
y

tensor([[61056,   363,  1488, 11808,  3734],
        [18097,    44,  4618,    44,  9575]])

In [39]:
torch.equal(x,inputs); torch.equal(y,targets)
# hooray, they match

True

In [40]:
# try with more typical batch and sequence sizes
dl = tokenizing_distributed_data_loader(B=32, T=2048, split="train", device="cpu")

In [41]:
x, y = next(dl)

In [42]:
x

tensor([[65536, 61056,   363,  ...,  1365,   288,   930],
        [   46,   372,    52,  ...,   408,  3484,  3050],
        [ 6475,   283,   261,  ...,   309,  6944,   288],
        ...,
        [ 5724,   257,  2239,  ..., 62468,  1855,   449],
        [ 1480,  4135,   327,  ...,    46,  1008,   500],
        [  519,   356,  1403,  ..., 41840,  1539, 16547]], dtype=torch.int32)

In [43]:
x.shape

torch.Size([32, 2048])

In [44]:
y.shape

torch.Size([32, 2048])

In [47]:
# how many <bos> tokens do we have?
torch.sum(x == tokenizer.get_bos_token_id())

tensor(83)

In [53]:
# how many 'The' tokens do we have?
torch.sum(x == tokenizer.encode("The")[0])

tensor(109)

In [57]:
tokenizer.decode(x[5,30:50].tolist())

' directing the Philharmonic Choral Society and the Kansas City Symphony Orchestra.\nBusch fell in love'

In [59]:
tokenizer.decode(x[10,30:50].tolist())

' prey for Amphiprion clarkii larviculture: effects on larval survival and growth. Aquaculture'