This is not the main notebook in this challenge. See `midtrain-d20.ipynb`.

## Investigate questions

Investigate the 3 questions raised in `midtrain-d20.ipynb`

### 1) diff in bpb loss eval

Why was final val bpb in base training (done in challenge 25) 0.8135 but the starting one in this mid training run was 0.6856?

In `my_base_train.py` bpb loss eval is done like this:

```
        val_loader = build_val_loader()
        eval_steps = eval_tokens // (device_batch_size * max_seq_len * ddp_world_size)
        with autocast_ctx:
            val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
```

In `my_mid_train.py` bpb loss eval is done like this:

```
        val_loader = build_val_loader()
        eval_steps = eval_tokens // (device_batch_size * max_seq_len * ddp_world_size)
        with autocast_ctx:
            val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
```

Ah, they shouldn't match. One is using base train validation data (tons and tons of text from the internet) and the other is using mid train validation data (the user / assistant conversations).

### 2) diff in CORE metric

Why was the final CORE metric in base training 0.2084 but the one I measured here on the same model was 0.2012?

In `my_base_train.py` CORE metric is evaluated like this:

```
model.eval()
        with autocast_ctx:
            results = evaluate_model(orig_model, tokenizer, device, max_per_task=core_metric_max_per_task)
        print0(f"Step {step:05d}: CORE metric: {results['core_metric']:.4f}")
```

And in `my_base_eval.py`:

```
    with autocast_ctx:
        results = evaluate_model(model, tokenizer, device, max_per_task=args.max_per_task, tasks_to_run=tasks_to_run)

    print0(f"CORE metric: {results['core_metric']:.4f}")
```

When I did the base train I didn't specify core_metric_max_per_task so it defaulted to 500, which I can nicely confirm in wandb (run=challenge-25-4). When I called my_base_eval I didn't specify max-per-task so it would have run all. I'm pretty sure that explains it. Before I was wondering why he ran base_eval after the training in speedrun.sh. This explains it: because the CORE metric being evaluated after the last step in training is only on a subset of tasks. I suppose if I really want to verify I can get back on the GPU machine and run my_base_eval with --max-per-task=500.

### SpellingBee high accuracy

Can SpellingBee accuracy really be 96.88%?

A few initial thoughts...

- Make sure the train / test stuff is working and the model didn't memorize the 1000 examples
- Would be nice to have an easy way to view conversations where it gets it wrong (and right). Perhaps add.
- Try a few examples right now on my mac.
- There is no python use yet. So for the model to do this it has to learn which tokens together form the word, how to spell each of them, how to count specific letters within that, and to output the answer at the end after "###"...could it really be doing it so well already?

#### Check train / test

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_tasks.my_spellingbee import MySpellingBee

In [3]:
train_task = MySpellingBee(split="train")
val_task = MySpellingBee(split="test")

In [7]:
for i in range(10):
    print(train_task[i]['messages'][0]['content'])

count the number of 's' in "nonclassifiable"?
"cottonization"にnは何個ありますか
Find the count of 'a' in "beaune"?
In 'giftedness', count the "f"
Count the a in gentman?
'touchy'に'u'が何回出てくる
spookにoは何個ありますか
lebistes라는 단어에 'm'가 몇 개?
tell me the number of r in fretters?
tell me the number of 'o' in pollocks?


In [8]:
for i in range(10):
    print(val_task[i]['messages'][0]['content'])

"cottonization"にnは何個ありますか
Find the count of 'a' in "beaune"?
In 'giftedness', count the "f"
Count the a in gentman?
'touchy'に'u'が何回出てくる
spookにoは何個ありますか
lebistes라는 단어에 'm'가 몇 개?
tell me the number of r in fretters?
tell me the number of 'o' in pollocks?
数一下"sphygmuses"中的"s"


It's not right!

Did I copy something wrong? My code:

```
        seed = index if self.split == 'train' else -(index + 1)
        rng = random.Random(seed)
```

Code from [spellingbee.cpy](https://github.com/karpathy/nanochat/blob/master/tasks/spellingbee.py):

```
        seed = index if self.split == 'train' else -(index + 1)
        rng = random.Random(seed)
```

In [9]:
import random

In [10]:
index = 5
train_seed = index
test_seed = -(index + 1)
train_seed, test_seed

(5, -6)

In [None]:
choices = list(range(1000))

In [15]:
rng = random.Random(5)
rng.choice(choices)

637

In [16]:
rng = random.Random(-5)
rng.choice(choices)

637

In [17]:
rng = random.Random(6)
rng.choice(choices)

812

In [18]:
rng = random.Random(-6)
rng.choice(choices)

812

In [21]:
random.Random(5).random()

0.6229016948897019

In [22]:
random.Random(-5).random()

0.6229016948897019

So at least in this version of python, seeding with x and -x are the same.

How to fix? How many words in that word file again?

In [24]:
from my_nanochat.my_common import get_base_dir

In [26]:
!wc -l {get_base_dir()}/words_alpha.txt

  370105 /Users/ericsilberstein/.cache/my_nanochat/words_alpha.txt


Offset by ten million should be fine.

But first let me see if I can repro seeing (fake) high accuracy if I run a few example on my mac so I can confirm fix later.

In [9]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [10]:
!python -m scripts.my_chat_eval --source=mid --batch-size=1 --model-tag=d20 --max-problems=10 --task-name=SpellingBee

Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/mid_checkpoints/d20 with step 809
Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}
final: 10/10 (100.00%)
SpellingBee accuracy: 100.00%


ok, now change `my_spellingbee.py`

did following:

```
@@ -4,6 +4,7 @@ import random
 import re
 
 WORD_LIST_URL = "https://raw.githubusercontent.com/dwyl/english-words/refs/heads/master/words_alpha.txt"
+TEST_RANDOM_SEED_OFFSET = 10_000_000 # much bigger than the ~370,000 words in the list 
 
 class MySimpleSpelling(MyTask):
     def __init__(self, size=1000, split="train", **kwargs):
@@ -27,7 +28,7 @@ class MySimpleSpelling(MyTask):
         return self.size
 
     def get_example(self, index):
-        seed = index if self.split == 'train' else -(index + 1)
+        seed = index if self.split == 'train' else TEST_RANDOM_SEED_OFFSET + index
         rng = random.Random(seed)
         word = rng.choice(self.words)
         word_letters = ','.join(list(word))
@@ -132,7 +133,7 @@ class MySpellingBee(MyTask):
         return self.size
 
     def get_example(self, index):
-        seed = index if self.split == 'train' else -(index + 1)
+        seed = index if self.split == 'train' else TEST_RANDOM_SEED_OFFSET + index
         rng = random.Random(seed)
```

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_tasks.my_spellingbee import MySpellingBee

In [2]:
train_task = MySpellingBee(split="train")
val_task = MySpellingBee(split="test")

In [3]:
for i in range(10):
    print(train_task[i]['messages'][0]['content'])

count the number of 's' in "nonclassifiable"?
"cottonization"にnは何個ありますか
Find the count of 'a' in "beaune"?
In 'giftedness', count the "f"
Count the a in gentman?
'touchy'に'u'が何回出てくる
spookにoは何個ありますか
lebistes라는 단어에 'm'가 몇 개?
tell me the number of r in fretters?
tell me the number of 'o' in pollocks?


In [4]:
for i in range(10):
    print(val_task[i]['messages'][0]['content'])

How many "g" are in the word meningomyelitis?
'achesoun'에 'u'가 몇 번 나오나요?
Show me how many "g" are in "driegh"
¿cuántas t hay en 'enterotoxication'?
How many "g" appear in fondlings?
数一下"unobtunded"中的"u"?
Count how many 'u' appear in eurytherm
unsatiate中有多少个"u"
Can you count the "r" letters in collarbird
what's the count of e in anhydroglocose


^ Looks good.

Also check SimpleSpelling task

In [5]:
from my_tasks.my_spellingbee import MySimpleSpelling
train_task = MySimpleSpelling(split="train")
val_task = MySimpleSpelling(split="test")

In [7]:
for i in range(10):
    print(train_task[i]['messages'][0]['content'])

Spell the word: baggers
Spell the word: rimfire
Spell the word: andaqui
Spell the word: pinewood
Spell the word: couldn
Spell the word: revelous
Spell the word: alcaligenes
Spell the word: carapato
Spell the word: imaum
Spell the word: entosphenoid


In [8]:
for i in range(10):
    print(val_task[i]['messages'][0]['content'])

Spell the word: garishly
Spell the word: agrobiologist
Spell the word: undersign
Spell the word: comfortation
Spell the word: barristership
Spell the word: heroized
Spell the word: antiexpressionist
Spell the word: whiteboy
Spell the word: criteriology
Spell the word: materialman


Run the chat eval again with a few example on my mac and expect low accuracy.

In [49]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [2]:
!python -m scripts.my_chat_eval --source=mid --batch-size=1 --model-tag=d20 --max-problems=10 --task-name=SpellingBee

Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/mid_checkpoints/d20 with step 809
Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}
final: 10/10 (100.00%)
SpellingBee accuracy: 100.00%


It's still getting them all right. Hmm. Let's try one.

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_tasks.my_spellingbee import MySpellingBee
from my_nanochat.my_checkpoint_manager import load_model
from my_nanochat.my_common import compute_init, autodetect_device_type
from my_nanochat.my_engine import Engine

In [2]:
val_task = MySpellingBee(split="test")

In [3]:
device_type = autodetect_device_type() 
_, _, _, _, device = compute_init(device_type)
model, tokenizer, meta_data = load_model('mid', model_tag='d20', device=device, phase='eval')
engine = Engine(model, tokenizer)

Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/mid_checkpoints/d20 with step 809
Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}


In [4]:
val_task[0]

{'messages': [{'role': 'user',
   'content': 'How many "g" are in the word meningomyelitis?'},
  {'role': 'assistant',
   'content': [{'type': 'text',
     'text': "We are asked to find the number 'g' in the word 'meningomyelitis'. Let me try a manual approach first.\n\nFirst spell the word out:\nmeningomyelitis:m,e,n,i,n,g,o,m,y,e,l,i,t,i,s\n\nThen count the occurrences of 'g':\n1:m\n2:e\n3:n\n4:i\n5:n\n6:g hit! count=1\n7:o\n8:m\n9:y\n10:e\n11:l\n12:i\n13:t\n14:i\n15:s\n\nThis gives us 1."},
    {'type': 'text', 'text': '\n\nLet me double check this using Python:\n\n'},
    {'type': 'python', 'text': "'meningomyelitis'.count('g')"},
    {'type': 'python_output', 'text': '1'},
    {'type': 'text',
     'text': '\n\nPython gives us 1.\n\nMy final answer is:\n\n#### 1'}]}]}

In [5]:
tokens = tokenizer.render_for_completion(val_task[0])

In [6]:
tokenizer.decode(tokens)

'<|bos|><|user_start|>How many "g" are in the word meningomyelitis?<|user_end|><|assistant_start|>'

In [8]:
results, _ = engine.generate_batch(tokens, max_tokens=512, top_k=50, temperature=0.0)

In [9]:
print(tokenizer.decode(results[0]))

<|bos|><|user_start|>How many "g" are in the word meningomyelitis?<|user_end|><|assistant_start|>We are asked to find the number 'g' in the word 'meningomyelitis'. Let me try a manual approach first.

First spell the word out:
meningomyelitis:m,e,n,i,n,g,e,m,o,p,y

Then count the occurrences of 'g':
1:m
2:e
3:n
4:i
5:n
6:g hit! count=1
7:e
8:m
9:o
10:p
11:y

This gives us 1.

Let me double check this using Python:

<|python_start|>'meningomyelitis'.count('g')<|python_end|><|output_start|>1<|output_end|>

Python gives us 1.

My final answer is:

#### 1<|assistant_end|>


^ wow! Was that really not in the training? In that particular way? With 80,000 train examples I guess it has a ~20% chance of being there in some form.

In [10]:
# checking this way because even with my fix random seed for train is still just index
train_task = MySpellingBee(size=80000, split="train")

In [18]:
for i in range(80_000):
    if 'meningomyelitis' in train_task[i]['messages'][0]['content']:
        print(i, train_task[i]['messages'][0]['content'])

Let's do another

In [21]:
val_task[1]

{'messages': [{'role': 'user', 'content': "'achesoun'에 'u'가 몇 번 나오나요?"},
  {'role': 'assistant',
   'content': [{'type': 'text',
     'text': "We are asked to find the number 'u' in the word 'achesoun'. Let me try a manual approach first.\n\nFirst spell the word out:\nachesoun:a,c,h,e,s,o,u,n\n\nThen count the occurrences of 'u':\n1:a\n2:c\n3:h\n4:e\n5:s\n6:o\n7:u hit! count=1\n8:n\n\nThis gives us 1."},
    {'type': 'text', 'text': '\n\nLet me double check this using Python:\n\n'},
    {'type': 'python', 'text': "'achesoun'.count('u')"},
    {'type': 'python_output', 'text': '1'},
    {'type': 'text',
     'text': '\n\nPython gives us 1.\n\nMy final answer is:\n\n#### 1'}]}]}

In [22]:
tokens = tokenizer.render_for_completion(val_task[1])

In [23]:
tokenizer.decode(tokens)

"<|bos|><|user_start|>'achesoun'에 'u'가 몇 번 나오나요?<|user_end|><|assistant_start|>"

In [24]:
results, _ = engine.generate_batch(tokens, max_tokens=512, top_k=50, temperature=0.0)

In [25]:
print(tokenizer.decode(results[0]))

<|bos|><|user_start|>'achesoun'에 'u'가 몇 번 나오나요?<|user_end|><|assistant_start|>We are asked to find the number 'u' in the word 'achesoun'. Let me try a manual approach first.

First spell the word out:
achesoun:a,c,h,e,s,o,u,n

Then count the occurrences of 'u':
1:a
2:c
3:h
4:e
5:s
6:o
7:u hit! count=1
8:n

This gives us 1.

Let me double check this using Python:

<|python_start|>'achesoun'.count('u')<|python_end|><|output_start|>1<|output_end|>

Python gives us 1.

My final answer is:

#### 1<|assistant_end|>


In [26]:
for i in range(80_000):
    if 'achesoun' in train_task[i]['messages'][0]['content']:
        print(i, train_task[i]['messages'][0]['content'])

In [29]:
# confirm this is working
for i in range(80_000):
    if 'giftedness' in train_task[i]['messages'][0]['content']:
        print(i, train_task[i]['messages'][0]['content'])

3 In 'giftedness', count the "f"


See if I can find a failed one even on my slow mac. Would think at least 3-4 out of 100 so not that hard. Add print failed option to `my_chat_eval.py`

In [31]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [38]:
!python -m scripts.my_chat_eval \
    --source=mid \
    --batch-size=1 \
    --model-tag=d20 \
    --max-problems=200 \
    --task-name=SpellingBee \
    --print-failed

Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/mid_checkpoints/d20 with step 809
Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}
Conversation: {'messages': [{'role': 'user', 'content': "initiatrixの中に't'がいくつ"}, {'role': 'assistant', 'content': [{'type': 'text', 'text': "We are asked to find the number 't' in the word 'initiatrix'. Let me try a manual approach first.\n\nFirst spell the word out:\ninitiatrix:i,n,i,t,i,a,t,r,i,x\n\nThen count the occurrences of 't':\n1:i\n2:n\n3:i\n4:t hit! count=1\n5:i\n6:a\n7:t hit! count=2\n8:r\n9:i\n10:x\n\nThis gives us 2."}, {'type': 'text', 'text': '\n\nLet me double check this using Python:\n\n'}, {'type': 'python', 'text': "'initiatrix'.count('t')"}, {'type': 'python_output', 'text': '2'}, {'type': 'text', 'text': '\n\nPython gives us 2.\n\nMy final answer is:\n\n#### 2'}]}]}

Model completion(s): ["We are asked to 

^ Interesting.

For t's in initiatrix, it spells "initiatrix" wrong when it's spelling it out: i,n,i,t,i,a,r,x but copies it correctly when it writes "First spell the word out:\ninitiatrix" and when it writes the python program: 'initiatrix'.count('t')

For t's in weights, the model goes somewhere totally different, writing: Weights are a plural noun that refers to the amount of something that is being weighed or measured.

For r's in heparinized, it loses the spelling even on its first copy: We are asked to find the number 'r' in the word 'healinized'

For n's in improvements, it spells it out as: i,n,f,o,r,p,r,i,e,m,e,n,t,s but copies it right including into the python progrma.

For m's in cimicifugin, it copies it right but spells it out wrong: c,i,m,i,c,i,f,u,m,i,n

Is it crazy that with "just" 80,000 examples showing this format and another 200,000 of simple spelling examples the model becomes so good at this? I guess compared to what I know larger models (and maybe even this one) are capable of this doesn't seem like a big deal. But if I didn't know that, this would seem crazy.