This is not the main notebook in this challenge. See `midtrain-d20.ipynb`.

## Investigate questions

Investigate the 3 questions raised in `midtrain-d20.ipynb`

### 1) diff in bpb loss eval

Why was final val bpb in base training (done in challenge 25) 0.8135 but the starting one in this mid training run was 0.6856?

In `my_base_train.py` bpb loss eval is done like this:

```
        val_loader = build_val_loader()
        eval_steps = eval_tokens // (device_batch_size * max_seq_len * ddp_world_size)
        with autocast_ctx:
            val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
```

In `my_mid_train.py` bpb loss eval is done like this:

```
        val_loader = build_val_loader()
        eval_steps = eval_tokens // (device_batch_size * max_seq_len * ddp_world_size)
        with autocast_ctx:
            val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
```

Ah, they shouldn't match. One is using base train validation data (tons and tons of text from the internet) and the other is using mid train validation data (the user / assistant conversations).

### 2) diff in CORE metric

Why was the final CORE metric in base training 0.2084 but the one I measured here on the same model was 0.2012?

In `my_base_train.py` CORE metric is evaluated like this:

```
model.eval()
        with autocast_ctx:
            results = evaluate_model(orig_model, tokenizer, device, max_per_task=core_metric_max_per_task)
        print0(f"Step {step:05d}: CORE metric: {results['core_metric']:.4f}")
```

And in `my_base_eval.py`:

```
    with autocast_ctx:
        results = evaluate_model(model, tokenizer, device, max_per_task=args.max_per_task, tasks_to_run=tasks_to_run)

    print0(f"CORE metric: {results['core_metric']:.4f}")
```

When I did the base train I didn't specify core_metric_max_per_task so it defaulted to 500, which I can nicely confirm in wandb (run=challenge-25-4). When I called my_base_eval I didn't specify max-per-task so it would have run all. I'm pretty sure that explains it. Before I was wondering why he ran base_eval after the training in speedrun.sh. This explains it: because the CORE metric being evaluated after the last step in training is only on a subset of tasks. I suppose if I really want to verify I can get back on the GPU machine and run my_base_eval with --max-per-task=500.

### SpellingBee high accuracy

Can SpellingBee accuracy really be 96.88%?

A few initial thoughts...

- Make sure the train / test stuff is working and the model didn't memorize the 1000 examples
- Would be nice to have an easy way to view conversations where it gets it wrong (and right). Perhaps add.
- Try a few examples right now on my mac.
- There is no python use yet. So for the model to do this it has to learn which tokens together form the word, how to spell each of them, how to count specific letters within that, and to output the answer at the end after "###"...could it really be doing it so well already?

#### Check train / test

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_tokenizer import get_tokenizer
from my_tasks.my_spellingbee import MySpellingBee

In [3]:
train_task = MySpellingBee(split="train")
val_task = MySpellingBee(split="test")

In [7]:
for i in range(10):
    print(train_task[i]['messages'][0]['content'])

count the number of 's' in "nonclassifiable"?
"cottonization"にnは何個ありますか
Find the count of 'a' in "beaune"?
In 'giftedness', count the "f"
Count the a in gentman?
'touchy'に'u'が何回出てくる
spookにoは何個ありますか
lebistes라는 단어에 'm'가 몇 개?
tell me the number of r in fretters?
tell me the number of 'o' in pollocks?


In [8]:
for i in range(10):
    print(val_task[i]['messages'][0]['content'])

"cottonization"にnは何個ありますか
Find the count of 'a' in "beaune"?
In 'giftedness', count the "f"
Count the a in gentman?
'touchy'に'u'が何回出てくる
spookにoは何個ありますか
lebistes라는 단어에 'm'가 몇 개?
tell me the number of r in fretters?
tell me the number of 'o' in pollocks?
数一下"sphygmuses"中的"s"


It's not right!

Did I copy something wrong? My code:

```
        seed = index if self.split == 'train' else -(index + 1)
        rng = random.Random(seed)
```

Code from [spellingbee.cpy](https://github.com/karpathy/nanochat/blob/master/tasks/spellingbee.py):

```
        seed = index if self.split == 'train' else -(index + 1)
        rng = random.Random(seed)
```

In [9]:
import random

In [10]:
index = 5
train_seed = index
test_seed = -(index + 1)
train_seed, test_seed

(5, -6)

In [None]:
choices = list(range(1000))

In [15]:
rng = random.Random(5)
rng.choice(choices)

637

In [16]:
rng = random.Random(-5)
rng.choice(choices)

637

In [17]:
rng = random.Random(6)
rng.choice(choices)

812

In [18]:
rng = random.Random(-6)
rng.choice(choices)

812

In [21]:
random.Random(5).random()

0.6229016948897019

In [22]:
random.Random(-5).random()

0.6229016948897019

So at least in this version of python, seeding with x and -x are the same.