This is not the main notebook in his challenge. See `understand-reporting.ipynb`.

## Review d20 evals

Look through `review-d20-evals.ipynb`.

1) Sanity check against data collected during prior runs.

2) Why the -30% boolq result?

3) Compare ChatCORE mid to sft.

### #1 Sanity check against prior runs

For base eval, I should compare with the base eval I ran at the top of this notebook: `challenge-28-midtrain-d20/midtrain-d20.ipynb`

For base loss, I never ran that before, but I can somewhat sanity check against the loss from the d20 base training run: `challenge-25-pretrain-d20/full-run.ipynb`

For chat eval mid, it should match with the mid chat eval I summarized here: `challenge-32-redo-chat-eval-d20/investigate-runs.ipynb`?

For chat eval sft, it should be "similar" to the sft eval in that same notebook, but should not match exactly, because I kept repeating the sft train but never repeated the full chat eval.

#### base eval

From `challenge-28-midtrain-d20/midtrain-d20.ipynb`:
```
CORE metric: 0.2012
centered results:
{
    "hellaswag_zeroshot": 0.25672173500061035,
    "jeopardy": 0.11856400221586227,
    "bigbench_qa_wikidata": 0.5365877747535706,
    "arc_easy": 0.5291806856791178,
    "arc_challenge": 0.13538110256195068,
    "copa": 0.3799999952316284,
    "commonsense_qa": 0.03767403215169905,
    "piqa": 0.35908591747283936,
    "openbook_qa": 0.14666668574015299,
    "lambada_openai": 0.3745391070842743,
    "hellaswag": 0.26282942295074463,
    "winograd": 0.28205132484436035,
    "winogrande": 0.05603790283203125,
    "bigbench_dyck_languages": 0.10200000554323196,
    "agi_eval_lsat_ar": 0.027173891663551317,
    "bigbench_cs_algorithms": 0.3583333194255829,
    "bigbench_operators": 0.1666666716337204,
    "bigbench_repeat_copy_logic": 0.0,
    "squad": 0.2345316857099533,
    "coqa": 0.18126018345355988,
    "boolq": -0.2972798786665264,
    "bigbench_language_identification": 0.1782178120775716
}
```

Yes CORE 0.2012 matches exactly and I compared a few others and they also match exactly.

#### base loss

In `challenge-25-pretrain-d20/full-run.ipynb`, final validation bpb was 0.8135. In the report, val bpb: 0.8136. Seems good. Train bpb is 0.8164. Also seems to pass a sanity check.

#### chat eval mid

In `challenge-32-redo-chat-eval-d20/investigate-runs.ipynb` I hand created (really vim magic) this table:

```
                   mid        sft         increase (from mid to sft)

ARC-Easy           43.18      44.07       2.1%
ARC-Challenge      33.19      31.74       -4.4%
MMLU               33.07      32.34       -2.2% 
GSM8K              3.41       5.38        57.8%
HumanEval          6.71       6.10        -9%
SpellingBee        98.05      96.88       -1.2%
```

The numbers in the mid column match perfectly.

### chat eval sft

The numbers in the sft column are "similar" at least enough for a sanity check on the reporting code.


### #2 Odd boolq results

Why in CORE eval did boolq come out as centered -0.297?

From `challenge-21-understand-core-metric/core-evaluation-data-examples.ipynb`, looks like boolq is yes/no multiple choice questions like this one:

```
Query: Passage: The Tower of Terror buildings are among the tallest structures found at
their respective Disney resorts. At 199 feet (60.7 m), the Florida version is the second
tallest attraction at the Walt Disney World Resort, with only Expedition Everest 199.5
feet (60.8 m) being taller. At the Disneyland Resort, the 183-foot (55.8 m) structure
(which now houses Guardians of the Galaxy -- Mission: Breakout!) is the tallest building
at the resort, as well as one of the tallest buildings in Anaheim. At Disneyland Paris, it
is the second tallest attraction. Question: does disney world still have tower of terror?
Correct prompt: 1

prompt 0: Passage: The Tower of Terror buildings are among the tallest structures found at
their respective Disney resorts. At 199 feet (60.7 m), the Florida version is the second
tallest attraction at the Walt Disney World Resort, with only Expedition Everest 199.5
feet (60.8 m) being taller. At the Disneyland Resort, the 183-foot (55.8 m) structure
(which now houses Guardians of the Galaxy -- Mission: Breakout!) is the tallest building
at the resort, as well as one of the tallest buildings in Anaheim. At Disneyland Paris, it
is the second tallest attraction. Question: does disney world still have tower of terror?
Answer: no

prompt 1: Passage: The Tower of Terror buildings are among the tallest structures found at
their respective Disney resorts. At 199 feet (60.7 m), the Florida version is the second
tallest attraction at the Walt Disney World Resort, with only Expedition Everest 199.5
feet (60.8 m) being taller. At the Disneyland Resort, the 183-foot (55.8 m) structure
(which now houses Guardians of the Galaxy -- Mission: Breakout!) is the tallest building
at the resort, as well as one of the tallest buildings in Anaheim. At Disneyland Paris, it
is the second tallest attraction. Question: does disney world still have tower of terror?
Answer: yes
```

What's the random baseline?

In [32]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_common import get_base_dir
import pandas as pd
eval_meta_data = pd.read_csv(f"{get_base_dir()}/eval_bundle/eval_meta_data.csv")
eval_meta_data[eval_meta_data['Eval Task'] == 'boolq']

Unnamed: 0,Eval Task,Task Category,Task Type,#shots,#datapoints,Random baseline,Centered Metric?,Description
45,boolq,reading comprehension,multiple choice,10,3270,62.0,,"BoolQ consists of 3,270 short passages on a d..."


In [35]:
# from my_base_eval.py
def centered_result(accuracy, random_baseline):
    return (accuracy - 0.01 * random_baseline) / (1.0 - 0.01 * random_baseline)

In [37]:
centered_result(0.5070, 62)

-0.29736842105263156

But why is the random baseline 62%?

In [45]:
print(eval_meta_data.iloc[45]['Description'])

 BoolQ consists of 3,270 short passages on a diverse range of subjects followed by a yes/no questions. The model is expected to answer in multiple-choice format.


I'm sure they all have 2 answers, but just to be sure...

In [49]:
!ls {get_base_dir()}/eval_bundle/eval_data/reading_comprehension

agi_eval_lsat_lr.jsonl              coqa.jsonl
agi_eval_lsat_rc.jsonl              narrative_qa.jsonl
agi_eval_sat_en.jsonl               pubmed_qa_labeled.jsonl
bigbench_understanding_fables.jsonl squad.jsonl
boolq.jsonl


In [51]:
data_path = f"{get_base_dir()}/eval_bundle/eval_data/reading_comprehension/boolq.jsonl"
import json
with open(data_path, 'r', encoding='utf-8') as f:
            data = [json.loads(line.strip()) for line in f]

In [53]:
len(data)

3270

In [55]:
data[0]

{'query': "Passage: All biomass goes through at least some of these steps: it needs to be grown, collected, dried, fermented, distilled, and burned. All of these steps require resources and an infrastructure. The total amount of energy input into the process compared to the energy released by burning the resulting ethanol fuel is known as the energy balance (or ``energy returned on energy invested''). Figures compiled in a 2007 report by National Geographic Magazine point to modest results for corn ethanol produced in the US: one unit of fossil-fuel energy is required to create 1.3 energy units from the resulting ethanol. The energy balance for sugarcane ethanol produced in Brazil is more favorable, with one unit of fossil-fuel energy required to create 8 from the ethanol. Energy balance estimates are not easily produced, thus numerous such reports have been generated that are contradictory. For instance, a separate survey reports that production of ethanol from sugarcane, which requir

In [58]:
num_choices = [len(item['choices']) for item in data]
len(num_choices), min(num_choices), max(num_choices)

(3270, 2, 2)

Where did `eval_meta_data.csv` come from?

We download from:

```
EVAL_BUNDLE_URL = "https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip"
```

Download again now. Maybe it's been fixed.

In [61]:
!ls -l {get_base_dir()}/eval_bundle.zip

-rw-r--r--  1 ericsilberstein  staff  26060410 Nov 13 07:30 /Users/ericsilberstein/.cache/my_nanochat/eval_bundle.zip


In [65]:
!mv {get_base_dir()}/eval_bundle.zip {get_base_dir()}/eval_bundle-downloaded-on-2025-Nov-13.zip

In [67]:
!rm {get_base_dir()}/eval_bundle.zip.lock

In [69]:
!rm -rf {get_base_dir()}/eval_bundle

In [72]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [73]:
# just to force it to download again
!python -m scripts.my_base_eval \
    --model-tag=d1 \
    --source=base \
    --max-per-task=11 \
    --tasks-to-run=jeopardy

Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/base_checkpoints/d1 with step 10
Building model with config: {'sequence_len': 256, 'vocab_size': 65536, 'n_layer': 1, 'n_head': 1, 'n_kv_head': 1, 'n_embd': 64}
downloading https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip...
downloaded to /Users/ericsilberstein/.cache/my_nanochat/eval_bundle.zip
placed eval_bundle dir at /Users/ericsilberstein/.cache/my_nanochat/eval_bundle
Evaluating: jeopardy (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 0.64s
Model: base_model (step 10)
Task                               , Accuracy  , Centered  
jeopardy                           , 0.000000  , 0.000000  
CORE                               ,           , 0.000000  



In [74]:
eval_meta_data = pd.read_csv(f"{get_base_dir()}/eval_bundle/eval_meta_data.csv")
eval_meta_data[eval_meta_data['Eval Task'] == 'boolq']

Unnamed: 0,Eval Task,Task Category,Task Type,#shots,#datapoints,Random baseline,Centered Metric?,Description
45,boolq,reading comprehension,multiple choice,10,3270,62.0,,"BoolQ consists of 3,270 short passages on a d..."


Look on HuggingFace: https://huggingface.co/datasets/google/boolq

Ah! Google AI reponse said: This compared to a 62% majority-class baseline (predicting the most frequent answer)

### #3 Compare ChatCORE mid to sft

I spent a lot of time in `challenge-32-redo-chat-eval-d20/investigate-runs.ipynb` trying to decide if sft training worked and did anything good and that was one of the motivations to add reporting. One of a few good things about adding the reporting was computing the ChatCORE metric which is the average of the centered chat core task results (ARC-Easy, Arc-Challenge, etc.)

ChatCORE declines 0.2568 to 0.2549 from mid to sft. Doesn't really add anything to what was trying to figure out in the other notebook, but does seem like in this comment in speedrun.sh:

```# train sft and re-eval right away (should see a small bump)```

he might have been talking about ChatCORE, and I didn't see a small bump.