## Understand and add reporting

All along I've been skipping the code that logs metrics to reports. I want to now look at what his reporting thing is and then go back to the train scripts and I think the eval scripts too and add the reporting. I hope this will reduce the pain (and risk of making errors) I've had after the last few runs of scrolling through logs in my notebooks and/or wandb to pull out accuracy numbers, validation loss numbers, etc.

I'm making the headings in this notebook <span style='color:green'>green</span> to distinguish from the report headings.

### <span style='color:green'>Report.py</span>

Look at [report.py](https://github.com/karpathy/nanochat/blob/master/nanochat/report.py)

It gets git info. That make sense. What is `git status --porcelain`?

In [3]:
!git status --porcelain

 M readme.md
?? challenge-34-understand-reporting/
?? scratch.ipynb
?? scratch.txt


In [5]:
!git status --help | grep -C 3 porcelain

       --show-stash
           Show the number of entries currently stashed away.

       --porcelain[=<version>]
           Give the output in an easy-to-parse format for scripts. This is
           similar to the short output, but will remain stable across Git
           versions and regardless of user configuration. See below for
--
           ## branchname tracking info

   Porcelain Format Version 1
       Version 1 porcelain format is similar to the short format, but is
       guaranteed not to change in a backwards-incompatible way between Git
       versions or based on user configuration. This makes it ideal for
       parsing by scripts. The description of the short format above also
       describes the porcelain format, with a few exceptions:

        1. The userâ€™s color.status configuration is not respected; color will
           always be off.


So that's why it's called porcelain.

Good reminder to add my scratch files to .gitignore

In [6]:
!git status --porcelain

 M .gitignore
 M readme.md
?? challenge-34-understand-reporting/


Looks like it also records a bunch of system info, for example:

In [7]:
import platform
platform.system()

'Darwin'

In [10]:
import psutil
psutil.cpu_count(logical=False), psutil.cpu_count(logical=True)

(8, 8)

Didn't realize my laptop is considered to have 8 physical CPUs. I guess those are CPU cores.

In [11]:
!sysctl hw.physicalcpu hw.logicalcpu

hw.physicalcpu: 8
hw.logicalcpu: 8


In [17]:
!system_profiler SPHardwareDataType | grep -i chip

      Chip: Apple M1


In [18]:
!system_profiler SPHardwareDataType | grep -i core

      Total Number of Cores: 8 (4 performance and 4 efficiency)


What is this `files-to-prompt` thing?

In [24]:
!files-to-prompt

zsh:1: command not found: files-to-prompt


Oh, it's a Simon Willison tool: https://github.com/simonw/files-to-prompt

```
Concatenate a directory full of files into a single prompt for use with LLMs
```

ok, I get the idea of this reporting stuff. It creates reports as markdown files and writes them to the `report` folder and also assembles a final report. I'll go ahead and copy most of it but I don't think by itself it's going to solve all my pain.

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_report import get_git_info, get_gpu_info, get_system_info, generate_header

In [2]:
get_git_info()

{'commit': '33da382',
 'branch': 'master',
 'dirty': True,
 'message': 'started and finished (?) challenge 33: add chat CLI'}

In [3]:
get_gpu_info() # will be more interesting on a machine with GPUs

{'available': False}

In [4]:
get_system_info()

{'hostname': 'Erics-MacBook-Air-2.local',
 'platform': 'Darwin',
 'python_version': '3.10.18',
 'torch_version': '2.9.0',
 'cpu_count': 8,
 'cpu_count_logical': 8,
 'memory_gb': 16.0,
 'user': 'ericsilberstein',
 'nanochat_base_dir': 'out',
 'working_dir': '/Users/ericsilberstein/Documents/ericsilberstein1-repos/learn-nanochat/challenge-34-understand-reporting'}

Going to skip the cost and bloat stuff for now

In [6]:
from IPython.display import Markdown
Markdown(generate_header())

# nanochat training report

Generated: 2025-11-24 10:40:33

## Environment

### Git Information
- Branch: master
- Commit: 33da382 (dirty)
- Message: started and finished (?) challenge 33: add chat CLI

### Hardware
- Platform: Darwin
- CPUs: 8 cores (8 logical)
- Memory: 16.0 GB
- GPUs: None available

### Software
- Python: 3.10.18
- PyTorch: 2.9.0



Looking at functions `extract()` and `extract_timestamp()`...so we'll be reading from markdown reports we wrote previously?

In [14]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_report import get_report

In [15]:
report = get_report()

In [16]:
file_path = report.log('test report', ['hello\n', {'a': 1.2345, 'b': 12_345, 'c': 12}])
file_path

'/Users/ericsilberstein/.cache/my_nanochat/report/test-report.md'

In [17]:
!cat {file_path}

## test report
timestamp: 2025-11-24 12:16:34

hello
- a: 1.2345
- b: 12,345
- c: 12



In [18]:
!rm {file_path}

Will also skip `report.generate()` for now.

In [22]:
sorted(['a', 'b', 'c', 'd', 'e'], key=lambda x: (x != 'e', x == 'c'))

['e', 'a', 'b', 'd', 'c']

In [23]:
# a vs b
(True, False) == (True, False)

True

In [25]:
# c vs d
(True, True) > (True, False)

True

In [26]:
(10, 5) > (9, 7)

True

In [4]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_report import get_report

In [5]:
report = get_report()
report.generate()

Generating report to /Users/ericsilberstein/.cache/my_nanochat/report/report.md
Copying report.md to current directory for convenience


'/Users/ericsilberstein/.cache/my_nanochat/report/report.md'

In [6]:
!cat report.md

## Summary

| Metric          | BASE     | MID      | SFT      | RL       |
|-----------------|----------|----------|----------|----------|

Total wall clock time: unknown


Now go back to all the train and eval scripts and write reports

### <span style='color:green'>my_base_train.py</span>

In [1]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [18]:
# need to use depth=1 so all the CORE metrics can be evaluted on my mac without OOM - see challenge 21
!python -m scripts.my_base_train \
    --depth=1 \
    --max_seq_len=256 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=256 \
    --eval_every=5 \
    --eval_tokens=1280 \
    --core_metric_every=20 \
    --core_metric_max_per_task=11 \
    --sample_every=5

overriding depth = 1
overriding max_seq_len = 256
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 256
overriding eval_every = 5
overriding eval_tokens = 1280
overriding core_metric_every = 20
overriding core_metric_max_per_task = 11
overriding sample_every = 5
user_config: {'run': 'dummy', 'device_type': '', 'depth': 1, 'max_seq_len': 256, 'num_iterations': 10, 'target_param_data_ratio': 20, 'device_batch_size': 1, 'total_batch_size': 256, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02, 'grad_clip': 1.0, 'warmup_ratio': 0.0, 'warmdown_ratio': 0.2, 'final_lr_frac': 0.0, 'eval_every': 5, 'eval_tokens': 1280, 'core_metric_every': 20, 'core_metric_max_per_task': 11, 'sample_every': 5, 'model_tag': ''}
Autodetected device type: mps
This process is ddp_rank: 0, ddp_local_rank: 0, ddp_world_size: 1
Vocab size: 65,536
num_layers: 1
model_dim: 64
num_heads: 1
num_kv_heads: 1
Tokens / micro-batch / rank: 1 x 256 

In [24]:
import sys
sys.path.append('../my_nanochat')
import os
from IPython.display import Markdown
from my_nanochat.my_common import get_base_dir

In [25]:
!ls {get_base_dir()}/report

base-model-training.md report.md


In [26]:
Markdown(open(os.path.join(get_base_dir(), 'report', 'base-model-training.md')).read())

## Base model training
timestamp: 2025-11-24 15:34:09

- run: dummy
- device_type: 
- depth: 1
- max_seq_len: 256
- num_iterations: 10
- target_param_data_ratio: 20
- device_batch_size: 1
- total_batch_size: 256
- embedding_lr: 0.2000
- unembedding_lr: 0.0040
- weight_decay: 0.0000
- matrix_lr: 0.0200
- grad_clip: 1.0000
- warmup_ratio: 0.0000
- warmdown_ratio: 0.2000
- final_lr_frac: 0.0000
- eval_every: 5
- eval_tokens: 1280
- core_metric_every: 20
- core_metric_max_per_task: 11
- sample_every: 5
- model_tag: 
- Number of parameters: 8,437,760
- Calculated number of iterations: 10
- Number of training tokens: 2560
- Tokens : Params ratio: 0.0003
- DDP world size: 1
- warmup_ratio: 0.0000
- warmdown_ratio: 0.2000
- final_lr_frac: 0.0000
- Minimum validation bpb: 3.3075
- Final validation bpb: 3.3075
- CORE metric estimate: -0.0093
- Total training time: 0.00m
- Peak memory usage: 0.00MiB



### <span style='color:green'>my_base_eval.py</span>

Here I'll add both logging to a report and writing to a CSV file, both left out earlier.

In [29]:
!python -m scripts.my_base_eval \
    --model-tag=d1 \
    --source=base \
    --max-per-task=11

Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/base_checkpoints/d1 with step 10
Building model with config: {'sequence_len': 256, 'vocab_size': 65536, 'n_layer': 1, 'n_head': 1, 'n_kv_head': 1, 'n_embd': 64}
Evaluating: hellaswag_zeroshot (0-shot, type: multiple_choice)... accuracy: 0.4545 | centered: 0.2727 | time: 0.83s
Evaluating: jeopardy (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 0.37s
Evaluating: bigbench_qa_wikidata (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 0.25s
Evaluating: arc_easy (10-shot, type: multiple_choice)... accuracy: 0.0909 | centered: -0.2121 | time: 2.21s
Evaluating: arc_challenge (10-shot, type: multiple_choice)... accuracy: 0.0000 | centered: -0.3333 | time: 2.86s
Evaluating: copa (0-shot, type: multiple_choice)... accuracy: 0.7273 | centered: 0.4545 | time: 0.40s
Evaluating: commonsense_qa (10-shot, type: multiple_choice)... accuracy: 0.

^ Well, that's a lot easier to read! Hopefully see the ugly formatting of negatives once doing this for real.

In [30]:
!ls {get_base_dir()}/report

base-model-evaluation.md base-model-training.md   report.md


In [31]:
Markdown(open(os.path.join(get_base_dir(), 'report', 'base-model-evaluation.md')).read())

## Base model evaluation
timestamp: 2025-11-24 15:58:21

- Model: base_model (step 10)
- CORE metric: -0.0093
- hellaswag_zeroshot: 0.2727
- jeopardy: 0.0000
- bigbench_qa_wikidata: 0.0000
- arc_easy: -0.2121
- arc_challenge: -0.3333
- copa: 0.4545
- commonsense_qa: -0.1364
- piqa: -0.0909
- openbook_qa: 0.1515
- lambada_openai: 0.0000
- hellaswag: 0.2727
- winograd: 0.0909
- winogrande: -0.0909
- bigbench_dyck_languages: 0.0000
- agi_eval_lsat_ar: 0.0909
- bigbench_cs_algorithms: 0.0000
- bigbench_operators: 0.0000
- bigbench_repeat_copy_logic: 0.0000
- squad: 0.0000
- coqa: 0.0000
- boolq: -0.6746
- bigbench_language_identification: -0.0001



### <span style='color:green'>my_base_loss.py</span>

I never created this script earlier. Will do now so can log loss. See that this script will also calculate loss on the training data with a bigger number of tokens than the training batches.

In [35]:
!python -m scripts.my_base_loss \
    --device_batch_size=1 \
    --split_tokens=2048 \
    --model_tag=d1

overriding device_batch_size = 1
overriding split_tokens = 2048
overriding model_tag = d1
user_config: {'device_batch_size': 1, 'split_tokens': 2048, 'device_type': ''}
Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/base_checkpoints/d1 with step 10
Building model with config: {'sequence_len': 256, 'vocab_size': 65536, 'n_layer': 1, 'n_head': 1, 'n_kv_head': 1, 'n_embd': 64}
train bpb: 1.9768
val bpb: 3.1200
<|bos|>The capital of France is the the the the the the the the the the the the the the the the
<|bos|>The chemical symbol of gold is the the the the the the the the the the the the the the the the
<|bos|>If yesterday was Friday, then tomorrow will be the the the the the the the the the the the the the the the the
<|bos|>The opposite of hot is the the the the the the the the the the the the the the the the
<|bos|>The planets of the solar system are:  the the the the the the the the the the the the the the the
<|bos|>My favorite color i

In [36]:
!ls {get_base_dir()}/report

base-model-evaluation.md base-model-training.md
base-model-loss.md       report.md


In [39]:
Markdown(open(os.path.join(get_base_dir(), 'report', 'base-model-loss.md')).read())

## Base model loss
timestamp: 2025-11-24 16:28:20

- train bpb: 1.9768
- val bpb: 3.1200
- sample 0: <|bos|>The capital of France is the the the the the the the the the the the the the the the the
- sample 1: <|bos|>The chemical symbol of gold is the the the the the the the the the the the the the the the the
- sample 2: <|bos|>If yesterday was Friday, then tomorrow will be the the the the the the the the the the the the the the the the
- sample 3: <|bos|>The opposite of hot is the the the the the the the the the the the the the the the the
- sample 4: <|bos|>The planets of the solar system are:  the the the the the the the the the the the the the the the
- sample 5: <|bos|>My favorite color is the the the the the the the the the the the the the the the the
- sample 6: <|bos|>If 5*x + 3 = 13, then x is the the the the the the the the the the the the the the the the



### <span style='color:green'>my_mid_train.py</span>

In [42]:
!python -m scripts.my_mid_train \
    --model_tag=d1 \
    --num_iterations=10 \
    --max_seq_len=256 \
    --device_batch_size=1 \
    --total_batch_size=256 \
    --eval_every=5 \
    --eval_tokens=1024

overriding model_tag = d1
overriding num_iterations = 10
overriding max_seq_len = 256
overriding device_batch_size = 1
overriding total_batch_size = 256
overriding eval_every = 5
overriding eval_tokens = 1024
user_config: {'run': 'dummy', 'device_type': '', 'dtype': 'bfloat16', 'num_iterations': 10, 'max_seq_len': 256, 'device_batch_size': 1, 'unembedding_lr': 0.004, 'embedding_lr': 0.2, 'matrix_lr': 0.02, 'init_lr_frac': 1.0, 'weight_decay': 0.0, 'eval_every': 5, 'eval_tokens': 1024, 'total_batch_size': 256, 'dry_run': 0}
Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/base_checkpoints/d1 with step 10
Building model with config: {'sequence_len': 256, 'vocab_size': 65536, 'n_layer': 1, 'n_head': 1, 'n_kv_head': 1, 'n_embd': 64}
Tokens / micro-batch / rank: 1 x 256 = 256
Tokens / micro-batch: 256
Total batch size 256 => gradient accumulation steps: 1
Scaling the LR for the AdamW parameters proportional to 1/sqrt(64/768) = 3.464101615137755


In [43]:
!ls {get_base_dir()}/report

base-model-evaluation.md base-model-training.md   report.md
base-model-loss.md       midtraining.md


In [44]:
Markdown(open(os.path.join(get_base_dir(), 'report', 'midtraining.md')).read())

## Midtraining
timestamp: 2025-11-24 16:41:44

- run: dummy
- device_type: 
- dtype: bfloat16
- num_iterations: 10
- max_seq_len: 256
- device_batch_size: 1
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- init_lr_frac: 1.0000
- weight_decay: 0.0000
- eval_every: 5
- eval_tokens: 1024
- total_batch_size: 256
- dry_run: 0
- Number of iterations: 9
- DDP world size: 1
- Minimum validation bpb: 2.3758



### <span style='color:green'>my_chat_eval.py</span>

In [51]:
!python -m scripts.my_chat_eval \
    --source=mid \
    --batch-size=1 \
    --model-tag=d1 \
    --max-problems=5

Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/mid_checkpoints/d1 with step 9
Building model with config: {'sequence_len': 256, 'vocab_size': 65536, 'n_layer': 1, 'n_head': 1, 'n_kv_head': 1, 'n_embd': 64}
final: 0/5 (0.00%)
ARC-Easy accuracy: 0.00%
final: 0/5 (0.00%)
ARC-Challenge accuracy: 0.00%
final: 2/5 (40.00%)
MMLU accuracy: 40.00%
final: 0/5 (0.00%)
GSM8K accuracy: 0.00%
final: 0/5 (0.00%)
HumanEval accuracy: 0.00%
final: 0/5 (0.00%)
SpellingBee accuracy: 0.00%


In [52]:
!ls {get_base_dir()}/report

base-model-evaluation.md base-model-training.md   midtraining.md
base-model-loss.md       chat-evaluation-mid.md   report.md


In [53]:
Markdown(open(os.path.join(get_base_dir(), 'report', 'chat-evaluation-mid.md')).read())

## Chat evaluation mid
timestamp: 2025-11-24 16:54:34

- source: mid
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 1
- model_tag: d1
- step: None
- max_problems: 5
- print_failed: False
- device_type: 
- ARC-Easy: 0.0000
- ARC-Challenge: 0.0000
- MMLU: 0.4000
- GSM8K: 0.0000
- HumanEval: 0.0000
- SpellingBee: 0.0000
- ChatCORE metric: -0.0778



### <span style='color:green'>my_chat_sft.py</span>

In [54]:
!python -m scripts.my_chat_sft \
    --model_tag=d1 \
    --num_iterations=10 \
    --device_batch_size=1 \
    --target_examples_per_step=4 \
    --eval_every=5 \
    --eval_steps=10 \
    --eval_metrics_every=5 \
    --eval_metrics_max_problems=2 \
    --max_data_tokens=1280

overriding model_tag = d1
overriding num_iterations = 10
overriding device_batch_size = 1
overriding target_examples_per_step = 4
overriding eval_every = 5
overriding eval_steps = 10
overriding eval_metrics_every = 5
overriding eval_metrics_max_problems = 2
overriding max_data_tokens = 1280
user_config: {'run': 'dummy', 'source': 'mid', 'device_type': '', 'dtype': 'bfloat16', 'device_batch_size': 1, 'num_epochs': 1, 'num_iterations': 10, 'max_data_tokens': 1280, 'target_examples_per_step': 4, 'unembedding_lr': 0.004, 'embedding_lr': 0.2, 'matrix_lr': 0.02, 'weight_decay': 0.0, 'init_lr_frac': 0.02, 'eval_every': 5, 'eval_steps': 10, 'eval_metrics_every': 5, 'eval_metrics_max_problems': 2}
Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/mid_checkpoints/d1 with step 9
Building model with config: {'sequence_len': 256, 'vocab_size': 65536, 'n_layer': 1, 'n_head': 1, 'n_kv_head': 1, 'n_embd': 64}
Target examples per step: 4
Device batch size: 1

In [55]:
!ls {get_base_dir()}/report

base-model-evaluation.md chat-evaluation-mid.md   report.md
base-model-loss.md       chat-sft.md
base-model-training.md   midtraining.md


In [56]:
Markdown(open(os.path.join(get_base_dir(), 'report', 'chat-sft.md')).read())

## Chat SFT
timestamp: 2025-11-24 17:33:28

- run: dummy
- source: mid
- device_type: 
- dtype: bfloat16
- device_batch_size: 1
- num_epochs: 1
- num_iterations: 10
- max_data_tokens: 1280
- target_examples_per_step: 4
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- weight_decay: 0.0000
- init_lr_frac: 0.0200
- eval_every: 5
- eval_steps: 10
- eval_metrics_every: 5
- eval_metrics_max_problems: 2
- Training rows: 22,439
- Number of iterations: 10
- Training loss: 10.3042
- Validation loss: 8.6608



Do chat_eval again for sft to get that report written too:

In [57]:
!python -m scripts.my_chat_eval \
    --source=sft \
    --batch-size=1 \
    --model-tag=d1 \
    --max-problems=5

Autodetected device type: mps
loading the model from /Users/ericsilberstein/.cache/my_nanochat/chatsft_checkpoints/d1 with step 9
Building model with config: {'sequence_len': 256, 'vocab_size': 65536, 'n_layer': 1, 'n_head': 1, 'n_kv_head': 1, 'n_embd': 64}
final: 0/5 (0.00%)
ARC-Easy accuracy: 0.00%
final: 0/5 (0.00%)
ARC-Challenge accuracy: 0.00%
final: 2/5 (40.00%)
MMLU accuracy: 40.00%
final: 0/5 (0.00%)
GSM8K accuracy: 0.00%
final: 0/5 (0.00%)
HumanEval accuracy: 0.00%
final: 0/5 (0.00%)
SpellingBee accuracy: 0.00%


In [58]:
!ls {get_base_dir()}/report

base-model-evaluation.md chat-evaluation-mid.md   midtraining.md
base-model-loss.md       chat-evaluation-sft.md   report.md
base-model-training.md   chat-sft.md


In [59]:
Markdown(open(os.path.join(get_base_dir(), 'report', 'chat-evaluation-sft.md')).read())

## Chat evaluation sft
timestamp: 2025-11-24 17:36:54

- source: sft
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 1
- model_tag: d1
- step: None
- max_problems: 5
- print_failed: False
- device_type: 
- ARC-Easy: 0.0000
- ARC-Challenge: 0.0000
- MMLU: 0.4000
- GSM8K: 0.0000
- HumanEval: 0.0000
- SpellingBee: 0.0000
- ChatCORE metric: -0.0778



### <span style='color:green'>Generate full report</span>

Generate the full report again. Add the main part to `my_report.py` first so I run it as a script.

In [63]:
!python -m my_nanochat.my_report

Generating report to /Users/ericsilberstein/.cache/my_nanochat/report/report.md
Copying report.md to current directory for convenience


In [64]:
Markdown(open('report.md').read())

## Base model training
timestamp: 2025-11-24 15:34:09

- run: dummy
- device_type: 
- depth: 1
- max_seq_len: 256
- num_iterations: 10
- target_param_data_ratio: 20
- device_batch_size: 1
- total_batch_size: 256
- embedding_lr: 0.2000
- unembedding_lr: 0.0040
- weight_decay: 0.0000
- matrix_lr: 0.0200
- grad_clip: 1.0000
- warmup_ratio: 0.0000
- warmdown_ratio: 0.2000
- final_lr_frac: 0.0000
- eval_every: 5
- eval_tokens: 1280
- core_metric_every: 20
- core_metric_max_per_task: 11
- sample_every: 5
- model_tag: 
- Number of parameters: 8,437,760
- Calculated number of iterations: 10
- Number of training tokens: 2560
- Tokens : Params ratio: 0.0003
- DDP world size: 1
- warmup_ratio: 0.0000
- warmdown_ratio: 0.2000
- final_lr_frac: 0.0000
- Minimum validation bpb: 3.3075
- Final validation bpb: 3.3075
- CORE metric estimate: -0.0093
- Total training time: 0.00m
- Peak memory usage: 0.00MiB


## Base model loss
timestamp: 2025-11-24 16:28:20

- train bpb: 1.9768
- val bpb: 3.1200
- sample 0: <|bos|>The capital of France is the the the the the the the the the the the the the the the the
- sample 1: <|bos|>The chemical symbol of gold is the the the the the the the the the the the the the the the the
- sample 2: <|bos|>If yesterday was Friday, then tomorrow will be the the the the the the the the the the the the the the the the
- sample 3: <|bos|>The opposite of hot is the the the the the the the the the the the the the the the the
- sample 4: <|bos|>The planets of the solar system are:  the the the the the the the the the the the the the the the
- sample 5: <|bos|>My favorite color is the the the the the the the the the the the the the the the the
- sample 6: <|bos|>If 5*x + 3 = 13, then x is the the the the the the the the the the the the the the the the


## Base model evaluation
timestamp: 2025-11-24 15:58:21

- Model: base_model (step 10)
- CORE metric: -0.0093
- hellaswag_zeroshot: 0.2727
- jeopardy: 0.0000
- bigbench_qa_wikidata: 0.0000
- arc_easy: -0.2121
- arc_challenge: -0.3333
- copa: 0.4545
- commonsense_qa: -0.1364
- piqa: -0.0909
- openbook_qa: 0.1515
- lambada_openai: 0.0000
- hellaswag: 0.2727
- winograd: 0.0909
- winogrande: -0.0909
- bigbench_dyck_languages: 0.0000
- agi_eval_lsat_ar: 0.0909
- bigbench_cs_algorithms: 0.0000
- bigbench_operators: 0.0000
- bigbench_repeat_copy_logic: 0.0000
- squad: 0.0000
- coqa: 0.0000
- boolq: -0.6746
- bigbench_language_identification: -0.0001


## Midtraining
timestamp: 2025-11-24 16:41:44

- run: dummy
- device_type: 
- dtype: bfloat16
- num_iterations: 10
- max_seq_len: 256
- device_batch_size: 1
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- init_lr_frac: 1.0000
- weight_decay: 0.0000
- eval_every: 5
- eval_tokens: 1024
- total_batch_size: 256
- dry_run: 0
- Number of iterations: 9
- DDP world size: 1
- Minimum validation bpb: 2.3758


## Chat evaluation mid
timestamp: 2025-11-24 16:54:34

- source: mid
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 1
- model_tag: d1
- step: None
- max_problems: 5
- print_failed: False
- device_type: 
- ARC-Easy: 0.0000
- ARC-Challenge: 0.0000
- MMLU: 0.4000
- GSM8K: 0.0000
- HumanEval: 0.0000
- SpellingBee: 0.0000
- ChatCORE metric: -0.0778


## Chat SFT
timestamp: 2025-11-24 17:33:28

- run: dummy
- source: mid
- device_type: 
- dtype: bfloat16
- device_batch_size: 1
- num_epochs: 1
- num_iterations: 10
- max_data_tokens: 1280
- target_examples_per_step: 4
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- weight_decay: 0.0000
- init_lr_frac: 0.0200
- eval_every: 5
- eval_steps: 10
- eval_metrics_every: 5
- eval_metrics_max_problems: 2
- Training rows: 22,439
- Number of iterations: 10
- Training loss: 10.3042
- Validation loss: 8.6608


## Chat evaluation sft
timestamp: 2025-11-24 17:36:54

- source: sft
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 1
- model_tag: d1
- step: None
- max_problems: 5
- print_failed: False
- device_type: 
- ARC-Easy: 0.0000
- ARC-Challenge: 0.0000
- MMLU: 0.4000
- GSM8K: 0.0000
- HumanEval: 0.0000
- SpellingBee: 0.0000
- ChatCORE metric: -0.0778


## Summary

| Metric          | BASE     | MID      | SFT      | RL       |
|-----------------|----------|----------|----------|----------|
| CORE            | -0.0093  | -        | -        | -        |
| ARC-Challenge   | -        | 0.0000   | 0.0000   | -        |
| ARC-Easy        | -        | 0.0000   | 0.0000   | -        |
| GSM8K           | -        | 0.0000   | 0.0000   | -        |
| HumanEval       | -        | 0.0000   | 0.0000   | -        |
| MMLU            | -        | 0.4000   | 0.4000   | -        |
| ChatCORE        | -        | -0.0778  | -0.0778  | -        |

Total wall clock time: unknown


### <span style='color:green'>reset()</span>

I forgot to add reset earlier but that's what records the start time and other overall headers, which is why we don't see info above like git branch, number of CPUs or python version. Add now. (Will stick with reset() but maybe start() would be be a better name.)

In [67]:
!ls {get_base_dir()}/report

base-model-evaluation.md chat-evaluation-mid.md   midtraining.md
base-model-loss.md       chat-evaluation-sft.md   report.md
base-model-training.md   chat-sft.md


In [68]:
!python -m my_nanochat.my_report reset

Reset report and wrote header to /Users/ericsilberstein/.cache/my_nanochat/report/header.md


In [69]:
!ls {get_base_dir()}/report

header.md


In [71]:
!python -m my_nanochat.my_report

Generating report to /Users/ericsilberstein/.cache/my_nanochat/report/report.md
Copying report.md to current directory for convenience


In [72]:
Markdown(open('report.md').read())

# nanochat training report

Generated: 2025-11-24 17:53:37

## Environment

### Git Information
- Branch: master
- Commit: 33da382 (dirty)
- Message: started and finished (?) challenge 33: add chat CLI

### Hardware
- Platform: Darwin
- CPUs: 8 cores (8 logical)
- Memory: 16.0 GB
- GPUs: None available

### Software
- Python: 3.10.18
- PyTorch: 2.9.0

Run started: 2025-11-24 17:53:37

--

## Summary

| Metric          | BASE     | MID      | SFT      | RL       |
|-----------------|----------|----------|----------|----------|

Total wall clock time: unknown


### <span style='color:green'>Run everything so far</span>

This will be similar to how things work in [speedrun.sh](https://github.com/karpathy/nanochat/blob/master/speedrun.sh)

In [77]:
!python -m my_nanochat.my_report reset

!python -m scripts.my_base_train \
    --depth=1 \
    --max_seq_len=256 \
    --device_batch_size=1 \
    --num_iterations=10 \
    --total_batch_size=256 \
    --eval_every=5 \
    --eval_tokens=1280 \
    --core_metric_every=20 \
    --core_metric_max_per_task=11 \
    --sample_every=5

!python -m scripts.my_base_eval \
    --model-tag=d1 \
    --source=base \
    --max-per-task=11

!python -m scripts.my_base_loss \
    --device_batch_size=1 \
    --split_tokens=2048 \
    --model_tag=d1

!python -m scripts.my_mid_train \
    --model_tag=d1 \
    --num_iterations=10 \
    --max_seq_len=256 \
    --device_batch_size=1 \
    --total_batch_size=256 \
    --eval_every=5 \
    --eval_tokens=1024

!python -m scripts.my_chat_eval \
    --source=mid \
    --batch-size=1 \
    --model-tag=d1 \
    --max-problems=5

!python -m scripts.my_chat_sft \
    --model_tag=d1 \
    --num_iterations=10 \
    --device_batch_size=1 \
    --target_examples_per_step=4 \
    --eval_every=5 \
    --eval_steps=10 \
    --eval_metrics_every=5 \
    --eval_metrics_max_problems=2 \
    --max_data_tokens=1280

!python -m scripts.my_chat_eval \
    --source=sft \
    --batch-size=1 \
    --model-tag=d1 \
    --max-problems=5

!python -m my_nanochat.my_report

Reset report and wrote header to /Users/ericsilberstein/.cache/my_nanochat/report/header.md
overriding depth = 1
overriding max_seq_len = 256
overriding device_batch_size = 1
overriding num_iterations = 10
overriding total_batch_size = 256
overriding eval_every = 5
overriding eval_tokens = 1280
overriding core_metric_every = 20
overriding core_metric_max_per_task = 11
overriding sample_every = 5
user_config: {'run': 'dummy', 'device_type': '', 'depth': 1, 'max_seq_len': 256, 'num_iterations': 10, 'target_param_data_ratio': 20, 'device_batch_size': 1, 'total_batch_size': 256, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02, 'grad_clip': 1.0, 'warmup_ratio': 0.0, 'warmdown_ratio': 0.2, 'final_lr_frac': 0.0, 'eval_every': 5, 'eval_tokens': 1280, 'core_metric_every': 20, 'core_metric_max_per_task': 11, 'sample_every': 5, 'model_tag': ''}
Autodetected device type: mps
This process is ddp_rank: 0, ddp_local_rank: 0, ddp_world_size: 1
Vocab size: 65,536
nu

In [78]:
!ls -lt {get_base_dir()}/report

total 80
-rw-r--r--  1 ericsilberstein  staff  5244 Nov 24 18:05 report.md
-rw-r--r--  1 ericsilberstein  staff   423 Nov 24 18:05 chat-evaluation-sft.md
-rw-r--r--  1 ericsilberstein  staff   524 Nov 24 18:04 chat-sft.md
-rw-r--r--  1 ericsilberstein  staff   423 Nov 24 18:04 chat-evaluation-mid.md
-rw-r--r--  1 ericsilberstein  staff   424 Nov 24 18:03 midtraining.md
-rw-r--r--  1 ericsilberstein  staff   879 Nov 24 18:02 base-model-loss.md
-rw-r--r--  1 ericsilberstein  staff   652 Nov 24 18:02 base-model-evaluation.md
-rw-r--r--  1 ericsilberstein  staff   903 Nov 24 18:02 base-model-training.md
-rw-r--r--  1 ericsilberstein  staff   392 Nov 24 18:01 header.md


In [79]:
Markdown(open('report.md').read())

# nanochat training report

Generated: 2025-11-24 18:01:13

## Environment

### Git Information
- Branch: master
- Commit: 33da382 (dirty)
- Message: started and finished (?) challenge 33: add chat CLI

### Hardware
- Platform: Darwin
- CPUs: 8 cores (8 logical)
- Memory: 16.0 GB
- GPUs: None available

### Software
- Python: 3.10.18
- PyTorch: 2.9.0

Run started: 2025-11-24 18:01:13

--

## Base model training
timestamp: 2025-11-24 18:02:01

- run: dummy
- device_type: 
- depth: 1
- max_seq_len: 256
- num_iterations: 10
- target_param_data_ratio: 20
- device_batch_size: 1
- total_batch_size: 256
- embedding_lr: 0.2000
- unembedding_lr: 0.0040
- weight_decay: 0.0000
- matrix_lr: 0.0200
- grad_clip: 1.0000
- warmup_ratio: 0.0000
- warmdown_ratio: 0.2000
- final_lr_frac: 0.0000
- eval_every: 5
- eval_tokens: 1280
- core_metric_every: 20
- core_metric_max_per_task: 11
- sample_every: 5
- model_tag: 
- Number of parameters: 8,437,760
- Calculated number of iterations: 10
- Number of training tokens: 2560
- Tokens : Params ratio: 0.0003
- DDP world size: 1
- warmup_ratio: 0.0000
- warmdown_ratio: 0.2000
- final_lr_frac: 0.0000
- Minimum validation bpb: 3.3074
- Final validation bpb: 3.3074
- CORE metric estimate: -0.0093
- Total training time: 0.00m
- Peak memory usage: 0.00MiB


## Base model loss
timestamp: 2025-11-24 18:02:53

- train bpb: 1.9767
- val bpb: 3.1199
- sample 0: <|bos|>The capital of France is the the the the the the the the the the the the the the the the
- sample 1: <|bos|>The chemical symbol of gold is the the the the the the the the the the the the the the the the
- sample 2: <|bos|>If yesterday was Friday, then tomorrow will be the the the the the the the the the the the the the the the the
- sample 3: <|bos|>The opposite of hot is the the the the the the the the the the the the the the the the
- sample 4: <|bos|>The planets of the solar system are:  the the the the the the the the the the the the the the the
- sample 5: <|bos|>My favorite color is the the the the the the the the the the the the the the the the
- sample 6: <|bos|>If 5*x + 3 = 13, then x is the the the the the the the the the the the the the the the the


## Base model evaluation
timestamp: 2025-11-24 18:02:48

- Model: base_model (step 10)
- CORE metric: -0.0093
- hellaswag_zeroshot: 0.2727
- jeopardy: 0.0000
- bigbench_qa_wikidata: 0.0000
- arc_easy: -0.2121
- arc_challenge: -0.3333
- copa: 0.4545
- commonsense_qa: -0.1364
- piqa: -0.0909
- openbook_qa: 0.1515
- lambada_openai: 0.0000
- hellaswag: 0.2727
- winograd: 0.0909
- winogrande: -0.0909
- bigbench_dyck_languages: 0.0000
- agi_eval_lsat_ar: 0.0909
- bigbench_cs_algorithms: 0.0000
- bigbench_operators: 0.0000
- bigbench_repeat_copy_logic: 0.0000
- squad: 0.0000
- coqa: 0.0000
- boolq: -0.6746
- bigbench_language_identification: -0.0001


## Midtraining
timestamp: 2025-11-24 18:03:03

- run: dummy
- device_type: 
- dtype: bfloat16
- num_iterations: 10
- max_seq_len: 256
- device_batch_size: 1
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- init_lr_frac: 1.0000
- weight_decay: 0.0000
- eval_every: 5
- eval_tokens: 1024
- total_batch_size: 256
- dry_run: 0
- Number of iterations: 9
- DDP world size: 1
- Minimum validation bpb: 2.3761


## Chat evaluation mid
timestamp: 2025-11-24 18:04:15

- source: mid
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 1
- model_tag: d1
- step: None
- max_problems: 5
- print_failed: False
- device_type: 
- ARC-Easy: 0.0000
- ARC-Challenge: 0.0000
- MMLU: 0.4000
- GSM8K: 0.0000
- HumanEval: 0.0000
- SpellingBee: 0.0000
- ChatCORE metric: -0.0778


## Chat SFT
timestamp: 2025-11-24 18:04:36

- run: dummy
- source: mid
- device_type: 
- dtype: bfloat16
- device_batch_size: 1
- num_epochs: 1
- num_iterations: 10
- max_data_tokens: 1280
- target_examples_per_step: 4
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- weight_decay: 0.0000
- init_lr_frac: 0.0200
- eval_every: 5
- eval_steps: 10
- eval_metrics_every: 5
- eval_metrics_max_problems: 2
- Training rows: 22,439
- Number of iterations: 10
- Training loss: 10.3067
- Validation loss: 8.6618


## Chat evaluation sft
timestamp: 2025-11-24 18:05:48

- source: sft
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 1
- model_tag: d1
- step: None
- max_problems: 5
- print_failed: False
- device_type: 
- ARC-Easy: 0.0000
- ARC-Challenge: 0.0000
- MMLU: 0.4000
- GSM8K: 0.0000
- HumanEval: 0.0000
- SpellingBee: 0.0000
- ChatCORE metric: -0.0778


## Summary

| Metric          | BASE     | MID      | SFT      | RL       |
|-----------------|----------|----------|----------|----------|
| CORE            | -0.0093  | -        | -        | -        |
| ARC-Challenge   | -        | 0.0000   | 0.0000   | -        |
| ARC-Easy        | -        | 0.0000   | 0.0000   | -        |
| GSM8K           | -        | 0.0000   | 0.0000   | -        |
| HumanEval       | -        | 0.0000   | 0.0000   | -        |
| MMLU            | -        | 0.4000   | 0.4000   | -        |
| ChatCORE        | -        | -0.0778  | -0.0778  | -        |

Total wall clock time: 0h4m


Code added/updated as part of this challenge:

- Added `my_report.py`
  
- Added `my_base_loss.py`

- Added logging to report in `my_base_train.py`, `my_base_eval.py`, `my_mid_train.py`, `my_chat_eval.py`, and `my_chat_sft.py`