### Try CORE evaluation on GPU

This is not the main notebook in this challenge. See `understand-core-metric.ipynb`

#### Existing model

First load our existing model and see if can get the full evaluation (limited to 20 examples per task) to complete.

In [1]:
import sys
sys.path.append('../my_nanochat')
import os
import torch
from my_nanochat.my_common import get_base_dir, autodetect_device_type
from my_nanochat.my_checkpoint_manager import build_model
from contextlib import nullcontext

In [2]:
device_type = autodetect_device_type()
device = torch.device(device_type)
autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()

Autodetected device type: cuda


In [3]:
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d12")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=21000, device=device, phase="eval")

Building model with config: {'sequence_len': 1000, 'vocab_size': 65537, 'n_layer': 12, 'n_head': 6, 'n_kv_head': 6, 'n_embd': 768}


In [4]:
from scripts.my_base_eval import evaluate_model

In [5]:
with autocast_ctx:
    results = evaluate_model(model, tokenizer, device, max_per_task=20)

Evaluating: hellaswag_zeroshot (0-shot, type: multiple_choice)... accuracy: 0.3500 | centered: 0.1333 | time: 1.13s
Evaluating: jeopardy (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 0.72s
Evaluating: bigbench_qa_wikidata (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 0.51s
Evaluating: arc_easy (10-shot, type: multiple_choice)... accuracy: 0.2000 | centered: -0.0667 | time: 2.45s
Evaluating: arc_challenge (10-shot, type: multiple_choice)... accuracy: 0.1500 | centered: -0.1333 | time: 3.07s
Evaluating: copa (0-shot, type: multiple_choice)... accuracy: 0.5500 | centered: 0.1000 | time: 0.36s
Evaluating: commonsense_qa (10-shot, type: multiple_choice)... accuracy: 0.1500 | centered: -0.0625 | time: 3.46s
Evaluating: piqa (10-shot, type: multiple_choice)... accuracy: 0.7500 | centered: 0.5000 | time: 1.58s
Evaluating: openbook_qa (0-shot, type: multiple_choice)... accuracy: 0.2000 | centered: -0.0667 | time: 0.39s
Eval

Well that was much faster and easier than on my laptop.

In [6]:
results

{'results': {'hellaswag_zeroshot': 0.3499999940395355,
  'jeopardy': 0.0,
  'bigbench_qa_wikidata': 0.0,
  'arc_easy': 0.20000000298023224,
  'arc_challenge': 0.15000000596046448,
  'copa': 0.550000011920929,
  'commonsense_qa': 0.15000000596046448,
  'piqa': 0.75,
  'openbook_qa': 0.20000000298023224,
  'lambada_openai': 0.0,
  'hellaswag': 0.30000001192092896,
  'winograd': 0.550000011920929,
  'winogrande': 0.3499999940395355,
  'bigbench_dyck_languages': 0.0,
  'agi_eval_lsat_ar': 0.15000000596046448,
  'bigbench_cs_algorithms': 0.30000001192092896,
  'bigbench_operators': 0.05000000074505806,
  'bigbench_repeat_copy_logic': 0.0,
  'squad': 0.0,
  'coqa': 0.0,
  'boolq': 0.4000000059604645,
  'bigbench_language_identification': 0.20000000298023224},
 'centered_result': {'hellaswag_zeroshot': 0.13333332538604736,
  'jeopardy': 0.0,
  'bigbench_qa_wikidata': 0.0,
  'arc_easy': -0.06666666269302368,
  'arc_challenge': -0.13333332538604736,
  'copa': 0.10000002384185791,
  'commonsense

#### During training

Train a tiny model to make sure doing the CORE eval from the training loop works

In [1]:
import os
os.environ["PYTHONPATH"] = "../my_nanochat"

In [5]:
!python -m scripts.my_base_train \
    --depth=4 \
    --max_seq_len=1000 \
    --device_batch_size=2 \
    --num_iterations=100 \
    --total_batch_size=2000 \
    --eval_every=10 \
    --eval_tokens=4000 \
    --sample_every=50 \
    --core_metric_every=50 \
    --core_metric_max_per_task=20

overriding depth = 4
overriding max_seq_len = 1000
overriding device_batch_size = 2
overriding num_iterations = 100
overriding total_batch_size = 2000
overriding eval_every = 10
overriding eval_tokens = 4000
overriding sample_every = 50
overriding core_metric_every = 50
overriding core_metric_max_per_task = 20
user_config: {'device_type': '', 'depth': 4, 'max_seq_len': 1000, 'num_iterations': 100, 'device_batch_size': 2, 'total_batch_size': 2000, 'embedding_lr': 0.2, 'unembedding_lr': 0.004, 'weight_decay': 0.0, 'matrix_lr': 0.02, 'grad_clip': 1.0, 'warmup_ratio': 0.0, 'warmdown_ratio': 0.2, 'final_lr_frac': 0.0, 'eval_every': 10, 'eval_tokens': 4000, 'core_metric_every': 50, 'core_metric_max_per_task': 20, 'sample_every': 50, 'model_tag': ''}
Autodetected device type: cuda
  _C._set_float32_matmul_precision(precision)
Vocab size: 65,537
num_layers: 4
model_dim: 256
num_heads: 2
num_kv_heads: 2
Tokens / micro-batch / rank: 2 x 1000 = 2,000
Tokens / micro-batch: 2,000
Total batch size 2