### Sample CORE results

This is not the main notebook in this challenge. See `instructions.ipynb`

Show a handful of correct and incorrect results on the various CORE tasks. Currently only showing squad.

In [27]:
import os
import torch
import sys
import yaml
import json
import textwrap
import random
sys.path.append('../my_nanochat')
from my_nanochat.my_common import get_base_dir, autodetect_device_type, compute_init
from my_nanochat.my_checkpoint_manager import load_model
from my_nanochat.my_core_eval import evaluate_example, render_prompts_mc, render_prompts_lm, render_prompts_schema

In [2]:
device_type = autodetect_device_type() 
_, _, _, _, device = compute_init(device_type)
model, tokenizer, meta_data = load_model('base', model_tag='d32', device=device, phase='eval')
autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()

Autodetected device type: cuda
loading the model from /home/ubuntu/mynanochat2/base_checkpoints/d32 with step 71680


  _C._set_float32_matmul_precision(precision)


Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 32, 'n_head': 16, 'n_kv_head': 16, 'n_embd': 2048}


In [3]:
def print_wrap(s, remove_newlines=False):
    if remove_newlines:
        s = s.replace("\n", " ")
    print(textwrap.fill(s, 90))

In [29]:
base_dir = get_base_dir()
eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
config_path = os.path.join(eval_bundle_dir, "core.yaml")
data_base_path = os.path.join(eval_bundle_dir, "eval_data")
with open(config_path, 'r', encoding='utf-8') as f:
    config = yaml.safe_load(f)
tasks = config['icl_tasks']

In [30]:
tasks_to_inspect = ['squad']
items_per_task = 25

for task in tasks:
    if task['label'] not in tasks_to_inspect:
        continue

    task_meta = {
        'task_type': task['icl_task_type'],
        'dataset_uri': task['dataset_uri'],
        'num_fewshot': task['num_fewshot'][0],
        'continuation_delimiter': task.get('continuation_delimiter', ' ')
    }
    
    task_type = task['icl_task_type']
    continuation_delimiter = task.get('continuation_delimiter', ' ')
    data_path = os.path.join(data_base_path, task['dataset_uri'])
    
    with open(data_path, 'r', encoding='utf-8') as f:
        data = [json.loads(line.strip()) for line in f]

        random.seed(5)
        for i in random.sample(range(0, len(data)), items_per_task):
            with autocast_ctx:
                result, prompts = evaluate_example(i, model, tokenizer, data, device, task_meta, return_prompts=True)
                print(f"{task['label']} item {i} is {'correct' if result else 'wrong'}\n")
                item = data[i]
                if task_type == 'language_modeling':
                    print("Here is the expected continuation and full prompt including the n-shot examples if applicable:\n")
                    print_wrap(f"Expected continuation: {item['continuation']}")
                    print()
                    prompt = prompts[0] # for LM we only care about this one
                    
                    print_wrap(prompt)

                    print("\nHere's what the model outputs with the full prompt, max_tokens=20, temperature=0:")
                    in_tokens = tokenizer.encode(prompt, prepend=tokenizer.get_bos_token_id())
                    out_tokens = list(model.generate(in_tokens, max_tokens=20, temperature=0))
                    print(tokenizer.decode(out_tokens))

                    for seed in range(5):
                        print(f"\nHere's what the model outputs with the full prompt, max_tokens=20, temperature=1, seed={seed}:")
                        in_tokens = tokenizer.encode(prompt, prepend=tokenizer.get_bos_token_id())
                        out_tokens = list(model.generate(in_tokens, max_tokens=20, temperature=1, seed=seed))
                        print(tokenizer.decode(out_tokens))
                    
                else:
                    assert False # TOOD other types
                
                print("\n=================================\n")

squad item 10205 is wrong

Here is the expected continuation and full prompt including the n-shot examples if applicable:

Expected continuation: unwilling to risk large convoys to aid the limited forces it had in
New France

Context: On December 7, 1965, Goldenson announced a merger proposal with ITT to ABC
management; the two companies agreed to the deal on April 27, 1966. The FCC approved the
merger on December 21, 1966; however, the previous day (December 20), Donald F. Turner,
head antitrust regulator for the United States Department of Justice, expressed doubts
related to such issues as the emerging cable television market, and concerns over the
journalistic integrity of ABC and how it could be influenced by the overseas ownership of
ITT. ITT management promised that the company would allow ABC to retain autonomy in the
publishing business. The merger was suspended, and a complaint was filed by the Department
of Justice in July 1967, with ITT going to trial in October 1967; the m