This is not the main notebook in this challenege. See `understand-core-metric.ipynb`.

## CORE evaluation data examples

I realized after creating `challenge-26-understand-midtrain/midtrain-data-examples.ipynb` and doing midtraining that I was forgetting / getting confused about the CORE evaluation data. This notebook with some ugly code shows a few examples for each task type.

In [10]:
print_examples(items_per_task=2, random_items=True, random_seed=2)


This is multiple choice so each item will be scored as correct if the choice with the
highest probabiliy matches the correct choice. To get into the mechanics a bit more, it's
really only the probabilities  of the "choice part" that are looked at for each of the n
"prompts". The "choice part" is what comes after the query. The query is repeated in each
prompt, forming a common prefix. Think about it as which choice has the highest
probability given the text that comes before it. That's the one the model thinks is right.

Showing 2 random items of 10,042

----------- item: 926 ------------
Query: Cleaning windows: The man sprays windex and washes off with squeegee. The man uses
leaf blower to dry the window. The man
Correct prompt: 2

prompt 0: Cleaning windows: The man sprays windex and washes off with squeegee. The man
uses leaf blower to dry the window. The man drops the reusable leaf blower in the garbage
can.

prompt 1: Cleaning windows: The man sprays windex and washes off with s

In [119]:
import pandas as pd
pd.read_csv(f"{get_base_dir()}/eval_bundle/eval_meta_data.csv")

Unnamed: 0,Eval Task,Task Category,Task Type,#shots,#datapoints,Random baseline,Centered Metric?,Description
0,mmlu_zeroshot,world knowledge,multiple choice,0,14042,25.0,,"MMLU consists of 14,042 four-choice multiple c..."
1,hellaswag_zeroshot,language understanding,multiple choice,0,10042,25.0,,"HellaSwag consists of 10,042 multiple choice s..."
2,jeopardy,world knowledge,language modeling,10,2117,0.0,,"Jeopardy consists of 2,117 Jeopardy questions ..."
3,triviaqa_sm_sub,world knowledge,question answering,3,3000,0.0,,Trivia QA is a question answering dataset that...
4,gsm8k_cot,symbolic problem solving,question answering,3,1319,0.0,,"GSM8K consists of 1,319 short, free-response g..."
5,agi_eval_sat_math_cot,symbolic problem solving,question answering,3,220,0.0,,"AGI Eval SAT Math consists of 220 short, free-..."
6,aqua_cot,symbolic problem solving,question answering,3,245,0.0,,AQUA-RAT (Algebra Question Answering with Rati...
7,svamp_cot,symbolic problem solving,question answering,3,300,0.0,,"SVAMP consists of 300 short, free-response gra..."
8,bigbench_qa_wikidata,world knowledge,language modeling,10,20321,0.0,,"BIG-bench wikidata consists of 20,321 question..."
9,arc_easy,world knowledge,multiple choice,10,2376,25.0,,"ARC easy consists of 2,376 easy four-choice mu..."


In [11]:
import sys
sys.path.append('../my_nanochat')
import os
import yaml
import json
import random
import textwrap
from my_nanochat.my_common import get_base_dir
from my_nanochat.my_core_eval import render_prompts_mc, render_prompts_lm, render_prompts_schema

def print_wrap(s, remove_newlines=False):
    if remove_newlines:
        s = s.replace("\n", " ")
    print(textwrap.fill(s, 90))

def print_examples(items_per_task=3, random_items=False, random_seed=None, tasks_to_print=None):

    if random_seed:
        random.seed(random_seed)

    base_dir = get_base_dir()
    eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
    config_path = os.path.join(eval_bundle_dir, "core.yaml")
    data_base_path = os.path.join(eval_bundle_dir, "eval_data")
    with open(config_path, 'r', encoding='utf-8') as f:
        config = yaml.safe_load(f)
    tasks = config['icl_tasks']
    for task in tasks:
        if tasks_to_print is not None:
            if task['label'] not in tasks_to_print:
                continue
        
        task_type = task['icl_task_type']
        continuation_delimiter = task.get('continuation_delimiter', ' ')
        data_path = os.path.join(data_base_path, task['dataset_uri'])
        with open(data_path, 'r', encoding='utf-8') as f:
            data = [json.loads(line.strip()) for line in f]

        
        print(f"============= {task['label']} =============\n")
    
        if task_type == 'multiple_choice':
            print_wrap(
"""This is multiple choice so each item will be scored as correct if the choice with the highest probabiliy
matches the correct choice. To get into the mechanics a bit more, it's really only the probabilities 
of the "choice part" that are looked at for each of the n "prompts". The "choice part" is what comes after
the query. The query is repeated in each prompt, forming a common prefix. Think about it as which choice
has the highest probability given the text that comes before it. That's the one the model thinks is right.
""", remove_newlines=True)
        elif task_type == 'language_modeling':
            print_wrap(
"""This is a language modeling task type. Each item will be scored as correct if the model generates
the expected continuation.
""", remove_newlines=True)
        elif task_type == 'schema':
             print_wrap("""
This is a schema task type. Each item will be scored as correct if the "continuation part" with the
highest probability is in the correct prompt. This is similar to multiple choice except here we
have a common suffix (the continuation) and in multiple choice we have a common prefix (the query).
The continuations are the same in each prompt so in isolation they would have the same probability.
The key is they are judged in the context of the full prompt. It's also important that we look at
the probabilities only of the continuation parts, because we're interested in which is most probable
in the given context, not which prompt overall is more likely.
""", remove_newlines=True)
        else:
            assert False

        print()

        if random_items:
            print(f"Showing {items_per_task} random items of {len(data):,d}\n")
        else:
            print(f"Showing the first {items_per_task} items of {len(data):,d}\n")

        
        for i in random.sample(range(0, len(data)), items_per_task) if random_items else range(items_per_task):
            print(f"----------- item: {i} ------------")
            item = data[i]
            if task_type == 'multiple_choice':
                print_wrap(f"Query: {item['query']}")
                print(f"Correct prompt: {item['gold']}\n") 
                prompts = render_prompts_mc(item, continuation_delimiter, [])
            elif task_type == 'language_modeling':
                print_wrap(f"Expected continuation: {item['continuation']}")
                print()
                prompts = render_prompts_lm(item, continuation_delimiter, [])
                prompts = prompts[:-1] # because in CORE eval we only use the first method of scoring
            elif task_type == 'schema':
                print_wrap(f"Continuation part: {item['continuation']}")
                print(f"Correct prompt: {item['gold']}\n")            
                prompts = render_prompts_schema(item, continuation_delimiter, [])
            else:
                assert False
    
            for j, prompt in enumerate(prompts):
                print_wrap(f"prompt {j}: {prompt}")
                print()