## CORE evaluation data examples

I realized after creating `challenge-26-understand-midtrain/midtrain-data-examples.ipynb` and doing midtraining that I was forgetting / getting confused about the CORE evaluation data. This notebook with some ugly code shows a few examples for each task type.

In [75]:
display_examples()


This is multiple choice so each item will be scored correct if the choice with the highest probabiliy
matches the correct choice. To get into the mechanics a bit more, it's really only the probabilities 
of the "choice part" that are looked at for each of the n "prompts". The "choice part" is what comes after
the query. The query is repeated in each prompt, forming a common prefix. Think about it as which choice
has the highest probability given the text that comes before it. That's the one the model thinks is right.

----------- item: 0 ------------
Query: Roof shingle removal: A man is sitting on a roof. He
Correct prompt: 3

prompt 0: Roof shingle removal: A man is sitting on a roof. He is using wrap to wrap a pair of skis.

prompt 1: Roof shingle removal: A man is sitting on a roof. He is ripping level tiles off.

prompt 2: Roof shingle removal: A man is sitting on a roof. He is holding a rubik's cube.

prompt 3: Roof shingle removal: A man is sitting on a roof. He starts pulling 

In [74]:
def display_examples():
    
    import sys
    sys.path.append('../my_nanochat')
    import os
    import yaml
    import json
    from my_nanochat.my_common import get_base_dir
    from my_nanochat.my_core_eval import render_prompts_mc, render_prompts_lm, render_prompts_schema
    
    base_dir = get_base_dir()
    eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
    config_path = os.path.join(eval_bundle_dir, "core.yaml")
    data_base_path = os.path.join(eval_bundle_dir, "eval_data")
    with open(config_path, 'r', encoding='utf-8') as f:
        config = yaml.safe_load(f)
    tasks = config['icl_tasks']
    for task in tasks:
        print(f"============= {task['label']} =============")
    
        task_type = task['icl_task_type']
        if task_type == 'multiple_choice':
            print("""
This is multiple choice so each item will be scored correct if the choice with the highest probabiliy
matches the correct choice. To get into the mechanics a bit more, it's really only the probabilities 
of the "choice part" that are looked at for each of the n "prompts". The "choice part" is what comes after
the query. The query is repeated in each prompt, forming a common prefix. Think about it as which choice
has the highest probability given the text that comes before it. That's the one the model thinks is right.\n""")
        elif task_type == 'language_modeling':
            print("""
This is a language modeling task type. Each item will be scored correct if the model generates
the expected continuation.\n""")
        elif task_type == 'schema':
             print("""
This is a schema task type. Each item will be scored correct if the "continuation part" with the
highest probability is in the correct prompt. This is similar to multiple choice except here we
have a common suffix (the continuation) and in multiple choice we have a common prefix (the query).
The continuations are the same in each prompt so in isolation they would have the same probability.
The key is they are judged in the context of the full prompt. It's also important that we look at
the probabilities only of the continuation parts, because we're interested in which is most probable
in the given context, not which prompt overall is more likely.\n""")
        else:
            assert False
        
        continuation_delimiter = task.get('continuation_delimiter', ' ')
        data_path = os.path.join(data_base_path, task['dataset_uri'])
        with open(data_path, 'r', encoding='utf-8') as f:
            data = [json.loads(line.strip()) for line in f]
    
        for i in range(3):
            print(f"----------- item: {i} ------------")
            item = data[i]
            if task_type == 'multiple_choice':
                print(f"Query: {item['query']}")
                print(f"Correct prompt: {item['gold']}\n") 
                prompts = render_prompts_mc(item, continuation_delimiter, [])
            elif task_type == 'language_modeling':
                print(f"Expected continuation: {item['continuation']}\n")
                prompts = render_prompts_lm(item, continuation_delimiter, [])
                prompts = prompts[:-1] # because in CORE eval we only use the first method of scoring
            elif task_type == 'schema':
                print(f"Continuation part: {item['continuation']}")
                print(f"Correct prompt: {item['gold']}\n")            
                prompts = render_prompts_schema(item, continuation_delimiter, [])
            else:
                assert False
    
            for j, prompt in enumerate(prompts):
                print(f"prompt {j}: {prompt}\n")    