### CORE Metric Evaluation

Another thing he does every so many steps in [base_train.py](https://github.com/karpathy/nanochat/blob/master/scripts/base_train.py) is evalute the CORE metric. It looks like there's a lot involved. I want to understand it better and then I'll either hand copy that code now or move on and circle back later.

[base_eval.py](https://github.com/karpathy/nanochat/blob/master/scripts/base_eval.py) is the starting pont. He calls it from `base_train.py` and it also can be run by itself to evaluate a model. The first thing it does is download an "eval bundle" zip file. Let me see what that is. 

```
# ~162MB of data needed to evaluate the CORE metric
EVAL_BUNDLE_URL = "https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip"
```

might as well start copying straightforward functions, like the one to unzip the file, to `my_base_eval.py` rather than doing that inline in this notebook, because will need later

In [9]:
import sys
sys.path.append('../my_nanochat')
from scripts.my_base_eval import EVAL_BUNDLE_URL, place_eval_bundle
from my_nanochat.my_common import get_base_dir, download_file_with_lock
import os

In [4]:
eval_bundle_dir = os.path.join(get_base_dir(), 'eval_bundle')
if not os.path.exists(eval_bundle_dir):
    download_file_with_lock(EVAL_BUNDLE_URL, 'eval_bundle.zip', postprocess_fn=place_eval_bundle)

In [7]:
!ls -lh {eval_bundle_dir}

total 120
-rw-r--r--  1 ericsilberstein  staff   3.1K Nov 13 07:30 core.yaml
drwxr-xr-x  9 ericsilberstein  staff   288B Nov 13 07:30 [34meval_data[m[m
-rw-r--r--  1 ericsilberstein  staff    18K Nov 13 07:30 EVAL_GAUNTLET.md
-rw-r--r--  1 ericsilberstein  staff    16K Nov 13 07:30 eval_meta_data.csv
-rw-r--r--  1 ericsilberstein  staff   1.4K Nov 13 07:30 openai-community-gpt2-large.csv
-rw-r--r--  1 ericsilberstein  staff   1.4K Nov 13 07:30 openai-community-gpt2-medium.csv
-rw-r--r--  1 ericsilberstein  staff   1.4K Nov 13 07:30 openai-community-gpt2-xl.csv
-rw-r--r--  1 ericsilberstein  staff   1.4K Nov 13 07:30 openai-community-gpt2.csv


In [10]:
!head -15 {eval_bundle_dir}/core.yaml

icl_tasks:
-
  label: hellaswag_zeroshot
  dataset_uri: language_understanding/hellaswag.jsonl
  num_fewshot: [0]
  icl_task_type: multiple_choice
-
  label: jeopardy
  dataset_uri: world_knowledge/jeopardy_all.jsonl
  num_fewshot: [10]
  icl_task_type: language_modeling
  continuation_delimiter: "\nAnswer: "
  has_categories: true
-
  label: bigbench_qa_wikidata


In [11]:
!ls -lh {eval_bundle_dir}/eval_data

total 0
drwxr-xr-x  10 ericsilberstein  staff   320B Nov 13 07:30 [34mcommonsense_reasoning[m[m
drwxr-xr-x  10 ericsilberstein  staff   320B Nov 13 07:30 [34mlanguage_understanding[m[m
drwxr-xr-x  10 ericsilberstein  staff   320B Nov 13 07:30 [34mprogramming[m[m
drwxr-xr-x  11 ericsilberstein  staff   352B Nov 13 07:30 [34mreading_comprehension[m[m
drwxr-xr-x   6 ericsilberstein  staff   192B Nov 13 07:30 [34msafety[m[m
drwxr-xr-x  22 ericsilberstein  staff   704B Nov 13 07:30 [34msymbolic_problem_solving[m[m
drwxr-xr-x  15 ericsilberstein  staff   480B Nov 13 07:30 [34mworld_knowledge[m[m


In [12]:
!ls -lh {eval_bundle_dir}/eval_data/world_knowledge

total 63832
-rw-r--r--  1 ericsilberstein  staff   357K Nov 13 07:30 arc_challenge.jsonl
-rw-r--r--  1 ericsilberstein  staff   622K Nov 13 07:30 arc_easy.jsonl
-rw-r--r--  1 ericsilberstein  staff    46K Nov 13 07:30 bigbench_misconceptions.jsonl
-rw-r--r--  1 ericsilberstein  staff   123K Nov 13 07:30 bigbench_movie_recommendation.jsonl
-rw-r--r--  1 ericsilberstein  staff   1.6M Nov 13 07:30 bigbench_qa_wikidata.jsonl
-rw-r--r--  1 ericsilberstein  staff   356K Nov 13 07:30 jeopardy_all.jsonl
-rw-r--r--  1 ericsilberstein  staff   2.8K Nov 13 07:30 jeopardy_small.jsonl
-rw-r--r--  1 ericsilberstein  staff   9.9M Nov 13 07:30 mmlu_expand.jsonl
-rw-r--r--  1 ericsilberstein  staff   7.7M Nov 13 07:30 mmlu.jsonl
-rw-r--r--  1 ericsilberstein  staff   1.2M Nov 13 07:30 triviaqa_sm_sub.jsonl
-rw-r--r--  1 ericsilberstein  staff   4.6M Nov 13 07:30 triviaqa_sm.jsonl
-rw-r--r--  1 ericsilberstein  staff    12K Nov 13 07:30 triviaqa_small.jsonl
-rw-r--r--  1 ericsilberstein  staff   4.7M No

In [13]:
!head -15 {eval_bundle_dir}/eval_data/world_knowledge/jeopardy_small.jsonl

{"context": "WORLD HISTORY: This Navy commander flew from a base at Little America to the South Pole & back Nov. 28-29, 1929", "continuation": "Admiral Richard Byrd", "category": "world_history"}
{"context": "WORLD HISTORY: Accused of accepting bribes, Francis Bacon was imprisoned in this forbidding complex in 1621", "continuation": "Tower of London", "category": "world_history"}
{"context": "WORLD HISTORY: More than 250,000 died in fighting before France granted this African nation independence July 3, 1962", "continuation": "Algeria", "category": "world_history"}
{"context": "WORLD HISTORY: In 1784 she founded the city of Sevastopol in her new domain of the Crimea", "continuation": "Catherine the Great", "category": "world_history"}
{"context": "WORLD HISTORY: This Portuguese Admiral of the Indian Seas discovered & named the Amirante Islands", "continuation": "Vasco da Gama", "category": "world_history"}
{"context": "WORLD HISTORY: This dominion was created by the British North Ameri

So for those seems like give the model context and see if comes up with the continuation. But what about for say those multiple choice ones? Is it the same? It seems like there's too much code to support CORE metric if in every case it's just about seeing if a continuation matches.

In [15]:
!head -30 {eval_bundle_dir}/core.yaml

icl_tasks:
-
  label: hellaswag_zeroshot
  dataset_uri: language_understanding/hellaswag.jsonl
  num_fewshot: [0]
  icl_task_type: multiple_choice
-
  label: jeopardy
  dataset_uri: world_knowledge/jeopardy_all.jsonl
  num_fewshot: [10]
  icl_task_type: language_modeling
  continuation_delimiter: "\nAnswer: "
  has_categories: true
-
  label: bigbench_qa_wikidata
  dataset_uri: world_knowledge/bigbench_qa_wikidata.jsonl
  num_fewshot: [10]
  icl_task_type: language_modeling
-
  label: arc_easy
  dataset_uri: world_knowledge/arc_easy.jsonl
  num_fewshot: [10]
  icl_task_type: multiple_choice
  continuation_delimiter: "\nAnswer: "
-
  label: arc_challenge
  dataset_uri: world_knowledge/arc_challenge.jsonl
  num_fewshot: [10]
  icl_task_type: multiple_choice
  continuation_delimiter: "\nAnswer: "


In [17]:
!head -5 {eval_bundle_dir}/eval_data/world_knowledge/arc_easy.jsonl

{"choices": ["Sunlight is the source of energy for nearly all ecosystems.", "Most ecosystems are found on land instead of in water.", "Carbon dioxide is more available than other gases.", "The producers in all ecosystems are plants."], "query": "Question: Which statement best explains why photosynthesis is the foundation of most food webs?", "gold": 0}
{"choices": ["safety goggles", "breathing mask", "rubber gloves", "lead apron"], "query": "Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?", "gold": 1}
{"choices": ["brain cells", "bone cells", "muscle cells", "ovary cells"], "query": "Question: Meiosis is a type of cell division in which germ cells divide to produce haploid cells. Where does meiosis occur?", "gold": 3}
{"choices": ["gray", "warm", "long", "soft"], "query": "Question: Which characteristic describes the texture of a kitten's fur?", "gold": 3}
{"choices": ["a lightweight core surrounded by neutral particles", "a m

Just looking at that is bringing back nervous memories of studying for tests.

Might as well hand copy all this stuff now. I'll start with [core_eval.py](https://github.com/karpathy/nanochat/blob/master/nanochat/core_eval.py) which contains things like templates for the various tasks types and the code to have the model complete a prompt and score.

#### render_prompt_mc()

Start with one of the 3 prompt rendering functions just to get the idea

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_core_eval import render_prompts_mc

In [3]:
item = {
    'query': 'Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?',
    'choices': [
        "a lightweight core surrounded by neutral particles",
        "a massive core surrounded by negatively-charged particles",
        "a network of interacting positive and negative particles",
        "overlapping layers of neutral, positive, and negative particles"],
    'gold': 1
}
continuation_delimiter = "\nAnswer: "
render_prompts_mc(item, continuation_delimiter)

['Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?\nAnswer: a lightweight core surrounded by neutral particles',
 'Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?\nAnswer: a massive core surrounded by negatively-charged particles',
 'Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?\nAnswer: a network of interacting positive and negative particles',
 'Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?\nAnswer: overlapping layers of neutral, positive, and negative particles']

Why are those the prompts? How is the model supposed to complete to say which is the right answer? Maybe it will make more sense if there are also fewshow examples?

In [4]:
item2 = {
    'query': 'Question: Which statement best explains why photosynthesis is the foundation of most food webs?',
    'choices': [
        "Sunlight is the source of energy for nearly all ecosystems.", 
        "Most ecosystems are found on land instead of in water.", 
        "Carbon dioxide is more available than other gases.", 
        "The producers in all ecosystems are plants."],
    'gold': 0
}

In [5]:
render_prompts_mc(item, continuation_delimiter, fewshot_examples=[item2, item2])

['Question: Which statement best explains why photosynthesis is the foundation of most food webs?\nAnswer: Sunlight is the source of energy for nearly all ecosystems.\n\nQuestion: Which statement best explains why photosynthesis is the foundation of most food webs?\nAnswer: Sunlight is the source of energy for nearly all ecosystems.\n\nQuestion: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?\nAnswer: a lightweight core surrounded by neutral particles',
 'Question: Which statement best explains why photosynthesis is the foundation of most food webs?\nAnswer: Sunlight is the source of energy for nearly all ecosystems.\n\nQuestion: Which statement best explains why photosynthesis is the foundation of most food webs?\nAnswer: Sunlight is the source of energy for nearly all ecosystems.\n\nQuestion: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?\nAnswer: a massive core surrounded by negative

I still don't get it. What's the point of a prompt per choice, one of which gives the right answer?

Maybe it will help to understand this function: `batch_sequences_mc(tokenizer, prompts)`

In [12]:
from my_nanochat.my_core_eval import find_common_length, batch_sequences_mc
from my_nanochat.my_tokenizer import get_tokenizer

In [8]:
# helper function for batch_sequences_mc
find_common_length([[1,2,3,4],[1,2,3,5],[1,2,6,4,8]])

2

In [13]:
tokenizer = get_tokenizer()

In [22]:
prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples=None)
tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
start_idxs, end_idxs

([22, 22, 22, 22], [29, 30, 30, 32])

In [26]:
tokens[0][20:25], tokens[1][20:25], tokens[2][20:25], tokens[3][20:25]

([17643, 58, 257, 17642, 4516],
 [17643, 58, 257, 5436, 4516],
 [17643, 58, 257, 2735, 281],
 [17643, 58, 21833, 6243, 281])

In [27]:
prompts[0]

'Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?\nAnswer: a lightweight core surrounded by neutral particles'

In [29]:
tokenizer.decode([58])

':'

In [30]:
prompts[3]

'Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?\nAnswer: overlapping layers of neutral, positive, and negative particles'

In [31]:
len(tokens[0])

29

Well I get what `batch_sequences_mc()` does but it doesn't answer the question above.

Ah...the answer is in `evaluate_examle()`. Looks like for mutliple choice we're not looking for the model to complete the prompt with the correct choice. Instead we're calculating the loss for each prompt and consider the model's "answer" to be the prompt with the lowest loss.

For sure that makes way more sense. The lower the loss, the more likely the sequence is. For example, even a small, barely trained model whould be able to say that "he is" is more likely than "he pineapple". Also, this gives a way of "forcing" an answer that is one of the choices.

To think about this a little more clearly...say I feed in these two sequences (assume each word is a token):

- he is a
- he pineapple big

Say in the first sequence he -> is with 10% (meaning 10% is on "is" in the vocab) and is -> a with 20%.

Say in the second sequence, he -> pineapple with 1% and pineapple -> big with 1% 

(Not even trying to make realistic numbers.)

The total chance of the first sequence is 10% * 20% = 2%

The chance of the second sequence is 1% * 1% = .01%

The first is way more likely. So if the sequences were like:

- Q:What is a pineapple? A:A vehicle
- Q:What is a pinappple? A:A fruit
- Q:What is a pineapple? A:A dwelling
- Q:What is a pineapple? A:A spiritual concept

A good model should give the highest chance to the 2nd.

We can get this by taking the sequence with the lowest total cross entropy loss.

Let's do a small example using "he is a" vs "he pineapple big":

In [2]:
import torch
import torch.nn.functional as F
# vocab:   is  a  pineapple  big
# index:   0   1  2          3 

# make up logits for "he is a" showing we got it right and with high probability (-2 and -1)
# but not bothering to actually spread 100% probability over the whole vocab
logits_for_he_is_a = torch.tensor([[-2, -7, -7, -7], [-7, -1, -7, -7]], dtype=torch.float32)
#                                      he -> is           is -> a

target_for_he_is_a = torch.tensor([0, 1]) # indexes

# and same for "he pineapple big" also showing we got it right but with lower probability (-5 and -5)
logits_for_he_pineapple_big = torch.tensor([[-7, -7, -5, -7], [-7, -7, -7, -5]], dtype=torch.float32)
#                                            he -> pineapple   pineapple -> big

target_for_he_pineapple_big = torch.tensor([2, 3]) # indexes

# one more example for "he pineapple big" where we get "he -> pineapple" wrong
logits_for_he_pineapple_big_2 = torch.tensor([[-2, -7, -7, -7], [-7, -7, -7, -5]], dtype=torch.float32)
#                                              he -> pineapple   pineapple -> big

print(F.cross_entropy(logits_for_he_is_a, target_for_he_is_a))
print(F.cross_entropy(logits_for_he_pineapple_big, target_for_he_pineapple_big))
print(F.cross_entropy(logits_for_he_pineapple_big_2, target_for_he_pineapple_big))

tensor(0.0137)
tensor(0.3408)
tensor(2.6804)


Of these three examples, our best (lowest) loss is the first as expected.

In [5]:
# compute the cross entropy loss manually for the first example to remember how higher prob
# becomes lower loss
import math
loss_1 = -math.log(math.exp(-2) / (math.exp(-2) + math.exp(-7) * 3))
loss_2 = -math.log(math.exp(-1) / (math.exp(-1) + math.exp(-7) * 3))
average_loss = (loss_1 + loss_2) / 2.
print(f"loss for token 1: {loss_1:.4f}, loss for token 2: {loss_2:.4f}, average: {average_loss:.4f}")

loss for token 1: 0.0200, loss for token 2: 0.0074, average: 0.0137


In [6]:
# so if for example we predicted is -> a with even higher probability our loss goes down
loss_1 = -math.log(math.exp(-2) / (math.exp(-2) + math.exp(-7) * 3))
loss_2 = -math.log(math.exp(2) / (math.exp(2) + math.exp(-7) * 3))
average_loss = (loss_1 + loss_2) / 2.
print(f"loss for token 1: {loss_1:.4f}, loss for token 2: {loss_2:.4f}, average: {average_loss:.4f}")

loss for token 1: 0.0200, loss for token 2: 0.0004, average: 0.0102


So now hand copy `evaluate_example()` leaving out parts we don't need yet.

Looks we need another helper function `stack_sequences()`

In [7]:
from my_nanochat.my_core_eval import stack_sequences

In [8]:
stack_sequences([[4, 2, 5], [4, 7, 9, 2]], pad_token_id=0)

tensor([[4, 2, 5, 0],
        [4, 7, 9, 2]])

In [9]:
# make sure I understand what roll does
torch.roll(torch.tensor([1,2,3]), shifts=-1, dims=0)

tensor([2, 3, 1])

And another helper function `forward_model()` which is where we calculate the loss

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_core_eval import forward_model
from my_nanochat.my_common import get_base_dir, autodetect_device_type
from my_nanochat.my_checkpoint_manager import build_model
import os
import torch

In [2]:
device = torch.device(autodetect_device_type())
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d4")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=10, device=device, phase="eval")

Autodetected device type: mps
Building model with config: {'sequence_len': 128, 'vocab_size': 65537, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}


In [3]:
input_ids = torch.tensor([[65, 66, 67], [100, 101, 102]], dtype=torch.long, device=device)
forward_model(model, input_ids)

(tensor([[11.1263, 11.1653,     nan],
         [11.1395, 11.1280,     nan]], device='mps:0',
        grad_fn=<AsStridedBackward0>),
 tensor([[ 668,  668,  668],
         [ 668,  668, 3080]], device='mps:0'))

In [4]:
# suppose we nailed the prediction of the 2nd token in the first sequence, expect that
# first loss to go down
input_ids = torch.tensor([[65, 668, 67], [100, 101, 102]], dtype=torch.long, device=device)
forward_model(model, input_ids)

(tensor([[ 6.9780, 11.3036,     nan],
         [11.1395, 11.1280,     nan]], device='mps:0',
        grad_fn=<AsStridedBackward0>),
 tensor([[ 668,  668,  668],
         [ 668,  668, 3080]], device='mps:0'))

Finally we can try `evaluate_example()` for a multiple choice question (didn't add support for other types yet)

In [5]:
from my_nanochat.my_core_eval import evaluate_example

In [6]:
data = [
    {"choices": ["Sunlight is the source of energy for nearly all ecosystems.", 
                 "Most ecosystems are found on land instead of in water.", 
                 "Carbon dioxide is more available than other gases.", 
                 "The producers in all ecosystems are plants."], 
     "query": "Question: Which statement best explains why photosynthesis is the foundation of most food webs?", 
     "gold": 0},
    
    {"choices": ["safety goggles",
                 "breathing mask",
                 "rubber gloves",
                 "lead apron"],
     "query": "Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?",
     "gold": 1},
    
    {"choices": ["brain cells",
                 "bone cells",
                 "muscle cells",
                 "ovary cells"],
     "query": "Question: Meiosis is a type of cell division in which germ cells divide to produce haploid cells. Where does meiosis occur?",
     "gold": 3},
]
task_meta = {
  "num_fewshot": 0, # not really
  "task_type": 'multiple_choice',
  "continuation_delimiter": "\nAnswer: ",
}
len(data)

3

In [10]:
evaluate_example(0, model, tokenizer, data, device, task_meta)

False

### "language modeling"

Go back and add support for "language modeling" and "schema question." First see what a language modeling task is.

In [18]:
!head -15 {eval_bundle_dir}/core.yaml

icl_tasks:
-
  label: hellaswag_zeroshot
  dataset_uri: language_understanding/hellaswag.jsonl
  num_fewshot: [0]
  icl_task_type: multiple_choice
-
  label: jeopardy
  dataset_uri: world_knowledge/jeopardy_all.jsonl
  num_fewshot: [10]
  icl_task_type: language_modeling
  continuation_delimiter: "\nAnswer: "
  has_categories: true
-
  label: bigbench_qa_wikidata


In [19]:
!head -5 {eval_bundle_dir}/eval_data/world_knowledge/jeopardy_all.jsonl

{"context": "WORLD HISTORY: This Navy commander flew from a base at Little America to the South Pole & back Nov. 28-29, 1929", "continuation": "Admiral Richard Byrd", "category": "world_history"}
{"context": "WORLD HISTORY: Accused of accepting bribes, Francis Bacon was imprisoned in this forbidding complex in 1621", "continuation": "Tower of London", "category": "world_history"}
{"context": "WORLD HISTORY: More than 250,000 died in fighting before France granted this African nation independence July 3, 1962", "continuation": "Algeria", "category": "world_history"}
{"context": "WORLD HISTORY: In 1784 she founded the city of Sevastopol in her new domain of the Crimea", "continuation": "Catherine the Great", "category": "world_history"}
{"context": "WORLD HISTORY: This Portuguese Admiral of the Indian Seas discovered & named the Amirante Islands", "continuation": "Vasco da Gama", "category": "world_history"}


So here do we judge by the autoregressive loss on the continuation? Let's do the prompts first.

Copying `render_prompts_lm()` looks like it has an option to include the continuation, so maybe sometimes we judge by loss and other times we judge by generation?

And later adding lm to `evaluate_example` I see at least here we judge by an exact match on the generated output. (For some reason we input the sequence with the continuation, but then, using the example below, we only count it as right if ":" predicts " Admiral" AND " Admiral" predicts " Richard" AND " Byrd" predict " Byrd". Maybe this is just a convenience to not set max tokens? For a second I thought this is cheating, but not really, because if, say, we predict the first token wrong, we already count that as a fail.

In [3]:
from my_nanochat.my_core_eval import render_prompts_lm, batch_sequences_lm, evaluate_example

In [4]:
item = {
    "context": "WORLD HISTORY: This Navy commander flew from a base at Little America to the South Pole & back Nov. 28-29, 1929",
    "continuation": "Admiral Richard Byrd",
    "category": "world_history",
}
continuation_delimiter = "\nAnswer: "
prompts = render_prompts_lm(item, continuation_delimiter)
prompts

['WORLD HISTORY: This Navy commander flew from a base at Little America to the South Pole & back Nov. 28-29, 1929\nAnswer:',
 'WORLD HISTORY: This Navy commander flew from a base at Little America to the South Pole & back Nov. 28-29, 1929\nAnswer: Admiral Richard Byrd']

In [5]:
tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
start_idxs, end_idxs

([34], [37])

In [6]:
tokenizer.decode(tokens[0][34:37])

' Admiral Richard Byrd'

In [7]:
data = [item]
task_meta = {
  "num_fewshot": 0,
  "task_type": 'language_modeling',
  "continuation_delimiter": "\nAnswer: ",
}
evaluate_example(0, model, tokenizer, data, device, task_meta)

False

^ If my tiny CPU model trained for a few iterations pulled "Admiral Richard Byrd" OOIA that would be odd.

### "schema question"

Let's see what that's about.

In [14]:
!grep -C 5 schema {eval_bundle_dir}/core.yaml

  icl_task_type: multiple_choice
-
  label: winograd
  dataset_uri: language_understanding/winograd_wsc.jsonl
  num_fewshot: [0]
  icl_task_type: schema
-
  label: winogrande
  dataset_uri: language_understanding/winogrande.jsonl
  num_fewshot: [0]
  icl_task_type: schema
-
  label: bigbench_dyck_languages
  dataset_uri: symbolic_problem_solving/bigbench_dyck_languages.jsonl
  num_fewshot: [10]
  icl_task_type: language_modeling


In [15]:
!head -5 {eval_bundle_dir}/eval_data/language_understanding/winograd_wsc.jsonl

{"context_options": ["The city councilmen refused the demonstrators a permit because the city councilmen", "The city councilmen refused the demonstrators a permit because the demonstrators"], "continuation": "feared violence.", "gold": 0}
{"context_options": ["The city councilmen refused the demonstrators a permit because the city councilmen", "The city councilmen refused the demonstrators a permit because the demonstrators"], "continuation": "advocated violence.", "gold": 1}
{"context_options": ["The trophy doesn't fit into the brown suitcase because the trophy", "The trophy doesn't fit into the brown suitcase because the suitcase"], "continuation": "is too large.", "gold": 0}
{"context_options": ["The trophy doesn't fit into the brown suitcase because the trophy", "The trophy doesn't fit into the brown suitcase because the suitcase"], "continuation": "is too small.", "gold": 1}
{"context_options": ["Joan made sure to thank Susan for all the help Joan", "Joan made sure to thank Susan 

ok, I see the idea, we'll give the model, for example:

The trophy doesn't fit into the brown suitcase because the trophy is too large.

and

The trophy doesn't fit into the brown suitcase because the suitcase is too large.

and see which has higher probability. The correct "gold" answer is the first.

Going to cheat and copy and paste his code because this seems so similar to mutliple choice and language modeling.

In [3]:
from my_nanochat.my_core_eval import render_prompts_schema, batch_sequences_schema, evaluate_example

In [4]:
item = {
    "context_options": [
        "The trophy doesn't fit into the brown suitcase because the trophy", 
        "The trophy doesn't fit into the brown suitcase because the suitcase"], 
    "continuation": "is too large.", 
    "gold": 0}
continuation_delimiter = " " # not sure about this
prompts = render_prompts_schema(item, continuation_delimiter)
prompts

["The trophy doesn't fit into the brown suitcase because the trophy is too large.",
 "The trophy doesn't fit into the brown suitcase because the suitcase is too large."]

In [5]:
tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
start_idxs, end_idxs

([13, 13], [17, 17])

In [12]:
tokenizer.decode(tokens[1][13:17])

' is too large.'

^ I guess the reason we look at the loss of the continuation to decide the correct answer is because in the context of the parts of the token stream before, the autoregressive loss of the correct completion should be lower. We can't just look at the average loss of the whole stream because say "...the trophy" is say less likely than "...the suitcase" in general, we'll be judging that too. (However, for multiple choice we probably could just look at the whole stream. But maybe besides being slightly extra work, we reduce our "signal"? Especially if there are a bunch of fewshot examples in there.)

In [13]:
data = [item]
task_meta = {
  "num_fewshot": 0,
  "task_type": 'schema',
  "continuation_delimiter": " ",
}
evaluate_example(0, model, tokenizer, data, device, task_meta)

True

### `evaluate_task()`

Copy `evaluate_task()` which evaluates many examples and supports doing so across multiple GPUs.

In [40]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_core_eval import evaluate_task
from my_nanochat.my_common import get_base_dir, autodetect_device_type
from my_nanochat.my_checkpoint_manager import build_model
import os
import torch
device = torch.device(autodetect_device_type())
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d4")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=10, device=device, phase="eval")

Autodetected device type: mps
Building model with config: {'sequence_len': 128, 'vocab_size': 65537, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}


In [2]:
data = [
    {"choices": ["Sunlight is the source of energy for nearly all ecosystems.", 
                 "Most ecosystems are found on land instead of in water.", 
                 "Carbon dioxide is more available than other gases.", 
                 "The producers in all ecosystems are plants."], 
     "query": "Question: Which statement best explains why photosynthesis is the foundation of most food webs?", 
     "gold": 0},
    
    {"choices": ["safety goggles",
                 "breathing mask",
                 "rubber gloves",
                 "lead apron"],
     "query": "Question: Which piece of safety equipment is used to keep mold spores from entering the respiratory system?",
     "gold": 1},
    
    {"choices": ["brain cells",
                 "bone cells",
                 "muscle cells",
                 "ovary cells"],
     "query": "Question: Meiosis is a type of cell division in which germ cells divide to produce haploid cells. Where does meiosis occur?",
     "gold": 3},
]
task_meta = {
  "num_fewshot": 0, # not really
  "task_type": 'multiple_choice',
  "continuation_delimiter": "\nAnswer: ",
}
len(data)

3

In [3]:
evaluate_task(model, tokenizer, data, device, task_meta)

0.3333333432674408

### back to `scripts/my_base_eval.py`

Copy `evaluate_model()` which downloads the zip if needed and goes through all the tasks with a few options.

In [7]:
!head -3 {eval_bundle_dir}/eval_meta_data.csv

Eval Task,Task Category,Task Type,#shots,#datapoints,Random baseline,Centered Metric?,Description
mmlu_zeroshot,world knowledge,multiple choice,0,14042,25,,"MMLU consists of 14,042 four-choice multiple choice questions distributed across 57 categories. The questions are in the style of academic standardized tests and the model is provided the question and the choices and is expected to choose between A, B, C, and D as its outputs. The subjects range from jurisprudence, to math, to morality."
hellaswag_zeroshot,language understanding,multiple choice,0,10042,25,,"HellaSwag consists of 10,042 multiple choice scenarios in which the model is prompted with a scenario and choose the most likely conclusion to the scenario from four possible options."


In [37]:
# centering example for multiple choice with 4 questions where we get half right
random_baseline = 25
accuracy = .5
centered = (accuracy - 0.01 * random_baseline) / (1.0 - 0.01 * random_baseline)
centered

0.3333333333333333

Here goes...

In [1]:
import sys
sys.path.append('../my_nanochat')
from my_nanochat.my_common import get_base_dir, autodetect_device_type
from my_nanochat.my_checkpoint_manager import build_model
import os
import torch
device = torch.device(autodetect_device_type())
checkpoint_dir = os.path.join(get_base_dir(), "base_checkpoints", "d4")
model, tokenizer, meta_data = build_model(checkpoint_dir, step=10, device=device, phase="eval")
from scripts.my_base_eval import evaluate_model

Autodetected device type: mps
Building model with config: {'sequence_len': 128, 'vocab_size': 65537, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}


In [3]:
evaluate_model(model, tokenizer, device, max_per_task=20)

Evaluating: hellaswag_zeroshot (0-shot, type: multiple_choice)... accuracy: 0.3000 | centered: 0.0667 | time: 4.38s
Evaluating: jeopardy (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 1.68s
Evaluating: bigbench_qa_wikidata (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 1.07s
Evaluating: arc_easy (10-shot, type: multiple_choice)... accuracy: 0.3000 | centered: 0.0667 | time: 7.30s
Evaluating: arc_challenge (10-shot, type: multiple_choice)... accuracy: 0.2000 | centered: -0.0667 | time: 8.19s
Evaluating: copa (0-shot, type: multiple_choice)... accuracy: 0.5500 | centered: 0.1000 | time: 1.60s
Evaluating: commonsense_qa (10-shot, type: multiple_choice)... accuracy: 0.2000 | centered: 0.0000 | time: 6.62s
Evaluating: piqa (10-shot, type: multiple_choice)... accuracy: 0.4500 | centered: -0.1000 | time: 3.01s
Evaluating: openbook_qa (0-shot, type: multiple_choice)... accuracy: 0.3000 | centered: 0.0667 | time: 1.55s
Evalua

RuntimeError: MPS backend out of memory (MPS allocated: 17.16 GiB, other allocations: 117.52 MiB, max allowed: 18.13 GiB). Tried to allocate 1.06 GiB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Code added as part of this challenge:

- Created `my_core_eval.py` with bulk of the code for evaluating CORE metric

- Created `scripts/my_base_eval.py`

- Added `download_file_with_lock()` to `my_common.py`