# Part 3: Repeatable Evaluations with an Eval Harness  

After experimenting informally, we need a **systematic way to measure performance**. This part introduces an **evaluation harness**—a structured script for running batch tests across multiple inputs.  

We’ll cover:  
- **Defining evaluation criteria** based on observed patterns.  
- **Running batch tests** using a dataset of team names.  
- **Expanding evaluation** to real-world data, like news articles.  

By the end, you’ll have a **repeatable evaluation process**, ensuring that our AI system’s performance is measurable and consistent across different inputs.  


In [2]:
from ollama import chat
from ollama import ChatResponse

model = 'gemma2:2b'

def single_turn(prompt):
    response: ChatResponse = chat(model=model, messages=[
      {
        'role': 'user',
        'content': prompt,
      },
    ])
    return response['message']['content']

prompt = "Say hello to the class"
single_turn(prompt)

"Hello everyone! 👋 \n\nIt's great to be here today. 😊  \n"

## Your First Evals

Here are the list of AFL and NFL team names.
Check if the model gets things right.
This means you'll need to create a scoring and aggregation function.


In [3]:
afl_team = "Carlton Blues"
american_team = "Tennessee Titans"


In [4]:
afl_clubs = [
    "Adelaide Crows",
    "Brisbane Lions",
    "Carlton Blues",
    "Collingwood Magpies",
    "Essendon Bombers",
    "Fremantle Dockers",
    "Geelong Cats",
    "Gold Coast Suns",
    "Greater Western Sydney (GWS) Giants",
    "Hawthorn Hawks",
    "Melbourne Demons",
    "North Melbourne Kangaroos",
    "Port Adelaide Power",
    "Richmond Tigers",
    "St Kilda Saints",
    "Sydney Swans",
    "West Coast Eagles",
    "Western Bulldogs"
]

nfl_teams = [
    "Arizona Cardinals",
    "Atlanta Falcons",
    "Baltimore Ravens",
    "Buffalo Bills",
    "Carolina Panthers",
    "Chicago Bears",
    "Cincinnati Bengals",
    "Cleveland Browns",
    "Dallas Cowboys",
    "Denver Broncos",
    "Detroit Lions",
    "Green Bay Packers",
    "Houston Texans",
    "Indianapolis Colts",
    "Jacksonville Jaguars",
    "Kansas City Chiefs",
    "Las Vegas Raiders",
    "Los Angeles Chargers",
    "Los Angeles Rams",
    "Miami Dolphins",
    "Minnesota Vikings",
    "New England Patriots",
    "New Orleans Saints",
    "New York Giants",
    "New York Jets",
    "Philadelphia Eagles",
    "Pittsburgh Steelers",
    "San Francisco 49ers",
    "Seattle Seahawks",
    "Tampa Bay Buccaneers",
    "Tennessee Titans",
    "Washington Commanders"
]

In [7]:
import numpy as np
eval_map = {"australian": afl_clubs, "american": nfl_teams}

# Remove key let students code themselves
score = []
for nationality, teams in eval_map.items():
    for team in teams:
        prompt = "Output if this is an australian or american team, only print australian or american no other output: " + f"{team}"
        #print(prompt)
        response = single_turn(prompt).strip()
        score.append(response.lower() == nationality)
        print(f"{team}: {response}")

print(np.array(score).mean())

Adelaide Crows: Australian
Brisbane Lions: Australian
Carlton Blues: Australian
Collingwood Magpies: Australian
Essendon Bombers: Australian
Fremantle Dockers: Australian
Geelong Cats: Australian
Gold Coast Suns: Australian
Greater Western Sydney (GWS) Giants: Australian
Hawthorn Hawks: Australian
Melbourne Demons: Australian
North Melbourne Kangaroos: Australian
Port Adelaide Power: Australian
Richmond Tigers: Australian
St Kilda Saints: Australian
Sydney Swans: Australian
West Coast Eagles: Australian
Western Bulldogs: Australian
Arizona Cardinals: American
Atlanta Falcons: American
Baltimore Ravens: American
Buffalo Bills: American
Carolina Panthers: American
Chicago Bears: American
Cincinnati Bengals: American
Cleveland Browns: American
Dallas Cowboys: American
Denver Broncos: American
Detroit Lions: American
Green Bay Packers: American
Houston Texans: American
Indianapolis Colts: American
Jacksonville Jaguars: American
Kansas City Chiefs: American
Las Vegas Raiders: American
Los A

## From Notebook to Script: Making Evaluation Repeatable

Now that we’ve run our evaluation in the notebook, we want to make this process **repeatable and scriptable**.  

Executing the following cell will **generate `traces.csv`**, storing the model's predictions alongside the ground truth.  

Once the file is created, run the evaluation harness script to **compute accuracy programmatically**:  

```
python eval_harness.py data/traces.csv --output data/scored_results.csv
```

In [9]:
import csv

eval_map = {"australian": afl_clubs, "american": nfl_teams}

# Open CSV file for writing
csv_filename = "data/traces.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["team_name", "ground_truth", "model_output"])

    for nationality, teams in eval_map.items():
        for team in teams:
            prompt = "Output if this is an australian or american team, only print australian or american no other output: " + f"{team}"
            response = single_turn(prompt).strip().lower()
            
            # Save to CSV
            writer.writerow([team, nationality, response])
            
            print(f"{team}: {response}")

print(f"📁 Saved results to {csv_filename}")

Adelaide Crows: australian
Brisbane Lions: australian
Carlton Blues: australian
Collingwood Magpies: australian
Essendon Bombers: australian
Fremantle Dockers: australian
Geelong Cats: australian
Gold Coast Suns: australian
Greater Western Sydney (GWS) Giants: australian
Hawthorn Hawks: australian
Melbourne Demons: australian
North Melbourne Kangaroos: australian
Port Adelaide Power: australian
Richmond Tigers: australian
St Kilda Saints: australian
Sydney Swans: australian
West Coast Eagles: australian
Western Bulldogs: australian
Arizona Cardinals: american
Atlanta Falcons: american
Baltimore Ravens: american
Buffalo Bills: american
Carolina Panthers: american
Chicago Bears: american
Cincinnati Bengals: american
Cleveland Browns: american
Dallas Cowboys: american
Denver Broncos: american
Detroit Lions: american
Green Bay Packers: american
Houston Texans: american
Indianapolis Colts: american
Jacksonville Jaguars: american
Kansas City Chiefs: american
Las Vegas Raiders: american
Los A

## Maing things harder. Taking just the team name
What happens if we don't provide the full context. What happens to our score then?

In [22]:
afl_names = ['Crows',
 'Lions',
 'Blues',
 'Magpies',
 'Bombers',
 'Dockers',
 'Cats',
 'Suns',
 'Giants',
 'Hawks',
 'Demons',
 'Kangaroos',
 'Power',
 'Tigers',
 'Saints',
 'Swans',
 'Eagles',
 'Bulldogs']

In [23]:
nfl_teams = ['Cardinals',
 'Falcons',
 'Ravens',
 'Bills',
 'Panthers',
 'Bears',
 'Bengals',
 'Browns',
 'Cowboys',
 'Broncos',
 'Lions',
 'Packers',
 'Texans',
 'Colts',
 'Jaguars',
 'Chiefs',
 'Raiders',
 'Chargers',
 'Rams',
 'Dolphins',
 'Vikings',
 'Patriots',
 'Saints',
 'Giants',
 'Jets',
 'Eagles',
 'Steelers',
 '49ers',
 'Seahawks',
 'Buccaneers',
 'Titans',
 'Commanders']

In [24]:
eval_map = {"american": nfl_teams, "australian": afl_names}

core = []

for nationality, teams in eval_map.items():
    for team in teams:
        team = team.rsplit(maxsplit =1)[-1]
        prompt = "Output if this is an australian or american team, only print australian or american no other output: " + f"{team}"
        response = single_turn(prompt).strip()
        score.append(response.lower() == nationality)
        print(f"{team}: {response}")
print(np.array(score).mean())

Cardinals: American
Falcons: Australian
Ravens: American
Bills: American
Panthers: Australian
Bears: American
Bengals: American
Browns: American
Cowboys: American
Broncos: American
Lions: Australian
Packers: American
Texans: American
Colts: Australian
Jaguars: Australian
Chiefs: American
Raiders: Australian
Chargers: American
Rams: American
Dolphins: Australian
Vikings: Australian
Patriots: American
Saints: Australian
Giants: Australian
Jets: American
Eagles: American
Steelers: American
49ers: American
Seahawks: American
Buccaneers: American
Titans: Australian
Commanders: American
Crows: Australian
Lions: Australian
Blues: Australian
Magpies: Australian
Bombers: Australian
Dockers: Australian
Cats: Australian
Suns: Australian
Giants: American
Hawks: American
Demons: Australian
Kangaroos: Australian
Power: Australian
Tigers: Australian
Saints: American
Swans: Australian
Eagles: American
Bulldogs: Australian
0.85


## News Articles
Now try this for some long form articles. This time we won't give you an answer key, we'll let you figure things out.