# Part 2: Vibe-Based Evaluation  

Now that we have a working MVP, it’s time to test it—but before diving into structured evaluation, we’ll rely on **intuition and experimentation**.  

In this part, we’ll:  
- **Try different prompts** to explore how the model responds.  
- **Observe and analyze** the model’s behavior across examples.  
- **Identify weak spots** where results seem inconsistent or unreliable.  

This phase mirrors how many LLM practitioners start—tweaking and iterating based on "vibes" before committing to structured testing. Treat this as an **exploratory phase**: experiment, adjust, and get a feel for what works before formalizing evaluation criteria.  


In [18]:
from ollama import chat
from ollama import ChatResponse

model = 'gemma2:2b'

def single_turn(prompt):
    response: ChatResponse = chat(model=model, messages=[
      {
        'role': 'user',
        'content': prompt,
      },
    ])
    return response['message']['content']

prompt = "Say hello to the class"
single_turn(prompt)

"Hello everyone! 👋😊 \n\nIt's nice to be here.  Let's have a great day together.  \n"

## Your First Evals

Here are the list of AFL and NFL team names.
Check if the model gets things right.
This means you'll need to create a scoring and aggregation function.


In [19]:
afl_team = "Carlton Blues"
american_team = "Tennessee Titans"


In [20]:
afl_clubs = [
    "Adelaide Crows",
    "Brisbane Lions",
    "Carlton Blues",
    "Collingwood Magpies",
    "Essendon Bombers",
    "Fremantle Dockers",
    "Geelong Cats",
    "Gold Coast Suns",
    "Greater Western Sydney (GWS) Giants",
    "Hawthorn Hawks",
    "Melbourne Demons",
    "North Melbourne Kangaroos",
    "Port Adelaide Power",
    "Richmond Tigers",
    "St Kilda Saints",
    "Sydney Swans",
    "West Coast Eagles",
    "Western Bulldogs"
]

nfl_teams = [
    "Arizona Cardinals",
    "Atlanta Falcons",
    "Baltimore Ravens",
    "Buffalo Bills",
    "Carolina Panthers",
    "Chicago Bears",
    "Cincinnati Bengals",
    "Cleveland Browns",
    "Dallas Cowboys",
    "Denver Broncos",
    "Detroit Lions",
    "Green Bay Packers",
    "Houston Texans",
    "Indianapolis Colts",
    "Jacksonville Jaguars",
    "Kansas City Chiefs",
    "Las Vegas Raiders",
    "Los Angeles Chargers",
    "Los Angeles Rams",
    "Miami Dolphins",
    "Minnesota Vikings",
    "New England Patriots",
    "New Orleans Saints",
    "New York Giants",
    "New York Jets",
    "Philadelphia Eagles",
    "Pittsburgh Steelers",
    "San Francisco 49ers",
    "Seattle Seahawks",
    "Tampa Bay Buccaneers",
    "Tennessee Titans",
    "Washington Commanders"
]

In [21]:
import numpy as np
eval_map = {"australian": afl_clubs, "american": nfl_teams}

# Remove key let students code themselves
score = []
for nationality, teams in eval_map.items():
    for team in teams:
        prompt = "Output if this is an australian or american team, only print australian or american no other output: " + f"{team}"
        #print(prompt)
        response = single_turn(prompt).strip()
        score.append(response.lower() == nationality)
        print(f"{team}: {response}")

print(np.array(score).mean())

Adelaide Crows: Australian
Brisbane Lions: Australian
Carlton Blues: Australian
Collingwood Magpies: Australian
Essendon Bombers: Australian
Fremantle Dockers: Australian
Geelong Cats: Australian
Gold Coast Suns: Australian
Greater Western Sydney (GWS) Giants: Australian
Hawthorn Hawks: Australian
Melbourne Demons: Australian
North Melbourne Kangaroos: Australian
Port Adelaide Power: Australian
Richmond Tigers: Australian
St Kilda Saints: Australian
Sydney Swans: Australian
West Coast Eagles: Australian
Western Bulldogs: Australian
Arizona Cardinals: American
Atlanta Falcons: American
Baltimore Ravens: American
Buffalo Bills: American
Carolina Panthers: American
Chicago Bears: American
Cincinnati Bengals: American
Cleveland Browns: American
Dallas Cowboys: American
Denver Broncos: American
Detroit Lions: American
Green Bay Packers: American
Houston Texans: American
Indianapolis Colts: American
Jacksonville Jaguars: American
Kansas City Chiefs: American
Las Vegas Raiders: American
Los A

## Maing things harder. Taking just the team name
What happens if we don't provide the full context. What happens to our score then?

In [22]:
afl_names = ['Crows',
 'Lions',
 'Blues',
 'Magpies',
 'Bombers',
 'Dockers',
 'Cats',
 'Suns',
 'Giants',
 'Hawks',
 'Demons',
 'Kangaroos',
 'Power',
 'Tigers',
 'Saints',
 'Swans',
 'Eagles',
 'Bulldogs']

In [23]:
nfl_teams = ['Cardinals',
 'Falcons',
 'Ravens',
 'Bills',
 'Panthers',
 'Bears',
 'Bengals',
 'Browns',
 'Cowboys',
 'Broncos',
 'Lions',
 'Packers',
 'Texans',
 'Colts',
 'Jaguars',
 'Chiefs',
 'Raiders',
 'Chargers',
 'Rams',
 'Dolphins',
 'Vikings',
 'Patriots',
 'Saints',
 'Giants',
 'Jets',
 'Eagles',
 'Steelers',
 '49ers',
 'Seahawks',
 'Buccaneers',
 'Titans',
 'Commanders']

In [24]:
eval_map = {"american": nfl_teams, "australian": afl_names}

core = []

for nationality, teams in eval_map.items():
    for team in teams:
        team = team.rsplit(maxsplit =1)[-1]
        prompt = "Output if this is an australian or american team, only print australian or american no other output: " + f"{team}"
        response = single_turn(prompt).strip()
        score.append(response.lower() == nationality)
        print(f"{team}: {response}")
print(np.array(score).mean())

Cardinals: American
Falcons: Australian
Ravens: American
Bills: American
Panthers: Australian
Bears: American
Bengals: American
Browns: American
Cowboys: American
Broncos: American
Lions: Australian
Packers: American
Texans: American
Colts: Australian
Jaguars: Australian
Chiefs: American
Raiders: Australian
Chargers: American
Rams: American
Dolphins: Australian
Vikings: Australian
Patriots: American
Saints: Australian
Giants: Australian
Jets: American
Eagles: American
Steelers: American
49ers: American
Seahawks: American
Buccaneers: American
Titans: Australian
Commanders: American
Crows: Australian
Lions: Australian
Blues: Australian
Magpies: Australian
Bombers: Australian
Dockers: Australian
Cats: Australian
Suns: Australian
Giants: American
Hawks: American
Demons: Australian
Kangaroos: Australian
Power: Australian
Tigers: Australian
Saints: American
Swans: Australian
Eagles: American
Bulldogs: Australian
0.85


## News Articles
Now try this for some long form articles. This time we won't give you an answer key, we'll let you figure things out.


TODO: Ravin will fill this in. But its basically the same as above. Instead of a team name it'll take in what is a synthetically generated news article and classify it