# Homework 3: Isaac & Hamel LLM-as-Judge Office Hours Discussion

This notebook walks through the reference implementation for Homework 3, and discusses additional topics.  This notebook was reviewed in Isaac's office hours.

In [18]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
import os
import subprocess

The homework instructions for this homework assignment is in `homeworks/hw3/README.md`.  The content

For a recording of the homework walkthrough please see: https://youtu.be/1d5aNfslwHg

## Step 1a: Generate Traces

> NOTE: Homework assignment _option 1_ starts here.

This is the start of the process.  It starts with queries that map to a dietary restriction.

In [21]:
!head -5 data/'dietary_queries.csv'

id,query,dietary_restriction
1,I'm vegan but I really want to make something with honey - is there a good substitute? i am craving a yogurt breakfast,vegan
2,Need a quick gluten-free breakfast. I hate eggs though.,gluten-free
3,Keto breakfast that I can meal prep for the week,keto
4,I'm dairy-free and also can't stand the taste of coconut milk. What dessert can I make?,dairy-free


This is processed by `generate_traces.py`, when runs those queries though the model.  Some key notes:

The `generate_traces.py` script calls `get_agent_response` **from the application code**.  This is ideal to minimize differences between experiments and production.

Sometimes, you must have some differences.  These may include:

- Complicated tracing/logging that prevents you from calling that function programatically well
- Database connections that are codified without a convient decouple dev database

In these situation the ideal is to decouple the code you need to experiment with from production application logic that makes those experiments difficult or requires additional context/deps for the experiments.  This is generally a good tech debt reducing activity anyway, as having clear decoupling and separation of concerns is a good idea generally :D


You can use AI to create these, but you **must** look at them carefully.  For example, consider this row:

> 43,Comfort food that won't make me feel guilty,vegetarian

Given the `query`, how could your application know that it should be a vegetarian dish.  This is a bad query as the query does not provide any way for the LLM to know it's supposed to be generating a vegetarian recipe.

## Step 1b: Label the Data

> NOTE: Homework assignment _option 2_ starts here.

The next step is to label and create ground truth labels for traces.  The `label_data.py` script creates these ground truth labels.

In [26]:
!head -1 data/'labeled_traces.csv'

query,dietary_restriction,response,success,error,trace_id,query_id,label,reasoning,confidence,labeled


This is a csv that has the `query` and `dietary_restriction` from the previous step.  But also the most critical pieces are:

- `response`:  The model response to the query (should be a recipe).
- `label`: Whether this recipe passed or failed to follow the dietary restriction
- `reasoning`: LLM reasoning for why it gave the label it did
- `confince`: A rating of confidence the model gave for the label.  Useful for identifying likely model errors.

### `label_data.py` 

This script creates these ground truth labels with LLMs.  Should you?  Probably not.

The purpose of this project is to create an LLM as a judge.  By using an LLM to create the ground truth, you have essentially used an LLM as a judge to create the ground truth.  This can be ok if you use a much larger more powerful model for these, then are creating a much more scalable LLM as a judge with a smaller model.  HOWEVER....

- This is not an excuse to not look at the data and label them
- There are many pitfalls with this approach, that labeling manually with a domain expert completely avoids
- When getting started this step is used as a way to look at less of the data.  But looking at the data is the singlemost high value activity

So generally, you should just label your data manually.  Even if you use this LLM, you should then go in and label at least a sample of the LLM created ground truth labels to make sure the ground truth is high quality.

> NOTE:  This is an example of where looking at your data manually can avoid complexity and a whole class of problems if you can

## Step 2: Split Your Labeled Data

> NOTE: Homework assignment _option 3_ starts here.

In every data driven approach you should split your data in different sets.

They may be called `train`, `validation`, and `test`.  Or `train`, `dev`, `test`.  But you need 3 sets.  This is true in machine learning as well as statistics.  They serve different purposes:

- **train**: You can do anything with this and "train" you model on this data.  In this case training your model is using it to create few-shot examples for your prompt.  But you could do RAG against these, or use them for fine tuning or anything.  They are fair game for everything.
- **validation**:  This is what you regularly measure against for development.  These cannot be used for RAG, or put in your prompt, or trained on.  But when you have a good solution you can iterate by testing how well it performs on the `validation` set.
- **Test**: This is you ultimate protection to ensure your experiment results are going to translate to production and you can predict what the impact of your change will be.  Every time you measure against it and look at it you lose some of that protection.  So do so very sparingly! 

> This is to make sure you model can *generalize* beyond the specific things you have seen.  There are many words for overfitting in different contexts such as overfitting, p-hacking, data leakage, lookahead bias, and more.  But good 

Let's start with a simple approach

### Simple Approach

Read labeled traces

In [None]:
df = pd.read_csv("data/labeled_traces.csv")

Create random numbers and assign to appropriate categories based on those numbers.

In [None]:
df['rand'] = [np.random.random() for _ in range(len(df))]
df['split'] = np.where(df['rand'] < 0.15, 'train', np.where(df['rand'] < 0.55, 'validation', 'test')) 

In [57]:
train_df = df[df.split == 'train']
dev_df = df[df.split == 'dev']
test_df = df[df.split == 'test']
train_df.shape, dev_df.shape, test_df.shape

((15, 13), (47, 13), (39, 13))

Once we do that we would save these splits so we have a consistent test set (we don't want to have test set values sometimes in the training set that we're looking at!)

And look at how our categories are divided

In [58]:
stats = df.groupby(['split', 'label']).size().unstack(fill_value=0)
stats['TTL'] = stats.sum(axis=1)
stats

label,FAIL,PASS,TTL
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dev,13,34,47
test,7,32,39
train,6,9,15


The most common quersion is whether this is "enough"?  Let's think about this test set:

There are 7 failures in the test set.  Do you think you'd feel better if your test set have 10 or 15 failures?  If so, label more!  Most problems can be avoided by looking at more data.

One guideline I like is to consider if I had 20% less data.  If I did that do I think my experiment results could be skewed?  If so, I want to label more.  I could even do those experiments and see if they turn out differently.

What we want is a `representative sample`.  There are many ways to estimate that statistically.  One way is to get a random sample of it many time, see if those smaller samples look like (are *representative*) of the whole set.

### Stratified Splitting Script

The `split_data.py` script uses a more advanced splitting approach called `stratified splitting`.

Instead of making dev/test/train sets purely randomly, it ensures that each of the categories are proportionate in each set.  If 10% of the samples are `FAIL`, this it ensures that roughy 10% of the samples in each of the sets are `FAIL` and we don't end up with imbalanced based on random chance.

Should you use stratified splitting?  Not neccesarily!  It's ok to start simple.  If random splitting is causing imbalanced labels in your sets there's 2 broad ways to fix it:

1. **Best Way** Label more data until it's a non-issue with randomness!  This is usually the best way to go.  The more labeled samples you have the more unlikely randomness will cause "wonkiness"
1. Make sure that they are balanced with stratified splitting.  This is really valuable when more labeling is extremely costly (either very hard to do, or classes or so imbalanced you'd have to label an insane amount to get the minority classes to have decent coverage).

> NOTE:  This is an *another* example of where looking at your data manually can avoid complexity and a whole class of problems if you can

## Step 3: Develop Your LLM-as-Judge Prompt

The `develop_judge.py` script uses an existing judge prompt and adds some few-shot examples from the training set.

This script chooses few-shot examples randomly to create a demonstration of what a prompt might look like.  During the process I recommend:

1. Create a detailed criteria
2. Look at training set and think about where applying that criteria might be hard or unclear
3. Add the few-shot examples deliberately based on failure modes.
4. Repeat

For example for me, example 2 for example shows chicken in a vegetarian dietary restriction.  I suspect that this is a bit too obvious of a failure to add anything significantly helpful to the model.


Guidelines for few-shot-examples for prompt:

- Showing the model the full query and response is helpful.  If it is too large to do that you can play with it to try to make a more minimal reproduction of what you are trying to convey that is difficult to convey in the written instructions.
- Include the correct answer
- Include reasoning information for why the answer is correct that explains the nuance.  LLM generated reasoning can be a **starting** point, but this is critical and should be very carefully thought about and edited.

In [39]:
# View the judge prompt that was created
with open('results/judge_prompt.txt', 'r') as f:
    judge_prompt = f.read()

# Display first 1000 characters to see the structure
print("Judge Prompt Preview:")
print("=" * 80)
print(judge_prompt[:1000])
print("\n... [truncated]")

Judge Prompt Preview:
You are an expert nutritionist and dietary specialist evaluating whether recipe responses properly adhere to specified dietary restrictions.

DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey, etc.)
- Vegetarian: No meat or fish, but dairy and eggs are allowed
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains
- Dairy-free: No milk, cheese, butter, yogurt, or other dairy products
- Keto: Very low carb (typically <20g net carbs), high fat, moderate protein
- Paleo: No grains, legumes, dairy, refined sugar, or processed foods
- Pescatarian: No meat except fish and seafood
- Kosher: Follows Jewish dietary laws (no pork, shellfish, mixing meat/dairy)
- Halal: Follows Islamic dietary laws (no pork, alcohol, proper slaughter)
- Nut-free: No tree nuts or peanuts
- Low-carb: Significantly reduced carbohydrates (typically <50g per day)
- Sugar-free: No added sugars or high-sugar ingredients
- Raw vegan: Vegan foods 

## Refine & Validate Your Judge

The judge was evaluated on dev set. Let's look at the performance metrics generated by `evaluate_judge.py`

In [16]:
# Load judge performance metrics
with open('results/judge_performance.json', 'r') as f:
    judge_perf = json.load(f)

test_perf = judge_perf['test_set_performance']

print("Test Set Performance:")
print(f"TPR (True Positive Rate): {test_perf['true_positive_rate']:.3f}")
print(f"TNR (True Negative Rate): {test_perf['true_negative_rate']:.3f}")

Test Set Performance:
TPR (True Positive Rate): 1.000
TNR (True Negative Rate): 0.750


**TPR (True Positive Rate): 1.000**

- What this means:  
    - The judge correctly identifies 100% of responses that actually violate dietary restrictions
    - No false negatives - you won't miss dangerous violations (e.g., serving meat to vegetarians)
- Is this good?
    - Probably not.  Is it reasonable to believe that the bot will **never** violate dietary restrictions?  Maybe, but more likely
        - You test set is contaminated
        - You eval code has a bug in it
        - You don't have representative user queries
        - Some other issue
    - What if I checked all that and I'm certain it's really 100%?
        - Maybe this task is just too easy?  If so, is it commercially viable?  Can someone else replicate it super easily?  Or is there other non-ai parts of the product that are providing enough value that it's ok that this isn't hard?
        - You still don't have any information or signal for how to improve.  That should be worrying!

**TNR (True Negative Rate): 0.750** 
- Your judge correctly identifies 75% of responses that properly follow dietary restrictions
- 25% false positive rate - sometimes flags compliant responses as violations

Ok, so what's more important?  TPR or TNR?

There are 2 possible failure modes in this task:

1. The model creates a non-compliant recipe and thinks it is compliant (e.g. Supposed to be dairy-free but it has cheese in it)
2. The model create a complient recipe and thinks it is not-compliant (e.g. Recipe is dairy free, but model thinks it has dairy in it and fails it).

Break each down and consider the scenarious and impact of failing:

#### The model creates a non-compliant recipe and thinks it is compliant 

Imagine...


- A user asked for a dairy-free recipe and it has cheese in it:
    - Is there harm to the user?  What if they are lactose intolerant?
    - Does this effect churn/subscription likihood?  By a lot or a little?
- A user asked for shellfish free and the recipe has oyster sauce in it
    - Is there a harm to the user?  What if they don't know oyster sauce often has shellfish?
    - Does this effect churn/subscription liklihood?  By a little or a lot..
    
#### The model create a complient recipe and thinks it is not-compliant 

- A user asked for for a dairy-free recipe, and the generated recipe was dairy-free but my judge thought it wasn't.
    - What is the product behavior?  
        - Would it show the recipe anyway and it be fine?
        - Do you have logic to ask it to try again, meaning increased latency percieved to the user?
        - Would it explain the failure to the user?
    - Based on the product behavior:
        - What's the churn/subscription liklihood risk?
        - Is there any real harm to the user?
        - Can they still use the product with an obvious workaround, or are they completely stuck?

#### Decision

Once you have really thought about those, consider what kind of error is more painful?  Which do you care more about?  Do you care about each equally, or is one drastically more important to your product than the other?  

Communicate this reasoning clearly to everyone involved!


## Analyze Test Errors

Once we have a judge we are happy with, we can measure on the test set.

In [30]:
# Analyze test predictions to understand errors
with open('results/test_predictions.json', 'r') as f:
    test_preds = json.load(f)

In [31]:
false_negatives = [p for p in test_preds if p['true_label'] == 'PASS' and p['predicted_label'] == 'FAIL']
false_positives = [p for p in test_preds if p['true_label'] == 'FAIL' and p['predicted_label'] == 'PASS']

In [32]:
print(f"False Negatives: {len(false_negatives)}")
print(f"False Positives: {len(false_positives)}")

False Negatives: 0
False Positives: 2


We can see how many failures we had!

> ❗️ Remember, the more we look at the test set the less it protects us against overfitting.  So what we're going to do next is NOT what you should do!

Let's look at a failure:

In [13]:
print(false_positives[1]['dietary_restriction'])
print(false_positives[1]['response'])

dairy-free
Great! I recommend trying a **Creamy Coconut Curry with Vegetables**. It's a flavorful and nutritious dish that’s dairy-free, but you can add a bit of cheese on the side if desired. Here's a recipe for 2 servings:

### Ingredients:
- 1 tablespoon coconut oil or vegetable oil
- 1 small onion, finely chopped
- 2 cloves garlic, minced
- 1 thumb-sized piece of ginger, grated
- 1 medium carrot, sliced
- 1 bell pepper (any color), sliced
- 1 zucchini, sliced
- 1 cup canned chickpeas, drained and rinsed
- 1 can (13.5 oz) coconut milk
- 1 teaspoon curry powder
- 1/2 teaspoon turmeric
- 1/2 teaspoon ground cumin
- Salt and pepper to taste
- Fresh cilantro, chopped (optional, for garnish)
- Lime wedges (optional, for serving)
- Cheese (optional, for topping)

### Instructions:
1. **Heat the oil:** In a large skillet or saucepan, heat the coconut oil over medium heat until shimmering.

2. **Sauté aromatics:** Add the chopped onion and cook for about 3-4 minutes until softened and trans

As we look we can clearly see the failure mode.  This is dairy free, but there's an optional cheese topping!

But what now?

In production, I don't get to see a user query and change the system to better handle that specific user query.  So I shouldn't be able to do that on the test set!  The problem is I am biased.  I know that optional cheese topping is a specific failure mode.  I suspect it might be all optional topics that it might miss.  Even if I don't add this example, this failure will be top of mind and my way of improving performance is biased to this failure in my test set.  It's a bit like hardcoding a solution.

So what do I do when I really need to look at the test set?

- Accept my future tests are compromised
- Turn the test set into a larger training set and make a new test set.  If this was a surprising failure mode, then we haven't reached saturation so we should go look at more data.

## Measure on "New" Traces

Now we can run on more traces that don't have ground truth to estimate how well out production traces adhere to to our criteria using our judge.  This is done in `run_full_evaluation.py`

In [35]:
# Load the final evaluation results
with open('results/final_evaluation.json', 'r') as f:
    final_eval = json.load(f)

eval_results = final_eval['final_evaluation']

print(f"Total traces evaluated: {eval_results['total_traces_evaluated']}")
print(f"Raw observed success rate: {eval_results['raw_observed_success_rate']:.3f} ({eval_results['raw_observed_success_rate']*100:.1f}%)")

Total traces evaluated: 2400
Raw observed success rate: 0.857 (85.7%)


But there's a problem!  We know our judge isn't perfect, so we should account for that in what it tells us.

## Report Results with judgy

The `judgy` library was used to correct for judge bias based on the TPR/TNR from the test set.

If our judge was 90% accurate based on evaluations on our labeled data, and when run on production data it said we were 100% correct, then **we should be skeptical of that 100% result**.

Here's the intuitive reasoning:

**The Problem:** Our judge makes mistakes 10% of the time. So when it says "everything is perfect" on production data, some of those "perfect" judgments are probably wrong.

**What's Really Happening:** 
- The judge is likely missing some actual violations (false negatives)
- It might also be flagging some good responses as bad (false positives)
- The 100% "success rate" is inflated because the judge isn't catching all the real problems

**The Correction:**
Since we know our judge's error patterns from the labeled test data, we can mathematically adjust the 100% observed rate to estimate what the *true* success rate probably is. 

Basically we know our judge isn't perfect, so we should account for that when we use it for measurements.

In [36]:
# Import judgy to understand the correction
import judgy

# Display the corrected results
print("\nCorrected Results (accounting for judge bias):")
print(f"Corrected success rate: {eval_results['corrected_success_rate']:.3f} ({eval_results['corrected_success_rate']*100:.1f}%)")
print(f"95% Confidence Interval: [{eval_results['confidence_interval_95']['lower_bound']:.3f}, {eval_results['confidence_interval_95']['upper_bound']:.3f}]")
print(f"                        [{eval_results['confidence_interval_95']['lower_bound']*100:.1f}%, {eval_results['confidence_interval_95']['upper_bound']*100:.1f}%]")

correction = eval_results['corrected_success_rate'] - eval_results['raw_observed_success_rate']
print(f"\nCorrection applied: {correction:.3f} ({correction*100:.1f} percentage points)")


Corrected Results (accounting for judge bias):
Corrected success rate: 0.926 (92.6%)
95% Confidence Interval: [0.817, 1.000]
                        [81.7%, 100.0%]

Correction applied: 0.069 (6.9 percentage points)


## Confidence Intervals Made Simple

### Is sampling ok instead of statistical formulas?

All statistical measures you need for evaluations can be derived through sampling.  This is beneficial because:

1. You avoid assumptions about your data that are baked into statistical formulas, which can be hard to get right
2. It is easy to reason about and change your evaluations
3. It's faster and easier to explain how it works to non-statisticians

Often people think this approach I am going to show is not the "right" way, but here's what the R.A. Fisher ("father" of modern statistics) had this to say about it why the complex statistical formulas were used!

> Actually, the statistician does not carry out this very simple and very tedious process, but his conclusions have no justification beyond the fact that they agree with those which could have
been arrived at by this elementary method.

This is discussing sampling being too tedious in 1936.  Today, it's as simple as `for i in range(10000)`.  It's not tedious anymore, so we don't need to complicated computational shortcuts.  We can do things the simple way :)

###  How to do it

Let's create a simple dataset.  We will have 17 passes (1), and 3 failures (0).

In [37]:
# Our data: 1 = pass, 0 = fail
results = [1] * 17 + [0] * 3
observed_rate = np.mean(results)
print(f"Observed success rate: {observed_rate:.1%}")

Observed success rate: 85.0%


We can see that the mean is 85%, meaning we have an 85% success rate.  The question is how much ambiguity is there? If I re-did the experiment that got 85% sucess rate over and over, what range of successes would I get?  If it pretty much always going to be 85% success?  Or would it wildly swing and have lots of noise?

We can check by sampling.  We pick a random item from our list of 20 results. And we do that 20 times.  Sometimes we'll get unlucky and pick lots of `0`'s.  Sometimes we'll get lucky and get very few.  Let's try it 10,000 times to see how much luck has to do with it.

In [38]:
# Bootstrap in 5 lines
bootstrap_rates = []
for _ in range(10000):
    resample = np.random.choice(results, size=len(results), replace=True)
    bootstrap_rates.append(np.mean(resample))

# Get 95% confidence interval
ci_lower = np.percentile(bootstrap_rates, 2.5)
ci_upper = np.percentile(bootstrap_rates, 97.5)

print(f"95% Confidence Interval: [{ci_lower:.1%}, {ci_upper:.1%}]")

95% Confidence Interval: [70.0%, 100.0%]


Wow!  95% of the time my values are between 70% and 100%.  That's a huge range.  I don't feel good about that, because 70% accuracy seems unnacceptable, but 100% accuracy seems suspicious.

**What can I do?**

This is *yet another* example where just getting more data and labeling it solves a lot of problems!  The more data you have, the less uncertainty in your results!  Let's keep the same 85% success rate, but with more data.

In [43]:
# Our data: 1 = pass, 0 = fail
results = [1] * 170 + [0] * 30  # 850 passes, 150 fails
observed_rate = np.mean(results)
print(f"Observed success rate: {observed_rate:.1%}")

Observed success rate: 85.0%


Let's do the example same sampling, but with the larger dataset (200 labeled examples instead of 20)

In [47]:
# Bootstrap in 5 lines
bootstrap_rates = []
for _ in range(10000):
    resample = np.random.choice(results, size=len(results), replace=True)
    bootstrap_rates.append(np.mean(resample))

# Get 95% confidence interval
ci_lower = np.percentile(bootstrap_rates, 2.5)
ci_upper = np.percentile(bootstrap_rates, 97.5)

print(f"95% Confidence Interval: [{ci_lower:.1%}, {ci_upper:.1%}]")

95% Confidence Interval: [80.0%, 90.0%]


That's much better!

To improve the accuracy of your confidence interval, you can increase the number of loops you do.  But generally this is plenty accurate.
