# Chapter 1: Introduction to LLM Evaluation

## Setup

In [1]:
import weave
from set_env import set_env
import nest_asyncio
import json
import asyncio

In [2]:
set_env("GOOGLE_API_KEY")
set_env("WANDB_API_KEY")
print("Env set")

Env set


In [3]:
try:
    import IPython
    in_jupyter = True
except ImportError:
    in_jupyter = False
if in_jupyter:
    nest_asyncio.apply()

In [6]:
from utils.config import WEAVE_PROJECT, ENTITY, MODEL, MODEL_CLIENT
from utils.prompts import medical_task, medical_system_prompt 
from utils.render import display_prompt, print_dialogue_data
from utils.llm_client import LLMClient
from utils.prompts import medical_privacy_judge_prompt, MedicalPrivacyJudgement, medical_task_score_prompt, MedicalTaskScoreJudgement, medical_task_score_system_prompt, medical_privacy_system_prompt
from utils.evals import get_evaluation_predictions, calculate_kappa_scores, calculate_weighted_alignment

## Understanding Medical Data Extraction Evaluation

### The Task: What Are We Trying to Do?

#### Raw Data Format
Medical conversations are messy and unstructured. Looking at our example data:

1. **Dialogue Format**:
- Back-and-forth conversation between doctor and patient
- Contains personal details, small talk, and medical information mixed together
- Informal language ("hey", "mm-hmm", "yeah")
- Important details scattered throughout

2. **Medical Notes**:
- More structured but still in prose
- Contains standardized sections (CHIEF COMPLAINT, HISTORY, etc.)
- Includes sensitive information (names, ages)
- Medical terminology and abbreviations

#### Extraction Goals
The LLM needs to:
1. Find relevant information
2. Ignore irrelevant details
3. Standardize the format
4. Protect patient privacy
5. Maintain medical accuracy

In [None]:
weave_client = weave.init(f"{ENTITY}/{WEAVE_PROJECT}")

In [None]:
display_prompt(medical_system_prompt)
display_prompt(medical_task)

<div align="center">
    <img src="./media/medical_chatbot.png" width="250"/>
</div>

In [None]:
annotated_medical_data = weave.ref(f"weave:///{ENTITY}/{WEAVE_PROJECT}/object/medical_data_annotations:latest").get()
# annotated_medical_data = weave.ref("weave:///a-sh0ts/eval_course_ch1_dev/object/medical_data_annotations:At9gri9UasftpPe5VNzT3EuIXQWAo5MYX8aMf2cuE8A").get()

In [10]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[0])

### In fact, let's just generate an example now:

In [12]:
llm = LLMClient(model_name=MODEL, client_type=MODEL_CLIENT)
llm.predict(user_prompt=medical_task.format(transcript=annotated_medical_data[0][0]["input"]), system_prompt=medical_system_prompt)



🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192e825-6ce3-75b3-a805-46921cfa36e0


'• **Chief complaint:** Bilateral elbow pain, right worse than left.\n\n• **History of present illness:** 1.5 years of bilateral elbow pain, worse on the right,  worsened by upper extremity use. Pain located on the medial aspect of both elbows.  Patient uses ibuprofen 800mg three times daily for pain relief; ice provides no relief.  History of athletic activity (basketball, baseball, football) without prior elbow pain.\n\n• **Physical examination:**  Pulses equal in all extremities. Normal distal sensation. Right elbow: limited range of motion in extension, pain with flexion and extension, pain with supination, medial aspect pain on palpation. Left elbow: minimal pain with flexion and extension, slight limited ROM on extension, pain with supination.\n\n• **Symptoms:**  Bilateral elbow pain (medial aspect), worse with use,  worse on the right.\n\n• **New medications:**  MRI ordered.  Whole blood transfusion discussed as a treatment option.\n\n• **Follow-up instructions:** MRI scheduled;

## Data Collection and Curation for Evaluation

### Best practices for medical extraction evaluation:
1. Collect real medical transcripts and LLM outputs
2. Include diverse medical conditions and conversation styles
3. Balance routine vs complex medical cases
4. Remove duplicate records
5. Validate with medical experts

These become our evaluation dataset

In [13]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[1])


## Why and How to Evaluate LLMs

### Core Principles of LLM Evaluation
Unlike traditional software testing, LLM evaluation requires special consideration:

1. **Non-Deterministic Outputs**
   - Models can give different valid answers
   - Responses vary between runs
   - Multiple correct solutions possible

2. **Quality is Multi-Dimensional**
   - Correctness isn't binary
   - Context matters heavily
   - Different stakeholders have different priorities

3. **Scale vs Accuracy Trade-offs**
   - Manual review is accurate but expensive
   - Automated checks are scalable but limited
   - Hybrid approaches often work best

### Practical Evaluation Recipe 🧑‍🍳

1. **Define Success Criteria**
   - List must-have requirements
   - Set acceptable thresholds
   - Identify critical failures

2. **Build Evaluation Suite**
   - Automated checks for clear rules
   - Expert review for nuanced cases
   - Version control evaluation code

3. **Create Scoring System**
   - Weight different factors
   - Establish baselines
   - Plan for aggregation

### Applying to Medical Data Extraction 🏥

For our medical extraction task, this means:
- **Success Criteria**: Required fields, privacy compliance, word limits
- **Evaluation Suite**: Automated checks + medical expert review
- **Scoring**: Weighted combination of format, accuracy, and safety metrics

Let's see how to implement this...

![](./media/traditional_llm_eval.png)

## First: Annotation: Building Quality Training Data

### Why Annotate?
To evaluate LLMs effectively, we need expert-labeled data that:
1. Defines what "good" looks like
2. Shows us what to test for
3. Helps align our automated tests with human judgment

### The Process
Experts review outputs and provide structured feedback. This creates a foundation for:
- Building automated evaluation tests
- Measuring how well those tests match expert judgment
- Refining our evaluation methods until they align with expert standards

Think of annotations as our compass - they help ensure our later automated evaluation methods point in the same direction as human experts while assessing the quality of our LLM's outputs.

![](./media/annotation_ui.png)

In [14]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[2, 3, 4])

## Evaluation: Measuring Performance

### Understanding LLM Evaluation
Unlike traditional software testing, LLM evaluation requires multiple approaches:

1. **Automated Checks**
   - Fast, programmatic tests
   - Clear pass/fail criteria
   - Example: format rules, required fields

2. **Model-Assisted Evaluation**
   - Using LLMs to evaluate outputs
   - Helpful for subjective criteria
   - Example: checking medical accuracy, privacy compliance

3. **Expert Review**
   - Human validation of complex cases
   - Ground truth for training evaluators
   - Example: annotated datasets

### Building Evaluation Systems

In this notebook, we'll implement this through:

1. **Basic Tests**
   ```python
   test_adheres_to_required_keys()
   test_adheres_to_word_limit()
   ```

2. **LLM Judges**
   ```python
   judge_adheres_to_privacy_guidelines()
   judge_overall_score()
   ```

3. **Key Questions**
   - How closely do automated evaluations match human judgment?
   - When do automated systems diverge from human experts?
   - What makes a good evaluation system?

These questions lead us to the concept of alignment - measuring how well our automated systems match human expectations and values. We'll explore practical ways to measure and improve this alignment after implementing our evaluation system.

![](./media/eval_task_flowchart.png)

### Using Domain Knowledge to Build Evaluation Tests

We'll create four key tests to evaluate our medical extraction outputs:

1. **Required Fields Check**
   - Verifies presence of essential medical fields
   - E.g., "Chief complaint", "Symptoms", "Follow-up instructions"

2. **Word Limit Check**
   - Ensures output stays within 150-word limit
   - Promotes concise, focused summaries

3. **Privacy Guidelines Check**
   - Uses LLM to detect any PII leakage
   - Critical for medical data compliance

4. **Overall Quality Score**
   - LLM-based assessment of extraction quality
   - Considers accuracy, completeness, and format

These tests will be validated against our expert-annotated dataset to ensure they align with human judgment. This alignment process helps us understand how well our automated evaluation matches medical expert standards.

Let's implement each test:

In [15]:
test_output = annotated_medical_data[0][1]["output"]

In [17]:
@weave.op()
def test_adheres_to_required_keys(model_output: str):
    # Required medical keys
    required_keys = [
        "Chief complaint",
        "History of present illness",
        "Physical examination",
        "Symptoms",
        "New medications with dosages",
        "Follow-up instructions"
    ]
    
    # Convert to lowercase for case-insensitive matching
    model_output_lower = model_output.lower()
    
    # Check if all required keys are present
    for key in required_keys:
        if key.lower() not in model_output_lower:
            return int(False)
            
    return int(True)

In [18]:
test_adheres_to_required_keys(test_output)

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192e825-7c9b-7ef3-954c-6064a1c0c7fb


0

In [19]:
@weave.op()
def test_adheres_to_word_limit(model_output: str):
    return int(len(model_output.split()) <= 150)

In [20]:
test_adheres_to_word_limit(test_output)

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192e825-7ca5-74b1-99ba-1c85856f04ae


0

In [21]:
display_prompt(medical_privacy_system_prompt)
display_prompt(medical_privacy_judge_prompt)

In [22]:
@weave.op()
def judge_adheres_to_privacy_guidelines(model_output: str):
    llm = LLMClient(model_name=MODEL, client_type=MODEL_CLIENT)
    response = llm.predict(user_prompt=medical_privacy_judge_prompt.format(text=model_output), system_prompt=medical_privacy_system_prompt, schema=MedicalPrivacyJudgement)
    try:
        result = json.loads(response.text.strip("\n"))
        return int(not result[0]["contains_pii"])
    except:
        return int(True) #TODO: Add json parsing as failure reason

In [23]:
judge_adheres_to_privacy_guidelines(test_output)

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192e825-7cb8-7101-872f-8235c449b437


0

In [24]:
display_prompt(medical_task_score_system_prompt)
display_prompt(medical_task_score_prompt)

In [25]:
@weave.op()
def judge_overall_score(model_output: str):
    llm = LLMClient(model_name=MODEL, client_type=MODEL_CLIENT)
    response = llm.predict(user_prompt=medical_task_score_prompt.format(text=model_output), system_prompt=medical_task_score_system_prompt, schema=MedicalTaskScoreJudgement)
    try:
        result = json.loads(response.text.strip("\n"))
        return result[0]["score"]
    except:
        return 0 #TODO: Add json parsing as failure reason


In [26]:
judge_overall_score(test_output)

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192e825-8001-7573-8616-7af34c4f7a5a


0

### We already have a dataset of annotated medical data. We can use our tests to evaluate the outputs of our LLM.

In [27]:
@weave.op()
def annotated_data_passthrough(input, output):
    return output

In [28]:
annotated_medical_data[0][2]

0

In [29]:
evaluation_data = [
    {"input": annotated_row[0]["input"], "output": annotated_row[1]["output"], "scores": {"human_required_keys": annotated_row[3]["presence_of_keys"], "human_word_limit": annotated_row[3]["word_count"], "human_absence_of_PII": annotated_row[3]["absence_of_PII"], "human_overall_score": annotated_row[2]}}
    for annotated_row in annotated_medical_data
][0:5]

In [31]:
# Create evaluation
evaluation = weave.Evaluation(
    dataset=evaluation_data,
    scorers=[test_adheres_to_required_keys, test_adheres_to_word_limit, judge_adheres_to_privacy_guidelines, judge_overall_score]
)

# Run evaluation
evals = asyncio.run(evaluation.evaluate(annotated_data_passthrough))

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192e825-82fe-7c03-b03d-9eabe0c81ff5


### But do our test outputs adhere to the annotation expectations?

We need to measure how well our automated evaluations match human judgment. We'll:

1. **Measure Alignment**
   - Compare automated test results with expert annotations using kappa scores
   - Weight different aspects based on their importance (privacy, completeness, etc.)
   - Find where automated tests disagree with human experts

2. **Use These Results**
   - Chapter 2 will focus on improving the LLM judges that show poor alignment
   - We'll learn to refine prompts based on these alignment scores
   - Build better evaluation systems by focusing on the weakest areas first

These alignment measurements are crucial - they tell us which parts of our automated system need the most work, especially for critical aspects like privacy checks and medical accuracy.

In [35]:
eval_call_id = "0192e825-82fe-7c03-b03d-9eabe0c81ff5"

In [36]:
df = get_evaluation_predictions(weave_client, eval_call_id)
df

Unnamed: 0,input,required_keys,word_limit,privacy,overall
0,Dialogue: [doctor] hey dylan what's going on s...,"(1, 0)","(1, 0)","(0, 0)","(0, 0)"
1,"Dialogue: [doctor] hello , mrs . peterson . [p...","(1, 0)","(1, 1)","(1, 0)","(1, 1)"
2,"Dialogue: [doctor] hey , ms. hill . nice to se...","(0, 0)","(1, 1)","(1, 0)","(0, 1)"
3,"Dialogue: [doctor] hi keith , how are you ? [p...","(1, 0)","(1, 1)","(0, 0)","(0, 0)"
4,Dialogue: [doctor] okay so we are recording ok...,"(0, 0)","(1, 1)","(1, 1)","(0, 1)"


In [37]:
# Example usage:
kappa_scores = calculate_kappa_scores(df)
for metric, score in kappa_scores.items():
    print(f"{metric}: {score:.3f}")

required_keys: 0.000
word_limit: 0.000
privacy: 0.286
overall: 0.286


In [38]:
# Example weights (adjust these based on what's most important for your use case)
weights = {
    'required_keys': 0.3,    # High importance - core functionality
    'privacy': 0.3,          # High importance - compliance/safety
    'word_limit': 0.2,       # Medium importance - usability
    'overall': 0.2           # Medium importance - general quality
}

# Calculate aggregate score
aggregate_score = calculate_weighted_alignment(kappa_scores, weights)
print(f"\nWeighted Aggregate Alignment Score: {aggregate_score:.3f}")

# You can easily try different weightings:
privacy_focused_weights = {
    'required_keys': 0.2,
    'privacy': 0.5,          # Much higher weight on privacy
    'word_limit': 0.15,
    'overall': 0.15
}

privacy_focused_score = calculate_weighted_alignment(kappa_scores, privacy_focused_weights)
print(f"Privacy-Focused Alignment Score: {privacy_focused_score:.3f}")


Weighted Aggregate Alignment Score: 0.143
Privacy-Focused Alignment Score: 0.186
