# Designing and Improving LLM Evaluators

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import weave
import pandas as pd

from dotenv import load_dotenv
load_dotenv()  # TODO: replace with getpass

import google.generativeai as genai
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

import nest_asyncio
nest_asyncio.apply()

from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# initialize weave
weave_client = weave.init(project_name="eval-course/eval-course-dev")

Logged in as Weights & Biases user: ayut.
View Weave data at https://wandb.ai/eval-course/eval-course-dev/weave


## Essay Writer

Imagine a task, where you are using an LLM to write an essay. 

query ----> [LLM based essay writer] ----> essay # TODO: simple diagram

- You have built an evaluation set of query-essay pairs.
- You have a set of human evaluators who have labeled the essays based on some criteria.

Now you don't want to always rely on human evaluators to label the essays. You want to build an LLM based evaluator. # TODO: improve framing

Let's start with building a simple evaluator and then we will see how we can align it with human evaluators.

## Part 1: Prompt

Any LLM evaluator needs a prompt. A "judge's" prompt will have three key components: # TODO: expand of these three components

1. **Task Description**: The task defines the role of the LLM as the evaluator, such as “You are an evaluator tasked with assessing the fluency and coherence of this text.”
2. **Measuring Criteria(s)**: The criteria outline what the LLM should look for, such as “The criteria for evaluation are fluency, grammar, factual accuracy, and adherence to the prompt.” There are the set of instructions to the LLM validator.
3. **Scoring Rubric**: The rubric provides detailed guidelines on how to score the output.


In [4]:
JUDGE_PROMPT = """You are an expert essay evaluator. 
Please evaluate the following essay according to the Holistic Rating for Source-Based Writing rubric on a scale of 1-6.
First give a reason for the score and return the result as a valid JSON object.

Example:
```json
{{"score": 4, "reason": "The essay demonstrates a clear understanding of the source text and effectively uses it to support its points."}}
```

Essay:
{full_text}
"""

## Part 2: The Evaluator

TODO: diagram
[LLM SYSTEM] --> system output --> [EVALUATOR] --> evaluator output

The LLM evaluator takes in the system prompt, initialize an LLM and pass the system prompt along with "generated" content to the LLM.

We expect the evaluator to return a judgement which can be in the form of raw text or a JSON object.

Here we are using the `weave.Model` class which under the hood is a Pydantic `BaseModel`. By structuring your code to be compatible with this API, you benefit from a structured way to version your application so you can more systematically keep track of your experiments.

In this case, we are passing the `full_text` to the evaluator and expect it to return a JSON object with `score` and `reason` keys.

In [5]:
from weave import Model, Evaluation
import asyncio
import json


class EssayEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, full_text: str) -> dict:
        response = self.model.generate_content(self.judge_prompt.format(full_text=full_text))
        try:
            result = response.text.strip()
            result = json.loads(result)
            return result
        except:
            return {"score": 0, "reason": "Failed to parse JSON"}  # Default to lowest score if parsing fails

# Initialize evaluator
essay_evaluator = EssayEvaluator()

## Part 3: The evaluation dataset

To simulate this imaginary scenario, we use a small subset of the `train.csv` file from the "[Learning Agency Lab - Automated Essay Scoring 2.0](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/data?select=train.csv)" Kaggle competition.

Specifically, we have two columns of interest: `full_text` and `score`. The `full_text` should be essay generated from our LLM based essay writer. The `score` is the score given by the human evaluators.

Each essay was scored on a scale of 1 to 6 using the "[Holistic Rating for Source-Based Writing](https://storage.googleapis.com/kaggle-forum-message-attachments/2733927/20538/Rubric_%20Holistic%20Essay%20Scoring.pdf)" code book.

In [6]:
# Load the dataset
essay_scorer_small = weave.ref('essay_scorer_small:v0').get()

## Part 4: The evaluation metric

We want to evaluate the evaluator's performance using the `score` column from the dataset. We are using the `exact_match` metric to check if the evaluator's prediction matches the human score.

The `weave.op()` decorator allows us to track the metric as an operation in the weave graph.

In [7]:
# Define a simple exact match metric
@weave.op()
def exact_match(score: dict, model_output: dict) -> float:
    """Check if predicted score matches human score"""
    return model_output['score'] == score

## Part 5: The evaluation

A good evaluation system should have the following features:

1. asynchronous
2. trials
3. powerful visualization
4. evaluation comparison and insights

In [8]:
# Create evaluation
evaluation = Evaluation(
    dataset=essay_scorer_small,
    scorers=[exact_match]
)

# Run evaluation
asyncio.run(evaluation.evaluate(essay_evaluator))

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192d2f8-cf88-7b03-8b71-edb8b4cbba35


{'model_output': {'score': {'mean': 0.0}},
 'exact_match': {'true_count': 0, 'true_fraction': 0.0},
 'model_latency': {'mean': 5.74023334980011}}

### Better JSON parsing

We need to improve the JSON parsing to handle cases where the LLM returns a JSON object with extra markdown formatting.

In [9]:
@weave.op()
def parse_json(result: str) -> dict:
    if "```json" in result:
        result = result.split("```json\n")[1].split("\n```")[0]
    # Clean up any remaining markdown formatting
    result = result.strip()
    return json.loads(result)


class EssayEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, full_text: str) -> dict:
        response = self.model.generate_content(self.judge_prompt.format(full_text=full_text))
        try:
            result = response.text.strip()
            return parse_json(result)
        except:
            return {"score": 0, "reason": "Failed to parse JSON"}  # Default to lowest score if parsing fails

# Initialize evaluator
essay_evaluator = EssayEvaluator()

# Run evaluation
asyncio.run(evaluation.evaluate(essay_evaluator))

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192d2f9-0fd8-78f2-94b6-414d7bfc30a9


{'model_output': {'score': {'mean': 2.1}},
 'exact_match': {'true_count': 1, 'true_fraction': 0.1},
 'model_latency': {'mean': 5.057384610176086}}

### Structured output

Most frontier LLM providers support structured outputs. Using this forces the LLM to return/predict a specific schema.

Note: If you have complex "reasoning" to be done via your LLM evaluator, you should use two API calls. Use the first API call to do the reasoning and use the second API call to output the structured response. Reference: https://arxiv.org/abs/2408.02442v1

Learn more about structured outputs in this free course by Jason Liu: https://www.wandb.courses/courses/steering-language-models

In [10]:
import typing_extensions as typing

class Judgement(typing.TypedDict):
    reason: str
    score: int


class EssayEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, full_text: str) -> dict:
        response = self.model.generate_content(
            self.judge_prompt.format(full_text=full_text),
            generation_config=genai.GenerationConfig(
                response_mime_type="application/json", response_schema=Judgement
            ),
        )
        try:
            result = json.loads(response.text.strip("\n"))
            return result
        except:
            return {"score": 0, "reason": "Failed to parse JSON"}  # Default to lowest score if parsing fails

# Initialize evaluator
essay_evaluator = EssayEvaluator()

# Run evaluation
asyncio.run(evaluation.evaluate(essay_evaluator))

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192d2f9-4e58-7bc3-bfdd-ee41d62e5ad8


{'model_output': {'score': {'mean': 2.2}},
 'exact_match': {'true_count': 2, 'true_fraction': 0.2},
 'model_latency': {'mean': 3.149549388885498}}

## Aligning LLM evaluators with human evaluators

One of the most important aspects of building an LLM evaluator is to align it with human evaluators. This ensures that the evaluator is consistent with human beliefs ensuring higher confidence in the evaluator's predictions.

In our case, we have human annotations. Let's see how we can align the LLM evaluator and in turn improve the evaluator's performance.

### The alignment metric

Cohen Kappa # TODO: add more details

Range of metric: Similar to correlation coefficients, it can range from −1 to +1, where 0 represents the amount of agreement that can be expected from random chance, and 1 represents perfect agreement between the raters. While kappa values below 0 are possible, they are unlikely in practice. ([Source](https://pmc.ncbi.nlm.nih.gov/articles/PMC3900052/#:~:text=Similar%20to%20correlation%20coefficients%2C%20it,unlikely%20in%20practice%20(8).))

ASR metric? introduced in this paper: https://arxiv.org/pdf/2402.10669

In [11]:
def get_evaluation_predictions(eval_call_id: str) -> pd.DataFrame:
    """
    Retrieves evaluation predictions from a Weave call and returns them as a DataFrame.
    
    Args:
        eval_call_id (str): The ID of the Weave evaluation call to analyze
        
    Returns:
        pd.DataFrame: DataFrame containing the evaluation data with predictions
    """
    eval_calls = weave_client.get_call(eval_call_id)

    predictions = []
    for eval_call in eval_calls.children():
        if eval_call.op_name.split("/")[-1].split(":")[0] == "Evaluation.predict_and_score":
            _eval_call = weave_client.get_call(eval_call.id)
            data = dict(_eval_call.inputs["example"])
            data.update({"pred_score": dict(_eval_call.output)["model_output"]["score"]})
            predictions.append(data)

    return pd.DataFrame(predictions)

# Get evaluation predictions
eval_df = get_evaluation_predictions("0192d2f9-4e58-7bc3-bfdd-ee41d62e5ad8")
eval_df.head()

Unnamed: 0,essay_id,full_text,score,pred_score
0,ad85676,Venus is one of the brightest point in the sky...,1,2
1,241077a,Driverless cars may be the future but it would...,4,2
2,1b7e42c,Alien Landform?\n\nDo you think that the face ...,1,2
3,0b6df5c,Emotions in the classroom? This is a question ...,3,2
4,f92d35c,"dear state senator, im writting this letter to...",2,2


In [12]:
def calculate_cohen_kappa(df: pd.DataFrame, labels: list) -> float:
    """
    Calculate Cohen's Kappa score between human scores and model predictions.
    
    Args:
        df (pd.DataFrame): DataFrame containing 'score' and 'pred_score' columns
        labels (list): List of label values to consider in the calculation

    Returns:
        float: Cohen's Kappa score with linear weights

    Raises:
        AssertionError: If required columns 'score' or 'pred_score' are missing from DataFrame
    """
    required_cols = ['score', 'pred_score']
    missing_cols = [col for col in required_cols if col not in df.columns]
    
    assert len(missing_cols) == 0, (
        f"DataFrame is missing required columns: {missing_cols}. "
        f"Please ensure DataFrame contains both 'score' and 'pred_score' columns."
    )
    
    from sklearn.metrics import cohen_kappa_score
    return cohen_kappa_score(
        df['score'], 
        df['pred_score'],
        labels=labels, 
        weights='linear'
    )

# Calculate Cohen's Kappa score for scores 1-6
kappa = calculate_cohen_kappa(eval_df, labels=list(range(1,7)))
print(f"Alignment between human and LLM evaluator: {kappa:.3f}")

Alignment between human and LLM evaluator: 0.156


## Improve the LLM evaluator

### Part 1: Improve the criteria definition

Here we will improve the evaluator by improving the criteria used to evaluate the essays. Since the human annotators used the [Holistic Rating for Source-Based Writing](https://storage.googleapis.com/kaggle-forum-message-attachments/2733927/20538/Rubric_%20Holistic%20Essay%20Scoring.pdf)" code book we will use similar criteria to evaluate the essays.


In [13]:
JUDGE_PROMPT = """You are an expert essay evaluator.
Please evaluate the following essay according to the Holistic Rating for Source-Based Writing rubric on a scale of 1-6 as shown below:

Score 6: Demonstrates clear and consistent mastery with minor errors. Effectively and insightfully develops a point of view with outstanding critical thinking. Uses appropriate examples and evidence to support its stance. The essay is highly organized and coherent, showing smooth idea progression, skillful language use, and varied, accurate vocabulary. Free of significant errors in grammar and mechanics.

Score 5: Shows reasonably consistent mastery with occasional errors. Develops a strong point of view with good critical thinking, supported by relevant examples and evidence. Generally organized and coherent, the essay uses language well, with appropriate vocabulary and sentence structure variety. Mostly free of errors in grammar and mechanics.

Score 4: Demonstrates adequate mastery but has some lapses. Develops a point of view with competent critical thinking, supported by adequate examples and evidence. Generally organized and coherent, though may show inconsistency in language use or vocabulary choice. May have occasional grammar and mechanics errors.

Score 3: Shows developing mastery with weaknesses, such as inconsistent critical thinking or inadequate support. Organization or focus may be limited, with possible lapses in coherence. Language use may be basic, with weak vocabulary and/or issues in sentence structure. Contains multiple grammar and mechanics errors.

Score 2: Demonstrates little mastery and is flawed by vague or weak critical thinking, poor organization, or insufficient evidence. Language use is limited, with frequent vocabulary and sentence structure issues. Grammar and mechanics errors may obscure meaning.

Score 1: Displays very little or no mastery. Lacks a viable point of view or relevant evidence, is highly disorganized or incoherent. Contains severe vocabulary and structure issues, with pervasive grammar and mechanics errors that obscure meaning.

First give a reason for the score and return the result as a valid JSON object.

Example:
```json
{{"score": 4, "reason": "The reason for the score..."}}
```

Essay:
{full_text}
"""

In [14]:
class EssayEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, full_text: str) -> dict:
        response = self.model.generate_content(
            self.judge_prompt.format(full_text=full_text),
            generation_config=genai.GenerationConfig(
                response_mime_type="application/json", response_schema=Judgement
            ),
        )
        try:
            result = json.loads(response.text.strip("\n"))
            return result
        except:
            return {"score": 0, "reason": "Failed to parse JSON"}  # Default to lowest score if parsing fails

# Initialize evaluator
essay_evaluator = EssayEvaluator()

# Run evaluation
asyncio.run(evaluation.evaluate(essay_evaluator))

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192d2f9-fb73-7b92-b685-2f44edb5f027


{'model_output': {'score': {'mean': 2.8}},
 'exact_match': {'true_count': 4, 'true_fraction': 0.4},
 'model_latency': {'mean': 3.182755160331726}}

In [16]:
eval_df = get_evaluation_predictions("0192d2f9-fb73-7b92-b685-2f44edb5f027")
kappa = calculate_cohen_kappa(eval_df, labels=list(range(1,7)))
print(f"Alignment between human and LLM evaluator: {kappa:.3f}")

Alignment between human and LLM evaluator: 0.366


### Part 2: Add few-shot examples

Adding few-show examples can help the LLM evaluator understand the task better. It can help guide the LLM towards the correct answer.

and suggestions!

TODO: add how I created the few-shot examples

Let's see this in action.

In [17]:
JUDGE_PROMPT = """You are an expert essay evaluator.
Please evaluate the following essay according to the Holistic Rating for Source-Based Writing rubric on a scale of 1-6 as shown below:

Score 6: Demonstrates clear and consistent mastery with minor errors. Effectively and insightfully develops a point of view with outstanding critical thinking. Uses appropriate examples and evidence to support its stance. The essay is highly organized and coherent, showing smooth idea progression, skillful language use, and varied, accurate vocabulary. Free of significant errors in grammar and mechanics.

Score 5: Shows reasonably consistent mastery with occasional errors. Develops a strong point of view with good critical thinking, supported by relevant examples and evidence. Generally organized and coherent, the essay uses language well, with appropriate vocabulary and sentence structure variety. Mostly free of errors in grammar and mechanics.

Score 4: Demonstrates adequate mastery but has some lapses. Develops a point of view with competent critical thinking, supported by adequate examples and evidence. Generally organized and coherent, though may show inconsistency in language use or vocabulary choice. May have occasional grammar and mechanics errors.

Score 3: Shows developing mastery with weaknesses, such as inconsistent critical thinking or inadequate support. Organization or focus may be limited, with possible lapses in coherence. Language use may be basic, with weak vocabulary and/or issues in sentence structure. Contains multiple grammar and mechanics errors.

Score 2: Demonstrates little mastery and is flawed by vague or weak critical thinking, poor organization, or insufficient evidence. Language use is limited, with frequent vocabulary and sentence structure issues. Grammar and mechanics errors may obscure meaning.

Score 1: Displays very little or no mastery. Lacks a viable point of view or relevant evidence, is highly disorganized or incoherent. Contains severe vocabulary and structure issues, with pervasive grammar and mechanics errors that obscure meaning.

First give a reason for the score and return the result as a valid JSON object.

{few_shot_examples}

Example:
```json
{{"score": 4, "reason": "The reason for the score..."}}
```

Essay:
{full_text}
"""

# Read few shot examples from file
with open("../data/essay_scorer_few_shot_prompt.txt", "r") as f:
    few_shot_examples = f.read()

In [19]:
class EssayEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, full_text: str) -> dict:
        response = self.model.generate_content(
            self.judge_prompt.format(few_shot_examples=few_shot_examples, full_text=full_text),
            generation_config=genai.GenerationConfig(
                response_mime_type="application/json", response_schema=Judgement
            ),
        )
        try:
            result = json.loads(response.text.strip("\n"))
            return result
        except:
            return {"score": 0, "reason": "Failed to parse JSON"}  # Default to lowest score if parsing fails

# Initialize evaluator
essay_evaluator = EssayEvaluator()

# Run evaluation
asyncio.run(evaluation.evaluate(essay_evaluator))

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192d2fc-50aa-7c43-a835-8cbeb7454c5d


{'model_output': {'score': {'mean': 3.0}},
 'exact_match': {'true_count': 4, 'true_fraction': 0.4},
 'model_latency': {'mean': 4.123927116394043}}

In [20]:
eval_df = get_evaluation_predictions("0192d2fc-50aa-7c43-a835-8cbeb7454c5d")
kappa = calculate_cohen_kappa(eval_df, labels=list(range(1,7)))
print(f"Alignment between human and LLM evaluator: {kappa:.3f}")

Alignment between human and LLM evaluator: 0.400


## TODOS:

- [x] progressive declaration of complexity (move stuff in scripts where possible and explain in markdown)
- [ ] parameter for model should be model_name and not the client.
- [ ] JSON mode vs function calling
- [ ] CoT might help with alignment (step by step reasoning) (generate synthetic reasoning examples)
- [ ] in few-shot examples: input-output pairs vs just outputs
