## Bias in LLM Evaluators

There are many sources of bias in LLM evaluators. They are not necessrily inherent to LLM evaluators but we cover them here to show the impact of these biases and how to best navigate them. 

These biases are artifact of LLMs today and might go away tomorrow.
TODO: expand framing


In [1]:
%load_ext autoreload
%autoreload 2

In [34]:
import os

import weave
from dotenv import load_dotenv

load_dotenv()  # TODO: replace with getpass

import google.generativeai as genai

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

import nest_asyncio

nest_asyncio.apply()

In [None]:
# initialize weave
weave_client = weave.init(project_name="eval-course/eval-course-dev")

## Problem 1: Position Bias

LLM validators might favor outputs based on their position (early or late in a sequence). TODO: expand on this and the implications.

In [4]:
import asyncio

from weave import Evaluation, Model

In [27]:
# Define the prompt template for pairwise comparison
PAIRWISE_PROMPT = """You are an expert mathematics teacher evaluating student answers.
Given a math question and two possible answers, determine which answer is better.

Question: {question}

Answer A: {answer_a}
Answer B: {answer_b}

Which answer is better? Respond with JUST "A" or "B".
"""


class PairWiseEvaluator(Model):
    where_is_correct: str = "A"
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    pairwise_judge_prompt: str = PAIRWISE_PROMPT

    @weave.op()
    def predict(self, question: str, correct: str, incorrect: str) -> dict:
        if self.where_is_correct == "A":
            response = self.model.generate_content(
                self.pairwise_judge_prompt.format(
                    question=question, answer_a=correct, answer_b=incorrect,
                ),
            )
        elif self.where_is_correct == "B":
            response = self.model.generate_content(
                self.pairwise_judge_prompt.format(
                    question=question, answer_a=incorrect, answer_b=correct,
                ),
            )
        else:
            raise ValueError("where_is_correct must be either 'A' or 'B'")

        result = response.text.strip(" \n")
        return self.where_is_correct, result

In [28]:
# Load the dataset
mmlu_maths = weave.ref("mmlu_maths:v0").get()

# Metric


@weave.op()
def exact_match(model_output: list) -> bool:
    """Check if predicted score matches human score"""
    where_is_correct, result = model_output
    return where_is_correct == result


# Create evaluation
evaluation = Evaluation(dataset=mmlu_maths.rows, scorers=[exact_match])

In [32]:
# Run evaluation with where_is_correct = "A"
pairwise_evaluator = PairWiseEvaluator(where_is_correct="A")
a = asyncio.run(evaluation.evaluate(pairwise_evaluator))

# Run evaluation with where_is_correct = "B"
pairwise_evaluator = PairWiseEvaluator(where_is_correct="B")
b = asyncio.run(evaluation.evaluate(pairwise_evaluator))

What's the difference between the two evaluations?

For the same question, the evaluator is more likely to choose the answer based on the position of the answer in the sequence.

In [None]:
print(
    "What's the difference in acccuracy becasue of position bias?\n",
    b["exact_match"]["true_fraction"] - a["exact_match"]["true_fraction"],
)

### Solutions

- Swap Augmentation: Randomize the order of outputs to minimize position bias.
    - This is espically useful if you run your evaluation multiple times and take the average. ([Source](https://arxiv.org/pdf/2306.05685))

- Multiple Evidence Calibration (MEC): Prompt the model to generate evaluation evidence before assigning scores. In simple terms, you are asking the model to reason about the quality of the answer before assigning a score. ([Source](https://arxiv.org/pdf/2305.17926))

- Balanced Position Calibration (BPC): Evaluate each candidate in both positions across two runs and compute the final score as the average of the two runs ([Source](https://arxiv.org/pdf/2305.17926)).

Fore more detailed discussion on positional bias check out these two papers:

- [Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs](https://arxiv.org/pdf/2406.07791v1)
- [Large Language Models are not Fair Evaluators](https://arxiv.org/pdf/2305.17926)


## Problem 2: Verbosity Bias

LLM evaluators might favor outputs that are more verbose. This is a problem because it can lead to overconfidence in the evaluator.

TODO: expand on this and the implications.


In [None]:
# Let's create an evaluator that judges correctness of a single answer
CORRECTNESS_PROMPT = """You are an expert mathematics teacher evaluating a student answer.
Given a math question and the student's answer, determine if the answer is correct.

Question: {question}
Student Answer: {answer}

Is this answer correct? Respond with JUST "YES" or "NO".
"""


class CorrectnessEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = CORRECTNESS_PROMPT

    @weave.op()
    def predict(self, question: str, correct: str) -> dict:
        response = self.model.generate_content(
            self.judge_prompt.format(question=question, answer=correct),
        )

        result = response.text.strip(" \n")
        return result


@weave.op()
def is_correct(model_output: str) -> bool:
    return model_output == "YES"


evaluation = Evaluation(dataset=mmlu_maths.rows, scorers=[is_correct])

correctness_evaluator = CorrectnessEvaluator()
plain_answer = asyncio.run(evaluation.evaluate(correctness_evaluator))

In [None]:
# Let's create an evaluator that judges correctness of a single answer
CORRECTNESS_PROMPT = """You are an expert mathematics teacher evaluating a student answer.
Given a math question and the student's answer, determine if the answer is correct.

Question: {question}
Student Answer: {answer}

Is this answer correct? Respond with JUST "YES" or "NO".
"""


class CorrectnessEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = CORRECTNESS_PROMPT

    @weave.op()
    def predict(self, question: str, correct: str) -> dict:
        beautified_answer_prompt = """You are given a math question and the correctanswer to that question. Can you expand on the reasoning that led to the answer?
        Question: {question}
        Answer: {answer}
        """
        _fake_answer = self.model.generate_content(
            beautified_answer_prompt.format(question=question, answer=correct),
        )

        # In case the model fails to generate a fake answer, we use the correct answer as the fake answer.
        # The `response.text` quick accessor requires the response to contain a valid `Part`, but none were returned.
        try:
            beautified_answer = _fake_answer.text.strip(" \n")
        except:
            beautified_answer = f"The correct answer is {correct}."

        response = self.model.generate_content(
            self.judge_prompt.format(
                question=question,
                answer=beautified_answer,
            ),
        )

        result = response.text.strip(" \n")
        return result


@weave.op()
def is_correct(model_output: str) -> bool:
    return model_output == "YES"


evaluation = Evaluation(dataset=mmlu_maths.rows, scorers=[is_correct])

correctness_evaluator = CorrectnessEvaluator()
beautified_answer = asyncio.run(evaluation.evaluate(correctness_evaluator))

In [None]:
print(
    "What's the difference in acccuracy becasue of verbosity bias?\n",
    beautified_answer["is_correct"]["true_fraction"]
    - plain_answer["is_correct"]["true_fraction"],
)

### Solutions

TODO: expand on this and the implications.

## Problem 3: Misinformation Oversight Bias

In [None]:
JUDGE_PROMPT = """You are an expert evaluator. Given a question and an answer, you need to determine if the answer is correct or incorrect.
Question: {question}
Answer: {answer}

Respond with exactly one word - either "correct" or "incorrect"."""


class MisinformationEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-pro")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, question: str, answer: str) -> dict:
        response = self.model.generate_content(
            self.judge_prompt.format(
                question=question,
                answer=answer,
            ),
        )

        result = response.text.strip(" \n")
        return result


@weave.op()
def is_correct(model_output: str) -> bool:
    return model_output.lower() == "correct"


rag_dataset = weave.ref("rag_dataset:v0").get()

evaluation = Evaluation(dataset=rag_dataset.rows, scorers=[is_correct])

misinformation_evaluator = MisinformationEvaluator()
misinformation_results = asyncio.run(evaluation.evaluate(misinformation_evaluator))

In [None]:
JUDGE_PROMPT = """You are an expert evaluator. Given a question and an answer, you need to determine if the answer is correct or incorrect. You are also given the context that led to the answer.

Question: {question}
Context: {context}
Answer: {answer}

Respond with exactly one word - either "correct" or "incorrect"."""


class MisinformationEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-pro")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, context: str, question: str, answer: str) -> dict:
        response = self.model.generate_content(
            self.judge_prompt.format(
                question=question,
                context=context,
                answer=answer,
            ),
        )

        result = response.text.strip(" \n")
        return result


misinformation_evaluator = MisinformationEvaluator(judge_prompt=JUDGE_PROMPT)
misinformation_results = asyncio.run(evaluation.evaluate(misinformation_evaluator))