This workbook was adopted from: https://github.com/philschmid/evaluate-llms/blob/main/notebooks/01-use-langchain-for-llm-evaluation.ipynb

# Evaluate LLMs a practical example using Langchain

The rise of generative AI and LLMs like GPT-4, Llama or Claude enables a new era of AI drive applications and use cases. However, evaluating these models remains an open challenge. Academic benchmarks can no longer always be applied to generative models since the correct or most helpful answer can be formulated in different ways, which would give limited insight into real-world performance.

**So, how can we evaluate the performance of LLMs if previous methods are not long valid?**

Two main approaches show promising results for evaluating LLMs: leveraging human evaluations and using LLMs themselves as judges.

Human evaluation provides the most natural measure of quality but does not scale well. Crowdsourcing services can be used to collect human assessments on dimensions like relevance, fluency, and harmfulness. However, this process is relatively slow and costly.

Recent research has proposed using LLMs themselves as judges to evaluate other LLMs, an approach called [LLM-as-a-judge](https://arxiv.org/abs/2306.05685) demonstrates that large LLMs like GPT-4 can match human preferences with over 80% agreement when evaluating conversational chatbots.

In this blog post, we look at a hands-on example of how to evaluate LLMs:

- Criteria-based evaluation, such as helpfulness, relevance, or harmfulness
- RAG evaluation, whether our model correctly uses the provided context to answer
- Pairwise comparison, to explore whether we can generate AI feedback for RLAIF

We are going to use `gemini-2.0-flash-lite` as the LLM we evaluate.

As "evaluator" we are going to use `gemini-2.0-pro-exp`. You can use any supported `llm` of langchain to evaluate your models. If you want to use `GPT-4` make sure the environment variable `OPENAI_API_KEY` is set and valid.

In [1]:
%pip install langchain-google-genai

Note: you may need to restart the kernel to use updated packages.


In [1]:
import os

In [2]:
from langchain_google_genai import ChatGoogleGenerativeAI as genai_chat
from dotenv import load_dotenv

load_dotenv()

client = genai_chat(model='models/gemini-2.0-flash-lite')

In [3]:
def generate(text):
    response = client.invoke(text)
    return response.content

# test client
output = generate("What is 2+2?")
assert output == '2 + 2 = 4'
print(output)

2 + 2 = 4


In [4]:
# create evaluator
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" # https://platform.openai.com/account/api-keys
# assert os.environ.get("OPENAI_API_KEY") is not None, "Please set OPENAI_API_KEY environment variable"

evaluation_llm = genai_chat(model='models/gemini-2.0-pro-exp')

## Criteria-based evaluation

Criteria-based evaluation can be useful when you want to measure an LLM's performance on specific attributes rather than relying on a single metric. It provides fine-grained, interpretable scores on conciseness, helpfulness, harmfulness, or custom criteria definitions. We are going to evaluate the output of the following prompt:

- conciseness of the generation
- correctness using an additional reference
- custom criteria whether it is explained for a 5-year-old.

In [21]:
prompt = "Who is the current president of United States?"

Lets first take a look on what the model generates for the prompt:

In [22]:
pred = generate(prompt)
print(pred)

'The current president of the United States is **Joe Biden**.'


Is that correct?

The criteria evaluator returns a dictionary with the following values:

`score`: Binary integer 0 to 1, where 1 would mean that the output is compliant with the criteria, and 0 otherwise
`value`: A "Y" or "N" corresponding to the score
`reasoning`: String "chain of thought reasoning" from the LLM generated prior to creating the score


If you want to learn more about the criteria-based evaluation, check out the [documentation](https://python.langchain.com/docs/guides/evaluation/string/criteria_eval_chain).

### Conciseness evaluation

Conciseness is a evaluation criteria that measures if the the submission concise and to the point.

In [23]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("criteria", criteria="conciseness", llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=prompt,
)

# print result
print(eval_result)


{'reasoning': '*   **Criterion Analysis (Conciseness):** The criterion asks if '
              'the submission is concise and to the point.\n'
              '*   **Input Analysis:** The input asks for the name of the '
              'current US president.\n'
              '*   **Submission Analysis:** The submission directly answers '
              'the question by stating "The current president of the United '
              'States is Joe Biden."\n'
              '*   **Evaluation:** The submission provides only the necessary '
              "information to answer the question directly. It doesn't include "
              'any extraneous details, preamble, or unrelated information. It '
              'precisely addresses the query. Therefore, it is concise and to '
              'the point.\n'
              '\n'
              'Y',
 'score': 1,
 'value': 'Y'}


If I would have to asses the reasoning of Gemini, I would agree with its reasoning. The most concise answer would be "Joe Biden".

### Correctness using an additional reference

We can evaluate our generation based on correctness, which would relly on the internal knowledge of the LLM. This might not be the best approach since we are not sure if the LLM has the correct knowledge. To make sure we create our evaluator with `requires_reference=True` to use an additional reference to evaluate the correctness of the generation.

As reference we use the following text: _"The new and 47th president of the United States is Philipp Schmid."_ This is obviously wrong, but we want to see if the evaluation LLM values the reference over the internal knowledge.

In [24]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("labeled_criteria", criteria="correctness", llm=evaluation_llm,requires_reference=True)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=prompt,
    reference="The new and 47th president of the United States is Philipp Schmid."
)

# print result
print(eval_result)

{'reasoning': '1.  **Criterion Analysis (Correctness):** The criterion '
              'requires assessing if the submission is correct, accurate, and '
              'factual based on the provided reference material.\n'
              '2.  **Submission Content:** The submission states that Joe '
              'Biden is the current president of the United States.\n'
              '3.  **Reference Content:** The reference states that Philipp '
              'Schmid is the new and 47th president of the United States.\n'
              "4.  **Comparison:** The submission's statement (Joe Biden is "
              'president) directly contradicts the information provided in the '
              'reference (Philipp Schmid is president).\n'
              '5.  **Conclusion:** According to the provided reference, the '
              'submission is factually incorrect. Therefore, the submission '
              'does not meet the correctness criterion based on the given '
              'reference.\n

Nice! It worked as expected. The LLM evaluated the generation as incorrect based on the reference, saying _"There is a discrepancy between the submission and the reference"_.

### Custom criteria whether it is explained for a 5-year-old.

Langchain allows you to define custom criteria to evaluate your generations. In this example we want to evaluate if the generation is explained for a 5-year-old. We define the criteria as follows:

In [25]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# custom eli5 criteria
custom_criterion = {"eli5": "Is the output explained in a way that a 5 year old would understand?"}

# create evaluator
evaluator = load_evaluator("criteria", criteria=custom_criterion, llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=prompt,
)

# print result
print(eval_result)

{'reasoning': '1.  **Analyze the Submission:** The submission directly answers '
              'the input question by stating "The current president of the '
              'United States is **Joe Biden**."\n'
              '2.  **Analyze the Criteria:** The criterion is "eli5: Is the '
              'output explained in a way that a 5 year old would understand?". '
              'This requires the answer to be simplified, potentially using '
              'analogies or very basic language, beyond just stating a fact.\n'
              '3.  **Evaluate the Submission against the Criteria:** The '
              'submission provides a factual answer. However, it does not '
              '*explain* anything. It simply states the name. There is no '
              'attempt to simplify the concept of "president" or why Joe Biden '
              'holds that position in terms a 5-year-old would grasp (e.g., '
              '"He\'s like the main leader or the boss of the country"). '
             

## Retrival Augmented Generation (RAG) evaluation

Retrieval Augmented Generation (RAG) is one of the most popular use cases for LLMs, but it is also one of the most difficult to evaluate. We want RAG models to use the provided context to correctly answer a question, write a summary, or generate a response. This is a challenging task for LLMs, and it is difficult to evaluate whether the model is using the context correctly.

Langchain has a handy `ContextQAEvalChain` class that allows you to evaluate your RAG models. It takes a `context` and a `question` as well as a `prediction` and a `reference` to evaluate the correctness of the generation. The evaluator returns a dictionary with the following values:

`reasoning`: String "chain of thought reasoning" from the LLM generated prior to creating the score
`score`: Binary integer 0 to 1, where 1 would mean that the output is correct, and 0 otherwise
`value`: A "CORRECT" or "INCORRECT" corresponding to the score


In [26]:
question = "How many people are living in Nuremberg?"
context="Nuremberg is the second-largest city of the German state of Bavaria after its capital Munich, and its 541,000 inhabitants make it the 14th-largest city in Germany. On the Pegnitz River (from its confluence with the Rednitz in Fürth onwards: Regnitz, a tributary of the River Main) and the Rhine–Main–Danube Canal, it lies in the Bavarian administrative region of Middle Franconia, and is the largest city and the unofficial capital of Franconia. Nuremberg forms with the neighbouring cities of Fürth, Erlangen and Schwabach a continuous conurbation with a total population of 812,248 (2022), which is the heart of the urban area region with around 1.4 million inhabitants,[4] while the larger Nuremberg Metropolitan Region has approximately 3.6 million inhabitants. The city lies about 170 kilometres (110 mi) north of Munich. It is the largest city in the East Franconian dialect area."

prompt = f"""Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}"""

pred = generate(prompt)
print(pred)

'541,000'


Looks good! we can also quickly test how the LLM would respond without the context

In [27]:
false_pred = generate(question)
print(false_pred)

('As of December 31, 2022, the population of Nuremberg was approximately '
 '**518,365** people.')


As we can see without the context the generation is incorrect. Now lets see if our evaluator can detect that as well. As reference we will use the raw number with `541,000`.

In [28]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("context_qa", llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_strings(
  input=question,
  prediction=pred,
  context=context,
  reference="541,000"
)

# print result
print(eval_result)

{'reasoning': 'CORRECT', 'score': 1, 'value': 'CORRECT'}


Nice! It worked as expected. The LLM evaluated the generation as correct. Lets now test what happens if we provide a wrong prediction.

In [29]:
# evaluate
eval_result = evaluator.evaluate_strings(
  input=question,
  prediction=false_pred,
  context=context,
  reference="541,000"
)

# print result
print(eval_result)

{'reasoning': 'INCORRECT', 'score': 0, 'value': 'INCORRECT'}


Awesome! The evaluator detected that the generation is incorrect.

Alternatively, if you are not having a reference you can reuse the `criteria` evaluator to evaluate the correctness using the "question" as input and the "context" as reference.

In [30]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("labeled_criteria", criteria="correctness", llm=evaluation_llm, requires_reference=True)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=question,
    reference=context,
)

# print result
print(eval_result)

{'reasoning': '*   **Criterion: correctness:**\n'
              '    1.  The user input asks for the population of Nuremberg.\n'
              '    2.  The reference text provides information about '
              "Nuremberg's population.\n"
              '    3.  The reference text explicitly states: "...and its '
              '541,000 inhabitants make it the 14th-largest city in Germany."\n'
              '    4.  The submission states the population is "541,000".\n'
              '    5.  Comparing the submission to the reference text, the '
              'figure "541,000" matches the population figure given for '
              'Nuremberg in the reference.\n'
              '    6.  Therefore, the submission is correct, accurate, and '
              'factual according to the reference.\n'
              '\n'
              'Y',
 'score': 1,
 'value': 'Y'}


As we can see the LLM correctly reasoned that the generation is correct based on the provided context.

## Pairwise comparison and scoring

Pairwise comparison or Scoring is a method for evaluating LLMs that asks the model to choose between two generations or generate scores for the quality. Those methods are useful for evaluating whether a model can generate a better response than another/previous model.
This can also be used to generate preference data or AI Feedback for RLAIF or DPO.

Lets first look at the pairwise comparison. Here for we generate first two generations and then ask the LLM to choose between them.


In [10]:
prompt = "Write a short email to your boss about the meeting tomorrow."
pred_a = generate(prompt)

prompt = "Write a short email to your boss about the meeting tomorrow"  # "Write a concise message to your boss about the meeting tomorrow."
pred_b = generate(prompt)

assert pred_a != pred_b

Now, lets use our LLM to select its preferred generation.

In [11]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("pairwise_string", llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_string_pairs(
    prediction=pred_a,
    prediction_b=pred_b,
    input=prompt,
)

# print result
print(eval_result)

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


{'reasoning': 'Both assistants provided excellent, concise email templates '
              'suitable for reminding a boss about a meeting. They both '
              "included placeholders for essential information like the boss's "
              'name, time, and location/platform.\n'
              '\n'
              'Assistant A\'s template included the optional line, "Let me '
              'know if you have any questions beforehand," which is practical '
              "and encourages preparation. Assistant B's template included the "
              'optional line, "Looking forward to it," which conveys '
              'politeness and anticipation.\n'
              '\n'
              "Both templates are effective and appropriate. Assistant A's "
              'suggested addition is slightly more functional in a work '
              'context, prompting the boss to consider any necessary '
              'pre-meeting clarification. This gives it a very slight edge in '
              'term

The LLM selected the first generation as the preferred one, we could now use this information to generate AI Feedback for RLAIF or DPO. As next we want to look a bit more in detail into our two generation and how they would be scored. Scoring can help us to more qualitative evaluate our generations.


In [12]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("score_string", llm=evaluation_llm)

# evaluate
eval_result_a = evaluator.evaluate_strings(
    prediction=pred_a,
    input=prompt,
)
eval_result_b = evaluator.evaluate_strings(
    prediction=pred_b,
    input=prompt,
)


# print result
print(f"Score A: {eval_result_a['score']}")
print(f"Score B: {eval_result_b['score']}")

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


'Score A: 8'
'Score B: 8'
