<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>
<br>

# <font color="#76b900">**Notebook 1:** Evaluation of a deployed NIM</font>

**Welcome to the introductory notebook on evaluating NVIDIA Inference Microservices (NIMs).** In this notebook, you'll explore several techniques for assessing large language models (LLMs). We've deployed a NIM instance running the Llama 3.2 3B instruct model, which you'll query as you test and compare various evaluation methods.

Objectives:
- Learn to query NIMs using both curl and Python.
- Understand benchmark evaluation by testing on the MMLU task.
- Conduct an LLM-as-a-judge evaluation.
- Discover the fundamentals of human evaluation and explore the application of ELO ranking within the LLM arena.

<br><hr>

## **Notebook Presentation**

Run the following cell to load a video presentation covering this notebook's topics.

In [1]:
from videos.walkthroughs import notebook_01_video as video
video()

<br><hr>

## **Getting Started With Evaluating NIM**

To get started, let's verify that the NIM is operational by checking its health endpoint. You can do this as follows:

In [2]:
import time
import requests
from IPython.display import clear_output

url = "http://llama3-2-3b-instruct.local/v1/health/ready"

while True:
    try:
        response = requests.get(url)
        if response.status_code == 200:
            clear_output(wait=True)
            print("Service is up and running!")
            break
        else:
            clear_output(wait=True)
            print("Waiting for service to be ready... (Status code:", response.status_code,")")
    except requests.exceptions.RequestException as e:
        clear_output(wait=True)
        print("Waiting for service to be ready... (Error:", e,")")
    time.sleep(10)

Service is up and running!


After executing the request, you should see a message stating "Service is up and running" Next, verify the model deployed on the NIM by checking its model endpoint:

In [3]:
!curl -s -X GET 'llama3-2-3b-instruct.local/v1/models' | jq -r '.data[0].id'

meta/llama-3.2-3b-instruct


The model should be `meta/llama-3.2-3b-instruct`. 
A common first step in evaluating a model is to "eyeball" its responses—that is, to visually inspect the output for a known query. For example, you can use curl to ask the LLM for the capital of Spain:

In [4]:
%%bash
curl -s -X 'POST' \
'http://llama3-2-3b-instruct.local/v1/chat/completions' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "meta/llama-3.2-3b-instruct",
      "messages": [
          {
              "role": "system",
              "content": "Be succint in your response."
          },
          {
              "role": "user",
              "content": "What is the capital of Spain?"
          }
        ],
      "max_tokens": 32
    }' | jq -r '.choices[0].message.content'

The capital of Spain is Madrid.


Assuming the response is "Madrid" — Spain's capital — you can then experiment with more complex queries.

While eyeballing responses is a useful initial check, this method has several limitations:
- It is slow and doesn't scale well, as each question must be processed individually.
- It focuses on specific, subject-dependent queries.
- It doesn't provide a quantitative score for comparing different models.

Another evaluation approach involves analyzing the response for specific keywords. Metrics like BLEU and ROUGE, developed over 20 years ago for early machine translation systems, follow this strategy. This method enables fast evaluation using parsing libraries, without relying on LLMs or human reviewers.

For example, let's define a question and its ground truth answer:

In [5]:
question = "What is the capital of Spain?"
ground_truth = "Madrid"

Now, let's query the LLM and analyze its response to check for the presence of the ground truth answer:

In [6]:
import requests
import json

url = "http://llama3-2-3b-instruct.local/v1/chat/completions"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json"
}
data = {
    "model": "meta/llama-3.2-3b-instruct",
    "max_tokens": 300,
    "messages": [
        {"role": "system", "content": "Be succint in your response."},
        {"role": "user", "content": f"{question}"}
    ],
}

response = requests.post(url, headers=headers, data=json.dumps(data))

if response.status_code == 200:
    result = response.json()
    if ground_truth in result["choices"][0]["message"]["content"]:
        print(f"The generated answer contains the ground truth {ground_truth}")
    else:
        print(f"The generated answer does not contain the ground truth {ground_truth}")
else:
    print("Error:", response.status_code, response.text)

The generated answer contains the ground truth Madrid


Parsing for specific words is still not a systematic approach to evaluating LLMs, as two sentences can have the same meaning while using entirely different words.

In the remainder of this notebook, you will explore three widely used evaluation methods for LLMs:

- Benchmark evaluation – Measuring performance using standardized datasets.
- LLM-as-a-judge evaluation – Leveraging another LLM to assess responses.
- Human evaluation – Incorporating human judgment, including the ELO ranking system for model comparison in the LLM arena.

<br><hr>

## **Benchmark Evaluation**

LLMs can be evaluated using benchmarks — carefully curated tasks that assess capabilities in areas such as commonsense reasoning, question answering, and summarization. These evaluations typically involve either parsing the generated response or analyzing the model’s log probabilities (logprobs) for a given input. Most state-of-the-art LLMs are released with a comprehensive set of benchmark evaluations to showcase their performance.

A widely-used benchmark for evaluating reasoning in large language models is [GSM8K](https://huggingface.co/datasets/openai/gsm8k). GSM8K (Grade-School Math 8K) contains 8.5k open-ended, grade-school math word problems. Each problem requires multi-step quantitative reasoning and expects a short free-form numerical answer rather than a multiple-choice selection.

Example problem
```
A fruit seller packs apples equally into 4 boxes.  
If she puts 18 apples in each box and still has 12 apples left over,  
how many apples did she have in total?
```

Solution sketch  
• Apples in the packed boxes: $$4 \times 18 = 72$$  
• Add the leftovers: $$72 + 12 = 84$$  

Final answer: 84

Let's evaluate the LLM of our NIM with the GSM8K benchmark on 10 problem (you can modify the limit value). You can leverage the open-source library [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for that:

In [7]:
%%bash
lm_eval --model local-completions \
        --tasks gsm8k \
        --output_path llama32_gsm8k \
        --model_args model=meta/llama-3.2-3b-instruct,base_url=http://llama3-2-3b-instruct.local/v1/completions,tokenizer=utils/llama3-1-8b-instruct-tokenizer \
        --limit 10 |  awk -F'|' '/acc/ {gsub(/^ +| +$/, "", $8); print "The score is", $8}' 

2025-08-06:10:36:27 INFO     [__main__:440] Selected Tasks: ['gsm8k']
evaluator:163] Model appears to be an instruct variant but chat template is not applied. Recommend setting `apply_chat_template` (optionally `fewshot_as_multiturn`).
2025-08-06:10:36:27 INFO     [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-08-06:10:36:27 INFO     [evaluator:227] Initializing local-completions model, with arguments: {'model': 'meta/llama-3.2-3b-instruct', 'base_url': 'http://llama3-2-3b-instruct.local/v1/completions', 'tokenizer': 'utils/llama3-1-8b-instruct-tokenizer'}
els:168] Using max length 2048 - 1ls.api_mod
equests are disabled. To enable concurrent requests, set `num_concurrent` > 1.
NFO     [models.api_models:187] Using tokenizer huggingface
ples/s]ing train split: 100%|██████████| 7473/7473 [00:00<00:00, 636827.93 exam
les/s]ting test split: 100%|██████████| 1319/1319 [00:00<00:00, 463767.

The score should be 0.6-0.8, which means that 6-8 out of the 10 questions were answered correctly by the LLM. Run the following cell to print the results - 

In [8]:
import os
import json

dir_path = "llama32_gsm8k/meta__llama-3.2-3b-instruct"
# Find the latest file starting with "results"
files = [f for f in os.listdir(dir_path) if f.startswith("results")]
if not files:
    print("No results file found.")
    exit(1)
latest_file = max(files, key=lambda f: os.path.getmtime(os.path.join(dir_path, f)))
full_path = os.path.join(dir_path, latest_file)

# Load JSON and print only results["gsm8k"]
with open(full_path, "r") as f:
    data = json.load(f)
    gsm8k_results = data.get("results", {}).get("gsm8k")
    print(json.dumps(gsm8k_results, indent=2))


{
  "alias": "gsm8k",
  "exact_match,strict-match": 0.4,
  "exact_match_stderr,strict-match": 0.16329931618554522,
  "exact_match,flexible-extract": 0.4,
  "exact_match_stderr,flexible-extract": 0.16329931618554522
}


<br><hr>

## **LLM-as-a-judge Evaluation**

Benchmark evaluation works pretty well when there is a pre-compiled list of questions with defined answers. However, for many real-life scenarios, there are other factors to consider, such as factual accuracy, style or correctness. LLM-as-a-judge provides an intermediate approach between automatic benchmark evaluations and human evaluations, based on using a second LLM to assess the generated answer of the first LLM.

For LLM-as-a-judge evaluation, two key components are required in addition to the LLM itself:

- An evaluation dataset – A set of questions with acceptable answers, ideally curated by human annotators with subject matter expertise (potentially assisted by LLMs).
- A well-crafted prompt – Clear instructions guiding the LLM on how to assess responses.

For example, here’s a prompt designed to evaluate faithfulness:

In [9]:
prompt_faithfulness = """
System: You are an impartial judge assessing the faithfulness of an AI-generated response. Faithfulness means the response accurately reflects the ground-truth response without adding false or misleading content.

Evaluation Criteria:
1. **Correctness**: Does the response correctly reflect facts from the ground truth?
2. **No Hallucination**: Does the response introduce any information not present in the ground truth?
3. **Completeness**: Does the response omit any key facts that would change the meaning?
4. **Paraphrase Accuracy**: If paraphrased, is the meaning preserved?

Your Task:
Provide a faithfulness score from **1 (poor)** to **5 (perfect)** and a brief explanation of your reasoning.
"""

 We recommend that you check out other examples of prompts for LLM-as-a-judge evaluation: see  [the Ragas prompt for faithfulness](https://github.com/explodinggradients/ragas/blob/main/docs/concepts/metrics/available_metrics/faithfulness.md), [the hallucination detection prompt in Opik](https://www.comet.com/docs/opik/evaluation/metrics/hallucination/?from=llm&utm_source=opik&utm_medium=github&utm_content=hallucination_link&utm_campaign=opik#hallucination-prompt), or [the tone evaluation prompt in Promptfoo](https://github.com/promptfoo/promptfoo/blob/7c1576bf01579af23b0d0377df017fe715e1a622/site/docs/guides/langchain-prompttemplate.md?plain=1#L10). Other popular categories are factuality or correctness.

For example, let's assume we use our NIM with `llama-3.2-3b-instruct` as the LLM-as-a-judge. We'll ask it to identify the three most populous countries worldwide. The ground truth reflects accurate data as of February 2025, but in this simulated example, the generated answer contains an error in one of the three countries:

In [10]:
question_countries = "What are the three most populated countries worldwide?"
ground_truth_countries = "India, China and the United States"
generated_answer_countries = "India, China and Spain" # Note that Spain is not one of the three most populated countries 

Let's use the NIM as the LLM-as-a-judge with the faithfulness prompt:

In [11]:
import requests
import json

url = "http://llama3-2-3b-instruct.local/v1/chat/completions"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json"
}
data = {
    "model": "meta/llama-3.2-3b-instruct",
    "max_tokens": 300,
    "temperature": 0.2,
    "messages": [
        {"role": "system", "content": f"{prompt_faithfulness}. Question: {question_countries}"},
        {"role": "user", "content": f"Ground truth: {ground_truth_countries}. Answer: {generated_answer_countries}"}
    ],
}

response = requests.post(url, headers=headers, data=json.dumps(data))

if response.status_code == 200:
    result = response.json()
    print("Assistant:", result["choices"][0]["message"]["content"])
else:
    print("Error:", response.status_code, response.text)


Assistant: Faithfulness Score: 1 (poor)

Explanation: The response is incorrect in two significant ways. Firstly, it introduces a new country, Spain, which is not among the top three most populated countries in the world. Secondly, it omits two of the correct countries, the United States and India, which are indeed among the most populated countries globally. This response demonstrates a lack of correctness and introduces false information, making it a poor faithfulness score.


The LLM-as-a-judge should assign a low faithfulness score of 1, as one of the most populous countries was incorrect. While we've reduced the temperature to make the scoring more deterministic, the score may still vary occasionally.

By leveraging LLM-as-a-judge, you can generate numerical scores to evaluate key properties of the model, such as faithfulness. These scores provide a quantifiable way to compare different models systematically.

<br><hr>

## **Optional - Human Evaluation**

LLM-as-a-judge is a powerful evaluation technique, but it comes with potential biases. For example, it may favor responses generated by models similar to itself or show inconsistencies in assigning numerical scores. For more details, see [this blogpost](https://huggingface.co/blog/clefourrier/llm-evaluation).

In general, we recommend using LLM-as-a-judge for an initial evaluation pass. However, it is crucial to incorporate human evaluation in a second stage before deploying to production. This ensures that any issues not identified by the LLM-as-a-judge are caught.

The first step in human evaluation involves annotating a set of questions and reference answers. These questions are then passed to the LLM to generate answers, and humans are tasked with comparing the reference answers to the generated ones.

Our recommendations for a systematic human evaluation process are:
- Provide well-curated reference answers.
- Instruct humans to compare the reference and generated answers across clear, measurable categories.
- Ask evaluators to assign binary scores per category (e.g., 1 for "yes" and 0 for "no").
- Include multiple evaluators for the same question to reduce bias.
- Be mindful of human fatigue to maintain consistency in evaluation.

Let's apply human evaluation to an example involving a chatbot designed for customer support in a telecommunications company. The chatbot is tasked with answering questions about a customer's previous and current bills, with access to the bills through a RAG (Retrieval Augmented Generation) system. The following are the customer query, reference answer, and generated answers:

In [12]:
query = "Hi, I would like to understand why my phone bill is higher this month compared to last month?"
context = "..." # The context contains information of the previous and current bill

# The ground truth is provided by subject matter experts
reference_answer = """
    Hello, thanks for reaching out. Let me help you. 
    Your bill this month is $10 higher because you bought a package of 5GB of data on the 1st of February 2025.
"""
# Generated answer by LLM with question and context
generated_answer = """
    Hi, your bill is $10 higher because there is a charge of 5GB of data. There is another charge of $20 for calls.
"""


For the human evaluation, we ask the evaluators to provide binary scores in the following categories:

1) Friendliness: Does the generated answer contain a friendly greeting?
- Human assesment: No, it just replies "Hi."
- Score: 0
2) Correctness: Does the generated answer correctly explain the reason for the issue?
- Human assessment: Yes, it correctly states that the charge is due to data usage.
- Score: 1

3) Relevancy: Does the generated answer contain only relevant information and no hallucinated facts?
- Human assessment: No, it includes an irrelevant detail about call charges.
- Score: 0

**Average human score: 0.33**


These human scores can be improved with further customization of the LLM. In some cases, simple prompt engineering with tailored instructions may be sufficient. For more advanced customization, we recommend using the NeMo framework to apply techniques such as LoRA (Low-Rank Adaptation) or supervised fine-tuning.

#### **Human evaluation based on ELO for LLM Arena**

A specific variant of human evaluation is ELO ranking, which is used to compare LLMs head-to-head. Originally developed for chess, ELO ranking has now become a widely adopted method for ranking LLMs. For example, the [Chatbot Arena](https://lmarena.ai/?gad_source=1&gclid=EAIaIQobChMIv4rPwNTxiwMVkc_CBB2XNy-PEAAYASAAEgJgJPD_BwE) uses the ELO ranking system for this purpose.

In the context of LLMs, a human evaluator compares two responses generated by different models in response to the same prompt. The evaluator simply chooses their preferred response: the winning model gains points, and the losing model loses points in the ELO ranking system. For more information, check out this [blog post](https://huggingface.co/blog/clefourrier/llm-evaluation).

As an example, the following code simulates an ELO ranking system for popular LLMs, using random probabilities (which don't represent the actual quality of the models). The code simulates 20 "battles" and outputs the final ELO score:

In [13]:
import math
import random

# Initialize chatbot ratings
chatbots = {
    "GPT-4": 1500,
    "Claude-3": 1500,
    "Gemini": 1500,
    "Llama-3": 1500
}

def expected_score(rating_a, rating_b):
    """Calculate the expected score for Chatbot A vs Chatbot B."""
    return 1 / (1 + math.pow(10, (rating_b - rating_a) / 400))

def update_elo(winner, loser, k=32):
    """Update the ELO ratings after a match."""
    expected_winner = expected_score(chatbots[winner], chatbots[loser])
    expected_loser = expected_score(chatbots[loser], chatbots[winner])

    chatbots[winner] += round(k * (1 - expected_winner))
    chatbots[loser] += round(k * (0 - expected_loser))

def simulate_battle():
    """Simulate a random chatbot battle with a human-like decision."""
    bot1, bot2 = random.sample(list(chatbots.keys()), 2)
    print(f"🔹 Match: {bot1} vs {bot2}")

    # Simulate a winner (you can replace this with a real evaluation)
    winner = bot1 if random.random() < 0.55 else bot2  # Small skill bias
    loser = bot2 if winner == bot1 else bot1

    print(f"🏆 Winner: {winner}")
    update_elo(winner, loser)

# Simulate 20 battles
for _ in range(20):
    simulate_battle()

# Show final rankings
sorted_bots = sorted(chatbots.items(), key=lambda x: x[1], reverse=True)
print("\n📊 Final ELO Ratings:")
for bot, rating in sorted_bots:
    print(f"{bot}: {rating}")

🔹 Match: Gemini vs GPT-4
🏆 Winner: Gemini
🔹 Match: GPT-4 vs Llama-3
🏆 Winner: Llama-3
🔹 Match: Llama-3 vs GPT-4
🏆 Winner: GPT-4
🔹 Match: Claude-3 vs GPT-4
🏆 Winner: GPT-4
🔹 Match: GPT-4 vs Llama-3
🏆 Winner: GPT-4
🔹 Match: Gemini vs GPT-4
🏆 Winner: Gemini
🔹 Match: Claude-3 vs Gemini
🏆 Winner: Claude-3
🔹 Match: Claude-3 vs Gemini
🏆 Winner: Claude-3
🔹 Match: Claude-3 vs Llama-3
🏆 Winner: Llama-3
🔹 Match: Claude-3 vs GPT-4
🏆 Winner: GPT-4
🔹 Match: Llama-3 vs Gemini
🏆 Winner: Llama-3
🔹 Match: GPT-4 vs Llama-3
🏆 Winner: GPT-4
🔹 Match: Claude-3 vs Gemini
🏆 Winner: Gemini
🔹 Match: GPT-4 vs Claude-3
🏆 Winner: GPT-4
🔹 Match: Claude-3 vs Gemini
🏆 Winner: Claude-3
🔹 Match: Gemini vs Claude-3
🏆 Winner: Gemini
🔹 Match: Gemini vs GPT-4
🏆 Winner: Gemini
🔹 Match: GPT-4 vs Claude-3
🏆 Winner: Claude-3
🔹 Match: GPT-4 vs Claude-3
🏆 Winner: GPT-4
🔹 Match: Llama-3 vs Claude-3
🏆 Winner: Llama-3

📊 Final ELO Ratings:
GPT-4: 1526
Gemini: 1513
Llama-3: 1513
Claude-3: 1448


Please note that the scores in this simulation are not indicative of the actual quality of the models.

After completing this notebook, proceed to the next one, where you'll learn how the NeMo Evaluator Microservice simplifies the process of evaluating LLMs.

<br>

---

<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>