<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>
<br>

# <font color="#76b900">**Notebook 2:** Evaluating LLM Performance on Legal Domain Tasks Using NeMo Evaluator
</font>

### Overview

Large Language Models (LLMs) have revolutionized AI by learning from vast amounts of general-purpose text data. While this makes them versatile across many domains, specific applications like legal analysis may require evaluating and potentially enhancing their domain expertise.

This notebook demonstrates how to rigorously evaluate an LLM's performance on legal domain tasks using the NeMo Evaluator framework. We'll explore three key evaluation approaches:

1. **Zero-Shot Performance**: Testing the model's base capabilities without any context
2. **In-Context Learning (ICL)**: Enhancing performance by providing relevant examples in the prompt
3. **LLM-as-a-Judge**: Using a larger LLM to assess output quality

### Learning Objectives

In this notebook you will learn how to:
- Set up and configure NeMo Microservices for evaluation
- Prepare custom legal datasets and upload them to NeMo Data Store
- Evaluate LLM performance using similarity metrics (ROUGE) and LLM-as-a-Judge
- Compare model performance with and without in-context learning
- Visualize and analyze evaluation results using MLflow

<br><hr>

## **Notebook Presentation**

Run the following cell to load a video presentation covering this notebook's topics.

In [2]:
from videos.walkthroughs import notebook_02_video as video
video()

<br><hr>

## **Getting Started With Evaluating NIM**

### What is In-Context Learning (ICL)?

In-context learning (ICL) is a technique where we provide the model with a few carefully chosen examples alongside the main question. This notebook compares two modes:

- **Zero-Shot**: Model receives only the question
  ```
  Question: What are the legal implications of remote work across state lines?
  ```

- **Few-Shot (ICL)**: Model receives examples plus the question
  ```
  Example 1:
  Question: What constitutes workplace discrimination?
  Answer: Workplace discrimination occurs when an employer treats an individual differently based on protected characteristics like race, gender, age, religion, or disability. This includes hiring, firing, promotion, and workplace conditions.

  Example 2:
  Question: How does copyright law apply to software?
  Answer: Software copyright protects original code, giving creators exclusive rights to reproduce, modify, distribute, and license their work. Both source code and object code are protected under copyright law.

  Question: What are the legal implications of remote work across state lines?
  ```

This comparison will help quantify how much ICL improves performance on legal domain tasks.

## Table of Contents

1. [Prerequisites](#Prerequisites)
2. [Install Python Package Requirements](#Install-Python-Package-Requirements)
3. [Set up MLflow for Experiment Tracking](#Set-up-MLflow-for-Experiment-Tracking)
4. [Configure NeMo Microservices Endpoints](#Set-up-NeMo-Microservice-API-endpoints)
5. [Download & Format Dataset](#Download-and-Format-a-Dataset)
6. [Inference with Locally Deployed NIM](#Inference-with-Locally-Deployed-NIM)
7. [Evaluation with NeMo Evaluator MS](#Evaluate-on-Legal-dataset-using-Nemo-Evaluator-MS)
   * [Zero-Shot Mode](#Evaluation-of-zero-shot-mode)
   * [Few-Shot (ICL) Mode](#Evaluation-of-few-shots-(ICL))
8. [LLM-as-a-Judge Evaluation](#Evaluations-with-LLM-as-a-Judge)


<br><hr>

### Prerequisites

To run an evaluation, the following NeMo Microservices are required:
* NeMo Data Store for managing datasets used to train a custom model and for storing the trained model adapter
* Nemo Entity Store for managing platform-wide entities
* Access to NVIDIA NIM with Llama 3.2-3B-Instruct model. It comes in the form a docker container that can be obtained from the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nim/teams/meta/containers/llama-3.2-3b-instruct).
  * (Optional) You can modify the notebook to train with another supported model by updating `config_model_id`
* As we also want to use a judge LLM - Llama-3.3-70B-instruct, you will also need to have an NGC_API_KEY to use the [build.nvidia.com](https://build.nvidia.com/meta/llama-3_3-70b-instruct) platform 
* NeMo Evaluator Microservice

We have already deployed the microservices in this environment, but if you want to dive into how to deploy it, you can see the documentation for [Getting Started](https://developer.nvidia.com/docs/nemo-microservices/parent-chart/source/getting-started.html).

<div>
<img src="images/architecture-topology.png" width="800"/>
</div>

The Nemo Evaluator MS supports a wide variety of evaluation tasks and metrics. We won't cover all of them, but we will give you the tools to go through the process of deploying your first evaluation job, and how you can customize it in your future evaluation jobs.

### Install Python Package Requirements

Nemo Evaluator Microservice is supporting [Hugging Face API](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api) to version control the datasets and models used for evaluation.
<div></div>
We will also install the datatsets repository, so we can download our dataset and format it for our evaluation.

In [3]:
! pip install requests datasets bs4 ftfy
! pip install -U "huggingface_hub"

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting huggingface_hub
  Downloading huggingface_hub-0.34.3-py3-none-any.whl.metadata (14 kB)
Downloading huggingface_hub-0.34.3-py3-none-any.whl (558 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m558.8/558.8 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.33.5
    Uninstalling huggingface-hub-0.33.5:
      Successfully uninstalled huggingface-hub-0.33.5
Successfully installed hug

To visualize our evaluation results, we will use the open-source library [mlflow](https://mlflow.org/); We will use mlflow-tracking and the integration with Nemo Evaluator MS to ingest the evaluation results and visualize them. 
<div></div>
MLflow Tracking is a component of the MLflow platform that enables data scientists to record and manage the parameters, metrics, and artifacts of machine learning experiments. With MLflow Tracking, users can log experiment metadata, such as hyperparameters, metrics, and model artifacts, to a centralized tracking server. This allows for easy comparison and reproduction of experiments, as well as the ability to visualize and analyze results using tools like plots, tables, and charts. 

### Set up MLflow for Experiment Tracking

[MLflow](https://mlflow.org/) is an open-source platform that helps manage the machine learning lifecycle, including experimentation, reproducibility, and deployment. We'll use it to:
- Track evaluation parameters and metrics
- Compare results across different evaluation approaches
- Visualize performance differences between zero-shot and few-shot modes

#### 1. Configure MLflow Server

First, we'll set up port forwarding to access the MLflow tracking server:


In [4]:
import subprocess

subprocess.Popen(
    [
        "kubectl",
        "-n",
        "mlflow",
        "port-forward",
        "--address",
        "0.0.0.0",
        "service/mlflow-tracking",
        "30090:80",
    ],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
    close_fds=True,
)

<Popen: returncode: None args: ['kubectl', '-n', 'mlflow', 'port-forward', '...>

#### 2. Access MLflow UI

Run this cell to generate a clickable link to the MLflow interface:

In [6]:
%%js
var url = 'http://'+window.location.host+':30090';
element.innerHTML = '<a style="color:#76b900;" target="_blank" href='+url+'><h2>< Link To MLFLOW UI ></h2></a>';

<IPython.core.display.Javascript object>

#### 3. Set MLflow URI

Define the MLflow tracking URI for logging our evaluation results:

In [7]:
mlflow_uri = "http://localhost:30090"

#### What We'll Track

Our MLflow experiments will capture:
- Model configurations (zero-shot vs few-shot)
- Evaluation metrics (ROUGE, BLEU, F1)
- LLM-as-Judge scores
- Performance comparisons across different approaches

This structured tracking will help us:
1. Compare the effectiveness of in-context learning
2. Analyze which evaluation metrics best capture model improvements
3. Visualize performance patterns across different legal queries

### Set up NeMo Microservice API endpoints

<a id="init-vars"></a>

We use ingress in the server deployment so it will be easier to query the services from outside the K8s cluster; it is a design choice, not a requirement, so you can decide if you follow a similar approach when you try it. We now want to see the ingress hosts that are deployed.

In [8]:
!kubectl get ingress --all-namespaces

NAMESPACE              NAME                           CLASS   HOSTS                        ADDRESS        PORTS   AGE
llama3-2-3b-instruct   llama3-2-3b-instruct-ingress   nginx   llama3-2-3b-instruct.local   192.168.49.2   80      55m
nemo-customizer        nemo-customizer-ingress        nginx   nemo-customizer.local        192.168.49.2   80      54m
nemo-datastore         nemo-datastore-ingress         nginx   nemo-datastore.local         192.168.49.2   80      57m
nemo-entity-store      nemo-entity-store-ingress      nginx   nemo-entity-store.local      192.168.49.2   80      56m
nemo-evaluator         nemo-evaluator-ingress         nginx   nemo-evaluator.local         192.168.49.2   80      55m


For the evaluation, we will evaluate Meta Llama-3.2-3B-instruct that we deployed locally, you can use other SLMs by replacing the deployed NIM with a different [NIMs](https://docs.nvidia.com/nim/large-language-models/latest/supported-models.html)

In [9]:
!curl llama3-2-3b-instruct.local/v1/models

{"object":"list","data":[{"id":"meta/llama-3.2-3b-instruct","object":"model","created":1754478484,"owned_by":"system","root":"meta/llama-3.2-3b-instruct","parent":null,"max_model_len":131072,"permission":[{"id":"modelperm-cefdc65836cd431eb9ca9193f37f1011","object":"model_permission","created":1754478484,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

Set the following variables to the hostname of each microservice before continuing.

In [10]:
datastore_url = "http://nemo-datastore.local"
nim_url = "http://llama3-2-3b-instruct.local"
eval_url = "http://nemo-evaluator.local"
customizer_url = "http://nemo-customizer.local"
entitystore_url = "http://nemo-entity-store.local"

NIM_model_id = "meta/llama-3.2-3b-instruct"

We will also use hosted end-points for our evaluation. For this workshop environment, to avoid you needing to provide an API key, we will be sending requests through a proxy service we already have running in a background that will inject a valid API key. We'll mock an API key just so we can supply it to functions that expect it.

In [11]:
import os

# We mock the API key as a valid key will be provided by a proxy service we run for this workshop environment.
NGC_API_KEY = 'nvapi-mock-key'
BASE_URL = 'http://proxy/v1/chat/completions'
# Outside this workshop environment you would set the base URL as shown below.
# BASE_URL = "https://integrate.api.nvidia.com/v1/chat/completions"

We can verify the following services are running and accessible.

In [12]:
from pprint import pp

import requests
import urllib3

resp = requests.get(f"{eval_url}/health")
print(resp.status_code)
resp = requests.get(f"{datastore_url}/v1/health")
print(resp.status_code)
resp = requests.get(f"{entitystore_url}/v1/health/ready")
print(resp.status_code)
resp = requests.get(f"{customizer_url}/health/ready")
print(resp.status_code)
resp = requests.get(f"{nim_url}/v1/health/ready")
print(resp.status_code)

200
200
200
200
200


### Download and Format a Dataset

The [Law StackExchange dataset](https://huggingface.co/datasets/ymoslem/Law-StackExchange) is a collection of legal questions and answers from StackExchange, up to August 2023. Each record consists of a question, some context as well as human-provided answers.

We will use the Law StackExchange dataset to build an evaluation dataset to measure the NIM LLM (Llama-3.2-3B-instruct) with and without In-context learning in generating engaging titles for legal questions. Download the dataset from HuggingFace and preprocess the dataset using [NeMo-Curator](https://github.com/NVIDIA/NeMo-Curator) to cleanse HTML tags, reformat unicode, and filter samples by word count and score thresholds.

In [15]:
import re
from time import sleep

import ftfy
from bs4 import BeautifulSoup
from datasets import Dataset, load_dataset

In [16]:
hf_ds = load_dataset("ymoslem/Law-StackExchange")
hf_ds = hf_ds["train"].remove_columns(
    ["link", "license", "question_id", "answers", "tags"]
)
print(hf_ds)

Dataset({
    features: ['question_title', 'score', 'question_body'],
    num_rows: 24370
})


Filter rows based on score and content length to curate a quality dataset.

In [17]:
filtered_ds = hf_ds.filter(lambda row: row["score"] > 10)
filtered_ds = filtered_ds.filter(
    lambda row: len(row["question_body"]) >= 50 and len(row["question_body"]) <= 1000
)
print(filtered_ds)
pp(filtered_ds[0])

Filter: 100%|██████████| 24370/24370 [00:00<00:00, 344474.36 examples/s]
Filter: 100%|██████████| 1467/1467 [00:00<00:00, 117308.09 examples/s]

Dataset({
    features: ['question_title', 'score', 'question_body'],
    num_rows: 916
})
{'question_title': 'Why is drunk driving causing accident punished so much '
                   'worse than just drunk driving?',
 'score': 23,
 'question_body': '<p>When people drink and drive and then cause an accident '
                  'especially where if someone dies they get years and years '
                  'in prison but just the act of drunk driving is punished way '
                  "more lenient.  Shouldn't the 2, drunk driving and drunk "
                  'driving then causing accident be similarly punished?  I '
                  "feel like a lot of times it's luck whether an accident "
                  'happens.</p>\n'}





Modify the dataset and clean text from HTML tags and reformat unicode.

In [18]:
modified_ds = filtered_ds.map(
    lambda row: {
        "question_body": ftfy.fix_text(
            re.sub(
                r"\s+",
                " ",
                BeautifulSoup(row["question_body"], "html.parser").get_text(),
            ).strip()
        ),
        "question_title": ftfy.fix_text(
            re.sub(
                r"\s+",
                " ",
                BeautifulSoup(row["question_title"], "html.parser").get_text(),
            ).strip()
        ),
    }
)
print(filtered_ds)
pp(modified_ds[0])

Map: 100%|██████████| 916/916 [00:00<00:00, 2588.89 examples/s]

Dataset({
    features: ['question_title', 'score', 'question_body'],
    num_rows: 916
})
{'question_title': 'Why is drunk driving causing accident punished so much '
                   'worse than just drunk driving?',
 'score': 23,
 'question_body': 'When people drink and drive and then cause an accident '
                  'especially where if someone dies they get years and years '
                  'in prison but just the act of drunk driving is punished way '
                  "more lenient. Shouldn't the 2, drunk driving and drunk "
                  'driving then causing accident be similarly punished? I feel '
                  "like a lot of times it's luck whether an accident happens."}





We will convert the dataset into jsonl format, with each line of the file containing an example of prompt and ideal response field.

```
{"prompt": "<input>", "ideal_response": "<output>"}
```

As we are focusing on a summarization evaluation, we will use `question_body` as the input for our model for summarization task and set the output to `question_title`. The next step is to rename the dataset columns from "question_body" and "question_title" to the required "prompt" and "ideal response" format for NeMo Evaluator.

In [19]:
nemo_ds = (
    modified_ds.rename_column("question_body", "prompt")
    .rename_column("question_title", "completion")
    .remove_columns(["score"])
)
print(nemo_ds)
pp(nemo_ds[0])

Dataset({
    features: ['completion', 'prompt'],
    num_rows: 916
})
{'completion': 'Why is drunk driving causing accident punished so much worse '
               'than just drunk driving?',
 'prompt': 'When people drink and drive and then cause an accident especially '
           'where if someone dies they get years and years in prison but just '
           "the act of drunk driving is punished way more lenient. Shouldn't "
           'the 2, drunk driving and drunk driving then causing accident be '
           "similarly punished? I feel like a lot of times it's luck whether "
           'an accident happens.'}


#### Split the Dataset

Split the dataset to training, validation, and evaluation. We will use the training and validation dataset in the next notebook with the Nemo Customizer MS. In this notebook, we will focus on the evaluation dataset. We picked a rather small evaluation set size (46 rows) to shorten the evaluation time during the lab, but you can modify its size using the test_size parameter.

In [20]:
ds = nemo_ds.train_test_split(test_size=0.5, shuffle=False)
training_dataset = ds["train"]

ds = ds["test"].train_test_split(test_size=0.1, shuffle=False)
validation_dataset = ds["train"]
evaluation_dataset = ds["test"]

In [21]:
pp(evaluation_dataset)

Dataset({
    features: ['completion', 'prompt'],
    num_rows: 46
})


#### For In-context Learning - we will pick a set of good demonstration pairs

In [22]:
import random

NUM_FEW_SHOTS = 10  # adjust as you wish
random.seed(42)

# pick k examples from the train split
few_shot_examples = random.sample(list(training_dataset), NUM_FEW_SHOTS)


def build_few_shot_context(examples):
    """
    Build a text block like:
        Here are examples...
        Question: <body>
        Title: <title>
        ...
    """
    lines = ["Here are some examples of legal questions with good titles:"]
    for ex in examples:
        lines.append(f"Q: {ex['prompt']}")
        lines.append(f"Title: {ex['completion']}")
        lines.append("")  # blank line between examples
    return "\n".join(lines)


FEW_SHOT_CONTEXT = build_few_shot_context(few_shot_examples)

FEW_SHOT_CONTEXT_SHORT = build_few_shot_context(few_shot_examples[:2])
print(FEW_SHOT_CONTEXT_SHORT)
print("-----\n")

Here are some examples of legal questions with good titles:
Q: Do police officers have to stop an interrogation when right to counsel has been invoked by a suspect or can they continue the questioning like nothing has happened?
Title: Must the interrogation stop when the right to counsel has been invoked?

Q: What alternatives exist for finding and appointing an executor for one's will/estate for a person with no close family or qualified friend? Are there pros and cons?
Title: What options are there for executor when no close family member is available?

-----



Now, save the preprocessed training and validation datasets to local files for uploading to NeMo Data Store.

In [23]:
training_file = "law-qa-train.jsonl"
validation_file = "law-qa-val.jsonl"
testing_file = "law-qa-test.jsonl"

training_dataset.to_json(
    training_file
)  # saves dataset to the specified "training_file" file path
validation_dataset.to_json(
    validation_file
)  # saves dataset to the specified "validation_file" file path

pp(evaluation_dataset[1])  # print a sample of the dataset to check validity

Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 342.36ba/s]
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 468.17ba/s]

{'completion': 'Which jurisdiction applies to copyright violations on the '
               'internet?',
 'prompt': 'A person residing in country A takes a work by an artist in '
           'country B and puts it onto a website they own but which is hosted '
           'in country C which is intended for an audience of people in '
           'country D. The artist in country B did not give permission for '
           'this and wants to pursue legal actions. Which countries copyright '
           "laws apply to this case? Let's assume that A, B, C and D all "
           'signed and ratified the Berne Convention, but their '
           'implementations in local laws differ in ways which are relevant to '
           'this case.'}





We will add a category column to align with the expected format for the Evaluator MS.

In [24]:
inputs_dataset = evaluation_dataset.map(  # we are removing the score column as it's not providing additional value for the custom evaluation task
    lambda line: {"category": "summarization"}
)
pp(inputs_dataset[2])
inputs_dataset.to_json(
    testing_file
)  # saves dataset to the specified "testing_file" file path

Map: 100%|██████████| 46/46 [00:00<00:00, 12806.19 examples/s]


{'completion': 'Do I lose my rights as a British citizen when I travel to an '
               'other country for tourism?',
 'prompt': 'A friend of mine got detained at the airport in Jordan because his '
           'name matches a name of someone who has issues with the Jordanian '
           'authorities. My friend is British and he only was passing through '
           'Jordan. They forced him to stay there for 24 hours with no food '
           'and he had to sleep on the floor before they determined that he is '
           'not the man they were after. Does this incident mean that when you '
           'travel to a foreign country – even for a short time – that you '
           'give up your rights as a British citizen?',
 'category': 'summarization'}


Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 891.65ba/s]


25156

#### Create a Dataset in NeMo Data Store

Now that the files have been preprocessed, add the dataset to NeMo Data Store. First create a dataset then upload the files to NeMo Data Store.

In [25]:
from huggingface_hub import HfApi
from requests.exceptions import HTTPError

dataset_name = "legal_dataset_notebook_2"
repo_id = "default/" + dataset_name
repo_type = "dataset"

hf_api = HfApi(datastore_url + "/v1/hf")

try:
    hf_api.create_repo(repo_id, repo_type=repo_type)
except HTTPError:
    print(
        f"Since you've run the cell before, the repo has already been created in the NeMo Data Store"
    )

#### Upload Dataset Files

We will now upload the evaluation dataset (testing) to the datastore; we won't use the training and validation, however, you can un-comment it if you'd want to use it to fine tune the model.

In [26]:
hf_api.upload_file(
    path_or_fileobj=training_file,
    path_in_repo="training/training.jsonl",
    repo_id=repo_id,
    repo_type="dataset",
)

hf_api.upload_file(
    path_or_fileobj=validation_file,
    path_in_repo="validation/validation.jsonl",
    repo_id=repo_id,
    repo_type="dataset",
)

hf_api.upload_file(
    path_or_fileobj=testing_file,
    path_in_repo="testing/testing.jsonl",
    repo_id=repo_id,
    repo_type="dataset",
)

law-qa-train.jsonl: 100%|██████████| 272k/272k [00:00<00:00, 50.7MB/s]
law-qa-val.jsonl: 100%|██████████| 237k/237k [00:00<00:00, 67.1MB/s]
law-qa-test.jsonl: 100%|██████████| 25.2k/25.2k [00:00<00:00, 9.14MB/s]


CommitInfo(commit_url='', commit_message='Upload testing/testing.jsonl with huggingface_hub', commit_description='', oid='351fada520f2f55576d65c1e916db609aad930f1', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

#### Few-Shot (ICL) custom dataset
  
To measure the true impact of *in-context learning*, we want to create a dataset that will be used to compare it against the zero-shot:

1. Prepend the demonstration block stored in `FEW_SHOT_CONTEXT` to every prompt in the test set.  
2. Save this “ICL” version of the file and upload it to NeMo Data Store.  
3. Launch a second evaluation job that points to the new file, so the Evaluator MS calls the **same NIM** but with few-shot prompts.

By comparing the metrics of this run to the earlier zero-shot run we obtain a quantitative view of how much the demonstration examples help.

In [27]:
def with_few_shot(ex):
    return {
        # prepend the demonstration block + one blank line
        "prompt": f"{FEW_SHOT_CONTEXT}\nQ: {ex['prompt']}",
        "ideal_response": ex["completion"],
        "category": ex["category"],
    }


icl_dataset = inputs_dataset.map(with_few_shot)
icl_testing_file = "law-qa-test-icl.jsonl"
icl_dataset.to_json(icl_testing_file)

# Upload to NeMo Data Store
hf_api.upload_file(
    path_or_fileobj=icl_testing_file,
    path_in_repo="testing/testing_icl.jsonl",
    repo_id=repo_id,
    repo_type="dataset",
)

Map: 100%|██████████| 46/46 [00:00<00:00, 9152.22 examples/s]
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 553.56ba/s]
law-qa-test-icl.jsonl: 100%|██████████| 236k/236k [00:00<00:00, 63.9MB/s]


CommitInfo(commit_url='', commit_message='Upload testing/testing_icl.jsonl with huggingface_hub', commit_description='', oid='e451a9292b7522f6631d834f4659ae03bfd9bdfe', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

Use the NeMo Data Store API to view the new dataset and verify the files have uploaded successfully.

In [28]:
datasets = hf_api.list_datasets()
pp([x for x in datasets if x.id == repo_id])

[DatasetInfo(id='default/legal_dataset_notebook_2',
             author=None,
             sha=None,
             created_at=datetime.datetime(2025, 8, 6, 11, 12, 16, tzinfo=datetime.timezone.utc),
             last_modified=datetime.datetime(2025, 8, 6, 11, 12, 35, tzinfo=datetime.timezone.utc),
             private=None,
             gated=None,
             disabled=None,
             downloads=None,
             downloads_all_time=None,
             likes=None,
             paperswithcode_id=None,
             tags=None,
             trending_score=None,
             card_data=None,
             siblings=None,
             xet_enabled=None)]


#### Register the Dataset

NeMo Entity Store is a microservice for managing platform-wide entities such as namespaces, models, and datasets within the NeMo microservices platform. To register the dataset in the entity-store, you need to define the `files_url` value and format it as `hf://datasets/{namespace}/{dataset name}`.

In [29]:
dataset_params = {
    "name": dataset_name,
    "namespace": "default",
    "description": "A dataset of legal issues and titles",
    "files_url": "hf://datasets/" + repo_id,
    "project": "my-project-id",
}

resp = requests.post(
    f"{entitystore_url}/v1/datasets", json=dataset_params, verify=False
)
es = resp.json()
pp(es)

{'created_at': '2025-08-06T11:12:43.315795',
 'updated_at': '2025-08-06T11:12:43.315797',
 'name': 'legal_dataset_notebook_2',
 'namespace': 'default',
 'description': 'A dataset of legal issues and titles',
 'format': None,
 'files_url': 'hf://datasets/default/legal_dataset_notebook_2',
 'hf_endpoint': None,
 'split': None,
 'limit': None,
 'id': 'dataset-HKg8o6JhEjyCPNgHshLQMa',
 'project': 'my-project-id',
 'custom_fields': {}}


### Inference with Locally Deployed NIM

Before running full evaluations, let's perform some quick sanity checks with our deployed NIM model. We'll:
1. Test basic model functionality
2. Compare zero-shot vs few-shot responses
3. Verify our evaluation setup

#### 1. Helper Functions for Inference

First, let's define some utility functions for making NIM API calls:

In [30]:
import json
import random
import textwrap
import time

import pandas as pd
import requests


def make_prompt(question: str, *, with_icl: bool) -> str:
    """
    Build the final prompt sent to NIM.
    Ensures the model returns ONE line: the title only.
    """
    system_rule = (
        "You are a headline generator for a legal Q&A forum. "
        "Return ONE concise, engaging title (max 15 words) that captures "
        "the core legal issue. "
        "Do NOT add quotes, labels, options, or extra text. "
        "Output exactly the title on a single line.\n"
    )

    if with_icl:
        # Few-shot: prepend demonstration block
        prompt = f"{system_rule}" f"{FEW_SHOT_CONTEXT}\n" f"Q: {question}\n" f"Title:"
    else:
        # Zero-shot
        prompt = f"{system_rule}" f"Q: {question}\n" f"Title:"
    return prompt


def query_nim(prompt: str, *, model_id=NIM_model_id, url=nim_url, max_tokens=75, T=0.2):
    """
    Call the NIM endpoint and return the text portion only.
    """
    payload = {
        "model": model_id,
        "prompt": prompt,
        "temperature": T,
        "nvext": {"top_k": 1, "top_p": 0.0},
        "max_tokens": max_tokens,
    }
    r = requests.post(f"{url}/v1/completions", json=payload, timeout=45)
    r.raise_for_status()
    text = r.json()["choices"][0]["text"]
    # Strip any accidental echo
    return text.lstrip()

Before running the evaluations, we want to run a basic sanity check for the two scenarios. We will generate a prompt and run an inference call against the NIM model to get an initial feeling about the quality of the model with and without ICL and its responses in a domain specific knowlegde.

#### 2. Quick Sanity Check

Let's test our model with a sample legal question in both zero-shot and few-shot modes:

1. **Zero-Shot** – the LLM only sees the question.  
2. **Few-Shot / ICL** – the LLM sees the question **plus** three demonstration
   examples (`FEW_SHOT_CONTEXT`) prepended to the prompt.

Run the code cell and compare the titles generated in the two scenarios.

In [31]:
# --- Inference demo: Zero-Shot vs. ICL ----------------------------------
demo = random.choice(evaluation_dataset)
question_text = demo["prompt"]
ground_truth = demo["completion"]

# Build prompts and query the model
zero_title = (
    query_nim(make_prompt(question_text, with_icl=False)).splitlines()[0].strip()
)
icl_title = (
    query_nim(make_prompt(question_text, with_icl=True)).splitlines()[0].strip()
)

# Display
print("\033[1mGround-Truth Title\033[0m")
print(ground_truth)

print("\n\033[1mModel-Generated Titles\033[0m")
print(f"• Zero-Shot : {zero_title}")
print(f"• ICL       : {icl_title}")

[1mGround-Truth Title[0m
Do you have to obey English-only traffic signs in Toronto?

[1mModel-Generated Titles[0m
• Zero-Shot : Can I be held liable for ignoring bilingual traffic signs in predominantly English-speaking areas?
• ICL       : Are bilingual traffic signs in English-only cities in Canada legal and enforceable?


#### 3. Analysis

This quick test helps us:
- Verify the NIM endpoint is working correctly
- See the impact of in-context learning on a single example
- Ensure our prompt engineering is effective
- Identify any immediate issues before full evaluation

The qualitative differences between zero-shot and few-shot responses here will inform our expectations for the quantitative evaluation that follows.

When we finish the evaluation, we would want the results to be ingested to the mlflow server. To do that we've added a download and ingest function.

In [32]:
import json
import os
import shutil
import subprocess
import time
from time import sleep, time
import warnings
import zipfile

import requests
from IPython.display import clear_output

def download_and_process(eval_tag, eval_url, eval_id, mlflow_uri, polling_interval=10, timeout=6000):
    """
    Poll an evaluation job, download its results archive, and ingest into MLflow.
    Shows only essential status updates and progress.
    """
    start_time = time()

    # Monitor job status with progress updates
    while True:
        # Check for timeout
        if time() - start_time > timeout:
            raise RuntimeError(f"Evaluation took more than {timeout} seconds.")

        # Get job status
        job = requests.get(
            f"{eval_url}/v1/evaluation/jobs/{eval_id}", timeout=10
        ).json()
        status = job.get("status")
        if isinstance(status, dict):
            status = status.get("status")

        # Get progress if available
        status_details = job.get("status_details", {})
        progress = status_details.get("progress", 0)

        # Break if job is done
        if status not in {"initializing", "running", "created", "pending"}:
            break

        # Display minimal status
        clear_output(wait=True)
        elapsed = time() - start_time
        print(f"Status: {status} after {elapsed:.1f}s. Progress: {progress:.1f}%")
        sleep(polling_interval)

    # Print final status
    clear_output(wait=True)
    print("Status details:", status_details)

    # Download and process results
    url = f"{eval_url}/v1/evaluation/jobs/{eval_id}/results"
    local_zip = f"{eval_id}.zip"
    print(f"Downloading results from: {url}")

    resp = requests.get(
        url,
        headers={"accept": "application/json"},
        stream=True,
        timeout=60,
        verify=False,
    )
    resp.raise_for_status()

    # Create results directory structure
    extract_dir = f"{eval_id}_extracted"
    results_dir = os.path.join(extract_dir, "results")
    os.makedirs(results_dir, exist_ok=True)

    # Save results directly as results.json
    results_file = os.path.join(results_dir, "results.json")
    with open(results_file, "wb") as fh:
        for chunk in resp.iter_content(chunk_size=1 << 16):
            if chunk:
                fh.write(chunk)
    print(f"Saved results to: {results_file}")

    # MLflow ingestion
    experiment_name = f"{eval_tag}_{eval_id}"
    cmd = [
        "python3",
        "integrations/MLFlow/mlflow_eval_integration.py",
        "--results_abs_dir",
        os.path.abspath(results_dir),
        "--mlflow_uri",
        mlflow_uri,
        "--experiment_name",
        experiment_name,
    ]
    print("Running MLflow command:\n", " ".join(cmd))

    try:
        out = subprocess.run(cmd, check=True, capture_output=True, text=True)
        print("MLflow ingestion succeeded:\n", out.stdout)
    except subprocess.CalledProcessError as e:
        warnings.warn("MLflow ingestion failed:\n" + e.stderr)

    return results_dir

### Evaluation with NeMo Evaluator MS

The NeMo Evaluator provides several ways to assess model performance:

1. **Similarity Metrics** - Compare model outputs with ground truth using standard metrics
2. **LLM-as-a-Judge** - Use a larger LLM to evaluate outputs
3. Optional: **Academic Benchmarks** - Running evaluation using automatic benchmarks


Now we can evaluate and analyse the two different modes with the [NVIDIA NeMo Evaluator](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/overview.html). 
We can launch, monitor, and track results of evaluation jobs through a user-friendly API, to get insights into your model’s performance. 

Refer to [this](https://docs.nvidia.com/nemo/microservices/latest/api/evaluator.html) to learn more about the Evaluator API.


#### Evaluation using Similarity Metrics

The NeMo Evaluator Similarity Metrics is a powerful tool for evaluating models on custom datasets by comparing the model's generated responses to ground truth responses. In this playbook, we will evaluate the two models for the title generation use case using Law Exchange's `test.jsonl` dataset. 

Custom evaluation gives results for following metrics:

1. Accuracy :  

    Measures the proportion of correct responses generated by the model, calculated as the ratio of the number of correct predictions to the total number of predictions. 
    
    Best suited for tasks with a clear, discrete set of correct answers (e.g., classification or multiple-choice summarization tasks). Not effective for open-ended or creative outputs.

1. Rouge-N : 

    Measures the overlap of n-grams (like bigrams, trigrams) between the generated text and reference texts. 

    Relevant for extractive summarization, where word overlap is a good indicator of content fidelity.

1. BLEU (Bilingual Evaluation Understudy) : 

    Measures how closely the generated text resembles the ground truth by evaluating the overlap of n-grams, focusing on precision.

    Suitable for machine translation or summarization tasks where linguistic structure alignment is more critical than semantic equivalence.

1. EM (Exact Match) : 

    Measures the percentage of generated responses that exactly match the reference answers. 

    Ideal for tasks requiring verbatim reproduction or precise responses, such as classification and multiple choice question-answer use case.

1. F1 : 
    
    Combines precision (the proportion of correctly predicted words) and recall (the proportion of relevant ground truth words identified) into a harmonic mean. 
    
    Suitable for tasks where partial correctness matters, such as abstractive summarization with some degree of variability in outputs.

1. BERTScore : 
    
    BERTScore uses a pre-trained BERT model to measure semantic similarity between generated and reference texts, comparing embeddings at the word and sentence levels. 
    
    Ideal for abstractive summarization or generative tasks, where semantic alignment is more important than word-for-word overlap.

These metrics are best suited for uses cases where the LLM generations use cases where model outputs are expected to closely align with a predefined ideal response. Specifically for our classification use case, we will test the fine-tuned model on BLEU and F1 Score.

Refer to [this link](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-types.html#similarity-metrics) to get more details on using the Nemo Evaluator on custom dataset. 

#### Evaluation of zero-shot mode

We can initialize our evaluation config, where we can choose to run inference on an `input_file` through NeMo Evaluator, and even point to a generated `output_file` containing the predictions. The `scorers` selected during eval launch are used to generate evaluation metrics. In our evaluation we will use online evaluation against the deployed NIM.

Since both the Evaluator MS and the NIM are part of the same K8s cluster, we can't use the ingress to connect between the two microservices. We will use the microservice IP to reach the NIM in the following way:

In [38]:
import json
import subprocess


def get_service_ip(service_name, namespace="default"):
    # Run the kubectl command and get output in JSON format
    cmd = f"kubectl get svc {service_name} -n {namespace} -o json"
    result = subprocess.check_output(cmd, shell=True)
    svc_json = json.loads(result)
    return svc_json["spec"]["clusterIP"]


NIM_IP = get_service_ip("meta-llama3-2-3b-instruct", "llama3-2-3b-instruct")
nim_k8s_url = f"http://{NIM_IP}:8000/v1/completions"
print(nim_k8s_url)

http://10.107.247.183:8000/v1/completions


When we are preparing an evaluation job, we need to have an evaluation target, which represents the LLM endpoint that will be evaluated, and an evaluation configuration, which represents the type of evaluation that will be run, including all evaluation parameters. We will start by setting up the evaluation target with the NIM url. This will provide us with a unique ID that we will use later on when launching the evaluation job.

In [39]:
target_body = {
    "type": "model",
    "name": "NIM_llama_32_3b",
    "namespace": "default",
    "model": {"api_endpoint": {"url": nim_k8s_url, "model_id": NIM_model_id}},
}

resp = requests.post(
    f"{eval_url}/v1/evaluation/targets", json=target_body, verify=False
)
NIM_model_target = resp.json()["id"]
print(f"NIM evaluation target ID is {NIM_model_target}")

KeyError: 'id'

Setting up the evaluation configuration - pay attention to the files_url - it should be identical to the path in the DataStore microservice that you uploaded earlier in this notebook.

In [40]:
config_body = {
    "type": "similarity_metrics",
    "name": "nim_custom_similarity",
    "namespace": "default",
    "params": {"max_tokens": 50, "temperature": 0.7, "extra": {"top_k": 20}},
    "tasks": {
        "custom_similarity": {
            "type": "default",
            "dataset": {
                "files_url": f"hf://datasets/default/{dataset_name}/testing/testing.jsonl"
            },
            "metrics": {"bleu": {"type": "bleu"}, "f1": {"type": "f1"}},
        }
    },
}
resp = requests.post(
    f"{eval_url}/v1/evaluation/configs", json=config_body, verify=False
)
similarity_config_id = resp.json()["id"]
print(f"Evaluation configuration ID is {similarity_config_id}")

Evaluation configuration ID is eval-config-PwWtpVrjqGJ7hVPN6J7p1D


Launching the evaluation job with the evaluation target ID and the evaluation configuration ID

In [41]:
eval_tag = "Llama-3.2-3B_law_data_zero_shot"
eval_body = {
    "namespace": "dafault",
    "config": "default/nim_custom_similarity",
    "target": "default/NIM_llama_32_3b",
}
resp = requests.post(f"{eval_url}/v1/evaluation/jobs", json=eval_body, verify=False)
eval_id = resp.json()["id"]
print(f"Evaluation job ID is {eval_id}")

Evaluation job ID is eval-5Jq1t2qrEGib5hFek9W8Pd


Now we can poll the status until the evaluations finish. The evaluation can take a few minutes to an hour depending on the evaluation parameters, model performance, and size of the dataset.

In [42]:
download_and_process(eval_tag, eval_url, eval_id, mlflow_uri)

Status details: {'message': 'Job completed successfully', 'task_status': {'custom_similarity': 'completed'}, 'progress': 100.0}
Downloading results from: http://nemo-evaluator.local/v1/evaluation/jobs/eval-5Jq1t2qrEGib5hFek9W8Pd/results
Saved results to: eval-5Jq1t2qrEGib5hFek9W8Pd_extracted/results/results.json
Running MLflow command:
 python3 integrations/MLFlow/mlflow_eval_integration.py --results_abs_dir /dli/task/eval-5Jq1t2qrEGib5hFek9W8Pd_extracted/results --mlflow_uri http://localhost:30090 --experiment_name Llama-3.2-3B_law_data_zero_shot_eval-5Jq1t2qrEGib5hFek9W8Pd
MLflow ingestion succeeded:
 


'eval-5Jq1t2qrEGib5hFek9W8Pd_extracted/results'

#### Evaluation of few-shots (ICL)

We configure a similar evaluation job for the NIM with few-shots (ICL)

In [43]:
config_body = {
    "type": "similarity_metrics",
    "name": "nim_custom_similarity_icl",
    "namespace": "default",
    "params": {"max_tokens": 50, "temperature": 0.7, "extra": {"top_k": 20}},
    "tasks": {
        "custom_similarity_icl": {
            "type": "default",
            "dataset": {
                "files_url": f"hf://datasets/default/{dataset_name}/testing/testing_icl.jsonl"
            },
            "metrics": {"bleu": {"type": "bleu"}, "f1": {"type": "f1"}},
        }
    },
}
resp = requests.post(
    f"{eval_url}/v1/evaluation/configs", json=config_body, verify=False
)
similarity_icl_config_id = resp.json()["id"]
print(f"Evaluation configuration ID is {similarity_icl_config_id}")

Evaluation configuration ID is eval-config-22Uw3K4XopUCuZh4y99vZh


We can compare this performance to the zero-shot mode by launching a new evaluation job with the same evaluation target and the updated evaluation configuration that points to the ICL dataset.

In [44]:
eval_tag = "Llama-3.2-3B_law_data_icl"
eval_body = {
    "namespace": "dafault",
    "config": "default/nim_custom_similarity_icl",
    "target": "default/NIM_llama_32_3b",
}
resp = requests.post(f"{eval_url}/v1/evaluation/jobs", json=eval_body, verify=False)
eval_id = resp.json()["id"]
print(f"Evaluation job ID is {eval_id}")

Evaluation job ID is eval-J2HR21SdePudW7QUJEiNcm


We can now monitor the progress of the evaluation

In [45]:
download_and_process(eval_tag, eval_url, eval_id, mlflow_uri)

Status details: {'message': 'Job completed successfully', 'task_status': {'custom_similarity_icl': 'completed'}, 'progress': 100.0}
Downloading results from: http://nemo-evaluator.local/v1/evaluation/jobs/eval-J2HR21SdePudW7QUJEiNcm/results
Saved results to: eval-J2HR21SdePudW7QUJEiNcm_extracted/results/results.json
Running MLflow command:
 python3 integrations/MLFlow/mlflow_eval_integration.py --results_abs_dir /dli/task/eval-J2HR21SdePudW7QUJEiNcm_extracted/results --mlflow_uri http://localhost:30090 --experiment_name Llama-3.2-3B_law_data_icl_eval-J2HR21SdePudW7QUJEiNcm
MLflow ingestion succeeded:
 


'eval-J2HR21SdePudW7QUJEiNcm_extracted/results'

Now that you have the results for both evaluation, do you see any improvement coming from the ICL?

### Evaluations with LLM-as-a-Judge

With custom-LLM-as-a-Judge, an LLM can be evaluated by using another LLM as a judge. Nemo Evaluator follow a flexible and robust way to configure and customize llm-as-a-judge for your evaluation needs. Including (1) allowing users to bring in their own custom datasets, and (2) allowing any NIM model or OpenAI API end-point to be used as a judge model. 

LLM-as-a-Judge can be used for any use case, even highly generative ones, but the choice of judge is crucial in getting reliable evaluations. The judge model should have enough domain knowledge of the use case, compared to the model being evaluated, to be an effective judge. Generally, an LLM regarded as a high quality model should be used as the judge. 

Refer to Nemo Evaluator [docs](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-custom.html#evaluation-with-llm-as-a-judge) to know more about LLM-as-a-Judge Evaluation

We will go through following steps for using LLM-as-a-judge:
1. Formatting data for evaluation
1. Uploading custom dataset 
1. Setup an API endpoint for the Judge LLM
1. Submitting evaluation job

#### Formatting data for evaluation

In [46]:
from datasets import Dataset

merged_rows = []
for index, row in enumerate(inputs_dataset):
    merged_row = {
        "question_id": f"summary_example_{index}",
        "category": row.get("category", ""),
        "question": row["prompt"],  # The full consult or question text
        "ground_truth": row[
            "completion"
        ],  # The reference/ground-truth summary or title
    }
    merged_rows.append(merged_row)

# Limit data if needed
limit_data = 10
merged_dataset = Dataset.from_list(merged_rows[:limit_data])

# Save to a single JSONL file
merged_file = "custom_dataset/eval_data.jsonl"
merged_dataset.to_json(merged_file)

# Optional: print a sample row
from pprint import pprint

pprint(merged_dataset[0])

Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 1231.08ba/s]

{'category': 'summarization',
 'ground_truth': "Has a verdict of 'not proven' ever had a different effect to "
                 "one of 'not guilty'?",
 'question': 'In Scotland there are three verdicts available in criminal '
             'trials: guilty, not guilty, and not proven. In modern use there '
             'is no practical difference between not guilty and not proven: '
             'the defendant is acquitted in both cases. Has there ever been a '
             'case where a verdict of not proven ended up having a different '
             'practical effect on the defendant than if he had been found not '
             'guilty?',
 'question_id': 'summary_example_0'}





#### Uploading a Custom Dataset for Evaluation

In [47]:
repo_name = "default/test-llm_as_a_judge"
repo_type = "dataset"
local_file = "./custom_dataset/eval_data.jsonl"
path_in_repo = "eval_data.jsonl"  # This will be the filename in the repo

# Create the repo if it doesn't exist
hf_api.create_repo(
    repo_id=repo_name,
    repo_type=repo_type,
    exist_ok=True,  # Don't error if it already exists
)

# Upload the single file
result = hf_api.upload_file(
    path_or_fileobj=local_file,
    path_in_repo=path_in_repo,
    repo_id=repo_name,
    repo_type=repo_type,
)

print(f"File uploaded to: {result}")

eval_data.jsonl: 100%|██████████| 6.58k/6.58k [00:00<00:00, 2.33MB/s]


File uploaded to: http://nemo-datastore.local/v1/hf/datasets/default/test-llm_as_a_judge/blob/main/eval_data.jsonl


#### Use a hosted endpoint as the judge LLM

To simplify the evaluation process, we will use a large LLM that is hosted in [build.nvidia.com](build.nvidia.com) to act as the judge LLM.

Let's first test if the hosted NIM LLM endpoint is ready and available for evaluation

In [48]:
import json

import requests

# URL and headers as provided in the shell script
judge_endpoint_url = BASE_URL
judge_model = "nvidia/llama-3.3-nemotron-super-49b-v1"
judge_api_token = NGC_API_KEY

headers = {
    "Authorization": f"Bearer {judge_api_token}",
    "Accept": "application/json",
    "Content-Type": "application/json",
}

# Data payload as a Python dictionary
data = {
    "messages": [
        {
            "role": "user",
            "content": "Write a limerick about the wonders of GPU computing.",
        }
    ],
    "model": judge_model,
    "top_p": 0.7,
    "max_tokens": 1024,
    "seed": 42,
    "stream": False,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "temperature": 0.2,
}

# Making the POST request
response = requests.post(judge_endpoint_url, headers=headers, json=data)

# Printing the response
print(response.json()["choices"][0]["message"]["content"])

Here is a limerick about the wonders of GPU computing:

There once was a GPU so fine,
Whose parallel processing did shine.
It computed with great zest,
Through matrices at rest,
And accelerated tasks in no time.


#### Submitting evaluation job using LLM-as-a-judge

Now that we have judge LLM NIM deployed, and evaluation data uploaded in the Nemo Datastore, we will see how to use Nemo Evaluator API for submitting evaluation job using judge LLM.
As our end-point is already defined, we only need to define the evaluation config in the following way.

In [49]:
import requests

# In order to use the workshop environment's proxy service (which is providing a valid API key for you) inside k8s pods, we need to use the proxy service's FQDN
# which we set here. In your own environment, when providing your own API and not using the workshop's proxy service, you would set:
# judge_endpoint_url = 'https://integrate.api.nvidia.com/v1/chat/completions'
judge_endpoint_url = 'http://proxy.default.svc.cluster.local/v1/chat/completions'

DATASET_URL = (
    "hf://datasets/default/test-llm_as_a_judge/"  # or your actual dataset path
)

judge_eval_config = {
    "type": "custom",
    "name": "custom_llm_as_a_judge",
    "tasks": {
        "law_summary": {
            "type": "chat-completion",
            "params": {
                "template": {
                    "messages": [
                        {
                            "role": "system",
                            "content": (
                                "You are a headline generator for a legal Q&A forum. "
                                "Return ONE concise, engaging title (max 15 words) that captures "
                                "the core legal issue. "
                                "Do NOT add quotes, labels, options, or extra text. "
                                "Output exactly the title on a single line.\n"
                            ),
                        },
                        {
                            "role": "user",
                            "content": ("Question: {{ item.question }}\n Title:"),
                        },
                    ],
                    "max_tokens": 200,
                    "temperature": 0.0001,
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/test-llm_as_a_judge/eval_data.jsonl"
            },
            "metrics": {
                "accuracy": {
                    "type": "llm-judge",
                    "params": {
                        "model": {
                            "api_endpoint": {
                                "url": judge_endpoint_url,
                                "model_id": judge_model,
                                "api_key": judge_api_token,
                            },
                            "temperature": 0.0001,
                            "limit_samples": 5,
                            "parallelism": 2,
                        },
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": (
                                        "You are a critic with expertise in judging summaries of articles. "
                                        "You will be provided a source document and two summaries (Summary 1 and Summary 2).\n\n"
                                        "on a scale from 1 to 5:\n"
                                        "- A higher rating (towards 5) indicates that Summary 1 is better than Summary 2.\n"
                                        "- A lower rating (towards 1) indicates Summary 2 is better than Summary 1.\n"
                                        "- A score of 3 would mean that both the summaries are of similar quality in all dimensions.\n"
                                        "Please respond with RATING: <number>"
                                    ),
                                },
                                {
                                    "role": "user",
                                    "content": (
                                        "Source Document: {{ item.question }}\n"
                                        "Summary 1: {{ item.ground_truth }}\n"
                                        "Summary 2: {{ sample.output_text }}\n"
                                    ),
                                },
                            ]
                        },
                        "scores": {
                            "accuracy": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING:\s*(\d+)",
                                },
                            }
                        },
                    },
                },
                "correctness": {
                    "type": "llm-judge",
                    "params": {
                        "model": {
                            "api_endpoint": {
                                "url": judge_endpoint_url,
                                "model_id": judge_model,
                                "api_key": judge_api_token,
                            }
                        },
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": (
                                        "You are a judge. Rate the summary's correctness "
                                        "(no false info) on a scale 1-5:\n"
                                        "1 = many inaccuracies … 5 = completely accurate\n"
                                        "Please respond with RATING: <number>"
                                    ),
                                },
                                {
                                    "role": "user",
                                    "content": (
                                        "Full Consult: {{ item.content }}\n\n"
                                        "Summary: {{ sample.output_text }}"
                                    ),
                                },
                            ]
                        },
                        "scores": {
                            "correctness": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING:\s*(\d+)",
                                },
                            },
                        },
                    },
                },
                "conciseness": {
                    "type": "llm-judge",
                    "params": {
                        "model": {
                            "api_endpoint": {
                                "url": judge_endpoint_url,
                                "model_id": judge_model,
                                "api_key": judge_api_token,
                            }
                        },
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": (
                                        "You are a judge. Rate the summary's conciseness "
                                        "(no unimportant info) on a scale 1-5:\n"
                                        "1 = overly verbose … 5 = perfectly concise\n"
                                        "Please respond with RATING: <number>"
                                    ),
                                },
                                {
                                    "role": "user",
                                    "content": (
                                        "Full Consult: {{ item.content }}\n\n"
                                        "Summary: {{ sample.output_text }}"
                                    ),
                                },
                            ]
                        },
                        "scores": {
                            "conciseness": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING:\s*(\d+)",
                                },
                            },
                        },
                    },
                },
                "readability": {
                    "type": "llm-judge",
                    "params": {
                        "model": {
                            "api_endpoint": {
                                "url": judge_endpoint_url,
                                "model_id": judge_model,
                                "api_key": judge_api_token,
                            }
                        },
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": (
                                        "You are a judge. Rate the summary's readability "
                                        "(grammar & clarity) on a scale 1-5:\n"
                                        "1 = very hard to read … 5 = crystal clear\n"
                                        "Please respond with RATING: <number>"
                                    ),
                                },
                                {
                                    "role": "user",
                                    "content": (
                                        "Question: {{ item.question }}\n\n"
                                        "Summary: {{ sample.output_text }}"
                                    ),
                                },
                            ]
                        },
                        "scores": {
                            "readability": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING:\s*(\d+)",
                                },
                            },
                        },
                    },
                },
            },
        }
    },
}

# Submit the config to the NeMo Evaluator
resp = requests.post(
    f"{eval_url}/v1/evaluation/configs", json=judge_eval_config, verify=False
)
judge_llm_config_id = resp.json()["id"]
print("Evaluation config ID:", judge_llm_config_id)

Evaluation config ID: eval-config-6NsrdNmgU39mHvR3WxoBn1


Now that we have our evaluation configuration defined, we can launch the evaluation job.

In [50]:
NIM_model_target = "NIM_llama_32_3b"  # or whatever your model target is

eval_tag = "llm_as_a_judge_zero_shot"
eval_body = {
    "target": f"default/{NIM_model_target}",
    "config": "default/custom_llm_as_a_judge",
    "tags": [eval_tag],
}

resp = requests.post(
    f"{eval_url}/v1/evaluation/jobs",
    json=eval_body,
    headers={"accept": "application/json"},
    verify=False,
)
eval_id = resp.json()["id"]
print("Evaluation job ID:", eval_id)

Evaluation job ID: eval-HeqDgD4UMJw5gJGVLo3A4X


We can check the status of the evaluation job by pulling the status of the specific job or looking at the entire list of launched jobs.

In [51]:
download_and_process(eval_tag, eval_url, eval_id, mlflow_uri)

Status details: {'message': 'Job completed successfully.', 'task_status': {'law_summary': 'completed'}, 'progress': 100.0}
Downloading results from: http://nemo-evaluator.local/v1/evaluation/jobs/eval-HeqDgD4UMJw5gJGVLo3A4X/results
Saved results to: eval-HeqDgD4UMJw5gJGVLo3A4X_extracted/results/results.json
Running MLflow command:
 python3 integrations/MLFlow/mlflow_eval_integration.py --results_abs_dir /dli/task/eval-HeqDgD4UMJw5gJGVLo3A4X_extracted/results --mlflow_uri http://localhost:30090 --experiment_name llm_as_a_judge_zero_shot_eval-HeqDgD4UMJw5gJGVLo3A4X
MLflow ingestion succeeded:
 


'eval-HeqDgD4UMJw5gJGVLo3A4X_extracted/results'

Let's look at the results! Remember from our LLM-As-A-Judge prompt:
`You will evaluate the quality of Summary 2 on a scale of 1-5`

More specifically, interpolate the scores between 1-5 based on the following rationale:
- A higher rating (towards 5) indicates that Summary 1 is better than Summary 2.
- A lower rating (towards 1) indicates Summary 2 is better than Summary 1.
- A score of 3 would mean that both the summaries are of similar quality in all dimensions.


In [52]:
import pandas as pd

def parse_eval_results(output):
    # Navigate to the metrics dictionary
    metrics = output["law_summary"]["metrics"]
    rows = []
    for metric_name, metric_data in metrics.items():
        # Each metric has a 'scores' dict, which may have a sub-metric (e.g., 'correctness', 'completeness', etc.)
        for sub_metric, sub_metric_data in metric_data["scores"].items():
            value = sub_metric_data.get("value")
            stats = sub_metric_data.get("stats", {})
            row = {
                "metric": metric_name,
                "value": value,
                "count": stats.get("count"),
                "sum": stats.get("sum"),
                "mean": stats.get("mean"),
            }
            rows.append(row)
    # Create DataFrame
    df = pd.DataFrame(rows)
    return df

In [53]:
url = f"{eval_url}/v1/evaluation/jobs/{eval_id}/results"

resp = requests.get(
    url,
    headers={"accept": "application/json"},
    stream=True,
    timeout=60,
    verify=False,
)

metrics = resp.json()["tasks"]

print(parse_eval_results(metrics))

        metric  value  count   sum  mean
0     accuracy    2.0     10  20.0   2.0
1  correctness    2.4     10  24.0   2.4
2  conciseness    5.0     10  50.0   5.0
3  readability    4.7     10  47.0   4.7


##### You can now follow the same flow to evaluate with ICL.

Finally, let's compare the evaluation scores for the two modes and see how it helps us decide which model better fits for our use-case.

Based on the results above, which mode will you take for your legal GenAI application? Is it enough? In the next notebook we will explore the impact of fine-tuning on the quality of the model.

# Optional - Other evaluation functionalities

Running one type of evaluation won't necessarily provide you with all the information you need to make a decision which model is the right one for your application. As LLMs may require to do multiple tasks, we will usually build a list of evaluation metrics we want to utilize to get a more granular score and better tools that will help us make a better decision. 
<br />
We've added two additional ways that the Nemo Evaluator MS enables you to do exactly that. Give it a try and see if it provides you with the required information you need.

### Evaluate with LM Evaluation Harness

[LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/) supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8k, and hellaswag. Refer to [tasks](https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.6/lm_eval/tasks#tasks) to see the list of tasks. The Nemo Evaluator microservice allows users to run these, all of which are accessible through the Nemo Evaluator API.

The Nemo Evaluator microservice allows users to run these benchmarks through its API, streamlining the evaluation process for both researchers and practitioners in the field of natural language processing. Standard evaluation benchmarks, also known as Academic benchmarks, are useful for providing a baseline for general capabilities and identifying gaps in performance of a model. They provide a robust baseline for measuring general language understanding and comparing performance across models. For fine-tuned models, these benchmarks help ensure the model retains generalization while specializing in a particular domain. 

However, they often fail to capture domain-specific nuances critical for fine-tuned models, such as task precision or specialized terminology. To ensure relevance and effectiveness, fine-tuned models should also be evaluated using custom datasets tailored to their specific use case. 

#### Evaluation of NIM with ifeval

Evaluating the ability of Large Language Models (LLMs) to follow natural language instructions is crucial, but current methods have limitations. Human evaluations are time-consuming and subjective, while automated evaluations using LLMs can be biased. To address this, researchers have introduced Instruction-Following Eval [IFEval](https://arxiv.org/pdf/2311.07911), a standardized benchmark that assesses LLMs' ability to follow "verifiable instructions", such as writing a certain number of words or mentioning specific keywords. The benchmark consists of 500 prompts with 25 types of verifiable instructions and provides a straightforward and reproducible way to evaluate LLMs. The researchers have made the code and data publicly available.

We configure the new evaluation configuration for lm-eval `ifeval` task, you can use `limit` to limit the number of samples in the evaluation so the evaluation time will be shorter. -1 in `limit` means that the full evaluation will run

In [54]:
config_body = {
    "type": "gsm8k",
    "name": "my-configuration-lm-harness-gsm8k",
    "namespace": "default",
    "tasks": {"gsm8k_cot_llama": {"type": "gsm8k_cot_llama"}},
    "params": {
        "temperature": 0.00001,
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "limit_samples": 10,
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": True,
            "top_k": 1,
            "hf_token": "<my-token>",
            "tokenizer_backend": "hf",
            "tokenizer": "utils/llama3-1-8b-instruct-tokenizer",
            "apply_chat_template": True,
            "fewshot_as_multiturn": True
        },
    },
}


resp = requests.post(
    f"{eval_url}/v1/evaluation/configs", json=config_body, verify=False
)
ifeval_config_id = resp.json()["id"]
print(ifeval_config_id)

eval-config-BxZ2SkgcKbSpPLWzEbP6Sv


We have configured out our evaluation job and can submit it to our Nemo Evaluator endpoint.

In [55]:
eval_tag = "Llama-3.2-3B_ifeval"
eval_body = {
    "namespace": "default",
    "tags": [eval_tag],
    "target": "default/NIM_llama_32_3b",
    "config": "default/my-configuration-lm-harness-gsm8k",
}

resp = requests.post(f"{eval_url}/v1/evaluation/jobs", json=eval_body, verify=False)

eval_id = resp.json()["id"]
print(f"Evaluation job ID is {eval_id}")

Evaluation job ID is eval-Szc1aWsW4vea8myMYU7vdK


Now we can poll the status until the evaluations finish. The evaluation can take a few minutes to an hour depending on the evaluation parameters, model performance, and size of the dataset.

In [56]:
download_and_process(eval_tag, eval_url, eval_id, mlflow_uri)

Status details: {'message': 'An error occurred. To troubleshoot, download the logs by using the `/v1/evaluation/jobs/eval-Szc1aWsW4vea8myMYU7vdK/download-results` endpoint. ', 'task_status': {'gsm8k_cot_llama': 'failed'}, 'progress': 0.0}
Downloading results from: http://nemo-evaluator.local/v1/evaluation/jobs/eval-Szc1aWsW4vea8myMYU7vdK/results


HTTPError: 404 Client Error: Not Found for url: http://nemo-evaluator.local/v1/evaluation/jobs/eval-Szc1aWsW4vea8myMYU7vdK/results

While the evaluation runs, try to look into the different tasks supported by lm-eval-harness repository and try answering the following questions - 
 * If you want to evaluate your models on European languages (Spanish, Basque, French, ...), which tasks can help you do that?
 * While validating your model on non-English languages, how would you suggest approaching a new evaluation task? will you focus on translating existing tasks? why?