# Evaluate Base Model Endpoints using Azure AI Evaluation APIs

## Objective

This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoints deployed on Azure AI Platform or non Azure AI platforms. 

This guide uses Python Class as an application target which is passed to Evaluate API provided by PromptFlow SDK to evaluate results generated by LLM models against provided prompts. 

This tutorial uses the following Azure AI services:

- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)

## Time

You should expect to spend 30 minutes running this sample. 

## About this example

This example demonstrates evaluating model endpoints responses against provided prompts using azure-ai-evaluation

### Parameters and imports

In [1]:
import os
from dotenv import load_dotenv
import pandas as pd
import random

load_dotenv(override=True)

True

## Target Application

We will use Evaluate API provided by Prompt Flow SDK. It requires a target Application or python Function, which handles a call to LLMs and retrieve responses. 

In the notebook, we will use an Application Target `ModelEndpoints` to get answers from multiple model endpoints against provided question aka prompts. 

This application target requires list of model endpoints and their authentication keys. For simplicity, we have provided them in the `env_var` variable which is passed into init() function of `ModelEndpoints`.


Please provide Azure AI Project details so that traces and eval results are pushing in the project in Azure AI Studio.

In [2]:
azure_ai_project = {
    "subscription_id": os.getenv("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.getenv("AZURE_RESOURCE_GROUP"),
    "project_name": os.getenv("AZURE_PROJECT_NAME"),
}

## Model Endpoints
The following code demonstrates how to call various model endpoints, and is configured based on `env_var` set above. For any model in `env_var`, if you do not have that model deployed in your AI project, please comment it out. If you have a model that you would like to test that does not correspond with one of the types seen below, please include that type in the `__call__` function and create a helper function to call the model's endpoint via REST. 

In [3]:
!pygmentize utils/models.py

from typing import TypedDict, Self
from promptflow.tracing import trace


class ModelEndpoints:
    def __init__(self: Self, model: dict) -> str:
        self.model = model

    class Response(TypedDict):
        query: str
        response: str

    @trace
    def __call__(self: Self, query: str) -> Response:
        output = self.chat_completion(query)
        return output

    def chat_completion(self: Self, query: str) -> Response:
        from azure.ai.inference import ChatCompletionsClient
        from azure.ai.inference.models import SystemMessage, UserMessage
        from azure.core.credentials import AzureKeyCredential
        from dotenv import load_dotenv
        import os

        endpoint = os.getenv("AZURE_OPENAI_INFERENCE_ENDPOINT")
        key = os.getenv("AZURE_OPENAI_API_KEY")

        print(f"endpoint: {endpoint}")
        print(f"model: {self.model} \n")

        client = ChatCompletionsClient(
            endpoint=endpoint, 
            credential=AzureKeyCredenti

> 🧪 Test your model enpoints.

In [4]:
from utils.models import ModelEndpoints

def test_model(target: ModelEndpoints) -> None:
    query = "What is the capital of Switzerland?"
    result = target(query)

    print(f"Query: {result['query']}")
    print(f"Response: {result['response']}")

model = "Phi-4"
target = ModelEndpoints(model)

test_model(target)

endpoint: https://ai-serv-eval060765329650.openai.azure.com/models
model: Phi-4 

Query: What is the capital of Switzerland?
Response: The capital of Switzerland is Bern. While Switzerland does not have a national capital in the same way many other countries do, Bern serves as the seat of the federal authorities and is considered the de facto capital.


## Data

Following code reads Json file "data.jsonl" which contains inputs to the Application Target function. It provides question, context and ground truth on each line. 

In [5]:
df = pd.read_json("test-data.jsonl", lines=True)
print(df.head())

                                           query  \
0                 What is the capital of France?   
1             Which tent is the most waterproof?   
2           Which camping table is the lightest?   
3  How much does TrailWalker Hiking Shoes cost?    

                                             context  \
0                   France is the country in Europe.   
1  #TrailMaster X4 Tent, price $250,## BrandOutdo...   
2  #BaseCamp Folding Table, price $60,## BrandCam...   
3  #TrailWalker Hiking Shoes, price $110## BrandT...   

                                        ground_truth  
0                                              Paris  
1  The TrailMaster X4 tent has a rainfly waterpro...  
2  The BaseCamp Folding Table has a weight of 15 lbs  
3    The TrailWalker Hiking Shoes are priced at $110  


## Configuration of the LLM Judge
To use Relevance and Cohenrence Evaluator, we will Azure Open AI model details as a **Judge** that can be passed as model config.

In [6]:
from azure.ai.evaluation import AzureOpenAIModelConfiguration

judge_model = AzureOpenAIModelConfiguration(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_deployment="gpt-4o",
    api_version=os.getenv("AZURE_OPENAI_API_VERSION")
)

o3_judge_model = AzureOpenAIModelConfiguration(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_deployment="gpt-4o",
    api_version=os.getenv("AZURE_OPENAI_API_VERSION")
)

# model_config = {
#     "azure_endpoint": os.getenv("AZURE_OPENAI_ENDPOINT"),
#     "api_key": os.getenv("AZURE_OPENAI_API_KEY"),
#     "azure_deployment": "gpt-4o",
#     "api_version": os.getenv("AZURE_OPENAI_API_VERSION")
# }

# o3_mini_model_config = {
#     "azure_endpoint": os.getenv("AZURE_OPENAI_O1_ENDPOINT"),
#     "api_key": os.getenv("AZURE_OPENAI_O1_API_KEY"),
#     "azure_deployment": "o3-mini",
#     "api_version": os.getenv("AZURE_OPENAI_O1_API_VERSION")
# }


## Run the evaluation

The Following code runs Evaluate API and uses Content Safety, Relevance and Coherence Evaluator to evaluate results from different models.

The following are the few parameters required by Evaluate API. 

+   Data file (Prompts): It represents data file 'data.jsonl' in JSON format. Each line contains question, context and ground truth for evaluators.     

+   Application Target: It is name of python class which can route the calls to specific model endpoints using model name in conditional logic.  

+   Model Name: It is an identifier of model so that custom code in the App Target class can identify the model type and call respective LLM model using endpoint URL and auth key.  

+   Evaluators: List of evaluators is provided, to evaluate given prompts (questions) as input and output (answers) from LLM models. 

In [7]:
import pathlib
from utils.models import ModelEndpoints
from azure.ai.evaluation import evaluate, RelevanceEvaluator, GroundednessEvaluator

## Resoning models are not yet supported as a judge
# judge_model = o3_judge_model

relevance_evaluator = RelevanceEvaluator(judge_model)
groundedness_evaluator = GroundednessEvaluator(judge_model)

evaluators = {
    "relevance": relevance_evaluator,
    "groundedness": groundedness_evaluator,
}

models = [
    # "gpt-4o",
    "gpt-4o-mini",
    "Phi-4"
]

file_name = "test-data.jsonl"
path = str(pathlib.Path(pathlib.Path.cwd())) + f'/{file_name}'

for model in models:
    print(f"Evaluating {model}...")
    randomNum = random.randint(1111, 9999)
    evaluation_name = "Eval-Run-" + str(randomNum) + "-" + model.title()

    results = evaluate(
        evaluation_name=evaluation_name,
        data=path,
        target=ModelEndpoints(model),
        evaluators=evaluators,
        azure_ai_project=azure_ai_project,
        evaluator_config={
            "default": {
                "column_mapping": {
                    "response": "${data.ground_truth}",
                    "context": "${data.context}",
                    "query": "${data.query}",
                }
            }
        }
    )

Evaluating gpt-4o-mini...


[2025-02-24 18:14:55 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run evaluations_getting_started_20250224_181452_960985, log path: C:\Users\charendt\.promptflow\.runs\evaluations_getting_started_20250224_181452_960985\logs.txt
[2025-02-24 18:15:07 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_rygw_4gw_20250224_181507_140842, log path: C:\Users\charendt\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_rygw_4gw_20250224_181507_140842\logs.txt
[2025-02-24 18:15:07 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_6b_0zps9_20250224_181507_140842, log path: C:\Users\charendt\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_6b_0zps9_20250224_181507_140842\logs.txt


2025-02-24 18:14:55 +0100   21844 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-02-24 18:14:55 +0100   21844 execution.bulk     INFO     Current system's available memory is 9813.68359375MB, memory consumption of current process is 214.48828125MB, estimated available worker count is 9813.68359375/214.48828125 = 45
2025-02-24 18:14:55 +0100   21844 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 4, 'estimated_worker_count_based_on_memory_usage': 45}.
2025-02-24 18:14:58 +0100   21844 execution.bulk     INFO     Process name(SpawnProcess-4)-Process id(10068)-Line number(0) start execution.
2025-02-24 18:14:58 +0100   21844 execution.bulk     INFO     Process name(SpawnProcess-6)-Process id(11016)-Line number(1) start execution.
2025-02-24 18:14:58 +0100   21844 execution.bulk     INFO     Process name(SpawnProcess-7)-Process i

[2025-02-24 18:15:30 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run evaluations_getting_started_20250224_181528_113849, log path: C:\Users\charendt\.promptflow\.runs\evaluations_getting_started_20250224_181528_113849\logs.txt
[2025-02-24 18:16:03 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_jpxjjfdk_20250224_181603_056509, log path: C:\Users\charendt\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_jpxjjfdk_20250224_181603_056509\logs.txt
[2025-02-24 18:16:03 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_bmqktqm2_20250224_181603_056509, log path: C:\Users\charendt\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_bmqktqm2_20250224_181603_056509\logs.txt


2025-02-24 18:15:30 +0100   21844 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-02-24 18:15:30 +0100   21844 execution.bulk     INFO     Current system's available memory is 9919.0546875MB, memory consumption of current process is 232.80859375MB, estimated available worker count is 9919.0546875/232.80859375 = 42
2025-02-24 18:15:30 +0100   21844 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 4, 'estimated_worker_count_based_on_memory_usage': 42}.
2025-02-24 18:15:32 +0100   21844 execution.bulk     INFO     Process name(SpawnProcess-11)-Process id(17672)-Line number(0) start execution.
2025-02-24 18:15:32 +0100   21844 execution.bulk     INFO     Process name(SpawnProcess-14)-Process id(37416)-Line number(2) start execution.
2025-02-24 18:15:32 +0100   21844 execution.bulk     INFO     Process name(SpawnProcess-12)-Process 

View the results

In [8]:
pd.DataFrame(results["rows"])

Unnamed: 0,outputs.query,outputs.response,inputs.query,inputs.context,inputs.ground_truth,outputs.relevance.relevance,outputs.relevance.gpt_relevance,outputs.relevance.relevance_reason,outputs.groundedness.groundedness,outputs.groundedness.gpt_groundedness,outputs.groundedness.groundedness_reason,line_number
0,What is the capital of France?,The capital of France is Paris. It is the larg...,What is the capital of France?,France is the country in Europe.,Paris,4,4,The response is accurate and fully addresses t...,1,1,The RESPONSE introduces information (Paris) th...,0
1,Which tent is the most waterproof?,"When it comes to waterproof tents, several bra...",Which tent is the most waterproof?,"#TrailMaster X4 Tent, price $250,## BrandOutdo...",The TrailMaster X4 tent has a rainfly waterpro...,3,3,The response provides partial information by m...,5,5,The RESPONSE accurately conveys the informatio...,1
2,Which camping table is the lightest?,"When it comes to lightweight camping tables, t...",Which camping table is the lightest?,"#BaseCamp Folding Table, price $60,## BrandCam...",The BaseCamp Folding Table has a weight of 15 lbs,3,3,The response partially addresses the query by ...,5,5,The RESPONSE is fully grounded and accurate in...,2
3,How much does TrailWalker Hiking Shoes cost?,The cost of TrailWalker hiking shoes can vary ...,How much does TrailWalker Hiking Shoes cost?,"#TrailWalker Hiking Shoes, price $110## BrandT...",The TrailWalker Hiking Shoes are priced at $110,4,4,The response fully addresses the query with ac...,4,4,"The RESPONSE is accurate but incomplete, as it...",3


### Now, you can view the results in the Azure AI Foundry portal and compare the evaluation results:

![alt](assets\ai-foundry-evaluation-comparison.png)