










# Evaluate using Azure AI Evaluation custom evaluators


## Objective
In this notebook we will demonstrate how to use the target functions with the custom evaluators to evaluate an endpoint.

This tutorial provides a step-by-step guide on how to evaluate a function

This tutorial uses the following Azure AI services:

- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)

## Time

You should expect to spend 20 minutes running this sample. 

## About this example

This example demonstrates evaluating a target function using azure-ai-evaluation

## Before you begin

### Installation

Install the following packages required to execute this notebook. 

In [None]:
%pip install azure-ai-evaluation
%pip install azure-identity

### Parameters and imports

In [3]:
import pandas as pd
import os

from pprint import pprint
from azure.ai.evaluation import evaluate
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

azure_openai_api_version = os.environ["AZURE_OPENAI_API_VERSION"]
azure_openai_deployment = os.environ["MODEL_DEPLOYMENT_NAME"]
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]

## Target function
We will use a simple `endpoint_callback` to get answers to questions from our model. We will use `evaluate` API to evaluate `endpoint_callback` answers

`endpoint_callback` needs following environment variables to be set

- AZURE_OPENAI_API_VERSION
- AZURE_OPENAI_DEPLOYMENT
- AZURE_OPENAI_ENDPOINT

In [2]:
# Use the following code to set the environment variables if not already set. If set, you can skip this step.

os.environ["AZURE_OPENAI_API_VERSION"] = azure_openai_api_version
os.environ["AZURE_OPENAI_DEPLOYMENT"] = azure_openai_deployment
os.environ["AZURE_OPENAI_ENDPOINT"] = azure_openai_endpoint

In [4]:
async def endpoint_callback(query: str) -> dict:
    deployment = os.environ.get("AZURE_DEPLOYMENT_NAME")

    token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")

    oai_client = AzureOpenAI(
        azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
        api_version=os.environ.get("AZURE_API_VERSION"),
        azure_ad_token_provider=token_provider,
    )

    response_from_oai_chat_completions = oai_client.chat.completions.create(
        messages=[{"content": query, "role": "user"}], model=deployment, max_tokens=500
    )

    response_result = response_from_oai_chat_completions.to_dict()
    return {"query": query, "response": response_result["choices"][0]["message"]["content"]}

## Data
Reading existing dataset which has bunch of questions we can Ask Wiki

In [5]:
df = pd.read_json("data.jsonl", lines=True)
print(df.head())

                                         query       response
0               When was United Stated found ?           1776
1               What is the capital of France?          Paris
2  Who is the best tennis player of all time ?  Roger Federer


## Running Blocklist Evaluator to understand its input and output

In [15]:
from blocklist import BlocklistEvaluator

blocklist_evaluator = BlocklistEvaluator(blocklist=["bad", "worst", "terrible"])

blocklist_evaluator(response="New Delhi is Capital of India")

{'score': False}

## Run the evaluation

In [7]:
results = evaluate(
    data="data.jsonl",
    target=blocklist_evaluator,
    evaluators={
        "blocklist": blocklist_evaluator,
    },
)

2025-10-21 10:31:49 -0400   35164 execution.bulk     INFO     Finished 3 / 3 lines.
2025-10-21 10:31:49 -0400   35164 execution.bulk     INFO     Average execution time for completed lines: 0.0 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "TARGET_20251021_143149_130459"
Run status: "Completed"
Start time: "2025-10-21 14:31:49.130459+00:00"
Duration: "0:00:01.015536"

2025-10-21 10:31:50 -0400   38220 execution.bulk     INFO     Finished 3 / 3 lines.
2025-10-21 10:31:50 -0400   38220 execution.bulk     INFO     Average execution time for completed lines: 0.0 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "blocklist_20251021_143150_155483"
Run status: "Completed"
Start time: "2025-10-21 14:31:50.155483+00:00"
Duration: "0:00:01.010840"


{
    "blocklist": {
        "status": "Completed",
        "duration": "0:00:01.010840",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null
    }
}




View the results

In [12]:
print(results)

In [13]:
pd.DataFrame(results["rows"])

Unnamed: 0,inputs.query,inputs.response,inputs.line_number,outputs.score,outputs.blocklist.score
0,When was United Stated found ?,1776,0,False,False
1,What is the capital of France?,Paris,1,False,False
2,Who is the best tennis player of all time ?,Roger Federer,2,False,False
