# Lab 02: Explore Built-in Quality Evaluators

By the end of this lab, you will know:

1. What AI-Assisted evaluation workflows are, and how to run them.
1. The built-in quality evaluators available in Azure AI Foundry
1. How to run a quality evaluator with a test prompt (to understand usage)
1. How to run a composite quality evaluator (with multiple evaluators)

**Generation Quality Metrics**

1. These are used to assess the overall quality of the content produced by generative AI applications. 
1. All metrics or evaluators output a score and an explanation (except for SimilarityEvaluator which has score only). 
1. [Browse the documentation](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-metrics-built-in?tabs=warning#generation-quality-metrics) for details on how each metric works.

**Built-in Generation Quality Evaluators**

The Azure AI Foundry plaform provides a comprehensive set of built-in quality evaluators that can be used to assess the performance of generative AI models. 
- Visit the [documentation](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-metrics-built-in?tabs=warning#generation-quality-metrics) to get the latest updates
- Visit the [API reference](https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview) to understand usage of the API


---

## 1. Initialize Setup

In [1]:
## Setup Required Dependencies

# --------- Azure AI Project
import os
from pprint import pprint

# The Azure AI Foundry connection string contains all the parameters we need
connection_string = os.environ.get("AZURE_AI_CONNECTION_STRING")
region_id, subscription_id, resource_group_name, project_name = connection_string.split(";")

# Use extracted values to create the azure_ai_project
azure_ai_project = {
    "subscription_id": subscription_id,
    "resource_group_name": resource_group_name,
    "project_name": project_name,
}
pprint(azure_ai_project)

# ---------- Model Config
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("LAB_JUDGE_MODEL"),
}
pprint(model_config)

# ---------- Azure Credential
from azure.identity import DefaultAzureCredential
credential=DefaultAzureCredential()
pprint(credential)



{'project_name': 'ai-project-51324400',
 'resource_group_name': 'rg-aitour',
 'subscription_id': '3c2e0a23-bcf8-4766-84b7-8c635df04a7b'}
{'api_key': '55b32d2e39584a7f9a17fa750261ffb7',
 'azure_deployment': 'gpt-4',
 'azure_endpoint': 'https://aoai-51324400.openai.azure.com/'}
<azure.identity._credentials.default.DefaultAzureCredential object at 0x7ccc7f1c85f0>


---

## 2. General Purpose Evaluators    

These are evaluators that look at the quality of textual responses in generated cotent and include:
1. Coherence - measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought
1. Fluency - measures the effectiveness and clarity of written communication, focusing on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability
1. QA Composite - measures comprehensively various aspects in a question-answering scenario including relevance, groundedness, fluency, coherence, similarity, and F1 score.

Scores are typically numerical, generated using a Likert scale (1 to 5) with higher scores indicating better quality. The _threshold_ sets the cutoff for a "pass/fail" rating on that evaluator, helping you get a quick sense of where the primary issues lie.


### 2.1 Coherence Evaluator
CoherenceEvaluator measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.

In [None]:
from azure.ai.evaluation import CoherenceEvaluator

coherence = CoherenceEvaluator(model_config=model_config, threshold=3)
coherence(
    query="Is Marie Curie is born in Paris?", 
    response="No, Marie Curie is born in Warsaw."
)

{'coherence': 4.0,
 'gpt_coherence': 4.0,
 'coherence_reason': 'The RESPONSE is coherent because it directly answers the QUERY with a clear and logical statement, making it easy to understand.',
 'coherence_result': 'pass',
 'coherence_threshold': 3}

### 2.2 Fluency Evaluator
luencyEvaluatormeasures the effectiveness and clarity of written communication, focusing on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability. It assesses how smoothly ideas are conveyed and how easily the reader can understand the text.

In [None]:
from azure.ai.evaluation import FluencyEvaluator

fluency = FluencyEvaluator(model_config=model_config, threshold=3)
fluency(
    response="No, Marie Curie is born in Warsaw."
)

{'fluency': 2.0,
 'gpt_fluency': 2.0,
 'fluency_reason': 'The response should receive a Score of 2 because it communicates a simple idea with a grammatical error and limited vocabulary, fitting the definition of Basic Fluency.',
 'fluency_result': 'fail',
 'fluency_threshold': 3}

### 2.3 Question-Answering Composite Evaluator
QAEvaluator measures comprehensively various aspects in a question-answering scenario - including Relevance, Groundedness, Fluency, Coherence, Similarity, and F1 score.

In [None]:
from azure.ai.evaluation import QAEvaluator

qa_eval = QAEvaluator(model_config=model_config, threshold=3)
qa_eval(
    query="Where was Marie Curie born?", 
    context="Background: 1. Marie Curie was a chemist. 2. Marie Curie was born on November 7, 1867. 3. Marie Curie is a French scientist.",
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

{'f1_score': 0.631578947368421,
 'f1_result': 'pass',
 'f1_threshold': 3,
 'similarity': 5.0,
 'gpt_similarity': 5.0,
 'similarity_result': 'pass',
 'similarity_threshold': 3,
 'relevance': 5.0,
 'gpt_relevance': 5.0,
 'relevance_reason': 'The response accurately and completely answers the query, providing both the correct birthplace and additional clarification, making it comprehensive with insights.',
 'relevance_result': 'pass',
 'relevance_threshold': 3,
 'fluency': 3.0,
 'gpt_fluency': 3.0,
 'fluency_reason': 'The response is clear and grammatically correct, with adequate vocabulary. It conveys the intended message effectively, but the sentence structure is simple and lacks complexity, which aligns with Competent Fluency.',
 'fluency_result': 'pass',
 'fluency_threshold': 3,
 'coherence': 4.0,
 'gpt_coherence': 4.0,
 'coherence_reason': 'The response is coherent and provides the correct answer to the query, but it includes an unnecessary comparison that slightly affects the logica

---

## 3.  Retrieval Augmented Generation (RAG) Evaluators

A retrieval-augmented generation (RAG) system tries to generate the most relevant answer consistent with grounding documents in response to a user's query.  This requires it to _retrieve_ documents that provide grounding context, and _generate_ responses that are relevance, consistent with grounding data, and complete.




### 3.1 Retrieval Evaluator
RetrievalEvaluator measures the textual quality of retrieval results with an LLM without requiring ground truth. This metric focuses on how relevant the context chunks (encoded as a string) are to address a query and how the most relevant context chunks are surfaced at the top of the list.

In [None]:
from azure.ai.evaluation import RetrievalEvaluator

retrieval = RetrievalEvaluator(model_config=model_config, threshold=3)
retrieval(
    query="Where was Marie Curie born?", 
    context="Background: 1. Marie Curie was born in Warsaw. 2. Marie Curie was born on November 7, 1867. 3. Marie Curie is a French scientist. ",
)

{'retrieval': 5.0,
 'gpt_retrieval': 5.0,
 'retrieval_reason': 'The context contains the exact answer to the query at the top, with no external knowledge bias introduced, making it highly relevant and well-ranked.',
 'retrieval_result': 'pass',
 'retrieval_threshold': 3}


### 3.2 Groundedness Evaluator 
GroundednessEvaluator measures how well the generated response aligns with the given context (grounding source) and doesn't fabricate content outside of it. 

In [None]:
from azure.ai.evaluation import GroundednessEvaluator

groundedness = GroundednessEvaluator(model_config=model_config, threshold=3)
groundedness(
    query="Is Marie Curie is born in Paris?", 
    context="Background: 1. Marie Curie is born on November 7, 1867. 2. Marie Curie is born in Warsaw.",
    response="No, Marie Curie is born in Warsaw."
)

{'groundedness': 5.0,
 'gpt_groundedness': 5.0,
 'groundedness_reason': 'The response is fully grounded in the context, accurately and completely answering the query without adding extraneous information.',
 'groundedness_result': 'pass',
 'groundedness_threshold': 3}

### 3.3 Relevance Evaluator

RelevanceEvaluator measures how effectively a response addresses a query. It assesses the accuracy, completeness, and direct relevance of the response based solely on the given query. Higher scores mean better relevance.

In [7]:
{
    "relevance": 4.0,
    "gpt_relevance": 4.0, 
    "relevance_reason": "The RESPONSE accurately answers the QUERY by stating that Marie Curie was born in Warsaw, which is correct and directly relevant to the question asked.",
    "relevance_result": "pass", 
    "relevance_threshold": 3
}

{'relevance': 4.0,
 'gpt_relevance': 4.0,
 'relevance_reason': 'The RESPONSE accurately answers the QUERY by stating that Marie Curie was born in Warsaw, which is correct and directly relevant to the question asked.',
 'relevance_result': 'pass',
 'relevance_threshold': 3}

In [8]:
from azure.ai.evaluation import CoherenceEvaluator
coherence_evaluator = CoherenceEvaluator(model_config)

result = coherence_evaluator(
    query="What is the capital of Japan?",
    response="The capital of Japan is Tokyo."
)

pprint(result)

{'coherence': 4.0,
 'coherence_reason': 'The RESPONSE is coherent because it directly answers the '
                     'QUERY with a clear and logical sentence.',
 'coherence_result': 'pass',
 'coherence_threshold': 3,
 'gpt_coherence': 4.0}


### 3.4 Response Completeness Evaluator

ResponseCompletenessEvaluator that captures the recall aspect of response alignment with the expected response. This is complementary to GroundednessEvaluator which captures the precision aspect of response alignment with the grounding source.

In [9]:
from azure.ai.evaluation import ResponseCompletenessEvaluator

response_completeness = ResponseCompletenessEvaluator(model_config=model_config, threshold=3)
response_completeness(
    response="Based on the retrieved documents, the shareholder meeting discussed the operational efficiency of the company and financing options.",
    ground_truth="The shareholder meeting discussed the compensation package of the company CEO."
)

Class ResponseCompletenessEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


{'response_completeness': 1,
 'response_completeness_result': 'fail',
 'response_completeness_threshold': 3,
 'response_completeness_reason': "The response completely misses the key information about the CEO's compensation package discussed in the shareholder meeting, which is the focus of the ground truth."}

---

## 4. Textual Similarity Evaluators


These evaluators compare how closely the textual response generated by your AI system matches the response you would expect, typically called the "ground truth".
- The SimilarityEvaluator uses an "LLM-as-Judge" (AI-assisted evaluation) approach to score the metric.
- The F1 Score, BLEU, GLEU, ROUGE and METEOR evaluators (NLP-based) use a mathematical approach to score the metric.

### 4.1 Similarity Evaluator
SimilarityEvaluator measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.


In [None]:
from azure.ai.evaluation import SimilarityEvaluator

similarity = SimilarityEvaluator(model_config=model_config, threshold=3)
similarity(
    query="Is Marie Curie is born in Paris?", 
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

{'similarity': 5.0,
 'gpt_similarity': 5.0,
 'similarity_result': 'pass',
 'similarity_threshold': 3}

### 4.2 F1 Score
F1ScoreEvaluator measures the similarity by shared tokens between the generated text and the ground truth, focusing on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. 

In [None]:
from azure.ai.evaluation import F1ScoreEvaluator

f1_score = F1ScoreEvaluator(threshold=0.5)
f1_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

{'f1_score': 0.631578947368421, 'f1_result': 'fail', 'f1_threshold': 0.5}

### 4.3 BLEU Score
BleuScoreEvaluator computes the BLEU (Bilingual Evaluation Understudy) score commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text.

In [None]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu_score = BleuScoreEvaluator(threshold=0.3)
bleu_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

{'bleu_score': 0.1550967560878879,
 'bleu_result': 'fail',
 'bleu_threshold': 0.3}

### 4.4 GLEU Score

GleuScoreEvaluator computes the GLEU (Google-BLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth, similar to the BLEU score, focusing on both precision and recall. But it addresses the drawbacks of the BLEU score using a per-sentence reward objective. The numerical score is a 0-1 float and a higher score is better. 

In [None]:
from azure.ai.evaluation import GleuScoreEvaluator


gleu_score = GleuScoreEvaluator(threshold=0.2)
gleu_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

{'gleu_score': 0.25925925925925924,
 'gleu_result': 'fail',
 'gleu_threshold': 0.2}

### 4.5 ROUGE Score
RougeScoreEvaluator computes the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. The numerical score is a 0-1 float and a higher score is better. 

In [None]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_L, precision_threshold=0.6, recall_threshold=0.5, f1_score_threshold=0.55) 
rouge(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

{'rouge_precision': 0.46153846153846156,
 'rouge_recall': 1.0,
 'rouge_f1_score': 0.631578947368421,
 'rouge_precision_result': 'fail',
 'rouge_recall_result': 'pass',
 'rouge_f1_score_result': 'pass',
 'rouge_precision_threshold': 0.6,
 'rouge_recall_threshold': 0.5,
 'rouge_f1_score_threshold': 0.55}

### 4.6 METEOR Score
MeteorScoreEvaluator measures the similarity by shared n-grams between the generated text and the ground truth, similar to the BLEU score, focusing on precision and recall. But it addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.

In [None]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor_score = MeteorScoreEvaluator(threshold=0.9)
meteor_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

{'meteor_score': 0.8621140763997908,
 'meteor_result': 'fail',
 'meteor_threshold': 0.9}

---

## 5. Explore Custom Evaluators



### 5.1 Code-Based Evaluator


In [16]:
# Custom evaluator as a function to calculate response length
def response_length(response, **kwargs):
    return len(response)

# Custom class based evaluator to check for blocked words
class BlocklistEvaluator:
    def __init__(self, blocklist):
        self._blocklist = blocklist

    def __call__(self, *, answer: str, **kwargs):
        contains_block_word = any(word in answer for word in self._blocklist)
        return {"score": contains_block_word}

blocklist_evaluator = BlocklistEvaluator(blocklist=["bad", "worst", "terrible"])

# Test custom evaluator 1
result = response_length("The capital of Japan is Tokyo.")
print(result)

# Test custom evaluator 2
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
print(result)

# Test custom evaluator 3
result = blocklist_evaluator(answer="This is a bad idea.")
print(result)

30
{'score': False}
{'score': True}


### 5.2 Prompt-Based Evaluator
To build your own prompt-based large language model evaluator or AI-assisted annotator, you can create a custom evaluator based on a Prompty file. This is a file with a `.prompty` extension that adheres to the [Prompty specification](https://prompty.io) - defining a prompt asset that contains both model configuration and content template, for your prompt-based evaluator.

**STEP ONE: Create a Prompty file**

Explore the [02-friendliness.prompty](02-friendliness.prompty) file in the current folder to see what that looks like

In [17]:
# STEP TWO: Define the evaluator class - this loads the prompty file and uses it as a "instruction prompt" to guide the Judge LLM to grade the app response

import os
import json
import sys
from promptflow.client import load_flow


class FriendlinessEvaluator:
    def __init__(self, model_config):
        current_dir = os.getcwd()
        prompty_path = os.path.join(current_dir, "02-friendliness.prompty")
        self._flow = load_flow(source=prompty_path, model={"configuration": model_config})

    def __call__(self, *, response: str, **kwargs):
        llm_response = self._flow(response=response)
        try:
            response = json.loads(llm_response)
        except Exception as ex:
            response = llm_response
        return response

In [18]:
# STEP THREE: Run the evaluator - give it a response to grade

friendliness_eval = FriendlinessEvaluator(model_config)

friendliness_score = friendliness_eval(response="I will not apologize for my behavior!")
pprint(friendliness_score)


{'reason': 'The response is mostly unfriendly, as it is defensive and lacks '
           'warmth or empathy.',
 'score': 2}


---

## 6. Run Multiple Evaluators


In [19]:
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    ContentSafetyEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    GroundednessEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
)

# Create evaluators
content_safety_evaluator = ContentSafetyEvaluator( azure_ai_project=azure_ai_project, credential=credential)
relevance_evaluator = RelevanceEvaluator(model_config)
coherence_evaluator = CoherenceEvaluator(model_config)
groundedness_evaluator = GroundednessEvaluator(model_config)
fluency_evaluator = FluencyEvaluator(model_config)
similarity_evaluator = SimilarityEvaluator(model_config)


result = evaluate(
    data="00-data/02-data.jsonl",
    evaluators={
        "content_safety": content_safety_evaluator,
        "coherence": coherence_evaluator,
        "relevance": relevance_evaluator,
        "groundedness": groundedness_evaluator,
        "fluency": fluency_evaluator,
        "similarity": similarity_evaluator,
    },
    evaluation_name="02-quality-evaluators",
    # column mapping
    evaluator_config={
        "content_safety": {"column_mapping": {"query": "${data.query}", "response": "${data.response}"}},
        "coherence": {"column_mapping": {"response": "${data.response}", "query": "${data.query}"}},
        "groundedness": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.ground_truth}",
                "ground_truth": "${data.ground_truth}",
                "response": "${data.response}"
            } 
        },
        "relevance": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.ground_truth}",
                "response": "${data.response}"
            } 
        },
        "fluency": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.ground_truth}",
                "response": "${data.response}"
            } 
        },
        "similarity": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.ground_truth}",
                "response": "${data.response}"
            } 
        },
    },

    # Specify the azure_ai_project to push results to portal
    azure_ai_project = azure_ai_project,
    
    # Specify the output path to push results also to local file
    output_path="./02-quality-evaluators.results.json"
)

Class ContentSafetyEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ViolenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class SexualEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class SelfHarmEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class HateUnfairnessEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[2025-05-15 21:38:05 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_fluency_20250515_213805_811824, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_

2025-05-15 21:38:06 +0000   98702 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-05-15 21:38:06 +0000   98702 execution.bulk     INFO     Finished 5 / 5 lines.
2025-05-15 21:38:06 +0000   98702 execution.bulk     INFO     Average execution time for completed lines: 0.02 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-05-15 21:38:06 +0000   98702 execution          ERROR    5/5 flow run failed, indexes: [2,0,3,1,4], exception of index 2: (UserError) SimilarityEvaluator: Either 'conversation' or individual inputs must be provided.

Run name: "azure_ai_evaluation_evaluators_similarity_20250515_213805_817765"
Run status: "Completed"
Start time: "2025-05-15 21:38:05.830371+00:00"
Duration: "0:00:01.509708"
Output path: "/home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_similarity_20250515_213805_817765"

2025-05-15 21:38:08 +0000   98702 execution.bulk     INFO     Finished 1 / 5 lines.
2025-05

--- 

### 3.1 View Results Online

Just as before, you can now view the results of the multi-evaluator run using the Evaluation tab in the Azure AI Foundry Studio. Here is what you should see:



#### Quality Evaluation

![Quality](./../docs/img/screenshots/lab-02-portal-quality.png)

### 3.2 View Results Locally

---

## 4. Homework: Try It Yourself

1. Import the necessary evaluator
1. Invoke it with the relevant query/response parameters
1. Print the results - **observe them**. Do you agree with assessment?
1. Try changing the response - **re-evaluate** - Do you agree with the new assessment?
1. Think of a scenario where you would use this evaluator - **write it down**.

**Resources**:
1. [Documentation](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-metrics-built-in?tabs=severity#generation-quality-metrics)
1. [API Reference](https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview)

---

## 🎉 | Congratulations!

You have successfully completed the second lab in this module and got hands-on experience with a core subset of the the built-in quality evaluators. You also got a sense of how to create and run a custom evaluator.