# Evaluate Using Performance & Quality Metrics

Contoso Home Furnishings is developing an app that generates product descriptions for their selection of furniture. The app aims to generate engaging product descriptions based on the manufacturer's specification of the furniture.

In this exercise, you will evaluate the model output for the generated product description using performance and quality metrics. Provided below is an example of a row of data provided for the description generated for the Contoso Home Furnishings Dining Chair:

`context`

Dining chair. Wooden seat. Four legs. Backrest. Brown. 18" wide, 20" deep, 35" tall. Holds 250 lbs.

`query`

Given the product specfication for the Contoso Home Furnishings Dining Chair, provide a product description.

`ground_truth`

The dining chair is brown and wooden with four legs and a backrest. The dimensions are 18" wide, 20" deep, 35" tall. The dining chair has a weight capacity of 250 lbs.

`response`

Introducing our timeless wooden dining chair, designed for both comfort and durability. Crafted with a solid wood seat and sturdy four-legged base, this chair offers reliable support for up to 250 lbs. The smooth brown finish adds a touch of rustic elegance, while the ergonomically shaped backrest ensures a comfortable dining experience. Measuring 18" wide, 20" deep, and 35" tall, it's the perfect blend of form and function, making it a versatile addition to any dining space. Elevate your home with this beautifully simple yet sophisticated seating option.


## Add environment variables to the .env file

In the root of the **Evaluation and Data Generation Workshop** folder is an `.env` file. Within the `.env` file, fill in the values for the environment variables. You can locate the values for each environment variable in the following locations of the [Azure AI Foundry](https://ai.azure.com) portal:

- `AZURE_SUBSCRIPTION_ID` - On the **Overview** page of your project within **Project details**.
- `AZURE_AI_PROJECT_NAME` - At the top of the **Overview** page for your project.
- `AZURE_OPENAI_RESOURCE_GROUP` - On the **Overview** page of the **Management Center** within **Project properties**.
- `AZURE_OPENAI_SERVICE` - On the **Overview** page of your project in the **Included capabilities** tab for **Azure OpenAI Service**.
- `AZURE_OPENAI_API_VERSION` - On the [API version lifecycle](https://learn.microsoft.com/azure/ai-services/openai/api-version-deprecation#latest-ga-api-release) webpage within the **Latest GA API release** section.
- `AZURE_OPENAI_ENDPOINT` - On the **Details** tab of your model deployment within **Endpoint** (i.e. **Target URI**)
- `AZURE_OPENAI_DEPLOYMENT_NAME` -  On the **Details** tab of your model deployment within **Deployment info**.

# Sign in to Azure

As a security best practice, we'll use [keyless authentication](https://learn.microsoft.com/azure/developer/ai/keyless-connections?tabs=csharp%2Cazure-cli) to authenticate to Azure OpenAI with Microsoft Entra ID. Before you can do so, you'll first need to install the **Azure CLI** per the [installation instructions](https://learn.microsoft.com/cli/azure/install-azure-cli) for your operating system.

Next, open a terminal and run `az login` to sign in to your Azure account.

## Install the package

The evaluator classes for assessing performance and quality are in the Azure AI Evaluation SDK. We'll begin by installing the package.

In [1]:
%pip install azure-ai-evaluation

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Access the environment variables.

We'll import `os` and `load_dotenv` so that you can access the environment variables.

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

True

## Setup keyless authentication

Rather than hardcode your **key**, we'll use a keyless connection with Azure OpenAI.

In [3]:
import azure.identity

credential = azure.identity.DefaultAzureCredential()
token_provider = azure.identity.get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

token = token_provider()

## Configure the model_config

The `model_config` is necessary as it's a required parameter when creating an instance of the evaluator class. Let's configure the `model_config` with the following:

- Azure OpenAI endpoint
- Azure OpenAI API key
- Azure deployment

In [4]:
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": token,
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
}

## Create variables for the evaluation data

Since we'll be using the same context, query, response, and ground truth for the exercises, we'll create a variable to store each string and pass the variables into our evaluations.

In [5]:
context = "Dining chair. Wooden seat. Four legs. Backrest. Brown. 18\" wide, 20\" deep, 35\" tall. Holds 250 lbs."
query = "Given the product specification for the Contoso Home Furnishings Dining Chair, provide an engaging marketing product description."
ground_truth = "The dining chair is brown and wooden with four legs and a backrest. The dimensions are 18\" wide, 20\" deep, 35\" tall. The dining chair has a weight capacity of 250 lbs."
response = "Introducing our timeless wooden dining chair, designed for both comfort and durability. Crafted with a solid wood seat and sturdy four-legged base, this chair offers reliable support for up to 250 lbs. The smooth brown finish adds a touch of rustic elegance, while the ergonomically shaped backrest ensures a comfortable dining experience. Measuring 18\" wide, 20\" deep, and 35\" tall, it's the perfect blend of form and function, making it a versatile addition to any dining space. Elevate your home with this beautifully simple yet sophisticated seating option."

## Evaluate for Groundedness

Create an instance of the `GroundednessEvaluator` and run the evaluation.



In [6]:
from azure.ai.evaluation import GroundednessEvaluator

groundedness_eval = GroundednessEvaluator(model_config)

groundedness_score = groundedness_eval(
    response=response,
    context=context,
)

print(groundedness_score)

[INFO] Could not import AIAgentConverter. Please install the dependency with `pip install azure-ai-projects`.
[INFO] Could not import SKAgentConverter. Please install the dependency with `pip install semantic-kernel`.
{'groundedness': 3.0, 'gpt_groundedness': 3.0, 'groundedness_reason': 'The RESPONSE is accurate in conveying the details from the CONTEXT but introduces unsupported additions, making it grounded but not fully faithful to the source material.', 'groundedness_result': 'pass', 'groundedness_threshold': 3}


## Evaluate for Relevance

Create an instance of the `RelevanceEvaluator` and run the evaluation.

In [7]:
from azure.ai.evaluation import RelevanceEvaluator

relevance_eval = RelevanceEvaluator(model_config)

relevance_score = relevance_eval(
    response=response,
    context=context,
    query=query
)

print(relevance_score)

{'relevance': 5.0, 'gpt_relevance': 5.0, 'relevance_reason': 'The RESPONSE fully addresses the QUERY with accurate and complete information, while also adding relevant insights to make the description engaging and appealing for marketing purposes.', 'relevance_result': 'pass', 'relevance_threshold': 3}


## Evaluate for Coherence

Create an instance of the `CoherenceEvaluator` and run the evaluation.

In [8]:
from azure.ai.evaluation import CoherenceEvaluator

coherence_eval = CoherenceEvaluator(model_config)

coherence_score = coherence_eval(
    response=response,
    query=query
)

print(coherence_score)

{'coherence': 4.0, 'gpt_coherence': 4.0, 'coherence_reason': 'The RESPONSE is coherent, effectively addressing the QUERY with a clear and logical presentation of ideas. It provides all relevant details in an engaging manner, making it suitable for marketing purposes.', 'coherence_result': 'pass', 'coherence_threshold': 3}


## Evaluate for Fluency

Create an instance of the `FluencyEvaluator` and run the evaluation.

In [9]:
from azure.ai.evaluation import FluencyEvaluator

fluency_eval = FluencyEvaluator(model_config)

fluency_score = fluency_eval(
    response=response,
    query=query
)

print(fluency_score)

{'fluency': 5.0, 'gpt_fluency': 5.0, 'fluency_reason': 'The RESPONSE is articulate, cohesive, and uses sophisticated language with varied sentence structures, making it highly readable and engaging. It reflects a high level of fluency and style.', 'fluency_result': 'pass', 'fluency_threshold': 3}


## Evaluate for Similarity

Create an instance of the `SimiliartyEvaluator` and run the evaluation.

In [10]:
from azure.ai.evaluation import SimilarityEvaluator

similarity_eval = SimilarityEvaluator(model_config)

similarity_score = similarity_eval(
    response=response,
    query=query,
    ground_truth=ground_truth
)

print(similarity_score)

{'similarity': 5.0, 'gpt_similarity': 5.0, 'similarity_result': 'pass', 'similarity_threshold': 3}


## Evaluate for F1 Score

Create an instance of the `F1ScoreEvaluator` and run the evaluation.

In [11]:
from azure.ai.evaluation import F1ScoreEvaluator

f1_eval = F1ScoreEvaluator()

f1_score = f1_eval(
    response=response,
    ground_truth=ground_truth
)

print(f1_score)

{'f1_score': 0.35185185185185186, 'f1_result': 'fail', 'f1_threshold': 0.5}


## Evaluate for ROUGE
There are several types of ROUGE metrics: `ROUGE_1`, `ROUGE_2`, `ROUGE_3`, `ROUGE_4`, `ROUGE_5`, and `ROUGE_L`.

The initial 5 types are considered **ROUGE-N** which measures the overlap of n-grams (contiguous sequences of 'n' words) between the generated summary and reference summary. For example, `ROUGE_1` measures of the overalp of unigrams (single words), and `ROUGE_2` measures the overlap of bigrams (two-word sequences). We provide up to 5-grams.

`ROUGE_L` measures the longest common subsequence (LCS) between the generated and reference summaries. LCS takes into account sequence similarity whle maintaining word order, which makes `ROUGE_L` effective in capturing sentence-level structure.

Create an instance of the `RougeScoreEvaluator` and run the evaluation.

In [12]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge_eval = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

rouge_score = rouge_eval(
    response=response,
    ground_truth=ground_truth,
)

print(rouge_score)

{'rouge_precision': 0.2777777777777778, 'rouge_recall': 0.78125, 'rouge_f1_score': 0.40983606557377056, 'rouge_precision_result': 'fail', 'rouge_recall_result': 'pass', 'rouge_f1_score_result': 'fail', 'rouge_precision_threshold': 0.5, 'rouge_recall_threshold': 0.5, 'rouge_f1_score_threshold': 0.5}


## Evaluate for BLEU

Create an instance of the `BleuScoreEvaluator` and run the evaluation.

**Note**: The initial run may install a package. If this occurs, run the cell once more to receive the BLEU score.

In [13]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu_eval = BleuScoreEvaluator()

bleu_score = bleu_eval(
    response=response,
    ground_truth=ground_truth
)

print(bleu_score)

{'bleu_score': 0.10903931692423613, 'bleu_result': 'fail', 'bleu_threshold': 0.5}


## Evaluate for METEOR

The METEOR metric takes an `alpha`, `beta`, and `gamma` parameter which control the balance between precision, recall, and the penalty for incorrect word order (fragmentation penalty). These parameters influence how the final METEOR score is calculated, helping fine-tune it's sensitivity to different aspects of the translation or summary quality.

Create an instance of the `MeteorScoreEvaluator` and run the evaluation.

In [14]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor_eval = MeteorScoreEvaluator(
    alpha=0.9,
    beta=3.0,
    gamma=0.5
)

meteor_score = meteor_eval(
    response=response,
    ground_truth=ground_truth,
)

print(meteor_score)

{'meteor_score': 0.5252285661368535, 'meteor_result': 'pass', 'meteor_threshold': 0.5}


## Evaluate for GLEU

Create an instance of the `GleuScoreEvaluator` and run the evaluation.

In [15]:
from azure.ai.evaluation import GleuScoreEvaluator

gleu_eval = GleuScoreEvaluator()

gleu_score = gleu_eval(
    response=response,
    ground_truth=ground_truth,
)

print(gleu_score)

{'gleu_score': 0.13658536585365855, 'gleu_result': 'fail', 'gleu_threshold': 0.5}


## Evaluate on a test dataset

We can run an evaluation for a dataset with the `evaluate` function. In addition, we can run the evaluation using multiple evaluators. In our case, we're going to run an evaluation using a few evaluators on the product description dataset within the `product-descriptions.jsonl` file. We'll also output the results to a new `evaluation_results.json` file.

Let's run an evalation using the `Relevance`, `Groundedness`, and `Fluency` evaluators.

In [None]:
from azure.ai.evaluation import evaluate
import json

path = "performance-quality-data.jsonl"

result = evaluate(
    data=path, # provide your data here
    evaluators={
        "relevance": relevance_eval,
        "groundedness": groundedness_eval,
        "fluency": fluency_eval
    },
    # column mapping
    evaluator_config={
        "default": {
            "query": "${data.query}",
            "response": "${data.response}",
            "context": "${data.context}",
            "ground_truth": "${data.ground_truth}"
        }
    }
)

[2025-07-03 16:11:52 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_relevance_20250703_161151_983343, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_relevance_20250703_161151_983343/logs.txt
[2025-07-03 16:11:52 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_fluency_20250703_161151_983952, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_fluency_20250703_161151_983952/logs.txt
[2025-07-03 16:11:52 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_groundedness_20250703_161151_983710, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_groundedness_20250703_161151_983710/logs.txt


2025-07-03 16:11:55 +0000   12649 execution.bulk     INFO     Finished 3 / 3 lines.
2025-07-03 16:11:55 +0000   12649 execution.bulk     INFO     Average execution time for completed lines: 1.0 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-07-03 16:11:52 +0000   12649 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-07-03 16:11:54 +0000   12649 execution.bulk     INFO     Finished 1 / 3 lines.
2025-07-03 16:11:54 +0000   12649 execution.bulk     INFO     Average execution time for completed lines: 2.53 seconds. Estimated time for incomplete lines: 5.06 seconds.
2025-07-03 16:11:54 +0000   12649 execution.bulk     INFO     Finished 2 / 3 lines.
2025-07-03 16:11:54 +0000   12649 execution.bulk     INFO     Average execution time for completed lines: 1.35 seconds. Estimated time for incomplete lines: 1.35 seconds.
2025-07-03 16:11:54 +0000   12649 execution.bulk     INFO     Finished 3 / 3 lines.
2025-07

{'metrics': {'fluency.binary_aggregate': 1.0,
             'fluency.fluency': 4.333333333333333,
             'fluency.fluency_threshold': 3.0,
             'fluency.gpt_fluency': 4.333333333333333,
             'groundedness.binary_aggregate': 1.0,
             'groundedness.gpt_groundedness': 3.0,
             'groundedness.groundedness': 3.0,
             'groundedness.groundedness_threshold': 3.0,
             'relevance.binary_aggregate': 1.0,
             'relevance.gpt_relevance': 3.6666666666666665,
             'relevance.relevance': 3.6666666666666665,
             'relevance.relevance_threshold': 3.0},
 'rows': [{'inputs.context': 'Couch. Fabric upholstery. Three seats. Wooden '
                             'frame. Grey. 85" wide, 35" deep, 32" tall. Holds '
                             '750 lbs.',
           'inputs.ground_truth': 'The couch has a wood frame with gray '
                                  'upholstered fabric. There are 3 seats on '
                           

## Print the results with Pretty Print

Now that we've run the evaluation, let's print the results using Pretty Print, which displays data in a structured and visually appealing way, making it easier to read and understand.

In [17]:
from pprint import pprint
pprint(result)

## Print the results as table

We can also print the results as a table using `Pandas`.

In [18]:
import pandas as pd
pd.DataFrame(result["rows"])

Unnamed: 0,inputs.query,inputs.response,inputs.context,inputs.ground_truth,outputs.relevance.relevance,outputs.relevance.gpt_relevance,outputs.relevance.relevance_reason,outputs.relevance.relevance_result,outputs.relevance.relevance_threshold,outputs.groundedness.groundedness,outputs.groundedness.gpt_groundedness,outputs.groundedness.groundedness_reason,outputs.groundedness.groundedness_result,outputs.groundedness.groundedness_threshold,outputs.fluency.fluency,outputs.fluency.gpt_fluency,outputs.fluency.fluency_reason,outputs.fluency.fluency_result,outputs.fluency.fluency_threshold
0,Given the product specfication for the Contoso...,Sink into comfort with this stylish grey three...,Couch. Fabric upholstery. Three seats. Wooden ...,The couch has a wood frame with gray upholster...,4,4,The RESPONSE fully addresses the QUERY with al...,pass,3,3,3,The RESPONSE is grounded in the CONTEXT but in...,pass,3,4,4,"The RESPONSE is well-articulated, coherent, an...",pass,3
1,Given the product specfication for the Contoso...,Elevate your living space with this modern rou...,Coffee table. Glass top. Metal frame. Round. B...,The coffee table has a metal frame and glass t...,4,4,The RESPONSE fully addresses the QUERY with ac...,pass,3,3,3,The RESPONSE is grounded in the CONTEXT but in...,pass,3,5,5,The RESPONSE should get a Score of 5 because i...,pass,3
2,Given the product specfication for the Contoso...,Boost your productivity with this versatile de...,Desk. Wooden surface. Metal legs. Adjustable h...,The desk has a wooden surface and metal legs. ...,3,3,The RESPONSE addresses the QUERY by providing ...,pass,3,3,3,The RESPONSE is grounded in the CONTEXT but in...,pass,3,4,4,The RESPONSE demonstrates proficient fluency w...,pass,3


## Delete resources

If you've finished exploring Azure AI Services, delete the Azure resource that you created during the workshop.

**Note**: You may be prompted to delete your deployed model(s) before deleting the resource group.