# Evaluate Using Performance & Quality Metrics

Contoso Zoo is developing an AI-powered smart guide app to enhance visitors' experiences. The app aims to answer questions about wildlife. In this exercise, you will evaluate the model output for conversation pairs which reflect a conversation between a Contoso Zoo visitor and the smart guide AI assistant. Provided below is an example of a row of data provided for a question and answer pair for the Contoso Zoo app:

`query`

Why do flamingos stand on one leg?

`context`

Flamingos stand on one leg as a thermoregulatory behavior to minimize heat loss while wading in water. This posture also reduces muscle fatigue by engaging a passive locking mechanism in their joints, allowing them to rest efficiently.

`ground_truth`

Flamingos stand on one leg to conserve body heat and reduce muscle fatigue. Their joints lock in place, allowing them to balance effortlessly while resting. This helps them stay warm, especially when standing in cold water.

`response`

Fantastic observation! Flamingos stand on one leg to conserve body heat. By tucking one leg close to their body, they lose less heat through their limbs. Plus, their joints lock into place, so it's actually comfortable for them—like their version of lounging!

## Add environment variables to the .env file

In the root of the **Evaluation and Data Generation Workshop** folder is an `.env` file. Within the `.env` file, fill in the values for the environment variables. You can locate the values for each environment variable in the following locations of the [Azure AI Foundry](https://ai.azure.com) portal:

- `AZURE_SUBSCRIPTION_ID` - On the **Overview** page of your project within **Project details**.
- `AZURE_AI_PROJECT_NAME` - At the top of the **Overview** page for your project.
- `AZURE_OPENAI_RESOURCE_GROUP` - On the **Overview** page of the **Management Center** within **Project properties**.
- `AZURE_OPENAI_SERVICE` - On the **Overview** page of your project in the **Included capabilities** tab for **Azure OpenAI Service**.
- `AZURE_OPENAI_API_VERSION` - On the [API version lifecycle](https://learn.microsoft.com/azure/ai-services/openai/api-version-deprecation#latest-ga-api-release) webpage within the **Latest GA API release** section.
- `AZURE_OPENAI_ENDPOINT` - On the **Details** tab of your model deployment within **Endpoint** (i.e. **Target URI**)
- `AZURE_OPENAI_DEPLOYMENT_NAME` -  On the **Details** tab of your model deployment within **Deployment info**.

# Sign in to Azure

As a security best practice, we'll use [keyless authentication](https://learn.microsoft.com/azure/developer/ai/keyless-connections?tabs=csharp%2Cazure-cli) to authenticate to Azure OpenAI with Microsoft Entra ID. Before you can do so, you'll first need to install the **Azure CLI** per the [installation instructions](https://learn.microsoft.com/cli/azure/install-azure-cli) for your operating system.

Next, open a terminal and run `az login` to sign in to your Azure account.

## Install the package

The evaluator classes for assessing performance and quality are in the Azure AI Evaluation SDK. We'll begin by installing the package.

In [None]:
%pip install azure-ai-evaluation

## Access the environment variables.

We'll import `os` and `load_dotenv` so that you can access the environment variables.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

## Setup keyless authentication

Rather than hardcode your **key**, we'll use a keyless connection with Azure OpenAI.

In [None]:
import azure.identity

credential = azure.identity.DefaultAzureCredential()
token_provider = azure.identity.get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

token = token_provider()

## Configure the model_config

The `model_config` is necessary as it's a required parameter when creating an instance of the evaluator class. Let's configure the `model_config` with the following:

- Azure OpenAI endpoint
- Azure OpenAI API key
- Azure deployment

In [None]:
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": token,
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
}

## Create variables for the evaluation data

Since we'll be using the same context, query, response, and ground truth for the exercises, we'll create a variable to store each string and pass the variables into our evaluations.

In [None]:
query = "Why do flamingos stand on one leg?"
context = "Flamingos stand on one leg as a thermoregulatory behavior to minimize heat loss while wading in water. This posture also reduces muscle fatigue by engaging a passive locking mechanism in their joints, allowing them to rest efficiently."
ground_truth = "Flamingos stand on one leg to conserve body heat and reduce muscle fatigue. Their joints lock in place, allowing them to balance effortlessly while resting. This helps them stay warm, especially when standing in cold water."
response = "Fantastic observation! Flamingos stand on one leg to conserve body heat. By tucking one leg close to their body, they lose less heat through their limbs. Plus, their joints lock into place, so it's actually comfortable for them—like their version of lounging!"

## Evaluate for Groundedness

Create an instance of the `GroundednessEvaluator` and run the evaluation.



In [None]:
from azure.ai.evaluation import GroundednessEvaluator

groundedness_eval = GroundednessEvaluator(model_config)

groundedness_score = groundedness_eval(
    response=response,
    context=context,
)

print(groundedness_score)

## Evaluate for Relevance

Create an instance of the `RelevanceEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import RelevanceEvaluator

relevance_eval = RelevanceEvaluator(model_config)

relevance_score = relevance_eval(
    response=response,
    context=context,
    query=query
)

print(relevance_score)

## Evaluate for Coherence

Create an instance of the `CoherenceEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import CoherenceEvaluator

coherence_eval = CoherenceEvaluator(model_config)

coherence_score = coherence_eval(
    response=response,
    query=query
)

print(coherence_score)

## Evaluate for Fluency

Create an instance of the `FluencyEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import FluencyEvaluator

fluency_eval = FluencyEvaluator(model_config)

fluency_score = fluency_eval(
    response=response,
    query=query
)

print(fluency_score)

## Evaluate for Similarity

Create an instance of the `SimiliartyEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import SimilarityEvaluator

similarity_eval = SimilarityEvaluator(model_config)

similarity_score = similarity_eval(
    response=response,
    query=query,
    ground_truth=ground_truth
)

print(similarity_score)

## Evaluate for F1 Score

Create an instance of the `F1ScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import F1ScoreEvaluator

f1_eval = F1ScoreEvaluator()

f1_score = f1_eval(
    response=response,
    ground_truth=ground_truth
)

print(f1_score)

## Evaluate a dataset

To evaluate the quality of the responses from the app, a synthetic question-answer dataset was generated that covers various aspects of wildlife, including animal behavior, habitat preferences, and dietary needs.

You'll now evaluate the model output for the conversation pairs. Provided below is an example of a row of data provided for a question and answer pair for the Contoso Zoo app:

`query`

"Wow, the lion is much bigger than I expected! How much do they typically weigh?"

`response`

"Lions are indeed impressive creatures! Their weight varies depending on their sex and geographic location. Here's a general breakdown:\n\n- **Male lions** typically weigh between **330 to 550 pounds (150 to 250 kilograms)**.\n- **Female lions**, being smaller, usually weigh between **260 to 400 pounds (120 to 180 kilograms)**.\n\nLions in certain regions, such as southern Africa, tend to be larger on average compared to those in other parts of their range. Their size and muscular build are part of what makes them such formidable predators!"

We can run an evaluation for a dataset with the `evaluate` function. In addition, we can run the evaluation using multiple evaluators. In our case, we're going to run an evaluation for the dataset within the `zoo_app_conversations.jsonl` file. We'll also output the results to a new `evaluation_results.json` file.

Let's run an evaluation using the `Relevance`, `Coherence`, and `Fluency` evaluators.

In [None]:
from azure.ai.evaluation import evaluate
import json

path = "zoo_app_conversations.jsonl"

result = evaluate(
    data=path, # provide your data here
    evaluators={
        "relevance": relevance_eval,
        "coherence": coherence_eval,
        "fluency": fluency_eval
    },
    # column mapping
    evaluator_config={
        "default": {
            "query": "${data.query}",
            "response": "${data.response}"
        }
    }
)

## Print the results with Pretty Print

Now that we've run the evaluation, let's print the results using Pretty Print, which displays data in a structured and visually appealing way, making it easier to read and understand.

In [None]:
from pprint import pprint
pprint(result)

## Print the results as table

We can also print the results as a table using `Pandas`.

In [None]:
import pandas as pd
pd.DataFrame(result["rows"])

## Delete resources

If you've finished exploring Azure AI Services, delete the Azure resource that you created during the workshop.

**Note**: You may be prompted to delete your deployed model(s) before deleting the resource group.