Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/ai/evaluation/evaluate-ai-response.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Quickstart - Evaluate the quality of a model's response
description: Learn how to create an MSTest app to evaluate the AI chat response of a language model.
ms.date: 03/03/2026
ms.date: 04/09/2026
ms.topic: quickstart
ai-usage: ai-assisted
---
Expand Down Expand Up @@ -53,7 +53,7 @@ Complete the following steps to create an MSTest project that connects to an AI
dotnet user-secrets set AZURE_TENANT_ID <your-tenant-ID>
```

(Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
(Depending on your environment, you might not need the tenant ID. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)

1. Open the new app in your editor of choice.

Expand Down Expand Up @@ -84,7 +84,7 @@ Complete the following steps to create an MSTest project that connects to an AI

This method does the following:

- Invokes the <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> to evaluate the *coherence* of the response. The <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator.EvaluateAsync(System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.ChatMessage},Microsoft.Extensions.AI.ChatResponse,Microsoft.Extensions.AI.Evaluation.ChatConfiguration,System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.Evaluation.EvaluationContext},System.Threading.CancellationToken)> method returns an <xref:Microsoft.Extensions.AI.Evaluation.EvaluationResult> that contains a <xref:Microsoft.Extensions.AI.Evaluation.NumericMetric>. A `NumericMetric` contains a numeric value that's typically used to represent numeric scores that fall within a well-defined range.
- Invokes the <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> to evaluate the *coherence* of the response. The <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator.EvaluateAsync(System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.ChatMessage},Microsoft.Extensions.AI.ChatResponse,Microsoft.Extensions.AI.Evaluation.ChatConfiguration,System.Collections.Generic.IEnumerable{Microsoft.Extensions.AI.Evaluation.EvaluationContext},System.Threading.CancellationToken)> method returns an <xref:Microsoft.Extensions.AI.Evaluation.EvaluationResult> that contains a <xref:Microsoft.Extensions.AI.Evaluation.NumericMetric>. A `NumericMetric` contains a numeric value that typically represents numeric scores that fall within a well-defined range.
- Retrieves the coherence score from the <xref:Microsoft.Extensions.AI.Evaluation.EvaluationResult>.
- Validates the *default interpretation* for the returned coherence metric. Evaluators can include a default interpretation for the metrics they return. You can also change the default interpretation to suit your specific requirements, if needed.
- Validates that no diagnostics are present on the returned coherence metric. Evaluators can include diagnostics on the metrics they return to indicate errors, warnings, or other exceptional conditions encountered during evaluation.
Expand Down
36 changes: 18 additions & 18 deletions docs/ai/evaluation/evaluate-safety.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Tutorial - Evaluate response safety with caching and reporting
description: Create an MSTest app that evaluates the content safety of a model's response using the evaluators in the Microsoft.Extensions.AI.Evaluation.Safety package and with caching and reporting.
ms.date: 03/03/2026
ms.date: 04/09/2026
ms.topic: tutorial
ai-usage: ai-assisted
---
Expand All @@ -20,13 +20,13 @@ In this tutorial, you create an MSTest app to evaluate the *content safety* of a
To provision an Azure OpenAI service and model using the Azure portal, complete the steps in the [Create and deploy an Azure OpenAI Service resource](/azure/ai-services/openai/how-to/create-resource?pivots=web-portal) article. In the "Deploy a model" step, select the `gpt-5` model.

> [!TIP]
> The previous configuration step is only required to fetch the response to be evaluated. To evaluate the safety of a response you already have in hand, you can skip this configuration.
> You only need the previous configuration step to fetch the response to evaluate. To evaluate the safety of a response you already have, skip this configuration.

The evaluators in this tutorial use the Foundry Evaluation service, which requires some additional setup:

- [Create a resource group](/azure/azure-resource-manager/management/manage-resource-groups-portal#create-resource-groups) within one of the Azure [regions that support Foundry Evaluation service](/azure/ai-foundry/how-to/develop/evaluate-sdk#region-support).
- [Create a Foundry hub](/azure/ai-foundry/how-to/create-azure-ai-resource?tabs=portal#create-a-hub-in-azure-ai-foundry-portal) in the resource group you just created.
- Finally, [create a Foundry project](/azure/ai-foundry/how-to/create-projects?tabs=ai-studio#create-a-project) in the hub you just created.
- [Create a Foundry project](/azure/ai-foundry/how-to/create-projects?tabs=ai-studio#create-a-project) in the hub you just created.

## Create the test app

Expand Down Expand Up @@ -63,7 +63,7 @@ Complete the following steps to create an MSTest project.
dotnet user-secrets set AZURE_AI_PROJECT <your-Azure-AI-project>
```

(Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
(Depending on your environment, you might not need the tenant ID. If so, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)

1. Open the new app in your editor of choice.

Expand All @@ -85,9 +85,9 @@ Complete the following steps to create an MSTest project.
The [scenario name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun.ScenarioName) is set to the fully qualified name of the current test method. However, you can set it to any string of your choice. Here are some considerations for choosing a scenario name:

- When using disk-based storage, the scenario name is used as the name of the folder under which the corresponding evaluation results are stored.
- By default, the generated evaluation report splits scenario names on `.` so that the results can be displayed in a hierarchical view with appropriate grouping, nesting, and aggregation.
- By default, the generated evaluation report splits scenario names on `.` so the report displays results in a hierarchical view with appropriate grouping, nesting, and aggregation.

The [execution name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration.ExecutionName) is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration>, all evaluation runs will use the same default execution name of `Default`. In this case, results from one run will be overwritten by the next.
The [execution name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration.ExecutionName) is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration>, all evaluation runs use the same default execution name of `Default`. In this case, results from one run will be overwritten by the next.

1. Add a method to gather the safety evaluators to use in the evaluation.

Expand All @@ -97,26 +97,26 @@ Complete the following steps to create an MSTest project.

:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="ServiceConfig":::

1. Add a method that creates an <xref:Microsoft.Extensions.AI.IChatClient> object, which will be used to get the chat response to evaluate from the LLM.
1. Add a method that creates an <xref:Microsoft.Extensions.AI.IChatClient> object, which gets the chat response to evaluate from the LLM.

:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="ChatClient":::

1. Set up the reporting functionality. Convert the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration> to a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration>, and then pass that to the method that creates a <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration>.

:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="ReportingConfig":::

Response caching functionality is supported and works the same way regardless of whether the evaluators talk to an LLM or to the Foundry Evaluation service. The response will be reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the LLM endpoint or the question being asked, is changed.
Response caching works the same way regardless of whether the evaluators talk to an LLM or to the Foundry Evaluation service. The response is reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the LLM endpoint or the question being asked, changes.

> [!NOTE]
> This code example passes the LLM <xref:Microsoft.Extensions.AI.IChatClient> as `originalChatClient` to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.IChatClient)>. The reason to include the LLM chat client here is to enable getting a chat response from the LLM, and notably, to enable response caching for it. (If you don't want to cache the LLM's response, you can create a separate, local <xref:Microsoft.Extensions.AI.IChatClient> to fetch the response from the LLM.) Instead of passing a <xref:Microsoft.Extensions.AI.IChatClient>, if you already have a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> for an LLM from another reporting configuration, you can pass that instead, using the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)> overload.
> This code example passes the LLM <xref:Microsoft.Extensions.AI.IChatClient> as `originalChatClient` to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.IChatClient)>. Including the LLM chat client here enables getting a chat response from the LLM and enables response caching for the response. (To skip caching the LLM's response, create a separate, local <xref:Microsoft.Extensions.AI.IChatClient> to fetch the response from the LLM.) Instead of passing a <xref:Microsoft.Extensions.AI.IChatClient>, if you already have a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> for an LLM from another reporting configuration, you can pass that instead, using the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)> overload.
>
> Similarly, if you configure both [LLM-based evaluators](libraries.md#quality-evaluators) and [Foundry Evaluation service&ndash;based evaluators](libraries.md#safety-evaluators) in the reporting configuration, you also need to pass the LLM <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)>. Then it returns a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> that can talk to both types of evaluators.
> Similarly, if you configure both [LLM-based evaluators](libraries.md#quality-evaluators) and [Foundry Evaluation service&ndash;based evaluators](libraries.md#safety-evaluators) in the reporting configuration, you also need to pass the LLM <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> to <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfigurationExtensions.ToChatConfiguration(Microsoft.Extensions.AI.Evaluation.Safety.ContentSafetyServiceConfiguration,Microsoft.Extensions.AI.Evaluation.ChatConfiguration)>. The method then returns a <xref:Microsoft.Extensions.AI.Evaluation.ChatConfiguration> that can talk to both types of evaluators.

1. Add a method to define the [chat options](xref:Microsoft.Extensions.AI.ChatOptions) and ask the model for a response to a given question.

:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="GetResponse":::

The test in this tutorial evaluates the LLM's response to an astronomy question. Since the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration> has response caching enabled, and since the supplied <xref:Microsoft.Extensions.AI.IChatClient> is always fetched from the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun> created using this reporting configuration, the LLM response for the test is cached and reused.
The test in this tutorial evaluates the LLM's response to an astronomy question. Because the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration> has response caching enabled, and because the supplied <xref:Microsoft.Extensions.AI.IChatClient> is always fetched from the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun> created using this reporting configuration, the LLM response for the test gets cached and reused.

1. Add a method to validate the response.

Expand All @@ -129,16 +129,16 @@ Complete the following steps to create an MSTest project.

:::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="TestMethod":::

This test method:
The test method:

- Creates the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun>. The use of `await using` ensures that the `ScenarioRun` is correctly disposed and that the results of this evaluation are correctly persisted to the result store.
- Gets the LLM's response to a specific astronomy question. The same <xref:Microsoft.Extensions.AI.IChatClient> that will be used for evaluation is passed to the `GetAstronomyConversationAsync` method in order to get *response caching* for the primary LLM response being evaluated. (In addition, this enables response caching for the responses that the evaluators fetch from the Foundry Evaluation service as part of performing their evaluations.)
- Runs the evaluators against the response. Like the LLM response, on subsequent runs, the evaluation is fetched from the (disk-based) response cache that was configured in `s_safetyReportingConfig`.
- Creates the <xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun>. `await using` ensures that `ScenarioRun` is correctly disposed and that the evaluation results are correctly persisted to the result store.
- Gets the LLM's response to a specific astronomy question. The test passes the same <xref:Microsoft.Extensions.AI.IChatClient> used for evaluation to `GetAstronomyConversationAsync` to enable *response caching* for the primary LLM response being evaluated. (In addition, passing the same <xref:Microsoft.Extensions.AI.IChatClient> enables response caching for the evaluator responses from the Foundry Evaluation service.)
- Runs the evaluators against the response. Like the LLM response, subsequent runs fetch the evaluation from the (disk-based) response cache configured in `s_safetyReportingConfig`.
- Runs some safety validation on the evaluation result.

## Run the test/evaluation

Run the test using your preferred test workflow, for example, by using the CLI command `dotnet test` or through [Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer).
Run the test using your preferred test workflowfor example, by using the CLI command `dotnet test` or [Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer).

## Generate a report

Expand All @@ -148,6 +148,6 @@ To generate a report to view the evaluation results, see [Generate a report](eva

This tutorial covers the basics of evaluating content safety. As you create your test suite, consider the following next steps:

- Configure additional evaluators, such as the [quality evaluators](libraries.md#quality-evaluators). For an example, see the AI samples repo [quality and safety evaluation example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example10_RunningQualityAndSafetyEvaluatorsTogether.cs).
- Configure more evaluators, such as the [quality evaluators](libraries.md#quality-evaluators). For an example, see the AI samples repo [quality and safety evaluation example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example10_RunningQualityAndSafetyEvaluatorsTogether.cs).
- Evaluate the content safety of generated images. For an example, see the AI samples repo [image response example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example09_RunningSafetyEvaluatorsAgainstResponsesWithImages.cs).
- In real-world evaluations, you might not want to validate individual results, since the LLM responses and evaluation scores can vary over time as your product (and the models used) evolve. You might not want individual evaluation tests to fail and block builds in your CI/CD pipelines when this happens. Instead, in such cases, it might be better to rely on the generated report and track the overall trends for evaluation scores across different scenarios over time (and only fail individual builds in your CI/CD pipelines when there's a significant drop in evaluation scores across multiple different tests).
- In real-world evaluations, you might not want to validate individual results, because the LLM responses and evaluation scores can vary over time as your product (and the models used) evolve. You might not want individual evaluation tests to fail and block builds in your CI/CD pipelines when evaluation scores change. Instead, consider relying on the generated report and tracking the overall trends for evaluation scores across different scenarios over time (and only failing individual builds in your CI/CD pipelines when there's a significant drop in evaluation scores across multiple different tests).
Loading
Loading