diff --git a/docs/ai/evaluation/evaluate-ai-response.md b/docs/ai/evaluation/evaluate-ai-response.md index 983c63c5484ab..db56f1ee3f70d 100644 --- a/docs/ai/evaluation/evaluate-ai-response.md +++ b/docs/ai/evaluation/evaluate-ai-response.md @@ -1,7 +1,7 @@ --- title: Quickstart - Evaluate the quality of a model's response description: Learn how to create an MSTest app to evaluate the AI chat response of a language model. -ms.date: 03/03/2026 +ms.date: 04/09/2026 ms.topic: quickstart ai-usage: ai-assisted --- @@ -53,7 +53,7 @@ Complete the following steps to create an MSTest project that connects to an AI dotnet user-secrets set AZURE_TENANT_ID ``` - (Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the .) + (Depending on your environment, you might not need the tenant ID. In that case, remove it from the code that instantiates the .) 1. Open the new app in your editor of choice. @@ -84,7 +84,7 @@ Complete the following steps to create an MSTest project that connects to an AI This method does the following: - - Invokes the to evaluate the *coherence* of the response. The method returns an that contains a . A `NumericMetric` contains a numeric value that's typically used to represent numeric scores that fall within a well-defined range. + - Invokes the to evaluate the *coherence* of the response. The method returns an that contains a . A `NumericMetric` contains a numeric value that typically represents numeric scores that fall within a well-defined range. - Retrieves the coherence score from the . - Validates the *default interpretation* for the returned coherence metric. Evaluators can include a default interpretation for the metrics they return. You can also change the default interpretation to suit your specific requirements, if needed. - Validates that no diagnostics are present on the returned coherence metric. Evaluators can include diagnostics on the metrics they return to indicate errors, warnings, or other exceptional conditions encountered during evaluation. diff --git a/docs/ai/evaluation/evaluate-safety.md b/docs/ai/evaluation/evaluate-safety.md index af375fa39f1a2..ddd3baf780d7a 100644 --- a/docs/ai/evaluation/evaluate-safety.md +++ b/docs/ai/evaluation/evaluate-safety.md @@ -1,7 +1,7 @@ --- title: Tutorial - Evaluate response safety with caching and reporting description: Create an MSTest app that evaluates the content safety of a model's response using the evaluators in the Microsoft.Extensions.AI.Evaluation.Safety package and with caching and reporting. -ms.date: 03/03/2026 +ms.date: 04/09/2026 ms.topic: tutorial ai-usage: ai-assisted --- @@ -20,13 +20,13 @@ In this tutorial, you create an MSTest app to evaluate the *content safety* of a To provision an Azure OpenAI service and model using the Azure portal, complete the steps in the [Create and deploy an Azure OpenAI Service resource](/azure/ai-services/openai/how-to/create-resource?pivots=web-portal) article. In the "Deploy a model" step, select the `gpt-5` model. > [!TIP] -> The previous configuration step is only required to fetch the response to be evaluated. To evaluate the safety of a response you already have in hand, you can skip this configuration. +> You only need the previous configuration step to fetch the response to evaluate. To evaluate the safety of a response you already have, skip this configuration. The evaluators in this tutorial use the Foundry Evaluation service, which requires some additional setup: - [Create a resource group](/azure/azure-resource-manager/management/manage-resource-groups-portal#create-resource-groups) within one of the Azure [regions that support Foundry Evaluation service](/azure/ai-foundry/how-to/develop/evaluate-sdk#region-support). - [Create a Foundry hub](/azure/ai-foundry/how-to/create-azure-ai-resource?tabs=portal#create-a-hub-in-azure-ai-foundry-portal) in the resource group you just created. -- Finally, [create a Foundry project](/azure/ai-foundry/how-to/create-projects?tabs=ai-studio#create-a-project) in the hub you just created. +- [Create a Foundry project](/azure/ai-foundry/how-to/create-projects?tabs=ai-studio#create-a-project) in the hub you just created. ## Create the test app @@ -63,7 +63,7 @@ Complete the following steps to create an MSTest project. dotnet user-secrets set AZURE_AI_PROJECT ``` - (Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the .) + (Depending on your environment, you might not need the tenant ID. If so, remove it from the code that instantiates the .) 1. Open the new app in your editor of choice. @@ -85,9 +85,9 @@ Complete the following steps to create an MSTest project. The [scenario name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun.ScenarioName) is set to the fully qualified name of the current test method. However, you can set it to any string of your choice. Here are some considerations for choosing a scenario name: - When using disk-based storage, the scenario name is used as the name of the folder under which the corresponding evaluation results are stored. - - By default, the generated evaluation report splits scenario names on `.` so that the results can be displayed in a hierarchical view with appropriate grouping, nesting, and aggregation. + - By default, the generated evaluation report splits scenario names on `.` so the report displays results in a hierarchical view with appropriate grouping, nesting, and aggregation. - The [execution name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration.ExecutionName) is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a , all evaluation runs will use the same default execution name of `Default`. In this case, results from one run will be overwritten by the next. + The [execution name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ReportingConfiguration.ExecutionName) is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a , all evaluation runs use the same default execution name of `Default`. In this case, results from one run will be overwritten by the next. 1. Add a method to gather the safety evaluators to use in the evaluation. @@ -97,7 +97,7 @@ Complete the following steps to create an MSTest project. :::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="ServiceConfig"::: -1. Add a method that creates an object, which will be used to get the chat response to evaluate from the LLM. +1. Add a method that creates an object, which gets the chat response to evaluate from the LLM. :::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="ChatClient"::: @@ -105,18 +105,18 @@ Complete the following steps to create an MSTest project. :::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="ReportingConfig"::: - Response caching functionality is supported and works the same way regardless of whether the evaluators talk to an LLM or to the Foundry Evaluation service. The response will be reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the LLM endpoint or the question being asked, is changed. + Response caching works the same way regardless of whether the evaluators talk to an LLM or to the Foundry Evaluation service. The response is reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the LLM endpoint or the question being asked, changes. > [!NOTE] - > This code example passes the LLM as `originalChatClient` to . The reason to include the LLM chat client here is to enable getting a chat response from the LLM, and notably, to enable response caching for it. (If you don't want to cache the LLM's response, you can create a separate, local to fetch the response from the LLM.) Instead of passing a , if you already have a for an LLM from another reporting configuration, you can pass that instead, using the overload. + > This code example passes the LLM as `originalChatClient` to . Including the LLM chat client here enables getting a chat response from the LLM and enables response caching for the response. (To skip caching the LLM's response, create a separate, local to fetch the response from the LLM.) Instead of passing a , if you already have a for an LLM from another reporting configuration, you can pass that instead, using the overload. > - > Similarly, if you configure both [LLM-based evaluators](libraries.md#quality-evaluators) and [Foundry Evaluation service–based evaluators](libraries.md#safety-evaluators) in the reporting configuration, you also need to pass the LLM to . Then it returns a that can talk to both types of evaluators. + > Similarly, if you configure both [LLM-based evaluators](libraries.md#quality-evaluators) and [Foundry Evaluation service–based evaluators](libraries.md#safety-evaluators) in the reporting configuration, you also need to pass the LLM to . The method then returns a that can talk to both types of evaluators. 1. Add a method to define the [chat options](xref:Microsoft.Extensions.AI.ChatOptions) and ask the model for a response to a given question. :::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="GetResponse"::: - The test in this tutorial evaluates the LLM's response to an astronomy question. Since the has response caching enabled, and since the supplied is always fetched from the created using this reporting configuration, the LLM response for the test is cached and reused. + The test in this tutorial evaluates the LLM's response to an astronomy question. Because the has response caching enabled, and because the supplied is always fetched from the created using this reporting configuration, the LLM response for the test gets cached and reused. 1. Add a method to validate the response. @@ -129,16 +129,16 @@ Complete the following steps to create an MSTest project. :::code language="csharp" source="./snippets/evaluate-safety/MyTests.cs" id="TestMethod"::: - This test method: + The test method: - - Creates the . The use of `await using` ensures that the `ScenarioRun` is correctly disposed and that the results of this evaluation are correctly persisted to the result store. - - Gets the LLM's response to a specific astronomy question. The same that will be used for evaluation is passed to the `GetAstronomyConversationAsync` method in order to get *response caching* for the primary LLM response being evaluated. (In addition, this enables response caching for the responses that the evaluators fetch from the Foundry Evaluation service as part of performing their evaluations.) - - Runs the evaluators against the response. Like the LLM response, on subsequent runs, the evaluation is fetched from the (disk-based) response cache that was configured in `s_safetyReportingConfig`. + - Creates the . `await using` ensures that `ScenarioRun` is correctly disposed and that the evaluation results are correctly persisted to the result store. + - Gets the LLM's response to a specific astronomy question. The test passes the same used for evaluation to `GetAstronomyConversationAsync` to enable *response caching* for the primary LLM response being evaluated. (In addition, passing the same enables response caching for the evaluator responses from the Foundry Evaluation service.) + - Runs the evaluators against the response. Like the LLM response, subsequent runs fetch the evaluation from the (disk-based) response cache configured in `s_safetyReportingConfig`. - Runs some safety validation on the evaluation result. ## Run the test/evaluation -Run the test using your preferred test workflow, for example, by using the CLI command `dotnet test` or through [Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer). +Run the test using your preferred test workflow—for example, by using the CLI command `dotnet test` or [Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer). ## Generate a report @@ -148,6 +148,6 @@ To generate a report to view the evaluation results, see [Generate a report](eva This tutorial covers the basics of evaluating content safety. As you create your test suite, consider the following next steps: -- Configure additional evaluators, such as the [quality evaluators](libraries.md#quality-evaluators). For an example, see the AI samples repo [quality and safety evaluation example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example10_RunningQualityAndSafetyEvaluatorsTogether.cs). +- Configure more evaluators, such as the [quality evaluators](libraries.md#quality-evaluators). For an example, see the AI samples repo [quality and safety evaluation example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example10_RunningQualityAndSafetyEvaluatorsTogether.cs). - Evaluate the content safety of generated images. For an example, see the AI samples repo [image response example](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example09_RunningSafetyEvaluatorsAgainstResponsesWithImages.cs). -- In real-world evaluations, you might not want to validate individual results, since the LLM responses and evaluation scores can vary over time as your product (and the models used) evolve. You might not want individual evaluation tests to fail and block builds in your CI/CD pipelines when this happens. Instead, in such cases, it might be better to rely on the generated report and track the overall trends for evaluation scores across different scenarios over time (and only fail individual builds in your CI/CD pipelines when there's a significant drop in evaluation scores across multiple different tests). +- In real-world evaluations, you might not want to validate individual results, because the LLM responses and evaluation scores can vary over time as your product (and the models used) evolve. You might not want individual evaluation tests to fail and block builds in your CI/CD pipelines when evaluation scores change. Instead, consider relying on the generated report and tracking the overall trends for evaluation scores across different scenarios over time (and only failing individual builds in your CI/CD pipelines when there's a significant drop in evaluation scores across multiple different tests). diff --git a/docs/ai/evaluation/evaluate-with-reporting.md b/docs/ai/evaluation/evaluate-with-reporting.md index 06562df1880b1..fc2cf4cb388d0 100644 --- a/docs/ai/evaluation/evaluate-with-reporting.md +++ b/docs/ai/evaluation/evaluate-with-reporting.md @@ -1,14 +1,14 @@ --- title: Tutorial - Evaluate response quality with caching and reporting description: Create an MSTest app to evaluate the response quality of a language model, add a custom evaluator, and learn how to use the caching and reporting features of Microsoft.Extensions.AI.Evaluation. -ms.date: 03/03/2026 +ms.date: 04/09/2026 ms.topic: tutorial ai-usage: ai-assisted --- # Tutorial: Evaluate response quality with caching and reporting -In this tutorial, you create an MSTest app to evaluate the chat response of an OpenAI model. The test app uses the [Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) libraries to perform the evaluations, cache the model responses, and create reports. The tutorial uses both built-in and custom evaluators. The built-in quality evaluators (from the [Microsoft.Extensions.AI.Evaluation.Quality package](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality)) use an LLM to perform evaluations; the custom evaluator does not use AI. +In this tutorial, you create an MSTest app to evaluate the chat response of an OpenAI model. The test app uses the [Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) libraries to perform the evaluations, cache the model responses, and create reports. The tutorial uses both built-in and custom evaluators. The built-in quality evaluators (from the [Microsoft.Extensions.AI.Evaluation.Quality package](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality)) use an LLM to perform evaluations; the custom evaluator doesn't use AI. ## Prerequisites @@ -76,14 +76,14 @@ Complete the following steps to create an MSTest project that connects to an AI **Scenario name** - The [scenario name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun.ScenarioName) is set to the fully qualified name of the current test method. However, you can set it to any string of your choice when you call . Here are some considerations for choosing a scenario name: + The [scenario name](xref:Microsoft.Extensions.AI.Evaluation.Reporting.ScenarioRun.ScenarioName) is set to the fully qualified name of the current test method. However, you can set it to any string when you call . Consider these factors when choosing a scenario name: - When using disk-based storage, the scenario name is used as the name of the folder under which the corresponding evaluation results are stored. So it's a good idea to keep the name reasonably short and avoid any characters that aren't allowed in file and directory names. - - By default, the generated evaluation report splits scenario names on `.` so that the results can be displayed in a hierarchical view with appropriate grouping, nesting, and aggregation. This is especially useful in cases where the scenario name is set to the fully qualified name of the corresponding test method, since it allows the results to be grouped by namespaces and class names in the hierarchy. However, you can also take advantage of this feature by including periods (`.`) in your own custom scenario names to create a reporting hierarchy that works best for your scenarios. + - By default, the generated evaluation report splits scenario names on `.` so the results display in a hierarchical view with appropriate grouping, nesting, and aggregation. The hierarchical view is especially useful when the scenario name is the fully qualified name of the corresponding test method, because it groups results by namespaces and class names in the hierarchy. However, you can also take advantage of this feature by including periods (`.`) in your own custom scenario names to create a reporting hierarchy that works best for your scenarios. **Execution name** - The execution name is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a , all evaluation runs will use the same default execution name of `Default`. In this case, results from one run will be overwritten by the next and you lose the ability to compare results across different runs. + The execution name is used to group evaluation results that are part of the same evaluation run (or test run) when the evaluation results are stored. If you don't provide an execution name when creating a , all evaluation runs use the same default execution name of `Default`. In this case, results from one run are overwritten by the next, and you lose the ability to compare results across different runs. This example uses a timestamp as the execution name. If you have more than one test in your project, ensure that results are grouped correctly by using the same execution name in all reporting configurations used across the tests. @@ -105,9 +105,9 @@ Complete the following steps to create an MSTest project that connects to an AI :::code language="csharp" source="./snippets/evaluate-with-reporting/WordCountEvaluator.cs"::: - The `WordCountEvaluator` counts the number of words present in the response. Unlike some evaluators, it isn't based on AI. The `EvaluateAsync` method returns an includes a that contains the word count. + The `WordCountEvaluator` counts the number of words present in the response. Unlike some evaluators, it isn't based on AI. The `EvaluateAsync` method returns an that includes a that contains the word count. - The `EvaluateAsync` method also attaches a default interpretation to the metric. The default interpretation considers the metric to be good (acceptable) if the detected word count is between 6 and 100. Otherwise, the metric is considered failed. This default interpretation can be overridden by the caller, if needed. + The `EvaluateAsync` method also attaches a default interpretation to the metric. The default interpretation considers the metric to be good (acceptable) if the detected word count is between 6 and 100. Otherwise, the metric is considered failed. The caller can override this default interpretation if needed. 1. Back in `MyTests.cs`, add a method to gather the evaluators to use in the evaluation. @@ -117,7 +117,7 @@ Complete the following steps to create an MSTest project that connects to an AI :::code language="csharp" source="./snippets/evaluate-with-reporting/MyTests.cs" id="GetResponse"::: - The test in this tutorial evaluates the LLM's response to an astronomy question. Since the has response caching enabled, and since the supplied is always fetched from the created using this reporting configuration, the LLM response for the test is cached and reused. The response will be reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the the LLM endpoint or the question being asked, is changed. + The test in this tutorial evaluates the LLM's response to an astronomy question. Because the has response caching enabled, and because the supplied is always fetched from the created using this reporting configuration, the LLM response for the test is cached and reused. The response is reused until the corresponding cache entry expires (in 14 days by default), or until any request parameter, such as the LLM endpoint or the question being asked, changes. 1. Add a method to validate the response. @@ -132,14 +132,14 @@ Complete the following steps to create an MSTest project that connects to an AI This test method: - - Creates the . The use of `await using` ensures that the `ScenarioRun` is correctly disposed and that the results of this evaluation are correctly persisted to the result store. - - Gets the LLM's response to a specific astronomy question. The same that will be used for evaluation is passed to the `GetAstronomyConversationAsync` method in order to get *response caching* for the primary LLM response being evaluated. (In addition, this enables response caching for the LLM turns that the evaluators use to perform their evaluations internally.) With response caching, the LLM response is fetched either: + - Creates the . `await using` ensures correct disposal of the `ScenarioRun` and correct persistence of the evaluation results to the result store. + - Gets the LLM's response to a specific astronomy question. The test passes the same that's used for evaluation to the `GetAstronomyConversationAsync` method to get *response caching* for the primary LLM response being evaluated. (Passing the same client also enables response caching for the LLM turns that the evaluators use to perform their evaluations internally.) With response caching, the LLM response is fetched either: - Directly from the LLM endpoint in the first run of the current test, or in subsequent runs if the cached entry has expired (14 days, by default). - - From the (disk-based) response cache that was configured in `s_defaultReportingConfiguration` in subsequent runs of the test. - - Runs the evaluators against the response. Like the LLM response, on subsequent runs, the evaluation is fetched from the (disk-based) response cache that was configured in `s_defaultReportingConfiguration`. + - From the (disk-based) response cache configured in `s_defaultReportingConfiguration` in subsequent runs of the test. + - Runs the evaluators against the response. Like the LLM response, subsequent runs fetch the evaluation from the (disk-based) response cache configured in `s_defaultReportingConfiguration`. - Runs some basic validation on the evaluation result. - This step is optional and mainly for demonstration purposes. In real-world evaluations, you might not want to validate individual results since the LLM responses and evaluation scores can change over time as your product (and the models used) evolve. You might not want individual evaluation tests to "fail" and block builds in your CI/CD pipelines when this happens. Instead, it might be better to rely on the generated report and track the overall trends for evaluation scores across different scenarios over time (and only fail individual builds when there's a significant drop in evaluation scores across multiple different tests). That said, there is some nuance here and the choice of whether to validate individual results or not can vary depending on the specific use case. + This step is optional and mainly for demonstration purposes. In real-world evaluations, you might not want to validate individual results because the LLM responses and evaluation scores can change over time as your product (and the models used) evolve. You might not want individual evaluation tests to "fail" and block builds in your CI/CD pipelines when results change. Instead, it might be better to rely on the generated report and track the overall trends for evaluation scores across different scenarios over time (and only fail individual builds when there's a significant drop in evaluation scores across multiple different tests). That said, there is some nuance here and the choice of whether to validate individual results or not can vary depending on the specific use case. When the method returns, the `scenarioRun` object is disposed and the evaluation result for the evaluation is stored to the (disk-based) result store that's configured in `s_defaultReportingConfiguration`. @@ -152,25 +152,22 @@ Run the test using your preferred test workflow, for example, by using the CLI c 1. Install the [Microsoft.Extensions.AI.Evaluation.Console](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) .NET tool by running the following command from a terminal window: ```dotnetcli - dotnet tool install --local Microsoft.Extensions.AI.Evaluation.Console + dotnet tool install --create-manifest-if-needed Microsoft.Extensions.AI.Evaluation.Console ``` - > [!TIP] - > You might need to create a manifest file first. For more information about that and installing local tools, see [Local tools](../../core/tools/dotnet-tool-install.md#local-tools). - 1. Generate a report by running the following command: ```dotnetcli dotnet tool run aieval report --path --output report.html ``` -1. Open the `report.html` file. It should look something like this. +1. Open the `report.html` file. The report looks similar to the following screenshot. :::image type="content" source="media/evaluation-report.png" alt-text="Screenshot of the evaluation report showing the conversation and metric values."::: ## Next steps -- Navigate to the directory where the test results are stored (which is `C:\TestReports`, unless you modified the location when you created the ). In the `results` subdirectory, notice that there's a folder for each test run named with a timestamp (`ExecutionName`). Inside each of those folders is a folder for each scenario name—in this case, just the single test method in the project. That folder contains a JSON file with the all the data including the messages, response, and evaluation result. -- Expand the evaluation. Here are a couple ideas: - - Add an additional custom evaluator, such as [an evaluator that uses AI to determine the measurement system](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/evaluation/Evaluators/MeasurementSystemEvaluator.cs) that's used in the response. - - Add another test method, for example, [a method that evaluates multiple responses](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example02_SamplingAndEvaluatingMultipleResponses.cs) from the LLM. Since each response can be different, it's good to sample and evaluate at least a few responses to a question. In this case, you specify an iteration name each time you call . +- Navigate to the directory where the test results are stored (which is `C:\TestReports`, unless you modified the location when you created the ). In the `results` subdirectory, notice that there's a folder for each test run named with a timestamp (`ExecutionName`). Inside each of those folders is a folder for each scenario name—in this case, just the single test method in the project. That folder contains a JSON file with all the data including the messages, response, and evaluation result. +- Expand the evaluation. Here are a couple of ideas: + - Add another custom evaluator, such as [an evaluator that uses AI to determine the measurement system](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/evaluation/Evaluators/MeasurementSystemEvaluator.cs) that's used in the response. + - Add another test method, for example, [a method that evaluates multiple responses](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/reporting/ReportingExamples.Example02_SamplingAndEvaluatingMultipleResponses.cs) from the LLM. Because each response can be different, it's good to sample and evaluate at least a few responses to a question. In this case, you specify an iteration name each time you call . diff --git a/docs/ai/evaluation/libraries.md b/docs/ai/evaluation/libraries.md index d3e46908cc1db..5851c9939b9e6 100644 --- a/docs/ai/evaluation/libraries.md +++ b/docs/ai/evaluation/libraries.md @@ -2,14 +2,14 @@ title: The Microsoft.Extensions.AI.Evaluation libraries description: Learn about the Microsoft.Extensions.AI.Evaluation libraries, which simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps. ms.topic: concept-article -ms.date: 03/03/2026 +ms.date: 04/09/2026 ai-usage: ai-assisted --- # The Microsoft.Extensions.AI.Evaluation libraries The Microsoft.Extensions.AI.Evaluation libraries simplify the process of evaluating the quality and safety of responses generated by AI models in .NET intelligent apps. Various quality metrics measure aspects like relevance, truthfulness, coherence, and completeness of the responses. Safety metrics measure aspects like hate and unfairness, violence, and sexual content. Evaluations are crucial in testing, because they help ensure that the AI model performs as expected and provides reliable and accurate results. -The evaluation libraries, which are built on top of the [Microsoft.Extensions.AI abstractions](../microsoft-extensions-ai.md), are composed of the following NuGet packages: +The evaluation libraries, which build on the [Microsoft.Extensions.AI abstractions](../microsoft-extensions-ai.md), are composed of the following NuGet packages: - [📦 Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) – Defines the core abstractions and types for supporting evaluation. - [📦 Microsoft.Extensions.AI.Evaluation.NLP](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.NLP) - Contains [evaluators](#nlp-evaluators) that evaluate the similarity of an LLM's response text to one or more reference responses using natural language processing (NLP) metrics. These evaluators aren't LLM or AI-based; they use traditional NLP techniques such as text tokenization and n-gram analysis to evaluate text similarity. @@ -21,13 +21,13 @@ The evaluation libraries, which are built on top of the [Microsoft.Extensions.AI ## Test integration -The libraries are designed to integrate smoothly with existing .NET apps, allowing you to leverage existing testing infrastructures and familiar syntax to evaluate intelligent apps. You can use any test framework (for example, [MSTest](../../core/testing/index.md#mstest), [xUnit](../../core/testing/index.md#xunitnet), or [NUnit](../../core/testing/index.md#nunit)) and testing workflow (for example, [Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer), [dotnet test](../../core/tools/dotnet-test.md), or a CI/CD pipeline). The library also provides easy ways to do online evaluations of your application by publishing evaluation scores to telemetry and monitoring dashboards. +The libraries integrate smoothly with existing .NET apps, letting you use existing testing infrastructure and familiar syntax to evaluate intelligent apps. You can use any test framework (for example, [MSTest](../../core/testing/index.md#mstest), [xUnit](../../core/testing/index.md#xunitnet), or [NUnit](../../core/testing/index.md#nunit)) and testing workflow (for example, [Test Explorer](/visualstudio/test/run-unit-tests-with-test-explorer), [dotnet test](../../core/tools/dotnet-test.md), or a CI/CD pipeline). The library also provides easy ways to do online evaluations of your application by publishing evaluation scores to telemetry and monitoring dashboards. ## Comprehensive evaluation metrics The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators), [NLP](#nlp-evaluators), and [safety](#safety-evaluators) evaluators and the metrics they measure. -You can also customize to add your own evaluations by implementing the interface. +To add your own evaluations, implement the interface. ### Quality evaluators @@ -61,7 +61,7 @@ NLP evaluators evaluate the quality of an LLM response by comparing it to a refe ### Safety evaluators -Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations. +Safety evaluators check for the presence of harmful, inappropriate, or unsafe content in a response. They rely on the Foundry Evaluation service, which uses a model that's fine-tuned to perform evaluations. | Evaluator type | Metric | Description | |---------------------------------------------------------------------------|--------------------|-------------| @@ -79,11 +79,11 @@ Safety evaluators check for presence of harmful, inappropriate, or unsafe conten ## Cached responses -The library uses *response caching* functionality, which means responses from the AI model are persisted in a cache. In subsequent runs, if the request parameters (prompt and model) are unchanged, responses are then served from the cache to enable faster execution and lower cost. +The library uses *response caching* functionality to persist responses from the AI model in a cache. In subsequent runs, if the request parameters (prompt and model) are unchanged, it serves responses from the cache for faster execution and lower cost. ## Reporting -The library contains support for storing evaluation results and generating reports. The following image shows an example report in an Azure DevOps pipeline: +The library supports storing evaluation results and generating reports. The following image shows an example report in an Azure DevOps pipeline: :::image type="content" source="../media/ai-extensions/pipeline-report.jpg" lightbox="../media/ai-extensions/pipeline-report.jpg" alt-text="Screenshot of an AI evaluation report in an Azure DevOps pipeline."::: @@ -91,11 +91,11 @@ The `dotnet aieval` tool, which ships as part of the `Microsoft.Extensions.AI.Ev ## Configuration -The libraries are designed to be flexible. You can pick the components that you need. For example, you can disable response caching or tailor reporting to work best in your environment. You can also customize and configure your evaluations, for example, by adding customized metrics and reporting options. +The libraries are flexible and you can pick the components you need. For example, disable response caching or tailor reporting to work best in your environment. You can also customize and configure your evaluations, for example, by adding customized metrics and reporting options. ## Samples -For a more comprehensive tour of the functionality and APIs available in the Microsoft.Extensions.AI.Evaluation libraries, see the [API usage examples (dotnet/ai-samples repo)](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/). These examples are structured as a collection of unit tests. Each unit test showcases a specific concept or API and builds on the concepts and APIs showcased in previous unit tests. +For a more comprehensive tour of the functionality and APIs in the Microsoft.Extensions.AI.Evaluation libraries, see the [API usage examples (dotnet/ai-samples repo)](https://github.com/dotnet/ai-samples/blob/main/src/microsoft-extensions-ai-evaluation/api/). These examples are a collection of unit tests. Each unit test showcases a specific concept or API and builds on the concepts and APIs showcased in previous unit tests. ## See also diff --git a/docs/ai/evaluation/responsible-ai.md b/docs/ai/evaluation/responsible-ai.md index 60c3707151ac6..ab5cc097b1fc4 100644 --- a/docs/ai/evaluation/responsible-ai.md +++ b/docs/ai/evaluation/responsible-ai.md @@ -1,13 +1,13 @@ --- title: Responsible AI with .NET -description: Learn what responsible AI is and how you can use .NET to evaluate the safety of your AI apps. -ms.date: 03/03/2026 +description: Learn what responsible AI is and how to use .NET to evaluate the safety of your AI apps. +ms.date: 04/09/2026 ai-usage: ai-assisted --- # Responsible AI with .NET -*Responsible AI* refers to the practice of designing, developing, and deploying artificial intelligence systems in a way that is ethical, transparent, and aligned with human values. It emphasizes fairness, accountability, privacy, and safety to ensure that AI technologies benefit individuals and society as a whole. As AI becomes increasingly integrated into applications and decision-making processes, prioritizing responsible AI is of utmost importance. +*Responsible AI* refers to the practice of designing, developing, and deploying artificial intelligence systems in a way that is ethical, transparent, and aligned with human values. It emphasizes fairness, accountability, privacy, and safety to ensure that AI technologies benefit individuals and society as a whole. As AI becomes increasingly integrated into applications and decision-making processes, prioritizing responsible AI is essential. Microsoft has identified [six principles](https://www.microsoft.com/ai/responsible-ai) for responsible AI: diff --git a/docs/ai/how-to/use-tokenizers.md b/docs/ai/how-to/use-tokenizers.md index 872a05c256679..f3a5312cf996d 100644 --- a/docs/ai/how-to/use-tokenizers.md +++ b/docs/ai/how-to/use-tokenizers.md @@ -2,12 +2,12 @@ title: Use Microsoft.ML.Tokenizers for text tokenization description: Learn how to use the Microsoft.ML.Tokenizers library to tokenize text for AI models, manage token counts, and work with various tokenization algorithms. ms.topic: how-to -ms.date: 10/29/2025 +ms.date: 04/09/2026 ai-usage: ai-assisted --- # Use Microsoft.ML.Tokenizers for text tokenization -The [Microsoft.ML.Tokenizers](https://www.nuget.org/packages/Microsoft.ML.Tokenizers) library provides a comprehensive set of tools for tokenizing text in .NET applications. Tokenization is essential when you work with large language models (LLMs), as it allows you to manage token counts, estimate costs, and preprocess text for AI models. +The [Microsoft.ML.Tokenizers](https://www.nuget.org/packages/Microsoft.ML.Tokenizers) library provides a comprehensive set of tools for tokenizing text in .NET applications. Tokenization is essential when you work with large language models (LLMs), as it lets you manage token counts, estimate costs, and preprocess text for AI models. This article shows you how to use the library's key features and work with different tokenizer models. @@ -26,7 +26,7 @@ Install the Microsoft.ML.Tokenizers NuGet package: dotnet add package Microsoft.ML.Tokenizers ``` -For Tiktoken models (like GPT-4), you also need to install the corresponding data package: +For Tiktoken models (like GPT-4), also install the corresponding data package: ```dotnetcli dotnet add package Microsoft.ML.Tokenizers.Data.O200kBase @@ -47,7 +47,7 @@ The Tiktoken tokenizer is commonly used with OpenAI models like GPT-4. The follo :::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/TiktokenExample.cs" id="TiktokenBasic"::: -For better performance, you should cache and reuse the tokenizer instance throughout your app. +For better performance, cache and reuse the tokenizer instance throughout your app. When you work with LLMs, you often need to manage text within token limits. The following example shows how to trim text to a specific token count: @@ -65,7 +65,7 @@ All tokenizers support advanced encoding options, such as controlling normalizat ## Use BPE tokenizer -*Byte-pair encoding* (BPE) is the underlying algorithm used by many tokenizers, including Tiktoken. BPE was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when it pretrained the GPT model. The following example demonstrates BPE tokenization: +*Byte-pair encoding* (BPE) is the underlying algorithm used by many tokenizers, including Tiktoken. BPE was initially developed as an algorithm to compress texts, and then OpenAI used it for tokenization when it pretrained the GPT model. The following example demonstrates BPE tokenization: :::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/BpeExample.cs" id="BpeBasic":::