# Evaluation with Data
In this notebook, we introduce built-in evaluators and guide you through creating your own custom evaluators. We'll cover both code-based and prompt-based custom evaluators. Finally, we'll demonstrate how to use the `evaluate` API to assess data using these evaluators.


In [3]:
# Clearing any old installation
# This is important since older version of promptflow has one package.
# Now it is split into number of them.
! pip uninstall -y promptflow promptflow-cli promptflow-azure promptflow-core promptflow-devkit promptflow-tools promptflow-evals

# Install packages in this order
! pip install promptflow-evals

[0mFound existing installation: promptflow-azure 1.10.1
Uninstalling promptflow-azure-1.10.1:
  Successfully uninstalled promptflow-azure-1.10.1
Found existing installation: promptflow-core 1.10.1
Uninstalling promptflow-core-1.10.1:
  Successfully uninstalled promptflow-core-1.10.1
Found existing installation: promptflow-devkit 1.10.1
Uninstalling promptflow-devkit-1.10.1:
  Successfully uninstalled promptflow-devkit-1.10.1
[0mFound existing installation: promptflow-evals 0.2.0.dev0
Uninstalling promptflow-evals-0.2.0.dev0:
  Successfully uninstalled promptflow-evals-0.2.0.dev0
Collecting promptflow-evals
  Using cached promptflow_evals-0.2.0.dev0-py3-none-any.whl (53 kB)
Collecting promptflow-core<2.0.0
  Using cached promptflow_core-1.10.1-py3-none-any.whl (966 kB)
Collecting promptflow-azure<2.0.0
  Using cached promptflow_azure-1.10.1-py3-none-any.whl (704 kB)
Collecting promptflow-devkit<2.0.0
  Using cached promptflow_devkit-1.10.1-py3-none-any.whl (2.2 MB)
Installing collecte

In [None]:
#! pip install azure_ai_ml --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/

# Dependencies needed for some of the notebooks
#! pip install azure-cli
#! pip install bs4
#! pip install ipykernel

Expected env vars

```
AZURE_OPENAI_API_KEY
AZURE_OPENAI_API_VERSION
AZURE_OPENAI_DEPLOYMENT
AZURE_OPENAI_ENDPOINT
```

In [11]:
from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

True

## 1. Built-in Evaluators

The table below lists all the built-in evaluators we support. In the following sections, we will select a few of these evaluators to demonstrate how to use them.

| Category       | Namespace                                        | Evaluator Class           | Notes                                             |
|----------------|--------------------------------------------------|---------------------------|---------------------------------------------------|
| Quality        | promptflow.evals.evaluators                      | GroundednessEvaluator     |                                                   |
|                |                                                  | RelevanceEvaluator        |                                                   |
|                |                                                  | CoherenceEvaluator        |                                                   |
|                |                                                  | FluencyEvaluator          |                                                   |
|                |                                                  | SimilarityEvaluator       |                                                   |
|                |                                                  | F1ScoreEvaluator          |                                                   |
| Content Safety | promptflow.evals.evaluators.content_safety       | ViolenceEvaluator         |                                                   |
|                |                                                  | SexualEvaluator           |                                                   |
|                |                                                  | SelfHarmEvaluator         |                                                   |
|                |                                                  | HateUnfairnessEvaluator   |                                                   |
| Composite      | promptflow.evals.evaluators                      | QAEvaluator               | Built on top of individual quality evaluators.    |
|                |                                                  | ChatEvaluator             | Similar to QAEvaluator but designed for evaluating chat messages. |
|                |                                                  | ContentSafetyEvaluator    | Built on top of individual content safety evaluators. |



### 1.1 Quality Evaluator

In [13]:
import os
from promptflow.core import AzureOpenAIModelConfiguration

azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT")
api_key=os.environ.get("AZURE_OPENAI_API_KEY")
azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT")
api_version=os.environ.get("OPENAI_API_VERSION")

print("azure_endpoint=" + azure_endpoint)
print("azure_deployment=" + azure_deployment)
print("api_version=" + api_version)

# Initialize Azure OpenAI Connection
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=azure_endpoint,
    api_key=api_key,
    azure_deployment=azure_deployment,
    api_version=api_version,
)

azure_endpoint=https://ai-cviaiwestus1288043977207.openai.azure.com/
azure_deployment=gpt-4-turbo
api_version=2023-03-15-preview


In [14]:
from promptflow.evals.evaluators import RelevanceEvaluator

# Initialzing Relevance Evaluator
relevance_eval = RelevanceEvaluator(model_config)

In [15]:
# Running Relevance Evaluator on single input row
relevance_score = relevance_eval(
    answer="The Alpine Explorer Tent is the most waterproof.",
    context="From the our product list,"
    " the alpine explorer tent is the most waterproof."
    " The Adventure Dining Table has higher weight.",
    question="Which tent is the most waterproof?",
)

In [16]:
print(relevance_score)

{'gpt_relevance': 5.0}


## 3. Using Evaluate API to evaluate with data

In previous sections, we walked you through how to use built-in evaluators to evaluate a single row and how to define your own custom evaluators. Now, we will show you how to use these evaluators with the powerful `evaluate` API to assess an entire dataset.

First, let's take a peek at what the data looks like.

In [None]:
import pandas as pd

data_path = "data.jsonl"

df = pd.read_json(data_path, lines=True)
df

Now, we will invoke the `evaluate` API using a few evaluators that we already initialized

Additionally, we have a column mapping to map the `truth` column from the dataset to `ground_truth`, which is accepted by the evaluator.

In [None]:
from promptflow.evals.evaluate import evaluate

result = evaluate(
    data="data.jsonl",
    evaluators={
        "relevance": relevance_eval
    },
    # column mapping
    evaluator_config={
        "default": {
            "ground_truth": "${data.truth}"
        }
    }
)


Finally, let's check the results produced by the evaluate API.

In [None]:
from IPython.display import display, JSON

display(JSON(result))

In [None]:
# Check the results using Azure AI Studio UI
print(result["studio_url"])