# Evaluation with Data
In this notebook, we introduce built-in evaluators and guide you through creating your own custom evaluators. We'll cover both code-based and prompt-based custom evaluators. Finally, we'll demonstrate how to use the `evaluate` API to assess data using these evaluators.


In [None]:
# Clearing any old installation
# This is important since older version of promptflow has one package.
# Now it is split into number of them.
! pip uninstall -y promptflow promptflow-cli promptflow-azure promptflow-core promptflow-devkit promptflow-tools promptflow-evals

# Install packages in this order
! pip install promptflow-evals

In [None]:
#! pip install azure_ai_ml --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/

# Dependencies needed for some of the notebooks
#! pip install azure-cli
#! pip install bs4
#! pip install ipykernel

Expected env vars

```
AZURE_OPENAI_API_KEY
AZURE_OPENAI_API_VERSION
AZURE_OPENAI_DEPLOYMENT
AZURE_OPENAI_ENDPOINT
```

In [123]:
from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

True

## Evaluate the eval dataset using the fine tuned model

In [124]:
dataset_path_hf_eval = "dataset/hf.eval.jsonl"
dataset_path_hf_eval_answer = "dataset/hf.eval.answer2.jsonl"
dataset_path_hf_eval_answer_baseline = "dataset/hf.eval.answer2.baseline.jsonl"

dataset_path_ft_eval = "dataset/ft.eval.jsonl"
dataset_path_ft_eval_baseline = "dataset/ft.eval.baseline.jsonl"

EVAL_OPENAI_API_KEY_BASE = os.getenv('EVAL_OPENAI_API_KEY_BASE')
EVAL_OPENAI_API_KEY_FT = os.getenv('EVAL_OPENAI_API_KEY_FT')

### Baseline

In [125]:
!unset AZURE_OPENAI_ENDPOINT && \
unset AZURE_OPENAI_API_KEY && \
unset OPENAI_API_VERSION && \
OPENAI_BASE_URL="https://Llama-2-7b-lnqzi-serverless.westus3.inference.ai.azure.com/v1" \
OPENAI_API_KEY=$EVAL_OPENAI_API_KEY_BASE \
python ../eval.py \
    --question-file $dataset_path_hf_eval \
    --answer-file $dataset_path_hf_eval_answer_baseline \
    --model Llama-2-7b-lnqzi

[32m2024-05-15 05:53:20[0m [1;30m INFO[0m [    ] [34meval[0m number of inputs: 12
100%|███████████████████████████████████████████| 12/12 [00:21<00:00,  1.80s/it]
[32m2024-05-15 05:53:42[0m [1;30m INFO[0m [    ] [34meval[0m total time used: 21.636550664901733


### Fine tuned model

In [126]:
!unset AZURE_OPENAI_ENDPOINT && \
unset AZURE_OPENAI_API_KEY && \
unset OPENAI_API_VERSION && \
OPENAI_BASE_URL="https://Llama-2-7b-raft-ucb-sh-man-yzqgd-serverless.westus3.inference.ai.azure.com/v1" \
OPENAI_API_KEY=$EVAL_OPENAI_API_KEY_FT \
python ../eval.py \
    --question-file $dataset_path_hf_eval \
    --answer-file $dataset_path_hf_eval_answer \
    --model Llama-2-7b-raft-ucb-sh-man-yzqgd

[32m2024-05-15 05:53:43[0m [1;30m INFO[0m [    ] [34meval[0m number of inputs: 12
100%|███████████████████████████████████████████| 12/12 [00:20<00:00,  1.72s/it]
[32m2024-05-15 05:54:04[0m [1;30m INFO[0m [    ] [34meval[0m total time used: 20.65044641494751


## 0. Prepare eval dataset

In [115]:
! python ../format.py \
    --input $dataset_path_hf_eval_answer \
    --input-type jsonl \
    --output $dataset_path_ft_eval \
    --output-format eval

Generating train split: 12 examples [00:00, 738.79 examples/s]
[32m2024-05-15 05:46:03[0m [1;30m INFO[0m [    ] [34mraft[0m Converting jsonl file dataset/hf.eval.answer2.jsonl to jsonl eval file dataset/ft.eval.jsonl
Map: 100%|██████████████████████████████| 12/12 [00:00<00:00, 204.26 examples/s]
Map: 100%|█████████████████████████████| 12/12 [00:00<00:00, 3359.25 examples/s]
Creating json from Arrow format: 100%|████████████| 1/1 [00:00<00:00, 54.87ba/s]


In [107]:
! python ../format.py \
    --input $dataset_path_hf_eval_answer_baseline \
    --input-type jsonl \
    --output $dataset_path_ft_eval_baseline \
    --output-format eval

Generating train split: 12 examples [00:00, 1646.49 examples/s]
[32m2024-05-15 03:47:14[0m [1;30m INFO[0m [    ] [34mraft[0m Converting jsonl file dataset/hf.eval.answer.baseline.jsonl to jsonl eval file dataset/ft.eval.baseline.jsonl
Map: 100%|█████████████████████████████| 12/12 [00:00<00:00, 2081.97 examples/s]
Map: 100%|█████████████████████████████| 12/12 [00:00<00:00, 4145.59 examples/s]
Creating json from Arrow format: 100%|███████████| 1/1 [00:00<00:00, 339.98ba/s]


In [116]:
df = pd.read_json(dataset_path_ft_eval, lines=True)
df['answer_final'] = df['answer'].map(lambda x: x.split("<ANSWER>:")[-1])
df.head()

Unnamed: 0,question,answer,gold_answer,context,answer_final
0,Did Nicholas B. Birgeneau serve as a Universit...,2013?\n\nThe context provided is:\n\n> In 1952...,Nicholas B. Birgeneau served as a University C...,<DOCUMENT>It shares this unofficial status\nwi...,2013?\n\nThe context provided is:\n\n> In 1952...
1,"What significant donation did BP, Bill and Mel...","\n\nThe answer is:\n\n<ANSWER>: BP, Bill and M...",The context provided does not specify the sign...,<DOCUMENT>The institute is now widely regarded...,"BP, Bill and Melinda Gates Foundation, and Yu..."
2,In what year did Robert J. Berdahl's service b...,\n<ANSWER>: 2004\n\n### 2.\n\n<QUESTION>: What...,2004,<DOCUMENT>News since at\nleast 2014.</DOCUMENT...,2004\n\n### 21.\n\n<QUESTION>:
3,Name one notable non-alumni benefactor of UC B...,\n\nThe correct answer is: Jane Street princip...,Mark Zuckerberg and Priscilla Chan,"<DOCUMENT>In 1952, the university\nreorganized...",\n\nThe correct answer is: Jane Street princip...
4,What percentage of UC Berkeley's total revenue...,\n\nThe context provided in the question is:\n...,12 percent,<DOCUMENT>Dirks\n2017–present: Carol T. Christ...,\n\nThe context provided in the question is:\n...


In [109]:
pd.read_json(dataset_path_ft_eval_baseline, lines=True).head()

Unnamed: 0,question,answer,gold_answer,context
0,What is the full name of the individual who he...,2017?\nWhat is the full name of the individual...,Nicholas B. Birgeneau.,<DOCUMENT>Department of Education's Office of ...
1,In what year did Robert J. Berdahl's service b...,\nIn what year did Robert J. Berdahl's service...,2004,<DOCUMENT>News since at\nleast 2014.</DOCUMENT...
2,When did Robert J. Berdahl serve in his position?,\nWhen did Nicholas B. Dirks serve in his posi...,Robert J. Berdahl served in his position from ...,<DOCUMENT>The alumni giving rate accounts for ...
3,How many years did Robert J. Berdahl's term last?,\nHow many years did Robert J. Berdahl's term ...,9 years,<DOCUMENT>Native Americans contended with the ...
4,What is the complete name of the person who se...,2004 to 2013 as the 10th,Robert J. Berdahl,<DOCUMENT>Berkeley had originally reported tha...


## 1. Built-in Evaluators

The table below lists all the built-in evaluators we support. In the following sections, we will select a few of these evaluators to demonstrate how to use them.

| Category       | Namespace                                        | Evaluator Class           | Notes                                             |
|----------------|--------------------------------------------------|---------------------------|---------------------------------------------------|
| Quality        | promptflow.evals.evaluators                      | GroundednessEvaluator     |                                                   |
|                |                                                  | RelevanceEvaluator        |                                                   |
|                |                                                  | CoherenceEvaluator        |                                                   |
|                |                                                  | FluencyEvaluator          |                                                   |
|                |                                                  | SimilarityEvaluator       |                                                   |
|                |                                                  | F1ScoreEvaluator          |                                                   |
| Content Safety | promptflow.evals.evaluators.content_safety       | ViolenceEvaluator         |                                                   |
|                |                                                  | SexualEvaluator           |                                                   |
|                |                                                  | SelfHarmEvaluator         |                                                   |
|                |                                                  | HateUnfairnessEvaluator   |                                                   |
| Composite      | promptflow.evals.evaluators                      | QAEvaluator               | Built on top of individual quality evaluators.    |
|                |                                                  | ChatEvaluator             | Similar to QAEvaluator but designed for evaluating chat messages. |
|                |                                                  | ContentSafetyEvaluator    | Built on top of individual content safety evaluators. |



### 1.1 Quality Evaluator

In [92]:
import os
from promptflow.core import AzureOpenAIModelConfiguration

azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT")
api_key=os.environ.get("AZURE_OPENAI_API_KEY")
azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT")
api_version=os.environ.get("OPENAI_API_VERSION")

print("azure_endpoint=" + azure_endpoint)
print("azure_deployment=" + azure_deployment)
print("api_version=" + api_version)

# Initialize Azure OpenAI Connection
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=azure_endpoint,
    api_key=api_key,
    azure_deployment=azure_deployment,
    api_version=api_version,
)

azure_endpoint=https://ai-cviaiwestus1288043977207.openai.azure.com/
azure_deployment=gpt-4-turbo
api_version=2023-03-15-preview


In [93]:
from promptflow.evals.evaluators import RelevanceEvaluator

# Initialzing Relevance Evaluator
relevance_eval = RelevanceEvaluator(model_config)

In [94]:
sample=df.iloc[0]
sample

question       What is the full name of the individual who he...
answer         2017?\nWhat is the full name of the individual...
gold_answer                               Nicholas B. Birgeneau.
context        <DOCUMENT>Department of Education's Office of ...
Name: 0, dtype: object

In [95]:
# Running Relevance Evaluator on single input row
relevance_score = relevance_eval(
    question=sample['question'],
    answer=sample['answer'],
    context=sample['context'],
)
print(relevance_score)

{'gpt_relevance': 1.0}


## 3. Batch evaluate

In [99]:
dataset_path_ft_eval_pf = "dataset/ft.eval.pf.jsonl"

In [102]:
df = pd.read_json(dataset_path_ft_eval, lines=True)
df.head()

Unnamed: 0,question,answer,gold_answer,context
0,What is the full name of the individual who he...,2017?\nWhat is the full name of the individual...,Nicholas B. Birgeneau.,<DOCUMENT>Department of Education's Office of ...
1,In what year did Robert J. Berdahl's service b...,\nIn what year did Robert J. Berdahl's service...,2004,<DOCUMENT>News since at\nleast 2014.</DOCUMENT...
2,When did Robert J. Berdahl serve in his position?,\nWhen did Nicholas B. Dirks serve in his posi...,Robert J. Berdahl served in his position from ...,<DOCUMENT>The alumni giving rate accounts for ...
3,How many years did Robert J. Berdahl's term last?,\nHow many years did Robert J. Berdahl's term ...,9 years,<DOCUMENT>Native Americans contended with the ...
4,What is the complete name of the person who se...,2004 to 2013 as the 10th,Robert J. Berdahl,<DOCUMENT>Berkeley had originally reported tha...


In [100]:
!python ../pfeval.py \
    --input $dataset_path_ft_eval \
    --output $dataset_path_ft_eval_pf

[32m2024-05-15 02:50:32[0m [1;30m INFO[0m [    ] [34mpfeval[0m Loading model configuration
[32m2024-05-15 02:50:32[0m [1;30m INFO[0m [    ] [34mpfeval[0m deployment=fgpt-4-turbo
[32m2024-05-15 02:50:32[0m [1;30m INFO[0m [    ] [34mpfeval[0m api_version=f2023-03-15-preview
[32m2024-05-15 02:50:32[0m [1;30m INFO[0m [    ] [34mpfeval[0m azure_endpoint=fhttps://ai-cviaiwestus1288043977207.openai.azure.com/
[32m2024-05-15 02:50:32[0m [1;30m INFO[0m [    ] [34mpfeval[0m Starting evaluate...
100%|███████████████████████████████████████████| 12/12 [00:11<00:00,  1.02it/s]
[32m2024-05-15 02:50:44[0m [1;30m INFO[0m [    ] [34mpfeval[0m Finished evaluate in 11.843371868133545s
[32m2024-05-15 02:50:44[0m [1;30m INFO[0m [    ] [34mpfeval[0m Writing 12 results to dataset/ft.eval.pf.jsonl


In [101]:
df = pd.read_json(dataset_path_ft_eval_pf, lines=True)
df.head()

Unnamed: 0,query,response,context,gpt_relevance,gpt_fluency,gpt_coherence,gpt_groundedness
0,In what year did Nicholas B. Birgeneau's tenur...,2017,<DOCUMENT>The institute is now widely regarded...,5,,1,1
1,What is the full name of the individual who he...,Nicholas B. Birgeneau.,<DOCUMENT>Department of Education's Office of ...,1,,1,1
2,How many years did Robert J. Berdahl's term last?,9 years,<DOCUMENT>Native Americans contended with the ...,5,,3,1
3,Who served as the Chancellor of the University...,Nicholas B.,<DOCUMENT>Dirks\n2017–present: Carol T. Christ...,3,1.0,1,1
4,Did Nicholas B. Birgeneau serve as a Universit...,Nicholas B. Birgeneau served as a University C...,<DOCUMENT>It shares this unofficial status\nwi...,3,4.0,4,5


## 3. Using Evaluate API to evaluate with data

In previous sections, we walked you through how to use built-in evaluators to evaluate a single row and how to define your own custom evaluators. Now, we will show you how to use these evaluators with the powerful `evaluate` API to assess an entire dataset.

First, let's take a peek at what the data looks like.

In [57]:
df.head()

Unnamed: 0,question,gold_answer,context
0,When did Robert J. Berdahl serve in his position?,Robert J. Berdahl served in his position from ...,<DOCUMENT>The alumni giving rate accounts for ...
1,What is the complete name of the person who se...,Robert J. Berdahl,<DOCUMENT>Berkeley had originally reported tha...
2,How many years did Robert J. Berdahl's term last?,9 years,<DOCUMENT>Native Americans contended with the ...
3,In what year did Robert J. Berdahl's service b...,2004,<DOCUMENT>News since at\nleast 2014.</DOCUMENT...
4,Who served as the Chancellor of the University...,Nicholas B.,<DOCUMENT>Dirks\n2017–present: Carol T. Christ...


Now, we will invoke the `evaluate` API using a few evaluators that we already initialized

Additionally, we have a column mapping to map the `truth` column from the dataset to `ground_truth`, which is accepted by the evaluator.

In [60]:
from promptflow.evals.evaluate import evaluate

result = evaluate(
    data=dataset_path_ft_eval,
    evaluators={
        "relevance": relevance_eval
    },
    # column mapping
    evaluator_config={
        "default": {
            "answer": "${data.gold_answer}"
        }
    }
)

from IPython.display import display, JSON
display(JSON(result))

Starting prompt flow service...
Start prompt flow service on port 23333, version: 1.10.1.


[2024-05-14 18:59:01,061][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_studio_ft_variant_0_20240514_185853_935940, log path: /home/vscode/.promptflow/.runs/azure_ai_studio_ft_variant_0_20240514_185853_935940/logs.txt


You can stop the prompt flow service with the following command:'[1mpf service stop[0m'.
Alternatively, if no requests are made within 1 hours, it will automatically stop.
You can view the traces in local from http://localhost:23333/v1.0/ui/traces/?#run=azure_ai_studio_ft_variant_0_20240514_185853_935940


Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/workspaces/gorilla/raft/.venv/lib/python3.10/site-packages/promptflow/_sdk/operations/__init__.py", line 7, in <module>
    from ._flow_operations import FlowOperations
  File "/workspaces/gorilla/raft/.venv/lib/python3.10/site-packages/promptflow/_sdk/operations/_flow_operations.py", line 26, in <module>
    from promptflow._sdk._configuration import Configuration
  File "/workspaces/gorilla/raft/.venv/lib/python3.10/site-packages/promptflow/_sdk/_configuration.py", line 21, in <module>
    from promptflow._sdk._tracing import TraceDestinationConfig
  File "/workspaces/gorilla/raft/.venv/lib/python3.10/site-packages/promptflow/_sdk/_tracing.py", line 58, 

SpawnedForkProcessManagerStartFailure: Failed to start spawned fork process manager

ERROR:opentelemetry.sdk.trace.export:Exception while exporting Span batch.
Traceback (most recent call last):
  File "/workspaces/gorilla/raft/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
  File "/workspaces/gorilla/raft/.venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/workspaces/gorilla/raft/.venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspaces/gorilla/raft/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/workspaces/gorilla/raft/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 496, in _make_request
    conn.


Finally, let's check the results produced by the evaluate API.

In [None]:
# Check the results using Azure AI Studio UI
print(result["studio_url"])