# Evaluation of the fine-tuned and baseline models with the RAFT generated eval dataset split

In this notebook, we will use the evaluation dataset synthetically generated in the [](./0_gen.ipynb) notebook using the RAFT method to evaluate both the baseline model and the fine-tuned model, then compare the two to analyse the impact of the fine-tuning.

We introduce the `promptflow-evals` package and built-in evaluators. Then, we'll demonstrate how to use the `evaluate` API to assess data using these evaluators.

Finally, we'll draw a diagram showing the performance of the fine-tuned model against the baseline.

## Overview

- Testing
  - Run the baseline model on the evaluation split to get its predictions.
  - Run the finetuned model on the evaluation split to get its predictions.
- Answers formatting
  - Convert the baseline model answers to a format suitable for testing
  - Convert the fine-tuned model answers to a format suitable for testing
- Evaluation
  - Calculate the metrics (such as accuracy, precision, recall, etc.) based on the predictions from the baseline model.
  - Calculate the metrics based on the predictions from the finetuned model.  
- Compare metrics

In [None]:
! pip install promptflow-evals

## Testing

### Define variables we will need

In [None]:
import os
from dotenv import load_dotenv

# User provided values
load_dotenv('.env')

# Variables passed by previous notebooks
load_dotenv('.env.state')

# Let's capture the initial working directory because the evaluate function will change it
dir = os.getcwd()

experiment_name=os.getenv("DATASET_NAME")
experiment_dir=f"{dir}/dataset/{experiment_name}-files"

# Dataset generated by the gen notebook that we will evaluate the baseline and finetuned models on
dataset_path_hf_eval = f"{experiment_dir}/{experiment_name}-hf.eval.jsonl"

# Evaluated answer files
dataset_path_hf_eval_answer = f"{experiment_dir}/{experiment_name}-hf.eval.answer.jsonl"
dataset_path_hf_eval_answer_baseline = f"{experiment_dir}/{experiment_name}-hf.eval.answer.baseline.jsonl"

# Formatted answer evaluation files
dataset_path_eval_answer_finetuned = f"{experiment_dir}/{experiment_name}-eval.answer.finetuned.jsonl"
dataset_path_eval_answer_baseline = f"{experiment_dir}/{experiment_name}-eval.answer.baseline.jsonl"

# Scored answer files
dataset_path_eval_answer_score_finetuned = f"{experiment_dir}/{experiment_name}-eval.answer.score.finetuned.jsonl"
dataset_path_eval_answer_score_baseline = f"{experiment_dir}/{experiment_name}-eval.answer.score.baseline.jsonl"

BASELINE_OPENAI_DEPLOYMENT = os.getenv("BASELINE_OPENAI_DEPLOYMENT")
FINETUNED_OPENAI_DEPLOYMENT = os.getenv("FINETUNED_OPENAI_DEPLOYMENT")

print(f"Evaluating the finetuned model {FINETUNED_OPENAI_DEPLOYMENT} against the baseline model {BASELINE_OPENAI_DEPLOYMENT}")

### Run the baseline model on the evaluation split

In [None]:
!python .gorilla/raft/eval.py \
    --question-file $dataset_path_hf_eval \
    --answer-file $dataset_path_hf_eval_answer_baseline \
    --model $BASELINE_OPENAI_DEPLOYMENT \
    --env-prefix BASELINE \
    --mode chat

### Run the fine tuned model on the evaluation split

In [None]:
!python .gorilla/raft/eval.py \
    --question-file $dataset_path_hf_eval \
    --answer-file $dataset_path_hf_eval_answer \
    --model $FINETUNED_OPENAI_DEPLOYMENT \
    --env-prefix FINETUNED \
    --mode completion

## Answers formatting

### Format baseline answers

Convert the baseline model answers to a format suitable for testing

In [None]:
! python .gorilla/raft/format.py \
    --input $dataset_path_hf_eval_answer_baseline \
    --input-type jsonl \
    --output $dataset_path_eval_answer_baseline \
    --output-format eval

### Format finetuned model answers

Convert the fine-tuned model answers to a format suitable for testing

In [None]:
! python .gorilla/raft/format.py \
    --input $dataset_path_hf_eval_answer \
    --input-type jsonl \
    --output $dataset_path_eval_answer_finetuned \
    --output-format eval

## Let's review the formatted files

### Finetuned model answers

In [None]:
import pandas as pd

In [None]:
pd.read_json(dataset_path_eval_answer_finetuned, lines=True).head(2)

### Baseline model answers

In [None]:
pd.read_json(dataset_path_eval_answer_baseline, lines=True).head(2)

## Evaluation

### Built-in Evaluators

The table below lists all the built-in evaluators we support. In the following sections, we will select a few of these evaluators to demonstrate how to use them.

| Category       | Namespace                                        | Evaluator Class           | Notes                                             |
|----------------|--------------------------------------------------|---------------------------|---------------------------------------------------|
| Quality        | promptflow.evals.evaluators                      | GroundednessEvaluator     | Measures how well the answer is entailed by the context and is not hallucinated |
|                |                                                  | RelevanceEvaluator        | How well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. |
|                |                                                  | CoherenceEvaluator        | How well all the sentences fit together and sound naturally as a whole. |
|                |                                                  | FluencyEvaluator          | Quality of individual sentences in the answer, and whether they are well-written and grammatically correct. |
|                |                                                  | SimilarityEvaluator       | Measures the similarity between the predicted answer and the correct answer |
|                |                                                  | F1ScoreEvaluator          | F1 score |
| Content Safety | promptflow.evals.evaluators.content_safety       | ViolenceEvaluator         |                                                   |
|                |                                                  | SexualEvaluator           |                                                   |
|                |                                                  | SelfHarmEvaluator         |                                                   |
|                |                                                  | HateUnfairnessEvaluator   |                                                   |
| Composite      | promptflow.evals.evaluators                      | QAEvaluator               | Built on top of individual quality evaluators.    |
|                |                                                  | ChatEvaluator             | Similar to QAEvaluator but designed for evaluating chat messages. |
|                |                                                  | ContentSafetyEvaluator    | Built on top of individual content safety evaluators. |



#### Quality Evaluator

In [None]:
import os
from promptflow.core import AzureOpenAIModelConfiguration

azure_endpoint = os.environ.get("SCORE_AZURE_OPENAI_ENDPOINT")
api_key = os.environ.get("SCORE_AZURE_OPENAI_API_KEY")
azure_deployment = os.environ.get("SCORE_AZURE_OPENAI_DEPLOYMENT")
api_version = os.environ.get("SCORE_OPENAI_API_VERSION")

print("azure_endpoint=" + azure_endpoint)
print("azure_deployment=" + azure_deployment)
print("api_version=" + api_version)

# Initialize Azure OpenAI Connection
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=azure_endpoint,
    api_key=api_key,
    azure_deployment=azure_deployment,
    api_version=api_version,
)

In [None]:
from promptflow.evals.evaluators import RelevanceEvaluator, SimilarityEvaluator, GroundednessEvaluator

# Initializing evaluators
similarity = SimilarityEvaluator(model_config)
groundedness = GroundednessEvaluator(model_config)

In [None]:
df = pd.read_json(dataset_path_eval_answer_finetuned, lines=True)
sample=df.iloc[1]
sample

In [None]:
# Running Groundedness Evaluator on single input row
groundedness_score = groundedness(
    answer=sample['final_answer'],
    context=sample['context'],
)
print(groundedness_score)

In [None]:
# Running Similarity Evaluator on single input row
similarity_score = similarity(
    question=sample['question'],
    answer=sample['final_answer'],
    context=sample['context'],
    ground_truth=sample['gold_final_answer'],
)
print(similarity_score)

### Using the Evaluate API to calculate the metrics

In previous sections, we walked you through how to use built-in evaluators to evaluate a single row and how to define your own custom evaluators. Now, we will show you how to use these evaluators with the powerful `evaluate` API to assess an entire dataset.

### Configure AI Studio reporting (Optional)

You can optional setup uploading the evaluation report to Azure AI Studio, to keep track of the evaluations and share with your team. In order to do so, configure the following environment variables with the information of the Azure AI Studio project you want to upload the reports to:

```
REPORT_SUB_ID=<SUBSCRIPTION ID>
REPORT_GROUP=<RESOURCE GROUP NAME>
REPORT_PROJECT_NAME=<AZURE AI STUDIO HUB PROJECT NAME>
```

In [None]:
from utils import get_reporting_project_scope
project_scope_report = get_reporting_project_scope()

### Running the metrics

Now, we will invoke the `evaluate` API using a few evaluators that we already initialized

Additionally, we have a column mapping to map the `truth` column from the dataset to `ground_truth`, which is accepted by the evaluator.

In [None]:
from promptflow.evals.evaluate import evaluate

def score_dataset(dataset, output_path=None):
    result = evaluate(
        data=dataset,
        evaluators={
            "similarity": similarity,
            "groundedness": groundedness
        },
        azure_ai_project=project_scope_report,
        # column mapping
        evaluator_config={
            "similarity": {
                "question": "${data.question}",
                "answer": "${data.final_answer}",
                "ground_truth": "${data.gold_final_answer}",
                "context": "${data.context}",
            },
            "groundedness": {
                "answer": "${data.final_answer}",
                "context": "${data.context}",
            },
        }
    )

    if output_path:
        pd.DataFrame.from_dict(result['rows']).to_json(output_path, orient="records", lines=True)

    return result

#### Baseline model evaluation metrics

In [None]:
pd.read_json(dataset_path_eval_answer_baseline, lines=True).head(2)

In [None]:
baseline_result = score_dataset(dataset_path_eval_answer_baseline, dataset_path_eval_answer_score_baseline)
from IPython.display import display, JSON
display(JSON(baseline_result['metrics']))

In [None]:
# Check the results using Azure AI Studio UI
if baseline_result["studio_url"]:
    print(f"Results uploaded to AI Studio {baseline_result['studio_url']}")
else:
    print("Results available at http://127.0.0.1:23333")

#### Finetuned model evaluation metrics

In [None]:
pd.read_json(dataset_path_eval_answer_finetuned, lines=True).head(2)

In [None]:
finetune_result = score_dataset(dataset_path_eval_answer_finetuned, dataset_path_eval_answer_score_finetuned)
from IPython.display import display, JSON
display(JSON(finetune_result['metrics']))


Finally, let's check the results produced by the evaluate API.

In [None]:
# Check the results using Azure AI Studio UI
if finetune_result["studio_url"]:
    print(f"Results uploaded to AI Studio {finetune_result['studio_url']}")
else:
    print("Results available at http://127.0.0.1:23333")

## Compare the metrics of the fine-tuned model against the baseline

In [None]:
metrics = pd.DataFrame.from_dict({"baseline": baseline_result['metrics'], "finetuned": finetune_result['metrics']})
metrics['improvement'] = metrics['finetuned'] / metrics['baseline']
metrics

In [None]:
metrics.drop('improvement', axis=1).plot.bar(rot=0)