### Install Vertex AI Python SDK for Gen AI Evaluation Service

In [1]:
# %pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation]

### Restart runtime
To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [2]:
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

In [3]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK

In [4]:
import os
from IPython.display import JSON
import json

# Cloud project id.
PROJECT_IDS = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_IDS[0]  # @param {type:"string"}

if not PROJECT_ID:
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = "us-central1" # @param {type:"string"}

os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
os.environ["GOOGLE_CLOUD_LOCATION"] = LOCATION
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "TRUE" # Use Vertex AI API
# [your-project-id]

In [5]:
LOCATION = "us-central1"  # @param {type:"string"}
EXPERIMENT_NAME = "my-eval-task-experiment"  # @param {type:"string"}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    raise ValueError("Please set your PROJECT_ID")


import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries

In [6]:
import pandas as pd
from vertexai.evaluation import EvalTask, PointwiseMetric, PointwiseMetricPromptTemplate
from vertexai.preview.evaluation import notebook_utils

# Set up eval metrics based on your criteria

As a developer, you would like to evaluate the text quality generated from an LLM based on two criteria: Fluency and Entertaining. You define a metric called `text_quality` using those two criteria.

In [7]:
# Your own definition of text_quality.
text_quality = PointwiseMetric(
    metric="text_quality",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
            "fluency": "Sentences flow smoothly and are easy to read, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.",
            "entertaining": "Short, amusing text that incorporates emojis, exclamations and questions to convey quick and spontaneous communication and diversion.",
        },
        rating_rubric={
            "1": "The response performs well on both criteria.",
            "0": "The response is somewhat aligned with both criteria",
            "-1": "The response falls short on both criteria",
        },
    ),
)

The `input_variables` parameter is empty. Only the `response` column is used for computing this model-based metric.


In [8]:
print(text_quality.metric_prompt_template)

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user prompt and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step by step explanations for your rating, and only choose ratings from the Rating Rubric.


# Evaluation
## Criteria
entertaining: Short, amusing text that incorporates emojis, exclamations and questions to convey quick and spontaneous communication and diversion.
fluency: Sentences flow smoothly and are easy to read, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.

## Rating Rubric
-1: The response falls short on both criteria
0: The response i

# Prepare your dataset

Evaluate stored generative AI model responses in an evaluation dataset.

In [9]:
responses = [
    # An example of good text_quality
    "Life is a rollercoaster, full of ups and downs, but it's the thrill that keeps us coming back for more!",
    # An example of medium text_quality
    "The weather is nice today, not too hot, not too cold.",
    # An example of poor text_quality
    "The weather is, you know, whatever.",
]

eval_dataset = pd.DataFrame(
    {
        "response": responses,
    }
)

# Run evaluation

With the evaluation dataset and metrics defined, you can run evaluation for an `EvalTask` on different models and applications, prompt templates, and many other use cases.

In [10]:
eval_result = EvalTask(
    dataset=eval_dataset, metrics=[text_quality], experiment=EXPERIMENT_NAME
).evaluate()

Associating projects/255766800726/locations/us-central1/metadataStores/default/contexts/my-eval-task-experiment-68035fa5-eb48-4cec-9bc0-f0a21d825354 to Experiment: my-eval-task-experiment


Computing metrics with a total of 3 Vertex Gen AI Evaluation Service API requests.


100%|██████████| 3/3 [00:00<00:00,  3.24it/s]

All 3 metric requests are successfully computed.
Evaluation Took:0.9326070339884609 seconds





You can view the summary metrics and row-based metrics for each response in the `EvalResult`.


In [11]:
notebook_utils.display_eval_result(eval_result)

### Summary Metrics

Unnamed: 0,row_count,text_quality/mean,text_quality/std
0,3.0,0.0,0.0


### Row-based Metrics

Unnamed: 0,response,text_quality/explanation,text_quality/score
0,"Life is a rollercoaster, full of ups and downs...","The response has a basic level of fluency, but...",0.0
1,"The weather is nice today, not too hot, not to...",The response is somewhat aligned with the flue...,0.0
2,"The weather is, you know, whatever.",The response is not particularly entertaining ...,0.0


In [28]:
from vertexai.preview.evaluation import (
    CustomOutputConfig,
    EvalTask,
    PointwiseMetric,
    PredefinedRubricMetrics,
    RubricBasedMetric,
    RubricGenerationConfig,
    notebook_utils,
    utils,
    MetricPromptTemplateExamples,
)

from vertexai.evaluation import (
    PairwiseMetricPromptTemplate,
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
)

In [29]:
MetricPromptTemplateExamples.list_example_metric_names()


['coherence',
 'fluency',
 'safety',
 'groundedness',
 'instruction_following',
 'verbosity',
 'text_quality',
 'summarization_quality',
 'question_answering_quality',
 'multi_turn_chat_quality',
 'multi_turn_safety',
 'pairwise_coherence',
 'pairwise_fluency',
 'pairwise_safety',
 'pairwise_groundedness',
 'pairwise_instruction_following',
 'pairwise_verbosity',
 'pairwise_text_quality',
 'pairwise_summarization_quality',
 'pairwise_question_answering_quality',
 'pairwise_multi_turn_chat_quality',
 'pairwise_multi_turn_safety']

In [31]:
print(MetricPromptTemplateExamples.get_prompt_template('fluency'))



# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step by step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing fluency, which measures language mastery of the model's response based on the user prompt.

## Criteria
Fluency: The text is free of grammatical errors, employs varied sentence structures, and maintains a consistent tone and style, resulting in a smooth and natural flow that is easy to understand.

## Rating Rubric
5 (completely fluent): The response is free of grammatical errors, d

In [30]:
# from vertexai.evaluation import EvalTask

# eval_result = EvalTask(
#     dataset=eval_dataset,
#     metrics=[MetricPromptTemplateExamples.Pointwise.FLUENCY],
#     experiment=EXPERIMENT_NAME,
# ).evaluate(
#     # model=MODEL,
#     # experiment_run=EXPERIMENT_RUN_NAME,
# )


# Define a pointwise multi-turn chat quality metric
pointwise_chat_quality_metric_prompt = """Evaluate the AI's contribution to a meaningful conversation, considering coherence, fluency, groundedness, and conciseness.
 Review the chat history for context. Rate the response on a 1-5 scale, with explanations for each criterion and its overall impact.

# Conversation History
{history}

# Current User Prompt
{prompt}

# AI-generated Response
{response}
"""

freeform_multi_turn_chat_quality_metric = PointwiseMetric(
    metric="multi_turn_chat_quality_metric",
    metric_prompt_template=pointwise_chat_quality_metric_prompt,
)

# Run evaluation using the freeform_multi_turn_chat_quality_metric metric
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[freeform_multi_turn_chat_quality_metric],
)
eval_result = eval_task.evaluate(
    model=MODEL,
)


Associating projects/255766800726/locations/us-central1/metadataStores/default/contexts/my-eval-task-experiment-d53e59a4-00a4-4e3e-ad48-1949208cd1f8 to Experiment: my-eval-task-experiment


Computing metrics with a total of 3 Vertex Gen AI Evaluation Service API requests.


  0%|          | 0/3 [00:00<?, ?it/s]


ValueError: Metric name: fluency is not supported.

# Clean up

Delete ExperimentRun created by the evaluation.

In [12]:
# from google.cloud import aiplatform

# aiplatform.ExperimentRun(
#     run_name=eval_result.metadata["experiment_run"],
#     experiment=eval_result.metadata["experiment"],
# ).delete()