In [None]:
!pip install google-cloud-aiplatform[evaluation]

In [3]:
import pandas as pd
import vertexai

In [192]:
from vertexai.evaluation import EvalTask, PointwiseMetric, PointwiseMetricPromptTemplate, PromptTemplate

In [205]:
# Define custom evaluation criteria
custom_query_quality = PointwiseMetric(
    metric="custom_query_quality",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
            #"accuracy": "Given the query and the prompt, the answer has to follow the instructions in the query.",
            "clarity": "The output is easy to understand and formatted logically, considering synthetic data.",
            "relevance": "The output aligns with the synthetic context and answers the prompt appropriately.",
            "insightfulness": "The output provides meaningful observations or patterns from the synthetic dataset.",
            "formatting": "The output is well-structured, using tables, lists, or concise sentences where needed."
        },
        rating_rubric={
            "2": "The output performs well on all criteria.",
            "0": "The output is somewhat aligned with the criteria.",
            "-2": "The output falls short on all criteria."
        },
        input_variables=["prompt", "output"],
    ),
)

In [206]:
EVAL_DATASET = pd.read_json("/home/jupyter/ai-powered-querying-csv-eval/evaluation_dataset_sample.json")

In [207]:
EVAL_DATASET

Unnamed: 0,prompt,output
0,How many bookings were made by customers in th...,There were 805 bookings made by customers in t...
1,Show the average ticket price for internationa...,{'content': 'Here's the average international ...
2,How many bookings were made by customers aged ...,There were 188 bookings made by customers aged...
3,What is the average ticket price for business ...,The average ticket price for business class bo...
4,How many customers booked flights with Delta A...,There were 170 customers who booked flights wi...
5,Which age group spends the most on flight tick...,The age group that spends the most on flight t...
6,What percentage of bookings were made through ...,Mobile app bookings accounted for 32.56% of to...
7,What is the total revenue generated by ticket ...,The total revenue generated from ticket sales ...
8,Which states have the highest number of freque...,{'content': 'Here are the top 5 states with th...
9,How many frequent flyers booked flights to int...,There are 470 frequent flyers who booked fligh...


In [208]:



evaluation_prompt_template = PromptTemplate(
    template="""
                You are evaluating the output generated from a synthetic dataset created for testing purposes.
                The dataset structure mimics real-world airline bookings but does not contain real data.

                ### Evaluation Criteria:
                1. **Clarity**: The response is easy to understand and formatted logically.
                2. **Relevance**: The response aligns with the synthetic context and answers the query appropriately.
                3. **Insightfulness**: The response highlights meaningful patterns or trends in the synthetic data.
                4. **Formatting**: The response is well-structured, using tables or concise explanations as needed.

                Input Query: {prompt}
                Model Output: {output}

                Evaluate the response considering the synthetic nature of the data.
"""
)

# Create an evaluation task
# Load dataset (CSV or JSON)

eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[custom_query_quality],
)

In [209]:
from vertexai.generative_models import GenerativeModel

model = GenerativeModel(
    "gemini-1.5-flash-002") #gemini-1.5-flash-002 gemini-1.5-pro

In [226]:
# Run evaluation for your model
eval_result = eval_task.evaluate(
    #model=MODEL,  # Replace with your actual tool's evaluation interface
    model=model,
    prompt_template=evaluation_prompt_template
)



Assembling prompts from the `prompt_template`. The `prompt` column in the `EvalResult.metrics_table` has the assembled prompts used for model response generation.
Generating a total of 10 responses from Gemini model gemini-1.5-flash-002.


100%|██████████| 10/10 [00:03<00:00,  2.89it/s]

All 10 responses are successfully generated from Gemini model gemini-1.5-flash-002.
Multithreaded Batch Inference took: 3.479947172003449 seconds.
Computing metrics with a total of 10 Vertex Gen AI Evaluation Service API requests.



100%|██████████| 10/10 [00:17<00:00,  1.72s/it]

All 10 metric requests are successfully computed.
Evaluation Took:17.199717224997585 seconds





In [227]:
# View results
eval_result.metrics_table

Unnamed: 0,prompt,output,response,custom_query_quality/explanation,custom_query_quality/score
0,\n You are evaluating the outpu...,There were 805 bookings made by customers in t...,The model's output is excellent given the cont...,Step 1: Assessment\nClarity: The response is c...,2.0
1,\n You are evaluating the outpu...,{'content': 'Here's the average international ...,The model's output is good and effectively add...,Step 1: Assessment\nclarity: The output is eas...,2.0
2,\n You are evaluating the outpu...,There were 188 bookings made by customers aged...,The model output is excellent given the contex...,Step 1: Assessment\nClarity: The response is c...,2.0
3,\n You are evaluating the outpu...,The average ticket price for business class bo...,The response is satisfactory given the context...,Step 1:\nClarity: The response is clear and ea...,0.0
4,\n You are evaluating the outpu...,There were 170 customers who booked flights wi...,The model output is satisfactory given the con...,Step 1: Assessment\nClarity: The response is c...,2.0
5,\n You are evaluating the outpu...,The age group that spends the most on flight t...,"The model's output is good, meeting most of th...",Step 1: Assessment\nClarity: The response is c...,0.0
6,\n You are evaluating the outpu...,Mobile app bookings accounted for 32.56% of to...,The model's response is excellent given the co...,Step 1: Assessment\nclarity: The response is c...,2.0
7,\n You are evaluating the outpu...,The total revenue generated from ticket sales ...,The response is adequate given the context of ...,Step 1: Assessment\nclarity: The output is cle...,0.0
8,\n You are evaluating the outpu...,{'content': 'Here are the top 5 states with th...,"The model's response is good, but could be imp...",Step 1: Assessment\nclarity: The output is eas...,0.0
9,\n You are evaluating the outpu...,There are 470 frequent flyers who booked fligh...,The model's response is satisfactory given the...,Step 1: Assessment:\nClarity: The response is ...,2.0


In [224]:
eval_result.metrics_table["response"][7]


"The response is adequate given the limited information provided about the synthetic dataset.  Let's evaluate it against the criteria:\n\n1. **Clarity:** The response is perfectly clear and concise.  It directly answers the question.\n\n2. **Relevance:** The response is highly relevant. It provides the requested total revenue figure for Delta Airlines ticket sales.\n\n3. **Insightfulness:** The response lacks insightfulness. While it provides the answer, it offers no context.  For example, it would be more insightful if it included the number of tickets sold, average ticket price, or a comparison to other airlines in the synthetic dataset.  This is understandable given the limitations of the input – we only asked for a total revenue figure.\n\n4. **Formatting:** The formatting is excellent; it's simple, direct, and easy to read.\n\n**Overall:**  The response is good in terms of clarity, relevance, and formatting. However,  because it only provides a single number, the insightfulness is

In [225]:
eval_result.metrics_table["custom_query_quality/explanation"][7]

"Step 1: Clarity: The response is clear and easy to understand. Relevance: The response is relevant to the prompt. Insightfulness: Due to the nature of the prompt, the response lacks insight. It only presents the total revenue and doesn't offer comparisons or trends, which would be more insightful with a synthetic dataset. Formatting: The response is well-formatted.\nStep 2: Overall, the response effectively addresses the prompt's request, which is to determine the total revenue. However, due to the limitations in the prompt and the lack of additional insights, the response receives a score of 0. It successfully fulfills the basic request but does not provide any further information or analysis that would enhance understanding."

In [216]:
# Convert evalresult to DataFrame
eval_rslt = pd.DataFrame(eval_result.metrics_table)

# Save both datasets to CSV files

eval_file = "eval_file.csv"


eval_rslt.to_csv(eval_file, index=False)


print(f"Booking data saved to {eval_file}")

Booking data saved to eval_file.csv
