We noticed BERTScore might not be capable of giving sufficiently low scores for unrelated queries, so we now explore the possibility of using LLM evaluator(s) to automate the performance of simulated users (rators). That said, we should use human evaluation to make sure the LLM evaluators themselves are reliable.

In [2]:
!pip install bert-score

from bert_score import score

persona_references = [
    "Workout", "Cars", "Money",
    "winter hiring", "mac studio m2 ultra chip",
    "rog snow keyboard coding", "recommendation of rog keyboard", "huawei hiring",
    "used iphone Toronto", "used RTX4090 Toronto", "AMG CLE53"
]

generated_queries = [
    "All week long we will be highlighting some incredible University of Waterloo entrepreneurs", # unrelated text
    "Dynamic programming is a computer programming technique", # unrelated text
    "Best selling portable Windows laptops", # related query
    "High-quality piano keyboards for beginners" # unrelated query
]

for query in generated_queries:
    P, R, F1 = score([query] * len(persona_references), persona_references, lang='en', verbose=True)
    max_f1 = max(F1).item()
    print(f"Query: '{query}'\nMax F1 score across references: {max_f1:.4f}\n")

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 2.45 seconds, 4.49 sentences/sec
Query: 'All week long we will be highlighting some incredible University of Waterloo entrepreneurs'
Max F1 score across references: 0.8318



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.55 seconds, 7.10 sentences/sec
Query: 'Dynamic programming is a computer programming technique'
Max F1 score across references: 0.8506



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 2.46 seconds, 4.47 sentences/sec
Query: 'Best selling portable Windows laptops'
Max F1 score across references: 0.8747



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.31 seconds, 8.37 sentences/sec
Query: 'High-quality piano keyboards for beginners'
Max F1 score across references: 0.8718



By running the code above, we are able to obtain the BERTScore F1 values as follows.
* Query: 'All week long we will be highlighting some incredible University of Waterloo entrepreneurs'
  * Max F1 score across references: 0.8318

* Query: 'Dynamic programming is a computer programming technique'
  * Max F1 score across references: 0.8506

* Query: 'Best selling portable Windows laptops'
  * Max F1 score across references: 0.8747

* Query: 'High-quality piano keyboards for beginners'
  * Max F1 score across references: 0.8718

We see that the unrelated texts (the 1st and 2nd example) had a BERTScore of 0.83 and 0.85. The unrelated query (the 4th example) had a BERTScore of 0.87, which is extremely close to the score obtained by the related query (the 3rd example). This indicates that another automated evaluation method should be explored in place of BERTScore.

In [1]:
!pip install openai
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [5]:
import pandas as pd
import os
from openai import OpenAI

from dotenv import load_dotenv

# Load the API key:
load_dotenv('.env')
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

top_level_instruction = """
  Evaluate the quality of how closely a search query is matched with the given persona on a scale of 1 to 3.
  1 - The text is unlikely to be a search query from the given persona.
  2 - The text might be a search query from the given persona.
  3 - The text is very likely to be a search query from the given persona.
  You will need to give your reason if you give a score of 1.

  Persona:
  This person is interested:
  "Workout", "Cars", "Money",
  "winter hiring", "mac studio m2 ultra chip",
  "rog snow keyboard coding", "recommendation of rog keyboard", "huawei hiring",
  "used iphone Toronto", "used RTX4090 Toronto", "AMG CLE53"

  Example:

  Text: "All week long we will be highlighting some incredible University of Toronto entrepreneurs"
  Evaluation: 1 - The text is unlikely to be a search query from the given persona.
  Reason: The persona's interests are centered around workouts, cars, technology, and specific hiring queries, none of which relate to University of Toronto entrepreneurs.
"""

text_to_evaluate = [
    "Google Search is a fully-automated search engine that uses software known as web crawlers that explore the web regularly to find pages to add to our index.", # unrelated text
    "Dynamic programming is a computer programming technique", # unrelated text
    "Best selling portable Windows laptops", # related query
    "High-quality piano keyboards for beginners" # unrelated query
]

# Process each statement
feedback_gpt4 = []

for text in text_to_evaluate:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
              "role": "user",
              "content": [
                {
                  "type": "text",
                  "text": top_level_instruction + "\nText:" + text
                }
              ]
            }
        ]
    )
    feedback = response.choices[0].message.content
    print(feedback)
    feedback_gpt4.append(feedback)

# print(labels_gpt4)

Evaluation: 1 - The text is unlikely to be a search query from the given persona.

Reason: The text describes how Google Search works and does not relate to any of the persona's specific interests such as workouts, cars, technology, winter hiring, or used items in Toronto. The query does not demonstrate any focus on the persona's listed topics.
Evaluation: 1 - The text is unlikely to be a search query from the given persona.

Reason: The persona's interests focus on workouts, cars, specific technology products, hiring, and purchasing used items in Toronto. While there is an interest in "rog snow keyboard coding," the text is about dynamic programming, which is a broad computer science topic not directly tied to the specific interests of the persona.
Evaluation: 2 - The text might be a search query from the given persona.

Reason: Although the persona has a clear interest in technology, including specific items like the "mac studio m2 ultra chip" and "rog snow keyboard," the interest is

From the above GPT-4 generated evaluation, we see that the 3rd example received a score of 2 while all other examples received scores of 1, which agrees with our setting.