# L9: Evaluation Part II

Evaluate LLM responses where there isn't a single "right answer."

## Setup
#### Load the API key and relevant Python libaries.
In this course, we've provided some code that loads the OpenAI API key for you.

In [None]:
import os
import openai
import sys
sys.path.append('../..')
import utils
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [None]:
def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0, max_tokens=500):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message["content"]

### Run through the end-to-end system to answer the user query

These helper functions are running the chain of promopts that you saw in the earlier videos.

In [None]:
customer_msg = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?"""

products_by_category = utils.get_products_from_query(customer_msg)
category_and_product_list = utils.read_string_to_list(products_by_category)
product_info = utils.get_mentioned_product_info(category_and_product_list)
assistant_answer = utils.answer_user_msg(user_msg=customer_msg,
                                                   product_info=product_info)

In [None]:
print(assistant_answer) 

*OUTPUT*

Sure! Let me provide you with some information about the SmartX ProPhone and the FotoSnap DSLR Camera.

The SmartX ProPhone is a powerful smartphone with advanced camera features. It has a 6.1-inch display, 128GB storage, a 12MP dual camera, and supports 5G connectivity. The SmartX ProPhone is priced at $899.99 and comes with a 1-year warranty.

The FotoSnap DSLR Camera is a versatile camera that allows you to capture stunning photos and videos. It features a 24.2MP sensor, 1080p video recording, a 3-inch LCD screen, and supports interchangeable lenses. The FotoSnap DSLR Camera is priced at $599.99 and also comes with a 1-year warranty.

As for TVs and TV-related products, we have a range of options available. Some of our popular TV models include the CineView 4K TV, CineView 8K TV, and CineView OLED TV. We also have home theater systems like the SoundMax Home Theater and SoundMax Soundbar. Could you please let me know your specific requirements or preferences so that I can assist you better?

#### How can you evaluate if this is a good answer or not? 
Seems like there are lots of possible good answers. 
One way to evaluate this is to write a rubric, 
meaning a set of guidelines, to evaluate this 
answer on different dimensions, and then use that to 
decide whether or not you're satisfied with this answer.

### Evaluate the LLM's answer to the user with a rubric, based on the extracted product information

In [None]:
cust_prod_info = {
    'customer_msg': customer_msg,
    'context': product_info
}

 So, let me create a little data structure to 
store the customer message as well as the product info. 
So here, I'm going to specify a prompt for evaluating the 
assistant answer using what's called a rubric.

In [None]:
def eval_with_rubric(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by looking at the context that the customer service \
    agent is using to generate its response. 
    """

    user_message = f"""\
You are evaluating a submitted answer to a question based on the context \
that the agent uses to answer the question.
Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Context]: {context}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the context. \
Ignore any differences in style, grammar, or punctuation.
Answer the following questions:
    - Is the Assistant response based only on the context provided? (Y or N)
    - Does the answer include information that is not provided in the context? (Y or N)
    - Is there any disagreement between the response and the context? (Y or N)
    - Count how many questions the user asked. (output a number)
    - For each question that the user asked, is there a corresponding answer to it?
      Question 1: (Y or N)
      Question 2: (Y or N)
      ...
      Question N: (Y or N)
    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

So, this is called a rubric, and this specifies what 
we think the answer should get right for 
us to consider it a good answer. 

In [None]:
evaluation_output = eval_with_rubric(cust_prod_info, assistant_answer)
print(evaluation_output)

*OUTPUT*

- Is the Assistant response based only on the context provided? (Y or N)

Y

- Does the answer include information that is not provided in the context? (Y or N)

N

- Is there any disagreement between the response and the context? (Y or N)

N

- Count how many questions the user asked. (output a number)

2

- For each question that the user asked, is there a corresponding answer to it?

Question 1: Y

Question 2: Y

- Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)

2

So we would look at this output and maybe conclude that this 
is a pretty good response. 
And one note, here I'm using the ChatGPT 3.5 Turbo 
model for this evaluation. 
For a more robust evaluation, it might be worth considering using 
GPT-4 because even if you deploy 3.5 Turbo in production and generate a 
lot of text, if your evaluation is a more 
sporadic exercise, then it may be prudent to pay for the somewhat 
more expensive GPT-4 API call to get a more rigorous evaluation 
of the output. 
One design pattern that I hope you can take away 
from this is that when you can specify a rubric, 
meaning a list of criteria by which to 
evaluate an LLM output, then you can actually 
use another API call to evaluate your first LLM output. 
There's one other design pattern that could be useful 
for some applications, which is if you can 
specify an ideal response.

The "BLEU" (Bilingual Evaluation Understudy) score, which is a metric used to evaluate the quality of machine-generated text, such as machine translation or language generation models, like Large Language Models (LLMs).

BLEU measures the similarity between the generated text and a reference (human-written) text. It compares n-grams (contiguous sequences of n items, like words or characters) in the generated text to the n-grams in the reference text. The basic idea is that if the generated text has n-grams that match those in the reference text, it's likely to be a good translation or a high-quality generated text.

The BLEU score ranges from 0 to 1, where higher values indicate better quality. It's important to note that BLEU is a relatively simple metric and has some limitations. For example, it doesn't capture the overall fluency, coherence, or semantic correctness of the generated text. Also, it heavily relies on exact n-gram matches, which might not always be the best representation of quality, especially for longer sentences or when dealing with paraphrased content.

### Evaluate the LLM's answer to the user based on an "ideal" / "expert" (human generated) answer.

So this is if you have an expert human 
customer service representative write a really good answer.

In [None]:
test_set_ideal = {
    'customer_msg': """\
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?""",
    'ideal_answer':"""\
Of course!  The SmartX ProPhone is a powerful \
smartphone with advanced camera features. \
For instance, it has a 12MP dual camera. \
Other features include 5G wireless and 128GB storage. \
It also has a 6.1-inch display.  The price is $899.99.

The FotoSnap DSLR Camera is great for \
capturing stunning photos and videos. \
Some features include 1080p video, \
3-inch LCD, a 24.2MP sensor, \
and interchangeable lenses. \
The price is 599.99.

For TVs and TV related products, we offer 3 TVs \


All TVs offer HDR and Smart TV.

The CineView 4K TV has vibrant colors and smart features. \
Some of these features include a 55-inch display, \
'4K resolution. It's priced at 599.

The CineView 8K TV is a stunning 8K TV. \
Some features include a 65-inch display and \
8K resolution.  It's priced at 2999.99

The CineView OLED TV lets you experience vibrant colors. \
Some features include a 55-inch display and 4K resolution. \
It's priced at 1499.99.

We also offer 2 home theater products, both which include bluetooth.\
The SoundMax Home Theater is a powerful home theater system for \
an immmersive audio experience.
Its features include 5.1 channel, 1000W output, and wireless subwoofer.
It's priced at 399.99.

The SoundMax Soundbar is a sleek and powerful soundbar.
It's features include 2.1 channel, 300W output, and wireless subwoofer.
It's priced at 199.99

Are there any questions additional you may have about these products \
that you mentioned here?
Or may do you have other questions I can help you with?
    """
}

### Check if the LLM's response agrees with or disagrees with the expert answer

This evaluation prompt is from the [OpenAI evals](https://github.com/openai/evals/blob/main/evals/registry/modelgraded/fact.yaml) project.

[BLEU score](https://en.wikipedia.org/wiki/BLEU): another way to evaluate whether two pieces of text are similar or not.

In [None]:
def eval_vs_ideal(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    ideal = test_set['ideal_answer']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by comparing the response to the ideal (expert) response
    Output a single letter and nothing else. 
    """

    user_message = f"""\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Expert]: {ideal}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the expert answer. \
Ignore any differences in style, grammar, or punctuation.
    The submitted answer may either be a subset or superset of the expert answer, \
    or it may conflict with it. Determine which case applies. \
    Answer the question by selecting one of the following options:
    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
    (C) The submitted answer contains all the same details as the expert answer.
    (D) There is a disagreement between the submitted answer and the expert answer.
    (E) The answers differ, but these differences don't matter from the perspective of factuality.
  choice_strings: ABCDE
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

Here's the prompt we can use, which is. We're going to use an LLM 
and tell it to be an assistant that evaluates how well the 
customer service agent answers a user question by 
comparing the response, that was the automatically generated 
one, to the ideal (expert) human written 
response. 
So we're going to give it the data, which is what was the customer request, what 
is the expert written ideal response, and then what did our 
LLM actually output. 
And this rubric comes from the OpenAI open source evals framework, 
which is a fantastic framework with many evaluation methods 
contributed both by OpenAI developers and 
by the broader open source community. 

In [None]:
print(assistant_answer)

*OUTPUT*

Sure! Let me provide you with some information about the SmartX ProPhone and the FotoSnap DSLR Camera.

The SmartX ProPhone is a powerful smartphone with advanced camera features. It has a 6.1-inch display, 128GB storage, a 12MP dual camera, and supports 5G connectivity. The SmartX ProPhone is priced at $899.99 and comes with a 1-year warranty.

The FotoSnap DSLR Camera is a versatile camera that allows you to capture stunning photos and videos. It features a 24.2MP sensor, 1080p video recording, a 3-inch LCD screen, and supports interchangeable lenses. The FotoSnap DSLR Camera is priced at $599.99 and also comes with a 1-year warranty.

As for TVs and TV-related products, we have a range of options available. Some of our popular TV models include the CineView 4K TV, CineView 8K TV, and CineView OLED TV. We also have home theater systems like the SoundMax Home Theater and SoundMax Soundbar. Could you please let me know your specific requirements or preferences so that I can assist you better?

In [None]:
eval_vs_ideal(test_set_ideal, assistant_answer)

*OUTPUT*

'A'

This assistant answer is much shorter than the 
long expert answer up top, but it does hopefully is consistent. 
Once again, I'm using GPT-3.5 Turbo in this example, but to get 
a more rigorous evaluation, it might make sense to use GPT-4 in your own 
application.

In [None]:
assistant_answer_2 = "life is like a box of chocolates"

In [None]:
eval_vs_ideal(test_set_ideal, assistant_answer_2)

*OUTPUT*

'D'

In summary, there are two key points for evaluating LLM systems:

1. Without an expert-provided ideal answer, you can use a rubric to evaluate one LLM's output against another LLM's output.

2. With an expert-provided ideal answer, you can compare the LLM's output to the expert's answer, which helps in assessing the similarity and quality of the LLM's responses.


These evaluation methods are valuable during development and while the system is running, allowing continuous monitoring and improvement of the LLM's performance.