# L6: Check outputs

Checking outputs before showing them to users can be 
important for ensuring the quality, relevance and safety 
of the responses provided to them or used in automation flows. 
We'll learn how to use the moderation API, but 
this time for outputs, and how to use additional 
prompts to the model to evaluate output quality 
before displaying them. 
Moderation API can also be used to filter 
and moderate outputs generated by the system itself.

## Setup
#### Load the API key and relevant Python libaries.
In this course, we've provided some code that loads the OpenAI API key for you.

In [None]:
import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [None]:
def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0, max_tokens=500):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message["content"]

### Check output for potentially harmful content

In [None]:
final_response_to_customer = f"""
The SmartX ProPhone has a 6.1-inch display, 128GB storage, \
12MP dual camera, and 5G. The FotoSnap DSLR Camera \
has a 24.2MP sensor, 1080p video, 3-inch LCD, and \
interchangeable lenses. We have a variety of TVs, including \
the CineView 4K TV with a 55-inch display, 4K resolution, \
HDR, and smart TV features. We also have the SoundMax \
Home Theater system with 5.1 channel, 1000W output, wireless \
subwoofer, and Bluetooth. Do you have any specific questions \
about these products or any other products we offer?
"""
response = openai.Moderation.create(
    input=final_response_to_customer
)
moderation_output = response["results"][0]
print(moderation_output)

*OUTPUT*
```json
{
  "categories": {
    "harassment": false,
    "harassment/threatening": false,
    "hate": false,
    "hate/threatening": false,
    "self-harm": false,
    "self-harm/instructions": false,
    "self-harm/intent": false,
    "sexual": false,
    "sexual/minors": false,
    "violence": false,
    "violence/graphic": false
  },
  "category_scores": {
    "harassment": 3.3195027e-09,
    "harassment/threatening": 9.1200786e-10,
    "hate": 5.6609095e-09,
    "hate/threatening": 3.281053e-10,
    "self-harm": 5.362001e-10,
    "self-harm/instructions": 3.4188584e-08,
    "self-harm/intent": 8.6282137e-10,
    "sexual": 2.1635752e-07,
    "sexual/minors": 2.3970818e-08,
    "violence": 3.5049223e-07,
    "violence/graphic": 7.182571e-10
  },
  "flagged": false
}
```

In general, it can also be 
important to check the outputs. 
For example, if you were creating a chatbot for sensitive audiences, 
you could use lower thresholds for flagging outputs. 
In general, if the moderation output indicates that 
the content is flagged, you can take appropriate action such 
as responding with a fallback answer or generating a 
new response. 
Note that as we improve the models, they also are becoming less and 
less likely to return some kind of harmful output. 
Another approach for checking outputs is to ask the 
model itself if the generated was satisfactory and if it follows a 
certain rubric that you define. 
This can be done by providing the generated 
output as part of the input to the model and asking it 
to rate the quality of the output. 

### Check if output is factually based on the provided product information

In [None]:
system_message = f"""
You are an assistant that evaluates whether \
customer service agent responses sufficiently \
answer customer questions, and also validates that \
all the facts the assistant cites from the product \
information are correct.
The product information and user and customer \
service agent messages will be delimited by \
3 backticks, i.e. ```.
Respond with a Y or N character, with no punctuation:
Y - if the output sufficiently answers the question \
AND the response correctly uses product information
N - otherwise

Output a single letter only.
"""
customer_message = f"""
tell me about the smartx pro phone and \
the fotosnap camera, the dslr one. \
Also tell me about your tvs"""
product_information = """{ "name": "SmartX ProPhone", "category": "Smartphones and Accessories", "brand": "SmartX", "model_number": "SX-PP10", "warranty": "1 year", "rating": 4.6, "features": [ "6.1-inch display", "128GB storage", "12MP dual camera", "5G" ], "description": "A powerful smartphone with advanced camera features.", "price": 899.99 } { "name": "FotoSnap DSLR Camera", "category": "Cameras and Camcorders", "brand": "FotoSnap", "model_number": "FS-DSLR200", "warranty": "1 year", "rating": 4.7, "features": [ "24.2MP sensor", "1080p video", "3-inch LCD", "Interchangeable lenses" ], "description": "Capture stunning photos and videos with this versatile DSLR camera.", "price": 599.99 } { "name": "CineView 4K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-4K55", "warranty": "2 years", "rating": 4.8, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "A stunning 4K TV with vibrant colors and smart features.", "price": 599.99 } { "name": "SoundMax Home Theater", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-HT100", "warranty": "1 year", "rating": 4.4, "features": [ "5.1 channel", "1000W output", "Wireless subwoofer", "Bluetooth" ], "description": "A powerful home theater system for an immersive audio experience.", "price": 399.99 } { "name": "CineView 8K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-8K65", "warranty": "2 years", "rating": 4.9, "features": [ "65-inch display", "8K resolution", "HDR", "Smart TV" ], "description": "Experience the future of television with this stunning 8K TV.", "price": 2999.99 } { "name": "SoundMax Soundbar", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-SB50", "warranty": "1 year", "rating": 4.3, "features": [ "2.1 channel", "300W output", "Wireless subwoofer", "Bluetooth" ], "description": "Upgrade your TV's audio with this sleek and powerful soundbar.", "price": 199.99 } { "name": "CineView OLED TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-OLED55", "warranty": "2 years", "rating": 4.7, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "Experience true blacks and vibrant colors with this OLED TV.", "price": 1499.99 }"""
q_a_pair = f"""
Customer message: ```{customer_message}```
Product information: ```{product_information}```
Agent response: ```{final_response_to_customer}```

Does the response use the retrieved information correctly?
Does the response sufficiently answer the question

Output Y or N
"""
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages, max_tokens=1)
print(response)

*OUTPUT*

Y

In [None]:
another_response = "life is like a box of chocolates"
q_a_pair = f"""
Customer message: ```{customer_message}```
Product information: ```{product_information}```
Agent response: ```{another_response}```

Does the response use the retrieved information correctly?
Does the response sufficiently answer the question?

Output Y or N
"""
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages)
print(response)

*OUTPUT*

N

So as you can see, the model can provide 
feedback on the quality of a generated output, 
and you can use this feedback to decide 
whether to present the output to the user or to generate a 
new response. You could even experiment with generating 
multiple model responses per user query and then having 
the model choose the best one to show the user. So there's 
lots of different things you could try. 
In general, checking outputs using the 
moderation API is good practice, but while asking the model to 
evaluate its own output might be useful for immediate 
feedback to ensure the quality of responses in a very small number 
of cases. 
I think it's probably unnecessary most of the time, especially 
if you're using a more advanced model like GPT-4. 
 
I haven't actually seen many people do something like this in 
production. It would also increase the latency 
and cost of your system, because you'd have to wait for an additional 
call for the model, and that's also additional tokens. If it's really 
important for your app or product that your error rate 
is 0.0000001%, then maybe you should try this approach. But overall, I wouldn't 
really recommend that you do this in practice.