# Prompt Evaluation Automation
This notebook contains an example of how to build a testing framework for prompt evaluation.  The basic idea is that for most prompts, they consist of system instructions, role assignment, few shot examples, etc, which we call "instructions" and then they have the user query, which we will call the "question".  This notebook allows users to test changes to the instructions, and then see how those changes will impact the responses to a series of questions.  This requires first generating a list of questions and correct responses, preferably manually checked for correctness by a human.

The notebook follows this structure:
  1) Set up the envionment
  2) Create the testing functionality
  3) Examples of using the tests

## Set up the envionment
First start by importing and setting up the libraries we need:

In [2]:
#for connecting with Bedrock, use Boto3
import boto3, time, json
from botocore.config import Config

#increase the standard time out limits in boto3, because Bedrock may take a while to respond to large requests.
my_config = Config(
    connect_timeout=60*3,
    read_timeout=60*3,
)
bedrock = boto3.client(service_name='bedrock-runtime',config=my_config)
bedrock_service = boto3.client(service_name='bedrock',config=my_config)

In [3]:
#check that it's working:
models = bedrock_service.list_foundation_models()
for line in models["modelSummaries"]:
    #print this out if you want to see all the models you have access to.
    #print (line["modelId"])
    pass
if "anthropic.claude-3" in str(models):
    print("Claud-v3 found!")
else:
    print ("Error, no model found.")

Claud-v3 found!


### Next, create helper functions to make it easy to send a query to Claude

In [4]:
MAX_ATTEMPTS = 3 #how many times to retry if Claude is not working.
session_cache = {} #for this session, do not repeat the same query to claude.
def ask_claude(messages,system="", DEBUG=False, model_version="haiku"):
    '''
    Send a prompt to Bedrock, and return the response.  Debug is used to see exactly what is being sent to and from Bedrock.
    messages can be an array of role/message pairs, or a string.
    '''
    raw_prompt_text = str(messages)
    
    if type(messages)==str:
        messages = [{"role": "user", "content": messages}]
    
    promt_json = {
        "system":system,
        "messages": messages,
        "max_tokens": 3000,
        "temperature": 0.7,
        "anthropic_version":"",
        "top_k": 250,
        "top_p": 0.7,
        "stop_sequences": ["\n\nHuman:"]
    }
    
    if DEBUG: print("sending:\nSystem:\n",system,"\nMessages:\n","\n".join(messages))
    
    if model_version== "opus":#comming soon to Bedrock!
        modelId = 'error'
    elif model_version== "sonnet":
        modelId = 'anthropic.claude-3-sonnet-20240229-v1:0'
    elif model_version== "haiku":
        modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
    else:
        print ("ERROR:  Bad model version, must be opus, sonnet, or haiku.")
        modelId = 'error'
    
    if raw_prompt_text in session_cache:
        return [raw_prompt_text,session_cache[raw_prompt_text]]
    attempt = 1
    while True:
        try:
            response = bedrock.invoke_model(body=json.dumps(promt_json), modelId=modelId, accept='application/json', contentType='application/json')
            response_body = json.loads(response.get('body').read())
            results = response_body.get("content")[0].get("text")
            if DEBUG:print("Recieved:",results)
            break
        except Exception as e:
            print("Error with calling Bedrock: "+str(e))
            attempt+=1
            if attempt>MAX_ATTEMPTS:
                print("Max attempts reached!")
                results = str(e)
                break
            else:#retry in 10 seconds
                time.sleep(10)
    session_cache[raw_prompt_text] = results
    return [raw_prompt_text,results]

In [5]:
%%time
#check that it's working:
try:
    query = "Please say the number four."
    #query = [{"role": "user", "content": "Please say the number two."},{"role": "assistant", "content": "Two."},{"role": "user", "content": "Please say the number three."}]
    result = ask_claude(query)
    print(query)
    print(result[1])
except Exception as e:
    print("Error with calling Claude: "+str(e))

Please say the number four.
Four.
CPU times: user 9.6 ms, sys: 4.15 ms, total: 13.8 ms
Wall time: 830 ms


### Finally, create a threaded function for calling Claude multiple times at the same time.

In [6]:
from queue import Queue
from threading import Thread

# Threaded function for queue processing.
def thread_request(q, result):
    while not q.empty():
        work = q.get()                      #fetch new work from the Queue
        thread_start_time = time.time()
        try:
            data = ask_claude(work[1])
            result[work[0]] = data          #Store data back at correct index
        except Exception as e:
            error_time = time.time()
            print('Error with prompt!',str(e))
            result[work[0]] = (str(e))
        #signal to the queue that task has been processed
        q.task_done()
    return True

def ask_claude_threaded(prompts,DEBUG=False):
    '''
    Call ask_claude, but multi-threaded.
    Returns a dict of the prompts and responces.
    '''
    q = Queue(maxsize=0)
    num_theads = min(50, len(prompts))
    
    #Populating Queue with tasks
    results = [{} for x in prompts];
    #load up the queue with the promts to fetch and the index for each job (as a tuple):
    for i in range(len(prompts)):
        #need the index and the url in each queue item.
        q.put((i,prompts[i]))
        
    #Starting worker threads on queue processing
    for i in range(num_theads):
        #print('Starting thread ', i)
        worker = Thread(target=thread_request, args=(q,results))
        worker.setDaemon(True)    #setting threads as "daemon" allows main program to 
                                  #exit eventually even if these dont finish 
                                  #correctly.
        worker.start()

    #now we wait until the queue has been processed
    q.join()

    if DEBUG:print('All tasks completed.')
    return results

In [7]:
%%time
#test if our threaded Claude calls are working
q1 = [{"role": "user", "content": "Please say the number one."}]
q2 = [{"role": "user", "content": "Please say the number two."}]
q3 = [{"role": "user", "content": "Please say the number three."}]

#print(ask_claude_threaded(["Please say the number one.","Please say the number two.","Please say the number three.","Please say the number four.","Please say the number five."]))
print(ask_claude_threaded([q1,q2,q3]))

  worker.setDaemon(True)    #setting threads as "daemon" allows main program to


Error with calling Bedrock: An error occurred (ModelErrorException) when calling the InvokeModel operation: The system encountered an unexpected error during processing. Try your request again.
[["[{'role': 'user', 'content': 'Please say the number one.'}]", '1'], ["[{'role': 'user', 'content': 'Please say the number two.'}]", 'Two.'], ["[{'role': 'user', 'content': 'Please say the number three.'}]", 'Three.']]
CPU times: user 44.5 ms, sys: 14.5 ms, total: 59 ms
Wall time: 10.5 s


## Create the testing functionality
Here, we'll set up the functions that use an LLM to run our unit tests.  Start by defining a "judge prompt" which we will use to have the LLM compare the output we want to test to the gold standard output.

In [8]:
scoring_prompt_template = """You are a teacher.  Consider the following question along with its correct answer and a student submitted answer.
Here is the question:
<question>{{QUESTION}}</question>
Here is the correct answer:
<correct_answer>{{ANSWER}}</correct_answer>
Here is the student's answer:
<student_answer>{{TEST_ANSWER}}</student_answer>
Please provide a score from 0 to 100 on how well the student answer matches the correct answer for this question.
The score should be high if the answers say essentially the same thing.
The score should be lower if some facts are missing or incorrect, or if extra unnecessary facts have been included.
The score should be 0 for entirely wrong answers.  Put the score in <SCORE> tags. and your reasoning in <REASON> tags.
Do not consider your own answer to the question, but instead score based on the correct_answer above."""

### Next, we create a helper funtion which will take the prompt we want to test, and use it generate answers to every question in our gold standard set.

In [9]:
def get_answers(prompt_template, questions):
    '''
    get answers for each of our sample questions using the prompt template we are testing.
    question_answers is a dict type.
    '''
    prompts = []
    for question in questions:
        prompts.append(prompt_template.replace("{{QUESTION}}",question))
    return ask_claude_threaded(prompts)

In [10]:
def score_answers(prompt_template, question_answers):
    '''
    ask our LLM to score each of the generated answers.
    '''
    print ("Generating answers to score...")
    answers_to_test = get_answers(prompt_template, question_answers)
    print ("Done.  Scoring answers...")
    
    
    #pack answers with questions in templated form.
    question_answers_with_template = {}
    for question in question_answers:
        question_answers_with_template[prompt_template.replace("{{QUESTION}}",question)] = question_answers[question]
    #pack questions to templated form
    question_with_template_to_questions = {}
    for question in question_answers:
        question_with_template_to_questions[prompt_template.replace("{{QUESTION}}",question)]=question
    
    prompts = []
    for question,test_answer in answers_to_test:
        original_question = question_with_template_to_questions[question]
        correct_answer = question_answers_with_template[question]
        prompts.append(scoring_prompt_template.replace("{{QUESTION}}",original_question).replace("{{ANSWER}}",correct_answer).replace("{{TEST_ANSWER}}",test_answer))

    return ask_claude_threaded(prompts)

In [11]:
from bs4 import BeautifulSoup as BS

In [12]:
def evaluate_prompt(prompt_template, question_answers, threshhold):
    """
    Call score answers and format the results once all threads have returned.
    """
    scored_answers = score_answers(prompt_template, question_answers)
    print ("Done.")
    #pack questions to templated form
    question_with_template_to_questions = {}
    for question in question_answers:
        question_with_template_to_questions[prompt_template.replace("{{QUESTION}}",question)]=question
    
    scores = []
    for prompt,response in scored_answers:
        soup = BS(prompt)
        question = soup.find('question').text
        correct_answer = soup.find('correct_answer').text
        prompt_answer = soup.find('student_answer').text
        soup = BS(response)
        score = soup.find('score').text
        reason = soup.find('reason').text
        passed = True
        if int(score)<threshhold:
            passed = False
        scores.append([question,correct_answer,prompt_answer,score,reason,passed])
        
    return scores

## Examples of using the tests
### Start by defining the gold standard question/answers, and two prompts we want to test.

In [13]:
#Our gold standard list of question answer pairs.  Don't use generic ones here, write them for your use case!
#a good test has a couple hundred questions.
question_answers = {
 "What is heavier, 1kg of feathers or 1kg of iron?":"They are the same.",
 "What is my current bank account balance?":"I don't have access to that information.",
 "Who was the president in the year 2000?":"Bill Clinton",   
 "A boy runs down the stairs in the morning and sees a tree in his living room, and some boxes under the tree. What's going on?":"It is Christmas.",
 "If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?":"5 hours."
}
questions = list(question_answers.keys())

In [14]:
#here, we use {{QUESTION}} as the placeholder where each of the questions will be interested for testing.
#the automated test will replace {{QUESTION}} with each question in our gold standard list one at a time.
#A pretty good one
test_prompt = "You are a helpful assistant that loves to give full, complete, accurate answers.  Please answer this question:{{QUESTION}}"

# A bad one, for comparason.
test_prompt_2 = "You are a boat fanatic and always talk like a pirate.  You do answer questions, but you also always include a fun fact about boats.  Please answer this question:{{QUESTION}}"

### Now, let's score both of our two prompts that we want to test.

In [15]:
scores = evaluate_prompt(test_prompt, question_answers,threshhold=90)
scores_2 = evaluate_prompt(test_prompt_2, question_answers,threshhold=90)

Generating answers to score...


  worker.setDaemon(True)    #setting threads as "daemon" allows main program to


Error with calling Bedrock: An error occurred (ModelErrorException) when calling the InvokeModel operation: The system encountered an unexpected error during processing. Try your request again.
Error with calling Bedrock: An error occurred (ModelErrorException) when calling the InvokeModel operation: The system encountered an unexpected error during processing. Try your request again.
Done.  Scoring answers...
Error with calling Bedrock: An error occurred (ModelErrorException) when calling the InvokeModel operation: The system encountered an unexpected error during processing. Try your request again.
Done.
Generating answers to score...
Error with calling Bedrock: An error occurred (ModelErrorException) when calling the InvokeModel operation: The system encountered an unexpected error during processing. Try your request again.
Done.  Scoring answers...
Error with calling Bedrock: An error occurred (ModelErrorException) when calling the InvokeModel operation: The system encountered an u

### We can take a more detailed look at the results:

In [16]:
all_scores = 0
number_passed = 0
padding = 35 #column width in output
print ("Gold Answer".ljust(padding),"|","Prompt Answer".ljust(padding),"|","Reason".ljust(padding),"|","Score")
print ("_________________________________________________________________________________")
for question,correct_answer,prompt_answer,score,reason,passed in scores:
    if len(correct_answer)>padding-3:
        correct_answer = correct_answer[:padding-3]+"..."
    if len(prompt_answer)>padding-3:
        prompt_answer = prompt_answer[:padding-3]+"..."
    if len(reason)>padding-3:
        reason = reason[:padding-3]+"..."
    if passed:
        number_passed+=1
    print (correct_answer.ljust(padding),"|",prompt_answer.ljust(padding),"|",reason,"|",score)
    all_scores+=float(score)
print ("")
average_score = all_scores/len(scores)
print ("Total average score: ",average_score)
print ("Total number passed: ",number_passed)

Gold Answer                         | Prompt Answer                       | Reason                              | Score
_________________________________________________________________________________
They are the same.                  | Okay, let's think this through s... | The student's answer correctly e... | 100
I don't have access to that info... | I'm afraid I don't actually have... | The student's answer matches the... | 100
Bill Clinton                        | The president of the United Stat... | The student's answer accurately ... | 100
It is Christmas.                    | Based on the information provide... | The student's answer correctly i... | 90
5 hours.                            | To solve this problem, we can us... | The student's answer is partiall... | 50

Total average score:  88.0
Total number passed:  4


In [17]:
all_scores_2 = 0
number_passed_2 = 0
padding = 35 #column width in output
print ("Gold Answer".ljust(padding),"|","Prompt 2 Answer".ljust(padding),"|","Reason".ljust(padding),"|","Score")
print ("_________________________________________________________________________________")
for question,correct_answer,prompt_answer,score,reason,passed in scores_2:
    if len(correct_answer)>padding-3:
        correct_answer = correct_answer[:padding-3]+"..."
    if len(prompt_answer)>padding-3:
        prompt_answer = prompt_answer[:padding-3]+"..."
    if len(reason)>padding-3:
        reason = reason[:padding-3]+"..."
    if passed:
        number_passed_2+=1
    print (correct_answer.ljust(padding),"|",prompt_answer.ljust(padding),"|",reason,"|",score)
    all_scores_2+=float(score)
print ("")
average_score_2 = all_scores_2/len(scores_2)
print ("Total average score: ",average_score_2)
print ("Total number passed: ",number_passed_2)

Gold Answer                         | Prompt 2 Answer                     | Reason                              | Score
_________________________________________________________________________________
They are the same.                  | Ahoy, me hearty! As a true boat ... | The student's answer correctly s... | 90
I don't have access to that info... | *clears throat and speaks in a g... | The student's answer matches the... | 90
Bill Clinton                        | Ahoy, me hearty! Ye be askin' ab... | The student's answer does not ma... | 0
It is Christmas.                    | *clears throat and speaks in a g... | The student's answer correctly i... | 90
5 hours.                            | Ahoy, me hearty! As a boat fanat... | The student's answer is quite of... | 20

Total average score:  58.0
Total number passed:  3


## Let's also grab a quick summary of the reasons, to get a general feel for how each prompt is doing.

In [18]:
reasons = []
reasons_2 = []

for question,correct_answer,prompt_answer,score,reason,passed in scores:
    reasons.append(reason)
for question,correct_answer,prompt_answer,score,reason,passed in scores_2:
    reasons_2.append(reason)

In [19]:
reasoning_summary_prompt = """
Consider the following comments.  Each one was made by a teacher grading the same student's work.
<comments>
{{COMMENTS}}
</comments>
Please provide a breif summary of common trends you see in this student's work, both positive and negative, if any.
In your answer, it is important to protect privacy by refering to the student as the "prompt".  Never say "the student" instead say "the prompt".
"""

In [20]:
reasoning_summary_prompt_1 = reasoning_summary_prompt.replace("{{COMMENTS}}","<comment>\n"+"</comment>\n<comment>\n".join(reasons)+"\n</comment>")
reasoning_summary_prompt_2 = reasoning_summary_prompt.replace("{{COMMENTS}}","<comment>\n"+"</comment>\n<comment>\n".join(reasons_2)+"\n</comment>")

In [21]:
print ("Asking Claude to generate a summary of the reasoning on each prompt.")
prompt_1_summary = ask_claude(reasoning_summary_prompt_1)[1]
prompt_2_summary = ask_claude(reasoning_summary_prompt_2)[1]
print ("Prompt 1:")
print (prompt_1_summary)
print ("\nPrompt 2:")
print (prompt_2_summary)


Asking Claude to generate a summary of the reasoning on each prompt.
Error with calling Bedrock: An error occurred (ModelErrorException) when calling the InvokeModel operation: The system encountered an unexpected error during processing. Try your request again.
Prompt 1:
Based on the provided comments, the following trends can be observed in the prompt's work:

Positive Trends:
1. The prompt generally provides accurate and comprehensive answers that fully capture the essence of the correct answers. The responses demonstrate a strong understanding of the subject matter.
2. The prompt's answers often include additional relevant details and context, which further reinforce the correctness of the responses.
3. The prompt's step-by-step reasoning is well-aligned with the correct answers, indicating a thorough understanding of the concepts.

Negative Trends:
1. In one instance, the prompt's answer includes unnecessary steps and arrives at an incorrect final answer, despite setting up the pr

### Testing results:

In [22]:
print ("Prompt Template 1")
print ("Prompt:",test_prompt)
print("Average Score:",average_score)
print("Number passed: %s/%s"%(number_passed,len(question_answers)))
print ("")
print ("Prompt Template 2")
print ("Prompt:",test_prompt_2)
print("Average Score:",average_score_2)
print("Number passed: %s/%s"%(number_passed_2,len(question_answers)))
print ("")

Prompt Template 1
Prompt: You are a helpful assistant that loves to give full, complete, accurate answers.  Please answer this question:{{QUESTION}}
Average Score: 88.0
Number passed: 4/5

Prompt Template 2
Prompt: You are a boat fanatic and always talk like a pirate.  You do answer questions, but you also always include a fun fact about boats.  Please answer this question:{{QUESTION}}
Average Score: 58.0
Number passed: 3/5

