# Prompt Evaluation
This notebook contains an example of how to build a testing framework for prompt evaluation.  The basic idea is that for most prompts, they consist of system instructions, role assignment, few shot examples, etc, which we call "instructions" and then they have the user query, which we will call the "question".  This notebook allows users to test changes to the instructions, and then see how those changes will impact the responses to a series of questions.  This requires first generating a list of questions and correct responses, preferably manually checked for correctness by a human.

The notebook follows this structure:
  1) Set up the envionment
  2) Create the testing functionality
  3) Examples of using the tests

## Set up the envionment

In [2]:
#for connecting with Bedrock, use Boto3
import boto3, time, json
from botocore.config import Config

#increase the standard time out limits in boto3, because Bedrock may take a while to respond to large requests.
my_config = Config(
    connect_timeout=60*3,
    read_timeout=60*3,
)
bedrock = boto3.client(service_name='bedrock-runtime',config=my_config)
bedrock_service = boto3.client(service_name='bedrock',config=my_config)

In [3]:
#check that it's working:
models = bedrock_service.list_foundation_models()
if "anthropic.claude-v2" in str(models):
    print("Claud-v2 found!")
else:
    print ("Error, no model found.")

Claud-v2 found!


In [118]:
MAX_ATTEMPTS = 5 #how many times to retry if Claude is not working.
session_cache = {} #for this session, do not repeat the same query to claude.
def ask_claude(prompt_text, DEBUG=False):
    '''
    Send a prompt to Bedrock, and return the response.  Debug is used to see exactly what is being sent to and from Bedrock.
    '''
    raw_prompt_text = prompt_text
    #usually, the prompt will have "human" and "assistant" tags already.  These are required, so if they are not there, add them in.
    if not "Assistant:" in prompt_text:
        prompt_text = "\n\nHuman:"+prompt_text+"\n\Assistant: "
        
    promt_json = {
        "prompt": prompt_text,
        "max_tokens_to_sample": 3000,
        "temperature": 0.7,
        "top_k": 250,
        "top_p": 0.7,
        "stop_sequences": ["\n\nHuman:"]
    }
    body = json.dumps(promt_json)
    
    
    if DEBUG: print("sending:",prompt_text)
    modelId = 'anthropic.claude-v2'
    accept = 'application/json'
    contentType = 'application/json'
    
    if raw_prompt_text in session_cache:
        return [raw_prompt_text,session_cache[raw_prompt_text]]
    attempt = 1
    while True:
        try:
            response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
            response_body = json.loads(response.get('body').read())
            results = response_body.get("completion").strip()            
            if DEBUG:print("Recieved:",results)
            break
        except Exception as e:
            print("Error with calling Bedrock: "+str(e))
            attempt+=1
            if attempt>MAX_ATTEMPTS:
                print("Max attempts reached!")
                results = str(e)
                break
            else:#retry in 10 seconds
                time.sleep(10)
    session_cache[raw_prompt_text] = results
    return [raw_prompt_text,results]

In [119]:
%%time
#check that it's working:
try:
    print(ask_claude("Please say the number one."))
except Exception as e:
    print("Error with calling Claude: "+str(e))

['Please say the number one.', 'One.']
CPU times: user 4.96 ms, sys: 0 ns, total: 4.96 ms
Wall time: 885 ms


In [120]:
from queue import Queue
from threading import Thread

# Threaded function for queue processing.
def thread_request(q, result):
    while not q.empty():
        work = q.get()                      #fetch new work from the Queue
        thread_start_time = time.time()
        try:
            data = ask_claude(work[1])
            result[work[0]] = data          #Store data back at correct index
        except Exception as e:
            error_time = time.time()
            print('Error with prompt!',str(e))
            result[work[0]] = (str(e))
        #signal to the queue that task has been processed
        q.task_done()
    return True

def ask_claude_threaded(prompts,DEBUG=False):
    '''
    Call ask_claude, but multi-threaded.
    Returns a dict of the prompts and responces.
    '''
    q = Queue(maxsize=0)
    num_theads = min(50, len(prompts))
    
    #Populating Queue with tasks
    results = [{} for x in prompts];
    #load up the queue with the promts to fetch and the index for each job (as a tuple):
    for i in range(len(prompts)):
        #need the index and the url in each queue item.
        q.put((i,prompts[i]))
        
    #Starting worker threads on queue processing
    for i in range(num_theads):
        #print('Starting thread ', i)
        worker = Thread(target=thread_request, args=(q,results))
        worker.setDaemon(True)    #setting threads as "daemon" allows main program to 
                                  #exit eventually even if these dont finish 
                                  #correctly.
        worker.start()

    #now we wait until the queue has been processed
    q.join()

    if DEBUG:print('All tasks completed.')
    return results

In [158]:
%%time
#test if our threaded Claude calls are working
print(ask_claude_threaded(["Please say the number one.","Please say the number two.","Please say the number three.","Please say the number four.","Please say the number five."]))

[['Please say the number one.', 'One.'], ['Please say the number two.', 'Two.'], ['Please say the number three.', 'Three.'], ['Please say the number four.', 'Four.'], ['Please say the number five.', 'Five.']]
CPU times: user 653 µs, sys: 1.89 ms, total: 2.54 ms
Wall time: 3.63 ms


  worker.setDaemon(True)    #setting threads as "daemon" allows main program to


## Create the testing functionality

In [174]:
scoring_prompt_template = """You are a teacher.  Consider the following question along with its correct answer and a student submitted answer.
Here is the question:
<question>{{QUESTION}}</question>
Here is the correct answer:
<correct_answer>{{ANSWER}}</correct_answer>
Here is the student's answer:
<student_answer>{{TEST_ANSWER}}</student_answer>
Please provide a score from 0 to 100 on how well the student answer matches the correct answer for this question.
The score should be high if the answers say essentially the same thing.
The score should be lower if some facts are missing or incorrect, or if extra unnecessary facts have been included.
The score should be 0 for entirely wrong answers.  Put the score in <SCORE> tags. and your reasoning in <REASON> tags.
Do not consider your own answer to the question, but instead score based on the correct_answer above."""

In [160]:
def get_answers(prompt_template, question_answers):
    '''
    get answers for each of our sample questions using the prompt template we are testing.
    question_answers is a dict type.
    '''
    prompts = []
    for question in question_answers:
        prompts.append(prompt_template.replace("{{QUESTION}}",question))
    return ask_claude_threaded(prompts)

In [168]:
def score_answers(prompt_template, question_answers):
    '''
    ask our LLM to score each of the generated answers.
    '''
    print ("Generating answers to score...")
    answers_to_test = get_answers(prompt_template, question_answers)
    print ("Done.  Scoring answers...")
    
    
    #pack answers with questions in templated form.
    question_answers_with_template = {}
    for question in question_answers:
        question_answers_with_template[prompt_template.replace("{{QUESTION}}",question)] = question_answers[question]
    #pack questions to templated form
    question_with_template_to_questions = {}
    for question in question_answers:
        question_with_template_to_questions[prompt_template.replace("{{QUESTION}}",question)]=question
    
    prompts = []
    for question,test_answer in answers_to_test:
        original_question = question_with_template_to_questions[question]
        correct_answer = question_answers_with_template[question]
        prompts.append(scoring_prompt_template.replace("{{QUESTION}}",original_question).replace("{{ANSWER}}",correct_answer).replace("{{TEST_ANSWER}}",test_answer))

    return ask_claude_threaded(prompts)

In [126]:
from bs4 import BeautifulSoup as BS

In [170]:
def evaluate_prompt(prompt_template, question_answers):
    scored_answers = score_answers(prompt_template, question_answers)
    print ("Done.")
    #pack questions to templated form
    question_with_template_to_questions = {}
    for question in question_answers:
        question_with_template_to_questions[prompt_template.replace("{{QUESTION}}",question)]=question
    
    scores = []
    for prompt,response in scored_answers:
        soup = BS(prompt)
        question = soup.find('question').text
        correct_answer = soup.find('correct_answer').text
        prompt_answer = soup.find('student_answer').text
        soup = BS(response)
        score = soup.find('score').text
        reason = soup.find('reason').text
        scores.append([question,correct_answer,prompt_answer,score,reason])
        
    return scores

## Examples of using the tests

In [163]:
#start by defining out test case, the prompt and question/answers
#here, we use {{QUESTION}} as the placeholder where each of the questions will be interested for testing.

test_prompt = "You are a helpful assistant that loves to give full, complete, accurate answers.  Please answer this question:{{QUESTION}}"
test_prompt_2 = "You are a boat fanatic and always talk like a pirate.  You do answer questions, but you also always include a fun fact about boats.  Please answer this question:{{QUESTION}}"


question_answers = {
 "What is heavier, 1kg of feathers or 1kg of iron?":"They are the same.",
 "What is my current bank account balance?":"I don't have access to that information.",
 "Who was the president in the year 2000?":"Bill Clinton",   
 "A boy runs down the stairs in the morning and sees a tree in his living room, and some boxes under the tree. What's going on?":"It is Christmas.",
 "If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?":"5 hours."
}

In [164]:
scores = evaluate_prompt(test_prompt, question_answers)

Generating answers to score...
Done.  Scoring answers...
Done.


  worker.setDaemon(True)    #setting threads as "daemon" allows main program to


In [184]:
all_scores = 0
for question,correct_answer,prompt_answer,score,reason in scores:
    if len(correct_answer)>30:
        correct_answer = correct_answer[:30]+"..."
    if len(prompt_answer)>30:
        prompt_answer = prompt_answer[:30]+"..."
        
    print (correct_answer,"**|**",prompt_answer,"**|**",score)
    print ("*****")
    all_scores+=float(score)

average_score = all_scores/len(scores_2)
print ("Total average score: ",average_score)

They are the same. **|** 1kg of feathers and 1kg of iro... **|** 100
*****
I don't have access to that in... **|** I'm an AI assistant created by... **|** 100
*****
Bill Clinton **|** The president of the United St... **|** 100
*****
It is Christmas. **|** It sounds like the boy's famil... **|** 95
*****
5 hours. **|** * You hang 5 shirts and they t... **|** 0
*****
Total average score:  79.0


In [175]:
scores_2 = evaluate_prompt(test_prompt_2, question_answers)

Generating answers to score...
Done.  Scoring answers...


  worker.setDaemon(True)    #setting threads as "daemon" allows main program to


Done.


In [180]:
all_scores = 0
for question,correct_answer,prompt_answer,score,reason in scores_2:
    if len(correct_answer)>30:
        correct_answer = correct_answer[:30]+"..."
    if len(prompt_answer)>30:
        prompt_answer = prompt_answer[:30]+"..."
        
    print (correct_answer,"**|**",prompt_answer,"**|**",score)
    print ("*****")
    all_scores+=float(score)

average_score_2 = all_scores/len(scores_2)
print ("Total average score: ",average_score)

They are the same. **|** Ahoy matey! One kilogram o' fe... **|** 90
*****
I don't have access to that in... **|** Ahoy matey! I be afraid I don'... **|** 50
*****
Bill Clinton **|** Ahoy matey! In the year 2000, ... **|** 100
*****
It is Christmas. **|** Ahoy matey! Ye scallywag be se... **|** 70
*****
5 hours. **|** Ahoy matey! Let's sail into th... **|** 0
*****
Total average score:  79.0


### Testing results:

In [181]:
print ("Prompt Template 1 Average Score:",average_score)
print (test_prompt)
print ("")
print ("Prompt Template 2 Average Score:",average_score_2)
print (test_prompt_2)
print ("")

Prompt Template 1 Average Score: 79.0
You are a helpful assistant that loves to give full, complete, accurate answers.  Please answer this question:{{QUESTION}}

Prompt Template 2 Average Score: 62.0
You are a boat fanatic and always talk like a pirate.  You do answer questions, but you also always include a fun fact about boats.  Please answer this question:{{QUESTION}}

