# Basic Anthropic Claude 3 on Bedrock
This notebook contains a collection of basic helper functions which are useful for connecting to Bedrock.  Varations of these are used for many other code samples and use cases.

The helper functins include:
  * Setting up a connection with Bedrock including longer connection timeout times.
  * Cost calculation, which converts the token counts to dollars based on public pricing.
  * ask_claude, a simple way to send prompts to Claude
  * ask_claude_threaded, a simple way to send multiple prompts at the same time
  * evaluate_prompt, a simple way to test a prompt against a set of gold standard input/output pairs.

In [2]:
#for connecting with Bedrock, use Boto3
import boto3, time, json
from botocore.config import Config

#increase the standard time out limits in boto3, because Bedrock may take a while to respond to large requests.
my_config = Config(
    connect_timeout=60*5,
    read_timeout=60*5,
)
bedrock = boto3.client(service_name='bedrock-runtime',config=my_config)
bedrock_service = boto3.client(service_name='bedrock',config=my_config)

In [3]:
#check that it's working:
models = bedrock_service.list_foundation_models()
for line in models["modelSummaries"]:
    #print (line["modelId"])
    pass
if "anthropic.claude-3" in str(models):
    print("Claud-v3 found!")
else:
    print ("Error, no model found.")

Claud-v3 found!


In [134]:
#helper function for converting tokens to public pricing for Claude.
input_token_haiku = 0.25/1000000
output_token_haiku = 1.25/1000000
input_token_sonnet = 3.00/1000000
output_token_sonnet = 15.00/1000000
input_token_opus = 15.00/1000000
output_token_opus = 75.00/1000000
def calculate_cost(usage, model):
    '''
    Takes the usage tokens returned by Bedrock in input and output, and coverts to cost in dollars.
    '''
    cost = 0
    if model=='haiku':
        cost+= usage['input_tokens']*input_token_haiku
        cost+= usage['output_tokens']*output_token_haiku
    if model=='sonnet':
        cost+= usage['input_tokens']*input_token_sonnet
        cost+= usage['output_tokens']*output_token_sonnet
    if model=='opus':
        cost+= usage['input_tokens']*input_token_opus
        cost+= usage['output_tokens']*output_token_opus
    return cost

In [129]:
MAX_ATTEMPTS = 1 #how many times to retry if Claude is not working.
session_cache = {} #all calls are stored in the cache.  
def ask_claude(messages,system="", model="haiku", ignore_cache=False, DEBUG=False):
    '''
    Send a prompt to Bedrock, and return the response.
    messages can be an array of role/message pairs, or a string.
    DEBUG is used to see exactly what is being sent to and from Bedrock.
    model can be haiku or sonnet
    Set ignore_cache to True if you want to force a new call to Bedrock
    '''
    raw_system_prompt_text = system+str(messages)
    raw_prompt_text = str(messages)
    
    #if the messages are just a string, convert to the Messages API format.
    if type(messages)==str:
        messages = [{"role": "user", "content": messages}]
    
    #build the JSON to send to Bedrock
    prompt_json = {
        "system":system,
        "messages": messages,
        "max_tokens": 4096, # 4096 is a hard limit to output length in Claude 3
        "temperature": 0.5, #creativity on a scale from 0-1.
        "anthropic_version":"",
        "top_k": 250,
        "top_p": 0.7,
        "stop_sequences": ["\n\nHuman:"]
    }
    
    
    if DEBUG: print("Sending:\nSystem:\n",system,"\nMessages:\n",str(messages))
    
    #pick the correct endpoint for the model we want to use.
    if model== "opus":
        modelId = 'error'
    elif model== "sonnet":
        modelId = 'anthropic.claude-3-sonnet-20240229-v1:0'
    elif model== "haiku":
        modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
    else:
        print ("ERROR:  Bad model, must be opus, sonnet, or haiku.")
        modelId = 'error'
    
    #if this is already in the cashe, return data from the cache and skip Bedrock.
    if raw_system_prompt_text in session_cache and not ignore_cache:
        if DEBUG: print ("Using results from cache, skipping Bedrock.")
        cached = session_cache[raw_system_prompt_text]
        return [raw_prompt_text,cached[0],cached[1],cached[2],cached[3]]
    
    attempt = 1
    query_time = -1
    usage = (-1,-1)
    while True:
        try:
            start_time = time.time()
            response = bedrock.invoke_model(body=json.dumps(prompt_json), modelId=modelId, accept='application/json', contentType='application/json')
            response_body = json.loads(response.get('body').read())
            #print (response_body)
            results = response_body.get("content")[0].get("text")
            usage = response_body.get("usage")
            query_time = round(time.time()-start_time,2)
            if DEBUG:print("Recieved:",results)
            break
        except Exception as e:
            print("Error with calling Bedrock: "+str(e))
            attempt+=1
            if attempt>MAX_ATTEMPTS:
                print("Max attempts reached!")
                results = str(e)
                break
            else:#retry in 10 seconds
                time.sleep(10)
    session_cache[raw_system_prompt_text] = [results,usage,query_time,system]
    return [raw_prompt_text,results,usage,query_time,system]

In [96]:
#check that it's working:
try:
    query = "Please say the number four."
    system = "You always reply in Spanish."
    result = ask_claude(query,system=system,ignore_cache=True,DEBUG=False)
    print("System Instructions: ",result[4])
    print("Prompt: ",query)
    print("Response: ",result[1])
    print(result[2])
    print("Query time: ",result[3],"seconds")
except Exception as e:
    print("Error with calling Claude: "+str(e))

System Instructions:  You always reply in Spanish.
Prompt:  Please say the number four.
Response:  Cuatro.
{'input_tokens': 19, 'output_tokens': 7}
Query time:  0.42 seconds


In [139]:
#check that cost calculation is working
print(result[2])
cost = calculate_cost(result[2], 'haiku')
print("Cost for running this query 1 million times: $",cost*1000000)

{'input_tokens': 19, 'output_tokens': 7}
Cost for running this query 1 million times: $ 13.5


In [90]:
from queue import Queue
from threading import Thread

# Threaded function for queue processing.
def thread_request(q, result):
    while not q.empty():
        work = q.get()    #fetch new work from the Queue
        try:
            data = ask_claude(work[1],system=work[2],model=work[3],ignore_cache=work[4])
            result[work[0]] = data  #Store data back at correct index
        except Exception as e:
            print('Error with prompt!',str(e))
            result[work[0]] = (str(e))
        #signal to the queue that task has been processed
        q.task_done()
    return True

def ask_claude_threaded(prompts,system="",model="haiku",ignore_cache=False):
    '''
    Call ask_claude, but multi-threaded.
    Returns a dict of the prompts and responces.
    '''
    q = Queue(maxsize=0)
    num_theads = min(50, len(prompts))
    #Populating Queue with tasks
    results = [{} for x in prompts];
    #load up the queue with the promts to fetch and the index for each job (as a tuple):
    for i in range(len(prompts)):
        #need the index and the url in each queue item.
        q.put((i,prompts[i],system,model,ignore_cache))
        
    #Starting worker threads on queue processing
    for i in range(num_theads):
        #print('Starting thread ', i)
        worker = Thread(target=thread_request, args=(q,results))
        worker.daemon = True
        worker.start()

    #now we wait until the queue has been processed
    q.join()
    return results

In [97]:
%%time
#test if our threaded Claude calls are working
q1 = [{"role": "user", "content": "Please say the number one."}]
q2 = [{"role": "user", "content": "Please say the number two."}]
q3 = [{"role": "user", "content": "Please say the number three."}]
results = ask_claude_threaded([q1,q2,q3],system="you only reply in spanish",model='haiku',ignore_cache=True)
for r in results:
    print(r)

["[{'role': 'user', 'content': 'Please say the number one.'}]", 'Uno.', {'input_tokens': 18, 'output_tokens': 6}, 0.23, 'you only reply in spanish']
["[{'role': 'user', 'content': 'Please say the number two.'}]", 'Dos.', {'input_tokens': 18, 'output_tokens': 6}, 0.34, 'you only reply in spanish']
["[{'role': 'user', 'content': 'Please say the number three.'}]", 'Tres.', {'input_tokens': 18, 'output_tokens': 6}, 0.33, 'you only reply in spanish']
CPU times: user 21.2 ms, sys: 0 ns, total: 21.2 ms
Wall time: 346 ms


### Evaluation Function: evaluate_prompt()
Our final piece of setup is an evaluation function.  It is presented in a compact form here, but please see this blog for a more complete explanation of this critial step.  This takes a prompt template to test, and a dictonary of input/output pairs, and provides an accuracy metric.

In [102]:
scoring_prompt_template = """You are a teacher.  Consider the following question along with its correct answer and a student submitted answer.
Here is the question:
<question>{{QUESTION}}</question>
Here is the correct answer:
<correct_answer>{{ANSWER}}</correct_answer>
Here is the student's answer:
<student_answer>{{TEST_ANSWER}}</student_answer>
Please provide a score from 0 to 100 on how well the student answer matches the correct answer for this question.
The score should be high if the answers say essentially the same thing.
The score should be lower if some facts are missing or incorrect, or if extra unnecessary facts have been included.
The score should be 0 for entirely wrong answers.  Put the score in <SCORE> tags. and your reasoning in <REASON> tags.
Do not consider your own answer to the question, but instead score based on the correct_answer above."""

def score_answers(prompt_template, input_output, system):
    '''
    ask our LLM to score each of the generated answers.
    '''
    print ("Generating results to score...")
    prompts = []
    for i in input_output:
        prompts.append(prompt_template.replace("{{QUESTION}}",i))
    answers_to_test = ask_claude_threaded(prompts, system=system)
    print ("Done.  Scoring answers...")
    
    
    #pack answers with questions in templated form.
    question_answers_with_template = {}
    question_with_template_to_questions = {}
    for question in input_output:
        question_answers_with_template[prompt_template.replace("{{QUESTION}}",question)] = input_output[question]
        question_with_template_to_questions[prompt_template.replace("{{QUESTION}}",question)]=question
    
    prompts = []
    for question,test_answer,usage,query_time,system in answers_to_test:
        original_question = question_with_template_to_questions[question]
        correct_answer = question_answers_with_template[question]
        prompts.append(scoring_prompt_template.replace("{{QUESTION}}",original_question).replace("{{ANSWER}}",correct_answer).replace("{{TEST_ANSWER}}",test_answer))

    return ask_claude_threaded(prompts)

from bs4 import BeautifulSoup as BS

def evaluate_prompt(prompt_template, question_answers, threshhold, system="", print_out=False):
    """
    Call score answers and format the results once all threads have returned.
    """
    scored_answers = score_answers(prompt_template, question_answers, system)
    print ("Done.")
    #pack questions to templated form
    question_with_template_to_questions = {}
    for question in question_answers:
        question_with_template_to_questions[prompt_template.replace("{{QUESTION}}",question)]=question
    
    scores = []
    total_scored = 0
    total_passed = 0
    for prompt,response,usage,query_time,system in scored_answers:
        soup = BS(prompt)
        question = soup.find('question').text
        correct_answer = soup.find('correct_answer').text
        prompt_answer = soup.find('student_answer').text
        soup = BS(response)
        score = soup.find('score').text
        reason = soup.find('reason').text
        passed = True
        
        if int(score)<threshhold:
            passed = False
            
        #keep track for printing locally
        total_scored+=1
        if passed: total_passed+=1
        
        scores.append([question,correct_answer,prompt_answer,score,reason,passed])
    if print_out:
        print("Total inputs:",total_scored)
        print("Total Correct:",total_passed)
        print("Accuracy:",round(total_passed/total_scored,2)*100,"%")
    return scores

In [103]:
#run a quick test to make sure evaluation is working:
inputs_outputs = {
 "What is heavier, 1kg of feathers or 1kg of iron?":"They are the same.",
 "What is my current bank account balance?":"I don't have access to that information.",
 "Who was the president in the year 2000?":"Bill Clinton",   
 "A boy runs down the stairs in the morning and sees a tree in his living room, and some boxes under the tree. What's going on?":"It is Christmas.",
 "If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?":"5 hours."
}
system = "you always keep your reply as short as possible."
test_prompt = "You are a boat fanatic and always talk like a pirate.  You do answer questions, but you also always include a fun fact about boats.  Please answer this question:{{QUESTION}}"
scores = evaluate_prompt(test_prompt, inputs_outputs, threshhold=90, system=system, print_out=True)

Generating results to score...
Done.  Scoring answers...
Done.
Total inputs: 5
Total Correct: 1
Accuracy: 20.0 %


## 2a) Task Based Decomposition
For this example, let's consider the use case of a marketing copy editor who is reviewing hundreds of AWS blogs.  It is important to AWS branding that all services mentioned in our blogs are referred to by the correct name.  AWS has hundreds of services, and each one has a marketing approved name, which may vary depending on the context.  In general, the guidelines ask that services are referred to their full name when first mentioned, and then can be referred to by a shortened version after that.  

For our use case, imagine that you have hundreds of pages of text that needs to be reviewed for compliance against this set of rules.  We will explore doing this in a single prompt and measure the cost, latency, and accuracy of that approach.  Next, we will decompose this task into multiple steps, and then compare the difference in cost, latency, and accuracy.

Like every good Generative AI project, we will start by setting up our test cases, so that we can measure the impact and quality of everything else we do.  (More information on the prompt evaluation used here can be found in [this blog](https://medium.com/@flux07/prompt-evaluation-systematically-testing-and-improving-your-gen-ai-prompts-at-scale-784e54efe83d))

In [None]:
#gold standard test cases.  The format is "input":"Correct output"
test_cases = {
    "What do you think about SQS?":"This violoates the AWS Mareting policy.  The first mention of \"SQS\" should be \"Amazon Simple Notification Service (Amazon SNS)\".",
    "Things to think about. Amazon FinSpace: Company X tried FinSpace and it worked.":"",
    "I like to use AWS HealthLake, but sometimes HealthLake does not like me Although HealthLake does work.":"",
    "Amazon Simple Notification Service (Amazon SNS): CompanyX used Amazon SNS to fanout messages to multiple recipients for parallel processing. SNS allowed them to easily distribute messages to different systems and applications that needed to take action based on the fulfillment process.":"",
    "":"",
    "":"",
    "":"",
    "":"",
    "":"",
    "":""
}

test_cases = {
    "What do you think about SQS?":["The first mention of \"SQS\" must be \"Amazon Simple Queue Service (Amazon SQS)\"."],
    "Amazon EMR's newest feature is great":["The possessive form of \"EMR\" is not allowed"],
    "The service Amazon FinSpace: Company X tried FinSpace and it worked.":[], # Correct usage so no output
    "I like to use Amazon HealthLake, but sometimes HealthLake needs some tuning, when done Amazon HealthLake work well.":["The wrong prefix is used, \"AWS HealthLake\" is the correct prefix on first use.","The wrong prefix is used, subsequent uses of this service name require the correct prefix use of \"AWS HealthLake\", no prefix is also acceptable."],
    "SQS and SNS can be used together. SNS can fan out while SQS can handle higher message rates":["The first mention of \"SQS\" must be \"Amazon Simple Queue Service (Amazon SQS)\".", "The first mention of \"SNS\" must be \"Amazon Simple Notification Service (Amazon SNS)\".", "Warning: The short version of \"SQS\" should only be used on subsequent use when space is limited, please use \"Amazon SQS\" in most scenarios.", "Warning: The short version of \"SNS\" should only be used on subsequent use when space is limited, please use \"Amazon SNS\" in most scenarios."],
    "AWS Lambda is great a scaling to make sure your lambda is run at full speed": ["Do not use offering names, or Amazon or AWS trademarks, as common nouns or verbs such as \"your lambda\""],
    "When setting up your system you may want to use AppConfig.": ["Prefix \"AWS\" is required for \"AppConfig\""],
    "To orchistrate code please use step functions.": ["The capitalized version of this name must be used \"Step Functions\""],
    "Some people still use Amazon Sumerian": ["Warning: \"Amazon Sumerian\" is marked as \"Deprecated\""],
    "When doing migrations consider using DataSync.\nUsing DataSync will help you migrate your data with end-to-end security.": ["The first mention of \"DataSync\" must be \"AWS DataSync\"."],
}

## 2b) Volume based decomposition
Here we'll consider an example use case of a user who would like to undersand how many unique characters are in a novel, and learn a bit about the three most common characters.  We start by downloading the novel.  For this example we use Frankenstein by Mary Shelley, as it is in public domain.

In [104]:
import requests, re
from bs4 import BeautifulSoup 

In [105]:
#grab the text from the Gutenberg project, a collection of public domain works.
#We use Beautiful Soup to parse the HTML of the webpage.
url = "https://www.gutenberg.org/files/84/84-h/84-h.htm"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
raw_full_text_webpage = soup.text

In [107]:
#Cut the top and bottom of the webpage so that we only have the text of the book.
raw_full_text = raw_full_text_webpage[raw_full_text_webpage.index("Letter 1\n\nTo Mrs. Saville, England."):raw_full_text_webpage.index("*** END OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN ***")].replace("\r\n"," ").replace("\n", " ")
#encode some misc unicode charaters.
full_text = raw_full_text.encode('raw_unicode_escape').decode()
#show that we found the expected length
words_count = len(full_text.split(" "))
pages_count = int(words_count/500)#quick estimate, real page count is dependant on page and font size.
print ("Approximate word count:",words_count)
print ("Approximate page count:",pages_count)

Approximate word count: 76553
Approximate page count: 153


### Now that we have our novel, let's try to find all the unique characters with a single prompt.

In [109]:
long_prompt_template = """Consider the following novel:
<novel>
{{NOVEL}}
</novel>

How many unique characters are there with at least one spoken line of dialog?  Please also provide a brief description of the top three most common characters in separate paragraphs. 
Only count charaters that have at least one spoken line of dialog.
"""

long_prompt = long_prompt_template.replace("{{NOVEL}}",full_text)

In [130]:
%%time
long_responce = ask_claude(long_prompt, model="sonnet",ignore_cache=True)
print("Time at Bedrock: ",long_responce[3],"sec")
print("Tokens: ",long_responce[2]["input_tokens"]+long_responce[2]["output_tokens"])
print("Responce from model:")
print(long_responce[1])

Time at Bedrock:  42.2 sec
Tokens:  98285
Responce from model:
Based on the novel, there are 8 unique characters that have at least one spoken line of dialog.

1. Victor Frankenstein:
Victor Frankenstein is the protagonist of the novel. He is a scientist who creates a grotesque but sentient creature through an unorthodox scientific experiment. His obsession with his work and the consequences of his creation drive the plot forward. He is portrayed as a complex character, torn between his ambition and the guilt and remorse he feels for his actions.

2. The Creature/Monster:
The Creature, often referred to as the Monster, is Frankenstein's creation. Initially seeking companionship and acceptance, he becomes embittered and vengeful after being rejected by his creator and society. He is highly intelligent and articulate, but his hideous appearance and the mistreatment he faces lead him down a path of violence and retribution.

3. Robert Walton:
Robert Walton is the explorer who rescues Vict

### Not bad!  93K tokens processed in about 40 seconds.  Let's see if we can make that faster and cheaper using prompt decomposition.
### We'll divide the novel into thirds, run each third in parallel, then write a fourth prompt to combine the results.

In [118]:
short_prompt_template = """Consider the following portion of a novel:
<novel>
{{NOVEL}}
</novel>

Please provide a list of unique characters, each in a character tag.  Inside the character tag should be a name tag with their name,
a count tag with an exact count of times they appear, and a description tag with a brief description of that character.
Only count charaters that have at least one spoken line of dialog.
"""

#let's cut the novel into thirds.
third = int(len(full_text)/3)
short_prompt_1 = short_prompt_template.replace("{{NOVEL}}",full_text[:third])
short_prompt_2 = short_prompt_template.replace("{{NOVEL}}",full_text[third:third+third])
short_prompt_3 = short_prompt_template.replace("{{NOVEL}}",full_text[third+third:])

### Now let's run these three prompts in parallel

In [126]:
short_responces = ask_claude_threaded([short_prompt_1,short_prompt_2,short_prompt_3],model='sonnet',ignore_cache=False)
time_1 = short_responces[0][3]
time_2 = short_responces[1][3]
time_3 = short_responces[2][3]
average_time = round((time_1+time_2+time_3)/3,2)
print("Average time at Bedrock: ",average_time,"sec")

#show the reply from one of the three prompts
print("Example Output:")
print(short_responces[0][1][:700]+" ...")

Average time at Bedrock:  16.38 sec
Example Output:
Here is a list of unique characters with their names, counts, and descriptions, based on the provided text:

<character>
  <name>Victor Frankenstein</name>
  <count>138</count>
  <description>The narrator and protagonist, a young scientist who creates a hideous sapient creature in an unorthodox scientific experiment.</description>
</character>

<character>
  <name>Elizabeth Lavenza</name>
  <count>16</count>
  <description>Victor's adopted sister and love interest, who is kind and innocent.</description>
</character>

<character>
  <name>Alphonse Frankenstein</name>
  <count>4</count>
  <description>Victor's father, who is caring and supportive.</description>
</character>

<character>
  <nam ...


### So far it's looking good!  We've processed the whole novel in around 17 seconds, down from 42.  Let's make a final call to get a final result that matches our original long prompt.

In [127]:
final_prompt_template = """Consider the following list of charaters from a novel.  Each entry contains the character's name,
a count of the number of times they appeared, and a brief description of that charater:
<characters>
{{CHARACTERS}}
</characters>
Some charaters may be listed more than once.  Use the name and description to determine that two entries are the same, 
and if they are, sum their count to support your responce.

How many unique characters are there?  Please also provide a brief description of the top three most common characters in separate paragraphs. 
"""

characters = short_responces[0][1]+short_responces[1][1]+short_responces[2][1]

final_prompt = final_prompt_template.replace("{{CHARACTERS}}",characters)

In [128]:
%%time
session_cache = {}#don't use cached info, since we. want to time this.
final_responce = ask_claude(final_prompt, model="sonnet")
print(final_responce[1])

Based on the provided list of characters, there are 10 unique characters in total. I have summed the counts for characters with the same name and description.

The top three most common characters are:

1. Victor Frankenstein (Count: 333)
Victor Frankenstein is the protagonist and narrator of the story. He is a young scientist who creates a hideous sapient creature in an unorthodox scientific experiment. Victor's creation haunts him and seeks revenge after being rejected by his creator and society.

2. The creature/monster/daemon (Count: 66)
The creature, also referred to as the monster or daemon, is the hideous but intelligent being created by Victor Frankenstein. It develops a longing for human companionship and demands that Victor create a female companion for him. After being rejected by Victor and society, the creature seeks revenge.

3. Robert Walton (Count: 16)
Robert Walton is the explorer who rescues Victor Frankenstein and records his story. He serves as the narrator in the f

### Results: Twice as fast!
This final prompt took about 5 seconds to run.  The original long prompt took 42 seconds to run, and our decomposed version took 17s + 5, or 22 seconds total.  Almost twice as fast to do the same amount work!
Note that the decomposed version actually found 11 characters, not 10.  This is somewhat common, that the quality will slightly improve with smaller, more focused prompts, because the LLM can focus more when the prompt is smaller.