# Build agentic workflows with Amazon Bedrock and open source frameworks - Introduction

The goal of this workshop is to provide in-depth examples on key concepts and frameworks for Retrieval Augmented Generation (RAG) and Agentic application. We introduce an example use case to serve as a backdrop for curated and prescriptive guidance for RAG and Agentic workflows including libraries and blueprints for some of the top trends in the market today.

## Overview

Through web-scale training, foundation models (FMs) are built to support a wide variety of tasks across a large body of general knowledge. Without being exposed to additional information or further fine-tuning, they suffer from a knowledge cutoff preventing them from reliably completing tasks requiring specific data not available at training time. Furthermore, their inability to call external functions limits their capacity to resolve complex tasks beyond ones that can be solved with their own internal body of knowledge.

In this notebook, we introduce the requirements that lead us to build our **Virtual Travel Agent**. We end by running some **course-grained model evaluation** across a subset of the models available in Amazon Bedrock.

### Requirements

In [23]:
!pip3 install langchain-aws --quiet

## Functional requirements

The purpose of the solution is to improve the experience for customers searching for their dream travel destination. To do this, a customer needs the ability to do the following:
- Rapidly get a sense a given destination with a representative description.
- Discover new destinations based on location, weather or other aspects that may be of interest.
- Book travel dates for a given destination ensuring it does not collide with their other travel.

Before diving deeper into the solution, we begin with some lite testing of the various models available in the `us-west-2` region.

## Course-grained model evaluation

In this section, we experiment with multiple models available on Amazon Bedrock and run course-grained evaluation on one of our task of interest. With the thousands of available models on the market, it is intractable to evaluation every single one. Hence, it is generally necessary to pre-filter for the ones that are not only from trusted providers, but have shown strong performance on a variety of benchmarks. 

Amazon Bedrock allows you to make a quick short-list by supporting a growing list providers such as Anthropic, Meta, Mistral, Cohere, AI21Labs, Stability AI and Amazon. This lets you start with a strong base to continue the model selection process.

![model selection](./assets/model-selection.png)

Since, academic benchmarks are known to model providers and often used as marketing materials, it is important to not to rely too heavily on them, but rather use them as a soft measure. 

Next we perform course-grained model evalution on the following models to inform our initial choice of model for our task of interest:
- Anthropic: Claude Sonnet 3.5, Claude 3 Sonnet, Claude 3 Haiku
- Meta: Llama 3.1 70B, Llama 3.1 8B
- Mistral: Mistral Large
- Cohere: Command R+

We start by importing the boto3 client for the Bedrock Runtime.

In [2]:
import boto3
import os
from IPython.display import Markdown, display

region = 'us-east-1'#'us-west-2'
bedrock = boto3.client(
    service_name = 'bedrock-runtime',
    region_name = region,
)


bedrock_service = boto3.client(
    service_name='bedrock',
    region_name=region,
)
print(boto3.__version__)

1.35.29


#### Validate the connection

We can check the client works by trying out the `list_foundation_models()` method, which will tell us all the models available for us to use 

In [3]:
bedrock_service.list_foundation_models()

{'ResponseMetadata': {'RequestId': '4b40a95f-9aa1-4cc9-b5b9-8820d6372dcf',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 28 Oct 2024 00:03:47 GMT',
   'content-type': 'application/json',
   'content-length': '31106',
   'connection': 'keep-alive',
   'x-amzn-requestid': '4b40a95f-9aa1-4cc9-b5b9-8820d6372dcf'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large',
   'modelName': 'Titan Text Large',
   'providerName': 'Amazon',
   'inputModalities': ['TEXT'],
   'outputModalities': ['TEXT'],
   'responseStreamingSupported': True,
   'customizationsSupported': [],
   'inferenceTypesSupported': ['ON_DEMAND'],
   'modelLifecycle': {'status': 'ACTIVE'}},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-image-generator-v1:0',
   'modelId': 'amazon.titan-image-generator-v1:0',
   'modelName': 'Titan Image Generator G1',
   'providerName': 'Amazon',

---

## Common inference parameter definitions

### Randomness and Diversity

Foundation models support the following parameters to control randomness and diversity in the 
response.

**Temperature** – Large language models use probability to construct the words in a sequence. For any 
given next word, there is a probability distribution of options for the next word in the sequence. When 
you set the temperature closer to zero, the model tends to select the higher-probability words. When 
you set the temperature further away from zero, the model may select a lower-probability word.

In technical terms, the temperature modulates the probability density function for the next tokens, 
implementing the temperature sampling technique. This parameter can deepen or flatten the density 
function curve. A lower value results in a steeper curve with more deterministic responses, and a higher 
value results in a flatter curve with more random responses.

**Top K** – Temperature defines the probability distribution of potential words, and Top K defines the cut 
off where the model no longer selects the words. For example, if K=50, the model selects from 50 of the 
most probable words that could be next in a given sequence. This reduces the probability that an unusual 
word gets selected next in a sequence.
In technical terms, Top K is the number of the highest-probability vocabulary tokens to keep for Top-
K-filtering - This limits the distribution of probable tokens, so the model chooses one of the highest-
probability tokens.

**Top P** – Top P defines a cut off based on the sum of probabilities of the potential choices. If you set Top 
P below 1.0, the model considers the most probable options and ignores less probable ones. Top P is 
similar to Top K, but instead of capping the number of choices, it caps choices based on the sum of their 
probabilities.
For the example prompt "I hear the hoof beats of ," you may want the model to provide "horses," 
"zebras" or "unicorns" as the next word. If you set the temperature to its maximum, without capping 
Top K or Top P, you increase the probability of getting unusual results such as "unicorns." If you set the 
temperature to 0, you increase the probability of "horses." If you set a high temperature and set Top K or 
Top P to the maximum, you increase the probability of "horses" or "zebras," and decrease the probability 
of "unicorns."

### Length

The following parameters control the length of the generated response.

**Response length** – Configures the minimum and maximum number of tokens to use in the generated 
response.

**Length penalty** – Length penalty optimizes the model to be more concise in its output by penalizing 
longer responses. Length penalty differs from response length as the response length is a hard cut off for 
the minimum or maximum response length.

In technical terms, the length penalty penalizes the model exponentially for lengthy responses. 0.0 
means no penalty. Set a value less than 0.0 for the model to generate longer sequences, or set a value 
greater than 0.0 for the model to produce shorter sequences.

### Repetitions

The following parameters help control repetition in the generated response.

**Repetition penalty (presence penalty)** – Prevents repetitions of the same words (tokens) in responses. 
1.0 means no penalty. Greater than 1.0 decreases repetition.

We use the `ChatBedrock` object part of `langchain-aws` to interact with the Bedrock service.

In [4]:
from langchain_aws.chat_models.bedrock import ChatBedrock

modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
llm = ChatBedrock(
    model_id=modelId,
    client=bedrock,
    beta_use_converse_api=True
)
llm.invoke("Help me with my travel needs.").content

"Sure, I'd be happy to assist you with your travel needs. What kind of help are you looking for? Some common areas I can provide assistance with include:\n\n- Trip planning and itinerary creation\n- Booking flights, hotels, rental cars, or other transportation\n- Researching destinations and things to do\n- Providing travel tips and recommendations\n- Helping navigate visa/passport requirements\n- Offering suggestions for budgeting and saving on travel costs\n\nPlease let me know more details about your upcoming trip, such as your destination(s), travel dates, budget, interests, and any other specific needs you have. The more information you can provide, the better I can try to help meet your travel requirements."

### Converse API

In this notebook, we'll explore the basics of the Converse API in Amazon Bedrock. The Converse or ConverseStream API is a unified structured text API action that allows you simplifying the invocations to Bedrock LLMs, using a universal syntax and message structured prompts for any of the supported model providers.

To use the Converse API, you call the `Converse` or `ConverseStream` operations to send messages to a model. To call Converse, you require permission for the `bedrock:InvokeModel` operation. To call ConverseStream, you require permission for the `bedrock:InvokeModelWithResponseStream` operation.

<h3> Prerequisites </h3>

Before you can use Amazon Bedrock, you must carry out the following steps:

- Sign up for an AWS account (if you don't already have one) and IAM Role with the necessary permissions for Amazon Bedrock, see [AWS Account and IAM Role](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html#new-to-aws).
- Request access to the foundation models (FM) that you want to use, see [Request access to FMs](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html#getting-started-model-access). 
    
    We have used below Foundation Models in our examples in this Notebook in `us-west-2` (Oregon) region.
    
| Provider Name | Foundation Model Name | Model Id |
| ------- | ------------- | ------------- |
| Amazon | Titan Text G1 - Lite | amazon.titan-text-lite-v1 |
| Anthropic | Claude 3.5 Sonnet  | anthropic.claude-3-5-sonnet-20240620-v1:0 |
| Anthropic | Claude 3 Haiku  | anthropic.claude-3-haiku-20240307-v1:0 |
| Meta | Llama 3.2 3B Instruct | meta.llama3-1-8b-instruct-v1:0 |


<h3> ConverseStream for streaming invocations </h3>

We can also use the Converse API for streaming invocations. In this case we rely on the ConverseStream action.

In [20]:
import sys
def invoke_bedrock_model_stream(client, id, prompt, max_tokens=2000, temperature=0, top_p=0.9):
    response = ""
    response = client.converse_stream(
        modelId=id,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "text": prompt
                    }
                ]
            }
        ],
        inferenceConfig={
            "temperature": temperature,
            "maxTokens": max_tokens,
            "topP": top_p
        }
    )
    # Extract and print the response text in real-time.
    for event in response['stream']:
        if 'contentBlockDelta' in event:
            chunk = event['contentBlockDelta']
            sys.stdout.write(chunk['delta']['text'])
            sys.stdout.flush()
            
        # Log token usage.
    print("", flush=True)
    
def invoke_bedrock_model_converse(client, id, prompt, max_tokens=2000, temperature=0, top_p=0.9):
    response = ""
    response = client.converse(
        modelId=id,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "text": prompt
                    }
                ]
            }
        ],
        inferenceConfig={
            "temperature": temperature,
            "maxTokens": max_tokens,
            "topP": top_p
        }
    )
    #print(response)
    print(response['output']['message']['content'][0]['text'])
    token_usage = response['usage']
    print(f"Input tokens: {token_usage['inputTokens']}")
    print(f"Output tokens: {token_usage['outputTokens']}")
    print(f"Total tokens: {token_usage['totalTokens']}")
    print(f"Stop reason: {response['stopReason']}")
    return

In [22]:
prompt = ("Help me with my travel needs.")
print(f'Prompt: {prompt}\n')
MODEL_IDS = ['anthropic.claude-3-haiku-20240307-v1:0',]

for i in MODEL_IDS:
    print(f'\n\nModel: {i}')
    invoke_bedrock_model_stream(bedrock, i, prompt)
    #- just to demonstreate the tokens usage
    invoke_bedrock_model_converse(bedrock, i, prompt)

Prompt: Help me with my travel needs.



Model: anthropic.claude-3-haiku-20240307-v1:0
Sure, I'd be happy to help with your travel needs. What kind of information or assistance are you looking for? Some common travel-related topics I can help with include:

- Trip planning and itinerary suggestions
- Booking flights, hotels, rental cars, etc.
- Destination research and recommendations
- Travel tips and advice
- Budgeting and cost estimates
- Travel documentation requirements
- Packing lists and travel checklists
- Transportation options and logistics
- Travel insurance and health/safety information

Let me know the specifics of what you need help with, and I'll do my best to provide useful information and guidance. I'm here to support your travel planning and make sure your trip goes smoothly.
Sure, I'd be happy to help with your travel needs. What kind of information or assistance are you looking for? Some common travel-related topics I can help with include:

- Trip planning and itin

## Introduction to the use case

The use -case we are going to try and model is 

To perform an initial evaluation, we create a small curated dataset of 10 examples. The optimal initial number of examples should be sufficiently big to roughly cover the types of queries our customers will send our model. Since this stage of the model evaluation process is meant to get a rough idea, the number of examples can be small. To come up with our examples, we use [HELM's](https://crfm.stanford.edu/helm/lite/latest/) definition of a scenario, which is broken down by the following diagram:

![helm scenario](./assets/helm-scenario.png)

To start, our scenario can be described by summarization (**task**) of vacation destinations (**what**) asked by travelers (**who**) at the time of development (**when**) in English (**language**). The set of initial questions can be found in [examples.txt](./data/examples.txt). We could expand our test by changing one or more of the variables composing the scenario of interesting. For instance, we could generate equivalent examples, but asked by people who aren't travelers or by others speaking in any other languages.


In [69]:
with open("./data/examples.txt", "r") as file:
    examples = file.read().splitlines()

Once we retrieved our limited set of examples, we defined `generate_answers`, which outputs a dataframe where each column is populated by a given model's answers. This allows us to quickly capture model answers across a set of `examples`.

In [70]:
import pandas as pd
from langchain_core.messages import HumanMessage, SystemMessage

pd.set_option('display.max_colwidth', None)


def generate_answers(
    examples: list = [],
    system_prompt: SystemMessage = None
):
    modelIds = [
        "anthropic.claude-3-5-sonnet-20240620-v1:0",
        "anthropic.claude-3-sonnet-20240229-v1:0",
        "anthropic.claude-3-haiku-20240307-v1:0",
        "cohere.command-r-plus-v1:0",
        "meta.llama3-1-70b-instruct-v1:0",
        "meta.llama3-1-8b-instruct-v1:0",
        "mistral.mistral-large-2407-v1:0"
    ]
    output = pd.DataFrame({
        'example': [],
        'Claude35Sonnet': [],
        'Claude3Sonnet': [],
        'Claude3Haiku': [],
        'CommandRplus': [],
        'Llama8b': [],
        'Llama70b': [],
        'MistralLarge': [],
    })
    for example in examples:
        results = [example]
        for modelId in modelIds:
            messages = [
                system_prompt if system_prompt else SystemMessage(content=""),
                HumanMessage(content=example)
            ]
            llm = ChatBedrock(
                model_id=modelId,
                client=bedrock,
                beta_use_converse_api=True
            )
            resp = llm.invoke(messages).content
            results.append(resp)
        output.loc[len(output)] = results
    return output

We generate model outputs without a system prompt for a single example. This example is pulled from the top of the examples list and contains just the words *New York*.

In [59]:
one_example = examples[:1]
output = generate_answers(one_example)

We should the answers generated by the various models for this example. Quickly, we notice Llama 3.1 70B has produce the longest input. As expected, we also see some consistency in the outputs within a given model family.

When diving deeper into the examples, it is clear the model has been trained has broad knowledge of the subject and is able to give us some facts about it. However, we do not provide additional information into the model's current role. This results in fairly long and generic answers. Hence, in the next step we will continue to tailor model output by supplying it with a consistent system prompt reused across all examples.

To get a better sense of model functionality without additional context, it may be helpful to rerun the previous cells on other examples or create your own.

In [None]:
output.head()

We define a `SystemMessage` passed as a system prompt that is passed to all models for every example. The purpose is to provide more context to the model as to what is expected from it.

In [71]:
one_example = examples[:1]
output = generate_answers(
    one_example,
    SystemMessage(content="You are a text summarizer for travelers who are on the go. Generate your summary in a single sentence."))

When looking through the model responses, the difference in size of response is immediately obvious and is a direct result of the content of the system prompt.   

In [None]:
output.head()

Next, we modify the original `generate_answers` function to accomodate for few-shots. The purpose of few-shot learning is to enable machine learning models to learn from a small number of examples or training data points, rather than requiring a large labeled dataset. This is particularly useful in scenarios where obtaining a large amount of labeled data is difficult, expensive, or time-consuming. There are several advantages of few-shot learning:

- **Data efficiency**: Few-shot learning allows models to learn from limited data, which is beneficial when obtaining large labeled datasets is challenging or costly.
- **Adaptability**: Few-shot learning enables models to quickly adapt to new tasks or domains without the need for extensive retraining from scratch, making the models more flexible and versatile.
- **Transfer learning**: Few-shot learning relies on transfer learning principles, where knowledge gained from one task or domain is transferred and applied to a different but related task or domain.
- **Human-like learning**: Few-shot learning aims to mimic the way humans can learn new concepts from just a few examples, leveraging prior knowledge and experience.

As we start adding more repeated elements to our prompt, we also introduce the `ChatPromptTemplate` a core component of Langchain allowing us to define a template receiving runtime inputs. We pipe the resulting prompt to the model for inference. `FewShotChatMessagePromptTemplate` extends this object to provide prompt template that supports few-shot examples. 

Although we supply a static set of examples, the library does support dynamic few-shots where examples are chosen based on semantic similarity to the query.

In [125]:
from langchain_core.prompts import (
    ChatPromptTemplate,
    FewShotChatMessagePromptTemplate,
)


def generate_answers(
    examples: list = [],
    system_prompt: str = None,
    few_shots: list = []
):
    modelIds = [
        "anthropic.claude-3-5-sonnet-20240620-v1:0",
        "anthropic.claude-3-sonnet-20240229-v1:0",
        "anthropic.claude-3-haiku-20240307-v1:0",
        "cohere.command-r-plus-v1:0",
        "meta.llama3-1-70b-instruct-v1:0",
        "meta.llama3-1-8b-instruct-v1:0",
        "mistral.mistral-large-2407-v1:0"
    ]
    output = pd.DataFrame({
        'example': [],
        'Claude35Sonnet': [],
        'Claude3Sonnet': [],
        'Claude3Haiku': [],
        'CommandRplus': [],
        'Llama8b': [],
        'Llama70b': [],
        'MistralLarge': [],
    })
    for example in examples:
        results = [example]
        for modelId in modelIds:
            messages = [
                system_prompt if system_prompt else SystemMessage(content=""),
                HumanMessage(content=example)
            ]
            llm = ChatBedrock(
                model_id=modelId,
                client=bedrock,
                beta_use_converse_api=True
            )

            example_prompt = ChatPromptTemplate.from_messages(
                [
                    ("human", "{input}"),
                    ("ai", "{output}"),
                ]
            )
            few_shot_prompt = FewShotChatMessagePromptTemplate(
                example_prompt=example_prompt,
                examples=few_shots,
            )
            final_prompt = ChatPromptTemplate.from_messages(
                [
                    ("system", system_prompt),
                    few_shot_prompt,
                    ("human", "{input}"),
                ]
            )
            chain = final_prompt | llm

            resp = chain.invoke(messages).content
            results.append(resp)
        output.loc[len(output)] = results
    return output

We create few examples requesting for description, comparisons and lists. In all cases, the examples include a description followed by some type of recommendation. For the requests for summaries, we prefix the response with *Nice!*  

In [129]:
few_shots = [
    {"input": "Describe the culinary scene in Tokyo.", "output": "Nice! Tokyo's culinary scene is diverse and vibrant, offering everything from traditional Japanese cuisine to international flavors, street food, Michelin-starred restaurants, and unique dining experiences abound, so I highly recommend trying some of the city's famous ramen shops for a quintessential Tokyo dining experience with rich, flavorful broths and perfectly cooked noodles."},
    {"input": "List the top attractions in Barcelona.", "output": "Barcelona's top attractions include Sagrada Familia, Park Güell, Casa Batlló, La Rambla, Gothic Quarter, and Camp Nou, offering a mix of stunning architecture, history, and vibrant culture, so I highly recommend exploring the winding streets of the Gothic Quarter for an immersive experience in the city's rich heritage."},
    {"input": "Compare the climates of Seattle and Miami", "output": "Seattle has a temperate climate with mild, wet winters and cool, dry summers, while Miami has a tropical climate with hot, humid summers and warm, dry winters, with Seattle receiving more rainfall and Miami experiencing more sunshine and occasional hurricanes, so I highly recommend visiting Seattle during the summer months to enjoy the city's beautiful parks and outdoor activities without the constant rain."},
    {"input": "Outline the best ways to experience Paris on a budget.", "output": "budget-friendly ways to experience Paris include walking tours, visiting free museums on certain days, picnicking in parks, using public transportation, staying in budget accommodations, and enjoying street food, with many iconic sights like Notre-Dame Cathedral and the Eiffel Tower able to be appreciated from the outside for free, so I highly recommend taking advantage of the city's numerous free or low-cost attractions and activities to fully immerse yourself in the Parisian atmosphere without breaking the bank."},
    {"input": "Summarize the cultural etiquette for visitors to Japan.", "output": "Nice! cultural etiquette in Japan includes bowing as a greeting, removing shoes before entering homes or certain establishments, using chopsticks correctly, being punctual, avoiding public displays of affection, respecting quiet in public spaces, gift-giving is common, and it's important to be mindful of table manners and bathing customs in public baths, so I highly recommend familiarizing yourself with these customs and traditions to ensure a respectful and enriching cultural experience during your visit to Japan."},
]

We prompt the model with an example asking for a summary about Bali.

In [139]:
one_example = examples[1]
output = generate_answers(
    [one_example],
    "You are a text summarizer for travelers who are on the go.",
    few_shots
)

For our specific few-shots and prompt, we notice **Claude 3.5 Sonnet**, **Claude 3 Sonnet**, **Llama 3.1 8B** and **Llama 3.1 70B** all responded with the correct prefix and a recommendation. 

We suggest tailoring the few-shots and system prompt to further understand model behaviour.

In [None]:
output.head()

Next, we generate answers for our set of examples reusing the lessons.

In [141]:
output = generate_answers(
    examples,
    "You are a text summarizer for travelers who are on the go.",
    few_shots
)

Although the models are able to adequatly answer the most general questions, queries about current events or requiring data not available at training time remain unanswered.  

In [None]:
output.head()

## Conclusion

In this notebook, we demonstrated simple interactions between Langchain and Bedrock. We tailored model outputs by suppliying it with a system prompt and few-shots, which both help guide behavior. 