# Retrieval Augmented Generation with Amazon Bedrock - Workshop Setup

> *PLEASE NOTE: This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

---

In this notebook, we will set up the [`boto3` Python SDK](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to work with [Amazon Bedrock](https://aws.amazon.com/bedrock/) Foundation Models as well as install extra dependencies needed for this workshop. Specifically, we will be using the following libraries throughout the workshop...

* [LangChain](https://python.langchain.com/docs/get_started/introduction) for large language model (LLM) utilities
* [FAISS](https://github.com/facebookresearch/faiss) for vector similarity searching
* [Streamlit](https://streamlit.io/) for user interface (UI) building

---
## Install External Dependencies

The code below will install the rest of the Python packages required for the workshop.

In [2]:
%pip install --upgrade pip
%pip install --quiet -r ../requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
# !pip install boto3 --upgrade
# !pip install awscli --upgrade

---
## Create the `boto3` client connection to Amazon Bedrock

Interaction with the Bedrock API is done via the AWS SDK for Python: [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html).

As you are running this notebook from [Amazon Sagemaker Studio](https://aws.amazon.com/sagemaker/studio/) and your Sagemaker Studio [execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) has permissions to access Bedrock you can just run the cells below as-is in order to create a connection to Amazon Bedrock. This is also the case if you are running these notebooks from a computer whose default AWS credentials have access to Bedrock.

In [8]:
import boto3
import os
from IPython.display import Markdown, display

region = os.environ.get("AWS_REGION")
bedrock_service = boto3.client(
    service_name='bedrock',
    region_name=region,
)

#### Validate the connection

We can check the client works by trying out the `list_foundation_models()` method, which will tell us all the models available for us to use 

In [2]:
bedrock_service.list_foundation_models()

{'ResponseMetadata': {'RequestId': '56cf2ea2-3c11-4c10-9456-7c726fe44250',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Sat, 01 Jun 2024 22:02:17 GMT',
   'content-type': 'application/json',
   'content-length': '23824',
   'connection': 'keep-alive',
   'x-amzn-requestid': '56cf2ea2-3c11-4c10-9456-7c726fe44250'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large',
   'modelName': 'Titan Text Large',
   'providerName': 'Amazon',
   'inputModalities': ['TEXT'],
   'outputModalities': ['TEXT'],
   'responseStreamingSupported': True,
   'customizationsSupported': [],
   'inferenceTypesSupported': ['ON_DEMAND'],
   'modelLifecycle': {'status': 'ACTIVE'}},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-image-generator-v1:0',
   'modelId': 'amazon.titan-image-generator-v1:0',
   'modelName': 'Titan Image Generator G1',
   'providerName': 'Amazon',

---

## `InvokeModel` body and output

The `invoke_model()` method of the Amazon Bedrock client (`InvokeModel` API) will be the primary method we use for most of our Text Generation and Processing tasks - whichever model we're using.

Although the method is shared, the format of input and output varies depending on the foundation model used - as described below:

### Anthropic Claude

#### Input

```json
{
    "prompt": "\n\nHuman:<prompt>\n\Assistant:",
    "max_tokens_to_sample": 300,
    "temperature": 0.5,
    "top_k": 250,
    "top_p": 1,
    "stop_sequences": ["\n\nHuman:"]
}
```

#### Output

```json
{
    "completion": "<output>",
    "stop_reason": "stop_sequence"
}
```

---

## Common inference parameter definitions

### Randomness and Diversity

Foundation models support the following parameters to control randomness and diversity in the 
response.

**Temperature** – Large language models use probability to construct the words in a sequence. For any 
given next word, there is a probability distribution of options for the next word in the sequence. When 
you set the temperature closer to zero, the model tends to select the higher-probability words. When 
you set the temperature further away from zero, the model may select a lower-probability word.

In technical terms, the temperature modulates the probability density function for the next tokens, 
implementing the temperature sampling technique. This parameter can deepen or flatten the density 
function curve. A lower value results in a steeper curve with more deterministic responses, and a higher 
value results in a flatter curve with more random responses.

**Top K** – Temperature defines the probability distribution of potential words, and Top K defines the cut 
off where the model no longer selects the words. For example, if K=50, the model selects from 50 of the 
most probable words that could be next in a given sequence. This reduces the probability that an unusual 
word gets selected next in a sequence.
In technical terms, Top K is the number of the highest-probability vocabulary tokens to keep for Top-
K-filtering - This limits the distribution of probable tokens, so the model chooses one of the highest-
probability tokens.

**Top P** – Top P defines a cut off based on the sum of probabilities of the potential choices. If you set Top 
P below 1.0, the model considers the most probable options and ignores less probable ones. Top P is 
similar to Top K, but instead of capping the number of choices, it caps choices based on the sum of their 
probabilities.
For the example prompt "I hear the hoof beats of ," you may want the model to provide "horses," 
"zebras" or "unicorns" as the next word. If you set the temperature to its maximum, without capping 
Top K or Top P, you increase the probability of getting unusual results such as "unicorns." If you set the 
temperature to 0, you increase the probability of "horses." If you set a high temperature and set Top K or 
Top P to the maximum, you increase the probability of "horses" or "zebras," and decrease the probability 
of "unicorns."

### Length

The following parameters control the length of the generated response.

**Response length** – Configures the minimum and maximum number of tokens to use in the generated 
response.

**Length penalty** – Length penalty optimizes the model to be more concise in its output by penalizing 
longer responses. Length penalty differs from response length as the response length is a hard cut off for 
the minimum or maximum response length.

In technical terms, the length penalty penalizes the model exponentially for lengthy responses. 0.0 
means no penalty. Set a value less than 0.0 for the model to generate longer sequences, or set a value 
greater than 0.0 for the model to produce shorter sequences.

### Repetitions

The following parameters help control repetition in the generated response.

**Repetition penalty (presence penalty)** – Prevents repetitions of the same words (tokens) in responses. 
1.0 means no penalty. Greater than 1.0 decreases repetition.

---

## Try out the text generation model

With some theory out of the way, let's see the models in action! Run the cells below to see how to generate text with the Anthropic Claude instant model. 

### Client side `boto3` bedrock-runtime connection

In [9]:
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name=region,
)

In [10]:
claude3 = 'claude3'
llama2 = 'llama2'
llama3='llama3'
mistral='mistral'
titan='titan'
models_dict = {
    claude3 : 'anthropic.claude-3-sonnet-20240229-v1:0',
    llama2: 'meta.llama2-13b-chat-v1',
    llama3: 'meta.llama3-8b-instruct-v1:0',
    mistral: 'mistral.mistral-7b-instruct-v0:2',
    titan : 'amazon.olympus-premier-v1:0'
}
max_tokens_val = 100
temperature_val = 0.1
dict_add_params = {
    llama3: {"max_gen_len":max_tokens_val, "temperature":temperature_val} , 
    claude3: {"top_k": 200,  "temperature": temperature_val, "max_tokens": max_tokens_val},
    mistral: {"max_tokens":max_tokens_val, "temperature": temperature_val} , 
    titan:  {"topK": 200,  "maxTokenCount": max_tokens_val}
}

### Anthropic Claude Instant

In [11]:
import json

PROMPT_DATA = '''Human: Write me a blog about making strong business decisions as a leader.

Assistant:
'''

In [20]:
messages_API_body = {
    "anthropic_version": "bedrock-2023-05-31", 
    "max_tokens": 100, #int(500/0.75),
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": PROMPT_DATA
                }
            ]
        }
    ]
}

### Ask claude to generate this article

In [23]:
import json
from IPython.display import clear_output, display, display_markdown, Markdown

body = json.dumps(messages_API_body)
accept = "application/json"
contentType = "application/json"


modelId = models_dict.get(claude3) #"anthropic.claude-instant-v1"


# Invoke the model with the request.
response = bedrock_runtime.invoke_model(
    modelId=modelId, body=body
)

# Extract and print the response text in real-time. Claude v3
# Decode the response body.
model_response = json.loads(response["body"].read())

# Extract and print the response text.
response_text = model_response["content"][0]["text"]
display(Markdown(response_text))



Here is a draft blog post on making strong business decisions as a leader:

Title: Becoming a Decisive Leader: How to Make Tough Business Calls

As a leader, one of the most important skills you can develop is the ability to make difficult decisions. The path is rarely clear-cut when you're steering a company. You'll constantly face forks in the road that require decisive action amidst uncertainty and pressure.  

The best leaders don't agonize en

## Generate streaming output

For large language models, it can take noticeable time to generate long output sequences. Rather than waiting for the entire response to be available, latency-sensitive applications may like to **stream** the response to users.

Run the code below to see how you can achieve this with Bedrock's `invoke_model_with_response_stream()` method - returning the response body in separate chunks.

### Each model has its unique input and output properties

**Claude models**

```
for event in streaming_response["body"]:
    chunk = json.loads(event["chunk"]["bytes"])
    if chunk["type"] == "content_block_delta":
        print(chunk["delta"].get("text", ""), end="")
        #display(Markdown(chunk["delta"].get("text", "")))
```

**Llama3**
```
for event in streaming_response["body"]:
    chunk = json.loads(event["chunk"]["bytes"])
    if "generation" in chunk:
        #print(chunk["generation"], end="")
        display(Markdown(chunk["generation"]))
```

In [24]:
import json
from IPython.display import clear_output, display, display_markdown, Markdown

body = json.dumps(messages_API_body)
accept = "application/json"
contentType = "application/json"


modelId = models_dict.get(claude3) #"anthropic.claude-instant-v1"


# Invoke the model with the request.
streaming_response = bedrock_runtime.invoke_model_with_response_stream(
    modelId=modelId, body=body
)

# Extract and print the response text in real-time. Claude v3
for event in streaming_response["body"]:
    chunk = json.loads(event["chunk"]["bytes"])
    if chunk["type"] == "content_block_delta":
        print(chunk["delta"].get("text", ""), end="")
        #display(Markdown(chunk["delta"].get("text", "")))



Here is a draft blog post about making strong business decisions as a leader:

Tough Choices: How to Make Strong Business Decisions as a Leader

As a leader, you're faced with tough decisions every day that can significantly impact the trajectory of your business. Whether it's deciding on a new product line, hiring or firing employees, or shifting strategies, the choices you make carry a lot of weight. With so much at stake, it's crucial to approach decision-making thoughtfully an

### To solve this problem Bedrock has now created a `Converse API`

but the model decodng params are different


```
messages_API_body = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "text": "Provide general steps to debug a BSOD on a Windows laptop."
                }
            ]
        }
    ],
    "system": [{"text" : "You are a tech support expert who helps resolve technical issues. Signal 'SUCCESS' if you can resolve the issue, otherwise 'FAILURE'"}],
    "inferenceConfig": {
        "stopSequences": [ "SUCCESS", "FAILURE" ]
    },
    "additionalModelRequestFields": {
        "top_k": 200,
        "max_tokens": 100
    },
    "additionalModelResponseFieldPaths": [
        "/stop_sequence"
    ]
}
```

In [14]:
import boto3
import os
from IPython.display import Markdown, display
import logging
import boto3


from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
    
logging.basicConfig(level=logging.INFO,format="%(levelname)s: %(message)s")

region = os.environ.get("AWS_REGION")
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name=region,
)
claude3 = 'claude3'
llama2 = 'llama2'
llama3='llama3'
mistral='mistral'
titan='titan'
models_dict = {
    claude3 : 'anthropic.claude-3-sonnet-20240229-v1:0',
    llama2: 'meta.llama2-13b-chat-v1',
    llama3: 'meta.llama3-8b-instruct-v1:0',
    mistral: 'mistral.mistral-7b-instruct-v0:2',
    titan : 'amazon.titan-text-premier-v1:0'
}
max_tokens_val = 100
temperature_val = 0.1

dict_add_params = {
    llama3: {}, #"max_gen_len":max_tokens_val, "temperature":temperature_val} , 
    claude3: {"top_k": 200, },# "temperature": temperature_val, "max_tokens": max_tokens_val},
    mistral: {}, #{"max_tokens":max_tokens_val, "temperature": temperature_val} , 
    titan:  {"topK": 200, },# "maxTokenCount": max_tokens_val}
}
inference_config={
    "temperature": temperature_val,
    "maxTokens": max_tokens_val,
    "topP": 0.9
}


def generate_conversation(bedrock_client,model_id,system_text,input_text):
    """
    Sends a message to a model.
    Args:
        bedrock_client: The Boto3 Bedrock runtime client.
        model_id (str): The model ID to use.
        system_text (JSON) : The system prompt.
        input text : The input message.

    Returns:
        response (JSON): The conversation that the model generated.

    """

    logger.info("Generating message with model %s", model_id)

    # Message to send.
    message = {
        "role": "user",
        "content": [{"text": input_text}]
    }
    messages = [message]
    system_prompts = [{"text" : system_text}]

    if model_id in [models_dict.get(mistral), models_dict.get(titan)]:
        system_prompts = [] # not supported

    # Inference parameters to use.
    temperature = 0.5
    top_k = 200
    max_tokens=100

    #Base inference parameters to use.
    # - inference_config


    # Send the message.
    response = bedrock_client.converse(
        modelId=model_id,
        messages=messages,
        system=system_prompts,
        inferenceConfig=inference_config,
        additionalModelRequestFields=get_additional_model_fields(model_id)
    )

    return response

def get_additional_model_fields(modelId):

    return dict_add_params.get(modelId)
    #{"top_k": top_k, "max_tokens": max_tokens}}
    
def get_converse_output(response_obj):
    ret_messages=[]
    output_message = response['output']['message']
    role_out = output_message['role']

    for content in output_message['content']:
        ret_messages.append(content['text'])
        
    return ret_messages, role_out


In [19]:
import logging
import boto3


from botocore.exceptions import ClientError


modelId = models_dict.get(titan) #claude3) #llama3) # mistral # titan
system_text = "You are an economist with access to lots of data."
input_text = "Write an article about impact of high inflation to GDP of a country."
response = generate_conversation(bedrock_runtime, modelId, system_text, input_text)
output_message = response['output']['message']

print(f"Role: {output_message['role']}")

for content in output_message['content']:
    print(f"Text: {content['text']}")

    token_usage = response['usage']
    print(f"Input tokens:  {token_usage['inputTokens']}")
    print(f"Output tokens:  {token_usage['outputTokens']}")
    print(f"Total tokens:  {token_usage['totalTokens']}")
    print(f"Stop reason: {response['stopReason']}")

print(f"Finished generating text with model {modelId}.")

display(Markdown(get_converse_output(response)[0][0]))


INFO:__main__:Generating message with model amazon.titan-text-premier-v1:0


Role: assistant
Text: Title: The Devastating Impact of High Inflation on a Country's GDP

High inflation can have a significant impact on a country's Gross Domestic Product (GDP), leading to economic instability, reduced purchasing power, and decreased investment. In this article, we will explore the various ways in which high inflation can affect a country's GDP and the overall economy.

1. Reduced Purchasing Power: High inflation can lead to a decrease in the purchasing power of consumers, as
Input tokens:  14
Output tokens:  100
Total tokens:  114
Stop reason: max_tokens
Finished generating text with model amazon.titan-text-premier-v1:0.


Title: The Devastating Impact of High Inflation on a Country's GDP

High inflation can have a significant impact on a country's Gross Domestic Product (GDP), leading to economic instability, reduced purchasing power, and decreased investment. In this article, we will explore the various ways in which high inflation can affect a country's GDP and the overall economy.

1. Reduced Purchasing Power: High inflation can lead to a decrease in the purchasing power of consumers, as

## Generate embeddings

Use text embeddings to convert text into meaningful vector representations. You input a body of text 
and the output is a (1 x n) vector. You can use embedding vectors for a wide variety of applications. 
Bedrock currently offers Titan Embeddings for text embedding that supports text similarity (finding the 
semantic similarity between bodies of text) and text retrieval (such as search).

At the time of writing you can use `amazon.titan-embed-text-v1` as embedding model via the API. The input text size is 8192 tokens and the output vector length is 1536.

To use a text embeddings model, use the InvokeModel API operation or the Python SDK.
Use InvokeModel to retrieve the vector representation of the input text from the specified model.



#### Input

```json
{
    "inputText": "<text>"
}
```

#### Output

```json
{
    "embedding": []
}
```


Let's see how to generate embeddings of some text:

In [27]:
prompt_data = "Amazon Bedrock supports foundation models from industry-leading providers such as \
AI21 Labs, Anthropic, Stability AI, and Amazon. Choose the model that is best suited to achieving \
your unique goals."

In [28]:
body = json.dumps({"inputText": prompt_data})
modelId = "amazon.titan-embed-text-v1"
accept = "application/json"
contentType = "application/json"

response = bedrock_runtime.invoke_model(
    body=body, modelId=modelId, accept=accept, contentType=contentType
)
response_body = json.loads(response.get("body").read())

embedding = response_body.get("embedding")
print(f"The embedding vector has {len(embedding)} values\n{embedding[0:3]+['...']+embedding[-3:]}")

The embedding vector has 1536 values
[0.16601562, 0.23632812, 0.703125, '...', 0.26953125, -0.609375, -0.55078125]


#### Now let us run a eval on our models

## Next steps

In this notebook we have successfully set up our Bedrock compatible environment and showed some basic examples of invoking Amazon Bedrock models using the AWS Python SDK. You're now ready to move on to the next notebook to start building our retrieval augmented generation (RAG) application!