# Getting started with Llama 3 on AWS

This notebook demonstrates how to use Llama 3 on AWS. We will cover the setup, configuration, and basic usage of Llama 3 for generating text. By the end of this notebook, you should be able to understand the basic workflow and how to interact with the Llama 3 model using AWS Bedrock.

## Llama 3

Llama 3 (Large Language Model Meta AI) is the third iteration of Meta's advanced language models, designed for tasks like text generation, translation, and summarization. Built on transformer architecture, it excels at understanding and generating human-like text by training on extensive datasets from diverse sources.

## Amazon Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Since Amazon Bedrock is serverless, you don't have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

AWS Bedrock can be used to access and deploy Llama 3 by providing the necessary infrastructure and tools. Developers can integrate Llama 3 into their applications through Bedrock's APIs, customize it with specific datasets, and scale the deployment as needed, leveraging AWS's robust infrastructure and cost management features.

## Prerequisites

This notebook is vetted to run on a [SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) Jupyter notebook running the latest `ipykernel`. Also, the AWS credentials, namely AWS Access Key and AWS Secret Access key, are assigned as an IAM Role to the notebook instance, hence why they are not hard-coded anywhere in the code. If you run this outside of SageMaker Studio, make the right ajustement to [authenticate your requests](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) to the Bedrock API with your AWS Access Key.

## Cost

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> API calls to Amazon Bedrock will incur charges based on the tokens used and some additional features like Guardrails. Additionally, using SageMaker Studio may result in charges if you exceed the free tier: [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/), [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/).
</div>

In [20]:
# Define a couple utility functions. You can skip this section.

import rich, json

def print_json(data):
    rich.print_json(json.dumps(data))

## Introduction

[Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is a Python library that allows you to interact with AWS resources programmatically. It provides an easy way to automate tasks and manage AWS services through code. We'll use Boto3 to make requests and retrieve data from the Amazon Bedrock API. The Boto3 Bedrock SDK includes four clients designed to interact with different aspects of Bedrock:

- **bedrock**: Includes APIs for controlling model management, training, and deployment.
- **bedrock-runtime**: Includes APIs for making inference requests to models hosted in Amazon Bedrock.
- **bedrock-agent**: Provides APIs for creating and managing agents and knowledge bases.
- **bedrock-agent-runtime**: Includes APIs for controlling model management, training, and deployment for agents and knowledge bases.

We will only use the `bedrock` and `bedrock-runtime` client in here. Let's start by installing the latest version of boto3.

In [21]:
# Install the latest version of boto3
!python3 -m pip install --quiet --upgrade boto3

In [22]:
import boto3
print(boto3.__version__)

1.34.143


To kick things off, we list all models available via Bedrock from Meta. Note that differents models will be available based on the AWS Region you choose.

In [23]:
# Set default AWS region
default_region = "us-east-1"

# Create a Bedrock client in the AWS Region of your choice.
bedrock = boto3.client("bedrock", region_name=default_region)

# List all models from meta
models = bedrock.list_foundation_models(
    byProvider='Meta'  # comment this line to get all models from all providers
)

print_json(models)

The output returns many important attributes for each models:

| Field                        | Description                                                              |
|------------------------------|--------------------------------------------------------------------------|
| `modelArn`                   | ARN that uniquely identifies the model in AWS Bedrock.                   |
| `modelId`                    | Unique identifier for the model within AWS Bedrock.                      |
| `modelName`                  | Name or title of the model.                                              |
| `providerName`               | Organization or entity providing the model.                              |
| `inputModalities`            | Types of inputs the model accepts (e.g., `'TEXT'`).                        |
| `outputModalities`           | Types of outputs the model generates (e.g., `'TEXT'`).                     |
| `responseStreamingSupported` | Indicates if the model supports streaming responses.                     |
| `customizationsSupported`    | Lists any customization options available for the model.                 |
| `inferenceTypesSupported`    | Describes the ways inference can be requested (e.g., `'ON_DEMAND'`).       |
| `modelLifecycle`             | Current status of the model (e.g., `'ACTIVE'`).                            |


Another way to list all models in a more readable fashion is as follow:

In [24]:
for model in models['modelSummaries']:
    print(model['modelId'])

meta.llama2-13b-chat-v1:0:4k
meta.llama2-13b-chat-v1
meta.llama2-70b-chat-v1:0:4k
meta.llama2-70b-chat-v1
meta.llama2-13b-v1:0:4k
meta.llama2-13b-v1
meta.llama2-70b-v1:0:4k
meta.llama2-70b-v1
meta.llama3-8b-instruct-v1:0
meta.llama3-70b-instruct-v1:0


## Calling a model

The first example consists of a call to the Bedrock API to pass a prompt and receive an answer from the LLM. The InvokeModel API calls the specified Amazon Bedrock model to run inference using the prompt and inference parameters provided in the request body. Depending on the model, you can infer text, images, or embeddings.

API documentation: https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html

<div class="alert alert-block alert-warning"> 
<b>NOTE:</b> Large language models produce non-deterministic results; you may see different outputs than those presented in this notebook.
</div>

In [25]:
from botocore.exceptions import ClientError

# Set the model ID.
model_id = "meta.llama3-8b-instruct-v1:0"

# Set the prompt.
prompt = "Describe the purpose of a 'hello world' program in one line."

# Create a Bedrock Runtime client in the AWS Region you want to use.
bedrock_runtime = boto3.client("bedrock-runtime", region_name=default_region)

# Embed the prompt in Llama 3's instruction format.
# More information: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
formatted_prompt = f"""
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
{prompt}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""

# Format the request payload using the model's native structure.
native_request = {
    "prompt": formatted_prompt,
    "max_gen_len": 512,
    "temperature": 0.5,
}

# Convert the native request to JSON.
request = json.dumps(native_request)

try:
    # Invoke the model with the request.
    response = bedrock_runtime.invoke_model(modelId=model_id, body=request)
    
    # Decode the response body.
    model_response = json.loads(response["body"].read())

    # Extract and print the response text.
    response_text = model_response["generation"]
    print(response_text)

except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
    exit(1)

A "Hello World" program is a simple program that serves as a starting point for learning a programming language, typically printing the text "Hello, World!" to the screen to demonstrate the basic syntax and functionality of the language.


Additionally, Llama 2 Chat, Llama 2, and Llama 3 Instruct models return the following fields for a text completion inference call alongside the text generated by the model.
 
| Field                    | Description                                                                                                                                                                           |
|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `generation`             | The generated text.                                                                                                                                                                   |
| `prompt_token_count`     | The number of tokens in the prompt.                                                                                                                                                   |
| `generation_token_count` | The number of tokens in the generated text.                                                                                                                                           |
| `stop_reason`            | The reason why the response stopped generating text. Possible values are: <br> - `stop`: The model has finished generating text for the input prompt. <br> - `length`: The length of the tokens for the generated text exceeds the value of `max_gen_len` in the call to `InvokeModel` (`InvokeModelWithResponseStream`, if you are streaming output). The response is truncated to `max_gen_len` tokens. Consider increasing the value of `max_gen_len` and trying again. |

In [26]:
print_json(model_response)

The drawback of using the `InvokeModel` API lies in its requirement for different JSON request and response structures depending on the model provider. Recall the following code snippet from the example

```python
formatted_prompt = f"""
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
{prompt}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""
```

Switching from Llama 2 or Llama 3 to another model with a different prompt structure, such as from a different provider (or perhaps even a future release of Llama), would necessitate rewriting the code. This situation leads to managing diverse formats, complicating integration efforts.

A better approach is to use the Amazon Bedrock `Converse` API.

### Bedrock converse API

The [Bedrock Converse API](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference.html) is designed for creating advanced conversational applications that interact with large language models like Llama 3. It allows developers to send conversation prompts and receive contextually relevant responses, maintaining dialogue coherence over multiple exchanges.

Compared to the `InvokeModel` API, the `Converse` API offers advantages in dialogue management and context retention. While `InvokeModel` handles single, standalone prompts, the `Converse` API is built to maintain the context of an ongoing conversation, making it more suitable for applications that require multi-turn interactions and a natural flow of dialogue. This enhanced capability results in more engaging and effective conversational agents.

For a complete guide, see [Getting started with the Amazon Bedrock Converse API](https://community.aws/content/2hHgVE7Lz6Jj1vFv39zSzzlCilG/getting-started-with-the-amazon-bedrock-converse-api?lang=en).

In [27]:
# Use the Conversation API to send a text message to Meta Llama.

def send_message_to_model(conversation, model_id=model_id, max_tokens=512, temperature=0.5, top_p=0.9, system_prompt="You are a helpful assistant"):
    """
    Send a message to a model and return the response.

    Args:
        conversation (list): The conversation history/messages to send to the model.
        model_id (str): The ID of the model to use.
        max_tokens (int): Maximum number of tokens to generate in the response.
        temperature (float): Sampling temperature to control randomness.
        top_p (float): Nucleus sampling parameter to control the range of token sampling.
        system_prompt (str): System prompt to guide the model's behavior.

    Returns:
        dict: The response from the model, containing the generated text and additional metadata.
    """
    try:
        # Send the message to the model, using the provided inference configuration.
        response = bedrock_runtime.converse(
            modelId=model_id,
            messages=conversation,
            inferenceConfig={
                "maxTokens": max_tokens,
                "temperature": temperature,
                "topP": top_p
            },
            system=[{"text": system_prompt}],
        )

        # Extract and print the response text.
        print(response["output"]["message"]["content"][0]["text"])
        return response

    except (ClientError, Exception) as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)



# Start a conversation with the user message.
user_message = "Describe the purpose of a 'hello world' program in one line."
conversation = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

response = send_message_to_model(conversation)



A "Hello World" program is a traditional introductory example in programming that serves as a simple demonstration of how to write a program in a specific language, typically printing the text "Hello, World!" to the screen.


Alternatively, we can print the whole conversation. Notice the two roles `user` and `assistant` alterning between each other. The last message in the list should be from the `user` role, so that the LLM can respond to it.

In [28]:
conversation.append(response["output"]["message"])
print_json(conversation)

#### Setting a system prompt

You can set a system prompt to communicate basic instructions for the large language model outside of the normal conversation. System prompts are generally used by the developer to define the tone and constraints for the conversation. In this case, we are instructing Llama to act like a pirate.

In [29]:
new_message = {
    "role": "user",
    "content": [
        {"text": "What is the best place to hide a pirate booty?" }
    ],
}

system_prompt = "Answer in the style of a pirate"

conversation.append(new_message)
response = send_message_to_model(conversation, system_prompt=system_prompt)



Arrr, that be a great question, matey! According to me, the best place to hide a pirate booty be on a deserted isle, deep in the jungle, where the only creatures that'll find it be the scurvy dogs and the parrots. Make sure to bury it good and deep, with a map and a riddle to lead ye back to it, savvy? And don't ferget to stash a few fake treasures around to throw off any landlubbers who might be searchin' fer it!


#### Getting response metadata and token counts

The Converse method also returns metadata about the API call. The `stopReason` property tells us why the model completed the message, which can be useful for your application logic, error handling, or troubleshooting. The `usage` property includes details about the input and output tokens, helping you understand the charges for your API call.

In [30]:
print_json(response)

#### Bedrock Converse Streaming API

This example demonstrates how to use the Converse operation with output streaming. This means the model's answer is printed in real-time as the text is generated, rather than waiting for the model to complete the entire text before displaying it. The example shows how to send the input text, inference parameters, and additional parameters unique to the model. The code starts a conversation by asking the model to create a list of songs.

<b>NOTE:</b> Output streaming is also supported with the `InvokeModelWithResponseStream` API. https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModelWithResponseStream.html 
</div>

In [31]:
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


def stream_conversation(bedrock_client,
                    model_id,
                    messages,
                    system_prompts,
                    inference_config,
                    additional_model_fields):
    """
    Sends messages to a model and streams the response.
    Args:
        bedrock_client: The Boto3 Bedrock runtime client.
        model_id (str): The model ID to use.
        messages (JSON) : The messages to send.
        system_prompts (JSON) : The system prompts to send.
        inference_config (JSON) : The inference configuration to use.
        additional_model_fields (JSON) : Additional model fields to use.

    Returns:
        None
    """

    logger.info("Streaming messages with model %s", model_id)

    response = bedrock_client.converse_stream(
        modelId=model_id,
        messages=messages,
        system=system_prompts,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields
    )

    stream = response.get('stream')
    if stream:
        for event in stream:

            if 'messageStart' in event:
                print(f"\nRole: {event['messageStart']['role']}")

            if 'contentBlockDelta' in event:
                print(event['contentBlockDelta']['delta']['text'], end="")

            if 'messageStop' in event:
                print(f"\nStop reason: {event['messageStop']['stopReason']}")

            if 'metadata' in event:
                metadata = event['metadata']
                if 'usage' in metadata:
                    print("\nToken usage")
                    print(f"Input tokens: {metadata['usage']['inputTokens']}")
                    print(
                        f"Output tokens: {metadata['usage']['outputTokens']}")
                    print(f"Total tokens: {metadata['usage']['totalTokens']}")
                if 'metrics' in event['metadata']:
                    print(
                        f"Latency: {metadata['metrics']['latencyMs']} ms")

In [32]:
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

system_prompt = """You are an app that creates playlists for a radio station
  that plays rock and pop music. Only return song names and the artist."""

# Message to send to the model.
input_text = "Create a list of 3 pop songs."

message = {
    "role": "user",
    "content": [{"text": input_text}]
}
conversation = [message]

# System prompts.
system_prompts = [{"text": system_prompt}]

# inference parameters to use.
temperature = 0.5

# Base inference parameters.
inference_config = {
    "temperature": temperature
}

# Additional model inference parameters.
additional_model_fields = {}

try:
    bedrock_client = boto3.client(service_name='bedrock-runtime')

    stream_conversation(bedrock_client,
                        model_id,
                        conversation,
                        system_prompts,
                        inference_config,
                        additional_model_fields)

except ClientError as err:
    message = err.response['Error']['Message']
    logger.error("A client error occurred: %s", message)
    print("A client error occured: " +
          format(message))

else:
    print(
        f"Finished streaming messages with model {model_id}.")


INFO:__main__:Streaming messages with model meta.llama3-8b-instruct-v1:0



Role: assistant
Here are 3 pop songs for your playlist:

1. "Happy" by Pharrell Williams
2. "Can't Stop the Feeling!" by Justin Timberlake
3. "We Found Love" by Rihanna (feat. Calvin Harris)
Stop reason: end_turn

Token usage
Input tokens: 51
Output tokens: 51
Total tokens: 102
Latency: 976 ms
Finished streaming messages with model meta.llama3-8b-instruct-v1:0.


## Guardrails

Guardrails for Amazon Bedrock enables you to implement safeguards for your generative AI applications based on your use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them across multiple foundation models (FM), providing a consistent user experience and standardizing safety and privacy controls across generative AI applications. You can use guardrails with text-based user inputs and model responses.

Guardrails can be used in multiple ways to safeguard generative AI applications. For example:

- A chatbot application can use guardrails to filter harmful user inputs and toxic model responses.
- A banking application can use guardrails to block user queries or model responses associated with seeking or providing investment advice.
- A call center application to summarize conversation transcripts between users and agents can use guardrails to redact users’ personally identifiable information (PII) to protect user privacy.

For more details, visit [AWS Bedrock Guardrails](https://aws.amazon.com/bedrock/guardrails/).

In [33]:
# Create a new Guardrails to block any financial advice.

financial_guardrail = bedrock.create_guardrail(
    name='financial-advice-guardrail',
    description='string',
    topicPolicyConfig={
        'topicsConfig': [
            {
                'name': 'financial-advice',
                'definition': 'Never give any financial advice.',
                'examples': [
                    'Where should I invest my money?',
                    'What are the best stocks to buy right now?',
                    'Shoudl I buy bitcoin?',
                ],
                'type': 'DENY'
            },
        ]
    },
    blockedInputMessaging='You query was blocked by the following guardrail: financial advice-guardrail.',
    blockedOutputsMessaging='The model response was blocked by the following guardrail: financial advice-guardrail.'
)

In [34]:
# List guardrail
bedrock.list_guardrails()

{'ResponseMetadata': {'RequestId': 'ad464fa2-d546-404e-ba53-a3d9235fb120',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Thu, 11 Jul 2024 08:24:18 GMT',
   'content-type': 'application/json',
   'content-length': '573',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'ad464fa2-d546-404e-ba53-a3d9235fb120'},
  'RetryAttempts': 0},
 'guardrails': [{'id': 'g78qb7eiy7ff',
   'arn': 'arn:aws:bedrock:us-east-1:137229038021:guardrail/g78qb7eiy7ff',
   'status': 'READY',
   'name': 'bedrock-agent-demo',
   'version': 'DRAFT',
   'createdAt': datetime.datetime(2024, 6, 7, 8, 28, 51, 298678, tzinfo=tzlocal()),
   'updatedAt': datetime.datetime(2024, 6, 7, 8, 32, 24, 610026, tzinfo=tzlocal())},
  {'id': 'voqvsabek3h0',
   'arn': 'arn:aws:bedrock:us-east-1:137229038021:guardrail/voqvsabek3h0',
   'status': 'READY',
   'name': 'financial-advice-guardrail',
   'description': 'string',
   'version': 'DRAFT',
   'createdAt': datetime.datetime(2024, 7, 11, 8, 24, 17, 751399, tzinfo=tzlocal(

In [35]:
# Testing a Guardrail (without invoking a model)
def test_guardrail(prompt, guardrail_id, guardrail_version):
    """
    Tests a guardrail by applying it to a given prompt.

    Args:
        prompt (str): The input text to be tested against the guardrail.
        guardrail_id (str): The unique identifier of the guardrail.
        guardrail_version (str): The version of the guardrail to be applied.

    Returns:
        None
    """
    response = bedrock_runtime.apply_guardrail(
        guardrailIdentifier=guardrail_id,
        guardrailVersion=guardrail_version,
        source='INPUT', 
        content=[{"text": {"text": prompt}}])
    print_json(response["outputs"][0]["text"])


prompt = "How should I invest for my savings?"
guardrailIdentifier = financial_guardrail['guardrailId']
guardrailVersion = financial_guardrail['version']

test_guardrail(prompt, guardrailIdentifier, guardrailVersion)

In [36]:
# Adding a guardrail to a Llama 3 invocation
def stream_conversation(bedrock_client,
                    model_id,
                    messages,
                    system_prompts,
                    inference_config,
                    additional_model_fields,
                    guardrail_config):
    """
    Sends messages to a model and streams the response.
    Args:
        bedrock_client: The Boto3 Bedrock runtime client.
        model_id (str): The model ID to use.
        messages (JSON) : The messages to send.
        system_prompts (JSON) : The system prompts to send.
        inference_config (JSON) : The inference configuration to use.
        additional_model_fields (JSON) : Additional model fields to use.
        guardrailConfig (JSON) : The guardrail to use.

    Returns:
        None
    """

    logger.info("Streaming messages with model %s", model_id)

    response = bedrock_client.converse_stream(
        modelId=model_id,
        messages=messages,
        system=system_prompts,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields,
        guardrailConfig=guardrail_config
    )

    stream = response.get('stream')
    if stream:
        for event in stream:

            if 'messageStart' in event:
                print(f"\nRole: {event['messageStart']['role']}")

            if 'contentBlockDelta' in event:
                print(event['contentBlockDelta']['delta']['text'], end="")

            if 'messageStop' in event:
                print(f"\nStop reason: {event['messageStop']['stopReason']}")

            if 'metadata' in event:
                metadata = event['metadata']
                if 'usage' in metadata:
                    print("\nToken usage")
                    print(f"Input tokens: {metadata['usage']['inputTokens']}")
                    print(
                        f"Output tokens: {metadata['usage']['outputTokens']}")
                    print(f"Total tokens: {metadata['usage']['totalTokens']}")
                if 'metrics' in event['metadata']:
                    print(
                        f"Latency: {metadata['metrics']['latencyMs']} ms")

In [37]:
system_prompt = """You are a helpful assistant"""

# Message to send to the model.
input_text = "How should I invest for my savings?"

message = {
    "role": "user",
    "content": [{"text": input_text}]
}
conversation = [message]

# System prompts.
system_prompts = [{"text": system_prompt}]

# inference parameters to use.
temperature = 0.5

# Base inference parameters.
inference_config = {
    "temperature": temperature
}

# Additional model inference parameters.
additional_model_fields = {}

# Guardrail
guardrail_config = {
    'guardrailIdentifier': guardrailIdentifier,
    'guardrailVersion': guardrailVersion,
}

try:
    bedrock_client = boto3.client(service_name='bedrock-runtime')

    stream_conversation(bedrock_client,
                        model_id,
                        conversation,
                        system_prompts,
                        inference_config,
                        additional_model_fields,
                        guardrail_config)

except ClientError as err:
    message = err.response['Error']['Message']
    logger.error("A client error occurred: %s", message)
    print("A client error occured: " +
          format(message))

else:
    print(
        f"Finished streaming messages with model {model_id}.")


INFO:__main__:Streaming messages with model meta.llama3-8b-instruct-v1:0



Role: assistant
You query was blocked by the following guardrail: financial advice-guardrail.
Stop reason: guardrail_intervened

Token usage
Input tokens: 0
Output tokens: 0
Total tokens: 0
Latency: 730 ms
Finished streaming messages with model meta.llama3-8b-instruct-v1:0.


In [38]:
# Cleanup
bedrock.delete_guardrail(guardrailIdentifier=financial_guardrail['guardrailId'])

{'ResponseMetadata': {'RequestId': 'c2dd3c76-787c-426a-99a1-b00fb06ec118',
  'HTTPStatusCode': 202,
  'HTTPHeaders': {'date': 'Thu, 11 Jul 2024 08:24:19 GMT',
   'content-type': 'application/json',
   'content-length': '2',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'c2dd3c76-787c-426a-99a1-b00fb06ec118'},
  'RetryAttempts': 0}}

## Additional Ressource

- Meta's Llama receipt for AWS: https://github.com/meta-llama/llama-recipes/tree/main/recipes/3p_integrations/aws
- Amazon Bedrock samples: https://github.com/aws-samples/amazon-bedrock-samples