## Prompt Caching with Amazon Nova models

Prompt caching, now generally available on Amazon Bedrock with Anthropic’s Claude 3.5 Haiku and Claude 3.7 Sonnet, along with Nova Micro, Nova Lite, and Nova Pro models. This notebook demonstrates how to work with prompt caching with Amazon Nova models using both the Bedrock Converse API and InvokeModel API.

### Understanding Prompt Caching Benefits
Prompt caching allows you to cache frequently used context across multiple model invocations, which is especially valuable for:

* Document Q&A systems where users ask multiple questions about the same document
* Coding assistants that maintain context about code files
* Applications with long, repeated prompts
* 
The cached context remains available for up to 5 minutes after each access, with each cache hit resetting this countdown.


In [None]:
import json
import os
import sys
import boto3
import requests

# initialize bedrock runtime client
bedrock_runtime = boto3.client('bedrock-runtime')
boto3_bedrock = boto3.client('bedrock')

In [None]:
# we'll use nova models in this notebook please make sure to enable the access to Nova models
nova_models = ['us.amazon.nova-micro-v1:0','us.amazon.nova-lite-v1:0','us.amazon.nova-pro-v1:0']

### Use case: Chat with document

#### Prompt structure
For document chat applications, the optimal caching approach separates static and dynamic content:

- **Static content** (cache these):
  - Instructions (system prompt)
  - Document content (messages)
- **Dynamic content** (don't cache):
  - User queries

This separation maximizes cache efficiency while maintaining flexibility for varied user inputs

In [None]:
# Document topics 
topics = [
    'https://aws.amazon.com/about-aws/whats-new/2024/12/amazon-bedrock-preview-prompt-caching',
    'https://aws.amazon.com/about-aws/whats-new/2025/03/amazon-nova-models-govcloud/',
]

# Sample user queries
questions = [
    'what is it about?',
    'what are the use cases?'
]

# Model output instructions
instructions = (
    "I will provide you with a document, followed by a question about its content. "
    "Your task is to analyze the document, extract relevant information, and provide "
    "a comprehensive answer to the question. Please follow these detailed instructions:"

    "\n\n1. Identifying Relevant Quotes:"
    "\n   - Carefully read through the entire document."
    "\n   - Identify sections of the text that are directly relevant to answering the question."
    "\n   - Select quotes that provide key information, context, or support for the answer."
    "\n   - Quotes should be concise and to the point, typically no more than 2-3 sentences each."
    "\n   - Choose a diverse range of quotes if multiple aspects of the question need to be addressed."
    "\n   - Aim to select between 2 to 5 quotes, depending on the complexity of the question."

    "\n\n2. Presenting the Quotes:"
    "\n   - List the selected quotes under the heading 'Relevant quotes:'"
    "\n   - Number each quote sequentially, starting from [1]."
    "\n   - Present each quote exactly as it appears in the original text, enclosed in quotation marks."
    "\n   - If no relevant quotes can be found, write 'No relevant quotes' instead."
    "\n   - Example format:"
    "\n     Relevant quotes:"
    "\n     [1] \"This is the first relevant quote from the document.\""
    "\n     [2] \"This is the second relevant quote from the document.\""

    "\n\n3. Formulating the Answer:"
    "\n   - Begin your answer with the heading 'Answer:' on a new line after the quotes."
    "\n   - Provide a clear, concise, and accurate answer to the question based on the information in the document."
    "\n   - Ensure your answer is comprehensive and addresses all aspects of the question."
    "\n   - Use information from the quotes to support your answer, but do not repeat them verbatim."
    "\n   - Maintain a logical flow and structure in your response."
    "\n   - Use clear and simple language, avoiding jargon unless it's necessary and explained."

    "\n\n4. Referencing Quotes in the Answer:"
    "\n   - Do not explicitly mention or introduce quotes in your answer (e.g., avoid phrases like 'According to quote [1]')."
    "\n   - Instead, add the bracketed number of the relevant quote at the end of each sentence or point that uses information from that quote."
    "\n   - If a sentence or point is supported by multiple quotes, include all relevant quote numbers."
    "\n   - Example: 'The company's revenue grew by 15% last year. [1] This growth was primarily driven by increased sales in the Asian market. [2][3]'"

    "\n\n5. Handling Uncertainty or Lack of Information:"
    "\n   - If the document does not contain enough information to fully answer the question, clearly state this in your answer."
    "\n   - Provide any partial information that is available, and explain what additional information would be needed to give a complete answer."
    "\n   - If there are multiple possible interpretations of the question or the document's content, explain this and provide answers for each interpretation if possible."

    "\n\n6. Maintaining Objectivity:"
    "\n   - Stick to the facts presented in the document. Do not include personal opinions or external information not found in the text."
    "\n   - If the document presents biased or controversial information, note this objectively in your answer without endorsing or refuting the claims."

    "\n\n7. Formatting and Style:"
    "\n   - Use clear paragraph breaks to separate different points or aspects of your answer."
    "\n   - Employ bullet points or numbered lists if it helps to organize information more clearly."
    "\n   - Ensure proper grammar, punctuation, and spelling throughout your response."
    "\n   - Maintain a professional and neutral tone throughout your answer."

    "\n\n8. Length and Depth:"
    "\n   - Provide an answer that is sufficiently detailed to address the question comprehensively."
    "\n   - However, avoid unnecessary verbosity. Aim for clarity and conciseness."
    "\n   - The length of your answer should be proportional to the complexity of the question and the amount of relevant information in the document."

    "\n\n9. Dealing with Complex or Multi-part Questions:"
    "\n   - For questions with multiple parts, address each part separately and clearly."
    "\n   - Use subheadings or numbered points to break down your answer if necessary."
    "\n   - Ensure that you've addressed all aspects of the question in your response."

    "\n\n10. Concluding the Answer:"
    "\n    - If appropriate, provide a brief conclusion that summarizes the key points of your answer."
    "\n    - If the question asks for recommendations or future implications, include these based strictly on the information provided in the document."

    "\n\nRemember, your goal is to provide a clear, accurate, and well-supported answer based solely on the content of the given document. "
    "Adhere to these instructions carefully to ensure a high-quality response that effectively addresses the user's query."
    )

#### Implementation with InvokeModel API

The first example is to use InvokeModel API.

In [None]:
def invoke_chat_with_document(system_prompt, document, user_query, model_id):
    
    document_content =  f"## document:\n{document} "

    # Define your system prompt(s).
    system_list = [
        {
            "text": system_prompt,
            "cachePoint": {
                "type": "default"
            }
        }
    ]

    # Define one or more messages using the "user" and "assistant" roles.
    message_list = [
        {
            "role": "user",
            "content": [
                {
                    "text": document_content,
                    "cachePoint": {
                        "type": "default"
                    }
                },
                {
                    "text": user_query,
                }
            ]
        }
    ]

    # Configure the inference parameters.
    inf_params = {
        "max_new_tokens": 300,
        "top_p": 0.9,
        "top_k": 20,
        "temperature": 0.7
    }

    native_request = {
        "messages": message_list,
        "system": system_list,
        "inferenceConfig": inf_params,
    }



    response = bedrock_runtime.invoke_model(
        body=json.dumps(native_request),
        modelId=model_id,
    )
    response_body = json.loads(response.get("body").read())
    print(json.dumps(response_body, indent=2))


**First invocation**: cache write

In [None]:
response = requests.get(topics[0])
blog = response.text
invoke_chat_with_document(instructions, blog, questions[0], nova_models[1])

**Subsequent invocation**: cache read

In [None]:
invoke_chat_with_document(instructions, blog, questions[1], nova_models[1])

#### Implementation with Converse API

Converse API on Bedrock provides unified API experience across models. The following example implement the same use case with Converse API on Bedrock.

<div class="alert alert-block alert-info">
<b>Info:</b> You will need boto3 > 1.37.26, which includes prompt caching in the Converse API.
</div>

In [None]:
print(f"boto3 version: {boto3.__version__}")

In [None]:
def converse_with_document(system_prompt, document, user_query, model_id):

    document_content =  f"## document:\n{document} "

    system_list = [
        {
            "text": system_prompt
        },
        {
            "cachePoint": {
                "type": "default"
            }
        }
    ]
    
    message_list = [
        {
            'role': 'user',
            'content': [
                {
                    'text': document_content
                },
                {
                    "cachePoint": {
                        "type": "default"
                    }
                },
                {
                    'text': user_query
                },
            ]
        },
    ]

    inference_config = {
        'maxTokens': 500,
        'temperature': 0,
        'topP': 1
    }

    response = bedrock_runtime.converse(
        system=system_list,
        messages=message_list,
        modelId=model_id,
        inferenceConfig=inference_config
    )

    output_message = response["output"]["message"]
    response_text = output_message["content"][0]["text"]

    print("Response text:")
    print(response_text)

    print("Usage:")
    print(json.dumps(response["usage"], indent=2))

**Same instructions, different document**: cache read on instructions, cache write on document content

In [None]:
response = requests.get(topics[1])
blog = response.text
converse_with_document(instructions, blog, questions[0], nova_models[1])

**Subsequent invocation**: cache read

In [None]:
converse_with_document(instructions, blog, questions[1], nova_models[1])

## Conclusion

This notebook expolored Amazon Bedrock's prompt caching feature with Amazon Nova models.

For more information about working with prompt caching on Amazon Bedrock, see the [Amazon Bedrock User Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html). 