# Prompt Caching in Amazon Bedrock

Prompt caching is a powerful feature in Amazon Bedrock that significantly reduces response latency for workloads with repetitive contexts. This notebook demonstrates how to implement prompt caching effectively for document-based chat applications.

## What is Prompt Caching?

Prompt caching allows you to store portions of your conversation context, enabling models to:
- Reuse cached context instead of reprocessing inputs
- Reduce response Time-To-First-Token (TTFT) for subsequent queries

## When to Use Prompt Caching

Prompt caching delivers maximum benefits for:
- **Chat with Document**: By caching the document as input context on the first request, each user query becomes more efficient, perhaps enabling simpler architectures that avoid heavier solutions like vector databases.
- **Coding assistants**: Reusing long code files in prompts enables near real-time inline suggestions, eliminating much of the time spent reprocessing code files.
- **Agentic workflows**: Longer system prompts can be used to refine agent behavior without degrading the end-user experience. By caching the system prompts and complex tool definitions, the time to process each step in the agentic flow can be reduced.
- **Few-Shot Learning**: Including numerous high-quality examples and complex instructions, such as for customer service or technical troubleshooting, can benefit from prompt caching.

## Benefits of Prompt Caching

- **Faster Response Times**: Avoid reprocessing the same context repeatedly
- **Improved User Experience**: Reduced TTFT to create more natural conversations
- **Cost Efficiency**: Potentially lower token usage by avoiding redundant processing

## Implementation Example

This notebook walks through a document-based chat implementation using prompt caching to demonstrate:
1. How to properly structure cache points in your requests
2. Performance comparisons with and without caching
3. Best practices for cache management
4. Measuring and optimizing cache effectiveness
5. How to use tenant level isolation with Anthropic Claude models

In [None]:
! pip install --upgrade boto3 pandas numpy matplotlib seaborn pytz 

In [None]:
# Standard libraries
import json
import time
from enum import Enum

# Data processing and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patheffects as path_effects
import seaborn as sns

# AWS and external services
import boto3
import requests
import hashlib

bedrock_runtime = boto3.client('bedrock-runtime')
boto3_bedrock = boto3.client('bedrock')

<div class="alert alert-block alert-info">
<b>Info:</b> You will need the latest boto3 which includes prompt caching in the Converse API.
</div>

In [None]:
print(f"boto3 version: {boto3.__version__}")

<div class="alert alert-block alert-info">
<b>Info:</b> This notebook uses Anthropic Claude 3.5 Haiku as an example, please make sure you have enabled the model on Bedrock
</div>


In [None]:
[models['modelId'] for models in boto3_bedrock.list_foundation_models()['modelSummaries']]

In [None]:
model_id="us.anthropic.claude-3-5-haiku-20241022-v1:0"

### Use case: Chat with document

To effectively use Prompt Caching, there is a [minimum number of tokens](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html#prompt-caching-models). Thus we need to request a long doc here.

In [None]:
topics = [
    'https://aws.amazon.com/blogs/aws/reduce-costs-and-latency-with-amazon-bedrock-intelligent-prompt-routing-and-prompt-caching-preview/',
    'https://aws.amazon.com/blogs/machine-learning/enhance-conversational-ai-with-advanced-routing-techniques-with-amazon-bedrock/',
    'https://aws.amazon.com/blogs/security/cost-considerations-and-common-options-for-aws-network-firewall-log-management/'
]

questions = [
    'what is it about?',
    'what are the use cases?',
    'what is intelligent prompt routing?',
    'what is prompt caching?',
]

In [None]:
def chat_with_document(document, user_query, model_id, tenant_id=None):
    instructions = (
    "I will provide you with a document, followed by a question about its content. "
    "Your task is to analyze the document, extract relevant information, and provide "
    "a comprehensive answer to the question. Please follow these detailed instructions:"

    "\n\n1. Identifying Relevant Quotes:"
    "\n   - Carefully read through the entire document."
    "\n   - Identify sections of the text that are directly relevant to answering the question."
    "\n   - Select quotes that provide key information, context, or support for the answer."
    "\n   - Quotes should be concise and to the point, typically no more than 2-3 sentences each."
    "\n   - Choose a diverse range of quotes if multiple aspects of the question need to be addressed."
    "\n   - Aim to select between 2 to 5 quotes, depending on the complexity of the question."

    "\n\n2. Presenting the Quotes:"
    "\n   - List the selected quotes under the heading 'Relevant quotes:'"
    "\n   - Number each quote sequentially, starting from [1]."
    "\n   - Present each quote exactly as it appears in the original text, enclosed in quotation marks."
    "\n   - If no relevant quotes can be found, write 'No relevant quotes' instead."
    "\n   - Example format:"
    "\n     Relevant quotes:"
    "\n     [1] \"This is the first relevant quote from the document.\""
    "\n     [2] \"This is the second relevant quote from the document.\""

    "\n\n3. Formulating the Answer:"
    "\n   - Begin your answer with the heading 'Answer:' on a new line after the quotes."
    "\n   - Provide a clear, concise, and accurate answer to the question based on the information in the document."
    "\n   - Ensure your answer is comprehensive and addresses all aspects of the question."
    "\n   - Use information from the quotes to support your answer, but do not repeat them verbatim."
    "\n   - Maintain a logical flow and structure in your response."
    "\n   - Use clear and simple language, avoiding jargon unless it's necessary and explained."

    "\n\n4. Referencing Quotes in the Answer:"
    "\n   - Do not explicitly mention or introduce quotes in your answer (e.g., avoid phrases like 'According to quote [1]')."
    "\n   - Instead, add the bracketed number of the relevant quote at the end of each sentence or point that uses information from that quote."
    "\n   - If a sentence or point is supported by multiple quotes, include all relevant quote numbers."
    "\n   - Example: 'The company's revenue grew by 15% last year. [1] This growth was primarily driven by increased sales in the Asian market. [2][3]'"

    "\n\n5. Handling Uncertainty or Lack of Information:"
    "\n   - If the document does not contain enough information to fully answer the question, clearly state this in your answer."
    "\n   - Provide any partial information that is available, and explain what additional information would be needed to give a complete answer."
    "\n   - If there are multiple possible interpretations of the question or the document's content, explain this and provide answers for each interpretation if possible."

    "\n\n6. Maintaining Objectivity:"
    "\n   - Stick to the facts presented in the document. Do not include personal opinions or external information not found in the text."
    "\n   - If the document presents biased or controversial information, note this objectively in your answer without endorsing or refuting the claims."

    "\n\n7. Formatting and Style:"
    "\n   - Use clear paragraph breaks to separate different points or aspects of your answer."
    "\n   - Employ bullet points or numbered lists if it helps to organize information more clearly."
    "\n   - Ensure proper grammar, punctuation, and spelling throughout your response."
    "\n   - Maintain a professional and neutral tone throughout your answer."

    "\n\n8. Length and Depth:"
    "\n   - Provide an answer that is sufficiently detailed to address the question comprehensively."
    "\n   - However, avoid unnecessary verbosity. Aim for clarity and conciseness."
    "\n   - The length of your answer should be proportional to the complexity of the question and the amount of relevant information in the document."

    "\n\n9. Dealing with Complex or Multi-part Questions:"
    "\n   - For questions with multiple parts, address each part separately and clearly."
    "\n   - Use subheadings or numbered points to break down your answer if necessary."
    "\n   - Ensure that you've addressed all aspects of the question in your response."

    "\n\n10. Concluding the Answer:"
    "\n    - If appropriate, provide a brief conclusion that summarizes the key points of your answer."
    "\n    - If the question asks for recommendations or future implications, include these based strictly on the information provided in the document."

    "\n\nRemember, your goal is to provide a clear, accurate, and well-supported answer based solely on the content of the given document. "
    "Adhere to these instructions carefully to ensure a high-quality response that effectively addresses the user's query."
    )

    if tenant_id:
        sha256_hash = hashlib.sha256(tenant_id.encode()).hexdigest()
        instructions = f"{sha256_hash}:{instructions}"
        
    document_content =  f"Here is the document:  <document> {document} </document>"

    messages_body = [
        {
            'role': 'user',
            'content': [
                {
                'text': instructions
                },
                {
                'text': document_content
                },
                {
                "cachePoint": {
                    "type": "default"
                    }
                },
                {
                'text': user_query
                },
            ]
        },
    ]

    inference_config={
        'maxTokens': 500,
        'temperature': 0,
        'topP': 1
    }

    response = bedrock_runtime.converse(
                messages=messages_body,
                modelId=model_id,
                inferenceConfig=inference_config
            )

    output_message = response["output"]["message"]
    response_text = output_message["content"][0]["text"]

    print("Response text:")
    print(response_text)

    print("Usage:")
    print(json.dumps(response["usage"], indent=2))

### Converse API with Prompt Caching

**First invocation**

When you first use the Converse API with prompt caching enabled, you'll initiate the cache creation process (indicated by **cacheWriteInputTokens** in the response). If you provide a tenant_id, it will be securely hashed and included as a prefix in the instructions, ensuring tenant-specific cache isolation. This is what happens during the first invocation:

In [None]:
response = requests.get(topics[0])
blog = response.text

chat_with_document(
    document=blog,
    user_query=questions[0],
    model_id=model_id,
    tenant_id="tenant1"
)

**Subsequent invocations**

When you submit a different question about the same document, the LLM retrieves context from the cache using the hashed tenant_id as part of the cache key to ensure tenant specific isolation instead of reprocessing the entire document. You can observe the **cacheReadInputTokens** metric in the response:

In [None]:
chat_with_document(
    document=blog,
    user_query=questions[1],
    model_id=model_id,
     tenant_id="tenant1"
)

This cell runs a test using a different tenant_id value. By providing a new tenant identifier, we ensure that the cache is isolated per tenant demonstrating that responses and cache entries are kept separate for each tenant, as the hashed tenant_id is used to prefix the instructions and form a unique cache key.

In [None]:
chat_with_document(
    document=blog,
    user_query=questions[0],
    model_id=model_id,
     tenant_id="tenant2"
)

In this cell, we run an additional test using the same tenant_id but with a different question about the same document. This demonstrates that prompt caching is functioning correctly for this tenant: since the cache is keyed by the hashed tenant_id and the document, the model retrieves context from the cache (as indicated by the cacheReadInputTokens metric) rather than reprocessing the document. This confirms that cache entries are isolated per tenant, and that subsequent queries for the same tenant and document benefit from reduced latency and token usage due to prompt caching

In [None]:
chat_with_document(
    document=blog,
    user_query=questions[1],
    model_id=model_id,
     tenant_id="tenant2"
)

## Conclusion

This notebook expolored Amazon Bedrock's prompt caching feature, demonstrating how it works, when to use it, and how to use it effectively. It's important to carefully evaluate whether your use case will benefit from this feature. It depends on thoughtful prompt structuring, understanding the distinction between static and dynamic content, and selecting appropriate caching strategies for your specific needs.

For more information about working with prompt caching on Amazon Bedrock, see the [Amazon Bedrock User Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html). 