# Building Low-Latency AI Agent Workflows with Amazon Bedrock Prompt Caching
This notebook demonstrates how to implement efficient AI agent workflows using Amazon Bedrock's prompt caching capabilities. As organizations move their AI agent applications from proof-of-concept to production, they face challenges with token consumption, latency, and scaling costs. We'll show how to optimize these aspects without compromising the agent's reasoning capabilities.

## What you'll learn
- How to identify cacheable components in agent prompts
- Implementation of cache checkpoints in Amazon Bedrock
- Performance monitoring and optimization techniques
- Integration with open source agent frameworks

## Why Prompt Caching Matters
AI agents typically require significant static portions of prompts (system instructions, tool definitions, response formatting guidelines, etc.) that remain largely unchanged between user requests. Without caching, these static components:
- Consume substantial tokens with each call
- Introduce unnecessary processing latency
- Increase costs at scale
- Can lead to API throttling and rate limit issues

By implementing prompt caching, we can achieve:
- Up to 85% reduction in latency
- Up to 90% cost savings through reduced token processing
- Improved throughput for handling more concurrent users

## Prerequisites
- An AWS account with access to Amazon Bedrock
- Access to Anthropic Claude 3.7 Sonnet model in Amazon Bedrock
- Basic understanding of LLMs and prompt engineering
- Python 3.7+

## Execution Instructions

This notebook is designed to be run sequentially from top to bottom. Code cells that create shared utilities need to be executed before the implementation sections. Each implementation section (Part 1 and Part 2) can be run independently after the shared utilities are defined.

Note that some cells show execution outputs from our test runs, but you should execute all cells to see the results in your environment.

## Setup and Configuration
First, let's import our required libraries and set up our configurations.

In [None]:
!python3 -m pip install --upgrade --quiet boto3

In [None]:
import boto3
import logging
import json
import time
import re
from typing import Dict, List, Optional, Any, Union, Tuple
print(boto3.__version__)

In [None]:
# setting logger
logging.basicConfig(format='[%(asctime)s] line:{%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

Now that we have our environment configured, we'll create the core functions needed to interact with Amazon Bedrock. These functions will serve as the foundation for both our cached and non-cached implementations, allowing us to make direct comparisons between the two approaches.

## Shared Util functions

### Bedrock Integration
Create functions to handle communication with Amazon Bedrock.

In [None]:
# Create Bedrock client using default credentials
session = boto3.Session()
bedrock = session.client(service_name='bedrock-runtime')

def call_bedrock(
    message_list: List[Dict[str, Any]],
    tool_list: Optional[List[Dict[str, Any]]] = None,
    system_prompt: Optional[List[Dict[str, Any]]] = None
) -> Dict[str, Any]:
    """
    Makes a call to Amazon Bedrock.

    Args:
        message_list (List[Dict[str, Any]]): List of conversation messages
        tool_list (List[Dict[str, Any]], optional): List of available tools
        system_prompt (List[Dict[str, Any]], optional): System prompt configuration

    Returns:
        Dict[str, Any]: Bedrock response containing the message and usage statistics
    """
    try:
        kwargs = {
            'modelId': "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
            'messages': message_list,
            'inferenceConfig': {
                "maxTokens": 2000,
                "temperature": 0
            }
        }

        if system_prompt:
            kwargs['system'] = system_prompt

        if tool_list:
            kwargs['toolConfig'] = {"tools": tool_list}

        # Start timing to measure call latency
        start_time = time.time()

        response = bedrock.converse(**kwargs)

        # Calculate latency
        latency = (time.time() - start_time) * 1000  # Convert to milliseconds

        # Add latency to the usage stats so these can used later to analyze the performance
        response['usage']['latency_ms'] = latency

        return response

    except Exception as e:
        logging.error(f"Error calling Bedrock: {str(e)}")
        raise


## Performance Analysis Utilities
Let's create utilities to measure and analyze the performance improvements from prompt caching.

In [None]:
def analyze_performance(usage_stats: List[Dict[str, Union[int, float]]]) -> None:
    """
    Analyze and display performance metrics from usage statistics.

    Args:
        usage_stats (List[Dict[str, Union[int, float]]]): List of usage statistics from Bedrock responses
    """
    # Initialize counters
    total_stats = {
        'input_tokens': 0,
        'output_tokens': 0,
        'total_tokens': 0,
        'cache_read_tokens': 0,
        'cache_write_tokens': 0,
        'total_latency': 0
    }

    # Aggregate statistics
    for stats in usage_stats:
        total_stats['input_tokens'] += stats.get('inputTokens', 0)
        total_stats['output_tokens'] += stats.get('outputTokens', 0)
        total_stats['total_tokens'] += stats.get('totalTokens', 0)
        total_stats['cache_read_tokens'] += stats.get('cacheReadInputTokens', 0)
        total_stats['cache_write_tokens'] += stats.get('cacheWriteInputTokens', 0)
        total_stats['total_latency'] += stats.get('latency_ms', 0)

    # Calculate total requests (interactions)
    total_interactions = len(usage_stats)

    # Calculate averages including latency
    avg_latency = total_stats['total_latency'] / total_interactions if total_interactions > 0 else 0

    # Calculate cache effectiveness
    total_token_requests = total_stats['total_tokens'] - total_stats['output_tokens']
    cached_tokens = total_stats['cache_read_tokens']
    # Calculate percentage of tokens that were served from cache
    cache_hit_ratio = (cached_tokens / total_token_requests * 100) if total_token_requests > 0 else 0

    # Print formatted results
    print("\nPerformance Summary:")
    print("=" * 50)
    print(f"Number of Interactions: {total_interactions}")
    print(f"\nToken Usage:")
    print(f"Input Tokens: {total_stats['input_tokens']:,}")
    print(f"Cache Read Tokens: {total_stats['cache_read_tokens']:,}")
    print(f"Cache Write Tokens: {total_stats['cache_write_tokens']:,}")
    print(f"Output Tokens: {total_stats['output_tokens']:,}")
    print(f"Total Tokens Processed: {total_stats['total_tokens']:,}")

    print(f"\nCache Performance:")
    print(f"Cache Hit Ratio: {cache_hit_ratio:.2f}%")
    print("=" * 50)

    print(f"\nLatency Performance:")
    print(f"Total Latency: {total_stats['total_latency']:.2f} ms")
    print(f"Average Latency per Request: {avg_latency:.2f} ms")

    # Print per-interaction breakdown
    print("\nPer-Interaction Breakdown:")
    print("-" * 50)
    for i, stats in enumerate(usage_stats, 1):
        print(f"\nInteraction {i}:")
        print(f"Input Tokens: {stats.get('inputTokens', 0):,}")
        print(f"Cache Read Tokens: {stats.get('cacheReadInputTokens', 0):,}")
        print(f"Cache Write Tokens: {stats.get('cacheWriteInputTokens', 0):,}")
        print(f"Output Tokens: {stats.get('outputTokens', 0):,}")
        print(f"Total Tokens: {stats.get('totalTokens', 0):,}")
        print(f"Latency: {stats.get('latency_ms', 0):.2f} ms")

        # Calculate per-interaction cache hit ratio
        interaction_total = stats.get('totalTokens', 0)
        interaction_cache_reads = stats.get('cacheReadInputTokens', 0)
        interaction_hit_ratio = (interaction_cache_reads / interaction_total * 100) if interaction_total > 0 else 0
        print(f"Interaction Cache Hit Ratio: {interaction_hit_ratio:.2f}%")

### BaseConversationManager

The `BaseConversationManager` class serves as the foundation for our conversation handling implementations. By centralizing common functionality in this base class, we avoid repeating the same code across Part 1 and Part 2 of this notebook

In the following sections, we'll implement two different conversation managers that extend this base class:
- Part1: `ConverseAPIManager`: Uses Amazon Bedrock's native Converse API tool configuration
- Part2: `FrameworkAgnosticManager`: Uses tool definitions embedded in prompts for compatibility with any LLM framework

In [None]:
class BaseConversationManager:
    """
    Base class for conversation management with common functionality.
    """
    def __init__(
        self,
        tool_function_mappings: Dict[str, callable],
        system_prompt: List[Dict[str, Any]],
        tool_definitions: Optional[List[Dict[str, Any]]] = None,
        max_loops: int = 6
    ):
        self.max_loops = max_loops
        self.tool_definitions = tool_definitions
        self.tool_function_mappings = tool_function_mappings
        self.logger = logging.getLogger(__name__)
        self.system_prompt = system_prompt

    def handle_tool_response(self, tool_use_block: Dict[str, Any]) -> str:
        """
        Processes tool usage and returns appropriate responses.

        Args:
            tool_use_block (Dict[str, Any]): Tool usage information containing name and input parameters

        Returns:
            str: Tool execution result

        Raises:
            ValueError: If the tool name is not found in the registered tools
        """
        try:
            tool_name = tool_use_block['name']
            tool_args = tool_use_block['input']
            self.logger.info(f"Using tool: {tool_name} with inputs: {tool_args}")

            if tool_name not in self.tool_function_mappings:
                raise ValueError(f"Tool '{tool_name}' not found")
            return self.tool_function_mappings[tool_name](**tool_args)

        except Exception as e:
            self.logger.error(f"Tool execution error: {str(e)}")
            raise

    def run_conversation(self, prompt: str) -> Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]:
        """
        Manages the conversation flow with the assistant.

        Args:
            prompt (str): User's question

        Returns:
            Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]:
                - First element: List of conversation messages
                - Second element: List of usage statistics
        """
        message_list = [{
            "role": "user",
            "content": [{"text": prompt}]
        }]
        usage_stats = []
        loop_count = 0

        while loop_count < self.max_loops:
            try:
                # Get response from Bedrock
                self.logger.debug(f"Sending message list to Bedrock: {json.dumps(message_list)}")
                response = call_bedrock(
                    message_list,
                    self.tool_definitions,
                    self.system_prompt
                )

                # Process response
                response_message = response['output']['message']
                usage_stats.append(response['usage'])
                message_list.append(response_message)
                self.logger.info(f"response_message: {response_message}")

                # Check for tool usage - this is where implementations differ
                should_continue, tool_response = self._process_tool_usage(response_message)

                if not should_continue:
                    break

                if tool_response:
                    message_list.append(tool_response)

                self.logger.info(f"message_list: {json.dumps(message_list)}")
                loop_count += 1

            except Exception as e:
                self.logger.error(f"Conversation error: {str(e)}")
                raise

        return message_list, usage_stats

    def _process_tool_usage(self, response_message):
        """
        Abstract method to be implemented by subclasses.
        Process tool usage from the response message.

        Returns:
            tuple: (should_continue, tool_response_message)
        """
        raise NotImplementedError("Subclasses must implement this method")

### Base System Prompt Function

The `create_base_system_prompt` function creates the foundation for our HR assistant's personality and capabilities. This function is part of our shared utilities because it's used by both implementation approaches (Converse API and Framework-Agnostic).

#### Purpose:
- Establishes the HR assistant's core identity, responsibilities, and behavioral guidelines
- Creates a consistent personality across different implementation approaches
- Handles the addition of cache points for the system instructions when caching is enabled

The system prompt is intentionally verbose and structured with multiple sections. This is not just for clarity, but also to meet the minimum token requirements for effective caching. Different LLMs have different minimum token thresholds for cache checkpoints (for example, Claude 3.7 Sonnet requires at least 1,024 tokens per checkpoint). By providing detailed instructions in a structured format, we ensure the prompt meets these requirements while also giving the model comprehensive guidance.

By centralizing these instructions in a shared function, we ensure consistency across implementations while making it easier to update the assistant's core capabilities in one place. The function also handles the strategic placement of cache points to optimize token usage when prompt caching is enabled.

In [None]:
def create_base_system_prompt(caching_activated: bool = False) -> List[Dict[str, Any]]:
    """
    Create the base system prompt with HR assistant instructions.

    This function generates a structured system prompt that defines the HR assistant's
    personality, responsibilities, and behavioral guidelines. The prompt is intentionally
    verbose to meet minimum token requirements for effective caching.

    Args:
        caching_activated (bool): Whether to add cache points to the system prompt

    Returns:
        List[Dict[str, Any]]: A list of dictionaries containing text blocks and optional
                             cache points. Each dictionary has either a "text" key with
                             prompt content or a "cachePoint" key with cache configuration.
    """

    system_prompt = [
        {
            "text": """<task_description>
            You are an expert HR Virtual Assistant working for a large enterprise organization. Your role is to provide accurate, professional, and empathetic support to employees regarding HR matters, with a focus on leave management and HR policies.
            </task_description>

            <responsibilities>
            - Assist employees with leave-related inquiries including:
            - Vacation time
            - Sick leave
            - FMLA (Family and Medical Leave Act)
            - Other types of leave
            - Provide clear explanations of HR policies and procedures
            - Help employees understand their benefits and entitlements
            - Guide employees through HR-related processes
            - Maintain strict confidentiality of all employee information
            </responsibilities>

            <interaction_guidelines>
            - Always maintain a professional, friendly, and empathetic tone
            - Verify employee identity before providing personal information
            - Be clear and concise in your explanations
            - When uncertain, acknowledge limitations and offer to escalate to human HR representatives
            - Use inclusive and respectful language
            - Provide relevant policy references when applicable
            </interaction_guidelines>

            <key_behaviors>
            - Prioritize data privacy and confidentiality
            - Focus on accuracy and compliance with company policies
            - Show empathy while maintaining professional boundaries
            - Escalate sensitive situations to human HR representatives
            - Document interactions appropriately
            - Avoid making promises or guarantees about approvals
            </key_behaviors>

            <limitations_and_boundaries>
            - Do not provide legal advice
            - Do not make decisions about policy exceptions
            - Do not discuss other employees' information
            - Do not handle grievances or complaints
            - Do not provide medical advice
            - Do not discuss compensation changes or negotiations
            </limitations_and_boundaries>

            <security_protocol>
            - Always verify employee identity before accessing personal information
            - Only provide information relevant to the requesting employee
            - Follow data privacy guidelines and GDPR/CCPA compliance requirements
            - Log all sensitive data access appropriately
            </security_protocol>"""
        }
    ]

    # Add cache point for the system prompt if caching is activated
    if caching_activated:
        system_prompt.append({
            "cachePoint": {
                "type": "default"
            }
        })

    return system_prompt

## Tool Registry

The `ToolRegistry` class serves as a central repository for our HR tools. It handles:
- Registering tool definitions with their implementation functions
- Formatting tools appropriately for API calls
- Managing cache points for tool definitions
- Converting tool definitions to different formats as needed

This abstraction allows us to maintain consistent tool functionality while adapting the presentation format for different implementation approaches.

In [None]:
class ToolRegistry:
    """Simple tool registry for HR tools that matches Bedrock Agents schema"""
    def __init__(self):
        self.tools: List[Dict[str, Any]] = []
        self.caching_enabled: bool = False
        self._functions: Dict[str, callable] = {}  # Private dictionary to store tool functions

    def add_tool(
        self,
        name: str,
        description: str,
        properties: Dict[str, Any],
        function: callable,
        required: Optional[List[str]] = None
    ) -> None:
        """
        Add a tool with specified schema and its implementation function.

        Args:
            name (str): Name of the tool
            description (str): Tool description
            properties (dict): Schema properties for the tool
            function (callable): Function to be called when tool is invoked
            required (list, optional): List of required parameters
        """
        tool = {
            'toolSpec': {
                'name': name,
                'description': description,
                'inputSchema': {
                    'json': {
                        "type": "object",
                        "properties": properties,
                    }
                }
            },
        }

        # Add required fields if specified
        if required:
            tool['toolSpec']['inputSchema']['json']['required'] = required

        self.tools.append(tool)
        self._functions[name] = function

    def set_caching(self, enabled=False):
        """Enable or disable prompt caching for tools"""
        self.caching_enabled = enabled

    def execute_tool(self, name, **kwargs):
        """
        Execute a registered tool function.

        Args:
            name (str): Name of the tool to execute
            **kwargs: Arguments to pass to the tool function
        """
        if name not in self._functions:
            raise ValueError(f"Tool '{name}' not found")
        return self._functions[name](**kwargs)

    def get_tools(self):
        """
        Get list of all registered tools with optional caching configuration.

        When caching is enabled (self.caching_enabled = True), appends a cachePoint
        configuration to the tools list. This allows Bedrock to cache static parts
        of the prompt, reducing token usage and latency for subsequent calls.

        Returns:
            list: List of tool specifications, optionally including cache configuration
        """
        tools = self.tools.copy()  # Create a copy to avoid modifying the original list
        if self.caching_enabled:
            tools.append({
                "cachePoint": {
                    "type": "default"
                }
            })
        return tools

    def get_tools_as_json_string(self):
        """Get a JSON string representation of all registered tools."""
        tools_json = []
        for tool in self.tools:
            if 'toolSpec' in tool:
                tool_spec = tool['toolSpec']
                properties = tool_spec['inputSchema']['json']['properties']
                required = tool_spec['inputSchema']['json'].get('required', [])

                tool_json = {
                    "name": tool_spec['name'],
                    "description": tool_spec['description'],
                    "parameters": {
                        "type": "object",
                        "properties": properties,
                        "required": required
                    }
                }
                tools_json.append(tool_json)

        return "<tools>\n" + json.dumps(tools_json, indent=2) + "\n</tools>"

    def get_tool_function_mapping(self):
        """
        Get mapping of tool names to their mapped function.

        Returns:
            dict: Dictionary mapping tool names to their corresponding functions
        """
        return self._functions.copy()


### HR Tools Implementation
Now that we've created our ToolRegistry class, let's populate it with specific HR management tools. These tools will demonstrate how to structure tool definitions that benefit from prompt caching while maintaining their functionality.

In [None]:
def create_hr_tools():
    """
    Create HR tools with proper schema and functions.

    Returns:
        ToolRegistry: A registry containing HR-related tools with their implementations
    """
    registry = ToolRegistry()

    # Add leave balances tool
    def get_leave_balances(employee_id, leave_type=None, as_of_date=None):
        # Example implementation - in production, this would query a database
        return f"You have 10 days of vacation, 5 sick days remaining"

    registry.add_tool(
        name='get_leave_balances',
        description='Get all available leave balances for different leave types',
        properties={
            "employee_id": {
                "type": "integer",
                "description": "the id of the employee"
            },
            "leave_type": {
                "type": "string",
                "description": "specific type of leave to check",
                "enum": ["vacation", "sick", "personal", "floating_holiday", "parental", "bereavement", "all"]
            },
            "as_of_date": {
                "type": "string",
                "description": "date to check balances for (defaults to current date)"
            }
        },
        required=["employee_id"],
        function=get_leave_balances
    )

    # Add vacation reservation tool
    def reserve_vacation_time(employee_id, start_date, end_date):
        return f"Vacation reserved from {start_date} to {end_date}"

    registry.add_tool(
        name='reserve_vacation_time',
        description='reserve vacation time for a specific employee - you need all parameters to reserve vacation time',
        properties={
            "employee_id": {
                "type": "integer",
                "description": "the id of the employee for which time off will be reserved"
            },
            "start_date": {
                "type": "string",
                "description": "the start date for the vacation time"
            },
            "end_date": {
                "type": "string",
                "description": "the end date for the vacation time"
            }
        },
        required=["employee_id", "start_date", "end_date"],
        function=reserve_vacation_time
    )

    # Add leave policy tool
    def get_leave_policy_info(policy_type, employee_id=None, state=None):
        return f"Leave policy information for {policy_type}"

    registry.add_tool(
        name='get_leave_policy_info',
        description='Retrieve information about leave policies and eligibility',
        properties={
            "policy_type": {
                "type": "string",
                "description": "type of leave policy to query",
                "enum": ["vacation", "sick", "fmla", "std", "ltd", "parental", "bereavement", "all"]
            },
            "employee_id": {
                "type": "integer",
                "description": "employee id to check eligibility (optional)"
            },
            "state": {
                "type": "string",
                "description": "state code for state-specific policies"
            }
        },
        required=["policy_type"],
        function=get_leave_policy_info
    )

    return registry

Now that we've established our shared utilities, we'll implement two different approaches to prompt caching:

1. **Part 1: Native Converse API** - Using Bedrock's built-in tool configuration
2. **Part 2: Framework-Agnostic** - Embedding tool definitions in prompts

Each approach demonstrates different integration patterns while achieving similar caching benefits. Let's start with the native Converse API approach.

# Part 1: Prompt Caching with Converse API

With our Bedrock integration in place, we can now implement prompt caching using Amazon Bedrock's Converse API. This approach leverages the native caching capabilities of the API, making it ideal for applications that directly interact with Bedrock.

#### Verify the tool configurations
Let's test the tools and see how the final tool specification looks like.

In [None]:
hr_tools = create_hr_tools()
print("Tools without caching:")
print(json.dumps(hr_tools.get_tools(), indent=4))

# Now enable caching and see the difference
hr_tools.set_caching(True)
print("\nTools with caching enabled:")
print(json.dumps(hr_tools.get_tools(), indent=4))

## System Prompt for Converse API Approach

The system prompt for the Converse API approach focuses on providing the HR assistant's personality and guidelines. It doesn't need to include tool definitions directly because these are provided separately through the Converse API's `toolConfig` parameter.

Key characteristics:
- Defines the assistant's role and responsibilities
- Sets interaction guidelines and boundaries
- Includes cache points for efficient token usage
- Relies on Bedrock's native tool handling

In [None]:
def create_converse_api_system_prompt(caching_activated=False):
    """Create system prompt for Converse API approach (Part 1)"""

    # Get the base system prompt
    system_prompt = create_base_system_prompt(caching_activated)

    # Add Converse API specific instructions
    system_prompt.append({
        "text": """<tool_usage_guidelines>
        You have access to several tools that can help you assist employees with their HR inquiries.
        Use these tools when you need specific information about leave balances, policies, or to make reservations.
        The system will automatically process your tool requests through the Converse API.
        </tool_usage_guidelines>"""
    })

    # Add cache point for the additional instructions if caching is activated
    if caching_activated:
        system_prompt.append({
            "cachePoint": {
                "type": "default"
            }
        })

    return system_prompt

## Conversation Handler
The `ConverseAPIManager` extends our base class to implement conversation handling using Amazon Bedrock's native Converse API tool configuration. This approach leverages the built-in capabilities of the API for managing tool interactions.

### Key Features:

1. **Native Tool Integration**: Uses Bedrock's built-in tool configuration format, allowing the model to directly invoke tools through the API.

2. **Structured Tool Results**: Formats tool results using the `toolResult` structure expected by the Converse API.

3. **Efficient Caching**: Places cache points strategically around tool definitions in the API request structure.

This implementation represents the most direct way to leverage Amazon Bedrock's prompt caching capabilities, making it ideal for applications that interact directly with Bedrock without intermediate frameworks.

In [None]:
class ConverseAPIManager(BaseConversationManager):
    """
    Conversation manager using native Bedrock Converse API tool configuration.
    """
    def _create_tool_result_message(
        self,
        tool_use_id: str,
        content: str,
        status: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        Creates a properly formatted tool result message for Converse API.

        Args:
            tool_use_id (str): ID of the tool use from the model's response
            content (str): Result content from the tool execution
            status (str, optional): Status of the tool execution (e.g., "error")

        Returns:
            Dict[str, Any]: A properly formatted message with toolResult structure
                           that can be sent back to the model
        """
        message = {
            "role": "user",
            "content": [{
                "toolResult": {
                    "toolUseId": tool_use_id,
                    "content": [{"text": content}]
                }
            }]
        }

        if status:
            message["content"][0]["toolResult"]["status"] = status

        return message

    def _process_tool_usage(self, response_message):
        """
        Process tool usage from Converse API response format.

        Returns:
            tuple: (should_continue, tool_response_message)
        """
        # Check for tool usage in last content item
        last_content = response_message["content"][-1]
        if 'toolUse' not in last_content:
            return False, None

        # Handle tool execution
        tool_use_block = last_content['toolUse']
        tool_use_id = tool_use_block['toolUseId']

        try:
            tool_response = self.handle_tool_response(tool_use_block)
            return True, self._create_tool_result_message(tool_use_id, tool_response)
        except Exception as e:
            return False, self._create_tool_result_message(
                tool_use_id,
                repr(e),
                status="error"
            )

## Example Usage without Prompt Caching
Let's try out our HR agent with a vacation-related query without prompt caching to establish a baseline.

In [None]:
# Set up without caching
CACHING_ACTIVATED = False

# Initialize the system prompt
system_prompt = create_converse_api_system_prompt(caching_activated=CACHING_ACTIVATED)

# Initialize registry
tool_registry = create_hr_tools()
tool_registry.set_caching(CACHING_ACTIVATED)

# Create conversation handler
handler = ConverseAPIManager( tool_definitions=tool_registry.get_tools(), tool_function_mappings=tool_registry.get_tool_function_mapping(), system_prompt=system_prompt)

# Run a conversation
messages, usage_stats = handler.run_conversation("How many vacation days do I have? My employee ID is 123")

print("\n===================== ANSWER =====================")
print(messages[-1]['content'][0]["text"])
print("==================================================")

# Display performance metrics
analyze_performance(usage_stats)

### Understanding the Results
This example demonstrates the baseline performance without prompt caching enabled. Key observations:

#### Performance Metrics
1. **Token Usage**
   - Each interaction processes a all input tokens
   - Total token consumption is high due to repeated processing of tool definitions and system prompt
   - No cache hits (0%) as caching is disabled

2. **Latency**
   - No latency improvements between first and subsequent calls

3. **Cache Statistics**
   - Cache read/write tokens: 0 (expected with caching disabled)
   - Cache hit ratio: 0% (all prompts fully processed)

These metrics serve as our baseline for comparing against the cached implementation in the next section.

## Example Usage with Prompt Caching
Now let's run the same query with prompt caching enabled to see the performance improvements.

In [None]:
# Set up with caching
CACHING_ACTIVATED = True

# Initialize the system prompt with caching
system_prompt = create_converse_api_system_prompt(caching_activated=CACHING_ACTIVATED)

# Initialize registry with caching
tool_registry = create_hr_tools()
tool_registry.set_caching(CACHING_ACTIVATED)

# Create conversation handler
handler = ConverseAPIManager( tool_definitions=tool_registry.get_tools(), tool_function_mappings=tool_registry.get_tool_function_mapping(), system_prompt=system_prompt)

# Run a conversation
messages, usage_stats = handler.run_conversation("How many vacation days do I have? My employee ID is 123")

print("\n===================== ANSWER =====================")
print(messages[-1]['content'][0]["text"])
print("==================================================")

# Display performance metrics
analyze_performance(usage_stats)

### Understanding Cached Performance
This example demonstrates the performance improvements achieved with prompt caching enabled. Let's analyze the key differences:

#### Performance Improvements
1. **Token Processing Efficiency**
   - Input tokens reduced dramatically compared to the non-cached version
   - Significant cache hit ratio shows effective reuse of cached prompts
   - Each interaction benefits from cached tokens

2. **Key Benefits**
   - Reduced token consumption for input processing
   - Consistent cache hit ratios
   - Lower costs due to reduced token processing
   - Similar response quality despite reduced token processing

The results demonstrate how prompt caching significantly reduces token processing while maintaining the same quality of responses. This efficiency is particularly valuable for production deployments where cost and performance optimization are crucial.

# Part 2: Framework-Agnostic Prompt Caching with Tool Definitions in Prompts

While the Bedrock Converse API provides excellent built-in support for tool definitions and caching, many organizations use open-source frameworks like LangChain, LlamaIndex, or custom solutions. In this section, we'll explore how to implement prompt caching in a framework-agnostic way that works with any LLM framework.

The key difference in this approach is that we'll:
1. Include tool definitions directly in the system prompt as structured text
2. Add cache points strategically around these definitions
3. Parse tool invocations from the LLM's text output
4. Execute tools and return results in the conversation flow

This approach offers greater flexibility and compatibility with existing agent implementations while still leveraging the performance benefits of prompt caching.

## Tool Definitions in Prompts

Instead of using the Bedrock Converse API's tool configuration, we'll include tool definitions directly in the system prompt. 

Below, we'll create a structured text representation of our HR tools that can be included in the system prompt:

In [None]:
hr_tools_as_string = hr_tools.get_tools_as_json_string()
print(hr_tools_as_string)

## Framework-Agnostic Approach with Tool Definitions in Prompts

In this approach, we include tool definitions directly in the system prompt as structured text rather than using the Bedrock Converse API's native tool configuration. This makes our implementation compatible with any LLM framework or direct API calls.

### Key Advantages:
1. **Universal Compatibility**: Works with any LLM framework that supports system prompts
2. **Simplified Integration**: Easier to integrate with existing agent implementations
3. **Consistent Caching Benefits**: Achieves similar token and latency savings

### System Prompt Characteristics:
- Includes all the same HR assistant guidelines as the Converse API approach
- Embeds tool definitions directly in the prompt as structured JSON
- Provides explicit instructions for tool invocation format using `<tool_use>` tags
- Places cache points strategically around large static sections for optimal caching

The LLM will parse these tool definitions and use them to guide its responses, while the caching mechanism ensures we don't repeatedly process the same static text. This approach is particularly valuable for organizations using multiple LLM frameworks or integrating with existing systems.

In [None]:
def create_framework_agnostic_system_prompt(tools_json_string, caching_activated=False):
    """Create system prompt for Framework-Agnostic approach"""

    # Get the base system prompt
    system_prompt = create_base_system_prompt(caching_activated)

    # Add tool definitions and framework-agnostic instructions
    system_prompt.append({
        "text": f"""
        {tools_json_string}
        <tool_usage_guidelines>
        When you determine that you need to use a tool, you MUST format your response using the following JSON structure:

        <tool_use>
        {{
          "name": "tool_name",
          "input": {{
            "param1": "value1",
            "param2": "value2"
          }}
        }}
        </tool_use>

        After sending this format, the application will:
        1. Parse your tool call
        2. Execute the tool with the provided parameters
        3. Return control to you with the tool's output
        </tool_usage_guidelines>
        """
    })

    # Add cache point for tools if caching is activated
    if caching_activated:
        system_prompt.append({
            "cachePoint": {
                "type": "default"
            }
        })

    return system_prompt

In [None]:
CACHING_ACTIVATED = True
create_framework_agnostic_system_prompt(hr_tools_as_string, CACHING_ACTIVATED)

## Conversation Handler

The `FrameworkAgnosticManager` extends our base class to implement a framework-agnostic approach to conversation handling. Instead of relying on Bedrock's native tool configuration, this implementation embeds tool definitions directly in the system prompt as structured text.

### Key Features:

1. **Universal Compatibility**: Works with any LLM framework that supports system prompts, not just the Bedrock Converse API.

2. **Text-Based Tool Invocation**: Parses tool calls from the model's text output using regex pattern matching.

3. **Flexible Integration**: Can be adapted to work with open-source frameworks like LangChain, LlamaIndex, or custom solutions.

4. **Prompt-Based Caching**: Demonstrates how to implement prompt caching even when using text-based tool definitions.

This approach offers greater flexibility and compatibility with existing agent implementations while still leveraging the performance benefits of prompt caching. It's particularly valuable for organizations that use multiple LLM frameworks or need to integrate with existing systems.

In [None]:
class FrameworkAgnosticManager(BaseConversationManager):
    """
    Conversation manager using tool definitions in prompts for framework compatibility.
    """
    def _create_tool_result_message(self, content):
        """
        Creates a properly formatted tool result message for text-based responses.
        """
        return {
            "role": "user",
            "content": [{
                "text": content
            }]
        }

    def extract_tool_call(self, response_text: Dict[str, str]) -> Optional[Dict[str, Any]]:
        """
        Extract tool call JSON from the model's response text.

        This method uses regex pattern matching to find and parse tool invocations
        that are formatted as JSON within <tool_use> tags in the model's response.

        Args:
            response_text (Dict[str, str]): Content item from the model's response
                                        containing the 'text' key

        Returns:
            Optional[Dict[str, Any]]: Parsed tool call with 'name' and 'input' keys if found,
                                    None if no tool call is present or parsing fails
        """
        tool_call_pattern = r"<tool_use>\s*(\{.*?\})\s*</tool_use>"
        text_content = response_text['text']

        match = re.search(tool_call_pattern, text_content, re.DOTALL)
        if match:
            try:
                tool_call_json = match.group(1)
                return json.loads(tool_call_json)
            except json.JSONDecodeError:
                print("Failed to parse tool call JSON")
                return None
        return None

    def _process_tool_usage(self, response_message):
        """
        Process tool usage from text-based response format.

        Returns:
            tuple: (should_continue, tool_response_message)
        """
        # Check for tool usage in last content item
        last_content = response_message["content"][-1]

        # Check if the response contains a tool call
        tool_call = self.extract_tool_call(last_content)
        if not tool_call:
            return False, None

        try:
            tool_response = self.handle_tool_response(tool_call)
            return True, self._create_tool_result_message(tool_response)
        except Exception as e:
            return False, self._create_tool_result_message(repr(e))

To quantify the benefits of prompt caching, we'll analyze key performance metrics from both cached and non-cached implementations. This analysis will help us understand:

1. How much token consumption is reduced
2. The impact on response latency
3. The effectiveness of our caching strategy through cache hit ratios

Let's start by establishing a baseline with our non-cached implementation.

## Example Usage without Prompt Caching
Let's try out our HR agent with a vacation-related query without prompt caching to establish a baseline.

In [None]:
# Set up without caching
CACHING_ACTIVATED = False

# Initialize the system prompt
hr_tools_as_string = hr_tools.get_tools_as_json_string()
system_prompt = create_framework_agnostic_system_prompt(
    tools_json_string=hr_tools_as_string,
    caching_activated=CACHING_ACTIVATED
)

# Create conversation handler
hr_tool_function_mapping = hr_tools.get_tool_function_mapping()
handler = FrameworkAgnosticManager(tool_function_mappings=hr_tool_function_mapping, system_prompt=system_prompt)

# Run a conversation
messages, usage_stats = handler.run_conversation("How many vacation days do I have? My employee ID is 123")

print("\n===================== ANSWER =====================")
print(messages[-1]['content'][0]["text"])
print("==================================================")

# Display performance metrics
analyze_performance(usage_stats)

## Example Usage with Prompt Caching
Now let's run the same query with prompt caching enabled to see the performance improvements.

In [None]:
# Set up without caching
CACHING_ACTIVATED = True

# Initialize the system prompt
hr_tools_as_string = hr_tools.get_tools_as_json_string()
system_prompt = create_framework_agnostic_system_prompt(
    tools_json_string=hr_tools_as_string,
    caching_activated=CACHING_ACTIVATED
)

# Create conversation handler
hr_tool_function_mapping = hr_tools.get_tool_function_mapping()
handler = FrameworkAgnosticManager(tool_function_mappings=hr_tool_function_mapping, system_prompt=system_prompt)

# Run a conversation
messages, usage_stats = handler.run_conversation("How many vacation days do I have? My employee ID is 123")

print("\n===================== ANSWER =====================")
print(messages[-1]['content'][0]["text"])
print("==================================================")

# Display performance metrics
analyze_performance(usage_stats)

# Conclusion

In this notebook, we've explored two approaches to implementing prompt caching with Amazon Bedrock:

1. **Direct Bedrock Converse API Integration**: Using cache points directly in the API calls for system prompts and tool definitions
2. **Open Source Framework Integration**: Creating adapters to use prompt caching with frameworks like LangChain

## Key Takeaways

- **Performance Improvements**: Prompt caching can significantly reduce token usage and latency
- **Cost Savings**: Fewer tokens processed means lower costs at scale
- **Flexibility**: Caching can be integrated with both direct API calls and open source frameworks
- **Monitoring**: Tracking cache hit ratios and latency helps optimize performance

## Best Practices

1. **Identify Static Components**: Look for parts of your prompts that don't change between requests
2. **Strategic Cache Points**: Place cache points after large static sections like system prompts and tool definitions
3. **Version Management**: Include version markers before cache points to invalidate caches when tools or prompts change
4. **Performance Monitoring**: Track cache hit ratios and latency to ensure caching is effective

By implementing these techniques, you can build more efficient, cost-effective AI agent workflows that scale better in production environments.