# Understanding Basic LLM Metrics

## Overview
In this workshop, you'll learn to measure and analyze key operational metrics for Large Language Models (LLMs) within Amazon Bedrock. Understanding these metrics is crucial for:
- Optimizing application performance
- Managing costs effectively
- Understanding latency vs accuracy

## Key Metrics We'll Cover
1. **Cost Metrics** - Token usage and pricing
2. **Latency Metrics** - End-to-end response time
3. **TTFT vs TTLT** - Time to First Token vs Time to Last Token
4. **Throttling** - Rate limiting and error handling
5. **Throughput** - Tokens per second and requests per minute

## Use Case
For this workshop our use case will be email summarization where we process incoming emails to extract key information, action items, and important details. This will demonstrates how LLM metrics impact real-world performance in a scenario requiring both accuracy and efficiency. We will compare how different models balance speed, cost, and quality when generating concise email summaries at scale.

## Prerequisites
- AWS account with Bedrock access
- Python 3.10+
- boto3 library

## Setup and Dependencies

In [41]:
import boto3
import json
import time
import traceback
import statistics
from datetime import datetime
from typing import Dict, List, Optional

bedrock_client = boto3.client("bedrock-runtime")
cloudwatch = boto3.client('cloudwatch')


## 1. Cost Metrics

Understanding token usage is essential for cost optimization. Different models have different pricing structures based on input and output tokens. A link to bedrock model pricing can be found [here](https://aws.amazon.com/bedrock/pricing/#:~:text=Model%20pricing%20details)

In [42]:
#Pricing (per 1K tokens)
MODEL_PRICING = {
    "us.amazon.nova-lite-v1:0": {"input": 0.00006, "output": 0.000015},
    "us.amazon.nova-pro-v1:0": {"input": 0.0008, "output": 0.0002},
    "us.anthropic.claude-3-7-sonnet-20250219-v1:0": {"input": 0.003, "output": 0.015}
}

def calculate_cost(model_id: str, input_tokens: int, output_tokens: int) -> Dict:
    """Calculate the cost of a model invocation based on token usage."""
    if model_id not in MODEL_PRICING:
        return {"error": f"Pricing not available for {model_id}"}
    
    pricing = MODEL_PRICING[model_id]
    input_cost = (input_tokens / 1000) * pricing["input"]
    output_cost = (output_tokens / 1000) * pricing["output"]
    total_cost = input_cost + output_cost
    response = cloudwatch.put_metric_data(
    Namespace='llm_custom_operational_metrics',  # A logical container for your metrics
    MetricData=[
        {
            'MetricName': 'TotalCost',  # The name of your custom metric
            'Value': total_cost,                  # The value of the metric
            'Dimensions': [              # Optional: Add dimensions for more granular analysis
                {
                    'Name': 'Model',
                    'Value': model_id
                }
            ]
        }])

    print("Custom metric published")
    
    return {
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "input_cost_usd": f"${input_cost:.8f}", 
        "output_cost_usd": f"${output_cost:.8f}",
        "total_cost_usd": f"${total_cost:.8f}"
    }

# Example cost calculation for Anthropic Claude 3.7 Sonnet
example_cost = calculate_cost("us.anthropic.claude-3-7-sonnet-20250219-v1:0", 20000, 1500)
print("Cost Analysis Example:")
print(json.dumps(example_cost, indent=2))

Custom metric published: {'ResponseMetadata': {'RequestId': '70e519bb-9878-4aa8-9598-75152dc20d53', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '70e519bb-9878-4aa8-9598-75152dc20d53', 'content-type': 'text/xml', 'content-length': '212', 'date': 'Mon, 04 Aug 2025 14:01:40 GMT'}, 'RetryAttempts': 0}}
Cost Analysis Example:
{
  "input_tokens": 20000,
  "output_tokens": 1500,
  "input_cost_usd": "$0.06000000",
  "output_cost_usd": "$0.02250000",
  "total_cost_usd": "$0.08250000"
}


## 2. Latency Metrics

Latency measures how long it takes to get a complete response. This is critical for user experience and each model differs in latency.

The Bedrock converse API provides built in metrics in the invocation response, we will use this to fetch basic metrics such as latency and input/output token count.

In [43]:
def measure_latency(model_id: str, prompt: str, max_tokens: int = 100) -> Dict:
    """Measure end-to-end latency for a model invocation."""
    try:
        
        response = bedrock_client.converse(
            modelId=model_id,
            messages=[{"role": "user", "content": [{"text": prompt}]}],
            inferenceConfig={"maxTokens": max_tokens, "temperature": 0.1}
        )
        
        latency_ms = response["metrics"]["latencyMs"]
        
        cost_info = calculate_cost(
            model_id, 
            response["usage"]["inputTokens"], 
            response["usage"]["outputTokens"]
        )
        
        return {
            "model_id": model_id,
            "server_latency_ms": latency_ms,
            "input_tokens": response["usage"]["inputTokens"],
            "output_tokens": response["usage"]["outputTokens"],
            "tokens_per_second": round(response["usage"]["outputTokens"] / (latency_ms / 1000), 1),
            **cost_info,
            "error": False
        }
        
    except Exception as e:
        return {"error": True, "error_message": str(e)}



## Testing latency and cost between Anthropic Claude 3.7 and Amazon Nova Pro

In [44]:
# Test latency measurement with claude 3.7 
print("Testing Latency Measurement...")
latency_result = measure_latency(
    "us.anthropic.claude-3-7-sonnet-20250219-v1:0", 
    "Explain quantum computing in simple terms.", 
    max_tokens=150
)
print(json.dumps(latency_result, indent=2))

Testing Latency Measurement...
Custom metric published: {'ResponseMetadata': {'RequestId': '1296ded1-e1a3-4a9a-a49d-7f3c8e51ddea', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '1296ded1-e1a3-4a9a-a49d-7f3c8e51ddea', 'content-type': 'text/xml', 'content-length': '212', 'date': 'Mon, 04 Aug 2025 14:01:50 GMT'}, 'RetryAttempts': 0}}
{
  "model_id": "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
  "server_latency_ms": 3960,
  "input_tokens": 15,
  "output_tokens": 150,
  "tokens_per_second": 37.9,
  "input_cost_usd": "$0.00004500",
  "output_cost_usd": "$0.00225000",
  "total_cost_usd": "$0.00229500",
  "error": false
}


## Nova Pro measurement

In [45]:
# Test latency measurement with nova pro
print("Testing Latency Measurement...")
latency_result = measure_latency(
    "us.amazon.nova-pro-v1:0", 
    "Explain quantum computing in simple terms.", 
    max_tokens=150
)
print(json.dumps(latency_result, indent=2))

Testing Latency Measurement...
Custom metric published: {'ResponseMetadata': {'RequestId': '055b1ae4-7f0e-456e-a8fb-57b887167c99', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '055b1ae4-7f0e-456e-a8fb-57b887167c99', 'content-type': 'text/xml', 'content-length': '212', 'date': 'Mon, 04 Aug 2025 14:01:54 GMT'}, 'RetryAttempts': 0}}
{
  "model_id": "us.amazon.nova-pro-v1:0",
  "server_latency_ms": 1196,
  "input_tokens": 7,
  "output_tokens": 150,
  "tokens_per_second": 125.4,
  "input_cost_usd": "$0.00000560",
  "output_cost_usd": "$0.00003000",
  "total_cost_usd": "$0.00003560",
  "error": false
}


## Analysis of results
Notice that both models differ in latency, input token count and cost. This signifys the importance of choosing the right model for your use case as model choice can directly impact many aspects for your Gen AI application. 
Now lets dive deeper into some other metrics TTFS and TTLT.

# 3. TTFT vs TTLT (Time to First Token vs Time to Last Token)

## Time to First Token (TTFT)
Measures how quickly a model begins generating its response after receiving a prompt. This metric is crucial for user experience as it affects perceived responsiveness.

- **Lower TTFT**: Creates the impression of a more responsive system
- **Impact factors**: Model size, architecture, hardware

## Time to Last Token (TTLT)
Measures the total time from prompt submission to complete response delivery. This metric is vital for throughput and overall system performance.

- **Lower TTLT**: Enables processing more requests per unit time
- **Faster responses**: Allow for more interactions within a given timeframe, which is beneficial for tasks requiring iterative engagement, like chain-of-thought prompting

In [46]:
import time
import statistics
import json
from typing import Dict
from decimal import Decimal

def put_custom_operational_cw_metrics(model_id: str, ttfs_ms, ttlt_ms,total_cost_usd):
    """Publish custom metrics to namespace- """
    
    response = cloudwatch.put_metric_data(
    Namespace='llm_custom_operational_metrics',  # A logical container for your metrics
    MetricData=[
        {
            'MetricName': 'TimeToFirstToken',  # The name of your custom metric
            'Value': ttfs_ms,                  # The value of the metric
            'Unit': 'Milliseconds',             # The unit of measurement (e.g., Count, Seconds, Bytes)
            'Dimensions': [              # Optional: Add dimensions for more granular analysis
                {
                    'Name': 'Model',
                    'Value': model_id
                }
            ],
            # 'Timestamp': datetime.utcnow(), # Optional: Specify a timestamp, defaults to current time
            # 'StorageResolution': 1 # Optional: Set to 1 for high-resolution metrics (1-second granularity)
        },
        {
            'MetricName': 'TimeToLastToken',
            'Value': ttlt_ms,
            'Unit': 'Milliseconds',
            'Dimensions': [
                {
                    'Name': 'Model',
                    'Value': model_id
                }
            ]
        },
        {
            'MetricName': 'TotalCost',
            'Value': Decimal(total_cost_usd.replace("$", "")),
            'Dimensions': [
                {
                    'Name': 'Model',
                    'Value': model_id
                }
            ]
        }
        ]
    )

    print("Custom metric published.")


def measure_streaming_metrics(model_id: str, prompt: str, max_tokens: int = 200) -> Dict:
    """Measure TTFS (TTFT) and TTLT (TTLS) using streaming responses with precise timing."""
    try:
        start_time = time.time()
        
        response_stream = bedrock_client.converse_stream(
            modelId=model_id,
            messages=[{"role": "user", "content": [{"text": prompt}]}],
            inferenceConfig={"maxTokens": max_tokens, "temperature": 0.1}
        )
        
        first_token_time = None
        last_token_time = None  
        token_timestamps = []
        input_tokens = 0
        output_tokens = 0
        response_text = ""  # Capture the actual response
        
        for event in response_stream["stream"]:
            current_time = time.time()
            
            if 'contentBlockDelta' in event:
                if first_token_time is None:
                    first_token_time = current_time
                
                # Update last token time for each token received
                last_token_time = current_time
                token_timestamps.append(current_time)
                
                # Capture the response text
                if 'delta' in event['contentBlockDelta'] and 'text' in event['contentBlockDelta']['delta']:
                    response_text += event['contentBlockDelta']['delta']['text']
                
            elif 'metadata' in event:
                usage = event['metadata'].get('usage', {})
                input_tokens = usage.get('inputTokens', 0)
                output_tokens = usage.get('outputTokens', 0)
        
        end_time = last_token_time if last_token_time else time.time()
        
        ttfs_ms = round((first_token_time - start_time) * 1000, 2) if first_token_time else None
        ttlt_ms = round((end_time - start_time) * 1000, 2)
        
        inter_token_latencies = []
        if len(token_timestamps) > 1:
            for i in range(1, len(token_timestamps)):
                inter_token_latencies.append(
                    (token_timestamps[i] - token_timestamps[i-1]) * 1000
                )
        
        cost_info = calculate_cost(model_id, input_tokens, output_tokens)
        
        put_custom_operational_cw_metrics(model_id,ttfs_ms,ttlt_ms,cost_info["total_cost_usd"])
        
        return {
            "model_id": model_id,
            "ttfs_ms": ttfs_ms,  
            "ttlt_ms": ttlt_ms,  
            "generation_time_ms": ttlt_ms - ttfs_ms if ttfs_ms else None,
            "tokens_per_second": round(output_tokens / (ttlt_ms / 1000), 1) if ttlt_ms > 0 else None,
            "avg_inter_token_latency_ms": round(statistics.mean(inter_token_latencies), 2) if inter_token_latencies else None,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "total_tokens_received": len(token_timestamps),
            "response_text": response_text,  # Include the actual response
            **cost_info,
            "error": False
        }
        
    except Exception as e:
        return {"error": True, "error_message": str(e)}


In [47]:

# Test streaming metrics for Amazon nova pro
print("Testing Streaming Metrics (TTFS vs TTLT)...")
streaming_result = measure_streaming_metrics(
    "us.amazon.nova-pro-v1:0",
    "Write a short story about a robot learning to paint.",
    max_tokens=300
)

# Pretty print with focus on timing metrics
if not streaming_result.get("error"):
    print(f"Model: {streaming_result['model_id']}")
    print(f"TTFS (Time to First Token): {streaming_result['ttfs_ms']}ms")
    print(f"TTLT (Time to Last Token): {streaming_result['ttlt_ms']}ms")
    print(f"Tokens/Second: {streaming_result['tokens_per_second']}")
    print(f"Avg Inter-token Latency: {streaming_result['avg_inter_token_latency_ms']}ms")
    print(f"Cost: {streaming_result['total_cost_usd']}")

    
else:
    print(f"Error: {streaming_result['error_message']}")

Testing Streaming Metrics (TTFS vs TTLT)...
Custom metric published: {'ResponseMetadata': {'RequestId': '2fc5cdb2-1aea-4111-b508-e85e7aa24aba', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '2fc5cdb2-1aea-4111-b508-e85e7aa24aba', 'content-type': 'text/xml', 'content-length': '212', 'date': 'Mon, 04 Aug 2025 14:02:08 GMT'}, 'RetryAttempts': 0}}
Custom metric published.
Model: us.amazon.nova-pro-v1:0
TTFS (Time to First Token): 303.89ms
TTLT (Time to Last Token): 3937.96ms
Tokens/Second: 76.2
Avg Inter-token Latency: 25.41ms
Cost: $0.00006880


## Visualizing Operational metrics using CloudWatch Dashboard

Amazon CloudWatch has automatic dashboards for customers to quickly gain insights into the health and performance of their AWS services. An automatic dashboard for Amazon Bedrock is available with [Amazon Bedrock runtime metrics](https://docs.aws.amazon.com/bedrock/latest/userguide/monitoring.html#runtime-cloudwatch-metrics). To access Bedrock automatic dashboard from the AWS Management Console:

Select Dashboards from the CloudWatch console, and select the Automatic Dashboards tab. You’ll see an option for an Amazon Bedrock dashboard in the list of available dashboards. 

You can create a [custom CloudWatch Dashboard](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create_dashboard.html) and add the Bedrock Automatic Dashboard to it as shown below: 

<div style="text-align:left">
    <img src="images/operational-metrics-cloudwatch-dashboard.png" width="100%"/>
</div>

## 4. Email Summarization Use case
In this example use case we will be using two models to summarize emails and extract key information, action items, and important details. This will demonstrates how LLM metrics impact real-world performance in a scenario requiring both accuracy and efficiency. We will compare how different models balance speed, cost, and quality when generating concise email summaries at scale.

In [48]:
# Email Summarization Use Case - Model Performance Comparison
import os
import glob
from pathlib import Path

def load_emails_from_folder(folder_path="data/emails"):
    """Load emails to summarize."""
    sample_emails = []
    
    email_files = glob.glob(os.path.join(folder_path, "*.txt"))
    
    for i, file_path in enumerate(email_files, 1):
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read().strip()
                
                # Extract subject from filename or first line
                filename = Path(file_path).stem
                
                # If the file starts with "Subject:", extract it
                if content.startswith("Subject:"):
                    lines = content.split('\n')
                    subject = lines[0].replace("Subject:", "").strip()
                    email_content = '\n'.join(lines[1:]).strip()
                else:
                    # Use filename as subject if no subject line found
                    subject = filename.replace("_", " ").title()
                    email_content = content
                
                sample_emails.append({
                    "id": i,
                    "subject": subject,
                    "content": email_content
                })
                
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
            continue
    
    return sample_emails

sample_emails = load_emails_from_folder()

summarization_prompt = """
You are an AI assistant that summarizes business emails for busy executives. 

Please analyze the following email and provide a concise summary that includes:

1. **Key Points**: Main topics and important information
2. **Action Items**: Specific tasks or decisions required
3. **Deadlines**: Any time-sensitive items
4. **People/Teams Involved**: Who needs to take action
5. **Impact**: Business impact or urgency level

Email to summarize:
{email_content}

Provide a clear, structured summary in 3-4 sentences followed by bullet points for action items.
"""

# Models to compare
models_to_test = [
    "us.amazon.nova-lite-v1:0",
    "us.amazon.nova-pro-v1:0", 
    "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
]

def run_email_summarization_comparison(email_data, models):
    """Compare models on email summarization task with performance metrics."""
    results = []
    
    print(f"\nTesting Email: '{email_data['subject']}'")
    print("-" * 50)
    
    for model_id in models:
        print(f"\nTesting {model_id.split('.')[-1].upper()}...")
        
        prompt = summarization_prompt.format(
            email_content=f"Subject: {email_data['subject']}\n\n{email_data['content']}"
        )
        
        # Measure performance using our streaming metrics function
        result = measure_streaming_metrics(
            model_id=model_id,
            prompt=prompt,
            max_tokens=400  
        )
        
        if not result.get("error"):
            print(f"Success!")
            print(f"   TTFS: {result['ttfs_ms']}ms")
            print(f"   TTLT: {result['ttlt_ms']}ms") 
            print(f"   Speed: {result['tokens_per_second']} tokens/sec")
            print(f"   Cost: {result['total_cost_usd']}")
            print(f"   Tokens: {result['output_tokens']} output")
            
            # Add email context and additional data for advanced metrics
            result['email_id'] = email_data['id']
            result['email_subject'] = email_data['subject']
            result['email_content'] = email_data['content']
            result['model_name'] = model_id.split('.')[-1]
            result['prompt_used'] = prompt

            put_custom_operational_cw_metrics(model_id,result['ttfs_ms'],result['ttlt_ms'],result['total_cost_usd'])
            
        else:
            print(f"Error: {result['error_message']}")
            
        results.append(result)
        
        time.sleep(0.5)
    
    return results

all_results = []

for email in sample_emails:
    email_results = run_email_summarization_comparison(email, models_to_test)
    all_results.extend(email_results)



Testing Email: 'Sampleemail2'
--------------------------------------------------

Testing NOVA-LITE-V1:0...
Custom metric published: {'ResponseMetadata': {'RequestId': 'c76e3152-2f26-4839-b6f7-b9d7eccb7557', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'c76e3152-2f26-4839-b6f7-b9d7eccb7557', 'content-type': 'text/xml', 'content-length': '212', 'date': 'Mon, 04 Aug 2025 14:02:27 GMT'}, 'RetryAttempts': 0}}
Custom metric published.
Success!
   TTFS: 535.74ms
   TTLT: 1523.34ms
   Speed: 166.1 tokens/sec
   Cost: $0.00002924
   Tokens: 253 output
Custom metric published.

Testing NOVA-PRO-V1:0...
Custom metric published: {'ResponseMetadata': {'RequestId': 'b0b87e12-6386-48db-970f-022be714acc6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'b0b87e12-6386-48db-970f-022be714acc6', 'content-type': 'text/xml', 'content-length': '212', 'date': 'Mon, 04 Aug 2025 14:02:30 GMT'}, 'RetryAttempts': 0}}
Custom metric published.
Success!
   TTFS: 300.67ms
   TTLT: 1976.31m

## 4.1 Visualizing custom operation metrics
`put_custom_operational_cw_metrics` function [publishes the following custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html): "Time To First Token", "Time To Last Token" and "Total Cost" to custom [CloudWatch namespace](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Namespace) `llm_custom_operational_metrics` for each model. This enables you to visualize these custom metrics in your same observability dashboard as shown below:  

<div style="text-align:left">
    <img src="images/Custom-operational-metrics-cloudwatch-dashboard.png" width="100%"/>
</div>

Similarly, you can create any additional custom metric you need for your application logic.

In [None]:
# Save responses for quality metrics evaluation (using data already collected)
import json
import os

# Format the results for quality metrics evaluation
enhanced_results = []
print("Preparing agent responses for advanced metrics evaluation...")

for result in all_results:
    if not result.get("error") and result.get("response_text"):
        enhanced_result = {
            "email_id": result['email_id'],
            "email_subject": result['email_subject'],
            "email_content": result['email_content'],
            "model_id": result['model_id'],
            "model_name": result['model_name'],
            "agent_response": result['response_text'],  # Use captured response
            "prompt_used": result['prompt_used'],
            "input_tokens": result['input_tokens'],
            "output_tokens": result['output_tokens'],
            # Include performance metrics for reference
            "performance_metrics": {
                "ttfs_ms": result.get('ttfs_ms'),
                "ttlt_ms": result.get('ttlt_ms'),
                "tokens_per_second": result.get('tokens_per_second'),
                "total_cost_usd": result.get('total_cost_usd')
            }
        }
        enhanced_results.append(enhanced_result)
        print(f" Prepared response from {result['model_name']} for email {result['email_id']}")
    else:
        print(f" Skipping result due to error or missing response text")

# Save the enhanced results
with open("email_responses.json", 'w') as f:
    json.dump(enhanced_results, f, indent=2)



Preparing agent responses for advanced metrics evaluation...
 Prepared response from nova-lite-v1:0 for email 1
 Prepared response from nova-pro-v1:0 for email 1
 Skipping result due to error or missing response text
 Prepared response from nova-lite-v1:0 for email 2
 Prepared response from nova-pro-v1:0 for email 2
 Skipping result due to error or missing response text


## Analyzing Results
As you compare the result you will notice that while given the same task the models had varying speed (TTFT/TTLT) as well as differences in output token count and cost. However in this sample we did not inspect the LLM output to identify which model gave the best responses for our use case. Which we will be doing in the next module.

## 7. Best Practices and Key Takeaways
- Monitor token usage closely
- Choose appropriate models for your use case
- Consider input/output token ratios
- Use streaming for better user experience (lower TTFT)
- Use the correct invocation type based on use case (Batch VS On Demand) 
- Monitor and set appropriate timeouts
- Implement comprehensive error handling
- Set up monitoring and alerting to keep track of cost
- Plan for rate limits and scaling
- Consider caching strategies

## Conclusion

You've learned to measure and analyze key LLM metrics:

**Cost Metrics** - Track token usage and optimize spending  
**Latency Metrics** - Measure end-to-end response times  
**TTFT vs TTLT** - Understand streaming performance  

### Next Steps
- Evaluate models based on quality metrics
- Create use case specific metrics
- Evaluate metrics based on your workload
- Utilize Opensource frameworks to evaluate your models.
