# Llama 4 on Bedrock: Long context window use cases

This notebook demonstrates how to use Llama 4 on AWS by deploying on SageMaker JumpStart. We will cover the deployment process, basic invocations to run inference, and important use cases that can be used with Llama 4 Scout's 10 million token window.

## Llama 4

Llama 4 represents Meta’s most advanced multimodal models to date, featuring a mixture of experts (MoE) architecture and context window support up to 10 million tokens. With native multimodality and early fusion technology, Meta states that these new models demonstrate unprecedented performance across text and vision tasks while maintaining efficient compute requirements. With a dramatic increase on supported context length from 128K in Llama 3, Llama 4 is now suitable for multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over extensive codebases. 

## Prerequisites

- Access to Llama 4 on Bedrock enabled via AWS console

In [1]:
import boto3
from boto3 import client
from botocore.config import Config
import json
import re
import logging
from botocore.exceptions import ClientError
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

In [2]:
config = Config(read_timeout=2000)
bedrock_client = boto3.client(service_name='bedrock-runtime', region_name="us-east-1", config=config)

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


In [3]:
meta_llama_4_maverick = 'us.meta.llama4-maverick-17b-instruct-v1:0'
meta_llama_4_scout = 'us.meta.llama4-scout-17b-instruct-v1:0'

model_id = meta_llama_4_scout

In [6]:
# Prepare your payload with the messages format
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "text": "You are a helpful AI assistant."
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "text": "What are three key benefits of large language models for businesses?"
                }
            ]
        }
    ]

}

In [7]:
response = bedrock_client.converse(
    modelId=model_id,
    messages=payload["messages"],  # Wrap the message in a list
    inferenceConfig={
        "maxTokens": 2000,
        "temperature": 0,
        "topP": .5
    }
)

print(response["output"]["message"]["content"][0]["text"])

Large language models (LLMs) offer several benefits to businesses, and here are three key advantages:

1. **Improved Customer Service and Support**: LLMs can be integrated with chatbots and virtual assistants to provide 24/7 customer support. They can understand and respond to customer inquiries, resolve issues, and even route complex problems to human customer support agents. This can lead to increased customer satisfaction, reduced support costs, and improved overall customer experience.

2. **Enhanced Content Generation and Automation**: LLMs can generate high-quality content, such as blog posts, social media posts, product descriptions, and email newsletters, at a much faster pace than humans. This can help businesses save time and resources, while also increasing their content output and consistency. Additionally, LLMs can automate tasks like data analysis, reporting, and summarization, freeing up human resources for more strategic and creative work.

3. **Data Analysis and Insigh

# Codebase Analysis 

Using Llama 4 Scout’s industry-leading context window, this section showcases its ability to deeply analyze expansive codebases. The example extracts and contextualizes the buildspec-1-10-2.yml file from the AWS Deep Learning Containers GitHub repository, illustrating how the model synthesizes information across an entire repository. We used a tool to ingest the whole repository into plaintext that we provided to the model as context:

In [19]:
import json
import time
import boto3
from botocore.config import Config

def read_markdown_file(file_path):
    """
    Read the content of a markdown file.
    
    Args:
        file_path (str): Path to the markdown file to be read
        
    Returns:
        str: The content of the file as a string
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            return content
    except FileNotFoundError:
        print(f"Error: File not found at path {file_path}")
        raise
    except Exception as e:
        print(f"Error reading file: {e}")
        raise

def summarize_text_bedrock(text, model_id, max_length=10000):
    """
    Summarize the provided text using Amazon Bedrock's converse API.
    
    Args:
        text (str): Text to summarize
        model_id (str): Bedrock model ID to use
        max_length (int): Maximum length of text to send (to avoid token limits)
        
    Returns:
        tuple: (Summarized text, Latency metrics)
    """
    # Configure the Bedrock client with extended timeout
    config = Config(read_timeout=300, connect_timeout=30)
    bedrock_client = boto3.client(service_name='bedrock-runtime', region_name="us-east-1", config=config)
    
    # Truncate text if necessary to avoid exceeding token limits
    if len(text) > max_length:
        text = text[:max_length] + "... [Content truncated due to length]"
    
    # Start measuring preprocessing time
    preprocess_start = time.time()
    
    # Format the messages correctly for Bedrock's converse API
    # First, create two separate messages
    system_instruction = "You are a helpful AI assistant that summarizes codebases to me to help me understand how to analyze code by synthesizing through the entire codebase before responding. Be thorough in your search as the file may be nested within a markdown code block or within a directory listing."
    
    user_question = f"Can you explain to me the buildspec-1-10-2.yml file and how it relates to the rest of the huggingface directory? Use this information:\n\n{text} as reference."
    
    # Format messages as per Llama 4 Bedrock requirements
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "text": system_instruction  # Include system instruction as first user message
                }
            ]
        },
        {
            "role": "user", 
            "content": [
                {
                    "text": user_question
                }
            ]
        }
    ]
    
    # Measure payload size
    payload_size = len(json.dumps({"messages": messages}).encode('utf-8'))
    
    preprocess_end = time.time()
    preprocess_time = preprocess_end - preprocess_start
    
    # Invoke the Bedrock model and measure latency
    api_start = time.time()
    try:
        # Use invoke_model first to test if converse is the issue
        response = bedrock_client.converse(
            modelId=model_id,
            messages=messages,
            inferenceConfig={
                "maxTokens": 4000,
                "temperature": 0.7,
                "topP": 0.9
            }
        )
    except Exception as e:
        print(f"Error calling Bedrock converse: {e}")
        # Try using invoke_model instead as a fallback
        try:
            # Format for invoke_model is different
            payload = {
                "messages": [
                    {"role": "system", "content": system_instruction},
                    {"role": "user", "content": user_question}
                ],
                "max_tokens": 2000,
                "temperature": 0.7,
                "top_p": 0.9
            }
            
            body = json.dumps(payload)
            response = bedrock_client.invoke_model(
                modelId=model_id,
                body=body
            )
            
            # Parse the response from invoke_model
            response_body = json.loads(response['body'].read())
            # Convert to a format similar to converse for consistent handling
            response = {
                "output": {
                    "content": [
                        {"text": response_body.get('content', '') or response_body.get('generation', '')}
                    ]
                }
            }
        except Exception as e2:
            print(f"Fallback also failed: {e2}")
            raise
    
    api_end = time.time()
    api_latency = api_end - api_start
    
    # Start measuring postprocessing time
    postprocess_start = time.time()
    
    # Extract the summary from the response
    try:
        summary = response["output"]["message"]["content"][0]["text"]
    except (KeyError, IndexError) as e:
        print(f"Error parsing response: {e}")
        summary = str(response)
    
    # Calculate output size
    output_size = len(summary.encode('utf-8'))
    
    postprocess_end = time.time()
    postprocess_time = postprocess_end - postprocess_start
    
    # Collect metrics
    metrics = {
        'api_latency': api_latency,
        'preprocess_time': preprocess_time,
        'postprocess_time': postprocess_time,
        'total_latency': preprocess_time + api_latency + postprocess_time,
        'payload_size_bytes': payload_size,
        'output_size_bytes': output_size,
        'payload_tokens': len(text) / 4,  # rough approximation of tokens
    }
    
    return summary, metrics

In [20]:
# Configuration
file_path = '/home/ec2-user/SageMaker/repomix-output.md'  # Update with actual markdown file
model_id = 'us.meta.llama4-maverick-17b-instruct-v1:0'  # Use your desired Bedrock model ID

# Read the markdown file
print(f"Reading markdown file from {file_path}...")
read_start = time.time()
markdown_content = read_markdown_file(file_path)
read_end = time.time()
read_time = read_end - read_start
print(f"File read time: {read_time:.4f} seconds")

# Summarize the content with Bedrock
print("Generating summary using Bedrock...")
summary, latency_metrics = summarize_text_bedrock(markdown_content, model_id)

# Print the latency metrics
print("\n--- LATENCY METRICS ---")
print(f"API call latency: {latency_metrics['api_latency']:.4f} seconds")
print(f"Preprocessing time: {latency_metrics['preprocess_time']:.4f} seconds")
print(f"Postprocessing time: {latency_metrics['postprocess_time']:.4f} seconds")
print(f"Total latency: {latency_metrics['total_latency']:.4f} seconds")
print(f"Payload size: {latency_metrics['payload_size_bytes'] / 1024:.2f} KB")
print(f"Output size: {latency_metrics['output_size_bytes'] / 1024:.2f} KB")
print(f"Approximate input tokens: {int(latency_metrics['payload_tokens'])}")

# Print the summary
print("\n--- SUMMARY ---\n")
print(summary)

Reading markdown file from /home/ec2-user/SageMaker/repomix-output.md...
File read time: 0.1781 seconds
Generating summary using Bedrock...

--- LATENCY METRICS ---
API call latency: 6.0696 seconds
Preprocessing time: 0.0001 seconds
Postprocessing time: 0.0000 seconds
Total latency: 6.0696 seconds
Payload size: 10.69 KB
Output size: 3.44 KB
Approximate input tokens: 2509

--- SUMMARY ---

To understand the `buildspec-1-10-2.yml` file and its relation to the rest of the `huggingface` directory, we first need to locate the file within the provided directory structure.

Upon examining the directory structure, we find that the `huggingface` directory contains several subdirectories, including `pytorch`, which further contains `inference` and `training` directories. However, the `buildspec-1-10-2.yml` file is not directly listed under the `huggingface` directory or its immediate subdirectories.

To proceed, let's search for the file within the provided code repository representation.

## Fi

# Multi-Doc Processing

With Llama 4 Scout's 3.5million token context window on Bedrock - Llama 4 Scout excels in multi-document processing. In this example, the model extracts key financial metrics from Amazon 10-K reports (2017-2024), demonstrating its capability to integrate and analyze data spanning multiple years—all without the need for additional processing tools.

In [21]:
!pip install pyPDF2

Collecting pyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: pyPDF2
Successfully installed pyPDF2-3.0.1


In [31]:
import boto3
import json
import time
import PyPDF2
import os
import glob
import random
from botocore.config import Config

# Define helper functions
def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file."""
    text = ""
    try:
        with open(pdf_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            print(f"PDF has {len(pdf_reader.pages)} pages")
            # Extract text from first 100 pages (adjust as needed)
            for page_num in range(min(100, len(pdf_reader.pages))):
                print(f"Processing page {page_num+1}/{len(pdf_reader.pages)}...")
                text += pdf_reader.pages[page_num].extract_text() + "\n"
        return text
    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
        return None

def retry_with_exponential_backoff(func, max_retries=5, initial_backoff=5):
    """Execute a function with exponential backoff retry."""
    retries = 0
    while retries <= max_retries:
        try:
            return func()
        except Exception as e:
            if hasattr(e, 'response') and 'Error' in e.response and e.response['Error'].get('Code') == 'ThrottlingException':
                wait_time = initial_backoff * (2 ** retries) + random.uniform(0, 1)
                print(f"Throttling detected. Retrying after {wait_time:.2f} seconds...")
                time.sleep(wait_time)
                retries += 1
                if retries > max_retries:
                    print(f"Maximum retries ({max_retries}) exceeded.")
                    raise
            else:
                print(f"Non-throttling error: {e}")
                raise
    
    raise Exception("Max retries exceeded")

def extract_key_data_from_pdfs(pdf_files, bedrock_client, model_id):
    """Extract and summarize key data from each PDF file using Amazon Bedrock."""
    pdf_summaries = {}
    
    for pdf_file in pdf_files:
        file_name = os.path.basename(pdf_file)
        print(f"\n--- Extracting key data from {file_name} ---")
        
        pdf_text = extract_text_from_pdf(pdf_file)
        if not pdf_text:
            print(f"Failed to extract text from {file_name}")
            continue
            
        # Use a smaller chunk size to avoid token limits
        max_text_length = 10000  # Reduced from 25000 to avoid throttling
        truncated_text = pdf_text[:max_text_length]
        
        # Create system instruction and user question
        system_instruction = "You are a financial analyst. Extract key financial data from this 10-K report."
        
        user_question = f"""Extract the following information from this Amazon 10-K report:
1. What year is this report for?
2. Total revenue for the year
3. Net income
4. Key business segments and their performance
5. Important trends mentioned

Here's the 10-K text (truncated):
{truncated_text}

Format your response as a JSON object with these fields: year, revenue, net_income, segments, trends.
"""
        
        # Format messages according to the example
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "text": system_instruction
                    }
                ]
            },
            {
                "role": "user", 
                "content": [
                    {
                        "text": user_question
                    }
                ]
            }
        ]
        
        try:
            print(f"Extracting key data from {file_name}...")
            
            # Call Bedrock's Converse API with retry logic
            def make_api_call():
                return bedrock_client.converse(
                    modelId=model_id,
                    messages=messages,
                    inferenceConfig={
                        "maxTokens": 1500,  # Reduced from 2048
                        "temperature": 0.7,
                        "topP": 0.9
                    }
                )
                
            # Add a delay before making the API call to avoid throttling
            time.sleep(2)
            response = retry_with_exponential_backoff(make_api_call)
            
            # Extract content from response based on the format
            try:
                summary = response["output"]["message"]["content"][0]["text"]
                pdf_summaries[file_name] = summary
                print(f"Successfully extracted key data from {file_name}")
            except (KeyError, IndexError) as e:
                print(f"Error parsing response: {e}")
                print(f"Response structure: {json.dumps(response, default=str)}")
                
        except Exception as e:
            print(f"Error processing {file_name}: {e}")
            import traceback
            traceback.print_exc()
    
    return pdf_summaries

def find_pdf_files(directory):
    """Find all PDF files in the specified directory."""
    pdf_files = []
    try:
        # Look for all PDF files in the directory
        pdf_pattern = os.path.join(directory, "*.pdf")
        pdf_files = glob.glob(pdf_pattern)
        
        if pdf_files:
            print(f"Found {len(pdf_files)} PDF files:")
            for pdf in pdf_files:
                print(f"  - {os.path.basename(pdf)}")
        else:
            print(f"No PDF files found in {directory}")
    except Exception as e:
        print(f"Error finding PDF files: {e}")
    
    return pdf_files

def synthesize_and_answer_question(pdf_summaries, question, bedrock_client, model_id):
    """Synthesize information from all PDFs and answer a specific question."""
    # Combine all summaries - limiting context size to avoid throttling
    combined_context = ""
    for pdf_name, summary in pdf_summaries.items():
        combined_context += f"--- Data from {pdf_name} ---\n{summary[:2000]}\n\n"  # Limit each summary
    
    # Create system instruction and user question
    system_instruction = """You are an expert financial analyst comparing Amazon's 10-K reports across multiple years.
Your task is to synthesize information from these reports and answer questions accurately.
Make sure to compare data across years when relevant and highlight significant changes or trends."""
    
    user_question = f"""I've analyzed multiple Amazon 10-K reports and extracted the key information from each:

{combined_context}

Based on this information from all the reports, please answer the following question:
{question}

Provide a comprehensive answer that compares information across all available years."""

    # Format messages according to your example
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "text": system_instruction
                }
            ]
        },
        {
            "role": "user", 
            "content": [
                {
                    "text": user_question
                }
            ]
        }
    ]

    try:
        print(f"\nSynthesizing information and answering question...")
        start_time = time.time()
        
        # Call Bedrock's Converse API with retry logic
        def make_api_call():
            return bedrock_client.converse(
                modelId=model_id,
                messages=messages,
                inferenceConfig={
                    "maxTokens": 2000,  # Reduced from 4096
                    "temperature": 0.7,
                    "topP": 0.9
                }
            )
        
        # Add a delay before making the API call to avoid throttling
        time.sleep(2)    
        response = retry_with_exponential_backoff(make_api_call)
        
        end_time = time.time()
        print(f"Time taken: {end_time - start_time:.2f} seconds")
        
        # Extract the response content
        try:
            answer = response["output"]["message"]["content"][0]["text"]
            return answer
        except (KeyError, IndexError) as e:
            print(f"Error parsing response: {e}")
            try:
                print(f"Response structure: {json.dumps(response, default=str)}")
            except:
                print(f"Cannot display response structure")
            return None
            
    except Exception as e:
        print(f"Error synthesizing information: {e}")
        import traceback
        traceback.print_exc()
        return None

In [32]:
# Define the directory containing PDF files
pdf_directory = "/home/ec2-user/SageMaker/Llama4_Assets/"  # Replace with your directory path

# Initialize the Amazon Bedrock client
print("Initializing Amazon Bedrock client...")
bedrock_client = boto3.client('bedrock-runtime')
# You can choose from available models in your Bedrock setup, like:
# model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
model_id = "us.meta.llama4-scout-17b-instruct-v1:0"  # Replace with your preferred Bedrock model
print(f"Using model: {model_id}")

try:
    # Find PDF files
    pdf_files = find_pdf_files(pdf_directory)
    if not pdf_files:
        print("No PDF files found. Exiting.")
        raise SystemExit(0)
    
    # Extract summaries from PDFs
    print("\nExtracting key data from all PDFs...")
    pdf_summaries = extract_key_data_from_pdfs(pdf_files, bedrock_client, model_id)
    
    if not pdf_summaries:
        print("Failed to extract data from any PDFs. Exiting.")
        raise SystemExit(0)
    
    print("\nSuccessfully extracted data from the following PDFs:")
    for pdf_name in pdf_summaries.keys():
        print(f"  - {pdf_name}")
    
    # Define your questions
    questions = [
        "How did Amazon's revenue and net income change over time?",
        "What are the main business segments driving growth at Amazon over the years?",
        "How has AWS evolved as a segment and what is its contribution to Amazon's overall business?"
    ]
    
    # Answer each question
    for question in questions:
        print(f"\nAnswering question: {question}")
        answer = synthesize_and_answer_question(pdf_summaries, question, bedrock_client, model_id)
        
        if answer:
            print("\n" + "="*50)
            print(f"ANSWER to: {question}")
            print("="*50)
            print(answer)
        else:
            print(f"Sorry, I couldn't generate an answer for: {question}")

except Exception as e:
    print(f"Error occurred: {e}")


Initializing Amazon Bedrock client...
Using model: us.meta.llama4-scout-17b-instruct-v1:0
Found 10 PDF files:
  - amazon10k_2022.pdf
  - amazon_10k_2020.pdf
  - amazon_10k_2015.pdf
  - amazon_10k_2016.pdf
  - amazon_10k_2018.pdf
  - amazon_10k_2021.pdf
  - amazon_10k_2024.pdf
  - amazon_10k_2019.pdf
  - amazon_10k_2017.pdf
  - amazon_10k_2023.pdf

Extracting key data from all PDFs...

--- Extracting key data from amazon10k_2022.pdf ---
PDF has 194 pages
Processing page 1/194...
Processing page 2/194...
Processing page 3/194...
Processing page 4/194...
Processing page 5/194...
Processing page 6/194...
Processing page 7/194...
Processing page 8/194...
Processing page 9/194...
Processing page 10/194...
Processing page 11/194...
Processing page 12/194...
Processing page 13/194...
Processing page 14/194...
Processing page 15/194...
Processing page 16/194...
Processing page 17/194...
Processing page 18/194...
Processing page 19/194...
Processing page 20/194...
Processing page 21/194...
Proce