# Deploying Llama 4 on SageMaker JumpStart: Long context window use cases

This notebook demonstrates how to use Llama 4 on AWS by deploying on SageMaker JumpStart. We will cover the deployment process, basic invocations to run inference, and important use cases that can be used with Llama 4 Scout's 10 million token window.

## Llama 4

Llama 4 represents Meta’s most advanced multimodal models to date, featuring a mixture of experts (MoE) architecture and context window support up to 10 million tokens. With native multimodality and early fusion technology, Meta states that these new models demonstrate unprecedented performance across text and vision tasks while maintaining efficient compute requirements. With a dramatic increase on supported context length from 128K in Llama 3, Llama 4 is now suitable for multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over extensive codebases. 

## Prerequisites

In order to run the following cells to deploy, you must:
- Have AWS credentials with AWS Access Key and AWS Secret Access key
- IAM execution role for the notebook

In [1]:
from sagemaker.jumpstart.model import JumpStartModel



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
model_id, model_version = "meta-vlm-llama-4-scout-17b-16e-instruct", "*"

In order to run the gated model, please change the accept_eula = True

In [4]:
accept_eula = False

## Deploy model
Using the model ID, define your model as a JumpStart model. You can deploy the model on other instance types by passing instance_type to JumpStartModel. See Deploy publicly available foundation models with the JumpStartModel class for more configuration options.

In [5]:
model = JumpStartModel(model_id=model_id, model_version=model_version)

Model 'meta-vlm-llama-4-scout-17b-16e-instruct' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama4Eula.txt for terms of use.


Using model 'meta-vlm-llama-4-scout-17b-16e-instruct' with wildcard version identifier '*'. You can pin to version '1.0.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


No instance type selected for inference hosting endpoint. Defaulting to ml.p5.48xlarge.


In [6]:
predictor = model.deploy(accept_eula=accept_eula)

--------------!

## Run Inference

Now that our Llama 4 model is deployed, we can run inference. The following cell will show an example of all the various use cases 

In [7]:
example_payloads = model.retrieve_all_examples()

In [10]:
for payload in example_payloads:
    response = predictor.predict(payload)
    response = response[0] if isinstance(response, list) else response
    print("Input:\n", payload.body, end="\n\n")
    if "generated_text" in response:
        print("Output:\n", response["generated_text"].strip(), end="\n\n\n")
    else:
        print("Output:\n", response["choices"][0]["message"]["content"], end="\n\n\n")

Input:
 {'messages': [{'role': 'system', 'content': 'You are a helpful assistant'}, {'role': 'user', 'content': [{'type': 'text', 'text': 'Hello, how can you help me today?'}]}], 'temperature': 0.7, 'max_tokens': 150}

Output:
 Hello! I'm here to help with any questions or tasks you may have. I can assist with a wide range of topics, such as:

* Answering questions on various subjects (e.g., science, history, technology, health, etc.)
* Generating text or writing assistance (e.g., proofreading, summarizing, or creating content)
* Providing definitions or explanations for specific terms or concepts
* Offering suggestions or ideas for projects or problems you're facing
* Chatting and conversation (if you just want to talk!)

What brings you here today? Is there something specific you'd like help with, or do you want to explore and see where our conversation takes us?


Input:
 {'messages': [{'role': 'system', 'content': 'You are an expert programming assistant specializing in Python'}, {

In [24]:
# Create your prompt using the OpenAI Chat Completions format
payload = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant."
        },
        {
            "role": "user",
            "content": "What are three key benefits of large language models for businesses?"
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9
}

In [25]:
# Run inference
print("Sending request to model...")
response = predictor.predict(payload)
response = response[0] if isinstance(response, list) else response

# Print the output
print("Input:\n", payload, "\n")
if "generated_text" in response:
    print("Output:\n", response["generated_text"].strip(), "\n\n")
elif "choices" in response and len(response["choices"]) > 0:
    print("Output:\n", response["choices"][0]["message"]["content"], "\n\n")
else:
    print("Raw response:", response)

Sending request to model...
Input:
 {'messages': [{'role': 'system', 'content': 'You are a helpful AI assistant.'}, {'role': 'user', 'content': 'What are three key benefits of large language models for businesses?'}], 'max_tokens': 2048, 'temperature': 0.7, 'top_p': 0.9} 

Output:
 Large language models (LLMs) have numerous benefits for businesses, and here are three key advantages:

1. **Enhanced Customer Experience through Automation**: LLMs can be integrated with chatbots and virtual assistants to provide 24/7 customer support, automating routine inquiries, and freeing up human customer support agents to focus on more complex issues. This leads to faster response times, improved customer satisfaction, and reduced support costs. Additionally, LLMs can analyze customer interactions to identify patterns and preferences, enabling businesses to personalize their services and improve overall customer experience.
2. **Increased Efficiency in Content Generation and Analysis**: LLMs can gene

# Image / Multi-Image Outputs

In [18]:
import requests
import base64
import boto3
import json

s3 = boto3.client("s3")
region = 'us-east-1'

s3_bucket = f"jumpstart-cache-prod-{region}"
key_prefix = "inference-notebook-assets"

def download_from_s3(key_filenames):
    for key_filename in key_filenames:
        s3.download_file(s3_bucket, f"{key_prefix}/{key_filename}", key_filename)

# Define image names
heat_map = "heatmap_semantic_similarity_search.png"

# Download and display the heatmap image
# download_from_s3(key_filenames=[heat_map])

def img_to_base64(image_path):
    """Convert image to base64 with proper MIME type prefix"""
    with open(image_path, "rb") as f:
        img = f.read()
    
    # Determine MIME type based on file extension
    mime_type = "image/png"  # Default to PNG
    if image_path.lower().endswith('.jpg') or image_path.lower().endswith('.jpeg'):
        mime_type = "image/jpeg"
    
    # Encode image and add proper prefix
    enc_img = base64.b64encode(img).decode('utf-8')
    return f"data:{mime_type};base64,{enc_img}"

# Convert image to base64 with proper format
b64_img = img_to_base64(heat_map)

# Verify the base64 string starts correctly
print(f"Base64 string prefix: {b64_img[:30]}...")

def url_to_base64(image_url):
    """Download an image from URL and convert to base64"""
    # Download the image
    response = requests.get(image_url)
    if response.status_code != 200:
        print(f"Failed to download image from {image_url}, status code: {response.status_code}")
        return None
    
    # Determine MIME type based on Content-Type header or URL extension
    mime_type = "image/jpeg"  # Default
    content_type = response.headers.get('Content-Type', '')
    if content_type.startswith('image/'):
        mime_type = content_type
    elif image_url.lower().endswith('.png'):
        mime_type = "image/png"
    
    # Encode the image content to base64
    image_base64 = base64.b64encode(response.content).decode('utf-8')
    return f"data:{mime_type};base64,{image_base64}"

# Define image names for S3 download
image1_name = "heatmap_semantic_similarity_search.png"
# Define URL for second image
image2_url = "https://raw.githubusercontent.com/mathvision-cuhk/MATH-V/refs/heads/main/images/13.jpg"

# Download the first image from S3
print("Downloading first image from S3...")
download_from_s3(key_filenames=[image1_name])

# Convert first image to base64
print("Converting first image to base64...")
b64_img1 = img_to_base64(image1_name)

# Download and convert second image from URL
print("Downloading and converting second image...")
b64_img2 = url_to_base64(image2_url)

# Verify the base64 strings start correctly
print(f"Base64 string 1 prefix: {b64_img1[:30]}...")
print(f"Base64 string 2 prefix: {b64_img2[:30]}...")

Base64 string prefix: data:image/png;base64,iVBORw0K...


In [21]:
# Create the payload as a Python dictionary
data = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_img
                    }
                },
            ],
        },
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 512
}

In [22]:
# Run inference
print("Sending request to model...")
response = predictor.predict(data)  # Pass the dictionary directly
response = response[0] if isinstance(response, list) else response

# Print the output
print("Input: [Image with prompt 'What is in this image?']")
if "generated_text" in response:
    print("Output:\n", response["generated_text"].strip(), "\n\n")
elif "choices" in response and len(response["choices"]) > 0:
    print("Output:\n", response["choices"][0]["message"]["content"], "\n\n")
else:
    print("Raw response:", response)

Sending request to model...
Input: [Image with prompt 'What is in this image?']
Output:
 The image presents a heatmap illustrating the semantic textual similarity between sentences. The heatmap is divided into two axes, with the x-axis listing seven sentences and the y-axis listing the same sentences in reverse order.

**Sentences:**

* Your dog is so cute.
* How cute your dog is!
* You have such a cute dog!
* New York City is the place where I work.
* I work in New York City.
* What color do you like the most?
* What is your favourite color?

**Heatmap:**

The heatmap displays a range of colors, from dark red (indicating high similarity) to light beige (indicating low similarity). The color scale on the right side of the heatmap provides a visual representation of the similarity scores, ranging from 1.0 (dark red) to -0.4 (light beige).

**Observations:**

* The three sentences related to dogs ("Your dog is so cute.", "How cute your dog is!", and "You have such a cute dog!") exhibit h

In [19]:
# Create the payload with two images
data = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "There are two images. Describe each one and why they are related",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_img1,
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_img2,
                    }
                },
            ],
        },
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 512
}

Downloading first image from S3...
Converting first image to base64...
Downloading and converting second image...
Base64 string 1 prefix: data:image/png;base64,iVBORw0K...
Base64 string 2 prefix: data:image/jpeg;base64,/9j/4AA...


In [20]:
# Run inference
print("Sending request to model...")
response = predictor.predict(data)  # Pass the dictionary directly
response = response[0] if isinstance(response, list) else response

# Print the output
print("Input: [Image with prompt 'What is in this image?']")
if "generated_text" in response:
    print("Output:\n", response["generated_text"].strip(), "\n\n")
elif "choices" in response and len(response["choices"]) > 0:
    print("Output:\n", response["choices"][0]["message"]["content"], "\n\n")
else:
    print("Raw response:", response)

Sending request to model...
Input: [Image with prompt 'What is in this image?']
Output:
 The image presents two distinct visual representations, each conveying different information.

**Image 1: Semantic Textual Similarity Between Sentences**

This image is a heatmap illustrating the semantic textual similarity between various sentences. The heatmap is a square matrix with a color gradient ranging from dark red (high similarity) to light beige (low similarity). The sentences being compared are:

* Your dog is so cute.
* How cute your dog is!
* You have such a cute dog!
* New York City is the place where I work.
* I work in New York City.
* What color do you like the most?
* What is your favourite color?

The heatmap reveals that:

* Sentences about the dog's cuteness have high similarity scores (dark red), indicating they convey similar meanings.
* Sentences about working in New York City also show high similarity scores.
* Sentences about favorite colors have high similarity scores.
*

# Codebase Analysis 

Using Llama 4 Scout’s industry-leading context window, this section showcases its ability to deeply analyze expansive codebases. The example extracts and contextualizes the buildspec-1-10-2.yml file from the AWS Deep Learning Containers GitHub repository, illustrating how the model synthesizes information across an entire repository. We used a tool to ingest the whole repository into plaintext that we provided to the model as context:

In [2]:
import json
import time
import boto3
from sagemaker.jumpstart.model import JumpStartModel

def read_markdown_file(file_path):
    """
    Read the content of a markdown file.
    
    Args:
        file_path (str): Path to the markdown file to be read
        
    Returns:
        str: The content of the file as a string
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            return content
    except FileNotFoundError:
        print(f"Error: File not found at path {file_path}")
        raise
    except Exception as e:
        print(f"Error reading file: {e}")
        raise

def summarize_text(text, predictor, max_length=2000):
    """
    Summarize the provided text using a SageMaker JumpStart model predictor.
    
    Args:
        text (str): Text to summarize
        predictor: JumpStart model predictor
        max_length (int): Maximum length of text to send (to avoid token limits)
        
    Returns:
        tuple: (Summarized text, Latency metrics)
    """
    # Truncate text if necessary to avoid exceeding token limits
    if len(text) > max_length:
        text = text[:max_length] + "... [Content truncated due to length]"
    
    # Create the payload using OpenAI Chat Completions format
    payload = {
        "messages": [
            {
                "role": "system",
                "content": """You are a helpful AI assistant that summarizes codebases to me to help me understand 
                            how to analyze code by synthesizing through the entire codebase before responding.

                            Be thorough in your search as the file may be nested within a markdown code block or within a directory listing.
                            """
            },
            {
                "role": "user",
                "content": f"Can you explain to me the buildspec-1-10-2.yml file and how it relates to the rest of the huggingface directory? Use this information :\n\n{text} as reference."
            }
        ],
        "max_tokens": 2048,
        "temperature": 0.7,
        "top_p": 0.9
    }
    
    # Start measuring preprocessing time
    preprocess_start = time.time()
    
    # Measure payload size
    payload_size = len(json.dumps(payload).encode('utf-8'))
    
    preprocess_end = time.time()
    preprocess_time = preprocess_end - preprocess_start
    
    # Invoke the predictor and measure latency
    api_start = time.time()
    response = predictor.predict(payload)
    api_end = time.time()
    api_latency = api_end - api_start
    
    # Process response based on JumpStart model output format
    if isinstance(response, list):
        response = response[0]
    
    # Start measuring postprocessing time
    postprocess_start = time.time()
    
    # Extract the summary from the response based on format
    if "choices" in response and len(response["choices"]) > 0:
        summary = response["choices"][0]["message"]["content"]
    elif "generated_text" in response:
        summary = response["generated_text"].strip()
    else:
        summary = str(response)
    
    # Calculate output size
    output_size = len(summary.encode('utf-8'))
    
    postprocess_end = time.time()
    postprocess_time = postprocess_end - postprocess_start
    
    # Collect metrics
    metrics = {
        'api_latency': api_latency,
        'preprocess_time': preprocess_time,
        'postprocess_time': postprocess_time,
        'total_latency': preprocess_time + api_latency + postprocess_time,
        'payload_size_bytes': payload_size,
        'output_size_bytes': output_size,
        'payload_tokens': len(text) / 4,  # rough approximation of tokens
    }
    
    return summary, metrics



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [5]:
# Configuration
file_path = '/home/ec2-user/SageMaker/repomix-output.md'  # Update with actual markdown file

# Read the markdown file
print(f"Reading markdown file from {file_path}...")
read_start = time.time()
markdown_content = read_markdown_file(file_path)
read_end = time.time()
read_time = read_end - read_start
print(f"File read time: {read_time:.4f} seconds")

# Summarize the content
print("Generating summary...")
summary, latency_metrics = summarize_text(markdown_content, predictor)

# Print the latency metrics
print("\n--- LATENCY METRICS ---")
print(f"API call latency: {latency_metrics['api_latency']:.4f} seconds")
print(f"Preprocessing time: {latency_metrics['preprocess_time']:.4f} seconds")
print(f"Postprocessing time: {latency_metrics['postprocess_time']:.4f} seconds")
print(f"Total latency: {latency_metrics['total_latency']:.4f} seconds")
print(f"Payload size: {latency_metrics['payload_size_bytes'] / 1024:.2f} KB")
print(f"Output size: {latency_metrics['output_size_bytes'] / 1024:.2f} KB")
print(f"Approximate input tokens: {int(latency_metrics['payload_tokens'])}")

# Print the summary
print("\n--- SUMMARY ---\n")
print(summary)


Reading markdown file from /home/ec2-user/SageMaker/repomix-output.md...
File read time: 0.1799 seconds
Generating summary...

--- LATENCY METRICS ---
API call latency: 41.5058 seconds
Preprocessing time: 0.0000 seconds
Postprocessing time: 0.0000 seconds
Total latency: 41.5059 seconds
Payload size: 2.69 KB
Output size: 3.22 KB
Approximate input tokens: 509

--- SUMMARY ---

## Overview of the buildspec-1-10-2.yml File

The provided information doesn't directly reference a `buildspec-1-10-2.yml` file. However, based on the context of the Hugging Face repository and common practices in software development, I can infer that this file likely relates to a build specification, possibly for a Continuous Integration/Continuous Deployment (CI/CD) pipeline.

## Understanding the File's Purpose

Typically, a `buildspec.yml` file is used in CI/CD pipelines to define the build, test, and deployment process for a project. The version number (`1-10-2`) might indicate a specific configuration or a v

# Multi-Doc Processing
With Llama 4 Scout's 10million token context window - Llama 4 Scout excels in multi-document processing. In this example, the model extracts key financial metrics from Amazon 10-K reports (2017-2024), demonstrating its capability to integrate and analyze data spanning multiple years—all without the need for additional processing tools.

In [27]:
!pip install pyPDF2

Collecting pyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: pyPDF2
Successfully installed pyPDF2-3.0.1


In [39]:
import boto3
import json
import time
import PyPDF2
import os
import glob
import sagemaker
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

In [40]:
# Define helper functions
def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file."""
    text = ""
    try:
        with open(pdf_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            print(f"PDF has {len(pdf_reader.pages)} pages")
            # Extract text from first 100 pages (adjust as needed)
            for page_num in range(min(100, len(pdf_reader.pages))):
                print(f"Processing page {page_num+1}/{len(pdf_reader.pages)}...")
                text += pdf_reader.pages[page_num].extract_text() + "\n"
        return text
    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
        return None

def extract_key_data_from_pdfs(pdf_files, predictor):
    """Extract and summarize key data from each PDF file using JumpStart model."""
    pdf_summaries = {}
    
    for pdf_file in pdf_files:
        file_name = os.path.basename(pdf_file)
        print(f"\n--- Extracting key data from {file_name} ---")
        
        pdf_text = extract_text_from_pdf(pdf_file)
        if not pdf_text:
            print(f"Failed to extract text from {file_name}")
            continue
            
        # Create a payload using the OpenAI Chat Completions format
        payload = {
            "model": "meta-vlm-llama-4-scout-17b-16e-instruct",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a financial analyst. Extract key financial data from this 10-K report."
                },
                {
                    "role": "user",
                    "content": f"""Extract the following information from this Amazon 10-K report:
1. What year is this report for?
2. Total revenue for the year
3. Net income
4. Key business segments and their performance
5. Important trends mentioned

Here's the 10-K text:
{pdf_text[:25000]}

Format your response as a JSON object with these fields: year, revenue, net_income, segments, trends.
"""
                }
            ],
            "max_tokens": 2048,
            "temperature": 0.7,
            "top_p": 0.9
        }
        
        try:
            print(f"Extracting key data from {file_name}...")
            # Use the JumpStart predictor
            response = predictor.predict(payload)
            
            # Handle response format
            if isinstance(response, list):
                response = response[0]
                
            # Extract content based on response format
            if "choices" in response and len(response["choices"]) > 0:
                summary = response["choices"][0]["message"]["content"]
                pdf_summaries[file_name] = summary
                print(f"Successfully extracted key data from {file_name}")
            elif "generated_text" in response:
                summary = response["generated_text"].strip()
                pdf_summaries[file_name] = summary
                print(f"Successfully extracted key data from {file_name}")
            else:
                print(f"Failed to extract key data from {file_name}: Unexpected response format")
                print(f"Response: {response}")
                
        except Exception as e:
            print(f"Error processing {file_name}: {e}")
    
    return pdf_summaries

def find_pdf_files(directory):
    """Find all PDF files in the specified directory."""
    pdf_files = []
    try:
        # Look for all PDF files in the directory
        pdf_pattern = os.path.join(directory, "*.pdf")
        pdf_files = glob.glob(pdf_pattern)
        
        if pdf_files:
            print(f"Found {len(pdf_files)} PDF files:")
            for pdf in pdf_files:
                print(f"  - {os.path.basename(pdf)}")
        else:
            print(f"No PDF files found in {directory}")
    except Exception as e:
        print(f"Error finding PDF files: {e}")
    
    return pdf_files

def synthesize_and_answer_question(pdf_summaries, question, predictor):
    """Synthesize information from all PDFs and answer a specific question."""
    # Combine all summaries
    combined_context = ""
    for pdf_name, summary in pdf_summaries.items():
        combined_context += f"--- Data from {pdf_name} ---\n{summary}\n\n"
    
    # Create the system prompt
    system_prompt = """You are an expert financial analyst comparing Amazon's 10-K reports across multiple years.
Your task is to synthesize information from these reports and answer questions accurately.
Make sure to compare data across years when relevant and highlight significant changes or trends."""
    
    # Create the user prompt with combined context
    user_content = f"""I've analyzed multiple Amazon 10-K reports and extracted the key information from each:

{combined_context}

Based on this information from all the reports, please answer the following question:
{question}

Provide a comprehensive answer that compares information across all available years."""
    
    # Create the payload
    payload = {
        "model": "meta-vlm-llama-4-scout-17b-16e-instruct",
        "messages": [
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": user_content
            }
        ],
        "max_tokens": 4096,
        "temperature": 0.7,
        "top_p": 0.9
    }

    try:
        print(f"\nSynthesizing information and answering question...")
        start_time = time.time()
        
        response = predictor.predict(payload)
        
        end_time = time.time()
        
        if isinstance(response, list):
            response = response[0]
            
        print(f"Time taken: {end_time - start_time:.2f} seconds")
        
        if "choices" in response and len(response["choices"]) > 0:
            return response["choices"][0]["message"]["content"]
        elif "generated_text" in response:
            return response["generated_text"].strip()
        else:
            print("Unexpected response format:")
            return None
            
    except Exception as e:
        print(f"Error synthesizing information: {e}")
        return None

In [41]:
# Define the directory containing PDF files
pdf_directory = "/home/ec2-user/SageMaker/Llama4_Assets/"  # Replace with your directory path

# Connect to your existing endpoint
endpoint_name = 'meta-vlm-llama-4-scout-17b-16e-instruct-2025-04-14-20-01-24-485'  # Replace with your actual endpoint name
print(f"Connecting to existing endpoint: {endpoint_name}")

# Create a proper SageMaker session
sagemaker_session = sagemaker.Session()

# Create the predictor with the SageMaker session
predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)
print("Successfully connected to the endpoint")

try:
    # Find PDF files
    pdf_files = find_pdf_files(pdf_directory)
    if not pdf_files:
        print("No PDF files found. Exiting.")
        raise SystemExit(0)
    
    # Extract summaries from PDFs
    print("\nExtracting key data from all PDFs...")
    pdf_summaries = extract_key_data_from_pdfs(pdf_files, predictor)
    
    if not pdf_summaries:
        print("Failed to extract data from any PDFs. Exiting.")
        raise SystemExit(0)
    
    print("\nSuccessfully extracted data from the following PDFs:")
    for pdf_name in pdf_summaries.keys():
        print(f"  - {pdf_name}")
    
    # Define your questions
    questions = [
        "How did Amazon's revenue and net income change over time?",
        "What are the main business segments driving growth at Amazon over the years?",
        "How has AWS evolved as a segment and what is its contribution to Amazon's overall business?"
    ]
    
    # Answer each question
    for question in questions:
        print(f"\nAnswering question: {question}")
        answer = synthesize_and_answer_question(pdf_summaries, question, predictor)
        
        if answer:
            print("\n" + "="*50)
            print(f"ANSWER to: {question}")
            print("="*50)
            print(answer)
        else:
            print(f"Sorry, I couldn't generate an answer for: {question}")

except Exception as e:
    print(f"Error occurred: {e}")

Connecting to existing endpoint: meta-vlm-llama-4-scout-17b-16e-instruct-2025-04-14-20-01-24-485
Successfully connected to the endpoint
Found 10 PDF files:
  - amazon10k_2022.pdf
  - amazon_10k_2020.pdf
  - amazon_10k_2015.pdf
  - amazon_10k_2016.pdf
  - amazon_10k_2018.pdf
  - amazon_10k_2021.pdf
  - amazon_10k_2024.pdf
  - amazon_10k_2019.pdf
  - amazon_10k_2017.pdf
  - amazon_10k_2023.pdf

Extracting key data from all PDFs...

--- Extracting key data from amazon10k_2022.pdf ---
PDF has 194 pages
Processing page 1/194...
Processing page 2/194...
Processing page 3/194...
Processing page 4/194...
Processing page 5/194...
Processing page 6/194...
Processing page 7/194...
Processing page 8/194...
Processing page 9/194...
Processing page 10/194...
Processing page 11/194...
Processing page 12/194...
Processing page 13/194...
Processing page 14/194...
Processing page 15/194...
Processing page 16/194...
Processing page 17/194...
Processing page 18/194...
Processing page 19/194...
Processing 

In [44]:
predictor.delete_endpoint()