# Testing Deployed LLaMA Fine-tuned Model Endpoint

This notebook provides examples for testing your deployed SageMaker endpoint.

## Prerequisites
- Model has been approved in Model Registry
- Deployment pipeline has completed
- Endpoint is in 'InService' status

## Setup

In [None]:
import boto3
import json
import time

# Initialize clients
sm_client = boto3.client('sagemaker')
runtime_client = boto3.client('sagemaker-runtime')

print("‚úì Clients initialized")

## Find Your Endpoint

List all endpoints to find yours:

In [None]:
# List all endpoints
response = sm_client.list_endpoints(
    SortBy='CreationTime',
    SortOrder='Descending',
    MaxResults=10
)

print("Available Endpoints:")
print("-" * 80)
for endpoint in response['Endpoints']:
    print(f"Name: {endpoint['EndpointName']}")
    print(f"Status: {endpoint['EndpointStatus']}")
    print(f"Created: {endpoint['CreationTime']}")
    print("-" * 80)

In [None]:
# Set your endpoint name here
ENDPOINT_NAME = "your-endpoint-name-here"  # Replace with your actual endpoint name

print(f"Using endpoint: {ENDPOINT_NAME}")

## Check Endpoint Status

In [None]:
# Check endpoint status
response = sm_client.describe_endpoint(EndpointName=ENDPOINT_NAME)
print(f"Endpoint Name: {response['EndpointName']}")
print(f"Status: {response['EndpointStatus']}")

# Get instance details from ProductionVariants
variant = response['ProductionVariants'][0]
print(f"Variant Name: {variant['VariantName']}")
print(f"Current Instance Count: {variant.get('CurrentInstanceCount', 'N/A')}")
print(f"Desired Instance Count: {variant.get('DesiredInstanceCount', 'N/A')}")

# To get InstanceType, you need to describe the endpoint config
config_name = response['EndpointConfigName']
config_response = sm_client.describe_endpoint_config(EndpointConfigName=config_name)
print(f"Instance Type: {config_response['ProductionVariants'][0]['InstanceType']}")

if response['EndpointStatus'] == 'InService':
    print("\n‚úÖ Endpoint is ready for inference!")
else:
    print(f"\n‚ö†Ô∏è Endpoint is {response['EndpointStatus']}. Please wait for it to be InService.")


## Helper Function for Inference

In [None]:
def invoke_endpoint(prompt, max_tokens=150, temperature=0.7, show_prompt=True):
    """
    Invoke the SageMaker endpoint with a prompt.
    
    Args:
        prompt: Input prompt text
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature (0.0-1.0)
        show_prompt: Whether to display the prompt
    
    Returns:
        Generated text
    """
    # Prepare payload
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": temperature,
            "top_p": 0.95,
            "do_sample": True
        }
    }
    
    if show_prompt:
        print("üìù Prompt:")
        print("-" * 80)
        print(prompt[:200] + "..." if len(prompt) > 200 else prompt)
        print("-" * 80)
    
    # Invoke endpoint
    start_time = time.time()
    
    response = runtime_client.invoke_endpoint(
        EndpointName=ENDPOINT_NAME,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    latency = time.time() - start_time
    
    # Parse response
    result = json.loads(response['Body'].read().decode())
    
    # Extract generated text
    if isinstance(result, list) and len(result) > 0:
        generated_text = result[0].get('generated_text', '')
    elif isinstance(result, dict):
        generated_text = result.get('generated_text', result.get('outputs', ''))
    else:
        generated_text = str(result)
    
    print(f"\n‚è±Ô∏è Latency: {latency:.2f}s")
    print("\nüìÑ Generated Response:")
    print("=" * 80)
    print(generated_text)
    print("=" * 80)
    
    return generated_text

## Test 1: Summarization Task

In [None]:
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize the following text in 2-3 sentences.

### Input:
Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to improve their performance on a specific task through experience. Unlike traditional programming where explicit instructions are provided, machine learning systems learn patterns from data and make decisions with minimal human intervention. This technology powers many modern applications including recommendation systems, image recognition, and natural language processing.

### Response:
"""

response = invoke_endpoint(prompt, max_tokens=100)

## Test 2: Question Answering

In [None]:
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following question based on the context provided.

### Input:
Context: The Amazon rainforest, also known as Amazonia, is a moist broadleaf tropical rainforest in the Amazon biome that covers most of the Amazon basin of South America. The basin covers 7,000,000 square kilometers, of which 5,500,000 square kilometers are covered by the rainforest.

Question: How large is the Amazon rainforest?

### Response:
"""

response = invoke_endpoint(prompt, max_tokens=80)

## Test 3: Instruction Following

In [None]:
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a professional email to a customer apologizing for a delayed shipment.

### Response:
"""

response = invoke_endpoint(prompt, max_tokens=200)

## Test 4: Custom Prompt

Try your own prompt:

In [None]:
# Customize this prompt
custom_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Explain quantum computing in simple terms.

### Response:
"""

response = invoke_endpoint(custom_prompt, max_tokens=150, temperature=0.7)

## Batch Processing Example

Process multiple prompts:

In [None]:
prompts = [
    """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is the capital of France?

### Response:
""",
    """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
List 3 benefits of exercise.

### Response:
""",
    """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Translate 'Hello, how are you?' to Spanish.

### Response:
"""
]

results = []
for i, prompt in enumerate(prompts, 1):
    print(f"\n{'='*80}")
    print(f"Processing prompt {i}/{len(prompts)}")
    print(f"{'='*80}")
    result = invoke_endpoint(prompt, max_tokens=100, show_prompt=False)
    results.append(result)
    time.sleep(0.5)  # Small delay between requests

print(f"\n‚úÖ Processed {len(results)} prompts successfully!")

## Performance Testing

Measure latency across multiple requests:

In [None]:
import statistics

test_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is machine learning?

### Response:
"""

num_requests = 5
latencies = []

print(f"Running {num_requests} requests to measure latency...\n")

for i in range(num_requests):
    payload = {
        "inputs": test_prompt,
        "parameters": {"max_new_tokens": 50}
    }
    
    start = time.time()
    response = runtime_client.invoke_endpoint(
        EndpointName=ENDPOINT_NAME,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    latency = time.time() - start
    latencies.append(latency)
    print(f"Request {i+1}: {latency:.2f}s")

print(f"\nüìä Latency Statistics:")
print(f"  Average: {statistics.mean(latencies):.2f}s")
print(f"  Median: {statistics.median(latencies):.2f}s")
print(f"  Min: {min(latencies):.2f}s")
print(f"  Max: {max(latencies):.2f}s")

## Cleanup (Optional)

‚ö†Ô∏è **Warning:** This will delete your endpoint and stop all inference capabilities.

Only run this if you want to delete the endpoint to save costs:

In [None]:
# Uncomment to delete endpoint
# sm_client.delete_endpoint(EndpointName=ENDPOINT_NAME)
# print(f"‚úì Endpoint {ENDPOINT_NAME} deleted")

## Summary

This notebook demonstrated:
- ‚úÖ Finding and checking endpoint status
- ‚úÖ Invoking endpoint with different prompts
- ‚úÖ Batch processing multiple requests
- ‚úÖ Measuring inference latency

### Next Steps:
1. Integrate endpoint into your application
2. Monitor CloudWatch metrics
3. Set up auto-scaling if needed
4. Configure alarms for monitoring