# üìä Benchmarking Phishing Detection Endpoint

**Purpose**: Evaluate endpoint performance under various load conditions.

This notebook:
- Tests single inference latency
- Runs concurrent load tests (4, 8, 16, 32, 64 clients)
- Measures P50/P90 latencies
- Plots latency vs. concurrency
- Provides cleanup utilities

## Prerequisites
- **Run `03_model_deployment.ipynb` first**
- Active SageMaker endpoint
- Test dataset from data processing

## Benchmarking Goals
- Understand endpoint capacity
- Identify optimal concurrency
- Measure tail latencies
- Validate production readiness

---

## 1. Setup and Installation

In [None]:
!pip install -Uq "sagemaker==2.253.1" joblib tqdm matplotlib numpy

In [None]:
import boto3
import json
import time
import random
import numpy as np
import matplotlib.pyplot as plt
from botocore.config import Config
from joblib import Parallel, delayed
from tqdm import tqdm

## 2. Load Variables from Deployment

In [None]:
%store -r endpoint_name
%store -r model_name
%store -r test_s3_uri
%store -r region

# Verify
try:
    print("‚úÖ Variables loaded:")
    print(f"  Endpoint: {endpoint_name}")
    print(f"  Model: {model_name}")
    print(f"  Test data: {test_s3_uri}")
except NameError:
    print("‚ùå Run 03_model_deployment.ipynb first!")
    raise

## 3. Download Test Dataset

Download test data from S3 for benchmarking.

In [None]:
import os

# Download test data
os.makedirs('datasets', exist_ok=True)

s3_client = boto3.client('s3', region_name=region)

# Parse S3 URI
bucket = test_s3_uri.split('/')[2]
key = '/'.join(test_s3_uri.split('/')[3:])

# Download
local_test_file = 'datasets/test.jsonl'
s3_client.download_file(bucket, key, local_test_file)

# Count lines
with open(local_test_file, 'r') as f:
    num_test_samples = sum(1 for _ in f)

print(f"‚úÖ Downloaded test dataset: {num_test_samples} samples")

## 4. Configure Inference Client

In [None]:
no_retry_config = Config(retries={'max_attempts': 1})
runtime_client = boto3.client("sagemaker-runtime", config=no_retry_config)

def invoke_classification_endpoint(ep_name, texts):
    """
    Invoke SageMaker classification endpoint.
    """
    if isinstance(texts, str):
        texts = [texts]
    
    payload = {"inputs": texts}
    
    response = runtime_client.invoke_endpoint(
        EndpointName=ep_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    return json.loads(response['Body'].read().decode())

print("‚úÖ Inference client configured")

## 5. Single Inference Latency Test

Test latency for a single inference request.

In [None]:
def inference_latency(endpoint_name, test_file_location, num_lines=None):
    """
    Run a single inference benchmark using a random sample from test file.
    
    Returns:
        dict: {'latency' (ms), 'error' (bool), 'result'}
    """
    error = False
    result = None
    start = time.time()
    
    try:
        if num_lines is None:
            with open(test_file_location, 'r') as f:
                num_lines = sum(1 for _ in f)
        
        random_line_num = random.randint(0, num_lines - 1)
        
        with open(test_file_location, 'r') as f:
            for i, line in enumerate(f):
                if i == random_line_num:
                    data = json.loads(line)
                    text = data['text']
                    break
        
        result = invoke_classification_endpoint(endpoint_name, text)
        
    except Exception as e:
        error = True
        result = str(e)
    
    latency = (time.time() - start) * 1000.0  # Convert to ms
    
    return {'latency': latency, 'error': error, 'result': result}

print("‚úÖ Latency test function ready")

In [None]:
# Run single test
print("Running single inference test...\n")

test_result = inference_latency(
    endpoint_name,
    local_test_file,
    num_lines=num_test_samples
)

print(f"Latency: {test_result['latency']:.2f} ms")
print(f"Error: {test_result['error']}")
if not test_result['error']:
    print(f"Result: {test_result['result']}")

## 6. Concurrent Load Testing

Test endpoint performance under concurrent load.

In [None]:
def run_benchmark(
    endpoint_name,
    test_file_location,
    number_of_clients,
    number_of_runs,
    num_lines=None
):
    """
    Run benchmark with concurrent clients.
    
    Args:
        endpoint_name: SageMaker endpoint name
        test_file_location: Path to test JSONL file
        number_of_clients: Number of parallel clients
        number_of_runs: Total number of requests
        num_lines: Number of lines in test file (optional)
    
    Returns:
        tuple: (p50_latency_ms, p90_latency_ms, mean_latency_ms, error_rate)
    """
    progress_bar = tqdm(
        range(number_of_runs),
        desc=f"{number_of_clients} clients",
        position=0,
        leave=True
    )
    
    results = Parallel(n_jobs=number_of_clients, prefer="threads")(
        delayed(inference_latency)(endpoint_name, test_file_location, num_lines)
        for _ in progress_bar
    )
    
    latencies = [res['latency'] for res in results if not res['error']]
    errors = sum(1 for res in results if res['error'])
    
    if len(latencies) == 0:
        return None, None, None, 1.0
    
    p50_latency_ms = float(np.quantile(latencies, 0.50))
    p90_latency_ms = float(np.quantile(latencies, 0.90))
    mean_latency_ms = float(np.mean(latencies))
    error_rate = errors / len(results)
    
    return p50_latency_ms, p90_latency_ms, mean_latency_ms, error_rate

print("‚úÖ Benchmark function ready")

## 7. Run Benchmarks Across Concurrency Levels

Test with 4, 8, 16, 32, and 64 concurrent clients.

In [None]:
# Benchmark configuration
concurrency_levels = [4, 8, 16, 32, 64]
num_requests_per_level = 512  # Total requests per concurrency level

# Storage for results
benchmark_results = {
    'concurrency': [],
    'p50_latency': [],
    'p90_latency': [],
    'mean_latency': [],
    'error_rate': []
}

print("üöÄ Starting concurrent load tests...\n")
print(f"Configuration:")
print(f"  Concurrency levels: {concurrency_levels}")
print(f"  Requests per level: {num_requests_per_level}")
print(f"  Test samples: {num_test_samples}\n")

for num_clients in concurrency_levels:
    print(f"\n{'='*60}")
    print(f"Testing with {num_clients} concurrent clients")
    print(f"{'='*60}")
    
    p50, p90, mean, error_rate = run_benchmark(
        endpoint_name=endpoint_name,
        test_file_location=local_test_file,
        number_of_clients=num_clients,
        number_of_runs=num_requests_per_level,
        num_lines=num_test_samples
    )
    
    # Store results
    benchmark_results['concurrency'].append(num_clients)
    benchmark_results['p50_latency'].append(p50)
    benchmark_results['p90_latency'].append(p90)
    benchmark_results['mean_latency'].append(mean)
    benchmark_results['error_rate'].append(error_rate)
    
    # Print results
    print(f"\nResults:")
    print(f"  P50 Latency: {p50:.2f} ms")
    print(f"  P90 Latency: {p90:.2f} ms")
    print(f"  Mean Latency: {mean:.2f} ms")
    print(f"  Error Rate: {error_rate*100:.2f}%")

print(f"\n\n{'='*60}")
print("‚úÖ All benchmarks complete!")
print(f"{'='*60}")

## 8. Visualize Results

Plot latency vs. concurrency to understand endpoint behavior.

In [None]:
# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Latency vs Concurrency
ax1.plot(
    benchmark_results['concurrency'],
    benchmark_results['p50_latency'],
    marker='o',
    label='P50 Latency',
    linewidth=2
)
ax1.plot(
    benchmark_results['concurrency'],
    benchmark_results['p90_latency'],
    marker='s',
    label='P90 Latency',
    linewidth=2
)
ax1.plot(
    benchmark_results['concurrency'],
    benchmark_results['mean_latency'],
    marker='^',
    label='Mean Latency',
    linewidth=2,
    linestyle='--'
)

ax1.set_xlabel('Concurrent Clients', fontsize=12)
ax1.set_ylabel('Latency (ms)', fontsize=12)
ax1.set_title('Endpoint Latency vs Concurrency', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_xticks(benchmark_results['concurrency'])

# Plot 2: Error Rate
ax2.bar(
    benchmark_results['concurrency'],
    [e * 100 for e in benchmark_results['error_rate']],
    color='coral',
    alpha=0.7
)
ax2.set_xlabel('Concurrent Clients', fontsize=12)
ax2.set_ylabel('Error Rate (%)', fontsize=12)
ax2.set_title('Error Rate vs Concurrency', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')
ax2.set_xticks(benchmark_results['concurrency'])

plt.tight_layout()
plt.savefig('benchmark_results.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualization saved as 'benchmark_results.png'")

## 9. Results Summary

In [None]:
import pandas as pd

# Create results table
results_df = pd.DataFrame({
    'Concurrency': benchmark_results['concurrency'],
    'P50 (ms)': [f"{x:.2f}" for x in benchmark_results['p50_latency']],
    'P90 (ms)': [f"{x:.2f}" for x in benchmark_results['p90_latency']],
    'Mean (ms)': [f"{x:.2f}" for x in benchmark_results['mean_latency']],
    'Error Rate': [f"{x*100:.2f}%" for x in benchmark_results['error_rate']]
})

print("\nüìä Benchmark Results Summary:")
print("="*70)
print(results_df.to_string(index=False))
print("="*70)

# Save to CSV
results_df.to_csv('benchmark_results.csv', index=False)
print("\n‚úÖ Results saved to 'benchmark_results.csv'")

## 10. Cleanup Resources

‚ö†Ô∏è **Important**: Delete the endpoint to stop incurring charges (~\$1.41/hour)

In [None]:
# Cleanup endpoint and model
print(f"‚ö†Ô∏è  Deleting endpoint: {endpoint_name}")
print(f"‚ö†Ô∏è  Deleting model: {model_name}")
print("\nUncomment the lines below to delete:")
print("\n# from sagemaker_core.resources import Endpoint, Model")
print(f"# endpoint = Endpoint.get(endpoint_name='{endpoint_name}')")
print("# endpoint.delete()")
print(f"# model = Model.get(model_name='{model_name}')")
print("# model.delete()")
print("\n# print('‚úÖ Resources deleted')")

In [None]:
# Uncomment to delete resources:

from sagemaker_core.resources import Endpoint, Model

endpoint = Endpoint.get(endpoint_name=endpoint_name)
endpoint.delete()

model = Model.get(model_name=model_name)
model.delete()

print('‚úÖ Endpoint and model deleted')

## ‚úÖ Benchmarking Complete!

### What We Accomplished:
1. ‚úÖ Tested single inference latency
2. ‚úÖ Ran concurrent load tests (4-64 clients)
3. ‚úÖ Measured P50/P90/mean latencies
4. ‚úÖ Visualized performance characteristics
5. ‚úÖ Saved results to CSV and PNG

### Key Findings:
- Review the latency vs. concurrency plot
- Identify optimal concurrency for your use case
- Check error rates at higher concurrency
- Use P90 latency for capacity planning

### Performance Insights:
- **Low latency**: Single-token classification is fast
- **Scalability**: Endpoint handles concurrent requests well
- **Cost-effective**: Small instance (ml.g5.xlarge) sufficient

### Next Steps:
1. Review benchmark results
2. Adjust endpoint instance type/count if needed
3. **Delete endpoint to stop charges** (see Section 10)
4. Deploy to production when ready

---

**‚ö†Ô∏è Don't forget to delete the endpoint!** (~$1.41/hour)