# Deploy Phi-3 with SageMaker Inference Components

This notebook demonstrates deploying Phi-3 using SageMaker Inference Components for optimized resource utilization and cost savings.

## What are Inference Components?

Inference Components (ICs) allow you to:
- **Right-size resources**: Allocate exact compute needed per model
- **Model packing**: Run multiple models on shared infrastructure  
- **Independent scaling**: Scale each model independently
- **Cost optimization**: Pay only for resources you use
- **Zero-downtime updates**: Rolling updates without downtime

## Use Cases

- Multi-tenant deployments
- A/B testing different model versions
- Cost-effective production deployments
- Scale-to-zero for intermittent workloads

## Prerequisites

- AWS Account with SageMaker access
- IAM role with appropriate permissions
- GPU instance quota

## 1. Setup

In [None]:
!pip install sagemaker==2.256.0
!pip install boto3 --upgrade --quiet

In [None]:
import sagemaker
import boto3
import json
from datetime import datetime
from sagemaker.huggingface import get_huggingface_llm_image_uri

# Initialize clients
sess = sagemaker.Session()
region = sess.boto_region_name
role = sagemaker.get_execution_role()
sm_client = boto3.client('sagemaker', region_name=region)
sagemaker_runtime = boto3.client('sagemaker-runtime', region_name=region)

print(f"Region: {region}")
print(f"Role: {role}")

## 2. Get TGI Container Image

In [None]:
# Get TGI image URI
image_uri = get_huggingface_llm_image_uri(
    backend="huggingface",
    region=region
)

print(f"Image URI: {image_uri}")

## 3. Create Endpoint (without model)

First, create an endpoint without any models. We'll add models via Inference Components.

In [None]:
endpoint_name = f"phi3-ic-endpoint-{datetime.now().strftime('%Y%m%d-%H%M%S')}"

# Endpoint configuration with Managed Instance Scaling
endpoint_config = {
    "EndpointName": endpoint_name,
    "ProductionVariants": [
        {
            "VariantName": "AllTraffic",
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 600,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 1,
                "MaxInstanceCount": 4
            },
            "RoutingConfig": {
                "RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"
            }
        }
    ]
}

print(f"Creating endpoint: {endpoint_name}")
print("This will take 5-10 minutes...")

# Create endpoint
sm_client.create_endpoint(**endpoint_config)

# Wait for endpoint to be in service
waiter = sm_client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)

print(f"‚úÖ Endpoint created: {endpoint_name}")

## 4. Create Inference Component with Phi-3

Now create an Inference Component to deploy Phi-3 on the endpoint.

In [None]:
ic_name = f"phi3-mini-ic-{datetime.now().strftime('%Y%m%d-%H%M%S')}"

# Model configuration
model_config = {
    'HF_MODEL_ID': 'microsoft/Phi-3-mini-4k-instruct',
    'SM_NUM_GPUS': '1',
    'MAX_INPUT_LENGTH': '3072',
    'MAX_TOTAL_TOKENS': '4096',
    'MESSAGES_API_ENABLED': 'true',  # Enable OpenAI-compatible API
    # Optional: HF token for gated models
    # 'HUGGING_FACE_HUB_TOKEN': '<YOUR_TOKEN>',
}

# Create inference component
ic_config = {
    "InferenceComponentName": ic_name,
    "EndpointName": endpoint_name,
    "VariantName": "AllTraffic",
    "Specification": {
        "ModelName": ic_name,  # Will create model automatically
        "Container": {
            "Image": image_uri,
            "Environment": model_config
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1,  # Use 1 GPU
            "MinMemoryRequiredInMb": 8192  # 8GB RAM
        }
    },
    "RuntimeConfig": {
        "CopyCount": 2  # Run 2 copies for availability
    }
}

print(f"Creating inference component: {ic_name}")
print("This will take 3-5 minutes...")

sm_client.create_inference_component(**ic_config)

# Wait for IC to be in service
while True:
    response = sm_client.describe_inference_component(
        InferenceComponentName=ic_name
    )
    status = response['InferenceComponentStatus']
    
    if status == 'InService':
        print(f"‚úÖ Inference Component is InService")
        break
    elif status in ['Failed', 'Unknown']:
        print(f"‚ùå Failed to create IC: {status}")
        break
    else:
        print(f"Status: {status}... waiting")
        import time
        time.sleep(30)

## 5. Invoke Inference Component

In [None]:
def invoke_ic(ic_name, payload):
    """
    Invoke a specific inference component.
    """
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        InferenceComponentName=ic_name,  # Target specific IC
        Body=json.dumps(payload),
        ContentType='application/json'
    )
    
    result = json.loads(response['Body'].read().decode('utf-8'))
    return result

# Test inference
test_payload = {
    "inputs": "What is the meaning of life?",
    "parameters": {
        "max_new_tokens": 150,
        "temperature": 0.7,
        "top_p": 0.9,
        "do_sample": True
    }
}

response = invoke_ic(ic_name, test_payload)

print("\n" + "="*50)
print("RESPONSE:")
print("="*50)
print(response[0]['generated_text'])
print("="*50)

## 6. Test with Messages API Format

Since we enabled `MESSAGES_API_ENABLED`, we can use OpenAI-compatible format.

In [None]:
# OpenAI-compatible format
messages_payload = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant."
        },
        {
            "role": "user",
            "content": "Explain quantum computing in simple terms."
        }
    ],
    "max_tokens": 200,
    "temperature": 0.7
}

response = invoke_ic(ic_name, messages_payload)

print("\n" + "="*50)
print("MESSAGES API RESPONSE:")
print("="*50)
print(response['choices'][0]['message']['content'])
print("="*50)

## 7. Deploy Second Model (A/B Testing)

Deploy another model variant on the same endpoint for A/B testing.

In [None]:
ic_name_v2 = f"phi3-mini-v2-ic-{datetime.now().strftime('%Y%m%d-%H%M%S')}"

# Configuration for second variant (with different parameters)
model_config_v2 = {
    'HF_MODEL_ID': 'microsoft/Phi-3-mini-4k-instruct',
    'SM_NUM_GPUS': '1',
    'MAX_INPUT_LENGTH': '3072',
    'MAX_TOTAL_TOKENS': '4096',
    'MESSAGES_API_ENABLED': 'true',
    # Different quantization or parameters for testing
}

ic_config_v2 = {
    "InferenceComponentName": ic_name_v2,
    "EndpointName": endpoint_name,
    "VariantName": "AllTraffic",
    "Specification": {
        "ModelName": ic_name_v2,
        "Container": {
            "Image": image_uri,
            "Environment": model_config_v2
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1,
            "MinMemoryRequiredInMb": 8192
        }
    },
    "RuntimeConfig": {
        "CopyCount": 1  # Start with 1 copy
    }
}

print(f"Creating second inference component: {ic_name_v2}")
sm_client.create_inference_component(**ic_config_v2)

# Wait for IC to be ready
import time
while True:
    response = sm_client.describe_inference_component(
        InferenceComponentName=ic_name_v2
    )
    status = response['InferenceComponentStatus']
    
    if status == 'InService':
        print(f"‚úÖ Second IC is InService")
        break
    else:
        print(f"Status: {status}... waiting")
        time.sleep(30)

## 8. A/B Test Both Models

In [None]:
# Test both models with same prompt
test_prompt = {
    "inputs": "Write a haiku about artificial intelligence.",
    "parameters": {
        "max_new_tokens": 100,
        "temperature": 0.8,
        "do_sample": True
    }
}

print("\n" + "="*50)
print("A/B TEST COMPARISON")
print("="*50)

# Model V1
response_v1 = invoke_ic(ic_name, test_prompt)
print(f"\nModel V1 ({ic_name}):")
print("-" * 50)
print(response_v1[0]['generated_text'])

# Model V2  
response_v2 = invoke_ic(ic_name_v2, test_prompt)
print(f"\nModel V2 ({ic_name_v2}):")
print("-" * 50)
print(response_v2[0]['generated_text'])
print("="*50)

## 9. Scale Inference Component

Dynamically scale the number of copies.

In [None]:
# Scale up to 3 copies
print(f"Scaling {ic_name} to 3 copies...")

sm_client.update_inference_component(
    InferenceComponentName=ic_name,
    RuntimeConfig={
        "CopyCount": 3
    }
)

# Wait for update
import time
while True:
    response = sm_client.describe_inference_component(
        InferenceComponentName=ic_name
    )
    status = response['InferenceComponentStatus']
    
    if status == 'InService':
        runtime_config = response['RuntimeConfig']
        print(f"‚úÖ Scaled to {runtime_config['CurrentCopyCount']} copies")
        break
    else:
        print(f"Status: {status}... waiting")
        time.sleep(20)

## 10. Rolling Update Example

Update an IC to a new model version with zero downtime.

In [None]:
# Update to Phi-3 small model (or any other variant)
new_model_config = {
    'HF_MODEL_ID': 'microsoft/Phi-3-small-8k-instruct',  # Different model
    'SM_NUM_GPUS': '1',
    'MAX_INPUT_LENGTH': '7168',
    'MAX_TOTAL_TOKENS': '8192',
    'MESSAGES_API_ENABLED': 'true',
}

print(f"Performing rolling update on {ic_name}...")
print("This maintains availability during update.")

sm_client.update_inference_component(
    InferenceComponentName=ic_name,
    Specification={
        "Container": {
            "Image": image_uri,
            "Environment": new_model_config
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1,
            "MinMemoryRequiredInMb": 8192
        }
    }
)

# Monitor update progress
while True:
    response = sm_client.describe_inference_component(
        InferenceComponentName=ic_name
    )
    status = response['InferenceComponentStatus']
    
    if status == 'InService':
        print("‚úÖ Rolling update complete")
        break
    elif status == 'Updating':
        print("Update in progress...")
        time.sleep(30)
    else:
        print(f"Status: {status}")
        time.sleep(20)

## 11. Monitor Inference Component Metrics

In [None]:
# Get IC details
ic_details = sm_client.describe_inference_component(
    InferenceComponentName=ic_name
)

print("Inference Component Details:")
print("=" * 50)
print(f"Name: {ic_details['InferenceComponentName']}")
print(f"Status: {ic_details['InferenceComponentStatus']}")
print(f"Endpoint: {ic_details['EndpointName']}")
print(f"\nCompute Resources:")
print(f"  GPUs: {ic_details['Specification']['ComputeResourceRequirements']['NumberOfAcceleratorDevicesRequired']}")
print(f"  Memory: {ic_details['Specification']['ComputeResourceRequirements']['MinMemoryRequiredInMb']} MB")
print(f"\nRuntime:")
print(f"  Desired Copies: {ic_details['RuntimeConfig']['DesiredCopyCount']}")
print(f"  Current Copies: {ic_details['RuntimeConfig']['CurrentCopyCount']}")
print(f"\nModel:")
env = ic_details['Specification']['Container']['Environment']
print(f"  Model ID: {env.get('HF_MODEL_ID', 'N/A')}")
print(f"  Max Tokens: {env.get('MAX_TOTAL_TOKENS', 'N/A')}")
print("=" * 50)

## 12. CloudWatch Metrics for Inference Components

In [None]:
import datetime as dt

cloudwatch = boto3.client('cloudwatch', region_name=region)

def get_ic_metrics(ic_name, minutes=60):
    """
    Get CloudWatch metrics for Inference Component.
    """
    end_time = dt.datetime.utcnow()
    start_time = end_time - dt.timedelta(minutes=minutes)
    
    # IC-specific metrics
    metrics = [
        'InferenceComponentInvocations',
        'InferenceComponentConcurrentRequestsPerCopy',
        'InferenceComponent4XXErrors',
        'InferenceComponent5XXErrors'
    ]
    
    print(f"\nMetrics for {ic_name}:")
    print("=" * 50)
    
    for metric_name in metrics:
        response = cloudwatch.get_metric_statistics(
            Namespace='AWS/SageMaker',
            MetricName=metric_name,
            Dimensions=[
                {'Name': 'InferenceComponentName', 'Value': ic_name}
            ],
            StartTime=start_time,
            EndTime=end_time,
            Period=300,
            Statistics=['Sum', 'Average']
        )
        
        if response['Datapoints']:
            latest = sorted(response['Datapoints'], 
                          key=lambda x: x['Timestamp'])[-1]
            print(f"{metric_name}: {latest.get('Sum', latest.get('Average', 0))}")
        else:
            print(f"{metric_name}: No data")
    
    print("=" * 50)

# Get metrics
get_ic_metrics(ic_name)

## 13. Auto-Scaling Configuration

Set up auto-scaling for the Inference Component.

In [None]:
autoscaling = boto3.client('application-autoscaling', region_name=region)

# Register scalable target
resource_id = f"inference-component/{ic_name}"

autoscaling.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:inference-component:DesiredCopyCount',
    MinCapacity=1,
    MaxCapacity=5
)

print(f"Registered scalable target for {ic_name}")

# Create target tracking policy
policy_name = f"target-tracking-{ic_name}"

autoscaling.put_scaling_policy(
    PolicyName=policy_name,
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:inference-component:DesiredCopyCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 5.0,  # Target 5 concurrent requests per copy
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution'
        },
        'ScaleInCooldown': 300,  # 5 minutes
        'ScaleOutCooldown': 60   # 1 minute
    }
)

print(f"‚úÖ Auto-scaling policy created: {policy_name}")
print("  Min copies: 1")
print("  Max copies: 5")
print("  Target: 5 concurrent requests per copy")

## 14. Cost Analysis

In [None]:
# Calculate cost savings with Inference Components
def calculate_cost_comparison():
    """
    Compare costs: Traditional deployment vs Inference Components
    """
    # ml.g5.2xlarge pricing (example: $1.50/hour)
    hourly_rate = 1.50
    
    print("\nCost Comparison Analysis")
    print("=" * 50)
    
    # Traditional: 2 separate endpoints
    traditional_instances = 2  # One per model
    traditional_monthly = traditional_instances * hourly_rate * 730  # 730 hours/month
    
    print(f"Traditional Deployment (2 models, 2 endpoints):")
    print(f"  Instances: {traditional_instances}")
    print(f"  Monthly Cost: ${traditional_monthly:.2f}")
    
    # With ICs: 1 endpoint, shared infrastructure
    ic_instances = 1  # Shared endpoint
    ic_monthly = ic_instances * hourly_rate * 730
    
    print(f"\nInference Components (2 models, 1 endpoint):")
    print(f"  Instances: {ic_instances}")
    print(f"  Monthly Cost: ${ic_monthly:.2f}")
    
    savings = traditional_monthly - ic_monthly
    savings_pct = (savings / traditional_monthly) * 100
    
    print(f"\nüí∞ Monthly Savings: ${savings:.2f} ({savings_pct:.1f}%)")
    print(f"üí∞ Annual Savings: ${savings * 12:.2f}")
    print("=" * 50)

calculate_cost_comparison()

## 15. Cleanup

In [None]:
# Delete auto-scaling policy
try:
    autoscaling.deregister_scalable_target(
        ServiceNamespace='sagemaker',
        ResourceId=resource_id,
        ScalableDimension='sagemaker:inference-component:DesiredCopyCount'
    )
    print("‚úÖ Auto-scaling policy deleted")
except Exception as e:
    print(f"Note: {e}")

# Delete inference components
for ic in [ic_name, ic_name_v2]:
    try:
        print(f"Deleting IC: {ic}")
        sm_client.delete_inference_component(
            InferenceComponentName=ic
        )
        print(f"‚úÖ {ic} deleted")
    except Exception as e:
        print(f"Note: {e}")

# Wait for ICs to be deleted
import time
time.sleep(60)

# Delete endpoint
print(f"\nDeleting endpoint: {endpoint_name}")
sm_client.delete_endpoint(EndpointName=endpoint_name)
print("‚úÖ Endpoint deleted")

print("\n‚úÖ All resources cleaned up!")

## Summary

### What We Covered

1. ‚úÖ Created endpoint without models
2. ‚úÖ Deployed Phi-3 via Inference Component
3. ‚úÖ Used Messages API format
4. ‚úÖ Deployed second model for A/B testing
5. ‚úÖ Scaled Inference Components dynamically
6. ‚úÖ Performed rolling updates
7. ‚úÖ Monitored IC-specific metrics
8. ‚úÖ Configured auto-scaling
9. ‚úÖ Analyzed cost savings

### Key Benefits of Inference Components

| Feature | Benefit |
|---------|--------|
| Resource Optimization | 50%+ cost savings on average |
| Model Packing | Multiple models on shared infrastructure |
| Independent Scaling | Scale each model based on demand |
| Zero Downtime | Rolling updates without service interruption |
| A/B Testing | Easy model comparison on same endpoint |
| Granular Metrics | Per-model monitoring and alerting |

### When to Use Inference Components

‚úÖ **Use ICs when:**
- Running multiple models
- Need cost optimization
- Require A/B testing
- Want independent scaling
- Need zero-downtime updates

‚ùå **Skip ICs when:**
- Single model deployment
- Maximum simplicity needed
- Legacy workflows

## Next Steps

- Implement scale-to-zero for intermittent workloads
- Set up CloudWatch alarms for IC metrics
- Deploy quantized models (AWQ/GPTQ)
- Integrate with API Gateway
- Build production MLOps pipeline

## Resources

- [Inference Components Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-components.html)
- [Cost Optimization Blog](https://aws.amazon.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/)
- [Rolling Updates Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-component-rolling-updates.html)
- [Auto-scaling Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html)