# Deploy Phi-3 Model on SageMaker using Text Generation Inference (TGI)

This notebook demonstrates how to deploy Microsoft's Phi-3 model on Amazon SageMaker using the Hugging Face Text Generation Inference (TGI) container.

## Overview

- **Model**: microsoft/Phi-3-mini-4k-instruct (3.8B parameters)
- **Container**: Hugging Face TGI Deep Learning Container
- **Instance Type**: ml.g5.2xlarge (1 GPU)
- **Features**: Streaming responses, optimized inference, token-level details

## Prerequisites

- AWS Account with SageMaker access
- Appropriate IAM role with SageMaker permissions
- Sufficient service quota for ml.g5.2xlarge instances

## 1. Setup and Installation

First, let's install and upgrade the necessary packages.

In [None]:
!pip install sagemaker --upgrade --quiet
!pip install boto3 --upgrade --quiet

## 2. Initialize SageMaker Session

Set up the SageMaker session and get the execution role.

In [None]:
import sagemaker
import boto3
import json
from datetime import datetime

# SageMaker session
sess = sagemaker.Session()
region = sess.boto_region_name
role = sagemaker.get_execution_role()

print(f"SageMaker role: {role}")
print(f"AWS region: {region}")
print(f"SageMaker version: {sagemaker.__version__}")

## 3. Get TGI Container Image URI

Retrieve the Hugging Face TGI container image URI using the SageMaker helper function.

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# Get the TGI image URI
image_uri = get_huggingface_llm_image_uri(
    backend="huggingface",  # or "lmi" for DJL serving
    region=region
)

print(f"Image URI: {image_uri}")

## 4. Configure Model Settings

Define the model configuration including the Hugging Face model ID and deployment parameters.

In [None]:
from sagemaker.huggingface import HuggingFaceModel
import time

# Model configuration
model_name = f"phi3-mini-4k-tgi-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
endpoint_name = f"{model_name}-ep"

# Environment variables for TGI
hub_config = {
    'HF_MODEL_ID': 'microsoft/Phi-3-mini-4k-instruct',
    'SM_NUM_GPUS': '1',
    'MAX_INPUT_LENGTH': '3072',
    'MAX_TOTAL_TOKENS': '4096',
    'MAX_BATCH_PREFILL_TOKENS': '4096',
    'MAX_BATCH_TOTAL_TOKENS': '8192',
    # Optional: Add HF token for gated models
    # 'HUGGING_FACE_HUB_TOKEN': '<YOUR_HF_TOKEN>',
}

print(f"Model name: {model_name}")
print(f"Endpoint name: {endpoint_name}")
print(f"Model configuration: {json.dumps(hub_config, indent=2)}")

## 5. Create SageMaker Model

Create a HuggingFaceModel object with the TGI configuration.

In [None]:
# Create HuggingFace Model
huggingface_model = HuggingFaceModel(
    name=model_name,
    env=hub_config,
    role=role,
    image_uri=image_uri
)

print(f"HuggingFace Model created: {model_name}")

## 6. Deploy Model to SageMaker Endpoint

Deploy the model to a real-time SageMaker endpoint. This step will take approximately 5-10 minutes.

In [None]:
# Deploy the model
predictor = huggingface_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
)

print(f"Endpoint deployed successfully: {endpoint_name}")
print(f"Endpoint ARN: {predictor.endpoint_name}")

## 7. Test the Endpoint - Simple Inference

Test the deployed model with a simple text generation request.

In [None]:
# Simple inference request
input_data = {
    "inputs": "What is machine learning? Explain in simple terms.",
    "parameters": {
        "max_new_tokens": 200,
        "temperature": 0.7,
        "top_p": 0.9,
        "do_sample": True,
        "return_full_text": False
    }
}

# Make prediction
response = predictor.predict(input_data)

print("\n" + "="*50)
print("RESPONSE:")
print("="*50)
print(response[0]['generated_text'])
print("="*50)

## 8. Test with Chat Format

Phi-3 models work well with chat-based interactions. Let's test with a formatted prompt.

In [None]:
# Phi-3 chat template format
def format_phi3_chat(messages):
    """
    Format messages for Phi-3 chat template.
    Messages should be a list of dicts with 'role' and 'content'.
    """
    formatted_prompt = ""
    for message in messages:
        role = message['role']
        content = message['content']
        if role == 'system':
            formatted_prompt += f"<|system|>\n{content}<|end|>\n"
        elif role == 'user':
            formatted_prompt += f"<|user|>\n{content}<|end|>\n"
        elif role == 'assistant':
            formatted_prompt += f"<|assistant|>\n{content}<|end|>\n"
    
    # Add assistant prefix for response
    formatted_prompt += "<|assistant|>\n"
    return formatted_prompt

# Example chat conversation
messages = [
    {
        "role": "system",
        "content": "You are a helpful AI assistant that provides clear and concise answers."
    },
    {
        "role": "user",
        "content": "What are the key differences between Python and JavaScript?"
    }
]

formatted_prompt = format_phi3_chat(messages)

chat_data = {
    "inputs": formatted_prompt,
    "parameters": {
        "max_new_tokens": 300,
        "temperature": 0.7,
        "top_p": 0.9,
        "do_sample": True,
        "return_full_text": False,
        "stop": ["<|end|>", "<|endoftext|>"]
    }
}

response = predictor.predict(chat_data)

print("\n" + "="*50)
print("CHAT RESPONSE:")
print("="*50)
print(response[0]['generated_text'])
print("="*50)

## 9. Streaming Inference

TGI supports streaming responses for real-time token generation.

In [None]:
import io

# Create SageMaker runtime client for streaming
sagemaker_runtime = boto3.client('sagemaker-runtime', region_name=region)

def stream_response(endpoint_name, payload):
    """
    Stream responses from the SageMaker endpoint.
    """
    response = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload),
        ContentType='application/json'
    )
    
    event_stream = response['Body']
    
    print("\nStreaming response:")
    print("-" * 50)
    
    for event in event_stream:
        if 'PayloadPart' in event:
            payload_part = event['PayloadPart']['Bytes'].decode('utf-8')
            
            # Parse the JSON response
            try:
                data = json.loads(payload_part)
                if 'token' in data:
                    token_text = data['token']['text']
                    print(token_text, end='', flush=True)
            except json.JSONDecodeError:
                continue
    
    print("\n" + "-" * 50)

# Streaming request
stream_payload = {
    "inputs": "Write a short poem about artificial intelligence.",
    "parameters": {
        "max_new_tokens": 150,
        "temperature": 0.8,
        "top_p": 0.9,
        "do_sample": True,
        "return_full_text": False
    },
    "stream": True
}

stream_response(endpoint_name, stream_payload)

## 10. Batch Processing Example

Process multiple prompts efficiently.

In [None]:
# Batch processing
prompts = [
    "Explain quantum computing in one sentence.",
    "What is the capital of France?",
    "Write a haiku about nature.",
    "What are the benefits of exercise?"
]

print("Processing batch of prompts...\n")

for i, prompt in enumerate(prompts, 1):
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 100,
            "temperature": 0.7,
            "do_sample": True,
            "return_full_text": False
        }
    }
    
    response = predictor.predict(payload)
    
    print(f"\n{'='*50}")
    print(f"Prompt {i}: {prompt}")
    print(f"{'='*50}")
    print(response[0]['generated_text'])
    print(f"{'='*50}")

## 11. Monitor Endpoint Performance

Check endpoint metrics and status.

In [None]:
sm_client = boto3.client('sagemaker', region_name=region)

# Get endpoint details
endpoint_description = sm_client.describe_endpoint(EndpointName=endpoint_name)

print("Endpoint Status:")
print(f"  Endpoint Name: {endpoint_description['EndpointName']}")
print(f"  Status: {endpoint_description['EndpointStatus']}")
print(f"  Creation Time: {endpoint_description['CreationTime']}")
print(f"  Instance Type: {endpoint_description['ProductionVariants'][0]['InstanceType']}")
print(f"  Current Instance Count: {endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']}")

## 12. Advanced Features - Details and Metadata

Get detailed information about generated tokens including log probabilities and finish reasons.

In [None]:
# Request with details
detailed_request = {
    "inputs": "Explain the concept of neural networks.",
    "parameters": {
        "max_new_tokens": 150,
        "temperature": 0.7,
        "do_sample": True,
        "return_full_text": False,
        "details": True  # Request detailed information
    }
}

response = predictor.predict(detailed_request)

print("\n" + "="*50)
print("DETAILED RESPONSE:")
print("="*50)
print(f"Generated Text: {response[0]['generated_text']}")
print(f"\nDetails:")
if 'details' in response[0]:
    details = response[0]['details']
    print(f"  Finish Reason: {details.get('finish_reason', 'N/A')}")
    print(f"  Generated Tokens: {details.get('generated_tokens', 'N/A')}")
print("="*50)

## 13. Cleanup Resources

**Important**: Delete the endpoint when you're done to avoid ongoing charges.

In [None]:
# Delete the endpoint
predictor.delete_endpoint(delete_endpoint_config=True)
print(f"Endpoint {endpoint_name} deleted successfully")

# Optionally, delete the model
predictor.delete_model()
print(f"Model {model_name} deleted successfully")

## Summary

In this notebook, we:

1. ✅ Set up the SageMaker environment
2. ✅ Retrieved the Hugging Face TGI container image
3. ✅ Configured the Phi-3 model for deployment
4. ✅ Deployed the model to a SageMaker endpoint
5. ✅ Tested simple text generation
6. ✅ Used chat-formatted prompts
7. ✅ Implemented streaming inference
8. ✅ Performed batch processing
9. ✅ Monitored endpoint performance
10. ✅ Retrieved detailed generation metadata
11. ✅ Cleaned up resources

## Next Steps

- Experiment with different Phi-3 variants (mini-128k, medium, small)
- Implement auto-scaling policies for production workloads
- Integrate with CloudWatch for monitoring and alerting
- Set up A/B testing with multiple model versions
- Implement custom inference code for specialized use cases

## Additional Resources

- [Hugging Face TGI Documentation](https://huggingface.co/docs/text-generation-inference)
- [SageMaker Inference Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/inference.html)
- [Phi-3 Model Card](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
- [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers)