# Lab 05: AIP with LiteLLM - AI Gateway for Multi-Tenant Applications

## Business Context

You are building a **Multi-Tenant Marketing Platform** with an API gateway layer that serves multiple enterprise clients with AI-powered capabilities. Your platform provides a unified LiteLLM gateway that routes requests to Nova Pro for different tenants:

- **Tenant A**: B2B tech company needing marketing campaign automation
- **Tenant B**: B2C retail company needing promotional content generation

**Challenge**: Use LiteLLM as an abstraction layer while maintaining per-tenant isolation through Application Inference Profiles, enabling simpler multi-tenant code and configuration-driven tenant management.

## Learning Objectives
- Configure LiteLLM to route requests through Application Inference Profiles
- Create tenant-specific gateway configurations for multi-tenant isolation
- Route requests through LiteLLM to tenant-specific AIPs
- Track usage and costs per tenant through AIP metrics
- Compare gateway abstraction (Lab 05) vs direct SDK (Lab 03) approaches

In [None]:
# Install required packages
!pip install --force-reinstall -q -r requirements.txt --quiet

## Section 1: Setup and Create Application Inference Profiles

First, let's set up our environment with both boto3 (for AIP management) and LiteLLM (for gateway abstraction), then create Application Inference Profiles for our two tenants.

In [None]:
import boto3
import json
import time
import litellm
from datetime import datetime
from typing import Dict, Any
from lab_helpers.config import Region, ModelId
from lab_helpers.aip_manager import AIPManager
from lab_helpers.usage_tracker import UsageTracker

# Initialize AWS clients
bedrock_client = boto3.client('bedrock', region_name=Region)
bedrock_runtime = boto3.client('bedrock-runtime', region_name=Region)
sts = boto3.client('sts', region_name=Region)

# Configure LiteLLM to suppress verbose logs
litellm.set_verbose = False

print(f"‚úÖ Initialized boto3 clients for region: {Region}")
print(f"‚úÖ Initialized LiteLLM for gateway abstraction")
print(f"üìã Using Nova Pro 1.0 model: {ModelId}")

### üèóÔ∏è Setup: AIP Manager and Tenant Configurations

Initialize the Application Inference Profile manager and define configurations for two tenants:
- **Tenant A (SaaS Support)**: Customer support automation platform.
- **Tenant B (Analytics)**: Report generation and data analysis service.

Each tenant gets:
- Unique AIP for isolated model access
- Tags for cost tracking and billing (`TenantId`, `BusinessType`, `CostCenter`)
- Separate CloudWatch metrics dimensions

In [None]:
# Initialize AIP Manager
aip_manager = AIPManager(bedrock_client)

# Define tenant configurations (same as Lab 03)
TENANT_CONFIGS = {
    "tenant_a": {
        "name": "marketing-ai-tenant-a",
        "description": "Marketing AI AIP for Tenant A",
        "tags": {
            "TenantId": "tenant-a",
            "BusinessType": "B2B-Tech",
            "Environment": "production",
            "CostCenter": "marketing-ai-platform"
        }
    },
    "tenant_b": {
        "name": "marketing-ai-tenant-b", 
        "description": "Marketing AI AIP for Tenant B",
        "tags": {
            "TenantId": "tenant-b",
            "BusinessType": "B2C-Retail",
            "Environment": "production",
            "CostCenter": "marketing-ai-platform"
        }
    }
}

print("üìã Tenant configurations defined:")
for tenant_id, config in TENANT_CONFIGS.items():
    print(f"  - {tenant_id}: {config['description']}")

### üîß Create Application Inference Profiles for Each Tenant

This cell creates (or verifies existing) Application Inference Profiles for both tenants using the AWS Bedrock API.

**The AIP Creation Process:**

1. **Check for Existing AIPs**: Verify if tenant already has an AIP to avoid duplicates
2. **Prepare Tags**: Convert tenant tags to AWS API format for cost allocation
3. **Construct Model ARN**: Build proper ARN pointing to the base System Inference Profile
4. **Create AIP**: Call `create_inference_profile()` with:
   - Unique profile name per tenant
   - Model source (copies from System Inference Profile)
   - Tenant-specific tags for tracking

**üéØ The Critical API Call:**
```python
bedrock_client.create_inference_profile(
    inferenceProfileName=config["name"],
    description=config["description"],
    modelSource={"copyFrom": MODEL_ARN},
    tags=tag_list
)
```

**What you get:**
- Unique AIP ARN for each tenant
- Isolated CloudWatch metrics dimension (ModelId = AIP ARN)
- Tagged resources for cost allocation and billing

**Result:** Two Application Inference Profiles ready for LiteLLM gateway routing!

In [None]:
# Check for existing AIPs and reuse or create
tenant_aips = {}

for tenant_id, config in TENANT_CONFIGS.items():
    print(f"\nüîç Checking AIP for {tenant_id}...")
    
    try:
        # Check if AIP already exists (possibly from Lab 03)
        existing_arn = aip_manager.check_aip_exists(config["name"])
        
        if existing_arn:
            print(f"‚úÖ Found existing AIP (reusing from Lab 03 or previous run)")
            print(f"   ARN: {existing_arn}")
            tenant_aips[tenant_id] = existing_arn
        else:
            print(f"üìù AIP not found - creating new one...")
            
            # Prepare tags
            tag_list = []
            if config["tags"]:
                tag_list = [{"key": k, "value": v} for k, v in config["tags"].items()]
            
            # Get account ID
            account_id = sts.get_caller_identity()['Account']
            
            # Construct proper ARN
            MODEL_ARN = f"arn:aws:bedrock:{Region}:{account_id}:inference-profile/{ModelId}"
            print(f"   Model ARN: {MODEL_ARN}")
            
            # Create Application Inference Profile
            response = bedrock_client.create_inference_profile(
                inferenceProfileName=config["name"],
                description=config["description"],
                modelSource={"copyFrom": MODEL_ARN},
                tags=tag_list
            )
            
            aip_arn = response['inferenceProfileArn']
            tenant_aips[tenant_id] = aip_arn
            
            print(f"‚úÖ Created new AIP for {tenant_id}")
            print(f"   Status: {response['status']}")
            print(f"   ARN: {aip_arn}")
            
    except Exception as e:
        print(f"‚ùå Error with AIP for {tenant_id}: {str(e)}")

print(f"\nüìä Summary: {len(tenant_aips)} Application Inference Profiles ready for Litellm")

### üåâ Why LiteLLM? The Gateway Abstraction Layer

**LiteLLM** provides a unified interface for calling 100+ LLMs through a consistent OpenAI-style API. When combined with Application Inference Profiles, it enables:

**‚úÖ Simplified Multi-Tenant Code:**
- Single `litellm.completion()` call for all tenants
- Configuration-driven routing (not hardcoded logic)
- Same code works across multiple models/providers

**‚úÖ Gateway Features:**
- Unified interface across AWS Bedrock, OpenAI, Azure, etc.
- Built-in token counting and cost tracking
- Automatic error handling and retries
- Easy model switching via configuration

**‚úÖ Production Benefits:**
- Easier to scale to many tenants (just add config)
- Simpler codebase (less conditional logic)
- Model-agnostic application code
- Gateway pattern ready for deployment

**The Key Pattern:**

```python
# Lab 03 (Direct boto3): Different code paths, manual routing
if tenant_id == "tenant_a":
    response = bedrock_runtime.invoke_model(modelId=tenant_a_aip_arn, ...)
elif tenant_id == "tenant_b":
    response = bedrock_runtime.invoke_model(modelId=tenant_b_aip_arn, ...)

# Lab 05 (LiteLLM): Unified interface, configuration-driven
response = litellm.completion(
    model=litellm_config[tenant_id]["model"],  # Config maps to AIP
    messages=[{"role": "user", "content": message}]
)
```

**Next:** Configure LiteLLM to route through tenant-specific AIPs!

### ‚öôÔ∏è Configure LiteLLM Gateway for Multi-Tenant Routing

Create configuration mappings that tell LiteLLM how to route each tenant's requests to their specific Application Inference Profile.

**Configuration Structure:**
- **model**: LiteLLM uses `bedrock/<model_id>` format to route to AWS Bedrock
- **aws_region_name**: Specify AWS region for Bedrock calls
- **max_tokens**: Control response length per tenant
- **temperature**: Adjust creativity/determinism per tenant

**Key Insight:** Each tenant's config points to their unique AIP ARN, enabling LiteLLM to transparently route through tenant-isolated AIPs while maintaining CloudWatch metric separation.

In [None]:
# Configure LiteLLM gateway mappings for each tenant
litellm_config = {}

for tenant_id, aip_arn in tenant_aips.items():
    # LiteLLM requires full ARN with bedrock/converse/ prefix for AIPs
    # Based on working example: model: bedrock/converse/arn:aws:bedrock:...
    litellm_config[tenant_id] = {
        "model": f"bedrock/converse/{aip_arn}",  # Full ARN format
        "aws_region_name": Region,
        "max_tokens": 1000,
        # Note: temperature removed - not supported with Bedrock AIPs
        "metadata": {
            "tenant_id": tenant_id,
            "aip_arn": aip_arn
        }
    }
    
print("üåâ LiteLLM Gateway Configuration:")
print("="*60)
for tenant_id, config in litellm_config.items():
    print(f"\nüè¢ {tenant_id.upper()}:")
    print(f"   Model: {config['model']}")
    print(f"   Region: {config['aws_region_name']}")
    print(f"   Max Tokens: {config['max_tokens']}")
    print(f"   AIP ARN: {config['metadata']['aip_arn']}")

print("\n" + "="*60)
print("‚úÖ LiteLLM gateway ready to route tenant requests through AIPs!")

### üìä CloudWatch Monitoring Function

The `monitor_aip_usage()` function fetches and visualizes CloudWatch metrics for each tenant's Application Inference Profile.

**What it does:**
- Queries CloudWatch for the last 60 minutes of data
- Fetches three key metrics per tenant:
  - **Invocations**: Number of API calls
  - **InputTokenCount**: Request complexity
  - **OutputTokenCount**: Response generation
- Generates time-series plots showing usage patterns
- Returns metrics dictionary for cost analysis

**Why it matters:**
- Each tenant's metrics are **isolated** using their unique AIP ID
- Enables accurate per-tenant billing and cost allocation
- Provides proof that multi-tenancy isolation is working through LiteLLM

**Key insight:** Even though requests go through LiteLLM abstraction layer, CloudWatch still tracks them separately by AIP dimension!

In [None]:
def monitor_aip_usage(tenant_aips, region):
    """
    Monitor CloudWatch metrics for multiple Application Inference Profiles.
    
    Args:
        tenant_aips (dict): Mapping of tenant_id -> AIP ARN
        region (str): AWS region where AIPs are deployed
    
    Returns:
        dict: Tenant metrics for further analysis
    """
    from lab_helpers.cloudwatch import fetch_metrices, plot_graph

    print("üìä Fetching CloudWatch metrics for Application Inference Profiles...")
    print(f"Region: {region}")
    print(f"Time Range: Last 60 minutes")
    print("="*80)

    tenant_metrics = {}

    for tenant_id, aip_arn in tenant_aips.items():
        print(f"\nüè¢ TENANT: {tenant_id.upper()}")
        print("="*50)
        print(f"AIP ARN: {aip_arn}")
        
        # Extract AIP ID from full ARN for CloudWatch ModelId dimension
        aip_id = aip_arn.split('/')[-1]
        print(f"AIP ID: {aip_id}")
        
        try:
            print(f"\nüìä METRICS FOR {tenant_id.upper()}:")
            response, input_token_response, output_token_response = fetch_metrices(
                Region=region,
                Period=60,
                Timedelta=60,
                Id=aip_id
            )
            
            tenant_metrics[tenant_id] = {
                'invocations': response,
                'input_tokens': input_token_response,
                'output_tokens': output_token_response
            }
            
            print(f"\nüìà USAGE PLOTS FOR {tenant_id.upper()}:")
            plot_graph(response, input_token_response, output_token_response)
            
            print("="*50)
                
        except Exception as e:
            print(f"‚ö†Ô∏è CloudWatch error for {tenant_id}: {str(e)}")
            print(f"üí° Metrics may take a few minutes to appear after model invocations")
            print("="*50)
    
    return tenant_metrics

## üìä Baseline Check: CloudWatch Metrics BEFORE LiteLLM Requests

Before we route any requests through LiteLLM, let's check CloudWatch metrics for our newly created AIPs. This establishes a **baseline** that proves:

1. ‚úÖ The AIPs exist and are properly configured
2. ‚úÖ Each tenant has their own isolated CloudWatch dimension
3. ‚úÖ No usage has occurred yet (metrics should be empty)

### The Before/After Demonstration

We'll check metrics **twice** in this lab:

**üîµ NOW (BEFORE):** Check metrics ‚Üí Expect empty/no data
- This proves the AIPs are ready but haven't been used

**üü¢ LATER (AFTER):** Check metrics ‚Üí Expect populated per-tenant data  
- This proves multi-tenant isolation works through LiteLLM!

Let's run the baseline check:

## Section 2: Tenant-Aware Request Routing via LiteLLM Gateway

Now let's implement the core gateway functionality: routing tenant requests through LiteLLM to their specific Application Inference Profiles.

### ü§ñ LiteLLM Request Routing Function

This function demonstrates the power of LiteLLM gateway abstraction for multi-tenant applications.

**The Gateway Pattern:**
```python
litellm.completion(
    model=config["model"],  # Tenant-specific AIP routing
    messages=[{"role": "user", "content": message}],
    **parameters
)
```

**What happens behind the scenes:**
1. LiteLLM receives request with tenant-specific model config
2. Extracts AIP ARN from config
3. Translates to AWS Bedrock API call format
4. Invokes Nova Pro through the tenant's AIP
5. Returns response with usage metrics
6. CloudWatch automatically tracks under tenant's AIP dimension

**Key Benefits:**
- **Single code path**: Same function works for all tenants
- **Configuration-driven**: Add new tenants by updating config, not code
- **Framework-agnostic**: Works with any LiteLLM-supported provider
- **Built-in features**: Token counting, retries, error handling included

Compare this to Lab 03 where we needed manual routing logic and error handling!

In [None]:
def route_tenant_request_via_litellm(tenant_id: str, config: Dict[str, Any], user_message: str) -> Dict[str, Any]:
    """
    Route tenant request through LiteLLM gateway to their specific AIP.
    
    Args:
        tenant_id: Tenant identifier
        config: LiteLLM configuration for this tenant
        user_message: User's message to process
        
    Returns:
        Response with content and usage metrics
    """
    
    try:
        # Record start time for latency calculation
        start_time = datetime.now()
        
        # üåâ LITELLM GATEWAY CALL - Single interface for all tenants!
        # LiteLLM transparently routes through tenant's AIP
        response = litellm.completion(
            model=config["model"],  # bedrock/converse/<full-aip-arn>
            messages=[
                {"role": "user", "content": user_message}
            ],
            max_tokens=config["max_tokens"],
            # Note: temperature removed - not supported with Bedrock AIPs via LiteLLM
            aws_region_name=config["aws_region_name"]
        )
        
        # Calculate latency
        end_time = datetime.now()
        latency_ms = (end_time - start_time).total_seconds() * 1000
        
        # Extract response content
        # LiteLLM normalizes response format across all providers
        response_content = response.choices[0].message.content
        
        # Extract usage metrics
        # LiteLLM provides consistent token counting across providers
        usage_metrics = {
            'input_tokens': response.usage.prompt_tokens,
            'output_tokens': response.usage.completion_tokens,
            'total_tokens': response.usage.total_tokens,
            'latency_ms': latency_ms
        }
        
        return {
            'tenant_id': tenant_id,
            'content': response_content,
            'usage_metrics': usage_metrics,
            'aip_arn': config['metadata']['aip_arn'],
            'timestamp': datetime.now().isoformat(),
            'success': True
        }
        
    except Exception as e:
        return {
            'tenant_id': tenant_id,
            'error': str(e),
            'timestamp': datetime.now().isoformat(),
            'success': False
        }

print("‚úÖ LiteLLM request routing function defined")
print("üí° This single function handles ALL tenant requests through gateway!")

### üìã Define Multi-Tenant Use Cases

Let's create realistic use cases for our two tenants to demonstrate how the LiteLLM gateway handles different request patterns while maintaining isolation:

- **Tenant A (SaaS Support)**: Customer support chatbot automation
- **Tenant B (Analytics)**: Batch report generation

Each use case will route through their tenant-specific AIP via LiteLLM, with all usage tracked separately in CloudWatch.

In [None]:
# Define use cases for each tenant
TENANT_USE_CASES = {
    "tenant_a": {
        "use_case": "Customer Support Chatbot",
        "message": """
A customer just submitted this support ticket:

"I'm trying to integrate your API with our system, but I keep getting a 401 Unauthorized error. 
I've checked my API key multiple times and it looks correct. The error happens on every endpoint I try.
Can you help me understand what might be wrong?"

Please provide a helpful, friendly support response that:
1. Acknowledges the issue
2. Suggests 3 common causes of 401 errors
3. Provides clear troubleshooting steps
4. Offers next steps if these don't resolve it
"""
    },
    "tenant_b": {
        "use_case": "Analytics Report Generation",
        "message": """
Generate an executive summary for the following analytics data:

Monthly User Engagement Metrics:
- Total Active Users: 45,230 (‚Üë 12% MoM)
- Average Session Duration: 8.5 minutes (‚Üë 5% MoM)
- Feature Adoption Rate: 67% (‚Üë 9% MoM)
- Customer Satisfaction Score: 4.2/5.0 (‚Üë 0.3 MoM)
- Churn Rate: 2.1% (‚Üì 0.4% MoM)

Please provide:
1. Executive summary of key insights
2. Trend analysis
3. Strategic recommendations
4. Areas of concern or opportunity
"""
    }
}

print("üìã Multi-Tenant Use Cases Defined:")
print("="*60)
for tenant_id, details in TENANT_USE_CASES.items():
    print(f"\nüè¢ {tenant_id.upper()}: {details['use_case']}")
    print(f"   Message length: {len(details['message'])} characters")

### üöÄ Process Requests Through LiteLLM Gateway

Now let's route both tenant requests through the LiteLLM gateway. Watch how:
1. Same routing function handles both tenants
2. Each request goes through their specific AIP
3. Usage metrics are captured per tenant
4. CloudWatch will track them separately (we'll verify in Section 3!)

**The Gateway Magic:** No conditional logic, no tenant-specific code paths - just configuration-driven routing!

In [None]:
# Route requests through LiteLLM gateway for both tenants
tenant_responses = {}

print("üöÄ Routing tenant requests through LiteLLM gateway...")
print("="*80)

for tenant_id in TENANT_CONFIGS.keys():
    if tenant_id in litellm_config:
        print(f"\nüåâ Processing request for {tenant_id.upper()} via LiteLLM...")
        print(f"   Use Case: {TENANT_USE_CASES[tenant_id]['use_case']}")
        print(f"   Routing through: {litellm_config[tenant_id]['model']}")
        
        result = route_tenant_request_via_litellm(
            tenant_id=tenant_id,
            config=litellm_config[tenant_id],
            user_message=TENANT_USE_CASES[tenant_id]['message']
        )
        
        tenant_responses[tenant_id] = result
        
        if result['success']:
            print(f"‚úÖ Request completed for {tenant_id}")
            print(f"   Input tokens: {result['usage_metrics']['input_tokens']}")
            print(f"   Output tokens: {result['usage_metrics']['output_tokens']}")
            print(f"   Total tokens: {result['usage_metrics']['total_tokens']}")
            print(f"   Latency: {result['usage_metrics']['latency_ms']:.2f}ms")
        else:
            print(f"‚ùå Error for {tenant_id}: {result['error']}")

print(f"\n" + "="*80)
print(f"üìä Processed {len([r for r in tenant_responses.values() if r['success']])} successful requests")
print("üí° Each request routed through tenant-specific AIP via LiteLLM!")

### üìÑ Display Response Content and Metrics

Let's examine the AI-generated responses for each tenant and their corresponding usage metrics.

In [None]:
# Display responses for each tenant
print("üìÑ TENANT RESPONSES VIA LITELLM GATEWAY")
print("="*80)

for tenant_id, response in tenant_responses.items():
    if response['success']:
        print(f"\nüè¢ TENANT: {tenant_id.upper()}")
        print(f"   Use Case: {TENANT_USE_CASES[tenant_id]['use_case']}")
        print(f"   AIP ARN: {response['aip_arn']}")
        print("   " + "-"*76)
        print(f"\n   üìù RESPONSE:\n")
        
        # Display response content with indentation
        for line in response['content'].split('\n'):
            print(f"   {line}")
        
        print(f"\n   " + "-"*76)
        print(f"   üìä USAGE METRICS:")
        print(f"      Input Tokens:  {response['usage_metrics']['input_tokens']:,}")
        print(f"      Output Tokens: {response['usage_metrics']['output_tokens']:,}")
        print(f"      Total Tokens:  {response['usage_metrics']['total_tokens']:,}")
        print(f"      Latency:       {response['usage_metrics']['latency_ms']:.2f}ms")
        print(f"      Timestamp:     {response['timestamp']}")
        print("="*80)
    else:
        print(f"\n‚ùå {tenant_id.upper()}: Failed - {response['error']}")
        print("="*80)

## Section 3: CloudWatch Metrics and Multi-Tenant Isolation Proof

Now let's verify that despite routing through LiteLLM, CloudWatch still tracks each tenant's usage separately through their Application Inference Profiles!

### üîç How LiteLLM + AIP Maintains CloudWatch Isolation

**The Question:** If we're routing through LiteLLM abstraction, how does CloudWatch still track tenants separately?

**The Answer:** The magic of Application Inference Profiles!

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Tenant A    ‚îÇ
‚îÇ Request     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ
       ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ LiteLLM Gateway
       ‚îÇ            (Unified Interface)
       ‚îÇ                   ‚îÇ
       ‚îÇ                   ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ bedrock/tenant-a-aip
       ‚îÇ                   ‚îÇ        ‚îÇ
       ‚îÇ                   ‚îÇ        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ AWS Bedrock
       ‚îÇ                   ‚îÇ                 (Tenant A AIP ARN)
       ‚îÇ                   ‚îÇ                        ‚îÇ
       ‚îÇ                   ‚îÇ                        ‚îî‚îÄ‚îÄ‚ñ∫ CloudWatch
       ‚îÇ                   ‚îÇ                             Dimension: tenant-a-aip
       ‚îÇ
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Tenant B    ‚îÇ
‚îÇ Request     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ
       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ LiteLLM Gateway
                    (Same Code!)
                           ‚îÇ
                           ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ bedrock/tenant-b-aip
                           ‚îÇ        ‚îÇ
                           ‚îÇ        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ AWS Bedrock
                           ‚îÇ                 (Tenant B AIP ARN)
                           ‚îÇ                        ‚îÇ
                           ‚îÇ                        ‚îî‚îÄ‚îÄ‚ñ∫ CloudWatch
                           ‚îÇ                             Dimension: tenant-b-aip
```

**Key Points:**
1. LiteLLM routes to `bedrock/<aip_id>` (tenant-specific)
2. Each AIP ID maps to unique AIP ARN
3. AWS Bedrock logs metrics with AIP ARN as `ModelId` dimension
4. CloudWatch separates metrics by `ModelId` dimension
5. **Result:** Per-tenant isolation maintained automatically!

**The Proof:** Let's check CloudWatch metrics now!

In [None]:
# Wait for CloudWatch metrics to propagate
print("‚è≥ Waiting 60 seconds for CloudWatch metrics to propagate...")
print("üí° CloudWatch metrics have a 1-2 minute delay after API calls")
time.sleep(60)
print("‚úÖ Wait complete - proceeding with metrics monitoring")

## üü¢ AFTER Check: CloudWatch Metrics Show Multi-Tenant Isolation!

Now that we've routed requests through LiteLLM, let's check CloudWatch metrics. You should see:

### Expected Results:

**üìä Per-Tenant Metrics:**
- **Tenant A** metrics under their own AIP dimension
- **Tenant B** metrics under their own AIP dimension
- **Separate graphs** for each tenant (not combined!)

### The Proof of Multi-Tenancy Through Gateway:

Compare this to the BEFORE check:
- **BEFORE:** Empty metrics (AIPs ready but unused)
- **AFTER:** Populated metrics showing exact per-tenant usage

This demonstrates:
1. ‚úÖ **Gateway + Isolation**: LiteLLM routing preserves per-tenant tracking
2. ‚úÖ **Accurate Billing**: Can charge tenant_a and tenant_b independently
3. ‚úÖ **SLA Monitoring**: Can track performance per customer through gateway
4. ‚úÖ **Cost Allocation**: Know exactly what each tenant costs despite abstraction layer

‚è≥ **Note:** If metrics appear empty, wait another 60 seconds and re-run this cell.

In [None]:
# Check CloudWatch metrics AFTER LiteLLM requests
print("üü¢ AFTER CHECK: Querying CloudWatch for tenant-specific metrics...")
print("="*80)
print("Expected: Per-tenant metrics showing isolated usage through LiteLLM!")
print("="*80 + "\n")

after_metrics = monitor_aip_usage(tenant_aips, Region)

print("\n" + "="*80)
print("‚úÖ Metrics check complete!")
print("üí° If you see metrics data, multi-tenant isolation is working!")
print("   Each tenant's LiteLLM requests tracked separately via their AIP.")
print("="*80)

### üí∞ Cost Allocation Analysis

Let's demonstrate how the combination of LiteLLM + AIP enables accurate per-tenant cost allocation and billing.

### üìà Before/After Comparison Summary

Let's summarize the before/after demonstration that proves multi-tenant isolation works through LiteLLM gateway.

In [None]:
# Before/After comparison summary
print("üìä BEFORE/AFTER COMPARISON: Multi-Tenant Isolation Proof")
print("="*80)

print("\nüîµ BEFORE (Baseline Check):")
print("   Status: AIPs created and configured")
print("   LiteLLM: Gateway configurations ready")
print("   CloudWatch: No metrics (AIPs unused)")
print("   Tenant Isolation: Ready but unverified")

print("\nüü¢ AFTER (Post-LiteLLM Requests):")
print("   LiteLLM Requests: ‚úÖ Routed through gateway")
print("   Tenant Responses: ‚úÖ Generated via AIPs")
print("   CloudWatch Metrics: ‚úÖ Separate dimensions per tenant")
print("   Cost Tracking: ‚úÖ Per-tenant billing enabled")

print("\nüìä WHAT WE PROVED:")
print("   1. ‚úÖ LiteLLM gateway routing preserves AIP isolation")
print("   2. ‚úÖ Each tenant's usage tracked separately in CloudWatch")
print("   3. ‚úÖ Configuration-driven approach scales to N tenants")
print("   4. ‚úÖ Gateway abstraction + AIP = production-ready multi-tenancy")
print("   5. ‚úÖ Accurate cost allocation despite abstraction layer")

print("\nüí° THE WINNING COMBINATION:")
print("   LiteLLM:    Simplifies multi-tenant routing code")
print("   +")
print("   AIP:        Ensures CloudWatch tracking isolation")
print("   =")
print("   Production: Scalable, maintainable, trackable multi-tenant AI")

print("="*80)

## Lab 05 Summary

üéâ **Congratulations!** You've successfully implemented Application Inference Profiles with LiteLLM gateway for a multi-tenant AI platform.

### What You Accomplished:

1. **‚úÖ Created Application Inference Profiles** for two tenants using boto3
2. **‚úÖ Configured LiteLLM gateway** to route through tenant-specific AIPs
3. **‚úÖ Checked CloudWatch metrics BEFORE** gateway requests (baseline/empty state)
4. **‚úÖ Routed tenant requests through LiteLLM** with unified code interface
5. **‚úÖ Generated tenant-specific responses** with isolated usage tracking
6. **‚úÖ Checked CloudWatch metrics AFTER** gateway requests (per-tenant usage visible!)
7. **‚úÖ Analyzed cost allocation** and billing breakdown per tenant
8. **‚úÖ Proved multi-tenant isolation** works through gateway abstraction

### Key Takeaways:

- **Gateway Abstraction**: LiteLLM simplifies multi-tenant routing with single code path
- **Configuration-Driven**: Add new tenants by updating config, not code
- **Tenant Isolation**: AIP ensures CloudWatch tracking remains separate
- **Cost Allocation**: Built-in token counting enables accurate billing
- **Production Ready**: Gateway pattern scales to many tenants efficiently
- **Framework Agnostic**: LiteLLM works across AWS, OpenAI, Azure, and more

### What You Learned:

```python
# Traditional approach (Lab 03 - Direct boto3)
if tenant_id == "tenant_a":
    response = bedrock_runtime.invoke_model(modelId=tenant_a_aip_arn, ...)
elif tenant_id == "tenant_b":
    response = bedrock_runtime.invoke_model(modelId=tenant_b_aip_arn, ...)
# Adding new tenants requires code changes

# Gateway approach (Lab 05 - LiteLLM + AIP)
response = litellm.completion(
    model=litellm_config[tenant_id]["model"],  # Config-driven routing
    messages=[{"role": "user", "content": message}]
)
# Adding new tenants only requires config changes
```

### The Before/After Pattern:

- **BEFORE Check**: Empty CloudWatch metrics (AIPs ready but unused)
- **AFTER Check**: Populated metrics showing exact per-tenant usage via LiteLLM
- **Proof**: Multi-tenant isolation works through gateway abstraction!

### Architecture Benefits:

| Aspect | Value |
|--------|-------|
| **Code Simplicity** | Single routing function for all tenants |
| **Scalability** | Config-driven tenant management |
| **Observability** | Per-tenant CloudWatch metrics automatic |
| **Cost Tracking** | Built-in token counting and billing |
| **Multi-Provider** | Easy to add OpenAI, Azure, etc. |
| **Maintainability** | Less code to maintain and test |

### Lab Comparison:

| Lab | Focus | Approach | Best For |
|-----|-------|----------|----------|
| **Lab 03** | Direct boto3 | Manual routing logic | Full AWS Bedrock control |
| **Lab 05** | LiteLLM gateway | Config-driven routing | Multi-tenant scale, multi-provider |
| **Lab 06** | LangChain | Chain-based apps | Complex workflows |
| **Lab 07** | LangGraph | Graph workflows | Multi-step processes |

### Next Steps:

Continue your AIP journey with framework-specific integrations:
- **Lab 06 (Optional)**: LangChain - Chain-based applications with tenant awareness
- **Lab 07 (Optional)**: LangGraph - Graph workflows for complex multi-step processes
- **Lab 08**: AgentCore Runtime - Production deployment with full observability

**Ready to explore chain-based applications?** ‚Üí [Continue to Lab 07: LangChain (Optional)](lab-07-aip-langchain-optional.ipynb)

**Already deployed with AgentCore?** ‚Üí [Back to Lab 05: AgentCore Runtime](lab-05-agentcore-runtime-deployment.ipynb)