# AI Model Gateway Demo - Deployment Notebook

This notebook orchestrates the full deployment of the AI Model Gateway infrastructure and agents.

## Features:
- **Gold Agent**: Unlimited access, no rate limiting
- **Bronze Agent**: Token rate limited (1000 tokens per minute)
- **Token Metrics**: All requests emit token usage metrics to Application Insights

## Steps:
1. Deploy `infra/hub` - API Management with Gold/Bronze products and Azure OpenAI resources
2. Deploy `infra/spoke` - AI Foundry project with Gold/Bronze Model Gateway connections
3. Generate `.env` file from Terraform outputs
4. Deploy and test both Gold and Bronze agents

## Configuration

In [None]:
# Install required packages (skip if already installed)
%pip install -q "azure-ai-projects>=2.0.0b3" "azure-identity>=1.25.1"

In [None]:
import json
import os
import subprocess

LOCATION = "swedencentral"
HUB_DIR = "infra/hub"
SPOKE_DIR = "infra/spoke"
os.environ["ARM_SUBSCRIPTION_ID"] = ""
os.environ["ARM_TENANT_ID"] = ""
os.environ["AZURE_TENANT_ID"] = ""

## Helper Functions

In [None]:
def run_terraform(working_dir: str, command: list[str], capture_output: bool = False) -> subprocess.CompletedProcess:
    """Run a terraform command in the specified directory."""
    full_command = ["terraform"] + command
    print(f"üìÇ {working_dir}")
    print(f"üîß Running: {' '.join(full_command)}")

    result = subprocess.run(
        full_command,
        cwd=working_dir,
        capture_output=capture_output,
        text=True
    )

    if result.returncode != 0 and not capture_output:
        raise Exception(f"Terraform command failed with exit code {result.returncode}")

    return result


def get_terraform_output(working_dir: str) -> dict:
    """Get terraform outputs as a dictionary."""
    result = run_terraform(working_dir, ["output", "-json"], capture_output=True)
    if result.returncode != 0:
        raise Exception(f"Failed to get outputs: {result.stderr}")

    outputs = json.loads(result.stdout)
    # Extract just the values
    return {k: v["value"] for k, v in outputs.items()}

## Step 1: Deploy Hub Infrastructure

This deploys:
- Azure API Management
- Azure OpenAI with model deployments
- Application Insights

In [None]:
# Initialize Hub
run_terraform(HUB_DIR, ["init", "-upgrade"])

In [None]:
# Apply Hub deployment
run_terraform(HUB_DIR, [
    "apply",
    "-auto-approve",
    f"-var=location={LOCATION}"
])

In [None]:
# Get Hub outputs
hub_outputs = get_terraform_output(HUB_DIR)
print("\n‚úÖ Hub outputs:")
for key, value in hub_outputs.items():
    if "key" in key.lower():
        print(f"   {key}: ***REDACTED***")
    else:
        print(f"   {key}: {value}")

## Step 2: Deploy Spoke Infrastructure

This deploys:
- AI Foundry Account and Project
- Model Gateway connection to APIM

In [None]:
# Initialize Spoke
run_terraform(SPOKE_DIR, ["init", "-upgrade"])

In [None]:
# Prepare spoke variables from hub outputs - Gold and Bronze connections
model_gateway_metadata = {
    "url": hub_outputs["azure_openai_endpoint"],
    "metadata": hub_outputs["model_gateway_metadata"]
}

model_gateway_gold_var = json.dumps({
    "url": hub_outputs["azure_openai_endpoint"],
    "api_key": hub_outputs["gold_subscription_key"],
    "metadata": hub_outputs["model_gateway_metadata"]
})

model_gateway_bronze_var = json.dumps({
    "url": hub_outputs["azure_openai_endpoint"],
    "api_key": hub_outputs["bronze_subscription_key"],
    "metadata": hub_outputs["model_gateway_metadata"]
})

# Apply Spoke deployment with Gold and Bronze connections
run_terraform(SPOKE_DIR, [
    "apply",
    "-auto-approve",
    f"-var=resource_group_name={hub_outputs['resource_group_name']}",
    f"-var=location={LOCATION}",
    f"-var=model_gateway_gold={model_gateway_gold_var}",
    f"-var=model_gateway_bronze={model_gateway_bronze_var}"
])

In [None]:
# Get Spoke outputs
spoke_outputs = get_terraform_output(SPOKE_DIR)
print("\n‚úÖ Spoke outputs:")
for key, value in spoke_outputs.items():
    print(f"   {key}: {value}")

## Step 3: Generate .env File

Create the `.env` file required by `deploy_agent.py` and `test_agent.py`

In [None]:
# Generate .env content
env_content = f"""# Auto-generated by deployment notebook - do not edit manually

# Azure Configuration
AZURE_SUBSCRIPTION_ID={os.environ['ARM_SUBSCRIPTION_ID']}
AZURE_RESOURCE_GROUP={hub_outputs['resource_group_name']}

# AI Foundry Configuration
AZURE_AI_ACCOUNT_NAME={spoke_outputs['cognitive_account_name']}
AZURE_AI_PROJECT_NAME={spoke_outputs['project_name']}
AZURE_AI_PROJECT_ENDPOINT={spoke_outputs['project_endpoint']}

# Agent Configuration - Gold (no rate limiting)
AGENT_NAME_GOLD=GoldAgent
AGENT_MODEL_GOLD=model-gateway-gold/gpt-4o-mini

# Agent Configuration - Bronze (token rate limited)
AGENT_NAME_BRONZE=BronzeAgent
AGENT_MODEL_BRONZE=model-gateway-bronze/gpt-4o-mini
"""

# Write .env file
with open(".env", "w") as f:
    f.write(env_content)

print("‚úÖ Generated .env file:")
print("-" * 40)
# Print without sensitive data
for line in env_content.split("\n"):
    if line and not line.startswith("#"):
        print(f"   {line}")

## Step 4: Deploy Gold and Bronze Agents

Create both agents in AI Foundry:
- **Gold Agent**: Uses `model-gateway-gold` connection (no rate limiting)
- **Bronze Agent**: Uses `model-gateway-bronze` connection (1000 tokens/minute limit)

In [None]:
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import PromptAgentDefinition
from azure.identity import DefaultAzureCredential

# Agent configurations
AGENT_NAME_GOLD = "GoldAgent"
AGENT_MODEL_GOLD = "model-gateway-gold/gpt-4o-mini"

AGENT_NAME_BRONZE = "BronzeAgent"
AGENT_MODEL_BRONZE = "model-gateway-bronze/gpt-4o-mini"

print("üöÄ Deploying Agents via AI Gateway")
print("=" * 60)

# Initialize the project client
print("\nüì° Connecting to Azure AI Foundry project...")
print(f"   Project: {spoke_outputs['project_name']}")
print(f"   Endpoint: {spoke_outputs['project_endpoint']}")

project_client = AIProjectClient(
    endpoint=spoke_outputs['project_endpoint'],
    credential=DefaultAzureCredential()
)
print("‚úÖ Connected successfully")

with project_client:
    # Create Gold Agent
    print("\nü•á Creating Gold Agent (no rate limiting)...")
    print(f"   Model: {AGENT_MODEL_GOLD}")

    gold_agent = project_client.agents.create_version(
        agent_name=AGENT_NAME_GOLD,
        definition=PromptAgentDefinition(
            model=AGENT_MODEL_GOLD,
            instructions="You are a Gold tier AI assistant with unlimited access. "
            "All your requests are routed through APIM with no rate limiting. "
            "Token metrics are emitted to Application Insights for monitoring.",
        ),
    )
    print(f"   ‚úÖ Gold Agent created: {gold_agent.name} (version {gold_agent.version})")

    # Create Bronze Agent
    print("\nü•â Creating Bronze Agent (token rate limited)...")
    print(f"   Model: {AGENT_MODEL_BRONZE}")

    bronze_agent = project_client.agents.create_version(
        agent_name=AGENT_NAME_BRONZE,
        definition=PromptAgentDefinition(
            model=AGENT_MODEL_BRONZE,
            instructions="You are a Bronze tier AI assistant with rate limited access. "
            "Your requests are limited to 1000 tokens per minute through APIM. "
            "Token metrics are emitted to Application Insights for monitoring.",
        ),
    )
    print(f"   ‚úÖ Bronze Agent created: {bronze_agent.name} (version {bronze_agent.version})")

print("\n" + "=" * 60)
print("üéâ Agent Deployment Complete!")
print("=" * 60)
print("\nüìã Agent Summary:")
print(f"   ü•á Gold Agent: {gold_agent.name} v{gold_agent.version} - No rate limiting")
print(f"   ü•â Bronze Agent: {bronze_agent.name} v{bronze_agent.version} - 1000 tokens/min limit")

## Step 5: Test Agents

Test both Gold and Bronze agents through the APIM gateway to verify:
- Token metrics are being emitted to Application Insights
- Gold agent has no rate limits
- Bronze agent respects the 1000 tokens/minute limit

In [None]:
print("üß™ Testing APIM Gateway Agents")
print("=" * 60)

project_client = AIProjectClient(
    endpoint=spoke_outputs['project_endpoint'],
    credential=DefaultAzureCredential()
)

def test_agent(agent_name, agent_version, tier_name, tier_emoji):
    """Test an agent and display results"""
    print(f"\n{tier_emoji} Testing {tier_name} Agent: {agent_name} (v{agent_version})")
    print("-" * 50)

    test_message = f"Hello! You are the {tier_name} agent. Please tell me a very short joke about API rate limits."
    print(f"üí¨ User: {test_message}")

    with project_client.get_openai_client() as openai_client:
        # Create a conversation with initial user message
        conversation = openai_client.conversations.create(
            items=[{"type": "message", "role": "user", "content": test_message}],
        )

        # Get response from agent
        response = openai_client.responses.create(
            conversation=conversation.id,
            extra_body={"agent": {"name": agent_name, "type": "agent_reference"}},
            input="",
        )

        print(f"\nü§ñ {tier_name} Agent Response:")
        print(f"   {response.output_text}")

    return response

with project_client:
    # Test Gold Agent
    gold_response = test_agent(gold_agent.name, gold_agent.version, "Gold", "ü•á")

    # Test Bronze Agent
    bronze_response = test_agent(bronze_agent.name, bronze_agent.version, "Bronze", "ü•â")

print("\n" + "=" * 60)
print("‚úÖ Tests Complete!")
print("=" * 60)
print("\nüìä Request Flow (both agents):")
print("   1. Python SDK ‚Üí Azure AI Foundry Project")
print("   2. Foundry ‚Üí model-gateway-{gold|bronze} connection")
print(f"   3. Connection ‚Üí APIM ({hub_outputs['apim_gateway_url']})")
print("   4. APIM applies product-specific policies:")
print("      - Gold: No rate limiting")
print("      - Bronze: 1000 tokens/minute limit (llm-token-limit)")
print("   5. Token metrics emitted to Application Insights (azure-openai-emit-token-metric)")
print("   6. APIM ‚Üí Azure OpenAI (via managed identity)")
print("   7. Response flows back through APIM")

## Step 6: Test Rate Limiting - Gold vs Bronze

Make multiple rapid requests to both agents to demonstrate:
- **Gold Agent**: No rate limiting - all requests succeed
- **Bronze Agent**: Token rate limited - should get 429 errors after exceeding 1000 tokens/minute

In [None]:
import time

print("üß™ Testing Rate Limiting - Gold vs Bronze")
print("=" * 60)

# Long prompt to consume more tokens
long_prompt = """Please provide a detailed explanation of how API gateways work in enterprise architectures.
Include information about load balancing, rate limiting, authentication, caching, and monitoring.
Make your response comprehensive and detailed to demonstrate token consumption."""

def test_rate_limiting(agent_name, tier_name, tier_emoji, num_requests=10):
    """Test rate limiting on an agent"""
    print(f"\n{tier_emoji} Testing {tier_name} Agent Rate Limiting")
    print("-" * 50)
    print(f"Making {num_requests} requests to test token limit...")

    successful_requests = 0
    rate_limited_requests = 0

    # Create a fresh client for each test
    client = AIProjectClient(
        endpoint=spoke_outputs['project_endpoint'],
        credential=DefaultAzureCredential()
    )

    with client:
        with client.get_openai_client() as openai_client:
            for i in range(num_requests):
                try:
                    print(f"   üì§ Request {i+1}/{num_requests}...", end=" ")

                    conversation = openai_client.conversations.create(
                        items=[{"type": "message", "role": "user", "content": long_prompt}],
                    )

                    response = openai_client.responses.create(
                        conversation=conversation.id,
                        extra_body={"agent": {"name": agent_name, "type": "agent_reference"}},
                        input="",
                    )

                    successful_requests += 1
                    print(f"‚úÖ Success ({len(response.output_text)} chars)")

                except Exception as e:
                    if "429" in str(e) or "Too Many Requests" in str(e):
                        rate_limited_requests += 1
                        print(f"üö´ Rate Limited (429)")
                    else:
                        print(f"‚ùå Error: {e}")

                # Small delay between requests
                time.sleep(1)

    return successful_requests, rate_limited_requests

# Test Gold Agent (no rate limiting)
gold_success, gold_limited = test_rate_limiting(gold_agent.name, "Gold", "ü•á", num_requests=10)

# Test Bronze Agent (1000 tokens/minute limit)
bronze_success, bronze_limited = test_rate_limiting(bronze_agent.name, "Bronze", "ü•â", num_requests=10)

print("\n" + "=" * 60)
print("üìä Rate Limiting Test Results:")
print("=" * 60)
print(f"\nü•á Gold Agent (no rate limiting):")
print(f"   ‚úÖ Successful requests: {gold_success}")
print(f"   üö´ Rate limited requests: {gold_limited}")

print(f"\nü•â Bronze Agent (1000 tokens/minute limit):")
print(f"   ‚úÖ Successful requests: {bronze_success}")
print(f"   üö´ Rate limited requests: {bronze_limited}")

print("\nüí° Expected behavior:")
print("   - Gold: All requests should succeed (no rate limiting)")
print("   - Bronze: Some requests should be rate limited after token quota exceeded")

## Cleanup (Optional)

Destroy all infrastructure when done testing

In [None]:
# ‚ö†Ô∏è DANGER: Uncomment to destroy all resources
# print("üóëÔ∏è Destroying Spoke...")
# run_terraform(SPOKE_DIR, ["destroy", "-auto-approve",
#     f"-var=resource_group_name={hub_outputs['resource_group_name']}",
#     f"-var=location={LOCATION}",
#     f"-var=model_gateway_gold={model_gateway_gold_var}",
#     f"-var=model_gateway_bronze={model_gateway_bronze_var}"
# ])

# print("üóëÔ∏è Destroying Hub...")
# run_terraform(HUB_DIR, ["destroy", "-auto-approve",
#     f"-var=location={LOCATION}"
# ])

# print("‚úÖ All resources destroyed")