# APIM ❤️ OpenAI

## Backend pool Load Balancing lab
![flow](../../images/backend-pool-load-balancing.gif)

Playground to try the built-in load balancing [backend pool functionality of APIM](https://learn.microsoft.com/en-us/azure/api-management/backends?tabs=bicep) to either a list of Azure OpenAI endpoints or mock servers.

Notes:
- The backend pool uses round-robin by default
- But priority and weight based routing are also supported: Adjust the `priority` (the lower the number, the higher the priority) and `weight` parameters in the `openai_resources` variable
- The `retry` API Management policy initiates a retry to an available backend if an HTTP 429 status code is encountered
- Use the mock servers by cleaning the `openai_resources` variable and simulate custom behavior by changing the code in the [app.py](../../tools/mock-server/app.py) file 

### Result
![result](result.png)

<a id='sdk'></a>
### 🧪 Test the API using the Azure OpenAI Python SDK

Repeat the same test using the Python SDK to ensure compatibility.

In [21]:
apim_resource_gateway_url = "https://apim-external-i4aqt-300dev.azure-api.net"
apim_subscription_key = "e44c3672990a489e9ea7bb792d6f0b28"
openai_api_version = "2024-10-21"
openai_model_name = "gpt-4o"
openai_deployment_name = "gpt-4o"

<a id='requests'></a>
### 🧪 Test the API using a direct HTTP call
Requests is an elegant and simple HTTP library for Python that will be used here to make raw API requests and inspect the responses. 

You will not see HTTP 429s returned as API Management's `retry` policy will select an available backend. If no backends are viable, an HTTP 503 will be returned.

Tip: Use the [tracing tool](../../tools/tracing.ipynb) to track the behavior of the backend pool.

In [23]:
import time
import os
import json
import datetime
import requests

runs = 20
sleep_time_ms = 200
url = apim_resource_gateway_url + "/openai/deployments/" + openai_deployment_name + "/chat/completions?api-version=" + openai_api_version
api_runs = []

for i in range(runs):
    print("▶️ Run:", i+1, "/", runs)
    

    messages={"messages":[
        {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
        {"role": "user", "content": "Can you tell me the time, please?"}
    ]}

    start_time = time.time()
    response = requests.post(url, headers = {'api-key':apim_subscription_key}, json = messages)
    response_time = time.time() - start_time
    
    print(f"⌚ {response_time:.2f} seconds")
    # Check the response status code and apply formatting
    if 200 <= response.status_code < 300:
        status_code_str = '\x1b[1;32m' + str(response.status_code) + " - " + response.reason + '\x1b[0m'  # Bold and green
    elif response.status_code >= 400:
        status_code_str = '\x1b[1;31m' + str(response.status_code) + " - " + response.reason + '\x1b[0m'  # Bold and red
    else:
        status_code_str = str(response.status_code)  # No formatting

    # Print the response status with the appropriate formatting
    print("Response status:", status_code_str)
    
    print("Response headers:", response.headers)
    
    if "x-ms-region" in response.headers:
        print("x-ms-region:", '\x1b[1;31m'+response.headers.get("x-ms-region")+'\x1b[0m') # this header is useful to determine the region of the backend that served the request
        api_runs.append((response_time, response.headers.get("x-ms-region")))
    
    if (response.status_code == 200):
        data = json.loads(response.text)
        print("Token usage:", data.get("usage"), "\n")
        print("💬 ", data.get("choices")[0].get("message").get("content"), "\n")
    else:
        print(response.text)   

    time.sleep(sleep_time_ms/1000)

▶️ Run: 1 / 20
⌚ 0.91 seconds
Response status: [1;32m200 - OK[0m
Response headers: {'Content-Length': '1020', 'Content-Type': 'application/json', 'apim-request-id': '4224d8a4-2202-433c-8a52-013953c7d7fd', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'x-ratelimit-remaining-requests': '199', 'x-accel-buffering': 'no', 'x-ms-rai-invoked': 'true', 'x-request-id': 'e5679e92-bb47-41b4-b622-4c674cac3eca', 'x-content-type-options': 'nosniff', 'azureml-model-session': 'd051-20241115090622', 'x-ms-region': 'UK South', 'x-envoy-upstream-service-time': '555', 'x-ms-client-request-id': 'Not-Set', 'x-ratelimit-remaining-tokens': '19340', 'Date': 'Sat, 14 Dec 2024 08:45:25 GMT'}
x-ms-region: [1;31mUK South[0m
Token usage: {'completion_tokens': 35, 'prompt_tokens': 30, 'total_tokens': 65} 

💬  Oh sure, let me just use my nonexistent powers to tell the time across the void of the internet. You could try checking the clock on your device, just a thought! 

▶️ Run: 2 /

In [24]:
import time
from openai import AzureOpenAI

runs = 9
sleep_time_ms = 0

for i in range(runs):
    print("▶️ Run: ", i+1)

    messages=[
        {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
        {"role": "user", "content": "Can you tell me the time, please?"}
    ]

    client = AzureOpenAI(
        azure_endpoint=apim_resource_gateway_url,
        api_key=apim_subscription_key,
        api_version=openai_api_version
    )

    start_time = time.time()

    response = client.chat.completions.create(model=openai_model_name, messages=messages)
    
    response_time = time.time() - start_time
    print(f"⌚ {response_time:.2f} seconds")
    print("💬 ", response.choices[0].message.content)
    time.sleep(sleep_time_ms/1000)



▶️ Run:  1
⌚ 1.67 seconds
💬  Oh, sure thing! Let me just consult my imaginary watch. Unfortunately, I can't actually tell you the time because I'm not connected to any real-world clock. But hey, time flies when you're sarcastic, doesn't it?
▶️ Run:  2
⌚ 1.08 seconds
💬  Oh sure, let me just check my imaginary watch with its imaginary hands and its imaginary numbers... done! The current time is probably somewhere between too late and too early. Hope that helps!
▶️ Run:  3
⌚ 1.20 seconds
💬  Of course! Just hold on while I consult my vast collection of clocks. Oh wait, I can’t actually do that. Maybe try looking at a device that does tell the time? Like a phone or a watch.
▶️ Run:  4
⌚ 1.06 seconds
💬  Oh, sure, let me just check my imaginary watch... oh wait, I forgot I can’t tell time! Why don't you check your phone or a clock nearby?
▶️ Run:  5
⌚ 1.04 seconds
💬  Oh, sure! Let me just reach into my magical database of current times that I definitely have access to and... Oh wait, that's r

NotFoundError: Error code: 404 - {'error': {'code': 'ResourceNotFound', 'message': 'Subdomain does not map to a resource.'}}

<a id='clean'></a>
### 🗑️ Clean up resources

When you're finished with the lab, you should remove all your deployed resources from Azure to avoid extra charges and keep your Azure subscription uncluttered.
Use the [clean-up-resources notebook](clean-up-resources.ipynb) for that.