# APIM ❤️ OpenAI

## Backend pool Load Balancing lab
![flow](../../images/backend-pool-load-balancing.gif)

Playground to try the built-in load balancing [backend pool functionality of APIM](https://learn.microsoft.com/en-us/azure/api-management/backends?tabs=bicep) to either a list of Azure OpenAI endpoints or mock servers.

Notes:
- The backend pool uses round-robin by default
- But priority and weight based routing are also supported: Adjust the `priority` (the lower the number, the higher the priority) and `weight` parameters in the `openai_resources` variable
- The `retry` API Management policy initiates a retry to an available backend if an HTTP 429 status code is encountered
- Use the mock servers by cleaning the `openai_resources` variable and simulate custom behavior by changing the code in the [app.py](../../tools/mock-server/app.py) file 

### Result
![result](result.png)

<a id='sdk'></a>
### 🧪 Test the API using the Azure OpenAI Python SDK

Repeat the same test using the Python SDK to ensure compatibility.

In [15]:
apim_resource_gateway_url = "https://apim-external-j5ly6-300swc.azure-api.net"
apim_subscription_key = "1baa4bcf5ad44f389c4475663b4b3f22"
openai_api_version = "2024-10-21"
openai_model_name = "gpt-4o"
openai_deployment_name = "gpt-4o"

<a id='requests'></a>
### 🧪 Test the API using a direct HTTP call
Requests is an elegant and simple HTTP library for Python that will be used here to make raw API requests and inspect the responses. 

You will not see HTTP 429s returned as API Management's `retry` policy will select an available backend. If no backends are viable, an HTTP 503 will be returned.

Tip: Use the [tracing tool](../../tools/tracing.ipynb) to track the behavior of the backend pool.

In [17]:
import time
import os
import json
import datetime
import requests

runs = 20
sleep_time_ms = 200
url = apim_resource_gateway_url + "/openai/deployments/" + openai_deployment_name + "/chat/completions?api-version=" + openai_api_version
api_runs = []

for i in range(runs):
    print("▶️ Run:", i+1, "/", runs)
    

    messages={"messages":[
        {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
        {"role": "user", "content": "Can you tell me the time, please?"}
    ]}

    start_time = time.time()
    response = requests.post(url, headers = {'api-key':apim_subscription_key}, json = messages)
    response_time = time.time() - start_time
    
    print(f"⌚ {response_time:.2f} seconds")
    # Check the response status code and apply formatting
    if 200 <= response.status_code < 300:
        status_code_str = '\x1b[1;32m' + str(response.status_code) + " - " + response.reason + '\x1b[0m'  # Bold and green
    elif response.status_code >= 400:
        status_code_str = '\x1b[1;31m' + str(response.status_code) + " - " + response.reason + '\x1b[0m'  # Bold and red
    else:
        status_code_str = str(response.status_code)  # No formatting

    # Print the response status with the appropriate formatting
    print("Response status:", status_code_str)
    
    print("Response headers:", response.headers)
    
    if "x-ms-region" in response.headers:
        print("x-ms-region:", '\x1b[1;31m'+response.headers.get("x-ms-region")+'\x1b[0m') # this header is useful to determine the region of the backend that served the request
        api_runs.append((response_time, response.headers.get("x-ms-region")))
    
    if (response.status_code == 200):
        data = json.loads(response.text)
        print("Token usage:", data.get("usage"), "\n")
        print("💬 ", data.get("choices")[0].get("message").get("content"), "\n")
    else:
        print(response.text)   

    time.sleep(sleep_time_ms/1000)

▶️ Run: 1 / 20
⌚ 0.65 seconds
Response status: [1;32m200 - OK[0m
Response headers: {'Content-Length': '983', 'Content-Type': 'application/json', 'apim-request-id': '86c9827f-01a2-4c73-ace5-ff1a40afaf20', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'x-ratelimit-remaining-requests': '199', 'x-accel-buffering': 'no', 'x-ms-rai-invoked': 'true', 'x-request-id': 'bdd06e11-8160-405b-91d1-5d34980b19f8', 'x-content-type-options': 'nosniff', 'azureml-model-session': 'd067-20241120171340', 'x-ms-region': 'France Central', 'x-envoy-upstream-service-time': '395', 'x-ms-client-request-id': 'Not-Set', 'x-ratelimit-remaining-tokens': '19340', 'Date': 'Fri, 13 Dec 2024 22:09:19 GMT'}
x-ms-region: [1;31mFrance Central[0m
Token usage: {'completion_tokens': 29, 'prompt_tokens': 30, 'total_tokens': 59} 

💬  Oh, absolutely, I'm a pro at reading time! Just look at whatever timekeeping device is right in front of you. Genius, right? 

▶️ Run: 2 / 20
⌚ 0.69 seconds
Respons

In [11]:
import time
from openai import AzureOpenAI

runs = 9
sleep_time_ms = 0

for i in range(runs):
    print("▶️ Run: ", i+1)

    messages=[
        {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
        {"role": "user", "content": "Can you tell me the time, please?"}
    ]

    client = AzureOpenAI(
        azure_endpoint=apim_resource_gateway_url,
        api_key=apim_subscription_key,
        api_version=openai_api_version
    )

    start_time = time.time()

    response = client.chat.completions.create(model=openai_model_name, messages=messages)
    
    response_time = time.time() - start_time
    print(f"⌚ {response_time:.2f} seconds")
    print("💬 ", response.choices[0].message.content)
    time.sleep(sleep_time_ms/1000)



▶️ Run:  1
⌚ 0.99 seconds
💬  Oh, sure, let me just reach into the internet and grab the time for you! Unfortunately, I'm just a text-based model without real-time capabilities. You might want to check a clock or your phone for that one, champ!
▶️ Run:  2
⌚ 4.34 seconds
💬  Of course! Let me just borrow Dr. Who's TARDIS, zoom across the space-time continuum, check my non-existent watch, and then not tell you the time because I'm just text on a screen without actual time-awareness. But hey, time flies when you're having fun, right?
▶️ Run:  3
⌚ 1.67 seconds
💬  Oh sure, let me just consult my non-existent watch... If only I could! You'll have to find some other way to figure that out. Isn't there a clock on whatever device you're using to talk to me right now?
▶️ Run:  4
⌚ 0.69 seconds
💬  Of course! It's exactly time for you to check the clock on your device. So convenient, isn't it?
▶️ Run:  5
⌚ 0.68 seconds
💬  Of course! Just look at the nearest clock, your phone, or any other device wit

<a id='clean'></a>
### 🗑️ Clean up resources

When you're finished with the lab, you should remove all your deployed resources from Azure to avoid extra charges and keep your Azure subscription uncluttered.
Use the [clean-up-resources notebook](clean-up-resources.ipynb) for that.