# APIM ❤️ OpenAI

## Backend pool Load Balancing lab
![flow](images/backend-pool-load-balancing.gif)

Playground to try the built-in load balancing [backend pool functionality of APIM](https://learn.microsoft.com/en-us/azure/api-management/backends?tabs=bicep) to either a list of Azure OpenAI endpoints or mock servers.

Notes:
- The backend pool uses round-robin by default
- But priority and weight based routing are also supported: Adjust the `priority` (the lower the number, the higher the priority) and `weight` parameters in the `openai_resources` variable
- The `retry` API Management policy initiates a retry to an available backend if an HTTP 429 status code is encountered

### Result
![result](images/result.png)

<a id='2'></a>
### 2️⃣ Create deployment using Terraform

This lab uses Terraform to declaratively define all the resources that will be deployed. Change the [variables.tf](variables.tf) directly to try different configurations.

In [12]:
! terraform init
! terraform apply -auto-approve

[0m[1mInitializing the backend...[0m
[0m[1mInitializing provider plugins...[0m
- Reusing previous version of azure/azapi from the dependency lock file
- Reusing previous version of hashicorp/azurerm from the dependency lock file
- Using previously-installed azure/azapi v2.1.0
- Using previously-installed hashicorp/azurerm v4.14.0

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.[0m
^C


[0m[1mazurerm_resource_group.rg: Refreshing state... [id=/subscriptions/dcef7009-6b94-4382-afdc-17eb160d709a/resourceGroups/rg-apim-genai-310-pprod][0m
[0m[1mazurerm_ai_services.ai-services["openai-uks"]: Refreshing state... [id=/subscriptions/dcef7009-6b94-4382-afdc-17eb160d709a/resourceGroups/rg-apim-genai-310-pprod/providers/Microsoft.CognitiveServices/accounts/openai-uks][0m
[0m[1mazurerm_ai_services.ai-services["openai-frc"]: Refreshing state... [id=/subscriptions/dcef7009-6b94-4382-afdc-17eb160d709a/resourceGroups/rg-apim-genai-310-pprod/providers/Microsoft.CognitiveServices/accounts/openai-frc][0m
[0m[1mazurerm_ai_services.ai-services["openai-swc"]: Refreshing state... [id=/subscriptions/dcef7009-6b94-4382-afdc-17eb160d709a/resourceGroups/rg-apim-genai-310-pprod/providers/Microsoft.CognitiveServices/accounts/openai-swc][0m
[0m[1mazurerm_log_analytics_workspace.log-analytics: Refreshing state... [id=/subscriptions/dcef7009-6b94-4382-afdc-17eb160d709a/resourceGroups/

<a id='sdk'></a>
### 🧪 Test the API using the Azure OpenAI Python SDK

Repeat the same test using the Python SDK to ensure compatibility.

In [2]:
apim_resource_gateway_url = ! terraform output -raw apim_resource_gateway_url
apim_resource_gateway_url = apim_resource_gateway_url.n
print("👉🏻 APIM Resource Gateway URL: ", apim_resource_gateway_url)

apim_subscription_key = ! terraform output -raw apim_subscription_key
apim_subscription_key = apim_subscription_key.n
print("👉🏻 APIM Subscription Key: ", apim_subscription_key)

app_insights_app_id = ! terraform output -raw app_insights_app_id
app_insights_app_id = app_insights_app_id.n
print("👉🏻 Application Insights App ID: ", app_insights_app_id)

apim_subscription_key_1 = ! terraform output -raw apim_subscription_key_1
apim_subscription_key_1 = apim_subscription_key_1.n
print("👉🏻 APIM Subscription Key 1: ", apim_subscription_key_1)

apim_subscription_key_2 = ! terraform output -raw apim_subscription_key_2
apim_subscription_key_2 = apim_subscription_key_2.n
print("👉🏻 APIM Subscription Key 2: ", apim_subscription_key_2)

apim_subscription_key_3 = ! terraform output -raw apim_subscription_key_3
apim_subscription_key_3 = apim_subscription_key_3.n
print("👉🏻 APIM Subscription Key 3: ", apim_subscription_key_3)

openai_api_version = "2024-10-21"
openai_model_name = "gpt-4o"
openai_deployment_name = "gpt-4o"

👉🏻 APIM Resource Gateway URL:  https://apim-external-310-pprod.azure-api.net
👉🏻 APIM Subscription Key:  3b1e080b883b4c55a19c2a0a0826454b
👉🏻 Application Insights App ID:  a64417ff-2c09-4df6-af5b-cc041b515ddb
👉🏻 APIM Subscription Key 1:  2d28576f0914447d9b5003ec8ae02fbe
👉🏻 APIM Subscription Key 2:  f3e8efcfed6e45d78c5463889d292812
👉🏻 APIM Subscription Key 3:  d8a61c5a8c024ac49ca2db2095fa5c69


<a id='requests'></a>
### 🧪 Test the API using a direct HTTP call
Requests is an elegant and simple HTTP library for Python that will be used here to make raw API requests and inspect the responses. 

You will not see HTTP 429s returned as API Management's `retry` policy will select an available backend. If no backends are viable, an HTTP 503 will be returned.

Tip: Use the [tracing tool](../../tools/tracing.ipynb) to track the behavior of the backend pool.

In [None]:
import time
import os
import json
import datetime
import requests

runs = 20
sleep_time_ms = 200
url = apim_resource_gateway_url + "/openai/deployments/" + openai_deployment_name + "/chat/completions?api-version=" + openai_api_version
api_runs = []

for i in range(runs):
    print("▶️ Run:", i+1, "/", runs)
    

    messages={"messages":[
        {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
        {"role": "user", "content": "Can you tell me the time, please?"}
    ]}

    start_time = time.time()
    response = requests.post(url, headers = {'api-key':apim_subscription_key}, json = messages)
    response_time = time.time() - start_time
    
    print(f"⌚ {response_time:.2f} seconds")
    # Check the response status code and apply formatting
    if 200 <= response.status_code < 300:
        status_code_str = '\x1b[1;32m' + str(response.status_code) + " - " + response.reason + '\x1b[0m'  # Bold and green
    elif response.status_code >= 400:
        status_code_str = '\x1b[1;31m' + str(response.status_code) + " - " + response.reason + '\x1b[0m'  # Bold and red
    else:
        status_code_str = str(response.status_code)  # No formatting

    # Print the response status with the appropriate formatting
    print("Response status:", status_code_str)
    
    print("Response headers:", response.headers)
    
    if "x-ms-region" in response.headers:
        print("x-ms-region:", '\x1b[1;31m'+response.headers.get("x-ms-region")+'\x1b[0m') # this header is useful to determine the region of the backend that served the request
        api_runs.append((response_time, response.headers.get("x-ms-region")))
    
    if (response.status_code == 200):
        data = json.loads(response.text)
        print("Token usage:", data.get("usage"), "\n")
        print("💬 ", data.get("choices")[0].get("message").get("content"), "\n")
    else:
        print(response.text)   

    time.sleep(sleep_time_ms/1000)

▶️ Run: 1 / 20
⌚ 0.87 seconds
Response status: [1;32m200 - OK[0m
Response headers: {'Content-Type': 'application/json', 'Date': 'Mon, 16 Dec 2024 21:13:46 GMT', 'Cache-Control': 'private', 'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Vary': 'Accept-Encoding', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'apim-request-id': '7c7c7242-f4ec-45fe-b912-c81880617c0a', 'x-ratelimit-remaining-requests': '77', 'x-accel-buffering': 'no', 'x-ms-rai-invoked': 'true', 'X-Request-ID': '5912d73b-a4c0-4e53-9489-fb4c9f07f19e', 'X-Content-Type-Options': 'nosniff', 'azureml-model-session': 'd199-20241121123626', 'x-ms-region': 'Sweden Central', 'x-envoy-upstream-service-time': '594', 'x-ms-client-request-id': 'Not-Set', 'x-ratelimit-remaining-tokens': '6020', 'Request-Context': 'appId=cid-v1:a64417ff-2c09-4df6-af5b-cc041b515ddb'}
x-ms-region: [1;31mSweden Central[0m
Token usage: {'completion_tokens': 47, 'prompt_tokens': 30, 'total_tokens': 77} 

💬  Oh, 

<a id='sdk'></a>
### 🧪 Test the API using the Azure OpenAI Python SDK

Repeat the same test using the Python SDK to ensure compatibility.

In [None]:
import time
from openai import AzureOpenAI

runs = 9
sleep_time_ms = 0

for i in range(runs):
    print("▶️ Run: ", i+1)

    messages=[
        {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
        {"role": "user", "content": "Can you tell me the time, please?"}
    ]

    client = AzureOpenAI(
        azure_endpoint=apim_resource_gateway_url,
        api_key=apim_subscription_key,
        api_version=openai_api_version
    )

    start_time = time.time()

    response = client.chat.completions.create(model=openai_model_name, messages=messages)
    
    response_time = time.time() - start_time
    print(f"⌚ {response_time:.2f} seconds")
    print("💬 ", response.choices[0].message.content)
    time.sleep(sleep_time_ms/1000)


▶️ Run:  1
⌚ 1.15 seconds
💬  Oh, sure! Just let me hop into my time machine and check for you. Or, you know, look at the convenient clock on your device. That works too.
▶️ Run:  2
⌚ 1.59 seconds
💬  Oh, absolutely! My circuits would love nothing more than to tell you the time... if only I could. But, oh dear, it seems I'm just a text-based assistant who doesn't actually keep track of time. You'll probably get more accurate results by, I don't know, looking at a clock or your phone!
▶️ Run:  3
⌚ 0.82 seconds
💬  Oh, sure, let me just ask the psychic powers I don't have. But here's a revolutionary idea: Check the clock on your device!
▶️ Run:  4
⌚ 1.12 seconds
💬  Sure, let me just consult my invisible, non-existent watch... Oh dear, it seems to be stuck on "figure it out yourself" o'clock. Maybe your phone or a nearby clock could be more forthcoming with the actual time!
▶️ Run:  5
⌚ 1.70 seconds
💬  Oh, absolutely! Let me just check my imaginary watch real quick... Oh no, it seems my time

<a id='kql'></a>
### 🔍 Analyze Application Insights requests

With this query you can get the request and response details including the prompt and the OpenAI completion. It also returns token counters.

In [7]:
import pandas as pd

query = "\"" + "requests  \
| project timestamp, duration, customDimensions \
| extend duration = round(duration, 2) \
| extend parsedCustomDimensions = parse_json(customDimensions) \
| extend apiName = tostring(parsedCustomDimensions.['API Name']) \
| extend apimSubscription = tostring(parsedCustomDimensions.['Subscription Name']) \
| extend userAgent = tostring(parsedCustomDimensions.['Request-User-agent']) \
| extend request_json = tostring(parsedCustomDimensions.['Request-Body']) \
| extend request = parse_json(request_json) \
| extend model = tostring(request.['model']) \
| extend messages = tostring(request.['messages']) \
| extend region = tostring(parsedCustomDimensions.['Response-x-ms-region']) \
| extend remainingTokens = tostring(parsedCustomDimensions.['Response-x-ratelimit-remaining-tokens']) \
| extend remainingRequests = tostring(parsedCustomDimensions.['Response-x-ratelimit-remaining-requests']) \
| extend response_json = tostring(parsedCustomDimensions.['Response-Body']) \
| extend response = parse_json(response_json) \
| extend promptTokens = tostring(response.['usage'].['prompt_tokens']) \
| extend completionTokens = tostring(response.['usage'].['completion_tokens']) \
| extend totalTokens = tostring(response.['usage'].['total_tokens']) \
| extend completion = tostring(response.['choices'][0].['message'].['content']) \
| project timestamp, apiName, apimSubscription, duration, userAgent, model, messages, completion, region, promptTokens, completionTokens, totalTokens, remainingTokens, remainingRequests \
| order by timestamp desc" + "\""

result_stdout = !  az monitor app-insights query --app {app_insights_app_id} --analytics-query {query} 
result = json.loads(result_stdout.n)

table = result.get('tables')[0]
pd.DataFrame(table.get("rows"), columns=[col.get("name") for col in table.get('columns')])


Unnamed: 0,timestamp,apiName,apimSubscription,duration,userAgent,model,messages,completion,region,promptTokens,completionTokens,totalTokens,remainingTokens,remainingRequests
0,2024-12-16T20:44:54.3197541Z,api-azure-openai,,378.44,,,,,France Central,,,,6020.0,77
1,2024-12-16T20:44:52.8721011Z,api-azure-openai,,1034.3,,,,,Sweden Central,,,,6680.0,78
2,2024-12-16T20:44:51.5857268Z,api-azure-openai,,896.43,,,,,France Central,,,,6680.0,78
3,2024-12-16T20:44:51.1702829Z,api-azure-openai,,40.22,,,,,UK South,,,,,65
4,2024-12-16T20:44:50.745986Z,api-azure-openai,,34.12,,,,,UK South,,,,,66
5,2024-12-16T20:44:50.3192288Z,api-azure-openai,,39.07,,,,,UK South,,,,,67
6,2024-12-16T20:44:48.7939495Z,api-azure-openai,,1128.86,,,,,UK South,,,,80.0,68
7,2024-12-16T20:44:47.7976708Z,api-azure-openai,,617.06,,,,,UK South,,,,740.0,69
8,2024-12-16T20:44:46.3407055Z,api-azure-openai,,1049.01,,,,,UK South,,,,1400.0,70
9,2024-12-16T20:44:45.159665Z,api-azure-openai,,794.62,,,,,UK South,,,,2060.0,71


<a id='portal'></a>
### 🔍 Open the workbook in the Azure Portal

Go to the application insights resource and under the Monitoring section select the Workbooks blade. You should see the OpenAI Usage Analysis workbook with the above query and some others to check token counts, performance, failures, etc.

<a id='sdk'></a>
### 🧪 Execute multiple runs for each subscription using the Azure OpenAI Python SDK

We will send requests for each subscription. Adjust the `sleep_time_ms` and the number of `runs` to your test scenario.


In [3]:
import time
from openai import AzureOpenAI
runs = 15
sleep_time_ms = 1000
for i in range(runs):
    print("▶️ Run: ", i+1)
    messages=[
        {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
        {"role": "user", "content": "Can you tell me the time, please?"}
    ]
    client = AzureOpenAI(azure_endpoint=apim_resource_gateway_url, api_key=apim_subscription_key_1, api_version=openai_api_version)
    response = client.chat.completions.create(model=openai_model_name, messages=messages, extra_headers={"x-user-id": "alex"})
    print("💬 ","for subscription 1: ", response.choices[0].message.content)

    client = AzureOpenAI(azure_endpoint=apim_resource_gateway_url, api_key=apim_subscription_key_2, api_version=openai_api_version)
    response = client.chat.completions.create(model=openai_model_name, messages=messages, extra_headers={"x-user-id": "alex"})
    print("💬 ","for subscription 2: ", response.choices[0].message.content)

    client = AzureOpenAI(azure_endpoint=apim_resource_gateway_url, api_key=apim_subscription_key_3, api_version=openai_api_version)
    response = client.chat.completions.create(model=openai_model_name, messages=messages, extra_headers={"x-user-id": "alex"})
    print("💬 ","for subscription 3: ", response.choices[0].message.content)

    time.sleep(sleep_time_ms/1000)


▶️ Run:  1
💬  for subscription 1:  Sure, let me just check my imaginary watch—sure looks like a time, doesn't it? Sadly, I can't actually tell you the current time. Why don't you take a bold leap into the 21st century and check one of your numerous digital devices?
💬  for subscription 2:  Oh, for sure! Just let me use my magical non-existent watch. Oh, wait... You might want to look at a clock for that one.
💬  for subscription 3:  Sure, let me just look into my invisible crystal ball for that. Oh wait, I don't have one! Maybe you could try checking a clock or your phone instead?
▶️ Run:  2
💬  for subscription 1:  Oh sure, let me just check my invisible, non-existent clock for you. Got it—it's sometime between the dawn of human civilization and the heat death of the universe.
💬  for subscription 2:  Oh, of course! Let me just consult my invisible timepiece... Wait a second, I have no idea what time it is because I'm stuck in this digital realm with no concept of time or space. Better ch

KeyboardInterrupt: 

<a id='kql'></a>
### 🔍 Analyze Application Insights custom metrics with a KQL query

With this query you can get the custom metrics that were emitted by Azure APIM.

In [4]:
import pandas as pd

query = "\"" + "customMetrics  \
| where name == 'Total Tokens' \
| extend parsedCustomDimensions = parse_json(customDimensions) \
| extend clientIP = tostring(parsedCustomDimensions.['Client IP']) \
| extend apiId = tostring(parsedCustomDimensions.['API ID']) \
| extend apimSubscription = tostring(parsedCustomDimensions.['Subscription ID']) \
| extend UserId = tostring(parsedCustomDimensions.['User ID']) \
| project timestamp, value, clientIP, apiId, apimSubscription, UserId \
| order by timestamp asc" + "\""

result_stdout = ! az monitor app-insights query --app {app_insights_resource_name} -g {resource_group_name} --analytics-query {query} 
result = json.loads(result_stdout.n)

table = result.get('tables')[0]
df = pd.DataFrame(table.get("rows"), columns=[col.get("name") for col in table.get('columns')])
df['timestamp'] = pd.to_datetime(df['timestamp']).dt.strftime('%H:%M')

df


NameError: name 'json' is not defined