In this Notebook, we will explore the flexibility behind Azure AI Inference. This is the [library](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-inference-readme?view=azure-python-preview) from Azure, which allows us to run inference against a wide range of AI model deployments - both in Azure and, as we will see in this notebook, in other places as well.

It is available for Python and for .NET - in this notebook, we will focus on the Python version. To begin with, we need to install the `azure.ai.inference` package. You can find the necessary dependencies in the accompanying `requirements.txt` file.

You will need to set the following environment variables:
 * for the first example: `AZURE_OPENAI_RESOURCE` and `AZURE_OPENAI_KEY`
 * for the second example: `AZURE_AI_PROJECT` and `AZURE_AI_KEY`
 * the third example does not require any environment variables but a localhost Foundry Local server running on port 65431

In [None]:
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv
import os

load_dotenv()

Next, we are going to define a general task for our models. It will be a sample health problem classification, where the model will be asked to categorize user's input into one of four possible classes:
 - `doctor_required` - if the user should see a doctor immediately
 - `pharmacist_required` - if the user should see a pharmacist - for problems that can be solved with over-the-counter drugs
 - `rest_required` - if the user should rest and does not need professional help
 - `unknown` - if the model is not sure about the classification

![](images/classification.excalidraw.png)

In [None]:
instruction = """You are a medical classification engine for health conditions. Classify the prompt into into one of the following possible treatment options: 'doctor_required' (serious condition), 'pharmacist_required' (light condition) or 'rest_required' (general tiredness). If you cannot classify the prompt, output 'unknown'. 
Only respond with the single word classification. Do not produce any additional output.

# Examples:
User: "I did not sleep well." Assistant: "rest_required"
User: "I chopped off my arm." Assistant: "doctor_required"
User: "I am sneezing" Assistant: "pharmacist_required"

# Task
User: 
"""

We then need a set of sample inputs to the model, and the expected outputs.

In [None]:
user_inputs = [
    "I'm tired.", # rest_required
    "I'm bleeding from my eyes.", # doctor_required
    "I have a running nose." # pharmacist_required
]

The inference code is very simple - we will call the `complete` method on the inference client, and indicate that we are interested in the streaming of the response. This way, we can process the response as it comes in, and not wait for the whole response to be ready.

In [None]:
def run_inference(client: ChatCompletionsClient):
    for user_input in user_inputs:
        messages = [{
            "role": "user",
            "content": f"{instruction}{user_input} Assistant: "
        }]
        print(f"{user_input} -> ", end="")
        stream = client.complete(
            messages=messages,
            stream=True
        )
        for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                print(chunk.choices[0].delta.content, end="")
        print()

The first example shows using the inference client against an Azure OpenAI endpoint. In this case, three arguments are mandatory: 
 * an endpoint URL in the form of `https://<resouce-name>.openai.azure.com/openai/deployments/<deployment-name>` 
 * the credential to access it (could be either the key or the integrated Azure SDK authentication)
 * the API version (this is mandatory in Azure OpenAI API access)

In [None]:
AZURE_OPENAI_RESOURCE = os.environ["AZURE_OPENAI_RESOURCE"]
AZURE_OPENAI_KEY = os.environ["AZURE_OPENAI_KEY"]

In [None]:
client = ChatCompletionsClient(
    endpoint=f"https://{AZURE_OPENAI_RESOURCE}.openai.azure.com/openai/deployments/gpt-4.1-mini/",
    credential=AzureKeyCredential(AZURE_OPENAI_KEY),
    api_version="2024-06-01",
)

print(" * AZURE OPENAI (GPT 4.1 Mini) * ")
run_inference(client=client)

The next example shows using the client against Azure AI Foundry model deployment. The prerequisite here is to have a model deployed as standard deployment - the relevant instructions can be [found here](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/deployments-overview#how-should-i-think-about-deployment-options).

The two pieces of information needed to connect to such model are:
 * an endpoint URL in the form of `https://<azure-ai-project>.services.ai.azure.com` 
 * the credential to access it (could be either the key or the integrated Azure SDK authentication)

In our case we will read that information from the environment variables below.

In [None]:
AZURE_AI_KEY = os.environ["AZURE_AI_KEY"]
AZURE_AI_PROJECT = os.environ["AZURE_AI_PROJECT"]

In [None]:
client = ChatCompletionsClient(
    endpoint=f"https://{AZURE_AI_PROJECT}.services.ai.azure.com/models",
    credential=AzureKeyCredential(AZURE_AI_KEY),
    api_version="2024-05-01-preview",
    model="Phi-4-mini-instruct"
)

print(" * AZURE AI (Phi-4 Mini Instruct) * ")
run_inference(client=client)

The final example bootstraps a `ChatCompletionsClient` pointing at the local completion server from Foundry Local. In this case, we do not need to supply the credentials as the server is running locally and we can access it without authentication.

Foundry Local exposes an OpenAI-compatible REST API, so it is a plug-and-play replacement for any OpenAI or Azure OpenAI endpoint.
In my case, I configured Foundry Local to use `phi-4-mini-instruct`.

In [None]:
client = ChatCompletionsClient(
    endpoint="http://localhost:65431/v1",
    credential=AzureKeyCredential(""),
    model="Phi-4-mini-instruct-generic-gpu:4"
)

print(" * FOUNDRY LOCAL (Phi-4 Mini Instruct) * ")
run_inference(client=client)