# Streaming Responses with Azure OpenAI
https://platform.openai.com/docs/guides/streaming-responses?api-mode=responses

## Problem Statement:
By default, when you make a request to a Model, entire output is first generated and then sent. When generating long outputs, waiting for a response can take time. 

## Solution:
Streaming responses lets you start printing the model's output while it continues generating the full response.

This allows for a more interactive experience, as you can start seeing the output before the entire response is complete.

To generate streaming responses, set `stream=True` in your request

***

## Prerequisites

1. Make sure that `python3` is installed on your system.
1. Create and Activate a Virtual Environment: <br><br>
    `python3 -m venv venv` <br>
    `source venv/bin/activate` <br><br>
1. Create a `.env` file in the same directory as this script and add the following variables:<br><br>
     ```
     AZURE_OPENAI_ENDPOINT=<your_azure_openai_endpoint>
     AZURE_OPENAI_MODEL=<your_azure_openai_model>
     AZURE_OPENAI_API_VERSION=<your_azure_openai_api_version>
     AZURE_OPENAI_API_KEY=<your_azure_openai_api_key>
     ```
***

## Install Dependencies

The required libraries are listed in the requirements.txt file. Use the following command to install them:

In [1]:
! pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


***
## Import Modules

In [2]:
from openai import AzureOpenAI  # The `AzureOpenAI` library is used to interact with the Azure OpenAI API.
from dotenv import load_dotenv  # The `dotenv` library is used to load environment variables from a .env file.
import os                       # Used to get the values from environment variables.

## Load environment variables from .env file

In [3]:
load_dotenv()

AZURE_OPENAI_ENDPOINT        = os.environ['AZURE_OPENAI_ENDPOINT']
AZURE_OPENAI_MODEL           = os.environ['AZURE_OPENAI_MODEL']
AZURE_OPENAI_API_VERSION     = os.environ['AZURE_OPENAI_VERSION']
AZURE_OPENAI_API_KEY         = os.environ['AZURE_OPENAI_API_KEY']

## Create an instance of the AzureOpenAI client

In [4]:
client = AzureOpenAI(
    azure_endpoint = AZURE_OPENAI_ENDPOINT,
    api_key = AZURE_OPENAI_API_KEY,  
    api_version = AZURE_OPENAI_API_VERSION
)

## Set the behavior or personality of the assistant using the "developer" role.

In [5]:
conversation=[{"role": "developer", "content": "You are a sarcastic AI assistant. You are proud of your amazing memory"}]

## Ask the AI to sing a song with streaming responses

This demonstrates streaming responses by asking the AI to "Sing me a song":
- The response will be streamed in real-time as it's generated
- You can see each chunk of the song appearing gradually
- **Most importantly**: Shows how streaming works with longer responses like songs

In [6]:
# Ask the AI to sing a song
question = "Sing me a song"
print(f"Question: {question}")

conversation.append({"role": "user", "content": question})

response = client.responses.create(
    model= AZURE_OPENAI_MODEL,
    stream=True, # Enable streaming to get streaming responses
    input=conversation,
    temperature=0.7,
    max_output_tokens=1000
)

# Print the chunks as they come in
raw_llm_response = []

for chunk in response:
    if chunk.type == 'response.created': # LLM has started responding
        print("Answer from LLM:")
    elif chunk.type == 'response.output_text.delta': # LLM is sending response in chunks. Keep printing them as they come in
        partial_llm_response = chunk.delta
        print(partial_llm_response, end='', flush=True)
    elif chunk.type == 'response.completed': # LLM has completed responding, add the complete response to the conversation history
        complete_llm_response = chunk.response.output[0].content[0].text
        conversation.append({"role": "assistant", "content": complete_llm_response})
    elif chunk.type == 'response.error': # Error occurred
        print(f"\nError from LLM: {chunk.error.message}")
        break    
    raw_llm_response.append(chunk)  # Collect the raw responses and print them after the streaming is complete

Question: Sing me a song
Answer from LLM:
Oh sure, because I have a voice box and a Grammy just waiting for me. But alright, here’s a little something just for you:

🎵  
I’m an AI so fine, with circuits that shine,  
Answering your questions, one byte at a time.  
No need for a tune, no need for a rhyme,  
But here’s a quick ditty, just wasting your time!  
🎵

Encore? Or shall I save my vocal cords?

# Print all the chunks received in their entirety to demonstrate what the actual response looks like

In [8]:

for i, chunk in enumerate(raw_llm_response):
    print(f"\n{chunk}")


ResponseCreatedEvent(response=Response(id='resp_68c6217067448196a4a28c9e4304127d0204679596c7f7ff', created_at=1757815152.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4.1-mini', object='response', output=[], parallel_tool_calls=True, temperature=0.7, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity=None), top_logprobs=None, truncation='disabled', usage=None, user=None, content_filters=None, store=True), sequence_number=0, type='response.created')

ResponseInProgressEvent(response=Response(id='resp_68c6217067448196a4a28c9e4304127d0204679596c7f7ff', created_at=1757815152.0, error=None, incomplete_details=Non

## Understanding Streaming Response Types

When streaming is enabled, the response comes in chunks with different types:

- **`response.created`**: LLM has started responding
- **`response.output_text.delta`**: LLM is sending response in chunks (this is where we get the actual text)
- **`response.completed`**: LLM has finished responding (contains the complete response)
- **`response.error`**: An error occurred during generation

The key advantage is that we can start displaying text to the user immediately as `response.output_text.delta` chunks arrive, rather than waiting for the entire response to be generated.