# The Fundamentals of Working with LLM API

A practical guide to working with Large Language Model APIs.

**Workshop Goals:**
- Understand how LLMs reason through prompts and take actions through tools
- Analyze stateless API mechanics and token economics
- Implement tool calling, streaming, and structured outputs
- Apply context management strategies for production workflows

**Prerequisites:** Python 3.8+, `requests` library, DIAL API/Azure OpenAI/OpenAI key

---

## Introduction to LLM APIs

### What are LLM APIs?

Large Language Model APIs are **HTTP endpoints that transform natural language into structured outputs**.

Think of it like this:
```
You send text -> Model processes -> You receive generated text
```

That's it! The API is **stateless** (no memory), works with **tokens** (text units), and responds based on how you **prompt** it.

### Workshop Roadmap

We'll cover:
1. **Setup & First Call** - Get up and running
2. **Tokens & Costs** - Understanding billing units
3. **Prompts & Parameters** - Controlling model behavior (reasoning)
4. **Conversations** - Managing multi-turn interactions
5. **Tool Calling** - Enabling actions beyond text
6. **Context Engineering** - Working within limits
7. **Structured Outputs** - Getting JSON responses
8. **Streaming** - Real-time token delivery
9. **Vision** - Working with images
10. **Production Patterns** - Error handling & best practices

---

## Part 1: Setup & First API Call

### Theory: Chat Completions API

**Endpoint Structure:**
```
https://{your-endpoint}/openai/deployments/{deployment-name}/chat/completions?api-version=2024-10-21
```

**Required Headers:**
- `Content-Type: application/json`
- `api-key: {your-api-key}`

**Request Structure:**
```json
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "max_completion_tokens": 100
}
```

**Response Structure:**
```json
{
  "choices": [{"message": {"content": "..."}, "finish_reason": "stop"}],
  "usage": {"prompt_tokens": 10, "completion_tokens": 20, "total_tokens": 30}
}
```

**GPT-5 vs GPT-4o Parameters:**
- **GPT-5 (reasoning):** `max_completion_tokens`, `reasoning_effort`, `developer` role
- **GPT-4o (standard):** `max_tokens`, `temperature`, `system` role

---

### Demo 1.1: Configure API Credentials

In [None]:
import os
import getpass

# Collect DIAL configuration
if 'DIAL_API_KEY' not in os.environ or not os.environ['DIAL_API_KEY']:
    os.environ['DIAL_API_KEY'] = getpass.getpass('Enter DIAL API Key: ')

# Set defaults (no prompting - use environment variables or defaults)
os.environ.setdefault('DIAL_DEPLOYMENT', 'gpt-5-mini-2025-08-07')
os.environ.setdefault('DIAL_GPT4O_DEPLOYMENT', 'gpt-4o-mini-2024-07-18')
os.environ.setdefault('DIAL_API_ENDPOINT', 'https://ai-proxy.lab.epam.com')
os.environ.setdefault('DIAL_API_VERSION', '2024-10-21')

# Remove any trailing slash to avoid double-slash URLs
os.environ['DIAL_API_ENDPOINT'] = os.environ['DIAL_API_ENDPOINT'].rstrip('/')

# Detect if this is a reasoning model
deployment = os.environ['DIAL_DEPLOYMENT'].lower()
reasoning_models = ['gpt-5', 'o1', 'o3', 'o4']
is_reasoning_model = any(model in deployment for model in reasoning_models)

print('[OK] Configuration set:')
print(f"  Endpoint: {os.environ['DIAL_API_ENDPOINT']}")
print(f"  API Version: {os.environ['DIAL_API_VERSION']}")
print(f"  API Key: {'*' * 20} (hidden)")
print(f"\n  Primary Model: {os.environ['DIAL_DEPLOYMENT']}")
print(f"  Comparison Model: {os.environ['DIAL_GPT4O_DEPLOYMENT']}")

if is_reasoning_model:
    print(f"\n[OK] GPT-5 Reasoning Model Detected:")
    print("  [OK] Supported: max_completion_tokens, reasoning_effort, developer role")
    print("  [X] NOT supported: temperature, top_p, max_tokens")
    os.environ['MODEL_TYPE'] = 'reasoning'
else:
    print(f"\n[OK] Standard Model Detected (GPT-4o):")
    print("  [OK] Supported: max_tokens, temperature, top_p, system role")
    os.environ['MODEL_TYPE'] = 'standard'

### Demo 1.2: First API Call - Hello World

In [None]:
import requests
import json

# Build request
endpoint = os.environ['DIAL_API_ENDPOINT']
deployment = os.environ['DIAL_DEPLOYMENT']
api_version = os.environ['DIAL_API_VERSION']
api_key = os.environ['DIAL_API_KEY']

url = f"{endpoint}/openai/deployments/{deployment}/chat/completions?api-version={api_version}"
headers = {"Content-Type": "application/json", "api-key": api_key}

payload = {
    "messages": [
        {"role": "developer", "content": "You are a helpful assistant. Respond directly without extensive reasoning."},
        {"role": "user", "content": "what's 2 + 2."}
    ],
    "max_completion_tokens": 2000,  # GPT-5 uses tokens for reasoning + output
    "reasoning_effort": "low"   # Use minimal reasoning for simple responses or low for none
}

# Call API
response = requests.post(url, headers=headers, json=payload)
result = response.json()

# Display results
content = result['choices'][0]['message'].get('content', '')
print("Response:", content if content else "(empty - model used reasoning tokens only)")
print(f"\nToken usage:")
print(f"  Prompt: {result['usage']['prompt_tokens']}")
print(f"  Completion: {result['usage']['completion_tokens']}")
print(f"  Total: {result['usage']['total_tokens']}")

# Check for reasoning tokens (GPT-5 specific)
if 'completion_tokens_details' in result['usage']:
    reasoning_tokens = result['usage']['completion_tokens_details'].get('reasoning_tokens', 0)
    if reasoning_tokens > 0:
        print(f"  Reasoning: {reasoning_tokens}")

print(f"  Finish reason: {result['choices'][0]['finish_reason']}")


---

## Part 2: Understanding Tokens & Costs

### Theory: What are Tokens?

**Tokens** are the fundamental units that LLMs process:
- Subword pieces: can be as short as 1 character or as long as 1 word
- **Context sensitivity:** " red", " Red", "Red" -> 3 different tokens!
- **Rough estimation:** ~1 token = 4 characters or 0.75 words
- **For exact counts:** Use tokenizers or check API response `usage` field

### Four Token Types

1. **Input tokens:** Your request (prompt, instructions, conversation history)
2. **Output tokens:** Generated response
3. **Cached tokens:** Reused context (billed at reduced rates)
4. **Reasoning tokens:** Internal "thinking" (GPT-5, o-series only)

### Pricing Model

**Output tokens cost 2-3x more than input tokens!**

**2025 Pricing Reference** (per 1M tokens):

| Model | Input | Output |
|-------|-------|--------|
| GPT-5 | $1.25 | $10.00 |
| GPT-5-mini | $0.25 | $2.00 |
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |

### Usage Field Breakdown

```json
{
  "usage": {
    "prompt_tokens": 245,
    "completion_tokens": 892,
    "total_tokens": 1137,
    "completion_tokens_details": {
      "reasoning_tokens": 156
    }
  }
}
```

---

### Demo 2.1: Token Counting

In [None]:
# Simple demo showing token counting
text = "Explain REST APIs in one sentence."

# Build request
url = f"{os.environ['DIAL_API_ENDPOINT']}/openai/deployments/{os.environ['DIAL_DEPLOYMENT']}/chat/completions?api-version={os.environ['DIAL_API_VERSION']}"
headers = {"Content-Type": "application/json", "api-key": os.environ['DIAL_API_KEY']}
payload = {
    "messages": [
        {"role": "developer", "content": "Respond directly without reasoning."},
        {"role": "user", "content": text}
    ],
    "max_completion_tokens": 500,
    "reasoning_effort": "minimal"  # Minimal reasoning for simple responses
}

# Call API
response = requests.post(url, headers=headers, json=payload)
result = response.json()

# Extract content and check for reasoning tokens
content = result['choices'][0]['message'].get('content', '')
reasoning_tokens = result['usage'].get('completion_tokens_details', {}).get('reasoning_tokens', 0)

# Display token breakdown
print(f"Input text: '{text}'")
print(f"\nToken usage:")
print(f"  Prompt tokens: {result['usage']['prompt_tokens']}")
print(f"  Completion tokens: {result['usage']['completion_tokens']}")
if reasoning_tokens > 0:
    print(f"    └─ Reasoning tokens: {reasoning_tokens}")
    print(f"    └─ Output tokens: {result['usage']['completion_tokens'] - reasoning_tokens}")
print(f"  Total tokens: {result['usage']['total_tokens']}")

# Estimate cost (GPT-5-mini rates: $0.25 input, $2.00 output per 1M tokens)
input_cost = result['usage']['prompt_tokens'] * 0.00000025
output_cost = result['usage']['completion_tokens'] * 0.000002
total_cost = input_cost + output_cost
print(f"\nEstimated cost: ${total_cost:.6f}")
print(f"  └─ Input: ${input_cost:.6f} ({result['usage']['prompt_tokens']} × $0.25/1M)")
print(f"  └─ Output: ${output_cost:.6f} ({result['usage']['completion_tokens']} × $2.00/1M)")

print(f"\nResponse: {content if content else '(empty - all tokens used for reasoning)'}")


### Demo 2.2: Comparing Reasoning Efforts (GPT-5)

In [None]:
# Compare low vs high reasoning effort
question = "Calculate 15 items/hour × 8 hours/day × 7 days, minus 20% returns."

# Low reasoning effort
payload_low = {
    "messages": [{"role": "user", "content": question}],
    "max_completion_tokens": 2000,
    "reasoning_effort": "low"
}
response_low = requests.post(url, headers=headers, json=payload_low)
result_low = response_low.json()

# High reasoning effort
payload_high = {
    "messages": [{"role": "user", "content": question}],
    "max_completion_tokens": 2000,  # Same limit for fair comparison
    "reasoning_effort": "high"
}
response_high = requests.post(url, headers=headers, json=payload_high)
result_high = response_high.json()

# Calculate costs (GPT-5-mini: $0.25 input, $2.00 output per 1M)
def calculate_cost(usage):
    input_cost = usage['prompt_tokens'] * 0.00000025
    output_cost = usage['completion_tokens'] * 0.000002
    return input_cost + output_cost

# Compare with actual responses
print("=" * 70)
print("LOW reasoning effort:")
print("-" * 70)
print(f"Answer: {result_low['choices'][0]['message'].get('content', '(empty)')}")
print(f"\nTokens:")
print(f"  Reasoning: {result_low['usage'].get('completion_tokens_details', {}).get('reasoning_tokens', 0)}")
print(f"  Total: {result_low['usage']['total_tokens']}")
print(f"  Cost: ${calculate_cost(result_low['usage']):.6f}")

print("\n" + "=" * 70)
print("HIGH reasoning effort:")
print("-" * 70)
print(f"Answer: {result_high['choices'][0]['message'].get('content', '(empty)')}")
print(f"\nTokens:")
print(f"  Reasoning: {result_high['usage'].get('completion_tokens_details', {}).get('reasoning_tokens', 0)}")
print(f"  Total: {result_high['usage']['total_tokens']}")
print(f"  Cost: ${calculate_cost(result_high['usage']):.6f}")

# Show comparison
print("\n" + "=" * 70)
reasoning_diff = result_high['usage'].get('completion_tokens_details', {}).get('reasoning_tokens', 0) - result_low['usage'].get('completion_tokens_details', {}).get('reasoning_tokens', 0)
cost_diff = calculate_cost(result_high['usage']) - calculate_cost(result_low['usage'])
print(f"HIGH reasoning used {reasoning_diff} more reasoning tokens")
print(f"Cost difference: ${abs(cost_diff):.6f} ({'higher' if cost_diff > 0 else 'lower'} with high reasoning)")
print("=" * 70)


### Demo 2.3: Truncation - finish_reason: "length"

In [None]:
# Demonstrate truncation with small token limit
payload = {
    "messages": [{"role": "user", "content": "Explain REST APIs in detail covering requests, verbs, codes, best practices."}],
    "max_completion_tokens": 30,  # Too small!
    "reasoning_effort": "low"
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()

print(f"Response: {result['choices'][0]['message']['content']}")
print(f"\nFinish reason: {result['choices'][0]['finish_reason']}")

if result['choices'][0]['finish_reason'] == 'length':
    print("  TRUNCATED! Response was cut off. Increase max_completion_tokens.")


---

## Part 3: Prompts & Model Parameters (REASONING)

LLMs reason through the prompts you craft and parameters you set.

### Theory: Prompt Engineering Fundamentals

**Clear Context & Instructions:**
- Provide role, context, and explicit constraints
- Example: "You are a financial analyst. Explain in 3 sentences how to evaluate a company's balance sheet."

**Few-Shot Prompting:**
- Provide 2-5 examples to teach the pattern
- Example: Show sentiment examples (positive/negative) then ask model to classify new text

**Chain-of-Thought (CoT):**
- Ask model to reason step-by-step before answering
- Add "Let's think step by step" or "Show your work"
- Improves accuracy on complex reasoning tasks
- Not needed for Reasoning/Thinking models

**Constraint-Based:**
- Specify output length, format, what to avoid
- Example: "Summarize in exactly 100 words without using the word 'technology'"

**Advanced (for reference):** Reflexion, Tree-of-Thoughts, Analogical Prompting
### Theory: Model Parameters

**GPT-5 (Reasoning Model):**
- `max_completion_tokens`: output limit
- `reasoning_effort`: minimal/low/medium/high (controls thinking depth, replaces temperature)
- `developer` role: system instructions for reasoning models
- No temperature/top_p (use `reasoning_effort` to control output variability)

**GPT-4o (Standard Model):**
- `max_tokens`: output limit
- `temperature` (0.0-2.0): deterministic (low) vs creative (high)
- `top_p`: nucleus sampling (use temp OR top_p, not both)
- `system` role: persistent behavior instructions

**When to use:**
- GPT-5 for complex reasoning, coding, debugging
- GPT-4o for speed, vision, creative tasks

---

### Demo 3.1: Few-Shot Prompting


In [None]:
# Few-shot: sentiment classification with examples
payload = {
    "messages": [
        {"role": "developer", "content": "Analyze product review sentiment."},
        # Example 1
        {"role": "user", "content": "Review: The headphones broke after one week."},
        {"role": "assistant", "content": "SENTIMENT: Negative"},
        # Example 2
        {"role": "user", "content": "Review: Decent product for the price."},
        {"role": "assistant", "content": "SENTIMENT: Neutral"},
        # Example 3
        {"role": "user", "content": "Review: Absolutely love it! Best purchase ever."},
        {"role": "assistant", "content": "SENTIMENT: Positive"},
        # Actual query
        {"role": "user", "content": "Review: This laptop exceeded my expectations! Fast and reliable."}
    ],
    "max_completion_tokens": 1500,
    "reasoning_effort": "low"
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()
print("Few-shot classification:")
print(result['choices'][0]['message']['content'])


### Demo 3.2: Chain-of-Thought Reasoning


In [None]:
# Chain-of-Thought: with vs without "step by step"
problem = "If a train travels 60 km/h for 2.5 hours, how far does it go?"

# Without CoT
payload_no_cot = {
    "messages": [{"role": "user", "content": problem}],
    "max_completion_tokens": 2000,
    "reasoning_effort": "minimal"
}
response_no_cot = requests.post(url, headers=headers, json=payload_no_cot)
result_no_cot = response_no_cot.json()

# With CoT
payload_cot = {
    "messages": [{"role": "user", "content": f"{problem} Let's think step by step."}],
    "max_completion_tokens": 3000,
    "reasoning_effort": "minimal"
}
response_cot = requests.post(url, headers=headers, json=payload_cot)
result_cot = response_cot.json()

print("WITHOUT Chain-of-Thought:")
print(result_no_cot['choices'][0]['message']['content'])

print("\nWITH Chain-of-Thought:")
print(result_cot['choices'][0]['message']['content'])


### Demo 3.3: Temperature Control (GPT-4o)


In [None]:
# Temperature comparison using GPT-4o
gpt4o_url = f"{os.environ['DIAL_API_ENDPOINT']}/openai/deployments/{os.environ['DIAL_GPT4O_DEPLOYMENT']}/chat/completions?api-version={os.environ['DIAL_API_VERSION']}"

prompt = "Write a creative company name for a coffee shop."

# Low temperature (deterministic)
payload_low = {
    "messages": [{"role": "user", "content": prompt}],
    "max_tokens": 50,
    "temperature": 0.0
}
response_low = requests.post(gpt4o_url, headers=headers, json=payload_low)
result_low = response_low.json()

# High temperature (creative)
payload_high = {
    "messages": [{"role": "user", "content": prompt}],
    "max_tokens": 50,
    "temperature": 1.5
}
response_high = requests.post(gpt4o_url, headers=headers, json=payload_high)
result_high = response_high.json()

print(f"Temperature 0.0 (deterministic): {result_low['choices'][0]['message']['content']}")
print(f"Temperature 1.5 (creative): {result_high['choices'][0]['message']['content']}")


---

## Part 4: Multi-Turn Conversations & Stateless APIs

### Theory: APIs Have NO Memory

APIs are stateless - each request is independent.

- No memory between requests: Server retains nothing
- You manage history: Maintain messages array, append each response
- Message pattern: system -> user -> assistant -> user -> assistant

### Rate Limits

- **RPM (Requests Per Minute)** and **TPM (Tokens Per Minute)**
- **HTTP 429 error** -> exponential backoff required
- Retry pattern: wait 1s, 2s, 4s, 8s...

---

### Demo 4.1: 3-Turn Conversation with Manual History


In [None]:
# 3-turn conversation with manual history management
messages = [
    {"role": "developer", "content": "You are a DevOps expert."},
    {"role": "user", "content": "What is Docker?"}
]

# Turn 1
response1 = requests.post(url, headers=headers, json={
    "messages": messages,
    "max_completion_tokens": 100,
    "reasoning_effort": "low"
})
result1 = response1.json()
reply1 = result1['choices'][0]['message']['content']

print(f"Turn 1 - Total tokens: {result1['usage']['total_tokens']}")
print(f"Messages in history: {len(messages)}\n")

# Add assistant response to history
messages.append({"role": "assistant", "content": reply1})
messages.append({"role": "user", "content": "How does it differ from a VM?"})

# Turn 2
response2 = requests.post(url, headers=headers, json={
    "messages": messages,
    "max_completion_tokens": 100,
    "reasoning_effort": "low"
})
result2 = response2.json()
reply2 = result2['choices'][0]['message']['content']

print(f"Turn 2 - Total tokens: {result2['usage']['total_tokens']}")
print(f"Messages in history: {len(messages)}\n")

# Add to history again
messages.append({"role": "assistant", "content": reply2})
messages.append({"role": "user", "content": "Give a minimal docker-compose example."})

# Turn 3
response3 = requests.post(url, headers=headers, json={
    "messages": messages,
    "max_completion_tokens": 150,
    "reasoning_effort": "low"
})
result3 = response3.json()

print(f"Turn 3 - Total tokens: {result3['usage']['total_tokens']}")
print(f"Messages in history: {len(messages)}")
print(f"\nToken accumulation: {result1['usage']['total_tokens']} -> {result2['usage']['total_tokens']} -> {result3['usage']['total_tokens']}")

---

## Part 5: Tool Calling & Function Calling (ACTIONS)

LLMs take actions through structured function calls that your code executes.

### Theory: Function Calling Workflow

LLMs generate JSON for tool calls, your code executes them.

**5-Step Process:**
1. Send request with tool definitions (JSON Schema in `tools` parameter)
2. Model decides to call (`finish_reason: "tool_calls"`)
3. **Your Python code** executes the actual function
4. Return results with `role: "tool"`
5. Model synthesizes final natural language answer

**Control:** `tool_choice` = `"auto"` (default), `"none"`, `"required"`, or specific function name

**Best Practice:** Design tools like clean APIs - single responsibility, clear names, use enums for constrained values.

---

### Demo 5.1: Single Tool (Calculator)

In [None]:
# Define a simple calculator tool
def calculate(operation, x, y):
    """Execute a mathematical operation."""
    operations = {
        'add': lambda a, b: a + b,
        'subtract': lambda a, b: a - b,
        'multiply': lambda a, b: a * b,
        'divide': lambda a, b: a / b if b != 0 else 'Error: Division by zero'
    }
    return operations.get(operation, lambda a, b: 'Unknown operation')(x, y)

# Define the tool schema for the model
calculator_tool = {
    "type": "function",
    "function": {
        "name": "calculate",
        "description": "Perform basic arithmetic operations (add, subtract, multiply, divide)",
        "parameters": {
            "type": "object",
            "properties": {
                "operation": {
                    "type": "string",
                    "enum": ["add", "subtract", "multiply", "divide"],
                    "description": "The arithmetic operation to perform"
                },
                "x": {
                    "type": "number",
                    "description": "The first number"
                },
                "y": {
                    "type": "number",
                    "description": "The second number"
                }
            },
            "required": ["operation", "x", "y"],
            "additionalProperties": False
        }
    }
}

# Initial request with tool definition
tool_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You are a helpful math assistant. Use the calculator tool when needed."
        },
        {
            "role": "user",
            "content": "What is 127 multiplied by 89?"
        }
    ],
    "tools": [calculator_tool],
    "tool_choice": "auto",
    "max_completion_tokens": 2000,
    "reasoning_effort": "low"
}

print("=" * 70)
print("STEP 1: Sending request with tool definition")
print("=" * 70)

# Send request
response1 = requests.post(url, headers=headers, json=tool_payload)
result1 = response1.json()
choice1 = result1['choices'][0]
message1 = choice1['message']

print(f"\nFinish reason: {choice1['finish_reason']}")
print(f"Model response: {message1.get('content', '(no text content)')}")

# Check if model wants to call a tool
if choice1['finish_reason'] == 'tool_calls' and 'tool_calls' in message1:
    tool_call = message1['tool_calls'][0]
    function_name = tool_call['function']['name']
    function_args = json.loads(tool_call['function']['arguments'])
    
    print(f"\n{'=' * 70}")
    print("STEP 2: Model requested tool call")
    print(f"{'=' * 70}")
    print(f"Function: {function_name}")
    print(f"Arguments: {json.dumps(function_args, indent=2)}")
    
    # Execute the function
    result = calculate(**function_args)
    print(f"\nFunction result: {result}")
    
    # Append the assistant's tool call request
    tool_payload['messages'].append(message1)
    
    # Append the tool result
    tool_payload['messages'].append({
        "role": "tool",
        "tool_call_id": tool_call['id'],
        "content": json.dumps({"result": result})
    })
    
    print(f"\n{'=' * 70}")
    print("STEP 3: Sending tool result back to model")
    print(f"{'=' * 70}")
    
    # Send back with tool result
    response2 = requests.post(url, headers=headers, json=tool_payload)
    result2 = response2.json()
    final_response = result2['choices'][0]['message']['content']
    
    print(f"\nFinal response:")
    print(final_response)
    print(f"\nFinish reason: {result2['choices'][0]['finish_reason']}")
else:
    print("\nModel responded without calling tool (unexpected for this example)")

print(f"\n{'=' * 70}")
print("Tool calling demo complete!")
print(f"{'=' * 70}")

### Demo 5.2: Multiple Tools

Model chooses between tools, can call in parallel.

In [None]:
# Define multiple tools
def get_current_weather(location, units="celsius"):
    """Simulated weather API."""
    # In real scenario, this would call an actual API
    weather_data = {
        "San Francisco": {"temp": 18, "conditions": "Partly cloudy"},
        "Tokyo": {"temp": 22, "conditions": "Sunny"},
        "London": {"temp": 12, "conditions": "Rainy"},
        "Paris": {"temp": 15, "conditions": "Overcast"}
    }
    data = weather_data.get(location, {"temp": 20, "conditions": "Unknown"})
    return {
        "location": location,
        "temperature": data["temp"],
        "units": units,
        "conditions": data["conditions"]
    }

def search_database(query, limit=5):
    """Simulated database search."""
    # Simulate a product database
    products = [
        {"id": 1, "name": "Laptop Pro 15", "price": 1299, "category": "Electronics"},
        {"id": 2, "name": "Wireless Mouse", "price": 29, "category": "Electronics"},
        {"id": 3, "name": "Office Chair", "price": 249, "category": "Furniture"},
        {"id": 4, "name": "Desk Lamp", "price": 45, "category": "Furniture"},
        {"id": 5, "name": "Notebook Set", "price": 12, "category": "Stationery"}
    ]
    # Simple search simulation
    results = [p for p in products if query.lower() in p['name'].lower() or query.lower() in p['category'].lower()]
    return results[:limit]

# Define tool schemas
weather_tool = {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city name, e.g., San Francisco, Tokyo"
                },
                "units": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "Temperature units"
                }
            },
            "required": ["location"],
            "additionalProperties": False
        }
    }
}

database_tool = {
    "type": "function",
    "function": {
        "name": "search_database",
        "description": "Search the product database for items matching a query",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query string"
                },
                "limit": {
                    "type": "integer",
                    "description": "Maximum number of results to return",
                    "default": 5
                }
            },
            "required": ["query"],
            "additionalProperties": False
        }
    }
}

# Function dispatcher
tool_functions = {
    "get_current_weather": get_current_weather,
    "search_database": search_database
}

# Multi-tool request
multi_tool_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You are a helpful assistant with access to weather data and a product database."
        },
        {
            "role": "user",
            "content": "What's the weather in Tokyo and can you find electronics in the database?"
        }
    ],
    "tools": [weather_tool, database_tool],
    "tool_choice": "auto",
    "max_completion_tokens": 3000,
    "reasoning_effort": "low"
}

print("=" * 70)
print("Multi-Tool Demo: Weather + Database")
print("=" * 70)

# First request
response1 = requests.post(url, headers=headers, json=multi_tool_payload)
result1 = response1.json()
choice1 = result1['choices'][0]
message1 = choice1['message']

print(f"\nModel's initial response:")
print(f"Finish reason: {choice1['finish_reason']}")

if 'tool_calls' in message1:
    print(f"\nModel requested {len(message1['tool_calls'])} tool call(s):")
    
    # Append assistant message
    multi_tool_payload['messages'].append(message1)
    
    # Execute all tool calls
    for tool_call in message1['tool_calls']:
        function_name = tool_call['function']['name']
        function_args = json.loads(tool_call['function']['arguments'])
        
        print(f"\n  Tool: {function_name}")
        print(f"  Args: {json.dumps(function_args, indent=4)}")
        
        # Execute function
        function_result = tool_functions[function_name](**function_args)
        print(f"  Result: {json.dumps(function_result, indent=4)}")
        
        # Append tool result
        multi_tool_payload['messages'].append({
            "role": "tool",
            "tool_call_id": tool_call['id'],
            "content": json.dumps(function_result)
        })
    
    # Send results back to model
    print(f"\n{'=' * 70}")
    print("Sending tool results back to model...")
    print(f"{'=' * 70}\n")
    
    response2 = requests.post(url, headers=headers, json=multi_tool_payload)
    result2 = response2.json()
    final_message = result2['choices'][0]['message']['content']
    
    print("Final response:")
    print(final_message)
else:
    print("\nNo tool calls requested")
    print(message1.get('content', ''))

---

## Part 6: Context Engineering & Limitations

### Theory: Finite Context & Quality Degradation

**Context Window Limits:**
- GPT-4o/GPT-5: 128K tokens standard
- GPT-5 can extend to 272K (preview)
- Quality drops beyond 50-55% capacity

**Context Rot (Lost-in-the-Middle):**
- Attention complexity: O(n²) - quadratic growth
- Model misses details buried in long contexts
- Keep high-signal content at top and bottom
- Find the "smallest possible set of high-signal tokens"

### Organization Best Practices

**Prompt Structure:**
1. System/developer instructions (top)
2. Tool definitions
3. Conversation history
4. Current user query (bottom)

**Use Structure Tags:**
- XML: `<instructions>`, `<context>`, `<examples>`
- Markdown: Clear headers and sections
- Specific enough to guide, flexible enough to generalize

### Compaction Strategies

**1. Summarization:** Compress old turns using the model itself
- Combine system message + summary + recent turns
- Maintains context while reducing tokens

**2. Sliding Window:** Keep last N turns verbatim
- Simple but loses earlier context
- Good for chat interfaces

**3. Progressive Disclosure:** Store IDs, load full content just-in-time
- Significantly reduces token usage
- Pattern: `listItems()` -> `getItemSummary(id)` -> `getItemContent(id)` only when needed

**4. External Memory:** Note-taking or sub-agent patterns
- Maintain structured memory files
- Retrieve relevant sections via tools

---

### Demo 6.1: Conversation Summarization


In [None]:
# Context compaction demo
def summarize_conversation(messages):
    '''Compress old conversation turns into a summary.'''
    summary_payload = {
        "messages": [
            {"role": "developer", "content": "Summarize the following conversation concisely, preserving key information."},
            {"role": "user", "content": "Conversation:\n" + "\n".join([f"{msg['role']}: {msg['content']}" for msg in messages])}
        ],
        "max_completion_tokens": 2000,
        "reasoning_effort": "low"
    }
    response = requests.post(url, headers=headers, json=summary_payload)
    result = response.json()
    return result['choices'][0]['message']['content']

# Simulate long conversation
conversation = [
    {"role": "developer", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a high-level, interpreted programming language known for its simplicity and readability."},
    {"role": "user", "content": "How do I create a list?"},
    {"role": "assistant", "content": "Use square brackets: my_list = [1, 2, 3] or list() constructor."},
    {"role": "user", "content": "What about dictionaries?"},
    {"role": "assistant", "content": "Dictionaries use curly braces: my_dict = {'key': 'value'}"}
]

print("Original conversation: {} messages, ~{} chars".format(len(conversation), sum(len(str(m)) for m in conversation)))

# Summarize old turns (keep system + last 2)
old_turns = conversation[1:-2]
summary = summarize_conversation(old_turns)

print(f"\nSummary of old turns:\n{summary}")

# Rebuild with summary
compacted = [
    conversation[0],  # system
    {"role": "user", "content": f"[Previous: {summary}]"},
    conversation[-2],  # recent messages
    conversation[-1]
]

print(f"\nCompacted conversation: {len(compacted)} messages, ~{sum(len(str(m)) for m in compacted)} chars")
print(f"Token savings: ~{(1 - sum(len(str(m)) for m in compacted) / sum(len(str(m)) for m in conversation)) * 100:.0f}%")


### Demo 6.2: Progressive Disclosure with Tools

Store lightweight IDs, load full content just-in-time.


In [None]:
# Simulate a document database
documents = {
    "doc1": {"title": "API Design Best Practices", "summary": "REST principles, versioning, error handling", "content": "... 5000 words ..."},
    "doc2": {"title": "Database Optimization Guide", "summary": "Indexing, query optimization, caching strategies", "content": "... 8000 words ..."},
    "doc3": {"title": "Kubernetes Deployment Patterns", "summary": "Rolling updates, blue-green, canary deployments", "content": "... 6000 words ..."}
}

# Tool 1: List available documents
def list_documents():
    return [{"id": doc_id, "title": doc["title"]} for doc_id, doc in documents.items()]

# Tool 2: Get summary
def get_document_summary(doc_id):
    if doc_id in documents:
        return {"id": doc_id, "title": documents[doc_id]["title"], "summary": documents[doc_id]["summary"]}
    return {"error": "Document not found"}

# Tool 3: Get full content (only when needed)
def get_document_content(doc_id):
    if doc_id in documents:
        return {"id": doc_id, "content": documents[doc_id]["content"]}
    return {"error": "Document not found"}

print("=" * 70)
print("Progressive Disclosure Pattern")
print("=" * 70)

print("\nStep 1: User asks question")
print("Query: 'How do I implement blue-green deployments?'")

print("\nStep 2: Model calls list_documents()")
doc_list = list_documents()
print(f"Available: {[d['title'] for d in doc_list]}")

print("\nStep 3: Model identifies relevant doc, calls get_document_summary('doc3')")
summary = get_document_summary("doc3")
print(f"Summary: {summary['summary']}")

print("\nStep 4: Model determines it needs full content, calls get_document_content('doc3')")
print("(Only NOW do we load the heavy 6000-word document into context)")

print("\nToken savings: Only loaded 1 document instead of all 3!")
print("Kept context focused and reduced cost by ~70%")


---

## Part 7: Structured Outputs & Data Extraction

### Theory: JSON Mode vs Schemas

**JSON Mode:** `response_format: {"type": "json_object"}`
- Guarantees valid JSON syntax
- Field names/types can vary

**Strict Schemas:** `{"type": "json_schema", "strict": true}`
- Enforces exact fields and types
- All fields required, `additionalProperties: false`
- First call ~10 sec (caching), then fast
- Guarantees format, NOT accuracy

---

### Demo 7.1: JSON Extraction

In [None]:
json_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You extract entities and always respond with JSON containing name, age, occupation, city."
        },
        {
            "role": "user",
            "content": "Jisoo Park is a 28-year-old Chief Meme Engineer living in Seoul, South Korea."
        }
    ],
    "max_completion_tokens": 2000,
    "reasoning_effort": "low",
    "response_format": {"type": "json_object"}
}

response = requests.post(url, headers=headers, json=json_payload)
json_result = response.json()
raw_content = json_result['choices'][0]['message']['content']

print('Raw JSON response:\n' + raw_content)

try:
    parsed = json.loads(raw_content)
    print('\nParsed entity:')
    for key, val in parsed.items():
        print(f'  {key}: {val}')
except json.JSONDecodeError as e:
    print(f'\n Failed to parse JSON: {e}')

### Demo 7.2: Strict JSON Schema

Enforce exact structure for user profiles.

In [None]:
# Define a strict JSON schema for extracting user information
user_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "user_profile_extraction",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "personal_info": {
                    "type": "object",
                    "properties": {
                        "full_name": {
                            "type": "string",
                            "description": "The person's full name"
                        },
                        "age": {
                            "type": "integer",
                            "description": "The person's age in years"
                        },
                        "email": {
                            "type": ["string", "null"],
                            "description": "Email address if mentioned, otherwise null"
                        }
                    },
                    "required": ["full_name", "age", "email"],
                    "additionalProperties": False
                },
                "professional_info": {
                    "type": "object",
                    "properties": {
                        "occupation": {
                            "type": "string",
                            "description": "Current job title or occupation"
                        },
                        "company": {
                            "type": ["string", "null"],
                            "description": "Company name if mentioned"
                        },
                        "years_of_experience": {
                            "type": ["integer", "null"],
                            "description": "Years of professional experience if mentioned"
                        }
                    },
                    "required": ["occupation", "company", "years_of_experience"],
                    "additionalProperties": False
                },
                "location": {
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string",
                            "description": "City name"
                        },
                        "country": {
                            "type": "string",
                            "description": "Country name"
                        }
                    },
                    "required": ["city", "country"],
                    "additionalProperties": False
                },
                "interests": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    },
                    "description": "List of hobbies or interests mentioned"
                }
            },
            "required": ["personal_info", "professional_info", "location", "interests"],
            "additionalProperties": False
        }
    }
}

# Sample text to extract from
sample_text = """
Meet Dr. Maria Rodriguez, a 23-year-old data scientist at Joe AI Labs 
with 8 years of experience in machine learning. She lives in Barcelona, Spain, 
and enjoys hiking, photography, and reading science fiction novels. 
You can reach her at maria.r@joe.ai for collaboration opportunities.
"""

structured_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You extract structured information from text and return it in the specified JSON format."
        },
        {
            "role": "user",
            "content": f"Extract all relevant information from this text:\n\n{sample_text}"
        }
    ],
    "response_format": user_schema,
    "max_completion_tokens": 2000,
    "reasoning_effort": "low"
}

print("=" * 70)
print("Structured Output Demo with Strict JSON Schema")
print("=" * 70)
print(f"\nInput text:\n{sample_text}")
print(f"\n{'=' * 70}")
print("Schema enforces:")
print("  - Exact field names and types")
print("  - Required fields (no missing data)")
print("  - No additional properties")
print("  - Nested object structure")
print(f"{'=' * 70}\n")

response = requests.post(url, headers=headers, json=structured_payload)
result = response.json()
response_content = result['choices'][0]['message']['content']

print("Raw response:")
print(response_content)

# Parse and validate
try:
    parsed_data = json.loads(response_content)
    print(f"\n{'=' * 70}")
    print(" Response successfully parsed as JSON")
    print(f"{'=' * 70}")
    print("\nFormatted extraction:")
    print(json.dumps(parsed_data, indent=2))
    
    # Demonstrate accessing nested fields
    print(f"\n{'=' * 70}")
    print("Accessing structured data:")
    print(f"{'=' * 70}")
    print(f"  Name: {parsed_data['personal_info']['full_name']}")
    print(f"  Age: {parsed_data['personal_info']['age']}")
    print(f"  Job: {parsed_data['professional_info']['occupation']}")
    print(f"  Company: {parsed_data['professional_info']['company']}")
    print(f"  Location: {parsed_data['location']['city']}, {parsed_data['location']['country']}")
    print(f"  Experience: {parsed_data['professional_info']['years_of_experience']} years")
    print(f"  Interests: {', '.join(parsed_data['interests'])}")
    
    print(f"\n All required fields present and correctly typed!")
    
except json.JSONDecodeError as e:
    print(f"\n JSON parsing error: {e}")
except KeyError as e:
    print(f"\n Missing expected field: {e}")

---

## Part 8: Streaming Responses

### Theory: Server-Sent Events

**How It Works:**
1. Set `stream: true`
2. Receive chunks as `data: {...}\n\n`
3. Accumulate `delta.content`
4. Stop at `data: [DONE]`

**Benefits:** Lower perceived latency, progressive UI updates

**Trade-off:** More complex error handling, can't validate before displaying

---

### Demo 8.1: Stream Tokens

In [None]:
import sys
import time
import requests
import json
import os

# Build streaming request URL and headers
endpoint = os.environ['DIAL_API_ENDPOINT']
deployment = os.environ['DIAL_DEPLOYMENT']
api_version = os.environ['DIAL_API_VERSION']
api_key = os.environ['DIAL_API_KEY']

url = f"{endpoint}/openai/deployments/{deployment}/chat/completions?api-version={api_version}"
headers = {
    "Content-Type": "application/json",
    "api-key": api_key
}

# Streaming request payload
streaming_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You are a technical writer who explains concepts clearly. Respond directly without extensive reasoning."
        },
        {
            "role": "user",
            "content": "Explain how HTTP requests work, step by step."
        }
    ],
    "max_completion_tokens": 3000,
    "reasoning_effort": "minimal",  # Add for GPT-5
    "stream": True  # Enable streaming
}

print("=" * 70)
print("Streaming Demo")
print("=" * 70)
print("\nStreaming response (tokens appear in real-time):\n")

start_time = time.time()
full_content = ""
chunk_count = 0

try:
    # Make streaming request
    response = requests.post(url, headers=headers, json=streaming_payload, stream=True, timeout=60)
    response.raise_for_status()
    
    # Process Server-Sent Events
    for line in response.iter_lines():
        if line:
            line_str = line.decode('utf-8')
            
            # SSE format: "data: {...}"
            if line_str.startswith('data: '):
                data_str = line_str[6:]  # Remove "data: " prefix
                
                # Check for stream end
                if data_str.strip() == '[DONE]':
                    break
                
                try:
                    # Parse JSON chunk
                    chunk = json.loads(data_str)
                    chunk_count += 1
                    
                    # Extract delta content
                    if 'choices' in chunk and len(chunk['choices']) > 0:
                        delta = chunk['choices'][0].get('delta', {})
                        content = delta.get('content', '')
                        
                        if content:
                            full_content += content
                            # Print token(s) as they arrive
                            print(content, end='', flush=True)
                        
                        # Check for finish reason
                        finish_reason = chunk['choices'][0].get('finish_reason')
                        if finish_reason:
                            print(f"\n\n[Stream ended: {finish_reason}]")
                
                except json.JSONDecodeError:
                    pass  # Skip malformed JSON
    
    elapsed = time.time() - start_time
    
    print(f"\n\n{'=' * 70}")
    print(f"Streaming Statistics:")
    print(f"  Total chunks received: {chunk_count}")
    print(f"  Total characters: {len(full_content)}")
    print(f"  Time elapsed: {elapsed:.2f}s")
    if chunk_count > 0:
        print(f"  Avg time per chunk: {(elapsed/chunk_count)*1000:.1f}ms")
    print(f"{'=' * 70}")

except requests.exceptions.RequestException as e:
    print(f"\n\nStreaming error: {e}")

---

## Part 9: Vision (Multimodal)

### Theory: Image Analysis

**Capabilities:**
- GPT-4o: text + images
- GPT-5: text only (for now)

**Detail Levels:**
- `low`: 85 tokens flat (classification)
- `high`: variable tokens, tiled processing (detailed analysis)

**Formats:** Base64 data or public URLs. Max 20MB/image, 50 images/request.

---

### Demo 9.1: Analyze Vacation Photo

In [None]:
import base64
import os
import requests
from pathlib import Path

# Use GPT-4o for vision support
gpt4o_deployment = os.environ.get('DIAL_GPT4O_DEPLOYMENT', 'gpt-4o-mini-2024-07-18')

print("=" * 70)
print("Vision Demo: Analyzing Vacation Photos")
print("=" * 70)

# Use images from local images/ folder
IMAGE_1 = "images/background-vacation1.jpeg"
IMAGE_2 = "images/background-vacation2.jpeg"

def load_image_base64(image_path: str):
    """Load image and convert to base64."""
    path = Path(image_path)
    
    # If relative path, resolve from current directory
    if not path.is_absolute():
        path = Path.cwd() / path
    
    if not path.exists():
        raise FileNotFoundError(f"Image not found: {path}")
    
    with path.open("rb") as image_file:
        image_bytes = image_file.read()
    
    return base64.b64encode(image_bytes).decode("utf-8"), path

# Load first vacation image
image_data, image_file = load_image_base64(IMAGE_1)
print(f"\nAnalyzing: {image_file.name}")

# Build payload with base64-encoded image
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this vacation photo. What location might this be? What activities or experiences does it suggest?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}",
                        "detail": "high"  # Use high detail for better analysis
                    }
                }
            ]
        }
    ],
    "max_tokens": 400,
    "temperature": 0.7
}

# Build GPT-4o endpoint
endpoint = os.environ['DIAL_API_ENDPOINT']
api_version = os.environ['DIAL_API_VERSION']
gpt4o_url = f"{endpoint}/openai/deployments/{gpt4o_deployment}/chat/completions?api-version={api_version}"

headers = {
    "Content-Type": "application/json",
    "api-key": os.environ['DIAL_API_KEY']
}

try:
    response = requests.post(gpt4o_url, headers=headers, json=payload, timeout=60)
    response.raise_for_status()
    result = response.json()

    print("\nGPT-4o Analysis:")
    print("-" * 70)
    print(result['choices'][0]['message']['content'])

    usage = result.get('usage', {})
    print("\n" + "=" * 70)
    print("Token Usage:")
    print(f"  Input tokens: {usage.get('prompt_tokens', 0)}")
    print(f"  Output tokens: {usage.get('completion_tokens', 0)}")
    print(f"  Total: {usage.get('total_tokens', 0)}")
    print("\n High detail image analysis uses more tokens but provides richer descriptions")

    # Optional: Analyze second image
    print("\n" + "=" * 70)
    print("Analyzing second vacation photo...")
    print("=" * 70)

    image_data2, image_file_2 = load_image_base64(IMAGE_2)

    payload2 = {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this vacation scene. Compare the mood and setting to a beach destination."},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data2}", "detail": "low"}}
                ]
            }
        ],
        "max_tokens": 300,
        "temperature": 0.7
    }

    response2 = requests.post(gpt4o_url, headers=headers, json=payload2, timeout=60)
    result2 = response2.json()

    print(f"\nAnalyzing: {image_file_2.name}")
    print("-" * 70)
    print(result2['choices'][0]['message']['content'])

    usage2 = result2.get('usage', {})
    print("\n" + "=" * 70)
    print("Token Comparison (detail='low' vs detail='high'):")
    print(f"  Image 1 (high detail): {usage.get('prompt_tokens', 0)} input tokens")
    print(f"  Image 2 (low detail):  {usage2.get('prompt_tokens', 0)} input tokens")
    print(f"  Savings: ~{usage.get('prompt_tokens', 0) - usage2.get('prompt_tokens', 0)} tokens with 'low' detail")

except Exception as e:
    print(f"\n Error: {e}")
    if hasattr(e, 'response') and e.response is not None:
        try:
            error_detail = e.response.json()
            print(f"Details: {error_detail}")
        except:
            print(f"Response: {e.response.text[:500]}")

print("\n" + "=" * 70)
print("Key Takeaways:")
print("=" * 70)
print(" GPT-4o supports vision; GPT-5 is text-only")
print(" 'detail': 'high' -> better analysis, more tokens")
print(" 'detail': 'low' -> faster, cheaper, good for classification")
print(" Images can be base64-encoded or public URLs")
print(" Remove images from history after analysis to save tokens")


---

## Part 10: Production Readiness & Operational Guardrails

### Theory: Production Readiness

**Always Check `finish_reason`:**
- `stop`: Normal completion
- `length`: Truncated (increase max_completion_tokens)
- `content_filter`: Blocked by safety filters
- `tool_calls`: Model wants to call a function

**Rate Limits:**
- HTTP 429: Too many requests
- Headers: `x-ratelimit-remaining-requests`, `x-ratelimit-remaining-tokens`
- Implement exponential backoff: 1s, 2s, 4s, 8s...

**Usage Monitoring:**
- Track token usage per request
- Monitor cumulative costs
- Log response times and errors
- Use request IDs for debugging

**Model Selection Guide:**
- **GPT-5**: Complex reasoning, math, code analysis (slowest, highest cost)
- **GPT-5-mini**: Balanced reasoning tasks, good for most use cases (moderate speed & cost)
- **GPT-5-nano**: Lightweight reasoning for edge/mobile deployments (faster, lower cost)
- **GPT-4o**: General tasks, speed-critical apps (fast, cost-effective)
- **GPT-4o-mini**: Simple classification, high-volume tasks (fastest, cheapest)

---

### Demo 10.1: Detecting Truncation

In [None]:
import requests
import json
import os

# Build API request components
endpoint = os.environ['DIAL_API_ENDPOINT']
deployment = os.environ['DIAL_DEPLOYMENT']
api_version = os.environ['DIAL_API_VERSION']
api_key = os.environ['DIAL_API_KEY']

url = f"{endpoint}/openai/deployments/{deployment}/chat/completions?api-version={api_version}"
headers = {
    "Content-Type": "application/json",
    "api-key": api_key
}

truncated_payload = {
    "messages": [
        {"role": "developer", "content": "You are a technical writer."},
        {"role": "user", "content": "Explain microservices architecture in detail with examples."}
    ],
    "max_completion_tokens": 30,  # Intentionally too small!
    "reasoning_effort": "minimal"
}

response = requests.post(url, headers=headers, json=truncated_payload, timeout=30)
result = response.json()
choice = result['choices'][0]
content = choice['message'].get('content', '')

# Show the truncated response
if content:
    preview = content[:150] if len(content) > 150 else content
    print(f"Response (truncated):\n{preview}...")
else:
    # GPT-5 might use all tokens for reasoning
    reasoning_tokens = result['usage'].get('completion_tokens_details', {}).get('reasoning_tokens', 0)
    print(f"Response: (empty - {reasoning_tokens} reasoning tokens used, no visible output)")

print(f"\nFinish reason: {choice['finish_reason']}")

if choice['finish_reason'] == 'length':
    print("\nTRUNCATED! The response was cut off mid-sentence.")
    print(f"   Token usage: {result['usage']['prompt_tokens']} prompt + {result['usage']['completion_tokens']} completion = {result['usage']['total_tokens']} total")
    
    # Show reasoning token breakdown for GPT-5
    if 'completion_tokens_details' in result['usage']:
        reasoning = result['usage']['completion_tokens_details'].get('reasoning_tokens', 0)
        output = result['usage']['completion_tokens'] - reasoning
        if reasoning > 0:
            print(f"   Breakdown: {reasoning} reasoning tokens + {output} output tokens")
    
    print("\nFix: Increase max_completion_tokens to 200-300 for full response")
else:
    print("Response completed naturally (finish_reason: stop)")

---

### Demo 10.2: Exponential Backoff Pattern


In [None]:
import time
import requests
import json
import os

# Build API request components
endpoint = os.environ['DIAL_API_ENDPOINT']
deployment = os.environ['DIAL_DEPLOYMENT']
api_version = os.environ['DIAL_API_VERSION']
api_key = os.environ['DIAL_API_KEY']

url = f"{endpoint}/openai/deployments/{deployment}/chat/completions?api-version={api_version}"
headers = {
    "Content-Type": "application/json",
    "api-key": api_key
}

def call_with_retry(url, headers, payload, max_retries=3):
    '''Call API with exponential backoff on rate limits.'''
    for attempt in range(max_retries):
        try:
            response = requests.post(url, headers=headers, json=payload, timeout=30)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429 and attempt < max_retries - 1:
                wait_time = 2 ** attempt  # 1s, 2s, 4s
                print(f"Rate limited! Waiting {wait_time}s before retry {attempt + 2}/{max_retries}...")
                time.sleep(wait_time)
            else:
                raise
        except Exception as e:
            print(f"Error: {e}")
            raise
    
print("Retry pattern: 1s -> 2s -> 4s -> 8s...")
print("Always implement exponential backoff for production!")
print("\nExample: If you hit rate limit (429), the function will:")
print("  1st retry: wait 1s")
print("  2nd retry: wait 2s") 
print("  3rd retry: wait 4s")
print("  Then fail if still rate limited")


---

### Demo 10.3: Track Usage and Costs


In [None]:
import requests
import json
import os

# Build API request components
endpoint = os.environ['DIAL_API_ENDPOINT']
deployment = os.environ['DIAL_DEPLOYMENT']
api_version = os.environ['DIAL_API_VERSION']
api_key = os.environ['DIAL_API_KEY']

url = f"{endpoint}/openai/deployments/{deployment}/chat/completions?api-version={api_version}"
headers = {
    "Content-Type": "application/json",
    "api-key": api_key
}

# Track usage across multiple requests
total_tokens = 0
total_cost = 0.0

for i in range(3):
    test_payload = {
        "messages": [
            {"role": "user", "content": f"Give me a random fact about space #{i+1}"}
        ],
        "max_completion_tokens": 100,
        "reasoning_effort": "minimal"
    }
    
    response = requests.post(url, headers=headers, json=test_payload, timeout=30)
    result = response.json()
    
    # Calculate usage
    prompt_tokens = result['usage']['prompt_tokens']
    completion_tokens = result['usage']['completion_tokens']
    request_tokens = result['usage']['total_tokens']
    
    # GPT-5-mini pricing: $0.25/1M input, $2.00/1M output
    request_cost = (prompt_tokens * 0.00000025) + (completion_tokens * 0.000002)
    
    total_tokens += request_tokens
    total_cost += request_cost
    
    print(f"Request {i+1}: {request_tokens} tokens, ${request_cost:.6f}")

print(f"\nTotal usage: {total_tokens} tokens")
print(f"Total cost: ${total_cost:.6f}")
print(f"Average per request: {total_tokens/3:.1f} tokens, ${total_cost/3:.6f}")
