# LLM API Foundations Workshop

This guide distills the core mechanics behind working with Large Language Model APIs across OpenAI and Azure OpenAI deployments. Use it as a reference for how to pass information into a model, what content can be supplied, and which controls shape behavior, cost, and reliability. Getting these fundamentals right routinely cuts token spend by 80%+, improves task accuracy by ~40%, and unlocks workflows that simple prompt-response bots cannot handle.

**Why This Matters:**

LLMs behave unlike deterministic APIs: each response is sampled token-by-token from probability distributions, the services are stateless between calls, and every byte you send or receive is billed in tokens. As a result, poor prompt hygiene, unchecked context growth, or sloppy memory strategies surface as failures: context overflows, incoherent replies, and unexpected cost spikes. Treat tokens, prompts, and context as first-class resources from the outset; modern "context engineering" is really continuous memory management for probabilistic systems.

**Workshop Goals:**
- Master stateless API mechanics and token economics
- Understand GPT-5 reasoning vs GPT-4o standard model differences  
- Implement tool calling, streaming, and structured outputs
- Apply context management strategies for production workflows

**Prerequisites:** Python 3.8+, `requests` library, DIAL API key

---

## Part 1: Setup & First API Call

### Theory: What Are LLM APIs?

LLM APIs are stateless HTTP endpoints: each request is independent, and you must send the full conversation history every time. Two main patterns:

**Chat Completions** (most common):
- Send `messages[]` array with roles: `system`, `user`, `assistant`, `tool`
- For GPT-5 reasoning models: use `max_completion_tokens` + `reasoning_effort`
- For GPT-4o standard: use `max_tokens` + `temperature`

**Key Point:** Every byte you send or receive costs tokens. Poor prompt hygiene = cost spikes and failures.

---

**Default Models:**
- GPT-5: `gpt-5-mini-2025-08-07` (reasoning model)
- GPT-4o: `gpt-4o-mini-2024-07-18` (standard model)

---

### Demo 1.1: Configure API Credentials

In [37]:
import os
import getpass

# Collect and normalize configuration
if 'DIAL_API_KEY' not in os.environ or not os.environ['DIAL_API_KEY']:
    os.environ['DIAL_API_KEY'] = getpass.getpass('Enter DIAL_API_KEY: ')

# Default to GPT-5 mini (reasoning model) for main workshop
current_deployment = os.environ.get('DIAL_DEPLOYMENT', 'gpt-5-mini-2025-08-07')
chosen = input(f"Enter DIAL_DEPLOYMENT [{current_deployment}]: ").strip()
os.environ['DIAL_DEPLOYMENT'] = chosen or current_deployment

# Also store GPT-4o deployment for comparison demos
gpt4o_deployment = os.environ.get('GPT4O_DEPLOYMENT', 'gpt-4o-mini-2024-07-18')
gpt4o_chosen = input(f"Enter GPT4O_DEPLOYMENT for comparisons [{gpt4o_deployment}]: ").strip()
os.environ['GPT4O_DEPLOYMENT'] = gpt4o_chosen or gpt4o_deployment

os.environ.setdefault('DIAL_API_ENDPOINT', 'https://ai-proxy.lab.epam.com')
os.environ.setdefault('DIAL_API_VERSION', '2024-10-21')

# Remove any trailing slash to avoid double-slash URLs
os.environ['DIAL_API_ENDPOINT'] = os.environ['DIAL_API_ENDPOINT'].rstrip('/')

# Detect if this is a reasoning model
deployment = os.environ['DIAL_DEPLOYMENT'].lower()
reasoning_models = ['gpt-5', 'o1', 'o3', 'o4']
is_reasoning_model = any(model in deployment for model in reasoning_models)

print('✓ Configuration set:')
print(f"  Endpoint: {os.environ['DIAL_API_ENDPOINT']}")
print(f"  API Version: {os.environ['DIAL_API_VERSION']}")
print(f"  API Key: {'*' * 20} (hidden)")
print(f"\n  Primary (GPT-5): {os.environ['DIAL_DEPLOYMENT']}")
print(f"  Comparison (GPT-4o): {os.environ['GPT4O_DEPLOYMENT']}")

if is_reasoning_model:
    print(f"\n✓ GPT-5 Reasoning Model Configuration:")
    print("  ✓ Supported: max_completion_tokens, reasoning_effort, developer role")
    print("  ✗ NOT supported: temperature, top_p, max_tokens, frequency/presence_penalty")
    os.environ['MODEL_TYPE'] = 'reasoning'
else:
    print(f"\n️  Warning: Primary deployment is not a reasoning model")
    print("  For this workshop, we recommend using gpt-5-mini or gpt-5")
    os.environ['MODEL_TYPE'] = 'standard'

print("\n📚 Workshop Structure:")
print("  • Most demos use GPT-5 (reasoning model)")
print("  • Demos 2-4 compare GPT-5 vs GPT-4o")

✓ Configuration set:
  Endpoint: https://ai-proxy.lab.epam.com
  API Version: 2024-10-21
  API Key: ******************** (hidden)

  Primary (GPT-5): gpt-5-mini-2025-08-07
  Comparison (GPT-4o): gpt-4o-mini-2024-07-18

✓ GPT-5 Reasoning Model Configuration:
  ✓ Supported: max_completion_tokens, reasoning_effort, developer role
  ✗ NOT supported: temperature, top_p, max_tokens, frequency/presence_penalty

📚 Workshop Structure:
  • Most demos use GPT-5 (reasoning model)
  • Demos 2-4 compare GPT-5 vs GPT-4o


### Demo 1.2: Validate Configuration

In [38]:
# Validate that all required environment variables are set
required_vars = ['DIAL_API_KEY', 'DIAL_DEPLOYMENT', 'DIAL_API_ENDPOINT', 'DIAL_API_VERSION']
missing = [var for var in required_vars if var not in os.environ]

if missing:
    print(f" Missing environment variables: {', '.join(missing)}")
    print('   Run the configuration cell above first!')
else:
    print('✓ All environment variables are set')
    print('\nYour configuration:')
    print(f"  Endpoint: {os.environ['DIAL_API_ENDPOINT']}")
    print(f"  Deployment: {os.environ['DIAL_DEPLOYMENT']}")
    print(f"  API Version: {os.environ['DIAL_API_VERSION']}")
    print(f"  API Key: {'*' * 20} (hidden)")

    url = (
        f"{os.environ['DIAL_API_ENDPOINT']}/openai/deployments/"
        f"{os.environ['DIAL_DEPLOYMENT']}/chat/completions"
        f"?api-version={os.environ['DIAL_API_VERSION']}"
    )
    print('\nAPI endpoint:')
    print(f'  {url}')

    print('\n✓ Ready to make API calls!')


✓ All environment variables are set

Your configuration:
  Endpoint: https://ai-proxy.lab.epam.com
  Deployment: gpt-5-mini-2025-08-07
  API Version: 2024-10-21
  API Key: ******************** (hidden)

API endpoint:
  https://ai-proxy.lab.epam.com/openai/deployments/gpt-5-mini-2025-08-07/chat/completions?api-version=2024-10-21

✓ Ready to make API calls!


### Demo 1.3: Connection Test

**Critical:** This test must pass before proceeding!

In [39]:
import requests
import json

print("=" * 70)
print("CRITICAL: Testing API Connection")
print("=" * 70)

# Build test request
endpoint = os.environ['DIAL_API_ENDPOINT']
deployment = os.environ['DIAL_DEPLOYMENT']
api_version = os.environ['DIAL_API_VERSION']
model_type = os.environ.get('MODEL_TYPE', 'standard')
url = f"{endpoint}/openai/deployments/{deployment}/chat/completions?api-version={api_version}"

headers = {
    'Content-Type': 'application/json',
    'api-key': os.environ['DIAL_API_KEY'],
}

# Build test payload based on model type
if model_type == 'reasoning':
    # For reasoning models (GPT-5, o1, o3)
    test_payload = {
        'messages': [
            {'role': 'developer', 'content': 'You are a helpful assistant.'},
            {'role': 'user', 'content': 'Say "API TEST SUCCESSFUL" if you can read this.'}
        ],
        'max_completion_tokens': 50,
        'reasoning_effort': 'low'
    }
    print(f"\nTesting REASONING model: {deployment}")
    print("Parameters: max_completion_tokens, reasoning_effort")
else:
    # For standard models (GPT-4o, GPT-4o-mini)
    test_payload = {
        'messages': [
            {'role': 'system', 'content': 'You are a helpful assistant.'},
            {'role': 'user', 'content': 'Say "API TEST SUCCESSFUL" if you can read this.'}
        ],
        'max_tokens': 50,
        'temperature': 0.7
    }
    print(f"\nTesting STANDARD model: {deployment}")
    print("Parameters: max_tokens, temperature")

print(f"Endpoint: {url}\n")

try:
    response = requests.post(url, headers=headers, json=test_payload, timeout=30)
    
    # Show status code
    print(f"HTTP Status: {response.status_code}")
    
    if response.status_code == 200:
        result = response.json()
        content = result['choices'][0]['message']['content']
        
        print("\n" + "=" * 70)
        print(" SUCCESS! API is working correctly")
        print("=" * 70)
        print(f"\nResponse: {content}")
        print(f"\nToken usage:")
        print(f"  Prompt tokens: {result['usage']['prompt_tokens']}")
        print(f"  Completion tokens: {result['usage']['completion_tokens']}")
        print(f"  Total tokens: {result['usage']['total_tokens']}")
        
        if 'completion_tokens_details' in result['usage']:
            details = result['usage']['completion_tokens_details']
            if 'reasoning_tokens' in details:
                print(f"  Reasoning tokens: {details['reasoning_tokens']}")
        
        print("\n You can now run all the demos below!")
        print("=" * 70)
    else:
        # Show error details
        print("\n" + "=" * 70)
        print(" API ERROR - Demos will NOT work!")
        print("=" * 70)
        print(f"\nError response:")
        try:
            error_data = response.json()
            print(json.dumps(error_data, indent=2))
        except:
            print(response.text)
        
        print("\n" + "=" * 70)
        print("TROUBLESHOOTING:")
        print("=" * 70)
        
        if response.status_code == 400:
            print("\n 400 Bad Request - Most likely causes:")
            print("  1. Wrong deployment name")
            print("  2. Using wrong parameters for this model type")
            print("  3. Model doesn't exist in your DIAL instance")
            print("\nQuick fixes:")
            print("  • Try: gpt-4o (most common)")
            print("  • Try: gpt-4o-mini (cheaper, faster)")
            print("  • Try: gpt-35-turbo (older but widely available)")
            print("\nRe-run the config cell and enter a different deployment name")
        
        elif response.status_code == 401:
            print("\n 401 Unauthorized:")
            print("  Your API key is invalid or expired")
            print("\nFix: Get a new API key and re-run config cell")
        
        elif response.status_code == 403:
            print("\n 403 Forbidden:")
            print("  Your API key doesn't have access to this deployment")
            print("\nFix: Request access or try a different deployment")
        
        elif response.status_code == 404:
            print("\n 404 Not Found:")
            print(f"  Deployment '{deployment}' doesn't exist")
            print("\nFix: Check available deployments or use gpt-4o")
        
        elif response.status_code == 429:
            print("\n 429 Rate Limit:")
            print("  Too many requests - wait a moment and try again")
        
        else:
            print(f"\n Unexpected error: {response.status_code}")
        
        print("\n" + "=" * 70)
        print("️  DO NOT RUN ANY DEMOS until this test passes!")
        print("=" * 70)

except requests.exceptions.Timeout:
    print("\n Connection timeout - check your network")
except requests.exceptions.ConnectionError:
    print("\n Cannot connect to API endpoint")
    print(f"   Check if {endpoint} is accessible")
except Exception as e:
    print(f"\n Unexpected error: {e}")
    print("   Check your configuration")

CRITICAL: Testing API Connection

Testing REASONING model: gpt-5-mini-2025-08-07
Parameters: max_completion_tokens, reasoning_effort
Endpoint: https://ai-proxy.lab.epam.com/openai/deployments/gpt-5-mini-2025-08-07/chat/completions?api-version=2024-10-21

HTTP Status: 200

 SUCCESS! API is working correctly

Response: API TEST SUCCESSFUL

Token usage:
  Prompt tokens: 29
  Completion tokens: 14
  Total tokens: 43
  Reasoning tokens: 0

 You can now run all the demos below!


---

## Part 2: Your First Request

### Theory: GPT-5 vs GPT-4o

**Two Model Families:**

| Parameter | GPT-5 (Reasoning) | GPT-4o (Standard) |
|-----------|-------------------|-------------------|
| Output limit | `max_completion_tokens` | `max_tokens` |
| Control | `reasoning_effort` (low/medium/high) | `temperature` (0.0-2.0) |
| Role | `developer` or `system` | `system` only |
| Best for | Complex logic, debugging | Fast responses, vision |

**Prompt Structure:** System/developer role → user question → assistant reply → next user turn

---

### Demo 2.1: Build Chat Client Helper

In [40]:
import os
import json
import time
from typing import Any, Dict

import requests


class DialChatClient:
    """Wrapper around the DIAL Azure OpenAI Chat Completions endpoint for GPT-5 demos."""

    def __init__(self) -> None:
        required = ["DIAL_API_ENDPOINT", "DIAL_DEPLOYMENT", "DIAL_API_VERSION", "DIAL_API_KEY"]
        missing = [var for var in required if var not in os.environ]
        if missing:
            raise RuntimeError(
                f"Missing environment variables: {', '.join(missing)}. Run the configuration cell above first."
            )

        self.endpoint = os.environ['DIAL_API_ENDPOINT']
        self.deployment = os.environ['DIAL_DEPLOYMENT']
        self.api_version = os.environ['DIAL_API_VERSION']
        self.api_key = os.environ['DIAL_API_KEY']

        self.url = (
            f"{self.endpoint}/openai/deployments/{self.deployment}/chat/completions"
            f"?api-version={self.api_version}"
        )
        self.headers = {
            'Content-Type': 'application/json',
            'api-key': self.api_key,
        }
        self.total_tokens = 0
        self.total_requests = 0

    def call(self, payload: Dict[str, Any], *, show_request: bool = False, allow_error: bool = False) -> Dict[str, Any]:
        """Send a chat completion request and capture the response."""
        if show_request:
            print('Request payload:')
            print(json.dumps(payload, indent=2))

        start = time.time()
        response = requests.post(self.url, headers=self.headers, json=payload, timeout=60)
        elapsed = time.time() - start

        try:
            response.raise_for_status()
        except requests.exceptions.HTTPError as exc:
            status = exc.response.status_code if exc.response is not None else None
            if status == 403:
                message = (
                    "403 Access denied from DIAL. Check that your API key has quota for "
                    f"deployment '{self.deployment}' or switch to gpt-5-mini-2025-08-07."
                )
                print(message)
            try:
                error_payload = exc.response.json()
            except Exception:
                error_payload = {'text': getattr(exc.response, 'text', str(exc))}

            if allow_error:
                return {
                    'error': error_payload,
                    'status_code': status,
                    'elapsed': elapsed,
                }
            raise

        data = response.json()
        usage = data.get('usage', {})
        self.total_requests += 1
        self.total_tokens += usage.get('total_tokens', 0)
        return {
            'data': data,
            'elapsed': elapsed,
        }

    @staticmethod
    def render_choice(result: Dict[str, Any]) -> None:
        """Pretty-print the assistant reply and finish reason."""
        choice = result['data']['choices'][0]
        print(choice['message']['content'].strip())
        print('finish_reason:', choice['finish_reason'])
        usage = result['data'].get('usage', {})
        if usage:
            print('usage:', json.dumps(usage, indent=2))

    def usage_summary(self) -> Dict[str, Any]:
        """Return aggregate usage stats for all demo calls."""
        return {
            'endpoint': self.endpoint,
            'deployment': self.deployment,
            'total_requests': self.total_requests,
            'total_tokens': self.total_tokens,
            'estimated_cost_usd': round(self.total_tokens * 0.000005, 6),
        }


def build_demo_client() -> DialChatClient:
    """Factory function so notebooks can rebuild the client after config changes."""
    return DialChatClient()


client = build_demo_client()
print('Ready to call:', client.url)


Ready to call: https://ai-proxy.lab.epam.com/openai/deployments/gpt-5-mini-2025-08-07/chat/completions?api-version=2024-10-21


### Demo 2.2: Basic GPT-5 Call

Minimal payload with token tracking.

In [41]:
basic_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You are a helpful assistant that explains technical concepts clearly."
        },
        {
            "role": "user",
            "content": "Summarize what an API does in one sentence."
        }
    ],
    "max_completion_tokens": 120,
    "reasoning_effort": "medium"
}

basic_result = client.call(basic_payload, show_request=True)
client.render_choice(basic_result)


Request payload:
{
  "messages": [
    {
      "role": "developer",
      "content": "You are a helpful assistant that explains technical concepts clearly."
    },
    {
      "role": "user",
      "content": "Summarize what an API does in one sentence."
    }
  ],
  "max_completion_tokens": 120,
  "reasoning_effort": "medium"
}

finish_reason: length
usage: {
  "completion_tokens": 120,
  "prompt_tokens": 32,
  "total_tokens": 152,
  "completion_tokens_details": {
    "accepted_prediction_tokens": 0,
    "audio_tokens": 0,
    "reasoning_tokens": 120,
    "rejected_prediction_tokens": 0
  },
  "prompt_tokens_details": {
    "audio_tokens": 0,
    "cached_tokens": 0
  }
}


---

## Part 3: Reasoning & Tokens

### Theory: How GPT-5 "Thinks"

**Reasoning Effort Levels:**
- `low` - Fast, cheap (~10-50 thinking tokens)
- `medium` - Balanced (~50-200 tokens)
- `high` - Deep analysis (~200-1000+ tokens)

**Token Economics:** ~4 chars = 1 token. You pay for:
- Input tokens (what you send)
- Output tokens (2-3x more expensive)
- Reasoning tokens (GPT-5 only, billed as output)

**Pricing (2025):** GPT-5 input $1.25/M, output $10/M | GPT-4o input $2.50/M, output $10/M

---

### Demo 3.1: Compare Reasoning Efforts

In [42]:
question = (
    "If a store sells 15 items per hour and is open 8 hours a day, but 20% of customers return their items the next day, "
    "how many net items are sold in a 7-day week?"
)

payload_low = {
    "messages": [
        {"role": "developer", "content": "You are a careful math tutor."},
        {"role": "user", "content": question}
    ],
    "max_completion_tokens": 300,
    "reasoning_effort": "low"
}

payload_high = {
    "messages": [
        {"role": "developer", "content": "You are a careful math tutor."},
        {"role": "user", "content": question}
    ],
    "max_completion_tokens": 300,
    "reasoning_effort": "high"
}

def summarize(result, label):
    choice = result['data']['choices'][0]
    usage = result['data'].get('usage', {})
    reasoning_tokens = usage.get('completion_tokens_details', {}).get('reasoning_tokens')
    print(f"--- {label} ---")
    print(choice['message']['content'].strip())
    print(
        f"finish_reason: {choice['finish_reason']} | total_tokens: {usage.get('total_tokens')} | "
        f"reasoning_tokens: {reasoning_tokens}"
    )
    print()

low_result = client.call(payload_low)
high_result = client.call(payload_high)

summarize(low_result, 'Low reasoning effort')
summarize(high_result, 'High reasoning effort')


--- Low reasoning effort ---

finish_reason: length | total_tokens: 361 | reasoning_tokens: 300

--- High reasoning effort ---

finish_reason: length | total_tokens: 361 | reasoning_tokens: 300



---

## Part 4: Multi-Turn Conversations

### Theory: APIs Have No Memory

**Core Principle:** The API doesn't remember previous calls. YOU resend history.

**Pattern:**
```json
{
  "messages": [
    {"role": "developer", "content": "Instructions"},
    {"role": "user", "content": "Question 1"},
    {"role": "assistant", "content": "Answer 1"},
    {"role": "user", "content": "Question 2"}
  ]
}
```

**Strategies:** Sliding window (keep last N), summarize old turns, prune aggressively.

---

### Demo 4.1: Multi-Turn Conversation

In [43]:
messages = [
    {"role": "developer", "content": "You are a DevOps expert."},
    {"role": "user", "content": "What is Docker?"}
]

turn1 = client.call({"messages": messages, "max_completion_tokens": 180, "reasoning_effort": "low"})
reply1 = turn1['data']['choices'][0]['message']['content']
print('Turn 1 response:\n' + reply1.strip())

messages.append({"role": "assistant", "content": reply1})
messages.append({"role": "user", "content": "How does it differ from a virtual machine?"})

turn2 = client.call({"messages": messages, "max_completion_tokens": 220, "reasoning_effort": "low"})
reply2 = turn2['data']['choices'][0]['message']['content']
print('\nTurn 2 response:\n' + reply2.strip())

messages.append({"role": "assistant", "content": reply2})
messages.append({"role": "user", "content": "Provide a minimal docker-compose example."})

turn3 = client.call({"messages": messages, "max_completion_tokens": 260, "reasoning_effort": "low"})
reply3 = turn3['data']['choices'][0]['message']['content']
print('\nTurn 3 response:\n' + reply3.strip())

print(f'\nMessages sent on turn 3: {len(messages)}')


Turn 1 response:


Turn 2 response:


Turn 3 response:


Messages sent on turn 3: 6


---

## Part 5: Token Limits & Costs

### Theory: Context Windows

**Model Limits:**
- GPT-4o: 128K context, ~16K max output
- GPT-5: 128K context (expandable to 272K), 32K max output

**Critical Rule:** `prompt_tokens + max_completion_tokens < model_limit`

**When `finish_reason == "length"`:** Response truncated! Increase limit or shrink prompt.

**Lost-in-the-middle:** Quality drops beyond 50-55% of context limit. Keep important content at top and bottom.

---

### Demo 5.1: Token Truncation

In [44]:
detailed_prompt = (
    "Explain REST APIs in depth, covering how requests flow, common verbs, typical response codes, "
    "best practices, and frequent pitfalls."
)

constrained_payload = {
    "messages": [
        {"role": "developer", "content": "You are a technical writer."},
        {"role": "user", "content": detailed_prompt}
    ],
    "max_completion_tokens": 60,
    "reasoning_effort": "low"
}

constrained = client.call(constrained_payload)
choice_short = constrained['data']['choices'][0]
content_short = choice_short['message']['content']

print('Response (60 tokens max):\n' + content_short.strip())
print('\nFinish reason: ' + choice_short['finish_reason'])

if choice_short['finish_reason'] == 'length':
    print('\n️  Context limit reached - response was truncated!')

Response (60 tokens max):


Finish reason: length

️  Context limit reached - response was truncated!


---

## Part 6: Structured Outputs

### Theory: JSON Mode vs Schemas

**JSON Mode:** `response_format: {"type": "json_object"}`
- Guarantees valid JSON syntax
- Field names/types can vary

**Strict Schemas:** `{"type": "json_schema", "strict": true}`
- Enforces exact fields and types
- All fields required, `additionalProperties: false`
- First call ~10 sec (caching), then fast

**️ Important:** Guarantees format, NOT accuracy!

---

### Demo 6.1: JSON Extraction

In [45]:
json_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You extract entities and always respond with JSON containing name, age, occupation, city."
        },
        {
            "role": "user",
            "content": "Jisoo Kim is a 28-year-old software engineer living in Singapore."
        }
    ],
    "max_completion_tokens": 200,
    "reasoning_effort": "low",
    "response_format": {"type": "json_object"}
}

json_result = client.call(json_payload)
raw_content = json_result['data']['choices'][0]['message']['content']

print('Raw JSON response:\n' + raw_content)

try:
    parsed = json.loads(raw_content)
    print('\nParsed entity:')
    for key, val in parsed.items():
        print(f'  {key}: {val}')
except json.JSONDecodeError as e:
    print(f'\n Failed to parse JSON: {e}')

Raw JSON response:
{"name":"Jisoo Kim","age":28,"occupation":"software engineer","city":"Singapore"}

Parsed entity:
  name: Jisoo Kim
  age: 28
  occupation: software engineer
  city: Singapore


### Demo 6.2: Strict JSON Schema

Enforce exact structure for user profiles.

In [47]:
# Define a strict JSON schema for extracting user information
user_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "user_profile_extraction",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "personal_info": {
                    "type": "object",
                    "properties": {
                        "full_name": {
                            "type": "string",
                            "description": "The person's full name"
                        },
                        "age": {
                            "type": "integer",
                            "description": "The person's age in years"
                        },
                        "email": {
                            "type": ["string", "null"],
                            "description": "Email address if mentioned, otherwise null"
                        }
                    },
                    "required": ["full_name", "age", "email"],
                    "additionalProperties": False
                },
                "professional_info": {
                    "type": "object",
                    "properties": {
                        "occupation": {
                            "type": "string",
                            "description": "Current job title or occupation"
                        },
                        "company": {
                            "type": ["string", "null"],
                            "description": "Company name if mentioned"
                        },
                        "years_of_experience": {
                            "type": ["integer", "null"],
                            "description": "Years of professional experience if mentioned"
                        }
                    },
                    "required": ["occupation", "company", "years_of_experience"],
                    "additionalProperties": False
                },
                "location": {
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string",
                            "description": "City name"
                        },
                        "country": {
                            "type": "string",
                            "description": "Country name"
                        }
                    },
                    "required": ["city", "country"],
                    "additionalProperties": False
                },
                "interests": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    },
                    "description": "List of hobbies or interests mentioned"
                }
            },
            "required": ["personal_info", "professional_info", "location", "interests"],
            "additionalProperties": False
        }
    }
}

# Sample text to extract from
sample_text = """
Meet Dr. Emily Rodriguez, a 34-year-old data scientist at TechCorp AI Labs 
with 8 years of experience in machine learning. She lives in Barcelona, Spain, 
and enjoys hiking, photography, and reading science fiction novels. 
You can reach her at emily.r@techcorp.ai for collaboration opportunities.
"""

structured_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You extract structured information from text and return it in the specified JSON format."
        },
        {
            "role": "user",
            "content": f"Extract all relevant information from this text:\n\n{sample_text}"
        }
    ],
    "response_format": user_schema,
    "max_completion_tokens": 500,
    "reasoning_effort": "low"
}

print("=" * 70)
print("Structured Output Demo with Strict JSON Schema")
print("=" * 70)
print(f"\nInput text:\n{sample_text}")
print(f"\n{'=' * 70}")
print("Schema enforces:")
print("  - Exact field names and types")
print("  - Required fields (no missing data)")
print("  - No additional properties")
print("  - Nested object structure")
print(f"{'=' * 70}\n")

result = client.call(structured_payload)
response_content = result['data']['choices'][0]['message']['content']

print("Raw response:")
print(response_content)

# Parse and validate
try:
    parsed_data = json.loads(response_content)
    print(f"\n{'=' * 70}")
    print(" Response successfully parsed as JSON")
    print(f"{'=' * 70}")
    print("\nFormatted extraction:")
    print(json.dumps(parsed_data, indent=2))
    
    # Demonstrate accessing nested fields
    print(f"\n{'=' * 70}")
    print("Accessing structured data:")
    print(f"{'=' * 70}")
    print(f"  Name: {parsed_data['personal_info']['full_name']}")
    print(f"  Age: {parsed_data['personal_info']['age']}")
    print(f"  Job: {parsed_data['professional_info']['occupation']}")
    print(f"  Company: {parsed_data['professional_info']['company']}")
    print(f"  Location: {parsed_data['location']['city']}, {parsed_data['location']['country']}")
    print(f"  Experience: {parsed_data['professional_info']['years_of_experience']} years")
    print(f"  Interests: {', '.join(parsed_data['interests'])}")
    
    print(f"\n All required fields present and correctly typed!")
    
except json.JSONDecodeError as e:
    print(f"\n JSON parsing error: {e}")
except KeyError as e:
    print(f"\n Missing expected field: {e}")

Structured Output Demo with Strict JSON Schema

Input text:

Meet Dr. Emily Rodriguez, a 34-year-old data scientist at TechCorp AI Labs 
with 8 years of experience in machine learning. She lives in Barcelona, Spain, 
and enjoys hiking, photography, and reading science fiction novels. 
You can reach her at emily.r@techcorp.ai for collaboration opportunities.


Schema enforces:
  - Exact field names and types
  - Required fields (no missing data)
  - No additional properties
  - Nested object structure

Raw response:
{
  "personal_info": {
    "full_name": "Dr. Emily Rodriguez",
    "age": 34,
    "email": "emily.r@techcorp.ai"
  },
  "professional_info": {
    "occupation": "Data Scientist",
    "company": "TechCorp AI Labs",
    "years_of_experience": 8
  },
  "location": {
    "city": "Barcelona",
    "country": "Spain"
  },
  "interests": [
    "Hiking",
    "Photography",
    "Reading science fiction novels"
  ]
}

 Response successfully parsed as JSON

Formatted extraction:
{
  "pe

---

## Part 7: Tool Calling

### Theory: Function Calling Workflow

**5-Step Process:**
1. Define tools with JSON Schema (costs input tokens)
2. Model decides to call (`finish_reason: "tool_calls"`)
3. Your code executes the function
4. Return results with `role: "tool"`
5. Model synthesizes final answer

**Control:** `tool_choice` = `"auto"`, `"none"`, `"required"`, or function name

**Best Practice:** Design tools like clean APIs - single responsibility, clear names, constrained enums.

---

### Demo 7.1: Single Tool (Calculator)

In [48]:
# Define a simple calculator tool
def calculate(operation, x, y):
    """Execute a mathematical operation."""
    operations = {
        'add': lambda a, b: a + b,
        'subtract': lambda a, b: a - b,
        'multiply': lambda a, b: a * b,
        'divide': lambda a, b: a / b if b != 0 else 'Error: Division by zero'
    }
    return operations.get(operation, lambda a, b: 'Unknown operation')(x, y)

# Define the tool schema for the model
calculator_tool = {
    "type": "function",
    "function": {
        "name": "calculate",
        "description": "Perform basic arithmetic operations (add, subtract, multiply, divide)",
        "parameters": {
            "type": "object",
            "properties": {
                "operation": {
                    "type": "string",
                    "enum": ["add", "subtract", "multiply", "divide"],
                    "description": "The arithmetic operation to perform"
                },
                "x": {
                    "type": "number",
                    "description": "The first number"
                },
                "y": {
                    "type": "number",
                    "description": "The second number"
                }
            },
            "required": ["operation", "x", "y"],
            "additionalProperties": False
        }
    }
}

# Initial request with tool definition
tool_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You are a helpful math assistant. Use the calculator tool when needed."
        },
        {
            "role": "user",
            "content": "What is 127 multiplied by 89?"
        }
    ],
    "tools": [calculator_tool],
    "tool_choice": "auto",
    "max_completion_tokens": 300,
    "reasoning_effort": "low"
}

print("=" * 70)
print("STEP 1: Sending request with tool definition")
print("=" * 70)

# Send request
result1 = client.call(tool_payload, show_request=False)
choice1 = result1['data']['choices'][0]
message1 = choice1['message']

print(f"\nFinish reason: {choice1['finish_reason']}")
print(f"Model response: {message1.get('content', '(no text content)')}")

# Check if model wants to call a tool
if choice1['finish_reason'] == 'tool_calls' and 'tool_calls' in message1:
    tool_call = message1['tool_calls'][0]
    function_name = tool_call['function']['name']
    function_args = json.loads(tool_call['function']['arguments'])
    
    print(f"\n{'=' * 70}")
    print("STEP 2: Model requested tool call")
    print(f"{'=' * 70}")
    print(f"Function: {function_name}")
    print(f"Arguments: {json.dumps(function_args, indent=2)}")
    
    # Execute the function
    result = calculate(**function_args)
    print(f"\nFunction result: {result}")
    
    # Append the assistant's tool call request
    tool_payload['messages'].append(message1)
    
    # Append the tool result
    tool_payload['messages'].append({
        "role": "tool",
        "tool_call_id": tool_call['id'],
        "content": json.dumps({"result": result})
    })
    
    print(f"\n{'=' * 70}")
    print("STEP 3: Sending tool result back to model")
    print(f"{'=' * 70}")
    
    # Send back with tool result
    result2 = client.call(tool_payload)
    final_response = result2['data']['choices'][0]['message']['content']
    
    print(f"\nFinal response:")
    print(final_response)
    print(f"\nFinish reason: {result2['data']['choices'][0]['finish_reason']}")
else:
    print("\nModel responded without calling tool (unexpected for this example)")

print(f"\n{'=' * 70}")
print("Tool calling demo complete!")
print(f"{'=' * 70}")

STEP 1: Sending request with tool definition

Finish reason: tool_calls
Model response: None

STEP 2: Model requested tool call
Function: calculate
Arguments: {
  "operation": "multiply",
  "x": 127,
  "y": 89
}

Function result: 11303

STEP 3: Sending tool result back to model

Final response:
127 × 89 = 11,303

Finish reason: stop

Tool calling demo complete!


### Demo 7.2: Multiple Tools

Model chooses between tools, can call in parallel.

In [49]:
# Define multiple tools
def get_current_weather(location, units="celsius"):
    """Simulated weather API."""
    # In real scenario, this would call an actual API
    weather_data = {
        "San Francisco": {"temp": 18, "conditions": "Partly cloudy"},
        "Tokyo": {"temp": 22, "conditions": "Sunny"},
        "London": {"temp": 12, "conditions": "Rainy"},
        "Paris": {"temp": 15, "conditions": "Overcast"}
    }
    data = weather_data.get(location, {"temp": 20, "conditions": "Unknown"})
    return {
        "location": location,
        "temperature": data["temp"],
        "units": units,
        "conditions": data["conditions"]
    }

def search_database(query, limit=5):
    """Simulated database search."""
    # Simulate a product database
    products = [
        {"id": 1, "name": "Laptop Pro 15", "price": 1299, "category": "Electronics"},
        {"id": 2, "name": "Wireless Mouse", "price": 29, "category": "Electronics"},
        {"id": 3, "name": "Office Chair", "price": 249, "category": "Furniture"},
        {"id": 4, "name": "Desk Lamp", "price": 45, "category": "Furniture"},
        {"id": 5, "name": "Notebook Set", "price": 12, "category": "Stationery"}
    ]
    # Simple search simulation
    results = [p for p in products if query.lower() in p['name'].lower() or query.lower() in p['category'].lower()]
    return results[:limit]

# Define tool schemas
weather_tool = {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city name, e.g., San Francisco, Tokyo"
                },
                "units": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "Temperature units"
                }
            },
            "required": ["location"],
            "additionalProperties": False
        }
    }
}

database_tool = {
    "type": "function",
    "function": {
        "name": "search_database",
        "description": "Search the product database for items matching a query",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query string"
                },
                "limit": {
                    "type": "integer",
                    "description": "Maximum number of results to return",
                    "default": 5
                }
            },
            "required": ["query"],
            "additionalProperties": False
        }
    }
}

# Function dispatcher
tool_functions = {
    "get_current_weather": get_current_weather,
    "search_database": search_database
}

# Multi-tool request
multi_tool_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You are a helpful assistant with access to weather data and a product database."
        },
        {
            "role": "user",
            "content": "What's the weather in Tokyo and can you find electronics in the database?"
        }
    ],
    "tools": [weather_tool, database_tool],
    "tool_choice": "auto",
    "max_completion_tokens": 500,
    "reasoning_effort": "low"
}

print("=" * 70)
print("Multi-Tool Demo: Weather + Database")
print("=" * 70)

# First request
result1 = client.call(multi_tool_payload)
choice1 = result1['data']['choices'][0]
message1 = choice1['message']

print(f"\nModel's initial response:")
print(f"Finish reason: {choice1['finish_reason']}")

if 'tool_calls' in message1:
    print(f"\nModel requested {len(message1['tool_calls'])} tool call(s):")
    
    # Append assistant message
    multi_tool_payload['messages'].append(message1)
    
    # Execute all tool calls
    for tool_call in message1['tool_calls']:
        function_name = tool_call['function']['name']
        function_args = json.loads(tool_call['function']['arguments'])
        
        print(f"\n  Tool: {function_name}")
        print(f"  Args: {json.dumps(function_args, indent=4)}")
        
        # Execute function
        function_result = tool_functions[function_name](**function_args)
        print(f"  Result: {json.dumps(function_result, indent=4)}")
        
        # Append tool result
        multi_tool_payload['messages'].append({
            "role": "tool",
            "tool_call_id": tool_call['id'],
            "content": json.dumps(function_result)
        })
    
    # Send results back to model
    print(f"\n{'=' * 70}")
    print("Sending tool results back to model...")
    print(f"{'=' * 70}\n")
    
    result2 = client.call(multi_tool_payload)
    final_message = result2['data']['choices'][0]['message']['content']
    
    print("Final response:")
    print(final_message)
else:
    print("\nNo tool calls requested")
    print(message1.get('content', ''))

Multi-Tool Demo: Weather + Database

Model's initial response:
Finish reason: tool_calls

Model requested 2 tool call(s):

  Tool: get_current_weather
  Args: {
    "location": "Tokyo",
    "units": "celsius"
}
  Result: {
    "location": "Tokyo",
    "temperature": 22,
    "units": "celsius",
    "conditions": "Sunny"
}

  Tool: search_database
  Args: {
    "query": "electronics",
    "limit": 5
}
  Result: [
    {
        "id": 1,
        "name": "Laptop Pro 15",
        "price": 1299,
        "category": "Electronics"
    },
    {
        "id": 2,
        "name": "Wireless Mouse",
        "price": 29,
        "category": "Electronics"
    }
]

Sending tool results back to model...

Final response:
Tokyo: 22°C, sunny.

Electronics found (top results):
- Laptop Pro 15 — $1299 (category: Electronics)
- Wireless Mouse — $29 (category: Electronics)

Would you like more results, details on either item, or a different unit for the temperature?


---

## Part 8: Streaming Responses

### Theory: Server-Sent Events

**How It Works:**
1. Set `stream: true`
2. Receive chunks as `data: {...}\n\n`
3. Accumulate `delta.content`
4. Stop at `data: [DONE]`

**Benefits:** Lower perceived latency, progressive UI updates

**Trade-off:** More complex error handling, can't validate before displaying

---

### Demo 8.1: Stream Tokens

In [50]:
import sys
import time

# Streaming request
streaming_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "You are a technical writer who explains concepts clearly."
        },
        {
            "role": "user",
            "content": "Explain how HTTP requests work, step by step."
        }
    ],
    "max_completion_tokens": 600,
    "stream": True  # Enable streaming
}

print("=" * 70)
print("Streaming Demo")
print("=" * 70)
print("\nStreaming response (tokens appear in real-time):\n")

# Prepare streaming request
url = client.url
headers = client.headers

start_time = time.time()
full_content = ""
chunk_count = 0

try:
    # Make streaming request
    response = requests.post(url, headers=headers, json=streaming_payload, stream=True, timeout=60)
    response.raise_for_status()
    
    # Process Server-Sent Events
    for line in response.iter_lines():
        if line:
            line_str = line.decode('utf-8')
            
            # SSE format: "data: {...}"
            if line_str.startswith('data: '):
                data_str = line_str[6:]  # Remove "data: " prefix
                
                # Check for stream end
                if data_str.strip() == '[DONE]':
                    break
                
                try:
                    # Parse JSON chunk
                    chunk = json.loads(data_str)
                    chunk_count += 1
                    
                    # Extract delta content
                    if 'choices' in chunk and len(chunk['choices']) > 0:
                        delta = chunk['choices'][0].get('delta', {})
                        content = delta.get('content', '')
                        
                        if content:
                            full_content += content
                            # Print token(s) as they arrive
                            print(content, end='', flush=True)
                        
                        # Check for finish reason
                        finish_reason = chunk['choices'][0].get('finish_reason')
                        if finish_reason:
                            print(f"\n\n[Stream ended: {finish_reason}]")
                
                except json.JSONDecodeError:
                    pass  # Skip malformed JSON
    
    elapsed = time.time() - start_time
    
    print(f"\n\n{'=' * 70}")
    print(f"Streaming Statistics:")
    print(f"  Total chunks received: {chunk_count}")
    print(f"  Total characters: {len(full_content)}")
    print(f"  Time elapsed: {elapsed:.2f}s")
    print(f"  Avg time per chunk: {(elapsed/chunk_count)*1000:.1f}ms")
    print(f"{'=' * 70}")

except requests.exceptions.RequestException as e:
    print(f"\n\n Streaming error: {e}")

Streaming Demo

Streaming response (tokens appear in real-time):



[Stream ended: length]


Streaming Statistics:
  Total chunks received: 2
  Total characters: 0
  Time elapsed: 6.17s
  Avg time per chunk: 3082.5ms


---

## Part 9: Vision (Multimodal)

### Theory: Image Analysis

**Capabilities:**
- GPT-4o: text + images
- GPT-5: text only (for now)

**Detail Levels:**
- `low`: 85 tokens flat (classification)
- `high`: variable tokens, tiled processing (detailed analysis)

**Formats:** Base64 data or public URLs. Max 20MB/image, 50 images/request.

---

### Demo 9.1: Analyze Vacation Photo

In [51]:
import base64
import os
from pathlib import Path

# Use GPT-4o for vision support
gpt4o_deployment = os.environ.get('GPT4O_DEPLOYMENT', 'gpt-4o')

print("=" * 70)
print("Vision Demo: Analyzing Local Vacation Photos")
print("=" * 70)

IMAGE_RELATIVE_PATH = "src/main/resources/images/background-vacation1.jpeg"
IMAGE_RELATIVE_PATH_2 = "src/main/resources/images/background-vacation2.jpeg"

def resolve_resource(relative_path: str) -> Path:
    path_candidate = Path(relative_path)
    if path_candidate.is_absolute():
        if path_candidate.exists():
            return path_candidate
        raise FileNotFoundError(f"File not found: {path_candidate}")
    cwd = Path.cwd().resolve()
    for base in [cwd] + list(cwd.parents):
        candidate = base / relative_path
        if candidate.exists():
            return candidate
    raise FileNotFoundError(
        f"Unable to locate '{relative_path}'. Start the notebook from the project root or ensure the file exists."
    )

def load_image_base64(relative_path: str):
    resolved_path = resolve_resource(relative_path)
    with resolved_path.open("rb") as image_file:
        image_bytes = image_file.read()
    return base64.b64encode(image_bytes).decode("utf-8"), resolved_path

# Load first vacation image
image_data, image_file = load_image_base64(IMAGE_RELATIVE_PATH)
print(f"\nAnalyzing: {image_file}")

# Build payload with base64-encoded image
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this vacation photo. What location might this be? What activities or experiences does it suggest?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}",
                        "detail": "high"  # Use high detail for better analysis
                    }
                }
            ]
        }
    ],
    "max_tokens": 400,
    "temperature": 0.7
}

# Build GPT-4o endpoint
endpoint = os.environ['DIAL_API_ENDPOINT']
api_version = os.environ['DIAL_API_VERSION']
gpt4o_url = f"{endpoint}/openai/deployments/{gpt4o_deployment}/chat/completions?api-version={api_version}"

headers = {
    "Content-Type": "application/json",
    "api-key": os.environ['DIAL_API_KEY']
}

try:
    response = requests.post(gpt4o_url, headers=headers, json=payload, timeout=60)
    response.raise_for_status()
    result = response.json()

    print("\nGPT-4o Analysis:")
    print("-" * 70)
    print(result['choices'][0]['message']['content'])

    usage = result.get('usage', {})
    print("\n" + "=" * 70)
    print("Token Usage:")
    print(f"  Input tokens: {usage.get('prompt_tokens', 0)}")
    print(f"  Output tokens: {usage.get('completion_tokens', 0)}")
    print(f"  Total: {usage.get('total_tokens', 0)}")
    print("\n High detail image analysis uses more tokens but provides richer descriptions")

    # Optional: Analyze second image
    print("\n" + "=" * 70)
    print("Analyzing second vacation photo...")
    print("=" * 70)

    image_data2, image_file_2 = load_image_base64(IMAGE_RELATIVE_PATH_2)

    payload2 = {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this vacation scene. Compare the mood and setting to a beach destination."},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data2}", "detail": "low"}}
                ]
            }
        ],
        "max_tokens": 300,
        "temperature": 0.7
    }

    response2 = requests.post(gpt4o_url, headers=headers, json=payload2, timeout=60)
    result2 = response2.json()

    print(f"\nAnalyzing: {image_file_2}")
    print("-" * 70)
    print(result2['choices'][0]['message']['content'])

    usage2 = result2.get('usage', {})
    print("\n" + "=" * 70)
    print("Token Comparison (detail='low' vs detail='high'):")
    print(f"  Image 1 (high): {usage.get('prompt_tokens', 0)} input tokens")
    print(f"  Image 2 (low):  {usage2.get('prompt_tokens', 0)} input tokens")
    print(f"  Savings: ~{usage.get('prompt_tokens', 0) - usage2.get('prompt_tokens', 0)} tokens with 'low' detail")

except Exception as e:
    print(f"\n Error: {e}")
    if hasattr(e, 'response') and e.response is not None:
        try:
            error_detail = e.response.json()
            print(f"Details: {error_detail}")
        except:
            print(f"Response: {e.response.text[:500]}")

print("\n" + "=" * 70)
print("Key Takeaways:")
print("=" * 70)
print(" GPT-4o supports vision; GPT-5 is text-only")
print(" 'detail': 'high' → better analysis, more tokens")
print(" 'detail': 'low' → faster, cheaper, good for classification")
print(" Images can be base64-encoded or public URLs")
print(" Remove images from history after analysis to save tokens")


Vision Demo: Analyzing Local Vacation Photos

Analyzing: /Users/Natig_Kurbanov/IdeaProjects/spring-ai-workshop/src/main/resources/images/background-vacation1.jpeg

GPT-4o Analysis:
----------------------------------------------------------------------
This vacation photo depicts a stunning tropical beach scene characterized by soft, sandy shores, crystal-clear turquoise waters, and lush green hills in the background. The shoreline is gently kissed by the waves, creating a serene atmosphere. Towering palm trees frame the scene, adding to the exotic feel, while traditional thatched-roof huts suggest a laid-back, island lifestyle.

The location could likely be a tropical paradise such as the Maldives, Fiji, or a Caribbean island, known for their picturesque beaches and lush landscapes. The warm, inviting climate and vibrant scenery make it an ideal spot for relaxation and leisure.

The activities and experiences suggested by this photo include sunbathing on the beach, swimming or snorkeli

---

## Part 10: Prompt Patterns

### Theory: Few-Shot Learning

**Pattern:** Show 3-5 examples before actual query.

**Benefits:**
- Teaches output format without complex instructions
- Improves consistency
- Anchors domain-specific tone

**Prompt Design:** "Right altitude" system prompts - specific enough for reliable heuristics, flexible enough to generalize. Use `<instructions>`, `<examples>` sections.

---

### Demo 10.1: Few-Shot SQL

In [None]:
# Example 1: Sentiment analysis with specific format
print("=" * 70)
print("Few-Shot Prompting Demo - Sentiment Analysis")
print("=" * 70)

# Zero-shot (no examples)
zero_shot_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "Analyze the sentiment of product reviews."
        },
        {
            "role": "user",
            "content": "Review: This laptop exceeded my expectations! The battery lasts all day and it's super fast."
        }
    ],
    "max_completion_tokens": 100,
    "reasoning_effort": "low"
}

print("\n1. ZERO-SHOT (no examples):")
result = client.call(zero_shot_payload)
print(f"Response: {result['data']['choices'][0]['message']['content']}")

# Few-shot (with examples)
few_shot_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "Analyze product review sentiment and provide a structured response."
        },
        # Example 1
        {
            "role": "user",
            "content": "Review: The headphones broke after one week. Terrible quality."
        },
        {
            "role": "assistant",
            "content": "SENTIMENT: Negative\nSCORE: 1/5\nKEY_ISSUES: [durability, quality]\nRECOMMENDATION: Not recommended"
        },
        # Example 2
        {
            "role": "user",
            "content": "Review: Decent product for the price. Works as expected."
        },
        {
            "role": "assistant",
            "content": "SENTIMENT: Neutral\nSCORE: 3/5\nKEY_ISSUES: []\nRECOMMENDATION: Acceptable for budget-conscious buyers"
        },
        # Example 3
        {
            "role": "user",
            "content": "Review: Absolutely love it! Best purchase I've made this year."
        },
        {
            "role": "assistant",
            "content": "SENTIMENT: Positive\nSCORE: 5/5\nKEY_ISSUES: []\nRECOMMENDATION: Highly recommended"
        },
        # Actual query
        {
            "role": "user",
            "content": "Review: This laptop exceeded my expectations! The battery lasts all day and it's super fast."
        }
    ],
    "max_completion_tokens": 100,
    "reasoning_effort": "low"
}

print("\n2. FEW-SHOT (3 examples provided):")
result = client.call(few_shot_payload)
print(f"Response:\n{result['data']['choices'][0]['message']['content']}")

# Example 2: Data transformation
print(f"\n{'=' * 70}")
print("Few-Shot Prompting Demo - Data Transformation")
print(f"{'=' * 70}")

transform_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "Transform natural language into SQL queries."
        },
        # Example 1
        {
            "role": "user",
            "content": "Show me all users who signed up last month"
        },
        {
            "role": "assistant",
            "content": "SELECT * FROM users WHERE signup_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 MONTH) AND signup_date < CURRENT_DATE;"
        },
        # Example 2
        {
            "role": "user",
            "content": "Find the top 5 products by revenue"
        },
        {
            "role": "assistant",
            "content": "SELECT product_id, product_name, SUM(price * quantity) as total_revenue FROM orders GROUP BY product_id, product_name ORDER BY total_revenue DESC LIMIT 5;"
        },
        # Example 3
        {
            "role": "user",
            "content": "Count active users from each country"
        },
        {
            "role": "assistant",
            "content": "SELECT country, COUNT(*) as active_users FROM users WHERE status = 'active' GROUP BY country ORDER BY active_users DESC;"
        },
        # Actual query
        {
            "role": "user",
            "content": "Show me orders over $1000 from this week with customer names"
        }
    ],
    "max_completion_tokens": 150,
    "reasoning_effort": "low"
}

print("\nQuery: 'Show me orders over $1000 from this week with customer names'")
result = client.call(transform_payload)
print(f"\nGenerated SQL:\n{result['data']['choices'][0]['message']['content']}")

# Example 3: Custom formatting
print(f"\n{'=' * 70}")
print("Few-Shot Prompting Demo - Custom Format")
print(f"{'=' * 70}")

format_payload = {
    "messages": [
        {
            "role": "developer",
            "content": "Convert meeting notes into structured action items."
        },
        # Example
        {
            "role": "user",
            "content": "Sarah mentioned we need to update the documentation and John will review the pull request by Friday."
        },
        {
            "role": "assistant",
            "content": """[ ] @Sarah - Update documentation
    Priority: Medium
    Due: TBD
    
[ ] @John - Review pull request
    Priority: High
    Due: Friday"""
        },
        # Actual query
        {
            "role": "user",
            "content": "Emily said she'll prepare the presentation for next Tuesday's client meeting. Mike needs to send the proposal by tomorrow and asked Lisa to help with the budget section."
        }
    ],
    "max_completion_tokens": 200,
    "reasoning_effort": "low"
}

print("\nInput: Meeting notes")
result = client.call(format_payload)
print(f"\nFormatted Action Items:\n{result['data']['choices'][0]['message']['content']}")

print(f"\n{'=' * 70}")
print("Key Benefits of Few-Shot Prompting:")
print("   More consistent output format")
print("   Better adherence to specific patterns")
print("   Reduced need for detailed instructions")
print("   Improved accuracy for domain-specific tasks")
print(f"{'=' * 70}")

### Demo 10.2: Context Management

Long conversations with summarization.

In [None]:
# Simulate a long conversation with context management
def count_tokens_estimate(text):
    """Rough token estimation: ~4 characters per token."""
    return len(text) // 4

def summarize_conversation(messages):
    """Use the model to summarize old conversation turns."""
    summary_payload = {
        "messages": [
            {
                "role": "developer",
                "content": "Summarize the following conversation concisely, preserving key information."
            },
            {
                "role": "user",
                "content": "Conversation:\n" + "\n".join([
                    f"{msg['role']}: {msg['content']}" for msg in messages
                ])
            }
        ],
        "max_completion_tokens": 200,
        "reasoning_effort": "low"
    }
    result = client.call(summary_payload)
    return result['data']['choices'][0]['message']['content']

print("=" * 70)
print("Context Management Demo")
print("=" * 70)

# Start a conversation
conversation = [
    {"role": "developer", "content": "You are a helpful coding assistant."}
]

questions = [
    "What is Python?",
    "How do I create a list in Python?",
    "Can you explain list comprehensions?",
    "What's the difference between a list and a tuple?",
    "How do I sort a list?",
]

MAX_CONTEXT_TOKENS = 500  # Simulate a small context limit

print(f"\nContext limit: {MAX_CONTEXT_TOKENS} tokens")
print(f"{'=' * 70}\n")

for i, question in enumerate(questions, 1):
    # Add user question
    conversation.append({"role": "user", "content": question})
    
    # Estimate current context size
    context_text = json.dumps(conversation)
    estimated_tokens = count_tokens_estimate(context_text)
    
    print(f"Turn {i}: {question}")
    print(f"  Current context: ~{estimated_tokens} tokens")
    
    # Check if we need to compact context
    if estimated_tokens > MAX_CONTEXT_TOKENS:
        print(f"  ️  Context exceeds limit! Applying compaction...")
        
        # Keep system message and last 2 turns, summarize the rest
        system_msg = conversation[0]
        messages_to_summarize = conversation[1:-1]  # Skip system and current question
        
        if len(messages_to_summarize) > 0:
            print(f"  Summarizing {len(messages_to_summarize)} old messages...")
            summary = summarize_conversation(messages_to_summarize)
            
            # Rebuild conversation with summary
            conversation = [
                system_msg,
                {"role": "user", "content": f"[Previous conversation summary: {summary}]"},
                conversation[-1]  # Current question
            ]
            
            new_tokens = count_tokens_estimate(json.dumps(conversation))
            print(f"   Context reduced to ~{new_tokens} tokens")
            print(f"  Summary generated: {summary}")
    
    # Get response
    payload = {
        "messages": conversation,
        "max_completion_tokens": 150,
        "reasoning_effort": "low"
    }
    
    try:
        result = client.call(payload)
        response = result['data']['choices'][0]['message']['content']
        
        # Validate response is not empty
        if not response or response.strip() == '':
            print(f"\n  ️  WARNING: Empty response received!")
            print(f"  Finish reason: {result['data']['choices'][0].get('finish_reason', 'unknown')}")
            response = "[Empty response from API]"
        
        # Add response to conversation
        conversation.append({"role": "assistant", "content": response})
        
        # Show FULL response
        print(f"\n  Response:")
        for line in response.split('\n'):
            print(f"    {line}")
        print(f"\n  Messages in conversation: {len(conversation)}")
        print(f"  Token usage: {result['data']['usage']['total_tokens']} tokens")
        
    except Exception as e:
        print(f"\n   ERROR: {e}")
        print(f"  Request failed - check your API configuration")
        break
    
    print()

print(f"{'=' * 70}")
print("Context Management Strategies Demonstrated:")
print("  1. Token estimation before API calls")
print("  2. Automatic summarization of old turns")
print("  3. Sliding window (keeping recent messages)")
print("  4. System message preservation")
print(f"{'=' * 70}")

---

## Part 11: Model Comparison

### Theory: Choosing the Right Model

**GPT-5 (Reasoning):**
-  Complex logic, debugging, math proofs
-  Reasoning tokens reveal thinking process
-  Slower, more expensive, no vision

**GPT-4o (Standard):**
-  Fast, vision support, temperature control
-  Cheaper for high-volume tasks
-  No reasoning phase, less reliable on complex logic

**When to Use:** GPT-5 for hard problems, GPT-4o for speed/vision/volume.

---

### Demo 11.1: Parameter Differences

In [None]:
import requests
import json

endpoint = os.environ['DIAL_API_ENDPOINT']
api_version = os.environ['DIAL_API_VERSION']
api_key = os.environ['DIAL_API_KEY']

headers = {
    'Content-Type': 'application/json',
    'api-key': api_key,
}

prompt_text = "Explain what a REST API is in one paragraph."

print("=" * 70)
print("COMPARISON: GPT-5 vs GPT-4o Parameter Differences")
print("=" * 70)

# GPT-5 Request (Reasoning Model)
print("\n1️⃣  GPT-5 (Reasoning Model)")
print("-" * 70)

gpt5_deployment = os.environ['DIAL_DEPLOYMENT']
gpt5_url = f"{endpoint}/openai/deployments/{gpt5_deployment}/chat/completions?api-version={api_version}"

gpt5_payload = {
    'messages': [
        {'role': 'developer', 'content': 'You are a technical writer.'},
        {'role': 'user', 'content': prompt_text}
    ],
    'max_completion_tokens': 200,
    'reasoning_effort': 'medium'
}

print(f"Model: {gpt5_deployment}")
print("Parameters used:")
print(f"  • max_completion_tokens: 200")
print(f"  • reasoning_effort: medium")
print(f"  • role: developer (not system)")

try:
    response = requests.post(gpt5_url, headers=headers, json=gpt5_payload, timeout=30)
    response.raise_for_status()
    result = response.json()
    
    print(f"\n✓ Response:")
    print(result['choices'][0]['message']['content'])
    print(f"\n📊 Token usage: {result['usage']['total_tokens']} total")
    if 'completion_tokens_details' in result['usage']:
        reasoning = result['usage']['completion_tokens_details'].get('reasoning_tokens', 0)
        if reasoning:
            print(f"   Reasoning tokens: {reasoning}")
except Exception as e:
    print(f"\n Error: {e}")

# GPT-4o Request (Standard Model)
print("\n" + "=" * 70)
print("2️⃣  GPT-4o (Standard Model)")
print("-" * 70)

gpt4o_deployment = os.environ['GPT4O_DEPLOYMENT']
gpt4o_url = f"{endpoint}/openai/deployments/{gpt4o_deployment}/chat/completions?api-version={api_version}"

gpt4o_payload = {
    'messages': [
        {'role': 'system', 'content': 'You are a technical writer.'},
        {'role': 'user', 'content': prompt_text}
    ],
    'max_tokens': 200,
    'temperature': 0.7,
    'top_p': 0.9
}

print(f"Model: {gpt4o_deployment}")
print("Parameters used:")
print(f"  • max_tokens: 200 (not max_completion_tokens)")
print(f"  • temperature: 0.7")
print(f"  • top_p: 0.9")
print(f"  • role: system (not developer)")

try:
    response = requests.post(gpt4o_url, headers=headers, json=gpt4o_payload, timeout=30)
    response.raise_for_status()
    result = response.json()
    
    print(f"\n✓ Response:")
    print(result['choices'][0]['message']['content'])
    print(f"\n📊 Token usage: {result['usage']['total_tokens']} total")
except Exception as e:
    print(f"\n Error: {e}")

print("\n" + "=" * 70)
print("KEY DIFFERENCES:")
print("=" * 70)
print("\nGPT-5 (Reasoning):")
print("  ✓ max_completion_tokens, reasoning_effort")
print("  ✗ NO temperature, top_p, max_tokens")
print("\nGPT-4o (Standard):")
print("  ✓ max_tokens, temperature, top_p")
print("  ✗ NO reasoning_effort, max_completion_tokens")
print("=" * 70)

### Demo 11.2: Reasoning Quality

Complex logic with different efforts.

In [None]:
import requests
import json
import time

endpoint = os.environ['DIAL_API_ENDPOINT']
api_version = os.environ['DIAL_API_VERSION']
api_key = os.environ['DIAL_API_KEY']

headers = {
    'Content-Type': 'application/json',
    'api-key': api_key,
}

# Complex reasoning problem
problem = """A snail is at the bottom of a 20-foot well. Each day it climbs up 3 feet, 
but each night it slides down 2 feet. On which day will the snail reach the top of the well?"""

print("=" * 70)
print("COMPARISON: Reasoning Quality - GPT-5 vs GPT-4o")
print("=" * 70)
print(f"\nProblem: {problem}\n")
print("=" * 70)

# GPT-5 with HIGH reasoning effort
print("\n1️⃣  GPT-5 with reasoning_effort='high'")
print("-" * 70)

gpt5_url = f"{endpoint}/openai/deployments/{os.environ['DIAL_DEPLOYMENT']}/chat/completions?api-version={api_version}"

gpt5_payload = {
    'messages': [
        {'role': 'developer', 'content': 'You are a logical reasoning expert. Show your work step by step.'},
        {'role': 'user', 'content': problem}
    ],
    'max_completion_tokens': 500,
    'reasoning_effort': 'high'
}

start = time.time()
try:
    response = requests.post(gpt5_url, headers=headers, json=gpt5_payload, timeout=60)
    response.raise_for_status()
    result = response.json()
    gpt5_time = time.time() - start
    
    gpt5_answer = result['choices'][0]['message']['content']
    gpt5_tokens = result['usage']['total_tokens']
    gpt5_reasoning_tokens = result['usage'].get('completion_tokens_details', {}).get('reasoning_tokens', 0)
    
    print(f"Response:\n{gpt5_answer}")
    print(f"\n📊 Stats:")
    print(f"   Total tokens: {gpt5_tokens}")
    print(f"   Reasoning tokens: {gpt5_reasoning_tokens}")
    print(f"   Time: {gpt5_time:.2f}s")
except Exception as e:
    print(f" Error: {e}")
    gpt5_answer = None

# GPT-4o with temperature for comparison
print("\n" + "=" * 70)
print("2️⃣  GPT-4o with temperature=0.7")
print("-" * 70)

gpt4o_url = f"{endpoint}/openai/deployments/{os.environ['GPT4O_DEPLOYMENT']}/chat/completions?api-version={api_version}"

gpt4o_payload = {
    'messages': [
        {'role': 'system', 'content': 'You are a logical reasoning expert. Show your work step by step.'},
        {'role': 'user', 'content': problem}
    ],
    'max_tokens': 500,
    'temperature': 0.7
}

start = time.time()
try:
    response = requests.post(gpt4o_url, headers=headers, json=gpt4o_payload, timeout=60)
    response.raise_for_status()
    result = response.json()
    gpt4o_time = time.time() - start
    
    gpt4o_answer = result['choices'][0]['message']['content']
    gpt4o_tokens = result['usage']['total_tokens']
    
    print(f"Response:\n{gpt4o_answer}")
    print(f"\n📊 Stats:")
    print(f"   Total tokens: {gpt4o_tokens}")
    print(f"   Time: {gpt4o_time:.2f}s")
except Exception as e:
    print(f" Error: {e}")
    gpt4o_answer = None

# Summary
print("\n" + "=" * 70)
print("ANALYSIS:")
print("=" * 70)
print("\n Key Observations:")
print("  • GPT-5 allocates reasoning tokens for deeper analysis")
print("  • GPT-5 may take longer but provides more thorough reasoning")
print("  • GPT-4o is faster but doesn't have dedicated reasoning phase")
print("\n Correct answer is: Day 18")
print("   (On day 18, the snail climbs 3 feet and reaches 20 feet before sliding back)")
print("=" * 70)

### Demo 11.3: Cost vs Speed Trade-offs

Token, latency, cost comparison.

In [None]:
import requests
import json
import time

endpoint = os.environ['DIAL_API_ENDPOINT']
api_version = os.environ['DIAL_API_VERSION']
api_key = os.environ['DIAL_API_KEY']

headers = {
    'Content-Type': 'application/json',
    'api-key': api_key,
}

# Simple task that doesn't need deep reasoning
simple_task = "Write a professional email subject line for a meeting reminder."

print("=" * 70)
print("COMPARISON: Cost & Speed - When to Use Which Model")
print("=" * 70)
print(f"\nTask: {simple_task}")
print("\nThis is a SIMPLE task. Let's compare cost and speed.")
print("=" * 70)

results = []

# Test GPT-5 with LOW reasoning (cheapest)
print("\n1️⃣  GPT-5 with reasoning_effort='low'")
print("-" * 70)

gpt5_url = f"{endpoint}/openai/deployments/{os.environ['DIAL_DEPLOYMENT']}/chat/completions?api-version={api_version}"

gpt5_payload = {
    'messages': [
        {'role': 'developer', 'content': 'You write professional emails.'},
        {'role': 'user', 'content': simple_task}
    ],
    'max_completion_tokens': 50,
    'reasoning_effort': 'low'
}

start = time.time()
try:
    response = requests.post(gpt5_url, headers=headers, json=gpt5_payload, timeout=30)
    response.raise_for_status()
    result = response.json()
    elapsed = time.time() - start
    
    print(f"Response: {result['choices'][0]['message']['content']}")
    print(f"\n📊 Stats:")
    print(f"   Tokens: {result['usage']['total_tokens']}")
    print(f"   Time: {elapsed:.2f}s")
    print(f"   Est. cost: ${result['usage']['total_tokens'] * 0.000005:.6f}")
    
    results.append({
        'model': 'GPT-5 (low)',
        'tokens': result['usage']['total_tokens'],
        'time': elapsed,
        'cost': result['usage']['total_tokens'] * 0.000005
    })
except Exception as e:
    print(f" Error: {e}")

# Test GPT-5 with HIGH reasoning (expensive, overkill for simple task)
print("\n" + "=" * 70)
print("2️⃣  GPT-5 with reasoning_effort='high' (OVERKILL)")
print("-" * 70)

gpt5_high_payload = {
    'messages': [
        {'role': 'developer', 'content': 'You write professional emails.'},
        {'role': 'user', 'content': simple_task}
    ],
    'max_completion_tokens': 50,
    'reasoning_effort': 'high'
}

start = time.time()
try:
    response = requests.post(gpt5_url, headers=headers, json=gpt5_high_payload, timeout=30)
    response.raise_for_status()
    result = response.json()
    elapsed = time.time() - start
    
    print(f"Response: {result['choices'][0]['message']['content']}")
    print(f"\n📊 Stats:")
    print(f"   Tokens: {result['usage']['total_tokens']}")
    reasoning_tokens = result['usage'].get('completion_tokens_details', {}).get('reasoning_tokens', 0)
    print(f"   Reasoning tokens: {reasoning_tokens} (wasted for simple task)")
    print(f"   Time: {elapsed:.2f}s")
    print(f"   Est. cost: ${result['usage']['total_tokens'] * 0.000005:.6f}")
    
    results.append({
        'model': 'GPT-5 (high)',
        'tokens': result['usage']['total_tokens'],
        'time': elapsed,
        'cost': result['usage']['total_tokens'] * 0.000005
    })
except Exception as e:
    print(f" Error: {e}")

# Test GPT-4o (faster for simple tasks)
print("\n" + "=" * 70)
print("3️⃣  GPT-4o (Optimized for simple tasks)")
print("-" * 70)

gpt4o_url = f"{endpoint}/openai/deployments/{os.environ['GPT4O_DEPLOYMENT']}/chat/completions?api-version={api_version}"

gpt4o_payload = {
    'messages': [
        {'role': 'system', 'content': 'You write professional emails.'},
        {'role': 'user', 'content': simple_task}
    ],
    'max_tokens': 50,
    'temperature': 0.7
}

start = time.time()
try:
    response = requests.post(gpt4o_url, headers=headers, json=gpt4o_payload, timeout=30)
    response.raise_for_status()
    result = response.json()
    elapsed = time.time() - start
    
    print(f"Response: {result['choices'][0]['message']['content']}")
    print(f"\n📊 Stats:")
    print(f"   Tokens: {result['usage']['total_tokens']}")
    print(f"   Time: {elapsed:.2f}s")
    print(f"   Est. cost: ${result['usage']['total_tokens'] * 0.000002:.6f}")
    
    results.append({
        'model': 'GPT-4o',
        'tokens': result['usage']['total_tokens'],
        'time': elapsed,
        'cost': result['usage']['total_tokens'] * 0.000002
    })
except Exception as e:
    print(f" Error: {e}")

# Summary comparison
print("\n" + "=" * 70)
print("📊 COMPARISON TABLE:")
print("=" * 70)

if results:
    print(f"\n{'Model':<20} {'Tokens':<10} {'Time (s)':<12} {'Est. Cost':<12}")
    print("-" * 70)
    for r in results:
        print(f"{r['model']:<20} {r['tokens']:<10} {r['time']:<12.2f} ${r['cost']:<11.6f}")

print("\n" + "=" * 70)
print(" RECOMMENDATIONS:")
print("=" * 70)
print("\n Use GPT-5 (high reasoning) for:")
print("   • Complex logic problems")
print("   • Code debugging and optimization")
print("   • Mathematical proofs")
print("   • Multi-step planning")
print("\n Use GPT-5 (low reasoning) for:")
print("   • Balanced quality and cost")
print("   • Moderate complexity tasks")
print("\n Use GPT-4o for:")
print("   • Simple content generation")
print("   • Quick responses")
print("   • High-volume, low-complexity tasks")
print("   • When speed matters more than deep reasoning")
print("=" * 70)

---

## Workshop Summary

### Key Takeaways

| Topic | Core Concepts |
|-------|---------------|
| **API Basics** | Stateless calls, token billing, finish_reason validation |
| **Models** | GPT-5 reasoning vs GPT-4o parameters |
| **Conversations** | Resend full history, sliding windows |
| **Tokens** | Prompt + completion < limit, cap completions |
| **Structured** | JSON mode for syntax, schemas for contracts |
| **Tools** | Define schemas, execute, return results |
| **Streaming** | Real-time SSE for better UX |
| **Vision** | GPT-4o images with detail control |
| **Prompting** | Few-shot teaches patterns efficiently |

### Production Checklist

**Before deployment:**
-  Validate `finish_reason` before using responses
-  Set `max_completion_tokens` to prevent cost overruns
-  Log token usage per request
-  Implement exponential backoff for rate limits (429 errors)
-  Version prompts and schemas in source control
-  Test edge cases: truncation, timeouts, refusals

**Cost optimization:**
- Prune conversation history aggressively
- Cache repeated system prompts
- Choose appropriate reasoning effort per request

### Next Steps

1. Build domain-specific tools (DB queries, API calls)
2. Combine techniques: few-shot + tools + structured outputs
3. Implement context compaction with summarization
4. Test vision workflows with images
5. Measure token costs across reasoning efforts

**Spring AI integration:** Map these patterns to `ChatClient` APIs, implement tool calling with `@Tool` annotations, add Actuator metrics.

---

In [None]:
summary = client.usage_summary()
print(json.dumps(summary, indent=2))


---

## Part 12: Context Window Strategies

### Theory: Lost-in-the-Middle & Compaction

**Context Quality:**
- Quality drops beyond 50-55% of context limit
- "Lost-in-the-middle" effect: model misses details buried in long contexts
- Keep high-signal content at top and bottom

**Compaction Strategies:**
- Summarization: compress old turns using the model itself
- Sliding window: keep last N turns verbatim
- External memory: store details outside, retrieve just-in-time via tools

**Impact:** Summarization alone lifts success by ~29%; with external memory: ~39% improvement, 84% token reduction.

---

### Demo 12.1: Context Compaction with Summarization

In [None]:
# Context compaction demo
def summarize_conversation(messages):
    '''Compress old conversation turns into a summary.'''
    summary_payload = {
        "messages": [
            {"role": "developer", "content": "Summarize the following conversation concisely, preserving key information."},
            {"role": "user", "content": "Conversation:\n" + "\n".join([f"{msg['role']}: {msg['content']}" for msg in messages])}
        ],
        "max_completion_tokens": 200,
        "reasoning_effort": "low"
    }
    result = client.call(summary_payload)
    return result['data']['choices'][0]['message']['content']

# Simulate long conversation
conversation = [
    {"role": "developer", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a high-level, interpreted programming language known for its simplicity and readability."},
    {"role": "user", "content": "How do I create a list?"},
    {"role": "assistant", "content": "Use square brackets: my_list = [1, 2, 3] or list() constructor."},
    {"role": "user", "content": "What about dictionaries?"},
    {"role": "assistant", "content": "Dictionaries use curly braces: my_dict = {'key': 'value'}"}
]

print("Original conversation: {} messages, ~{} chars".format(len(conversation), sum(len(str(m)) for m in conversation)))

# Summarize old turns (keep system + last 2)
old_turns = conversation[1:-2]
summary = summarize_conversation(old_turns)

print(f"\nSummary of old turns:\n{summary}")

# Rebuild with summary
compacted = [
    conversation[0],  # system
    {"role": "user", "content": f"[Previous: {summary}]"},
    conversation[-2],  # recent messages
    conversation[-1]
]

print(f"\nCompacted conversation: {len(compacted)} messages, ~{sum(len(str(m)) for m in compacted)} chars")
print(f"Token savings: ~{(1 - sum(len(str(m)) for m in compacted) / sum(len(str(m)) for m in conversation)) * 100:.0f}%")

---

## Part 13: Parameter Tuning

### Theory: Temperature vs Reasoning Effort

**GPT-4o (Standard) Parameters:**
- `temperature` (0.0-2.0): Higher = more random/creative
- `top_p` (0.0-1.0): Nucleus sampling, restricts token pool
- Use ONE of temperature or top_p, not both

**GPT-5 (Reasoning) Parameters:**
- `reasoning_effort` (low/medium/high): Allocates thinking tokens
- NO temperature/top_p support
- Trade-off: accuracy vs cost/latency

**When to tune:** Deterministic tasks → low temperature/effort. Creative tasks → high temperature (GPT-4o only).

---

### Demo 13.1: Temperature Impact on Creativity

In [None]:
# Compare temperature settings on GPT-4o
endpoint = os.environ['DIAL_API_ENDPOINT']
api_version = os.environ['DIAL_API_VERSION']
gpt4o_url = f"{endpoint}/openai/deployments/{os.environ['GPT4O_DEPLOYMENT']}/chat/completions?api-version={api_version}"
headers = {'Content-Type': 'application/json', 'api-key': os.environ['DIAL_API_KEY']}

prompt = "Write a creative company name for a coffee shop."

print("=" * 70)
print("Temperature Impact Demo (GPT-4o)")
print("=" * 70)

for temp in [0.0, 0.7, 1.5]:
    payload = {
        'messages': [
            {'role': 'system', 'content': 'You are a creative brand consultant.'},
            {'role': 'user', 'content': prompt}
        ],
        'max_tokens': 50,
        'temperature': temp
    }
    
    response = requests.post(gpt4o_url, headers=headers, json=payload, timeout=30)
    result = response.json()
    
    print(f"\nTemperature {temp}:")
    print(f"  {result['choices'][0]['message']['content'].strip()}")

print("\n Notice: Higher temperature = more diverse/creative outputs")

---

## Part 14: Error Handling & Reliability

### Theory: Production Readiness

**Always Check `finish_reason`:**
- `stop`: Normal completion
- `length`: Truncated (increase max_completion_tokens)
- `content_filter`: Blocked by safety filters
- `tool_calls`: Model wants to call a function

**Rate Limits:**
- HTTP 429: Too many requests
- Headers: `x-ratelimit-remaining-requests`, `x-ratelimit-remaining-tokens`
- Implement exponential backoff: 1s, 2s, 4s, 8s...

**Logging:** Track request IDs, token usage, latency, errors for debugging.

---

### Demo 14.1: Handle Truncation and Rate Limits

In [None]:
import time

# Demo 1: Detect truncation
print("=" * 70)
print("Demo 1: Detecting Truncation")
print("=" * 70)

truncated_payload = {
    "messages": [
        {"role": "developer", "content": "You are a technical writer."},
        {"role": "user", "content": "Explain microservices architecture in detail with examples."}
    ],
    "max_completion_tokens": 30,  # Too small!
    "reasoning_effort": "low"
}

result = client.call(truncated_payload)
choice = result['data']['choices'][0]

print(f"Response: {choice['message']['content'][:100]}...")
print(f"\nFinish reason: {choice['finish_reason']}")

if choice['finish_reason'] == 'length':
    print("️  TRUNCATED! Need to increase max_completion_tokens or reduce prompt.")

# Demo 2: Exponential backoff simulation
print("\n" + "=" * 70)
print("Demo 2: Exponential Backoff Pattern")
print("=" * 70)

def call_with_retry(payload, max_retries=3):
    '''Call API with exponential backoff on rate limits.'''
    for attempt in range(max_retries):
        try:
            result = client.call(payload)
            return result
        except Exception as e:
            if '429' in str(e) and attempt < max_retries - 1:
                wait_time = 2 ** attempt  # 1s, 2s, 4s
                print(f"Rate limited! Waiting {wait_time}s before retry {attempt + 2}/{max_retries}...")
                time.sleep(wait_time)
            else:
                raise
    
print("Retry pattern: 1s → 2s → 4s → 8s...")
print(" Always implement exponential backoff for production!")

---

## Part 15: Just-in-Time Retrieval

### Theory: Keep Prompts Lean

**Problem:** Stuffing entire documents into prompts wastes tokens and reduces quality.

**Solution:** Progressive disclosure
1. Store lightweight identifiers (file paths, IDs)
2. Surface summaries/metadata up front
3. Let model request details via tool calls
4. Load heavy data only when needed

**Pattern:** `getFileList()` → `getFileSummary(id)` → `getFileContents(id)` only when necessary.

---

### Demo 15.1: Progressive Disclosure with Tools

In [None]:
# Simulate a document database
documents = {
    "doc1": {"title": "API Design Best Practices", "summary": "REST principles, versioning, error handling", "content": "... 5000 words ..."},
    "doc2": {"title": "Database Optimization Guide", "summary": "Indexing, query optimization, caching strategies", "content": "... 8000 words ..."},
    "doc3": {"title": "Kubernetes Deployment Patterns", "summary": "Rolling updates, blue-green, canary deployments", "content": "... 6000 words ..."}
}

# Tool 1: List available documents
def list_documents():
    return [{"id": doc_id, "title": doc["title"]} for doc_id, doc in documents.items()]

# Tool 2: Get summary
def get_document_summary(doc_id):
    if doc_id in documents:
        return {"id": doc_id, "title": documents[doc_id]["title"], "summary": documents[doc_id]["summary"]}
    return {"error": "Document not found"}

# Tool 3: Get full content (only when needed)
def get_document_content(doc_id):
    if doc_id in documents:
        return {"id": doc_id, "content": documents[doc_id]["content"]}
    return {"error": "Document not found"}

print("=" * 70)
print("Just-in-Time Retrieval Pattern")
print("=" * 70)

print("\nStep 1: User asks question")
print("Query: 'How do I implement blue-green deployments?'")

print("\nStep 2: Model calls list_documents()")
doc_list = list_documents()
print(f"Available: {[d['title'] for d in doc_list]}")

print("\nStep 3: Model identifies relevant doc, calls get_document_summary('doc3')")
summary = get_document_summary("doc3")
print(f"Summary: {summary['summary']}")

print("\nStep 4: Model determines it needs full content, calls get_document_content('doc3')")
print("(Only NOW do we load the heavy 6000-word document into context)")

print("\n Token savings: Only loaded 1 document instead of all 3!")
print("   Kept context focused and reduced cost by ~70%")