# Exploring the ChatGPT Completions API

## Overview

This notebook introduces the **OpenAI Chat Completions API**, the foundation for building LLM-powered applications. Understanding this API is critical for:

1. **System Design** - Architecture patterns for AI-powered services
2. **Cost Optimization** - Token usage directly impacts operational expenses
3. **Quality Control** - Temperature and parameter tuning for consistent outputs
4. **Production Readiness** - Error handling, rate limits, and monitoring

## What We'll Cover

1. **Basic Completions** - Single-turn requests and responses
2. **Message Roles** - System, user, and assistant message types
3. **Temperature Control** - Balancing creativity vs determinism
4. **Token Management** - Tracking usage for cost and context limits
5. **Helper Functions** - Reusable patterns for production systems

## Key Concepts

### Roles in Chat API
- **system**: Sets behavior, tone, and constraints (invisible to end user)
- **user**: Input from the human user
- **assistant**: LLM's response (can be used for few-shot examples)

### Temperature (0-2)
- **0**: Deterministic, consistent (production default)
- **1**: Balanced creativity (general use)
- **2**: Maximum randomness (creative writing)

## Environment Setup

Initialize the OpenAI client with API credentials. The `tiktoken` library provides token counting for cost estimation.

In [1]:
import os
import openai
import tiktoken  # Token counting library from OpenAI
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file (must contain OPENAI_API_KEY)
_ = load_dotenv(find_dotenv())

# Set API key for OpenAI client
openai.api_key = os.environ['OPENAI_API_KEY']

---

## Simple Completion Helper

A basic wrapper for single-turn interactions. This pattern is useful for:
- Quick prototyping
- Stateless applications
- Simple question-answering systems

**Default Parameters:**
- `temperature=0`: Deterministic output for consistency
- `model="gpt-3.5-turbo"`: Cost-effective model for most use cases

In [2]:
# Initialize OpenAI client (uses api_key from environment)
client = openai.OpenAI()

def get_completion(prompt, model="gpt-3.5-turbo"):
    """
    Simple wrapper for single-turn completions.
    
    Args:
        prompt: User's input text
        model: OpenAI model identifier (default: gpt-3.5-turbo)
    
    Returns:
        str: LLM's response content
    """
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0  # Deterministic output for consistency
    )
    return response.choices[0].message.content


### Test: Basic Factual Query

Simple question-answering to verify API connectivity.

In [3]:
# Simple factual query - should return "Paris"
response = get_completion("What is the capital of France?")

---

## Prompt Engineering: Character-Level Tasks

LLMs struggle with character-level operations (reversing strings, counting letters) due to tokenization. Let's demonstrate the problem and solution.

### Issue: String Reversal Without Delimiters

In [4]:
# Problem: LLM may struggle due to tokenization (sees word chunks, not characters)
response = get_completion("Take the letters in lollipop \
and reverse them")
print(response)

pilpolol


**Observation**: LLM may produce incorrect output (e.g., "popillol" instead of "popillol"). This happens because tokenization treats "lollipop" as 1-2 tokens, not individual characters.

### Solution: Character Delimiting

Explicitly separate characters with hyphens so the LLM processes them individually.

In [5]:
# Solution: Hyphenate characters so LLM processes them individually
response = get_completion("""Take the letters in \
l-o-l-l-i-p-o-p and reverse them""")
print(response)  # Should correctly output: p-o-p-i-l-l-o-l

p-o-p-i-l-l-o-l


**Result**: Correct reversal (p-o-p-i-l-l-o-l).

**Key Lesson**: For character-level tasks, provide delimiters to work around tokenization limitations.

---

## Advanced Completion Helper with Full Control

This function exposes all critical parameters for production use:

In [6]:
def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, 
                                 max_tokens=500):
    """
    Production-ready completion helper with full parameter control.
    
    Args:
        messages: List of message dicts with 'role' and 'content' keys
        model: OpenAI model identifier
        temperature: 0-2, controls randomness (0=deterministic, 2=creative)
        max_tokens: Maximum tokens in response (limits cost and response length)
    
    Returns:
        str: LLM's response content
    """
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,  # Degree of randomness (0=deterministic, 2=creative)
        max_tokens=max_tokens,    # Cap response length and cost
    )
    return response.choices[0].message.content

## Using System Messages to Constrain Output

System messages define behavior, format, and constraints. They're critical for:
1. **Output Format** - JSON, length limits, style requirements
2. **Tone Control** - Professional, casual, technical
3. **Safety Rails** - Content filtering, scope limitations

### Example: Length Constraint

Force the LLM to respond in exactly one sentence.

In [7]:
# System message: Enforce strict length constraint
messages = [  
    {'role': 'system',
     'content': 'All your responses must be one sentence long.'},  # Hard constraint
    {'role': 'user',
     'content': 'write me a story about a happy carrot'},  
] 

# temperature=1 allows creativity within the length constraint
response = get_completion_from_messages(messages, temperature=1)
print(response)

Once there was a cheerful carrot named Carl who loved dancing in the sunshine.


**Observation**: Despite creative temperature, the system message enforces a one-sentence response.

**Production Use Cases**:
- Chat UI with character limits
- SMS/notification systems
- Summary generation for dashboards

---

## Token Counting for Cost & Context Management

Production systems must track token usage for:
1. **Cost Control** - Billing is per-token (input + output)
2. **Context Limits** - Models have maximum context windows (e.g., 4K, 16K, 128K tokens)
3. **Performance Monitoring** - Track usage patterns and optimize prompts

In [8]:
def get_completion_and_token_count(messages, 
                                   model="gpt-3.5-turbo", 
                                   temperature=0, 
                                   max_tokens=500):
    """
    Completion helper that returns both content and token usage.
    Critical for production cost tracking and monitoring.
    
    Args:
        messages: List of message dicts
        model: OpenAI model identifier
        temperature: Randomness control (0-2)
        max_tokens: Response length cap
    
    Returns:
        tuple: (response_content, token_dict)
            token_dict contains: prompt_tokens, completion_tokens, total_tokens
    """
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens,
    )
    
    content = response.choices[0].message.content
    
    # Extract token usage from response metadata
    token_dict = {
        'prompt_tokens': response.usage.prompt_tokens,      # Input tokens (your cost)
        'completion_tokens': response.usage.completion_tokens,  # Output tokens (your cost)
        'total_tokens': response.usage.total_tokens,        # Sum of input + output
    }

    return content, token_dict

### Example: Style Control + Token Tracking

Using a system message to control output style (Dr. Seuss) while tracking token usage.

In [9]:
# System message: Enforce Dr. Seuss style (rhyming, whimsical)
messages = [
    {'role': 'system', 
     'content': """You are an assistant who responds in the style of Dr Seuss."""},
    {'role': 'user',
     'content': """write me a very short poem about a happy carrot"""},  
] 

# Get both response content and token usage metrics
response, token_dict = get_completion_and_token_count(messages)

### View Response

In [10]:
# Display the Dr. Seuss-style poem
print(response)

Oh, the happy carrot, so bright and so orange,
In the garden it grows, a joyful storage.
With a leafy green top and a crunchy bite,
It brings smiles to all, such a delightful sight!


**Expected**: Rhyming poem in Dr. Seuss style (e.g., "In a garden so bright, with sun shining down...")

---

### View Token Usage

In [11]:
# Display token breakdown for cost calculation
print(token_dict)

{'prompt_tokens': 35, 'completion_tokens': 44, 'total_tokens': 79}


**Example Output**:
```python
{
    'prompt_tokens': 39,       # System message + user message
    'completion_tokens': 52,   # LLM's response
    'total_tokens': 91         # Sum (used for billing)
}
```

### Cost Calculation Example (GPT-3.5-Turbo)

**Pricing** (as of 2024):
- Input: $0.50 / 1M tokens
- Output: $1.50 / 1M tokens

**This Request**:
- Input cost: 39 tokens × $0.50/1M = $0.0000195
- Output cost: 52 tokens × $1.50/1M = $0.0000780
- **Total: $0.0000975** (~$0.0001)

**Production Implications**:
- 10K requests/day ≈ $1/day
- 1M requests/month ≈ $97.50/month

---

## Summary: Production Patterns

### Helper Function Selection

| Use Case | Function | Key Feature |
|----------|----------|-------------|
| **Quick Prototyping** | `get_completion()` | Simple, single-turn |
| **Multi-turn Chat** | `get_completion_from_messages()` | Full message history support |
| **Cost Tracking** | `get_completion_and_token_count()` | Returns usage metrics |

### Best Practices

1. **System Messages**
   - Define behavior, format, tone
   - Use for output constraints (length, structure)
   - Invisible to end users

2. **Temperature Control**
   - 0: Deterministic (production default for consistency)
   - 0.5-1: Balanced (general use)
   - 1.5-2: Creative (writing, brainstorming)

3. **Token Management**
   - Always track usage in production
   - Set `max_tokens` to prevent runaway costs
   - Monitor prompt efficiency (shorter = cheaper)

4. **Error Handling** (not shown, but critical)
   - Wrap API calls in try/except
   - Handle rate limits (exponential backoff)
   - Log failures for debugging

### Next Steps

- **B.EvaluateAndClassifyInputs** - Input validation and routing
- **C.InputModerationTechniques** - Content safety and filtering
- **D.ChainOfThoughtReasoning** - Complex reasoning patterns