# Ollama OpenAI Compatibility API

This notebook demonstrates Ollama's OpenAI-compatible API using the official `openai` Python library.

## Features Covered

- List models
- Generate response (completions)
- Chat completion
- Streaming responses
- Generate embeddings

## Limitations

The OpenAI compatibility layer does **not** support:
- Show model details (`/api/show`)
- List running models (`/api/ps`)
- Copy model (`/api/copy`)
- Delete model (`/api/delete`)

## Prerequisites

- Ollama pod running: `ujust ollama start`
- Model pulled: `ujust ollama pull llama3.2`

## 1. Setup & Configuration

In [8]:
import os
import time
import requests
from openai import OpenAI

# === Configuration ===
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
DEFAULT_MODEL = "llama3.2:latest"

# Initialize OpenAI client pointing to Ollama
client = OpenAI(
    base_url=f"{OLLAMA_HOST}/v1",
    api_key="ollama"  # Required by library but ignored by Ollama
)

print(f"Ollama host: {OLLAMA_HOST}")
print(f"OpenAI base URL: {OLLAMA_HOST}/v1")
print(f"Default model: {DEFAULT_MODEL}")

Ollama host: http://ollama:11434
OpenAI base URL: http://ollama:11434/v1
Default model: llama3.2:latest

## 2. Connection Health Check

In [9]:
def check_ollama_health() -> tuple[bool, bool]:
    """Check if Ollama server is running and model is available.
    
    Returns:
        tuple: (server_healthy, model_available)
    """
    try:
        response = requests.get(f"{OLLAMA_HOST}/api/tags", timeout=5)
        if response.status_code == 200:
            print("✓ Ollama server is running!")
            models = response.json()
            model_names = [m.get("name", "") for m in models.get("models", [])]
            
            if DEFAULT_MODEL in model_names:
                print(f"✓ Model '{DEFAULT_MODEL}' is available")
                return True, True
            else:
                print(f"✗ Model '{DEFAULT_MODEL}' not found!")
                print()
                if model_names:
                    print("Available models:")
                    for name in model_names:
                        print(f"  - {name}")
                else:
                    print("No models installed.")
                print()
                print("To fix this, run:")
                print(f"  ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
                return True, False
        else:
            print(f"Ollama returned unexpected status: {response.status_code}")
            return False, False
    except requests.exceptions.ConnectionError:
        print("✗ Cannot connect to Ollama server!")
        print("To fix this, run: ujust ollama start")
        return False, False
    except requests.exceptions.Timeout:
        print("✗ Connection to Ollama timed out!")
        return False, False

ollama_healthy, model_available = check_ollama_health()

✓ Ollama server is running!
✓ Model 'llama3.2:latest' is available

## 3. List Models

**Endpoint:** `GET /v1/models`

In [10]:
print("=== List Available Models ===")

models = client.models.list()
for model in models.data:
    print(f"  - {model.id}")

=== List Available Models ===
  - llama3.2:latest

## 4. Generate Response (Completions)

**Endpoint:** `POST /v1/completions`

In [11]:
print("=== Generate Response ===")

if not model_available:
    print()
    print("⚠ Skipping - model not available")
    print(f"  Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
    prompt = "Why is the sky blue? Answer in one sentence."
    print(f"Prompt: {prompt}")
    print()

    try:
        start_time = time.perf_counter()
        response = client.completions.create(
            model=DEFAULT_MODEL,
            prompt=prompt,
            max_tokens=100
        )
        end_time = time.perf_counter()

        print(f"Response: {response.choices[0].text}")
        print()
        print(f"Latency: {end_time - start_time:.2f}s")
        print(f"Completion tokens: {response.usage.completion_tokens}")
    except Exception as e:
        print(f"✗ Error: {e}")

=== Generate Response ===
Prompt: Why is the sky blue? Answer in one sentence.
Response: The sky appears blue because when sunlight enters Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen and oxygen, which scatter the shorter (blue) wavelengths of light more than the longer (red) wavelengths, resulting in the blue color we see most often.

Latency: 0.32s
Completion tokens: 54

## 5. Chat Completion

**Endpoint:** `POST /v1/chat/completions`

In [12]:
print("=== Chat Completion ===")

if not model_available:
    print()
    print("⚠ Skipping - model not available")
    print(f"  Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
    try:
        response = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. Keep responses brief."},
                {"role": "user", "content": "Explain machine learning in one sentence."}
            ],
            temperature=0.7,
            max_tokens=100
        )

        print(f"Assistant: {response.choices[0].message.content}")
        print(f"\nTokens used: {response.usage.total_tokens}")
    except Exception as e:
        print(f"✗ Error: {e}")

=== Chat Completion ===Assistant: Machine learning is a type of artificial intelligence that enables computers to automatically learn from data and improve their performance on specific tasks without being explicitly programmed.

Tokens used: 72

## 6. Multi-turn Conversation

In [13]:
print("=== Multi-turn Conversation ===")

if not model_available:
    print()
    print("⚠ Skipping - model not available")
    print(f"  Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
    try:
        messages = [
            {"role": "system", "content": "You are a helpful math tutor."}
        ]

        # Turn 1
        messages.append({"role": "user", "content": "What is 2 + 2?"})
        response = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=messages,
            max_tokens=50
        )
        assistant_msg = response.choices[0].message.content
        messages.append({"role": "assistant", "content": assistant_msg})
        print(f"User: What is 2 + 2?")
        print(f"Assistant: {assistant_msg}")

        # Turn 2
        messages.append({"role": "user", "content": "And what is that multiplied by 3?"})
        response = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=messages,
            max_tokens=50
        )
        print(f"User: And what is that multiplied by 3?")
        print(f"Assistant: {response.choices[0].message.content}")
    except Exception as e:
        print(f"✗ Error: {e}")

=== Multi-turn Conversation ===
User: What is 2 + 2?
Assistant: That's an easy one!

2 + 2 = 4User: And what is that multiplied by 3?
Assistant: To find the product of 4 and 3, I'll multiply them together.

4 × 3 = 12

So, the answer is 12!

## 7. Streaming Response

**Endpoint:** `POST /v1/chat/completions` with `stream: true`

In [14]:
print("=== Streaming Response ===")

if not model_available:
    print()
    print("⚠ Skipping - model not available")
    print(f"  Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
    try:
        print()

        stream = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=[{"role": "user", "content": "Count from 1 to 5."}],
            stream=True
        )

        collected = []
        for chunk in stream:
            if chunk.choices[0].delta.content:
                collected.append(chunk.choices[0].delta.content)

        print(f"Response: {''.join(collected)}")
    except Exception as e:
        print(f"✗ Error: {e}")

=== Streaming Response ===

Response: 1, 2, 3, 4, 5!

## 8. Generate Embeddings

**Endpoint:** `POST /v1/embeddings`

In [15]:
print("=== Generate Embeddings ===")

if not model_available:
    print()
    print("⚠ Skipping - model not available")
    print(f"  Run: ujust ollama pull {DEFAULT_MODEL.split(':')[0]}")
else:
    try:
        test_text = "Ollama makes running LLMs locally easy and efficient."

        response = client.embeddings.create(
            model=DEFAULT_MODEL,
            input=test_text
        )

        embedding = response.data[0].embedding
        print(f"Input: '{test_text}'")
        print(f"Embedding dimensions: {len(embedding)}")
        print(f"First 5 values: {embedding[:5]}")
        print(f"Last 5 values: {embedding[-5:]}")
    except Exception as e:
        print(f"✗ Error: {e}")

=== Generate Embeddings ===
Input: 'Ollama makes running LLMs locally easy and efficient.'
Embedding dimensions: 3072
First 5 values: [-0.026683127507567406, -0.0028091324493288994, -0.02738499455153942, -0.009667067788541317, -0.017405545338988304]
Last 5 values: [-0.028065813705325127, 0.010568944737315178, -0.028453463688492775, 0.014874468557536602, -0.02971256710588932]

## 9. Error Handling

In [16]:
print("=== Error Handling ===")

# Test: Non-existent model
print("\n1. Testing non-existent model...")
try:
    response = client.chat.completions.create(
        model="invalid-model",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(f"   Unexpected success")
except Exception as e:
    print(f"   Expected error: {type(e).__name__}")

# Test: Empty messages
print("\n2. Testing empty messages...")
try:
    response = client.chat.completions.create(
        model=DEFAULT_MODEL,
        messages=[]
    )
    print(f"   Empty messages allowed")
except Exception as e:
    print(f"   Error: {type(e).__name__}")

print("\nError handling tests completed!")

=== Error Handling ===

1. Testing non-existent model...
   Expected error: NotFoundError

2. Testing empty messages...
   Error: BadRequestError

Error handling tests completed!

## Summary

This notebook demonstrated Ollama's OpenAI-compatible API.

### API Endpoints Used

| Endpoint | Method | Purpose |
|----------|--------|--------|
| `/v1/models` | GET | List models |
| `/v1/completions` | POST | Generate text |
| `/v1/chat/completions` | POST | Chat completion |
| `/v1/embeddings` | POST | Generate embeddings |

### Quick Reference

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://ollama:11434/v1",
    api_key="ollama"
)

# Chat
response = client.chat.completions.create(
    model="llama3.2:latest",
    messages=[{"role": "user", "content": "Hello!"}]
)
```

### Why Use OpenAI Compatibility?

- **Migration** - Drop-in replacement for OpenAI API
- **Tool ecosystem** - Works with LangChain, LlamaIndex, etc.
- **Familiar interface** - Standard OpenAI patterns