# Ollama Pod Testing Notebook

This notebook tests the local Ollama pod functionality using **three API approaches**:

1. **`requests`** - Raw HTTP calls to Ollama REST API
2. **`ollama`** - Official Ollama Python client
3. **`openai`** - OpenAI compatibility layer

Includes comprehensive tests and performance benchmarks.

## Prerequisites

- Ollama pod must be running: `ujust ollama start`
- At least one model must be pulled: `ujust ollama pull llama3.2`

## 1. Setup & Configuration

In [1]:
import os

# === Configuration (Papermill parameters) ===
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
DEFAULT_MODEL = "llama3.2"
BENCHMARK_RUNS = 5
TEST_PROMPT = "Why is the sky blue? Answer in one sentence."

In [2]:
# === Imports ===
import requests
import json
import time
import statistics
from typing import Callable, Any

import ollama
from openai import OpenAI
import pandas as pd

print("Imports successful!")
print(f"Ollama host: {OLLAMA_HOST}")
print(f"  (source: {'OLLAMA_HOST env var' if 'OLLAMA_HOST' in os.environ else 'default fallback'})")
print(f"Default model: {DEFAULT_MODEL}")

Imports successful!
Ollama host: http://ollama:11434
  (source: OLLAMA_HOST env var)
Default model: llama3.2

## 2. Connection Health Check

In [3]:
def check_ollama_health() -> bool:
    """Check if Ollama server is running and accessible."""
    try:
        response = requests.get(f"{OLLAMA_HOST}/api/tags", timeout=5)
        if response.status_code == 200:
            print("Ollama server is running!")
            return True
        else:
            print(f"Ollama returned unexpected status: {response.status_code}")
            return False
    except requests.exceptions.ConnectionError:
        print("ERROR: Cannot connect to Ollama server!")
        print("")
        print("To fix this, run:")
        print("  ujust ollama start")
        print("")
        print("Then re-run this cell.")
        return False
    except requests.exceptions.Timeout:
        print("ERROR: Connection to Ollama timed out!")
        return False

ollama_healthy = check_ollama_health()

Ollama server is running!

## 3. List Available Models

In [4]:
# === Method 1: Using requests ===
print("=== Using requests library ===")
response = requests.get(f"{OLLAMA_HOST}/api/tags")
models_data = response.json()

if models_data.get("models"):
    for model in models_data["models"]:
        size_gb = model.get("size", 0) / (1024**3)
        print(f"  - {model['name']} ({size_gb:.2f} GB)")
else:
    print("  No models found. Run: ujust ollama pull llama3.2")

=== Using requests library ===
  - llama3.2:1b (1.23 GB)

In [5]:
# === Method 2: Using ollama library ===
print("=== Using ollama library ===")
models = ollama.list()

if models.get("models"):
    for model in models["models"]:
        size_gb = model.get("size", 0) / (1024**3)
        print(f"  - {model['name']} ({size_gb:.2f} GB)")
else:
    print("  No models found.")

=== Using ollama library ===

KeyError: 'name'

In [6]:
# === Method 3: Using OpenAI compatibility layer ===
print("=== Using OpenAI compatibility layer ===")

# Initialize OpenAI client pointing to Ollama
openai_client = OpenAI(
    base_url=f"{OLLAMA_HOST}/v1",
    api_key="ollama"  # Required by library but ignored by Ollama
)

models = openai_client.models.list()
for model in models.data:
    print(f"  - {model.id}")

=== Using OpenAI compatibility layer ===
  - llama3.2:1b

## 4. Basic Generation (requests)

In [7]:
print("=== Generation using requests ===")
print(f"Prompt: {TEST_PROMPT}")
print()

start_time = time.perf_counter()
response = requests.post(
    f"{OLLAMA_HOST}/api/generate",
    json={
        "model": DEFAULT_MODEL,
        "prompt": TEST_PROMPT,
        "stream": False
    }
)
end_time = time.perf_counter()

result = response.json()
print(f"Response: {result['response']}")
print()
print(f"Latency: {end_time - start_time:.2f}s")
print(f"Eval tokens: {result.get('eval_count', 'N/A')}")
print(f"Eval duration: {result.get('eval_duration', 0) / 1e9:.2f}s")

=== Generation using requests ===
Prompt: Why is the sky blue? Answer in one sentence.


KeyError: 'response'

## 5. Basic Generation (ollama library)

In [8]:
print("=== Generation using ollama library ===")
print(f"Prompt: {TEST_PROMPT}")
print()

start_time = time.perf_counter()
result = ollama.generate(
    model=DEFAULT_MODEL,
    prompt=TEST_PROMPT
)
end_time = time.perf_counter()

print(f"Response: {result['response']}")
print()
print(f"Latency: {end_time - start_time:.2f}s")
print(f"Eval tokens: {result.get('eval_count', 'N/A')}")

=== Generation using ollama library ===
Prompt: Why is the sky blue? Answer in one sentence.


ResponseError: model 'llama3.2' not found (status code: 404)

## 6. Basic Generation (OpenAI compatibility)

In [9]:
print("=== Generation using OpenAI compatibility layer ===")
print(f"Prompt: {TEST_PROMPT}")
print()

start_time = time.perf_counter()
response = openai_client.completions.create(
    model=DEFAULT_MODEL,
    prompt=TEST_PROMPT,
    max_tokens=100
)
end_time = time.perf_counter()

print(f"Response: {response.choices[0].text}")
print()
print(f"Latency: {end_time - start_time:.2f}s")
print(f"Completion tokens: {response.usage.completion_tokens}")

=== Generation using OpenAI compatibility layer ===
Prompt: Why is the sky blue? Answer in one sentence.


NotFoundError: Error code: 404 - {'error': {'message': "model 'llama3.2' not found", 'type': 'api_error', 'param': None, 'code': None}}

## 7. Chat Completion (requests)

In [10]:
print("=== Chat using requests ===")

messages = [
    {"role": "system", "content": "You are a helpful assistant. Keep responses brief."},
    {"role": "user", "content": "What is Python?"}
]

response = requests.post(
    f"{OLLAMA_HOST}/api/chat",
    json={
        "model": DEFAULT_MODEL,
        "messages": messages,
        "stream": False
    }
)

result = response.json()
print(f"Assistant: {result['message']['content']}")

=== Chat using requests ===

KeyError: 'message'

## 8. Chat Completion (ollama library)

In [11]:
print("=== Chat using ollama library ===")

# Multi-turn conversation
messages = [
    {"role": "user", "content": "What is 2 + 2?"},
]

response = ollama.chat(
    model=DEFAULT_MODEL,
    messages=messages
)
print(f"User: What is 2 + 2?")
print(f"Assistant: {response['message']['content']}")

# Continue conversation
messages.append(response["message"])
messages.append({"role": "user", "content": "And what is that multiplied by 3?"})

response = ollama.chat(
    model=DEFAULT_MODEL,
    messages=messages
)
print(f"User: And what is that multiplied by 3?")
print(f"Assistant: {response['message']['content']}")

=== Chat using ollama library ===

ResponseError: model 'llama3.2' not found (status code: 404)

## 9. Chat Completion (OpenAI compatibility)

In [12]:
print("=== Chat using OpenAI compatibility layer ===")

response = openai_client.chat.completions.create(
    model=DEFAULT_MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Keep responses brief."},
        {"role": "user", "content": "Explain machine learning in one sentence."}
    ],
    temperature=0.7,
    max_tokens=100
)

print(f"Assistant: {response.choices[0].message.content}")
print(f"\nTokens used: {response.usage.total_tokens}")

=== Chat using OpenAI compatibility layer ===

NotFoundError: Error code: 404 - {'error': {'message': "model 'llama3.2' not found", 'type': 'api_error', 'param': None, 'code': None}}

## 10. Streaming Responses

In [13]:
print("=== Streaming with requests ===")
print("Response: ", end="", flush=True)

response = requests.post(
    f"{OLLAMA_HOST}/api/generate",
    json={
        "model": DEFAULT_MODEL,
        "prompt": "Count from 1 to 5.",
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        print(chunk.get("response", ""), end="", flush=True)
        if chunk.get("done"):
            break
print()

=== Streaming with requests ===
Response: 

In [14]:
print("=== Streaming with ollama library ===")
print("Response: ", end="", flush=True)

stream = ollama.generate(
    model=DEFAULT_MODEL,
    prompt="Count from 1 to 5.",
    stream=True
)

for chunk in stream:
    print(chunk["response"], end="", flush=True)
print()

=== Streaming with ollama library ===
Response: 

ResponseError: model 'llama3.2' not found (status code: 404)

In [15]:
print("=== Streaming with OpenAI compatibility layer ===")
print("Response: ", end="", flush=True)

stream = openai_client.chat.completions.create(
    model=DEFAULT_MODEL,
    messages=[{"role": "user", "content": "Count from 1 to 5."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

=== Streaming with OpenAI compatibility layer ===
Response: 

NotFoundError: Error code: 404 - {'error': {'message': "model 'llama3.2' not found", 'type': 'api_error', 'param': None, 'code': None}}

## 11. Error Handling

In [16]:
print("=== Testing Error Handling ===")

# Test 1: Non-existent model
print("\n1. Testing non-existent model...")
try:
    result = ollama.generate(
        model="nonexistent-model-xyz",
        prompt="Hello"
    )
    print(f"   Unexpected success: {result}")
except Exception as e:
    print(f"   Expected error: {type(e).__name__}: {e}")

# Test 2: Empty prompt
print("\n2. Testing empty prompt...")
try:
    result = ollama.generate(
        model=DEFAULT_MODEL,
        prompt=""
    )
    print(f"   Response received (empty prompts allowed): {result['response'][:50]}...")
except Exception as e:
    print(f"   Error: {type(e).__name__}: {e}")

# Test 3: OpenAI client with invalid model
print("\n3. Testing OpenAI client with invalid model...")
try:
    response = openai_client.chat.completions.create(
        model="invalid-model",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(f"   Unexpected success")
except Exception as e:
    print(f"   Expected error: {type(e).__name__}")

print("\nError handling tests completed!")

=== Testing Error Handling ===

1. Testing non-existent model...   Expected error: ResponseError: model 'nonexistent-model-xyz' not found (status code: 404)

2. Testing empty prompt...
   Error: ResponseError: model 'llama3.2' not found (status code: 404)

3. Testing OpenAI client with invalid model...
   Expected error: NotFoundError

Error handling tests completed!

## 12. Performance Benchmarks

In [17]:
def benchmark_api(name: str, api_func: Callable, n_runs: int = BENCHMARK_RUNS) -> dict:
    """Benchmark an API function."""
    latencies = []
    tokens_per_second = []
    
    for i in range(n_runs):
        start = time.perf_counter()
        result = api_func()
        end = time.perf_counter()
        
        latency = end - start
        latencies.append(latency)
        
        # Extract token count based on result format
        if isinstance(result, dict):
            tokens = result.get("eval_count", 0)
        elif hasattr(result, "usage"):
            tokens = result.usage.completion_tokens if result.usage else 0
        else:
            tokens = 0
        
        if tokens > 0:
            tokens_per_second.append(tokens / latency)
    
    return {
        "API": name,
        "Mean Latency (s)": statistics.mean(latencies),
        "Std Dev (s)": statistics.stdev(latencies) if len(latencies) > 1 else 0,
        "Min (s)": min(latencies),
        "Max (s)": max(latencies),
        "Tokens/sec": statistics.mean(tokens_per_second) if tokens_per_second else "N/A"
    }

print(f"Benchmark functions defined. Will run {BENCHMARK_RUNS} iterations per API.")

Benchmark functions defined. Will run 5 iterations per API.

In [18]:
print(f"Running benchmarks with prompt: '{TEST_PROMPT}'")
print(f"Iterations per API: {BENCHMARK_RUNS}")
print()

# Define API functions
def requests_generate():
    response = requests.post(
        f"{OLLAMA_HOST}/api/generate",
        json={"model": DEFAULT_MODEL, "prompt": TEST_PROMPT, "stream": False}
    )
    return response.json()

def ollama_generate():
    return ollama.generate(model=DEFAULT_MODEL, prompt=TEST_PROMPT)

def openai_generate():
    return openai_client.completions.create(
        model=DEFAULT_MODEL,
        prompt=TEST_PROMPT,
        max_tokens=100
    )

# Run benchmarks
results = []

print("Benchmarking requests library...", flush=True)
results.append(benchmark_api("requests", requests_generate))

print("Benchmarking ollama library...", flush=True)
results.append(benchmark_api("ollama", ollama_generate))

print("Benchmarking OpenAI compatibility...", flush=True)
results.append(benchmark_api("openai", openai_generate))

# Display results
df = pd.DataFrame(results)
print("\n=== Benchmark Results ===")
print(df.to_string(index=False))

Running benchmarks with prompt: 'Why is the sky blue? Answer in one sentence.'
Iterations per API: 5

Benchmarking requests library...Benchmarking ollama library...

ResponseError: model 'llama3.2' not found (status code: 404)

In [19]:
print("=== Chat API Benchmarks ===")
print()

chat_messages = [{"role": "user", "content": TEST_PROMPT}]

def requests_chat():
    response = requests.post(
        f"{OLLAMA_HOST}/api/chat",
        json={"model": DEFAULT_MODEL, "messages": chat_messages, "stream": False}
    )
    return response.json()

def ollama_chat():
    return ollama.chat(model=DEFAULT_MODEL, messages=chat_messages)

def openai_chat():
    return openai_client.chat.completions.create(
        model=DEFAULT_MODEL,
        messages=chat_messages,
        max_tokens=100
    )

# Run chat benchmarks
chat_results = []

print("Benchmarking requests chat...", flush=True)
chat_results.append(benchmark_api("requests (chat)", requests_chat))

print("Benchmarking ollama chat...", flush=True)
chat_results.append(benchmark_api("ollama (chat)", ollama_chat))

print("Benchmarking OpenAI chat...", flush=True)
chat_results.append(benchmark_api("openai (chat)", openai_chat))

# Display results
df_chat = pd.DataFrame(chat_results)
print("\n=== Chat Benchmark Results ===")
print(df_chat.to_string(index=False))

=== Chat API Benchmarks ===

Benchmarking requests chat...Benchmarking ollama chat...

ResponseError: model 'llama3.2' not found (status code: 404)

## 13. GPU Verification

In [20]:
import subprocess

print("=== GPU Status ===")

# Check if nvidia-smi is available
try:
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=name,memory.used,memory.total,utilization.gpu", "--format=csv,noheader,nounits"],
        capture_output=True,
        text=True,
        timeout=5
    )
    if result.returncode == 0:
        lines = result.stdout.strip().split("\n")
        for i, line in enumerate(lines):
            parts = line.split(", ")
            if len(parts) >= 4:
                name, mem_used, mem_total, util = parts
                print(f"GPU {i}: {name}")
                print(f"  Memory: {mem_used} MB / {mem_total} MB")
                print(f"  Utilization: {util}%")
    else:
        print("nvidia-smi returned an error")
        print(result.stderr)
except FileNotFoundError:
    print("nvidia-smi not found - NVIDIA GPU may not be available")
except subprocess.TimeoutExpired:
    print("nvidia-smi timed out")
except Exception as e:
    print(f"Error checking GPU: {e}")

=== GPU Status ===GPU 0: NVIDIA GeForce RTX 4080 SUPER
  Memory: 3509 MB / 16376 MB
  Utilization: 45%

In [21]:
print("=== GPU Usage During Inference ===")
print("Running inference and checking GPU metrics...")
print()

# Run a generation to load the model
result = ollama.generate(
    model=DEFAULT_MODEL,
    prompt="Write a haiku about computers."
)

print(f"Response: {result['response']}")
print()

# Check Ollama's reported metrics
print("Ollama Inference Metrics:")
print(f"  Prompt eval count: {result.get('prompt_eval_count', 'N/A')}")
print(f"  Prompt eval duration: {result.get('prompt_eval_duration', 0) / 1e9:.3f}s")
print(f"  Eval count (tokens generated): {result.get('eval_count', 'N/A')}")
print(f"  Eval duration: {result.get('eval_duration', 0) / 1e9:.3f}s")
print(f"  Total duration: {result.get('total_duration', 0) / 1e9:.3f}s")

if result.get('eval_count') and result.get('eval_duration'):
    tokens_per_sec = result['eval_count'] / (result['eval_duration'] / 1e9)
    print(f"  Tokens/second: {tokens_per_sec:.1f}")

=== GPU Usage During Inference ===
Running inference and checking GPU metrics...


ResponseError: model 'llama3.2' not found (status code: 404)

## Summary

This notebook demonstrated three ways to interact with the Ollama pod:

| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **requests** | Maximum control | No dependencies, full API access | Verbose, manual JSON handling |
| **ollama** | Ollama-specific features | Clean API, streaming support | Ollama-only |
| **openai** | OpenAI migration | Standard API, tool ecosystem | Slight overhead |

### Quick Reference

```python
# requests
requests.post("http://localhost:11434/api/generate", json={...})

# ollama
ollama.generate(model="llama3.2", prompt="...")

# openai
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
client.chat.completions.create(model="llama3.2", messages=[...])
```