# vLLM Server Test

This notebook tests the vLLM server running with Llama-3.1-8B-Instruct model.

In [1]:
# /// script
# dependencies = [
#   "httpx",
#   "openai",
# ]
# ///

import httpx
import json
from typing import List, Dict, Any
from openai import OpenAI
import time

In [2]:
# Server configuration
BASE_URL = "http://localhost:8007"
client = OpenAI(base_url=f"{BASE_URL}/v1", api_key="local")

In [3]:
# Check server health
with httpx.Client() as http_client:
    health_response = http_client.get(f"{BASE_URL}/health", timeout=5)
    print(f"Health check status: {health_response.status_code}")
    
# Get available models using OpenAI client
models = client.models.list()
available_models = [model.id for model in models.data]
print("Available models:", available_models)

DEFAULT_MODEL = available_models[0] if available_models else "meta-llama/Llama-3.1-8B-Instruct"
print(f"Using model: {DEFAULT_MODEL}")


Health check status: 200
Available models: ['openai/gpt-oss-20b']
Using model: openai/gpt-oss-20b


In [4]:
def send_chat_completion(
    messages: List[Dict[str, str]],
    temperature: float = 0.7,
    max_tokens: int | None = None,
    model: str = DEFAULT_MODEL,
    enable_thinking: bool = False,
) -> Dict[str, Any]:
    """
    Send a chat completion request using OpenAI client.

    Args:
        messages: List of message dictionaries with 'role' and 'content'
        temperature: Sampling temperature (0.0 to 1.0)
        max_tokens: Maximum number of tokens to generate
        model: Model to use for completion

    Returns:
        Response dictionary from the server
    """
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            extra_body={
                "chat_template_kwargs": {"enable_thinking": enable_thinking},
            },
        )
        return response.model_dump()
    except Exception as e:
        print(f"Chat completion failed: {e}")
        return None

## Test 1: Simple greeting

In [5]:
# Test 1: Simple greeting
messages = [
    {"role": "user", "content": "Hello! How are you today?"}
]

response = send_chat_completion(messages)
if response:
    print("Response:")
    print(json.dumps(response, indent=2))
    print("\nGenerated text:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Response:
{
  "id": "chatcmpl-a135c6c16fa8155b",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Hello! I\u2019m just a bundle of code, so I don\u2019t have feelings in the way humans do, but I\u2019m here and ready to help. How can I assist you today?",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "User says \"Hello! How are you today?\" typical greeting. We respond politely.",
        "reasoning_content": "User says \"Hello! How are you today?\" typical greeting. We respond politely."
      },
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "created": 1770188798,
  "model": "openai/gpt-oss-20b",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 65,
    "prompt_tokens": 72,
   

## Test 2: Multi-turn conversation

In [6]:
# Test 2: Multi-turn conversation
messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's the population of that city?"}
]

response = send_chat_completion(messages)
if response:
    print("Multi-turn response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Multi-turn response:
As of the most recent estimates, the population of Paris (the city proper) is about **2.1 million people**. If you include the wider Île‑de‑France metropolitan area, the figure jumps to roughly **12 million**.


## Test 3: Reasoning task

In [7]:
# Test 3: Reasoning task
messages = [
    {"role": "user", "content": "If I have 3 apples and I give away 1 apple, then buy 2 more apples, how many apples do I have in total? Please explain your reasoning."}
]

response = send_chat_completion(messages)
if response:
    print("Reasoning response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Reasoning response:
You start with 3 apples.  
1. **Give away 1 apple** → 3 – 1 = 2 apples left.  
2. **Buy 2 more apples** → 2 + 2 = 4 apples in total.

So after giving one away and buying two more, you have **4 apples**.


# Test 4: Tool calling

In [8]:
def test_tool_calling():
    """Test tool calling capabilities if supported by the model."""
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the weather for a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The unit for temperature"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]
    
    messages = [
        {"role": "user", "content": "What's the weather like in Paris, France?"}
    ]
    
    try:
        response = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=messages,
            tools=tools,
        )
        
        result = response.model_dump()
        print("Tool calling response:")
        print(json.dumps(result, indent=2))
        
        # Check if tool was called
        choice = result["choices"][0]
        if choice["message"].get("tool_calls"):
            print("\n✅ Tool calling is supported!")
            for tool_call in choice["message"]["tool_calls"]:
                print(f"Tool called: {tool_call['function']['name']}")
                print(f"Arguments: {tool_call['function']['arguments']}")
        else:
            print("\n❌ Tool calling not supported or model chose not to use tools")
            print("Response:", choice["message"]["content"])
            
    except Exception as e:
        print(f"❌ Tool calling test failed: {e}")

test_tool_calling()

Tool calling response:
{
  "id": "chatcmpl-86e963b75db90637",
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": null,
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "chatcmpl-tool-84afaaa3d19c6378",
            "function": {
              "arguments": "{\"location\": \"Paris, France\", \"unit\": \"celsius\"}",
              "name": "get_weather"
            },
            "type": "function"
          }
        ],
        "reasoning": "We need to get weather via function call get_weather. Use location \"Paris, France\". unit default maybe? Provide \"unit\" field. Let's use \"celsius\".",
        "reasoning_content": "We need to get weather via function call get_weather. Use location \"Paris, France\". unit default maybe? Provide \"unit\" field. Let's use \"ce

## Test 5: Different temperature settings

In [9]:
prompt = "Write a creative short story about a robot learning to paint."
temperatures = [0.1, 0.7, 1.0]

for temp in temperatures:
    messages = [{"role": "user", "content": prompt}]
    response = send_chat_completion(messages, temperature=temp)
    
    if response:
        print(f"\n--- Temperature: {temp} ---")
        print(response["choices"][0]["message"]["content"])
    else:
        print(f"Failed to get response for temperature {temp}")


--- Temperature: 0.1 ---
**Astra’s First Brushstroke**

The workshop smelled of linseed oil and fresh paint, a scent that had been the backdrop of Mara’s life for thirty years. The walls were lined with canvases, each a different story, a different voice. In the corner, a small, humming machine sat on a workbench, its chrome arms resting on a stack of brushes. It was Astra, a service robot originally designed to spray paint car bodies in a factory, now repurposed for a far stranger task: learning to paint.

Astra’s eyes were a pair of infrared cameras, its ears a set of microphones that picked up the faintest brush squeak. Its brain was a lattice of neural networks, trained on millions of images, but none of them had ever been a painting. Its programming was simple: observe, analyze, replicate. But the world of color and texture was a maze of variables that no algorithm had ever mapped.

“Good morning, Astra,” Mara said, wiping her hands on a rag. She had a habit of talking to the mac

## Performance

In [10]:
# Test 6: Performance timing
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

start_time = time.time()
response = send_chat_completion(messages)
end_time = time.time()

if response:
    duration = end_time - start_time
    tokens_generated = response.get("usage", {}).get("completion_tokens", 0)
    
    print(f"Response time: {duration:.2f} seconds")
    print(f"Tokens generated: {tokens_generated}")
    if tokens_generated > 0 and duration > 0:
        print(f"Tokens per second: {tokens_generated/duration:.2f}")
    print("\nResponse:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response for performance test")

Response time: 31.84 seconds
Tokens generated: 996
Tokens per second: 31.29

Response:
## Quantum Computing in Plain English

| **What it is** | **How it works** | **Why it matters** |
|----------------|------------------|--------------------|
| A new kind of computer that uses the rules of quantum physics instead of the rules of ordinary electronics | It manipulates *quantum bits* (qubits) that can be 0, 1, or “both at once.” | It can solve some problems *much faster* than today’s best computers. |

---

### 1. The Building Block – the Qubit

| **Bit** (classic) | **Qubit** (quantum) |
|-------------------|---------------------|
| One of two states: 0 or 1 | A tiny particle (electron, photon, etc.) that can be in a *superposition* of 0 and 1 at the same time. |
| Think of a light switch that’s either **off** or **on** | Think of a spinning coin that’s both **heads** and **tails** *until you look* |

Because a qubit can be 0 and 1 simultaneously, a single qubit carries *two bits of inf

## Summary

Run all the cells above to test various aspects of your vLLM server:

1. **Server health check** - Verify server is running and get available models
2. **Basic functionality** - Simple chat completion
3. **Multi-turn conversations** - Context awareness
4. **Reasoning capabilities** - Complex problem solving
5. **Temperature effects** - Creativity control
6. **Tool calling** - Function calling capabilities (if supported)
7. **Performance** - Response timing and token usage

The notebook now uses:
- **httpx** for HTTP requests
- **OpenAI Python client** for chat completions
- **Proper error handling** and response parsing
- **Tool calling tests** to check function calling support

If all tests pass successfully, your vLLM server is working correctly!