# vLLM Server Test

This notebook tests the vLLM server running with Llama-3.1-8B-Instruct model.

In [1]:
# /// script
# dependencies = [
#   "httpx",
#   "openai",
# ]
# ///

import httpx
import json
from typing import List, Dict, Any
from openai import OpenAI
import time

In [2]:
# Server configuration
BASE_URL = "http://localhost:8007"
client = OpenAI(base_url=f"{BASE_URL}/v1", api_key="local")

In [3]:
# Check server health
with httpx.Client() as http_client:
    health_response = http_client.get(f"{BASE_URL}/health", timeout=5)
    print(f"Health check status: {health_response.status_code}")
    
# Get available models using OpenAI client
models = client.models.list()
available_models = [model.id for model in models.data]
print("Available models:", available_models)

DEFAULT_MODEL = available_models[0] if available_models else "meta-llama/Llama-3.1-8B-Instruct"
print(f"Using model: {DEFAULT_MODEL}")


Health check status: 200
Available models: ['openai/gpt-oss-120b']
Using model: openai/gpt-oss-120b


In [4]:
def send_chat_completion(
    messages: List[Dict[str, str]],
    temperature: float = 0.7,
    max_tokens: int | None = None,
    model: str = DEFAULT_MODEL,
    enable_thinking: bool = False,
) -> Dict[str, Any]:
    """
    Send a chat completion request using OpenAI client.

    Args:
        messages: List of message dictionaries with 'role' and 'content'
        temperature: Sampling temperature (0.0 to 1.0)
        max_tokens: Maximum number of tokens to generate
        model: Model to use for completion

    Returns:
        Response dictionary from the server
    """
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            extra_body={
                "chat_template_kwargs": {"enable_thinking": enable_thinking},
            },
        )
        return response.model_dump()
    except Exception as e:
        print(f"Chat completion failed: {e}")
        return None

## Test 1: Simple greeting

In [5]:
# Test 1: Simple greeting
messages = [
    {"role": "user", "content": "Hello! How are you today?"}
]

response = send_chat_completion(messages)
if response:
    print("Response:")
    print(json.dumps(response, indent=2))
    print("\nGenerated text:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Response:
{
  "id": "chatcmpl-bec957621cf71408",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Hello! I\u2019m doing great, thanks for asking. How can I help you today?",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "We need to respond as ChatGPT. The user says \"Hello! How are you today?\" Likely a casual greeting. Should respond politely, ask about user, maybe mention being ready to help. Also we should follow policy, no issues. Provide friendly tone.",
        "reasoning_content": "We need to respond as ChatGPT. The user says \"Hello! How are you today?\" Likely a casual greeting. Should respond politely, ask about user, maybe mention being ready to help. Also we should follow policy, no issues. Provide friendly tone."
      },
      "stop_reason": null,
      

## Test 2: Multi-turn conversation

In [6]:
# Test 2: Multi-turn conversation
messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's the population of that city?"}
]

response = send_chat_completion(messages)
if response:
    print("Multi-turn response:")
    print(json.dumps(response, indent=2))
else:
    print("Failed to get response")

Multi-turn response:
{
  "id": "chatcmpl-9de45684b3495748",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Paris is one of the world\u2019s most\u2011populated capital cities, but the exact number you hear depends on whether you\u2019re looking at the **city\u2011proper** (the administrative limits of the Commune of Paris) or the larger **metropolitan area**.\n\n| Area | Most recent official estimate (2025\u20112026) | How it\u2019s measured |\n|------|-------------------------------------------|-------------------|\n| **City\u2011proper (Commune de Paris)** | **\u2248\u202f2,180,000** residents | INSEE (French national statistics office) released the 2025 \u201cpopulation l\u00e9gale\u201d figures in January\u202f2026. |\n| **Greater Paris (M\u00e9tropole du Grand Paris)** | **\u2248\u202f7.2\u202fmillion** residents | Aggregates the 131 surrounding communes that make up the \u201cGrand Paris\u201d i

## Test 3: Reasoning task

In [7]:
# Test 3: Reasoning task
messages = [
    {"role": "user", "content": "If I have 3 apples and I give away 1 apple, then buy 2 more apples, how many apples do I have in total? Please explain your reasoning."}
]

response = send_chat_completion(messages, enable_thinking=True)
if response:
    print(json.dumps(response, indent=2))
else:
    print("Failed to get response")

{
  "id": "chatcmpl-a2d462b856139235",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "You start with **3 apples**.\n\n1. **Give away 1 apple**  \n   \\(3 \\text{ apples} - 1 \\text{ apple} = 2 \\text{ apples}\\)\n\n2. **Buy 2 more apples**  \n   \\(2 \\text{ apples} + 2 \\text{ apples} = 4 \\text{ apples}\\)\n\nSo after giving one away and then buying two more, you have **4 apples** in total.",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "The user asks a simple arithmetic word problem: start with 3 apples, give away 1, then buy 2 more. So 3 - 1 = 2, then +2 = 4. So answer is 4 apples. Provide explanation.\n\nWe need to respond with reasoning.\n\nProbably straightforward.\n\nWe'll answer.",
        "reasoning_content": "The user asks a simple arithmetic word proble

# Test 4: Tool calling

In [8]:
def test_tool_calling():
    """Test tool calling capabilities if supported by the model."""
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the weather for a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The unit for temperature"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]
    
    messages = [
        {"role": "user", "content": "What's the weather like in Paris, France?"}
    ]
    
    try:
        response = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=messages,
            tools=tools,
        )
        
        result = response.model_dump()
        print("Tool calling response:")
        print(json.dumps(result, indent=2))
        
        # Check if tool was called
        choice = result["choices"][0]
        if choice["message"].get("tool_calls"):
            print("\n✅ Tool calling is supported!")
            for tool_call in choice["message"]["tool_calls"]:
                print(f"Tool called: {tool_call['function']['name']}")
                print(f"Arguments: {tool_call['function']['arguments']}")
        else:
            print("\n❌ Tool calling not supported or model chose not to use tools")
            print("Response:", choice["message"]["content"])
            
    except Exception as e:
        print(f"❌ Tool calling test failed: {e}")

test_tool_calling()

Tool calling response:
{
  "id": "chatcmpl-accbb10841a3b85a",
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": null,
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "chatcmpl-tool-a883391edc9b4753",
            "function": {
              "arguments": "{\"location\": \"Paris, France\"}",
              "name": "get_weather"
            },
            "type": "function"
          }
        ],
        "reasoning": "User asks weather in Paris, France. Need to fetch using get_weather function. Provide location as \"Paris, France\". No unit specified, default. We'll call function.",
        "reasoning_content": "User asks weather in Paris, France. Need to fetch using get_weather function. Provide location as \"Paris, France\". No unit specified, default. We'll cal

## Test 5: Different temperature settings

In [9]:
prompt = "Write a creative short story about a robot learning to paint."
temperatures = [0.1, 0.7, 1.0]

for temp in temperatures:
    messages = [{"role": "user", "content": prompt}]
    response = send_chat_completion(messages, temperature=temp)
    
    if response:
        print(f"\n--- Temperature: {temp} ---")
        print(response["choices"][0]["message"]["content"])
    else:
        print(f"Failed to get response for temperature {temp}")


--- Temperature: 0.1 ---
**The Brush‑Circuit**

In the back corner of a cramped studio on the edge of the city, a thin line of dust curled around the base of a metal arm. The arm belonged to **C‑7**, a service robot whose original purpose had been to sort recyclables in the municipal plant. A mis‑delivered shipment of spare parts and a stray software update had given C‑7 a new set of instructions: *Explore creative expression*.

The studio belonged to Mara, a painter whose canvases were as wild as the thunderstorms that rolled over the river each night. She had found C‑7 abandoned in a junkyard, its chassis dented but its servos still humming. “You look like you could use a purpose,” she had said, and with a few bolts and a fresh coat of oil, she had given the robot a place among her brushes, tubes, and splattered palettes.

At first, C‑7’s idea of “painting” was literal. It scanned the room with its infrared eyes, mapped the geometry of the space, and projected a perfect replica of t

## Performance

In [10]:
# Test 6: Performance timing
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

start_time = time.time()
response = send_chat_completion(messages)
end_time = time.time()

if response:
    duration = end_time - start_time
    tokens_generated = response.get("usage", {}).get("completion_tokens", 0)
    
    print(f"Response time: {duration:.2f} seconds")
    print(f"Tokens generated: {tokens_generated}")
    if tokens_generated > 0 and duration > 0:
        print(f"Tokens per second: {tokens_generated/duration:.2f}")
    print("\nResponse:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response for performance test")

Response time: 64.19 seconds
Tokens generated: 1320
Tokens per second: 20.57

Response:
**Quantum Computing in Plain English**

---

### 1. The Classic Computer vs. the Quantum Computer  

| Classic Computer | Quantum Computer |
|------------------|------------------|
| **Bits** are the smallest units of information. Each bit is either **0** *or* **1**. | **Qubits** are the quantum version of bits. A qubit can be **0**, **1**, **or both at the same time** (a property called **super‑position**). |
| Operations are like tiny, deterministic switches that flip bits or move them around. | Operations are performed with **quantum gates** that manipulate the probabilities of many possibilities at once. |
| To solve a problem you usually have to try many possibilities one after another. | A quantum computer can explore many possibilities **simultaneously**, giving it a shortcut for certain tasks. |

---

### 2. Two Key Quantum Tricks

1. **Super‑position – “being in many places at once”**  
   

## Summary

Run all the cells above to test various aspects of your vLLM server:

1. **Server health check** - Verify server is running and get available models
2. **Basic functionality** - Simple chat completion
3. **Multi-turn conversations** - Context awareness
4. **Reasoning capabilities** - Complex problem solving
5. **Temperature effects** - Creativity control
6. **Tool calling** - Function calling capabilities (if supported)
7. **Performance** - Response timing and token usage

The notebook now uses:
- **httpx** for HTTP requests
- **OpenAI Python client** for chat completions
- **Proper error handling** and response parsing
- **Tool calling tests** to check function calling support

If all tests pass successfully, your vLLM server is working correctly!