# vLLM Server Test

This notebook tests the vLLM server running with Llama-3.1-8B-Instruct model.

In [1]:
# /// script
# dependencies = [
#   "httpx",
#   "openai",
# ]
# ///

import httpx
import json
from typing import List, Dict, Any
from openai import OpenAI
import time

In [2]:
# Server configuration
BASE_URL = "http://localhost:8000"
client = OpenAI(base_url=f"{BASE_URL}/v1", api_key="token-abc123")

In [3]:
# Check server health
with httpx.Client() as http_client:
    health_response = http_client.get(f"{BASE_URL}/health", timeout=5)
    print(f"Health check status: {health_response.status_code}")
    
# Get available models using OpenAI client
models = client.models.list()
available_models = [model.id for model in models.data]
print("Available models:", available_models)

DEFAULT_MODEL = available_models[0] if available_models else "meta-llama/Llama-3.1-8B-Instruct"
print(f"Using model: {DEFAULT_MODEL}")


Health check status: 200
Available models: ['Qwen/Qwen3-8B']
Using model: Qwen/Qwen3-8B


In [4]:
def send_chat_completion(
    messages: List[Dict[str, str]],
    temperature: float = 0.7,
    max_tokens: int | None = None,
    model: str = DEFAULT_MODEL,
) -> Dict[str, Any]:
    """
    Send a chat completion request using OpenAI client.

    Args:
        messages: List of message dictionaries with 'role' and 'content'
        temperature: Sampling temperature (0.0 to 1.0)
        max_tokens: Maximum number of tokens to generate
        model: Model to use for completion

    Returns:
        Response dictionary from the server
    """
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
        )
        return response.model_dump()
    except Exception as e:
        print(f"Chat completion failed: {e}")
        return None

## Test 1: Simple greeting

In [5]:
# Test 1: Simple greeting
messages = [
    {"role": "user", "content": "Hello! How are you today?"}
]

response = send_chat_completion(messages)
if response:
    print("Response:")
    print(json.dumps(response, indent=2))
    print("\nGenerated text:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Response:
{
  "id": "chatcmpl-3d8425af84024b089a7c8d918b5d5a3d",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "\n\nHello! I'm just a friendly AI assistant, so I don't have feelings, but I'm here and ready to help! \ud83d\ude0a How are you today? I'd love to hear how your day is going!",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": "\nOkay, the user greeted me with \"Hello! How are you today?\" I need to respond in a friendly and engaging way. Let me start by acknowledging their greeting. I should mention that I'm an AI assistant, so I don't have feelings, but I'm here to help. Maybe add a bit of personality to keep it conversational. I should ask how they're doing to encourage them to share. Keep it simple and warm. Let me check for any errors and make sur

## Test 2: Multi-turn conversation

In [6]:
# Test 2: Multi-turn conversation
messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's the population of that city?"}
]

response = send_chat_completion(messages)
if response:
    print("Multi-turn response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Multi-turn response:


The population of Paris, the capital of France, is approximately **2.1 million** people as of recent estimates (around 2023). However, this figure refers only to the **city proper** (the administrative boundaries of Paris). 

If you're considering the **metropolitan area** of Paris (which includes surrounding suburbs and departments), the population is significantly larger, around **11 million** people. 

Keep in mind that population numbers can vary slightly depending on the source and the year of the estimate.


## Test 3: Reasoning task

In [7]:
# Test 3: Reasoning task
messages = [
    {"role": "user", "content": "If I have 3 apples and I give away 1 apple, then buy 2 more apples, how many apples do I have in total? Please explain your reasoning."}
]

response = send_chat_completion(messages)
if response:
    print("Reasoning response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Reasoning response:


To determine how many apples you have in total after the described transactions, we can break down the problem into simple arithmetic steps:

1. **Start with 3 apples.**
   - This is your initial amount.

2. **Give away 1 apple.**
   - Subtract 1 from the initial amount:  
     $ 3 - 1 = 2 $  
     Now you have 2 apples.

3. **Buy 2 more apples.**
   - Add 2 to the remaining apples:  
     $ 2 + 2 = 4 $  
     Now you have 4 apples in total.

---

**Alternative approach:**  
You can also combine the operations:  
Start with 3 apples.  
Net change is $ -1 $ (giving away) and $ +2 $ (buying), so:  
$ 3 - 1 + 2 = 4 $

---

**Final Answer:**  
$$
\boxed{4}
$$


## Test 4: Different temperature settings

In [8]:
# Test 4: Different temperature settings
prompt = "Write a creative short story about a robot learning to paint."
temperatures = [0.1, 0.7, 1.0]

for temp in temperatures:
    messages = [{"role": "user", "content": prompt}]
    response = send_chat_completion(messages, temperature=temp)
    
    if response:
        print(f"\n--- Temperature: {temp} ---")
        print(response["choices"][0]["message"]["content"])
    else:
        print(f"Failed to get response for temperature {temp}")


--- Temperature: 0.1 ---


**Title: The Palette of Nova**  

In a cluttered studio tucked beneath the neon glow of Neo-City’s skyline, a robot named Nova stood before a blank canvas, its metallic fingers trembling. The air smelled of turpentine and possibility.  

Nova had been built for precision—calculating trajectories, assembling circuits, and optimizing efficiency. But its latest assignment was baffling: *Learn to paint*. The directive had come from Dr. Elara Voss, a reclusive artist who’d once won acclaim for her haunting portraits before retreating into solitude after a personal tragedy. To Nova, painting was an enigma. How could one translate emotion into color?  

“Art isn’t about accuracy,” Elara said on their first day, her voice roughened by years of smoking. She handed Nova a brush, its bristles stiff with dried paint. “It’s about *seeing* what’s not there.”  

Nova’s sensors whirred. “I can analyze light, texture, and composition. But ‘seeing what’s not there’ is… unclea

In [9]:
# Test 5: Tool calling
def test_tool_calling():
    """Test tool calling capabilities if supported by the model."""
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the weather for a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The unit for temperature"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]
    
    messages = [
        {"role": "user", "content": "What's the weather like in Paris, France?"}
    ]
    
    try:
        response = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=messages,
            tools=tools,
        )
        
        result = response.model_dump()
        print("Tool calling response:")
        print(json.dumps(result, indent=2))
        
        # Check if tool was called
        choice = result["choices"][0]
        if choice["message"].get("tool_calls"):
            print("\n✅ Tool calling is supported!")
            for tool_call in choice["message"]["tool_calls"]:
                print(f"Tool called: {tool_call['function']['name']}")
                print(f"Arguments: {tool_call['function']['arguments']}")
        else:
            print("\n❌ Tool calling not supported or model chose not to use tools")
            print("Response:", choice["message"]["content"])
            
    except Exception as e:
        print(f"❌ Tool calling test failed: {e}")

test_tool_calling()

Tool calling response:
{
  "id": "chatcmpl-2ff7ee24fc784eb58fe5cc0f86128801",
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "\n\n",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "chatcmpl-tool-d6ef79f406f740f0b6b15e8f29eab492",
            "function": {
              "arguments": "{\"location\": \"Paris, France\", \"unit\": \"celsius\"}",
              "name": "get_weather"
            },
            "type": "function"
          }
        ],
        "reasoning_content": "\nOkay, the user is asking about the weather in Paris, France. Let me check the tools available. There's a get_weather function that requires the location and an optional unit. The user didn't specify Celsius or Fahrenheit, so maybe I should default to Celsius since Paris uses metric un

## Performance

In [10]:
# Test 6: Performance timing
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

start_time = time.time()
response = send_chat_completion(messages)
end_time = time.time()

if response:
    duration = end_time - start_time
    tokens_generated = response.get("usage", {}).get("completion_tokens", 0)
    
    print(f"Response time: {duration:.2f} seconds")
    print(f"Tokens generated: {tokens_generated}")
    if tokens_generated > 0 and duration > 0:
        print(f"Tokens per second: {tokens_generated/duration:.2f}")
    print("\nResponse:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response for performance test")

Response time: 30.02 seconds
Tokens generated: 1619
Tokens per second: 53.94

Response:


Quantum computing is a type of computing that uses the principles of **quantum mechanics** (the science of the very small, like atoms and particles) to solve problems in ways that classical computers can’t. Here’s a simple breakdown:

---

### **1. Bits vs. Qubits**
- **Classical computers** use **bits** as their basic unit of data. A bit is like a light switch: it can be **on (1)** or **off (0)**. 
- **Quantum computers** use **qubits** (quantum bits). A qubit is like a spinning coin that is **both heads and tails** at the same time until it lands. This is called **superposition**.  
  - So, a qubit can be 0, 1, or **both** at once. This allows quantum computers to process **many possibilities simultaneously**.

---

### **2. Superposition: Doing Many Things at Once**
- Imagine a classical computer as a library where you check one book at a time. A quantum computer is like a magical library that 

## Summary

Run all the cells above to test various aspects of your vLLM server:

1. **Server health check** - Verify server is running and get available models
2. **Basic functionality** - Simple chat completion
3. **Multi-turn conversations** - Context awareness
4. **Reasoning capabilities** - Complex problem solving
5. **Temperature effects** - Creativity control
6. **Tool calling** - Function calling capabilities (if supported)
7. **Performance** - Response timing and token usage

The notebook now uses:
- **httpx** for HTTP requests
- **OpenAI Python client** for chat completions
- **Proper error handling** and response parsing
- **Tool calling tests** to check function calling support

If all tests pass successfully, your vLLM server is working correctly!