# vLLM Server Test

This notebook tests the vLLM server running with Llama-3.1-8B-Instruct model.

In [28]:
# /// script
# dependencies = [
#   "httpx",
#   "openai",
# ]
# ///

import httpx
import json
from typing import List, Dict, Any
from openai import OpenAI
import time

In [29]:
# Server configuration
BASE_URL = "http://localhost:8000"
client = OpenAI(base_url=f"{BASE_URL}/v1", api_key="local")

In [30]:
# Check server health
with httpx.Client() as http_client:
    health_response = http_client.get(f"{BASE_URL}/health", timeout=5)
    print(f"Health check status: {health_response.status_code}")
    
# Get available models using OpenAI client
models = client.models.list()
available_models = [model.id for model in models.data]
print("Available models:", available_models)

DEFAULT_MODEL = available_models[0] if available_models else "meta-llama/Llama-3.1-8B-Instruct"
print(f"Using model: {DEFAULT_MODEL}")


Health check status: 200
Available models: ['Qwen/Qwen2.5-7B-Instruct']
Using model: Qwen/Qwen2.5-7B-Instruct


In [31]:
def send_chat_completion(
    messages: List[Dict[str, str]],
    temperature: float = 0.7,
    max_tokens: int | None = None,
    model: str = DEFAULT_MODEL,
    enable_thinking: bool = False,
) -> Dict[str, Any]:
    """
    Send a chat completion request using OpenAI client.

    Args:
        messages: List of message dictionaries with 'role' and 'content'
        temperature: Sampling temperature (0.0 to 1.0)
        max_tokens: Maximum number of tokens to generate
        model: Model to use for completion

    Returns:
        Response dictionary from the server
    """
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            extra_body={
                "chat_template_kwargs": {"enable_thinking": enable_thinking},
            },
        )
        return response.model_dump()
    except Exception as e:
        print(f"Chat completion failed: {e}")
        return None

## Test 1: Simple greeting

In [32]:
# Test 1: Simple greeting
messages = [
    {"role": "user", "content": "Hello! How are you today?"}
]

response = send_chat_completion(messages)
if response:
    print("Response:")
    print(json.dumps(response, indent=2))
    print("\nGenerated text:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Response:
{
  "id": "chatcmpl-329c148e508f4b4aae45041af5bb5e5f",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Hello! I'm just a digital assistant, so I don't have feelings, but I'm here and ready to help you with any questions or information you need. How can I assist you today?",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "created": 1759662488,
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 41,
    "prompt_tokens": 36,
    "total_tokens": 77,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_i

## Test 2: Multi-turn conversation

In [33]:
# Test 2: Multi-turn conversation
messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's the population of that city?"}
]

response = send_chat_completion(messages)
if response:
    print("Multi-turn response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Multi-turn response:
As of 2023, the population of Paris is approximately 2.2 million people. This includes the inhabitants of the city proper (Paris市内). The greater metropolitan area, which encompasses a larger region including suburbs and surrounding cities, has a population of around 12.5 million people.


## Test 3: Reasoning task

In [34]:
# Test 3: Reasoning task
messages = [
    {"role": "user", "content": "If I have 3 apples and I give away 1 apple, then buy 2 more apples, how many apples do I have in total? Please explain your reasoning."}
]

response = send_chat_completion(messages)
if response:
    print("Reasoning response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Reasoning response:
Sure! Let's break down the problem step by step:

1. **Starting Point**: You have 3 apples.
2. **Giving Away an Apple**: When you give away 1 apple, you subtract 1 from your total. So now you have:
   \[
   3 - 1 = 2 \text{ apples}
   \]
3. **Buying More Apples**: Then, you buy 2 more apples. Adding these to your current total gives:
   \[
   2 + 2 = 4 \text{ apples}
   \]

So, after giving away 1 apple and then buying 2 more, you end up with a total of 4 apples.


## Test 4: Different temperature settings

In [35]:
# Test 4: Different temperature settings
prompt = "Write a creative short story about a robot learning to paint."
temperatures = [0.1, 0.7, 1.0]

for temp in temperatures:
    messages = [{"role": "user", "content": prompt}]
    response = send_chat_completion(messages, temperature=temp)
    
    if response:
        print(f"\n--- Temperature: {temp} ---")
        print(response["choices"][0]["message"]["content"])
    else:
        print(f"Failed to get response for temperature {temp}")


--- Temperature: 0.1 ---
In the heart of a bustling city, nestled among towering skyscrapers and vibrant street art, stood a small, unassuming workshop. This was the domain of Elara, a young artist with a passion for color and form. Her latest project was a series of large-scale murals that would transform the walls of an old factory into a vibrant canvas of life.

Elara had always been fascinated by the idea of collaboration between humans and machines. She believed that technology could enhance creativity rather than replace it. One day, she decided to bring her vision to life by creating a robot that could assist her in painting.

She named her creation "Pinta," a sleek, metallic figure with a long arm equipped with a brush. Pinta was not just any robot; it was designed to learn and adapt. Elara programmed it with basic painting techniques but left room for improvement through machine learning algorithms.

The first few days were challenging. Pinta struggled to mimic Elara's moveme

In [36]:
# Test 5: Tool calling
def test_tool_calling():
    """Test tool calling capabilities if supported by the model."""
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the weather for a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The unit for temperature"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]
    
    messages = [
        {"role": "user", "content": "What's the weather like in Paris, France?"}
    ]
    
    try:
        response = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=messages,
            tools=tools,
        )
        
        result = response.model_dump()
        print("Tool calling response:")
        print(json.dumps(result, indent=2))
        
        # Check if tool was called
        choice = result["choices"][0]
        if choice["message"].get("tool_calls"):
            print("\n✅ Tool calling is supported!")
            for tool_call in choice["message"]["tool_calls"]:
                print(f"Tool called: {tool_call['function']['name']}")
                print(f"Arguments: {tool_call['function']['arguments']}")
        else:
            print("\n❌ Tool calling not supported or model chose not to use tools")
            print("Response:", choice["message"]["content"])
            
    except Exception as e:
        print(f"❌ Tool calling test failed: {e}")

test_tool_calling()

Tool calling response:
{
  "id": "chatcmpl-6e9e55836b8e437a83a60593be993fb6",
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": null,
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "chatcmpl-tool-c6fb67646f674bfa82a3061ecff56146",
            "function": {
              "arguments": "{\"location\": \"Paris, France\", \"unit\": \"celsius\"}",
              "name": "get_weather"
            },
            "type": "function"
          }
        ],
        "reasoning_content": null
      },
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "created": 1759662516,
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 29,
    "prompt_tokens": 22

## Performance

In [37]:
# Test 6: Performance timing
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

start_time = time.time()
response = send_chat_completion(messages)
end_time = time.time()

if response:
    duration = end_time - start_time
    tokens_generated = response.get("usage", {}).get("completion_tokens", 0)
    
    print(f"Response time: {duration:.2f} seconds")
    print(f"Tokens generated: {tokens_generated}")
    if tokens_generated > 0 and duration > 0:
        print(f"Tokens per second: {tokens_generated/duration:.2f}")
    print("\nResponse:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response for performance test")

Response time: 4.15 seconds
Tokens generated: 310
Tokens per second: 74.76

Response:
Quantum computing is a type of computing that uses the principles of quantum mechanics to process information. Here’s a simple way to understand it:

1. **Classical Bits vs. Quantum Bits (Qubits):**
   - In classical computers, data is processed using bits, which can be either 0 or 1.
   - Quantum computers use quantum bits, or qubits, which can exist as both 0 and 1 simultaneously thanks to a principle called superposition.

2. **Superposition:**
   - Imagine a coin spinning in the air; it's not heads or tails until it lands. Similarly, a qubit can be in multiple states at once until measured.

3. **Entanglement:**
   - This is when two qubits become linked in such a way that the state of one (whether it's 0 or 1) depends on the state of the other, no matter how far apart they are. Changing one instantly affects the other.

4. **Parallel Processing:**
   - Because qubits can be in multiple states at 

## Summary

Run all the cells above to test various aspects of your vLLM server:

1. **Server health check** - Verify server is running and get available models
2. **Basic functionality** - Simple chat completion
3. **Multi-turn conversations** - Context awareness
4. **Reasoning capabilities** - Complex problem solving
5. **Temperature effects** - Creativity control
6. **Tool calling** - Function calling capabilities (if supported)
7. **Performance** - Response timing and token usage

The notebook now uses:
- **httpx** for HTTP requests
- **OpenAI Python client** for chat completions
- **Proper error handling** and response parsing
- **Tool calling tests** to check function calling support

If all tests pass successfully, your vLLM server is working correctly!