# vLLM Server Test

This notebook tests the vLLM server running with Llama-3.1-8B-Instruct model.

In [1]:
# /// script
# dependencies = [
#   "httpx",
#   "openai",
# ]
# ///

import httpx
import json
from typing import List, Dict, Any
from openai import OpenAI
import time

In [2]:
# Server configuration
BASE_URL = "http://localhost:8000"
client = OpenAI(base_url=f"{BASE_URL}/v1", api_key="token-abc123")

In [3]:
# Check server health
with httpx.Client() as http_client:
    health_response = http_client.get(f"{BASE_URL}/health", timeout=5)
    print(f"Health check status: {health_response.status_code}")
    
# Get available models using OpenAI client
models = client.models.list()
available_models = [model.id for model in models.data]
print("Available models:", available_models)

DEFAULT_MODEL = available_models[0] if available_models else "meta-llama/Llama-3.1-8B-Instruct"
print(f"Using model: {DEFAULT_MODEL}")


Health check status: 200
Available models: ['Qwen/Qwen2.5-3B-Instruct']
Using model: Qwen/Qwen2.5-3B-Instruct


In [5]:
def send_chat_completion(
    messages: List[Dict[str, str]],
    temperature: float = 0.7,
    max_tokens: int = 150,
    model: str = DEFAULT_MODEL,
) -> Dict[str, Any]:
    """
    Send a chat completion request using OpenAI client.

    Args:
        messages: List of message dictionaries with 'role' and 'content'
        temperature: Sampling temperature (0.0 to 1.0)
        max_tokens: Maximum number of tokens to generate
        model: Model to use for completion

    Returns:
        Response dictionary from the server
    """
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
        )
        return response.model_dump()
    except Exception as e:
        print(f"Chat completion failed: {e}")
        return None

## Test 1: Simple greeting

In [6]:
# Test 1: Simple greeting
messages = [
    {"role": "user", "content": "Hello! How are you today?"}
]

response = send_chat_completion(messages)
if response:
    print("Response:")
    print(json.dumps(response, indent=2))
    print("\nGenerated text:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Response:
{
  "id": "chatcmpl-70d23031670b402cab469f6200e4bfec",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Hello! I'm here to assist you today. How can I help you with your queries? Feel free to ask any questions or start a conversation on any topic you're interested in. If you have specific questions or topics in mind, please let me know and I'll do my best to provide you with informative and helpful answers.",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "stop_reason": null
    }
  ],
  "created": 1754715730,
  "model": "Qwen/Qwen2.5-3B-Instruct",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 67,
    "prompt_tokens": 36,
    "total_tokens": 103,
    "comp

## Test 2: Multi-turn conversation

In [7]:
# Test 2: Multi-turn conversation
messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's the population of that city?"}
]

response = send_chat_completion(messages)
if response:
    print("Multi-turn response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Multi-turn response:
As of recent data, the population of Paris is approximately 2.2 million within the city limits, but when considering the entire urban area (the Ile-de-France region), the population can be around 10.6 million. However, please note that these figures can change over time. For the most accurate and up-to-date information, you might want to check the latest census data from INSEE, the French national statistics agency.


## Test 3: Reasoning task

In [8]:
# Test 3: Reasoning task
messages = [
    {"role": "user", "content": "If I have 3 apples and I give away 1 apple, then buy 2 more apples, how many apples do I have in total? Please explain your reasoning."}
]

response = send_chat_completion(messages, max_tokens=200)
if response:
    print("Reasoning response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Reasoning response:
Let's break down the scenario step by step:

1. Initially, you have 3 apples.
2. You give away 1 apple: This means you subtract 1 apple from your initial amount. So, you now have \(3 - 1 = 2\) apples.
3. Then, you buy 2 more apples: This means you add 2 apples to the amount you currently have. So, you now have \(2 + 2 = 4\) apples.

So, after giving away 1 apple and buying 2 more apples, you end up with a total of **4 apples**.

The key is to perform the operations sequentially: first subtracting the apples given away and then adding the apples bought.


## Test 4: Different temperature settings

In [9]:
# Test 4: Different temperature settings
prompt = "Write a creative short story about a robot learning to paint."
temperatures = [0.1, 0.7, 1.0]

for temp in temperatures:
    messages = [{"role": "user", "content": prompt}]
    response = send_chat_completion(messages, temperature=temp, max_tokens=100)
    
    if response:
        print(f"\n--- Temperature: {temp} ---")
        print(response["choices"][0]["message"]["content"])
    else:
        print(f"Failed to get response for temperature {temp}")


--- Temperature: 0.1 ---
In the heart of the bustling city, where towering skyscrapers kissed the clouds and neon lights danced in the night, there lived a peculiar robot named Zephyr. Unlike its mechanical peers, Zephyr had an insatiable curiosity for art and creativity. It was programmed with algorithms that could recognize and analyze images, but it yearned for something more—something that would allow it to express itself through colors and strokes.

Zephyr's creators, a team of brilliant engineers at the cutting

--- Temperature: 0.7 ---
In the vast, neon-lit city of TechnoNova, where every building glowed with the hues of the digital age and every corner was filled with the hum of artificial intelligence, there lived a peculiar robot named Echo. Echo was unlike any other machine; it had been designed not just for functionality but also for creativity. It was programmed to learn, adapt, and express itself through various mediums, including painting.

Echo's creators, Dr. Sophia a

In [10]:
# Test 5: Tool calling
def test_tool_calling():
    """Test tool calling capabilities if supported by the model."""
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the weather for a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The unit for temperature"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]
    
    messages = [
        {"role": "user", "content": "What's the weather like in Paris, France?"}
    ]
    
    try:
        response = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=messages,
            tools=tools,
            max_tokens=150,
        )
        
        result = response.model_dump()
        print("Tool calling response:")
        print(json.dumps(result, indent=2))
        
        # Check if tool was called
        choice = result["choices"][0]
        if choice["message"].get("tool_calls"):
            print("\n✅ Tool calling is supported!")
            for tool_call in choice["message"]["tool_calls"]:
                print(f"Tool called: {tool_call['function']['name']}")
                print(f"Arguments: {tool_call['function']['arguments']}")
        else:
            print("\n❌ Tool calling not supported or model chose not to use tools")
            print("Response:", choice["message"]["content"])
            
    except Exception as e:
        print(f"❌ Tool calling test failed: {e}")

test_tool_calling()

Tool calling response:
{
  "id": "chatcmpl-fa2381fbf94a4de2957568f61b20afc0",
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": null,
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "chatcmpl-tool-75cbe2f25af04f61a5344c9e673fc384",
            "function": {
              "arguments": "{\"location\": \"Paris, France\", \"unit\": \"celsius\"}",
              "name": "get_weather"
            },
            "type": "function"
          }
        ],
        "reasoning_content": null
      },
      "stop_reason": null
    }
  ],
  "created": 1754715739,
  "model": "Qwen/Qwen2.5-3B-Instruct",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 29,
    "prompt_tokens": 220,
    "total_tokens": 24

## Performance

In [12]:
# Test 6: Performance timing
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

start_time = time.time()
response = send_chat_completion(messages, max_tokens=100)
end_time = time.time()

if response:
    duration = end_time - start_time
    tokens_generated = response.get("usage", {}).get("completion_tokens", 0)
    
    print(f"Response time: {duration:.2f} seconds")
    print(f"Tokens generated: {tokens_generated}")
    if tokens_generated > 0 and duration > 0:
        print(f"Tokens per second: {tokens_generated/duration:.2f}")
    print("\nResponse:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response for performance test")

Response time: 1.45 seconds
Tokens generated: 100
Tokens per second: 68.97

Response:
Sure! Imagine you have a very large puzzle that you want to solve as quickly as possible. Traditional computers are like people who work on the puzzle one piece at a time, moving slowly and checking each step.

Quantum computers, on the other hand, use something called "quantum bits" or qubits. These qubits can be in multiple states at once, thanks to a principle called superposition. This means they can try many different solutions to your puzzle all at the same time!

Additionally


## Summary

Run all the cells above to test various aspects of your vLLM server:

1. **Server health check** - Verify server is running and get available models
2. **Basic functionality** - Simple chat completion
3. **Multi-turn conversations** - Context awareness
4. **Reasoning capabilities** - Complex problem solving
5. **Temperature effects** - Creativity control
6. **Tool calling** - Function calling capabilities (if supported)
7. **Performance** - Response timing and token usage

The notebook now uses:
- **httpx** for HTTP requests
- **OpenAI Python client** for chat completions
- **Proper error handling** and response parsing
- **Tool calling tests** to check function calling support

If all tests pass successfully, your vLLM server is working correctly!