# vLLM Server Test

This notebook tests the vLLM server running with Llama-3.1-8B-Instruct model.

In [1]:
# /// script
# dependencies = [
#   "httpx",
#   "openai",
# ]
# ///

import httpx
import json
from typing import List, Dict, Any
from openai import OpenAI
import time

In [2]:
# Server configuration
BASE_URL = "http://localhost:8007"
client = OpenAI(base_url=f"{BASE_URL}/v1", api_key="local")

In [3]:
# Check server health
with httpx.Client() as http_client:
    health_response = http_client.get(f"{BASE_URL}/health", timeout=5)
    print(f"Health check status: {health_response.status_code}")
    
# Get available models using OpenAI client
models = client.models.list()
available_models = [model.id for model in models.data]
print("Available models:", available_models)

DEFAULT_MODEL = available_models[0] if available_models else "meta-llama/Llama-3.1-8B-Instruct"
print(f"Using model: {DEFAULT_MODEL}")


Health check status: 200
Available models: ['nvidia/Nemotron-Cascade-8B']
Using model: nvidia/Nemotron-Cascade-8B


In [4]:
def send_chat_completion(
    messages: List[Dict[str, str]],
    temperature: float = 0.7,
    max_tokens: int | None = None,
    model: str = DEFAULT_MODEL,
    enable_thinking: bool = False,
) -> Dict[str, Any]:
    """
    Send a chat completion request using OpenAI client.

    Args:
        messages: List of message dictionaries with 'role' and 'content'
        temperature: Sampling temperature (0.0 to 1.0)
        max_tokens: Maximum number of tokens to generate
        model: Model to use for completion

    Returns:
        Response dictionary from the server
    """
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            extra_body={
                "chat_template_kwargs": {"enable_thinking": enable_thinking},
            },
        )
        return response.model_dump()
    except Exception as e:
        print(f"Chat completion failed: {e}")
        return None

## Test 1: Simple greeting

In [5]:
# Test 1: Simple greeting
messages = [
    {"role": "user", "content": "Hello! How are you today?"}
]

response = send_chat_completion(messages)
if response:
    print("Response:")
    print(json.dumps(response, indent=2))
    print("\nGenerated text:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Response:
{
  "id": "chatcmpl-b1b0328c27e30eff",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Hello! I'm just a virtual assistant, so I don't have feelings, but I'm here and ready to help you with anything you need! How are *you* doing today? \ud83d\ude0a",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null,
        "reasoning_content": null
      },
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "created": 1770982228,
  "model": "nvidia/Nemotron-Cascade-8B",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 41,
    "prompt_tokens": 32,
    "total_tokens": 73,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_to

## Test 2: Multi-turn conversation

In [6]:
# Test 2: Multi-turn conversation
messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's the population of that city?"}
]

response = send_chat_completion(messages)
if response:
    print("Multi-turn response:")
    print(json.dumps(response, indent=2))
else:
    print("Failed to get response")

Multi-turn response:
{
  "id": "chatcmpl-a3cf95ff34f7b3fa",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "The population of Paris, the capital of France, is approximately **2.1 million** people within the city limits (as of recent estimates). The metropolitan area (\u00cele-de-France region) is much larger, with around **11 million** inhabitants. \n\nWould you like more detailed demographic data?",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null,
        "reasoning_content": null
      },
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "created": 1770982228,
  "model": "nvidia/Nemotron-Cascade-8B",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 63,
    "prompt_tokens": 61

## Test 3: Reasoning task

In [7]:
# Test 3: Reasoning task
messages = [
    {"role": "user", "content": "If I have 3 apples and I give away 1 apple, then buy 2 more apples, how many apples do I have in total? Please explain your reasoning."}
]

response = send_chat_completion(messages, enable_thinking=True)
if response:
    print(json.dumps(response, indent=2))
else:
    print("Failed to get response")

{
  "id": "chatcmpl-970eda2bbe2f20b8",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "\nTo solve the problem, let's break it down step by step with clear reasoning:\n\n1. **Initial Condition**: You start with 3 apples.\n2. **Giving Away an Apple**: You give away 1 apple. Giving away an apple means you lose it, so subtract 1 from the initial amount:  \n   \\(3 - 1 = 2\\) apples remaining.\n3. **Buying More Apples**: You then buy 2 more apples. Buying apples means you gain them, so add 2 to the current amount:  \n   \\(2 + 2 = 4\\) apples.\n\nAfter these actions, the total number of apples you have is **4**.\n\n### Summary of Reasoning:\n- The sequence of actions is crucial: first giving away, then buying.\n- Arithmetic operations reflect the changes: subtraction for giving away and addition for buying.\n- Final calculation: Start with 3, minus 1 gives 2, plus 2 gives 4.\n\nThus, you have **4 apples** i

# Test 4: Tool calling

In [8]:
def test_tool_calling():
    """Test tool calling capabilities if supported by the model."""
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the weather for a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The unit for temperature"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]
    
    messages = [
        {"role": "user", "content": "What's the weather like in Paris, France?"}
    ]
    
    try:
        response = client.chat.completions.create(
            model=DEFAULT_MODEL,
            messages=messages,
            tools=tools,
        )
        
        result = response.model_dump()
        print("Tool calling response:")
        print(json.dumps(result, indent=2))
        
        # Check if tool was called
        choice = result["choices"][0]
        if choice["message"].get("tool_calls"):
            print("\n‚úÖ Tool calling is supported!")
            for tool_call in choice["message"]["tool_calls"]:
                print(f"Tool called: {tool_call['function']['name']}")
                print(f"Arguments: {tool_call['function']['arguments']}")
        else:
            print("\n‚ùå Tool calling not supported or model chose not to use tools")
            print("Response:", choice["message"]["content"])
            
    except Exception as e:
        print(f"‚ùå Tool calling test failed: {e}")

test_tool_calling()

Tool calling response:
{
  "id": "chatcmpl-b7c228db1e46d1bd",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "<tool_call>\n{\"name\": \"get_weather\", \"location\": \"Paris, France\", \"unit\": \"celsius\"}\n</tool_call>",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null,
        "reasoning_content": null
      },
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "created": 1770982247,
  "model": "nvidia/Nemotron-Cascade-8B",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 26,
    "prompt_tokens": 216,
    "total_tokens": 242,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_param

## Test 5: Different temperature settings

In [9]:
prompt = "Write a creative short story about a robot learning to paint."
temperatures = [0.1, 0.7, 1.0]

for temp in temperatures:
    messages = [{"role": "user", "content": prompt}]
    response = send_chat_completion(messages, temperature=temp)
    
    if response:
        print(f"\n--- Temperature: {temp} ---")
        print(response["choices"][0]["message"]["content"])
    else:
        print(f"Failed to get response for temperature {temp}")


--- Temperature: 0.1 ---
In the heart of the cybernetic metropolis, where steel towers pierced the clouds and circuits hummed with life, a robot named Zeta stood before a blank canvas. Its mechanical eyes, aglow with curiosity, scanned the vibrant colors spread out before it. Zeta was an experiment, a prototype designed to learn and adapt, but its creators had given it a peculiar directive: learn to paint.

At first, Zeta's mechanical arms moved with precision, yet clumsy strokes splattered the canvas with discordant hues. Paint dripped like malfunctioning code, creating a chaotic mess. The robot's processors whirred as it analyzed the failure. "Error: Aesthetic harmony not achieved," it beeped softly.

Undeterred, Zeta observed. It watched the human artists who traversed the gallery nearby‚Äîhow their hands danced, how their minds translated emotions into swirling patterns. A human child pointed at a sunset painting, giggling as Zeta's sensors captured the warmth of amber and violet.

## Performance

In [10]:
# Test 6: Performance timing
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

start_time = time.time()
response = send_chat_completion(messages)
end_time = time.time()

if response:
    duration = end_time - start_time
    tokens_generated = response.get("usage", {}).get("completion_tokens", 0)
    
    print(f"Response time: {duration:.2f} seconds")
    print(f"Tokens generated: {tokens_generated}")
    if tokens_generated > 0 and duration > 0:
        print(f"Tokens per second: {tokens_generated/duration:.2f}")
    print("\nResponse:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response for performance test")

Response time: 7.06 seconds
Tokens generated: 345
Tokens per second: 48.88

Response:
Sure! Here's a simple explanation of **quantum computing**:

### **Quantum Computing in Simple Terms**
Quantum computing is a new way of processing information using the principles of **quantum mechanics**‚Äîthe science that describes how tiny particles (like atoms and electrons) behave.

#### **Key Differences from Regular Computers:**
1. ** classical bits vs. quantum bits (qubits)**:
   - Normal computers use **bits** that are either **0 or 1**.
   - Quantum computers use **qubits**, which can be **0, 1, or both at the same time** (this is called **superposition**). This lets them explore many possibilities simultaneously.

2. **Entanglement**:
   - Qubits can be **linked together** (entangled), meaning changing one instantly affects another, no matter how far apart they are. This helps in fast calculations.

3. **Quantum Interference**:
   - Quantum computers manipulate probabilities to **amplify c

## Summary

Run all the cells above to test various aspects of your vLLM server:

1. **Server health check** - Verify server is running and get available models
2. **Basic functionality** - Simple chat completion
3. **Multi-turn conversations** - Context awareness
4. **Reasoning capabilities** - Complex problem solving
5. **Temperature effects** - Creativity control
6. **Tool calling** - Function calling capabilities (if supported)
7. **Performance** - Response timing and token usage

The notebook now uses:
- **httpx** for HTTP requests
- **OpenAI Python client** for chat completions
- **Proper error handling** and response parsing
- **Tool calling tests** to check function calling support

If all tests pass successfully, your vLLM server is working correctly!