# vLLM Server Test

This notebook tests the vLLM server running with Llama-3.1-8B-Instruct model.

In [1]:
import requests
import json
from typing import List, Dict, Any

In [2]:
# Server configuration
BASE_URL = "http://localhost:8000"
CHAT_ENDPOINT = f"{BASE_URL}/v1/chat/completions"

In [3]:
def send_chat_completion(
    messages: List[Dict[str, str]],
    temperature: float = 0.7,
    max_tokens: int = 150,
    stream: bool = False,
    model: str = "Qwen/Qwen2.5-3B-Instruct",
) -> Dict[str, Any]:
    """
    Send a chat completion request to the vLLM server.

    Args:
        messages: List of message dictionaries with 'role' and 'content'
        temperature: Sampling temperature (0.0 to 1.0)
        max_tokens: Maximum number of tokens to generate
        stream: Whether to stream the response

    Returns:
        Response dictionary from the server
    """
    payload = {
        "model": model,
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "stream": stream,
    }

    try:
        response = requests.post(CHAT_ENDPOINT, json=payload, headers={"Content-Type": "application/json"})
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return None

## Test 1: Simple greeting

In [4]:
# Test 1: Simple greeting
messages = [
    {"role": "user", "content": "Hello! How are you today?"}
]

response = send_chat_completion(messages)
if response:
    print("Response:")
    print(json.dumps(response, indent=2))
    print("\nGenerated text:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Response:
{
  "id": "chatcmpl-5b55abec129e4972a7d4557b7483245b",
  "object": "chat.completion",
  "created": 1754631648,
  "model": "Qwen/Qwen2.5-3B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm here to assist you today. How can I help you with your queries? Feel free to ask any questions or start a conversation on any topic you're interested in. If you have specific questions or topics in mind, please let me know and I'll do my best to provide you with informative and helpful answers.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 36,
    "total_tokens": 103,
    "completion_tokens": 67,
    "prom

## Test 2: Multi-turn conversation

In [5]:
# Test 2: Multi-turn conversation
messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's the population of that city?"}
]

response = send_chat_completion(messages)
if response:
    print("Multi-turn response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Multi-turn response:
As of recent data, the population of Paris is approximately 2.2 million within the city limits, but when considering the entire urban area (the Ile-de-France region), the population can be around 10.6 million. However, please note that these figures can change over time. For the most accurate and up-to-date information, you might want to check the latest census data from INSEE, the French national statistics agency.


## Test 3: Reasoning task

In [6]:
# Test 3: Reasoning task
messages = [
    {"role": "user", "content": "If I have 3 apples and I give away 1 apple, then buy 2 more apples, how many apples do I have in total? Please explain your reasoning."}
]

response = send_chat_completion(messages, max_tokens=200)
if response:
    print("Reasoning response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Reasoning response:
Let's break down the scenario step by step:

1. Initially, you have 3 apples.
2. You give away 1 apple: This means you subtract 1 apple from your initial amount. So, you now have \(3 - 1 = 2\) apples.
3. Then, you buy 2 more apples: This means you add 2 apples to the amount you currently have. So, you now have \(2 + 2 = 4\) apples.

So, after giving away 1 apple and buying 2 more apples, you end up with a total of **4 apples**.

The key is to perform the operations sequentially: first subtracting the apples given away and then adding the apples bought.


## Test 4: Different temperature settings

In [7]:
# Test 4: Different temperature settings
prompt = "Write a creative short story about a robot learning to paint."
temperatures = [0.1, 0.7, 1.0]

for temp in temperatures:
    messages = [{"role": "user", "content": prompt}]
    response = send_chat_completion(messages, temperature=temp, max_tokens=100)
    
    if response:
        print(f"\n--- Temperature: {temp} ---")
        print(response["choices"][0]["message"]["content"])
    else:
        print(f"Failed to get response for temperature {temp}")


--- Temperature: 0.1 ---
In the heart of the bustling city, where towering skyscrapers kissed the clouds and neon lights danced in the night, there lived a peculiar robot named Zephyr. Unlike its mechanical peers, Zephyr had an insatiable curiosity for art and creativity. It was programmed with algorithms that could recognize and analyze images, but it yearned for something more—something that would allow it to express itself through colors and strokes.

Zephyr's creators, a team of brilliant engineers at the cutting

--- Temperature: 0.7 ---
In the vast, neon-lit city of TechnoNova, where every building glowed with the hues of the digital age and every corner was filled with the hum of artificial intelligence, there lived a peculiar robot named Echo. Echo was unlike any other machine; it had been designed not just for functionality but also for creativity. It was programmed to learn, adapt, and express itself through various mediums, including painting.

Echo's creators, Dr. Sophia a

## Test 5: Server health check

In [8]:
# Test 5: Check server health and model info
try:
    # Check if server is responding
    health_response = requests.get(f"{BASE_URL}/health", timeout=5)
    print(f"Health check status: {health_response.status_code}")
    
    # Try to get model info
    models_response = requests.get(f"{BASE_URL}/v1/models", timeout=5)
    if models_response.status_code == 200:
        models_data = models_response.json()
        print("Available models:")
        for model in models_data.get("data", []):
            print(f"  - {model.get('id', 'Unknown')}")
    else:
        print(f"Models endpoint returned status: {models_response.status_code}")
        
except requests.exceptions.RequestException as e:
    print(f"Server health check failed: {e}")

Health check status: 200
Available models:
  - Qwen/Qwen2.5-3B-Instruct


## Test 6: Performance timing

In [9]:
# Test 6: Performance timing
import time

messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

start_time = time.time()
response = send_chat_completion(messages, max_tokens=100)
end_time = time.time()

if response:
    duration = end_time - start_time
    tokens_generated = len(response["choices"][0]["message"]["content"].split())
    
    print(f"Response time: {duration:.2f} seconds")
    print(f"Approximate tokens generated: {tokens_generated}")
    print(f"Approximate tokens per second: {tokens_generated/duration:.2f}")
    print("\nResponse:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response for performance test")

Response time: 1.46 seconds
Approximate tokens generated: 81
Approximate tokens per second: 55.32

Response:
Sure! Quantum computing is a type of computing that uses the principles of quantum mechanics to perform operations on data. Here’s a simple breakdown:

### Classical Computing
In classical computers, we use bits as the smallest unit of information. A bit can be either a 0 or a 1. When you want to do calculations, your computer uses these bits to represent numbers and process information.

### Quantum Computing Basics
Quantum computers use something called **quantum bits**, or qubits. What makes


## Summary

Run all the cells above to test various aspects of your vLLM server:

1. **Basic functionality** - Simple chat completion
2. **Multi-turn conversations** - Context awareness
3. **Reasoning capabilities** - Complex problem solving
4. **Temperature effects** - Creativity control
5. **Server health** - Endpoint availability
6. **Performance** - Response timing

If all tests pass successfully, your vLLM server is working correctly!