# vLLM Server Test

This notebook tests the vLLM server running with Llama-3.1-8B-Instruct model.

In [1]:
import requests
import json
from typing import List, Dict, Any

In [2]:
# Server configuration
BASE_URL = "http://localhost:8000"
CHAT_ENDPOINT = f"{BASE_URL}/v1/chat/completions"

In [3]:
def send_chat_completion(
    messages: List[Dict[str, str]],
    temperature: float = 0.7,
    max_tokens: int = 150,
    stream: bool = False,
) -> Dict[str, Any]:
    """
    Send a chat completion request to the vLLM server.

    Args:
        messages: List of message dictionaries with 'role' and 'content'
        temperature: Sampling temperature (0.0 to 1.0)
        max_tokens: Maximum number of tokens to generate
        stream: Whether to stream the response

    Returns:
        Response dictionary from the server
    """
    payload = {
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "stream": stream,
    }

    try:
        response = requests.post(CHAT_ENDPOINT, json=payload, headers={"Content-Type": "application/json"})
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return None

## Test 1: Simple greeting

In [5]:
# Test 1: Simple greeting
messages = [
    {"role": "user", "content": "Hello! How are you today?"}
]

response = send_chat_completion(messages)
if response:
    print("Response:")
    print(json.dumps(response, indent=2))
    print("\nGenerated text:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Response:
{
  "id": "chatcmpl-30b8d58b4cdc4f10856f414eaccb5947",
  "object": "chat.completion",
  "created": 1754628410,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm doing well, thank you for asking. I'm a large language model, so I don't have emotions like humans do, but I'm here and ready to help with any questions or tasks you may have. How can I assist you today?",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 42,
    "total_tokens": 94,
    "completion_tokens": 52,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Gen

## Test 2: Multi-turn conversation

In [6]:
# Test 2: Multi-turn conversation
messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's the population of that city?"}
]

response = send_chat_completion(messages)
if response:
    print("Multi-turn response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Multi-turn response:
The population of Paris, the capital city of France, is approximately 2.1 million people within the city limits. However, the larger metropolitan area of Paris, also known as the Île-de-France region, has a population of around 12.2 million people.


## Test 3: Reasoning task

In [7]:
# Test 3: Reasoning task
messages = [
    {"role": "user", "content": "If I have 3 apples and I give away 1 apple, then buy 2 more apples, how many apples do I have in total? Please explain your reasoning."}
]

response = send_chat_completion(messages, max_tokens=200)
if response:
    print("Reasoning response:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response")

Reasoning response:
Let's break down the problem step by step:

1. You start with 3 apples.
2. You give away 1 apple. Now you have 3 - 1 = 2 apples.
3. You buy 2 more apples. Now you have 2 (remaining apples) + 2 (new apples) = 4 apples.

So, in total, you have 4 apples.


## Test 4: Different temperature settings

In [8]:
# Test 4: Different temperature settings
prompt = "Write a creative short story about a robot learning to paint."
temperatures = [0.1, 0.7, 1.0]

for temp in temperatures:
    messages = [{"role": "user", "content": prompt}]
    response = send_chat_completion(messages, temperature=temp, max_tokens=100)
    
    if response:
        print(f"\n--- Temperature: {temp} ---")
        print(response["choices"][0]["message"]["content"])
    else:
        print(f"Failed to get response for temperature {temp}")


--- Temperature: 0.1 ---
**The Brush of Creation**

In a world where technology and art coexisted in perfect harmony, a brilliant inventor, Dr. Rachel Kim, had a vision to create a robot that could bring joy and beauty to the world through the art of painting. She named her creation "Aurora," a sleek and agile robot with a mind capable of learning and adapting at an exponential rate.

Aurora's journey began in a state-of-the-art studio, surrounded by an array of paints, brushes

--- Temperature: 0.7 ---
**The Brushstrokes of a New Mind**

In a world where art and technology collided, a brilliant inventor, Dr. Rachel Kim, stood before her latest creation: a robot designed to learn the art of painting. The robot, named Nova, stood tall, its slender metal body adorned with a canvas easel and a palette of vibrant colors.

Nova's creator had imbued the robot with an advanced artificial intelligence, allowing it to process and analyze data at an incredible rate. The goal was for

--- Temper

## Test 5: Server health check

In [9]:
# Test 5: Check server health and model info
try:
    # Check if server is responding
    health_response = requests.get(f"{BASE_URL}/health", timeout=5)
    print(f"Health check status: {health_response.status_code}")
    
    # Try to get model info
    models_response = requests.get(f"{BASE_URL}/v1/models", timeout=5)
    if models_response.status_code == 200:
        models_data = models_response.json()
        print("Available models:")
        for model in models_data.get("data", []):
            print(f"  - {model.get('id', 'Unknown')}")
    else:
        print(f"Models endpoint returned status: {models_response.status_code}")
        
except requests.exceptions.RequestException as e:
    print(f"Server health check failed: {e}")

Health check status: 200
Available models:
  - meta-llama/Llama-3.1-8B-Instruct


## Test 6: Performance timing

In [10]:
# Test 6: Performance timing
import time

messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

start_time = time.time()
response = send_chat_completion(messages, max_tokens=100)
end_time = time.time()

if response:
    duration = end_time - start_time
    tokens_generated = len(response["choices"][0]["message"]["content"].split())
    
    print(f"Response time: {duration:.2f} seconds")
    print(f"Approximate tokens generated: {tokens_generated}")
    print(f"Approximate tokens per second: {tokens_generated/duration:.2f}")
    print("\nResponse:")
    print(response["choices"][0]["message"]["content"])
else:
    print("Failed to get response for performance test")

Response time: 1.20 seconds
Approximate tokens generated: 70
Approximate tokens per second: 58.19

Response:
**What is Quantum Computing?**

Quantum computing is a new way of processing information that's different from the classical computers we use today. While classical computers use "bits" to store and process information, quantum computers use "qubits" (quantum bits).

**Classical Computers vs. Quantum Computers**

Classical computers use "bits" that can only be in one of two states: 0 or 1. It's like a light switch that's either on (1) or off


## Summary

Run all the cells above to test various aspects of your vLLM server:

1. **Basic functionality** - Simple chat completion
2. **Multi-turn conversations** - Context awareness
3. **Reasoning capabilities** - Complex problem solving
4. **Temperature effects** - Creativity control
5. **Server health** - Endpoint availability
6. **Performance** - Response timing

If all tests pass successfully, your vLLM server is working correctly!