# Client Notebook for Testing a vLLM Server

This notebook allows you to **send requests to a local or remote vLLM server** (compatible with the OpenAI API format) and test its performance and capabilities.

You will learn to:
- Run simple and streaming completions using raw HTTP and the `openai` library.
- Benchmark the server's latency and performance under load.

![server vLLM](images/schema_client_server.png)

## ⚙️ Initial Setup

Before we can send requests, we need to configure a few key parameters. This section sets up the server connection details and the prompt we'll use for testing.

### 1. Server Connection Details

You will access the vLLM server over the network. Please enter the public IP address of your EC2 instance where the server is running. If you have configured a different port during the server launch, please update the `PORT` variable as well.

In [None]:
# The public IP address of the EC2 instance running the vLLM server.
# ➡️ ACTION: Replace the placeholder with your actual public IP.
PUBLIC_IP = "XYZ"  # TODO: Replace with your actual public IP

# The port the vLLM server is listening on. The default is 8000.
PORT = "8000"

# The model name to be used for the requests.
# This must match the model name that the vLLM server has loaded.
# For servers launched with a model like 'mistralai/Mistral-7B-v0.2', the name is often the same.
MODEL_NAME = "mistral-7b"  # TODO set the right model name

# Construct the base URL for the API endpoints.
URL = f"http://{PUBLIC_IP}:{PORT}"

print(f"Will connect to server at: {URL}")
print(f"Will use model: '{MODEL_NAME}'")

### 2. Define Your Prompt

Now, let's define the prompt we will send to the model. You can change this to any question you like. Keep in mind the capabilities of the model you are using (e.g., a 7B model is powerful, but not at the level of GPT-4).

In [None]:
# You can modify the prompt to ask the model anything you want.
PROMPT = """
Explique la loi de l'offre et de la demande en économie, en donnant un exemple simple.
"""  

## 1. Completion using a Raw HTTP Request

First, let's interact with the server at a low level using the `requests` library. This helps us understand the raw API structure, which is compliant with the OpenAI Chat Completions API format. We are sending a POST request to the `/v1/chat/completions` endpoint.

In [None]:
import requests
import json

# Define the headers for our HTTP request. We specify the content type is JSON.
headers = {
    "Content-Type": "application/json"
}

# Define the payload (body) of our request.
data = {
    "model": MODEL_NAME,
    "messages": [
        {"role": "user", "content": PROMPT}
    ],
    "max_tokens": 500,
}

# The full URL for the chat completions endpoint.
url_completions = URL + "/v1/chat/completions"

# Send the POST request.
try:
    response = requests.post(url_completions, headers=headers, json=data, timeout=60)
    print("Status code:", response.status_code)
    print("\n" + "-"*100 + "\n")
    resp_json = response.json()
    print("Full Response JSON:")
    print(json.dumps(resp_json, indent=2, ensure_ascii=False))
    print("\n" + "-"*100 + "\n")
    print("Generated Content:")
    print(resp_json["choices"][0]["message"]["content"])
except requests.exceptions.RequestException as e:
    print("HTTP Request failed:", e)
except Exception as e:
    print("Error reading the response:", e)
    if 'response' in locals():
        print("Raw Response Text:", response.text)

## 2. Using the OpenAI Python Library

While using `requests` is good for understanding the underlying API, it's more convenient to use a dedicated library. The `openai` library can be configured to communicate with any OpenAI-compatible API, including our vLLM server.

In [None]:
from openai import OpenAI
import time

# Initialize the OpenAI client.
client = OpenAI(
    base_url=f"{URL}/v1",
    api_key="EMPTY"  # vLLM n'utilise pas la clé mais c'est requis par le client
)

print("OpenAI client configured to connect to:", client.base_url)

### 🔁 Simple Completion Request

This is the most basic type of request. We send the prompt and wait for the full response to be generated before it's returned to us. This is also known as a blocking request.

In [None]:
print(f"Sending a simple completion request for model '{MODEL_NAME}'...\n")

try:
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[
            {"role": "user", "content": PROMPT}
        ]
    )
    print(response.choices[0].message.content)
except Exception as e:
    print(f"Error during OpenAI completion: {e}")

### 📡 Streaming Completion Request

For a better user experience, especially with long responses, we can stream the output. This means the server sends back the response token by token as it's being generated, rather than waiting for the entire sequence to be complete. This makes the application feel much more responsive.

In [None]:
print(f"Sending a streaming request for model '{MODEL_NAME}'...\n")

try:
    stream = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[{"role": "user", "content": PROMPT}],
        stream=True
    )

    output = ""
    print("--- Streaming Response ---")
    for chunk in stream:
        if hasattr(chunk.choices[0].delta, 'content') and chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            print(content, end="", flush=True)
            output += content
    print("\n--- End of Stream ---")
except Exception as e:
    print(f"Error during streaming completion: {e}")

## 3. ⏱️ concurrent requests performance

Finally, let's measure the performance of our vLLM server. We will send multiple requests concurrently to simulate a real-world load and measure the latency of each request. This helps us understand the server's throughput and how well it can handle parallel requests.

We will measure:
- **Latency**: The time it takes to get a response for a single token (in seconds).
- **Throughput**: The number of requests the server can handle per second.

In [None]:
import concurrent.futures
import numpy as np

def single_request(prompt: str):
    """Sends a single request to the LLM and returns its latency."""
    start_time = time.time()
    try:
        client.chat.completions.create(
            model=MODEL_NAME,
            messages=[{"role": "user", "content": prompt}]
        )
        end_time = time.time()
        latency = end_time - start_time
        return latency
    except Exception as e:
        print(f"An error occurred in a request: {e}")
        return None

def run_benchmark(n_requests: int = 10, prompt: str = PROMPT):
    """Runs a benchmark by sending n_requests concurrently."""
    print(f"Starting benchmark with {n_requests} concurrent requests...\n")
    latencies = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=n_requests) as executor:
        futures = [executor.submit(single_request, prompt) for _ in range(n_requests)]
        for i, future in enumerate(concurrent.futures.as_completed(futures)):
            latency = future.result()
            if latency is not None:
                latencies.append(latency)
                print(f"Request {i+1} completed in {latency:.2f}s")

    if not latencies:
        print("\nNo requests completed successfully.")
        return

    avg_latency = np.mean(latencies)
    min_latency = np.min(latencies)
    max_latency = np.max(latencies)
    

    print("\n=== Benchmark Summary ===")
    print(f"Total Successful Requests: {len(latencies)}")
    print(f"Average Latency: {avg_latency:.2f}s")
    print(f"Min Latency: {min_latency:.2f}s")
    print(f"Max Latency: {max_latency:.2f}s")

# Run the benchmark with 10 requests.
run_benchmark(n_requests=10)