# vLLM FastAPI API Calls Notebook

This notebook demonstrates how to interact with the FastAPI endpoints for managing vLLM containers. We will:

1. Start a vLLM container with specified parameters
2. Check the container status
3. Stop the container

In [None]:
import requests
import json

# Base URL for the FastAPI server
BASE_URL = "http://localhost:7500"

# Define the model name and HF token to be used in the requests
MODEL_NAME = "HuggingFaceTB/SmolLM-135M"
HF_TOKEN = ""

## Start vLLM Container

This cell calls the `/start-vllm` endpoint to start a vLLM container with extra command arguments.

In [None]:
# Prepare the payload with extra command arguments
payload = {
    "hf_token": HF_TOKEN,
    "model_name": MODEL_NAME,
    "command": "--max-num-seqs 256 --max-num-batched-tokens 16384 --gpu-memory-utilization 0.1 --swap-space 8 --disable-custom-all-reduce --use-v2-block-manager --enable-chunked-prefill --enable-prefix-caching --enforce-eager"
}

start_url = f"{BASE_URL}/start-vllm"
response = requests.post(start_url, json=payload)

print("Status Code:", response.status_code)
print("Response:", response.json())

## Check vLLM Container Status

This cell calls the `/status-vllm` endpoint to check the status of the container.

In [None]:
status_url = f"{BASE_URL}/status-vllm"
params = {"model_name": MODEL_NAME}
response = requests.get(status_url, params=params)

print("Status Code:", response.status_code)
print("Response:", response.json())

## Stop vLLM Container

This cell calls the `/stop-vllm` endpoint to stop and remove the container.

In [None]:
stop_url = f"{BASE_URL}/stop-vllm"
params = {"model_name": MODEL_NAME}
response = requests.post(stop_url, params=params)

print("Status Code:", response.status_code)
print("Response:", response.json())