Lightweight web GUI for managing vLLM model serving with first-class support for speculative decoding (DFlash, EAGLE, Medusa).
- Model Browser — Search and download models from HuggingFace directly in the UI
- Speculative Config Builder — Visual form that generates valid
--speculative-configJSON with presets for DFlash, EAGLE, and Medusa - Custom Flags Panel — Expose advanced vLLM flags (attention backend, GPU memory utilization, tensor parallelism, etc.)
- Chat UI — Stream responses from the running vLLM endpoint via SSE
- Live Logs — Watch vLLM startup logs in real-time
cd specserve
pip install -r requirements.txt
python server.pyOpen http://localhost:9200 in your browser.
- Search for a model in the left sidebar (e.g.,
Qwen2.5-0.5B) - Configure speculative decoding — pick a preset or build custom config
- Set custom flags — port, tensor parallelism, GPU memory utilization, etc.
- Enter the model name in "Model to Serve" field
- Click Start — vLLM launches as a subprocess
- Chat — send messages and stream responses from the right panel
| Preset | Method | Description |
|---|---|---|
| None | — | Standard autoregressive decoding |
| DFlash | dflash | KV-cache matching with draft model (default: z-lab/Qwen3.6-35B-A3B-DFlash) |
| EAGLE | eagle | Tree-based draft model search |
Browser (port 9200) → FastAPI server → vLLM subprocess (port 8100 default)
↕ ↕
HuggingFace API /v1/chat/completions
huggingface_hub (OpenAI-compatible)
server.py— FastAPI backend: model browser, vLLM lifecycle, chat proxyindex.html— Single-page frontend with Tailwind CSSrequirements.txt— Python dependencieslogs/— vLLM log files (created on first start)
- Use
--tensor-parallel-size 1for single-Spark deployments - For 2-Spark clusters, use QSFP IPs and
--tensor-parallel-size 2; add Ray/distributed flags manually until SpecServe grows first-class cluster launch support - Reference: NVIDIA's DGX Spark playbooks recommend no
--gpu-memory-utilizationflag for tight 70B+ examples; default is safest when unsure - FP8 is the safest general-purpose format on GB10. NVIDIA-published NVFP4 models can work in NVIDIA's validated NGC vLLM containers; third-party FP4 needs testing first.