SpecServe

Lightweight web GUI for managing vLLM model serving with first-class support for speculative decoding (DFlash, EAGLE, Medusa).

Features

Model Browser — Search and download models from HuggingFace directly in the UI
Speculative Config Builder — Visual form that generates valid --speculative-config JSON with presets for DFlash, EAGLE, and Medusa
Custom Flags Panel — Expose advanced vLLM flags (attention backend, GPU memory utilization, tensor parallelism, etc.)
Chat UI — Stream responses from the running vLLM endpoint via SSE
Live Logs — Watch vLLM startup logs in real-time

Quick Start

cd specserve
pip install -r requirements.txt
python server.py

Open http://localhost:9200 in your browser.

Usage

Search for a model in the left sidebar (e.g., Qwen2.5-0.5B)
Configure speculative decoding — pick a preset or build custom config
Set custom flags — port, tensor parallelism, GPU memory utilization, etc.
Enter the model name in "Model to Serve" field
Click Start — vLLM launches as a subprocess
Chat — send messages and stream responses from the right panel

Speculative Decoding Presets

Preset	Method	Description
None	—	Standard autoregressive decoding
DFlash	dflash	KV-cache matching with draft model (default: `z-lab/Qwen3.6-35B-A3B-DFlash`)
EAGLE	eagle	Tree-based draft model search

Architecture

Browser (port 9200) → FastAPI server → vLLM subprocess (port 8100 default)
                              ↕                    ↕
                     HuggingFace API          /v1/chat/completions
                     huggingface_hub         (OpenAI-compatible)

Files

server.py — FastAPI backend: model browser, vLLM lifecycle, chat proxy
index.html — Single-page frontend with Tailwind CSS
requirements.txt — Python dependencies
logs/ — vLLM log files (created on first start)

Notes for DGX Spark

Use --tensor-parallel-size 1 for single-Spark deployments
For 2-Spark clusters, use QSFP IPs and --tensor-parallel-size 2; add Ray/distributed flags manually until SpecServe grows first-class cluster launch support
Reference: NVIDIA's DGX Spark playbooks recommend no --gpu-memory-utilization flag for tight 70B+ examples; default is safest when unsure
FP8 is the safest general-purpose format on GB10. NVIDIA-published NVFP4 models can work in NVIDIA's validated NGC vLLM containers; third-party FP4 needs testing first.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
tests		tests
.gitignore		.gitignore
README.md		README.md
clusters.json		clusters.json
index.html		index.html
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpecServe

Features

Quick Start

Usage

Speculative Decoding Presets

Architecture

Files

Notes for DGX Spark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpecServe

Features

Quick Start

Usage

Speculative Decoding Presets

Architecture

Files

Notes for DGX Spark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages