Skip to content

Weschera/specserve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpecServe

Lightweight web GUI for managing vLLM model serving with first-class support for speculative decoding (DFlash, EAGLE, Medusa).

Features

  • Model Browser — Search and download models from HuggingFace directly in the UI
  • Speculative Config Builder — Visual form that generates valid --speculative-config JSON with presets for DFlash, EAGLE, and Medusa
  • Custom Flags Panel — Expose advanced vLLM flags (attention backend, GPU memory utilization, tensor parallelism, etc.)
  • Chat UI — Stream responses from the running vLLM endpoint via SSE
  • Live Logs — Watch vLLM startup logs in real-time

Quick Start

cd specserve
pip install -r requirements.txt
python server.py

Open http://localhost:9200 in your browser.

Usage

  1. Search for a model in the left sidebar (e.g., Qwen2.5-0.5B)
  2. Configure speculative decoding — pick a preset or build custom config
  3. Set custom flags — port, tensor parallelism, GPU memory utilization, etc.
  4. Enter the model name in "Model to Serve" field
  5. Click Start — vLLM launches as a subprocess
  6. Chat — send messages and stream responses from the right panel

Speculative Decoding Presets

Preset Method Description
None Standard autoregressive decoding
DFlash dflash KV-cache matching with draft model (default: z-lab/Qwen3.6-35B-A3B-DFlash)
EAGLE eagle Tree-based draft model search

Architecture

Browser (port 9200) → FastAPI server → vLLM subprocess (port 8100 default)
                              ↕                    ↕
                     HuggingFace API          /v1/chat/completions
                     huggingface_hub         (OpenAI-compatible)

Files

  • server.py — FastAPI backend: model browser, vLLM lifecycle, chat proxy
  • index.html — Single-page frontend with Tailwind CSS
  • requirements.txt — Python dependencies
  • logs/ — vLLM log files (created on first start)

Notes for DGX Spark

  • Use --tensor-parallel-size 1 for single-Spark deployments
  • For 2-Spark clusters, use QSFP IPs and --tensor-parallel-size 2; add Ray/distributed flags manually until SpecServe grows first-class cluster launch support
  • Reference: NVIDIA's DGX Spark playbooks recommend no --gpu-memory-utilization flag for tight 70B+ examples; default is safest when unsure
  • FP8 is the safest general-purpose format on GB10. NVIDIA-published NVFP4 models can work in NVIDIA's validated NGC vLLM containers; third-party FP4 needs testing first.

About

Lightweight web GUI for managing vLLM model serving with speculative decoding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors