vLLM v0.19.1 Windows Build — Multi-TurboQuant
Native Windows build of vLLM 0.19.1 — no WSL, no Docker, no Linux VM. Point release on top of v0.19.0-win.
What's new in v0.19.1-win
- vLLM v0.19.1 base — upstream point release (CI fixes, pinned
nixl-cu{12,13}, Jina ColBERT rotary recomputation for transformers v5). uvloopfallback baked into the wheel — upstream added an unconditionalimport uvloopinvllm/v1/utils.py; the patch now wraps it intry/except ImportError → asyncio, so user code no longer needs thesys.modules.setdefault("uvloop", ...)stub.- All 6 TQ methods re-verified on RTX 3090 with Qwen3-14B-AWQ-4bit (see below).
- New
tests/test_tq_diag.py— faulthandler-guarded diagnostic that distinguishes a real hang from a slow-but-terminating PyTorch-fallback decode (90s watchdog, per-method viaTQ_METHODenv var).
Verified (RTX 3090, Qwen3-14B-abliterated-AWQ-4bit)
Smoke test (FlashAttention 2, kv_cache_dtype=auto): 933 ms for 16 tokens, ~17 tok/s.
All six TurboQuant methods (Triton attention backend, PyTorch-fallback encode/decode). 5 tokens, max_model_len=512, gpu_memory_utilization=0.5:
| Method | Preset | Time (5 tok) | Output tok/s | Status |
|---|---|---|---|---|
isoquant3 |
no_calibration_symmetric | 41.5s | 0.12 | PASS |
isoquant4 |
no_calibration_quality | 53.0s | 0.09 | PASS |
planarquant3 |
k_only_planar | 40.5s | 0.12 | PASS |
planarquant4 |
k_only_planar | 53.0s | 0.09 | PASS |
turboquant25 |
max_compression | 6.7s | 0.74 | PASS |
turboquant35 |
speed | 5.4s | 0.92 | PASS |
turboquant25/35 are ~8× faster than the iso/planar family on the PyTorch-fallback path. All methods still pay the expected ~30-300× throughput cost vs FP16 until a fused Triton kernel lands.
Install
py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.10.0 torchaudio==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows==3.6.0.post26
pip install vllm-0.19.1+cu126-cp310-cp310-win_amd64.whl
pip install git+https://github.com/rookiemann/multi-turboquant.gitRequirements
- Windows 10 21H2+ / Windows 11
- NVIDIA GPU with SM 8.0+ (RTX 30/40/50, A100, H100)
- CUDA driver R545+
- Python 3.10.x
Known limitations
Unchanged from v0.19.0-win:
- TQ throughput penalty from PyTorch-fallback encode/decode (fused Triton kernel still pending).
- Single GPU only (NCCL unavailable on Windows).
- No FlashAttention 3, no FlashInfer.
See CHANGELOG.md for full history.