Skip to content

vLLM v0.19.1 Windows Build — Multi-TurboQuant

Choose a tag to compare

@aivrar aivrar released this 20 Apr 00:06
· 7 commits to master since this release

Native Windows build of vLLM 0.19.1 — no WSL, no Docker, no Linux VM. Point release on top of v0.19.0-win.

What's new in v0.19.1-win

  • vLLM v0.19.1 base — upstream point release (CI fixes, pinned nixl-cu{12,13}, Jina ColBERT rotary recomputation for transformers v5).
  • uvloop fallback baked into the wheel — upstream added an unconditional import uvloop in vllm/v1/utils.py; the patch now wraps it in try/except ImportError → asyncio, so user code no longer needs the sys.modules.setdefault("uvloop", ...) stub.
  • All 6 TQ methods re-verified on RTX 3090 with Qwen3-14B-AWQ-4bit (see below).
  • New tests/test_tq_diag.py — faulthandler-guarded diagnostic that distinguishes a real hang from a slow-but-terminating PyTorch-fallback decode (90s watchdog, per-method via TQ_METHOD env var).

Verified (RTX 3090, Qwen3-14B-abliterated-AWQ-4bit)

Smoke test (FlashAttention 2, kv_cache_dtype=auto): 933 ms for 16 tokens, ~17 tok/s.

All six TurboQuant methods (Triton attention backend, PyTorch-fallback encode/decode). 5 tokens, max_model_len=512, gpu_memory_utilization=0.5:

Method Preset Time (5 tok) Output tok/s Status
isoquant3 no_calibration_symmetric 41.5s 0.12 PASS
isoquant4 no_calibration_quality 53.0s 0.09 PASS
planarquant3 k_only_planar 40.5s 0.12 PASS
planarquant4 k_only_planar 53.0s 0.09 PASS
turboquant25 max_compression 6.7s 0.74 PASS
turboquant35 speed 5.4s 0.92 PASS

turboquant25/35 are ~8× faster than the iso/planar family on the PyTorch-fallback path. All methods still pay the expected ~30-300× throughput cost vs FP16 until a fused Triton kernel lands.

Install

py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.10.0 torchaudio==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows==3.6.0.post26
pip install vllm-0.19.1+cu126-cp310-cp310-win_amd64.whl
pip install git+https://github.com/rookiemann/multi-turboquant.git

Requirements

  • Windows 10 21H2+ / Windows 11
  • NVIDIA GPU with SM 8.0+ (RTX 30/40/50, A100, H100)
  • CUDA driver R545+
  • Python 3.10.x

Known limitations

Unchanged from v0.19.0-win:

  • TQ throughput penalty from PyTorch-fallback encode/decode (fused Triton kernel still pending).
  • Single GPU only (NCCL unavailable on Windows).
  • No FlashAttention 3, no FlashInfer.

See CHANGELOG.md for full history.