Release vLLM v0.19.1 Windows Build — Multi-TurboQuant · aivrar/vllm-windows-build

Native Windows build of vLLM 0.19.1 — no WSL, no Docker, no Linux VM. Point release on top of v0.19.0-win.

What's new in v0.19.1-win

vLLM v0.19.1 base — upstream point release (CI fixes, pinned nixl-cu{12,13}, Jina ColBERT rotary recomputation for transformers v5).
uvloop fallback baked into the wheel — upstream added an unconditional import uvloop in vllm/v1/utils.py; the patch now wraps it in try/except ImportError → asyncio, so user code no longer needs the sys.modules.setdefault("uvloop", ...) stub.
All 6 TQ methods re-verified on RTX 3090 with Qwen3-14B-AWQ-4bit (see below).
New tests/test_tq_diag.py — faulthandler-guarded diagnostic that distinguishes a real hang from a slow-but-terminating PyTorch-fallback decode (90s watchdog, per-method via TQ_METHOD env var).

Verified (RTX 3090, Qwen3-14B-abliterated-AWQ-4bit)

Smoke test (FlashAttention 2, kv_cache_dtype=auto): 933 ms for 16 tokens, ~17 tok/s.

All six TurboQuant methods (Triton attention backend, PyTorch-fallback encode/decode). 5 tokens, max_model_len=512, gpu_memory_utilization=0.5:

Method	Preset	Time (5 tok)	Output tok/s	Status
`isoquant3`	no_calibration_symmetric	41.5s	0.12	PASS
`isoquant4`	no_calibration_quality	53.0s	0.09	PASS
`planarquant3`	k_only_planar	40.5s	0.12	PASS
`planarquant4`	k_only_planar	53.0s	0.09	PASS
`turboquant25`	max_compression	6.7s	0.74	PASS
`turboquant35`	speed	5.4s	0.92	PASS

turboquant25/35 are ~8× faster than the iso/planar family on the PyTorch-fallback path. All methods still pay the expected ~30-300× throughput cost vs FP16 until a fused Triton kernel lands.

Install

py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.10.0 torchaudio==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows==3.6.0.post26
pip install vllm-0.19.1+cu126-cp310-cp310-win_amd64.whl
pip install git+https://github.com/rookiemann/multi-turboquant.git

Requirements

Windows 10 21H2+ / Windows 11
NVIDIA GPU with SM 8.0+ (RTX 30/40/50, A100, H100)
CUDA driver R545+
Python 3.10.x

Known limitations

Unchanged from v0.19.0-win:

TQ throughput penalty from PyTorch-fallback encode/decode (fused Triton kernel still pending).
Single GPU only (NCCL unavailable on Windows).
No FlashAttention 3, no FlashInfer.

See CHANGELOG.md for full history.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM v0.19.1 Windows Build — Multi-TurboQuant

Choose a tag to compare

Sorry, something went wrong.