Skip to content

vLLM v0.21.0 Windows Build — Multi-TurboQuant + native TurboQuant

Choose a tag to compare

@aivrar aivrar released this 19 May 20:49
· 4 commits to master since this release

Native Windows build of vLLM 0.21.0 — no WSL, no Docker, no Linux VM. Major upstream bump on top of v0.19.1-win: 1,157 commits from v0.19.1v0.21.0, PyTorch 2.10.0 → 2.11.0, CUTLASS 4.2.1 → 4.4.2.

What's new in v0.21.0-win

  • vLLM v0.21.0 base — covers the v0.19.2 / v0.20.0 / v0.20.1 / v0.20.2 / v0.21.0 release train, including v1 engine maturity, zero-bubble DP scheduling, batched chat completions, the new DeepGEMM extension, async-scheduling hardening, and the new native TurboQuant attention backend (PR #38479).
  • 10 KV cache compression dtypes in one wheel. Our 6 Multi-TurboQuant methods and the 4 new upstream turboquant_* variants now coexist in CacheDType. The platform dispatcher in vllm/platforms/cuda.py routes the upstream names to TurboQuantBackend (fused Triton kernels, full speed) and ours stay on the patched TritonAttention backend.
  • PyTorch 2.11.0 + CUDA 12.6 wheels for cp310 win_amd64.
  • cutlass-windows.patch + vllm-flash-attn-cutlass-windows.patch ship inside the v5 source patch and get applied automatically by CMakeLists.txt / vllm_flash_attn.cmake after FetchContent — no manual .deps patching anymore.
  • Auto-default VLLM_USE_FLASHINFER_SAMPLER=False on Windows. Upstream's True default triggered ModuleNotFoundError: No module named 'flashinfer' at LLM() construction; the patch flips the default on sys.platform == "win32" so the Triton sampler is used silently.
  • uvloop fallback also still baked in (carried over from v0.19.1-win).

What changed inside the patch

vllm-windows-v5.patch replaces vllm-windows-v4.patch. 36 modified files + 3 new files (vllm/v1/attention/ops/multi_turboquant_kv.py, cutlass-windows.patch, vllm-flash-attn-cutlass-windows.patch), ~1918 lines.

New patches in v5:

  • /Usmall + WIN32_LEAN_AND_MEAN to defeat the Windows SDK rpcndr.h small macro that collides with PyTorch 2.11.0's bool small parameter name in c10::cuda::CUDACachingAllocator.
  • /Zc:__cplusplus so MSVC reports the real C++ version (defaults to 199711L); CUTLASS 4.4.2's platform.h gates is_unsigned_v etc. behind __cplusplus >= 201703L.
  • csrc/persistent_topk.cuh (new file in v0.21.0): __attribute__((always_inline)) guarded with _MSC_VER / __forceinline fallback.
  • csrc/quantization/fused_kernels/fused_silu_mul_block_quant.cu (new file): quant_type_max_v<scalar_out_t>quant_type_max_v<scalar_out_t>() (function-template call syntax).
  • csrc/moe/topk_softplus_sqrt_kernels.cu: hoist #ifndef USE_ROCM out of the DISPATCH_HASH(...) macro argument (preprocessor-in-macro-arg is ill-formed even with /Zc:preprocessor).
  • requirements/cuda.txt: comment out fastsafetensors (Linux-only io_uring); we keep using our own numpy-mmap reader from v0.19.x.
  • cutlass-windows.patch: 5-file CUTLASS 4.4.2 patch (cuda_host_adapter.hpp memsetDevice host/device mismatch + 4 SM100/SM103 headers with static constexpr dim3 get_block_shape() violations).
  • vllm-flash-attn-cutlass-windows.patch: same constexpr-dim3 fix in the vendored CUTLASS submodule under vllm-flash-attn.

Three v4 hunks dropped as obsolete upstream: csrc/topk.cu designated-initializer fix (the affected function was rewritten upstream), routed_experts_capturer.py fcntl/msvcrt locking (upstream rewrote the file to not use file locks), and triton_reshape_and_cache_flash.py fp8 startswith assert (upstream now uses is_quantized_kv_cache).

Full per-file breakdown → PATCHES.md

Verified (RTX 3090, Qwen3-14B-abliterated-AWQ-4bit)

kv_cache_dtype Output tok/s Notes
auto (fp16, FA2) 16.7 20 tokens in 1.20 s (wheel install, warm OS cache)
turboquant35 (Triton + PyTorch-fallback) 0.93 20 tokens in 21.4 s — matches the v0.19.1 figure (0.92 tok/s)

Other Multi-TurboQuant methods (isoquant3/4, planarquant3/4, turboquant25) should behave the same as in v0.19.1-win; rerun tests/test_tq_real.py for a full sweep.

Build environment:

  • Windows 10 Pro 22H2
  • Visual Studio 2022 Community 17.13 (MSVC 19.43.34810)
  • CUDA Toolkit 12.6
  • Python 3.10.11
  • PyTorch 2.11.0+cu126
  • Triton-windows 3.6.0.post26

Install

Pre-built wheel — no compiler needed:

py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.11.0 torchaudio==2.11.0 torchvision==0.26.0 ^
    --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows==3.6.0.post26
pip install vllm-0.21.0+cu126-cp310-cp310-win_amd64.whl
pip install git+https://github.com/aivrar/multi-turboquant.git

Or from source (~60-90 min):

git clone https://github.com/vllm-project/vllm.git vllm-source
cd vllm-source && git checkout v0.21.0 && cd ..
git apply vllm-windows-v5.patch --directory vllm-source
build.bat

Full instructions: README.md

Known limitations

Unchanged from v0.19.x:

  • TQ throughput penalty on our 6 methods (PyTorch-fallback encode/decode). Memory savings real, throughput cost real. The 4 upstream turboquant_* variants don't pay this cost — they use fused Triton kernels.
  • Single GPU only (NCCL still unavailable on Windows; the patch wires up FakeProcessGroup for single-rank operation).
  • No FlashAttention 3 or 4, no FlashInfer. No Windows wheels.
  • No DeepGEMM, no Quack, no Tilelang, no TokenSpeed-MLA, no NIXL. No Windows wheels. CMake skips DeepGEMM automatically when target arch < SM 9.0.

SHA256

b63902f427527ff8aa150744ff6d20c4a91d16c6c85125fbe727a9539a75cd21  vllm-0.21.0+cu126-cp310-cp310-win_amd64.whl