Release vLLM v0.21.0 Windows Build — Multi-TurboQuant + native TurboQuant · aivrar/vllm-windows-build

Native Windows build of vLLM 0.21.0 — no WSL, no Docker, no Linux VM. Major upstream bump on top of v0.19.1-win: 1,157 commits from v0.19.1 → v0.21.0, PyTorch 2.10.0 → 2.11.0, CUTLASS 4.2.1 → 4.4.2.

What's new in v0.21.0-win

vLLM v0.21.0 base — covers the v0.19.2 / v0.20.0 / v0.20.1 / v0.20.2 / v0.21.0 release train, including v1 engine maturity, zero-bubble DP scheduling, batched chat completions, the new DeepGEMM extension, async-scheduling hardening, and the new native TurboQuant attention backend (PR #38479).
10 KV cache compression dtypes in one wheel. Our 6 Multi-TurboQuant methods and the 4 new upstream turboquant_* variants now coexist in CacheDType. The platform dispatcher in vllm/platforms/cuda.py routes the upstream names to TurboQuantBackend (fused Triton kernels, full speed) and ours stay on the patched TritonAttention backend.
PyTorch 2.11.0 + CUDA 12.6 wheels for cp310 win_amd64.
cutlass-windows.patch + vllm-flash-attn-cutlass-windows.patch ship inside the v5 source patch and get applied automatically by CMakeLists.txt / vllm_flash_attn.cmake after FetchContent — no manual .deps patching anymore.
Auto-default VLLM_USE_FLASHINFER_SAMPLER=False on Windows. Upstream's True default triggered ModuleNotFoundError: No module named 'flashinfer' at LLM() construction; the patch flips the default on sys.platform == "win32" so the Triton sampler is used silently.
uvloop fallback also still baked in (carried over from v0.19.1-win).

What changed inside the patch

vllm-windows-v5.patch replaces vllm-windows-v4.patch. 36 modified files + 3 new files (vllm/v1/attention/ops/multi_turboquant_kv.py, cutlass-windows.patch, vllm-flash-attn-cutlass-windows.patch), ~1918 lines.

New patches in v5:

/Usmall + WIN32_LEAN_AND_MEAN to defeat the Windows SDK rpcndr.h small macro that collides with PyTorch 2.11.0's bool small parameter name in c10::cuda::CUDACachingAllocator.
/Zc:__cplusplus so MSVC reports the real C++ version (defaults to 199711L); CUTLASS 4.4.2's platform.h gates is_unsigned_v etc. behind __cplusplus >= 201703L.
csrc/persistent_topk.cuh (new file in v0.21.0): __attribute__((always_inline)) guarded with _MSC_VER / __forceinline fallback.
csrc/quantization/fused_kernels/fused_silu_mul_block_quant.cu (new file): quant_type_max_v<scalar_out_t> → quant_type_max_v<scalar_out_t>() (function-template call syntax).
csrc/moe/topk_softplus_sqrt_kernels.cu: hoist #ifndef USE_ROCM out of the DISPATCH_HASH(...) macro argument (preprocessor-in-macro-arg is ill-formed even with /Zc:preprocessor).
requirements/cuda.txt: comment out fastsafetensors (Linux-only io_uring); we keep using our own numpy-mmap reader from v0.19.x.
cutlass-windows.patch: 5-file CUTLASS 4.4.2 patch (cuda_host_adapter.hpp memsetDevice host/device mismatch + 4 SM100/SM103 headers with static constexpr dim3 get_block_shape() violations).
vllm-flash-attn-cutlass-windows.patch: same constexpr-dim3 fix in the vendored CUTLASS submodule under vllm-flash-attn.

Three v4 hunks dropped as obsolete upstream: csrc/topk.cu designated-initializer fix (the affected function was rewritten upstream), routed_experts_capturer.py fcntl/msvcrt locking (upstream rewrote the file to not use file locks), and triton_reshape_and_cache_flash.py fp8 startswith assert (upstream now uses is_quantized_kv_cache).

Full per-file breakdown → PATCHES.md

Verified (RTX 3090, Qwen3-14B-abliterated-AWQ-4bit)

`kv_cache_dtype`	Output tok/s	Notes
`auto` (fp16, FA2)	16.7	20 tokens in 1.20 s (wheel install, warm OS cache)
`turboquant35` (Triton + PyTorch-fallback)	0.93	20 tokens in 21.4 s — matches the v0.19.1 figure (0.92 tok/s)

Other Multi-TurboQuant methods (isoquant3/4, planarquant3/4, turboquant25) should behave the same as in v0.19.1-win; rerun tests/test_tq_real.py for a full sweep.

Build environment:

Windows 10 Pro 22H2
Visual Studio 2022 Community 17.13 (MSVC 19.43.34810)
CUDA Toolkit 12.6
Python 3.10.11
PyTorch 2.11.0+cu126
Triton-windows 3.6.0.post26

Install

Pre-built wheel — no compiler needed:

py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.11.0 torchaudio==2.11.0 torchvision==0.26.0 ^
    --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows==3.6.0.post26
pip install vllm-0.21.0+cu126-cp310-cp310-win_amd64.whl
pip install git+https://github.com/aivrar/multi-turboquant.git

Or from source (~60-90 min):

git clone https://github.com/vllm-project/vllm.git vllm-source
cd vllm-source && git checkout v0.21.0 && cd ..
git apply vllm-windows-v5.patch --directory vllm-source
build.bat

Full instructions: README.md

Known limitations

Unchanged from v0.19.x:

TQ throughput penalty on our 6 methods (PyTorch-fallback encode/decode). Memory savings real, throughput cost real. The 4 upstream turboquant_* variants don't pay this cost — they use fused Triton kernels.
Single GPU only (NCCL still unavailable on Windows; the patch wires up FakeProcessGroup for single-rank operation).
No FlashAttention 3 or 4, no FlashInfer. No Windows wheels.
No DeepGEMM, no Quack, no Tilelang, no TokenSpeed-MLA, no NIXL. No Windows wheels. CMake skips DeepGEMM automatically when target arch < SM 9.0.

SHA256

b63902f427527ff8aa150744ff6d20c4a91d16c6c85125fbe727a9539a75cd21  vllm-0.21.0+cu126-cp310-cp310-win_amd64.whl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM v0.21.0 Windows Build — Multi-TurboQuant + native TurboQuant

Choose a tag to compare

Sorry, something went wrong.