vLLM v0.21.0 Windows Build — Multi-TurboQuant + native TurboQuant
Native Windows build of vLLM 0.21.0 — no WSL, no Docker, no Linux VM. Major upstream bump on top of v0.19.1-win: 1,157 commits from v0.19.1 → v0.21.0, PyTorch 2.10.0 → 2.11.0, CUTLASS 4.2.1 → 4.4.2.
What's new in v0.21.0-win
- vLLM v0.21.0 base — covers the
v0.19.2/v0.20.0/v0.20.1/v0.20.2/v0.21.0release train, including v1 engine maturity, zero-bubble DP scheduling, batched chat completions, the new DeepGEMM extension, async-scheduling hardening, and the new native TurboQuant attention backend (PR #38479). - 10 KV cache compression dtypes in one wheel. Our 6 Multi-TurboQuant methods and the 4 new upstream
turboquant_*variants now coexist inCacheDType. The platform dispatcher invllm/platforms/cuda.pyroutes the upstream names toTurboQuantBackend(fused Triton kernels, full speed) and ours stay on the patchedTritonAttentionbackend. - PyTorch 2.11.0 + CUDA 12.6 wheels for
cp310win_amd64. cutlass-windows.patch+vllm-flash-attn-cutlass-windows.patchship inside the v5 source patch and get applied automatically byCMakeLists.txt/vllm_flash_attn.cmakeafter FetchContent — no manual.depspatching anymore.- Auto-default
VLLM_USE_FLASHINFER_SAMPLER=Falseon Windows. Upstream's True default triggeredModuleNotFoundError: No module named 'flashinfer'atLLM()construction; the patch flips the default onsys.platform == "win32"so the Triton sampler is used silently. uvloopfallback also still baked in (carried over from v0.19.1-win).
What changed inside the patch
vllm-windows-v5.patch replaces vllm-windows-v4.patch. 36 modified files + 3 new files (vllm/v1/attention/ops/multi_turboquant_kv.py, cutlass-windows.patch, vllm-flash-attn-cutlass-windows.patch), ~1918 lines.
New patches in v5:
/Usmall+WIN32_LEAN_AND_MEANto defeat the Windows SDKrpcndr.hsmallmacro that collides with PyTorch 2.11.0'sbool smallparameter name inc10::cuda::CUDACachingAllocator./Zc:__cplusplusso MSVC reports the real C++ version (defaults to199711L); CUTLASS 4.4.2'splatform.hgatesis_unsigned_vetc. behind__cplusplus >= 201703L.csrc/persistent_topk.cuh(new file in v0.21.0):__attribute__((always_inline))guarded with_MSC_VER/__forceinlinefallback.csrc/quantization/fused_kernels/fused_silu_mul_block_quant.cu(new file):quant_type_max_v<scalar_out_t>→quant_type_max_v<scalar_out_t>()(function-template call syntax).csrc/moe/topk_softplus_sqrt_kernels.cu: hoist#ifndef USE_ROCMout of theDISPATCH_HASH(...)macro argument (preprocessor-in-macro-arg is ill-formed even with/Zc:preprocessor).requirements/cuda.txt: comment outfastsafetensors(Linux-onlyio_uring); we keep using our own numpy-mmap reader from v0.19.x.cutlass-windows.patch: 5-file CUTLASS 4.4.2 patch (cuda_host_adapter.hppmemsetDevicehost/device mismatch + 4 SM100/SM103 headers withstatic constexpr dim3 get_block_shape()violations).vllm-flash-attn-cutlass-windows.patch: same constexpr-dim3fix in the vendored CUTLASS submodule undervllm-flash-attn.
Three v4 hunks dropped as obsolete upstream: csrc/topk.cu designated-initializer fix (the affected function was rewritten upstream), routed_experts_capturer.py fcntl/msvcrt locking (upstream rewrote the file to not use file locks), and triton_reshape_and_cache_flash.py fp8 startswith assert (upstream now uses is_quantized_kv_cache).
Full per-file breakdown → PATCHES.md
Verified (RTX 3090, Qwen3-14B-abliterated-AWQ-4bit)
kv_cache_dtype |
Output tok/s | Notes |
|---|---|---|
auto (fp16, FA2) |
16.7 | 20 tokens in 1.20 s (wheel install, warm OS cache) |
turboquant35 (Triton + PyTorch-fallback) |
0.93 | 20 tokens in 21.4 s — matches the v0.19.1 figure (0.92 tok/s) |
Other Multi-TurboQuant methods (isoquant3/4, planarquant3/4, turboquant25) should behave the same as in v0.19.1-win; rerun tests/test_tq_real.py for a full sweep.
Build environment:
- Windows 10 Pro 22H2
- Visual Studio 2022 Community 17.13 (MSVC 19.43.34810)
- CUDA Toolkit 12.6
- Python 3.10.11
- PyTorch 2.11.0+cu126
- Triton-windows 3.6.0.post26
Install
Pre-built wheel — no compiler needed:
py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.11.0 torchaudio==2.11.0 torchvision==0.26.0 ^
--index-url https://download.pytorch.org/whl/cu126
pip install triton-windows==3.6.0.post26
pip install vllm-0.21.0+cu126-cp310-cp310-win_amd64.whl
pip install git+https://github.com/aivrar/multi-turboquant.gitOr from source (~60-90 min):
git clone https://github.com/vllm-project/vllm.git vllm-source
cd vllm-source && git checkout v0.21.0 && cd ..
git apply vllm-windows-v5.patch --directory vllm-source
build.batFull instructions: README.md
Known limitations
Unchanged from v0.19.x:
- TQ throughput penalty on our 6 methods (PyTorch-fallback encode/decode). Memory savings real, throughput cost real. The 4 upstream
turboquant_*variants don't pay this cost — they use fused Triton kernels. - Single GPU only (NCCL still unavailable on Windows; the patch wires up
FakeProcessGroupfor single-rank operation). - No FlashAttention 3 or 4, no FlashInfer. No Windows wheels.
- No DeepGEMM, no Quack, no Tilelang, no TokenSpeed-MLA, no NIXL. No Windows wheels. CMake skips DeepGEMM automatically when target arch < SM 9.0.
SHA256
b63902f427527ff8aa150744ff6d20c4a91d16c6c85125fbe727a9539a75cd21 vllm-0.21.0+cu126-cp310-cp310-win_amd64.whl