vLLM v0.17.1 — Windows + Triton + Qwen 3.5
vLLM v0.17.1 Native Windows Build
First native Windows build of vLLM with Triton kernel support, enabling Qwen 3.5 (Gated Delta Networks) and all models supported by vLLM 0.17.1.
What's included
- Pre-built wheel —
vllm-0.17.1+cu126-cp310-cp310-win_amd64.whl(201 MB) - Built for Python 3.10, CUDA 12.6, RTX 30xx (SM 8.6)
- Includes:
_C.pyd,_moe_C.pyd,_vllm_fa2_C.pyd,cumem_allocator.pyd
Install
py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows
pip install vllm-0.17.1+cu126-cp310-cp310-win_amd64.whl
pip install "llguidance>=1.3.0,<1.4.0" "xgrammar==0.1.29"Run
import os
os.environ['VLLM_HOST_IP'] = '127.0.0.1'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from vllm import LLM, SamplingParams
llm = LLM(model='path/to/Qwen3.5-4B', max_model_len=4096,
gpu_memory_utilization=0.90, enforce_eager=True, trust_remote_code=True)
output = llm.generate(['Hello!'], sampling_params=SamplingParams(max_tokens=100))
print(output[0].outputs[0].text)RTX 3090 Performance (Qwen3.5-4B BF16)
| Metric | Value |
|---|---|
| Weights VRAM | 8.61 GiB |
| KV cache | 91,872 tokens (11.26 GiB) |
| Max concurrency @ 4096 ctx | ~63x |
| Max concurrency @ 2048 ctx | ~126x |
Note: First inference is slow (1-2 min) while Triton JIT-compiles GDN kernels. Subsequent requests are fast.
Stack
| Component | Version |
|---|---|
| vLLM | 0.17.1 |
| PyTorch | 2.10.0+cu126 |
| Triton | 3.6.0 (triton-windows) |
| CUDA | 12.6 |
| Python | 3.10 |
| Compiler | MSVC 2022 (19.43) |
| Target arch | SM 8.6 (RTX 30xx) |
New patches (beyond v0.14.2 patchset)
- PyTorch
CUDACachingAllocator.h:#undef small(Windows SDK macro conflict) csrc/topk.cu: C99 designated initializers → positional assignmentcsrc/moe/grouped_topk_kernels.cu:__attribute((aligned))→__align__- CUTLASS SM100/103 headers:
constexpr dim3→ non-constexpr on MSVC - Flash-attn CUTLASS
exmy_base.h:is_unsigned_vguard +typenamefor dependent types - FA3 (Hopper) disabled on MSVC: nested
#ifdefin lambda macros incompatible CUDA::cublaslinked to_moe_Ctarget (newrouter_gemm.cu)
Building for other GPUs
The pre-built wheel targets SM 8.6 (RTX 30xx). To build for other GPUs, clone the repo and apply vllm-windows-v2.patch to vLLM v0.17.1 source with your TORCH_CUDA_ARCH_LIST:
| GPU | TORCH_CUDA_ARCH_LIST |
|---|---|
| RTX 20xx | 7.5 |
| RTX 30xx | 8.6 |
| RTX 40xx | 8.9 |
| RTX 50xx | 12.0 (requires CUDA 13.0+) |
See the Build from Source section in the README.
Previous release
For the v0.14.2 build with the one-click installer, see v0.14.2-win.