Release vLLM v0.17.1 — Windows + Triton + Qwen 3.5 · aivrar/vllm-windows-build

vLLM v0.17.1 Native Windows Build

First native Windows build of vLLM with Triton kernel support, enabling Qwen 3.5 (Gated Delta Networks) and all models supported by vLLM 0.17.1.

What's included

Pre-built wheel — vllm-0.17.1+cu126-cp310-cp310-win_amd64.whl (201 MB)
Built for Python 3.10, CUDA 12.6, RTX 30xx (SM 8.6)
Includes: _C.pyd, _moe_C.pyd, _vllm_fa2_C.pyd, cumem_allocator.pyd

Install

py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows
pip install vllm-0.17.1+cu126-cp310-cp310-win_amd64.whl
pip install "llguidance>=1.3.0,<1.4.0" "xgrammar==0.1.29"

Run

import os
os.environ['VLLM_HOST_IP'] = '127.0.0.1'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from vllm import LLM, SamplingParams
llm = LLM(model='path/to/Qwen3.5-4B', max_model_len=4096,
           gpu_memory_utilization=0.90, enforce_eager=True, trust_remote_code=True)
output = llm.generate(['Hello!'], sampling_params=SamplingParams(max_tokens=100))
print(output[0].outputs[0].text)

RTX 3090 Performance (Qwen3.5-4B BF16)

Metric	Value
Weights VRAM	8.61 GiB
KV cache	91,872 tokens (11.26 GiB)
Max concurrency @ 4096 ctx	~63x
Max concurrency @ 2048 ctx	~126x

Note: First inference is slow (1-2 min) while Triton JIT-compiles GDN kernels. Subsequent requests are fast.

Stack

Component	Version
vLLM	0.17.1
PyTorch	2.10.0+cu126
Triton	3.6.0 (triton-windows)
CUDA	12.6
Python	3.10
Compiler	MSVC 2022 (19.43)
Target arch	SM 8.6 (RTX 30xx)

New patches (beyond v0.14.2 patchset)

PyTorch CUDACachingAllocator.h: #undef small (Windows SDK macro conflict)
csrc/topk.cu: C99 designated initializers → positional assignment
csrc/moe/grouped_topk_kernels.cu: __attribute((aligned)) → __align__
CUTLASS SM100/103 headers: constexpr dim3 → non-constexpr on MSVC
Flash-attn CUTLASS exmy_base.h: is_unsigned_v guard + typename for dependent types
FA3 (Hopper) disabled on MSVC: nested #ifdef in lambda macros incompatible
CUDA::cublas linked to _moe_C target (new router_gemm.cu)

Building for other GPUs

The pre-built wheel targets SM 8.6 (RTX 30xx). To build for other GPUs, clone the repo and apply vllm-windows-v2.patch to vLLM v0.17.1 source with your TORCH_CUDA_ARCH_LIST:

GPU	`TORCH_CUDA_ARCH_LIST`
RTX 20xx	7.5
RTX 30xx	8.6
RTX 40xx	8.9
RTX 50xx	12.0 (requires CUDA 13.0+)

See the Build from Source section in the README.

Previous release

For the v0.14.2 build with the one-click installer, see v0.14.2-win.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM v0.17.1 — Windows + Triton + Qwen 3.5

Choose a tag to compare

Sorry, something went wrong.