Skip to content

vLLM v0.17.1 — Windows + Triton + Qwen 3.5

Choose a tag to compare

@aivrar aivrar released this 21 Mar 06:58
· 13 commits to master since this release

vLLM v0.17.1 Native Windows Build

First native Windows build of vLLM with Triton kernel support, enabling Qwen 3.5 (Gated Delta Networks) and all models supported by vLLM 0.17.1.

What's included

  • Pre-built wheelvllm-0.17.1+cu126-cp310-cp310-win_amd64.whl (201 MB)
  • Built for Python 3.10, CUDA 12.6, RTX 30xx (SM 8.6)
  • Includes: _C.pyd, _moe_C.pyd, _vllm_fa2_C.pyd, cumem_allocator.pyd

Install

py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows
pip install vllm-0.17.1+cu126-cp310-cp310-win_amd64.whl
pip install "llguidance>=1.3.0,<1.4.0" "xgrammar==0.1.29"

Run

import os
os.environ['VLLM_HOST_IP'] = '127.0.0.1'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from vllm import LLM, SamplingParams
llm = LLM(model='path/to/Qwen3.5-4B', max_model_len=4096,
           gpu_memory_utilization=0.90, enforce_eager=True, trust_remote_code=True)
output = llm.generate(['Hello!'], sampling_params=SamplingParams(max_tokens=100))
print(output[0].outputs[0].text)

RTX 3090 Performance (Qwen3.5-4B BF16)

Metric Value
Weights VRAM 8.61 GiB
KV cache 91,872 tokens (11.26 GiB)
Max concurrency @ 4096 ctx ~63x
Max concurrency @ 2048 ctx ~126x

Note: First inference is slow (1-2 min) while Triton JIT-compiles GDN kernels. Subsequent requests are fast.

Stack

Component Version
vLLM 0.17.1
PyTorch 2.10.0+cu126
Triton 3.6.0 (triton-windows)
CUDA 12.6
Python 3.10
Compiler MSVC 2022 (19.43)
Target arch SM 8.6 (RTX 30xx)

New patches (beyond v0.14.2 patchset)

  • PyTorch CUDACachingAllocator.h: #undef small (Windows SDK macro conflict)
  • csrc/topk.cu: C99 designated initializers → positional assignment
  • csrc/moe/grouped_topk_kernels.cu: __attribute((aligned))__align__
  • CUTLASS SM100/103 headers: constexpr dim3 → non-constexpr on MSVC
  • Flash-attn CUTLASS exmy_base.h: is_unsigned_v guard + typename for dependent types
  • FA3 (Hopper) disabled on MSVC: nested #ifdef in lambda macros incompatible
  • CUDA::cublas linked to _moe_C target (new router_gemm.cu)

Building for other GPUs

The pre-built wheel targets SM 8.6 (RTX 30xx). To build for other GPUs, clone the repo and apply vllm-windows-v2.patch to vLLM v0.17.1 source with your TORCH_CUDA_ARCH_LIST:

GPU TORCH_CUDA_ARCH_LIST
RTX 20xx 7.5
RTX 30xx 8.6
RTX 40xx 8.9
RTX 50xx 12.0 (requires CUDA 13.0+)

See the Build from Source section in the README.

Previous release

For the v0.14.2 build with the one-click installer, see v0.14.2-win.