Release vLLM v0.19.0 — Windows + Multi-TurboQuant · aivrar/vllm-windows-build

vLLM 0.19.0 native Windows build with Multi-TurboQuant KV cache compression.

No WSL, no Docker, no Linux VM. Just download the wheel and pip install.

What's new

vLLM 0.19.0 base — Gemma 4 support, zero-bubble async scheduling, Model Runner V2, online MXFP8, batched chat completions endpoint, ViT full CUDA graphs.
Multi-TurboQuant KV cache compression — six methods integrated as native vLLM kv_cache_dtype options, with real packed uint8 storage:
- isoquant3 / isoquant4 — quaternion 4D rotation, no calibration needed
- planarquant3 / planarquant4 — Givens 2D rotation, no calibration needed
- turboquant25 / turboquant35 — WHT + MSE codebook + QJL residual
- 2× more KV cache tokens at the same gpu_memory_utilization (verified: 16,336 → 32,672 on Qwen3-14B AWQ-4bit, RTX 3090)
Custom Windows safetensors reader — numpy.memmap + chunked GPU streaming. Loads a 14B model in 6.5 seconds vs 189 seconds with the upstream mmap path. Works on Windows systems with the pagefile disabled.
All 140 CUDA targets compile clean with MSVC 2022 + CUDA 12.6 + Ninja. 33 source files patched + 1 new file.
End-to-end test suite — proves each TQ method actually compresses (not a placebo) and each one produces unique output from FP16.

Install

Download vllm-0.19.0+cu126-cp310-cp310-win_amd64.whl below, then:

py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.10.0 torchaudio==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows==3.6.0.post26
pip install vllm-0.19.0+cu126-cp310-cp310-win_amd64.whl
pip install git+https://github.com/rookiemann/multi-turboquant.git

Or run the install.bat one-click installer.

Hello world

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin")
os.add_dll_directory(r"C:\path\to\venv\Lib\site-packages\torch\lib")

import sys; sys.modules.setdefault("uvloop", type(sys)("uvloop"))
from vllm import LLM, SamplingParams

llm = LLM(
    model="path/to/Qwen3-14B-AWQ-4bit",
    dtype="float16",
    kv_cache_dtype="isoquant4",   # 2x more KV cache, near-FP16 quality
    max_model_len=2048,
    gpu_memory_utilization=0.85,
    enforce_eager=True,
)
print(llm.generate(["Explain CUDA streams in 3 sentences:"],
                   SamplingParams(temperature=0.7, max_tokens=200))[0].outputs[0].text)

Trade-offs

TQ throughput drops ~30-300× — encode/decode runs in PyTorch (no fused Triton kernel yet). Memory savings are real, throughput cost is the price. Best for offline / long-context / batch workloads. Online serving should stay with auto or fp8.
Single GPU only — NCCL still unavailable on Windows; the patch wires up FakeProcessGroup for single-rank operation.

Documentation

📖 Wiki — Install, Usage, Multi-TurboQuant deep dive, Benchmarks, Build from source, Architecture, Troubleshooting
📝 CHANGELOG — full release notes
🛠️ PATCHES.md — per-file breakdown of every change
🧪 tests/ — end-to-end test scripts

System requirements

Component	Minimum	Recommended
OS	Windows 10 21H2 (x64)	Windows 10 22H2 / Windows 11
GPU	NVIDIA SM 8.0+	RTX 3090 / 4090 / A6000
VRAM	12 GB	24 GB
RAM	16 GB	32+ GB
Python	3.10.x	3.10.11
CUDA driver	R545+	latest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM v0.19.0 — Windows + Multi-TurboQuant

Choose a tag to compare

Sorry, something went wrong.