vLLM v0.19.0 — Windows + Multi-TurboQuant
vLLM 0.19.0 native Windows build with Multi-TurboQuant KV cache compression.
No WSL, no Docker, no Linux VM. Just download the wheel and pip install.
What's new
- vLLM 0.19.0 base — Gemma 4 support, zero-bubble async scheduling, Model Runner V2, online MXFP8, batched chat completions endpoint, ViT full CUDA graphs.
- Multi-TurboQuant KV cache compression — six methods integrated as native vLLM
kv_cache_dtypeoptions, with real packeduint8storage:isoquant3/isoquant4— quaternion 4D rotation, no calibration neededplanarquant3/planarquant4— Givens 2D rotation, no calibration neededturboquant25/turboquant35— WHT + MSE codebook + QJL residual- 2× more KV cache tokens at the same
gpu_memory_utilization(verified: 16,336 → 32,672 on Qwen3-14B AWQ-4bit, RTX 3090)
- Custom Windows safetensors reader —
numpy.memmap+ chunked GPU streaming. Loads a 14B model in 6.5 seconds vs 189 seconds with the upstream mmap path. Works on Windows systems with the pagefile disabled. - All 140 CUDA targets compile clean with MSVC 2022 + CUDA 12.6 + Ninja. 33 source files patched + 1 new file.
- End-to-end test suite — proves each TQ method actually compresses (not a placebo) and each one produces unique output from FP16.
Install
Download vllm-0.19.0+cu126-cp310-cp310-win_amd64.whl below, then:
py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.10.0 torchaudio==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows==3.6.0.post26
pip install vllm-0.19.0+cu126-cp310-cp310-win_amd64.whl
pip install git+https://github.com/rookiemann/multi-turboquant.gitOr run the install.bat one-click installer.
Hello world
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin")
os.add_dll_directory(r"C:\path\to\venv\Lib\site-packages\torch\lib")
import sys; sys.modules.setdefault("uvloop", type(sys)("uvloop"))
from vllm import LLM, SamplingParams
llm = LLM(
model="path/to/Qwen3-14B-AWQ-4bit",
dtype="float16",
kv_cache_dtype="isoquant4", # 2x more KV cache, near-FP16 quality
max_model_len=2048,
gpu_memory_utilization=0.85,
enforce_eager=True,
)
print(llm.generate(["Explain CUDA streams in 3 sentences:"],
SamplingParams(temperature=0.7, max_tokens=200))[0].outputs[0].text)Trade-offs
- TQ throughput drops ~30-300× — encode/decode runs in PyTorch (no fused Triton kernel yet). Memory savings are real, throughput cost is the price. Best for offline / long-context / batch workloads. Online serving should stay with
autoorfp8. - Single GPU only — NCCL still unavailable on Windows; the patch wires up
FakeProcessGroupfor single-rank operation.
Documentation
- 📖 Wiki — Install, Usage, Multi-TurboQuant deep dive, Benchmarks, Build from source, Architecture, Troubleshooting
- 📝 CHANGELOG — full release notes
- 🛠️ PATCHES.md — per-file breakdown of every change
- 🧪 tests/ — end-to-end test scripts
System requirements
| Component | Minimum | Recommended |
|---|---|---|
| OS | Windows 10 21H2 (x64) | Windows 10 22H2 / Windows 11 |
| GPU | NVIDIA SM 8.0+ | RTX 3090 / 4090 / A6000 |
| VRAM | 12 GB | 24 GB |
| RAM | 16 GB | 32+ GB |
| Python | 3.10.x | 3.10.11 |
| CUDA driver | R545+ | latest |