Skip to content

vLLM v0.19.0 — Windows + Multi-TurboQuant

Choose a tag to compare

@aivrar aivrar released this 12 Apr 12:27
· 8 commits to master since this release

vLLM 0.19.0 native Windows build with Multi-TurboQuant KV cache compression.

No WSL, no Docker, no Linux VM. Just download the wheel and pip install.

What's new

  • vLLM 0.19.0 base — Gemma 4 support, zero-bubble async scheduling, Model Runner V2, online MXFP8, batched chat completions endpoint, ViT full CUDA graphs.
  • Multi-TurboQuant KV cache compression — six methods integrated as native vLLM kv_cache_dtype options, with real packed uint8 storage:
    • isoquant3 / isoquant4 — quaternion 4D rotation, no calibration needed
    • planarquant3 / planarquant4 — Givens 2D rotation, no calibration needed
    • turboquant25 / turboquant35 — WHT + MSE codebook + QJL residual
    • 2× more KV cache tokens at the same gpu_memory_utilization (verified: 16,336 → 32,672 on Qwen3-14B AWQ-4bit, RTX 3090)
  • Custom Windows safetensors readernumpy.memmap + chunked GPU streaming. Loads a 14B model in 6.5 seconds vs 189 seconds with the upstream mmap path. Works on Windows systems with the pagefile disabled.
  • All 140 CUDA targets compile clean with MSVC 2022 + CUDA 12.6 + Ninja. 33 source files patched + 1 new file.
  • End-to-end test suite — proves each TQ method actually compresses (not a placebo) and each one produces unique output from FP16.

Install

Download vllm-0.19.0+cu126-cp310-cp310-win_amd64.whl below, then:

py -3.10 -m venv venv
venv\Scripts\activate
pip install torch==2.10.0 torchaudio==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu126
pip install triton-windows==3.6.0.post26
pip install vllm-0.19.0+cu126-cp310-cp310-win_amd64.whl
pip install git+https://github.com/rookiemann/multi-turboquant.git

Or run the install.bat one-click installer.

Hello world

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin")
os.add_dll_directory(r"C:\path\to\venv\Lib\site-packages\torch\lib")

import sys; sys.modules.setdefault("uvloop", type(sys)("uvloop"))
from vllm import LLM, SamplingParams

llm = LLM(
    model="path/to/Qwen3-14B-AWQ-4bit",
    dtype="float16",
    kv_cache_dtype="isoquant4",   # 2x more KV cache, near-FP16 quality
    max_model_len=2048,
    gpu_memory_utilization=0.85,
    enforce_eager=True,
)
print(llm.generate(["Explain CUDA streams in 3 sentences:"],
                   SamplingParams(temperature=0.7, max_tokens=200))[0].outputs[0].text)

Trade-offs

  • TQ throughput drops ~30-300× — encode/decode runs in PyTorch (no fused Triton kernel yet). Memory savings are real, throughput cost is the price. Best for offline / long-context / batch workloads. Online serving should stay with auto or fp8.
  • Single GPU only — NCCL still unavailable on Windows; the patch wires up FakeProcessGroup for single-rank operation.

Documentation

  • 📖 Wiki — Install, Usage, Multi-TurboQuant deep dive, Benchmarks, Build from source, Architecture, Troubleshooting
  • 📝 CHANGELOG — full release notes
  • 🛠️ PATCHES.md — per-file breakdown of every change
  • 🧪 tests/ — end-to-end test scripts

System requirements

Component Minimum Recommended
OS Windows 10 21H2 (x64) Windows 10 22H2 / Windows 11
GPU NVIDIA SM 8.0+ RTX 3090 / 4090 / A6000
VRAM 12 GB 24 GB
RAM 16 GB 32+ GB
Python 3.10.x 3.10.11
CUDA driver R545+ latest