vLLM v0.14.2 — Pre-built Windows Binary
vLLM for Windows — One-Click Install
Pre-built vLLM v0.14.2 with all CUDA kernels compiled. No build tools needed.
Download
vllm-0.14.2-win.zip (371 MB)
Usage
- Extract the zip anywhere
- Double-click
launch.bat - On first run, it auto-installs Python 3.10, PyTorch 2.9.1+cu126, and vLLM (~2.5 GB download)
- Select a model from the interactive picker, or pass
--model path\to\model
launch.bat # interactive model selector
launch.bat --model E:\models\Qwen2.5-1.5B # direct launch
launch.bat --model E:\models\Phi-4 --port 8000 --gpu-memory-utilization 0.8What's included
| File | Description |
|---|---|
launch.bat |
One-click launcher (start here) |
install.bat |
Portable multi-stage installer (Python, PyTorch, vLLM) |
vllm_launcher.py |
OpenAI-compatible server with interactive model selector |
build_wheel.py |
Re-package script (advanced, for rebuilding the wheel) |
dist/vllm-*.whl |
Pre-built vLLM wheel (380 MB, all 5 compiled .pyd extensions) |
vllm-windows.patch |
Source patch for building from scratch |
Compiled extensions included
vllm/_C.pyd(142 MB) — core CUDA opsvllm/_moe_C.pyd(91 MB) — mixture of expertsvllm/cumem_allocator.pyd— CUDA memory allocatorvllm/vllm_flash_attn/_vllm_fa2_C.pyd(426 MB) — Flash Attention 2vllm/vllm_flash_attn/_vllm_fa3_C.pyd(626 MB) — Flash Attention 3
Requirements
- Windows 10/11 (64-bit)
- NVIDIA GPU with CUDA Compute Capability 7.0+ (RTX 20xx or newer)
- CUDA 12.6 runtime (driver 560+)
- ~5 GB disk space (after install)
- Internet connection (first run only)
API endpoints
GET /health → {"status": "ok"}
GET /v1/models → list loaded models
POST /v1/chat/completions → OpenAI-compatible chat (with tool calling)
POST /v1/embeddings → text embeddings (--task embed)
Works with any OpenAI-compatible client — just point base_url at http://127.0.0.1:8100/v1.