A concise, end‑to‑end reference for:
- Standing up a modern WSL2 Ubuntu 24.04 environment on Windows
- (Optional) Installing a full KDE Plasma desktop reachable via XRDP
- Enabling GPU acceleration (CUDA + PyTorch) inside WSL for local AI workloads
- Installing Docker Engine + NVIDIA Container Toolkit for GPU containers
- Running and validating high‑throughput GEMM benchmarks (
bench.py,bench_tests.py)
This master README consolidates and cross‑links the two detailed guides and the Python benchmarking utilities contained in the repo.
| Path | Purpose |
|---|---|
wsl-kde-xrdp.md |
Step‑by‑step KDE Plasma + XRDP desktop enablement (optional GUI path) |
wsl2-gpu-ai-docker-setup.md |
Core WSL GPU + CUDA + Docker + PyTorch environment bootstrap with benchmark usage notes |
python/bench.py |
Stand‑alone high‑throughput GEMM (matrix multiply) benchmark (TF32 / FP16 / BF16 where supported, optional CUDA Graphs) |
python/bench_tests.py |
Automated stress & validation matrix across sizes/modes/graphs; produces summaries & optional CSV |
README.md |
(This file) Unified overview and quick navigation |
- Install / Update WSL2 (Admin PowerShell):
wsl --install # if first time wsl --update wsl --status
- Install Ubuntu 24.04 (if not already):
wsl --install -d Ubuntu-24.04
- Enable systemd inside WSL (once) inside Ubuntu shell:
ps -p 1 -o comm= # If not 'systemd': echo -e "[boot]\nsystemd=true" | sudo tee /etc/wsl.conf wsl --shutdown # run from Windows side or just exit and `wsl --shutdown`
- Install CUDA toolkit (driver already handled by Windows NVIDIA driver) — follow the repo script in
wsl2-gpu-ai-docker-setup.mdSection 3. - Install Docker Engine + NVIDIA Container Toolkit — Section 5 & 6 of the same guide.
- Create Python venv + Install PyTorch CUDA wheels — Section 7.
- Run a benchmark:
source ~/.venvs/ai/bin/activate python python/bench.py --size 4096 --iters 30
- (Optional) Run validation matrix:
python python/bench_tests.py
For richer explanations and rationale, read the detailed guide: wsl2-gpu-ai-docker-setup.md.
If you need a remoteable full Linux desktop (GUI IDEs, visualization tools) accessible via Windows’ Remote Desktop Client, use wsl-kde-xrdp.md. If you only need terminals & VS Code (WSLg already gives basic GUI support), you can skip it.
- Install / verify with
wsl --list --verboseandlsb_release -a. - Keep WSL updated (
wsl --update). - Enable systemd for smooth service management (Docker, etc.).
- Windows NVIDIA Driver is the single authoritative driver; do not install a Linux kernel driver inside WSL.
- Use NVIDIA’s WSL CUDA repository to get user‑space CUDA toolkit binaries (e.g.,
nvcc). - Avoid globally forcing
LD_LIBRARY_PATHto CUDA—preserves WSLg’s D3D12 stack for GUI acceleration. - The scripts detect available capabilities: TF32/BF16 modes are only attempted on Ampere (SM 8.0) or newer.
- Install Docker CE packages; enable and start the service under systemd.
- Install
nvidia-container-toolkit& runsudo nvidia-ctk runtime configure --runtime=docker. - Validate with:
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu22.04 nvidia-smi
- Create an isolated venv:
python3 -m venv ~/.venvs/ai. - Activate:
source ~/.venvs/ai/bin/activate. - Install CUDA‑enabled PyTorch wheels (example uses CUDA 12.1 index):
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
- Sanity check inside Python:
import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))
Use nvidia-smi (PowerShell) for live telemetry, e.g.:
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,power.draw --format=csv -l 1- Install via
taskselselecting KDE. - Install
xrdpand connect with Windowsmstsc, choosing session type Xorg. - Useful if you want a full Linux desktop vs. WSLg’s per‑app windows.
High‑throughput GEMM benchmark focusing on TF32 / FP16 / BF16 performance (automatically skipping unavailable precisions) and optional CUDA Graphs. Key characteristics:
- Uses CUDA events for precise timing.
- Auto warmup phase (customizable via
--warmup). - Static allocations to accommodate CUDA Graph capture.
- Reports average ms/iter + achieved TFLOP/s per mode & size.
| Flag | Meaning |
|---|---|
--size N |
Single cubic matrix (m=n=k=N) |
--sizes N1 N2 ... |
Multiple explicit sizes |
--sweep START STOP STEP |
Generate a size range |
--iters K |
Timed iterations (default 30) |
--warmup K |
Override warmup iteration count |
--modes tf32,fp16,bf16 |
Comma‑delimited subset |
--graphs |
Enable CUDA Graph capture/replay |
--csv file.csv |
Export results to CSV |
# Default (4096, all modes):
python bench.py
# Large size with CUDA Graphs:
python bench.py --size 8192 --iters 50 --graphs
# Multiple sizes + CSV:
python bench.py --sizes 2048 4096 6144 8192 --graphs --csv results.csvAutomated matrix for functional + performance regression style coverage.
- Iterates a progressive ladder of sizes (tiny → large) + modes + graphs (on/off).
- Dynamically adjusts iteration counts for timing stability vs. runtime.
- Computes operational intensity heuristic & GFLOP/s per SM.
- Prints a per‑test table and summary statistics (median / mean / P90 / min / max / stdev) per (mode, graphs) combo.
- Contains an embedded negative test (
test_invalid_mode).
| Variable | Effect |
|---|---|
STRESS=1 |
Adds very large sizes (6144, 8192) |
TEST_CSV=path.csv |
Writes raw per‑test rows to CSV |
VERBOSE=1 |
Emits full tracebacks for failures |
# Standard run
python bench_tests.py
# Include stress sizes + export CSV
STRESS=1 TEST_CSV=matrix.csv python bench_tests.py
# Verbose errors if something fails
VERBOSE=1 python bench_tests.pyTFLOP/sgives aggregate throughput; compare across modes to understand precision tradeoffs.AVG_MSis latency per iteration for the given GEMM and mode.GFLOP/S/SMprovides rough per‑SM scaling sanity (depends on accurate SM count inference).- If CUDA Graphs provide a noticeable improvement, you will see consistent TFLOP/s uplift and/or lower ms.
- Stand up baseline WSL + CUDA + PyTorch (no desktop). Validate
torch.cuda.is_available(). - Run
bench.pyat a modest size (4096) to establish baseline TF32/FP16/BF16 numbers. - Enable
--graphsand compare. Retain results (CSV) for future regressions. - Periodically run
bench_tests.py(possibly withSTRESS=1) after driver / PyTorch updates. - (Optional) Add KDE + XRDP later if a full desktop is required.
| Symptom | Likely Cause | Fix |
|---|---|---|
CUDA not available in Python |
venv created before installing driver / CUDA, or running in wrong environment | Activate correct venv; verify Windows NVIDIA driver; reinstall PyTorch with CUDA wheels |
| BF16/TF32 rows missing | GPU does not support those precisions (pre‑Ampere) | Expected; upgrade GPU if needed |
docker: Error response from daemon: could not select device driver |
NVIDIA Container Toolkit not configured | Re-run sudo nvidia-ctk runtime configure --runtime=docker then restart Docker |
nvidia-smi works on Windows but not in container |
Missing --gpus all flag |
Add --gpus all to docker run |
| Bench graphs warn & disable | Capture unsafe due to allocations or older driver | Accept fallback; ensure static allocations not modified |
| Unrealistic TFLOP/s for size=1 | Timing noise | Script caps tiny-size outliers; ignore tiny-size metrics |
- Add new dtypes (e.g., FP8) by extending mode handling in
bench.py. - Integrate additional kernels (convolution, attention) following the same timing & graph pattern.
- Feed CSV outputs into a dashboard (Prometheus / Grafana or lightweight HTML) for historical tracking.
- CUDA Graphs: Only captured once per (size, mode) with static tensors to avoid illegal memory ops during replay.
- Warmup Strategy: Larger relative warmup for high iteration counts ensures kernel autotuning caches populate.
- Memory Intensity Heuristic in tests is intentionally approximate; refine with precise element sizes / reads if needed.
- Do not install conflicting CUDA drivers inside WSL; rely on Windows host driver.
- Avoid running untrusted containers with
--gpus allunless you understand the security implications. - Keep your Python environment isolated (venv) to prevent accidental system package pollution.
- Restrict benchmark modes to what the GPU supports (script already performs capability checks).
If you copy bench.py / bench_tests.py to your home folder (as recommended in the setup guide) and later pull repo changes, just recopy them. They are self‑contained, no relative imports beyond bench used by bench_tests.py.
- Add CI (GitHub Actions) to lint Python, maybe run a reduced CPU‑only logic test when CUDA is absent.
- Provide a containerized benchmark image (
Dockerfile) with pinned PyTorch + CUDA toolkit versions. - Add JSON output option for easier machine ingestion.
- Collect and visualize performance deltas across driver / PyTorch updates.
Released under the MIT License. See the LICENSE file for full text.
SPDX-License-Identifier: MIT
No external network actions or secret material are stored here—scripts are self‑contained. Run them locally under your own environment.
Happy benchmarking & productive hacking inside WSL! 🚀