GPU cluster health check. Runs in ~20 seconds. No install required.
curl -fsSL https://raw.githubusercontent.com/cv700/scanprobe/main/scanprobe.py | python3Or install permanently:
pip install ashiba-scanprobe
scanprobeashiba scanprobe v0.1.0 ─ github.com/cv700/scanprobe
GPU 0 ✓ HEALTHY H100 SXM5 72°C no ECC errors
GPU 1 ✓ HEALTHY H100 SXM5 74°C no ECC errors
GPU 2 ! WATCH H100 SXM5 91°C HW thermal throttle: HwThermalSlowdown
GPU 3 ✓ HEALTHY H100 SXM5 71°C no ECC errors
Node: WATCH
→ GPU 2 HW thermal throttle: HwThermalSlowdown
→ GPU 2 Temperature critical: 91°C
Checked: nvidia-smi · ECC counters · Xid scan (18s)
Skipped: DCGM (not found), matmul/collective (--tier 2)
Tip: python3 scanprobe.py --tier 2 for DCGM + matmul checks (~3 min)
| Check | What it catches | Tier |
|---|---|---|
| nvidia-smi | ECC errors, temperature, clock throttle | 1 |
| Xid scan (dmesg) | Hardware faults written to kernel ring buffer | 1 |
| DCGM diagnostics | GPU memory bandwidth, compute validation | 1+ |
| Matmul correctness | Silent numerical errors (SDC) vs FP64 reference | 2 |
| Collective latency | Per-rank allreduce outliers — NVLink / fabric faults | 2+ |
--tier 1 ~20s nvidia-smi + ECC + Xid (default)
--tier 2 ~3min + DCGM diagnostics + matmul
--tier 3 ~10min + collective latency sweep (all GPUs)
| Tier | Score | Meaning |
|---|---|---|
| HEALTHY | < 0.20 | No detected issues. Proceed. |
| WATCH | 0.20 – 0.49 | Anomaly present. Investigate before a long run. |
| DRAIN | ≥ 0.50 | Hardware fault confirmed. Do not use this GPU. |
Exit codes: 0=HEALTHY 1=WATCH 2=DRAIN 3=error
| Signal | Weight | Source |
|---|---|---|
| Xid drain-class (48, 63, 74, 79, 94, 95) | 0.85 | dmesg |
| DBE ECC volatile | 0.70+ | nvidia-smi |
| DCGM diagnostic failure | 0.55+ | dcgmi |
| nvidia-smi query failure | 0.55 | nvidia-smi |
| HW thermal throttle | 0.40 | nvidia-smi |
| Temperature > 88°C | 0.35 | nvidia-smi |
| Matmul >50% anomalous shapes | 0.35 | torch |
| DBE ECC aggregate (lifetime) | 0.30 | nvidia-smi |
| Collective latency outlier >3σ | 0.30 | torch.distributed |
| Xid watch-class | 0.25 | dmesg |
Scores aggregate with geometric decay: the strongest signal dominates, each additional contributes half as much. Two WATCH signals (e.g., HW throttle + high temperature) combine to DRAIN.
This covers pre-flight signals. It will not detect dormant faults that only manifest after hundreds of training steps — some GPUs pass all pre-flight checks and fail at step 450+. NVIDIA's diagnostic tooling has ~70% recall on silent data corruption events.
For in-flight detection, see ByteRobust (§4–5) and XPUTimer.
# Zero dependencies (nvidia-smi + dmesg only)
pip install ashiba-scanprobe
# With pretty terminal output
pip install ashiba-scanprobe[display]
# With matmul + collective checks (requires PyTorch)
pip install ashiba-scanprobe[full]Python 3.9+. The single-file scanprobe.py has no dependencies beyond stdlib.
GPU hardware failures are common at scale. Meta's Llama 3 training on 16,384 H100s experienced one job failure every 3 hours. NVIDIA's EUD diagnostic has ~70% recall on SDC events (ByteRobust §4.3). Pre-flight checks catch the detectable subset before you commit 72 hours of compute.
Built by Ashiba · MIT license
Real-hardware validation is in progress. The checks were written from documentation
and papers. If you run this on real hardware, please open an issue or PR with your
scanprobe --json output — it is the most valuable contribution you can make.
| Hardware | Driver | Status |
|---|---|---|
| H100 SXM5 | — | in progress |
| A100 80GB | — | not yet |
| A10G | — | not yet |
See CONTRIBUTING.md. You don't need a GPU.