HwLoad is a lightweight, high-performance, and cross-platform hardware load generator. It is designed to stress-test specific hardware components (CPU, GPU, NPU) by generating precise synthetic loads or simulating realistic workloads.
It supports granular control over load intensity (from idle keep-alive to saturation) and allows mixed-device testing (e.g., stressing CPU and GPU simultaneously).
- Multi-Architecture: Native support for CPU (x86/ARM), NVIDIA GPU (CUDA), and Ascend NPU (MindSpore).
- Granular Control: 5 distinct pressure levels ranging from "Idle" to "Saturated".
- Targeted Profiles: Isolate specific bottlenecks (ALU, Memory Bandwidth, Latency, IO).
- Chaos Mode: "Random" profile to simulate fluctuating cluster environments.
- Real Simulation: "Real" profile to mimic actual model training/inference loops.
- Concurrency: Run multiple load generators in parallel via a single command.
| Component | Requirement | Note |
|---|---|---|
| Compiler | C++17 compliant | GCC, Clang, or MSVC |
| CMake | ≥ 3.10 | Build system |
| CUDA | Toolkit 11.0+ | Required for HWLOAD_USE_GPU |
| MindSpore | Python Environment | Required for HWLOAD_USE_NPU |
Configure the build to enable specific hardware backends:
mkdir build && cd build
# Enable/Disable modules as needed
cmake .. \
-DHWLOAD_USE_CPU=ON \
-DHWLOAD_USE_GPU=ON \
-DHWLOAD_USE_NPU=ON
make -jThe binary accepts a list of tasks in the format device:profile:level. You can combine multiple tasks to stress different hardware simultaneously.
./loadgen <device:profile:level> [device:profile:level] ...- Implementation: C++11 threads and OS primitives
- Compute: Complex floating-point math (
sin,sqrt, arithmetic). Usesstd::atomicandvolatileto prevent compiler optimization - Memory: Allocates a buffer larger than L3 cache (256 MB). Performs linear and strided writes to saturate DRAM bandwidth
- Data: Pointer chasing. Traverses a randomized linked list (64 MB) to maximize cache misses and stress memory latency
- IO (Linux only): Uses
mmapto map a file, writes dirty pages, and forces disk synchronization viamsync - Random: Randomly select a profile and level.
- Real
- Implementation: CUDA Runtime API
- Compute: Fused Multiply-Add (FMA) and transcendental functions in tight loops to saturate SMs (Streaming Multiprocessors)
- Memory: Device-to-device (D2D) memory copies and read–modify–write operations to saturate HBM/GDDR bandwidth
- Data: Mixed kernel with compute-dependent memory access, simulating general-purpose CUDA workloads
- Random: Randomly select a profile and level.
- Real
- Implementation: MindSpore (Python bridge)
- Compute: Large
MatMul(FP16) operators to saturate Cube Units - Memory:
Add/Assign(FP32) on large tensors to stress Vector Units and HBM bandwidth - Data: Mixed FP32↔FP16 casting with FP16 MatMul to stress data movement, Vector Units, and memory bandwidth.
- Random: Randomly select a profile and level.
- Real: Simulates a CNN training step (
Conv2d + BN + ReLU). Uses internal repeats to keep the NPU busy without CPU/Python overhead
Levels determine the duty cycle (run/sleep ratio), thread count, and data size.
| Level | Description | Behavior |
|---|---|---|
| Idle | Keep-Alive | Minimal pulse (e.g., 1 ms work / 500 ms sleep). Keeps the context active for monitoring |
| Low | Background | Short bursts with long sleep intervals (e.g., 20–30 ms sleep) |
| Medium | Business | Balanced work/sleep ratio. Simulates typical application usage |
| High | Heavy | Continuous operation with minimal sleep. Targets >90% utilization |
| Saturated | Stress / Max | 0 ms sleep + max concurrency. Spawns threads = logical cores (CPU) or fills command queues (GPU/NPU). Tests thermal limits and power throttling |
- Privileges:
cpu:iomay require root privileges depending on the write location - Heat generation:
Saturatedlevel can generate significant heat. Ensure adequate cooling before long runs