gpusim

A GPU simulator written in Rust, modelling the architecture of modern NVIDIA GPUs (H100/Hopper) from the SM/warp/thread level up to multi-GPU clusters connected by NVLink and InfiniBand. The ultimate goal is to execute kernels and study scheduling, occupancy, memory hierarchy, and distributed communication behaviour.

Architecture based on: JAX Scaling Book — GPUs, NVIDIA architecture whitepapers, and GPGPU-Sim.

Architecture Overview

Modern GPU clusters span three levels of hierarchy — from individual compute units inside a GPU, up through multi-GPU nodes, to multi-node clusters. gpusim models all three:

Cluster
├── Node 0  ─────────────────────── NVLink 4.0 (900 GB/s, all-to-all)
│   ├── GPU 0
│   │   ├── SMs (×132)
│   │   │   ├── SMEM           — 256KB fast on-chip shared memory per SM
│   │   │   ├── Warp Schedulers (×4 subpartitions)
│   │   │   └── Tensor Cores   — matrix multiply-accumulate (MMA) units
│   │   ├── L2 Cache           — ~50MB shared across all SMs
│   │   └── HBM                — 80GB high-bandwidth memory (3.4 TB/s)
│   └── GPU 1 … GPU 7
├── Node 1  ─────────────────────── NVLink 4.0
│   └── GPU 0 … GPU 7
└── InfiniBand fabric (NDR 400Gb/s, fat-tree) connecting all nodes

Kernels are launched with a grid of thread blocks. Each block is assigned to an SM and its threads are divided into warps of 32. The warp scheduler determines execution order within the SM. Across GPUs, communication is modelled via NVLink (intra-node) and InfiniBand (inter-node) with realistic bandwidth and latency parameters.

Features

Kernel Execution

CUDA-style execution model: grid_dim, block_dim, threadIdx, blockIdx
Per-thread ThreadCtx with access to SMEM and global HBM memory
Kernels defined as Rust closures — no DSL or bytecode needed

Block Scheduling (GigaThread Engine)

Blocks assigned to the SM with the most available resource headroom, matching empirically observed NVIDIA behaviour (Gilman et al., SIGMETRICS 2021)
Resource-aware: tracks threads, warps, registers, and SMEM per SM

Occupancy Calculation

Full 5-limiter occupancy model (mirrors GPGPU-Sim's max_cta() logic):
1. Thread slots
2. Warp slots
3. Register file (with allocation granularity)
4. Shared memory (with allocation granularity)
5. Hardware block cap
Reports theoretical occupancy and identifies the bottleneck resource

Warp Scheduling (3 policies)

Policy	Description	Best for
LRR	Loose Round-Robin — rotates through warps equally	Baseline comparison
GTO	Greedy-Then-Oldest — sticks with one warp until it stalls, then picks the oldest eligible	Cache locality (default in GPGPU-Sim)
TwoLevel	Active set + pending pool — LRR within active set, promotes from pending when stalled	Best overall (~19% over LRR, Narasiman et al. MICRO 2011)

Memory Hierarchy

SMEM — per-block on-chip scratch memory (256KB per SM)
L2 Cache — shared across all SMs (~50MB), sparse-mapped
HBM — 80GB main memory, sparse-mapped (no eager 80GB allocation on host)

SM Configurations

SmConfig::h100() — Hopper (CC 9.0): 132 SMs, 64 warps/SM, 228KB SMEM/SM
SmConfig::a100() — Ampere (CC 8.0): 164KB SMEM/SM

Multi-GPU Clusters

Cluster — N nodes × M GPUs, NVLink intra-node + InfiniBand inter-node
Point-to-point transfers routed automatically: NVLink if same node, InfiniBand if cross-node
Collective operations: AllReduce (Ring / Tree / Direct), AllGather (Ring), Broadcast (Tree)
Bandwidth and latency modelled realistically; reports time, effective bandwidth, and efficiency
launch_kernel_on(device, kernel, config, policy) — run a kernel on any GPU in the cluster

Interconnect	Preset	Bandwidth	Latency
NVLink 4.0 (H100)	`NVLinkConfig::h100()`	900 GB/s	1 µs
NVLink 3.0 (A100)	`NVLinkConfig::a100()`	600 GB/s	1 µs
NDR InfiniBand	`InfiniBandConfig::ndr()`	50 GB/s	2 µs
HDR InfiniBand	`InfiniBandConfig::hdr()`	25 GB/s	2 µs
XDR InfiniBand	`InfiniBandConfig::xdr()`	100 GB/s	1.5 µs

Live TUI Visualizer

Attach to any running simulation from a separate terminal at any time — no need to restart the sim
Polls /tmp/gpusim_live.json every 200ms (written atomically by the executor after each block)
Shows SM utilization heatmap, occupancy gauge, block progress, and kernel stats
Press q or Esc to quit; the simulation keeps running unaffected

Module Structure

src/
├── main.rs         — Entry point; single-GPU and multi-GPU demos
├── lib.rs          — Module declarations
├── gpu.rs          — Top-level GPU struct; launch_kernel()
├── sm.rs           — StreamingMultiprocessor; resource tracking
├── kernel.rs       — Dim3, LaunchConfig, ThreadCtx, Kernel
├── executor.rs     — KernelExecutor; block + warp scheduling loop; metrics snapshots
├── occupancy.rs    — SmConfig, KernelResources, max_blocks_per_sm()
├── scheduler.rs    — WarpState, WarpSlot, LRR/GTO/TwoLevel schedulers
├── metrics.rs      — LiveMetrics; atomic write/read to /tmp/gpusim_live.json
├── memory.rs       — L2Cache and HBM (sparse HashMap-backed)
├── warp.rs         — Warp struct (registers, PC, age)
├── tensor_core.rs  — TensorCore MMA unit
├── cluster.rs      — Cluster, Node, DeviceId; transfer(), all_reduce(), all_gather()
├── interconnect.rs — NVLinkConfig, InfiniBandConfig, transfer math, collective algorithms
└── bin/
    └── viz.rs      — Live TUI visualizer (ratatui)

Usage

Running the demo

cargo run

Output:

Initialized H100-like GPU with 132 SMs, 50MB L2 cache, 80GB HBM
[gpusim] Launching kernel 'vec_add' | grid=(8,1,1) block=(128,1,1) | policy=GTO | max_blocks/SM=16 | occupancy=100.0% (limited by register file)
[gpusim] Kernel 'vec_add' complete | 8 blocks | 32 warps | 1024 threads | occupancy=100.0%
Verification PASSED: all 1024 results correct (each = 1024)
Stats: 8 blocks | 32 warps | 1024 threads | occupancy=100.0% (limited by register file) | policy=GTO

Writing a kernel

use gpusim::gpu::GPU;
use gpusim::kernel::{Dim3, Kernel, LaunchConfig};
use gpusim::scheduler::SchedulingPolicy;

let mut gpu = GPU::h100();

// Define a kernel — doubles every element in-place
let kernel = Kernel::new("double", |ctx| {
    let addr = ctx.global_id() as usize * 4;
    let val = f32::from_le_bytes(ctx.gmem.read(addr, 4).try_into().unwrap());
    ctx.gmem.write(addr, &(val * 2.0).to_le_bytes());
});

// Launch: 128 threads/block, 32 regs/thread for occupancy calculation
let config = LaunchConfig::new(Dim3::x(8), Dim3::x(128))
    .with_resources(32, 0);

let stats = gpu.launch_kernel(&kernel, &config, SchedulingPolicy::Gto);

println!("Occupancy: {:.1}%", stats.theoretical_occupancy * 100.0);
println!("Bottleneck: {}", stats.occupancy_limiter);

Choosing a scheduling policy

// Loose Round-Robin
gpu.launch_kernel(&kernel, &config, SchedulingPolicy::Lrr);

// Greedy-Then-Oldest (default in GPGPU-Sim)
gpu.launch_kernel(&kernel, &config, SchedulingPolicy::Gto);

// Two-Level with active set of 8 warps
gpu.launch_kernel(&kernel, &config, SchedulingPolicy::TwoLevel { active_set_size: 8 });

Multi-GPU cluster

use gpusim::cluster::{Cluster, DeviceId};
use gpusim::interconnect::AllReduceAlgorithm;

// 2 nodes × 8 H100s = 16 GPUs, NVLink 4.0 + NDR InfiniBand
let mut cluster = Cluster::h100_dgx(2);

// Point-to-point transfer: 1GB intra-node (NVLink)
let t = cluster.transfer(DeviceId::new(0, 0), DeviceId::new(0, 1), 1 << 30);
println!("{:.2}ms  ({:.0} GB/s)", t.time_us / 1000.0, t.effective_bandwidth_gb_s);
// → 1.19ms  (899 GB/s)

// Point-to-point transfer: 1GB inter-node (InfiniBand)
let t = cluster.transfer(DeviceId::new(0, 0), DeviceId::new(1, 0), 1 << 30);
println!("{:.2}ms  ({:.0} GB/s)", t.time_us / 1000.0, t.effective_bandwidth_gb_s);
// → 21.48ms  (50 GB/s)

// AllReduce: 1GB per GPU across all 16 GPUs
let s = cluster.all_reduce(1 << 30, AllReduceAlgorithm::Ring);
println!("{:.2}ms  efficiency={:.1}%", s.time_us / 1000.0, s.efficiency * 100.0);
// → 40.33ms  efficiency=99.9%

// Launch a kernel on a specific GPU
cluster.launch_kernel_on(DeviceId::new(1, 3), &kernel, &config, SchedulingPolicy::Gto);

Live visualizer

Start the simulation in one terminal, then attach the visualizer in another at any time:

# Terminal 1 — run simulation
cargo run

# Terminal 2 — attach visualizer (start before, during, or after)
cargo run --bin viz

The visualizer shows:

┌─ ⚡ gpusim live monitor ─────────────────────────────────────────────────────┐
│  kernel: vec_add   policy: GTO   status: RUNNING                             │
├──────────────────────────────────────┬───────────────────────────────────────┤
│ SM Utilization                       │ Stats                                 │
│ ██ active   ░░ idle                  │ Occupancy  ████████████████ 100.0%    │
│                                      │ Blocks     ████░░░░░░░░░░░░  3 / 8   │
│ ██ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░   │                                       │
│ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░   │ Warps:      12                        │
│ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░ ░░   │ Threads:    384                       │
│ ...                                  │ Max blk/SM: 16                        │
│  1/132 SMs active                    │ Limiter:    register file             │
│                                      │ Grid:   (8,1,1)                       │
│                                      │ Block:  (128,1,1)                     │
├──────────────────────────────────────┴───────────────────────────────────────┤
│  q / esc: quit    auto-refreshes every 200ms    reads /tmp/gpusim_live.json  │
└──────────────────────────────────────────────────────────────────────────────┘

Key References

Topic	Reference
GPU architecture	JAX Scaling Book — GPUs
Block scheduling	Gilman et al., SIGMETRICS 2021 — Demystifying NVIDIA's TBS
Fermi TBS details	Sreepathi Pai — How the Fermi TBS Works
GTO scheduling	Rogers et al., MICRO 2012 — Cache-Conscious Wavefront Scheduling
Two-level scheduling	Narasiman et al., MICRO 2011 — Two-Level Warp Scheduling
Latency hiding	Volkov, 2016 — Understanding Latency Hiding on GPUs
Occupancy formula	Lei Mao — CUDA Occupancy Calculation
Reference simulator	GPGPU-Sim

Roadmap

Warp stall simulation (LongScoreboard / ShortScoreboard state transitions)
Cycle-accurate execution with simulated memory latency
Python bindings for running experiments from Python notebooks
Matrix multiplication kernel demo (using TensorCore MMA)
Multi-kernel / concurrent kernel execution
Performance metrics: IPC, memory bandwidth utilisation, stall breakdown
Richer visualizer: warp state timeline, per-SM drill-down, stall breakdown chart
Multi-GPU visualizer: per-node/per-GPU view, transfer activity, collective progress

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpusim

Architecture Overview

Features

Kernel Execution

Block Scheduling (GigaThread Engine)

Occupancy Calculation

Warp Scheduling (3 policies)

Memory Hierarchy

SM Configurations

Multi-GPU Clusters

Live TUI Visualizer

Module Structure

Usage

Running the demo

Writing a kernel

Choosing a scheduling policy

Multi-GPU cluster

Live visualizer

Key References

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gpusim

Architecture Overview

Features

Kernel Execution

Block Scheduling (GigaThread Engine)

Occupancy Calculation

Warp Scheduling (3 policies)

Memory Hierarchy

SM Configurations

Multi-GPU Clusters

Live TUI Visualizer

Module Structure

Usage

Running the demo

Writing a kernel

Choosing a scheduling policy

Multi-GPU cluster

Live visualizer

Key References

Roadmap

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages