From d00de38c13bdfd8b49442633d9321a96c220322f Mon Sep 17 00:00:00 2001 From: bcumming Date: Thu, 22 May 2025 13:09:01 +0200 Subject: [PATCH 1/7] node-images --- docs/alps/hardware.md | 35 +- docs/images/alps/gh200-schematic.svg | 2676 ++++++++++++++++++++++++++ 2 files changed, 2702 insertions(+), 9 deletions(-) create mode 100644 docs/images/alps/gh200-schematic.svg diff --git a/docs/alps/hardware.md b/docs/alps/hardware.md index 58173255..ca24a5c8 100644 --- a/docs/alps/hardware.md +++ b/docs/alps/hardware.md @@ -40,20 +40,37 @@ Alps was installed in phases, starting with the installation of 1024 AMD Rome du There are currently four node types in Alps, with another becoming available in 2025: -| type | blades | nodes | CPU sockets | GPU devices | -| ---- | ------:| -----:| -----------:| -----------:| -| NVIDIA GH200 | 1344 | 2688 | 10,752 | 10,752 | -| AMD Rome | 256 | 1024 | 2,048 | -- | -| NVIDIA A100 | 72 | 144 | 144 | 576 | -| AMD MI250x | 12 | 24 | 24 | 96 | -| AMD MI300A | 64 | 128 | 512 | 512 | +| type | abbreviation | blades | nodes | CPU sockets | GPU devices | +| ---- | ------- | ------:| -----:| -----------:| -----------:| +| NVIDIA GH200 | gh200 | 1344 | 2688 | 10,752 | 10,752 | +| AMD Rome | zen2 | 256 | 1024 | 2,048 | -- | +| NVIDIA A100 | a100 | 72 | 144 | 144 | 576 | +| AMD MI250x | mi200 | 12 | 24 | 24 | 96 | +| AMD MI300A | mi300 | 64 | 128 | 512 | 512 | [](){#ref-alps-gh200-node} ### NVIDIA GH200 GPU Nodes -!!! todo +There are 24 cabinets, in 4 rows with 6 cabinets per row: + +* 8 chassis per cabinet +* 7 blades per chassis + * a chassis can contain up to 8 blades, however Alps' gh200 chassis are underpopulated so that we can increase the amount of power delivered to each node. +* 2 nodes per blade + +Each node contains four Grace-Hopper modules and four corresponding network interface cards (NICS) per blade, as illustrated below: + +![](../images/alps/gh200-schematic.svg) + +??? info "node xnames" + There are two boards per blade with one node per board. + This is different to the `zen2` CPU-only nodes (used for example in Eiger) that had two nodes per board for a total of four nodes per blade. + As such, there are no `n1` nodes in the xname list, e.g.: + ``` + x1100c0s6b0n0 + x1100c0s6b1n0 + ``` -Blanca Peak [](){#ref-alps-zen2-node} ### AMD Rome CPU Nodes diff --git a/docs/images/alps/gh200-schematic.svg b/docs/images/alps/gh200-schematic.svg new file mode 100644 index 00000000..de981f66 --- /dev/null +++ b/docs/images/alps/gh200-schematic.svg @@ -0,0 +1,2676 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + GPU HBM3 + + + + + GPU HBM3 + + + + + ~94 GB Numa 4(-11 MIG) + + + + + + Hopper + + + + + GPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ≤4000GB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + GPU HBM3 + + + + + GPU HBM3 + + + + + ~94 GB Numa 20(-27 MIG) + + + + + + Hopper + + + + + GPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ≤4000GB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + GPU HBM3 + + + + + GPU HBM3 + + + + + ~94 GB Numa 28(-35 MIG) + + + + + + Hopper + + + + + GPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ≤4000GB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + NVIDIA GH200 Grace Hopper Superchip + + + + + NVIDIA GH200 Grace Hopper Superchip + + + + + NVIDIA GH200 Grace Hopper Superchip + + + + + NVIDIA GH200 Grace Hopper Superchip + + + + + + + + + High-Speed + + + IO + + + High-Speed + + + IO + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Grace + + + CPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ≤500GB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ~117 GB Numa 0 + + + + + + + + + CPU LPDDR5X + + + + + CPU LPDDR5X + + + + + + + + + Grace + + + CPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ≤500GB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ~120 GB Numa1 + + + + + + + + + CPU LPDDR5X + + + + + CPU LPDDR5X + + + + + + + + Hardware Coherency + + + + + + + + + Grace + + + CPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ≤500GB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ~120 GB Numa 2 + + + + + + + + + CPU LPDDR5X + + + + + CPU LPDDR5X + + + + + + + + + Grace + + + CPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ≤500GB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ~120 GB Numa 3 + + + + + + + + + CPU LPDDR5X + + + + + CPU LPDDR5X + + + + + + + + + + + + 512GB/s + + + + + 512GB/s + + + + + 16x PCIe-5 + + + + + 16x PCIe-5 + + + + + 4x + + + + + 4x + + + + + + + + + High-Speed + + + IO + + + High-Speed + + + IO + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 512GB/s + + + + + 512GB/s + + + + + 16x PCIe-5 + + + + + 16x PCIe-5 + + + 4x + + + 4x + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + NVLink-C2C + + + 900GB/s + + + NVLink-C2C + + + 900GB/s + + + NVLink-C2C + + + 900GB/s + + + + + + + + + + + + + + + NVLink-C2C + 450GB/s per Dir + + + 900GB/s Tot + + + NVLink-C2C + 450GB/s per Dir + + + 900GB/s Tot + + + NVLink-C2C + 450GB/s per Dir + + + 900GB/s Tot + + + NVLink-C2C + 450GB/s per Dir + + + 900GB/s Tot + + + + + + + GPU All2all: 3x6 + + + P2P NVLink + + + 150GB/s + per direction + + + + + + + + CPU All2all + + + Grace Interconnect + + + ~75GB/s + per direction + + + + + + + + Slingshot hsn0 + + + 200Gb/s (25GB/s) + + + + + + Slingshot hsn3 + + + 200Gb/s (25GB/s) + + + + + + Slingshot hsn1 + + + 200Gb/s (25GB/s) + + + + + + + + Slingshot + hsn2 + + + 200Gb/s (25GB/s) + + + + + + + ~378 GB (GPU) + + + + + ~477 GB (CPU) + + + + + + + + + = ~855 GB Unified memory + + + + + 4xGH200 NVIDIA Grace Hopper Superchips + + + + + + + + + + + + CSCS ALPS HPE Cray EX GH200 Node + + + + + + + + + + + + + + + + + Slingshot Switches + connecting to other ALPS + nodes, internet, + + + storage (SSD/HDD) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + GPU HBM3 + + + + + GPU HBM3 + + + + + + Hopper + + + + + GPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ≤4000GB/s Tot + + + + + + + + + + + + + + + + + + + + + + + ~94 GB Numa 12(-19 MIG) + + + + + + + + From 1c1b72010d384c361cd16a757295dd1637b28cf6 Mon Sep 17 00:00:00 2001 From: bcumming Date: Thu, 22 May 2025 13:11:46 +0200 Subject: [PATCH 2/7] add mi300 node diagram --- docs/alps/hardware.md | 3 +- docs/images/alps/mi300-schematic.svg | 4790 ++++++++++++++++++++++++++ 2 files changed, 4792 insertions(+), 1 deletion(-) create mode 100644 docs/images/alps/mi300-schematic.svg diff --git a/docs/alps/hardware.md b/docs/alps/hardware.md index ca24a5c8..a53d98a3 100644 --- a/docs/alps/hardware.md +++ b/docs/alps/hardware.md @@ -71,7 +71,6 @@ Each node contains four Grace-Hopper modules and four corresponding network inte x1100c0s6b1n0 ``` - [](){#ref-alps-zen2-node} ### AMD Rome CPU Nodes @@ -96,6 +95,8 @@ Bard Peak [](){#ref-alps-mi300-node} ### AMD MI300A GPU Nodes +![](../images/alps/mi300-schematic.svg) + !!! todo Parry Peak diff --git a/docs/images/alps/mi300-schematic.svg b/docs/images/alps/mi300-schematic.svg new file mode 100644 index 00000000..665e8016 --- /dev/null +++ b/docs/images/alps/mi300-schematic.svg @@ -0,0 +1,4790 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 6x38 =  228 GPU + + + Compute Units + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Unified HBM3 Memory + + + + + + + + + Unified HBM3 Memory + + + 3x8 = 24 AMD Zen4 + + + x86 CPU cores + + + + + ~125 GiB Numa0 + + + + + + + + + + + + + + + ≤5300GiB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 6x38 =  228 GPU + + + Compute Units + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Unified HBM3 Memory + + + + + + + + + Unified HBM3 Memory + + + 3x8 = 24 AMD Zen4 + + + x86 CPU cores + + + + + ~125 GiB Numa2 + + + + + + + + + + + + + + + ≤5300GiB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 6x38 =  228 GPU + + + Compute Units + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Unified HBM3 Memory + + + + + + + + + Unified HBM3 Memory + + + 3x8 = 24 AMD Zen4 + + + x86 CPU cores + + + + + ~125 GiB Numa1 + + + + + + + + + + + + + + + ≤5300GiB/s Tot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 6x38 =  228 GPU + + + Compute Units + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Unified HBM3 Memory + + + + + + + + + Unified HBM3 Memory + + + 3x8 = 24 AMD Zen4 + + + x86 CPU cores + + + + + ~125 GiB Numa3 + + + + + + + + + + + + + + + ≤5300GiB/s Tot + + + + + + + + AMD INSTINCT™ MI300A APU + + + + + AMD INSTINCT™ MI300A APU + + + + + AMD INSTINCT™ MI300A APU + + + + + AMD INSTINCT™ MI300A APU + + + + + + + + + High-Speed + + + IO + + + High-Speed + + + IO + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Hardware Coherency + + + + + + + + + + + 512GiB/s + + + + + 512GiB/s + + + + + 16x PCIe-5 + + + + + 16x PCIe-5 + + + + + 4x + + + + + 4x + + + + + + + + + High-Speed + + + IO + + + High-Speed + + + IO + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 512GiB/s + + + + + 512GiB/s + + + + + 16x PCIe-5 + + + + + 16x PCIe-5 + + + 4x + + + 4x + + + + + + + Slingshot hsn2 + + + 200Gb/s (25GiB/s) + + + + + + Slingshot hsn3 + + + 200Gb/s (25GiB/s) + + + + + + Slingshot hsn0 + + + 200Gb/s (25GiB/s) + + + + + + + + Slingshot + hsn1 + + + 200Gb/s (25GiB/s) + + + + + + + + + + + ~477 GiB Unified HBM3 Memory + + + + + 4xAMD INSTINCT™ MI300A APU + + + + + + CSCS ALPS HPE Cray EX MI300A Node + + + + + + + + + + + + + + + + + + + Slingshot Switches + connecting to other ALPS + nodes, internet, + + + storage (SSD/HDD) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 6 Infinity Fabric™ + + + Links + + + providing a total of + 256GB/s per + direction between + each APU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + From 7b2844cfa186517e920f7f1c56d3dd7d2da81aa9 Mon Sep 17 00:00:00 2001 From: bcumming Date: Thu, 22 May 2025 15:22:13 +0200 Subject: [PATCH 3/7] add under-construction admonition; document custom admonitions in contributing --- docs/alps/hardware.md | 5 ++++ docs/contributing/index.md | 60 ++++++++++++++++++++++++++++++++++++++ docs/stylesheets/extra.css | 16 ++++++++++ 3 files changed, 81 insertions(+) diff --git a/docs/alps/hardware.md b/docs/alps/hardware.md index a53d98a3..1c323eec 100644 --- a/docs/alps/hardware.md +++ b/docs/alps/hardware.md @@ -51,6 +51,11 @@ There are currently four node types in Alps, with another becoming available in [](){#ref-alps-gh200-node} ### NVIDIA GH200 GPU Nodes +!!! under-construction + The description of the GH200 nodes is incomplete. + + Let us know if there is missing information. + There are 24 cabinets, in 4 rows with 6 cabinets per row: * 8 chassis per cabinet diff --git a/docs/contributing/index.md b/docs/contributing/index.md index e9728925..0e7ea43e 100644 --- a/docs/contributing/index.md +++ b/docs/contributing/index.md @@ -229,6 +229,66 @@ They stand out better from the main text, and can be collapsed by default if nee If an admonition is collapsed by default, it should have a title. +We provide some custom admonitions. + +#### Change + +For adding information about a change, originally designed for recording updates to clusters. + +=== "Rendered" + !!! change "2025-04-17" + * SLURM was upgraded to version 25.1. + * uenv was upgraded to v0.8 + + Old changes can be folded: + + ??? change "2025-02-04" + * The new Scratch cleanup policy was implemented + * NVIDIA driver was updated + +=== "Markdown" + ``` + !!! change "2025-04-17" + * SLURM was upgraded to version 25.1. + * uenv was upgraded to v0.8 + ``` + + Old changes can be folded: + + ``` + ??? change "2025-02-04" + * The new Scratch cleanup policy was implemented + * NVIDIA driver was updated + ``` + +#### Under construction + +For marking incomplete sections. + +=== "Rendered" + !!! under-construction + This is not finished yet! + +=== "Markdown" + ``` + !!! under-construction + This is not finished yet! + ``` + +#### Todo + +As a placeholder for documentation that needs to be written. + +=== "Rendered" + !!! todo + Add some common error messages and how to fix them. + +=== "Markdown" + ``` + !!! todo + Add some common error messages and how to fix them. + ``` + ### Code blocks Use [code blocks](https://squidfunk.github.io/mkdocs-material/reference/code-blocks/) when you want to display monospace text in a programming language, terminal output, configuration files etc. diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index 4a9eec90..0e9ae3c6 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -53,6 +53,22 @@ mask-image: var(--md-admonition-icon--alps); } +/* under-construction admonition */ +.md-typeset .admonition.under-construction, +.md-typeset details.under-construction { + border-color: rgb(0, 0, 0); +} +.md-typeset .under-construction > .admonition-title, +.md-typeset .under-construction > summary { + background-color: rgba(255, 255, 0, 0.3); +} +.md-typeset .under-construction > .admonition-title::before, +.md-typeset .under-construction > summary::before { + background-color: rgb(0, 0, 0); + -webkit-mask-image: var(--md-admonition-icon--digging); + mask-image: var(--md-admonition-icon--digging); +} + .md-nav__item .md-nav__link--active { font-weight: bold; } From a150c2ecf51572ab991c4a452fa295b321c4f95a Mon Sep 17 00:00:00 2001 From: Ben Cumming Date: Thu, 22 May 2025 15:50:24 +0200 Subject: [PATCH 4/7] Update docs/alps/hardware.md Co-authored-by: Rocco Meli --- docs/alps/hardware.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/alps/hardware.md b/docs/alps/hardware.md index 1c323eec..4ed1c9c0 100644 --- a/docs/alps/hardware.md +++ b/docs/alps/hardware.md @@ -67,9 +67,9 @@ Each node contains four Grace-Hopper modules and four corresponding network inte ![](../images/alps/gh200-schematic.svg) -??? info "node xnames" +??? info "Node xname" There are two boards per blade with one node per board. - This is different to the `zen2` CPU-only nodes (used for example in Eiger) that had two nodes per board for a total of four nodes per blade. + This is different to the `zen2` CPU-only nodes (used for example in Eiger) that have two nodes per board for a total of four nodes per blade. As such, there are no `n1` nodes in the xname list, e.g.: ``` x1100c0s6b0n0 From 446e6cdb71ad257fac08cf6f9d2d38705c2bf59b Mon Sep 17 00:00:00 2001 From: Ben Cumming Date: Thu, 22 May 2025 15:50:47 +0200 Subject: [PATCH 5/7] Update docs/alps/hardware.md Co-authored-by: Rocco Meli --- docs/alps/hardware.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/alps/hardware.md b/docs/alps/hardware.md index 4ed1c9c0..f13ee10d 100644 --- a/docs/alps/hardware.md +++ b/docs/alps/hardware.md @@ -63,7 +63,7 @@ There are 24 cabinets, in 4 rows with 6 cabinets per row: * a chassis can contain up to 8 blades, however Alps' gh200 chassis are underpopulated so that we can increase the amount of power delivered to each node. * 2 nodes per blade -Each node contains four Grace-Hopper modules and four corresponding network interface cards (NICS) per blade, as illustrated below: +Each node contains four Grace-Hopper modules and four corresponding network interface cards (NICs) per blade, as illustrated below: ![](../images/alps/gh200-schematic.svg) From 386392f65c5c051f87bf01ff30ff2e9da40e217c Mon Sep 17 00:00:00 2001 From: Ben Cumming Date: Thu, 22 May 2025 15:51:12 +0200 Subject: [PATCH 6/7] Update docs/alps/hardware.md Co-authored-by: Rocco Meli --- docs/alps/hardware.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/alps/hardware.md b/docs/alps/hardware.md index f13ee10d..0c9fc71f 100644 --- a/docs/alps/hardware.md +++ b/docs/alps/hardware.md @@ -38,7 +38,7 @@ This approach to cooling provides greater efficiency for the rack-level cooling, Alps was installed in phases, starting with the installation of 1024 AMD Rome dual socket CPU nodes in 2020, through to the main installation of 2,688 Grace-Hopper nodes in 2024. -There are currently four node types in Alps, with another becoming available in 2025: +There are currently five node types in Alps: | type | abbreviation | blades | nodes | CPU sockets | GPU devices | | ---- | ------- | ------:| -----:| -----------:| -----------:| From d4a13e1be57bd0bea9bf329dc97ba5be8aef0a57 Mon Sep 17 00:00:00 2001 From: bcumming Date: Thu, 22 May 2025 16:40:55 +0200 Subject: [PATCH 7/7] pr comments --- docs/alps/hardware.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/alps/hardware.md b/docs/alps/hardware.md index 1c323eec..4c3318a9 100644 --- a/docs/alps/hardware.md +++ b/docs/alps/hardware.md @@ -52,17 +52,18 @@ There are currently four node types in Alps, with another becoming available in ### NVIDIA GH200 GPU Nodes !!! under-construction - The description of the GH200 nodes is incomplete. - - Let us know if there is missing information. - -There are 24 cabinets, in 4 rows with 6 cabinets per row: + The description of the GH200 nodes is a work in progress. + We will add more detailed information soon. + Please [get in touch](https://github.com/eth-cscs/cscs-docs/issues) if there is information that you want to see here. +There are 24 cabinets, in 4 rows with 6 cabinets per row, and each cabinet contains 112 nodes (for a total of 448 GH200): * 8 chassis per cabinet * 7 blades per chassis - * a chassis can contain up to 8 blades, however Alps' gh200 chassis are underpopulated so that we can increase the amount of power delivered to each node. * 2 nodes per blade +!!! info "Why 7 blades per chassis?" + A chassis can contain up to 8 blades, however Alps' gh200 chassis are underpopulated so that we can increase the amount of power delivered to each GPU. + Each node contains four Grace-Hopper modules and four corresponding network interface cards (NICS) per blade, as illustrated below: ![](../images/alps/gh200-schematic.svg)