eth-cscs · bcumming · Mar 27, 2025 · Mar 27, 2025 · Mar 27, 2025 · Mar 27, 2025
@@ -16,45 +16,9 @@ The number of nodes can change when nodes are added or removed from other cluste
 
 Most nodes are in the [`normal` slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug].
 
-### File Systems and Storage
+### Storage and file systems
 
-There are three main file systems mounted on Clariden and Bristen.
-
-| type |mount | filesystem |
-| -- | -- | -- |
-| Home | /users/$USER | [VAST][ref-alps-vast] |
-| Scratch | `/iopstor/scratch/cscs/$USER` | [Iopstor][ref-alps-iopstor] |
-| Project | `/capstor/store/cscs/swissai/<project>` | [Capstor][ref-alps-capstor] |
-
-#### Home
-
-Every user has a home path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem.
-The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts.
-
-#### Scratch
-
-Scratch filesystems provide temporary storage for high-performance I/O for executing jobs.
-Use scratch to store datasets that will be accessed by jobs, and for job output.
-Scratch is per user - each user gets separate scratch path and quota.
-
-* The environment variable `SCRATCH=/iopstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch.
-
-!!! warning "scratch cleanup policy"
-    Files that have not been accessed in 30 days are automatically deleted.
-
-    **Scratch is not intended for permanent storage**: transfer files back to the capstor project storage after job runs.
-
-!!! note
-    There is an additional scratch path mounted on [Capstor][ref-alps-capstor] at `/capstor/scratch/cscs/$USER`, however this is not recommended for ML workloads for performance reasons.
-
-### Project
-
-Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters.
-Project is per project - each project gets a project folder with project-specific quota.
-
-* if you need additional storage, ask your PI to contact the CSCS service managers Fawzi or Nicholas.
-* hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela 
-* it is not recommended to write directly to the project path from jobs.
+Clariden uses the [MLp filesystems and storage policies][ref-mlp-storage].
 
 ## Getting started
 

@@ -1,3 +1,127 @@
 [](){#ref-cluster-santis}
 # Santis
 
+Santis is an Alps cluster that provides GPU accelerators and file systems designed to meet the needs of climate and weather models for the [CWp][ref-platform-cwp].
+
+## Cluster specification
+
+### Compute nodes
+
+Santis consists of around ??? [Grace-Hopper nodes][ref-alps-gh200-node].
-Santis consists of around ??? [Grace-Hopper nodes][ref-alps-gh200-node].
+Santis consists of around 600 [Grace-Hopper nodes][ref-alps-gh200-node].
-Santis consists of around ??? [Grace-Hopper nodes][ref-alps-gh200-node].
+Santis consists of around 600 [Grace-Hopper nodes][ref-alps-gh200-node].
+The number of nodes can change when nodes are added or removed from other clusters on Alps.
+
+There are four login nodes, labelled `santis-ln00[1-4]`.
+You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs.
+
+| node type | number of nodes | total CPU sockets | total GPUs |
+|-----------|--------| ----------------- | ---------- |
+| [gh200][ref-alps-gh200-node] | 1,200 | 4,800 | 4,800 |
+
+### Storage and file systems
+
+Santis uses the [CWp filesystems and storage policies][ref-cwp-storage].
+
+## Getting started
+
+### Logging into Santis
+
+To connect to Santis via SSH, first refer to the [ssh guide][ref-ssh].
+
+!!! example "`~/.ssh/config`"
+    Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to santis using `ssh santis`.
+    ```
+    Host santis
+        HostName santis.alps.cscs.ch
+        ProxyJump ela
+        User cscsusername
+        IdentityFile ~/.ssh/cscs-key
+        IdentitiesOnly yes
+    ```
+
+### Software
+
+CSCS and the user community provide software environments tailored to  [uenv][ref-uenv] are also available on Santis.
+
+Currently, the following uenv are provided for the climate and weather community
+
+* `icon/25.1`
+* `climana/25.1`
+
+In adition to the climate and weather uenv, all of the 
+
+??? example "using uenv provided for other clusters"
+    You can run uenv that were built for other Alps clusters using the `@` notation.
+    For example, to use uenv images for [daint][ref-cluster-daint]:
+    ```bash
+    # list all images available for daint
+    uenv image find @daint
+
+    # download an image for daint
+    uenv image pull namd/3.0:v3@daint
+
+    # start the uenv
+    uenv start namd/3.0:v3@daint
+    ```
+
+It is also possible to use HPC containers on Santis:
+
+* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
+* To build images, see the [guide to building container images on Alps][ref-build-containers].
+
+
+## Running jobs on Santis
+
+### SLURM
+
+Santis uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
+
+There are two slurm partitions on the system:
+
+* the `normal` partition is for all production workloads.
+* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes.
+* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal] at CSCS.
+
+| name | nodes  | max nodes per job | time limit |
+| --   | --     | --                | -- |
+| `normal` | 1266       | -    | 24 hours |
+| `debug`  | 32         | 2    | 30 minutes |
+| `xfer`   | 2          | 1    | 24 hours |
+
+* nodes in the `normal` and `debug` partitions are not shared
+* nodes in the `xfer` partition can be shared
+
+See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
+
+??? example "how to check the number of nodes on the system"
+    You can check the size of the system by running the following command in the terminal:
+    ```terminal
+    > sinfo --format "| %20R | %10D | %10s | %10l | %10A |"
+    | PARTITION            | NODES      | JOB_SIZE   | TIMELIMIT  | NODES(A/I) |
+    | debug                | 32         | 1-2        | 30:00      | 3/29       |
+    | normal               | 1266       | 1-infinite | 1-00:00:00 | 812/371    |
+    | xfer                 | 2          | 1          | 1-00:00:00 | 1/1        |
+    ```
+    The last column shows the number of nodes that have been allocted in currently running jobs (`A`) and the number of jobs that are idle (`I`).
+
+### FirecREST
+
+Santis can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v1` API endpoint.
+
+## Maintenance and status
+
+### Scheduled maintenance
+
+Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. 
+
+Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).
+
+### Change log
+
+!!! change "2025-03-05 container engine updated"
+    now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.
+
+??? change "2024-10-07 old event"
+    this is an old update. Use `???` to automatically fold the update.
+
+### Known issues
+
@@ -1,5 +1,74 @@
 [](){#ref-platform-cwp}
-# Climate and Weather Platform
+# Climate and weather platform
+
+The Machine Learning Platform (MLp) provides compute, storage and support to the climate and weather modeling community in Switzerland.
+
+## Getting Started
+
+### Getting access
+
+Project administrators (PIs and deputy PIs) of projects on the MLp can to invite users to join their project, before they can use the project's resources on Alps.
 
 !!! todo
-    follow the template of the [MLp][ref-platform-mlp]
+    This points to the Waldur solution - whether the [UMP][ref-account-ump] or [Waldur][ref-account-waldur] docs are linked depends on which is being used when these docs go live.
+
+This is performed using the [project management tool][ref-account-waldur]
+
+Once invited to a project, you will receive an email, which you can need to create an account and configure [multi-factor authentification][ref-mfa] (MFA).
+
+## Systems
+
+Santis is the system deployed on the Alps infrastructure for the climate and weather platform.
+Its name derives from the highest mountain Säntis in the Alpstein massif of North-Eastern Switzerland.
+
+<div class="grid cards" markdown>
+-   :fontawesome-solid-mountain: [__Santis__][ref-cluster-santis]
+
+    Santis is a large [Grace-Hopper][ref-alps-gh200-node] cluster.
+</div>
+
+[](){#ref-cwp-storage}
+## File systems and storage
+
+There are three main file systems mounted on the CWp system Santis.
+
+| type |mount | filesystem |
+| -- | -- | -- |
+| Home | /users/$USER | [VAST][ref-alps-vast] |
+| Scratch | `/capstor/scratch/cscs/$USER` | [Iopstor][ref-alps-capstor] |
-| Scratch | `/capstor/scratch/cscs/$USER` | [Iopstor][ref-alps-capstor] |
+| Scratch | `/capstor/scratch/cscs/$USER` | [Iopstor][ref-alps-iopstor] |
-| Scratch | `/capstor/scratch/cscs/$USER` | [Iopstor][ref-alps-capstor] |
+| Scratch | `/capstor/scratch/cscs/$USER` | [Iopstor][ref-alps-iopstor] |
+| Project | `/capstor/store/cscs/userlab/<project>` | [Capstor][ref-alps-capstor] |
+
+### Home
+
+Every user has a home path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem.
+The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts.
+
+### Scratch
+
+The Scratch filesystem provides temporary storage for high-performance I/O for executing jobs.
+Use scratch to store datasets that will be accessed by jobs, and for job output.
+Scratch is per user - each user gets separate scratch path and quota.
+
+!!! info
+    A quota of 150 TB and 1 million inodes (files and folders) is applied to your scratch path.
+
+    These are implemented as soft quotas: upon reaching either limit there is a grace period of 1 week before write access to `$SCRATCH` is blocked.
+
+    You can check your quota at any time from Ela or one of the login nodes, using the [`quota` command][ref-storage-quota].
+
+!!! info
+    The environment variable `SCRATCH=/capstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch.
+
+!!! warning "scratch cleanup policy"
+    Files that have not been accessed in 30 days are automatically deleted.
+
+    **Scratch is not intended for permanant storage**: transfer files back to the capstor project storage after job runs.
+
+### Project
+
+Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters.
+Project is per project - each project gets a project folder with project-specific quota.
+
+* hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela.
+* it is not recommended to write directly to the project path from jobs.
+
@@ -1,14 +1,9 @@
 [](){#ref-platform-mlp}
-# Machine Learning Platform
+# Machine learning platform
 
-!!! todo
-    A description of the MLP
-
-    * who are the users (help answer the question "is this the platform that I am on")
-    * who are the partners (SwissAI, etc)
-    * how to get apply to access MLp (if that is a thing)
+The Machine Learning Platform (MLp) provides compute, storage and expertise to the machine learning and AI community in Switzerlan, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/).
-The Machine Learning Platform (MLp) provides compute, storage and expertise to the machine learning and AI community in Switzerlan, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/).
+The Machine Learning Platform (MLp) provides compute, storage and expertise to the machine learning and AI community in Switzerland, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/).
-The Machine Learning Platform (MLp) provides compute, storage and expertise to the machine learning and AI community in Switzerlan, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/).
+The Machine Learning Platform (MLp) provides compute, storage and expertise to the machine learning and AI community in Switzerland, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/).
 
-## Getting Started
+## Getting started
 
 ### Getting access
 
@@ -17,14 +12,14 @@ This is performed using the [project management tool][ref-account-waldur]
 
 Once invited to a project, you will receive an email, which you can need to create an account and configure [multi-factor authentication][ref-mfa] (MFA).
 
-## vClusters
+## Systems
 
 The main cluster provided by the MLp is Clariden, a large Grace-Hopper GPU system on Alps.
 
 <div class="grid cards" markdown>
 -   :fontawesome-solid-mountain: [__Clariden__][ref-cluster-clariden]
 
-    Clariden is the main [Grace-Hopper][ref-alps-gh200-node] cluster used for **todo**
+    Clariden is the main [Grace-Hopper][ref-alps-gh200-node] cluster.
 </div>
 
 <div class="grid cards" markdown>
@@ -33,7 +28,48 @@ The main cluster provided by the MLp is Clariden, a large Grace-Hopper GPU syste
     Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for **todo**
 </div>
 
-## Guides and Tutorials
+[](){#ref-mlp-storage}
+## File Systems and Storage
+
+There are three main file systems mounted on the MLp clusters Clariden and Bristen.
+
+| type |mount | filesystem |
+| -- | -- | -- |
+| Home | /users/$USER | [VAST][ref-alps-vast] |
+| Scratch | `/iopstor/scratch/cscs/$USER` | [Iopstor][ref-alps-iopstor] |
+| Project | `/capstor/store/cscs/swissai/<project>` | [Capstor][ref-alps-capstor] |
+
+### Home
+
+Every user has a home path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem.
+The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts.
+
+### Scratch
+
+Scratch filesystems provide temporary storage for high-performance I/O for executing jobs.
+Use scratch to store datasets that will be accessed by jobs, and for job output.
+Scratch is per user - each user gets separate scratch path and quota.
+
+* The environment variable `SCRATCH=/iopstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch.
+
+!!! warning "scratch cleanup policy"
+    Files that have not been accessed in 30 days are automatically deleted.
+
+    **Scratch is not intended for permanant storage**: transfer files back to the capstor project storage after job runs.
+
+!!! note
+    There is an additional scratch path mounted on [Capstor][ref-alps-capstor] at `/capstor/scratch/cscs/$USER`, however this is not reccomended for ML workloads for performance reasons.
+
+### Project
+
+Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters.
+Project is per project - each project gets a project folder with project-specific quota.
+
+* if you need additional storage, ask your PI to contact the CSCS service managers Fawzi or Nicholas.
+* hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela 
+* it is not recommended to write directly to the project path from jobs.
+
+## Guides and tutorials
 
 !!! todo
     links to tutorials and guides for ML workflows