diff --git a/docs/clusters/clariden.md b/docs/clusters/clariden.md index fffdcb82..dbbe636b 100644 --- a/docs/clusters/clariden.md +++ b/docs/clusters/clariden.md @@ -16,45 +16,9 @@ The number of nodes can change when nodes are added or removed from other cluste Most nodes are in the [`normal` slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug]. -### File Systems and Storage +### Storage and file systems -There are three main file systems mounted on Clariden and Bristen. - -| type |mount | filesystem | -| -- | -- | -- | -| Home | /users/$USER | [VAST][ref-alps-vast] | -| Scratch | `/iopstor/scratch/cscs/$USER` | [Iopstor][ref-alps-iopstor] | -| Project | `/capstor/store/cscs/swissai/` | [Capstor][ref-alps-capstor] | - -#### Home - -Every user has a home path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem. -The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts. - -#### Scratch - -Scratch filesystems provide temporary storage for high-performance I/O for executing jobs. -Use scratch to store datasets that will be accessed by jobs, and for job output. -Scratch is per user - each user gets separate scratch path and quota. - -* The environment variable `SCRATCH=/iopstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch. - -!!! warning "scratch cleanup policy" - Files that have not been accessed in 30 days are automatically deleted. - - **Scratch is not intended for permanent storage**: transfer files back to the capstor project storage after job runs. - -!!! note - There is an additional scratch path mounted on [Capstor][ref-alps-capstor] at `/capstor/scratch/cscs/$USER`, however this is not recommended for ML workloads for performance reasons. - -### Project - -Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. -Project is per project - each project gets a project folder with project-specific quota. - -* if you need additional storage, ask your PI to contact the CSCS service managers Fawzi or Nicholas. -* hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela -* it is not recommended to write directly to the project path from jobs. +Clariden uses the [MLp filesystems and storage policies][ref-mlp-storage]. ## Getting started diff --git a/docs/clusters/santis.md b/docs/clusters/santis.md index 507ca020..44c7a006 100644 --- a/docs/clusters/santis.md +++ b/docs/clusters/santis.md @@ -1,3 +1,127 @@ [](){#ref-cluster-santis} # Santis +Santis is an Alps cluster that provides GPU accelerators and file systems designed to meet the needs of climate and weather models for the [CWp][ref-platform-cwp]. + +## Cluster specification + +### Compute nodes + +Santis consists of around ??? [Grace-Hopper nodes][ref-alps-gh200-node]. +The number of nodes can change when nodes are added or removed from other clusters on Alps. + +There are four login nodes, labelled `santis-ln00[1-4]`. +You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs. + +| node type | number of nodes | total CPU sockets | total GPUs | +|-----------|--------| ----------------- | ---------- | +| [gh200][ref-alps-gh200-node] | 1,200 | 4,800 | 4,800 | + +### Storage and file systems + +Santis uses the [CWp filesystems and storage policies][ref-cwp-storage]. + +## Getting started + +### Logging into Santis + +To connect to Santis via SSH, first refer to the [ssh guide][ref-ssh]. + +!!! example "`~/.ssh/config`" + Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to santis using `ssh santis`. + ``` + Host santis + HostName santis.alps.cscs.ch + ProxyJump ela + User cscsusername + IdentityFile ~/.ssh/cscs-key + IdentitiesOnly yes + ``` + +### Software + +CSCS and the user community provide software environments tailored to [uenv][ref-uenv] are also available on Santis. + +Currently, the following uenv are provided for the climate and weather community + +* `icon/25.1` +* `climana/25.1` + +In adition to the climate and weather uenv, all of the + +??? example "using uenv provided for other clusters" + You can run uenv that were built for other Alps clusters using the `@` notation. + For example, to use uenv images for [daint][ref-cluster-daint]: + ```bash + # list all images available for daint + uenv image find @daint + + # download an image for daint + uenv image pull namd/3.0:v3@daint + + # start the uenv + uenv start namd/3.0:v3@daint + ``` + +It is also possible to use HPC containers on Santis: + +* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine]. +* To build images, see the [guide to building container images on Alps][ref-build-containers]. + + +## Running jobs on Santis + +### SLURM + +Santis uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs. + +There are two slurm partitions on the system: + +* the `normal` partition is for all production workloads. +* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes. +* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal] at CSCS. + +| name | nodes | max nodes per job | time limit | +| -- | -- | -- | -- | +| `normal` | 1266 | - | 24 hours | +| `debug` | 32 | 2 | 30 minutes | +| `xfer` | 2 | 1 | 24 hours | + +* nodes in the `normal` and `debug` partitions are not shared +* nodes in the `xfer` partition can be shared + +See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200]. + +??? example "how to check the number of nodes on the system" + You can check the size of the system by running the following command in the terminal: + ```terminal + > sinfo --format "| %20R | %10D | %10s | %10l | %10A |" + | PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) | + | debug | 32 | 1-2 | 30:00 | 3/29 | + | normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 | + | xfer | 2 | 1 | 1-00:00:00 | 1/1 | + ``` + The last column shows the number of nodes that have been allocted in currently running jobs (`A`) and the number of jobs that are idle (`I`). + +### FirecREST + +Santis can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v1` API endpoint. + +## Maintenance and status + +### Scheduled maintenance + +Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. + +Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch). + +### Change log + +!!! change "2025-03-05 container engine updated" + now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates. + +??? change "2024-10-07 old event" + this is an old update. Use `???` to automatically fold the update. + +### Known issues + diff --git a/docs/platforms/cwp/index.md b/docs/platforms/cwp/index.md index 38b7d143..cf40ff4e 100644 --- a/docs/platforms/cwp/index.md +++ b/docs/platforms/cwp/index.md @@ -1,5 +1,74 @@ [](){#ref-platform-cwp} -# Climate and Weather Platform +# Climate and weather platform + +The Machine Learning Platform (MLp) provides compute, storage and support to the climate and weather modeling community in Switzerland. + +## Getting Started + +### Getting access + +Project administrators (PIs and deputy PIs) of projects on the MLp can to invite users to join their project, before they can use the project's resources on Alps. !!! todo - follow the template of the [MLp][ref-platform-mlp] + This points to the Waldur solution - whether the [UMP][ref-account-ump] or [Waldur][ref-account-waldur] docs are linked depends on which is being used when these docs go live. + +This is performed using the [project management tool][ref-account-waldur] + +Once invited to a project, you will receive an email, which you can need to create an account and configure [multi-factor authentification][ref-mfa] (MFA). + +## Systems + +Santis is the system deployed on the Alps infrastructure for the climate and weather platform. +Its name derives from the highest mountain Säntis in the Alpstein massif of North-Eastern Switzerland. + +
+- :fontawesome-solid-mountain: [__Santis__][ref-cluster-santis] + + Santis is a large [Grace-Hopper][ref-alps-gh200-node] cluster. +
+ +[](){#ref-cwp-storage} +## File systems and storage + +There are three main file systems mounted on the CWp system Santis. + +| type |mount | filesystem | +| -- | -- | -- | +| Home | /users/$USER | [VAST][ref-alps-vast] | +| Scratch | `/capstor/scratch/cscs/$USER` | [Iopstor][ref-alps-capstor] | +| Project | `/capstor/store/cscs/userlab/` | [Capstor][ref-alps-capstor] | + +### Home + +Every user has a home path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem. +The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts. + +### Scratch + +The Scratch filesystem provides temporary storage for high-performance I/O for executing jobs. +Use scratch to store datasets that will be accessed by jobs, and for job output. +Scratch is per user - each user gets separate scratch path and quota. + +!!! info + A quota of 150 TB and 1 million inodes (files and folders) is applied to your scratch path. + + These are implemented as soft quotas: upon reaching either limit there is a grace period of 1 week before write access to `$SCRATCH` is blocked. + + You can check your quota at any time from Ela or one of the login nodes, using the [`quota` command][ref-storage-quota]. + +!!! info + The environment variable `SCRATCH=/capstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch. + +!!! warning "scratch cleanup policy" + Files that have not been accessed in 30 days are automatically deleted. + + **Scratch is not intended for permanant storage**: transfer files back to the capstor project storage after job runs. + +### Project + +Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. +Project is per project - each project gets a project folder with project-specific quota. + +* hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela. +* it is not recommended to write directly to the project path from jobs. + diff --git a/docs/platforms/mlp/index.md b/docs/platforms/mlp/index.md index e1f7227b..ddd2addc 100644 --- a/docs/platforms/mlp/index.md +++ b/docs/platforms/mlp/index.md @@ -1,14 +1,9 @@ [](){#ref-platform-mlp} -# Machine Learning Platform +# Machine learning platform -!!! todo - A description of the MLP - - * who are the users (help answer the question "is this the platform that I am on") - * who are the partners (SwissAI, etc) - * how to get apply to access MLp (if that is a thing) +The Machine Learning Platform (MLp) provides compute, storage and expertise to the machine learning and AI community in Switzerlan, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/). -## Getting Started +## Getting started ### Getting access @@ -17,14 +12,14 @@ This is performed using the [project management tool][ref-account-waldur] Once invited to a project, you will receive an email, which you can need to create an account and configure [multi-factor authentication][ref-mfa] (MFA). -## vClusters +## Systems The main cluster provided by the MLp is Clariden, a large Grace-Hopper GPU system on Alps.
- :fontawesome-solid-mountain: [__Clariden__][ref-cluster-clariden] - Clariden is the main [Grace-Hopper][ref-alps-gh200-node] cluster used for **todo** + Clariden is the main [Grace-Hopper][ref-alps-gh200-node] cluster.
@@ -33,7 +28,48 @@ The main cluster provided by the MLp is Clariden, a large Grace-Hopper GPU syste Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for **todo**
-## Guides and Tutorials +[](){#ref-mlp-storage} +## File Systems and Storage + +There are three main file systems mounted on the MLp clusters Clariden and Bristen. + +| type |mount | filesystem | +| -- | -- | -- | +| Home | /users/$USER | [VAST][ref-alps-vast] | +| Scratch | `/iopstor/scratch/cscs/$USER` | [Iopstor][ref-alps-iopstor] | +| Project | `/capstor/store/cscs/swissai/` | [Capstor][ref-alps-capstor] | + +### Home + +Every user has a home path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem. +The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts. + +### Scratch + +Scratch filesystems provide temporary storage for high-performance I/O for executing jobs. +Use scratch to store datasets that will be accessed by jobs, and for job output. +Scratch is per user - each user gets separate scratch path and quota. + +* The environment variable `SCRATCH=/iopstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch. + +!!! warning "scratch cleanup policy" + Files that have not been accessed in 30 days are automatically deleted. + + **Scratch is not intended for permanant storage**: transfer files back to the capstor project storage after job runs. + +!!! note + There is an additional scratch path mounted on [Capstor][ref-alps-capstor] at `/capstor/scratch/cscs/$USER`, however this is not reccomended for ML workloads for performance reasons. + +### Project + +Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. +Project is per project - each project gets a project folder with project-specific quota. + +* if you need additional storage, ask your PI to contact the CSCS service managers Fawzi or Nicholas. +* hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela +* it is not recommended to write directly to the project path from jobs. + +## Guides and tutorials !!! todo links to tutorials and guides for ML workflows