-
Notifications
You must be signed in to change notification settings - Fork 41
add platform docs for CWp #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,3 +1,127 @@ | ||||||
| [](){#ref-cluster-santis} | ||||||
| # Santis | ||||||
|
|
||||||
| Santis is an Alps cluster that provides GPU accelerators and file systems designed to meet the needs of climate and weather models for the [CWp][ref-platform-cwp]. | ||||||
|
|
||||||
| ## Cluster specification | ||||||
|
|
||||||
| ### Compute nodes | ||||||
|
|
||||||
| Santis consists of around ??? [Grace-Hopper nodes][ref-alps-gh200-node]. | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
? Not sure what the official number is supposed to be. The table below says 1200, but I think the real number is lower than that at least.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added a temporary link to the Gordon Bell guide. |
||||||
| The number of nodes can change when nodes are added or removed from other clusters on Alps. | ||||||
|
|
||||||
| There are four login nodes, labelled `santis-ln00[1-4]`. | ||||||
| You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs. | ||||||
|
|
||||||
| | node type | number of nodes | total CPU sockets | total GPUs | | ||||||
| |-----------|--------| ----------------- | ---------- | | ||||||
| | [gh200][ref-alps-gh200-node] | 1,200 | 4,800 | 4,800 | | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See comment above. |
||||||
|
|
||||||
| ### Storage and file systems | ||||||
|
|
||||||
| Santis uses the [CWp filesystems and storage policies][ref-cwp-storage]. | ||||||
|
|
||||||
| ## Getting started | ||||||
|
|
||||||
| ### Logging into Santis | ||||||
|
|
||||||
| To connect to Santis via SSH, first refer to the [ssh guide][ref-ssh]. | ||||||
|
|
||||||
| !!! example "`~/.ssh/config`" | ||||||
| Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to santis using `ssh santis`. | ||||||
| ``` | ||||||
| Host santis | ||||||
| HostName santis.alps.cscs.ch | ||||||
| ProxyJump ela | ||||||
| User cscsusername | ||||||
| IdentityFile ~/.ssh/cscs-key | ||||||
| IdentitiesOnly yes | ||||||
| ``` | ||||||
|
|
||||||
| ### Software | ||||||
|
|
||||||
| CSCS and the user community provide software environments tailored to [uenv][ref-uenv] are also available on Santis. | ||||||
|
|
||||||
| Currently, the following uenv are provided for the climate and weather community | ||||||
|
|
||||||
| * `icon/25.1` | ||||||
| * `climana/25.1` | ||||||
|
|
||||||
| In adition to the climate and weather uenv, all of the | ||||||
|
|
||||||
| ??? example "using uenv provided for other clusters" | ||||||
| You can run uenv that were built for other Alps clusters using the `@` notation. | ||||||
| For example, to use uenv images for [daint][ref-cluster-daint]: | ||||||
| ```bash | ||||||
| # list all images available for daint | ||||||
| uenv image find @daint | ||||||
|
|
||||||
| # download an image for daint | ||||||
| uenv image pull namd/3.0:v3@daint | ||||||
|
|
||||||
| # start the uenv | ||||||
| uenv start namd/3.0:v3@daint | ||||||
| ``` | ||||||
|
|
||||||
| It is also possible to use HPC containers on Santis: | ||||||
|
|
||||||
| * Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine]. | ||||||
| * To build images, see the [guide to building container images on Alps][ref-build-containers]. | ||||||
|
|
||||||
|
|
||||||
| ## Running jobs on Santis | ||||||
|
|
||||||
| ### SLURM | ||||||
|
|
||||||
| Santis uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs. | ||||||
|
|
||||||
| There are two slurm partitions on the system: | ||||||
|
|
||||||
| * the `normal` partition is for all production workloads. | ||||||
| * the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes. | ||||||
| * the `xfer` partition is for [internal data transfer][ref-data-xfer-internal] at CSCS. | ||||||
|
|
||||||
| | name | nodes | max nodes per job | time limit | | ||||||
| | -- | -- | -- | -- | | ||||||
| | `normal` | 1266 | - | 24 hours | | ||||||
| | `debug` | 32 | 2 | 30 minutes | | ||||||
| | `xfer` | 2 | 1 | 24 hours | | ||||||
|
|
||||||
| * nodes in the `normal` and `debug` partitions are not shared | ||||||
| * nodes in the `xfer` partition can be shared | ||||||
|
|
||||||
| See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200]. | ||||||
|
|
||||||
| ??? example "how to check the number of nodes on the system" | ||||||
| You can check the size of the system by running the following command in the terminal: | ||||||
| ```terminal | ||||||
| > sinfo --format "| %20R | %10D | %10s | %10l | %10A |" | ||||||
| | PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) | | ||||||
| | debug | 32 | 1-2 | 30:00 | 3/29 | | ||||||
| | normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 | | ||||||
| | xfer | 2 | 1 | 1-00:00:00 | 1/1 | | ||||||
| ``` | ||||||
| The last column shows the number of nodes that have been allocted in currently running jobs (`A`) and the number of jobs that are idle (`I`). | ||||||
|
|
||||||
| ### FirecREST | ||||||
|
|
||||||
| Santis can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v1` API endpoint. | ||||||
|
|
||||||
| ## Maintenance and status | ||||||
|
|
||||||
| ### Scheduled maintenance | ||||||
|
|
||||||
| Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. | ||||||
|
|
||||||
| Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch). | ||||||
|
|
||||||
| ### Change log | ||||||
|
|
||||||
| !!! change "2025-03-05 container engine updated" | ||||||
| now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates. | ||||||
|
|
||||||
| ??? change "2024-10-07 old event" | ||||||
| this is an old update. Use `???` to automatically fold the update. | ||||||
|
|
||||||
| ### Known issues | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,5 +1,74 @@ | ||||||
| [](){#ref-platform-cwp} | ||||||
| # Climate and Weather Platform | ||||||
| # Climate and weather platform | ||||||
|
|
||||||
| The Machine Learning Platform (MLp) provides compute, storage and support to the climate and weather modeling community in Switzerland. | ||||||
|
|
||||||
| ## Getting Started | ||||||
|
|
||||||
| ### Getting access | ||||||
|
|
||||||
| Project administrators (PIs and deputy PIs) of projects on the MLp can to invite users to join their project, before they can use the project's resources on Alps. | ||||||
|
|
||||||
| !!! todo | ||||||
| follow the template of the [MLp][ref-platform-mlp] | ||||||
| This points to the Waldur solution - whether the [UMP][ref-account-ump] or [Waldur][ref-account-waldur] docs are linked depends on which is being used when these docs go live. | ||||||
|
|
||||||
| This is performed using the [project management tool][ref-account-waldur] | ||||||
|
|
||||||
| Once invited to a project, you will receive an email, which you can need to create an account and configure [multi-factor authentification][ref-mfa] (MFA). | ||||||
|
|
||||||
| ## Systems | ||||||
|
|
||||||
| Santis is the system deployed on the Alps infrastructure for the climate and weather platform. | ||||||
| Its name derives from the highest mountain Säntis in the Alpstein massif of North-Eastern Switzerland. | ||||||
|
|
||||||
| <div class="grid cards" markdown> | ||||||
| - :fontawesome-solid-mountain: [__Santis__][ref-cluster-santis] | ||||||
|
|
||||||
| Santis is a large [Grace-Hopper][ref-alps-gh200-node] cluster. | ||||||
| </div> | ||||||
|
|
||||||
| [](){#ref-cwp-storage} | ||||||
| ## File systems and storage | ||||||
|
|
||||||
| There are three main file systems mounted on the CWp system Santis. | ||||||
|
|
||||||
| | type |mount | filesystem | | ||||||
| | -- | -- | -- | | ||||||
| | Home | /users/$USER | [VAST][ref-alps-vast] | | ||||||
| | Scratch | `/capstor/scratch/cscs/$USER` | [Iopstor][ref-alps-capstor] | | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, the change should have been |
||||||
| | Project | `/capstor/store/cscs/userlab/<project>` | [Capstor][ref-alps-capstor] | | ||||||
|
|
||||||
| ### Home | ||||||
|
|
||||||
| Every user has a home path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem. | ||||||
| The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts. | ||||||
|
|
||||||
| ### Scratch | ||||||
|
|
||||||
| The Scratch filesystem provides temporary storage for high-performance I/O for executing jobs. | ||||||
| Use scratch to store datasets that will be accessed by jobs, and for job output. | ||||||
| Scratch is per user - each user gets separate scratch path and quota. | ||||||
|
|
||||||
| !!! info | ||||||
| A quota of 150 TB and 1 million inodes (files and folders) is applied to your scratch path. | ||||||
|
|
||||||
| These are implemented as soft quotas: upon reaching either limit there is a grace period of 1 week before write access to `$SCRATCH` is blocked. | ||||||
|
|
||||||
| You can check your quota at any time from Ela or one of the login nodes, using the [`quota` command][ref-storage-quota]. | ||||||
|
|
||||||
| !!! info | ||||||
| The environment variable `SCRATCH=/capstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch. | ||||||
|
|
||||||
| !!! warning "scratch cleanup policy" | ||||||
| Files that have not been accessed in 30 days are automatically deleted. | ||||||
|
|
||||||
| **Scratch is not intended for permanant storage**: transfer files back to the capstor project storage after job runs. | ||||||
|
|
||||||
| ### Project | ||||||
|
|
||||||
| Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. | ||||||
| Project is per project - each project gets a project folder with project-specific quota. | ||||||
|
|
||||||
| * hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela. | ||||||
| * it is not recommended to write directly to the project path from jobs. | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,14 +1,9 @@ | ||||||
| [](){#ref-platform-mlp} | ||||||
| # Machine Learning Platform | ||||||
| # Machine learning platform | ||||||
|
|
||||||
| !!! todo | ||||||
| A description of the MLP | ||||||
|
|
||||||
| * who are the users (help answer the question "is this the platform that I am on") | ||||||
| * who are the partners (SwissAI, etc) | ||||||
| * how to get apply to access MLp (if that is a thing) | ||||||
| The Machine Learning Platform (MLp) provides compute, storage and expertise to the machine learning and AI community in Switzerlan, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/). | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## Getting Started | ||||||
| ## Getting started | ||||||
|
|
||||||
| ### Getting access | ||||||
|
|
||||||
|
|
@@ -17,14 +12,14 @@ This is performed using the [project management tool][ref-account-waldur] | |||||
|
|
||||||
| Once invited to a project, you will receive an email, which you can need to create an account and configure [multi-factor authentication][ref-mfa] (MFA). | ||||||
|
|
||||||
| ## vClusters | ||||||
| ## Systems | ||||||
|
|
||||||
| The main cluster provided by the MLp is Clariden, a large Grace-Hopper GPU system on Alps. | ||||||
|
|
||||||
| <div class="grid cards" markdown> | ||||||
| - :fontawesome-solid-mountain: [__Clariden__][ref-cluster-clariden] | ||||||
|
|
||||||
| Clariden is the main [Grace-Hopper][ref-alps-gh200-node] cluster used for **todo** | ||||||
| Clariden is the main [Grace-Hopper][ref-alps-gh200-node] cluster. | ||||||
| </div> | ||||||
|
|
||||||
| <div class="grid cards" markdown> | ||||||
|
|
@@ -33,7 +28,48 @@ The main cluster provided by the MLp is Clariden, a large Grace-Hopper GPU syste | |||||
| Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for **todo** | ||||||
| </div> | ||||||
|
|
||||||
| ## Guides and Tutorials | ||||||
| [](){#ref-mlp-storage} | ||||||
| ## File Systems and Storage | ||||||
|
|
||||||
| There are three main file systems mounted on the MLp clusters Clariden and Bristen. | ||||||
|
|
||||||
| | type |mount | filesystem | | ||||||
| | -- | -- | -- | | ||||||
| | Home | /users/$USER | [VAST][ref-alps-vast] | | ||||||
| | Scratch | `/iopstor/scratch/cscs/$USER` | [Iopstor][ref-alps-iopstor] | | ||||||
| | Project | `/capstor/store/cscs/swissai/<project>` | [Capstor][ref-alps-capstor] | | ||||||
|
|
||||||
| ### Home | ||||||
|
|
||||||
| Every user has a home path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem. | ||||||
| The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts. | ||||||
|
|
||||||
| ### Scratch | ||||||
|
|
||||||
| Scratch filesystems provide temporary storage for high-performance I/O for executing jobs. | ||||||
| Use scratch to store datasets that will be accessed by jobs, and for job output. | ||||||
| Scratch is per user - each user gets separate scratch path and quota. | ||||||
|
|
||||||
| * The environment variable `SCRATCH=/iopstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch. | ||||||
|
|
||||||
| !!! warning "scratch cleanup policy" | ||||||
| Files that have not been accessed in 30 days are automatically deleted. | ||||||
|
|
||||||
| **Scratch is not intended for permanant storage**: transfer files back to the capstor project storage after job runs. | ||||||
|
|
||||||
| !!! note | ||||||
| There is an additional scratch path mounted on [Capstor][ref-alps-capstor] at `/capstor/scratch/cscs/$USER`, however this is not reccomended for ML workloads for performance reasons. | ||||||
|
|
||||||
| ### Project | ||||||
|
|
||||||
| Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. | ||||||
| Project is per project - each project gets a project folder with project-specific quota. | ||||||
|
|
||||||
| * if you need additional storage, ask your PI to contact the CSCS service managers Fawzi or Nicholas. | ||||||
| * hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela | ||||||
| * it is not recommended to write directly to the project path from jobs. | ||||||
|
|
||||||
| ## Guides and tutorials | ||||||
|
|
||||||
| !!! todo | ||||||
| links to tutorials and guides for ML workflows | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a question for myself to know: is CWp/MLp/etc. the way we capitalize this, not CWP/MLP/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just asked, and it was
XYp, but it was changed toXYP.I will make the change.