content updates

cedadev · Feb 13, 2024 · 80e4666 · 80e4666
1 parent 7af7be1
commit 80e4666
Show file tree

Hide file tree

Showing 5 changed files with 215 additions and 229 deletions.
diff --git a/content/docs/batch-computing/lotus-cluster-specification.md b/content/docs/batch-computing/lotus-cluster-specification.md
@@ -1,6 +1,5 @@
 ---
 aliases: /article/4932-lotus-cluster-specification
-date: 2023-03-13 13:50:34
 description: LOTUS cluster specification
 slug: lotus-cluster-specification
 tags:
@@ -9,16 +8,29 @@ tags:
 title: LOTUS cluster specification
 ---
 
-## LOTUS nodes
+## Current cluster specification
 
 LOTUS is a cluster of over 300 nodes/hosts and 19000 CPU cores. A node/host is
 an individual computer in the cluster with more than 1 processor. Each
 node/host belongs to a specific host group. The number of processors (CPUs or
 cores) per host is listed in Table 1 with the corresponding processor model
 and the size of the physical memory RAM available per node/host.
 
+**Table 1**. LOTUS cluster specification
+
+**Current** host groups
+
+Host group name |  Number of nodes/hosts  |  Processor model |  CPUs per host |  RAM 
+---|---|---|---|---  
+broadwell256G  |  37  |  Intel Xeon E5-2640-v4 "Broadwell"  |  20  |  256 GB  
+skylake348G  |  151  |  Intel Xeon Gold-5118 "Skylake"  |  24  |  348 GB  
+epyctwo1024G  | 200  |  AMD  |  48  |  1024 GB | 
+{.table .table-striped}
+
+## Selection of specific processor model
+
 To select a node/host with a specific processor model and memory, add the
-following SLURM directive to your job script 
+following Slurm directive to your job script 
 
 ```bash
 #SBATCH --constraint="<host-group-name>"
@@ -31,39 +43,28 @@ For example
 ```
 
 {{< alert type="info" >}}
-`intel` and `amd` node types are defined in the SLURM configuration as a feature:
-
-  * For any Intel node type use `#SBATCH --constraint="intel"`
-  * For a specific Intel CPU model use the host group name (see Table 1)
-    * e.g. `#SBATCH --constraint="skylake348G"`
-  * For AMD use ` #SBATCH --constraint="amd"`
+Further notes
+
+`intel` and `amd` node types are defined in the Slurm configuration as a feature:
+- For any Intel node type use `#SBATCH --constraint="intel"`
+- For a specific Intel CPU model use the host group name (see Table 1)
+  - e.g. `#SBATCH --constraint="skylake348G"`
+- For AMD use ` #SBATCH --constraint="amd"`
+- There are 10 nodes of node type `skylake348G` with SSD disk mounted on /tmp 
+- LOTUS nodes of node type `epyctwo1024` are not available yet on the `par-multi` queue
 {{< /alert >}}
 
+{{< alert type="danger" >}}
+If you choose to compile code for specific architectures, do not expect it to run elsewhere in the system.
+{{< /alert >}}
 
-**Table 1**. LOTUS cluster specification
-
-**Current** host groups
-
-Host group name |  Number of nodes/hosts  |  Processor model |  CPUs per host |  RAM 
----|---|---|---|---  
-broadwell256G  |  37  |  Intel Xeon E5-2640-v4 "Broadwell"  |  20  |  256 GB  
-skylake348G  |  151  |  Intel Xeon Gold-5118 "Skylake"  |  24  |  348 GB  
-epyctwo1024G  | 200  |  AMD  |  48  |  1024 GB | 
-{.table .table-striped}
+## Retired host groups no longer in use
 
-**Retired** host groups: no longer in use
+(For reference only)
 
 Host group name |  Number of nodes/hosts |  Processor model |  CPUs per host |  RAM  
 ---|---|---|---|---  
 ~~haswell256G~~ |  ~~7~~ retired |  ~~Intel Xeon E5-2650-v3 "Haswell"~~  |  ~~20~~  | ~~256 GB~~
 ~~ivybridge2000G~~  |  ~~3~~  -retired |  ~~Intel Xeon E7-4860-v2 "Ivy Bridge"~~  |  ~~48~~  | ~~2048 GB~~
 {.table .table-striped}
 
-**Notes**
-
-  * There are 10 nodes of node type `skylake348G` with SSD disk mounted on /tmp 
-  * LOTUS nodes of node type `epyctwo1024` are not available yet on the `par-multi` queue
-
-{{< alert type="danger" >}}
-If you choose to compile code for specific architectures, do not expect it to run elsewhere in the system.
-{{< /alert >}}
diff --git a/content/docs/batch-computing/orchid-gpu-cluster.md b/content/docs/batch-computing/orchid-gpu-cluster.md
@@ -9,23 +9,31 @@ type: docs
 This article provides details on JASMIN's GPU
 cluster, named **Orchid**.
 
+## GPU cluster spec
+
+The JASMIN GPU cluster is composed of 16 GPU nodes:
+
+- 14 x standard GPU nodes with 4 GPU Nvidia A100 GPU cards each
+- 2 x large GPU nodes with 8 Nvidia A100 GPU cards
+
+{{< image src="img/docs/gpu-cluster-orchid/file-NZmhCFPJx9.png" caption="ORCHID GPU cluster" >}}
+
 ## Request access to Orchid
 
-Access to the GPU cluster is controlled by being a member of the Slurm account
-`orchid`. You can request access to this via the link below which will
-direct you to the ORCHID service page on the JASMIN accounts portal:
+Access to the GPU cluster (and a GPU interactive node) is controlled by being a member of the Slurm account
+`orchid`. Please request access via the link below which will
+take you to the ORCHID service page on the JASMIN accounts portal:
 
-https://accounts.jasmin.ac.uk/services/additional_services/orchid/
+{{<button href="https://accounts.jasmin.ac.uk/services/additional_services/orchid/" >}}Apply here{{</button>}}
 
 **Note:** In the supporting info on the request form, please provide details
-on the software and the workflow that you will use/run on the GPU cluster (or
-the interactive GPU node)
+on the software and the workflow that you will use/run on ORCHID.
 
 ## Test a GPU job
 
 Testing a job on the JASMIN Orchid GPU cluster can be carried out in an
-interactive mode by launching a pseudo-shell terminal SLURM job from a JASMIN
-scientific server e.g. sci2:
+interactive mode by launching a pseudo-shell terminal Slurm job from a JASMIN
+scientific server e.g. `sci2`:
 
 {{<command user="user" host="sci2">}}
 srun --gres=gpu:1 --partition=orchid --account=orchid --pty /bin/bash
@@ -35,11 +43,10 @@ srun --gres=gpu:1 --partition=orchid --account=orchid --pty /bin/bash
 {{<command user="user" host="gpuhost16">}}
 ## you are now on gpuhost16
 {{</command>}}
-
 
 The GPU node gpuhost016 is allocated for this interactive session on LOTUS
 
-Note that for batch mode, a GPU job is submitted using the SLURM command
+Note that for batch mode, a GPU job is submitted using the Slurm command
 'sbatch':
 
 {{<command user="user" host="sci2">}}
@@ -53,27 +60,26 @@ or by adding the following preamble in the job script file
 #SBATCH --gres=gpu:1
 ```
 
-Note 1: `gpuhost015 `and `gpuhost016`are the two largest nodes with 64 CPUs and
+Note 1: `gpuhost015` and `gpuhost016` are the two largest nodes with 64 CPUs and
 8 GPUs.
 
 Note 2: **CUDA Version: 11.6**
 
-Note 3: The SLURM batch queue 'orchid' has a maximum runtime of 24 hours and
+Note 3: The Slurm batch partition/queue `orchid` has a maximum runtime of 24 hours and
 the default runtime is 1 hour. The maximum number of CPU cores per user is
 limited to 8 cores. If the limit is exceeded then the job is expected to be in
-a pending state with the reason being **QOSGrpCpuLimit**
+a pending state with the reason being {{<mark>}}QOSGrpCpuLimit{{</mark>}}
 
 ## GPU interactive node
 
-There is also an interactive GPU node `gpuhost001.jc.rl.ac.uk` (same spec as
-Orchid) that you can ssh into it from the JASMIN login server to prototype and
-test your GPU code prior to using the batch GPU cluster Orchid
+There is an interactive GPU node `gpuhost001.jc.rl.ac.uk`, with the same spec as
+other Orchid nodes, that you can access via a login server to prototype and
+test your GPU code prior to running as a batch job.
 
-
-{{<command user="user" host="login1">}}    
+{{<command user="user" host="login1">}}
 ssh -A gpuhost001.jc.rl.ac.uk
 {{</command>}}
-{{<command user="user" host="gpuhost001">}}    
+{{<command user="user" host="gpuhost001">}}
 ## you are now on gpuhost001
 {{</command>}}
 
@@ -86,13 +92,3 @@ ssh -A gpuhost001.jc.rl.ac.uk
 - Singularity 3.7.0 - which supports NVIDIA/GPU containers
 - SCL Python 3.6
 
-The SLURM queue is `orchid` with maximum runtime of 24 hours and
-default runtime 1 hour.
-
-## GPU cluster spec
-
-The JASMIN GPU cluster is composed of 16 GPU nodes: 
-- 14 x standard GPU nodes with 4 GPU Nvidia A100 GPU cards each
-- 2 x large GPU nodes with 8 Nvidia A100 GPU cards
-
-{{< image src="img/docs/gpu-cluster-orchid/file-NZmhCFPJx9.png" caption="ORCHID GPU cluster" >}}
diff --git a/content/docs/batch-computing/slurm-queues.md b/content/docs/batch-computing/slurm-queues.md
@@ -18,60 +18,62 @@ submissions to the LOTUS and ORCHID clusters.
 
 The Slurm queues in the LOTUS cluster are:
 
-  * `test`
-  * `short-serial`
-  * `long-serial`
-  * `par-single`
-  * `par-multi`
-  * `high-mem`
-  * `short-serial-4hr` (see Note 3)
-
-Each queue has an attribute of run-length limits (e.g. short, long) and
+- `test`
+- `short-serial`
+- `long-serial`
+- `par-single`
+- `par-multi`
+- `high-mem`
+- `short-serial-4hr`
+
+Each queue is has attributes of run-length limits (e.g. short, long) and
 resources. A full breakdown of each queue and its associated resources is
 shown below in Table 1.
 
 ## Queue details
 
 Queues represent a set of pending jobs, lined up in a defined order, and
 waiting for their opportunity to use resources. The queue is specified in the
-job script file using SLURM scheduler directive `#SBATCH -p <partition=queue_name>` where `<queue_name>` is the name of the
-queue/partition (Table 1. column 1)
+job script file using Slurm scheduler directive like this:
+
+```bash
+#SBATCH -p <partition=queue_name>`
+```
+
+where `<queue_name>` is the name of the queue/partition (Table 1. column 1)
 
 Table 1 summarises important specifications for each queue such as run time
-limits and the number of CPU core limits. If the queue is not specified, SLURM
+limits and the number of CPU core limits. If the queue is not specified, Slurm
 will schedule the job to the queue `short-serial` by default.
 
 Table 1. LOTUS/Slurm queues and their specifications
 
-Queue name  |  Max run time  |  Default run time  |  Max CPU cores  
-per job  |  Max CpuPer  
-UserLimit  |  Priority  
+Queue name  |  Max run time  |  Default run time  |  Max CPU cores per job  |  MaxCpuPerUserLimit  |  Priority  
 ---|---|---|---|---|---  
 `test` |  4 hrs  |  1hr  |  8  |  8  |  30  
 `short-serial` |  24 hrs  |  1hr  |  1  |  2000  |  30  
 `par-single` |  48 hrs  |  1hr  |  16  |  300  |  25  
 `par-multi` |  48 hrs  |  1hr  |  256  |  300  |  20  
 `long-serial` |  168 hrs  |  1hr  |  1  |  300  |  10  
 `high-mem` |  48 hrs  |  1hr  |  1  |  75  |  30  
-`short-serial-4hr` ( **Note 3** )  |  4 hrs  |  1hr  |  1  |  1000  |  30
+`short-serial-4hr`<br>(**Note 3**)  |  4 hrs  |  1hr  |  1  |  1000  |  30
 {.table .table-striped}
 
-**Note 1** : Resources that the job requests must be within the resource
+**Note 1** : Resources requested by a job must be within the resource
 allocation limits of the selected queue.
 
 **Note 2:** The default value for `--time=[hh:mm:ss]` (predicted maximum wall
-time) is 1 hour for the six SLURM queues. If you do not specify this option
+time) is 1 hour for the all queues. If you do not specify this option
 and/or your job exceeds the default maximum run time limit then it will be
-terminated by the SLURM scheduler.
+terminated by the Slurm scheduler.
 
-**Note 3** : A user must specify the SLURM job account `--account=short4hr`
-when submitting a batch job to the provisional SLURM partition `short-
-serial-4hr`
+**Note 3** : A user must specify the Slurm job account `--account=short4hr`
+when submitting a batch job to the `short-serial-4hr` queue.
 
 ## State of queues
 
-The Slurm command `sinfo `reports the state of queues/partitions and nodes
-managed by SLURM. It has a wide variety of filtering, sorting, and formatting
+The Slurm command `sinfo` reports the state of queues and nodes
+managed by Slurm. It has a wide variety of filtering, sorting, and formatting
 options.
 
 {{<command shell="bash">}}
@@ -97,14 +99,14 @@ as they implement different job scheduling and control policies.
 
 ## 'sinfo' Output field description:
 
-By default, the SLURM command 'sinfo' displays the following information:
+By default, the Slurm command 'sinfo' displays the following information:
 
-  * **PARTITION** : Partition name followed by "*" for the default queue/partition
-  * **AVAIL** : State/availability of a queue/partition. Partition state: up or down.
-  * **TIMELIMIT** : The maximum run time limit per job in each queue/partition is shown in TIMELIMIT in days- hours:minutes  :seconds . e.g. 2-00:00:00 is two days maximum runtime limit 
-  * **NODES** : Count of nodes with this particular configuration e.g. 48 nodes
-  * **STATE** : State of the nodes. Possible states include: allocated, down, drained, and idle. For example, the state "idle" means that the node is not allocated to any jobs and is available for use.
-  * **NODELIST** List of node names associated with this queue/partition
+- **PARTITION** : Partition name followed by **\*** for the default queue/partition
+- **AVAIL** : State/availability of a queue/partition. Partition state: up or down.
+- **TIMELIMIT** : The maximum run time limit per job in each queue/partition is shown in TIMELIMIT in days- hours:minutes  :seconds . e.g. 2-00:00:00 is two days maximum runtime limit 
+- **NODES** : Count of nodes with this particular configuration e.g. 48 nodes
+- **STATE** : State of the nodes. Possible states include: allocated, down, drained, and idle. For example, the state "idle" means that the node is not allocated to any jobs and is available for use.
+- **NODELIST** List of node names associated with this queue/partition
 
 The `sinfo` example below, reports more complete information about the
 partition/queue short-serial
@@ -116,12 +118,12 @@ sinfo --long --partition=short-serial
 (out)short-serial* up  1-00:00:00  1-infinite  no  NO    all     48  idle host[146-193]
 {{</command>}}
 
-## How to choose a SLURM queue/partition
+## How to choose a Slurm queue/partition
 
 ### Test queue
 
-The test  queue `test` can be used to test new workflows and also to help new
-users to familiarise themselves with the SLURM batch system. Both serial and
+The `test` queue can be used to test new workflows and also to help new
+users to familiarise themselves with the Slurm batch system. Both serial and
 parallel code can be tested on the `test`queue. The maximum runtime is 4 hrs
 and the maximum number of jobs per user is 8 job slots. The maximum number of
 cores for a parallel job e.g. MPI, OpenMP, or multi-threads is limited to 8
@@ -171,32 +173,32 @@ submitted to  the `par-single` queue . Each thread should be allocated one CPU
 core. Oversubscribing the number of threads to the CPU cores will cause the
 job to run very slow. The number of CPU cores should be specified via the
 submission command line `sbatch -n <number of CPU cores>` or  by adding the
-SLURM directive `#SBATCH -n <number of CPU cores>`in the job script file. An
+Slurm directive `#SBATCH -n <number of CPU cores>`in the job script file. An
 example is shown below:
 
 {{<command>}}
 sbatch --ntasks=4 --partition=par-single < myjobscript
 {{</command>}}
 
 Note: Jobs submitted with a number of CPU cores greater than 16 will be
-terminated (killed) by the SLURM scheduler with the following statement in the
+terminated (killed) by the Slurm scheduler with the following statement in the
 job output file:
 
 #### par-multi
 
 Distributed memory jobs with inter-node communication using the MPI library
 should be submitted to  the `par-multi` queue . A single MPI process (rank)
 should be allocated  a  single CPU core. The number of CPU cores should be
-specified via the SLURM submission command  flag `sbatch -n <number of CPU
-cores>` or  by adding the SLURM directive `#SBATCH -n <number of CPU cores>`
+specified via the Slurm submission command  flag `sbatch -n <number of CPU
+cores>` or  by adding the Slurm directive `#SBATCH -n <number of CPU cores>`
 to  the job script file. An example is shown below:
 
 {{<command>}}
 sbatch --ntasks=4 --partition=par-multi < myjobscript
 {{</command>}}
 
-Note 1: The number of CPU cores gets passed from SLURM submission  flag `-n` .
+Note 1: The number of CPU cores gets passed from Slurm submission  flag `-n` .
 Do not add  the `-np` flag  to `mpirun` command  .
 
-Note 2: SLURM will reject a job that requires a number of CPU cores greater
+Note 2: Slurm will reject a job that requires a number of CPU cores greater
 than the limit of 256.