# Intermediate Slurm on Milton
### Delivered by WEHI Research Computing Platform

<pre>Edward Yang    Michael Milton    Julie Iskander</pre><br>

<img src="static/1200px-Slurm_logo.svg.png" alt="Slurm" width="100"/>
<img src="static/milton.png" alt="Milton Mascot" width="100"/>
<img src="static/WEHI_RGB_logo.png" alt="WEHI logo" width="200"/>

## [OPTIONAL SLIDE] Self Introductions!

### Me
* Civil Engineer by training
* HPC by interest
* Mostly code in Fortran, occasionally Python
* I use HPC to simulate fluid and granular flows

## How about you?
* name
* How you use Milton's HPC
* Your coffee/tea order

## Background
* We already ran an "intro to Slurm" workshop (recording on RCP website)
* More "advanced" features of Slurm were highly requested
* Both ITS and researchers would benefit
    * ITS will have fewer issues to address
    * researchers can accelerate their research <- High Priority for everyone!

### Target Audience
* You've submitted quite a few jobs via `sbatch`
* You're familiar with resource requests. Like:
    * using `--ntasks` and `--cpus-per-task`
    * using `--mem` and/or `--memory-per-cpu`
* You're wondering whether your jobs are utilizing resources efficiently
* You're wondering how to make your life easier when using `sbatch`

## Background cont.

My goal for the session is to teach you how to:
* get the status of the cluster
* get information about your jobs
* make use of `sbatch` scripting features
* Run emabarrasingly parallel tasks
* **bonus topic** submitting jobs from R or submitting `python` or `R` scripts without a wrapper script.

## Agenda for Today

12:00 - 12:30&emsp;<span style="color:blue">Introduction & Housekeeping</span>

12:30 - 13:00&emsp;Lunch

13:00 - 13:30&emsp;Laying the groundwork: nodes, tasks and other Slurm terminology

13:30 - 14:00&emsp;Understanding your jobs and the job queue

13:30 - 14:00&emsp;Basic profiling of your jobs

14:00 - 14:30&emsp;Slurm scripting features

14:30 - 15:00&emsp;Embarrarsingly parallel examples

15:00 - 15:30&emsp;R batchtools

15:30 - 16:00&emsp;

## Introduction and Housekeeping

### Format
Slides + live coding

Live coding will be on Milton, so make sure you're connected to WEHI's VPN or staff network, or use RAP:<br />
rap.wehi.edu.au

Please follow along to reinforce learning!

Questions:
* Put your hand up whenever you have a question or have an issue running things
* Questions in the chat are welcome and will be addressed by helpers

Material is available here:
/link

Feel free to download the notebook and follow along.

## Introduction and Housekeeping Cont.
### Expected understanding of
* <span style="font-size:1.2em;">Concept of "resources"</span><br />
`CPUs`<br />
`RAM/memory`<br />
`Nodes`<br /><br />
* <span style="font-size:1.2em;">Job submission commands</span><br />
`srun    # executes a command/script/binary across tasks`<br />
`salloc  # allocates resources to be used (interactively and/or via srun)`<br />
`sbatch  # submits a script for later execution on requested resources`<br /><br />
* <span style="font-size:1.2em;">resource request options</span><br />
`--ntasks=             # "tasks" recognised by srun`<br />
`--nodes=              # no. of nodes`<br />
`--ntasks-per-node=    # tasks per node`<br />
`--cpus-per-task=      # cpus per task`<br />
`--mem=                # memory required for entire job`<br />
`--mem-per-cpu=        # memory required for each CPU`<br />
`--gres=               # "general resource" (i.e. GPUs)`<br />
`--time=               # requested wall time`<br />

# LUNCH

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff

### What are Nodes?
Nodes are essentially standalone computers with their own CPU cores, RAM, local storage, and maybe GPUs. 
<br>
<img src="static/node-diagram.png" alt="Node diagram" width="400">

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff cont.

HPC clusters (or just clusters) will consist of multiple nodes connected together through a (sometimes fast) network. <br>
<img src="static/cluster-diagram.png" alt="Cluster diagram" width="400">
<img src="static/shared-storage-diagram.png" alt="Shared storage diagram" width="400">

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff cont.

Nodes cannot collaborate on problems unless they are running a programmed designed that way.

It's like clicking your mouse on your PC, and expecting the click to register on a colleague's PC. 

It's possible, but needs a special program/protocol to do so!

Most programs in bioinformatics, health sciences, and statistics (that I know of) aren't setup to have multiple nodes cooperate. Notable exceptions are Tensorflow and Py-Torch, Relion, and Gromacs.

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff cont.

### What are tasks?
Tasks are a collection of resources (CPU cores, GPUs) expected to perform the same "task", or used by a single program e.g., via threads, Python multiprocessing, or OpenMP.

<img src="static/tasks-diagram1.png" alt="Tasks diagram" width="400">

A task can only be given resources co-located on a node. Multiple tasks requested by `sbatch` or `salloc` can be spread across multiple nodes (unless `--nodes=` is specified).

`srun` will use the number of tasks requested by `salloc` or `sbatch` to run `ntasks` instances of the same command/script/program. e.g.,

`srun echo hello world` will run `echo` equal to the number of tasks requested.

This was designed with "traditional" HPC in mind, where multiple instances of the same program would be created, and these instanced could cooperate (thanks to MPI).

### why `ntasks` and  `cpus-per-task` then?
This is because some "traditional" HPC would use multiprocessing (usually through OpenMP) where each program instance could utilize multiple cores.

Most data science, statistics, bionformatics, health-science work will use `--ntasks=1`, and using `--cpus-per-task` . If you see/hear anything to do with "distributed" or MPI (e.g. distributed ML), you may want to change that.

## Monitoring Your Jobs and the Job Queue

_knowledge is power_

### Building on the basics: `squeue`

`squeue` shows everyone's job in the queue (passing `-u <username>`) shows only `<username>`'s jobs.

In [4]:
squeue | head -n 5

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           8516030      gpuq interact bollands  R    1:30:11      1 gpu-p100-n01
           8515707      gpuq cryospar cryospar  R    3:04:59      1 gpu-p100-n01
           8511988 interacti sys/dash    yan.a  R   20:15:53      1 sml-n03
           8516092 interacti     work jackson.  R    1:21:42      1 sml-n01


Getting a bit more: `squeue --long` makes things more legible adds the "time_limit" column.

In [5]:
squeue --long | head -n 5

Thu Oct 20 12:07:43 2022
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
           8516030      gpuq interact bollands  RUNNING    1:30:13   8:00:00      1 gpu-p100-n01
           8515707      gpuq cryospar cryospar  RUNNING    3:05:01 2-00:00:00      1 gpu-p100-n01
           8511988 interacti sys/dash    yan.a  RUNNING   20:15:55 1-00:00:00      1 sml-n03


## Monitoring Your Jobs and the Job Queue cont.

But what if we want _even more_ information?

We have to make use of the formatting options!

```
$ squeue --Format field1,field2,...
```

OR use the environment variable `SQUEUE_FORMAT2`. Useful fields:

| Resources related | Time related | Scheduling   |
| :---              | :---         | :---         |
| `NumCPUs`         | `starttime`  | `JobId`      |
| `NumNodes`        | `submittime` | `name`       |
| `minmemory`       | `pendingtime`| `partition`  |
| `tres-alloc`      | `timelimit`  | `priority`   |
| `minmemory`       | `timeleft`   | `reasonlist` |
|                   | `timeused`   | `workdir`    |
|                   |              | `state`      |

You can always use `man squeue` to see the entire list of options.

So you don't have to type out the fields, I recommend aliasing the the command with your fields of choice in `~/.bashrc` e.g.

## Monitoring Your Jobs and the Job Queue cont.

In [26]:
alias sqv="squeue --Format=jobid:8,name:6' ',partition:10' ',statecompact:3,tres-alloc:60,timelimit:12,timeleft:12"
sqv | head -n 5

JOBID   NAME   PARTITION  ST TRES_ALLOC                                                  TIME_LIMIT  TIME_LEFT   
8517002 R      bigmem     R  cpu=22,mem=88G,node=1,billing=720984                        1-00:00:00  23:35:18    
8516030 intera gpuq       R  cpu=2,mem=20G,node=1,billing=44,gres/gpu=1,gres/gpu:p100=1  8:00:00     4:43:00     
8515707 cryosp gpuq       R  cpu=8,mem=17G,node=1,billing=44,gres/gpu=1,gres/gpu:p100=1  2-00:00:00  1-19:08:12  
8511988 sys/da interactiv R  cpu=8,mem=16G,node=1,billing=112                            1-00:00:00  1:57:18     


In [28]:
sqv -u bedo.j | head -n 5

JOBID   NAME   PARTITION  ST TRES_ALLOC                                                  TIME_LIMIT  TIME_LEFT   
8516851 bionix regular    PD cpu=24,mem=90G,node=1,billing=204                           2-00:00:00  2-00:00:00  
8516850 bionix regular    PD cpu=24,mem=90G,node=1,billing=204                           2-00:00:00  2-00:00:00  
8516849 bionix regular    PD cpu=24,mem=90G,node=1,billing=204                           2-00:00:00  2-00:00:00  
8516848 bionix regular    PD cpu=24,mem=90G,node=1,billing=204                           2-00:00:00  2-00:00:00  


## Monitoring Your Jobs and the Job Queue cont.

### Getting detailed information of your running/pending job

`scontrol show job <jobid>`

Useful if you care only about a specific job.

It's very useful when debugging jobs.

A lot of information without needing lots of input.

In [30]:
scontrol show job 8516360

JobId=8516360 JobName=Extr16S23S
   UserId=woodruff.c(2317) GroupId=allstaff(10908) MCS_label=N/A
   Priority=324 Nice=0 Account=wehi QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:21:53 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2022-10-20T11:37:49 EligibleTime=2022-10-20T11:37:49
   AccrueTime=2022-10-20T11:37:49
   StartTime=2022-10-20T14:28:03 EndTime=2022-10-22T14:28:03 Deadline=N/A
   PreemptEligibleTime=2022-10-20T14:28:03 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-20T14:28:03 Scheduler=Main
   Partition=regular AllocNode:Sid=vc7-shared:12938
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=med-n24
   BatchHost=med-n24
   NumNodes=1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=48G,node=1,billing=128
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryNode=48G MinTmpDiskNode=0
   Features=(null) Dela

## Monitoring Your jobs and the Job Queue cont.
### Monitoring the cluster

Being able to understand the state of the cluster, can help understand why your job might be waiting.

Or, you can use the information to your advantage to reduce wait times.

To view the state of the cluster, we're going to use the `sinfo` command.

In [31]:
sinfo

PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
interactive         up 1-00:00:00      4    mix med-n03,sml-n[01-03]
interactive         up 1-00:00:00      1  alloc med-n02
interactive         up 1-00:00:00      1   idle med-n01
regular*            up 2-00:00:00     42    mix lrg-n[02-03],med-n[03-05,07-09,12-13,18,20-23,25-27,29-30],sml-n[02-20,22-24]
regular*            up 2-00:00:00     13  alloc lrg-n04,med-n[02,06,10-11,14-17,19,24,28],sml-n21
long                up 14-00:00:0     40    mix med-n[03-05,07-09,12-13,18,20-23,25-27,29-30],sml-n[02-20,22-24]
long                up 14-00:00:0     12  alloc med-n[02,06,10-11,14-17,19,24,28],sml-n21
bigmem              up 2-00:00:00      3    mix lrg-n02,med-n[03-04]
bigmem              up 2-00:00:00      1  alloc med-n02
bigmem              up 2-00:00:00      1   idle lrg-n01
gpuq                up 2-00:00:00      1    mix gpu-p100-n01
gpuq                up 2-00:00:00     11   idle gpu-a30-n[01-07],gpu-p100-n[02-05]
gpuq_inter

## Monitoring Your Jobs and the Job Queue cont.

Like `squeue`, we can augment `sinfo`'s behaviour with options.

A very useful option is to use the `-N` (N for nodes) option.

In [33]:
sinfo -N | head -n 5

NODELIST      NODES        PARTITION STATE 
gpu-a10-n01       1 gpuq_interactive mix   
gpu-a30-n01       1             gpuq idle  
gpu-a30-n02       1             gpuq idle  
gpu-a30-n03       1             gpuq idle  


And now the data is now node-oriented instead of partition oriented!

## Monitoring Your Jobs and the Job Queue cont.
But just knowing whether nodes are "idle", "mixed", or "allocated" is not the _most useful_ information.

We can add detail with formatting options as well.

| CPU | memory | gres (GPU) | node state | time |
| :---| :--- | :--- | :--- | :--- |
| `CPUsState` | `FreeMem` | `GresUsed` | `StateCompact` | `Time` |
| | `AllocMem` | `Gres` | | |
| | `Memory` | | | |

* CPUs occupied/available/total: CPUsState
* memory occupied/available/total: FreeMem, AllocMem, Memory
* gres (GPU) occupied/available: GresUsed, Gres
* State of the node (e.g. whether the node is down): StateCompact
* Max time: Time

In [37]:
sinfo -NO nodelist:13,partition:13,cpusstate:13,freemem:8,memory:8,gresused,gres:11,statecompact:8,time | head -n 5

NODELIST     PARTITION    CPUS(A/I/O/T)FREE_MEMMEMORY  GRES_USED           GRES       STATE   TIMELIMIT           
gpu-a10-n01  gpuq_interact2/46/0/48    71517   257417  gpu:A10:1(IDX:0)    gpu:A10:4  mix     12:00:00            
gpu-a30-n01  gpuq         0/96/0/96    462381  511362  gpu:A30:0(IDX:N/A)  gpu:A30:4  idle    2-00:00:00          
gpu-a30-n02  gpuq         0/96/0/96    401605  511362  gpu:A30:0(IDX:N/A)  gpu:A30:4  idle    2-00:00:00          
gpu-a30-n03  gpuq         0/96/0/96    362552  511362  gpu:A30:0(IDX:N/A)  gpu:A30:4  idle    2-00:00:00          


## Monitoring Your Jobs and the Job Queue cont.

For newer Slurm versions, the GUI tool `sview` is an option

It's functionality is currently a little less limited and not customizable, but also an option.

In [36]:
sview




: 1

## Basic Profiling of Your Jobs



In [3]:
sacct -S -E -o jobid,jobname,alloctres,elapsed,end,start,exitcode,ncpus,nodelist,nnodes,submit,

bash: sacct: command not found


: 127

In [4]:
sinfo -NO nodehost,partition,statecompact,available,cpusstate,allocmem,memory,gres,gresused
scontrol show job <job id>

bash: sinfo: command not found
bash: syntax error near unexpected token `newline'


: 2

## Basic Job Profiling

ssh node

nvidia-smi

htop (and useful features)

seff

dcgmstats


## Sbatch Scripting Features

We're going to start with a simple R script submitted by wrapper sbatch script:

```r
## matmul.rscript

print("starting the matmul R script!")
nrows = 1e3
print(paste0("elem: ", nrows, "*", nrows, " = ", nrows*nrows))

# generating matrices
M <- matrix(rnorm(nrows*nrows),nrow=nrows)
N <- matrix(rnorm(nrows*nrows),nrow=nrows)

# start matmul
start.time <- Sys.time()
invisible(M %*% N)
end.time <- Sys.time()

# Getting final time and writing to stdout
elapsed.time <- difftime(time1=end.time, time2=start.time, units="secs")
print(elapsed.time)
```

In [5]:
#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev0
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1

# loading module for R
module load R/openBLAS/4.2.1

Rscript matmul.rscript

bash: module: command not found
Fatal error: cannot open file 'matmul.rscript': No such file or directory


: 2

## Sbatch scripting features cont.

Sbatch options extend beyond just specifying resources. There are ways you can utulise `sbatch` features to make your life easier or aid in automation:
* redirecting where your output and error messages go
* changing directory
* notifying you of when jobs have started, ended
* Submitting Python and R scripts
* Making use of Slurm's environment variables

## Sbatch scripting features cont.

### a short aside on `stdout` and `stderr`
Linux uses has two main "channels" to send output messages to. One is "stdout" (standard out), and the other is "stderr" (standard error).

If you have ever used the `|` `>` or `>>` shell scripting features, then you've _redirected_ `stdout` somewhere else e.g., to another command, a file, or the void (`/dev/null`).

```bash
$ ls dir-that-doesnt-exist
ls: cannot access dir-that-doesnt-exist: No such file or directory # this is a stderr output`
```

```bash
$ ls ~
bin cache Desktop Downloads ... # this is a stdout output!
```

### redirecting output and stderr
Sbatch will automatically redirect `stout` and `stdir` to a single file called `slurm-jobid.out`. But this may not be useful. Maybe you want to seperate any "output" from "errors". This can be done with `--output` and `--error`

In [6]:
#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev1-stderrstdout
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=results/Rmatmul-%j.out # the %j is interpreted as the job id!
#SBATCH --error=logs-debug/Rmatmul-%j.err

# loading module for R
module load R/openBLAS/4.2.1

Rscript matmul.rscript

bash: module: command not found
Fatal error: cannot open file 'matmul.rscript': No such file or directory


: 2

### changing directory
The `sbatch` option `--chdir` can make life a bit easier for many reasons:
* your script may be "far" from your data e.g a seperate scripts folder
* you may be processing data in different places

A typical approach would be to either modify the `--output` and `--error` locations and use `cd`.

`--chdir` changes the working directory _which includes where stderr and stdout goes_

This can be more concise and avoids having to change multiple paths.

In [7]:
#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev2-chdir
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=Rmatmul-%j.out
#SBATCH --error=Rmatmul-%j.err
#SBATCH --chdir=/vast/scratch/yang.e/slurm-demo/test1

# loading module for R
module load R/openBLAS/4.2.1

Rscript matmul.rscript

bash: module: command not found
Fatal error: cannot open file 'matmul.rscript': No such file or directory


: 2