# Intermediate Slurm on Milton
### Delivered by WEHI Research Computing Platform

<pre>Edward Yang    Michael Milton    Julie Iskander</pre><br>

<img src="static/1200px-Slurm_logo.svg.png" alt="Slurm" width="100"/>
<img src="static/milton.png" alt="Milton Mascot" width="100"/>
<img src="static/WEHI_RGB_logo.png" alt="WEHI logo" width="200"/>

## [OPTIONAL SLIDE] Self Introductions!

### Me
* Civil Engineer by training
* HPC by interest
* Mostly code in Fortran, occasionally Python
* I use HPC to simulate fluid and granular flows

## How about you?
* name
* How you use Milton's HPC
* Your coffee/tea order

## Background
* We already ran an "intro to Slurm" workshop (recording on RCP website)
* More "advanced" features of Slurm were highly requested
* Both ITS and researchers would benefit
    * ITS will have fewer issues to address
    * researchers can accelerate their research <- High Priority for everyone!

### Target Audience
* You've submitted quite a few jobs via `sbatch`
* You're familiar with resource requests. Like:
    * using `--ntasks` and `--cpus-per-task`
    * using `--mem` and/or `--memory-per-cpu`
* You're wondering whether your jobs are utilizing resources efficiently
* You're wondering how to make your life easier when using `sbatch`

## Background cont.

My goal for the session is to teach you how to:
* get the status of the cluster
* get information about your jobs
* make use of `sbatch` scripting features
* Run emabarrasingly parallel tasks
* **bonus topic** submitting jobs from R or submitting `python` or `R` scripts without a wrapper script.

## Agenda for Today

12:00 - 12:30&emsp;<span style="color:blue">Introduction & Housekeeping</span>

12:30 - 13:00&emsp;Lunch

13:00 - 13:30&emsp;Laying the groundwork: nodes, tasks and other Slurm terminology

13:30 - 14:00&emsp;Understanding your jobs and the job queue

13:30 - 14:00&emsp;Basic profiling of your jobs

14:00 - 14:30&emsp;Slurm scripting features

14:30 - 15:00&emsp;Embarrarsingly parallel examples

15:00 - 15:30&emsp;R batchtools

15:30 - 16:00&emsp;

## Introduction and Housekeeping

### Format
Slides + live coding

Live coding will be on Milton, so make sure you're connected to WEHI's VPN or staff network, or use RAP:<br />
rap.wehi.edu.au

Please follow along to reinforce learning!

Questions:
* Put your hand up whenever you have a question or have an issue running things
* Questions in the chat are welcome and will be addressed by helpers

Material is available here:
/link

Feel free to download the notebook and follow along.

## Introduction and Housekeeping Cont.
### Expected understanding of
* <span style="font-size:1.2em;">Concept of "resources"</span><br />
`CPUs`<br />
`RAM/memory`<br />
`Nodes`<br /><br />
* <span style="font-size:1.2em;">Job submission commands</span><br />
`srun    # executes a command/script/binary across tasks`<br />
`salloc  # allocates resources to be used (interactively and/or via srun)`<br />
`sbatch  # submits a script for later execution on requested resources`<br /><br />
* <span style="font-size:1.2em;">resource request options</span><br />
`--ntasks=             # "tasks" recognised by srun`<br />
`--nodes=              # no. of nodes`<br />
`--ntasks-per-node=    # tasks per node`<br />
`--cpus-per-task=      # cpus per task`<br />
`--mem=                # memory required for entire job`<br />
`--mem-per-cpu=        # memory required for each CPU`<br />
`--gres=               # "general resource" (i.e. GPUs)`<br />
`--time=               # requested wall time`<br />

# LUNCH

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff

### What are Nodes?
Nodes are essentially standalone computers with their own CPU cores, RAM, local storage, and maybe GPUs. 
<br>
<img src="static/node-diagram.png" alt="Node diagram" width="400">

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff cont.

HPC clusters (or just clusters) will consist of multiple nodes connected together through a (sometimes fast) network. <br>
<img src="static/cluster-diagram.png" alt="Cluster diagram" width="400">
<img src="static/shared-storage-diagram.png" alt="Shared storage diagram" width="400">

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff cont.

Nodes cannot collaborate on problems unless they are running a programmed designed that way.

It's like clicking your mouse on your PC, and expecting the click to register on a colleague's PC. 

It's possible, but needs a special program/protocol to do so!

Most programs in bioinformatics, health sciences, and statistics (that I know of) aren't setup to have multiple nodes cooperate. Notable exceptions are Tensorflow and Py-Torch, Relion, and Gromacs.

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff cont.

### So why are tasks useful/important?

The Slurm task model was created with "traditional HPC" in mind
* `srun` creates `ntasks` instances of a program which coordinate using MPI
* Some applications are designed to use multiple cores per task (hybrid MPI-OpenMP) for performance.

But this isn't a parallel programming workshop!

Tasks are not as relevant in bioinformatics, but Slurm nevertheless uses tasks for accounting/profiling purposes. 

Therefore, it's useful to have an understanding of tasks in order to interpret some of Slurm's job accounting/profiling outputs.

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff cont.

### What are tasks?
Tasks are a collection of resources (CPU cores, GPUs) expected to perform the same "task", or used by a single program e.g., via threads, Python multiprocessing, or OpenMP.
<img src="static/tasks-diagram1.png" alt="Tasks diagram" width="400">

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff cont.

A task can only be given resources co-located on a node. <br>
Multiple tasks requested by `sbatch` or `salloc` can be spread across multiple nodes (unless `--nodes=` is specified).

For example, if we have two nodes with 4 CPU cores each:

requesting 1 task and 8 cpus-per-task won't work.

But requesting 2 tasks and 4 cpus-per-task will!

Most data science, statistics, bionformatics, health-science work will use `--ntasks=1`, and using `--cpus-per-task`. 

If you see/hear anything to do with "distributed" or MPI (e.g. distributed ML), you may want to change these options.

## Laying the Groundwork: Nodes, Tasks and Other Slurm Stuff cont.

### A review of launching jobs (`srun` vs `sbatch` and `salloc`)

TL;DR:
* `sbatch` requests resources for use with a script
* `salloc` requests resources to be used interactively
* `srun` runs programs/scripts using resources requested by `sbatch` and `salloc`
    * When run "inside" a job, `srun` will use the resources requested in the parent request
    * When run "outside" a job, `srun` will need to be passed the resource request e.g. `--ntasks`

`srun` will execute `ntasks` instances of the same command/script/program. e.g.,

`srun echo hello world` will run `echo` equal to the number of tasks requested.

## Monitoring Your Jobs and the Job Queue

Slurm has lots of data on your jobs and the cluster!

Primary utilities discussed in this section:
* `squeue`    _Live_ Job queue data
* `scontrol`  _Live_ Singular job data
* `sinfo`     _Live_ Cluster data

This section show you how to get more detailed information about:
* the queue,
* other jobs in the queue, and
* the state/business of the cluster.

## Monitoring Your Jobs and the Job Queue cont.
### Building on the basics: `squeue`

`squeue` shows everyone's job in the queue (passing `-u <username>`) shows only `<username>`'s jobs.

In [4]:
squeue | head -n 5

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           8516030      gpuq interact bollands  R    1:30:11      1 gpu-p100-n01
           8515707      gpuq cryospar cryospar  R    3:04:59      1 gpu-p100-n01
           8511988 interacti sys/dash    yan.a  R   20:15:53      1 sml-n03
           8516092 interacti     work jackson.  R    1:21:42      1 sml-n01


## Monitoring Your Jobs and the Job Queue cont.
Getting a bit more: `squeue --long` makes things more legible adds the "time_limit" column.

In [1]:
squeue --long | head -n 5

Thu Nov 03 12:57:40 2022
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
           8649808 gpuq_larg   AF2.2g iskander  RUNNING    4:34:07 5-00:00:00      1 gpu-a100-n02
           8649805 gpuq_larg   AF2.2g iskander  RUNNING 1-03:47:14 5-00:00:00      1 gpu-a100-n01
           8664606 interacti sys/dash     wu.y  RUNNING    2:12:59   8:00:00      1 med-n01


## Monitoring Your Jobs and the Job Queue cont.

But what if we want _even more_ information?

We have to make use of the formatting options!

```
$ squeue --Format field1,field2,...
```

OR use the environment variable `SQUEUE_FORMAT2`. Useful fields:

| Resources related | Time related | Scheduling   |
| :---              | :---         | :---         |
| `NumCPUs`         | `starttime`  | `JobId`      |
| `NumNodes`        | `submittime` | `name`       |
| `minmemory`       | `pendingtime`| `partition`  |
| `tres-alloc`      | `timelimit`  | `priority`   |
| `minmemory`       | `timeleft`   | `reasonlist` |
|                   | `timeused`   | `workdir`    |
|                   |              | `state`      |

You can always use `man squeue` to see the entire list of options.

So you don't have to type out the fields, I recommend aliasing the the command with your fields of choice in `~/.bashrc` e.g.

## Monitoring Your Jobs and the Job Queue cont.

In [26]:
alias sqv="squeue --Format=jobid:8,name:6' ',partition:10' ',statecompact:3,tres-alloc:60,timelimit:12,timeleft:12"
sqv | head -n 5

JOBID   NAME   PARTITION  ST TRES_ALLOC                                                  TIME_LIMIT  TIME_LEFT   
8517002 R      bigmem     R  cpu=22,mem=88G,node=1,billing=720984                        1-00:00:00  23:35:18    
8516030 intera gpuq       R  cpu=2,mem=20G,node=1,billing=44,gres/gpu=1,gres/gpu:p100=1  8:00:00     4:43:00     
8515707 cryosp gpuq       R  cpu=8,mem=17G,node=1,billing=44,gres/gpu=1,gres/gpu:p100=1  2-00:00:00  1-19:08:12  
8511988 sys/da interactiv R  cpu=8,mem=16G,node=1,billing=112                            1-00:00:00  1:57:18     


In [28]:
sqv -u bedo.j | head -n 5

JOBID   NAME   PARTITION  ST TRES_ALLOC                                                  TIME_LIMIT  TIME_LEFT   
8516851 bionix regular    PD cpu=24,mem=90G,node=1,billing=204                           2-00:00:00  2-00:00:00  
8516850 bionix regular    PD cpu=24,mem=90G,node=1,billing=204                           2-00:00:00  2-00:00:00  
8516849 bionix regular    PD cpu=24,mem=90G,node=1,billing=204                           2-00:00:00  2-00:00:00  
8516848 bionix regular    PD cpu=24,mem=90G,node=1,billing=204                           2-00:00:00  2-00:00:00  


## Monitoring Your Jobs and the Job Queue cont.

### Getting detailed information of your running/pending job

`scontrol show job <jobid>`

Useful if you care only about a specific job.

It's very useful when debugging jobs.

A lot of information without needing lots of input.

In [30]:
scontrol show job 8516360

JobId=8516360 JobName=Extr16S23S
   UserId=woodruff.c(2317) GroupId=allstaff(10908) MCS_label=N/A
   Priority=324 Nice=0 Account=wehi QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:21:53 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2022-10-20T11:37:49 EligibleTime=2022-10-20T11:37:49
   AccrueTime=2022-10-20T11:37:49
   StartTime=2022-10-20T14:28:03 EndTime=2022-10-22T14:28:03 Deadline=N/A
   PreemptEligibleTime=2022-10-20T14:28:03 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-20T14:28:03 Scheduler=Main
   Partition=regular AllocNode:Sid=vc7-shared:12938
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=med-n24
   BatchHost=med-n24
   NumNodes=1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=48G,node=1,billing=128
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryNode=48G MinTmpDiskNode=0
   Features=(null) Dela

## Monitoring Your jobs and the Job Queue cont.
### Monitoring the cluster

Being able to understand the state of the cluster, can help understand why your job might be waiting.

Or, you can use the information to your advantage to reduce wait times.

To view the state of the cluster, we're going to use the `sinfo` command.

In [31]:
sinfo

PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
interactive         up 1-00:00:00      4    mix med-n03,sml-n[01-03]
interactive         up 1-00:00:00      1  alloc med-n02
interactive         up 1-00:00:00      1   idle med-n01
regular*            up 2-00:00:00     42    mix lrg-n[02-03],med-n[03-05,07-09,12-13,18,20-23,25-27,29-30],sml-n[02-20,22-24]
regular*            up 2-00:00:00     13  alloc lrg-n04,med-n[02,06,10-11,14-17,19,24,28],sml-n21
long                up 14-00:00:0     40    mix med-n[03-05,07-09,12-13,18,20-23,25-27,29-30],sml-n[02-20,22-24]
long                up 14-00:00:0     12  alloc med-n[02,06,10-11,14-17,19,24,28],sml-n21
bigmem              up 2-00:00:00      3    mix lrg-n02,med-n[03-04]
bigmem              up 2-00:00:00      1  alloc med-n02
bigmem              up 2-00:00:00      1   idle lrg-n01
gpuq                up 2-00:00:00      1    mix gpu-p100-n01
gpuq                up 2-00:00:00     11   idle gpu-a30-n[01-07],gpu-p100-n[02-05]
gpuq_inter

## Monitoring Your Jobs and the Job Queue cont.

Like `squeue`, we can augment `sinfo`'s behaviour with options.

A very useful option is to use the `-N` (N for nodes) option.

In [33]:
sinfo -N | head -n 5

NODELIST      NODES        PARTITION STATE 
gpu-a10-n01       1 gpuq_interactive mix   
gpu-a30-n01       1             gpuq idle  
gpu-a30-n02       1             gpuq idle  
gpu-a30-n03       1             gpuq idle  


And now the data is now node-oriented instead of partition oriented!

## Monitoring Your Jobs and the Job Queue cont.
But just knowing whether nodes are "idle", "mixed", or "allocated" is not the _most useful_ information.

We can add detail with formatting options as well.

| CPU | memory | gres (GPU) | node state | time |
| :---| :--- | :--- | :--- | :--- |
| `CPUsState` | `FreeMem` | `GresUsed` | `StateCompact` | `Time` |
| | `AllocMem` | `Gres` | | |
| | `Memory` | | | |

* CPUs occupied/available/total: CPUsState
* memory occupied/available/total: FreeMem, AllocMem, Memory
* gres (GPU) occupied/available: GresUsed, Gres
* State of the node (e.g. whether the node is down): StateCompact
* Max time: Time

In [13]:
sinfo -NO nodelist:11' ',partition:10' ',cpusstate:13' ',freemem:8' ',memory:8' ',gresused,gres:11,statecompact:8,time | head -n 5

NODELIST    PARTITION  CPUS(A/I/O/T) FREE_MEM MEMORY   GRES_USED           GRES       STATE   TIMELIMIT           
gpu-a10-n01 gpuq_inter 0/48/0/48     163914   257417   gpu:A10:0(IDX:N/A)  gpu:A10:4  idle    12:00:00            
gpu-a30-n01 gpuq       0/96/0/96     450325   511362   gpu:A30:0(IDX:N/A)  gpu:A30:4  idle    2-00:00:00          
gpu-a30-n02 gpuq       0/96/0/96     436435   511362   gpu:A30:0(IDX:N/A)  gpu:A30:4  idle    2-00:00:00          
gpu-a30-n03 gpuq       0/96/0/96     497816   511362   gpu:A30:0(IDX:N/A)  gpu:A30:4  idle    2-00:00:00          


## Monitoring Your Jobs and the Job Queue cont.

For newer Slurm versions, the GUI tool `sview` is an option.

It's functionality is currently a little limited and not customizable.

Requires and X11 server running on your computer (Windows: MobaXTerm, Mac: XQuartz).

In [36]:
sview




: 1

## Basic Job Monitoring and Profiling

This section will look at using command-line tools to obtain visibility into how your job is performing.

| type of data | Live | Historical |
| --- | --- | --- |
| <strong>good for</strong> | debugging | debugging |
| | evaluating utilization | profiling |
| <strong>drawbacks</strong> | uses system tools, so requires some system understanding | Only provides data when jobs are completed |

We will look at:
* `htop` for _Live_ Process activity on nodes
* `nvidia-smi` for _Live_ GPU activity on nodes
* `seff` for _Historical_ job CPU and memory usage data
* `dcgmstats` for _Historical_ job GPU usage data
* `sacct` for _Historical_ job data

## Basic Job Monitoring and Profiling cont.
### Live monitoring of jobs
A limitation of Slurm is that it doesn't offer tools that provide up-to-date data of a job as it is running. 

Instead, to obtain insight into current activity of a job, we can use general system tools.

The main obstacle with using these tools is that they do not use job IDs - you must match jobs with _processes_ on a node.

To do this, you use `squeue` to find which node(s) your job(s) are running on.<br>
Then, you will use the `ssh` command to move to that node and monitor the system's activity.

Another "drawback" is that some of the data may require a bit of operating system understanding to interpret.

But nevertheless, the tools are robust and useful in helping us understand how our jobs are performing.

## Basic Job Monitoring and Profiling cont.
#### Live monitoring: CPU, memory, IO activity
`htop` is a utility often installed on HPC clusters for monitoring processes.

It can be used to look at the CPU, memory, and IO utilization of a running process. 

It's not a Slurm tool, but is nevertheless very useful in monitoring jobs' activity and diagnosing issues.

To show only your processes, execute `htop -u $USER`

## Basic Job Monitoring and Profiling cont.

`htop` shows the individual CPU core utilization on the top, followed by memory utilization and some misc. information.

The bottom panel shows the process information

Relevant Headings:
* USER: User that owns the process
* PID: Process ID
* %CPU: % of a single core that a process is using e.g. 400% means process is using 4 cores
* %MEM: % of node's total RAM that process is using
* VSZ: "Virtual" memory (bytes) - the memory a process "thinks" it's using
* RSS: "Resident" memory (bytes) - the actual physical memory a process is using

## Basic Job Monitoring and Profiling cont.
You can organize the process information into "trees" by pressing `F5`

You can add IO information by
1. Press `F2` (Setup)
2. Press down three times to move the cursor to "Columns" and press right twice
3. The cursor should now be in "Available Columns". Scroll down to `IO_READ_RATE` and press enter
4. Scroll down to `IO_WRITE_RATE` and press enter
5. Press `F10` to exit. 
You should now be able to see read/write rates for processes that you have permissions for.

<strong>Tips</strong>: 
* `htop` configurations are saved in `~/.config/htop`. Delete this folder to reset your `htop` conifguration.
* `ps` and `pidstat` are useful alternatives which can be incorporated into scripts.
* Some systems may not have `htop` installed, in which case `top` can be used instead

## Basic Job Monitoring and Profiling cont.

#### Live monitoring: GPU activity

To monitor activity of Milton's NVIDIA GPUs, we must rely on NVIDIA's `nvidia-smi` tool.

`nvidia-smi` shows information about the memory and compute utilization, process allocation and other details. 

Like `htop`, `nvidia-smi` only provides information on processes running on a GPU. If your job is occupying an entire node and all its GPUs, it should be straightforward to determine which GPUs you've been allocated.

But if your job is sharing a node with other jobs, you might not know straight away which GPU your job has been allocated. You can determine this by
* inferring by the command being run on a GPU, or
* using `squeue` with extra formatting options as discussed previously.

<strong>Note</strong>:
This tool is available only on GPU nodes where the CUDA drivers are installed, so you must `ssh` to a `gpu` node to try it.

<strong>Tip</strong>: Combine `nvidia-smi` with `watch` to automatically update the output.

## Basic Job Monitoring and Profiling cont.
### Historical monitoring of jobs
Slurm tools and plugins are generally easier to use because they provide information on a per-job basis, meaning there's no need to match processes with jobs like previously discussed.

<strong>Tip</strong>: _generally_, results are more reliable when executing commands with `srun`.

<strong>NOTE</strong>: `seff` results are most accurate when executing commands with `srun`.

## Basic Job Monitoring and Profiling cont.
#### Historical data: CPU and memory utilization
The `seff` command shows the memory and CPU utilization of a job that has <strong>ended</strong>`

In [35]:
seff 8665813

Job ID: 8665813
Cluster: milton
User/Group: yang.e/allstaff
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:09:04
CPU Efficiency: 99.27% of 00:09:08 core-walltime
Job Wall-clock time: 00:02:17
Memory Utilized: 1.95 GB (estimated maximum)
Memory Efficiency: 48.83% of 4.00 GB (1.00 GB/core)


## Basic Job Monitoring and Profiling cont.
`sacct` is a general job history querying command-line tool that can provide lots of information about your _past_ jobs.

<strong>Note</strong>: `sacct` data can take a few minutes to be updated, so is best for jobs that have just finished.

The following `sacct` command shows your job data for jobs since 1st Nov:
* Job steps' ID and name
* Requested resources
* Elapsed time
* The quantity of data written and read
* The quantity of virtual and resident memory used

Note that the IO and memory values shown will be for the highest use <strong>task</strong>.

## Basic Job Monitoring and Profiling cont.

In [21]:
sacct -S 2022-11-01 -o jobid%14' ',jobname,ncpus%5' ',nodelist,elapsed,state,maxdiskread,maxdiskwrite,maxvmsize,maxrss | head -n5

         JobID    JobName NCPUS        NodeList    Elapsed      State  MaxDiskRead MaxDiskWrite  MaxVMSize     MaxRSS 
-------------- ---------- ----- --------------- ---------- ---------- ------------ ------------ ---------- ---------- 
       8664599 sys/dashb+     2         sml-n01 1-00:00:22    TIMEOUT                                                 
 8664599.batch      batch     2         sml-n01 1-00:00:23  CANCELLED      102.64M       15.11M   1760920K     99812K 
8664599.extern     extern     2         sml-n01 1-00:00:22  COMPLETED        0.00M            0    146612K        68K 


## Basic Job Monitoring and Profiling cont.
#### Historical data: GPU activity
By default, Slurm doesn't have the ability to produce stats on GPU usage.

WEHI's ITS have implemented the `dcgmstats` NVIDIA Slurm plugin which can produce these summary stats.

To use this plugin, pass the `--comment=dcgmstats` option to `srun`, `salloc`, or `sbatch`.

If your job requested at least one GPU, an extra output file will be generated in the working directory called `dcgm-stats-<jobid>.out`. The output file will contain a table for each GPU requested by the job.

## Basic Job Monitoring and Profiling cont.
### Summary
* live monitoring:
    * `htop` for CPU, memory, and IO data (requires configuration)
    * `nvidia-smi` for GPU activity
    * both require matching jobs to hardware and running processes
* historical monitoring:
    * `seff` command for simple CPU and memory utilization data for one job
    * `sacct` command for memory and IO data for multiple past jobs
    * `dcgmstats` Slurm plugin for GPU stats for a single Slurm job

## Sbatch Scripting Features
Sbatch scripts have a lot of nice features that extend beyond requesting resources. This section will look at some of these useful features which you can use in your workflows.

This section will look at:
* getting email notifications
* changing `stdout` and `stderr` files
* controlling how jobs depend on each other
* making use of job environments and interpreters (e.g. python or R)
* submitting `sbatch` scripts without a script

We're going to start with a simple R script submitted by wrapper sbatch script:

```r
## matmul.rscript
# multiplies two matrices together and prints how long it takes.

print("starting the matmul R script!")
nrows = 1e3
print(paste0("elem: ", nrows, "*", nrows, " = ", nrows*nrows))

# generating matrices
M <- matrix(rnorm(nrows*nrows),nrow=nrows)
N <- matrix(rnorm(nrows*nrows),nrow=nrows)

# start matmul
start.time <- Sys.time()
invisible(M %*% N)
end.time <- Sys.time()

# Getting final time and writing to stdout
elapsed.time <- difftime(time1=end.time, time2=start.time, units="secs")
print(elapsed.time)
```

In [23]:
#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev0

#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1

# loading module for R
module load R/openBLAS/4.2.1

Rscript matmul.rscript

[1] "starting the matmul R script!"
[1] "elem: 1000*1000 = 1e+06"
Time difference of 0.06260109 secs


## Sbatch scripting features cont.
### Email notifications
Getting notifications about the status of your Slurm jobs remove the need to `ssh` onto Milton and running `squeue` to get the status of your jobs.

Instead, it will notify you when your job state has changed e.g. when it has started or ended.

To enable this behaviour, add the following options to your job scripts:
```
--mail-user=me@gmail.com
--mail-type=ALL
```
This sends emails to `me@gmail.com` when the job state changes. 

If you only want to know when your job goes through certain states, e.g. if it fails or is pre-empted but not when it starts or finishes:
* BEGIN: job starts
* END: job finishes successfully
* FAIL: job fails
* TIME_LIMIT: job reaches time limit
* TIME_LIMIT_50/80/90: job reaches 50%/80%/90% of time limit

In [25]:
#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev1 - email notifications

#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mail-user=yang.e@wehi.edu.au
#SBATCH --mail-type=ALL

# loading module for R
module load R/openBLAS/4.2.1

Rscript matmul.rscript

[1] "starting the matmul R script!"
[1] "elem: 1000*1000 = 1e+06"
Time difference of 0.06343555 secs


## Sbatch scripting features cont.

### a short aside on `stdout` and `stderr`
Linux uses has two main "channels" to send output messages to. One is "stdout" (standard out), and the other is "stderr" (standard error).

If you have ever used the `|` `>` or `>>` shell scripting features, then you've _redirected_ `stdout` somewhere else e.g., to another command, a file, or the void (`/dev/null`).

```bash
$ ls dir-that-doesnt-exist
ls: cannot access dir-that-doesnt-exist: No such file or directory # this is a stderr output`
```

```bash
$ ls ~
bin cache Desktop Downloads ... # this is a stdout output!
```

## Sbatch scripting features cont.
### Redirecting job's `stderr` and `stdout`
By default:
* job's working directory is the directory you submitted from
* `stdout` is directed to `slurm-<jobid>.out` in the job's working directory
* `stderr` is directed to wherever `stdout` is directed to

Redirect `stderr` and `stdout` with `--error` and `--output` options. They work with both relative and absolute paths, e.g.
```
--error=/dev/null
--output=path/to/output.out
```
where paths are resolved relative to the job's working directory.

Variables can be used, like:
* `%j`: job ID
* `%x`: job name
* `%u`: username
* `%t`: task ID i.e., seperate file per task
* `%N`: node name i.e., seperate file per nodes in job

In [24]:
#!/bin/bash
# Example sbatch script running Rscript
# Does a matmul
# rev2 - added --output and --error options

#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=1-
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mail-user=yang.e@wehi.edu.au
#SBATCH --mail-type=ALL
#SBATCH --output=logs/matmul-%j.out
#SBATCH --error=logs-debug/matmul-%j.err

# loading module for R
module load R/openBLAS/4.2.1

Rscript matmul.rscript

[1] "starting the matmul R script!"
[1] "elem: 1000*1000 = 1e+06"
Time difference of 0.06532145 secs


## Sbatch scripting features cont.
### Using job dependancies
Slurm allows for submitted jobs to wait for another job to start or finish before beginning. While probably not as effective as workflow managers like Nextflow, Slurm's job dependencies can still be useful for simple workflows.

Make a job dependant on another by passing the `--dependency` option with one of the following values:
* `afterok:jobid1:jobid2...` waits for `jobid1`, `jobid2` ... to complete successfully
* `afterany:jobid1:jobid2...` "                              " to finish (fail, complete, cancelled)
* `after:jobid1:jobid2...` "                                 " to start or are cancelled.

e.g. `--dependency=afterok:12345678` will make the job wait for job `12345678` to complete successfully before starting.

## Sbatch scripting features cont.
### Making use of job environments and interpreters
By default, when you submit a Slurm job, Slurm copies all the environment variables in your environment and adds some extra for the job to use.

In [26]:
export VAR1="here is some text"

In [30]:
cat demo-scripts/env-vars1.sbatch

#!/bin/bash

echo $VAR1


In [37]:
sbatch demo-scripts/env-vars1.sbatch

Submitted batch job 8681656


In [38]:
cat slurm-8681656.out

here is some text


<strong>Note</strong>: For reproducibility reasons, a Slurm script that relies on environment variables can be submitted inside a wrapper script which first exports the relevant variable.

## Sbatch scripting features cont.
Alternatively, you can use the `--export` option which allows you to set specific values

In [48]:
echo $VAR1

here is some text


In [49]:
sbatch --export=VAR1="this is some different text" demo-scripts/env-vars1.sbatch

Submitted batch job 8681761


In [50]:
cat slurm-8681761.out

this is some different text


This feature is especially useful when submitting jobs inside wrapper scripts.

You can also use the `--export-file` option to specify a file with a list of `export VAR=value` pairs that you wish the script to use.

## Sbatch scripting features cont.
Slurm also adds environment variables that enable job parameters use

In [45]:
cat demo-scripts/env-vars2.sbatch

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2

echo I am running on ${SLURM_NODELIST}
echo with ${SLURM_NTASKS} tasks
echo and ${SLURM_CPUS_PER_TASK} CPUs per task


In [46]:
sbatch demo-scripts/env-vars2.sbatch

Submitted batch job 8681710


In [47]:
cat slurm-8681710.out

I am running on sml-n03
with 1 tasks
and 2 CPUs per task


These Slurm environment variables make it easy to supply parallelisation parameters to a program e.g. specifying number of threads. 

## Sbatch scripting features cont.
### Submitting scripts with different interpreters
Typically scripts submitted by `sbatch` use the `bash` or `sh` interpreter (e.g. `#!/bin/bash`), but it may be more convenient to use a different interpreter.

You can do this by changing the "hash bang" statement at the top of the script. To demonstrate this, we can take our original R matmul script, and add a "hash bang" statement to the top.

In [52]:
cat demo-scripts/matmul-interpreter1.rscript | head -n 5

#!/usr/bin/env Rscript
## matmul.rscript

print("starting the matmul R script!")
nrows = 1e3


The statement in the above looks for the Rscript in your current environment

Alternatively, you can specify the absolute path to the interpreter.

e.g. `#!/stornext/System/data/apps/R/openBLAS/R-4.2.1/lib64/R/bin/Rscript`

## Sbatch scripting features cont.
Changing the interpreter still allows you to access the extra Slurm environment variables, but in a way appropriate to the interpreter.

In [63]:
cat demo-scripts/matmul-interpreter2.rscript

#!/usr/bin/env Rscript
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G

## matmul.rscript

print("starting the matmul R script!")
paste("using", Sys.getenv("SLURM_NTASKS"), "tasks")
paste("and", Sys.getenv("SLURM_CPUS_PER_TASK"), "CPUs per task")
nrows = 1e3
print(paste0("elem: ", nrows, "*", nrows, " = ", nrows*nrows))

# generating matrices
M <- matrix(rnorm(nrows*nrows),nrow=nrows)
N <- matrix(rnorm(nrows*nrows),nrow=nrows)

# start matmul
start.time <- Sys.time()
invisible(M %*% N)
end.time <- Sys.time()

# Getting final time and writing to stdout
elapsed.time <- difftime(time1=end.time, time2=start.time, units="secs")
print(elapsed.time)


In [59]:
sbatch demo-scripts/matmul-interpreter2.rscript

Submitted batch job 8682077


In [62]:
cat slurm-8682077.out

[1] "starting the matmul R script!"
[1] "using 1 tasks"
[1] "and 2 CPUs per task"
[1] "elem: 1000*1000 = 1e+06"
Time difference of 0.06340098 secs


`python` works similarly. Replace `Rscript` in the hash bang statement to `python`.