# SLURM resource limits

SLURM provides a hierarchy of jobs, steps and tasks. For each of these we can specify a set of resources to be used (nodes, CPUs, RAM).

- How are jobs, job steps, step tasks scheduled across nodes?
- How does SLURM use [Control Groups](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01) to enforce resource limits?

## The task script
This script will run as a task in each step. It simply logs information about SLURM resources.

In [1]:
%%writefile limits.py
#!/usr/bin/env python3
import os
import glob
import time

slurm = {
    k[len('SLURM_'):]: v 
    for k, v in os.environ.items() 
    if k.startswith('SLURM_')
}

# Job info (printed by task 0 in step 0)
if int(slurm['STEP_ID']) == 0 and int(slurm['PROCID']) == 0:
    print(f"Job {slurm['JOB_ID']}")
    print(
        f"  Submit host: {slurm['SUBMIT_HOST']}",
        f"  Nodes      : {slurm['JOB_NODELIST']}",
        f"  Num nodes  : {slurm['JOB_NUM_NODES']}",
        sep='\n',
        end='\n\n',
    )
    
# Step info (printed by task 0 in each step)
if int(slurm['PROCID']) == 0:
    print(f"Step {slurm['STEP_ID']}")
    print(
        f"  Nodes      : {slurm['STEP_NODELIST']}",
        f"  Num nodes  : {slurm['STEP_NUM_NODES']}",
        f"  Num tasks  : {slurm['STEP_NUM_TASKS']}",
        sep='\n',
        end='\n\n',
    )
    
# Sleep to avoid overlaps in the log file
time.sleep(int(slurm['PROCID']))

# Get cgroup of this process
# E.g. /slurm/uid_012345/job_001122/step_0/task_0
with open(f'/proc/{os.getpid()}/cgroup') as f:
#     cgroup = next(l.strip() for l in f if ':memory' in l).split(':')[2]
    for l in f:
        if ':memory' in l:
            cgroup = l.strip().split(':')[2]    
            *_, uid, job, step, task = cgroup.split('/')
            uid = int(uid[len('uid_'):])
            job = int(job[len('job_'):])
            step = int(step[len('step_'):])
            task = int(task[len('task_'):])
            break
    else:
        raise RuntimeError('Unable to parse cgroup info')

# CPU quota of all tasks in this step
# https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpuacct
cpu_shares_tot = 0
for p in glob.glob(f'/sys/fs/cgroup/cpu,cpuacct/slurm/uid_{uid}/job_{job}/step_{step}/task_*/cpu.shares'):
    with open(p) as f:
        cpu_shares_tot += int(f.readline())
            
# CPU quota of this task (as a fraction of the step quota)
with open(f'/sys/fs/cgroup/cpu,cpuacct/slurm/uid_{uid}/job_{job}/step_{step}/task_{task}/cpu.shares') as f:
    cpu_shares_task = int(f.readline())
    
# CPU cores bound to this step (shared among tasks)
# https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpuset
with open(f'/sys/fs/cgroup/cpuset/slurm/uid_{uid}/job_{job}/step_{step}/cpuset.cpus') as f:
    cpu_cores = f.readline().strip()
    
# RAM 
# Soft/hard limits in bytes for this step/task
# https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-memory
with open(f'/sys/fs/cgroup/memory/slurm/uid_{uid}/job_{job}/step_{step}/memory.limit_in_bytes') as f:
    mem_step_hard = float(f.readline())
with open(f'/sys/fs/cgroup/memory/slurm/uid_{uid}/job_{job}/step_{step}/memory.soft_limit_in_bytes') as f:
    mem_step_soft = float(f.readline())
with open(f'/sys/fs/cgroup/memory/slurm/uid_{uid}/job_{job}/step_{step}/task_{task}/memory.limit_in_bytes') as f:
    mem_task_hard = float(f.readline())
with open(f'/sys/fs/cgroup/memory/slurm/uid_{uid}/job_{job}/step_{step}/task_{task}/memory.soft_limit_in_bytes') as f:
    mem_task_soft = float(f.readline())
    
print(
    f"Task {slurm['PROCID']}",
    f"  Hostname: {os.environ['SLURMD_NODENAME']}",

    # process ID of the task being started
    f"  Task PID: {slurm['TASK_PID']:<6}",
    
    # step-wise task ID, i.e. MPI rank
    f"  Task id : {slurm['PROCID']:<6}",
    
    # node local task ID for the process within this step
    f"  Local id: {slurm['LOCALID']:<6}",
    
    f"  Task control group : {cgroup}",
    
    # CPU cores shared among the tasks in this step on this machine
    f"  CPU cores  per step: {cpu_cores:<10}",
    
    # Percentage of the above cores that the task should use
    f"  CPU shares per task: {cpu_shares_task/cpu_shares_tot:<10.2%}",
    f"  RAM limits per step: {mem_step_soft / 2**30:.2f} GB (hard {mem_step_hard / 2**30:.2f} GB)",
    f"  RAM limits per task: {mem_task_soft / 2**30:.2f} GB (hard {mem_task_hard / 2**30:.2f} GB)",
    sep='\n',
    end='\n\n',
)

Overwriting limits.py


## Srun examples

Running `srun` from a normal shell creates a job with a single step.

A step can have multiple tasks:
- All tasks are launched with the same script and parameters
- Each task gets a unique task id (MPI rank)
- The tasks can be run on a single node or span across multiple nodes

### Basic single-task example

In [2]:
! srun --ntasks 1 python3 limits.py

Job 125493
  Submit host: moria
  Nodes      : smaug
  Num nodes  : 1

Step 0
  Nodes      : smaug
  Num nodes  : 1
  Num tasks  : 1

Task 0
  Hostname: smaug
  Task PID: 30193 
  Task id : 0     
  Local id: 0     
  Task control group : /slurm/uid_1295800031/job_125493/step_0/task_0
  CPU cores  per step: 38,78     
  CPU shares per task: 100.00%   
  RAM limits per step: 1.95 GB (hard 1.95 GB)
  RAM limits per task: 8589934592.00 GB (hard 8589934592.00 GB)



### Two tasks on the same node

Since the tasks execute on the same node, SLURM creates a single step cgroup and assigns 4 cores to it.

The step has two childred tasks, each of which is limited to using 50% of the step cores.

In [3]:
! srun --nodes 1 --ntasks 2 --cpus-per-task 2 python3 limits.py

Job 125494
  Submit host: moria
  Nodes      : smaug
  Num nodes  : 1

Step 0
  Nodes      : smaug
  Num nodes  : 1
  Num tasks  : 2

Task 0
  Hostname: smaug
  Task PID: 30228 
  Task id : 0     
  Local id: 0     
  Task control group : /slurm/uid_1295800031/job_125494/step_0/task_0
  CPU cores  per step: 38-39,78-79
  CPU shares per task: 50.00%    
  RAM limits per step: 3.91 GB (hard 3.91 GB)
  RAM limits per task: 8589934592.00 GB (hard 8589934592.00 GB)

Task 1
  Hostname: smaug
  Task PID: 30229 
  Task id : 1     
  Local id: 1     
  Task control group : /slurm/uid_1295800031/job_125494/step_0/task_1
  CPU cores  per step: 38-39,78-79
  CPU shares per task: 50.00%    
  RAM limits per step: 3.91 GB (hard 3.91 GB)
  RAM limits per task: 8589934592.00 GB (hard 8589934592.00 GB)



### Two tasks, but on two different nodes

Since the tasks execute on two separate nodes, SLURM creates a step cgroup on each machine and assigns 2 cores to each step.

Each step has a single child task that gets to use 100% of the step cores.

In [4]:
! srun --nodes 2 --ntasks 2 --cpus-per-task 2 python3 limits.py

Job 125495
  Submit host: moria
  Nodes      : balrog,belegost
  Num nodes  : 2

Step 0
  Nodes      : balrog,belegost
  Num nodes  : 2
  Num tasks  : 2

Task 0
  Hostname: balrog
  Task PID: 27904 
  Task id : 0     
  Local id: 0     
  Task control group : /slurm/uid_1295800031/job_125495/step_0/task_0
  CPU cores  per step: 37,77     
  CPU shares per task: 100.00%   
  RAM limits per step: 1.95 GB (hard 1.95 GB)
  RAM limits per task: 8589934592.00 GB (hard 8589934592.00 GB)

Task 1
  Hostname: belegost
  Task PID: 13103 
  Task id : 1     
  Local id: 0     
  Task control group : /slurm/uid_1295800031/job_125495/step_0/task_1
  CPU cores  per step: 22,46     
  CPU shares per task: 100.00%   
  RAM limits per step: 1.95 GB (hard 1.95 GB)
  RAM limits per task: 8589934592.00 GB (hard 8589934592.00 GB)



### Three tasks across 2 nodes

In [5]:
! srun --nodes 2 --ntasks 3 --cpus-per-task 2 python3 limits.py

Job 125496
  Submit host: moria
  Nodes      : gondor,khazadum
  Num nodes  : 2

Step 0
  Nodes      : gondor,khazadum
  Num nodes  : 2
  Num tasks  : 3

Task 0
  Hostname: gondor
  Task PID: 609   
  Task id : 0     
  Local id: 0     
  Task control group : /slurm/uid_1295800031/job_125496/step_0/task_0
  CPU cores  per step: 18-19,38-39
  CPU shares per task: 50.00%    
  RAM limits per step: 3.91 GB (hard 3.91 GB)
  RAM limits per task: 8589934592.00 GB (hard 8589934592.00 GB)

Task 1
  Hostname: gondor
  Task PID: 610   
  Task id : 1     
  Local id: 1     
  Task control group : /slurm/uid_1295800031/job_125496/step_0/task_1
  CPU cores  per step: 18-19,38-39
  CPU shares per task: 50.00%    
  RAM limits per step: 3.91 GB (hard 3.91 GB)
  RAM limits per task: 8589934592.00 GB (hard 8589934592.00 GB)

Task 2
  Hostname: khazadum
  Task PID: 4128  
  Task id : 2     
  Local id: 0     
  Task control group : /slurm/uid_1295800031/job_125496/step_0/task_2
  CPU cores  per step: 4,

## Sbatch example

- `sbatch` allows to create jobs with multiple steps.
- Steps are created using `srun` from within the `sbatch` script.
- A pool of nodes is allocated to the job, then the individual steps can use nodes from this pool.

Example:
- 2 nodes
- step 0 with 1 task
- step 1 with 2 task
- step 2 with 4 task

In [6]:
%%writefile limits.sbatch
#!/bin/bash
#SBATCH --output limits.out
#SBATCH --nodes  2

srun --nodes 1 --ntasks 1 python3 limits.py
sleep 2
srun --nodes 2 --ntasks 2 python3 limits.py
sleep 2
srun --nodes 2 --ntasks 4 --ntasks-per-node 2 --cpus-per-task 1 --mem-per-cpu 256M python3 limits.py

Overwriting limits.sbatch


In [7]:
%%bash
rm -f limits.out

JOBID="$(sbatch --parsable limits.sbatch)"
while [ -n "$(squeue | grep ${JOBID})" ]; do sleep 1; done
sacct --job $JOBID --format JobID%-25,AllocCPUS,Elapsed,State,ExitCode,NodeList
echo ""

cat limits.out

                    JobID  AllocCPUS    Elapsed      State ExitCode        NodeList 
------------------------- ---------- ---------- ---------- -------- --------------- 
125497                             4   00:00:11  COMPLETED      0:0 balrog,belegost 
125497.batch                       2   00:00:11  COMPLETED      0:0          balrog 
125497.0                           1   00:00:00  COMPLETED      0:0          balrog 
125497.1                           2   00:00:02  COMPLETED      0:0 balrog,belegost 
125497.2                           4   00:00:04  COMPLETED      0:0 balrog,belegost 

Job 125497
  Submit host: balrog.csc.kth.se
  Nodes      : balrog,belegost
  Num nodes  : 2

Step 0
  Nodes      : balrog
  Num nodes  : 1
  Num tasks  : 1

Task 0
  Hostname: balrog
  Task PID: 27977 
  Task id : 0     
  Local id: 0     
  Task control group : /slurm/uid_1295800031/job_125497/step_0/task_0
  CPU cores  per step: 37,77     
  CPU shares per task: 100.00%   
  RAM limits per step: 1.9