# SLURM job arrays

How to run a parameter sweep using job arrays?

## Using command-line arguments

We have the args in a file, one per line.

We use `SLURM_ARRAY_TASK_ID` to get the right line.

This is the python script that we want to run with different arguments:

In [1]:
%%writefile sums.py
#!/usr/bin/env python3
import os
import math
import argparse

parser = argparse.ArgumentParser(
    description="Sum first N integers (incl.), "
                "optionally skipping multiples of k, "
                "or squaring the numbers before summing"
)
parser.add_argument('N', type=int)
parser.add_argument('-k', type=int, default=None)
parser.add_argument('-s', '--square', action='store_true')
args = parser.parse_args()

tot = 0
for i in range(1, args.N+1):
    if args.k is not None and i%args.k == 0: 
        continue
    if args.square:
        i = i**2
    tot += i

print(tot)

Overwriting sums.py


This is the sbatch file that we will submit,
it doesn't contain the `--array` parameter because we'll specify that upon submission.

Variables used:
- `SLURM_ARRAY_TASK_ID` provided by SLURM, it will be different for every job in the array
- `ARGS_FILE` should be passed to the script upon submission

The python script will run using the arguments contained in the `SLURM_ARRAY_TASK_ID`-th line of `ARGS_FILE` (indexed from 1).

In [2]:
%%writefile sums.sbatch
#!/bin/bash
#SBATCH --output %A_%a.out

echo "This is job ${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID} on ${SLURMD_NODENAME}"

ARGS=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${ARGS_FILE}")
echo $ARGS
python3 sums.py $ARGS

Overwriting sums.sbatch


This is a plain text file, each line corresponds to one set of parameters:

In [3]:
%%writefile sums.args
10
20
15 -k 2
32 -k 3 --square

Overwriting sums.args


This submits the sbatch job. There is some bookkpeeing code, but the important bits are the `--array` and `--export` parameters of `sbatch`:

In [4]:
%%bash
ARGS_FILE='sums.args'
NUM_JOBS=$(wc -l < $ARGS_FILE)
MAX_PARALLEL_JOBS=2

JOBID=$(
    sbatch \
    --parsable \
    --array 1-${NUM_JOBS}%${MAX_PARALLEL_JOBS} \
    --export ARGS_FILE=$ARGS_FILE \
    sums.sbatch
)

while [ -n "$(squeue | grep ${JOBID})" ]; do sleep 5; done
sacct --job $JOBID --format jobid%-25,State,ExitCode,NodeList
echo

for f in ${JOBID}_*.out; do
    echo $f
    cat $f
    echo
done

                    JobID      State ExitCode        NodeList 
------------------------- ---------- -------- --------------- 
125536_4                   COMPLETED      0:0           smaug 
125536_4.batch             COMPLETED      0:0           smaug 
125536_1                   COMPLETED      0:0           smaug 
125536_1.batch             COMPLETED      0:0           smaug 
125536_2                   COMPLETED      0:0           smaug 
125536_2.batch             COMPLETED      0:0           smaug 
125536_3                   COMPLETED      0:0           smaug 
125536_3.batch             COMPLETED      0:0           smaug 

125536_1.out
This is job 125537_1 on smaug
10
55

125536_2.out
This is job 125538_2 on smaug
20
210

125536_3.out
This is job 125539_3 on smaug
15 -k 2
64

125536_4.out
This is job 125536_4 on smaug
32 -k 3 --square
7975



## Using config files

Each run reads its parameters from a configuration file. We use `SLURM_ARRAY_TASK_ID` to get the right file.

This is the python script that we want to run with different configs:

In [5]:
%%writefile prods.py
#!/usr/bin/env python3
import sys
import json

"""
Multiplies first N integers (incl.),
optionally skipping multiples of k,
or squaring the numbers before summing
"""

with open(sys.argv[1]) as f:
    config = json.load(f)

tot = 1
for i in range(1, config['N']+1):
    if 'k' in config is not None and i%config['k'] == 0:
        continue
    if config.get('square', False):
        i = i**2
    tot *= i

print(tot)

Overwriting prods.py


This is the sbatch file that we will submit,
it doesn't contain the `--array` parameter because we'll specify that upon submission.

Variables used:
- `SLURM_ARRAY_TASK_ID` provided by SLURM, it will be different for every job in the array
- `ARGS_PREFIX` should be passed to the script upon submission

The python script will run using the config file indicated by `ARGS_PREFIX` and `SLURM_ARRAY_TASK_ID`.

In [6]:
%%writefile prods.sbatch
#!/bin/bash
#SBATCH --output %A_%a.out

echo "This is job ${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID} on ${SLURMD_NODENAME}"

ARGS="${ARGS_PREFIX}_${SLURM_ARRAY_TASK_ID}.json"
cat "$ARGS"
python3 prods.py "$ARGS"

Overwriting prods.sbatch


These are the configs that we want to run. In practice, there will be another script that creates them.

In [7]:
%%writefile prods_1.json
{
    "N": 10
}

Writing prods_1.json


In [8]:
%%writefile prods_2.json
{
    "N": 10,
    "k": 3
}

Writing prods_2.json


In [9]:
%%writefile prods_3.json
{
    "N": 15,
    "k": 7,
    "square": true
}

Writing prods_3.json


This submits the sbatch job. There is some bookkpeeing code, but the important bits are the `--array` and `--export` parameters of `sbatch`:

In [10]:
%%bash

ARGS_PREFIX='prods'
NUM_JOBS=$(find . -name "${ARGS_PREFIX}_*.json" -printf '.' | wc -c)
MAX_PARALLEL_JOBS=2

JOBID=$(
    sbatch \
    --parsable \
    --array 1-${NUM_JOBS}%${MAX_PARALLEL_JOBS} \
    --export ARGS_PREFIX=$ARGS_PREFIX \
    prods.sbatch
)

while [ -n "$(squeue | grep ${JOBID})" ]; do sleep 5; done
sacct --job $JOBID --format jobid%-25,State,ExitCode,NodeList
echo

for f in ${JOBID}_*.out; do
    echo $f
    cat $f
    echo
done

                    JobID      State ExitCode        NodeList 
------------------------- ---------- -------- --------------- 
125540_3                   COMPLETED      0:0        belegost 
125540_3.batch             COMPLETED      0:0        belegost 
125540_1                   COMPLETED      0:0           smaug 
125540_1.batch             COMPLETED      0:0           smaug 
125540_2                   COMPLETED      0:0           smaug 
125540_2.batch             COMPLETED      0:0           smaug 

125540_1.out
This is job 125541_1 on smaug
{
    "N": 10
}
3628800

125540_2.out
This is job 125542_2 on smaug
{
    "N": 10,
    "k": 3
}
22400

125540_3.out
This is job 125540_3 on belegost
{
    "N": 15,
    "k": 7,
    "square": true
}
178052087955456000000

