A modular, high-performance computing solution to run jobs using SLURM
Canine operates by running jobs on a SLURM cluster. It is designed to take a bash or WDL script and schedule jobs using data from a Firecloud workspace or with manually provided inputs. API usage documented at the bottom of this section.
Canine may be used in any of the following ways:
- Running a pipeline yaml file (ie:
$ canine examples/example_pipeline.yaml
) - Running a pipeline defined on the commandline (ie:
$ canine --backend type:TransientGCP --backend name:my-cluster (etc...)
) - Building and running a pipeline in python (ie:
>>> canine.Orchestrator(pipeline_dict).run_pipeline()
) - Using the Canine API to execute custom workflows in Slurm, which could not be configured as a pipeline object
Canine can be natively configured to suit a vast range of setups. Canine is modularized into three main components which can be mixed and matched as needed: Adapters, Backends, and Localizers. A pipeline specifies which Adapter, Backend, and Localizer to use, along with any configuration options for each.
The pipeline adapter is responsible for converting the provided list of inputs into an input specification for each job.
This is a list of available adapters. For more details, see pipeline_options.md
Manual
: (Default) This is the primary input adapter responsible for determining the number of jobs and the inputs for each job, based on the raw inputs provided by the user.- Inputs which have a single constant value will have the same value for all jobs
- Inputs which have a 1D list of values will have one of those values in each job. By Default, all list inputs must have the same length, and there will be one job per element. The nth job will have the nth value of each input
- There are extra configuration options which can change how inputs are combined or how lists are interpreted
Firecloud
/Terra
: Choose this adapter if you are using data hosted in a FireCloud or Terra workspace. Your inputs will be interpreted as entity expressions, similar to how FireCloud and Terra workflows interpret inputs. This adapter can also be configured to post results back to your workspace, if you choose. Warning: Reading from Workspace buckets is convenient, but you may encounter issues if your Slurm cluster is not logged in using your credentials
The pipeline backend is responsible for interfacing with the Slurm controller. There are many different backends available depending on where SLURM is running (or for creating a Slurm cluster for you).
This is a list of available backends. For more details, see pipeline_options.md
Local
: (Default) Choose this backend if you will be running Canine from the Slurm controller and your cluster is fully configured. This backend will run Slurm commands through the local shellRemote
: Choose this backend if you have a fully configured SLURM cluster, but you will be running Canine elsewhere. This backend uses SSH and SFTP to interact with the Slurm controllerGCPTransient
: Choose this backend if you do not have a Slurm cluster. This backend will create a cluster to your specifications in Google Cloud and then use SSH and SFTP to interact with the controller. The cluster will be deleted after Canine has finishedImageTransient
: Choose this backend if you do not have a Slurm cluster, but want more control over its startup thanGCPTransient
. This backend assumes that the current system has Slurm installed and has an NFS mount set up. It then creates worker nodes from a Google Compute Image that you have setup and configured.DockerTransient
: Choose this backend if you want the same control asImageTransient
but do not want to set up a Google Compute Image. The Slurm daemons run inside docker containers on the worker nodes. The Slurm controller daemon runs inside a docker container on the local filesystemDummy
: Choose this backend for developing or testing pipelines. This backend simulates a Slurm cluster by running the controller and workers as docker containers on the local system. This backend does not provision any cloud resources. It runs entirely through the local docker daemon.
The pipeline localizer is responsible for staging the pipeline on the SLURM controller and for transferring inputs/outputs as needed. There are four different localizers to accommodate different needs.
This is a list of available localizers. For more details, see pipeline_options.md
Batched
: (Default) This localizer is suitable for most situations. It stages the canine pipeline workspace locally in a temporary directory, copying or symlinking local files into it before broadcasting the workspace directory structure over to the Slurm controller. Files stored in Google Cloud Storage are downloaded at the end, directly onto the Slurm Controller (using credentials stored on the controller).Local
: Choose this localizer if you have files in Google Cloud Storage which need to be localized but you are unable to save suitable credentials to the Slurm controller. This is very similar to theBatched
localizer, except that Google Cloud Storage files are staged locally and broadcast to the Slurm Controller along with the rest of the pipeline filesRemote
: Choose this localizer for small pipelines with few local files. This localizer stages the pipeline directory directly on the Slurm controller using SFTP. It is often less efficient than the bulk directory copy used by theBatched
andLocal
localizers (especially if you provide atransfer_bucket
to them) but can outperform other localizers for small pipelines which consist entirely of files from Google Cloud Storage.NFS
: Choose this backend if the current system has an active NFS mount to the Slurm controller. The canine pipeline will be staged locally, within the NFS mount point, allowing NFS to take care of transferring the pipeline directory to the controller.
There are a few examples in the examples/
directory which can be run out-of-the box.
To run one of these pipelines, follow any of the following instructions:
$ canine examples/example_pipeline.yaml
import canine
orchestrator = canine.Orchestrator('examples/example_pipeline.yaml')
results = orchestrator.run_pipeline()
import canine
import yaml
with open('examples/example_pipeline.yaml') as r:
config = yaml.load(r)
orchestrator = canine.Orchestrator(config)
results = orchestrator.run_pipeline()
Hopefully you've run an example or two and have a better understanding of what a pipeline looks like. This section will describe the other parts of a pipeline configuration not covered already
Inputs describe both the number of jobs and the inputs to each job.
The inputs
section of the pipeline should be a dictionary.
Each key is a string, mapping the name of the input to either a string or list of strings.
As described above, the adapter is responsible for parsing the raw, user-provided inputs into the set of inputs for each job that will be run.
- Raw inputs which were lists of 2 or more dimensions are interpreted by the adapter as if the user wished to provide one of the nested lists to each job. The array is flattened to 2 dimensions, and interpreted as if it were a regular list input (with one element passed to each job). The contents of these arrays are handled using the above localization rules
- Raw inputs which were lists of any dimensions, but marked as
common
in the overrides are flattened to 1 dimension, and the whole list is provided as an input to each job. The contents of the array are handled ascommon
files (see below)
The pipeline script is the heart of the pipeline. This is the actual bash script which will be run. The script
key can either be a filepath to a bash script to run, or a list of strings, each of which is a command to run.
Either way, the script gets executed by each job of the pipeline.
NOTE: During setup, every job will configure a $CANINE_DOCKER_ARGS
environment variable. We recommend that you expand this variable inside the argument list to docker run
commands to enable the container to properly interact with the canine environment
Localization overrides, defined in localization.overrides
allow the user to change the localizer's default handling for a specific input.
The overrides section should be a dictionary mapping input names, to a string describing the desired handling, as follows:
- Default rules (no override):
- Strings which exist as a local filepath are treated as files and will be localized to the Slurm controller
- Strings which start with
gs://
are interpreted to be files/directories within Google Cloud Storage and will be localized to the Slurm controller - Any file or Google Storage object which appears as an input to multiple jobs is considered
common
and will be localized once to a common directory, visible to all jobs - If any input to any arbitrary job is a list, the contents of the list are interpreted using the same rules
Stream
: Inputs marked asStream
will be streamed into a FIFO pipe, and the path to the pipe will be exported to the job. TheStream
override is ignored for inputs which are not Google Cloud Storage objects, causing those inputs to be localized under default rules. Jobs which are requeued due to node failure will always restart the stream. Streams are created in a temporary directory on the local disk of the compute nodeDelayed
: Inputs marked asDelayed
will be downloaded by the job once it starts, instead of upfront during localization. TheDelayed
override is ignored for inputs which are not Google Cloud Storage objects, causing those inputs to be localized under default rules. Jobs which are requeued due to node failures will only re-download delayed inputs if the job failed before the download completedLocal
: Similar toDelayed
. Inputs marked asLocal
will be downloaded by the job once it starts. The difference betweenDelayed
andLocal
is that forLocal
files, a new disk is provisioned and mounted to the worker node andLocal
downloads are saved there. The disk is automatically sized to fit all files marked asLocal
plus a small safety margin. Warning: Do not create or unzip files in the local download directory. The local download disks are sized automatically to fit the size of the downloaded files and will likely run out of space if additional files are created or unpackedLocalize
: Inputs marked asLocalize
will be treated as files and localized to job-specific input directories. This can be used to force files which would be handled as common, to be localized for each job. TheLocalize
override is ignored for inputs which are not valid filepaths or Google Cloud Storage objects, causing those inputs to be treated as stringsNull
orNone
: Inputs marked this way are treated as strings, and no localization will be applied.
The outputs section defines a mapping of output names to file patterns which should be grabbed for output. File patterns may be raw filenames or globs, and may include shell variables (including job inputs).
These patterns are always relative to each job's initial cwd ($CANINE_JOB_ROOT
). Patterns may match files above the workspace directory, but this is not recommended.
By default, stdout
and stderr
are included in the outputs, which will grab the job's stdout/err streams.
You may override this behavior by providing your own pattern for stdout
or stderr
.
Warning: the outputs stdout
and stderr
have special handling, which expects their patterns to match exactly one file.
If you provide a custom pattern for stdout
or stderr
and matches more than one file, the output dataframe will only show the first filename matched
All files which match a provided output pattern will be delocalized from the Slurm controller back to the current system in the following directory structure:
output_dir/
{job id}/
stdout
stderr
{other output names}/
{matched files/directories}
The resources
section allows you to define additional arguments to sbatch
to control the resource allocation or other scheduling parameters. The resource
dictionary is converted to commandline arguments as follows:
- Single-letter keys are converted to short (
-x
) options. - Multi-letter keys are converted to long (
--xx
) options. - Keys with a value of
True
are converted to flags (no value) - keys with any other value are converted to paramters (
--key=val
) - Underscores in keys are converted to hyphens (
foo_bar
becomes--foo-bar
)