Skip to content

Writing your own workflows

dxenes1 edited this page Aug 15, 2019 · 2 revisions

Introduction

This guide serves as a tutorial to writing your own workflows. It is very similar to the CWL tutorials, but it covers the extra features of SABER and discusses the features of CWL that SABER does not yet support.

CWL, or Common Workflow Language, is, as the name suggests, a language for writing workflows. A CWL instance comprises of two files -- the workflow file (usually ending in .cwl) and a job file (usually ending in .yml). Although the file extension of a workflow is .cwl, its format doesn't differ from yaml. SABER also allows a parameter sweep file, which also ends in .yml. Further information on this file can be found on the parameter sweeps page.

Creating tools

The workflow file usually has multiple steps, each of which has a tool CWL file. It is usually easier to explain an example than an abstract definition, so below is the tool file for membrane classification in the Xbrain pipeline.

cwlVersion: v1.0
class: CommandLineTool
hints:
    DockerRequirement:
        dockerPull: aplbrain/xbrain:latest
    ResourceRequirement:
        ramMin: 4000
        coresMin: 2
baseCommand: process-xbrain.py
arguments: ["classify"]
inputs:
    input:
        type: File
        inputBinding:
            position: 1
            prefix: -i
    output_name:
        type: string
        inputBinding:
            position: 2
            prefix: -o
    classifier:
        type: File
        inputBinding:
            position: 3
            prefix: -c
    ram_amount:
        type: int?
        inputBinding:
            position: 4
            prefix: --ram
    num_threads:
        type: int?
        inputBinding:
            position: 5
            prefix: --threads
outputs:
    membrane_probability_map:
        type: File
        outputBinding:
            glob: $(inputs.output_name)

cwlVersion and class should always remain the same between your tools. CWL does support different classes of tools, but SABER does not.

hints describes various suggestions for how to run the tool. The most important one is

DockerRequirement: 
    dockerPull: imagename

which specifies which docker image to use to run the tool. Note that this docker image must be pre-built or be able to be pulled (i.e you can specify images on Dockerhub). You should NOT specify the SABER-built images on AWS ECR, as updates to local images will not be uploaded.

The resource requirements are also important for keeping costs low and runtimes optimized. It is highly suggested you keep these as low as possible based on the job.

ResourceRequirement:
        ramMin: 4000
        coresMin: 2

ramMin and coresMin map to memory and vCPUs (respectively) directly in the AWS job definition.

baseCommand is the main executable of the program. If your tool is a executable script (i.e it can be run with ./baseCommand), this will simply be the filename of the script. If it is a python script, this might be python. A good way to test if this will work or not is to run docker run baseCommand DockerImage. If this results in a File not found error or a permissions error, you will need to change the base command.

arguments are the arugments for the command that do not rely on the parameters of the job. For example, if your tool is a non-executable python script, this will be the script name. Note that this is a list.

inputs specify the inputs of the tool. These are also referred to as parameters of the tool. The format of an input is described below

input:
    type: File
    inputBinding:
        position: 1
        prefix: -i

input: the name of the input. This is not especially important, but needs to be unique across the tool definition. It should also be relatively descriptive.

type: This is the type of the input. SABER suppors the following types: null, boolean, int, long, float, string, File, which are all pretty self explantory. For other datatypes, specify the type as string and make sure your tool can handle it as a string.

inputBinding specifies how this input is presented to the tool. position indicates the position of this input relative to the other inputs. prefix is the prefix given to the input.

outputs describes the outputs of the tool. Its structure is similar to inputs.

outputs:
    membrane_probability_map:
        type: File
        outputBinding:
            glob: $(inputs.output_name)

type should always be file. SABER does not support other output types.

outputBinding describes how the output is captured by the tool. glob here refers to globbing. The format specification for this essentially allows you to link output names with input parameters. In this example, $(inputs.output_name) finds the input parameter output_name and substitutes that value here. If you know the name of the output, you can just include that as a string here.

If your tool has no outputs, you can leave outputs as an empty list, i.e

outputs: []

Creating workflows

Workflows can be created by linking inputs and outputs of tools in a workflow CWL file. Again, it is probably easier to show by example. The following workflow is from the xbrain pipeline, with a few parameters omitted for readability. The entire workflow file can be found in this repository under saber/xbrain/jobs/job1/xbrain.cwl

cwlVersion: v1.0
class: Workflow
inputs:
    _saber_bucket: string
    
    classifier: File
    membrane_classify_output_name: string
    cell_detect_output_name: string
    vessel_segment_output_name: string

    detect_threshold: float?
    stop: float?
    initial_template_size: int?
    detect_dilation: int?
    max_cells: int?
    segment_threshold: float?
    segment_dilation: int?
    minimum: int?

    map_output_name: string
    list_output_name: string
    centroid_volume_output_name: string
outputs:
    pull_output:
        type: File
        outputSource: boss_pull/pull_output
    membrane_classify_output:
        type: File
        outputSource: membrane_classify/membrane_probability_map
    cell_detect_output:
        type: File
        outputSource: cell_detect/cell_detect_results
    vessel_segment_output:
        type: File
        outputSource: vessel_segment/vessel_segment_results

steps:     
    membrane_classify:
        run: ../../../../saber/xbrain/tools/membrane_classify_nos3.cwl
        in:
            input: volume.npy
            output_name: membrane_classify_output_name
            classifier: classifier
           
        out: [membrane_probability_map]
        hints:
            saber:
                score_format: '{} Average OOB: {score}'
    cell_detect:
        run: ../../../../saber/xbrain/tools/cell_detect_nos3.cwl
        in:
            input: membrane_classify/membrane_probability_map
            output_name: cell_detect_output_name
            classifier: classifier
            threshold: detect_threshold
            stop: stop
            initial_template_size: initial_template_size
            dilation: detect_dilation
            max_cells: max_cells
            

        out: [cell_detect_results]
        hints:
            saber:
                score_format: 'Iteration remaining = {} Correlation = [[{score}]]'
    vessel_segment:
        run: ../../../../saber/xbrain/tools/vessel_segment_nos3.cwl
        in:
            input: membrane_classify/membrane_probability_map
            output_name: vessel_segment_output_name
            classifier: classifier
            threshold: segment_threshold
            dilation: segment_dilation
            minimum: minimum
        out: [vessel_segment_results]
    cell_split:
        run: ../../../../saber/xbrain/tools/cell_split.cwl
        in:
            input: cell_detect/cell_detect_results
            map_output_name: map_output_name
            list_output_name: list_output_name
            centroid_volume_output_name: centroid_volume_output_name
        out:
            [cell_map, cell_list, centroid_volume]
        hints:
            saber:
                local: True

The workflow definition is pretty self explanatory.

inputs describes the inputs to the workflow. Each description is of the form name: type, where type is one of the types described above.

Similarly, outputs describes the outputs of the workflow. Along with type, each output specifies the source of the output from the steps. Each outputSource is of the form step_name/output_name.

steps describes the steps of the workflow. The names of each step must be unique but the tools do not have to be unique, i.e you can have the same tool multiple times, as long as they are named differently in this file.

Each step must have the following properties: run, in and out. Optionally, it can have a hints property.

run gives the relative path to the tool CWL.

in specifies which workflow inputs map to which tool inputs with the form tool_input_name: workflow_input_name. You can also reference an output of another tool with the form tool_input_name: step_name/output_name. Doing so will set the execution order of the workflow so that steps will not run until all their inputs are available.

out specifies which outputs of each tool are used in the workflow. This is just a list of names, but they must match what are in the tool CWL file.

hints specifies hints on how to run the tool. In the SABER case, it can specify the score format and whether or not to run the tool locally.

That's really all there is to it!

Other features of CWL are not supported, but may work.

Local Execution

Using Airflow, jobs can be executed in the cloud via AWS Batch, or using local resources. Local execution is useful for small jobs, and currently necessary for jobs requiring GPU support.

To execute locally, be sure to include the local hint in your workflow:

      hints:
            saber:
                local: True

Also be sure, after starting Airflow, to add a local execution pool. In airflow, got to Admin->Pools and add a pool called 'Local'.

If you'd like to run a workflow without using S3 as an intermediate storage service, add the following line to the CWL workflow under the class definition:

doc: local

This version of local execution requires all tool images to be built before running.

Testing Your CWL workflows

When developing new cwl workflows, you can use the cwltool reference runner to test code before deploying with SABER. The CWL team maintains an open-source cwltool repository (https://github.com/common-workflow-language/cwltool) to allow you to test your docker files and CWL files locally before scalable deployment.

This tool requires a local installation of docker, python, and pip. The repo installation contains and explanation of how to set up a virtual environment for this python configuration to prevent any conflicts with other python libraries.

Once this is set up, the installation is simply 'pip install cwlref-runner' Then you can test your CWL tools, using your local hardware, from the command line. For example

$ cwl-runner ../saber/xbrain/workflows/xbrain_unets_train.cwl ../saber/xbrain/jobs/unet_train_job/xbrain_unets_ex_job.yml

You will then see your workflow executed job by job, with the output stored in the present working directory