Trigger the Google Genomics Pipeline API with CWL
Python Shell
Pull request Compare This branch is 46 commits ahead of prismofeverything:master.
Latest commit dfc6f97 Feb 7, 2017 @kellrott kellrott Fixing command line
Failed to load latest commit information.


Trigger the Google Genomics Pipeline API with CWL


Goal and Disclaimer

The goal of this project is to accept a CWL description of a workflow and use that to trigger jobs on the Google Genomics Pipeline API (GP). Beyond this, it endeavors to run jobs that do not depend on each other for output in parallel, making the most use of available resources.

The system is working for the simple test case included (test/hashsplitter-workflow.cwl), which reads an input file, hashes it three different ways, then merges the output of those into a single file. Right now inputs must be supplied from a google bucket (GB) and output goes to a GB, so uploading and downloading of local inputs/outputs must be done manually for now.


A couple of things need to be installed for Funnel to run.

First, the latest version of cwl-runner:

sudo apt-get install greadlink
sudo pip install --upgrade cwl-runner

Then, the google api python client:

sudo pip install google-api-python-client

Then you need to set up gcloud:

export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
echo "deb $CLOUD_SDK_REPO main" | sudo tee /etc/apt/sources.list.d/google-cloud-sdk.list
curl | sudo apt-key add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
sudo pip install -U Sphinx
sudo pip install gcloud

Add some auxiliary libraries:

sudo pip install httplib2shim

Now you should be ready to go!


If you just want to run some jobs locally and take advantage of the parallel execution of non-dependent tasks, it works much like the cwltool it is built on:

python -m funnel.main test/hashsplitter-workflow.cwl --input

If you want to run on GP, you must have a GP enabled account and have gathered enough information to fill out the gce_config.yaml with your information:

project-id: machine-generated-837
bucket: your-bucket
output-file: path/to/where/you/want/google/pipeline/to/put/your/output

Once this is supplied, you can call the same command line as before with the --gce argument pointing to the GP config: this triggers usage of the GP. In addition, you must provide any input as a GB address:

python -m funnel.main --gce gce_config.yaml test/hashsplitter-workflow.cwl --input gs://hashsplitter/input/

If your input is actually in that bucket and your GP config is correct, this will output something in the bucket location gs://hashsplitter/output/unify, along with whatever intermediate output was the result of previous steps.