Skip to content

Running example Snakemake pipeline with GCP life sciences pipeline

Wendy Wong edited this page Sep 2, 2020 · 6 revisions

There are several steps, you will need a service account with the right permissions, a compute node that runs Snakemake workflow manager, and your input data on a GCP bucket.

It follows the steps described here: https://snakemake.readthedocs.io/en/stable/executor_tutorial/google_lifesciences.html

Service account

Create a service account with the following roles:

  • Compute Network Viewer
  • Firebase Develop Admin
  • Service Account User
  • Cloud Life Sciences Workflows Runner

Create a minimum VM instance or use Google Cloud Shell, or use your own computer

It might be easier to use a VM, we have already created one with the proper set up for testing. Google Cloud Shell is also convenient but it only has 5GB storage and has automatic timeouts. If you are using your own computer you will have to install gcloud cli. This node needs to be on because it acts as the workflow manager. If you want to run the Snakemake workflow manager and its job on GCP, you can do so via kubernetes on GCP (and not use the google life sciences pipeline).

To create a GCP VM, you can do it on the console or use gcloud cli, this is an example of getting a 1 core small VM with 10GB (default) boot disk with Ubuntu 20.04 LTS

gcloud beta compute --project=<your-project-id> instances create <name-of-the-instance> --zone=us-east4-c --machine-type=n1-standard-1  --image=ubuntu-2004-focal-v20200810 --image-project=ubuntu-os-cloud 

On the compute node of your choice, install Snakemake via conda:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
. ~/.bashrc
conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
conda activate snakemake

set up the credentials

You might need to do gcloud init if you are using your own computer. Download the key of the service account you created earlier and store it on a safe path on your compute node. Set up the following environmental variables

export GOOGLE_APPLICATION_CREDENTIALS=<path-to-service-account-key>.json
export GOOGLE_CLOUD_PROJECT=<gcp-project-ID>

Download the test data and store it on a gcp bucket

git clone https://github.com/snakemake/snakemake-tutorial-data
cd snakemake-tutorial-data/
gsutil rsync -r data <your-gcp-bucket>/snakemake-testing-data

Run the pipeline

Run it with conda environment

snakemake --google-lifesciences --default-remote-prefix <your-gcp-bucket>/snakemake-testing-data --use-conda --google-lifesciences-region us-east1

It also works if you specify a container image for the rule, e.g. if we change

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    conda:
        "environment.yaml"
    shell:
        "samtools index {input}"

To

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    container:
        "docker://biocontainers/samtools:v1.9-4-deb_cv1"
    shell:
        "samtools index {input}"

We can run the pipeline with the --use-singularity option

snakemake --google-lifesciences --default-remote-prefix dceg-pipeline-test/snakemake-testing-data --use-conda --use-singularity  --google-lifesciences-region us-east1

Watch the progress

You will be able to see your job running on gcp lifesciences pipeline by using its API gcloud beta lifesciences operations list and you can watch its progress by gcloud beta lifesciences operations describe projects/<your-project-id>/locations/us/operations/<pipeline-run-id>

To be tested

  • Right now the final result, i.e. the input for rules all is downloaded to the compute node that Snakemake is used, which in our case is plots/quals.svg, not sure if it can stay on the google bucket instead