-
Notifications
You must be signed in to change notification settings - Fork 1
Running example Snakemake pipeline with GCP life sciences pipeline
There are several steps, you will need a service account with the right permissions, a compute node that runs Snakemake workflow manager, and your input data on a GCP bucket.
It follows the steps described here: https://snakemake.readthedocs.io/en/stable/executor_tutorial/google_lifesciences.html
Create a service account with the following roles:
- Compute Network Viewer
- Firebase Develop Admin
- Service Account User
- Cloud Life Sciences Workflows Runner
It might be easier to use a VM, we have already created one with the proper set up for testing. Google Cloud Shell is also convenient but it only has 5GB storage and has automatic timeouts. If you are using your own computer you will have to install gcloud cli. This node needs to be on because it acts as the workflow manager. If you want to run the Snakemake workflow manager and its job on GCP, you can do so via kubernetes on GCP (and not use the google life sciences pipeline).
To create a GCP VM, you can do it on the console or use gcloud cli, this is an example of getting a 1 core small VM with 10GB (default) boot disk with Ubuntu 20.04 LTS
gcloud beta compute --project=<your-project-id> instances create <name-of-the-instance> --zone=us-east4-c --machine-type=n1-standard-1 --image=ubuntu-2004-focal-v20200810 --image-project=ubuntu-os-cloud
On the compute node of your choice, install Snakemake via conda:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
. ~/.bashrc
conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
conda activate snakemake
You might need to do gcloud init if you are using your own computer.
Download the key of the service account you created earlier and store it on a safe path on your compute node.
Set up the following environmental variables
export GOOGLE_APPLICATION_CREDENTIALS=<path-to-service-account-key>.json
export GOOGLE_CLOUD_PROJECT=<gcp-project-ID>
git clone https://github.com/snakemake/snakemake-tutorial-data
cd snakemake-tutorial-data/
gsutil rsync -r data <your-gcp-bucket>/snakemake-testing-data
Run it with conda environment
snakemake --google-lifesciences --default-remote-prefix <your-gcp-bucket>/snakemake-testing-data --use-conda --google-lifesciences-region us-east1
It also works if you specify a container image for the rule, e.g. if we change
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
conda:
"environment.yaml"
shell:
"samtools index {input}"
To
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
container:
"docker://biocontainers/samtools:v1.9-4-deb_cv1"
shell:
"samtools index {input}"
We can run the pipeline with the --use-singularity option
snakemake --google-lifesciences --default-remote-prefix dceg-pipeline-test/snakemake-testing-data --use-conda --use-singularity --google-lifesciences-region us-east1
You will be able to see your job running on gcp lifesciences pipeline by using its API
gcloud beta lifesciences operations list
and you can watch its progress by
gcloud beta lifesciences operations describe projects/<your-project-id>/locations/us/operations/<pipeline-run-id>
- Right now the final result, i.e. the input for rules all is downloaded to the compute node that Snakemake is used, which in our case is plots/quals.svg, not sure if it can stay on the google bucket instead