Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
122 lines (90 sloc) 6.15 KB

Why using snakemake

Snakemake is a python3 based pipeline building tool (a python variant of GNU make) specialized for bioinformatics. I put my notes managing different versions of python here. You can write any python codes inside the Snakefile. Using snakemake is to simplify the tedious pre-processing work for large genomic data sets and for the sake of reproducibility. There are many other tools you can find here for this purpose.

Key features of snakemake

  • Snakemake automatically creates missing directories.

  • wildcards and Input function

To access wildcards in a shell command: {wildcards.sample}

{wildcards} is greedy (.+): {sample}.fastq could be matching sampleA.fastq if there is no sub-folder anymore, but even whateverfolder/sampleA.fastq can be matched as well.

One needs to think snakemake in a bottom-up way: snakemake will first look for the output files, and substitue the file names to the {wildcards}, and look for which rule can be used to creat the output, and then look for input files that are defined by the {wildcards}.

The wildcards is pretty confusing to me. read the posts in the google group by searching wildcards

read threads below: https://groups.google.com/forum/#!searchin/snakemake/glob_wildcards/snakemake/YfHgx6P5se4/CRk-d151GBwJ
https://groups.google.com/forum/#!searchin/snakemake/glob_wildcards/snakemake/FsdT4ioRyNY/LCm6Xj8dIAAJ
https://groups.google.com/forum/#!searchin/snakemake/glob_wildcards/snakemake/JAcOdGgWR_g/1nT9nsNkCgAJ
https://groups.google.com/forum/#!searchin/snakemake/glob_wildcards/snakemake/KorE6c-OZg4/LVO_0jDBHlUJ

quote from the snakemake developer:

The generation of the final target is nearly always the complex part of the workflow. You have to determine, which files exist and somehow tell Snakemake which files you want it to generate. glob_wildcards and expand helps a lot - it was really painful in the versions before it was implemented, but often you have to write some custom code, if it gets complicated

Read the following

snakemake book
flexible bioinformatics pipelines with snakemake
Build bioinformatics pipelines with Snakemake
snakemake ChIP-seq pipeline example
submit all the jobs immediately
cluster job submission wrappers
snakemake-parallel-bwa
RNA-seq snakemake example
functions as inputs and derived parameters
snakemake FAQ
snakemake tutorial from the developer

examples

https://github.com/slowkow/snakefiles/blob/master/bsub.py
https://github.com/broadinstitute/viral-ngs/tree/master/pipes

jobscript

An example of the jobscript

#!/bin/bash
# properties = {properties}
. /etc/profile.d/modules.sh
module purge
module load snakemake/python3.2/2.5.2.2
{workflow.snakemakepath} --snakefile {workflow.snakefile} \
--force -j{cores} \
--directory {workdir} --nocolor --notemp --quiet --nolock {job.output} \
&& touch "{jobfinished}" || touch "{jobfailed}"
exit 0

A working snakemake pipeline for ChIP-seq

The folder structure is like this:

├── README.md
├── Snakemake
├── config.yaml
└── rawfastqs
    ├── sampleA
    │   ├── sampleA_L001.fastq.gz
    │   ├── sampleA_L002.fastq.gz
    │   └── sampleA_L003.fastq.gz
    ├── sampleB
    │   ├── sampleB_L001.fastq.gz
    │   ├── sampleB_L002.fastq.gz
    │   └── sampleB_L003.fastq.gz
    ├── sampleG1
    │   ├── sampleG1_L001.fastq.gz
    │   ├── sampleG1_L002.fastq.gz
    │   └── sampleG1_L003.fastq.gz
    └── sampleG2
        ├── sampleG2_L001.fastq.gz
        ├── sampleG2_L002.fastq.gz
        └── sampleG2_L003.fastq.gz

There is a folder named rawfastqs containing all the raw fastqs. each sample subfolder contains multiple fastq files from different lanes.

In this example, I have two control (Input) samples and two corresponding case(IP) samples.

CONTROLS = ["sampleG1","sampleG2"]
CASES = ["sampleA", "sampleB"]

putting them in a list inside the Snakefile. If there are many more samples, need to generate it with python programmatically.

## dry run
snakemake -np

## work flow diagram
snakemake --forceall --dag | dot -Tpng | display

###To Do:

  • Make the pipeline more flexiable. e.g. specify the folder name containing raw fastqs, now it is hard coded.
  • write a wrapper script for submitting jobs in moab. Figuring out dependencies and --immediate-submit

snakemake -k -j 1000 --forceall --cluster-config cluster.json --cluster "msub -V -N '{rule.name}_{wildcards.sample}' -l nodes={cluster.nodes}:ppn={cluster.cpu} -l mem={cluster.mem} -l walltime={cluster.time} -m {cluster.EmailNotice} -M {cluster.email}"

snakemake -k -j 1000 --forceall --cluster-config cluster.json --cluster "msub -V -l nodes={cluster.nodes}:ppn={cluster.cpu} -l mem={cluster.mem} -l walltime={cluster.time} -m {cluster.EmailNotice} -M {cluster.email}"

You can’t perform that action at this time.