# Snakemake workflows

## Why workflows?

- Workflows manage
  - Dependencies between tasks
  - Suspend/resume (reentrancy)
  - Logging
  - Parallelisation

## Classification of workflows

- [Leipzig 2016](http://bib.oxfordjournals.org/content/early/2016/03/23/bib.bbw020.long) classifies workflows into:
  - Implicit
  - Explicit

## Workflow platforms

![](workflows.png)

## Snakemake

- Text-based
- Can mix command line (shell), Python, and R scripts
- Uses file name conventions to identify the rules to transform one file into another

## Basic example

```python
rule sort:
    input:
        "path/to/dataset.txt"
    output:
        "dataset.sorted.txt"
    shell:
        "sort {input} > {output}"
```

## Top-level rule

```python
rule all:
    "dataset.sorted.txt"

rule sort:
    input:
        "dataset.txt"
    output:
        "dataset.sorted.txt"
    shell:
        "sort {input} > {output}
```

## Handling dependencies

- Snakemake works out which rules to apply by *pattern matching*
- You specify the *name* of a file (or files) that you want, and Snakemake works out which rules are required
- Requires strict conventions for file naming

```python
rule all:
  input:
    "dataset.sorted.copied.txt"

rule sort:
    input:
        "dataset.txt"
    output:
        "dataset.sorted.txt"
    shell:
        "sort {input} > {output}

rule copy:
    input:
        "dataset.sorted.txt"
    output:
        "dataset.sorted.copied.txt"
    shell:
        "cp {input} {output}
```

## Wildcards

- Wildcards can be used to match filename patterns

```python
IDS, = glob_wildcards("{id}.fasta")

rule all:
  input:
    expand("{id}.fmt7",id=IDS)
```

- We can also use *regular expressions* to match more complicated patterns

```python
IDS, = glob_wildcards("{id,[A-Z]{3}[0-9]+}.fasta"
```

## More wildcards

```python
rule sort:
    input:
        "{d1}.txt"
    output:
        "{d1}.sorted.txt"
    shell:
        "sort {input} > {output}
```

```python
rule all:
  input:
    "dataset.sorted.copied.txt"

rule sort:
    input:
        "{d1}.txt"
    output:
        "{d1}.sorted.txt"
    shell:
        "sort {input} > {output}

rule copy:
    input:
        "{d1}.txt"
    output:
        "{d1}.copied.txt"
    shell:
        "cp {input} {output}
```

```python
rule igblast:
  input:
    "{d1}.fasta"
  output:
    "{d1}.fmt7"
  message:
    "Running IgBLAST on {input}"
  threads:
    2
  shell:
    "igblastn \
    -germline_db_V database/human_trb_v \
    -germline_db_D database/human_trb_d \
    -germline_db_J database/human_trb_j \
    -auxiliary_data optional_file/human_gl.aux \
    -domain_system imgt -ig_seqtype TCR \
    -organism human -outfmt '7 std qseq sseq btop' \
    -query {input} -out {output} \
    -num_threads {threads}"
```

## Running Snakemake


```bash
snakemake \
  -j <number of threads> \
  -s <name of snakefile>
```

## Other features of Snakemake

- Run Python code or R code instead of shell commands
- Output script for use on SLURM clusters
- Benchmarking