## Getting started with Snakemake workflows for automated bioinformatics analysis.

## Author: Ali Pirani

## Software we're going to use

We're going to be using [conda](https://conda.io/en/latest/) package manager for installation and [snakemake](https://snakemake.readthedocs.io/en/stable/) to build the workflow.

If you want to run all of this on your own computer, you'll need to follow the install instructions below.

You can see the full set of installed software requirements in a conda `environment.yml` file located under binder directory. (More on binder later.)

You could use this install file to run everything we're doing today on your laptop, with: 

```
conda env create --file binder/environment.yml -n smake
conda activate smake
pip install bash_kernel
python -m bash_kernel.install
```

We'll be implementing a short read quality check and visualization pipeline, using [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [multiqc](https://multiqc.info/) to demonstrate how to get started with the Snakemake workflow. 

Why use a workflow management tool?

- Automate your pipeline.
- Dependency management.
- Reentrancy - start back up where you left off
- Reusable
- Documentation and logging
- Portable
- Reproducibility.

There are many Snakemake implementations on Github that you choose from for your inspiration and directory management: https://github.com/pditommaso/awesome-pipeline

## Agenda: Tasks we are going to do

- Get comfortable with Jupyter Notebook
- Install Bash Kernel for Jupyter Notebook.
- Download fastq data for QC analysis.
- Create a Snakefile to write our first Snakemake rule
- Explore advanced features of Snakemake

## Download Bash kernel for Jupyter notebooks

Run this commands in your terminal and then refresh the notebook tab. (Launch terminal from Launcher)

```
pip install bash_kernel
python -m bash_kernel.install
```

## Download some data

Execute the below cell to fetch example fastq data. This will download the example fastq data to your data directory.

Note: To run bash commands, select bash kernel from the top right corner.

In [None]:
curl -L https://github.com/ctb/2019-snakemake-ucdavis/raw/9db09bc0b6a3469f8a0d4996d4b2995bf36e5d27/data/0Hour_001_1.fq.gz > data/0Hour_001_1.fq.gz
curl -L https://github.com/ctb/2019-snakemake-ucdavis/raw/9db09bc0b6a3469f8a0d4996d4b2995bf36e5d27/data/6Hour_001_1.fq.gz > data/6Hour_001_1.fq.gz
curl -L https://github.com/ctb/2019-snakemake-ucdavis/raw/9db09bc0b6a3469f8a0d4996d4b2995bf36e5d27/data/0Hour_001_2.fq.gz > data/0Hour_001_2.fq.gz
curl -L https://github.com/ctb/2019-snakemake-ucdavis/raw/9db09bc0b6a3469f8a0d4996d4b2995bf36e5d27/data/6Hour_001_2.fq.gz > data/6Hour_001_2.fq.gz
curl -L https://github.com/ctb/2019-snakemake-ucdavis/raw/9db09bc0b6a3469f8a0d4996d4b2995bf36e5d27/data/6Hour_002_1.fq.gz > data/6Hour_002_1.fq.gz
curl -L https://github.com/ctb/2019-snakemake-ucdavis/raw/9db09bc0b6a3469f8a0d4996d4b2995bf36e5d27/data/6Hour_002_2.fq.gz > data/6Hour_002_2.fq.gz
curl -L https://github.com/ctb/2019-snakemake-ucdavis/raw/9db09bc0b6a3469f8a0d4996d4b2995bf36e5d27/data/0Hour_002_1.fq.gz > data/0Hour_002_1.fq.gz
curl -L https://github.com/ctb/2019-snakemake-ucdavis/raw/9db09bc0b6a3469f8a0d4996d4b2995bf36e5d27/data/0Hour_002_2.fq.gz > data/0Hour_002_2.fq.gz

In [None]:
ls data/

Lets activate the snakemake environment using conda.

In [None]:
conda activate smake

## Running snakemake!

### Create your first Snakefile.

A Snakemake workflow is defined by specifying rules in a Snakefile. By default, the main file snakemake tries to search in your working directory is a Snakefile.

Rules breaks down the workflow into small steps (for example, the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names.

We will define our rules of workflow in a Snakefile. 

Create a new text file by selecting (`File`, `New`, `Text file`) from the the top left corner menu and copy/paste this first rule:

```
rule fastqc_a_file:
  shell:
    "fastqc data/0Hour_001_1.fq.gz"
```

Save and Rename it to "Snakefile". (Right click on the file from the file explorer in left panel and select Rename)

* the snakemake configuration file is by default called `Snakefile`

Now, run snakemake:


In [None]:
snakemake

What was the error?

Lets run it with 1 core.

In [None]:
snakemake --cores 1

You should see:

```

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job              count    min threads    max threads
-------------  -------  -------------  -------------
fastqc_a_file        1              1              1
total                1              1              1

Select jobs to execute...

[Wed Nov  3 15:49:22 2021]
rule fastqc_a_file:
    jobid: 0
    resources: tmpdir=/tmp

Started analysis of 0Hour_001_1.fq.gz
Approx 5% complete for 0Hour_001_1.fq.gz
...
Analysis complete for 0Hour_001_1.fq.gz
[Wed Nov  3 15:49:27 2021]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /home/apirani/Session8_snakemake/.snakemake/log/2021-11-03T154921.245311.snakemake.log

(smake) 
```

### Updating the Snakefile to track inputs and outputs 

At the moment this is basically just a shell script with extra syntax... then what's the point?

Well, shell scripts - and this snakefile, too - will rerun the command every time you run the file, even if there's no reason to do so because the file hasn't changed.

This is particularly important for large or long workflows, where you're dealing with 10s to 100s of files that may take hours to days to process! It can be hard to figure out which files to rerun, but snakemake can really help you do this!

It's hard to track this kind of thing in a shell script - I usually just comment out the lines I don't want to run, or break my commands up into multiple shell scripts so they don't take so long - but with snakemake, you can annotate the rule with input and output files! 

Input - - - -Process 1- - - Process 2- - - Process 3- - - - Output 

If you annotate the rule with input/output files, Snakemake will automatically check if the output file exists and makes a decision to either rerun everything, rerun only a part of the workflow or do nothing.

We will see how it does that in the following section.

Change your snakefile to look like this:

```
rule fastqc_a_file:
  input:
    "data/0Hour_001_1.fq.gz"
  output:
    "data/0Hour_001_1_fastqc.html",
    "data/0Hour_001_1_fastqc.zip"
  shell:
    "fastqc data/0Hour_001_1.fq.gz"
```

here, we've annotated the rule with the required **input** file, as well as the expected **output** files.

**Question: how do we know what the output files are?**

Now run:


In [None]:
snakemake --cores 1

You should see something like this:

```
Building DAG of jobs...
Nothing to be done.
Complete log: /home/apirani/Session8_snakemake/.snakemake/log/2021-11-03T155832.508508.snakemake.log
```

**Question: What happened?**

snakemake looked at the files provided with output parameters, saw that the output files already existed, and figured out that it didn't need to do anything! So cool, right? To perform this in our scripts, you would have to write few lines of code.

### Forcibly re-running things

You can tell snakemake to run the rule no matter what with `-f`:

In [None]:
snakemake -f --cores 1

You can also remove an output file and it will automatically re-run:

In [None]:
rm data/*.html
snakemake --cores 1

Note that you don't need to remove *all* the output files to rerun a command - just remove *one* of them.

You can *also* update the timestamp on an *input* file, and snakemake will figure out that the output file is older than the input file, and will rerun things.

In [None]:
touch data/*.fq.gz
snakemake --cores 1

This feature of Snakemake will become important later.

### Multiple rules

Let's add a rule to run fastqc on a second file:

```
rule fastqc_a_file:
  input:
    "data/0Hour_001_1.fq.gz"
  output:
    "data/0Hour_001_1_fastqc.html",
    "data/0Hour_001_1_fastqc.zip"
  shell:
    "fastqc data/0Hour_001_1.fq.gz"

rule fastqc_a_file2:
  input:
    "data/6Hour_001_1.fq.gz"
  output:
    "data/6Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.zip"
  shell:
    "fastqc data/6Hour_001_1.fq.gz"
```

Now run snakemake again:

In [None]:
snakemake --cores 1

**Question: What did snakemake do?**

- It will do nothing. Because By default, snakemake only runs the *first* rule in a Snakefile.

**Question:How can you bypass this?**

- You can give a rule name on the command line, if you like, **or** you can tell snakemake what output file(s) you want.

Lets do the latter and run the following command:

In [None]:
snakemake --cores 1 data/0Hour_001_1_fastqc.html data/6Hour_001_1_fastqc.html

Now you should see the second fastqc command run, with the appropriate output files!

Note that snakemake only runs the second rule, because it looks at the output files and sees that the first file you wanted, `0Hour_001_1_fastqc.html` already exists!

Its a good feature when you are dealing with only few rules or while debugging your rules. The downside here is:

* this is pretty long compared to the same shell script...
* specifying which file or rule you want is kind of annoying...

### A first refactoring: adding a better default rule

Let's start refactoring (cleaning up) this Snakefile.

First, let's add a rule at the top:

```
rule all:
  input:
    "data/0Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.html"

rule fastqc_a_file:
  input:
    "data/0Hour_001_1.fq.gz"
  output:
    "data/0Hour_001_1_fastqc.html",
    "data/0Hour_001_1_fastqc.zip"
  shell:
    "fastqc data/0Hour_001_1.fq.gz"

rule fastqc_a_file2:
  input:
    "data/6Hour_001_1.fq.gz"
  output:
    "data/6Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.zip"
  shell:
    "fastqc data/6Hour_001_1.fq.gz"
```

This rule, by convention called `all`, is a default rule that produces all the final output files. 

But it's a bit weird! It's all input, and no output!

Its a blank rule that gathers together all of the various output files you want the workflow to produce, and says "hey, snakemake, I depend on all of these files for my input - make them for me!" And then, once those files are all there, it ...does nothing.

Yep, this is perfectly legal in snakemake, and it's one way to make your life easier.


### A second refactoring: doing a bit of templating

There's a lot of repetition in each of these rules. Let's collapse it down a little bit by replacing the filename in the fastqc command with a magic variable, `{input}`. The magic variable {input} means take files mentioned in the input section of the rule.

```
rule all:
  input:
    "data/0Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.html"

rule fastqc_a_file:
  input:
    "data/0Hour_001_1.fq.gz"
  output:
    "data/0Hour_001_1_fastqc.html",
    "data/0Hour_001_1_fastqc.zip"
  shell:
    "fastqc {input}"

rule fastqc_a_file2:
  input:
    "data/6Hour_001_1.fq.gz"
  output:
    "data/6Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.zip"
  shell:
    "fastqc {input}"
```

This all works as before, but now the rule is a bit more generic and will work with any input file. Sort of. why?

The input files data/0Hour_001_1.fq.gz and data/6Hour_001_1.fq.gz are still hardcoded! We dont want to hardcode anything! 

### Refactoring 3: templating output files, too

What do I mean, sort of?

Well, the output filenames ALSO depend on the input file names in some way - specifically, fastqc replace part of the filename with `_fastqc.html` and `_fastqc.zip` to make its two output files.

Let's rewrite the rule using some snakemake pattern matching:

```
rule all:
  input:
    "data/0Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.html"

rule fastqc_a_file:
  input:
    "{filename}.fq.gz"
  output:
    "{filename}_fastqc.html",
    "{filename}_fastqc.zip"
  shell:
    "fastqc {input}"

rule fastqc_a_file2:
  input:
    "{filename}.fq.gz"
  output:
    "{filename}_fastqc.html",
    "{filename}_fastqc.zip"
  shell:
    "fastqc {input}"
```

What we've done here is tell snakemake that anytime we say we *want* a file that ends with `_fastqc.html`, it should look for a file that ends in `.fq.gz` and then run `fastqc` on it.

Try running this:

In [None]:
snakemake --cores 1

We get a `AmbiguousRuleException:`! What's going on?

Well, if you look at the rule above, we've given snakemake two different rules to produce the same file(s)! `fastqc_a_file` and `fastqc_a_file2` are now identical rules! snakemake doesn't like that.

Let's remove one, to get a trimmer, leaner, and above all *functional* snakefile:

```
rule all:
  input:
    "data/0Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.html"

rule fastqc_a_file:
  input:
    "{filename}.fq.gz"
  output:
    "{filename}_fastqc.html",
    "{filename}_fastqc.zip"
  shell:
    "fastqc {input}"

```

and THAT should now work just fine!


In [None]:
snakemake --cores 1

### Adding some more files

Now here's the fun bit -- if you look in the data directory, you'll see that there are actually 8 files in there. Let's modify the snakefile to run fastqc on all of them!

**Question: How should we do that? (Give it a try!)**

Note that the variable name in input and output does not have to be "filename", it can be anything as long as you're consistant.

### Building out the workflow

So, we've gotten fastqc sorted out. What's next?

Let's add in a new rule - multiqc, to summarize our fastqc results.

multiqc takes a directory name under which there are one or more fastqc reports, and then summarizes them.

Run multiqc on the command line to determine what the output files look like,

In [None]:
multiqc data

You can see that it creates two outputs, *multiqc_report.html* and the directory *multiqc_data/* which contains a bunch of files. 

Let's create a snakemake rule for this; add:

```
rule all:
  input:
    "data/0Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.html"

rule fastqc_a_file:
  input:
    "{filename}.fq.gz"
  output:
    "{filename}_fastqc.html",
    "{filename}_fastqc.zip"
  shell:
    "fastqc {input}"

rule run_multiqc:
  input:
    "data/0Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.html",
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell:
    "multiqc data/"
```

In [None]:
snakemake --cores 1

This ...doesn't really do what we want, for a few reasons.

First of all, the output of run_multiqc is not specified in the all rule so snakemake doesn't look for a rule to create this.

Second of all, multiqc_report.html already exists, so snakemake won't run the rule.

Let's fix the first two things first:

    add multiqc_report.html to the inputs for the first rule - all.
    then remove multiqc_report.html and re-run snakemake.

Your snakefile should look like:

```
rule all:
  input:
    "data/0Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.html",
    "multiqc_report.html"

rule fastqc_a_file:
  input:
    "{filename}.fq.gz"
  output:
    "{filename}_fastqc.html",
    "{filename}_fastqc.zip"
  shell:
    "fastqc {input}"

rule run_multiqc:
  input:
    "data/0Hour_001_1_fastqc.html",
    "data/6Hour_001_1_fastqc.html",
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell:
    "multiqc data/"
```

In [None]:
rm multiqc_report.html
snakemake --cores 1

There is a lot of redundancy in the code. Providing input files explicitly to the multiqc rule is not great because those are the same files we have in the all rule and to add new files we'd now have to add them to two places.

We can fix the first issue by using variables.

To use variables, let's make a Python list at the very top, containing all of our expected output files from fastqc:

```
fastqc_output = ["data/0Hour_001_1_fastqc.html", "data/6Hour_001_1_fastqc.html", "data/0Hour_001_2_fastqc.html", "data/6Hour_001_2_fastqc.html"]
```

and modify the all and multiqc rules to contain this list. 

The final snakefile looks like this:

```
fastqc_output = ["data/0Hour_001_1_fastqc.html", "data/6Hour_001_1_fastqc.html",
  "data/0Hour_001_2_fastqc.html", "data/6Hour_001_2_fastqc.html"]

rule all:
  input:
    fastqc_output,
    "multiqc_report.html"

rule fastqc_a_file:
  input:
    "{filename}.fq.gz"
  output:
    "{filename}_fastqc.html",
    "{filename}_fastqc.zip"
  shell:
    "fastqc {input}"

rule run_multiqc:
  input:
    fastqc_output
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell:
    "multiqc data/"
```

In [None]:
snakemake --cores 1

## Refactoring this to make it slightly more concise 

We've got one more redundancy in this file - the fastqc_output is listed in the all rule, but you don't need it there! Why?

Well, multiqc_report.html is already in the all rule, and the multiqc rule depends on fastqc_output, so fastqc_output already needs to be created to satisfy the all rule, so... specifying it in the all rule is redundant! And you can remove it!

(It's not a big deal and I usually leave it in. But I wanted to talk about dependencies!)

The Snakefile now looks like this:

```
fastqc_output = ["data/0Hour_001_1_fastqc.html", "data/6Hour_001_1_fastqc.html",
  "data/0Hour_001_2_fastqc.html", "data/6Hour_001_2_fastqc.html"]

rule all:
  input:
    "multiqc_report.html"

rule fastqc_a_file:
  input:
    "{filename}.fq.gz"
  output:
    "{filename}_fastqc.html",
    "{filename}_fastqc.zip"
  shell:
    "fastqc {input}"

rule run_multiqc:
  input:
    fastqc_output
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell:
    "multiqc data/"
```

and we can rerun it from scratch by doing:

In [None]:
rm data/*.html multiqc_report.html
snakemake --cores 1

### Question: what files does snakemake check in order to decide about rerunning?

snakemake will compare only the dates on the very initial input files,
and the specific output file(s) you are requesting, to decide if it
needs to rerun the workflow.

In practical terms, this means that if you just delete the
`data/*.html` files above but leave `multiqc_report.html` around,
snakemake won't rerun anything. You have to delete both the
intermediaries _and_ the end output files (as we do in the previous
section), _or_ update the raw input files, in order to force
rerunning.

([This is a feature, not a bug](https://bitbucket.org/snakemake/snakemake/issues/885/snakemake-sometimes-doesnt-run-rule-when) - it helps deal with data-intensive pipelines where the intermediate files are really big.)

### Making a `clean` rule

It's kind of annoying to have to delete things explicitly. Snakemake should take care of that for us. 

Let's add a new rule, `clean` at the top, that forces rerunning of things --

``` 
fastqc_output = ["data/0Hour_001_1_fastqc.html", "data/6Hour_001_1_fastqc.html",
  "data/0Hour_001_2_fastqc.html", "data/6Hour_001_2_fastqc.html"]

rule all:
  input:
    fastqc_output,
    "multiqc_report.html"

rule clean:
  shell:
    "rm -f {fastqc_output} multiqc_report.html"
    
rule fastqc_a_file:
  input:
    "{filename}.fq.gz"
  output:
    "{filename}_fastqc.html",
    "{filename}_fastqc.zip"
  shell:
    "fastqc {input}"

rule run_multiqc:
  input:
    fastqc_output
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell:
    "multiqc data/"
```

and now try rerunning things:


In [None]:
snakemake -p clean --cores 1

In [None]:
snakemake --cores 1

A few things to point out --

* Here we see the use of variables inside a shell command, again - `{fastqc_output}` means "replace the thing in the curly quotes with the Python values in `fastqc_output`.
* We're using `snakemake -p` to get a printout of the commands that are run.

What's particularly nice about the `clean` rule (the name is a convention, not a requirement) is that you only need to keep track of the expected output files in one or two places - the all rule, and the clean rule.

## Recap

So, we've put all this work into making this snakefile with its input rules and its output rules... and there are a lot of advantages to our current approach already! 

Let's list a few of them --

- we've completely automated our analysis!
- we can easily add new data files into fastqc and multiqc!
- we can rerun things easily, and (even better) by default only things that need to be run will be run.
- the snakefile is actually pretty reusable - we could drop this into a new project, and, with little effort, run all of these things on new data!

# More advanced snakemake

## Running things in parallel

In [None]:
snakemake -p clean --cores 1

In [None]:
snakemake -j 4

This will run up to four things in parallel!

Points to remember:

* all the output is sort of smushed together... if a rule fails, it may be hard to figure out what happened. You can always just rerun snakemake.
* still need to be careful about how much memory and processor you're using!
* can be used on a cluster, to distribute jobs across multiple compute nodes. (This requires more work;)

### Specifying software required for a rule

You can specify software on a per-rule basis! This is really helpful when you have incompatible software requirements for different rules, or want to run on a cluster, or just want to pin your snakemake workflow to a specific version.

For example, if you create a file `env_fastqc.yml` with the following content,

```
channels:
  - bioconda
  - defaults
  - conda-forge
dependencies:
  - fastqc==0.11.8
```

and then change the fastqc rule to look like this:

```
rule fastqc_a_file:
  input:
    "{filename}.fq.gz"
  output:
    "{filename}_fastqc.html",
    "{filename}_fastqc.zip"
  conda:
    "env_fastqc.yml"
  shell:
    "fastqc {input}"
```

you can now run snakemake like so,

```
snakemake --use-conda -j 4
```
and for that rule, snakemake will install just that software, with the specified version.

This aids in reproducibility, in addition to the practical advantages of isolating software installs from each other.

(You can also do this with docker and singularity containers, too!)

### Outputting the entire workflow diagram

You can visualize your workflow like so!


In [None]:
snakemake --dag -j 4  | dot -Tpng > dag.png

In [None]:
snakemake -p clean --cores 1

Running on a cluster:

You can specify a cluster submit command:

In [None]:
snakemake --cluster sbatch -j 4

In [None]:
snakemake -p clean --cores 1

### Final thoughts - writing your own snakefile

Just like scripting, or writing an R script, writing a snakefile is a kind of programming. So you'll have to do a lot of debugging.

Workflows can get really complicated; for example [Samosa](https://github.com/alipirani88/samosa/blob/main/dag.svg)

If you want to see some good examples of how to build nice, clean, simple looking workflows, check out [this RNAseq example](https://github.com/snakemake-workflows/rna-seq-star-deseq2).

* start small, grow your snakefile!
* DO copy and paste from this and others you find online!
* it rarely hurts to just re-run snakemake!

## Other topics to explore for code club: wildcards, adding python code, configuring cluster config and many more.