<h1><center>Basic Bioinformatics Workflows on OSCAR: Snakemake and Nextflow</center></h1>
<p><center>Instructors: Ashok Ragavendran and Jordan Lawson</center>
 <center>Center for Computation and Visualization</center>
 <center>Center for Computational Biology of Human Disease - Computational Biology Core</center></p>

Resources for help @brown <br> 

COBRE CBHD Computational Biology Core
- Office hours
- https://cbc.brown.edu
- slack channel on ccv-share
- cbc-help@brown.edu <br>

Center for Computation and Visualization
- Office hours
- https://ccv.brown.edu
- ccv-share slack channel
- https://docs.ccv.brown.edu
- support@ccv.brown.edu

## What is Snakemake and Nextflow?  

Snakemake and Nextflow are workflow management tools that allow users to easily write data-intensive computational **pipelines**. These pipelines, or workflows as they are also called, have the following key features:

- Sequential processing of files
- Usually requires more than one tool
- Multiple programming languages
- Most times each sample is processed individually
- Compute resource intensive
  - Alignment could take 16 cpus, 60 Gb RAM, 4-24 hours, 30Gb of disk space per sample

## Why do we care about these pipelines? 

### Reason 1: Reproducibility 

<img src="./img/reproduce.png" width="700"/>

The journal Nature published a survey that found that more than 70% of researchers have tried and failed to reproduce another scientist's experiments. This trend is hugely problematic because we then can't trust the findings from many studies enough to use them to make data-driven decisions. In short, we need tools and standards that help address the reproducibility crisis in science! 

Pipelines created with Snakemake and Nextflow incorporate version-control and state-of-the-art software tools, known as containers, to manage all software dependencies and create stable environments that can be trusted to reproduce analyses reliably and accurately. 

***Reproducibility is important for producing trustworthy research!***

### Reason 2: Portability

#### What if we need to perform analyses with more resources?

This type of scenario would require us to move our analyses to a different environment, for example, a High Performance Computing (HPC) cluster environment. 

An important feature of Snakemake and Nextflow workflow management tools is that they enable users to easily scale any pipeline written on a personal computer to then run on an HPC cluster such as OSCAR, the HPC cluster we use at Brown University. So now we can run our pipelines using high performance resources without having to change workflow definitions or hard-code a pipeline to a specific setup. As a result, **the code stays constant** across varying infrastructures, thereby allowing portability, easy collaboration, and avoiding lock-in. 

***In short, we can easily move our multi-step analyses (i.e., pipelines) to any place we need them!***

## So Let's See How All This Works! 

### Our Starting Point

Say we have samples from reduced representation bisulfite sequencing (RRBS data) that we need to process on OSCAR by performing the following set of actions: 

<img src="./img/workflow.png" width="700"/>

<h2><center>What do you do??</center></h2>

## A Naive Approach

One solution would be to write a bunch of shell scripts that use various software tools to process the data in the ways we need. 

For example, if we need to run fastqc, trim galore, and then an alignment, we could just write a shell script for each step - so a total of 4 shell scripts in this case - where we have various inputs and outputs. This would look something as follows: 

**Script 1: Fastqc**

```
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH -n 32
#SBATCH -J rrbs_fastqc
#SBATCH --mem=198GB
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jordan_lawson@brown.edu

source /gpfs/runtime/cbc_conda/bin/activate_cbc_conda
conda activate fedulov_rrbs
for sample in `ls /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/Sequencing_Files/*fastq.gz`
do
align_dir="/gpfs/data/cbc/fedulov_alexey/porcine_rrbs" 
fastqc -o ${align_dir}/fastqc $sample
done
```

**Script 2: Trimming** 

```
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH -n 32
#SBATCH -J trimmomatic_update
#SBATCH --mem=198GB
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jordan_lawson@brown.edu

source /gpfs/runtime/cbc_conda/bin/activate_cbc_conda

for sample in `ls /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/trim_galore/*_trimmed.fq.gz`
do
    dir="/gpfs/data/cbc/fedulov_alexey/porcine_rrbs/trimmomatic"
    base=$(basename $sample "_trimmed.fq.gz")
    trimmomatic SE  -threads 8 -trimlog ${dir}/${base}_SE.log $sample ${dir}/${base}_tr.fq.gz ILLUMINACLIP:/gpfs/data/cbc/cbc_conda_v1/envs/cbc_conda/opt/trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:5:6:true SLIDINGWINDOW:4:20 MINLEN:35
done 
```


**Script 3: Fastqc on trimmed reads**

```
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH -n 8
#SBATCH -J retrim_fastqc_update
#SBATCH --mem=16GB
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jordan_lawson@brown.edu

source /gpfs/runtime/cbc_conda/bin/activate_cbc_conda
conda activate fedulov_rrbs
for sample in `ls /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/trimmomatic/*_tr.fq.gz`
do
trim_qc_dir="/gpfs/data/cbc/fedulov_alexey/porcine_rrbs"
fastqc -o ${trim_qc_dir}/trimmomatic_qc $sample
done
```

**Script 4: Alignment**

```
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH -N 1
#SBATCH -n 16
#SBATCH -J bismark_align_update_redo
#SBATCH --mem=160GB
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jordan_lawson@brown.edu
#SBATCH --array=1-18
#SBATCH -e /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/logs/bismark_align_%a_%A_%j.err
#SBATCH -o /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/logs/bismark_align_%a_%A_%j.out

source /gpfs/runtime/cbc_conda/bin/activate_cbc_conda
conda activate fedulov_rrbs
input=($(ls /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/trimmomatic/*_tr.fq.gz)) # using the round brackets indicates that this is a bash array
bismark -o /gpfs/data/cbc/fedulov_alexey/porcine_rrbs/alignments --bowtie2 --genome /gpfs/data/shared/databases/refchef_refs/S_scrofa/primary/bismark_index --un --pbat ${input[$((SLURM_ARRAY_TASK_ID -1))]}
```

## Problems with the Naive Approach 

Using multiple shell scripts to create a makeshift pipeline will work, but it is **inefficient**, can **get complicated fast**, and there are a few **challenges you have to manage**, such as: 

* Making sure you have the appropriate software and all dependencies for each step in the analysis - this can be a lot to stay on top of if you have a pipeline with a lot of steps! (imagine a 10 step process)
* **Portability!** Building and running on different machines is much more work
* Specifying where your output will go 
* Calling in the appropriate input (which is often the output from a previous step) 
* Handling where log files go 
* More labor intensive - we have to stay on top of jobs and monitor when each step finishes and then run next

## A Smarter Approach: Using Workflow Managers! 

The solution for processing your data in a much more efficient manner that handles the aforementioned issues is workflow management tools, such as Snakemake and Nextflow. Let's now learn how to use Snakemake and Nextflow...

## Next up..

<img src="./img/nextflow.png" width="500"/>

### Step 1: The Setup

Let's first discuss setting up our environment on OSCAR so that we can get Snakemake up and running. 

**At this point, I am going to open my terminal on Open OnDemand (OOD) so that I can walk you through and show you how each of these steps and files below look. Feel free to open your terminal as well and follow along. To do so, you can go to OOD at https://ood.ccv.brown.edu and under the Clusters tab at the top select the >_OSCAR Shell Access option. All files used today can be found on GitHub in the folder at: https://github.com/compbiocore/workflows_on_OSCAR**

#### Step 1a: Set up Nextflow Configuration Script using `compbiocore/workflows_on_OSCAR`:

```bash
[pcao5@node1322 ~]$ cd ~/
[pcao5@node1322 ~]$ git clone https://github.com/compbiocore/workflows_on_OSCAR.git
[pcao5@node1322 ~]$ git clone https://github.com/compbiocore/workflows_workshop.git
```

#### Step 1b: Install `compbiocore/workflows_on_OSCAR` package:

```bash
bash ~/workflows_on_OSCAR/install_me/install.sh && source ~/.bashrc
```


For the 1st installation prompt, input `NextFlow`:

```bash
Welcome to a program designed to help you easily set up and run workflow management systems on OSCAR!

Please type which software you wish to use: Nextflow or Snakemake? Nextflow
```

For the 2nd installation prompt, input your GitHub username (e.g., `paulcao-brown`):

```bash
Nextflow software detected, initializing configuration...
What is your GitHub user name? paulcao-brown
What is your GitHub token (we will keep this secret) - [Hit Enter when Done]?
```

#### Step 1c: Create a new GitHub Token and enter it:
<img src="https://i.imgur.com/GBGDQhY.png"/>

#### Step 1d: Complete the Installation 


```bash
Currently the Nextflow default for HPC resources is: memory = 5.GB time = 2.h cpus = 2 
Do you want to change these default resources for your Nextflow pipeline [Yes or No]? No
Keeping defaults!

OUTPUT MESSAGE:

                ******************************************************************
                 NEXTFLOW is now set up and configured and ready to run on OSCAR!
                ******************************************************************
                

Your default resources for Nextflow are: memory = 5.GB time = 2.h cpus = 2 


                To further customize your pipeline for efficiency, you can enter the following 
                label '<LabelName>' options right within processes in your Nextflow .nf script:
                1. label 'OSCAR_small_job' (comes with memory = 4 GB, time = 1 hour, cpus = 1)
                2. label 'OSCAR_medium_job' (comes with memory = 8 GB, time = 16 hours, cpus = 2)
                3. label 'OSCAR_large_job' (comes with memory = 16 GB, time = 24 hours, cpus = 2)
                

README:

Please see https://github.com/compbiocore/workflows_on_OSCAR for further details on how to add the above label options to your workflow.

Note the setup is designed such that pipelines downloaded from nf-core with their own resource specs within the .nf script will override your defaults.

To run Nextflow commands, you must first type and run the nextflow_start command.

To further learn how to easily run your Nextflow pipelines on OSCAR, use the Nextflow template shell script located in your ~/nextflow_setup directory.
```

### Step 4: Selecting An nf-core Pipeline

nf-core has many analysis pipelines we can use and so we need to identify the specific pipeline that is appropriate for our needs. We are once again using the RRBS example that we started with, so we need a pipeline that is appropriate to use for processing RRBS data. Heading over to https://nf-co.re/pipelines we can see that the **methylseq** pipeline will work for our data. We can view the details of this pipeline, such as all of its arguments and the steps it performs, by cliking on its link or visiting here: https://nf-co.re/methylseq 

**Note:** Visitng the pipeline's page and reviewing the pipeline's documentation is important because we need to know what arguments to use to make it run correctly. Once we've reviewed this and got an idea of how we need to run things, we can move to the final step. 

#### Step 4a. Running methylSeq

We want to take advantage of the fact that virtually of the `nf-core` have a test workflow that we can play with; with pre-set inputs and samplesheet pre-sets that we can explore and re-use later as templates to launch our real runs with.  


```bash
nextflow run nf-core/methylseq -profile test,singularity --outdir methylseq_out
```

![](https://i.imgur.com/UGXkZAZ.png)

#### Step 4b. Inspecting the Exact Commands Run:

The `conf/test.config` will give you all the parameters/inputs in: https://github.com/nf-core/methylseq/blob/master/conf/test.config: 
```
 // Input data
    input = "$test_data_base/samplesheet/samplesheet_test.csv"

    // Genome references
    fasta = "$test_data_base/reference/genome.fa"
    fasta_index = "$test_data_base/reference/genome.fa.fai"
```

https://github.com/nf-core/test-datasets/tree/methylseq/samplesheet/samplesheet_test.csv:
![](https://i.imgur.com/JHFvh3B.png)

```bash
#curl these files so we can use the command and original inputs locally as templates to run our workflow
wget https://github.com/nf-core/test-datasets/tree/methylseq/samplesheet/samplesheet_test.csv
wget https://github.com/nf-core/test-datasets/tree/methylseq/reference/genome.fa
wget https://github.com/nf-core/test-datasets/tree/methylseq/reference/genome.fa.fai

nextflow run nf-core/methylseq --input samplesheet_test.csv --fasta genome.fa --fasta_index genome.fa.fai --outdir methylseq_out2
```


We also can rely on the nf-core's own documentation of what each parameters mean. They are all very documented.

Take a look here: https://nf-co.re/methylseq/2.3.0/parameters

#### Step 4c. Once finished, inspecting the Test Workflow Results at `$OUT_DIR/pipeline_info`:

![](https://i.imgur.com/bG7Yh8e.png)


Inspect the full result here: https://nf-co.re/methylseq/results#methylseq/results-93bc5811603c287c766a0ff7e03b5b41f4483895/bismark/pipeline_info/pipeline_dag_2022-12-17_00-46-10.html:

##### Audit Log of Every Step (with Command):
![](https://i.imgur.com/qdJw4qT.png)


##### Trace Log: 
![](https://i.imgur.com/YsjYMNe.png)

#### Step 4d. Adapt only a task from `nf-core/methylseq`

Suppose you like `methylseq` but you would like to adapt, customize or incorporate one of the tasks in your own script; leveraging `nf-core`'s configuration.

We will try to this with the `bismark`.

We can find where all of the Nextflow processes are defined in `nf-core`'s `modules/nf-core/$PROCESS`;

##### https://github.com/nf-core/methylseq/blob/master/modules/nf-core/bismark/align/main.nf:

```bash
process BISMARK_ALIGN {
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/bismark:0.24.0--hdfd78af_0' :
        'quay.io/biocontainers/bismark:0.24.0--hdfd78af_0' }"
    
    """
    bismark \\
        $fastq \\
        --genome $index \\
        --bam \\
        $args
    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        bismark: \$(echo \$(bismark -v 2>&1) | sed 's/^.*Bismark Version: v//; s/Copyright.*\$//')
    END_VERSIONS
    """
    ...
}
```

We can simply adapt this task into our toy workflow:

#### bismark_copycat.nf:

```bash
process BISMARK_ALIGN {
    container 'https://depot.galaxyproject.org/singularity/bismark:0.24.0--hdfd78af_0'
    
    output:
     stdout
    
    script:
    """
    bismark #to-do need to wire in the input; but leverage the pre-built container from nf-core
    """
}

workflow {
  BISMARK_ALIGN() | view
}
```

#### run bismark_copycat.nf:

```bash
nextflow run bismark_copycat.nf
```

#### Step 5. Building Your Own Docker Container

You don't have to rely only on pre-built containers; and can even build your own containers either from scratch or on top of existing containers.

##### Install `salmon` to bismark container

###### salmon_Dockerfile:
```bash
FROM debian:bullseye-slim

RUN apt-get update && apt-get install -y curl cowsay
RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.5.2/salmon-1.5.2_linux_x86_64.tar.gz | tar xz \
&& mv /salmon-*/bin/* /usr/bin/ \
&& mv /salmon-*/lib/* /usr/lib/
```

###### Build the Docker (targetting `linux/amd64` for OSCAR and tagged as `salmon:latest`):
```bash
docker build -f salmon_Dockerfile -t salmon:latest --platform linux/amd64 .
```

###### Upload to Your DockerHub:
```bash
docker tag salmon:latest $DOCKER_USER/salmon:latest
docker push $DOCKER_USER/salmon:latest
```

###### Example Output:
![](https://i.imgur.com/OYPbYSs.png)

###### salmon.nf:

```bash
process salmon {
    container 'cowmoo/salmon:latest'

    output:
     stdout

    script:
    """
    salmon -h
    """
}

workflow {
  salmon() | view
}
```

###### run salmon.nf:

```bash
nextflow run salmon.nf
```

![](https://i.imgur.com/Zt5iMci.png)

## Additional Nextflow Patterns:

- Conditionals: https://github.com/stevekm/nextflow-demos/blob/master/conditional-execution/main.nf#L53 (Demonstration you can put in if/switch statements in your tasks)


- Making a Samplesheet: https://github.com/stevekm/nextflow-demos/blob/master/parse-samplesheet/main.nf
 - See the samplesheet:
https://github.com/stevekm/nextflow-demos/blob/master/parse-samplesheet/samples.analysis.tsv
 - See how a samplesheet `Channel` is created: https://github.com/stevekm/nextflow-demos/blob/master/parse-samplesheet/main.nf#L3
 - See how a downstream task consumes each row (representing the sample) from the original samplesheet: https://github.com/stevekm/nextflow-demos/blob/master/parse-samplesheet/main.nf#L40


- More Patterns Here: https://github.com/nextflow-io/patterns

## Tutorial: Using workflow management tools on OSCAR 

Workflow management tools are software that allow you to write more efficient, portable computational pipelines. As a result, you are able to optimize your workflows while maintaining reproducibility and rigor. Note that there are many workflow management tools available to researchers, but the two tools we will focus on learning about today are **Snakemake** and **Nextflow**. Let's first start with Snakemake...

<img src="./img/snakemake.png" width="500"/>

### Step 1: The Setup

Let's first discuss setting up our environment on OSCAR so that we can get Snakemake up and running. 

**At this point, I am going to open my terminal on Open OnDemand (OOD) so that I can walk you through and show you how each of these steps and files below look. Feel free to open your terminal as well and follow along. To do so, you can go to OOD at https://ood.ccv.brown.edu and under the Clusters tab at the top select the >_OSCAR Shell Access option. All files used today can be found on OSCAR in the folder at: /gpfs/data/shared/bootcamp_2022**

Once you are on OSCAR, there are a few ways to run Snakemake. For example, one could set up Snakemake through Conda environments or you could use singularity containers. The details of these tools (i.e., Conda environments and singularity containers) are beyond the scope of this workshop, but here are some helpful links to get you started learning more about these tools, should you be interested: 

<u>Conda and Conda environments:</u> 

https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/environments.html <br>

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html <br>


<u> Singularity containers: </u> 

https://sylabs.io/guides/3.5/user-guide/introduction.html <br>

https://blogs.iu.edu/ncgas/2021/04/29/a-quick-intro-to-singularity-containers/ <br>


For this workshop, the specific approach we are going to take is to set up a Python virtual environment (essentially just an isolated environment containing the few pieces of software we need to get up and running) and then within this environment we tell Snakemake to download and run specific pre-built singularity containers for each of the different data processing steps (known as Snakemake rules). Singularity containers are just computer programs a.k.a virtual machines that encapsulate all the software needed for a workflow and thus enable reproducibility. They are widely used on HPC systems because of their increased security relative to other software options.

So now let's get started with our set up. First, ssh into OSCAR and wherever you like (and have enough space for storage) create a folder called **snakemake**. To do this, simply type ```mkdir snakemake```. 

Now enter this folder using ```cd /path/to/snakemake``` and inside it we will set up our virtual environment so that we can have access to and run Snakemake software on OSCAR. To set up our virtual environment, we use the following script: 

<br> 

```
module load python/3.9.0    # load version of Python needed 
virtualenv snakemake_env     # create environment 
source snakemake_env/bin/activate    # activate environment
pip3 install snakemake    # install snakemake using pip
deactivate    # deactivate and exit, we will use again later!
```
    
**Note:** Save and run this script in your snakemake folder using ```bash env.sh```. This will create a snakemake_env folder in your project directory. 

After this code has been run, you can now at any point type ```source /path/to/environment/environment_folder_name/bin/activate``` and you will enter an environment that has snakemake software installed and ready to be called for use. Now let's move onto setting up the specific pieces of input that Snakemake needs to run on OSCAR, our HPC cluster. 

### Step 2: Creating the Snakefile 

To start using Snakemake on any platform, the first thing one must do is create a **Snakefile**. A Snakefile is a file that defines a Snakemake workflow in terms of rules that are to be carried out in a specific order to complete a pipeline. This file determines the entire flow of your data analysis pipeline, specifying the rules to be carried out and their respective inputs and outputs. We can name this file whatever we like, as long as we tell our Snakemake program where to find it; however, by convention, we usually name this file Snakefile (with no extension!). 

I create a file called Snakefile and store this file in my newly created **snakemake** folder. Drawing on the same RRBS example we started with, the Snakefile would be: 

**Snakefile**

```python
## Snakefile for BootCamp ##

# Specify configuration file to use - optional
# configfile: "/path/to/file/config.yaml"

# Define sample to iterate across data 
sample=["sample_1", "sample_2", "sample_3", "sample_4"]

# Create rules
rule all:
    message: "All done....!"
    input:
        expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}_trimmed.fq.gz", sample=sample),
        expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}.fastq.gz_trimming_report.txt", sample=sample),
        expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}_bismark_bt2_pe.bam", sample=sample),
        expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}_bismark_bt2_PE_report.txt", sample=sample),
        expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}_bismark_bt2_pe.nucleotide_stats.txt", sample=sample)


rule fastqc:
    message: "Running FastQC..."
    input:
        rawread=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/data_new/{sample}.fq.gz", sample=sample) 
    output:
        zip=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/fastqc/{sample}.zip", sample=sample), 
        html=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/fastqc/{sample}.html", sample=sample)
    singularity: "library://ftabaro/default/methylsnake" 
    shell: "fastqc {input.rawread} -o /gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/fastqc"

rule trim:
   message: "Performing reads trimming..." 
   input:
       rawreads=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/data_new/{sample}.fq.gz", sample=sample)
   output:
       trimmed=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}_trimmed.fq.gz", sample=sample),
       report=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}.fastq.gz_trimming_report.txt", sample=sample)
   params:  
     quality_filter_value="22",
   threads: 4
   singularity: "library://ftabaro/default/methylsnake"
   shell: "trim_galore --quality {params.quality_filter_value} --phred33 --output_dir /gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim --gzip --rrbs --fastqc --cores {threads} {input.rawreads}" 

rule bismark_align:
    message: "Performing alignment..."
    input:
        trimreads=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}_trimmed.fq.gz", sample=sample)
    output: 
      bam=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}_bismark_bt2_pe.bam", sample=sample),
      bisreport=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}_bismark_bt2_PE_report.txt", sample=sample),
      stats=expand("/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/trim/{sample}_bismark_bt2_pe.nucleotide_stats.txt", sample=sample)
    threads: 4
    singularity: "library://ftabaro/default/methylsnake"
    shell: "bismark -o /gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/alignment --bowtie2 --genome /gpfs/data/ccvstaff/jlawson9/bootcamp_2022/index --un --pbat {input.trimreads}"
```

### Step 3: Telling Snakemake how to run on the HPC Cluster  

After defining the rules and analysis steps for our workflow via the Snakefile, we now must set up the workflow to be compatible with our HPC cluster, telling it how to assign resources to each rule. This is done by creating a yaml file called **cluster.yaml** (really it can be called anything, as long as the extension is .yaml). This file can be stored anywhere, as long as you specify where it can be found by the Snakemake program when it's run (will see this later). But for simplicity, it's best to just create and save it in the same place as your Snakefile. We specify the cluster.yaml file as follows: 

**cluster.yaml**

```yaml
__default__:
  partition: "batch"
  cpus: "1"
  time: 60
  mem: "4g"
  email: jordan_lawson@brown.edu  
  email_type: "ALL"
```

In the above, we create a default specification, where each rule, by default, will get this resource allocation when run. 

### Step 4: Bringing it all together with Shell Scripting 

Now here is the step where we tie everything together, running our Snakefile and cluster specification all with one script, a shell script. To run everything on OSCAR, we can use the following script: 

```
#!/bin/bash

##############################
#                             # 
#       Snakemake             #
#                             #
###############################

##### 1.) Job Sumbission Options ######

# Change these as needed 

#SBATCH -t 24:00:00
#SBATCH -n 2
#SBATCH -J snake_test
#SBATCH --mem=16GB
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jordan_lawson@brown.edu

###### 2.) Run snakemake #####

# Activate virtual environment 
source snakemake_env/bin/activate

# Run snakemake - note you will need a cluster.yaml file as this command references one!
snakemake --use-singularity --singularity-args "-B /gpfs/data/ccvstaff/jlawson9/bootcamp_2022/:/gpfs/data/ccvstaff/jlawson9/bootcamp_2022/" -s /gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/Snakefile --cluster-config /gpfs/data/ccvstaff/jlawson9/bootcamp_2022/snakemake/cluster.yaml --latency-wait 60 --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} -c {cluster.cpus} --mail-type {cluster.email_type} --mail-user {cluster.email}' -j 10
```

I save this script as **snakemake.sh** and place it in the snakemake folder with the other files I recently created. Once we have this with everything else, we can run everything on the cluster by typing ```sbatch snakemake.sh```

Note in the above that we are running the snakemake program with the Snakefile and cluster.yaml files we specified. Some other important things: 

- your log file for slurm will automtatically be placed in the same directory that you ran the snakemake.sh script from. However, you can also get more detailed log files for each snakemake rule and its respective outputs by using the ```log:``` command followed by the path you want the logs files to go to in each of the rules found within the Snakefile 

- we are using the --use-singularity argument in the above shell script to allow for Snakemake to use a singularity container when running pipelines; this is highly recommended for reproducibility. 

- when using the singularity argument, **it is very, very important** that you include the ```--singularity-args "-B folder_to_mount:source_destination"``` argument or else your pipeline will fail to recognize any inputs and outputs you are specifying in the workflow 

## A Few Closing Thoughts.....

This workshop just provided an introductory overview of running worklfow management tools on OSCAR. There is much more customization that you can do and much more advanced things you can perform. Some of these are: 

* Resource allocation by rule (or analysis step) 
* Using a config.yaml file to handle your samples and how you iterate through files
* Running pipelines that skip steps and start at a specific step (for example, pipelines that fail at a certain step, you may not want to repeat everything but instead pick up where you left off)
* Handling log files within specific steps so that you get detailed output for each rule 
* And much more....