<h1><center>Basic Bioinformatics Workflows on OSCAR with Nextflow</center></h1>
 <center>Center for Computation and Visualization</center>
 <center>Center for Computational Biology of Human Disease - Computational Biology Core</center></p>

## What is Nextflow?  

Nextflow is a workflow management tool that allow users to easily write data-intensive computational **pipelines**. These pipelines, or workflows as they are also called, have the following key features:

- Sequential processing of files
- Usually requires more than one tool
- Multiple programming languages
- Most times each sample is processed individually
- Compute resource intensive
  - Alignment could take 16 cpus, 60 Gb RAM, 4-24 hours, 30Gb of disk space per sample

## Why do we care about these pipelines? 

### Reason 1: Reproducibility 

<img src="./img/reproduce.png" width="700"/>

The journal Nature published a survey that found that more than 70% of researchers have tried and failed to reproduce another scientist's experiments. This trend is hugely problematic because we then can't trust the findings from many studies enough to use them to make data-driven decisions. In short, we need tools and standards that help address the reproducibility crisis in science! 

Pipelines created with Snakemake and Nextflow incorporate version-control and state-of-the-art software tools, known as containers, to manage all software dependencies and create stable environments that can be trusted to reproduce analyses reliably and accurately. 

***Reproducibility is important for producing trustworthy research***

### Reason 2: Portability

#### What if we need to perform analyses with more resources?

This type of scenario would require us to move our analyses to a different environment, for example, our High Performance Computing (HPC) cluster environment, OSCAR. 

An important feature of the Nextflow workflow management tool is that it enables users to scale any pipeline written on a personal computer to then run on an HPC cluster such as OSCAR. So now we can run our pipelines using high performance resources without having to change workflow definitions or hard-code a pipeline to a specific setup. As a result, **the code stays constant** across varying infrastructures, thereby allowing portability, easy collaboration, and avoiding lock-in. In short, we can easily move our multi-step analyses (i.e., pipelines) to any place we need them!

***However, the trick is that setting up Nextflow on OSCAR requires a bit of configuration to get it running and many users at Brown aren't always comfortable using the terminal, setting up software, and dealing with software dependencies. So we want to not only offer users this tool and its benefits but also make its setup and use on OSCAR a bit user-friendly (and a lot more documented!).***

## First Let's See How All This Works

### Our Starting Point

Say that we have a basic RNA-Seq pipeline that we need to perform on OSCAR with the following set of actions: 

<img src="./img/workflow2.png" width="700"/>

<h2><center>What do you do??</center></h2>

## A Naive Approach

One solution would be to write a bunch of shell scripts that use various software tools to process the data in the ways we need. 

For example, if we needed to index a genome, perform an alignment, and then do transcript assembly, we could just write a shell script for each step - so a total of 3 shell scripts in this case - where we have various inputs and outputs and call different tools for each step. In this case, at a minimum, the tools needed would be: "bowtie2", "Tophat2", and "cufflinks." Here is a rough example of what these scripts might look like. 

**Script 1: Indexing**

```
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH -n 32
#SBATCH -J bowtie_index
#SBATCH --mem=198GB
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jordan_lawson@brown.edu

source /gpfs/runtime/cbc_conda/bin/activate_cbc_conda
conda activate some_environment_with_bowtie2 
genome_dir="/gpfs/data/cbc/project_folder" 
bowtie2-build "${genome_dir}/reference_sequence.fasta" index_name 
```

**Script 2: Alignment** 

```
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH -n 32
#SBATCH -J align_rna
#SBATCH --mem=198GB
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jordan_lawson@brown.edu

source /gpfs/runtime/cbc_conda/bin/activate_cbc_conda
conda activate some_environment_with_Tophat
for sample in `ls /gpfs/data/cbc/project_folder/*_{1,2}.fq`
do
    tophat2 --GTF /path/to/bed.gff_file genome.index $sample
done 
```


**Script 3: Assembly**

```
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH -n 8
#SBATCH -J assembly_rna
#SBATCH --mem=16GB
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jordan_lawson@brown.edu

source /gpfs/runtime/cbc_conda/bin/activate_cbc_conda
conda activate some_environment_with_cufflinks
for sample in `ls /gpfs/data/cbc/project_folder/bam_files`
do
cufflinks --no-update-check -q -G $annotation_gff_file $sample
done
```

## Problems with the Naive Approach 

Using multiple shell scripts to create a makeshift pipeline will work, but it is **inefficient**, can **get complicated fast**, and there are a few **challenges you have to manage**, such as: 

* Making sure you have the appropriate software and all dependencies for each step in the analysis - this can be a lot to stay on top of if you have a pipeline with a lot of steps! (imagine a 10 step process)
* **Portability!** Building and running on different machines is much more work
* Specifying where your output will go 
* Calling in the appropriate input (which is often the output from a previous step) 
* Handling where log files go 
* More labor intensive - we have to stay on top of jobs and monitor when each step finishes and then run next

## A Better Approach: Using Workflow Managers 

The solution for processing your data in a much more efficient manner that handles the aforementioned issues is workflow management tools, such as Nextflow. Let's now learn how to use Nextflow with OSCAR...

<img src="./img/nextflow.png" width="500"/>

### The Setup

So the somewhat tricky part is setting up and configuring Nextflow for use on OSCAR. There can be a few challenges, especially for community users more unfamiliar with the terminal and with software installation and dependencies. These challenges are: 

1. When running Nextflow, we often do NOT want to re-invent the wheel and so instead we usually pull and execute existing pipelines from either GitHub or from nf-core (a community effort to collect a curated set of pipelines built using Nextflow), which interfaces with GitHub; as a result, the user needs to allow Nextflow access to GitHub. 

2. The user needs to specify the computational resources they wish to allocate to different pipelines or even specific tasks within pipelines and how this will happen (i.e., what workload manager will they be using?). Moreover, there are several sources of configuration specifications that can be used with Nextflow and they can conflict, so how does one resolve and manage this and easily set up the configuration for use with OSCAR? For example, resource specifications can be given to Nextflow from the following configuration sources: 

    a\. Parameters specified in sbatch script via `#SBATCH` command or within `nextflow run` command 
    
    b\. Parameters provided using the -params-file option (take a yaml) 
    
    c\. Config file specified using the -c my_config option
    
    d\. The config file named nextflow.config in the current directory
    
    e\. The config file named nextflow.config in the workflow project directory
    
    f\. The config file $HOME/.nextflow/config
    
    g\. Values defined within the pipeline script itself (e.g. main.nf)
    
  When setting up the configuration, Nextflow automatically looks in all these different places and attempts to merge the configuration information; if there is conflicting configuration information, then the configuration source with the higher ranking overwrites the same configuration information from the lower ranking source. 

3. How to handle singularity containers and their mounting needs to be addressed; also suingularity caching is important, as jobs need enough storage space to not only store containers and but also all the temporary files created when converting Singularity containers from Docker images. Any issue here can crash the pipeline and create problems. 

4. User needs to set things up so that clearly manage and keep track of both slurm job output and pipeline output. 

To address these challenges, we created a user-friendly tool (essentially a user-friendly CLI) to assist users in setting up and configuring Nextflow for OSCAR and that automatically handles many of these aforementioned elements. 

###  Demo and Walkthrough

For this portion, we will walkthrough the installation and setup guide to configure Nextflow and also illustrate how to use it on OSCAR. 

If you wish, you can follow along by going to this repo: https://github.com/compbiocore/Workflows_on_OSCAR

All you need to do is git clone the `nextflow_setup` folder into your `HOME` directory and then follow along. 

### Using Nextflow with an nf-core pipeline

nf-core has many analysis pipelines we can use and so we need to identify the specific pipeline that is appropriate for our needs. For example, if we were analyzing RRBS data, we would need a pipeline that is appropriate to use for processing RRBS data. Heading over to https://nf-co.re/pipelines we can see that the **methylseq** pipeline will work for these types of data. We can view the details of this pipeline, such as all of its arguments and the steps it performs, by cliking on its link or visiting here: https://nf-co.re/methylseq 

**Note:** Visitng the pipeline's page and reviewing the pipeline's documentation is important because we need to know what arguments to use to make it run correctly. Once we've reviewed this and got an idea of how we need to run things, we can use the pipeline. 

We can simply create a bash script called **nextflow.sh** that leverages our Nextflow installation using something similar to this: 

```
#!/bin/bash

###############################
#                             #
#   Nextflow with nf-core     #
#                             #
###############################

##### 1.) Job Sumbission Options ######

# Can change/add resources as needed 

# Logs
#SBATCH --mail-user=jordan_lawson@brown.edu  
#SBATCH --mail-type=ALL
#SBATCH --output=%x-%j.out

##### 2.) Run Nextflow #####

# Activate Nextflow setup 
source $HOME/nextflow_env_username/bin/activate

# Run nextflow 
nextflow run nf-core/methylseq -profile singularity --input "/gpfs/data/ccvstaff/jlawson9/data_folder/*.fq.gz" --single_end --genome Sscrofa10.2 -c $HOME/nextflow_setup/nextflow.config --outdir $HOME/scratch
```

Then we simply run everything on the cluster by typing ```sbatch nextflow.sh```

Note in the above that we are running the nextflow program with the methylseq pipeline, as specified in the ```nf-core/methylseq``` argument that comes right after the ```nextflow run``` command. This tells nextflow to fetch and download the methylseq pipeline from nf-core. Some more important notes: 

- your log file for slurm will automtatically be placed in your HOME directory (unless otherwise specified). You can control where log files go using the ```-log``` flag followed by the path you wish to store the logs at. 

- the -c flag handles the nextflow.config automatically created for you 

- we are using the -profile singularity argument in the above shell script so that Nextflow uses singularity containers when running pipeline 

- the other arguments are bioinformatics specific and are there to make sure the pipeline runs according to our needs; they were found by visitng the pipeline's documentation page at: https://nf-co.re/methylseq (which you can get to from the nf-core homepage). 

# Current and Next Steps

1.) Implement this same process for Snakemake (do we need/want to?) 

**Quick note about Nextflow vs Snakemake:** Nextflow is very much like Snakemake in that they both serve the same need, so which one you use is largely up to you and is really preference-based. However, one very notable difference is that Nextflow has an open-source, community built repository of bioinformatics pipelines that you can easily download and use for your own data processing needs so that you don't have to write your own pipelines (but you can if you want to!). This resource is called **nf-core** and is a community effort to collect a curated set of pipelines built using Nextflow. This is nice because it already has many of the workflows and analyses that computational biologists use and so we don't have to waste our efforts and time re-inventing the wheel. Given these advantages and the Core's focus on bioinformatics, we prioritized Nextflow . But I am interested to hear thoughts about adding Snakemake and if it seems necessary or desired. 

To learn more about nf-core, you can visit its homepage here: 

https://nf-co.re

2.) Creating a GUI for this (Nextflow-tower exists, incorporate this?) and launch with OOD

3.) Add GPU option to configuration so that tasks can be run on GPU node

4.) Should we limit the number of parallel SLURM jobs? Also, what should max resources be when Nextflow is interacting with OSCAR? 

5.) Need to do more to handle working and temporary directories for users so they don't have to do much here. 

6.) Add to the tool to also allow users to easily run their pipelines with just a few inputs after setup and HPC configuration are done. 

7.) Documentation, documentation, documentation....walk users through configurations and how to use label options to customize workflows even more, singularity containers for custom pipelines, etc. 
