# Nextflow Demo, Hsiao Lab (Sequence Analysis Workshop)
Author: Zohaib Anwar <br />
Date: April 29, 2021

## Setup

Setup can be found here on this link [Link to Nextflow -](https://www.nextflow.io/index.html) <br />
Only one prerequisite <br />
* Java (version 8 or higher)


In [1]:
# Check Java version on your system
java -version

openjdk version "11.0.9.1" 2020-11-04 LTS
OpenJDK Runtime Environment Zulu11.43+55-CA (build 11.0.9.1+1-LTS)
OpenJDK 64-Bit Server VM Zulu11.43+55-CA (build 11.0.9.1+1-LTS, mixed mode)


**If Java version in your system is less than 8, please use this [link](https://java.com/en/download/help/download_options.html) to install newer version.** 

In [2]:
# Installation
# curl -s https://get.nextflow.io | bash

In [3]:
# Lets check Nextflow version
nextflow -v

nextflow version 20.10.0.5430


Lets try with Hello World of Nextflow to start with. 

In [4]:
nextflow run hello

N E X T F L O W  ~  version 20.10.0
Launching `nextflow-io/hello` [grave_shannon] - revision: e6d9427e5b [master]
executor >  local (4)[K
[a9/cf630d] process > sayHello (2) [  0%] 0 of 4[K
[3A
executor >  local (4)[K
[a9/cf630d] process > sayHello (2) [100%] 4 of 4 ✔[K
Hello world![K
[K
Bonjour world![K
[K
Hola world![K
[K
Ciao world![K
[K



: 1

When a _nextflow_ file isnt available in the directory,Nextflow looks at [nextflow.io](https://github.com/nextflow-io/) for possible workflow

# Introduction

## Basic components

* **Processes**
* **Channels**

In practice a Nextflow pipeline script is made by joining together different processes. Each process can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.).

Processes are executed independently and are isolated from each other, i.e. they do not share a common (writable) state. The only way they can communicate is via asynchronous FIFO queues, called channels in Nextflow.

Any process can define one or more channels as input and output. The interaction between these processes, and ultimately the pipeline execution flow itself, is implicitly defined by these input and output declarations.

## Processes

A process may contain five definition blocks, respectively: directives, inputs, outputs, when clause and finally the process script. The syntax is defined as follows:

```Nextflow

process < name > {

   [ directives ]

   input:
    < process inputs >

   output:
    < process outputs >

   when:
    < condition >

   [script|shell|exec]:
   < user script to be executed >

}

```

## Channels

Nextflow is based on the Dataflow programming model in which processes communicate through channels.

A channel has two major properties:

* Sending a message is an asynchronous operation which completes immediately, without having to wait for the receiving process.
* Receiving data is a blocking operation which stops the receiving process until the message has arrived.

### Channel factory

Channels may be created implicitly by the process output(s) declaration or explicitly using the following channel factory methods.

The available factory methods are:

* create
* empty
* from
* fromPath
* fromFilePairs
* fromSRA
* value
* watchPath


## Scripting language

Nextflow is designed to have a minimal learning curve, without having to pick up a new programming language. In most cases, users can utilise their current skills to develop Nextflow workflows. However, it also provides a powerful scripting DSL.

Nextflow scripting is an extension of the Groovy programming language, which in turn is a super-set of the Java programming language. Groovy can be considered as Python for Java in that is simplifies the writing of code and is more approachable.

## Working Demo

During this tutorial we will implement a proof of concept of a RNA-Seq pipeline which:

* Indexes a trascriptome file.
* Performs quality controls
* Performs quantification.
* Create a MultiqQC report.

Lets start with indexing transcriptome file in ```$baseDir/data```

In [5]:
cat 1.indexing.nf

cat 1.indexing.nf
/* 
 * pipeline input parameters 
 */
params.reads = "$baseDir/data/*_{1,2}.fq"
params.transcriptome = "$baseDir/data/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
params.outdir = "$baseDir/results"

println """\
         R N A S E Q - N F   P I P E L I N E    
         transcriptome: ${params.transcriptome}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()

/* 
 * create a transcriptome file object given then transcriptome string parameter
 */
transcriptome_file = file(params.transcriptome)
 
/* 
 * define the `index` process that create a binary index 
 * given the transcriptome file
 */
 
process index {

    tag {"indexing_${params.transcriptome}"}
    
    input:
    file transcriptome from transcriptome_file
     
    output:
    file 'index' into index_ch

    script:       
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}
 

: 1

In [6]:
nextflow run 1.indexing.nf

nextflow run 1.indexing.nf
N E X T F L O W  ~  version 20.10.0
Launching `1.indexing.nf` [compassionate_wing] - revision: 24486b812a
R N A S E Q - N F   P I P E L I N E    
transcriptome: /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/transcriptome.fa
reads        : /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/*_{1,2}.fq
outdir       : /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/results

executor >  local (1)[K
[c5/ed1d79] process > index (indexing_/Users/au57... [  0%] 0 of 1[K
[3A
executor >  local (1)[K
[c5/ed1d79] process > index (indexing_/Users/au57... [  0%] 0 of 1[K
[3A
executor >  local (1)[K
[c5/ed1d79] process > index (indexing_/Users/au57... [100%] 1 of 1 ✔[K
[3A
executor >  local (1)[K
[c5/ed1d79] process > index (indexing_/Users/au57... [100%] 1 of 1 ✔[K



: 1

Next step is to add quality visualization of reads using ```fastqc``` on the three different samples (gut, liver and lung) with paired-end reads in ```$data/*_{1,2}.fq```

In [7]:
cat 2.fastqc.nf

cat 2.fastqc.nf
/* 
 * pipeline input parameters 
 */
params.reads = "$baseDir/data/*_{1,2}.fq"
params.transcriptome = "$baseDir/data/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
params.outdir = "results"

println """\
         R N A S E Q - N F   P I P E L I N E    
         transcriptome: ${params.transcriptome}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()


Channel 
    .fromFilePairs( params.reads )
    .ifEmpty { error "Cannot find any reads matching: ${params.reads}"  }
    .set { read_pairs_ch } 
    

process fastqc {
    tag "FASTQC on $sample_id"
    publishDir "${params.outdir}/${task.process}", pattern: "fastqc_${sample_id}_logs/*.{zip,html}", mode: 'copy'

    input:
    set sample_id, file(reads) from read_pairs_ch

    output:
    file("fastqc_${sample_id}_logs") into fastqc_ch
    path("fastqc_${sample_id}_logs/*.{zip,html}")

    script:
    """
    mkdir fastqc_${sample_id}_logs
    fast

: 1

In [8]:
nextflow run 2.fastqc.nf

nextflow run 2.fastqc.nf
N E X T F L O W  ~  version 20.10.0
Launching `2.fastqc.nf` [adoring_mcnulty] - revision: ed683cad0d
R N A S E Q - N F   P I P E L I N E    
transcriptome: /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/transcriptome.fa
reads        : /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/*_{1,2}.fq
outdir       : results

executor >  local (3)[K
[30/dd73fc] process > fastqc (FASTQC on liver) [  0%] 0 of 3[K
[3A
executor >  local (3)[K
[30/dd73fc] process > fastqc (FASTQC on liver) [  0%] 0 of 3[K
[3A
executor >  local (3)[K
[30/dd73fc] process > fastqc (FASTQC on liver) [ 33%] 1 of 3[K
[3A
executor >  local (3)[K
[6e/4c81a0] process > fastqc (FASTQC on lung)  [100%] 3 of 3 ✔[K



: 1

Now we will add read quantification using ```salmon quant``` on the same samples (gut, liver and lung) with paired-end reads in ```bash $data/*_{1,2}.fq ```

In [9]:
cat 3.quantification.nf

cat 3.quantification.nf
/* 
 * pipeline input parameters 
 */
params.reads = "$baseDir/data/*_{1,2}.fq"
params.transcriptome = "$baseDir/data/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
params.outdir = "results"

println """\
         R N A S E Q - N F   P I P E L I N E    
         transcriptome: ${params.transcriptome}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()

/* 
 * create a transcriptome file object given then transcriptome string parameter
 */
transcriptome_file = file(params.transcriptome)
 
/* 
 * define the `index` process that create a binary index 
 * given the transcriptome file
 */
 
process index {
    
    input:
    file transcriptome from transcriptome_file
     
    output:
    file 'index' into index_ch

    script:       
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}

Channel 
    .fromFilePairs( params.reads )
    .ifEmpty { error "Cannot find 

: 1

In [10]:
nextflow run 3.quantification.nf

nextflow run 3.quantification.nf
N E X T F L O W  ~  version 20.10.0
Launching `3.quantification.nf` [zen_joliot] - revision: 8cc4f29644
R N A S E Q - N F   P I P E L I N E    
transcriptome: /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/transcriptome.fa
reads        : /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/*_{1,2}.fq
outdir       : results

executor >  local (1)[K
[13/92f07d] process > index          [  0%] 0 of 1[K
[-        ] process > quantification -[K
[4A
executor >  local (1)[K
[13/92f07d] process > index          [  0%] 0 of 1[K
[-        ] process > quantification -[K
[4A
executor >  local (4)[K
[13/92f07d] process > index                          [100%] 1 of 1 ✔[K
[85/a4a780] process > quantification (Quantificat... [  0%] 0 of 3[K
[4A
executor >  local (4)[K
[13/92f07d] process > index                          [100%] 1 of 1 ✔[K
[85/a4a780] process > quantification (Quantificat... [100%] 3 of 3 ✔[K



: 1

Finally, we will add a process to visualize quantification results using ```multiqc```

In [11]:
cat 4.multiqc.nf

cat 4.multiqc.nf
/* 
 * pipeline input parameters 
 */
params.reads = "$baseDir/data/gut_{1,2}.fq"
params.transcriptome = "$baseDir/data/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
params.outdir = "results"

println """\
         R N A S E Q - N F   P I P E L I N E    
         transcriptome: ${params.transcriptome}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()

/* 
 * create a transcriptome file object given then transcriptome string parameter
 */
transcriptome_file = file(params.transcriptome)
 
/* 
 * define the `index` process that create a binary index 
 * given the transcriptome file
 */
 
 
process index {
    
    input:
    file transcriptome from transcriptome_file
     
    output:
    file 'index' into index_ch

    script:       
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}


Channel 
    .fromFilePairs( params.reads )
    .ifEmpty { error "Cannot find an

: 1

In [13]:
nextflow run 4.multiqc.nf

nextflow run 4.multiqc.nf
N E X T F L O W  ~  version 20.10.0
Launching `4.multiqc.nf` [admiring_banach] - revision: a3bcc4bd46
R N A S E Q - N F   P I P E L I N E    
transcriptome: /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/transcriptome.fa
reads        : /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/gut_{1,2}.fq
outdir       : results

[-        ] process > index          -[K
[-        ] process > fastqc         -[K
[-        ] process > quantification -[K
[4A
executor >  local (2)[K
[72/c0df87] process > index                  [  0%] 0 of 1[K
[80/5621ff] process > fastqc (FASTQC on gut) [  0%] 0 of 1[K
[-        ] process > quantification         -[K
[-        ] process > multiqc                -[K
[6A
executor >  local (3)[K
[72/c0df87] process > index                          [100%] 1 of 1 ✔[K
[80/5621ff] process > fastqc (FASTQC on gut)         [  0%] 0 of 1[K
[3d/4625ae] process > quantification (Quantificat... [  0%] 0 of 1[K
[-        ] process > mu

: 1

In [14]:
cat main.nf

cat main.nf
/* 
 * pipeline input parameters 
 */
params.reads = "$baseDir/data/gut_{1,2}.fq"
params.transcriptome = "$baseDir/data/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
params.outdir = "results"

println """\
         R N A S E Q - N F   P I P E L I N E    
         transcriptome: ${params.transcriptome}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()

/* 
 * create a transcriptome file object given then transcriptome string parameter
 */
transcriptome_file = file(params.transcriptome)
 
/* 
 * define the `index` process that create a binary index 
 * given the transcriptome file
 */
 
 
process index {
    
    input:
    file transcriptome from transcriptome_file
     
    output:
    file 'index' into index_ch

    script:       
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}


Channel 
    .fromFilePairs( params.reads )
    .ifEmpty { error "Cannot find any rea

: 1

In [15]:
nextflow run main.nf

nextflow run main.nf
N E X T F L O W  ~  version 20.10.0
Launching `main.nf` [admiring_noether] - revision: 11f5c2c5d2
R N A S E Q - N F   P I P E L I N E    
transcriptome: /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/transcriptome.fa
reads        : /Users/au572806/GitHub/Nextflow_demo_HsiaoLab/data/gut_{1,2}.fq
outdir       : results

[-        ] process > index          -[K
[-        ] process > fastqc         -[K
[-        ] process > quantification -[K
[-        ] process > multiqc        -[K
[5A
executor >  local (2)[K
[09/3eec23] process > index                  [  0%] 0 of 1[K
[0e/728d7e] process > fastqc (FASTQC on gut) [  0%] 0 of 1[K
[-        ] process > quantification         -[K
[-        ] process > multiqc                -[K
[6A
executor >  local (3)[K
[09/3eec23] process > index                          [100%] 1 of 1 ✔[K
[0e/728d7e] process > fastqc (FASTQC on gut)         [  0%] 0 of 1[K
[23/addcba] process > quantification (Quantificat... [  0%] 0

: 1