Skip to content

Latest commit

 

History

History
633 lines (482 loc) · 15.1 KB

nextflow_cheatsheet.md

File metadata and controls

633 lines (482 loc) · 15.1 KB

Nextflow DSL2 getting-started cheatsheet

First of all, start using the new syntax DSL2. To achieve this, you have to add the following line:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

Key components of a pipeline

  • Inputs: in the context of Nextflow, they are stored in Channels.
  • Data processing steps: in the context of Nextflow DLS2, these are defined as Processes and organized into Workflows.
    • The output from one process is stored into a channel and can be piped into the next process.
    • If an input channel contains multiple elements, Nextflow will automatically run one process for each in parallel.
    • Each execution of a process will run in its own working directory, with input often created as symbolic link(symlink) to the original file.
    • Output files are by default generated in the working directory, and copied to the specified directory.
  • Computing environment and resources are set up to run the pipeline:
    • Will it run on a local machine, a high-performance computing environment (HPC) or the cloud?
    • How much CPU or memory will be used?
    • In the context of Nextflow, these are specified by Executors and often stored in separate configuration files.

A minimal example

In order to be able to use the same process twice in the workflow definition, we create the following modules: .modules/p1/main.nf and .modules/p2/main.nf. The contents for .modules/p1/main.nf are:

process p1 {

    // Run locally instead of HPC or Cloud
    executor 'local'  
  
    input:
        path(x) 

    output:
        path("head.txt"), emit: head
        /* p1.out will include all output files 
        from p1, whereas emit gives this 
        specific channel a name */
        path("tail.txt"), emit: tail 

    """
    head $x > head.txt
    tail $x > tail.txt
    """
}

and for .modules/p2/main.nf:

process p2 {

    executor 'local'
    // Copy files out of the working directory
    publishDir 'output_folder', mode: 'copy'  

    input:
        path(y)

    output: 
        path("*.gz")

    """
    gzip -f $y
    """
}

Then, we specify the workflow as shown below:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

include { p1 as p1 } from './modules/p1'
include { p2 as p2a } from './modules/p2'
include { p2 as p2b } from './modules/p2'

workflow { 
    input_ch = Channel.fromPath( "*.txt" ) 
    p1(input_ch)
    p1.out.head | p2a
    p1.out.tail | p2b
}

Channels

A way to specify a channel from different values, is the following:

value_ch = Channel.from(1,2,3)

Another way to achieve the same result is:

Channel.from(1,2,3)
  .set{value_ch}

To create a value channel, we use the value factory method. For example:

example_1 = Channel.value()
example_2 = Channel.value('Hello there!')
example_3 = Channel.value([1,2,3,4,5])

To create a channel which emits list elements, we can use the fromList method:

example_1 = Channel.fromList([1,2,3,4])

To create a channel from paths, we can use the fromPath method:

example_1 = Channel.fromPath('/path/file.txt')

To check if the file exists, we need to add checkIfExists: true as shown below:

example_1 = Channel.fromPath('/path/file.txt', checkIfExists: true)

To get the file pairs matching a glob pattern, we need to use the fromFilePairs method:

example_1 = Channel.fromFilePairs('/path/*_{1,2}.fastq')

Finally, to retrieve records directly from SRA, we use the method fromSRA:

example_1 = Channel.fromSRA('SRP043510')

The channel contents can be combined or modified using operators.

Operators

Filtering operators

Filtering operators are operators that allow to get the emitted elements from a channel which satisfy certain conditions

Filter: The filter operator allows to filter results based on a certain pattern or condition

Channel
  .from('a', 'b', 'c', 'aa', 'ab')
  .filter( ~/^a.*/ )
  .view()
Channel
  .from('a', 'b', 'c', '1', 1, 2, 2.35)
  .filter (Number)
  .view()

Unique: The unique operator allows to return the unique values from a channel

Channel
  .from(1, 2, 3, 1, 4, 'a', 'b', 'a')
  .unique()
  .view()

Distinct: The distinct operator allows to return unique consecutive values from a channel

Channel
  .from(1, 2, 3, 1, 1, 4, 'a', 'b', 'a')
  .distinct()
  .view()

Take: The take operator returns the first n items emitted by a channel

Channel
  .from(1..100)
  .take(10)
  .view()

First: The first operator either returns the first item or the first one that meets a certain condition

Channel
  .from(1, 2, 5, 8)
  .first({it > 4})
  .view()

Last: The last operator returns the last item of a channel

Channel
  .from(1, 2, 5, 8)
  .last()
  .view()

Until: The until operator returns all the values until a certain condition is met (the last value that meets the condition is NOT included)

Channel
  .from(1..100)
  .until({it == 49})
  .view()

Transforming operators

Transforming operators are operators that get the items emitted by a channel and they transform them to new values

map: This operator applies a chosen function to every item of a channel

Channel
  .from(1, 2, 3, 4)
  .map({it * it})
  .subscribe onNext: {println it}, onComplete: {println "Done!"}

flatMap: This operator is like map but here instead of a list of items, each item is returned individually

Channel
  .from(1, 2, 3, 4)
  .flatMap({it * it})
  .view()

groupTuple: This operator groups items emitted by a channel using a mapping function which associates a value with a key

Channel
  .from( [1, 'A'], [1, 'B'], [2, 'A'], [2, 'c'] )
  .groupTuple()
  .view()

collate: The collate operator transforms a channel in such a way that the emitted items are grouped in tuples containing n number of items where n is specified by the user

Channel
  .from(1..7)
  .collate(3)
  .view()

Now, if we want to get rid of the remaining item

Channel
  .from(1..7)
  .collate(3, false)
  .view()

buffer: This is an operator that buffers (subsets) the values to be returned based on certain conditions

// Specify end condition
Channel
  .from(1..100)
  .buffer(5)
  .view()
// Specify start and end condition
Channel
  .from(1..100)
  .buffer(10, 20)
  .view()
// Specify size
Channel
  .from(1..100)
  .buffer(size: 3, remainder: false)
  .view()

Collect: This operator collect all the items emitted from a channel to a list and returns them as a single list object

Channel
  .from(1..10)
  .collect()
  .view()

toList: This operator does what collect does

Channel
  .from(1..10)
  .toList()
  .view()

toSortedList: This operator returns the items in a sorted list

Channel
  .from(1, 2, 8, 5, 3, 4)
  .toSortedList()
  .view()

flatten: This operator transforms a channel so that each item is emitted separately even if it originally belongs to a collection or an array

Channel
  .from(1, [3, 4], 8, [34, 35, 36])
  .flatten()
  .view()

Combining operators

The combining operators combine the emitted values from multiple channels

join: The join operator creates a channel that joins together the items emitted by two channels when a matching key exists

ch1 = Channel.from(['X', 1], ['Y', 2])
ch2 = Channel.from(['X', 6], ['Y', 3])
ch1.join(ch2).view()

mix: The mix operator combines the items of more than one channels into one

c1 = Channel.from( 1,2,3 )
c2 = Channel.from( 'a','b','c' )
c3 = Channel.from( 'y','z' )
c1.mix(c2, c3).view()

collectFile: The collectFile operator collects the channel emissions and saves them into one or more files

Channel
    .from('alpha', 'beta', 'gamma')
    .collectFile(name: 'sample.txt', newLine: true)
    .subscribe {
        println "Entries are saved to file: $it"
        println "File content is: ${it.text}"
    }

combine: The combine operator returns the Cartesian product of items emitted by two channels

ch1 = Channel.from(1..5)
ch2 = Channel.from('A'..'C')
ch1.combine(ch2).view()

concat: The concat operator concatenates and returns the items from two or more channels but unlike mix it retains the order

a = Channel.from('a','b','c')
b = Channel.from(1,2,3)
c = Channel.from('p','q')
c.concat( b, a ).view()

Processes

Here are the major components of a process:

process < name > {

  [ directives ]        

  input:                
  < process inputs >

  output:               
  < process outputs >

  when:                 
  < condition >

  [script|shell|exec]:  
  """
  < user script to be executed >
  """
}

The name, the input and output of the process are specified. Conditionals (when) can also be specified, so that the process runs when certain conditions are met. Note that when using DSL2, there is no need for including the words from and into for creating the input and output channels.The script block defines the command to be executed. This block is interpreted by default as bash script but other code can be used too if the Shebang (#!) declaration is present:

process pyStuff {
  script:
  """
  #!/usr/bin/env python
  print("Hello world!")
  """
}

If, instead of using """, ''' are used, then Bash variables can be directly called without escaping $. For example:

process bar {
  script:
  '''
  echo $PATH | tr ':' '\n'
  '''
}

Insted of script, shell can be used in order to mix Bash variables and Nextflow variables. In this case, Nextflow variables should be defined using the !{..} syntax:

params.data = 'le monde'

process baz {
  shell:
  '''
  X = 'Bonjour'
  echo $X !{params.data}
  '''
}

Here is a simple process which prints the corresponding input to the console:

process printWord{
  input:
  val x

  output:
  stdout

  script:
  """
  echo $x
  """
}

Here is another one which converts the input to uppercase:

process upper{
  input:
  val x

  output:
  stdout

  script:
  """
  echo "$x" | tr '[a-z]' '[A-Z]'
  """
}

The output of the first process hello becomes input for the second one, converting lowercase hello to uppercase HELLO.

Files are often used as inputs and/or outputs in processes and thus knowing some file attributes can be extremely useful. Some commonly used file attributes are given in the table below:

Some useful file attributes

Attribute What it does
getName gets the name of the file (ignores the path)
getBaseName gets the file name without its extension
getSimpleName gets the file name after removing any extension
getExtension gets the extension of the file
exists check if the file exists
isFile returns true if it is a regular file
isDirectory returns true if it is a directory

Workflows

Workflows are sets of processes that take some inputs through a series of steps in order to produce a certain output. In a workflow, the contents of a channel can become input to another process. Thus, multiple processes can be chained. As an example, in the following workflow, a channel with the word Hello is created by the process printWord and the content of this channel is passed to the process upper to print HELLO

workflow {
  a = printWord("hello")
  upper(a).view()
}

Having defined inputs, channels and workflows, we can now write a complete Nextflow script and run it. Here is a very simple but complete Nextflow script which writes a greeting message to a file called hello.txt:

params.greetings="Hello world"
greeting = Channel.from(params.greetings)

// Write the processes
process writeText {
  input:
  val x

  output:
  file "hello.txt"

  script:
  """
  echo ${x} > hello.txt
  """
}

// Specify the workflow
workflow {
    writeText(greeting)
}

To include log information, we include the following:

log.info """
         """
         .stripIndent()

Between the triple brackets ("""), we include the parameters and their usage, as well as the outputs.

To run the script we first save the code snippet above to a file with .nf extension (e.g., main.nf). After having Nextflow installed and assuming that the main.nf file is in our working directory, we run the following on the terminal: nextflow run main.nf.

To publish the output text file in a directory, we need to use publishDir as shown below:

publishDir "Hello", copy: true

Modules in DLS2

A main advantage of the DSL2 syntax extension is the ability to write and use modules. Modules can be included and shared across workflows. Thus, code repetition can be avoided and Nextflow pipelines become more succinct. In addition, the nf-core community maintains high-quality modules for commonly used tools, which can be found here: Nextflow modules.

Modules may contain process, function, and workflow definitions. Components defined in a module, can be imported in another Nextflow script using the keyword include as shown in the example below:

include { foo } from './some/module' 

workflow {
    data = Channel.fromPath('/data/*.txt')
    foo(data)
}    

In this example, a process called foo which is present in the module ./some/module is invoked. The process foo takes data as input.

When multiple components need to be included from the same module, the components can be specified in the same inclusion. Their names need to be separated by ; as shown below:

include { foo; bar } from './some/module' 

workflow {
    data = Channel.fromPath('/data/*.txt')
    foo(data)
    bar(data)
}    

Functions in DSL2

DSL2 allows us to write functions, such as the ones shown below:

// Write a function
def print_on_console(x) {
  println x
}

print_on_console("Hello!")

The function returns the last evaluated expression, unless a return statement is provided explicitly, as in the example given below:

def fib (x) {
  if (x <= 1)
    return x
  else
    fib(x - 1) + fib(x - 2)
}

println fib(3)

Sources

Contributors