First of all, start using the new syntax DSL2
. To achieve this, you have to add the following line:
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
- Inputs: in the context of Nextflow, they are stored in Channels.
- Data processing steps: in the context of Nextflow DLS2, these are defined as Processes and organized into Workflows.
- The output from one process is stored into a channel and can be piped into the next process.
- If an input channel contains multiple elements, Nextflow will automatically run one process for each in parallel.
- Each execution of a process will run in its own working directory, with input often created as symbolic link(symlink) to the original file.
- Output files are by default generated in the working directory, and copied to the specified directory.
- Computing environment and resources are set up to run the pipeline:
- Will it run on a local machine, a high-performance computing environment (HPC) or the cloud?
- How much CPU or memory will be used?
- In the context of Nextflow, these are specified by Executors and often stored in separate configuration files.
In order to be able to use the same process twice in the workflow
definition, we create the following modules
: .modules/p1/main.nf
and .modules/p2/main.nf
. The contents for .modules/p1/main.nf
are:
process p1 {
// Run locally instead of HPC or Cloud
executor 'local'
input:
path(x)
output:
path("head.txt"), emit: head
/* p1.out will include all output files
from p1, whereas emit gives this
specific channel a name */
path("tail.txt"), emit: tail
"""
head $x > head.txt
tail $x > tail.txt
"""
}
and for .modules/p2/main.nf
:
process p2 {
executor 'local'
// Copy files out of the working directory
publishDir 'output_folder', mode: 'copy'
input:
path(y)
output:
path("*.gz")
"""
gzip -f $y
"""
}
Then, we specify the workflow
as shown below:
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
include { p1 as p1 } from './modules/p1'
include { p2 as p2a } from './modules/p2'
include { p2 as p2b } from './modules/p2'
workflow {
input_ch = Channel.fromPath( "*.txt" )
p1(input_ch)
p1.out.head | p2a
p1.out.tail | p2b
}
A way to specify a channel from different values, is the following:
value_ch = Channel.from(1,2,3)
Another way to achieve the same result is:
Channel.from(1,2,3)
.set{value_ch}
To create a value
channel, we use the value
factory method. For example:
example_1 = Channel.value()
example_2 = Channel.value('Hello there!')
example_3 = Channel.value([1,2,3,4,5])
To create a channel which emits list elements, we can use the fromList
method:
example_1 = Channel.fromList([1,2,3,4])
To create a channel from paths, we can use the fromPath
method:
example_1 = Channel.fromPath('/path/file.txt')
To check if the file exists, we need to add checkIfExists: true
as shown
below:
example_1 = Channel.fromPath('/path/file.txt', checkIfExists: true)
To get the file pairs matching a glob pattern, we need to use the fromFilePairs
method:
example_1 = Channel.fromFilePairs('/path/*_{1,2}.fastq')
Finally, to retrieve records directly from SRA, we use the method fromSRA
:
example_1 = Channel.fromSRA('SRP043510')
The channel contents can be combined or modified using operators
.
Filtering operators are operators that allow to get the emitted elements from a channel which satisfy certain conditions
Filter: The filter
operator allows to filter results based on a certain
pattern or condition
Channel
.from('a', 'b', 'c', 'aa', 'ab')
.filter( ~/^a.*/ )
.view()
Channel
.from('a', 'b', 'c', '1', 1, 2, 2.35)
.filter (Number)
.view()
Unique: The unique
operator allows to return the unique
values from a channel
Channel
.from(1, 2, 3, 1, 4, 'a', 'b', 'a')
.unique()
.view()
Distinct: The distinct
operator allows to return unique
consecutive values from a channel
Channel
.from(1, 2, 3, 1, 1, 4, 'a', 'b', 'a')
.distinct()
.view()
Take: The take
operator returns the first n items
emitted by a channel
Channel
.from(1..100)
.take(10)
.view()
First: The first
operator either returns the first item
or the first one that meets a certain condition
Channel
.from(1, 2, 5, 8)
.first({it > 4})
.view()
Last: The last
operator returns the last item
of a channel
Channel
.from(1, 2, 5, 8)
.last()
.view()
Until: The until
operator returns all the values until a certain
condition is met (the last value that meets the condition is NOT included)
Channel
.from(1..100)
.until({it == 49})
.view()
Transforming operators are operators that get the items emitted by a channel and they transform them to new values
map: This operator applies a chosen function to every item of a channel
Channel
.from(1, 2, 3, 4)
.map({it * it})
.subscribe onNext: {println it}, onComplete: {println "Done!"}
flatMap: This operator is like map but here instead of a list of items, each item is returned individually
Channel
.from(1, 2, 3, 4)
.flatMap({it * it})
.view()
groupTuple: This operator groups items emitted by a channel using a mapping function which associates a value with a key
Channel
.from( [1, 'A'], [1, 'B'], [2, 'A'], [2, 'c'] )
.groupTuple()
.view()
collate: The collate operator transforms a channel in such a way that the emitted items are grouped in tuples containing n number of items where n is specified by the user
Channel
.from(1..7)
.collate(3)
.view()
Now, if we want to get rid of the remaining item
Channel
.from(1..7)
.collate(3, false)
.view()
buffer: This is an operator that buffers (subsets) the values to be returned based on certain conditions
// Specify end condition
Channel
.from(1..100)
.buffer(5)
.view()
// Specify start and end condition
Channel
.from(1..100)
.buffer(10, 20)
.view()
// Specify size
Channel
.from(1..100)
.buffer(size: 3, remainder: false)
.view()
Collect: This operator collect all the items emitted from a channel to a list and returns them as a single list object
Channel
.from(1..10)
.collect()
.view()
toList: This operator does what collect does
Channel
.from(1..10)
.toList()
.view()
toSortedList: This operator returns the items in a sorted list
Channel
.from(1, 2, 8, 5, 3, 4)
.toSortedList()
.view()
flatten: This operator transforms a channel so that each item is emitted separately even if it originally belongs to a collection or an array
Channel
.from(1, [3, 4], 8, [34, 35, 36])
.flatten()
.view()
The combining operators
combine the emitted values from multiple channels
join: The join
operator creates a channel that joins together the items emitted by two channels when a matching key exists
ch1 = Channel.from(['X', 1], ['Y', 2])
ch2 = Channel.from(['X', 6], ['Y', 3])
ch1.join(ch2).view()
mix: The mix
operator combines the items of more than one channels into one
c1 = Channel.from( 1,2,3 )
c2 = Channel.from( 'a','b','c' )
c3 = Channel.from( 'y','z' )
c1.mix(c2, c3).view()
collectFile: The collectFile
operator collects the channel emissions and saves them into one or more files
Channel
.from('alpha', 'beta', 'gamma')
.collectFile(name: 'sample.txt', newLine: true)
.subscribe {
println "Entries are saved to file: $it"
println "File content is: ${it.text}"
}
combine: The combine
operator returns the Cartesian product of items emitted by two channels
ch1 = Channel.from(1..5)
ch2 = Channel.from('A'..'C')
ch1.combine(ch2).view()
concat: The concat
operator concatenates and returns the items from two or more channels but unlike mix
it
retains the order
a = Channel.from('a','b','c')
b = Channel.from(1,2,3)
c = Channel.from('p','q')
c.concat( b, a ).view()
Here are the major components of a process
:
process < name > {
[ directives ]
input:
< process inputs >
output:
< process outputs >
when:
< condition >
[script|shell|exec]:
"""
< user script to be executed >
"""
}
The name
, the input
and output
of the process are specified. Conditionals (when
) can also be specified, so that the process runs when certain conditions are met. Note that when using DSL2
, there is no need for including the words from
and into
for creating the input and output channels.The script
block defines the command to be executed. This block is interpreted by default as bash
script but other code can be used too if the Shebang (#!)
declaration is present:
process pyStuff {
script:
"""
#!/usr/bin/env python
print("Hello world!")
"""
}
If, instead of using """
, '''
are used, then Bash
variables can be directly called without escaping $
. For example:
process bar {
script:
'''
echo $PATH | tr ':' '\n'
'''
}
Insted of script
, shell
can be used in order to mix Bash
variables and Nextflow
variables. In this case, Nextflow
variables should be defined using the !{..}
syntax:
params.data = 'le monde'
process baz {
shell:
'''
X = 'Bonjour'
echo $X !{params.data}
'''
}
Here is a simple process which prints the corresponding input to the console:
process printWord{
input:
val x
output:
stdout
script:
"""
echo $x
"""
}
Here is another one which converts the input to uppercase:
process upper{
input:
val x
output:
stdout
script:
"""
echo "$x" | tr '[a-z]' '[A-Z]'
"""
}
The output of the first process hello
becomes input for the second one, converting lowercase hello
to uppercase HELLO
.
Files are often used as inputs and/or outputs in processes and thus knowing some file attributes can be extremely useful. Some commonly used file attributes are given in the table below:
Some useful file attributes
Attribute | What it does |
---|---|
getName | gets the name of the file (ignores the path) |
getBaseName | gets the file name without its extension |
getSimpleName | gets the file name after removing any extension |
getExtension | gets the extension of the file |
exists | check if the file exists |
isFile | returns true if it is a regular file |
isDirectory | returns true if it is a directory |
Workflows
are sets of processes that take some inputs through a series of steps in order to produce a certain output. In a workflow, the contents of a channel can become input to another process. Thus, multiple processes can be chained. As an example, in the following workflow, a channel with the word Hello
is created by the process printWord
and the content of this channel is passed to the process upper
to print HELLO
workflow {
a = printWord("hello")
upper(a).view()
}
Having defined inputs
, channels
and workflows
, we can now write a complete Nextflow script and run it. Here is a very simple but complete Nextflow script which writes a greeting message to a file called hello.txt
:
params.greetings="Hello world"
greeting = Channel.from(params.greetings)
// Write the processes
process writeText {
input:
val x
output:
file "hello.txt"
script:
"""
echo ${x} > hello.txt
"""
}
// Specify the workflow
workflow {
writeText(greeting)
}
To include log
information, we include the following:
log.info """
"""
.stripIndent()
Between the triple brackets ("""
), we include the parameters and their usage, as well as the outputs.
To run the script we first save the code snippet above to a file with .nf
extension (e.g., main.nf
). After having Nextflow installed and assuming that the main.nf
file is in our working directory, we run the following on the terminal: nextflow run main.nf
.
To publish the output text file in a directory, we need to use publishDir
as shown below:
publishDir "Hello", copy: true
A main advantage of the DSL2
syntax extension is the ability to write and use modules
. Modules can be included and shared across workflows. Thus, code repetition can be avoided and Nextflow pipelines become more succinct. In addition, the nf-core community maintains high-quality modules for commonly used tools, which can be found here: Nextflow modules.
Modules may contain process, function, and workflow definitions. Components defined in a module, can be imported in another Nextflow script using the keyword include
as shown in the example below:
include { foo } from './some/module'
workflow {
data = Channel.fromPath('/data/*.txt')
foo(data)
}
In this example, a process called foo
which is present in the module ./some/module
is invoked. The process foo
takes data
as input.
When multiple components need to be included from the same module
, the components can be specified in the same inclusion. Their names need to be separated by ;
as shown below:
include { foo; bar } from './some/module'
workflow {
data = Channel.fromPath('/data/*.txt')
foo(data)
bar(data)
}
DSL2 allows us to write functions, such as the ones shown below:
// Write a function
def print_on_console(x) {
println x
}
print_on_console("Hello!")
The function returns the last evaluated expression, unless a return
statement is provided explicitly, as in the example given below:
def fib (x) {
if (x <= 1)
return x
else
fib(x - 1) + fib(x - 2)
}
println fib(3)