Jabba: Hybrid Error Correction for Long Sequencing Reads
Jabba is a hybrid error correction tool to correct third generation (PacBio / ONT) sequencing data, using second generation (Illumina) data.
Jabba takes as input a concatenated de Bruijn graph and a set of sequences:
- the de Bruijn graph should appear in fasta format with 1 entry per node, the meta information should be in the format:
>NODE <node number> <size of node> <number of in edges> <in edges represented by node number of origin, separated by tabs> <number of out edges> <out edges represented by node number of target, separated by tabs>
- the set of sequences should be in fasta or fastq format. These sequences will be corrected (e.g. PacBio reads). The corrections will be written to a file Jabba-<input filename>.fasta.
The output is a file in fasta format with corrections of the long reads, and additionally a file in the input format containing uncorrected reads.
de Bruijn graph
To build a de Bruijn graph from sequencing reads, one can use brownie, or any other suitable tool. Errors in the de Bruijn graph have to be corrected and linear paths concatenated. Correction can be achieved either by using corrected second generation data to build the graph or by directly correcting the graph, or preferably a combination of the two. For read correction of the second generation data one can use brownie or Karect. This read correction should be performed with a small k-mer size, after which a larger k-mer size can be used
brownie will concatenate linear nodes and can output the graph in the desired graph format.
To build a graph with brownie from a fastq file short_reads.fastq containing short reads:
./brownie graphCorrection -p brownie_data -k 75 short_reads.fastq
To build a graph with brownie from a fasta file genome.fasta containing a reference genome (in this case the graph is not corrected):
./brownie graphConstruction -p brownie_data -k 75 genome.fasta
In both cases the graph file brownie_data/DBGraph.fasta will be created.
At the moment Jabba is available for Linux. It requires CMake 2.6 and GCC 4.7. Jabba can be compiled as follows:
mkdir -p build cd build cmake ../ make cd .. mkdir -p bin cp -b ./build/src/Jabba ./bin/Jabba
This code is also available in the compile.sh script in the main Jabba directory.
jabba [options] [file_options] file1 [[file_options] file2]...
-h --help display help page
-i --info display information page
-l --length minimal seed size [default = 20]
-k --dbgk de Bruijn graph k-mer size
-e --essak sparseness factor of the essa [default = 1]
-t --threads number of threads [default = available cores]
-p --passes maximal number of passes per read [default = 2]
-m --outputmode short (do not extend the reads) or long (maximally extend reads) [default = short]
-o --output output directory [default = Jabba_output]
-fastq fastq input files
-fasta fasta input files
-g --graph graph input file [default = DBGraph.fasta]
./jabba --dbgk 31 --graph DBGraph.txt -fastq reads.fastq ./jabba -o Jabba -l 20 -k 31 -p 2 -e 1 -g DBGraph.fasta -fastq reads1.fastq reads2.fastq -fasta reads3.fasta
Given an Illumina dataset short_reads.fastq and a PacBio dataset long_reads.fastq, the following pipeline can be used:
First we download and compile the software:
git clone https://github.com/biointec/jabba.git cd jabba ./compile.sh cd ..
Now we are ready to run the tools:
mkdir karect_output ./jabba/bin/karect -correct -matchtype=hamming -celltype=haploid -inputfile=short_reads.fastq -resultdir=karect_output -tempdir=karect_output mkdir brownie_output ./jabba/bin/brownie graphCorrection -p brownie_output -k 75 karect_output/karect_short_reads.fastq ./jabba/bin/jabba -o jabba_output -k 75 -g brownie_output/DBGraph.fasta -fastq long_reads.fastq