Home

gmiclotte edited this page Jun 4, 2018 · 11 revisions
Clone this wiki locally

Jabba: Hybrid Error Correction for Long Sequencing Reads

Jabba is a hybrid error correction tool to correct third generation (PacBio / ONT) sequencing data, using second generation (Illumina) data.

Input

Jabba takes as input a concatenated de Bruijn graph and a set of sequences:

  • the de Bruijn graph should appear in fasta format with 1 entry per node, the meta information should be in the format:
    >NODE <node number> <size of node> <number of in edges> <in edges represented by node number of origin, separated by tabs> <number of out edges> <out edges represented by node number of target, separated by tabs>
  • the set of sequences should be in fasta or fastq format. These sequences will be corrected (e.g. PacBio reads). The corrections will be written to a file Jabba-<input filename>.fasta.

The output is a file in fasta format with corrections of the long reads, and additionally a file in the input format containing uncorrected reads.

de Bruijn graph

To build a de Bruijn graph from sequencing reads, one can use brownie, or any other suitable tool. Errors in the de Bruijn graph have to be corrected and linear paths concatenated. Correction can be achieved either by using corrected second generation data to build the graph or by directly correcting the graph, or preferably a combination of the two. For read correction of the second generation data one can use brownie or Karect. This read correction should be performed with a small k-mer size, after which a larger k-mer size can be used

brownie will concatenate linear nodes and can output the graph in the desired graph format.
To build a graph with brownie from a fastq file short_reads.fastq containing short reads:

./brownie graphCorrection -p brownie_data -k 75 short_reads.fastq  

To build a graph with brownie from a fasta file genome.fasta containing a reference genome (in this case the graph is not corrected):

./brownie graphConstruction -p brownie_data -k 75 genome.fasta  

In both cases the graph file brownie_data/DBGraph.fasta will be created.

Installation

At the moment Jabba is available for Linux. It requires CMake 2.6 and GCC 4.7. Jabba can be compiled as follows:

mkdir -p build  
cd build  
cmake ../  
make
cd ..  
mkdir -p bin  
cp -b ./build/src/Jabba ./bin/Jabba  

This code is also available in the compile.sh script in the main Jabba directory.

Usage

jabba [options] [file_options] file1 [[file_options] file2]...
[options]
-h --help display help page
-i --info display information page
[options arg]
-l --length minimal seed size [default = 20]
-k --dbgk de Bruijn graph k-mer size
-e --essak sparseness factor of the essa [default = 1]
-t --threads number of threads [default = available cores]
-p --passes maximal number of passes per read [default = 2]
-m --outputmode short (do not extend the reads) or long (maximally extend reads) [default = short]
[file_options file_name]
-o --output output directory [default = Jabba_output]
-fastq fastq input files
-fasta fasta input files
-g --graph graph input file [default = DBGraph.fasta]

examples:

./jabba --dbgk 31 --graph DBGraph.txt -fastq reads.fastq  
./jabba -o Jabba -l 20 -k 31 -p 2 -e 1 -g DBGraph.fasta -fastq reads1.fastq reads2.fastq -fasta reads3.fasta  

Example

Given an Illumina dataset short_reads.fastq and a PacBio dataset long_reads.fastq, the following pipeline can be used:
First we download and compile the software:

git clone https://github.com/biointec/jabba.git  
cd jabba  
./compile.sh  
cd ..  

Now we are ready to run the tools:

mkdir karect_output  
./jabba/bin/karect -correct -matchtype=hamming -celltype=haploid -inputfile=short_reads.fastq -resultdir=karect_output -tempdir=karect_output  
mkdir brownie_output  
./jabba/bin/brownie graphCorrection -p brownie_output -k 75 karect_output/karect_short_reads.fastq  
./jabba/bin/jabba -o jabba_output -k 75 -g brownie_output/DBGraph.fasta -fastq long_reads.fastq