WTDBG

A fuzzy Bruijn graph (FBG) approach to long noisy reads assembly

Introduction

A challenge in assembling long noisy reads from third generation sequencing (TGS) is reducing its requirement of computing resource, especially for large genomes. To address this issue, I developed a novel sequence alignment algorithm and a new assembly graph for efficiently assembling large genomes using TGS data.

Alignment
KBM: Kmer-BIN-Mapping.
KBM groups k-mers from each non-overlapped sliding 256 bp fragments in long reads into bins. Bins of which most k-mers are high frequency, are filtered as highly repetitive ones. Then, KBM searches synteny of matched bin pairs in sequences in a dynamic programming way. A matched bin pair in two sequences is defined as two bins different by original but share a set of k-mers. The result of alignments in KBM have the same features of traditional sequence alignment, excepting the unit of KBM alignments is 256 bp bin instead of single base.
Assembly
FBG: Fuzzy Bruijn Graph.
FBG is composed of vertices in length of 1024 bp from reads, and edges connecting vertices in their order on read paths. Comparing with DBG, the size of vertices in FBG are much bigger, thus won¡¯t be sensitive to small repeat. To tolerate high sequencing errors, FBG's vertices are found using gapped sequence alignments from KBM or other aligners, comparing with searching identical k-mers in DBG.
Why choose wtdbg in genome assembly
There are many assemblers for long noisy reads assembly, e.g. FALCON, CANU, miniasm, and SMARTdenovo (progenitor of wtdbg). If you have a genome of 10G bp or bigger in size, wtdbg is your first or even the only option. For small but complicated genomes (< 3 G), wtdbg was often reported to yield better assembly by my friends. Besides, KBM is easy to use when you are setting up a web-server for long reads mapping (see Example 2).
Limitation
Max read length is 0x0003FFFFU (256 Kb), longer reads will be split.
Max number of reads is 0x03FFFFFFU (64 M). If your data volume exceeds, please filter relative shorter reads.
In KBM, max read length is 0xFFFFFFFFU (4 Gb), max number of reads is 0x0FFFFFFFU (256 M).
Max number of threads is 4096.
Cannot parallelly run in multiple nodes. But you can implement it simplely using kbm and wtdbg --load-alignments
Developed and tested in Linux-GCC only.
Only accepts fasta/fastq format for input, '.gz' suffixed files will be piped by gzip -dc.

Installation

git clone https://github.com/ruanjue/wtdbg-1.2.8.git
cd wtdbg-1.2.8
make

Long reads mapping

Supposes you have hg19.fa as reference sequences, and reads.fq.gz as query sequences.

Example 1

kbm-1.2.8 -t 64 -d hg19.fa -i reads.fa.gz -o reads.kbmap

output format

COL1 qry_name
COL2 qry_strand
COL3 qry_length
COL4 qry_beg
COL5 qry_end
COL6 ref_name
COL7 ref_strand (always equals +)
COL8 ref_length
COL9 ref_beg
COL10 ref_end
COL11 match_len (length of matched k-mers)
COL12 align_len (length of aligned)
COL13 #kcnt (number of matched k-mers)
COL14 #gap (number of gapped BINs)
COL15 cigar (256 x SAM's cigar)

Example 2

Suitable for online tools, or frequently used references

Build KBM-INDEX

kbm-1.2.8 -t 64 -d hg19.fa -i /dev/null -W hg19.kbmidx

Set up KBM server

kbm-1.2.8 -R hg19.kbmidx start

Now, hg19.kbmidx is cached in memory for further call.

Mapping with KBM-INDEX

kbm-1.2.8 -R hg19.kbmidx -t 64 -i reads.fa -o reads.kbmap

Please note that, hg19.kbmidx can be multilple called by any processes in the same computer.

Shutdown KBM server

kbm-1.2.8 -R hg19.kbmidx stop

Long reads assembling

Quick Start

echo "manual"
run_wtdbg_assembly.sh -h
echo "generating shell script"
run_wtdbg_assembly.sh -t 0 -i reads.fa.gz -o dbg -T >run.sh

run.sh is ready for invoked

Play with wtdbg

wtdbg-1.2.8 -h
wtdbg-1.2.8 --help

options

 -t <int>    Number of threads, 0: all cores, [0]
 -i <string> Long reads sequences file, + *
 -I <string> Error-free sequences file, +
 -o <string> Prefix of output files, *
 -f          Force overwrite
 -k <int>    Kmer fsize, 0 <= k <= 25, [0]
 -p <int>    Kmer psize, 0 <= p <= 25, [21]
             k + p <= 25, seed is <k-mer>+<p-homopolymer-compressed>
 -K <float>  Filter high frequency kmers, maybe repetitive, [1000]
             if K >= 1, take the integer value as cutoff
             else, mask the top fraction part high frequency kmers
 -E <int>    Min kmer frequency, [2]
 -F          Filter low frequency kmers by a 4G-bytes array (max_occ=3 2-bits). Here, -E must greater than 1
 -S <int>    Subsampling kmers, 1/(<-S>) kmers are indexed, [4]
             -S is very useful in saving memeory and speeding up
             please note that subsampling kmers will have less matched length
 -X <int>    Max number of bin(256bp) in one gap, [4]
 -Y <int>    Max number of bin(256bp) in one deviation, [4]
 -x <int>    penalty for BIN gap, [-7]
 -y <int>    penalty for BIN deviation, [-21]
 -l <float>  Min length of alignment, [2048]
 -m <float>  Min matched, [200]
 -s <float>  Max length variation of two aligned fragments, [0.2]
 -q          Quiet
 -v          Verbose, +
 --help      Show more options

For higher error rate long sequences

Decrease -p. Try -p 19 or -p 17
Decrease -S. Try -S 2 or -S 1
Both will increase computing time.

For very high coverage

Increase --edge-min. Try --edge-min 4, or higher.

For low coverage

Decrease --edge-min. Try --edge-min 2 --rescue-low-cov-edges.

Filter reads

--tidy-reads 5000. Will filtered shorter sequences. If names in format of \/\d+_\d+$, will selected the longest subread.

output

Suppose the prefix is dbg

dbg.1.dot
DOT file for initialized graph
dbg.1.nodes
nodes and their positions in reads
dbg.1.reads
reads and their nodes
dbg.2.dot
DOT file after transitive reduction
dbg.3.dot
DOT file after merging bubble and remove tips
dbg.alignments
KBMAP file, all vs all alignments
dbg.binkmer
Distribution of number of k-mers in a BIN
dbg.closed_bins
Filtered BINs
dbg.clps Reads clip information.
COL1 read_name
COL2 read_length
COL3 keep_offset
COL4 keep_length
dbg.ctg.dot
DOT file for contigs
dbg.ctg.lay
Contigs layout file. Will be read by wtdbg-cns. This file is the main result of wtdbg
Format:

>ctg(\d+) nodes=(\d+) len=(\d+)
E <OFFSET> <NODE1> <STRAND1> <NODE2> <STRAND2>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
...
E ...
...

One contig contains many edges (starting with 'E'), each edge contains many regions inside reads.
Please note that one read often contains many REGs.

dbg.events
Log file of graph simplification
dbg.frg.dot
DOT file for unitigs
dbg.frg.nodes
unitigs and their nodes
dbg.kmerdep
Distribution of k-mer depth
STDERR stream
wtdbg print runtime information on progrom's STDERR stream. --quiet to disiable it

Consensus

wtdbg-cns -t 64 -i dbg.ctg.lay -o dbg.ctg.lay.fa

The output file dbg.ctg.lay.fa is ready for further polished by PILON or QUIVER.

Performance

Human (3G) CHM1 PacBio P5C3 dataset, 65.5 core.hours

Data Source http://datasets.pacb.com/2014/Human54x/fasta.html
Command

wtdbg-1.2.8 -t 96 -i pb.fa -fo dbg --tidy-reads 5000 --edge-min 2 --rescue-low-cov-edges

Contigs TOT 2978536704, CNT 8752, AVG 340327, MAX 11662848, N50 1925120, L50 453, N90 400128, L90 1727, Min 5120
Runtime real 6131.803 sec, user 201836.200 sec, sys 33956.790 sec, maxrss 117281672.0 kB, maxvsize 202422172.0 kB

Human (3G) CHM1 PacBio P6C4 dataset, 211.3 core.hours

Data Source http://www.ebi.ac.uk/ena/data/view/PRJNA246220
Command

wtdbg-1.2.8 -t 96 -i wt.fa -fo dbg --tidy-reads 5000 --edge-min 4 --rescue-low-cov-edges

Contigs TOT 2964872448, CNT 1909, AVG 1553103, MAX 105310208, N50 23586816, L50 34, N90 3326976, L90 158, Min 5120
Runtime real 16806.534 sec, user 681278.770 sec, sys 79371.630 sec, maxrss 264956752.0 kB, maxvsize 443356532.0 kB

Axolotl (32G) PacBio dataset, 32 X, 3053 core.hours

Command

wtdbg-1.2.8 -t 96 -i ../rawdata/pacbio.fa.gz -p 21 -S 2 --aln-noskip --rescue-low-cov-edges --tidy-reads 5000 -fo axolotl

Contigs TOT 27375160576, CNT 115355, AVG 237313, MAX 7812608, N50 606976, L50 12527, N90 144896, L90 47295, Min 5120
Runtime real 190237.591 sec, user 10994200.800 sec, sys 488715.030 sec, maxrss 1671005352.0 kB, maxvsize 2365400208.0 kB

Human (3G) NA12878 ONT dataset, 197.5 core.hours

Data Source https://github.com/nanopore-wgs-consortium/NA12878
Command

wtdbg-1.2.8 -t 64 -i NA12878-ONT.fa.gz -fo dbg -S 2 --edge-min 2 --rescue-low-cov-edges

Contigs TOT 2827644928, CNT 19473, AVG 145209, MAX 31366400, N50 4540672, L50 162, N90 172800, L90 1111, Min 5120
Runtime real 14992.925 sec, user 649202.270 sec, sys 61638.300 sec, maxrss 256840096.0 kB, maxvsize 356668088.0 kB

Citation

To be published.
URL https://github.com/ruanjue/wtdbg-1.2.8/

Contact

Jue Ruan ruanjue@gmail.com
Jue Ruan ruanjue@caas.cn

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Makefile		Makefile
README.md		README.md
best_kbm_hit.pl		best_kbm_hit.pl
best_minimap_hit.pl		best_minimap_hit.pl
bit2vec.h		bit2vec.h
bitsvec.h		bitsvec.h
bitvec.h		bitvec.h
dagcns.h		dagcns.h
dbgcns.h		dbgcns.h
dbm_index_dot.pl		dbm_index_dot.pl
dbm_index_fa.pl		dbm_index_fa.pl
dbm_read_dot.pl		dbm_read_dot.pl
dbm_read_fa.pl		dbm_read_fa.pl
dmo.h		dmo.h
dna.h		dna.h
fa2fq.pl		fa2fq.pl
fa2tab.pl		fa2tab.pl
file_reader.c		file_reader.c
file_reader.h		file_reader.h
filereader.h		filereader.h
first_n_bases.pl		first_n_bases.pl
first_n_seqs.pl		first_n_seqs.pl
fq2fa.pl		fq2fa.pl
general_graph.h		general_graph.h
hashset.h		hashset.h
hlcolor		hlcolor
kbm.c		kbm.c
kbm.h		kbm.h
ksw.c		ksw.c
ksw.h		ksw.h
kswx.h		kswx.h
list.h		list.h
longest_pacbio_subreads.pl		longest_pacbio_subreads.pl
map2dbgcns.c		map2dbgcns.c
mem_share.h		mem_share.h
mum_assess.sh		mum_assess.sh
num_n50.pl		num_n50.pl
pomsa.h		pomsa.h
queue.h		queue.h
rename_fa.pl		rename_fa.pl
rename_fq.pl		rename_fq.pl
rev_seq.pl		rev_seq.pl
run_wtdbg_assembly.sh		run_wtdbg_assembly.sh
runit.pl		runit.pl
sam2dbgcns.pl		sam2dbgcns.pl
seq_n50.pl		seq_n50.pl
sort.h		sort.h
split_seqs_2.pl		split_seqs_2.pl
string.h		string.h
thread.h		thread.h
timer.h		timer.h
ttr_finder.c		ttr_finder.c
vcf_revise_ctg.c		vcf_revise_ctg.c
wtdbg-cns.c		wtdbg-cns.c
wtdbg-div.c		wtdbg-div.c
wtdbg.c		wtdbg.c
wtdmo.c		wtdmo.c

Xuelei-Dai/wtdbg-1.2.8

Folders and files

Latest commit

History

Repository files navigation