Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

title author date header footer section
abyss-sealer
Daniel Paulino
2014-11-13
ABySS
ABySS
1

Name

abyss-sealer - Close gaps within scaffolds

Synopsis

abyss-sealer -b <Bloom filter size> -k <kmer size> -k <kmer size>... -o <output_prefix> -S <path to scaffold file> [options]... <reads1> [reads2]...

For example:

abyss-sealer -b20G -k64 -k80 -k96 -k112 -k128 -o test -S scaffold.fa read1.fa read2.fa

Description

Sealer is an application of Konnector that closes intra-scaffold gaps. It performs three sequential functions. First, regions with Ns are identified from an input scaffold. Flanking nucleotues (2 x 100bp) are extracted from those regions while respecting the strand (5' to 3') direction on the sequence immediately downstream of each gap. In the second step, flanking sequence pairs are used as input to Konnector along with a set of reads with a high level of coverage redundancy. Ideally, the reads should represent the original dataset from which the draft assembly is generated, or further whole genome shotgun (WGS) sequencing data generated from the same sample. Within Konnector, the input WGS reads are used to populate a Bloom filter, tiling the reads with a sliding window of length k, thus generating a probabilistic representation of all the k-mers in the reads. Konnector also uses crude error removal and correctional algorithms, eliminating singletons (k-mers that are observed only once) and fixing base mismatches in the flanking sequence pairs. Sealer launches Konnector processes using a user-input range of k-mer lengths. In the third and final operation, succesfully merged sequences are inserted into the gaps of the original scaffolds, and Sealer outputs a new gap-filled scaffold file.

Installation

See ABySS installation instructions.

How to run as stand-alone application

abyss-sealer [-b bloom filter size][-k values...] [-o outputprefix] [-S assembly file] [options...] [reads...]

Sealer requires the following information to run:

  • draft assembly
  • user-supplied k values (>0)
  • output prefix
  • WGS reads (for building Bloom Filters)

Sample commands

Without pre-built bloom filters:

abyss-sealer -b20G -k64 -k96 -o run1 -S test.fa read1.fq.gz read2.fq.gz

With pre-built bloom filters:

abyss-sealer -k64 -k96 -o run1 -S test.fa -i k64.bloom -i k96.bloom read1.fq.gz read2.fq.gz

Reusable Bloom filters can be pre-built with abyss-bloom build, e.g.:

abyss-bloom build -vv -k64 -j12 -b20G -l2 k64.bloom read1.fq.gz read2.fq.gz

Note: when using pre-built bloom filters generated by abyss-bloom build, Sealer must be compiled with the same maxk value that abyss-bloom was compiled with. For example, if a Bloom filter was built with a maxk of 64, Sealer must be compiled with a maxk of 64 as well. If different values are used between the pre-built bloom filter and Sealer, any sequences generated will be nonsensical and incorrect.

Output files

  • prefix_log.txt
  • prefix_scaffold.fa
  • prefix_merged.fa
  • prefix_flanks_1.fq -> if --print-flanks option used
  • prefix_flanks_2.fq -> if --print-flanks option used

The log file contains results of each Konnector run. The structure of one run is as follows:

  • ## unique gaps closed for k##
  • No start/goal kmer: ###
  • No path: ###
  • Unique path: ###
  • Multiple paths: ###
  • Too many paths: ###
  • Too many branches: ###
  • Too many path/path mismatches: ###
  • Too many path/read mismatches: ###
  • Contains cycle: ###
  • Exceeded mem limit: ###
  • Skipped: ###
  • ### flanks left
  • k## run complete
  • Total gaps closed so far = ###

The scaffold.fa file is a gap-filled version of the draft assembly inserted into Sealer. The merged.fa file contains every newly generated sequence that were inserted into gaps, including the flanking sequences. Negative sizes of new sequences indicate Konnector collapsed the pair of flanking sequences. For example:

>[scaffold ID]_[original start position of gap on scaffold]_[size of new sequence] ACGCGACGAGCAGCGAGCACGAGCAGCGACGAGCGACGACGAGCAGCGACGAGCG

If --print-flanks option is enabled, Sealer outputs the flanking sequences used to insert into Konnector. This may be useful should users which to double check if this tool is extracting the correct sequences surrounding gaps. The structure of these files are as follows:

>[scaffold ID]_[original start position of gap on scaffold]_[size of gap]/[1 or 2 indicating whether left or right flank] GCTAGCTAGCTAGCTGATCGATCGTAGCTAGCTAGCTGACTAGCTGATCAGTCGA

How to optimize for gap closure

To optimize Sealer, users can observe the log files generated after a run and adjust parameters accordingly. If k runs are showing gaps having too many paths or branches, consider increasing -P or -B parameters, respectively.

Also consider increasing the number of k values used. Generally, large k-mers are better able to address highly repetitive genomic regions, while smaller k-mers are better able to resolve areas of low coverage.

Runtime and memory usage

More k values mean more bloom filters will be required, which will increase runtime as it takes time to build/load each bloom filter at the beginning of each k run. Memory usage is not affected by using more bloom filters.

The larger value used for parameters such as -P, -B or -F will increase runtime.

Options

Parameters of abyss-sealer

  • --print-flanks: outputs flank files
  • -S,--input-scaffold=FILE: load scaffold from FILE
  • -L,--flank-length=N: length of flanks to be used as pseudoreads [100]
  • -j,--threads=N: use N parallel threads [1]
  • -k,--kmer=N: the size of a k-mer
  • -b,--bloom-size=N: size of bloom filter. Required when not using pre-built Bloom filter(s).
  • -B,--max-branches=N: max branches in de Bruijn graph traversal; use 'nolimit' for no limit [1000]
  • -d,--dot-file=FILE: write graph traversals to a DOT file
  • -e,--fix-errors: find and fix single-base errors when reads have no kmers in bloom filter [disabled]
  • -f,--min-frag=N: min fragment size in base pairs [0]
  • -F,--max-frag=N: max fragment size in base pairs [1000]
  • -i,--input-bloom=FILE: load bloom filter from FILE
  • --mask: mask new and changed bases as lower case
  • --no-mask: do not mask bases [default]
  • --chastity: discard unchaste reads [default]
  • --no-chastity: do not discard unchaste reads
  • --trim-masked: trim masked bases from the ends of reads
  • --no-trim-masked: do not trim masked bases from the ends of reads [default]
  • -l,--long-search: start path search as close as possible to the beginnings of reads. Takes more time but improves results when bloom filter false positive rate is high [disabled]
  • -m,--flank-mismatches=N`: max mismatches between paths and flanks; use 'nolimit' for no limit [nolimit]
  • -M,--max-mismatches=N`: max mismatches between all alternate paths; use 'nolimit' for no limit [nolimit]
  • -n --no-limits`: disable all limits; equivalent to '-B nolimit -m nolimit -M nolimit -P nolimit'
  • -o,--output-prefix=FILE`: prefix of output FASTA files [required]
  • -P,--max-paths=N`: merge at most N alternate paths; use 'nolimit' for no limit [2]
  • -q,--trim-quality=N`: trim bases from the ends of reads whose quality is less than the threshold
  • --standard-quality: zero quality is `!' (33) default for FASTQ and SAM files
  • --illumina-quality: zero quality is `@' (64) default for qseq and export files
  • -r,--read-name=STR`: only process reads with names that contain STR
  • -s,--search-mem=N`: mem limit for graph searches; multiply by the number of threads (-j) to get the total mem used for graph traversal [500M]
  • -t,--trace-file=FILE`: write graph search stats to FILE
  • -v,--verbose`: display verbose output
  • --help: display this help and exit
  • --version: output version information and exit

k is the size of k-mer for the de Bruijn graph. You may specify multiple values of k, which will increase the nubmer of gaps closed at the cost of increased run time. Multiple values of k ought to be specified in increasing order, as lower values of k have fewer coverage gaps and are less likely to misassemble.

P is the threshold for number of paths allowed to be traversed. When set to 10, Konnector will attempt to close gaps even when there are 10 different paths found. It would attempt to create a consensus sequence between these paths. The default setting is 2.