🔬 De novo genome sequence assembler to assemble large genomes using short reads
C++ C Makefile M4 Shell Perl Other
Permalink
Failed to load latest commit information.
ABYSS Replace `opt::url` with `opt::db` Dec 18, 2014
AdjList ABYSS/ABYSS-P/AdjList: clarify --help text regarding -k/-K May 20, 2015
Align mergepairs, logcounter: --help --version to stdout Jan 27, 2015
Assembly Fix documentation of various ABYSS and abyss-pe parameters. Jun 13, 2015
Bloom Bloom filters: guard against combining Bloom filters with different h… Aug 16, 2016
BloomDBG README.md: clarify meaning of `B` param Sep 26, 2016
Common DotIO.h: support unquoted node IDs Sep 27, 2016
Consensus Update copyrights to 2014 Feb 17, 2014
DAssembler Update copyrights to 2014 Feb 17, 2014
DataBase Explicitly cast stream to bool for C++11 Mar 16, 2016
DataLayer abyss-pe, abyss-fac: Add G option for genome size May 30, 2016
DistanceEst Add type casts to silence compiler warnings Apr 29, 2015
FMIndex Include required options in usage messages Feb 21, 2014
FilterGraph abyss-filtergraph: Bug fix. Arg of -c is a float Mar 19, 2015
GapFiller No longer require SAM_SEQ_QUAL at a compile time Mar 7, 2014
Graph UndirectedGraph.h: add to EXTRA_DIST Oct 21, 2016
IntegrationTest konnector: mv integration tests from Unittest => IntegrationTest Aug 18, 2016
KAligner Include required options in usage messages Feb 21, 2014
Konnector konnector: fix compile error with boost-1.62 Oct 20, 2016
Layout layout: New option --SS for strand specific assembly Feb 17, 2014
LogKmerCount Use std::isnan rather than ::isnan for C++11 Mar 16, 2016
Map abyss-overlap: generate an edge when contig is subsumed but flush to … Dec 30, 2014
MergePaths MergeContigs: Merge subsumed contigs Dec 30, 2014
Misc Check that the Haskell package MMAP is installed. Sep 10, 2013
Overlap Add --gv as a synonym for --dot Sep 4, 2014
PairedDBG PairedDBG/Makefile.am: add missing source dependency (Dinuc.h) May 20, 2015
Parallel Fix typo (#134) Nov 14, 2016
ParseAligns Use std::isnan rather than ::isnan for C++11 Mar 16, 2016
PathOverlap Replace `opt::url` with `opt::db` Dec 18, 2014
PopBubbles Add --gv as a synonym for --dot Sep 4, 2014
Scaffold abyss-scaffold: Add G option for genome size May 30, 2016
Sealer sealer: fixes for FASTQ arg handling Aug 30, 2016
SimpleGraph Explicitly cast stream to bool for C++11 Mar 16, 2016
Unittest unit tests: fix gcc-6 compile errors Oct 21, 2016
bin Release 2.0.1 Sep 14, 2016
dialign * dialign/translate.c (translate, inverse, retranslate): Call abort Mar 19, 2012
doc Release 2.0.2 Oct 21, 2016
kmerprint Include required options in usage messages Feb 21, 2014
lib add missing Makefile.am files Aug 19, 2016
m4 bundle Google Test with ABySS May 28, 2015
.gitignore Add many files to .gitignore . Jun 8, 2015
.travis.yml .travis.yml: Run make check Jan 7, 2015
CITATION.bib Add CITATION.bib CITATION.md Dec 16, 2014
CITATION.md Add CITATION.bib CITATION.md Dec 16, 2014
COPYRIGHT COPYRIGHT: Add lib/bloomfilter lib/rolling-hash Sep 14, 2016
ChangeLog Release 2.0.2 Oct 21, 2016
Dockerfile Dockerfile: Remove dependency on libopenmpi1.6 Apr 25, 2016
LICENSE COPYRIGHT, LICENSE: Change the license to GPL-3 Sep 14, 2016
Makefile.am include README.md files in `make dist` tarball Aug 19, 2016
README.md Fix output name of rescaffolding in README.md (#129) Oct 5, 2016
_config.yml Set theme jekyll-theme-cayman Dec 15, 2016
autogen.sh Add autoconf/automake support to ABySS. Apr 8, 2008
circle.yml circle.yml: Run make check Dec 17, 2014
configure.ac Release 2.0.2 Oct 21, 2016
doxygen.conf Produce call graphs and inheritance graphs with doxygen Jul 11, 2013

README.md

ABySS

ABySS is a de novo sequence assembler intended for short paired-end reads and large genomes.

Contents

Quick Start

Install ABySS on Debian or Ubuntu

Run the command

sudo apt-get install abyss

or download and install the Debian package.

Install ABySS on Mac OS X

Install Homebrew, and run the commands

brew install homebrew/science/abyss

Assemble a small synthetic data set

wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/1.3.4/test-data.tar.gz
tar xzvf test-data.tar.gz
abyss-pe k=25 name=test \
    in='test-data/reads1.fastq test-data/reads2.fastq'

Calculate assembly contiguity statistics

abyss-fac test-unitigs.fa

Dependencies

ABySS requires the following libraries:

ABySS requires a C++ compiler that supports OpenMP such as GCC.

ABySS will receive an error when compiling with Boost 1.51.0 or 1.52.0 since they contain a bug. Later versions of Boost compile without error.

Compiling ABySS from GitHub

When installing ABySS from GitHub source the following tools are required:

To generate the configure script and make files:

./autogen.sh

See "Compiling ABySS from source" for further steps.

Compiling ABySS from source

To compile and install ABySS in /usr/local:

./configure
make
sudo make install

To install ABySS in a specified directory:

./configure --prefix=/opt/abyss
make
sudo make install

ABySS uses OpenMP for parallelization, which requires a modern compiler such as GCC 4.2 or greater. If you have an older compiler, it is best to upgrade your compiler if possible. If you have multiple versions of GCC installed, you can specify a different compiler:

./configure CC=gcc-4.6 CXX=g++-4.6

ABySS requires the Boost C++ libraries. Many systems come with Boost installed. If yours does not, you can download Boost. It is not necessary to compile Boost before installing it. The Boost header file directory should be found at /usr/include/boost, in the ABySS source directory, or its location specified to configure:

./configure --with-boost=/usr/local/include

If you wish to build the parallel assembler with MPI support, MPI should be found in /usr/include and /usr/lib or its location specified to configure:

./configure --with-mpi=/usr/lib/openmpi

ABySS should be built using the sparsehash library to reduce memory usage, although it will build without. sparsehash should be found in /usr/include or its location specified to configure:

./configure CPPFLAGS=-I/usr/local/include

If SQLite is installed in non-default directories, its location can be specified to configure:

./configure --with-sqlite=/opt/sqlite3

The default maximum k-mer size is 64 and may be decreased to reduce memory usage or increased at compile time. This value must be a multiple of 32 (i.e. 32, 64, 96, 128, etc):

./configure --enable-maxk=96

If you encounter compiler warnings, you may ignore them like so:

make AM_CXXFLAGS=-Wall

To run ABySS, its executables should be found in your PATH. If you installed ABySS in /opt/abyss, add /opt/abyss/bin to your PATH:

PATH=/opt/abyss/bin:$PATH

Assembling a paired-end library

To assemble paired reads in two files named reads1.fa and reads2.fa into contigs in a file named ecoli-contigs.fa, run the command:

abyss-pe name=ecoli k=64 in='reads1.fa reads2.fa'

The parameter in specifies the input files to read, which may be in FASTA, FASTQ, qseq, export, SRA, SAM or BAM format and compressed with gz, bz2 or xz and may be tarred. The assembled contigs will be stored in ${name}-contigs.fa.

A pair of reads must be named with the suffixes /1 and /2 to identify the first and second read, or the reads may be named identically. The paired reads may be in separate files or interleaved in a single file.

Reads without mates should be placed in a file specified by the parameter se (single-end). Reads without mates in the paired-end files will slow down the paired-end assembler considerably during the abyss-fixmate stage.

Assembling multiple libraries

The distribution of fragment sizes of each library is calculated empirically by aligning paired reads to the contigs produced by the single-end assembler, and the distribution is stored in a file with the extension .hist, such as ecoli-3.hist. The N50 of the single-end assembly must be well over the fragment-size to obtain an accurate empirical distribution.

Here's an example scenario of assembling a data set with two different fragment libraries and single-end reads. Note that the names of the libraries (pea and peb) are arbitrary.

  • Library pea has reads in two files, pea_1.fa and pea_2.fa.
  • Library peb has reads in two files, peb_1.fa and peb_2.fa.
  • Single-end reads are stored in two files, se1.fa and se2.fa.

The command line to assemble this example data set is:

abyss-pe k=64 name=ecoli lib='pea peb' \
    pea='pea_1.fa pea_2.fa' peb='peb_1.fa peb_2.fa' \
    se='se1.fa se2.fa'

The empirical distribution of fragment sizes will be stored in two files named pea-3.hist and peb-3.hist. These files may be plotted to check that the empirical distribution agrees with the expected distribution. The assembled contigs will be stored in ${name}-contigs.fa.

Scaffolding

Long-distance mate-pair libraries may be used to scaffold an assembly. Specify the names of the mate-pair libraries using the parameter mp. The scaffolds will be stored in the file ${name}-scaffolds.fa. Here's an example of assembling a data set with two paired-end libraries and two mate-pair libraries. Note that the names of the libraries (pea, peb, mpa, mpb) are arbitrary.

abyss-pe k=64 name=ecoli lib='pea peb' mp='mpc mpd' \
    pea='pea_1.fa pea_2.fa' peb='peb_1.fa peb_2.fa' \
    mpc='mpc_1.fa mpc_2.fa' mpd='mpd_1.fa mpd_2.fa'

The mate-pair libraries are used only for scaffolding and do not contribute towards the consensus sequence.

Rescaffolding with long sequences

Long sequences such as RNA-Seq contigs can be used to rescaffold an assembly. Sequences are aligned using BWA-MEM to the assembled scaffolds. Additional scaffolds are then formed between scaffolds that can be linked unambiguously when considering all BWA-MEM alignments.

Similar to scaffolding, the names of the datasets can be specified with the long parameter. These scaffolds will be stored in the file ${name}-long-scaffs.fa. The following is an example of an assembly with PET, MPET and an RNA-Seq assembly. Note that the names of the libraries are arbitrary.

abyss-pe k=64 name=ecoli lib='pe1 pe2' mp='mp1 mp2' long='longa' \
    pe1='pe1_1.fa pe1_2.fa' pe2='pe2_1.fa pe2_2.fa' \
    mp1='mp1_1.fa mp1_2.fa' mp2='mp2_1.fa mp2_2.fa' \
    longa='longa.fa'

Assembling using a Bloom filter de Bruijn graph

Assemblies may be performed using a Bloom filter de Bruijn graph, which typically reduces memory requirements by an order of magnitude. To assemble in Bloom filter mode, the user must specify 3 additional parameters: B (Bloom filter size in bytes), H (number of Bloom filter hash functions), and kc (minimum k-mer count threshold). B is the overall memory budget for the Bloom filter assembler, and may be specified with unit suffixes 'k' (kilobytes), 'M' (megabytes), 'G' (gigabytes). If no units are specified bytes are assumed. For example, the following will run a E. coli assembly with an overall memory budget of 100 megabytes, 3 hash functions, a minimum k-mer count threshold of 3, with verbose logging enabled:

abyss-pe name=ecoli k=64 in='reads1.fa reads2.fa' B=100M H=3 kc=3 v=-v

At the current time, the user must calculate suitable values for B and H on their own, and finding the best value for kc may require experimentation (optimal values are typically in the range of 2-4). Internally, the Bloom filter assembler divides the memory budget (B) equally across (kc + 1) Bloom filters, where kc Bloom filters are used for the cascading Bloom filter and one additional Bloom filter is used to track k-mers that have previously been included in contigs. Users are recommended to target a Bloom filter false positive rate (FPR) that is less than 5%, as reported by the assembly log when using the v=-v option (verbose level 1).

Assembling using a paired de Bruijn graph

Assemblies may be performed using a paired de Bruijn graph instead of a standard de Bruijn graph. In paired de Bruijn graph mode, ABySS uses k-mer pairs in place of k-mers, where each k-mer pair consists of two equal-size k-mers separated by a fixed distance. A k-mer pair is functionally similar to a large k-mer spanning the breadth of the k-mer pair, but uses less memory because the sequence in the gap is not stored. To assemble using paired de Bruijn graph mode, specify both individual k-mer size (K) and k-mer pair span (k). For example, to assemble E. coli with a individual k-mer size of 16 and a k-mer pair span of 64:

abyss-pe name=ecoli K=16 k=64 in='reads1.fa reads2.fa'

In this example, the size of the intervening gap between k-mer pairs is 32 bp (64 - 2*16). Note that the k parameter takes on a new meaning in paired de Bruijn graph mode. k indicates kmer pair span in paired de Bruijn graph mode (when K is set), whereas k indicates k-mer size in standard de Bruijn graph mode (when K is not set).

Assembling a strand-specific RNA-Seq library

Strand-specific RNA-Seq libraries can be assembled such that the resulting unitigs, contigs and scaffolds are oriented correctly with respect to the original transcripts that were sequenced. In order to run ABySS in strand-specific mode, the SS parameter must be used as in the following example:

abyss-pe name=SS-RNA k=64 in='reads1.fa reads2.fa' SS=--SS

The expected orientation for the read sequences with respect to the original RNA is RF. i.e. the first read in a read pair is always in reverse orientation.

Optimizing the parameter k

To find the optimal value of k, run multiple assemblies and inspect the assembly contiguity statistics. The following shell snippet will assemble for every eighth value of k from 50 to 90.

for k in `seq 50 8 90`; do
    mkdir k$k
    abyss-pe -C k$k name=ecoli k=$k in=../reads.fa
done
abyss-fac k*/ecoli-contigs.fa

The default maximum value for k is 96. This limit may be changed at compile time using the --enable-maxk option of configure. It may be decreased to 32 to decrease memory usage or increased to larger values.

Parallel processing

The np option of abyss-pe specifies the number of processes to use for the parallel MPI job. Without any MPI configuration, this will allow you to use multiple cores on a single machine. To use multiple machines for assembly, you must create a hostfile for mpirun, which is described in the mpirun man page.

Do not run mpirun -np 8 abyss-pe. To run ABySS with 8 threads, use abyss-pe np=8. The abyss-pe driver script will start the MPI process, like so: mpirun -np 8 ABYSS-P.

The paired-end assembly stage is multithreaded, but must run on a single machine. The number of threads to use may be specified with the parameter j. The default value for j is the value of np.

Running ABySS on a cluster

ABySS integrates well with cluster job schedulers, such as:

  • SGE (Sun Grid Engine)
  • Portable Batch System (PBS)
  • Load Sharing Facility (LSF)
  • IBM LoadLeveler

For example, to submit an array of jobs to assemble every eighth value of k between 50 and 90 using 64 processes for each job:

qsub -N ecoli -pe openmpi 64 -t 50-90:8 \
    <<<'mkdir k$SGE_TASK_ID && abyss-pe -C k$SGE_TASK_ID in=/data/reads.fa'

Using the DIDA alignment framework

ABySS supports the use of DIDA (Distributed Indexing Dispatched Alignment), an MPI-based framework for computing sequence alignments in parallel across multiple machines. The DIDA software must be separately downloaded and installed from http://www.bcgsc.ca/platform/bioinfo/software/dida. In comparison to the standard ABySS alignment stages which are constrained to a single machine, DIDA offers improved performance and the ability to scale to larger targets. Please see the DIDA section of the abyss-pe man page (in the doc subdirectory) for details on usage.

Assembly Parameters

Parameters of the driver script, abyss-pe

  • a: maximum number of branches of a bubble [2]
  • b: maximum length of a bubble (bp) [""]
  • B: Bloom filter size (e.g. "100M")
  • c: minimum mean k-mer coverage of a unitig [sqrt(median)]
  • d: allowable error of a distance estimate (bp) [6]
  • e: minimum erosion k-mer coverage [round(sqrt(median))]
  • E: minimum erosion k-mer coverage per strand [1 if sqrt(median) > 2 else 0]
  • G: genome size, used to calculate NG50 [disabled]
  • H: number of Bloom filter hash functions [1]
  • j: number of threads [2]
  • k: size of k-mer (when K is not set) or the span of a k-mer pair (when K is set)
  • kc: minimum k-mer count threshold for Bloom filter assembly [2]
  • K: the length of a single k-mer in a k-mer pair (bp)
  • l: minimum alignment length of a read (bp) [40]
  • m: minimum overlap of two unitigs (bp) [30]
  • n: minimum number of pairs required for building contigs [10]
  • N: minimum number of pairs required for building scaffolds [n]
  • np: number of MPI processes [1]
  • p: minimum sequence identity of a bubble [0.9]
  • q: minimum base quality [3]
  • s: minimum unitig size required for building contigs (bp) [1000]
  • S: minimum contig size required for building scaffolds (bp) [1000-10000]
  • t: maximum length of blunt contigs to trim [k]
  • v: use v=-v for verbose logging, v=-vv for extra verbose [disabled]
  • x: spaced seed (Bloom filter assembly only)

Please see the abyss-pe manual page for more information on assembly parameters.

Environment variables

abyss-pe configuration variables may be set on the command line or from the environment, for example with export k=20. It can happen that abyss-pe picks up such variables from your environment that you had not intended, and that can cause trouble. To troubleshoot that situation, use the abyss-pe env command to print the values of all the abyss-pe configuration variables:

abyss-pe env [options]

ABySS programs

abyss-pe is a driver script implemented as a Makefile. Any option of make may be used with abyss-pe. Particularly useful options are:

  • -C dir, --directory=dir Change to the directory dir and store the results there.
  • -n, --dry-run Print the commands that would be executed, but do not execute them.

abyss-pe uses the following programs, which must be found in your PATH:

  • ABYSS: de Bruijn graph assembler
  • ABYSS-P: parallel (MPI) de Bruijn graph assembler
  • AdjList: find overlapping sequences
  • DistanceEst: estimate the distance between sequences
  • MergeContigs: merge sequences
  • MergePaths: merge overlapping paths
  • Overlap: find overlapping sequences using paired-end reads
  • PathConsensus: find a consensus sequence of ambiguous paths
  • PathOverlap: find overlapping paths
  • PopBubbles: remove bubbles from the sequence overlap graph
  • SimpleGraph: find paths through the overlap graph
  • abyss-fac: calculate assembly contiguity statistics
  • abyss-filtergraph: remove shim contigs from the overlap graph
  • abyss-fixmate: fill the paired-end fields of SAM alignments
  • abyss-map: map reads to a reference sequence
  • abyss-scaffold: scaffold contigs using distance estimates
  • abyss-todot: convert graph formats and merge graphs

This flowchart shows the ABySS assembly pipeline its intermediate files.

Export to SQLite Database

ABySS has a built-in support for SQLite database to export log values into a SQLite file and/or .csv files at runtime.

Database parameters

Of abyss-pe:

  • db: path to SQLite repository file [$(name).sqlite]
  • species: name of species to archive [ ]
  • strain: name of strain to archive [ ]
  • library: name of library to archive [ ]

For example, to export data of species 'Ecoli', strain 'O121' and library 'pea' into your SQLite database repository named '/abyss/test.sqlite':

abyss-pe db=/abyss/test.sqlite species=Ecoli strain=O121 library=pea [other options]

Helper programs

Found in your path:

  • abyss-db-txt: create a flat file showing entire repository at a glance
  • abyss-db-csv: create .csv table(s) from the repository

Usage:

abyss-db-txt /your/repository
abyss-db-csv /your/repository program(s)

For example,

abyss-db-txt repo.sqlite

abyss-db-csv repo.sqlite DistanceEst
abyss-db-csv repo.sqlite DistanceEst abyss-scaffold
abyss-db-csv repo.sqlite --all

Publications

ABySS

Simpson, Jared T., Kim Wong, Shaun D. Jackman, Jacqueline E. Schein, Steven JM Jones, and Inanc Birol. ABySS: a parallel assembler for short read sequence data. Genome research 19, no. 6 (2009): 1117-1123. doi:10.1101/gr.089532.108

Trans-ABySS

Robertson, Gordon, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew Field, Shaun D. Jackman, Karen Mungall et al. De novo assembly and analysis of RNA-seq data. Nature methods 7, no. 11 (2010): 909-912. doi:10.1038/10.1038/nmeth.1517

ABySS-Explorer

Nielsen, Cydney B., Shaun D. Jackman, Inanc Birol, and Steven JM Jones. ABySS-Explorer: visualizing genome sequence assemblies. IEEE Transactions on Visualization and Computer Graphics 15, no. 6 (2009): 881-888. doi:10.1109/TVCG.2009.116

Support

Ask a question on Biostars.

Create a new issue on GitHub.

Subscribe to the ABySS mailing list, abyss-users@googlegroups.com.

For questions related to transcriptome assembly, contact the Trans-ABySS mailing list, trans-abyss@googlegroups.com.

Authors

Supervised by Dr. Inanc Birol.

Copyright 2016 Canada's Michael Smith Genome Sciences Centre