88 o8o .8' `"' .ooooo. .ooooo. .oooo.o ooo. .oo. .oo. .ooooo. .8' oooo ooo .oooo. oooo d8b oooo d88' `"Y8 d88' `88b d88( "8 `888P"Y88bP"Y88b d88' `88b .8' `88. .8' `P )88b `888""8P `888 888 888 888 `"Y88b. 888 888 888 888 888 .8' `88..8' .oP"888 888 888 888 .o8 888 888 o. )88b 888 888 888 888 888 .8' `888' d8( 888 888 888 `Y8bod8P' `Y8bod8P' 8""888P' o888o o888o o888o `Y8bod8P' 88 `8' `Y888""8o d888b o888o ver 0.5.1
Cosmo is a fast, low-memory DNA assembler that uses a succinct de Bruijn graph.
VARI, a succinct colored de Bruijn graph, can be found in the VARI branch.
After compiling, you can run Cosmo like so:
$ pack-edges <input_file> # this adds reverse complements and dummy edges, and packs them $ cosmo-build <input_file>.packed # compresses and builds indices $ cosmo-assemble <input_file>.packed.dbg # output: <input_file>.packed.dbg.fasta # NOT IMPLEMENTED YET
input_file is the binary output of a DSK run. Each program has a
--help option for a more
detailed description of how to use them.
Here are some things that you don't want to let surprise you:
Definition of "k-mer"
Note that since our graph is edge-based, k defines the length of our edges, hence our nodes are only k-1 symbols long.
If you want to construct a Succinct de Bruijn Graph where the nodes are k-mers, you will need to run DSK
with k set to k+1. E.g. using output from
$ dsk <input_file> 27 will actually build a 26-dimension de Bruijn graph.
Note: Both even and odd k values should work with this assembler due to our loop-immune traversal.
Furthermore, most de Bruijn graph based assemblers add edges between all nodes that overlap. Instead, we are taking the k-mers as our edges (of two k-1-length nodes), so we only have edges that were directly represented in the read set (this makes more sense to us, though, as it reduces unnecessary branching). I may add support for the standard way in the future if anyone wants it (it would be similar to the dummy edge adding code).
We currently only output the unitigs (paths between branching nodes).
There is an included Makefile - just type
make to build it (assuming you have the dependencies listed below).
To build with "Variable order mode", use the
- A compiler that supports C++11,
- Boost - ranges and range algorithms, zip iterator, tuple comparison, lots of good stuff,
- SDSL-lite - low level succinct data structures (For now you will have to use my branch if you want to use variable order
graphs: clone this and checkout the
developbranch before compiling),
- TClap - command line parsing,
- DSK - k-mer counting (we need this for input),
- Optionally (for developers): Python and NumPy - rebuilding the lookup tables,
- STXXL - external merging (not actually required yet though)
Your help is more than welcome! Please fork and send a pull request, or contact me directly :)
Cosmos /ˈkɑz.moʊs/ (n) : "An ordered, harmonious whole.".
If that doesn't suit an assembly program then I don't know what does. The last s was dropped because it was nicer to say. Furthermore, it is a reference to the Seinfeld character Cosmo Kramer (whose last name I'm often reminded of while working on this stuff).
This software is copyright (c) Alex Bowe 2014 (bowe dot alexander at gmail dot com). It is released under the GNU General Public License (GPL) version 3.