Bonsai: Flexible Taxonomic Analysis and Extension

Bonsai contains varied utilities for taxonomic analysis and classification using exact subsequence matches. These include:

A high-performance, generic taxonomic classifier
- Efficient classification
  - 20x as fast, single-threaded, as Kraken in our benchmarks, while demonstrating significantly better threadscaling.
- Arbitrary, user-defined spaced-seed encoding.
  - Reference compression by windowing/minimization schemes.
  - Generic minimization including by taxonomic depth, lexicographic value, subsequence specificity, or Shannon entropy.
- Parallelized pairwise Jaccard Distance estimation using HyperLogLog sketches, which has recently migrated to dashing.
An unsupervised method for taxonomic structure discovery and correction. (metatree)
A threadsafe, SIMD-accelerated HyperLogLog implementation, which has migrated to hll.
Scripts for downloading reference genomes from new (post-2014) and old RefSeq.

Tools have been compiled using both zlib and zstd, which means that they can transparently consume zlib-, zstd-, and uncompressed files.

All of these tools are experimental. Use at your own risk.

Build Instructions

cd bonsai && make bonsai

Unit Tests

We use the Catch testing framework. You can build and run the tests by:

cd bonsai && make unit && ./unit

Dependencies

Primary dependency is sketch, stored in hll, which handles sketching + bit math requirements. In addition, we require zlib, ntHash, and zstd.

Usage

Encoding: Use Encoder from include/bonsai/encoder.h to directly encode k-mers or RollingHasher to encode k-mers with a rolling hash to enable unbounded length. These are then called via for_each and for_each_hash functions.

Executables:

Usage instructions are available in each executable by executing it with no options or providing the -h flag.

For classification purposes, the commands involved are bonsai prebuild, bonsai build, and bonsai classify. prebuild is only required for taxonomic or feature minimization strategies, for which case database building requires double the memory requirements. Unless you're very sure you know what you're doing, we recommend simply bonsai build with either Entropy or Lexicographic minimization.

To build a database with k = 31, window size = 50, minimized by entropy, from a taxonomy in ref/nodes.dmp and a nameidmap in ref/nameidmap.txt and store it in in bns.db

bonsai build -e -w50 -k31 -p20 -T ref/nodes.dmp -M ref/nameidmap.txt bns.db `find ref/ -name '*.fna.gz'`

To prepare the above, the script in python/download_genomes.py can be used. The default of downloading all available genomes can be run by python python/download_genomes.py --threads 20 all. This places downloaded genomes by default into the paths listed above in the bonsai build command. These paths can be altered; see python/download_genomes.py -h/--help for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,861 Commits
bin		bin
clhash		clhash
doc		doc
exp		exp
hll @ 1485ad8		hll @ 1485ad8
include		include
klib		klib
kraken_benchmarks		kraken_benchmarks
kspp		kspp
lazy		lazy
lib		lib
linear		linear
ntHash @ 65325ba		ntHash @ 65325ba
pdqsort		pdqsort
python		python
rollinghash		rollinghash
save		save
sim		sim
test		test
tinythreadpp/source		tinythreadpp/source
zlib @ cacf7f1		zlib @ cacf7f1
zstd @ 3bcdcaa		zstd @ 3bcdcaa
.gitmodules		.gitmodules
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bonsai: Flexible Taxonomic Analysis and Extension

Build Instructions

Unit Tests

Dependencies

Usage

About

Releases

Packages

Contributors 3

Languages

License

dnbaker/bonsai

Folders and files

Latest commit

History

Repository files navigation

Bonsai: Flexible Taxonomic Analysis and Extension

Build Instructions

Unit Tests

Dependencies

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages