Bayesian analysis of genomic sequence data under the multispecies coalescent model
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

bpp

Build Status License DOI Version

Introduction

The aim of this project is to implement a versatile high-performance version of the BPP software. It should have the following properties:

  • open-source code with an appropriate open-source license.
  • 64-bit multi-threaded design that handles very large datasets.
  • easy to use and well-documented.
  • SIMD implementations of time-consuming parts.
  • Linux, Mac and Microsoft Windows compatibility.

BPP currently implements four methods:

  • Estimation of the parameters of species divergence times and population sizes under the multi-species coalescent (MSC) model when the species phylogeny is given (Rannala and Yang, 2003)

  • Inference of the species tree when the assignments are given by the user (Rannala and Yang, 2017)

  • Species delimitation using a user-specified guide tree (Yang and Rannala, 2010; Rannala and Yang, 2013)

  • Joint species delimitation and species tree estimation (Yang and Rannala 2014)

BPP can also accommodate variable mutation rates among loci (Burgess and Yang, 2008) and heredity multipliers (Hey and Nielsen, 2004). Finally, BPP supports diploid data. Phasing is done analytically as described by Gronau et al, 2011.

Compilation instructions

Currently, BPP requires that GNU Bison and Flex are installed on the target system. On a Debian-based Linux system, the two packages can be installed using the command

apt-get install flex bison

BPP can then be compiled using the provided Makefile

make

Compiling BPP requires that your system has GCC version 4.7 or newer, as AVX and AVX-2 optimized functions are compiled even if your processor does not support them. This is fine, as BPP will automatically select the right instruction set that your processor supports at run-time. This means, you can compile on one system, and run BPP on any other compatible system.

However, if your compiler is older than 4.7, you will get errors such as:

cc1: error: unrecognized command line option "-mavx2"

or

cc1: error: unrecognized command line option "-mavx"

If your compiler is GCC 4.6.x then you can compile BPP using:

make clean
make -e DISABLE_AVX2=1

In case your compiler is older than GCC 4.6 then compile using:

make -e DISABLE_AVX2=1 DISABLE_AVX=1

You can check your compiler version with:

gcc --version

Running BPP

After creating the control file, one can run BPP as follows:

bpp --cfile [CONTROL-FILE]

If you would like to resume a checkpoint file, please run:

bpp --resume [CHECKPOINT-FILE]

More documentation regarding control files, will be available soon on the wiki.

Citing BPP

Please cite the following publication if you use BPP:

Flouri T., Jiao X., Rannala B., Yang Z. (2018) Species Tree Inference with BPP using Genomic Sequences and the Multispecies Coalescent. Molecular Biology and Evolution (accepted manuscript). doi:10.1093/molbev/msy147

Please note, citing the corresponding of the four underlying methods, may also be appropriate.

License and third party licenses

The code is currently licensed under the GNU Affero General Public License version 3.

Code

File Description
arch.c Architecture specific code (Linux/Mac/Windows)
allfixed.c Summary statistics for method A00 (fixed species tree)
bpp.c Main file handling command-line parameters and executing selected methods
cfile.c Functions for parsing the control file
compress.c Functions for compressing multiple sequence alignments into site patterns
core_likelihood.c Core functions for evaluating the likelihood of a tree (non-vectorized)
core_likelihood_avx.c Core functions for evaluating the likelihood of a tree (AVX version)
core_likelihood_avx2.c Core functions for evaluating the likelihood of a tree (AVX-2 version)
core_likelihood_sse.c Core functions for evaluating the likelihood of a tree (SSE-3 version)
core_partials.c Core functions for computing partial likelihoods (non-vectorized)
core_partials_avx.c Core functions for computing partial likelihoods (AVX version)
core_partials_avx2.c Core functions for computing partial likelihoods (AVX-2 version)
core_partials_sse.c Core functions for computing partial likelihoods (SSE-3 version)
core_pmatrix.c Core functions for constructing the transition probability matrix
delimit.c Species delimitation auxiliary functions and summary statistics
diploid.c Functions for resolving/phasing diploid sequences
dlist.c Functions for handling doubly linked-lists
dump.c Functions for dumping the MCMC state into a checkpoint file
experimental.c Experimental functions that are not yet production-ready
gtree.c Functions for setting and processing gene trees
hardware.c Functions for hardware detection
hash.c Hash table implementation and related functions
lex_map.l Lexical analyzer for parsing map files
lex_rtree.l Lexical analyzer for parsing newick rooted trees
list.c Linked list implementation and related functions
load.c Functions for loading a checkpoint file
locus.c Locus specific functions
Makefile Makefile
mapping.c Functions for handling map files
maps.c Character mapping arrays for converting sequences to the internal representation
method.c Function containing the MCMC loop and calls to proposals
msa.c Code for processing multiple sequence alignments
output.c Auxiliary functions for printing pmatrices (to-be-renamed)
parse_map.y Functions for parsing map files
parse_rtree.y Functions for parsing rooted trees in newick format
phylip.c Functions for parsing phylip files
random.c Pseudo-random number generator functions
rtree.c Species tree export functions (to-be-renamed).
stree.c Functions for setting and processing the species tree
summary.c Species tree inference summary related functions
util.c Various common utility functions

Acknowledgements

Special thanks to Yuttapong Thawornwattana and Mario dos Reis Barros for testing and bug reports.

References

  • Flouri T., Carrasco FI, Darriba D., Aberer AJ, Nguyen LT, Minh BQ, Haeseler A., Stamatakis A. (2015) The Phylogenetic Likelihood Library. Systematic Biology, 64(2):356-362. doi:10.1093/sysbio/syu084

  • Flouri T., Jiao X., Rannala B., Yang Z. (2018) Species Tree Inference with BPP using Genomic Sequences and the Multispecies Coalescent. Molecular Biology and Evolution, Accepted Manuscript. doi:10.1093/molbev/msy147

  • Yang Z., Rannala B. (2003) Bayes Estimation of Species Divergence Times and Ancestral Population Sizes using DNA Sequences From Multiple Loci. Genetics, 164:1645-1656. Available at: http://www.genetics.org/content/164/4/1645.long

  • Hey J., Nielsen R. (2004) Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis. Genetics, 167(2):747-760. doi:10.1534/genetics.103.024182

  • Burgess R., Yang Z. (2008) Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Molecular Biology and Evolution, 25(9):1979-1994. doi:10.1093/molbev/msn148

  • Yang Z., Rannala B. (2010) Bayesian species delimitation using multilocus sequence data. Proceedings of the National Academy of Sciences, 107(20):9264-9269. doi:10.1073/pnas.0913022107

  • Gronau I., Hubisz MJ, Gulko B., Danko CG, Siepel A. (2011) Bayesian inference of ancient human demography from individual genome sequences. Nature Genetics, 43(10):1031-1035. doi:10.1038/ng.937

  • Rannala B., Yang Z. (2013) Improved reversible jump algorithms for Bayesian species delimitation. Genetics, 194:245-253. doi:10.1534/genetics.112.149039

  • Yang Z., Rannala B. (2014) Unguided species delimitation using DNA sequence data from multiple loci. Molecular Biology and Evolution, 31(12):3125-3135. doi:10.1093/molbev/msu279

  • Rannala B., Yang Z. (2017) Efficient Bayesian Species Tree Inference under the Multispecies Coalescent. Systematic Biology, 66(5):823-842. doi:0.1093/sysbio/syw119