# Species-tree & species-delimitation using *bpp* (BP&P)
The program *bpp* by Rannala & Yang (2010; 2015) is a powerful tool for inferring species tree parameters and testing species delimitation hypotheses. It is *relatively* easy to use, and best of all, it's *quite fast*, although not highly parallelizable. This notebook describes a streamlined approach to easily setup input files for testing different hypthotheses in *bpp*, and to do so in a clear programmatic way that makes it easy to perform many tests over many different parameter settings. We also show how to distribute many separate jobs to run in parallel on a cluster. 

## Notebook setup
This is a Jupyter notebook, a reproducible and executable document. The code in this notebook is Python (2.7), and should be executed in a jupyter-notebook like this one. Execute each cell in order to reproduce our entire analysis. We make use of the [ipyparallel](http://ipyparallel.rtfd.io) Python library to distribute STRUCTURE jobs across processers in parallel. If that is confusing, see our [tutorial]() on using ipcluster with jupyter. The example data set used in this analysis is from the [empirical example ipyrad tutorial](http://ipyrad.readthedocs.io/pedicularis_.html).

#### Install required software
All software required for this notebook can be installed using conda. 

In [1]:
## conda install -c ipyrad ipyrad
## conda install -c ipyrad bpp
## conda install -c eaton-lab toytree

In [2]:
import ipyrad.analysis as ipa         ## ipyrad analysis tools
import ipyparallel as ipp             ## parallelization
import pandas as pd                   ## DataFrames
import numpy as np                    ## data generation
import toytree                        ## tree plotting

#### Connect to an ipyparallel cluster
We will use the `ipyparallel` library to submit jobs to run in parallel on a cluster. We have a separate tutorial with more background about using ipyparallel. You will need to have an `ipcluster` instance running in a separate terminal on your machine (or ideally, it is running on your HPC cluster). The code below simply connects to that cluster and prints how many CPUs are available for use. 

In [3]:
## Connect to a running ipcluster instance
ipyclient = ipp.Client()

## print information about our cluster
print "Connected to {} cores".format(len(ipyclient))

Connected to 4 cores


## Analysis setup

#### Enter paths and input files  (I/O) 
You must define a tree with the "species" names in your analysis. This will act either as a fixed-tree or as a guide-tree. You must also define an IMAP dictionary which maps sample names to "species" names. You can also define an option MINMAP dictionary which is used to filter out RAD loci to include only those that have at least N samples with data for species in each locus.  

In [4]:
## set the location of our input .loci file
locifile = "./analysis-ipyrad/pedic-full_outfiles/pedic-full.alleles.loci"

## set the output directory. It will be created if it doesn't exist.
workdir = "./analysis-bpp"

## a tree hypothesis (guidetree) (here based on tetrad results)
newick = "((((((rex, lip), rck), tha), cup), (cys, (cya, sup))), prz);"

## a dictionary mapping sample names to 'species' names
imap = {
    "prz": ["32082_przewalskii", "33588_przewalskii"],
    "cys": ["41478_cyathophylloides", "41954_cyathophylloides"],
    "cya": ["30686_cyathophylla"],
    "sup": ["29154_superba"],
    "cup": ["33413_thamno"],
    "tha": ["30556_thamno"],
    "rck": ["35236_rex"],
    "rex": ["35855_rex", "40578_rex"],
    "lip": ["39618_rex", "38362_rex"],  
    }

## optional: loci will be filtered if they do not have data for at
## least N samples in each species.
minmap = {
    "prz": 2,
    "cys": 2,
    "cya": 1,
    "sup": 1,
    "cup": 1,
    "tha": 1, 
    "rck": 1,
    "rex": 2,
    "lip": 2,
    }

In [5]:
## check your (starting) tree hypothesis
toytree.tree(newick).draw();

## The *bpp* Class object

To simplify the creation of input files for *bpp* analyses we've created a bpp job generator object that can be accessed from `ipa.bpp()`. Running *bpp* requires three input files (.ctl, .imap, and .seq) of which the .ctl file is the most important since it contains the parameters for a run and points to the location of the other two files. The `ipa.bpp()` object can be used to easily modify parameter settings for a run, to generate the input files, and if desired, to submit the bpp jobs to run on a cluster (your ipyclient cluster). 

In [6]:
## create a bpp object to run algorithm 00
test = ipa.bpp(
    locifile=locifile,
    guidetree=newick, 
    imap=imap, 
    workdir=workdir,
    minmap=minmap,   
    )

In [7]:
## set some optional params, leaving others at their defaults
test.params.burnin = 5000
test.params.nsample = 20000
test.params.sampfreq = 20

## print params
test.params

burnin          5000                
cleandata       0                   
copied          False               
delimit_alg     (0, 5)              
finetune        (0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01)
infer_delimit   0                   
infer_sptree    0                   
nsample         20000               
sampfreq        20                  
seed            12345               
tauprior        (2, 2000, 1)        
thetaprior      (2, 2000)           
usedata         1                   

In [8]:
## set some optional filters leaving others at their defaults
test.filters.maxloci=500
test.filters.minsnps=4

## print filters
test.filters

maxloci   500                 
minmap    {'cys': 4, 'rex': 4, 'cup': 2, 'rck': 2, 'cya': 2, 'lip': 4, 'sup': 2, 'tha': 2, 'prz': 4}
minsnps   4                   

### Generating files &/or submitting jobs
When you create a *bpp* object you save it with a variable name (in this example named `test`), however, this is simply the name of your bpp-job-generator. To write files for a specific run of *bpp* you must also provide a *job name prefix* for one of its two functions, **write_bpp_files()** or **submit_bpp_jobs()**. Both functions make it easy to sample different distributions of loci to include in different replicate bpp analyses. Each rep will start from a different random seed after the initial `seed`. If you used a `maxloci` argument to limit the number of loci that will used in the analysis then you can also use the `randomize_order` argument to select a different random number of N loci in each rep. 

#### write_bpp_files()
This writes the .ctl, .seq, and .imap files for the specified run. 

In [9]:
## write files 
test.write_bpp_files(prefix="testrun")

input files created for job testrun (500 loci)


#### submit_bpp_jobs()
This writes a .ctl file for each job and submits the bpp jobs to run on the cluster designated by the *ipyclient* object. You can efficiently submit many replicate jobs in this way. 

In [10]:
## or, submit job to run by creating minimal needed files
test.run(
    prefix="testrun", 
    nreps=2, 
    ipyclient=ipyclient, 
    seed=12345, 
    randomize_order=True,
    )

submitted 2 bpp jobs [testrun] (500 loci)


#### Accessing job results
When you submit jobs the results files will be stored in the bpp objects `.files` attribute. Similarly, the 'asychronous result objects' from each submitted job, which represents the job running on the ipyclient cluster, is stored in its `.asyncs` attribute. You can view these objects to see if your job has finished or use them to trace errors if an error arises. 

In [11]:
## files associated with 'test'
test.files

locifile    ./analysis-ipyrad/pedic-full_outfiles/pedic-full.alleles.loci
mcmcfiles   ['~/Documents/ipyrad/tests/analysis-bpp/testrun.mcmc.txt', '~/Documents/ipyrad/tests/analysis-bpp/testrun-r0.mcmc.txt', '~/Documents/ipyrad/tests/analysis-bpp/testrun-r1.mcmc.txt']
outfiles    ['~/Documents/ipyrad/tests/analysis-bpp/testrun.out.txt', '~/Documents/ipyrad/tests/analysis-bpp/testrun-r0.out.txt', '~/Documents/ipyrad/tests/analysis-bpp/testrun-r1.out.txt']

In [12]:
## see async objects from a bpp object
test.asyncs

[<AsyncResult: _call_bpp>, <AsyncResult: _call_bpp>]

In [13]:
## check a result (or error) if the job is finished
if test.asyncs[0].ready():
    print test.asyncs[0].result()

In [None]:
## block until all jobs are ready
ipyclient.wait()

## Examples 

## Algorithm 00 - fixed tree parameter inference

The 00 algorithm means `'infer_sptree=0'` and `'infer_delimit=0'`, thus the tree that you enter will be treated as the fixed species tree and the analysis will infer parameters for the tree under the multispecies coalescent model. This will yield values of $\Theta$ for each branch of the tree, and divergence times ($\tau$) for each split in the tree. 

In [16]:
## create a copy of the 'test' object above (does not copy asyncs)
A00 = test.copy()

In [15]:
## submit a few replicate jobs from different random seeds 
A00.submit_bpp_jobs("A00", nreps=1, ipyclient=ipyclient)

submitted 1 bpp jobs [A00] (500 loci)


Also submit a job without data (using only the prior) by setting the `usedata` parameter to 0. It is good practice to also run a job without data to compare to your results. 

In [15]:
## change params to use no data
A00.params.usedata = 0

## submit a job with no data (prior only)
A00.submit_bpp_jobs("A00-nodata", nreps=1, ipyclient=ipyclient)

submitted 1 bpp jobs [A00-nodata] (500 loci)


#### Track progress

In [21]:
## wait for jobs to finish
ipyclient.wait()

3 jobs still running


#### Summarize results tables for algorithm 00
Different bpp algorithms produce different types of results files. For algorithm 00 the mcmc results file is simply a table of $\Theta$ and $\tau$ values so we can simply parse it as a CSV file to summarize results. The same results will be available in the .out.txt file, but I find that parsing the results this way is a bit easier and gives you a bit more control. 

In [23]:
## parse the mcmc table
table = pd.read_csv(
    "analysis-bpp/A00-r0.mcmc.txt",
    sep="\t", 
    index_col=0)

## print pretty table summary (suppressing scientific notation)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
table.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
theta_1cup,2000.0,0.004,0.0,0.002,0.003,0.004,0.004,0.005
theta_2cys,2000.0,0.001,0.0,0.001,0.001,0.001,0.001,0.002
theta_3lip,2000.0,0.002,0.0,0.001,0.001,0.002,0.002,0.003
theta_4prz,2000.0,0.008,0.001,0.006,0.008,0.008,0.009,0.01
theta_5rck,2000.0,0.002,0.0,0.001,0.002,0.002,0.003,0.004
theta_6rex,2000.0,0.005,0.001,0.003,0.004,0.005,0.006,0.008
theta_7tha,2000.0,0.002,0.0,0.001,0.002,0.002,0.002,0.003
theta_8rexliprckthacupcysprz,2000.0,0.016,0.004,0.008,0.013,0.015,0.017,0.046
theta_9rexliprckthacupcys,2000.0,0.007,0.001,0.001,0.006,0.007,0.007,0.01
theta_10rexliprckthacup,2000.0,0.005,0.001,0.002,0.004,0.005,0.005,0.009


## Algorithm 10 - species tree inference

The algorithm 10 aims to infer the correct species tree from the data by implemented a tree search method, thus the input tree is treated only as a starting tree. 

In [26]:
## create a new bpp object
A10 = A00.copy()

## set new params
A10.params.usedata = 1
A10.params.infer_sptree = 1
A10.params.infer_delimit = 0

In [27]:
## submit job reps to the cluster
A10.submit_bpp_jobs("A10", 
                    nreps=1, 
                    ipyclient=ipyclient, 
                    randomize_order=True)

submitted 1 bpp jobs [full] (500 loci)


submit a job without data (only prior)

In [16]:
## change params not use data
A10.params.usedata = 0

## submit a job with no data (prior only)
A10.submit_bpp_jobs("A10-nodata", nreps=1, ipyclient=ipyclient)

submitted 1 bpp jobs [c10-nodata] (500 loci)


#### Plot the distribution of species trees from algorithm 10

In [41]:
## load trees slicing out every 100th: [100:10000:100]
trees = toytree.multitree(
    "./analysis-bpp/A10-r0.mcmc.txt",
    treeslice=(10, 5000, 8)
    )

len(trees) 

249

In [42]:
trees.draw_cloudtree(
    orient='right',
    edge_style={"opacity": 0.025},
    )

(<toyplot.canvas.Canvas at 0x7f1867529050>,
 <toyplot.coordinates.Cartesian at 0x7f18640dd610>)

In [73]:
tips = [
    "<em>P. przewalskii</em>",
    "<em>P. cyathophylloides</em>",
    "<em>P. cyathophylla</em>",
    "<em>P. superba</em>",
    "<em>P. thamnophila cup.</em>",
    "<em>P. thamnophila tham.</em>",
    "<em>P. rex rockii</em>",
    "<em>P. rex rex</em>",
    "<em>P. rex lipskyana</em>",
]

In [75]:
## plot a cloudtree onto a set of toyplot axes
import toyplot

## set up axes
canvas = toyplot.Canvas(width=450, height=400)
axes = canvas.cartesian()

## plot the tree
trees.draw_cloudtree(
    axes=axes,
    edge_style={"opacity": 0.05},
    use_edge_lengths=True,
    orient='right',
    tip_labels=tips[::-1],
    );

## style axes
axes.y.show = False
axes.x.show = True
axes.x.ticks.show = True
axes.x.ticks.locator = toyplot.locator.Explicit(
    locations=np.linspace(0, -15, 5) / 1000.,
    labels=np.linspace(0, 15, 5),
    )
axes.x.label.text = "Divergence time (substitutions/site x 10<sup>-3</sup>)"

In [70]:
toyplot.html.render(canvas, "cloud-test.html")

### Running other algorithms (species tree inference)
The species delimitation algorithms (01 and 11) and a bit more difficult to summarize the results of, so we do not have a recommended way yet other than to look at the .out.txt file produced by the run. Have fun. 

### Setting up tests of multiple prior settings
Rannala and Yang suggest that you try out several species delimitation algorithms and that you do so over a range of params for the two algorithms. They suggest that you run algorithm 0 with $\epsilon$=(2, 5, 10, 20), and algorithm 1 with $\alpha$=(1, 1.5, 2) and $m$=(1, 1.5, 2). And also to do this with different starting trees. Using our programmatic approach you can easily set up all of these tests and run them in parallel using a simple for-loop setup.  

In [48]:
delim = A00.copy()

delim.params.burnin= 100
delim.params.nsample = 100
delim.params.infer_delimit = 1
delim.params

burnin          100                 
cleandata       0                   
copied          False               
delimit_alg     (0, 5)              
finetune        (0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01)
infer_delimit   1                   
infer_sptree    0                   
nsample         100                 
sampfreq        20                  
seed            12345               
tauprior        (2, 2000, 1)        
thetaprior      (2, 2000)           
usedata         1                   

In [51]:
delim.submit_bpp_jobs("delim-test", nreps=2, ipyclient=ipyclient)

submitted 2 bpp jobs [delim-test] (500 loci)


In [53]:
ipyclient.wait_interactive()

   2/2 tasks finished after  455 s
done


In [25]:
## set up a couple tests to perform
## delimit arg is a tuple with (algorithm, param) or (alg, param, param)

A01 = A00.copy()
A00.params.infer_sptree = 0
A00.params.infer_delimit = 1
A00.params.usedata = 1

In [26]:
DELIMIT_TESTS = [
    (0, 2),
    (0, 5),
    (0, 10),
    (1, 1.0, 1.0),
    (1, 1.0, 1.5),
    (1, 1.0, 2.0),
    (1, 1.5, 1.0), 
    (1, 1.5, 1.5), 
    (1, 1.5, 2.0),
    (1, 2.0, 1.0), 
    (1, 2.0, 1.5), 
    (1, 2.0, 2.0)
]

for test in DELIMIT_TESTS:
    ## set the delimit algorithm
    A01.params.delimit_alg = test
    
    ## creat a name for this job
    prefix = "delim-" + "-".join([str(i) for i in test])
    
    ## submit the job
    A01.submit_bpp_jobs(prefix=prefix, 
                        nreps=1, 
                        seed=123, 
                        ipyclient=ipyclient)

submitted 1 bpp jobs [delim-0-2] (500 loci)
submitted 1 bpp jobs [delim-0-5] (500 loci)
submitted 1 bpp jobs [delim-0-10] (500 loci)
submitted 1 bpp jobs [delim-1-1.0-1.0] (500 loci)
submitted 1 bpp jobs [delim-1-1.0-1.5] (500 loci)
submitted 1 bpp jobs [delim-1-1.0-2.0] (500 loci)
submitted 1 bpp jobs [delim-1-1.5-1.0] (500 loci)
submitted 1 bpp jobs [delim-1-1.5-1.5] (500 loci)
submitted 1 bpp jobs [delim-1-1.5-2.0] (500 loci)
submitted 1 bpp jobs [delim-1-2.0-1.0] (500 loci)
submitted 1 bpp jobs [delim-1-2.0-1.5] (500 loci)
submitted 1 bpp jobs [delim-1-2.0-2.0] (500 loci)
