# Species-tree & species-delimitation using *bpp* (BP&P)
The program *bpp* by Rannala & Yang (2010; 2015) is a powerful tool for inferring species tree parameters and testing species delimitation hypotheses. It is *relatively* easy to use, and best of all, it's *quite fast*, although not highly parallelizable. This notebook describes a streamlined approach we've developed to easily setup input files for testing different hypthotheses in *bpp*, and to do so in a clear programmatic way that makes it easy to perform many tests over many different parameter settings. We also show how to submit many separate jobs to run in parallel on a cluster. 

### Using Jupyter notebooks
If you have not used Jupyter notebooks before, please see our other documentation for an introduction. The purpose of these notebooks is to create a reproducible document that is easy to share, reproduce, and/or use as supplemental materials, by simply uploading it to a site such as github. You can execute the code (in this case written in Python) in the cells below to reproduce our results.

### Install required software
All software required for this notebook can be installed using conda. 

In [1]:
## conda install -c ipyrad ipyrad
## conda install -c ipyrad bpp
## conda install -c etetoolkit ete3
## pip install toyplot

In [2]:
import ipyrad.analysis as ipa         ## ipyrad analysis tools
import ipyparallel as ipp             ## parallelization

### Connect to an ipyparallel cluster
We will use the `ipyparallel` library to submit jobs to run in parallel on a cluster. We have a separate tutorial with more background about using ipyparallel. You will need to have an `ipcluster` instance running in a separate terminal on your machine (or ideally, it is running on your HPC cluster). The code below simply connects to that cluster and prints how many CPUs are available for use. If this is confusing and you don't want to learn to use `ipyparallel`, no problem, just continue without it, you can still run parallel code just not distributed across multiple nodes of a HPC cluster. 

In [3]:
## Connect to a running ipcluster instance and create load-balancer
ipyclient = ipp.Client()

## print information about our cluster
print "Connected to {} cores".format(len(ipyclient))

Connected to 4 cores


### Enter paths and input files  (I/O) 


In [4]:
## set the location of our input .loci file
LOCIFILE = "./branch-test/pedtest_outfiles/pedtest.loci"

## set the name of the output directory. It will be created if it doesn't exist.
WORKDIR = "./analysis-bpp"

## a tree hypothesis (guidetree) for our analyses (here based on tetrad results)
NEWICK = "((((sup, cya), cys), ((rex, rck), tha)), prz);"

## a dictionary mapping sample names to 'species' names
IMAP = {
    "prz": ["32082_przewalskii", "33588_przewalskii"], 
    "sup": ["29154_superba"],
    "cya": ["30686_cyathophylla"],
    "cys": ["41478_cyathophylloides", "41954_cyathophylloides"],
    "tha": ["33413_thamno"], 
    "rck": ["30556_thamno", "35236_rex"],
    "rex": ["35236_rex", "35236_rex", "39618_rex", "38362_rex"],  
}

## this means that loci will only be kept if they have data for at 
## least N samples in each species.
MINMAP = {
    "prz": 2, 
    "sup": 1,
    "cya": 1,
    "cys": 2,
    "tha": 1, 
    "rck": 2,
    "rex": 4,  
}

In [5]:
## print tree 
tree = ipa.tree(NEWICK)
tree.draw(height=300);

### Create an `ipa.bpp()` object

Running *bpp* requires three input files (.ctl, .imap, and .seq) of which the .ctl file is the most important, since it contains the parameters for the run and points to the location of the other two files. Consequently, if we plan to run analyses under a range of parameter settings and to run multiple replicates this ends up requiring that you create dozens of input files, which if done by hand is a huge pain in the butt. Thus, we have created a convenience function for creating these input files, and, if so desired, to submit them to run in parallel on a cluster. 

In [6]:
## create a bpp analysis object with the required args passed to it.
b1 = ipa.bpp(locifile=LOCIFILE, 
             guidetree=NEWICK,
             imap=IMAP,
             workdir=WORKDIR)

In [7]:
## set filtering parameters
b1.filters.maxloci = 500
b1.filters.minsnps = 4
b1.filters.minmap = MINMAP
print b1.filters

maxloci   500                 
minmap    {'cys': 2, 'rex': 4, 'sup': 1, 'cya': 1, 'rck': 2, 'tha': 1, 'prz': 2}
minsnps   4                   



In [8]:
## set bpp run parameters (~6 hours)
b1.params.burnin = 2500
b1.params.nsample = 25000
b1.params.sampfreq = 10
b1.params.tauprior = (2, 2000, 1)
b1.params.thetaprior = (2, 2000)
print b1.params

burnin          2500                
cleandata       0                   
delimit_alg     (0, 5)              
finetune        (0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01)
infer_delimit   0                   
infer_sptree    0                   
nsample         25000               
sampfreq        10                  
seed            12345               
tauprior        (2, 2000, 1)        
thetaprior      (2, 2000)           
usedata         1                   



### Option 1: simply write the bpp files
You can simply write the bpp files using the `write_bpp_files()` function and then execute them yourself by calling the `bpp` executable on the .ctl file. If you do it this way, it is up to you to parallelize the code yourself. Try running this first and take a look at the files that were produced in your working directory (`bpp-00.ctl.txt`, `tmp.seqfile.txt`, and `tmp.imapfile.txt`). You can change the settings above and look at their effect on these files.

In [9]:
## we'll name this test 'bpptest-b1'
b1.write_bpp_files("bpptest-b1")

input files created for job bpptest-b1 (500 loci)


### Option 2: Submit many jobs to run in parallel
Or, we have a function called `submit_bpp_jobs()` which can be used to submit a number of replicate jobs to run on a load-balanced job scheduler by passing it an `ipyparallel` client object. If submitting multiple replicates each will start from a different random seed, but you can still set the initial seed to make the runs reproducible. 

In [None]:
## submit many reps of the same job
b1.submit_bpp_jobs(prefix="bpptest-b1",
                   nreps=3, 
                   seed=98765, 
                   ipyclient=ipyclient)

Similarly, you can submit many jobs with different params using a for-loop. If you do this, remember to assign a new name to each job so that it writes the output to differently named files.

In [11]:
## or, submit many jobs with different params using a for-loop
for tauprior in [2, 20, 200]:
    b1.params.tauprior = (2, tauprior, 1)
    b1.submit_bpp_jobs(prefix="bpptest-b1-tau-{}".format(tauprior),
                       nreps=1, 
                       seed=123, 
                       ipyclient=ipyclient,
                      )

submitted 1 bpp jobs [tbpp-00-tau-2] (500 loci)
submitted 1 bpp jobs [tbpp-00-tau-20] (500 loci)
submitted 1 bpp jobs [tbpp-00-tau-200] (500 loci)


You should also always run at least one job with the `usedata=0` option turned on. This will give you results that are dictated entirely by the prior settings. 

In [12]:
## again, remember to set a different name for the job.
b1.params.usedata = 0
b1.submit_bpp_jobs(prefix="prior", 
                   nreps=1, 
                   seed=123,
                   ipyclient=ipyclient,
                  )

submitted 1 bpp jobs [prior] (500 loci)


### wait for parallel jobs to finish

In [17]:
## block until all jobs are finished
ipyclient.wait_interactive()

   4/4 tasks finished after 22109 s
done


### Running other algorithms (species tree inference)

In [8]:
## create a separate new bpp object
bp01 = ipa.bpp(locifile=LOCIFILE, guidetree=NEWICK, imap=IMAP)

## set limits on data size
bp01.filters.minmap = MINMAP
bp01.filters.maxloci = 500
bp01.filters.minsnps = 2

## set bpp run params
bp01.usedata = 1
bp01.params.burnin = 10000
bp01.params.nsample = 100000
bp01.params.sampfreq = 10
bp01.params.infer_sptree = 1

## print it
bp01.params

burnin          10000               
cleandata       0                   
delimit_alg     (0, 5)              
finetune        (0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01)
infer_delimit   0                   
infer_sptree    1                   
nsample         100000              
sampfreq        10                  
seed            12345               
tauprior        (4, 2, 1)           
thetaprior      (5, 5)              
usedata         1                   

In [14]:
## submit jobs
bp01.submit_bpp_jobs("bp-01", 
                     nreps=1,
                     seed=123,
                     ipyclient=ipyclient)

## and submit a job w/o data.
bp01.usedata = 0
bp01.submit_bpp_jobs("bp-01-prior", 
                     nreps=1, 
                     seed=123, 
                     ipyclient=ipyclient)

submitted 1 bpp jobs [bp-01] (500 loci)
submitted 1 bpp jobs [bp-01-prior] (500 loci)


### So what's a smart test to perform?
Well, my interest in bpp was to perform species delimitation. And Rannala and Yang suggest that you try out both species delimitation algorithms and that you do so over a range of params for the two algorithms. They suggest that you run algorithm 0 with $\epsilon$=(2, 5, 10, 20), and algorithm 1 with $\alpha$=(1, 1.5, 2) and $m$=(1, 1.5, 2). And also to do this with different starting trees. See if you can set up an efficient for-loop to submit tests over a range of prior settings. 

In [13]:
## set up a couple tests to perform
## delimit arg is a tuple with (algorithm, param) or (alg, param, param)
DELIMIT_TESTS = [
    (0, 2),
    (0, 5),
    (0, 10),
    (1, 1.0, 1.0),
    (1, 1.0, 1.5),
    (1, 1.0, 2.0),
    (1, 1.5, 1.0), 
    (1, 1.5, 1.5), 
    (1, 1.5, 2.0),
    (1, 2.0, 1.0), 
    (1, 2.0, 1.5), 
    (1, 2.0, 2.0)
]

bppo.usedata = 1

for test in DELIMIT_TESTS:
    bppo.params.delimit_alg = test
    prefix = "delim-" + "-".join([str(i) for i in test])
    bppo.submit_bpp_jobs(prefix=prefix, 
                         nreps=1, 
                         seed=123, 
                         ipyclient=ipyclient)