# Species-tree & species-delimitation using *bpp* (BP&P) or *ibpp*
The program *bpp* by Rannala & Yang (2010; 2015) is a powerful tool for inferring species tree parameters and testing species delimitation hypotheses. It is *relatively* easy to use, and best of all, it's *quite fast*, although not easily parallelizable. This notebook describes a streamlined approach we've developed to easily setup input files for testing different hypthotheses in *bpp*, and to do so in a clear programmatic way that makes it easy to perform many tests over many different parameter settings. We also show how to submit many separate jobs to run in parallel on a cluster. This approach also works with the program *ibpp*, which allows integration of traits with sequence data. 

### Using Jupyter notebooks
If you have not used Jupyter notebooks before, please see the other documentation for an introduction. This is a Jupyter notebook which contains documented code, in this case all Python, that can be used to replicate an analysis. The purpose of these notebooks is to produce a reproducible document that is easy to share, reproduce, and/or use as supplemental materials, by simply uploading it to a site such as github. You can execute the code in the cells to reproduce our results.

In [1]:
## Start by importing a few python modules
import ipyrad
import ipyparallel as ipp
import subprocess
import socket
import os
import sys

## print versions
print "ipyrad v.{}".format(ipyrad.__version__)

ipyrad v.0.5.10


### Let's make a new directory to store our tutorial files in

In [5]:
## make a new directory in our current directory called analysis_bpp/
WDIR = "./analysis_bpp"
if not os.path.exists(WDIR):
    os.mkdir(WDIR)

### Download and install *bpp* v.3.3 locally (only tested on Linux)
Copy and paste the code from the link below into a terminal (or a cell in this notebook along with a %%bash header) to install *bpp* locally. This will create a new directory if it does not already exist in `~/local/src/` and install *bpp* from source. This creates a binary file called **bpp**. Because we are installing it locally *you do not need administrator privileges to install it*. When finished it will print out the location where it is installed, which is `~/local/bin/bpp`. https://gist.github.com/dereneaton/73a377c643adaddc83635506a81180af


### Download and Install *ibpp* (v.2.1)
The *ibpp* installation follows a similar procedure and is installed in the same place. The source code in this case is downloaded (cloned) from github, so you will need to have the software *git* installed/loaded. This is usually available by default on a linux machine, and/or HPC cluster. Execute the code here, which will print out the location where it is installed `~/local/bin/ibpp`. https://gist.github.com/dereneaton/527b87488eede7b670222640fe26878d

### Create input files (.seq.txt, .imap.txt, .ctl.txt, and .traits.txt) 
To run *bpp/ibpp* requires at least three input files, of which the CTL file is the most important, as it points to the location of the other files. We can create these files fairly easily by parsing the sequence information from the `.loci` file produced by ipyrad, and by providing some additional information about which samples should be grouped together into the same "species" using Python dictionaries. 

I show an example of this below, using a function from the ipyrad API (`loci2bpp`) that we've created for this purpose. This will create all of the dependency files for a bpp analysis. The first is the IMAP file (*.bpp.imap.txt*), which simply maps sample names to species groups. The second is the SEQ file (*.bpp.seq.txt*), which obviously contains the sequence data, properly formatted. And the third is the CTL file (*.bpp.ctl.txt*), which contains parameters for the bpp analysis. A final optional TRAITS file can also be produced for ibpp analyses. 

The `loci2bpp()` function contains many additional options for filtering loci or samples from the sequence data. For example, you can pass it arguments to keep only loci that have at least N samples in each species, or to keep only N total loci. It also removes any sample from the sequence data set that is not listed in your IMAP dictionary. You can set all of the CTL parameters using this function. We'll start by creating an IMAP dictionary that matches 'species' names to lists of sample names belonging to each species, a TREE stating our species tree hypothesis as a newick string, and one optional arguments that is likely to be used very often with RAD-seq data, the MINMAP dictionary. Much further down in this notebook we also show how to incorporate traits into a ibpp analysis.  

In [6]:
## Create a mapping dictionary
## The keys are 'species', i.e., clades/groups for your samples, 
## The values are lists of sample names that belong to each group

IMAP = {"A": ["1A_0", "1B_0", "1C_0", "1D_0"], 
        "B": ["2E_0", "2F_0", "2G_0", "2H_0"],
        "C": ["3I_0", "3J_0", "3K_0", "3L_0"]
       }

In [7]:
## Then you must write your tree hypothesis as a newick string.
## This must include all 'species' names in the imap dictionary

TREE = "((A,B),C);"



(Optional): You can further designate an additional dictionary that will be used to subsample loci for inclusion in the *bpp* analysis. Below I call this dictionary MINMAP, and it will be used to filter loci so that we only include loci in the analysis that have at least N taxa with sequence data in a locus for each given 'species' group.  

In [8]:
## (Optional) Minimum sampling map
## The keys are 'species', i.e., clade/group names 
## The values are the number of samples in each 'species' that must have data
## for a given locus for it to be included in the data set. 

MINMAP = {"A": 4, 
          "B": 4, 
          "C": 4,
         }

### Run loci2bpp() to generate bpp input files
The `loci2bpp()` function has four required arguments, a name, a LOCI file, an IMAP dictionary, and a TREE hypothesis. For additional arguements see documentation for the function by typing `?ipyrad.file_conversion.loci2bpp()` into a cell. We also have more examples below. The function returns the CTL filename as a string, which you will see later can be quite useful.  

In [10]:
## enter the path to your loci file
LOCI = "/home/deren/Documents/ipyrad/tests/cli/cli_outfiles/cli.loci"

## create bpp seq file with data for all samples in the loci file and IMAP dict.
## if you tell it verbose=True then it will also print the ctl file info to the screen
ipyrad.file_conversion.loci2bpp('test', LOCI, IMAP, TREE, 
                                wdir=WDIR, verbose=True)

ctl file
--------
seed = 12345
seqfile = /home/deren/Documents/ipyrad/tests/analysis_bpp/test.bpp.seq.txt
Imapfile = /home/deren/Documents/ipyrad/tests/analysis_bpp/test.bpp.imap.txt
mcmcfile = /home/deren/Documents/ipyrad/tests/analysis_bpp/test.bpp.mcmc.txt
outfile = /home/deren/Documents/ipyrad/tests/analysis_bpp/test.bpp.out.txt
nloci = 1000
usedata = 1
cleandata = 0
speciestree = 0
speciesdelimitation = 0 0 5
species&tree = 3 A C B
                 4 4 4
                 ((A,B),C);
thetaprior = 5 5
tauprior = 4 2 1
finetune = 1: 1 0.002 0.01 0.01 0.02 0.005 1.0
print = 1 0 0 0
burnin = 1000
sampfreq = 2
nsample = 10000
--------

new files created (1000 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt


'/home/deren/Documents/ipyrad/tests/analysis_bpp/test.bpp.ctl.txt'

#### LOTS of extra arguments are available in *loci2bpp()*
These can be used to filter the loci that will be included in the data set, as well as to modify the parameters that will be used in *bpp* and which are specified in the *.ctl* file. The *.ctl* file has a large range of options, and so for some advanced usage you may still need to modify the file by hand, but our intention with this function is to at least provide a fairly easy to use function to produce these files programatically, instead of having to always produce them by hand. You can see in the final example that we provided the traits dictionary, and that loci2bpp() created an extra .traits.txt file, and that all of the files produced have ibpp in their names instead of bpp. 


In [11]:
## Create bpp seq file with data for all samples in the loci file and IMAP dict
ipyrad.file_conversion.loci2bpp('test', LOCI, IMAP, TREE, wdir=WDIR)

## Create bpp file with only the first 100 loci
ipyrad.file_conversion.loci2bpp('test', LOCI, IMAP, TREE, wdir=WDIR, maxloci=100)

## Only keep loci that have at least MINMAP samples for each species
ipyrad.file_conversion.loci2bpp('test', LOCI, IMAP, TREE, wdir=WDIR, minmap=MINMAP)

## Only keep loci that have at least MINMAP samples for each species
## and write the ctl file so that we perform species delimitation
ipyrad.file_conversion.loci2bpp('test', LOCI, IMAP, TREE, minmap=MINMAP, wdir=WDIR,
                                infer_delimit=1, delimit_alg=(0, 5))

new files created (1000 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt
new files created (100 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt
new files created (1000 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt
new files created (1000 loci, 3 species, 12 samples)
  test.bpp.seq.txt
  test.bpp.imap.txt
  test.bpp.ctl.txt


'/home/deren/Documents/ipyrad/tests/analysis_bpp/test.bpp.ctl.txt'

### Why?
You could of course alternatively create all of the bpp input files by hand but trust me, it's a pain. Besides, by making it programmatic in this way you can easily create a variety of input files for different jobs with different parameter settings. Furthermore, it will be easy to share your code with others to show how you created a range of analyses. It's certainly much easier to share a bit of code than it is to share 20 different ctl files that you produced. Below we show an example where we create bpp input files for a range of parameter values and submit them to run in parallel on a cluster. 

### What if I don't want to run parallel jobs?
Simple. You can just call bpp or ibpp on a single *.ctl.txt* file at a time. I would recommend running parallel code, however, since each job takes pretty long to run, and each bpp job can only run on a single CPU at a time. Although we can't parallelize a single run of *bpp*, we can run many jobs simultaneously, allowing us to test a bunch of different priors, or delimitation methods. 

In [8]:
## %%bash

## I've commented the code out, but you could uncomment it to run a single job.
## The '2>&1 bpp-log.txt` part saves all of the output to a file instead of to the screen 
# bpp test.ctl.txt 2>&1 bpp-log.txt

### Set up a parallel client to submit parallel jobs through this notebook
We need to know a few tricks to submit parallel jobs from this jupyter notebook. This is all handled by the ipyparallel library, which we loaded at the top of this notebook. We have a separate tuturial with more background about using ipyparallel. You will need to have an 'ipcluster' instance running in a separate terminal on your machine (or ideally, it is running on your HPC cluster). The code below simply connects to that cluster and prints how many CPUs are available for use. 

In [12]:
## Connect to the running ipcluster instance
## (you need to start it in a separate terminal)
ipyclient = ipp.Client()
lbview = ipyclient.load_balanced_view()

## print some information about our cluster
res = ipyclient[:].apply(socket.gethostname)
for host in set(res.result_dict.values()):
    print "compute node: [{} cores] on {}"\
          .format(res.result_dict.values().count(host), host)

compute node: [4 cores] on oud


### A function to run bpp/ibpp
This function simply calls the bpp/ibpp binary. If you installed your binaries into a different location than the default in the install scripts at the beginning of this notebook then you will have to change the path to the binaries in this function.

In [13]:
def run_bpp(ctlfile):
    """ run bpp command line program """
    import subprocess, os
    
    ## binary paths
    bpp = os.path.expanduser("~/local/bin/bpp")
    ibpp = os.path.expanduser("~/local/bin/ibpp")
    
    ## which one to use
    if ".ibpp" in ctlfile:
        cmd = [ibpp, ctlfile]
    else:
        cmd = [bpp, ctlfile]
        
    ## call the command
    proc = subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
    proc.communicate()

### Submit jobs to run in parallel

Now, we want each jobs that we submit to have a unique name. The code below is creating new jobs over a range of theta and tau prior values, and creating a name (rname) that stores those values, and passing these to the loci2bpp function to create new input files, and then it is submitting those jobs to run on the cluster. You could edit this code to iterate over a different range of parameter settings. 

In [14]:
## a range of theta params (alpha, beta) to test over
THETAS = [(5, 5), (5, 50), (5, 500)]

## a range of tau params (alpha, beta, dirich) to test over
TAUS = [(1, 1, 1), (1, 10, 1), (1, 100, 1)]

In [30]:
## a dictionary to store our results in
asyncs = {}

## send jobs to run 'asynchronously' using 'apply' over a range of values
for theta in THETAS:
    for tau in TAUS:
        
        ## name this run by its theta and tau params
        rname = 'TEST-{}-{}-{}-{}-{}'.format(*theta+tau)
    
        ## create input files for this run, the function returns the ctl
        ## file name as a string, which we will store and use below
        ctlfile = ipyrad.file_conversion.loci2bpp(rname, LOCI, IMAP, TREE, 
                                                  wdir=WDIR,
                                                  thetaprior=theta, 
                                                  tauprior=tau, 
                                                  nsample=100000, 
                                                  burnin=10000,
                                                  maxloci=100)

        ## submit job to the queue with ctlfile as the argument
        asyncs[ctlfile] = lbview.apply(run_bpp, ctlfile)
        
        ## print that the job was submitted
        sys.stderr.write('job submitted: bpp {}\n\n'.format(ctlfile))

new files created (100 loci, 3 species, 12 samples)
  TEST-5-5-1-1-1.bpp.seq.txt
  TEST-5-5-1-1-1.bpp.imap.txt
  TEST-5-5-1-1-1.bpp.ctl.txt
job submitted: bpp /home/deren/Documents/ipyrad/tests/TEST-5-5-1-1-1.bpp.ctl.txt

new files created (100 loci, 3 species, 12 samples)
  TEST-5-5-1-10-1.bpp.seq.txt
  TEST-5-5-1-10-1.bpp.imap.txt
  TEST-5-5-1-10-1.bpp.ctl.txt
job submitted: bpp /home/deren/Documents/ipyrad/tests/TEST-5-5-1-10-1.bpp.ctl.txt

new files created (100 loci, 3 species, 12 samples)
  TEST-5-5-1-100-1.bpp.seq.txt
  TEST-5-5-1-100-1.bpp.imap.txt
  TEST-5-5-1-100-1.bpp.ctl.txt
job submitted: bpp /home/deren/Documents/ipyrad/tests/TEST-5-5-1-100-1.bpp.ctl.txt

new files created (100 loci, 3 species, 12 samples)
  TEST-5-50-1-1-1.bpp.seq.txt
  TEST-5-50-1-1-1.bpp.imap.txt
  TEST-5-50-1-1-1.bpp.ctl.txt
job submitted: bpp /home/deren/Documents/ipyrad/tests/TEST-5-50-1-1-1.bpp.ctl.txt

new files created (100 loci, 3 species, 12 samples)
  TEST-5-50-1-10-1.bpp.seq.txt
  TEST-5-50-1

### Track progress
You could interrupt and/or restart this progress tracker without it interrupting the jobs that are running on the ipcluster engines. As you can see, we can still continue to work in this notebook while these jobs are running. We will have to wait for them to finish before we move on to analyzing the results, however. 

In [68]:
## check success/failure of jobs
for job in asyncs:
    ## get shorter name for job
    jobname = job.split("/")[-1]
    
    ## print done or not
    if asyncs[job].ready():
        if asyncs[job].successful():
            print "{:<30} -- finished".format(jobname)
        else:
            print "{:<30} -- failed:".format(asyncs[job].exception())
    else:
        print "{:<30} -- still running".format(jobname)

TEST-5-50-1-1-1.bpp.ctl.txt    -- finished
TEST-5-500-1-10-1.bpp.ctl.txt  -- still running
TEST-5-50-1-10-1.bpp.ctl.txt   -- still running
TEST-5-500-1-1-1.bpp.ctl.txt   -- still running
TEST-5-5-1-1-1.bpp.ctl.txt     -- finished
TEST-5-5-1-10-1.bpp.ctl.txt    -- finished
TEST-5-500-1-100-1.bpp.ctl.txt -- still running
TEST-5-50-1-100-1.bpp.ctl.txt  -- still running
TEST-5-5-1-100-1.bpp.ctl.txt   -- finished


### Interpreting/analyzing results
In this example we ran *bpp* under 10 different prior settings. We can compare the results of these analyses to investigate the effect of the prior on the estimated posterior distributions of the parameter estimates from the multi-species coalescent ($\theta$ and $\tau$). 

In [12]:
## I'll leave that to you.


### So what's a smart test to perform?
Well, my interest in bpp was to perform species delimitation. And Rannala and Yang suggest that you try out both species delimitation algorithms and that you do so over a range of params for the two algorithms. They suggest that you run algorithm 0 with $\epsilon$=(2, 5, 10, 20), and algorithm 1 with $\alpha$=(1, 1.5, 2) and $m$=(1, 1.5, 2). And also to do this with different starting trees. So let's set up that test below for the example RAD data set from ipyrad. 

In [39]:
## set up a couple tests to perform
DELIMIT_TESTS = [
    (0, 2),
    (0, 5),
    (0, 10),
    (1, 1.0, 1.0),
    (1, 1.0, 1.5),
    (1, 1.0, 2.0),
    (1, 1.5, 1.0), 
    (1, 1.5, 1.5), 
    (1, 1.5, 2.0),
    (1, 2.0, 1.0), 
    (1, 2.0, 1.5), 
    (1, 2.0, 2.0)
]

## Let's regroup the samples into more possible species
IMAP = {"A1": ["1A_0", "1B_0", "1C_0"],
        "A2": ["1D_0"], 
        "B1": ["2E_0", "2F_0"],
        "B2": ["2G_0", "2H_0"],
        "C1": ["3I_0", "3J_0", "3K_0", "3L_0"]
       }


## You can provide resolved and unresolved starting trees
TREE_TESTS = [
    "(((A1,A2),(B1,B2)),C1);"
    "((A1,A2),((B1,B2),C1));"
]

In [None]:
## a dictionary to store our results in
asyncs = {}

## send jobs to run 'asynchronously', here we number the tests using 'enumerate'
for tnum, tree in enumerate(TREE_TESTS):
    for anum, alg in enumerate(DELIMIT_TESTS):
        ## name this run by its theta and tau params
        rname = 'TEST-{}-{}'.format(tnum, anum)

        ## create input files for this run, the function returns the ctl
        ## file name as a string, which we will store and use below
        ctlfile = ipyrad.file_conversion.loci2bpp(rname, LOCI, IMAP, 
                                                  tree, 
                                                  infer_delimit=1, 
                                                  delimit_alg=alg,
                                                  thetaprior=(2, 2000), 
                                                  tauprior=(2, 200, 1), 
                                                  nsample=10000, 
                                                  burnin=1000,
                                                  maxloci=100)

        ## submit job to the queue as args to run_bpp
        asyncs[ctlfile] = lbview.apply(run_bpp, ctlfile)

        ## print that the job was submitted
        sys.stderr.write('job submitted: bpp {}\n\n'.format(ctlfile))

# Integrating trait data with iBPP
Species delimitation can be further aided by trait information from the samples that are included in the study under a model of trait evolution as described by Solis-Lemus et al. (2015) in their software *ibpp*. This software was purposely made to be highly compatible with *bpp*, and so input files can be made using the same `loci2bpp` function. As usual, it is very important that these input files are created in exactly the correct way, and so we try to make that easy. 

To ensure that all of your input files are compatible, we remove individuals from the trait file that are not present in the IMAP dictionary. This allows you to easily remove taxa from an analysis to examine their influence. We will also mean-standardize the trait values and properly format missing data cells. Rather than use an input dictionary like we did above, here we will use a Pandas DataFrame, which is an easier way to work with data from a CSV file. You pass the DataFrame to `loci2bpp` and it will create a traits file for each named analysis. Although this creates some redundancy by making many files that may be the same, it is convenient for filtering taxa from one master trait list into a trait file that is correct for each given analysis. 

There are four requirements of the trait data when input to iBPP: (1) The first column should contain sample names and have "Indiv" as the header. (2) All trait values in the remaining columns should be quantitative. (3) Missing data should be listed as "NA" (we show below how to easily convert this from other values). (4) The data should be mean-standardized (we perform this for you in `loci2bpp`).

### Read in a CSV format trait file
We use the `pandas.read_csv()` function.

In [17]:
## Here is example CSV data with missing data as "" or NA. 
## We're gonna make some small changes to it so it is compatible with ibpp
CSV_DATA = """\
Indiv, t1, t2, t3
1A_0,3,40.1,0.9
1B_0,3,38.8,1.0
1C_0,4,35.4,1.2
1D_0,4,37.0,1.0
2E_0,5,33.0,0.7
2F_0,5,32.4,0.7
2G_0,,NA,0.5
2H_0,,NA,0.5
3I_0,8,65.0,0.6
3J_0,8,67.4,0.4
3K_0,8,68.2,0.3
3L_0,9,59.9,0.3
"""

## For this example, I'll use the stringIO function to read the string data
## above to act like it is a file. I'm doing this only for this tutorial. 
## For your data you could simply read in a saved CSV file from disk.
import StringIO
csv_file = StringIO.StringIO(CSV_DATA)

## Load the csv_file using the pandas.read_csv() function, use the 'na_values=' 
## option to indicate missing data values that will be re-coded as NaN.
import pandas
traits = pandas.read_csv(csv_file, delimiter=",", na_values=["", "NA"], index_col=0)

## If your data are properly formatted they should like something like below.
print traits

        t1    t2   t3
Indiv                
1A_0   3.0  40.1  0.9
1B_0   3.0  38.8  1.0
1C_0   4.0  35.4  1.2
1D_0   4.0  37.0  1.0
2E_0   5.0  33.0  0.7
2F_0   5.0  32.4  0.7
2G_0   NaN   NaN  0.5
2H_0   NaN   NaN  0.5
3I_0   8.0  65.0  0.6
3J_0   8.0  67.4  0.4
3K_0   8.0  68.2  0.3
3L_0   9.0  59.9  0.3


### What is `loci2bpp` going to do with the trait dataframe?
We will filter it to remove sample that are not in IMAP, and we will mean-standardize the values in each column based on the samples that are present, then save it to a file. You can see this below, where the values are now scaled around 0, and NaN is converted to "NA". 

In [18]:
## mean standardize data in each column
straits = traits.apply(lambda x: (x - x.mean()) / (x.std()))

## convert NaN (true missing) to NA strings, b/c that's what ibpp wants.
ftraits = straits.fillna("NA")

## Now save as a new filename (TRAITFILE)
ftraits.to_csv("./traits_standardized.csv")

## print mean standardized trait values for our records
print ftraits

             t1        t2        t3
Indiv                              
1A_0   -1.16792 -0.497734  0.760639
1B_0   -1.16792 -0.582649  1.098701
1C_0  -0.735356 -0.804735  1.774824
1D_0  -0.735356 -0.700224  1.098701
2E_0  -0.302794 -0.961502  0.084515
2F_0  -0.302794  -1.00069  0.084515
2G_0         NA        NA -0.591608
2H_0         NA        NA -0.591608
3I_0   0.994893   1.12872 -0.253546
3J_0   0.994893   1.28549 -0.929670
3K_0   0.994893   1.33774 -1.267731
3L_0    1.42746   0.79559 -1.267731


In [19]:
### Let's run species delimitation with traits
ctl1 = ipyrad.file_conversion.loci2bpp("delim_with_traits", LOCI, IMAP, TREE, 
                                       wdir=WDIR,
                                       infer_delimit=1, 
                                       traits_df=traits,
                                       useseqdata=1,
                                       usetraitdata=1)

### And compare it to when the traits are not used
ctl2 = ipyrad.file_conversion.loci2bpp("delim_no_traits", LOCI, IMAP, TREE, 
                                       wdir=WDIR,
                                       infer_delimit=1,  
                                       traits_df=traits,
                                       useseqdata=1,
                                       usetraitdata=0)

### And compare it to when only traits are used
ctl3 = ipyrad.file_conversion.loci2bpp("delim_only_traits", LOCI, IMAP, TREE, 
                                       wdir=WDIR,
                                       infer_delimit=1,  
                                       traits_df=traits,
                                       useseqdata=0,
                                       usetraitdata=1)

new files created (1000 loci, 3 species, 12 samples)
  delim_with_traits.ibpp.seq.txt
  delim_with_traits.ibpp.imap.txt
  delim_with_traits.ibpp.ctl.txt
  delim_with_traits.ibpp.traits.txtnew files created (1000 loci, 3 species, 12 samples)
  delim_no_traits.ibpp.seq.txt
  delim_no_traits.ibpp.imap.txt
  delim_no_traits.ibpp.ctl.txt
  delim_no_traits.ibpp.traits.txtnew files created (1000 loci, 3 species, 12 samples)
  delim_only_traits.ibpp.seq.txt
  delim_only_traits.ibpp.imap.txt
  delim_only_traits.ibpp.ctl.txt
  delim_only_traits.ibpp.traits.txt