# Species delimitation in Malagasy Canarium using iBPP

This notebook is an empirical application of ibpp for species delimitation using GBS data assembled in ipyrad. We use the ipyrad utility function to `loci2bpp` to programatticaly setup a range of tests and to deploy them in parallel. 

### Information about this notebook
This is a jupyter notebook. All code in this notebook is Python. You should be able to download and execute this notebook and reproduce all of our results. This notebook along with other notebooks and data files are hosted on github: https://github.com/sarahfederman/Canarium-GBS/

### Install required software

In [None]:
## conda install bpp -c ipyrad
## conda install ete3 -c bioconda

### Import Python libraries

In [11]:
import ipyrad as ip
import pandas as pd
import ete3 as ete
import numpy as np
import random
import sys
import os

## print versions
print "ipyrad v.{}".format(ip.__version__)

ipyrad v.0.6.6


### Create a directory to store results files in

In [2]:
WDIR = "./analysis_bpp"
if not os.path.exists(WDIR):
    os.mkdir(WDIR)

### Setup an ipyparallel cluster connection

In [23]:
import ipyparallel as ipp
ipyclient = ipp.Client()
lbview = ipyclient.load_balanced_view()
print ip.cluster_info()

  host compute node: [20 cores] on c20n05.farnam.hpc.yale.internal


### The input data

In [9]:
## downoad .loci file from (replace dropbox link with zenodo link) and save path
#! curl -LkO https://dl.dropboxusercontent.com/u/2538935/CanEnd_min20.loci
LOCI = "./analysis-ipyrad/Canarium_min20_outfiles/Canarium_min20.loci"

In [12]:
## make a mapping dictionary grouping samples into 'species'
IMAP6 = {
    "A": ['SF172', 'SF175', 'SF328', 'SF200', 'SF209', 'D14528', 'SF276', 'SF286', 'D13052'],
    "B": ['D13101', 'D13103', 'D14482', 'D14483'],
    "C": ['D14504', 'D14505', 'D14506'],
    "D": ['D14477', 'D14478', 'D14480', 'D14485', 'D14501', 'D14513'], 
    "E": ['D13090', 'D12950'],
    "F": ['D13097', 'SF155', 'D13063', 'D12963', 'SF160', 'SF327',
          'SF224', 'SF228', '5573', 'SF153', 'SF164', 'D13075', 'SF197'], 
    }


## make a dictionary with min values to filter loci to those with N samples per species.
MINMAP6 = {
    "A": 8, 
    "B": 4, 
    "C": 3,
    "D": 5, 
    "E": 2, 
    "F": 8,
}


## Species tree hypothesis ('guide tree') based on raxml & bucky results
TREE6 = "((((D,C),B),(E,F)),A);"
print ete.Tree(TREE6)


            /-D
         /-|
      /-|   \-C
     |  |
   /-|   \-B
  |  |
  |  |   /-E
--|   \-|
  |      \-F
  |
   \-A


### Make a function to call bpp/ibpp
We will submit a large range of jobs to our parallel cluster. First we will infer a species tree with bpp, and then we will add traits and test delimitation hypotheses with ibpp. To track the progress of all of the parallel processes we will store info about them (their async objects) in a dictionary called results. 

In [13]:
## a function to call i/bpp
def bpp(ctlfile):
    import subprocess
    subprocess.check_output(["bpp", ctlfile])
    

### Species delimitation
We want to test different resolutions of our fixed species tree (TREE6) using the test in bpp (infer_sptree=0; infer_delimit=1), and to ensure adequate mixing of our mcmc analysis we'll run the analysis from several random seeds, and for different values for the prior theta. 

In [14]:
ctls = []
for theta in [200, 2000]:
    for tau in [1000, 2000]:
        for rep in range(10):
            ## build input files
            name = "delim-theta-{}-tau-{}-rep-{}".format(theta, tau, rep)
            ctl = ip.file_conversion.loci2bpp(name, 
                                              locifile=LOCI,
                                              imap=IMAP6,      
                                              minmap=MINMAP6,
                                              guidetree=TREE6,
                                              wdir=WDIR,
                                              infer_sptree=0,
                                              infer_delimit=1,
                                              maxloci=10000,
                                              nsample=100000,
                                              burnin=10000,
                                              sampfreq=2,
                                              thetaprior=(2, theta),
                                              tauprior=(2, tau, 1),
                                              seed=random.randint(1,1e9),
                                              )
            ## store the ctl filename
            ctls.append(ctl)
        

new files created (1110 loci, 6 species, 37 samples)
  delim-theta-200-tau-1000-rep-0.bpp.seq.txt
  delim-theta-200-tau-1000-rep-0.bpp.imap.txt
  delim-theta-200-tau-1000-rep-0.bpp.ctl.txt
new files created (1110 loci, 6 species, 37 samples)
  delim-theta-200-tau-1000-rep-1.bpp.seq.txt
  delim-theta-200-tau-1000-rep-1.bpp.imap.txt
  delim-theta-200-tau-1000-rep-1.bpp.ctl.txt
new files created (1110 loci, 6 species, 37 samples)
  delim-theta-200-tau-1000-rep-2.bpp.seq.txt
  delim-theta-200-tau-1000-rep-2.bpp.imap.txt
  delim-theta-200-tau-1000-rep-2.bpp.ctl.txt
new files created (1110 loci, 6 species, 37 samples)
  delim-theta-200-tau-1000-rep-3.bpp.seq.txt
  delim-theta-200-tau-1000-rep-3.bpp.imap.txt
  delim-theta-200-tau-1000-rep-3.bpp.ctl.txt
new files created (1110 loci, 6 species, 37 samples)
  delim-theta-200-tau-1000-rep-4.bpp.seq.txt
  delim-theta-200-tau-1000-rep-4.bpp.imap.txt
  delim-theta-200-tau-1000-rep-4.bpp.ctl.txt
new files created (1110 loci, 6 species, 37 samples)
  

In [24]:
## a dictionary to store results
tree_asyncs = {}

## submit jobs to the cluster
for job in ctls:
    tree_asyncs[job] = lbview.apply(bpp, job)
    sys.stderr.write("job submitted [{}]\n".format(job))

job submitted [/ysm-gpfs/home/de243/Canarium-GBS/analysis_bpp/delim-theta-200-tau-1000-rep-0.bpp.ctl.txt]
job submitted [/ysm-gpfs/home/de243/Canarium-GBS/analysis_bpp/delim-theta-200-tau-1000-rep-1.bpp.ctl.txt]
job submitted [/ysm-gpfs/home/de243/Canarium-GBS/analysis_bpp/delim-theta-200-tau-1000-rep-2.bpp.ctl.txt]
job submitted [/ysm-gpfs/home/de243/Canarium-GBS/analysis_bpp/delim-theta-200-tau-1000-rep-3.bpp.ctl.txt]
job submitted [/ysm-gpfs/home/de243/Canarium-GBS/analysis_bpp/delim-theta-200-tau-1000-rep-4.bpp.ctl.txt]
job submitted [/ysm-gpfs/home/de243/Canarium-GBS/analysis_bpp/delim-theta-200-tau-1000-rep-5.bpp.ctl.txt]
job submitted [/ysm-gpfs/home/de243/Canarium-GBS/analysis_bpp/delim-theta-200-tau-1000-rep-6.bpp.ctl.txt]
job submitted [/ysm-gpfs/home/de243/Canarium-GBS/analysis_bpp/delim-theta-200-tau-1000-rep-7.bpp.ctl.txt]
job submitted [/ysm-gpfs/home/de243/Canarium-GBS/analysis_bpp/delim-theta-200-tau-1000-rep-8.bpp.ctl.txt]
job submitted [/ysm-gpfs/home/de243/Canarium-G

### Track progress of jobs

In [59]:
## check whether each has finished or failed
for jid, job in enumerate(dict(tree_asyncs.items())):
    ## get shorter name for job
    jobname = job.split("/")[-1]
    
    ## print done or not
    if tree_asyncs[job].ready():
        if tree_asyncs[job].successful():
            print "{:<3}{:<30} -- finished".format(jid, jobname)
        else:
            print "{:<3}{:<30} -- failed:".format(jid, tree_asyncs[job].exception())
    else:
        print "{:<3}{:<30} -- still running".format(jid, jobname)

0  delim-theta-200-tau-1000-rep-9.bpp.ctl.txt -- finished
1  delim-theta-2000-tau-2000-rep-5.bpp.ctl.txt -- finished
2  delim-theta-200-tau-2000-rep-8.bpp.ctl.txt -- finished
3  delim-theta-200-tau-2000-rep-2.bpp.ctl.txt -- finished
4  delim-theta-200-tau-2000-rep-3.bpp.ctl.txt -- finished
5  delim-theta-2000-tau-2000-rep-6.bpp.ctl.txt -- finished
6  delim-theta-2000-tau-2000-rep-7.bpp.ctl.txt -- finished
7  delim-theta-2000-tau-1000-rep-1.bpp.ctl.txt -- finished
8  delim-theta-2000-tau-1000-rep-8.bpp.ctl.txt -- finished
9  delim-theta-2000-tau-2000-rep-8.bpp.ctl.txt -- finished
10 delim-theta-200-tau-2000-rep-1.bpp.ctl.txt -- finished
11 delim-theta-2000-tau-1000-rep-5.bpp.ctl.txt -- finished
12 delim-theta-2000-tau-1000-rep-0.bpp.ctl.txt -- finished
13 delim-theta-200-tau-1000-rep-7.bpp.ctl.txt -- finished
14 delim-theta-200-tau-1000-rep-1.bpp.ctl.txt -- finished
15 delim-theta-200-tau-2000-rep-9.bpp.ctl.txt -- finished
16 delim-theta-200-tau-2000-rep-7.bpp.ctl.txt -- finished
17 del

### Summarize results

In [143]:
import glob
import numpy as np
import pandas as pd
outfiles = glob.glob(os.path.join(WDIR, "delim-theta-200-tau-1000-*.out.txt"))

def parse_bpp(ofiles):
    cols = []
    for ofile in ofiles:
        with open(ofile) as infile:
            dat  = infile.read()
        lastbits = dat.split("bpp.mcmc.txt\n\n")[1:]
        results = lastbits[0].split("\n\n")[0].split()
        dat = np.array(results[3:]).reshape(8, 4)
        cols.append(dat[:, 3].astype(float))
    cols = np.array(cols)
    cols = cols.sum(axis=0) / 10.
    dat[:, 3] = cols.astype(str)
    dd = pd.DataFrame(dat[:, 1:])
    dd.columns = ["delim", "prior", "posterior"]
    nspecies = 1 + np.array([list(i) for i in dat[:, 1]], dtype=int).sum(axis=1)
    dd["nspecies"] = nspecies
    return dd
    
    
for theta in [200, 2000]:
    for tau in [1000, 2000]:
        ofile = "delim-theta-{}-tau-{}-*.out.txt".format(theta, tau)
        outfiles = glob.glob(os.path.join(WDIR, ofile))
        print ofile
        print parse_bpp(outfiles)
        print ""


delim-theta-200-tau-1000-*.out.txt
   delim    prior posterior  nspecies
0  00000  0.12500       0.3         1
1  10000  0.12500   0.01053         2
2  11000  0.12500    0.0392         3
3  11001  0.12500   0.04750         4
4  11100  0.12500   0.29133         4
5  11101  0.12500   0.11126         5
6  11110  0.12500   6.6e-05         5
7  11111  0.12500   0.20008         6

delim-theta-200-tau-2000-*.out.txt
   delim    prior posterior  nspecies
0  00000  0.12500       0.0         1
1  10000  0.12500       0.0         2
2  11000  0.12500   0.02226         3
3  11001  0.12500   0.00979         4
4  11100  0.12500   0.20957         4
5  11101  0.12500   0.59025         5
6  11110  0.12500   0.06810         5
7  11111  0.12500   0.10000         6

delim-theta-2000-tau-1000-*.out.txt
   delim    prior posterior  nspecies
0  00000  0.12500       0.1         1
1  10000  0.12500       0.0         2
2  11000  0.12500   0.02475         3
3  11001  0.12500   0.01223         4
4  11100  0.12500 

### Take home
In general, hypotheses of 5,6 species faired best, though in some cases one species was supported as well. There was frequently a lot of variation among replicates from different starting seeds, where each would give strong support for a conflicting result. Further tuning of mixing parameters might help with this. bpp does not output convergence statistics for the species delimitation method, but we concluded from runs of the 0,0 algorithm that this number of reps yielded ESS scores >200. 