## Cookbook for running BUCKy in parallel in a Jupyter notebook

This notebook uses the *Pedicularis* example data set from the first empirical ipyrad tutorial. Here I show how to run BUCKy on a large set of loci parsed from the output file with the `.loci` ending. All code in this notebook is Python. You can simply follow along and execute this same code in a Jupyter notebook of your own. 

--------------------------------
# Modification to this notebook are in progress...
---------------------------------

### Software requirements for this notebook
All required software can be installed through conda by running the commented out code below in a terminal. 

In [1]:
## conda install -c BioBuilds mrbayes
## conda install -c ipyrad ipyrad
## conda install -c ipyrad bucky

In [1]:
## import Python libraries
import ipyrad.analysis as ipa
import ipyparallel as ipp
#import subprocess as sps
import ipyrad as ip
#import glob
#import os
#import ipyrad.file_conversion as ifc

### Cluster setup
To execute code in parallel we will use the `ipyparallel` Python library. A quick guide to starting a parallel cluster locally can be found [here](link), and instructions for setting up a remote cluster on a HPC cluster is available [here](http://ipyrad.readthedocs.io/HPC_Tunnel.html). In either case, this notebook assumes you have started an `ipcluster` instance that this notebook can find, which the cell below will test. 

In [3]:
## look for running ipcluster instance, and create load-balancer
ipyclient = ipp.Client()
print "{} engines found".format(len(ipyclient))

4 engines found


### Create a bucky analysis object
The two required arguments are the `name` and `data` arguments. The `data` argument should be a .loci file or a .alleles.loci file. The name will be used to name output files, which will be written to `{workdir}/{name}/{number}.nexus`. Bucky doesn't deal well with missing data, so loci will only be included if they contain data for all samples in the analysis. It also doesn't work very well with more than about 10-12 samples. By default, all samples found in the loci file will be used, unless you enter a list of names (the `samples` argument) to subsample taxa. It is best to select one individual per species or subspecies. You can set a number of additional parameters in the `.params` dictionary. 

In [6]:
## make a list of sample names you wish to include in your BUCKy analysis 
samples = [
    "29154_superba", 
    "30686_cyathophylla", 
    "41478_cyathophylloides", 
    "33413_thamno", 
    "30556_thamno",
    "35236_rex",
    "40578_rex", 
    "38362_rex", 
    "33588_przewalskii",
]

In [5]:
## initiate a bucky object
b = ipa.bucky(
    name="test",
    data="analysis-ipyrad/pedic_outfiles/pedic.alleles.loci",
    workdir="analysis-bucky",
    samples=samples,
    minSNPs=2
)

AttributeError: 'module' object has no attribute 'bucky'

In [None]:
## print the params dictionary
b.params

In [6]:
## This will write nexus files to {workdir}/{name}/[number].nex
b.write_multinexus_files()

infile is: /home/deren/Documents/ipyrad/tests/branch-test/base_outfiles/base.loci
outdir is: /home/deren/Documents/ipyrad/tests/analysis-bucky


### An example nexus file
Nexus files are written to a new directory called `bucky-{name}`, where name is the name entered into the `loci2multinex()` function. If you entered a `outdir` argument as well then this new directory will be made as a subdirectory inside that outdir. Above we used name="example" and outdir=WORKDIR, which created files in the directory shown above. 

In [11]:
## print an example nexus file
! cat analysis-bucky/test/1.nex

#NEXUS
begin data;
dimensions ntax=9 nchar=66;
format datatype=dna interleave=yes gap=- missing=N;
matrix
30686_cyathophylla      CTTGGCAGGTGGCAGTTCGTTGCTGTTATATGCTGTAAGAAAAT-AAAAAAAAATCACCTGTTTAG
33413_thamno            CTTGGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAAT--AAAAAAAACCACCTGTTTAG
30556_thamno            CTTNGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAAT-NAAAAAAAATCACCTGTTTAG
33588_przewalskii       CTTGGCAGGTGGCAGTTCGTTGCTGAAATATGCTGTAAGAAAAT-AAAGAAAAATCATTT-TTTGG
29154_superba           CTTGGCAGTTGGCATTTCGTTGCTGTTATATGCTGTAAGAAAAT-AAAAAAAAATCACCTGTTTAA
40578_rex               CTTGGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAAT--AAAAAAAATCACCTGTTTAG
41478_cyathophylloides  CTTGGCAGGTGGCAGTTCGTTGCTGTTATATGCTGTAAGAAAATAAAAAAAAAATCACCTGTTTAG
38362_rex               CTTGGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAATAAAAAAAAAATCACCTGTTTAG
35236_rex               CTTGGCAGGTGGCAGTTTGTTGCTGTTTTATGCTGTAAGAAAAT--AAAAAAAATCACCTGTTTAG

    ;

begin mrbayes;
set autoclose=yes nowarn=yes;
lset nst=6 rates=gamma

In [12]:
## get all nexus files for this data set
nexfiles = glob.glob(os.path.join(RUNDIR, "*.nex"))

### A Python function to call `mrbayes`, `mbsum` and `bucky`. 


In [13]:
def mrbayes(infile):
    ## double check file path
    infile = os.path.realpath(infile)
    if not os.path.exists(infile):
        raise Exception("infile not found; try using a fullpath")
        
    ## call mrbayes
    cmd = ['mb', infile]
    proc = sps.Popen(cmd, stderr=sps.STDOUT, stdout=sps.PIPE)
    stdout = proc.communicate()
    
    ## check for errors
    if proc.returncode:
        return stdout

In [14]:
def mbsum(dirs):
    trees1 = glob.glob(os.path.join(dirs, "*.run1.t"))
    trees2 = glob.glob(os.path.join(dirs, "*.run2.t"))
    tidx = 0
    for tidx in xrange(len(trees1)):
        cmd = ["mbsum", 
               "-n", "0", 
               "-o", os.path.join(dirs, str(tidx))+".in", 
               trees1[tidx], 
               trees2[tidx]]
        proc = sps.Popen(cmd, stderr=sps.STDOUT, stdout=sps.PIPE)
        proc.communicate()
    print "summed {} trees in: {}".format(tidx, dirs)

In [15]:
def bucky(outname, indir, alpha, nchains, nreps, niter):
    ## check paths
    if not os.path.exists(indir):
        raise Exception("infiles not found; try using a fullpath")
    
    ## call bucky 
    infiles = os.path.join(indir, "*.in")
    cmd = ["bucky", 
           "-a", str(alpha),
           "-c", str(nchains),
           "-k", str(nreps),
           "-n", str(int(niter)), 
           "-o", outname, 
           infiles]
    
    cmd = " ".join(cmd)
    proc = sps.Popen(cmd, stderr=sps.STDOUT, stdout=sps.PIPE, shell=True)
    stdout = proc.communicate()
    if proc.returncode:
        return " ".join(cmd), stdout

### Run mrbayes on all nexus files in parallel
It is important that the lists contain the full paths to the files. 

In [15]:
## send jobs to the parallel engines
asyncs = []
for nexfile in nexfiles:
    async = lbview.apply(mrbayes, nexfile)
    asyncs.append(async)

### Track progress of the mrbayes runs
If you want to check the progress interactively then execute the cell below, which will tell you how many jobs have finished. The cell below that uses a wait() statement to block progress until all of the mrbayes jobs are finished.

In [33]:
ready =  [i for i in asyncs if i.ready()]
failed = [i for i in ready if not i.successful()]

## print progress
print "mrbayes batch runs:"
print "{} jobs submitted".format(len(asyncs))
print "{} jobs finished".format(len(ready))

## print errors, if any.
if any(failed):
    print failed[0].exception()
    print failes[0].result()

mrbayes batch runs:
722 jobs submitted
35 jobs finished


In [49]:
## waits until all mrbayes runs are finished
ipyclient.wait()

True

### Summarize the mrbayes posteriors

In [50]:
## run mbsum on each directory of tree files
mbsum(RUNDIR1)
mbsum(RUNDIR2)

summed 9 trees in: /home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13
summed 0 trees in: /home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp9


### Run BUCKy to infer concordance factors

In [124]:
nchains = 4
nreps = 4
niter = 1e6
alphas = [0.1, 1, 10]

## submit jobs to run at several values of alpha
bsyncs = []
for alpha in alphas:
    outname = os.path.join(RUNDIR, "bucky-{}".format(alpha))
    args = (outname, RUNDIR, alpha, nchains, nreps, niter)
    async = lbview.apply(bucky, *args)
    bsyncs.append(async)

### Track progress of Bucky runs

In [107]:
ready =  [i for i in bsyncs if i.ready()]
failed = [i for i in ready if not i.successful()]
print "bucky batch runs:"
print "{} jobs submitted".format(len(bsyncs))
print "{} jobs finished".format(len(ready))
if len(ready) == len(bsyncs):
    ## print errors, if any.
    if any(failed):
        print failed[0].exception()


bucky batch runs:
3 jobs submitted
0 jobs finished


In [108]:
ipyclient.wait()

True

### Results
Look at individual results files for final stats.

In [129]:
results = glob.glob(os.path.join(RUNDIR, "bucky-*.concordance"))


In [130]:
results

['/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-1.txt.concordance',
 '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-0.1.concordance',
 '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-0.1.txt.concordance',
 '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-1.concordance',
 '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-10.txt.concordance',
 '/home/deren/Documents/ipyrad/tests/analysis-bucky/bucky-samp13/bucky-10.concordance']