## Cookbook for running BUCKy in parallel in a Jupyter notebook

This notebook uses the *Pedicularis* example data set from the first empirical ipyrad tutorial. Here I show how to run BUCKy on a large set of loci parsed from the output file with the `.loci` ending. All code in this notebook is Python. You can simply follow along and execute this same code in a Jupyter notebook of your own. 

### Software requirements for this notebook

    + BUCKy
    + mbsum (distributed with BUCKy)
    + mrBayes
    + ipyrad


In [6]:
## import some Python libraries
import ipyparallel as ipp
import ipyrad as ip
import numpy as np
import subprocess
import glob
import os
from collections import Counter


### Cluster setup
To execute code in parallel we will use the `ipyparallel` Python library. A quick guide to using starting a parallel cluster locally can be found [here](link), and instructions for setting up a remote cluster on a HPC is available [here](http://ipyrad.readthedocs.io/HPC_Tunnel.html). In either case, this notebook assumes you are running an `ipcluster` that this notebook can find. 

In [7]:
## look for running ipcluster instance
ipyclient = ipp.Client()
print "{} engines found".format(len(ipyclient))

64 engines found


### Set up some tests
List the names of the samples you wish to include in your analysis in the dictionary. You can map them to simpler names if you wish. BUCKy generally doesn't starts to perform less well when the number of tips is >10 or so, so you might want to try analyses with different numbers of tips. In this case I make one fully sampled tree and one that has just one representative from each clade/species. 

In [12]:
## I load in the ipyrad object here, although this isn't required, 
## it has the sample's names easy to access. We then store the names
## in a dictionary as keys and matching values.
data = ip.load_json("/ysm-gpfs/home/de243/pedicularis-test/pedicularis/pedic.json")
fullsamples = {sample.name: sample.name for sample in data.samples.values()}

## alternatively, you can just enter all the names by hand, 
## here a subset of samples is mapped to simpler names in a dictionary
subsamples = {"superba": "29154_superba", 
              "cyathophylla": "30686_cyathophylla", 
              "cyathophylloides": "41478_cyathophylloides", 
              "thamno_cupul": "33413_thamno", 
              "thamno_thamno": "30556_thamno",
              "rex_rockii": "35236_rex",
              "rex_rex": "40578_rex", 
              "rex_lipskyana": "38362_rex", 
              "przewalskii": "33588_przewalskii"}
        
## print the fullsamples dict 
fullsamples

  loading Assembly: pedic
  from saved path: ~/pedicularis-test/pedicularis/pedic.json


{'29154_superba': '29154_superba',
 '30556_thamno': '30556_thamno',
 '30686_cyathophylla': '30686_cyathophylla',
 '32082_przewalskii': '32082_przewalskii',
 '33413_thamno': '33413_thamno',
 '33588_przewalskii': '33588_przewalskii',
 '35236_rex': '35236_rex',
 '35855_rex': '35855_rex',
 '38362_rex': '38362_rex',
 '39618_rex': '39618_rex',
 '40578_rex': '40578_rex',
 '41478_cyathophylloides': '41478_cyathophylloides',
 '41954_cyathophylloides': '41954_cyathophylloides'}

### Make an output directory for each test

In [14]:
## we group them into a dir called analysis_bucky
DIR1 = "analysis_bucky/test1"   
DIR2 = "analysis_bucky/test2"

## make the directories if they doesn't exist
for dirs in [DIR1, DIR2]:
    if not os.path.exists(dirs):
        os.makedirs(dirs)

### A function to write NEXUS blocks

In [15]:
NEXBLOCK = """\
#NEXUS
begin data;
dimensions ntax={} nchar={};
format datatype=dna interleave=yes gap=- missing=N;
matrix
{}
    ;

begin mrbayes;
set autoclose=yes nowarn=yes;
lset nst=6 rates=gamma;
mcmc ngen=2000000 samplefreq=2000 printfreq=20000000;
sump burnin=1000000;
sumt burnin=1000000;
end;
"""

def nexmake(mdict, nlocus, dirs):
    """ 
    function that takes a dictionary mapping names to 
    sequences, and a locus number, and writes it as a NEXUS
    file with a mrbayes analysis block.
    """
    ## create matrix as a string
    matrix = ""
    for i in mdict.items():
        matrix += "{:<10} {}\n".format(i[0][:10], i[1])
    
    ## write nexus block
    handle = os.path.join(dirs, "{}.nex".format(nlocus))
    with open(handle, 'w') as outnex:
        outnex.write(NEXBLOCK.format(len(mdict), 
                                     len(mdict.values()[0]),
                                     matrix))

### A few simple functions

In [16]:
## a dictionary mapping ambiguous characters
AMBIGS = {"R": ("G", "A"),
          "K": ("G", "T"),
          "S": ("G", "C"),
          "Y": ("T", "C"),
          "W": ("T", "A"),
          "M": ("C", "A"), 
          "A": ("A", "A"), 
          "T": ("T", "T"), 
          "G": ("G", "G"), 
          "C": ("C", "C"), 
          "-": ("-", "-"), 
          "N": ("N", "N")}
            

def resolveambig(subseq):
    """ Randomly resolves iupac hetero codes. This is a shortcut
    for now, we could instead use the phased alleles in RAD loci."""
    N = []
    for col in subseq:
        rand = np.random.binomial(1, 0.5)
        N.append([AMBIGS[i][rand] for i in col])
    return np.array(N)

In [17]:
def newPIS(seqsamp, N):
    """ filters for loci with >= N PIS """
    counts = [Counter(col) for col in seqsamp.T if not ("-" in col or "N" in col)]
    pis = [i.most_common(2)[1][1] > 1 for i in counts if len(i.most_common(2))>1]
    if sum(pis) >= N:
        return sum(pis)
    else:
        return 0      
    

In [18]:
def sample_loci_to_nexus(loci, hdict, dirs, minPIS=2):
    """ 
    This parses the .loci file format produced by ipyrad to 
    keep only loci that have data for all taxa listed in 
    the dictionary (hdict), and which have at least minPIS
    parsimony informative SNPs. 
    """
    ## keep track of how many loci pass
    nlocus = 0
    
    ## create subsampled data set
    for loc in loci:
        dat = loc.split("\n")[:-1]

        ## get names and seq from locus
        names = [i.split()[0] for i in dat]
        seqs = np.array([list(i.split()[1]) for i in dat])

        ## check that locus has required samples for each subtree
        if all([i in names for i in hdict.values()]):
            seqsamp = seqs[[names.index(tax) for tax in hdict.values()]]
            seqsamp = resolveambig(seqsamp)
            pis = newPIS(seqsamp, minPIS)
            if pis:
                nlocus += 1
                ## remove invariable columns given this subsampling
                seqsamp[seqsamp == "-"] = "N"
                rmcol = np.all(seqsamp == "N", axis=0)
                seqsamp = seqsamp[:, ~rmcol]

                ## write to a nexus file
                mdict = dict(zip(hdict.keys(), [i.tostring() for i in seqsamp]))
                nexmake(mdict, nlocus, dirs)
    print nlocus, 'loci kept'            


### Parse the loci for each test
You can either find the `.loci` file path and enter it here, or load you Assembly object with ipyrad and access the loci file from the object attributes. 

In [20]:
## get loci file from it's path or from ipyrad object
locifile = data.outfiles.loci
#locifile = "/home/deren/Documents/ipyrad/tests/pedicularis/pedic_outfiles/pedic.loci"

## parse the file into a list of individual loci
loci = open(locifile).read().strip().split("|\n")

## print the first and last locus
print "{}\n\n{}".format(loci[0], loci[-1])

29154_superba              TCTGGTCCCGCGGGTGATCAAGGCCCCACCACCGCGTCTCACATTTTCGATCTCAGGCGGTCTTACTCA
30556_thamno               TCCGGTCCCGCGGGTGATCAAGGCCCCACCACCGCGTCTCACATTCTAGATCTCAGGCGGTCTTACTCA
30686_cyathophylla         TCCAGTCCCGCGGGTGATCAAGGCCCCACCACCGCATCTCACATTCTCGATCTCAGGCGGTCTTACTCA
33413_thamno               TCCGGTCCTTCGGGTGATCAAGGCCCCACCACCGCGTCTCACATTCTAGATCTCAGGCGGTCTTACTCA
35236_rex                  TCCGGTCCCGCGGGTGATCAAGGCCCCACCACCGCGTCTCACATTCTMGATCTCAGGCGGTCTTACTCA
35855_rex                  TCCGGTCCCGCGGGTGATCAAGGCCCCACCACCGCGTCTCACATTCTAGATCTCAGGCGGTCTTACTCA
38362_rex                  TCCGGTCCTTCGGGTGATCAAGGCCCCACCACCGCGTCTCACATTCTAGATCTCAGGCGGTCTTACTCA
40578_rex                  TCCGGTCCYKCGGGTGATCAAGGCCCCACCACCGCGTCTCACATTCTCGATCTCAGGCGGTCTTACTCA
41478_cyathophylloides     TCCGGTCCCGCGGGTGATCAAGGCCCCACCACCGCGTCTCACATTATCGATCTCAGGCGGTCTTACTCA
41954_cyathophylloides     TCCGGTCCCGCGGGTGATCAAGGCCCCACCACCGCGTCTCACATTATCGATCTCAGGCGGTCTTACTCA
//                           -

### Sample loci and write NEXUS files

In [21]:
## write nexus file to the analysis directories
sample_loci_to_nexus(loci, subsamples, DIR1)
sample_loci_to_nexus(loci, fullsamples, DIR2)

847 loci kept
1833 loci kept


### Run mrbayes on loci in parallel
We now want to get a posterior distribution of gene trees from each RAD locus. 

In [22]:
@ipp.require(subprocess)
def mrbayes(infile):
    proc = subprocess.Popen(['mb', infile])
    proc.wait()

In [30]:
## create a load balanced view to distribute jobs
lbview = ipyclient.load_balanced_view()

## get all the nexus files 
nex1 = glob.glob(os.path.join(DIR1, "*.nex"))
nex2 = glob.glob(os.path.join(DIR2, "*.nex"))

## send jobs to the engines
for nexfile in nex1:#+nex2:
    lbview.apply(mrbayes, nexfile)

### Track progress of the mrbayes runs
These can take quite a while when there are thousands of them. 

In [None]:
ipyclient.wait_interactive()

   0/847 tasks finished after  507 s

### Summarize the mrbayes posteriors

In [125]:
def mbsum(dirs):
    """ function to write mbsum cmds """
    trees1 = glob.glob(os.path.join(dirs, "*.run1.t"))
    trees2 = glob.glob(os.path.join(dirs, "*.run2.t"))
    for tidx in xrange(len(trees1)):
        cmd = ["mbsum", "-n", "0", 
               "-o", os.path.join(dirs, str(tidx))+".in", 
               trees1[tidx], 
               trees2[tidx]]
        proc = subprocess.check_call(cmd)
        proc.wait()

In [127]:
## run mbsum on each directory of tree files
mbsum(DIR1)
mbsum(DIR2)

### Run BUCKy to infer concordance factors

In [None]:
def bucky(outname, indir, alpha, nchains, nreps, niter):
    cmd = ["bucky", "-a", alpha,
                    "-c", nchains,
                    "-k", nreps
                    "-n", niter, 
                    "-o", outname, 
                    os.path.join(indir, "*.in")]
    proc = subprocess.check_call(cmd)
    proc.wait()

In [None]:
## submit jobs to run at several values of alpha
for indir in [DIR1, DIR2]:
    for alpha in [0.1, 1, 10]:
        lbview.apply(bucky, *(alpha, 4, 4, 4000000, os.path.join(DIR1, "BUCKY_{}".format(alpha)), DIR1)
        lbview.apply(bucky, *(alpha, 4, 4, 4000000, os.path.join(DIR2, "BUCKY_{}".format(alpha)), DIR2)                          

In [None]:
ipyclient.wait_interactive()

### Results

In [None]:
cat analysis_bucky/test1/BUCKY_1.concordance.tre