# Analysis of *Pedicularis* PE-GBS data set

Notebook 1 associated with this study shows how we assembled this data set using *ipyrad*. The data set is composed of PE GBS data from 48 individuals of *Pedicularis* spanning two distinct clades. Here we will apply several phylogenetic methods to study each clade individually and the two together. 

### This notebook
This is a jupyter notebook showing IPython code to fully reproduce phylogenetic analyses from Yu-Eaton-Ree (2012) *Pedicularis* GBS data set. This notebook and its results files are stored in the following github repo [see git repo here](https://github.com/dereneaton/pedicularis-WB-GBS). 

In [1]:
## show the address of this git repo
! git config --get remote.origin.url

https://github.com/dereneaton/pedicularis-WB-GBS.git


### Import ipyrad and other common modules

In [4]:
## all necessary software is installed alongside ipyrad, 
## and can be installed by uncommenting the command below
# conda install -c ipyrad ipyrad -y

## import basic modules and ipyrad and print version
import os
import socket
import glob
import subprocess as sps
import numpy as np
import ipyparallel as ipp
import ipyrad as ip

print "ipyrad v.{}".format(ip.__version__)
print "ipyparallel v.{}".format(ipp.__version__)
print "numpy v.{}".format(np.__version__)

ipyrad v.0.4.9
ipyparallel v.5.2.0
numpy v.1.11.2


### The cluster
This notebook is connected to 80 cores on 5 nodes of the Farnam HPC cluster at Yale. SSH Tunneling was set up following [this tutorial](http://ipyrad.readthedocs.io/HPC_Tunnel.html). Below I use the ipyparallel Python module to show explicitly which host nodes we are connected to, and which *ipyrad* will make use of. 

In [11]:
## open direct and load-balanced views to the client
ipyclient = ipp.Client()
lbview = ipyclient.load_balanced_view()
print "{} total cores".format(len(ipyclient.ids))

## confirm we are connected to 5 8-core nodes
hosts = ipyclient[:].apply_sync(socket.gethostname)

## get an engine id from each host to send threaded jobs to
threaded = {host:[] for host in set(hosts)}
for hid, host in enumerate(hosts):
    threaded[host].append(hid)
    
## print threaded setup, and save as threaded-views
tview = {}
idx = 0
for host, ids in threaded.items():
    print host, ids
    ## threaded-views
    tview[idx] = ipyclient.load_balanced_view(targets=ids)
    idx += 1

80 total cores
c13n02.farnam.hpc.yale.internal [63, 64, 65, 66, 67, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
c13n05.farnam.hpc.yale.internal [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
c13n04.farnam.hpc.yale.internal [40, 47, 48, 50, 51, 52, 53, 55, 56, 57, 58, 59, 60, 61, 62, 68]
c13n01.farnam.hpc.yale.internal [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
c13n03.farnam.hpc.yale.internal [32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 49, 54]


### Paths to working directories
This notebook is run from a local directory (NBDIR) on the HPC cluster, while the scratch dir contains the large raw data and assembly files. For this notebook everything will be done in our notebook directory (NBDIR).

In [12]:
## create a new directory in HPC scratch dir
WORK = "/ysm-gpfs/scratch60/de243/WB-PED"
if not os.path.exists(WORK):
    os.mkdir(WORK)

## the current dir (./) in which this notebook resides
NBDIR = os.path.realpath(os.curdir)

## print both
print "working directory (WORK) = {}".format(WORK)
print "current directory (NBDIR) = {}".format(NBDIR)

working directory (WORK) = /ysm-gpfs/scratch60/de243/WB-PED
current directory (NBDIR) = /ysm-gpfs/home/de243/pedicularis-WB-GBS


### A dictionary to map accession IDSs to taxon names + IDs

In [13]:
NAMES = {"d33291": "P. oxycarpa 33291", 
         "d41389": "P. cranolopha 41389", 
         "d41237": "P. cranolopha 41237", 
         "d40328": "P. bidentata 40328",
         "d39531": "P. cranolpha 39531",
         "d31733": "P. latituba 31733",
         "d33291": "P. oxycarpa 33291", 
         "d39187": "P. souliei 39187", 
         "d39103": "P. decorissima 39103", 
         "d39253": "P. decorissima 39253",
         "decor21": "P. decorissima XX-DE21", 
         "d34041": "P. decorissima 34041",
         "d39114": "P. armata var. trimaculata 39114", 
         "d39404": "P. armata var. trimaculata 39404", 
         "d39968": "P. davidii 39968", 
         "d35422": "P. longiflora 35422", 
         "d41058": "P. longiflora var. tubiformis 41058", 
         "d39104": "P. longiflora var. tubiformis 39104", 
         "d19long1": "P. longiflora XX-DE19", 
         "d30695": "P. siphonantha 30695", 
         "d41732": "P. siphonantha 41732", 
         "d35178": "P. siphonantha 35178", 
         "d35371": "P. siphonantha 35371", 
         "d35320": "P. cephalantha 35320", 
         "d30181": "P. fletcheri 30181"
        }

### Run raxml on the supermatrix alignments (.phy files)

In [14]:
## make raxml dir
RAXDIR = os.path.join(os.curdir, "analysis_raxml")
RAXDIR = os.path.realpath(RAXDIR)
if not os.path.exists(RAXDIR):
    os.mkdir(RAXDIR)
    
## get outgroup string from assembly object, or wherever
min4 = ip.load_json(os.path.join(WORK, "c85d5f2h5/min4_c85d5f2h5.json"))
OUT = ",".join([i for i in min4.samples.keys() if i[0] == "d"])

## run raxml in the background
cmd4 = ["/home2/de243/miniconda2/bin/raxmlHPC-PTHREADS", 
        "-f", "a", 
        "-m", "GTRGAMMA", 
        "-N", "100", 
        "-T", "16", 
        "-x", "12345", 
        "-p", "54321",
        "-o", OUT, 
        "-w", RAXDIR, 
        "-n", "min4_tree",
        "-s", os.path.join(NBDIR, "min4_c85d5f2h5_outfiles/min4_c85d5f2h5.phy")]
        
cmd10 = ["/home2/de243/miniconda2/bin/raxmlHPC-PTHREADS", 
        "-f", "a", 
        "-m", "GTRGAMMA", 
        "-N", "100", 
        "-T", "16", 
        "-x", "12345", 
        "-p", "54321",
        "-o", OUT, 
        "-w", RAXDIR, 
        "-n", "min10_tree",
        "-s", os.path.join(NBDIR, "min10_c85d5f2h5_outfiles/min10_c85d5f2h5.phy")]
        
## Send jobs to different hosts
asyncs = {}
asyncs["min4"] = tview[0].apply(sps.check_output, cmd4)
asyncs["min10"] = tview[1].apply(sps.check_output, cmd10)

  loading Assembly: min4_c85d5f2h5
  from saved path: /ysm-gpfs/scratch60/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5.json


In [28]:
## Check whether jobs have finished
for job, async in asyncs.items():
    if async.ready():
        if async.successful():
            print "job: [{}] finished.".format(job)
            print async.result()
        else:
            print async.exception()
    else:
        print "job: [{}]\t Elapsed: {:.0f}s".format(job, async.elapsed)

job: [min10]	 Elapsed: 5365s
job: [min4]	 Elapsed: 5365s
