# Assembly and analysis of *Pedicularis* PE-GBS data set

A library for 48 samples was prepared following the protocol described in Escudero et al. 2013 with the PstI restriction enzyme, followed by PCR amplification of primer ligated fragments. The library prep lacked a size selection step, which we discuss in the methods below.  The library was sequenced on one lane of an Illumina HiSeq 2000 yielding xxx reads.

### This notebook
This notebook provides a fully reproducible workflow to assemble and analyze the Yu-Eaton-Ree 2012 Pedicularis GBS data set, and to save the results into a github repo with this notebook [see git repo here](https://github.com/dereneaton/pedicularis-WB-GBS). Starting from the raw data files, we denovo assemble the data in *ipyrad*, which involves demultiplexing and filtering reads, and then clustering within and between samples to identify homology, followed by final filtering and formating to create output files. Analysis of the resulting files is shown in separate notebooks, again available in the [git repo](https://github.com/dereneaton/pedicularis-WB-GBS).

In [109]:
## show that this dir is a git repo (has .git file mapping to the address shown)
## this allows me to push updates to this notebook directly to github, 
## to easily share its conents with others.
! git config --get remote.origin.url

https://github.com/dereneaton/pedicularis-WB-GBS.git


### Import ipyrad and other common modules

In [110]:
## all necessary software can be installed by uncommenting the command below
# conda install -c ipyrad ipyrad -y

## import basic modules and ipyrad and print version
import os
import socket
import glob
import subprocess as sps
import numpy as np
import ipyparallel as ipp
import ipyrad as ip

print "ipyrad v.{}".format(ip.__version__)
print "ipyparallel v.{}".format(ipp.__version__)
print "numpy v.{}".format(np.__version__)

ipyrad v.0.4.4
ipyparallel v.5.0.1
numpy v.1.11.0


### The cluster
This notebook is connected to 32 cores on 4 nodes of the Louise HPC cluster at Yale. SSH Tunneling was set up following [this tutorial](http://ipyrad.readthedocs.io/HPC_Tunnel.html) to launch an *ipcluster* instance. Below I use the ipyparallel Python module to show that we are connected to all cores. 

In [113]:
## open a view to the client
ipyclient = ipp.Client()

## confirm we are connected to 4 8-core nodes
hosts = ipyclient[:].apply_sync(socket.gethostname)
for hostname in set(hosts):
    print("host compute node: [{} cores] on {}"\
          .format(hosts.count(hostname), hostname))

host compute node: [8 cores] on compute-24-14.local
host compute node: [16 cores] on compute-22-10.local
host compute node: [8 cores] on compute-20-15.local


### Set up a working directory

In [116]:
## create a new working directory in HPC scratch dir
WORK = "/fastscratch/de243/WB-PED"
if not os.path.exists(WORK):
    os.mkdir(WORK)

## the current dir (./) in which this notebook resides
NBDIR = os.path.realpath(os.curdir)

## print it
print "working directory (WORK) = {}".format(WORK)
print "current directory (NBDIR) = {}".format(NBDIR)

working directory (WORK) = /fastscratch/de243/WB-PED
current directory (NBDIR) = /home2/de243/pedicularis-WB-GBS


### The raw data
The raw R1 and R2 data are each split into 59 gzipped files approximately 300MB in size. The barcodes file maps sample names to barcodes that are contained inline in the R1 sequences, and are 4-8bp in length. The barcodes are printed a little further below. 

In [117]:
## Locations of the raw data stored temporarily on Yale's Louise HPC cluster
## Data are also stored more permanently on local computer tinus at Yale
RAWREADS = "/fastscratch/de243/TMP_RAWS/*.fastq.gz"
BARCODES = "/fastscratch/de243/TMP_RAWS/WB-PED_barcodes.txt"

### Fastqc quality check

I ran the program *fastQC* on the raw data files to do a quality check, the results of which are available here [fastqc_dir](https://github.com/dereneaton/pedicularis-WB-GBS/blob/master/fastqc). Overall, quality scores were not terrible, but also not great, our biggest problem was very high adapter contamination. I will filter this out using the program *cutadapt* implemented in step2 of *ipyrad*, and discussed further below.

In [104]:
## uncomment this to install fastqc with conda
#conda install -c bioconda fastqc -q 

## create a tmp directory for fastqc outfiles (./tmp_fastqc)
QUALDIR = os.path.join(NBDIR, "tmp_fastqc")
if not os.path.exists(QUALDIR):
    os.mkdir(QUALDIR)
    
## run fastqc on all raw data files and write outputs to fastqc tmpdir.
## This is parallelized by load-balancing with ipyclient
lbview = ipyclient.load_balanced_view()
for rawfile in glob.glob(RAWREADS):
    cmd = ['fastqc', rawfile, '--outdir', QUALDIR, '-t', '1', '-q']
    lbview.apply_async(sps.check_output, cmd)
    
## block until finished and print progress
ipyclient.wait_interactive()

## I then pushed the tmp_fastq dir to the git repo
## and then deleted the tmpdir
! rm -r $QUALDIR

 118/118 tasks finished after  153 s
done


### Create demultiplexed files for each Sample in *ipyrad*
We set the location to the data and barcodes info for each object, and set the max barcode mismatch parameter to zero (strict), allowing no mismatches. 

In [118]:
## create an object to demultiplex each lane
demux = ip.Assembly("WB-PED_demux")

## set basic derep parameters for the two objects
demux.set_params("project_dir", os.path.join(WORK, "demux_reads"))
demux.set_params("raw_fastq_path", RAWREADS)
demux.set_params("barcodes_path", BARCODES)
demux.set_params("max_barcode_mismatch", 0)
demux.set_params("datatype", "pairgbs")
demux.set_params("restriction_overhang", ("TGCAG", "TGCAG"))

  New Assembly: WB-PED_demux


In [7]:
demux.run("1", force=True)


  Assembly: WB-PED_demux
  [####################] 100%  chunking large files  | 0:01:46 | s1 | 
  [####################] 100%  sorting reads         | 0:00:40 | s1 | 
  [####################] 100%  writing/compressing   | 0:02:00 | s1 | 


### Some quick summary stats, we'll look more in depth later

In [8]:
! cat $demux.stats_files.s1

raw_file                               total_reads    cut_found  bar_matched
lane2_NoIndex_L002_R1_001.fastq            4000000      3998858      3761574
lane2_NoIndex_L002_R2_001.fastq            4000000      3998858      3761574
lane2_NoIndex_L002_R1_002.fastq            4000000      3998854      3745260
lane2_NoIndex_L002_R2_002.fastq            4000000      3998854      3745260
lane2_NoIndex_L002_R1_003.fastq            4000000      3998872      3739970
lane2_NoIndex_L002_R2_003.fastq            4000000      3998872      3739970

sample_name                            total_reads
L20090356                                   210432
L2DZ0984                                    236276
L2DZ0988                                    251031
L2DZ0989                                    218117
L2DZ0990                                    229001
L2DZ1006                                    273899
L2DZ1007                                    605837
L2DZ1011                            

### Start an assembly of the data set
I usually prefer to start my analyses from data that are already de-multiplexed, since the demux fastq files are typically what is available to others after a study is published and that data are made available online. So here we will create an Assembly object and read in the demultiplexed data. Then we will run step 2 to filter the data and take a close look at the results. 

In [9]:
## this will be our assembly object for steps 1-6
data = ip.Assembly("c85d5f2h5")

## demux data location
DEMUX = os.path.join(demux.dirs.fastqs, "*gz")

## set parameters for this assembly and print them 
data.set_params("project_dir", os.path.join(WORK, data.name))
data.set_params("sorted_fastq_path", DEMUX)
data.set_params("barcodes_path", BARCODES)
data.set_params("filter_adapters", 2)
data.set_params("datatype", "pairgbs")
data.set_params("restriction_overhang", ("TGCAG", "TGCAG"))
data.set_params("max_Hs_consens", (5, 5))
data.set_params("trim_overhang", (0, 5, 5, 0))
data.get_params()

  New Assembly: c85d5f2h5
  0   assembly_name               c85d5f2h5                                    
  1   project_dir                 /fastscratch/de243/WB-PED/c85d5f2h5          
  2   raw_fastq_path                                                           
  3   barcodes_path               /fastscratch/de243/TMP_RAWS/WB-PED_barcodes.txt
  4   sorted_fastq_path           /fastscratch/de243/WB-PED/demux_reads/WB-PED_demux_fastqs/*gz
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    pairgbs                                      
  8   restriction_overhang        ('TGCAG', 'TGCAG')                           
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                         

In [10]:
data.run("12", force=True)


  Assembly: c85d5f2h5
  [####################] 100%  loading reads         | 0:00:01 | s1 | 
  [####################] 100%  processing reads      | 0:01:12 | s2 | 


In [11]:
print data.stats_dfs.s2

            reads_raw  trim_adapter_bp_read1  trim_adapter_bp_read2  \
L20090356      210432                  95926                 101524   
L2DZ0984       236276                  99895                 107534   
L2DZ0988       251031                 115927                 122157   
L2DZ0989       218117                  96324                 102621   
L2DZ0990       229001                 101439                 108863   
L2DZ1006       273899                 120394                 130348   
L2DZ1007       605837                 249392                 262101   
L2DZ1011       338422                 187917                 201909   
L2DZ1016       180016                  86163                  92733   
L2DZ1019       178359                  98059                 105969   
L2DZ1027       244863                 119740                 131395   
L2DZ1060       874698                 426618                 460594   
L2DZ1070       251629                 123686                 134965   
L2DZ11

In [12]:
data.run("3")


  Assembly: c85d5f2h5
  [####################] 100%  dereplicating         | 0:00:00 | s3 | 
  [####################] 100%  clustering            | 0:01:58 | s3 | 
  [####################] 100%  building clusters     | 0:00:14 | s3 | 
  [####################] 100%  chunking              | 0:00:01 | s3 | 
  [####################] 100%  aligning              | 0:06:22 | s3 | 
  [####################] 100%  concatenating         | 0:00:12 | s3 | 


In [6]:
data = ip.load_json("/fastscratch/de243/WB-PED/c85d5f2h5/c85d5f2h5.json")

  loading Assembly: c85d5f2h5
  from saved path: /fastscratch/de243/WB-PED/c85d5f2h5/c85d5f2h5.json


In [7]:
data.run("456", force=True)


  Assembly: c85d5f2h5
  [####################] 100%  inferring [H, E]      | 0:07:00 | s4 | 
  [####################] 100%  consensus calling     | 0:01:47 | s5 | 
  [####################] 100%  concat/shuffle input  | 0:00:02 | s6 | 
  [####################] 100%  clustering across     | 0:00:21 | s6 | 
  [####################] 100%  building clusters     | 0:00:02 | s6 | 
  [####################] 100%  aligning clusters     | 0:00:14 | s6 | 
  [####################] 100%  database indels       | 0:00:20 | s6 | 
  [####################] 100%  indexing clusters     | 0:00:18 | s6 | 
  [####################] 100%  building database     | 0:01:11 | s6 | 


In [15]:
## create named branches for final assemblies
min4 = data.branch("min4_c85d5f2h5")
min4.set_params("min_samples_locus", 4)

min10 = data.branch("min10_c85d5f2h5")
min10.set_params("min_samples_locus", 10)

## assemble outfiles
min4.run("7", force=True)
min10.run("7", force=True)


  Assembly: min4_c85d5f2h5
  [####################] 100%  filtering loci        | 0:00:04 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:02 | s7 | 
  [####################] 100%  building vcf file     | 0:00:06 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:02 | s7 | 
  Outfiles written to: /fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5_outfiles

  Assembly: min10_c85d5f2h5
  [####################] 100%  filtering loci        | 0:00:04 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:02 | s7 | 
  [####################] 100%  building vcf file     | 0:00:04 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:01 | s7 | 
  Outfiles written to: /fastscratch/de243/WB-PED/c85d5f2h5/min10_c85d5f2h5_outfiles


In [16]:
!cat $min4.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].statsfiles.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci              18895              0          18895
filtered_by_rm_duplicates             819            819          18076
filtered_by_max_indels                546            469          17607
filtered_by_max_snps                  957            649          16958
filtered_by_max_shared_het             12              4          16954
filtered_by_min_sample              11161          10822           6132
filtered_by_max_alleles              4737           2266           3866
total_filtered_loci                  3866              0           3866


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

            sample_coverage
L20090356               845
L2DZ0984                871
L2DZ0988               1069
L2DZ

In [17]:
%%bash
cut -c 1-80 /fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5_outfiles/min4_c85d5f2h5.loci | head -n 100

L20090356      TTGTTTATTATAACTATATACAGTGGATTGTCAAATAATCAAGTGAGCAATTGATGCAGTCT
L2DZ0984       TTGTTTATTATAACTATATACAGTGGATTGTCAAATAATCAAGTGAGCAATTGATGCAGTCA
L2DZ0989       TTGTTTATTATAACTATATACCGTGGATTGTCAAATAATCAAGTGAGCAATTGATGCAGTCA
LHW10071       TTGTTTATTATAACTATATACAGTGGATTGTCAAATAATCAAGTGAGCAATTGATGCAGTCA
//                                  -                                       -|0|
L20090356      TGGACGGCCTGGTCAAGGCCTACGGGGCCCGGCGCGT
L2DZ0984       TGGANGGNCTNGNNAAGGCCTACGGGGCNCNRCGCGT
L2DZ0990       TGGAGGGTCTCGCGAAGGCCTACGGGGCACGGCGCGT
L2DZ1016       TGGACGGCCTGGCCAAGGCCTACGGGGCCCGGCGCGT
L2DZ1243       TGGACGGCCTCGCGAAGGCCTACGGGGCACGGCGGGT
//                 -  -  * -*              *  -  -  |1|
L20090356      ATTTTGTTTCTCATAGTTAGTTGCGATTTATTTATTATAATTGTTTCGTCACATAAGTATTGAGT
L2DZ0988       ATTTTGTTTCTCATAGTTAGTTGCGAT-------TTATAATTGTTTCTTCACATAAGTATTGAGT
L2DZ0989       ATTTTGTTTCTCATAGTTAGTTGCGATTTATTTATTATAATTGTTTCGTCACATAAGTATTGAGT
L2DZ0990       ATTTTNNTTCTCATAGTTAGTTGNGAT-

cut: write error: Broken pipe


## Analysis methods

In [None]:
%%bash 
#-s $min4.outfiles.

tetrad -s /fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5_outfiles/min4_c85d5f2h5.snps.phy \
       -l /fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5_outfiles/min4_c85d5f2h5.snps.map \
       -b 100 -n min4_c85d5f2h5 -c 32 --MPI -f


In [30]:
min4 = ip.load_json("/fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5.json")
min4.outfiles.loci[:-5]+'.phy'

  loading Assembly: min4_c85d5f2h5
  from saved path: /fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5.json


'/fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5_outfiles/min4_c85d5f2h5.phy'

In [89]:
## make raxml dir
raxdir = os.path.join(os.curdir, "analysis_raxml")
raxdir = os.path.realpath(raxdir)
if not os.path.exists(raxdir):
    os.mkdir(raxdir)
    
## get outgroup string
OUT = ",".join([i for i in min4.samples.keys() if i[0] == "d"])

## run raxml in the background
cmd = ["/home2/de243/miniconda2/bin/raxmlHPC-PTHREADS", 
        "-f", "a", 
        "-m", "GTRGAMMA", 
        "-N", "100", 
        "-T", "8", 
        "-x", "12345", 
        "-p", "54321",
        "-o", OUT, 
        "-w", raxdir, 
        "-n", "min4_tree",
        "-s", min4.outfiles.loci[:-5]+'.phy']

## start process running in background
proc = sps.Popen(cmd, stderr=sps.PIPE, stdout=sps.PIPE)

In [107]:
## ask if it's still running in background
if proc.poll():
    print sps.returncode
else:
    tail = ! tail $raxdir/*info*
    print "still running: \n", tail[-1]

still running: 
Bootstrap[2]: Time 38.238277 seconds, bootstrap likelihood -661693.484348, best rearrangement setting 12
