# Assembly and analysis of *Pedicularis* PE-GBS data set

A library for 48 samples was prepared following the protocol described in Escudero et al. 2013 with the PstI restriction enzyme, followed by PCR amplification of primer ligated fragments. The library prep lacked a size selection step, which we discuss in the methods below.  The library was sequenced on one lane of an Illumina HiSeq 2000 yielding xxx reads.

### This notebook
This notebook provides a fully reproducible workflow to assemble and analyze the Yu-Eaton-Ree (2012) *Pedicularis* GBS data set. This notebook and its results files are stored in the following github repo [see git repo here](https://github.com/dereneaton/pedicularis-WB-GBS). Starting from the raw data files, we denovo assemble the data in *ipyrad* to demultiplex,  filter, and cluster reads within Samples, before clustering conensus reads between samples to identify homology, and finally filtering and formating to create output files. Analysis of the resulting files is shown in separate notebooks, again available in the [git repo](https://github.com/dereneaton/pedicularis-WB-GBS).

In [1]:
## show that this dir is a git repo (has .git file mapping to the address shown)
## this allows me to push updates to this notebook directly to github, 
## to easily share its conents with others.
! git config --get remote.origin.url

https://github.com/dereneaton/pedicularis-WB-GBS.git


### Import ipyrad and other common modules

In [2]:
## all necessary software is installed alongside ipyrad, 
## and can be installed by uncommenting the command below
# conda install -c ipyrad ipyrad -y

## import basic modules and ipyrad and print version
import os
import socket
import glob
import subprocess as sps
import numpy as np
import ipyparallel as ipp
import ipyrad as ip

print "ipyrad v.{}".format(ip.__version__)
print "ipyparallel v.{}".format(ipp.__version__)
print "numpy v.{}".format(np.__version__)

ipyrad v.0.4.4
ipyparallel v.5.0.1
numpy v.1.11.0


### The cluster
This notebook is connected to 32 cores on 4 nodes of the Louise HPC cluster at Yale. SSH Tunneling was set up following [this tutorial](http://ipyrad.readthedocs.io/HPC_Tunnel.html) to launch an *ipcluster* instance. Below I use the ipyparallel Python module to show that we are connected to all cores. 

In [4]:
## open a view to the client
ipyclient = ipp.Client()

## confirm we are connected to 4 8-core nodes
hosts = ipyclient[:].apply_sync(socket.gethostname)
for hostname in set(hosts):
    print("host compute node: [{} cores] on {}"\
          .format(hosts.count(hostname), hostname))

host compute node: [8 cores] on compute-24-14.local
host compute node: [16 cores] on compute-22-10.local
host compute node: [8 cores] on compute-20-15.local


### Set up a working directory
This notebook is run from a local directory on the HPC cluster, while the scratch dir will be used as a working directory in which all big seq files will be stored. We will transfer a few files to the local dir in the end to save them for downstream analyses and to upload to github. 

In [5]:
## create a new working directory in HPC scratch dir
WORK = "/fastscratch/de243/WB-PED"
if not os.path.exists(WORK):
    os.mkdir(WORK)

## the current dir (./) in which this notebook resides
NBDIR = os.path.realpath(os.curdir)

## print it
print "working directory (WORK) = {}".format(WORK)
print "current directory (NBDIR) = {}".format(NBDIR)

working directory (WORK) = /fastscratch/de243/WB-PED
current directory (NBDIR) = /home2/de243/pedicularis-WB-GBS


### The raw data
The raw R1 and R2 data are each split into 59 gzipped files approximately 300MB in size. The barcodes file maps sample names to barcodes that are contained inline in the R1 sequences, and are 4-8bp in length. The barcodes are printed a little further below. 

In [6]:
## Locations of the raw data stored temporarily on Yale's Louise HPC cluster
## Data are also stored more permanently on local computer tinus at Yale
RAWREADS = "/fastscratch/de243/TMP_RAWS/*.fastq.gz"
BARCODES = "/fastscratch/de243/TMP_RAWS/WB-PED_barcodes.txt"

### Fastqc quality check

I ran the program *fastQC* on the raw data files to do a quality check, the results of which are available here [fastqc_dir](https://github.com/dereneaton/pedicularis-WB-GBS/tree/master/tmp_fastqc). Overall, quality scores were not terrible, but also not great, our biggest problem was very high adapter contamination. I will filter this out using the program *cutadapt* implemented in step2 of *ipyrad*, and discussed further below.

In [104]:
## uncomment this to install fastqc with conda
#conda install -c bioconda fastqc -q 

## create a tmp directory for fastqc outfiles (./tmp_fastqc)
QUALDIR = os.path.join(NBDIR, "tmp_fastqc")
if not os.path.exists(QUALDIR):
    os.mkdir(QUALDIR)
    
## run fastqc on all raw data files and write outputs to fastqc tmpdir.
## This is parallelized by load-balancing with ipyclient
lbview = ipyclient.load_balanced_view()
for rawfile in glob.glob(RAWREADS):
    cmd = ['fastqc', rawfile, '--outdir', QUALDIR, '-t', '1', '-q']
    lbview.apply_async(sps.check_output, cmd)
    
## block until all finished and print progress
ipyclient.wait_interactive()

 118/118 tasks finished after  153 s
done


### Create demultiplexed files for each Sample in *ipyrad*
We set the location to the data and barcodes info for each object, and set the max barcode mismatch parameter to zero (strict), allowing no mismatches. 

In [7]:
## create an object to demultiplex each lane
demux = ip.Assembly("WB-PED_demux")

## set basic derep parameters for the two objects
demux.set_params("project_dir", os.path.join(WORK, "demux_reads"))
demux.set_params("raw_fastq_path", RAWREADS)
demux.set_params("barcodes_path", BARCODES)
demux.set_params("max_barcode_mismatch", 0)
demux.set_params("datatype", "pairgbs")
demux.set_params("restriction_overhang", ("TGCAG", "TGCAG"))

## print params
## demux.get_params()

  New Assembly: WB-PED_demux


In [7]:
demux.run("1")


  Assembly: WB-PED_demux
  [####################] 100%  chunking large files  | 0:00:00 | s1 | 
  [####################] 100%  sorting reads         | 0:18:40 | s1 | 
  [####################] 100%  writing/compressing   | 0:30:34 | s1 | 


### Look at demux results 
Showing just a few lines of the results below, you can see that a *ton* of reads did not match to a barcode due to Ns in the barcode region of reads for files numbered 041-050. We should allow for a mismatch in the barcode sequence to recover many of these. So we reran step 1 allowing for one bp difference. 

In [26]:
## print total
print "total reads recovered: {}\n".format(demux.stats.reads_raw.sum())

## print header, and then selected results across raw files
! head -n 1 $demux.stats_files.s1
! cat $demux.stats_files.s1 | grep 0[4-5][0-9].fastq

total reads recovered 182369514

raw_file                               total_reads    cut_found  bar_matched
lane2_NoIndex_L002_R1_040.fastq            4000000      3998844      3288695
lane2_NoIndex_L002_R2_040.fastq            4000000      3998844      3288695
lane2_NoIndex_L002_R1_041.fastq            4000000      3999819       978178
lane2_NoIndex_L002_R2_041.fastq            4000000      3999819       978178
lane2_NoIndex_L002_R1_042.fastq            4000000      3999912       538103
lane2_NoIndex_L002_R2_042.fastq            4000000      3999912       538103
lane2_NoIndex_L002_R1_043.fastq            4000000      4000000            0
lane2_NoIndex_L002_R2_043.fastq            4000000      4000000            0
lane2_NoIndex_L002_R1_044.fastq            4000000      4000000            0
lane2_NoIndex_L002_R2_044.fastq            4000000      4000000            0
lane2_NoIndex_L002_R1_045.fastq            4000000      4000000            0
lane2_NoIndex_L002_R2_045.fastq            

In [None]:
## run with one bp mismatch
demux.set_params("max_barcode_mismatch", 1)
demux.run("1", force=True)


  Assembly: WB-PED_demux
  [####################] 100%  chunking large files  | 0:00:00 | s1 | 
  [                    ]   0%  sorting reads         | 0:00:50 | s1 | 

In [10]:
## print total
print "total reads recovered: {}\n".format(demux.stats.reads_raw.sum())

## print header, and then selected results across raw files
! head -n 1 $demux.stats_files.s1
! cat $demux.stats_files.s1 | grep 0[4-5][0-9].fastq

total reads recovered: 221196301

raw_file                               total_reads    cut_found  bar_matched
lane2_NoIndex_L002_R1_040.fastq            4000000      3998844      3722871
lane2_NoIndex_L002_R2_040.fastq            4000000      3998844      3722871
lane2_NoIndex_L002_R1_041.fastq            4000000      3999819      3790942
lane2_NoIndex_L002_R2_041.fastq            4000000      3999819      3790942
lane2_NoIndex_L002_R1_042.fastq            4000000      3999912      3761225
lane2_NoIndex_L002_R2_042.fastq            4000000      3999912      3761225
lane2_NoIndex_L002_R1_043.fastq            4000000      4000000      3744178
lane2_NoIndex_L002_R2_043.fastq            4000000      4000000      3744178
lane2_NoIndex_L002_R1_044.fastq            4000000      4000000      3661280
lane2_NoIndex_L002_R2_044.fastq            4000000      4000000      3661280
lane2_NoIndex_L002_R1_045.fastq            4000000      4000000      3593439
lane2_NoIndex_L002_R2_045.fastq           

### copy the results txt file back to the local dir (git repo)
I then pushed it to the repo [(view the full results here)](https://github.com/dereneaton/pedicularis-WB-GBS/blob/master/s1_demultiplex_stats.txt)

In [11]:
## the result of this demux look better, so I copied the step1
## stats file to the local dir and pushed it to the git repo.
! cp $demux.stats_files.s1 $NBDIR

### Start an assembly of the data set
I usually prefer to start my analyses from data that are already de-multiplexed, since the demux fastq files are typically what is available to others after a study is published and that data are made available online. So here we will create an Assembly object and read in the demultiplexed data. Then we will run step 2 to filter the data and take a close look at the results. 

In [12]:
## this will be our assembly object for steps 1-6
data = ip.Assembly("c85d5f2h5")

## (optional) set a more fine-tuned threading for our cluster
data._ipcluster["threads"] = 4

## demux data location
DEMUX = os.path.join(demux.dirs.fastqs, "*gz")

## set parameters for this assembly and print them 
data.set_params("project_dir", os.path.join(WORK, data.name))
data.set_params("sorted_fastq_path", DEMUX)
data.set_params("barcodes_path", BARCODES)
data.set_params("filter_adapters", 2)
data.set_params("datatype", "pairgbs")
data.set_params("restriction_overhang", ("TGCAG", "TGCAG"))
data.set_params("max_Hs_consens", (5, 5))
data.set_params("trim_overhang", (0, 5, 5, 0))
data.get_params()

  New Assembly: c85d5f2h5
  0   assembly_name               c85d5f2h5                                    
  1   project_dir                 /fastscratch/de243/WB-PED/c85d5f2h5          
  2   raw_fastq_path                                                           
  3   barcodes_path               /fastscratch/de243/TMP_RAWS/WB-PED_barcodes.txt
  4   sorted_fastq_path           /fastscratch/de243/WB-PED/demux_reads/WB-PED_demux_fastqs/*gz
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    pairgbs                                      
  8   restriction_overhang        ('TGCAG', 'TGCAG')                           
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                         

In [13]:
## run steps 1-6
data.run("12")


  Assembly: c85d5f2h5
  [####################] 100%  loading reads         | 0:00:56 | s1 | 
  [####################] 100%  processing reads      | 0:23:05 | s2 | 


### Filtering stats
I pushed the full step 2 stats file to github [(see it here)](https://github.com/dereneaton/pedicularis-WB-GBS/blob/master/s2_rawedit_stats.txt). As you can tell from this little snippet, tons of reads were trimmed of adapters, and trimmed for quality scores.

In [17]:
## print just the first few samples
print data.stats_dfs.s2.head()

           reads_raw  trim_adapter_bp_read1  trim_adapter_bp_read2  \
L20090356    4143404                1836178                1983663   
L2DZ0984     4646226                1890089                2095516   
L2DZ0988     4948408                2195090                2371398   
L2DZ0989     4209024                1803353                1950326   
L2DZ0990     4621835                1962933                2150203   

           trim_quality_bp_read1  trim_quality_bp_read2  reads_filtered_by_Ns  \
L20090356               63255677               59275618                  1484   
L2DZ0984                74989294               69346878                  1654   
L2DZ0988                71173489               62383199                  1935   
L2DZ0989                62485246               60599462                  1577   
L2DZ0990                72114365               66128652                  1712   

           reads_filtered_by_minlen  reads_passed_filter  
L20090356                   10526

### Run remaining steps of Assembly

In [None]:
data.run("3456")


  Assembly: c85d5f2h5
  [                    ]   0%  dereplicating         | 0:00:19 | s3 | 

### Create final output files
We will do this for several values of `min_samples_locus`, and provide unique names for each resulting assembly. 

In [15]:
## create named branches for final assemblies
min4 = data.branch("min4_c85d5f2h5")
min4.set_params("min_samples_locus", 4)

min10 = data.branch("min10_c85d5f2h5")
min10.set_params("min_samples_locus", 10)

## assemble outfiles
min4.run("7", force=True)
min10.run("7", force=True)


  Assembly: min4_c85d5f2h5
  [####################] 100%  filtering loci        | 0:00:04 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:02 | s7 | 
  [####################] 100%  building vcf file     | 0:00:06 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:02 | s7 | 
  Outfiles written to: /fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5_outfiles

  Assembly: min10_c85d5f2h5
  [####################] 100%  filtering loci        | 0:00:04 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:02 | s7 | 
  [####################] 100%  building vcf file     | 0:00:04 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:01 | s7 | 
  Outfiles written to: /fastscratch/de243/WB-PED/c85d5f2h5/min10_c85d5f2h5_outfiles


### Print final stats for the min4 assembly

In [16]:
!cat $min4.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].statsfiles.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci              18895              0          18895
filtered_by_rm_duplicates             819            819          18076
filtered_by_max_indels                546            469          17607
filtered_by_max_snps                  957            649          16958
filtered_by_max_shared_het             12              4          16954
filtered_by_min_sample              11161          10822           6132
filtered_by_max_alleles              4737           2266           3866
total_filtered_loci                  3866              0           3866


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

            sample_coverage
L20090356               845
L2DZ0984                871
L2DZ0988               1069
L2DZ

### A quick view of the first few loci

In [17]:
%%bash
head -n 100 $min4.outfiles.loci | cut -c 1-80

L20090356      TTGTTTATTATAACTATATACAGTGGATTGTCAAATAATCAAGTGAGCAATTGATGCAGTCT
L2DZ0984       TTGTTTATTATAACTATATACAGTGGATTGTCAAATAATCAAGTGAGCAATTGATGCAGTCA
L2DZ0989       TTGTTTATTATAACTATATACCGTGGATTGTCAAATAATCAAGTGAGCAATTGATGCAGTCA
LHW10071       TTGTTTATTATAACTATATACAGTGGATTGTCAAATAATCAAGTGAGCAATTGATGCAGTCA
//                                  -                                       -|0|
L20090356      TGGACGGCCTGGTCAAGGCCTACGGGGCCCGGCGCGT
L2DZ0984       TGGANGGNCTNGNNAAGGCCTACGGGGCNCNRCGCGT
L2DZ0990       TGGAGGGTCTCGCGAAGGCCTACGGGGCACGGCGCGT
L2DZ1016       TGGACGGCCTGGCCAAGGCCTACGGGGCCCGGCGCGT
L2DZ1243       TGGACGGCCTCGCGAAGGCCTACGGGGCACGGCGGGT
//                 -  -  * -*              *  -  -  |1|
L20090356      ATTTTGTTTCTCATAGTTAGTTGCGATTTATTTATTATAATTGTTTCGTCACATAAGTATTGAGT
L2DZ0988       ATTTTGTTTCTCATAGTTAGTTGCGAT-------TTATAATTGTTTCTTCACATAAGTATTGAGT
L2DZ0989       ATTTTGTTTCTCATAGTTAGTTGCGATTTATTTATTATAATTGTTTCGTCACATAAGTATTGAGT
L2DZ0990       ATTTTNNTTCTCATAGTTAGTTGNGAT-

cut: write error: Broken pipe


## Analysis methods

In [30]:
## reload, b/c I disconnected and came back.
min4 = ip.load_json("/fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5.json")
min4.outfiles.loci[:-5]+'.phy'

  loading Assembly: min4_c85d5f2h5
  from saved path: /fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5.json


'/fastscratch/de243/WB-PED/c85d5f2h5/min4_c85d5f2h5_outfiles/min4_c85d5f2h5.phy'

In [89]:
## make raxml dir
raxdir = os.path.join(os.curdir, "analysis_raxml")
raxdir = os.path.realpath(raxdir)
if not os.path.exists(raxdir):
    os.mkdir(raxdir)
    
## get outgroup string
OUT = ",".join([i for i in min4.samples.keys() if i[0] == "d"])

## run raxml in the background
cmd = ["/home2/de243/miniconda2/bin/raxmlHPC-PTHREADS", 
        "-f", "a", 
        "-m", "GTRGAMMA", 
        "-N", "100", 
        "-T", "8", 
        "-x", "12345", 
        "-p", "54321",
        "-o", OUT, 
        "-w", raxdir, 
        "-n", "min4_tree",
        "-s", min4.outfiles.loci[:-5]+'.phy']

## start process running in background
proc = sps.Popen(cmd, stderr=sps.PIPE, stdout=sps.PIPE)

In [107]:
## ask if it's still running in background
if proc.poll():
    print sps.returncode
else:
    tail = ! tail $raxdir/*info*
    print "still running: \n", tail[-1]

still running: 
Bootstrap[2]: Time 38.238277 seconds, bootstrap likelihood -661693.484348, best rearrangement setting 12


### Plot the tree in R

In [15]:
%load_ext rpy2.ipython

ImportError: No module named rpy2.ipython