# Assembly and analysis of *Pedicularis* PE-GBS data set

A library for 48 samples was prepared following the protocol described in Escudero et al. 2013 with the PstI restriction enzyme, followed by PCR amplification of primer ligated fragments. The library prep lacked a size selection step, which we discuss in the methods below.  The library was sequenced on one lane of an Illumina HiSeq 2000 yielding 378,809,976 reads in lane 1, and 375,813,513 reads in lane 2, for a total of ~755M reads.  

### This notebook
This notebook provides a fully reproducible workflow to assemble and analyze the Yu-Eaton-Ree 2012 Pedicularis GBS data set, and to save the results into a github repo with this notebook [see git repo here](https://github.com/dereneaton/pedicularis-WB-GBS). Starting from the raw data files, we denovo assemble the data in *ipyrad*, which involves demultiplexing and filtering reads, and then clustering within and between samples to identify homology, followed by final filtering and formating to create output files. Analysis of the resulting files is shown in separate notebooks, again available in the [git repo](https://github.com/dereneaton/pedicularis-WB-GBS).

In [5]:
## show my local dir (where this notebook is located)
! pwd

## show the scratch dir (where data will be written)
! echo /fastscratch/de243/

## show that this dir has a git repo (.git file mapping to the address shown)
## this allows me to push updates to this notebook directly to github, 
## and to easily share the notebook with collaborators and as a final document.
! git config --get remote.origin.url

/home2/de243/pedicularis-WB-GBS
/fastscratch/de243/
https://github.com/dereneaton/pedicularis-WB-GBS.git


### Import ipyrad and other common modules

In [54]:
## all necessary software can be installed by uncommenting the command below
# conda install -c ipyrad ipyrad -y

## import basic modules and ipyrad and print version
import os
import socket
import glob
import subprocess as sps
import numpy as np
import ipyparallel as ipp
import ipyrad as ip

print "ipyrad v.{}".format(ip.__version__)
print "ipyparallel v.{}".format(ipp.__version__)
print "numpy v.{}".format(np.__version__)

ipyrad v.0.4.3
ipyparallel v.5.0.1
numpy v.1.11.0


### The cluster
This notebook was run connected to 32 cores on 4 nodes of the Louise HPC cluster at Yale. SSH Tunneling was set up following this [tutorial](http://ipyrad.readthedocs.io/HPC_Tunnel.html) to launch an *ipcluster* instance, which we use below to connect ipyrad to the cluster. Here I will create a view to the connected engines using the ipyparallel module, and confirm we are connected to all cores. 

In [52]:
## open a view to the client
ipyclient = ipp.Client()

## confirm we are connected to 4 8-core nodes
hosts = ipyclient[:].apply_sync(socket.gethostname)
for hostname in set(hosts):
    print("  host compute node: [{} cores] on {}"\
          .format(hosts.count(hostname), hostname))

  host compute node: [8 cores] on compute-24-14.local
  host compute node: [16 cores] on compute-22-10.local
  host compute node: [8 cores] on compute-20-15.local


### Set up a working directory

In [16]:
## create a new working directory in HPC scratch dir
WORK = "/fastscratch/de243/WB-PED"
if not os.path.exists(WORK):
    os.mkdir(WORK)

## print it
print "working directory = {}".format(WORK)

working directory = /fastscratch/de243/WB-PED


### The raw data
The raw R1 and R2 data are each split into 59 gzipped files approximately 300MB in size. The barcodes file maps sample names to barcodes that are contained inline in the R1 sequences, and are 4-8bp in length. The barcodes are printed a little further below. 

In [53]:
## Locations of the raw data stored temporarily on Yale's Louise HPC cluster
## Data are also stored more permanently on local computer tinus at Yale
RAWREADS = "/fastscratch/de243/TMP_RAWS/*.fastq.gz"
BARCODES = "/fastscratch/de243/TMP_RAWS/WB-PED_barcodes.txt"

### Fastqc quality check

I ran the program *fastQC* on the raw data files to do a quality check, the results of which (will be / are) available here [fastqc_dir](https://github.com/dereneaton/pedicularis-WB-GBS/blob/master/fastqc). Overall, quality scores were not terrible, but also not great, however, our biggest problem was very high adapter contamination. We will filter this out using the program *cutadapt* implemented in step2 of *ipyrad*, and discussed further below.

In [104]:
## uncomment this to install fastqc with conda
#conda install -c bioconda fastqc -q 

## create a tmp directory for fastqc outfiles (./tmp_fastqc)
QUALDIR = os.path.join(os.path.realpath(os.curdir), "tmp_fastqc")
if not os.path.exists(QUALDIR):
    os.mkdir(QUALDIR)
    
## run fastqc on all raw data files and write outputs to fastqc tmpdir.
## This is parallelized by load-balancing with ipyclient
lbview = ipyclient.load_balanced_view()
for rawfile in glob.glob(RAWREADS):
    cmd = ['fastqc', rawfile, '--outdir', QUALDIR, '-t', '1', '-q']
    lbview.apply_async(sps.check_output, cmd)
    
## block until finished and print progress
ipyclient.wait_interactive()

 118/118 tasks finished after  153 s
done


In [103]:
QUALDIR = os.path.join(os.path.realpath(os.curdir), "tmp_fastqc")
QUALDIR

'/home2/de243/pedicularis-WB-GBS/tmp_fastqc'

In [92]:
## update fastqc html results and this notebook
! git add $QUALDIR/*.html nb-WB-Pedicularis.ipynb
! git commit -m "fastq html results uploaded"
! git push -u origin master

fatal: '/fastscratch/de243/WB-PED/tmp_fastqc/lane2_NoIndex_L002_R1_001_fastqc.html' is outside repository
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	.ipynb_checkpoints/
#	ipyrad_log.txt
#	nb-WB-Pedicularis.ipynb
nothing added to commit but untracked files present (use "git add" to track)
error: The requested URL returned error: 403 Forbidden while accessing https://github.com/dereneaton/pedicularis-WB-GBS.git/info/refs

fatal: HTTP request failed


In [84]:
## cleanup tmpdir
! rm -r $QUALDIR

fatal: '/fastscratch/de243/WB-PED/fastqc/lane2_NoIndex_L002_R1_001_fastqc.html' is outside repository
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	.ipynb_checkpoints/
#	ipyrad_log.txt
#	nb-WB-Pedicularis.ipynb
nothing added to commit but untracked files present (use "git add" to track)
error: The requested URL returned error: 403 Forbidden while accessing https://github.com/dereneaton/pedicularis-WB-GBS.git/info/refs

fatal: HTTP request failed


### Create demultiplexed files for each Sample in *ipyrad*
We set the location to the data and barcodes info for each object, and set the max barcode mismatch parameter to zero (strict), allowing no mismatches. 

In [19]:
## create an object to demultiplex each lane
demux = ip.Assembly("WB-PED_demux")

## set basic derep parameters for the two objects
demux.set_params("project_dir", os.path.join(WORK, "demux_reads"))
demux.set_params("raw_fastq_path", RAWREADS)
demux.set_params("barcodes_path", BARCODES)
demux.set_params("max_barcode_mismatch", 0)

  New Assembly: WB-PED_demux


IPyradError:     Error setting parameter 'barcodes_path'
    list index out of range
    You entered: /home2/de243/RADSEQ_RAWS/WB-PED/WB-PED_barcodes.txt
    