# Assembly and analysis of *Pedicularis* PE-GBS data set

A library for 48 samples was prepared following the protocol described in Escudero et al. 2013 with the PstI restriction enzyme, followed by PCR amplification of primer ligated fragments. The library prep lacked a size selection step, which we discuss in the methods below.  The library was sequenced on one lane of an Illumina HiSeq 2000 yielding xxx reads.

### This notebook
This notebook provides a fully reproducible workflow to assemble and analyze the Yu-Eaton-Ree 2012 Pedicularis GBS data set, and to save the results into a github repo with this notebook [see git repo here](https://github.com/dereneaton/pedicularis-WB-GBS). Starting from the raw data files, we denovo assemble the data in *ipyrad*, which involves demultiplexing and filtering reads, and then clustering within and between samples to identify homology, followed by final filtering and formating to create output files. Analysis of the resulting files is shown in separate notebooks, again available in the [git repo](https://github.com/dereneaton/pedicularis-WB-GBS).

In [5]:
## show my local dir (where this notebook is located)
! pwd

## show the scratch dir (where data will be written)
! echo /fastscratch/de243/

## show that this dir has a git repo (.git file mapping to the address shown)
## this allows me to push updates to this notebook directly to github, 
## and to easily share the notebook with collaborators and as a final document.
! git config --get remote.origin.url

/home2/de243/pedicularis-WB-GBS
/fastscratch/de243/
https://github.com/dereneaton/pedicularis-WB-GBS.git


### Import ipyrad and other common modules

In [54]:
## all necessary software can be installed by uncommenting the command below
# conda install -c ipyrad ipyrad -y

## import basic modules and ipyrad and print version
import os
import socket
import glob
import subprocess as sps
import numpy as np
import ipyparallel as ipp
import ipyrad as ip

print "ipyrad v.{}".format(ip.__version__)
print "ipyparallel v.{}".format(ipp.__version__)
print "numpy v.{}".format(np.__version__)

ipyrad v.0.4.3
ipyparallel v.5.0.1
numpy v.1.11.0


### The cluster
This notebook was run connected to 32 cores on 4 nodes of the Louise HPC cluster at Yale. SSH Tunneling was set up following this [tutorial](http://ipyrad.readthedocs.io/HPC_Tunnel.html) to launch an *ipcluster* instance, which we use below to connect ipyrad to the cluster. Here I will create a view to the connected engines using the ipyparallel module, and confirm we are connected to all cores. 

In [52]:
## open a view to the client
ipyclient = ipp.Client()

## confirm we are connected to 4 8-core nodes
hosts = ipyclient[:].apply_sync(socket.gethostname)
for hostname in set(hosts):
    print("host compute node: [{} cores] on {}"\
          .format(hosts.count(hostname), hostname))

  host compute node: [8 cores] on compute-24-14.local
  host compute node: [16 cores] on compute-22-10.local
  host compute node: [8 cores] on compute-20-15.local


### Set up a working directory

In [16]:
## create a new working directory in HPC scratch dir
WORK = "/fastscratch/de243/WB-PED"
if not os.path.exists(WORK):
    os.mkdir(WORK)

## print it
print "working directory = {}".format(WORK)

working directory = /fastscratch/de243/WB-PED


### The raw data
The raw R1 and R2 data are each split into 59 gzipped files approximately 300MB in size. The barcodes file maps sample names to barcodes that are contained inline in the R1 sequences, and are 4-8bp in length. The barcodes are printed a little further below. 

In [120]:
## Locations of the raw data stored temporarily on Yale's Louise HPC cluster
## Data are also stored more permanently on local computer tinus at Yale
RAWREADS = "/fastscratch/de243/TMP_RAWS/*00[1-3].fastq.gz"
BARCODES = "/fastscratch/de243/TMP_RAWS/WB-PED_barcodes.txt"

### Fastqc quality check

I ran the program *fastQC* on the raw data files to do a quality check, the results of which (will be / are) available here [fastqc_dir](https://github.com/dereneaton/pedicularis-WB-GBS/blob/master/fastqc). Overall, quality scores were not terrible, but also not great, however, our biggest problem was very high adapter contamination. We will filter this out using the program *cutadapt* implemented in step2 of *ipyrad*, and discussed further below.

In [104]:
## uncomment this to install fastqc with conda
#conda install -c bioconda fastqc -q 

## create a tmp directory for fastqc outfiles (./tmp_fastqc)
QUALDIR = os.path.join(os.path.realpath(os.curdir), "tmp_fastqc")
if not os.path.exists(QUALDIR):
    os.mkdir(QUALDIR)
    
## run fastqc on all raw data files and write outputs to fastqc tmpdir.
## This is parallelized by load-balancing with ipyclient
lbview = ipyclient.load_balanced_view()
for rawfile in glob.glob(RAWREADS):
    cmd = ['fastqc', rawfile, '--outdir', QUALDIR, '-t', '1', '-q']
    lbview.apply_async(sps.check_output, cmd)
    
## block until finished and print progress
ipyclient.wait_interactive()

## I then pushed the tmp_fastq dir to the git repo
## and then deleted the tmpdir
! rm -r $QUALDIR

 118/118 tasks finished after  153 s
done


### Create demultiplexed files for each Sample in *ipyrad*
We set the location to the data and barcodes info for each object, and set the max barcode mismatch parameter to zero (strict), allowing no mismatches. 

In [121]:
## create an object to demultiplex each lane
demux = ip.Assembly("WB-PED_demux")

## set basic derep parameters for the two objects
demux.set_params("project_dir", os.path.join(WORK, "demux_reads"))
demux.set_params("raw_fastq_path", RAWREADS)
demux.set_params("barcodes_path", BARCODES)
demux.set_params("max_barcode_mismatch", 0)
demux.set_params("datatype", "pairgbs")
demux.set_params("restriction_overhang", ("TGCAG", "TGCAG"))

  New Assembly: WB-PED_demux


In [122]:
demux.run("1", force=True)


  Assembly: WB-PED_demux
    [force] overwriting fastq files previously *created by ipyrad* in:
    /fastscratch/de243/WB-PED/demux_reads/WB-PED_demux_fastqs
    This does not affect your *original/raw data files*
  [####################] 100%  chunking large files  | 0:02:01 | s1 | 
  [####################] 100%  sorting reads         | 0:00:44 | s1 | 
  [####################] 100%  writing/compressing   | 0:02:01 | s1 | 


### Some quick summary stats, we'll look more in depth later

In [129]:
! cat $demux.stats_files.s1

raw_file                               total_reads    cut_found  bar_matched
lane2_NoIndex_L002_R1_001.fastq            4000000      3998858      3761574
lane2_NoIndex_L002_R2_001.fastq            4000000      3998858      3761574
lane2_NoIndex_L002_R1_002.fastq            4000000      3998854      3745260
lane2_NoIndex_L002_R2_002.fastq            4000000      3998854      3745260
lane2_NoIndex_L002_R1_003.fastq            4000000      3998872      3739970
lane2_NoIndex_L002_R2_003.fastq            4000000      3998872      3739970

sample_name                            total_reads
L20090356                                   210432
L2DZ0984                                    236276
L2DZ0988                                    251031
L2DZ0989                                    218117
L2DZ0990                                    229001
L2DZ1006                                    273899
L2DZ1007                                    605837
L2DZ1011                            

### Start an assembly of the data set
I usually prefer to start my analyses from data that are already de-multiplexed, since the demux fastq files are typically what is available to others after a study is published and that data are made available online. So here we will create an Assembly object and read in the demultiplexed data. Then we will run step 2 to filter the data and take a close look at the results. 

In [139]:
## this will be our assembly object for steps 1-6
data = ip.Assembly("c85d5f2h5")

## demux data location
DEMUX = os.path.join(demux.dirs.fastqs, "*gz")

## set parameters for this assembly and print them 
data.set_params("project_dir", os.path.join(WORK, data.name))
data.set_params("sorted_fastq_path", DEMUX)
data.set_params("barcodes_path", BARCODES)
data.set_params("filter_adapters", 2)
data.set_params("datatype", "pairgbs")
data.set_params("restriction_overhang", ("TGCAG", "TGCAG"))
data.set_params("max_Hs_consens", (5, 5))
data.get_params()

## the fastqc check found evidence of two adapters/primers
## R1s have Illumina Paired End PCR Primer 2
## R2s have Illumina Paired End PCR Primer 1
primer2 = "AGATCGGA" #AGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTAT"
primer1 = "AGATCGGA" #AGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGG"

## By default ipyrad would search for the common prefix to the
## two primers, but b/c they're super common in this data set
## we'll set the full sequences explicitly
data._hackersonly["p5_adapter"] = primer2
data._hackersonly["p3_adapter"] = primer1

  New Assembly: c85d5f2h5
  0   assembly_name               c85d5f2h5                                    
  1   project_dir                 /fastscratch/de243/WB-PED/c85d5f2h5          
  2   raw_fastq_path                                                           
  3   barcodes_path               /fastscratch/de243/TMP_RAWS/WB-PED_barcodes.txt
  4   sorted_fastq_path           /fastscratch/de243/WB-PED/demux_reads/WB-PED_demux_fastqs/*gz
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    pairgbs                                      
  8   restriction_overhang        ('TGCAG', 'TGCAG')                           
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                         

In [140]:
data.run("12", force=True)


  Assembly: c85d5f2h5
  [####################] 100%  loading reads         | 0:00:01 | s1 | 
  [####################] 100%  processing reads      | 0:01:13 | s2 | 


In [141]:
print data.stats

            state  reads_raw  reads_passed_filter
L20090356       2     210432               161428
L2DZ0984        2     236276               181249
L2DZ0988        2     251031               195450
L2DZ0989        2     218117               170387
L2DZ0990        2     229001               180509
L2DZ1006        2     273899               213874
L2DZ1007        2     605837               484602
L2DZ1011        2     338422               238820
L2DZ1016        2     180016               135545
L2DZ1019        2     178359               130740
L2DZ1027        2     244863               189679
L2DZ1060        2     874698               635113
L2DZ1070        2     251629               193653
L2DZ1180        2     255598               193026
L2DZ1204        2     161805               126138
L2DZ1243        2     365060               275749
L2DZ1266        2     179139               139922
L2DZ1268        2     153988               114672
L2DZ1282_1      2     252918               193437


In [137]:
print data.stats

            state  reads_raw  reads_passed_filter
L20090356       2     210432               185032
L2DZ0984        2     236276               205672
L2DZ0988        2     251031               225226
L2DZ0989        2     218117               192268
L2DZ0990        2     229001               204410
L2DZ1006        2     273899               241373
L2DZ1007        2     605837               540138
L2DZ1011        2     338422               304366
L2DZ1016        2     180016               160610
L2DZ1019        2     178359               161434
L2DZ1027        2     244863               222857
L2DZ1060        2     874698               765719
L2DZ1070        2     251629               224869
L2DZ1180        2     255598               229999
L2DZ1204        2     161805               148565
L2DZ1243        2     365060               326697
L2DZ1266        2     179139               163800
L2DZ1268        2     153988               140695
L2DZ1282_1      2     252918               234617


In [142]:
data.barcodes

{'L20090356': 'ACAAA',
 'L2DZ0984': 'TAGGAA',
 'L2DZ0988': 'TTCAGA',
 'L2DZ0989': 'CCAGCT',
 'L2DZ0990': 'GGTGT',
 'L2DZ1006': 'GCTTA',
 'L2DZ1007': 'CGCTT',
 'L2DZ1011': 'GTCGATT',
 'L2DZ1016': 'GGACCTA',
 'L2DZ1019': 'GAACTTG',
 'L2DZ1027': 'GAATTCA',
 'L2DZ1060': 'CGCCTTAT',
 'L2DZ1070': 'CCGGATAT',
 'L2DZ1180': 'TCTCAGTG',
 'L2DZ1204': 'TGGTACGT',
 'L2DZ1243': 'TTCCTGGA',
 'L2DZ1266': 'TAGGCCAT',
 'L2DZ1268': 'GGTTGT',
 'L2DZ1282_1': 'TAGCATGG',
 'LHW10069': 'CTAGG',
 'LHW10071': 'TCACG',
 'LHW10074': 'ACCGT',
 'LHW10346': 'CCACAA',
 'LJ-118': 'GCTCTA',
 'd19long1': 'TGCA',
 'd30181': 'ATGCCT',
 'd30695': 'TGCGA',
 'd31733': 'AAAAGTT',
 'd33291': 'AACGCCT',
 'd34041': 'CGAT',
 'd35178': 'GTATT',
 'd35320': 'AATATGG',
 'd35371': 'AGCCG',
 'd35422': 'CAGA',
 'd39103': 'CTTCCA',
 'd39104': 'GTAA',
 'd39114': 'CTTGCTT',
 'd39187': 'TTCTG',
 'd39253': 'AGTGGA',
 'd39404': 'TATTTTT',
 'd39531': 'ATGAAAG',
 'd39968': 'GAGATA',
 'd40328': 'ACGACTAG',
 'd41058': 'ACTA',
 'd41237': 'AACT',
 