## Viburnum RAD-seq demultiplexing notebook

This notebook contains the *ipyrad* code used to demultiplex fastq data from multiple RAD-seq libraries sequenced across multiple lanes of Illumina HiSeq, and to visualize the distribution of reads among samples. Information about the libraries is recorded below. This notebook and it's accompanying barcode files are archived online in a github repository [http://github.com/dereneaton/Viburnum-phylogeny](https://github.com/dereneaton/Viburnum-phylogeny). It may be updated as new data are attained. 

### Raw fastq reads and index (barcode) files
The libraries were prepared with *inline* barcodes that 10bp in length. Barcodes files that match sample names to barcodes are used to demultiplex the data. These files are available online, and used below. 
1. [Viburnum library-1 barcodes](https://github.com/dereneaton/Viburnum-phylogeny/blob/master/VIBURNUM_1_BARCODES.txt)
2. [Viburnum library-2 barcodes](https://github.com/dereneaton/Viburnum-phylogeny/blob/master/VIBURNUM_2_BARCODES.txt)  
3. [Viburnum library-3 barcodes](https://github.com/dereneaton/Viburnum-phylogeny/blob/master/VIBURNUM_3_BARCODES.txt)  
4. [Viburnum library-4 barcodes](https://github.com/dereneaton/Viburnum-phylogeny/blob/master/VIBURNUM_4_BARCODES.txt)  
3. [Viburnum library-5 barcodes](https://github.com/dereneaton/Viburnum-phylogeny/blob/master/VIBURNUM_5_BARCODES.txt)  
3. [Viburnum library-6 barcodes](https://github.com/dereneaton/Viburnum-phylogeny/blob/master/VIBURNUM_6_BARCODES.txt)  
3. [Viburnum library-7 barcodes](https://github.com/dereneaton/Viburnum-phylogeny/blob/master/VIBURNUM_7_BARCODES.txt)  
3. [Viburnum library-8 barcodes](https://github.com/dereneaton/Viburnum-phylogeny/blob/master/VIBURNUM_8_BARCODES.txt)  


### The local paths to raw data files
All of these libraries were sequenced twice, on two separate lanes, to increase read depths. We could simply read in all of the data from both lanes, along with their corresponding barcodes file, to demultiplex each library. However, the more proper way to do things is to demultiplex each lane of data separately so that any possible effect of the lane on our results could (at least in theory) be detected. This is also the recommended format for uploading data to the NCBI short read archive, with each lane of data separated into separate files. It's really not that much trouble, since later when we go to assemble the data reads from the same sample, but which were demultiplexed into separate files, representing the separate lanes, can be easily merged using the `ipyrad.merge()` command.  

In [1]:
## Name: Viburnum 1
## Description: mostly species-level sampling for phylogeny but also some
##              population-level sampling for Beth's thesis, including
##              nudum, dentatum, rufidulum, lentago, prunifolium.
## Sequencer: 2 lanes; 100bp; SE; Illumina Hi-Seq 2000 at Univ. Oregon
## Lib-prep: Floragenex, PstI enzyme, size-selection:?

## there are two compressed data files (2 lanes), each 16Gb in size.
lib1_1 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_1/UO_C353_*.gz"
lib1_2 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_1/UO_C354_*.gz"
bar1 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/barcodes/VIBURNUM_1_BARCODES.txt"

In [2]:
## Name: Viburnum 2
## Description: mostly species-level sampling for phylogeny but also some
##              population-level sampling for Beth's thesis, including
##              lentago, prunifolium, rufidulum, obovatum.
## Sequencer: 2 lanes; 100bp; SE; Illumina Hi-Seq 2000 and 2500 at Univ. Oregon
## Lib-prep: Floragenex, PstI enzyme, size-selection: ?

## The first lane is HiSeq 2000, 9 files each ~1.4Gb in size.
## The second lane is HiSeq 2500, 7 files each ~1.4Gb in size.
lib2_1 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_2/lane1_*.gz"
lib2_2 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_2/lane8_*.gz"
bar2 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/barcodes/VIBURNUM_2_BARCODES.txt"

In [3]:
## Name: Viburnum 3
## Description: dentatum & nudum sampling for Beth's thesis
## Sequencer: 2 lanes; 100bp; SE; Illumina Hi-Seq 2500 at Univ. Oregon
## Lib-prep: Floragenex, PstI enzyme, size-selection:?

## there are two compressed data files (2 lanes), each 9Gb in size.
lib3_1 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_3/261_*.gz"
lib3_2 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_3/262_*.gz"
bar3 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/barcodes/VIBURNUM_3_BARCODES.txt"

In [4]:
## Name: Viburnum 4
## Description: mostly dentatum & rufidulum sampling for Beth's thesis
## Sequencer: 2 lanes; 100bp; SE; Illumina Hi-Seq 2500 at Univ. Oregon
## Lib-prep: Floragenex, PstI enzyme, size-selection:?

## there are two compressed data files (2 lanes), each 11Gb in size.
lib4_1 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_4/263_*.gz"
lib4_2 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_4/264_*.gz"
bar4 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/barcodes/VIBURNUM_4_BARCODES.txt"

In [5]:
## Name: Viburnum 5 (C657)
## Description: ...
## Sequencer: 1 lane; 100bp; SE; Illumina Hi-Seq 4000 at Univ. Oregon
## Lib-prep: Floragenex, PstI enzyme, size-selection:?

## One file gzip compressed to 16GB
lib5_1 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_5/932_*.gz"
bar5 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/barcodes/VIBURNUM_5_BARCODES_deren.txt"

In [6]:
## Name: Viburnum 6 (C655)
## Description: ...
## Sequencer: 1 lane; 100bp; SE; Illumina Hi-Seq 4000 at Univ. Oregon
## Lib-prep: Floragenex, PstI enzyme, size-selection:?

## One file gzip compressed to 16GB
lib6_1 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_6/930_*.gz"
bar6 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/barcodes/VIBURNUM_6_BARCODES_deren.txt"

In [7]:
## Name: Viburnum 7 (C658)
## Description: ...
## Sequencer: 1 lane; 100bp; SE; Illumina Hi-Seq 4000 at Univ. Oregon
## Lib-prep: Floragenex, PstI enzyme, size-selection:?

## One file gzip compressed to 16GB
lib7_1 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_7/933_*.gz"
bar7 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/barcodes/VIBURNUM_7_BARCODES_deren.txt"

In [8]:
## Name: Viburnum 8 (C656)
## Description: ...
## Sequencer: 1 lane; 100bp; SE; Illumina Hi-Seq 4000 at Univ. Oregon
## Lib-prep: Floragenex, PstI enzyme, size-selection:?

## One file gzip compressed to 16GB
lib8_1 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/VIBURNUM_8/931_*.gz"
bar8 = "/ysm-gpfs/project/de243/RADSEQ_RAWS/barcodes/VIBURNUM_8_BARCODES_deren.txt"

### Load *ipyrad* and print the current version

In [9]:
import ipyrad as ip
print "ipyrad v.{}".format(ip.__version__)

ipyrad v.0.5.15


In [10]:
print ip.cluster_info()

  host compute node: [20 cores] on c15n01.farnam.hpc.yale.internal
  host compute node: [20 cores] on c14n12.farnam.hpc.yale.internal


### Initiate an Assembly class objects for each lane of data


In [17]:
## create Assemblies for each library
data1_1 = ip.Assembly("lib1-lane1")
data1_2 = ip.Assembly("lib1-lane2")
data2_1 = ip.Assembly("lib2-lane1")
data2_2 = ip.Assembly("lib2-lane2")
data3_1 = ip.Assembly("lib3-lane1")
data3_2 = ip.Assembly("lib3-lane2")
data4_1 = ip.Assembly("lib4-lane1")
data4_2 = ip.Assembly("lib4-lane2")

  New Assembly: lib1-lane1
  New Assembly: lib1-lane2
  New Assembly: lib2-lane1
  New Assembly: lib2-lane2
  New Assembly: lib3-lane1
  New Assembly: lib3-lane2
  New Assembly: lib4-lane1
  New Assembly: lib4-lane2


In [11]:
data5_1 = ip.Assembly("lib5-lane1")
data6_1 = ip.Assembly("lib6-lane1")
data7_1 = ip.Assembly("lib7-lane1")
data8_1 = ip.Assembly("lib8-lane1")

  New Assembly: lib5-lane1
  New Assembly: lib6-lane1
  New Assembly: lib7-lane1
  New Assembly: lib8-lane1


### Set parameters for each Assembly

Here we set the path to the data files, as well as set a common project directory for all of the assemblies so that the resulting files that they produce will all be grouped into one place. Because the barcodes are 10bp in length we will allow 1 bp error in barcodes during demultiplexing. 

In [12]:
## Path where we want to write all of the demux files 
demuxdir = "/ysm-gpfs/project/de243/Viburnum_demux"

In [None]:
## set data & barcodes paths for each library
data1_1.set_params("project_dir", demuxdir)
data1_1.set_params("raw_fastq_path", lib1_1)
data1_1.set_params("barcodes_path", bar1)
data1_1.set_params("max_barcode_mismatch", 1)

## set data & barcodes paths for each library
data1_2.set_params("project_dir", demuxdir)
data1_2.set_params("raw_fastq_path", lib1_2)
data1_2.set_params("barcodes_path", bar1)
data1_2.set_params("max_barcode_mismatch", 1)

In [None]:
## set data & barcodes paths for each library
data2_1.set_params("project_dir", demuxdir)
data2_1.set_params("raw_fastq_path", lib2_1)
data2_1.set_params("barcodes_path", bar2)
data2_1.set_params("max_barcode_mismatch", 1)

## set data & barcodes paths for each library
data2_2.set_params("project_dir", demuxdir)
data2_2.set_params("raw_fastq_path", lib2_2)
data2_2.set_params("barcodes_path", bar2)
data2_2.set_params("max_barcode_mismatch", 1)

In [None]:
## set data & barcodes paths for each library
data3_1.set_params("project_dir", demuxdir)
data3_1.set_params("raw_fastq_path", lib3_1)
data3_1.set_params("barcodes_path", bar3)
data3_1.set_params("max_barcode_mismatch", 1)

## set data & barcodes paths for each library
data3_2.set_params("project_dir", demuxdir)
data3_2.set_params("raw_fastq_path", lib3_2)
data3_2.set_params("barcodes_path", bar3)
data3_2.set_params("max_barcode_mismatch", 1)

In [None]:
## set data & barcodes paths for each library
data4_1.set_params("project_dir", demuxdir)
data4_1.set_params("raw_fastq_path", lib4_1)
data4_1.set_params("barcodes_path", bar4)
data4_1.set_params("max_barcode_mismatch", 1)

## set data & barcodes paths for each library
data4_2.set_params("project_dir", demuxdir)
data4_2.set_params("raw_fastq_path", lib4_2)
data4_2.set_params("barcodes_path", bar4)
data4_2.set_params("max_barcode_mismatch", 1)

In [13]:
## set data & barcodes paths for each library
data5_1.set_params("project_dir", demuxdir)
data5_1.set_params("raw_fastq_path", lib5_1)
data5_1.set_params("barcodes_path", bar5)
data5_1.set_params("max_barcode_mismatch", 1)

In [14]:
## set data & barcodes paths for each library
data6_1.set_params("project_dir", demuxdir)
data6_1.set_params("raw_fastq_path", lib6_1)
data6_1.set_params("barcodes_path", bar6)
data6_1.set_params("max_barcode_mismatch", 1)

In [15]:
## set data & barcodes paths for each library
data7_1.set_params("project_dir", demuxdir)
data7_1.set_params("raw_fastq_path", lib7_1)
data7_1.set_params("barcodes_path", bar7)
data7_1.set_params("max_barcode_mismatch", 1)

In [16]:
## set data & barcodes paths for each library
data8_1.set_params("project_dir", demuxdir)
data8_1.set_params("raw_fastq_path", lib8_1)
data8_1.set_params("barcodes_path", bar8)
data8_1.set_params("max_barcode_mismatch", 1)

### Demux the libraries

In [17]:
data5_1.run("1")


  Assembly: lib5-lane1
  [####################] 100%  chunking large files  | 0:09:36 | s1 | 
  [####################] 100%  sorting reads         | 0:03:06 | s1 | 
  [####################] 100%  writing/compressing   | 0:03:39 | s1 | 


In [18]:
data6_1.run("1")


  Assembly: lib6-lane1
  [####################] 100%  chunking large files  | 0:09:46 | s1 | 
  [####################] 100%  sorting reads         | 0:02:35 | s1 | 
  [####################] 100%  writing/compressing   | 0:03:31 | s1 | 


In [19]:
data7_1.run("1")


  Assembly: lib7-lane1
  [####################] 100%  chunking large files  | 0:09:27 | s1 | 
  [####################] 100%  sorting reads         | 0:02:52 | s1 | 
  [####################] 100%  writing/compressing   | 0:03:28 | s1 | 


In [20]:
data8_1.run("1")


  Assembly: lib8-lane1
  [####################] 100%  chunking large files  | 0:09:39 | s1 | 
  [####################] 100%  sorting reads         | 0:03:01 | s1 | 
  [####################] 100%  writing/compressing   | 0:03:06 | s1 | 


### Stats

In [24]:
## reload assemblies in the case that this notebook was restarted. 
import ipyrad as ip
import pandas as pd
import glob
import os

## the demuxdir
demuxdir = "/ysm-gpfs/home/de243/project/Viburnum_demux/"

## json files
jsons = sorted(glob.glob(os.path.join(demuxdir, "*.json")))

## datadict
data = [ip.load_json(i, quiet=True) for i in jsons]

In [25]:
## a quick summary of the raw_reads stats w/o Floragenex control sample
raws = pd.DataFrame([dat.stats.drop("FGXCONTROL").reads_raw.describe() for dat in data],
                     index=[dat.name for dat in data])

print raws.astype(int)

            count     mean      std     min      25%      50%      75%  \
lib1-lane1     95  1624177   755857  443002  1130699  1450416  1924737   
lib1-lane2     95  1626160   739802  437908  1137801  1468499  1962176   
lib2-lane1     95  1201333   547659  318767   887036  1056552  1390343   
lib2-lane2     95   880605   402003  233855   649875   780487  1029653   
lib3-lane1     95  1040292   295246  399155   885037   995051  1164271   
lib3-lane2     95  1023212   282450  432182   840556  1017557  1133667   
lib4-lane1     95  1235275   813645  138413   687032  1127640  1564199   
lib4-lane2     95  1199391   782075  139396   662859  1074653  1527673   
lib5-lane1     95  2441584  3002894  109136  1208573  1541344  2746925   
lib6-lane1     95  2400024  1483075   18716  1727431  2364915  2866835   
lib7-lane1     95  2424420  2148516   25111  1158722  2068931  2875426   
lib8-lane1     95  2440229  1938794   53769  1385807  2181897  2866726   

                 max  
lib1-lane1   3

### File access permissions
Because I want all users in my group to be able to access the new demux folder that we created somewhere in our cluster directory, I will use the unix command chmod to make it accessible to users in my 'group'. 

In [24]:
## the (!) means this is a bash command, 
## chmod changes the permissions, 
## the order is: me=7, group=7, others=4
## -R means apply to subfolders as well
! chmod -R 774 $demuxdir


## Analysis/Visualization of read distributions

In [26]:
## import some plotting libraries
import toyplot
import pandas as pd

### Summary of reads 

In [91]:
canvas = toyplot.Canvas(height=3600, width=600)

for idx in range(12):
    axes = canvas.cartesian(grid=(12, 1, idx), gutter=75)
    stat = data[idx].stats.drop("FGXCONTROL").reads_raw.sort_values()
    axes.bars(stat, title=data[idx].name)
    axes.x.label.text = data[idx].name
    axes.y.ticks.labels.angle = -90
    axes.y.ticks.labels.style = {"font-size": 12}