## MiSeQ Ez-rad data exploration

Download the compressed fastq files from Zac's github repo. 

In [1]:
%%bash

## download MiSeq data from Zac's github repo
curl -LkO https://github.com/zacforsman/example_ezRAD_data/raw/master/ipyrad_formatted_data.tar.gz

## de-compress it
tar -xzvf ipyrad_formatted_data.tar.gz

## download mt genome file
curl -LkO https://github.com/zacforsman/example_ezRAD_data/raw/master/Achatinella_sowerbyana.fasta

ipyrad_formatted_data/ASO1mtreads_R1_.fastq.gz
ipyrad_formatted_data/
ipyrad_formatted_data/ASO5mtreads_R1_.fastq.gz
ipyrad_formatted_data/ASO2mtreads_R1_.fastq.gz
ipyrad_formatted_data/ASO3mtreads_R1_.fastq.gz
ipyrad_formatted_data/ASO4mtreads_R1_.fastq.gz
ipyrad_formatted_data/ASO6mtreads_R1_.fastq.gz


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   165  100   165    0     0    436      0 --:--:-- --:--:-- --:--:--   437
100 3775k  100 3775k    0     0  6498k      0 --:--:-- --:--:-- --:--:-- 6498k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   165  100   165    0     0    363      0 --:--:-- --:--:-- --:--:--   365
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 15399  100 15399    0     0  12875      0  0:00:01  0:00:01 --:--:-- 28516


### The data
The data appears to be Miseq data, paired 300bp. The paired reads are also interleaved, i.e., R1 is on a line, and then read2 on the next line, and then another read1; as opposed to the R1 and R2 reads being in separate files which is what I'm used to seeing. I found a short script on this page (http://seqanswers.com/forums/showthread.php?t=38892)[http://seqanswers.com/forums/showthread.php?t=38892] showing how to split this kind of data into two separate files. I wrote a Python function to do the same thing, below.  

In [2]:
%%bash

gzip -d -c ipyrad_formatted_data/ASO1mtreads_R1_.fastq.gz | head -n 16 | cut -c 1-80

@M02308:132:000000000-ANV18:1:1101:6317:3772 1:N:0:5
GATCTCAATGTTGTTGTTATCTTATAACAGCTTAATAAACAACTTAATTTTCCATGATTAAGATTTACATAGAGAACTAT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@M02308:132:000000000-ANV18:1:1101:6317:3772 2:N:0:5
GATCAAAGAGTCGAAGATTTAACATTAGAAAAGGATTATTATCAATTATTCCTAAAATAAGACTAATATGATTTATTTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGG
@M02308:132:000000000-ANV18:1:1101:6319:3792 1:N:0:5
GATCTCAATGTTGTTGTTATCTTATAACAGCTTAATAAACAACTTAATTTTCCATGATTAAGATTTACATAGAGAACTAT
+
@CCCCGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGGGGGGFFGGGGGGGGGGCGGGGGC9F
@M02308:132:000000000-ANV18:1:1101:6319:3792 2:N:0:5
GATCAAAGAGTCGAAGATTTAACATTAGAAAAGGATTATTATCAATTATTCCTAAAATAAGACTAATATGATTTATTTTA
+
CCCCCGGGGGGGGGGGGGGGGFGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGDCFGGGFGGG



gzip: stdout: Broken pipe


### Splitting function 

In [3]:
import itertools
import gzip

def split_miseq(input_file, out_r1, out_r2):
    
    ## open the input file 
    with gzip.open(input_file, 'r') as indat:
        
        ## line generator
        liner = iter(indat)
        qiter = itertools.izip(liner, liner, liner, liner)
        
        ## alternate between writing to out1 and out2
        out1 = []
        out2 = []
        out = 1
        
        ## iterate until qiter is empty
        while 1:
            try:
                quart = qiter.next()
                if out == 1:
                    out1.append("".join(quart))
                    out = 2
                else:
                    out2.append("".join(quart))
                    out = 1
            except StopIteration:
                break
           
        ## write lists to file
        with gzip.open(out_r1, 'w') as write1:
            write1.write("".join(out1))
        with gzip.open(out_r2, 'w') as write2:
            write2.write("".join(out2))        

### Split all reads

In [4]:
import glob
import os

## make new dirctory for split files
newdir = "split-fastqs"
if not os.path.exists(newdir):
    os.makedirs(newdir)

## iterate over files
for miseqfile in glob.glob("ipyrad_formatted_data/*.gz"):
    ## get name from file
    name = os.path.basename(miseqfile).split("_", 1)[0]
    r1 = os.path.join(newdir, name + "_R1_.fastq.gz")
    r2 = os.path.join(newdir, name + "_R2_.fastq.gz")
    split_miseq(miseqfile, r1, r2)

In [5]:
%%bash
ls -l split-fastqs

total 3652
-rw-rw-r-- 1 deren deren 282503 Aug 23 11:44 ASO1mtreads_R1_.fastq.gz
-rw-rw-r-- 1 deren deren 353785 Aug 23 11:44 ASO1mtreads_R2_.fastq.gz
-rw-rw-r-- 1 deren deren 256707 Aug 23 11:44 ASO2mtreads_R1_.fastq.gz
-rw-rw-r-- 1 deren deren 319318 Aug 23 11:44 ASO2mtreads_R2_.fastq.gz
-rw-rw-r-- 1 deren deren 230505 Aug 23 11:44 ASO3mtreads_R1_.fastq.gz
-rw-rw-r-- 1 deren deren 284714 Aug 23 11:44 ASO3mtreads_R2_.fastq.gz
-rw-rw-r-- 1 deren deren 147815 Aug 23 11:44 ASO4mtreads_R1_.fastq.gz
-rw-rw-r-- 1 deren deren 187029 Aug 23 11:44 ASO4mtreads_R2_.fastq.gz
-rw-rw-r-- 1 deren deren 334216 Aug 23 11:44 ASO5mtreads_R1_.fastq.gz
-rw-rw-r-- 1 deren deren 426128 Aug 23 11:44 ASO5mtreads_R2_.fastq.gz
-rw-rw-r-- 1 deren deren 395910 Aug 23 11:44 ASO6mtreads_R1_.fastq.gz
-rw-rw-r-- 1 deren deren 497179 Aug 23 11:44 ASO6mtreads_R2_.fastq.gz


## Assemble the data set

In [6]:
import ipyrad as ip
print 'ipyrad', ip.__version__


ipyrad 0.7.11


In [7]:
## create a denovo assembly object
denovo = ip.Assembly("denovo")
denovo.set_params("project_dir", "analysis-ipyrad")
denovo.set_params("sorted_fastq_path", "split-fastqs/*.gz")
denovo.set_params("mindepth_majrule", 2)
denovo.set_params("datatype", "pairgbs")
denovo.set_params("filter_adapters", 2)
denovo.set_params("restriction_overhang", ("GATC", "GATC"))

New Assembly: denovo


In [8]:
## load the data, filter and cluster it.
denovo.run("123")

Assembly: denovo
[####################] 100%  loading reads         | 0:00:00 | s1 | 
[####################] 100%  processing reads      | 0:00:00 | s2 | 


In [9]:
## create a reference assembly object
ref = denovo.branch('reference')
ref.set_params("reference_sequence", "Achatinella_sowerbyana.fasta")
ref.set_params("assembly_method", "reference")

## map to reference
ref.run("3", force=True)

### Visual inspection of clusters looks pretty good.

In [48]:
%%bash

## first 16 lines trimmed at 80 characters per line
zcat analysis-ipyrad/denovo_clust_0.85/ASO1mtreads.clustS.gz | head -n 16 | cut -c 1-80

006c3b4bfd1d393a7bf041856df8344e;size=39;*
----------------------------------------------GATCAAGTAAAATCAAATTTTAAAAATAAAAAAG
ab4ec4807c2dca97c970118c1f234dd7;size=2;-
----------------------------------------------GATCAAGTAAAATCAAATTTTAAAAATAAAAAAG
549ea44a0ea3ab0b31deecbfeeb46513;size=1;-
ATTATTGCAGATAAGAAGAGGAAAAAGTATATTTGTAGTAATATTAGAACAAGTAAAATCAAATTTTAAAAATAAAAAAG
2d0de67dcdfe85f6ccb42e829eb8864e;size=1;-
-------------------------------------------------------AAATCAAATTTTAAAAATAAAAAAG
087c5fd2760f87460d551c3451088469;size=1;+
------------AAGAAGAGGAAAAAGTATATTTGTAGTAATATTAGAACAAGTAAAATCAAATTTTAAAAATAAAAAAG
7dbb429703111a49e245e199aa005e16;size=1;+
--------------GAAGAGGAAAAAGTATATTTGTAGTAATATTAGAACAAGTAAAATCAAATTTTAAAAATAAAAAAG
914d88a9d5ce8b53c690026d55190a56;size=1;-
------------------------------------AGTAATATTAGAACAAGTAAAATCAAATTTTAAAAATAAAAAAG
bfba95cfefb5d8a6ad90f3ca72417a89;size=1;-
----------------------------------------------GATCAAGTAAAATCAAATTTTTAAAATAAAAAAG



gzip: stdout: Broken pipe


### Finish assembly

In [49]:
denovo.run("4567")

Assembly: ezrad
[####################] 100%  inferring [H, E]      | 0:00:00 | s4 | 
[####################] 100%  calculating depths    | 0:00:00 | s5 | 
[####################] 100%  chunking clusters     | 0:00:00 | s5 | 
[####################] 100%  consens calling       | 0:00:02 | s5 | 
[####################] 100%  concat/shuffle input  | 0:00:00 | s6 | 
[####################] 100%  clustering across     | 0:00:02 | s6 | 
[####################] 100%  building clusters     | 0:00:00 | s6 | 
[####################] 100%  aligning clusters     | 0:00:00 | s6 | 
[####################] 100%  database indels       | 0:00:00 | s6 | 
[####################] 100%  indexing clusters     | 0:00:00 | s6 | 
[####################] 100%  building database     | 0:00:00 | s6 | 
[####################] 100%  filtering loci        | 0:00:00 | s7 | 
[####################] 100%  building loci/stats   | 0:00:00 | s7 | 
[####################] 100%  building vcf file     | 0:00:00 | s7 | 
[#################

In [50]:
denovo.stats

Unnamed: 0,state,reads_raw,reads_passed_filter,clusters_total,clusters_hidepth,hetero_est,error_est,reads_consens
ASO1mtreads,6,2723,2721,273,59,0.006884,0.001928,55
ASO2mtreads,6,2362,2360,254,54,0.010536,0.002622,48
ASO3mtreads,6,1991,1990,302,39,0.016154,0.002737,32
ASO4mtreads,6,1449,1447,194,27,0.009593,0.00164,20
ASO5mtreads,6,3220,3215,297,58,0.016241,0.002657,51
ASO6mtreads,6,3760,3755,271,63,0.016831,0.002391,53


### Only one locus shared across 4 samples in this data set

In [53]:
denovo.stats_dfs.s7_loci

Unnamed: 0,locus_coverage,sum_coverage
1,0,0
2,0,0
3,0,0
4,1,1
5,0,1
6,0,1


### Visual inspection of .loci file

In [13]:
cat analysis-ipyrad/denovo_outfiles/denovo.loci

ASO1mtreads     TACCTTGAGGGCAAATATCATATTGGGGAGCGACAGTAATTACTAATCTTGTTAGGGCAATTCCTTATTGAGGTCAAAACTTAGTTATTTGAATTTGAGGTGGATATTCTGTYGGTCCTGCAACTTTGGGCCGATTTTTTTCTTTACATTTTATTTTACCATTTTTAATTCTTGTATTAGTTTTAATACATTTAATTTTTTTACATTTAAAAGGATC
ASO2mtreads     TACCTTGAGGKCAAATATCATATTGGGGAGCAACAGTAATTACTAATCTTGTTAGGGCAATTCCTTATTGAGGTCAAAACTTAGTTATTTGAATTTGAGGTGGATATTCTGTTGGCCCTGCAACTTTGGGCCGATTTTTTTCTTTACATTTTATTTTACCATTTTTAATTCTTGTATTAGTTTTAATACATTTAATTTTTTTACATTTAAAAGGATC
ASO3mtreads     TACCTTGAGGACAAATATCATATTGGGGAGCAACAGTAATTACTAATCTTGTTAGGGCAATTCCTTATTGAGGTCAAAACTTAGTTATTTGAATTTGAGGTGGATATTCTGTAGGCCCTGCAACTTTGGGCCGATTTTTTTCTTTACATTTTATTTTACCATTTTTAATTCTTGTATTAGTTTTAATACATTTAATTTTTTTACATTTRAAAGGATC
ASO4mtreads     TACCTTGAGGGCAAATATCATATTGGGGAGCGACAGTAATTACTAATCTTGTTAGGGCAATTCCTTATTGAGGTCAAAACTTAGTTATTTGAATTTGAGGTGGATATTCTGTTGGTCCTGCAACTTTGGGCCGATTTTTTTCTTTACATTTTATTTTACCATTTTTAATTCTTGTATTAGTTTTAATACATTTAATTTTTTTACATTTAAAAGGATC
ASO5mtreads     -----------CAAATATCATATTGGGGAGCAACAGTAATTACT

### Analysis

In [None]:
## conda install toytree -c eaton-lab
## conda install raxml -c bioconda

In [14]:
### infer a quick tree
import ipyrad.analysis as ipa
import toytree 

rax = ipa.raxml(name=denovo.name, data=denovo.outfiles.phy)
rax.run()

job denovo finished successfully


In [20]:
tre = toytree.tree(rax.trees.bestTree)
tre.draw(width=300, use_edge_lengths=True);