# Notebook 2: Assemble RAD-seq data sets

See Notebook 1 for code to download the fastq data files. In this notebook we merge technical replicate samples into individual samples and then assemble RAD-seq data sets using both a denovo assembly method and a reference-based assembly method in *ipyrad*. 

In [1]:
# conda install ipyrad -c ipyrad

In [2]:
import os
import ipyrad as ip
import ipyparallel as ipp

In [3]:
ip.__version__

'0.8.0-dev'

### Connect to parallel client

In [19]:
ipyclient = ipp.Client()

### Assemble Step 1: Load fastq data and merge technical replicate samples

In [5]:
WORKDIR = os.path.realpath("../analysis-ipyrad")
WORKDIR

'/home/deren/Documents/virentes-reference/analysis-ipyrad'

In [21]:
# load the first RAD-seq library
lib1 = ip.Assembly("lib1")
lib1.set_params("sorted_fastq_path", "../rawdata/radseq/*_v_*.fastq.gz")
lib1.set_params("project_dir", WORKDIR)
lib1.run("1")

New Assembly: lib1
Assembly: lib1
[####################] 100% 0:00:19 | loading reads        | s1 |


In [20]:
# load the 'replicates' library
lib2 = ip.Assembly("lib2")
lib2.set_params("sorted_fastq_path", "../rawdata/radseq/*_re_*.fastq.gz")
lib2.set_params("project_dir", WORKDIR)
lib2.run("1")

New Assembly: lib2
Assembly: lib2
[####################] 100% 0:00:24 | loading reads        | s1 |


#### Merge technical replicates 

In [44]:
# rename sample dict keys in each library by removing the _v or _re ending in names
lib1.samples = {name.split("_")[0]:sample for (name, sample) in lib1.samples.items()}
lib2.samples = {name.split("_")[0]:sample for (name, sample) in lib2.samples.items()}

# rename .name attribute of sample objects in the same way
for sample in lib1.samples.values():
    sample.name = sample.name.split("_")[0] 
for sample in lib2.samples.values():
    sample.name = sample.name.split("_")[0]     

In [45]:
# which sample names match between libraries (and will be merged)?
set(lib1.samples).intersection(set(lib2.samples))

{'BJSL25', 'BJVL19', 'CRL0001', 'CRL0030', 'FLAB109', 'FLBA140', 'FLSF54'}

In [46]:
# merge libraries (merges samples with the same sample name)
libmerge = ip.merge("libmerge", (lib1, lib2))

### Assemble Step 2: Filter reads 

In [48]:
libmerge.set_params("filter_adapters", 2)
libmerge.run("2", ipyclient=ipyclient)

Assembly: libmerge
[####################] 100% 0:00:00 | concatenating inputs | s2 |
[####################] 100% 0:03:33 | processing reads     | s2 |


In [17]:
# show N raw reads per sample after merging technical replicates
libmerge.stats.head()

Unnamed: 0,state,reads_raw,reads_passed_filter
AR,2,4046890,4029027
BJSB3,2,931926,925265
BJSL25,2,6322202,6285975
BJVL19,2,5533067,5506835
BZBB1,2,849191,843394


### Assemble Steps 3-7: Branch to assemble denovo data set


In [8]:
# create denovo branch and set params for assembly
denovo = libmerge.branch("denovo")
denovo.set_params("trim_loci", (0, 5, 0, 0))
denovo.set_params("output_formats", "*")
denovo.get_params()

0   assembly_name               denovo                                       
1   project_dir                 /home/deren/Documents/virentes-reference/analysis-ipyrad
2   raw_fastq_path                                                           
3   barcodes_path               Merged: lib1, lib2                           
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    rad                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6                    

In [10]:
denovo.run("34567", ipyclient=ipyclient)

Assembly: denovo
[####################] 100% 0:00:00 | concatenating        | s3 |
[####################] 100% 0:00:19 | dereplicating        | s3 |
[####################] 100% 0:04:29 | clustering/mapping   | s3 |
[####################] 100% 0:00:04 | building clusters    | s3 |
[####################] 100% 0:00:00 | chunking clusters    | s3 |
[####################] 100% 0:09:37 | aligning clusters    | s3 |
[####################] 100% 0:00:07 | concat clusters      | s3 |
[####################] 100% 0:00:03 | calc cluster stats   | s3 |
[####################] 100% 0:00:22 | inferring [H, E]     | s4 |
[####################] 100% 0:00:03 | calculating depths   | s5 |
[####################] 100% 0:00:05 | chunking clusters    | s5 |
[####################] 100% 0:07:13 | consens calling      | s5 |
[####################] 100% 0:00:15 | indexing alleles     | s5 |
[####################] 100% 0:00:15 | concatenating inputs | s6 |
[####################] 100% 0:08:03 | clustering tier 1    

### Assemble Steps 3-7: Branch to assemble denovo data set


In [15]:
# create reference branch and set params for assembly
reference = denovo.branch("reference")
reference.set_params("assembly_method", "reference")
reference.set_params("reference_sequence", "../rawdata/Qrob_PM1N.fa")
reference.get_params()

0   assembly_name               reference                                    
1   project_dir                 /home/deren/Documents/virentes-reference/analysis-ipyrad
2   raw_fastq_path                                                           
3   barcodes_path               Merged: lib1, lib2                           
4   sorted_fastq_path                                                        
5   assembly_method             reference                                    
6   reference_sequence          /home/deren/Documents/virentes-reference/rawdata/Qrob_PM1N.fa
7   datatype                    rad                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6    

In [16]:
reference.run("34567", ipyclient=ipyclient, force=True)

Assembly: reference
[####################] 100% 0:13:43 | indexing reference   | s3 |
[####################] 100% 0:00:00 | concatenating        | s3 |
[####################] 100% 0:00:19 | dereplicating        | s3 |
[####################] 100% 0:02:04 | mapping reads        | s3 |
[####################] 100% 0:01:30 | building clusters    | s3 |
[####################] 100% 0:00:02 | calc cluster stats   | s3 |
[####################] 100% 0:00:37 | inferring [H, E]     | s4 |
[####################] 100% 0:00:02 | calculating depths   | s5 |
[####################] 100% 0:00:03 | chunking clusters    | s5 |
[####################] 100% 0:08:03 | consens calling      | s5 |
[####################] 100% 0:00:18 | indexing alleles     | s5 |
[####################] 100% 0:00:17 | concatenating bams   | s6 |
[####################] 100% 0:00:05 | fetching regions     | s6 |
[####################] 100% 0:00:05 | building loci        | s6 |
[####################] 100% 0:00:14 | applying filters  

#### assemble min4, min10, and min20 data sets

In [None]:
# ...