# Canarium GBS Assembly
### *Federman et al.*

This notebook provides all code necessary to reproduce the assembled GBS data sets used in Federman et al. (xxxx). Starting from demultiplexed fastq data files we assemble the data into four complete data sets that were used in downstream analyses. All code in this notebook is written in Python and uses the *ipyrad* package for assembly. 

### Imports

In [18]:
import ipyrad as ip
print "ipyrad v.{}".format(ip.__version__)

ipyrad v.0.5.15


### Paths
The working directory for this analysis and the location of the fastq data files. 

In [2]:
WORKDIR = "/home/deren/Documents/Canarium"
FASTQDIR = "/home/deren/Dropbox/Canarium_GBS/analyses/ipyrad/canarium_test/canarium_test_fastqs/*.gz"

### Create an Assembly

Enter parameter values for the ipyrad assembly . 

In [3]:
## create an Assembly
data = ip.Assembly("Canarium")

## set params
data.set_params("project_dir", "analysis-ipyrad")
data.set_params("sorted_fastq_path", FASTQDIR)
data.set_params("restriction_overhang", ("CWGC", "CWGC"))
data.set_params("datatype", "gbs")
data.set_params("clust_threshold", 0.90)
data.set_params("filter_adapters", 2)
data.set_params("max_SNPs_locus", (10, 10))
data.set_params("max_shared_Hs_locus", 4)
data.set_params("trim_reads", (0, 0))
data.set_params("trim_loci", (0, 5))
data.set_params("output_formats", list("lpsvk"))

## print params
data.get_params()

  New Assembly: Canarium
  0   assembly_name               Canarium                                     
  1   project_dir                 ./analysis-ipyrad                            
  2   raw_fastq_path                                                           
  3   barcodes_path                                                            
  4   sorted_fastq_path           /home/deren/Dropbox/Canarium_GBS/analyses/ipyrad/canarium_test/canarium_test_fastqs/*.gz
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    gbs                                          
  8   restriction_overhang        ('CWGC', 'CWGC')                             
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                 

### Assemble reads within each Sample

In [4]:
## run steps 1-6
data.run("12345")


  Assembly: Canarium
  [####################] 100%  loading reads         | 0:02:41 | s1 | 
  [####################] 100%  processing reads      | 0:22:16 | s2 | 
  [####################] 100%  dereplicating         | 0:06:14 | s3 | 
  [####################] 100%  clustering            | 14:55:21 | s3 | 
  [####################] 100%  building clusters     | 0:04:03 | s3 | 
  [####################] 100%  chunking              | 0:00:50 | s3 | 
  [####################] 100%  aligning              | 1:15:07 | s3 | 
  [####################] 100%  concatenating         | 0:03:12 | s3 | 
  [####################] 100%  inferring [H, E]      | 0:13:51 | s4 | 
  [####################] 100%  calculating depths    | 0:01:11 | s5 | 
  [####################] 100%  chunking clusters     | 0:01:45 | s5 | 
  [####################] 100%  consens calling       | 0:33:17 | s5 | 


In [5]:
print data.stats

         state  reads_raw  reads_passed_filter  clusters_total  \
4304         5      47625                47517            4798   
5573         5    3382649              3366117          511399   
D12950       5    5675773              5657904          456910   
D12962       5     400266               399294          109210   
D12963       5    1033763              1030585          193959   
D13052       5   12539878             12515369          615213   
D13053       5    1694555              1688579          253455   
D13063       5    2192159              2186397          287284   
D13075       5     668679               663737          202341   
D13090       5    2499902              2494238          284933   
D13091       5     225668               222074          102911   
D13097       5   23355083             23267500         3575788   
D13101       5     402493               400708          120855   
D13103       5    2097537              2091313          261868   
D13374    

### Branch to remove super-low data samples

In [7]:
## who has less than 10K consens reads
exclude = ["D12962", "D13091", "D14492", "4304", "SF301", "SF343"]

## keep = list of samples excluding low dat samples
keep = set(data.samples.keys()) - set(exclude)
keep = list(keep)

## new branch with only keep samples
subdata = data.branch("subdata", subsamples=keep)

### Cluster data across samples

In [8]:
subdata.run("6")


  Assembly: subdata
  [####################] 100%  concat/shuffle input  | 0:01:11 | s6 | 
  [####################] 100%  clustering across     | 6:09:08 | s6 | 
  [####################] 100%  building clusters     | 0:01:15 | s6 | 
  [####################] 100%  aligning clusters     | 0:11:11 | s6 | 
  [####################] 100%  database indels       | 0:04:16 | s6 | 
  [####################] 100%  indexing clusters     | 0:08:29 | s6 | 
  [####################] 100%  building database     | 0:28:18 | s6 | 


### Finish assemblies of subdata at minsamp = 4, 10, 20

In [12]:
## assemble data at 4 different minsamp values
min4 = subdata.branch("Canarium_min4")
min4.set_params("min_samples_locus", 4)
min4.run("7")

min10 = subdata.branch("Canarium_min10")
min10.set_params("min_samples_locus", 10)
min10.run("7")

min20 = subdata.branch("Canarium_min20")
min20.set_params("min_samples_locus", 20)
min20.run("7")


  Assembly: Canarium_min4
  [####################] 100%  filtering loci        | 0:00:09 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:06 | s7 | 
  [####################] 100%  building vcf file     | 0:00:54 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:08 | s7 | 
  [####################] 100%  writing outfiles      | 0:05:49 | s7 | 
  Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium_min4_outfiles

  Assembly: Canarium_min10
  [####################] 100%  filtering loci        | 0:00:09 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:06 | s7 | 
  [####################] 100%  building vcf file     | 0:00:39 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:08 | s7 | 
  [####################] 100%  writing outfiles      | 0:03:34 | s7 | 
  Outfiles

### Create assembly for min20 without outgroups
This is the data set that we will use in *structure* analyses. 

In [13]:
## exclude outgroup samples
outgroups = ["D13852", "D13374", "SFC1988", "D14269"]
keep = set(subdata.samples.keys()) - set(outgroups)
keep = list(keep)

## min20 w/o outgs
min20no = min20.branch("Canarium_min20no", subsamples=keep)
min20no.run("7", force=True)


  Assembly: Canarium_min20no
  [####################] 100%  filtering loci        | 0:00:27 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:05 | s7 | 
  [####################] 100%  building vcf file     | 0:00:23 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:19 | s7 | 
  [####################] 100%  writing outfiles      | 0:01:38 | s7 | 
  Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium_min20no_outfiles


### Assembly stats

In [26]:
cat $min4.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].statsfiles.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci             438501              0         438501
filtered_by_rm_duplicates           44752          44752         393749
filtered_by_max_indels               1095           1095         392654
filtered_by_max_snps                48453          47952         344702
filtered_by_max_shared_het          64029          35269         309433
filtered_by_min_sample             130218         129761         179672
filtered_by_max_alleles            134200          30084         149588
total_filtered_loci                149588              0         149588


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

         sample_coverage
5573               36385
D12950             43124
D12963              9677
D13052          

In [27]:
cat $min10.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].statsfiles.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci             438501              0         438501
filtered_by_rm_duplicates           44752          44752         393749
filtered_by_max_indels               1095           1095         392654
filtered_by_max_snps                48453          47952         344702
filtered_by_max_shared_het          64029          35269         309433
filtered_by_min_sample             216082         210755          98678
filtered_by_max_alleles            134200          20817          77861
total_filtered_loci                 77861              0          77861


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

         sample_coverage
5573               32784
D12950             37214
D12963              8455
D13052          

In [28]:
cat $min20.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].statsfiles.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci             438501              0         438501
filtered_by_rm_duplicates           44752          44752         393749
filtered_by_max_indels               1095           1095         392654
filtered_by_max_snps                48453          47952         344702
filtered_by_max_shared_het          64029          35269         309433
filtered_by_min_sample             268400         251572          57861
filtered_by_max_alleles            134200          14461          43400
total_filtered_loci                 43400              0          43400


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

         sample_coverage
5573               27298
D12950             30311
D12963              7090
D13052          

In [14]:
min4.stats

Unnamed: 0,state,reads_raw,reads_passed_filter,clusters_total,clusters_hidepth,hetero_est,error_est,reads_consens
5573,6,3382649,3366117,511399,137393,0.01972,0.002771,134609
D12950,6,5675773,5657904,456910,158327,0.018282,0.002557,155112
D12963,6,1033763,1030585,193959,49334,0.022711,0.00292,47204
D13052,6,12539878,12515369,615213,218911,0.017515,0.002578,214796
D13053,6,1694555,1688579,253455,91686,0.021762,0.003212,88939
D13063,6,2192159,2186397,287284,111030,0.019954,0.002964,108564
D13075,6,668679,663737,202341,23966,0.024076,0.003389,22740
D13090,6,2499902,2494238,284933,107659,0.019692,0.002954,105002
D13097,6,23355083,23267500,3575788,249250,0.026899,0.00255,237432
D13101,6,402493,400708,120855,15553,0.02212,0.003482,14816


In [15]:
min10.stats

Unnamed: 0,state,reads_raw,reads_passed_filter,clusters_total,clusters_hidepth,hetero_est,error_est,reads_consens
5573,6,3382649,3366117,511399,137393,0.01972,0.002771,134609
D12950,6,5675773,5657904,456910,158327,0.018282,0.002557,155112
D12963,6,1033763,1030585,193959,49334,0.022711,0.00292,47204
D13052,6,12539878,12515369,615213,218911,0.017515,0.002578,214796
D13053,6,1694555,1688579,253455,91686,0.021762,0.003212,88939
D13063,6,2192159,2186397,287284,111030,0.019954,0.002964,108564
D13075,6,668679,663737,202341,23966,0.024076,0.003389,22740
D13090,6,2499902,2494238,284933,107659,0.019692,0.002954,105002
D13097,6,23355083,23267500,3575788,249250,0.026899,0.00255,237432
D13101,6,402493,400708,120855,15553,0.02212,0.003482,14816


In [16]:
min20.stats

Unnamed: 0,state,reads_raw,reads_passed_filter,clusters_total,clusters_hidepth,hetero_est,error_est,reads_consens
5573,6,3382649,3366117,511399,137393,0.01972,0.002771,134609
D12950,6,5675773,5657904,456910,158327,0.018282,0.002557,155112
D12963,6,1033763,1030585,193959,49334,0.022711,0.00292,47204
D13052,6,12539878,12515369,615213,218911,0.017515,0.002578,214796
D13053,6,1694555,1688579,253455,91686,0.021762,0.003212,88939
D13063,6,2192159,2186397,287284,111030,0.019954,0.002964,108564
D13075,6,668679,663737,202341,23966,0.024076,0.003389,22740
D13090,6,2499902,2494238,284933,107659,0.019692,0.002954,105002
D13097,6,23355083,23267500,3575788,249250,0.026899,0.00255,237432
D13101,6,402493,400708,120855,15553,0.02212,0.003482,14816


In [17]:
min20no.stats

Unnamed: 0,state,reads_raw,reads_passed_filter,clusters_total,clusters_hidepth,hetero_est,error_est,reads_consens
5573,6,3382649,3366117,511399,137393,0.01972,0.002771,134609
D12950,6,5675773,5657904,456910,158327,0.018282,0.002557,155112
D12963,6,1033763,1030585,193959,49334,0.022711,0.00292,47204
D13052,6,12539878,12515369,615213,218911,0.017515,0.002578,214796
D13053,6,1694555,1688579,253455,91686,0.021762,0.003212,88939
D13063,6,2192159,2186397,287284,111030,0.019954,0.002964,108564
D13075,6,668679,663737,202341,23966,0.024076,0.003389,22740
D13090,6,2499902,2494238,284933,107659,0.019692,0.002954,105002
D13097,6,23355083,23267500,3575788,249250,0.026899,0.00255,237432
D13101,6,402493,400708,120855,15553,0.02212,0.003482,14816


### Visualize shared data

In [None]:
## ...