# _ipyrad_ testing tutorial

### Getting started
Import _ipyrad_ and remove previous test files if they are already present

In [1]:
## import modules
import ipyrad as ip                ## 
print "version", ip.__version__    ## print version

DEBUG:ipyrad:H4CKERZ-mode: __loglevel__ = DEBUG


version 0.1.31


In [2]:
## clear data from test directory if it already exists
import shutil
import os
if os.path.exists("./test_rad/"):
    shutil.rmtree("./test_rad/")

### Assembly and Sample objects

Assembly and Sample objects are used by _ipyrad_ to access data stored on disk and to manipulate it. Each biological sample in a data set is represented in a Sample object, and a set of Samples is stored inside an Assembly object. The Assembly object has functions to assemble the data, and stores a log of all steps performed and the resulting statistics of those steps. Assembly objects can be copied or merged to allow branching events where different parameters can subsequently be applied to different Assemblies going forward. Examples of this are shown below.

We'll being by creating a single Assembly object named "data1". It is created with a set of default assembly parameters and without any Samples linked to it. The name provided will be used in the output files that this Assembly creates. 

In [3]:
## create an Assembly object named data1. 
data1 = ip.Assembly("data1")

  New Assembly: data1


### Modifying assembly parameters
An Assembly object's parameter settings can be viewed using its `get_params()` function. To get more detailed information about all parameters use the function `ip.get_params_info()` or select a single parameter with `ip.get_params_info(N)`, where N is the number or string representation of a parameter. Assembly objects have a function `set_params()` that is used to modify parameters, like below.

In [4]:
## modify parameters for this Assembly object
data1.set_params('working_directory', "./test_rad")
data1.set_params('raw_fastq_path', "./data/sim_rad2*.fastq.gz")
data1.set_params('barcodes_path', "./data/sim_rad2*_barcodes.txt")
data1.set_params('filter_adapters', 0)
data1.set_params('datatype', 'rad')

## test on real data
#data1.set_params(2, "~/Dropbox/UO_C353_1.fastq.part-aa.gz")
#data1.set_params(3, "/home/deren/Dropbox/Viburnum_revised.barcodes")

## print the new parameters to screen
data1.get_params()

  1   working_directory           ./test_rad                                   
  2   raw_fastq_path              ./data/sim_rad2*.fastq.gz                    
  3   barcodes_path               ./data/sim_rad2*_barcodes.txt                
  4   sorted_fastq_path                                                        
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    rad                                          
  8   restriction_overhang        ('TGCAG', '')                                
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6                                            
  13  maxdepth                    1000  

### Starting data
If the data are already demultiplexed then fastq files can be linked directly to the Assembly object, which in turn will create new Sample objects from them, or link them to existing Sample objects based on the file names (or pair of fastq files for paired data files). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data.

In [5]:
## This would link fastq files from the 'sorted_fastq_path' if present
## Here it raises an error because there are no files in the sorted_fastq_path

#data1.link_fastqs() #path="./test_rad/data1_fastqs/*")

### Step 1: Demultiplexing raw data files
Step1 uses barcode information to demultiplex data files found in param 2 ['raw_fastq_path']. It will create a Sample object for each barcoded sample. Below we use the step1() function to demultiplex. The `stats` attribute of an Assembly object is returned as a `pandas` data frame.

In [6]:
## run step 1 to demultiplex the data
data1.step1(force=True)

## print the results for each Sample in data1
print data1.stats

INFO:ipyrad.core.assembly:try 10: starting controller
DEBUG:ipyrad.core.assembly:OK! Connected to (4) engines
INFO:ipyrad.assemble.demultiplex:precheck optim=1636
DEBUG:ipyrad.assemble.util:zcat splittin' sim_rad2_R1_.fastq.gz
INFO:ipyrad.assemble.util:zcat is using optim = 1636
INFO:ipyrad.assemble.demultiplex:Executing 1 files, in 1 chunks, across 4 cpus
DEBUG:ipyrad.assemble.demultiplex:in parallel_sorter
DEBUG:ipyrad.assemble.demultiplex:gzipping 2G_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 3K_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 3J_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 2E_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 1A_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 1B_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 3I_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 3L_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 2F_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 1C_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 1D_0
DEBUG:ipyrad.assemble.demultiplex:gzipping 2H_0


      state  reads_raw
1A_0      1       1109
1B_0      1       1065
1C_0      1       1021
1D_0      1       1117
2E_0      1       1085
2F_0      1       1197
2G_0      1       1033
2H_0      1       1086
3I_0      1       1110
3J_0      1       1140
3K_0      1       1138
3L_0      1       1015


In [7]:
data1.statsfiles

{'s1':       reads_raw
 1A_0       1109
 1B_0       1065
 1C_0       1021
 1D_0       1117
 2E_0       1085
 2F_0       1197
 2G_0       1033
 2H_0       1086
 3I_0       1110
 3J_0       1140
 3K_0       1138
 3L_0       1015}

In [8]:
data1.samples

{'1A_0': <ipyrad.core.sample.Sample at 0x7fc9516a7710>,
 '1B_0': <ipyrad.core.sample.Sample at 0x7fc9516ba910>,
 '1C_0': <ipyrad.core.sample.Sample at 0x7fc951734e50>,
 '1D_0': <ipyrad.core.sample.Sample at 0x7fc951734190>,
 '2E_0': <ipyrad.core.sample.Sample at 0x7fc9516a7810>,
 '2F_0': <ipyrad.core.sample.Sample at 0x7fc951723c50>,
 '2G_0': <ipyrad.core.sample.Sample at 0x7fc9840b9bd0>,
 '2H_0': <ipyrad.core.sample.Sample at 0x7fc951734750>,
 '3I_0': <ipyrad.core.sample.Sample at 0x7fc9516ae250>,
 '3J_0': <ipyrad.core.sample.Sample at 0x7fc951647d10>,
 '3K_0': <ipyrad.core.sample.Sample at 0x7fc951647290>,
 '3L_0': <ipyrad.core.sample.Sample at 0x7fc9516aea90>}

In [9]:
#data1.save("data/saved_states/state1_rad.assembly")

In [10]:
## remove the lane control sequence
#data1.samples.pop("FGXCONTROL")
data1.dirs

{'fastqs': '/home/deren/Documents/ipyrad/tests/test_rad/data1_fastqs',
 'working': '/home/deren/Documents/ipyrad/tests/test_rad'}

### Step 2: Filter reads 
If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the `step2` function on. Because `step2` is a function of `data`, it will always execute with the parameters that are linked to `data`. 

In [11]:
## example of ways to run step 2 to filter and trim reads

#data1.step2(force=True)   ## run on a single sample
data1.step2(["1B_0", "1C_0"], force=True)      ## run on one or more samples

## print the results
print data1.statsfiles.s2

INFO:ipyrad.core.assembly:try 10: starting controller
DEBUG:ipyrad.core.assembly:OK! Connected to (4) engines
INFO:ipyrad.assemble.rawedit:optim=532
DEBUG:ipyrad.assemble.util:zcat splittin' 1B_0_R1_.fastq.gz
INFO:ipyrad.assemble.util:zcat is using optim = 532
INFO:ipyrad.assemble.rawedit:Executing 1 file, in 9 chunks, across 4 cpus
INFO:ipyrad.assemble.rawedit:optim=508
DEBUG:ipyrad.assemble.util:zcat splittin' 1C_0_R1_.fastq.gz
INFO:ipyrad.assemble.util:zcat is using optim = 508
INFO:ipyrad.assemble.rawedit:Executing 1 file, in 9 chunks, across 4 cpus


      reads_raw  filtered_by_qscore  filtered_by_adapter  reads_passed
1A_0        NaN                 NaN                  NaN           NaN
1B_0       1065                   0                    0          1065
1C_0       1021                   0                    0          1021
1D_0        NaN                 NaN                  NaN           NaN
2E_0        NaN                 NaN                  NaN           NaN
2F_0        NaN                 NaN                  NaN           NaN
2G_0        NaN                 NaN                  NaN           NaN
2H_0        NaN                 NaN                  NaN           NaN
3I_0        NaN                 NaN                  NaN           NaN
3J_0        NaN                 NaN                  NaN           NaN
3K_0        NaN                 NaN                  NaN           NaN
3L_0        NaN                 NaN                  NaN           NaN


### Branching Assembly objects
Let's imagine at this point that we are interested in clustering our data at two different clustering thresholds. We will try 0.90 and 0.85. First we need to make a copy/branch of the Assembly object. This will inherit the locations of the data linked in the first object, but diverge in any future applications to the object. Thus, the two Assembly objects can share the same working directory, and inherit shared files, but will diverge in creating new files linked to only one or the other. You can view the directories linked to an Assembly object with the `.dirs` argument, shown below. The prefix_outname (param 14) of the new object is automatically set to the Assembly object name. 


In [12]:
## create a copy of our Assembly object
data2 = data1.copy(newname="data2")

## set clustering threshold to 0.90
data2.set_params("clust_threshold", 0.90)

## look at inherited parameters
data2.get_params()

  1   working_directory           ./test_rad                                   
  2   raw_fastq_path              ./data/sim_rad2*.fastq.gz                    
  3   barcodes_path               ./data/sim_rad2*_barcodes.txt                
  4   sorted_fastq_path                                                        
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    rad                                          
  8   restriction_overhang        ('TGCAG', '')                                
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6                                            
  13  maxdepth                    1000  

### Step 3: clustering within-samples


In [1]:
import ipyrad as ip
data1 = ip.load.load_assembly("test_rad/data1.assembly")

DEBUG:ipyrad:H4CKERZ-mode: __loglevel__ = DEBUG


  loading Assembly: data1 [test_rad/data1.assembly]


In [20]:
import numpy as np

In [27]:
print data1.statsfiles.s2.to_string()#float_format='{}'.format)

      reads_raw  filtered_by_qscore  filtered_by_adapter  reads_passed
1A_0        NaN                 NaN                  NaN           NaN
1B_0       1065                   0                    0          1065
1C_0       1021                   0                    0          1021
1D_0        NaN                 NaN                  NaN           NaN
2E_0        NaN                 NaN                  NaN           NaN
2F_0        NaN                 NaN                  NaN           NaN
2G_0        NaN                 NaN                  NaN           NaN
2H_0        NaN                 NaN                  NaN           NaN
3I_0        NaN                 NaN                  NaN           NaN
3J_0        NaN                 NaN                  NaN           NaN
3K_0        NaN                 NaN                  NaN           NaN
3L_0        NaN                 NaN                  NaN           NaN


In [14]:
data1.samples["1C_0"].files

{'clusters': [],
 'consens': [],
 'database': [],
 'edits': [],
 'fastqs': [('/home/deren/Documents/ipyrad/tests/test_rad/data1_fastqs/1C_0_R1_.fastq.gz',)],
 'mapped_reads': [],
 'unmapped_reads': []}

In [12]:
## run step 3 to cluster reads within samples using vsearch
data1.step3(force=True)
#data1.step3()

## print the results
print data1.stats

INFO:ipyrad.core.assembly:try 10: starting controller
DEBUG:ipyrad.core.assembly:OK! Connected to (4) engines
DEBUG:ipyrad.core.assembly:Sample 2H_0 not in proper state.
DEBUG:ipyrad.core.assembly:Sample 3J_0 not in proper state.
DEBUG:ipyrad.core.assembly:Sample 2E_0 not in proper state.
DEBUG:ipyrad.core.assembly:Sample 1A_0 not in proper state.
DEBUG:ipyrad.core.assembly:Sample 2G_0 not in proper state.
DEBUG:ipyrad.core.assembly:Sample 3L_0 not in proper state.
DEBUG:ipyrad.core.assembly:Sample 2F_0 not in proper state.
DEBUG:ipyrad.core.assembly:Sample 3I_0 not in proper state.
DEBUG:ipyrad.core.assembly:Sample 1D_0 not in proper state.
DEBUG:ipyrad.core.assembly:Sample 3K_0 not in proper state.
[2:apply]: IndexError: list index out of range
[0:apply]: IndexError: list index out of range



    Sample not ready for clustering. First run step 2 on sample: 2H_0

    Sample not ready for clustering. First run step 2 on sample: 3J_0

    Sample not ready for clustering. First run step 2 on sample: 2E_0

    Sample not ready for clustering. First run step 2 on sample: 1A_0

    Sample not ready for clustering. First run step 2 on sample: 2G_0

    Sample not ready for clustering. First run step 2 on sample: 3L_0

    Sample not ready for clustering. First run step 2 on sample: 2F_0

    Sample not ready for clustering. First run step 2 on sample: 3I_0

    Sample not ready for clustering. First run step 2 on sample: 1D_0

    Sample not ready for clustering. First run step 2 on sample: 3K_0
Exception: one or more exceptions from call to method: clustall
[2:apply]: IndexError: list index out of range
[0:apply]: IndexError: list index out of range


CompositeError: one or more exceptions from call to method: clustall
[2:apply]: IndexError: list index out of range
[0:apply]: IndexError: list index out of range

In [13]:
## run step 3 to cluster reads in data2 at 0.90 sequence similarity
data2.step3(force=True) 

## print the results
print data2.stats

INFO:ipyrad.core.assembly:try 10: starting controller
DEBUG:ipyrad.core.assembly:OK! Connected to (4) engines
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning
INFO:ipyrad.assemble.cluster_within:muscle aligning


      state  reads_raw  reads_filtered  clusters_total  clusters_hidepth
1A_0      3      20099           20099            1000              1000
1B_0      3      19977           19977            1000              1000
1C_0      3      20114           20114            1000              1000
1D_0      3      19895           19895            1000              1000
2E_0      3      19928           19928            1000              1000
2F_0      3      19934           19934            1000              1000
2G_0      3      20026           20026            1000              1000
2H_0      3      19936           19936            1000              1000
3I_0      3      20084           20084            1000              1000
3J_0      3      20011           20011            1000              1000
3K_0      3      20117           20117            1000              1000
3L_0      3      19901           19901            1000              1000


### Branched Assembly objects
You can see below that the two Assembly objects are now working with several shared directories (working, fastq, edits) but with different clust directories (data1_clust_0.85 and data2_clust_0.9). 

In [12]:
print "data1 directories:"
for (i,j) in data1.dirs.items():
    print "{}\t{}".format(i, j)
    
print "\ndata2 directories:"
for (i,j) in data2.dirs.items():
    print "{}\t{}".format(i, j)

data1 directories:
fastqs	/home/deren/Documents/ipyrad/tests/test_rad/data1_fastqs
edits	/home/deren/Documents/ipyrad/tests/test_rad/data1_edits
clusts	/home/deren/Documents/ipyrad/tests/test_rad/data1_clust_0.85
working	/home/deren/Documents/ipyrad/tests/test_rad

data2 directories:
fastqs	/home/deren/Documents/ipyrad/tests/test_rad/data1_fastqs
edits	/home/deren/Documents/ipyrad/tests/test_rad/data1_edits
working	/home/deren/Documents/ipyrad/tests/test_rad


In [13]:
## TODO, just make a [name]_stats directory in [work] for each data obj
print data1.statsfiles.s1


      reads_raw
1A_0      20099
1B_0      19977
1C_0      20114
1D_0      19895
2E_0      19928
2F_0      19934
2G_0      20026
2H_0      19936
3I_0      20084
3J_0      20011
3K_0      20117
3L_0      19901


### Saving stats outputs
Example: two simple ways to save the stats data frame to a file.

In [None]:
data1.stats.to_csv("data1_results.csv", sep="\t")
data1.stats.to_latex("data1_results.tex")

### Example of plotting with _ipyrad_
There are a a few simple plotting functions in _ipyrad_ useful for visualizing results. These are in the module `ipyrad.plotting`. Below is an interactive plot for visualizing the distributions of coverages across the 12 samples in the test data set.  

In [2]:
import ipyrad.plotting as iplot

## plot for one or more selected samples
#iplot.depthplot(data1, ["1A_0", "1B_0"])

## plot for all samples in data1
iplot.depthplot(data1)

## save plot as pdf and html
#iplot.depthplot(data1, outprefix="testfig")

### Step 4: Joint estimation of heterozygosity and error rate


In [3]:
import ipyrad as ip
data1 = ip.load.load_assembly("test_rad/data1.assembly")

  loading Assembly: data1 [test_rad/data1.assembly]


In [3]:
## run step 4
data1.step4(force=True)

## print the results
print data1.stats

INFO:ipyrad.core.assembly:try 10: starting controller
DEBUG:ipyrad.core.assembly:OK! Connected to (4) engines


      state  reads_raw  reads_filtered  clusters_total  clusters_hidepth  \
1A_0      4       1109            1109              95                84   
1B_0      4       1065            1065              97                80   
1C_0      4       1021            1021              98                76   
1D_0      4       1117            1117              97                84   
2E_0      4       1085            1085              97                77   
2F_0      4       1197            1197              98                82   
2G_0      4       1033            1033              97                75   
2H_0      4       1086            1086              95                79   
3I_0      4       1110            1110              99                81   
3J_0      4       1140            1140              96                76   
3K_0      4       1138            1138              96                83   
3L_0      4       1015            1015              96                73   

      heter

### Step 5: Consensus base calls


In [1]:
import ipyrad as ip
data1 = ip.load.load_assembly("test_rad/data1")

DEBUG:ipyrad:H4CKERZ-mode: __loglevel__ = DEBUG


  loading Assembly: data1 [test_rad/data1.assembly]


In [16]:
#data1.set_params("max_Hs_consens", (1, 1))

In [5]:
## run step 5
data1.step5(force=True)#"1B_0")

## print the results
print data1.stats

INFO:ipyrad.core.assembly:try 10: starting controller
DEBUG:ipyrad.core.assembly:connected to 0 engines
INFO:ipyrad.core.assembly:try 9: starting controller
DEBUG:ipyrad.core.assembly:connected to 0 engines
INFO:ipyrad.core.assembly:try 8: starting controller
DEBUG:ipyrad.core.assembly:connected to 0 engines
INFO:ipyrad.core.assembly:try 7: starting controller


Exception: one or more exceptions from call to method: consensus
[3:apply]: KeyError: 'G'


CompositeError: one or more exceptions from call to method: consensus
[3:apply]: KeyError: 'G'

In [3]:
## run step 6
data1.step6(force=True)



In [17]:
import ipyrad as ip

## reload autosaved data. In case you quit and came back 
data1 = ip.load.load_assembly("test_rad/data1.assembly")

  loading Assembly: data1 [test_rad/data1.assembly]
  New Assembly: data1


In [4]:
data1.step7(force=True)

ERROR:ipyrad.file_conversion.loci2treemix:Treemix file conversion requires .unlinked_snps file, which does not exist.Make sure the param `output_formats` includes at least `usnps,treemix`and rerun step7()


Migrate format is still in development


## Assembly finished

In [5]:
ll test_rad/outfiles/

total 9068
-rw-rw-r-- 1 deren 2600999 Jan 23 02:03 data1.alleles
-rw-rw-r-- 1 deren 1252625 Jan 23 02:03 data1.full.vcf
-rw-rw-r-- 1 deren 1250895 Jan 23 02:03 data1.gphocs
-rw-rw-r-- 1 deren 1353000 Jan 23 02:03 data1.loci
-rw-rw-r-- 1 deren 1269535 Jan 23 02:03 data1.nex
-rw-rw-r-- 1 deren 1128105 Jan 23 02:03 data1.phy
-rw-rw-r-- 1 deren   21655 Jan 23 02:03 data1.phy.partitions
-rw-rw-r-- 1 deren      34 Jan 23 02:03 [0m[01;31mdata1.treemix.gz[0m
-rw-rw-r-- 1 deren  388482 Jan 23 02:03 data1.vcf


### Quick parameter explanations are always on-hand

In [18]:
ip.paramsinfo(10)


    (10) phred_Qscore_offset ---------------------------------------------
    Examples:
    ----------------------------------------------------------------------
    data.set_params(10) = 33
    data.set_params("phred_Qscore_offset") = 33
    ----------------------------------------------------------------------
    


### Log history 
A common problem at the end of an analysis, or while troubleshooting it, is that you find you've completely forgotten which parameters you used at what point, and when you changed them. Documenting or executing code inside Jupyter notebooks (like the one you're reading right now) is a great way to keep track of all of this. In addition, _ipyrad_ also stores a log history which time stamps all modifications to Assembly objects. 

In [None]:
for i in data1.log:
    print i
    
print "\ndata 2 log includes its pre-branching history with data1"
for i in data2.log:
    print i

### Saving Assembly objects
Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly `step` function is called, but you can also create your own checkpoints with the `save` command. 

In [None]:
## save assembly object
#ip.save_assembly("data1.p")

## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name