# _ipyrad_ testing tutorial

### Getting started
Import _ipyrad_ and remove previous test files if they are already present

In [1]:
## import modules
import ipyrad as ip                ## 
print "version", ip.__version__    ## print version


ipyparallel setup: Local connection to 4 engines.
version 0.0.66


In [2]:
## clear data from test directory if it already exists
import shutil
import os
if os.path.exists("./test_rad/"):
    shutil.rmtree("./test_rad/")

### Assembly and Sample objects

Assembly and Sample objects are used by _ipyrad_ to access data stored on disk and to manipulate it. Each biological sample in a data set is represented in a Sample object, and a set of Samples is stored inside an Assembly object. The Assembly object has functions to assemble the data, and stores a log of all steps performed and the resulting statistics of those steps. Assembly objects can be copied or merged to allow branching events where different parameters can subsequently be applied to different Assemblies going forward. Examples of this are shown below.

Below is the command to create an Assembly object named "data1". It is created with default assembly parameters and without any Samples linked to it.

In [3]:
## create an Assembly object named data1. 
data1 = ip.Assembly("data1")

New Assembly object `data1` created


### Modifying assembly parameters
An Assembly object's parameter settings can be viewed using the `get_params()` function. To get more detailed information about all parameters use `ip.get_params_info()` or to select a single parameter use `ip.get_params_info(N)`, where N is the number of a parameter. Assembly objects have a function `set_params()` that can be used to modify parameters, like below.

In [4]:
## modify parameters for this Assembly object
data1.set_params(1, "./test_rad")
data1.set_params(2, "./data/sim_rad_test_R1_.fastq.gz")
data1.set_params(3, "./data/sim_rad_test_barcodes.txt")
#data1.set_params(2, "~/Dropbox/UO_C353_1.fastq.part-aa.gz")
#data1.set_params(3, "/home/deren/Dropbox/Viburnum_revised.barcodes")
data1.set_params(7, 3)
data1.set_params(10, 'rad')

## print the new parameters to screen
data1.get_params()

  1   working_directory             ./test_rad                                   
  2   raw_fastq_path                ./data/sim_rad_test_R1_.fastq.gz             
  3   barcodes_path                 ./data/sim_rad_test_barcodes.txt             
  4   sorted_fastq_path                                                          
  5   restriction_overhang          ('TGCAG', '')                                
  6   max_low_qual_bases            5                                            
  7   N_processors                  4                                            
  8   mindepth_statistical          6                                            
  9   mindepth_majrule              6                                            
  10  datatype                      rad                                          
  11  clust_threshold               0.85                                         
  12  minsamp                       4                                            
  13  max_shared

### Starting data
If the data are already demultiplexed then fastq files can be linked directly to the Assembly object, which in turn will create new Sample objects from them, or link them to existing Sample objects based on the file names (or pair of fastq files for paired data files). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data.

In [5]:
## This would link fastq files from the 'sorted_fastq_path' if present
## Here it does nothing b/c there are no files in the sorted_fastq_path
data1.link_fastqs()

0 new Samples created in data1.
0 fastq files linked to Samples.


### Step 1: Demultiplexing raw data files
Step1 uses barcode information to demultiplex data files found in param 2 ['raw_fastq_path']. It will create a Sample object for each barcoded sample. Below we use the step1() function to demultiplex. The `stats` attribute of an Assembly object is returned as a `pandas` data frame.

In [6]:
## run step 1 to demultiplex the data
data1.step1()

## print the results for each Sample in data1
print data1.stats

      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      1      20099             NaN             NaN            NaN   
1B_0      1      19977             NaN             NaN            NaN   
1C_0      1      20114             NaN             NaN            NaN   
1D_0      1      19895             NaN             NaN            NaN   
2E_0      1      19928             NaN             NaN            NaN   
2F_0      1      19934             NaN             NaN            NaN   
2G_0      1      20026             NaN             NaN            NaN   
2H_0      1      19936             NaN             NaN            NaN   
3I_0      1      20084             NaN             NaN            NaN   
3J_0      1      20011             NaN             NaN            NaN   
3K_0      1      20117             NaN             NaN            NaN   
3L_0      1      19901             NaN             NaN            NaN   

      hetero_est  error_est  reads_consens  
1A_0 

### Step 2: Filter reads 
If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the `step2` function on. Because `step2` is a function of `data`, it will always execute with the parameters that are linked to `data`. 

In [7]:
## example of ways to run step 2 to filter and trim reads
data1.step2(["1A_0"])             ## run on a single sample
data1.step2(["1B_0", "1C_0"])     ## run on one or more samples
data1.step2(force=True)           ## run on all samples, skip finished ones.

## print the results
print data1.stats

      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      2      20099           20099             NaN            NaN   
1B_0      2      19977           19977             NaN            NaN   
1C_0      2      20114           20114             NaN            NaN   
1D_0      2      19895           19895             NaN            NaN   
2E_0      2      19928           19928             NaN            NaN   
2F_0      2      19934           19934             NaN            NaN   
2G_0      2      20026           20026             NaN            NaN   
2H_0      2      19936           19936             NaN            NaN   
3I_0      2      20084           20084             NaN            NaN   
3J_0      2      20011           20011             NaN            NaN   
3K_0      2      20117           20117             NaN            NaN   
3L_0      2      19901           19901             NaN            NaN   

      hetero_est  error_est  reads_consens  
1A_0 

### Branching Assembly objects
Let's imagine at this point that we are interested in clustering our data at two different clustering thresholds. We will try 0.90 and 0.85. First we need to make a copy/branch of the Assembly object. This will inherit the locations of the data linked in the first object, but diverge in any future applications to the object. Thus, the two Assembly objects can share the same working directory, and inherit shared files, but will diverge in creating new files linked to only one or the other. You can view the directories linked to an Assembly object with the `.dirs` argument, shown below. The prefix_outname (param 14) of the new object is automatically set to the Assembly object name. 


In [9]:
## create a copy of our Assembly object
data2 = data1.copy(newname="data2")

## set clustering threshold to 0.90
data2.set_params(11, 0.90)

## look at inherited parameters
data2.get_params()

  1   working_directory             ./test_rad                                   
  2   raw_fastq_path                ./data/sim_rad_test_R1_.fastq.gz             
  3   barcodes_path                 ./data/sim_rad_test_barcodes.txt             
  4   sorted_fastq_path                                                          
  5   restriction_overhang          ('TGCAG', '')                                
  6   max_low_qual_bases            5                                            
  7   N_processors                  4                                            
  8   mindepth_statistical          6                                            
  9   mindepth_majrule              6                                            
  10  datatype                      rad                                          
  11  clust_threshold               0.9                                          
  12  minsamp                       4                                            
  13  max_shared

### Step 3: clustering within-samples


In [8]:
## run step 3 to cluster reads within samples using vsearch
data1.step3(force=True)

## print the results
print data1.stats

clustering 12 samples using 3 engines per job
      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      3      20099           20099            1000           1000   
1B_0      3      19977           19977            1000           1000   
1C_0      3      20114           20114            1000           1000   
1D_0      3      19895           19895            1000           1000   
2E_0      3      19928           19928            1000           1000   
2F_0      3      19934           19934            1000           1000   
2G_0      3      20026           20026            1000           1000   
2H_0      3      19936           19936            1000           1000   
3I_0      3      20084           20084            1000           1000   
3J_0      3      20011           20011            1000           1000   
3K_0      3      20117           20117            1000           1000   
3L_0      3      19901           19901            1000           1000   

    

In [10]:
## run step 3 to cluster reads in data2 at 0.90 sequence similarity
data2.step3(force=True) 

## print the results
print data2.stats

clustering 12 samples using 3 engines per job
      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      3      20099           20099            1000           1000   
1B_0      3      19977           19977            1000           1000   
1C_0      3      20114           20114            1000           1000   
1D_0      3      19895           19895            1000           1000   
2E_0      3      19928           19928            1000           1000   
2F_0      3      19934           19934            1000           1000   
2G_0      3      20026           20026            1000           1000   
2H_0      3      19936           19936            1000           1000   
3I_0      3      20084           20084            1000           1000   
3J_0      3      20011           20011            1000           1000   
3K_0      3      20117           20117            1000           1000   
3L_0      3      19901           19901            1000           1000   

    

### Branched Assembly objects
And you can see below that the two Assembly objects are now working with several shared directories (working, fastq, edits) but with different clust directories (clust_0.85 and clust_0.9). 

In [11]:
print "data1 directories:"
for (i,j) in data1.dirs.items():
    print "{}\t{}".format(i, j)
    
print "\ndata2 directories:"
for (i,j) in data2.dirs.items():
    print "{}\t{}".format(i, j)

data1 directories:
fastqs	/home/deren/Dropbox/ipyrad/tests/test_rad/data1_fastqs
edits	/home/deren/Dropbox/ipyrad/tests/test_rad/data1_edits
clusts	/home/deren/Dropbox/ipyrad/tests/test_rad/data1_clust_0.85
working	/home/deren/Dropbox/ipyrad/tests/test_rad

data2 directories:
fastqs	/home/deren/Dropbox/ipyrad/tests/test_rad/data1_fastqs
edits	/home/deren/Dropbox/ipyrad/tests/test_rad/data1_edits
clusts	/home/deren/Dropbox/ipyrad/tests/test_rad/data2_clust_0.9
working	/home/deren/Dropbox/ipyrad/tests/test_rad


In [12]:
## TODO, just make a [name]_stats directory in [work] for each data obj
data1.statsfiles


{'s1': '/home/deren/Dropbox/ipyrad/tests/test_rad/data1_fastqs/s1_demultiplex_stats.txt',
 's2': '/home/deren/Dropbox/ipyrad/tests/test_rad/data1_edits/s2_rawedit_stats.txt',
 's3': '/home/deren/Dropbox/ipyrad/tests/test_rad/data1_clust_0.85/s3_cluster_stats.txt'}

### Saving stats outputs
Example: two simple ways to save the stats data frame to a file.

In [13]:
data1.stats.to_csv("data1_results.csv", sep="\t")
data1.stats.to_latex("data1_results.tex")

### Example of plotting with _ipyrad_
There are a a few simple plotting functions in _ipyrad_ useful for visualizing results. These are in the module `ipyrad.plotting`. Below is an interactive plot for visualizing the distributions of coverages across the 12 samples in the test data set.  

In [14]:
import ipyrad.plotting as iplot

## plot for one or more selected samples
#iplot.depthplot(data1, ["1A_0", "1B_0"])

## plot for all samples in data1
iplot.depthplot(data1)

## save plot as pdf and html
#iplot.depthplot(data1, outprefix="testfig")

### Step 4: Joint estimation of heterozygosity and error rate


In [15]:
## run step 4
data1.step4() 

## print the results
print data1.stats

      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      4      20099           20099            1000           1000   
1B_0      4      19977           19977            1000           1000   
1C_0      4      20114           20114            1000           1000   
1D_0      4      19895           19895            1000           1000   
2E_0      4      19928           19928            1000           1000   
2F_0      4      19934           19934            1000           1000   
2G_0      4      20026           20026            1000           1000   
2H_0      4      19936           19936            1000           1000   
3I_0      4      20084           20084            1000           1000   
3J_0      4      20011           20011            1000           1000   
3K_0      4      20117           20117            1000           1000   
3L_0      4      19901           19901            1000           1000   

      hetero_est  error_est  reads_consens  
1A_0 

### Step 5: Consensus base calls


In [16]:
#import ipyrad as ip

## reload autosaved data. In case you quit and came back 
#data1 = ip.load_dataobj("test_rad/data1.assembly")

In [17]:
## run step 5
#data1.step5()

## print the results
#print data1.stats

In [18]:
data1.samples["1A_0"].stats

state                 4.000000
reads_raw         20099.000000
reads_filtered    20099.000000
clusters_total     1000.000000
clusters_kept      1000.000000
hetero_est            0.002223
error_est             0.000756
reads_consens              NaN
dtype: float64

### Quick parameter explanations are always on-hand

In [19]:
ip.get_params_info(10)


        (10) clust_threshold -------------------------------------------------
        Clustering threshold. 
        Examples:
        ----------------------------------------------------------------------
        data.setparams(10) = .85          ## clustering similarity threshold
        data.setparams(10) = .90          ## clustering similarity threshold
        data.setparams(10) = .95          ## very high values not recommended 
        data.setparams("clust_threshold") = .83  ## verbose
        ----------------------------------------------------------------------
        


### Log history 
A common problem at the end of an analysis, or while troubleshooting it, is that you find you've completely forgotten which parameters you used at what point, and when you changed them. Documenting or executing code inside Jupyter notebooks (like the one you're reading right now) is a great way to keep track of all of this. In addition, _ipyrad_ also stores a log history which time stamps all modifications to Assembly objects. 

In [24]:
for i in data1.log:
    print i
    
print "\ndata 2 log includes its pre-branching history with data1"
for i in data2.log:
    print i

('data1', '11/18/15 01:40:56', 'data1 created')
('data1', '11/18/15 01:40:57', '[1] set to ./test_rad')
('data1', '11/18/15 01:40:57', '[2] set to ./data/sim_rad_test_R1_.fastq.gz')
('data1', '11/18/15 01:40:57', '[3] set to ./data/sim_rad_test_barcodes.txt')
('data1', '11/18/15 01:40:57', '[7] set to 3')
('data1', '11/18/15 01:40:57', '[10] set to rad')
('data1', '11/18/15 01:41:18', 's1_demultiplexing:')
('data1', '11/18/15 01:41:18', 's2 rawediting on 1A_0')
('data1', '11/18/15 01:41:19', 's2 rawediting on 1B_0')
('data1', '11/18/15 01:41:20', 's2 rawediting on 1C_0')
('data1', '11/18/15 01:41:21', 's2 rawediting on 1B_0')
('data1', '11/18/15 01:41:21', 's2 rawediting on 2H_0')
('data1', '11/18/15 01:41:22', 's2 rawediting on 3J_0')
('data1', '11/18/15 01:41:23', 's2 rawediting on 2E_0')
('data1', '11/18/15 01:41:23', 's2 rawediting on 1C_0')
('data1', '11/18/15 01:41:24', 's2 rawediting on 1A_0')
('data1', '11/18/15 01:41:24', 's2 rawediting on 2G_0')
('data1', '11/18/15 01:41:25',

### Saving Assembly objects
Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly `step` function is called, but you can also create your own checkpoints with the `save` command. 

In [21]:
## save assembly object
#ip.save_assembly("data1.p")

## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name