# _ipyrad_ testing tutorial

### Getting started
Import _ipyrad_ and remove previous test files if they are already present

In [1]:
## import modules
import ipyrad as ip      ## for RADseq assembly
print ip.__version__     ## print version

## clear data from test directory if it already exists
import shutil
import os
if os.path.exists("./test_rad/"):
    shutil.rmtree("./test_rad/")

0.0.61


### Assembly and Sample objects

Assembly and Sample objects are used by _ipyrad_ to access data stored on disk and to manipulate it. Each biological sample in a data set is represented in a Sample object, and these Samples are stored inside Assembly objects. The Assembly object contains functions to assemble the data, and stores a log of all steps performed and the resulting statistics of those steps. Assembly objects can be copied or merged to allow branching events where different parameters are applied to assemblies. 

To create an Assembly object call ip.Assembly and pass it a name for the data set. We could imagine that we planned to assemble and later combine data from multiple sequencing runs, but before combining them each group of samples has to be analyzed under a different set of parameters. As an example, we could call two data sets "2014_data" and "2015_data". These initially do not contain any Samples. Sample objects are created either by linking fastq files to the Assembly object or by running step 1 to demultiplex raw data files. 

In [2]:
## create an Assembly object called data1. 
## It takes an 'test'
data1 = ip.Assembly("2014_data")
data2 = ip.Assembly("2015_data")

print "Assembly object named", data1.name
print "Assembly object named", data2.name

Assembly object named 2014_data
Assembly object named 2015_data


### Modifying assembly parameters
All of the parameter settings are linked to an Assembly object, which has a set of default parameters when it is created. These can be viewed using the `get_params()` function. To get more detailed information about all paramteres use `ip.get_params_info()` or to select a single parameter use `ip.get_params_info(3)`. Assembly objects have a function `set_params()` that can be used to modify parameters. 

In [3]:
## modify parameters for this Assembly object
data1.set_params(1, "./test_rad")
data1.set_params(2, "./data/sim_rad_test_R1_.fastq.gz")
data1.set_params(3, "./data/sim_rad_test_barcodes.txt")
data1.set_params(7, 3)
data1.set_params(10, 'rad')

## print the new parameters to screen
data1.get_params()

  1   working_directory             ./test_rad                                   
  2   raw_fastq_path                ./data/sim_rad_test_R1_.fastq.gz             
  3   barcodes_path                 ./data/sim_rad_test_barcodes.txt             
  4   sorted_fastq_path                                                          
  5   restriction_overhang          ('TGCAG', '')                                
  6   max_low_qual_bases            5                                            
  7   N_processors                  3                                            
  8   mindepth_statistical          6                                            
  9   mindepth_majrule              6                                            
  10  datatype                      rad                                          
  11  clust_threshold               0.85                                         
  12  minsamp                       4                                            
  13  max_shared

### Starting data assembly and Sample objects
If the data are already demultiplexed then fastq files can be linked directly to the Data object, which in turn will create Sample objects for each fastq file (or pair of fastq files for paired data). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data.

In [4]:
## This would link fastq files from the 'sorted_fastq_path' if present
## Here it does nothing b/c there are no files in the sorted_fastq_path
data1.link_fastqs()

### Step 1: Demultiplex the raw data files
This uses the barcodes information to demultiplex reads in data files found in the 'raw_fastq_path'. It will create a Sample object for each sample that will be stored in the Assembly object. The state of each sample will be set to 1, meaning that the sample has completed step 1 of the _ipyrad_ assembly.

In [5]:
## run step 1 to demultiplex the data
data1.step1()

## print the results for each Sample in data1
print data1.stats

      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      1      20099             NaN             NaN            NaN   
1B_0      1      19977             NaN             NaN            NaN   
1C_0      1      20114             NaN             NaN            NaN   
1D_0      1      19895             NaN             NaN            NaN   
2E_0      1      19928             NaN             NaN            NaN   
2F_0      1      19934             NaN             NaN            NaN   
2G_0      1      20026             NaN             NaN            NaN   
2H_0      1      19936             NaN             NaN            NaN   
3I_0      1      20084             NaN             NaN            NaN   
3J_0      1      20011             NaN             NaN            NaN   
3K_0      1      20117             NaN             NaN            NaN   
3L_0      1      19901             NaN             NaN            NaN   

      hetero_est  error_est  reads_consens  
1A_0 

### Step 2: Filter reads 
If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the `step2` function on. Because `step2` is a function of `data`, it will always execute with the parameters that are linked to `data`. 

In [6]:
## example of ways to run step 2 to filter and trim reads
data1.step2("1A_0")            ## run on a single sample
data1.step2(["1B_0", "1C_0"])  ## run on one or more samples
data1.step2()                  ## run on all samples, skipping finished ones

## print the results
print data1.stats

skipping, 1B_0 already edited. Sample.stats['state'] == 2
skipping, 1C_0 already edited. Sample.stats['state'] == 2
skipping, 1A_0 already edited. Sample.stats['state'] == 2
      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      2      20099           20099             NaN            NaN   
1B_0      2      19977           19977             NaN            NaN   
1C_0      2      20114           20114             NaN            NaN   
1D_0      2      19895           19895             NaN            NaN   
2E_0      2      19928           19928             NaN            NaN   
2F_0      2      19934           19934             NaN            NaN   
2G_0      2      20026           20026             NaN            NaN   
2H_0      2      19936           19936             NaN            NaN   
3I_0      2      20084           20084             NaN            NaN   
3J_0      2      20011           20011             NaN            NaN   
3K_0      2      20117 

### Step 3: clustering within-samples

In [7]:
## run step 3 to cluster reads within samples using vsearch
data1.step3()  # ["2H_0", "2G_0"])

## print the results
print data1.stats

clustering 12 samples on 3 processors
      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      3      20099           20099            1000           1000   
1B_0      3      19977           19977            1000           1000   
1C_0      3      20114           20114            1000           1000   
1D_0      3      19895           19895            1000           1000   
2E_0      3      19928           19928            1000           1000   
2F_0      3      19934           19934            1000           1000   
2G_0      3      20026           20026            1000           1000   
2H_0      3      19936           19936            1000           1000   
3I_0      3      20084           20084            1000           1000   
3J_0      3      20011           20011            1000           1000   
3K_0      3      20117           20117            1000           1000   
3L_0      3      19901           19901            1000           1000   

      hetero

### Example of plotting with _ipyrad_
There are a a few simple plotting functions in _ipyrad_ useful for visualizing results. These are in the module `ipyrad.plotting`. Below is an interactive plot for visualizing the distributions of coverages across the 12 samples in the test data set.  

In [11]:
import ipyrad as ip
import ipyrad.plotting as iplot

## reload autosaved data. In case you quit and came back 
#data1 = ip.load_dataobj("test_rad/2014_data.dataobj")

## plot for one or more selected samples
iplot.depthplot(data1, ["1A_0", "1B_0"])

## plot for all samples in data1
#iplot.depthplot(data1)

## save plot as pdf and html
iplot.depthplot(data1, outprefix="testfig")

### Step 4: Joint estimation of heterozygosity and error rate


In [10]:
## run step 4
data1.step4()

## print the results
print data1.stats

      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      4      20099           20099            1000           1000   
1B_0      4      19977           19977            1000           1000   
1C_0      4      20114           20114            1000           1000   
1D_0      4      19895           19895            1000           1000   
2E_0      4      19928           19928            1000           1000   
2F_0      4      19934           19934            1000           1000   
2G_0      4      20026           20026            1000           1000   
2H_0      4      19936           19936            1000           1000   
3I_0      4      20084           20084            1000           1000   
3J_0      4      20011           20011            1000           1000   
3K_0      4      20117           20117            1000           1000   
3L_0      4      19901           19901            1000           1000   

      hetero_est     error_est  reads_consens  
1A

### Step 5: Consensus base calls


In [None]:
## run step 5
data1.step5()

## print the results
print data1.stats

### Quick parameter explanations are always on-hand

In [12]:
ip.get_params_info(10)


        (10) clust_threshold -------------------------------------------------
        Clustering threshold. 
        Examples:
        ----------------------------------------------------------------------
        data.setparams(10) = .85          ## clustering similarity threshold
        data.setparams(10) = .90          ## clustering similarity threshold
        data.setparams(10) = .95          ## very high values not recommended 
        data.setparams("clust_threshold") = .83  ## verbose
        ----------------------------------------------------------------------
        


### Log history 
A common problem after struggling through an analysis is that you find you've completely forgotten what parameters you used at what point, and when you changed them. The log history time stamps all calls to `set_params()`, as well as calls to `step` methods. It also records copies/branching of data objects.  

In [13]:
for i in data1.log:
    print i

('2014_data', '10/07/15 17:42:24', '2014_data created')
('2014_data', '10/07/15 17:42:25', '[1] set to ./test_rad')
('2014_data', '10/07/15 17:42:25', '[2] set to ./data/sim_rad_test_R1_.fastq.gz')
('2014_data', '10/07/15 17:42:25', '[3] set to ./data/sim_rad_test_barcodes.txt')
('2014_data', '10/07/15 17:42:25', '[7] set to 3')
('2014_data', '10/07/15 17:42:25', '[10] set to rad')
('2014_data', '10/07/15 17:42:35', 's1_demultiplexing:')
('2014_data', '10/07/15 17:43:01', 's2 rawediting on 1A_0')
('2014_data', '10/07/15 17:43:02', 's2 rawediting on 1B_0')
('2014_data', '10/07/15 17:43:03', 's2 rawediting on 1C_0')
('2014_data', '10/07/15 17:43:04', 's2 rawediting on 2H_0')
('2014_data', '10/07/15 17:43:04', 's2 rawediting on 3J_0')
('2014_data', '10/07/15 17:43:05', 's2 rawediting on 2E_0')
('2014_data', '10/07/15 17:43:06', 's2 rawediting on 2G_0')
('2014_data', '10/07/15 17:43:07', 's2 rawediting on 3L_0')
('2014_data', '10/07/15 17:43:07', 's2 rawediting on 2F_0')
('2014_data', '10/

### Saving Assembly objects
Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly `step` function is called, but you can also create your own checkpoints with the `save` command. 

In [None]:
## save assembly object
#ip.save_assembly("data1.p")

## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name