### ipyrad testing

In [1]:
import ipyrad as ip      ## for RADseq assembly
print ip.__version__     ## print version

DEBUG:ipyrad:H4CKERZ-mode: __loglevel__ = DEBUG


0.0.66


In [2]:
## clear test directory if it already exists
import shutil
import os
if os.path.exists("./test_pairgbs"):
    shutil.rmtree("./test_pairgbs")

### Getting started -- Assembly objects
The first step is to create an Assembly object. It takes an optional argument that provides it with an internal name. We could imagine that we planned to assemble and later combine data from multiple sequencing runs, but before combining them each group of samples has to be analyzed under a different set of parameters. As an example, we could call this data set "2014_data_set" and another "2015_data_set". 

In [3]:
## create an Assembly object called data1. 
## It takes an 'test'
data1 = ip.Assembly('test_pairgbs')

INFO:ipyrad.core.assembly:New Assembly object `test_pairgbs` created
INFO:ipyrad.core.parallel:Local connection to 4 engines


New Assembly object `test_pairgbs` created
ipyparallel setup: Local connection to 4 engines.


In [4]:
data1.set_params(1, "./test_pairgbs")
data1.set_params(2, "./data/sim_pairgbs_*.gz")
data1.set_params(3, "./data/sim_pairgbs_barcodes.txt")
data1.set_params(10, "pairgbs")
data1.set_params(17, 0)
#data1.set_params(19, 1)

In [5]:
data1.get_params()

  1   working_directory           ./test_pairgbs                               
  2   raw_fastq_path              ./data/sim_pairgbs_*.gz                      
  3   barcodes_path               ./data/sim_pairgbs_barcodes.txt              
  4   sorted_fastq_path                                                        
  5   restriction_overhang        ('TGCAG', '')                                
  6   max_low_qual_bases          5                                            
  7   engines_per_job             4                                            
  8   mindepth_statistical        6                                            
  9   mindepth_majrule            6                                            
  10  datatype                    pairgbs                                      
  11  clust_threshold             0.85                                         
  12  minsamp                     4                                            
  13  max_shared_heterozygosity   0.25  

### Step 1: Demultiplex the raw data files
This demultiplexes, and links the new fastq data files as Samples to the Assembly object. 

In [6]:
## demultiplex the raw_data files
## set step1 to only go if no samples are present...
data1.step1() #append=True)

print data1.stats

     state  reads_raw
1A0      1       4000
1B0      1       4000
1C0      1       4000
1D0      1       4000
2E0      1       4000
2F0      1       4000
2G0      1       4000
2H0      1       4000
3I0      1       4000
3J0      1       4000
3K0      1       4000
3L0      1       4000


In [7]:
#data1.set_params(4, "./test_pairgbs/fastq/*")
#data1.link_fastqs()


### Step 2: Filter reads 
If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the `step2` function on. Because `step2` is a function of `data`, it will always execute with the parameters that are linked to `data`. 

In [8]:
data1.step2() #["1A0","1B0"])#

print data1.stats

INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000


     state  reads_raw  reads_filtered
1A0      2       4000            4000
1B0      2       4000            4000
1C0      2       4000            4000
1D0      2       4000            4000
2E0      2       4000            4000
2F0      2       4000            4000
2G0      2       4000            4000
2H0      2       4000            4000
3I0      2       4000            4000
3J0      2       4000            4000
3K0      2       4000            4000
3L0      2       4000            4000


We can access the `name` and `fname` of the `Sample` objects and edit them as desired without affecting the original data files. The `name` identifier is equal to the filename (`fname`) by default, but is the name used in the final output files, and thus it may be desirable to reduce it to something more readable, like below. 

In [None]:
import ipyrad as ip
data1 = ip.load_assembly("test_pairgbs/test_pairgbs.assembly")

In [9]:
data1.step3(["1A0"], force=True)

Clustering 1 samples using 4 engines per job.


In [12]:
print data1.stats

     state  reads_raw  reads_filtered  clusters_total  clusters_kept
1A0      3       4000            4000             100            100
1B0      3       4000            4000             100            100
1C0      3       4000            4000             100            100
1D0      3       4000            4000             100            100
2E0      3       4000            4000             100            100
2F0      3       4000            4000             100            100
2G0      3       4000            4000             100            100
2H0      3       4000            4000             100            100
3I0      3       4000            4000             100            100
3J0      3       4000            4000             100            100
3K0      3       4000            4000             100            100
3L0      3       4000            4000             100            100


In [14]:
data1.step4()
print data1.stats

     state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A0      4       4000            4000             100            100   
1B0      4       4000            4000             100            100   
1C0      4       4000            4000             100            100   
1D0      4       4000            4000             100            100   
2E0      4       4000            4000             100            100   
2F0      4       4000            4000             100            100   
2G0      4       4000            4000             100            100   
2H0      4       4000            4000             100            100   
3I0      4       4000            4000             100            100   
3J0      4       4000            4000             100            100   
3K0      4       4000            4000             100            100   
3L0      4       4000            4000             100            100   

     hetero_est  error_est  
1A0    0.009533   0.000482  
1B0  

In [None]:
data1.step5(["1A0"])

In [None]:
#data1.step5(["1A0"])  ## better filters for -N-

### Quick parameter explanations are always on-hand

In [None]:
ip.get_params_info(10)

### Log history 
A common problem after struggling through an analysis is that you find you've completely forgotten what parameters you used at what point, and when you changed them. The log history time stamps all calls to `set_params()`, as well as calls to `step` methods. It also records copies/branching of data objects.  

In [None]:
for i in data.log:
    print i

### Saving Assembly objects
Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'pickle' object used by Python. Individual Sample objects are saved within Assembly objects. While it is important to remember that some of the information in Assembly objects is in their links to data files, most of the useful information that we would want to analyze post assembly is stored in the object itself. Thus these objects will be useful for making plots and tables of assembly statistics later. 

In [None]:
## save assembly object
#ip.save_assembly("data1.p")

## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name