### ipyrad testing

In [1]:
import ipyrad as ip      ## for RADseq assembly
print ip.__version__     ## print version

0.0.49


In [2]:
## clear test directory if it already exists
import shutil
import os
if os.path.exists("./test_pairgbs"):
    shutil.rmtree("./test_pairgbs")

### Getting started -- Assembly objects
The first step is to create an Assembly object. It takes an optional argument that provides it with an internal name. We could imagine that we planned to assemble and later combine data from multiple sequencing runs, but before combining them each group of samples has to be analyzed under a different set of parameters. As an example, we could call this data set "2014_data_set" and another "2015_data_set". 

### Problem is that pairgbs now expects your reads to have been through pear, no?
## No, cuz those would be 'merged'

In [3]:
## create an Assembly object called data1. 
## It takes an 'test'
data1 = ip.Assembly('test_pairgbs')

In [4]:
data1.set_params(1, "./test_pairgbs")
data1.set_params(2, "./data/sim_pairgbs_*.gz")
data1.set_params(3, "./data/sim_pairgbs_barcodes.txt")
data1.set_params(12, "pairgbs")
#data1.set_params(19, 1)

In [5]:
data1.get_params()

  1   working_directory             ./test_pairgbs                               
  2   raw_fastq_path                ./data/sim_pairgbs_*.gz                      
  3   barcodes_path                 ./data/sim_pairgbs_barcodes.txt              
  4   sorted_fastq_path                                                          
  5   vsearch_path                  vsearch                                      
  6   muscle_path                   muscle                                       
  7   restriction_overhang          ('TGCAG', '')                                
  8   max_low_qual_bases            5                                            
  9   N_processors                  4                                            
  10  mindepth_statistical          6                                            
  11  mindepth_maj_rule             6                                            
  12  datatype                      pairgbs                                      
  13  clust_thre

### Step 1: Demultiplex the raw data files
This demultiplexes, and links the new fastq data files as Samples to the Assembly object. 

In [6]:
## demultiplex the raw_data files
## set step1 to only go if no samples are present...
data1.step1(preview=1) #append=True)

preview: 1B0 GATATA
preview: 2G0 ATAAAG
preview: 2F0 TGAAAG
preview: 1A0 CATCAT
preview: 2H0 AAGAAG
preview: 2E0 GATGGT
preview: 3I0 TTGAGG
preview: 1C0 GGGTGG
preview: 3L0 GTTGGG
preview: 1D0 TAGTAT
preview: 3J0 TATGGG
preview: 3K0 TGATGT
preview: ('/home/deren/Dropbox/ipyrad/tests/data/sim_pairgbs_R1_.fastq.gz', '/home/deren/Dropbox/ipyrad/tests/data/sim_pairgbs_R2_.fastq.gz')
raws [('/home/deren/Dropbox/ipyrad/tests/data/sim_pairgbs_R1_.fastq.gz', '/home/deren/Dropbox/ipyrad/tests/data/sim_pairgbs_R2_.fastq.gz')]
chunkslist [['/home/deren/Dropbox/ipyrad/tests/data/sim_pairgbs_R1_.fastq.gz', [('/tmp/tmpDz8WAO', '/tmp/tmpIl4BKx'), ('/tmp/tmpImj62V', '/tmp/tmpHKfH80'), ('/tmp/tmp7eQqeN', '/tmp/tmpiw4MJU'), ('/tmp/tmpZ2OAiQ', '/tmp/tmpCayifX'), ('/tmp/tmpqOy_EI', '/tmp/tmpFkksGd')]]]


In [7]:
#data1.set_params(4, "./test_pairgbs/fastq/*")
#data1.link_fastqs()
for key, sample in data1.samples.items():
    print key, sample.state, sample.stats.values()

1B0 1 [4000, [None, None], None, None, None]
2G0 1 [4000, [None, None], None, None, None]
2F0 1 [4000, [None, None], None, None, None]
1A0 1 [4000, [None, None], None, None, None]
1C0 1 [4000, [None, None], None, None, None]
2H0 1 [4000, [None, None], None, None, None]
2E0 1 [4000, [None, None], None, None, None]
3L0 1 [4000, [None, None], None, None, None]
3I0 1 [4000, [None, None], None, None, None]
1D0 1 [4000, [None, None], None, None, None]
3J0 1 [4000, [None, None], None, None, None]
3K0 1 [4000, [None, None], None, None, None]


### Step 2: Filter reads 
If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the `step2` function on. Because `step2` is a function of `data`, it will always execute with the parameters that are linked to `data`. 

In [8]:
## if already edits, link them
data1.link_edits()

In [9]:
data1.step2() #["1A0","1B0"])#

In [10]:
print data1.list_files()

test_pairgbs/
    fastq/
        3J0_R1_.gz
        3K0_R2_.gz
        3K0_R1_.gz
        2E0_R2_.gz
        1B0_R2_.gz
        3L0_R2_.gz
        2G0_R1_.gz
        2F0_R1_.gz
        1A0_R2_.gz
        2H0_R2_.gz
        3I0_R1_.gz
        2E0_R1_.gz
        2G0_R2_.gz
        3J0_R2_.gz
        2H0_R1_.gz
        1C0_R2_.gz
        2F0_R2_.gz
        1C0_R1_.gz
        1D0_R2_.gz
        3I0_R2_.gz
        1B0_R1_.gz
        3L0_R1_.gz
        1A0_R1_.gz
        1D0_R1_.gz
    edits/
        1D0.fasta
        3J0.fasta
        2F0.fasta
        2H0.fasta
        1B0.fasta
        2G0.fasta
        2E0.fasta
        1C0.fasta
        3I0.fasta
        1A0.fasta
        3L0.fasta
        3K0.fasta
    stats/
        s2_rawedit_stats.txt
        s1_demultiplex_stats.txt
None


We can access the `name` and `fname` of the `Sample` objects and edit them as desired without affecting the original data files. The `name` identifier is equal to the filename (`fname`) by default, but is the name used in the final output files, and thus it may be desirable to reduce it to something more readable, like below. 

In [11]:
data1.step3(["1A0"], preview=1)

is list of 1
[('1A0', <ipyrad.core.sample.Sample object at 0x7f27bcff4990>)]
finished clustering
stats and cleanup ... 1
preview: in run_full
dereplicating...
vsearch -cluster_smallmem /home/deren/Dropbox/ipyrad/tests/test_pairgbs/edits/1A0.derep -strand both  -query_cov .60  -id 0.85 -userout /home/deren/Dropbox/ipyrad/tests/test_pairgbs/clust_0.85/1A0.utemp -userfields query+target+id+gaps+qstrand+qcov -maxaccepts 1 -maxrejects 0 -minsl 0.5 -fulldp -threads 4 -usersort  -notmatched /home/deren/Dropbox/ipyrad/tests/test_pairgbs/clust_0.85/1A0.htemp -fasta_width 600 -msaout /home/deren/Dropbox/ipyrad/tests/test_pairgbs/clust_0.85/1A0.msa


In [12]:
import numpy as np
b1 = np.array([list("aaabbbcccddd")])
b1 = np.append(b1,list("nnnn"))
b1

array(['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'n',
       'n', 'n', 'n'], 
      dtype='|S1')

In [13]:
data1.samples["1B0"].barcode

'GATATA'

In [14]:
a = "abcdefghif"
a = a[-5:]
a

'fghif'

In [15]:
import os
data1.samples["2G0"].files

{'clust': '',
 'consens': '',
 'edits': '/home/deren/Dropbox/ipyrad/tests/test_pairgbs/edits/2G0.fasta',
 'fastq': ('/home/deren/Dropbox/ipyrad/tests/test_pairgbs/fastq/2G0_R1_.gz',
  '/home/deren/Dropbox/ipyrad/tests/test_pairgbs/fastq/2G0_R2_.gz'),
 'pickle': None}

In [16]:
#data1.step4("1A0", sample_all=1)

In [17]:
#data1.step5("1A0")

In [18]:
#data1.step5("1A0")  ## better filters for -N-

### Quick parameter explanations are always on-hand

In [19]:
ip.get_params_info(10)


        (10) clust_threshold -------------------------------------------------
        Clustering threshold. 
        Examples:
        ----------------------------------------------------------------------
        data.setparams(10) = .85          ## clustering similarlity threshold
        data.setparams(10) = .90          ## clustering similarlity threshold
        data.setparams(10) = .95          ## very high values not recommended 
        data.setparams("clust_threshold") = .83  ## verbose
        ----------------------------------------------------------------------
        


### Log history 
A common problem after struggling through an analysis is that you find you've completely forgotten what parameters you used at what point, and when you changed them. The log history time stamps all calls to `set_params()`, as well as calls to `step` methods. It also records copies/branching of data objects.  

In [20]:
for i in data.log:
    print i

NameError: name 'data' is not defined

### Saving Assembly objects
Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'pickle' object used by Python. Individual Sample objects are saved within Assembly objects. While it is important to remember that some of the information in Assembly objects is in their links to data files, most of the useful information that we would want to analyze post assembly is stored in the object itself. Thus these objects will be useful for making plots and tables of assembly statistics later. 

In [None]:
## save assembly object
#ip.save_assembly("data1.p")

## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name