# User guide to the *ipyrad* API
Welcome! This tutorial will introduce you to the basic and advanced features of working with the *ipyrad* API to assemble RADseq data in Python. The API offers many advantages over the command-line interface, but requires a little more work up front to learn the necessary tools for using it. This includes knowing some very rudimentary Python, and setting up a Jupyter notebook. 

### Getting started with Jupyter notebooks

This tutorial is an example of a [Jupyter Notebook](http://jupyter.org/Jupyter). If you've installed *ipyrad* then you already have jupyter installed as well, which you can start from the command-line (type `jupyter-notebook`) to launch an interactive notebook like this one. For some background on how jupyter notebooks work I would recommend searching on google, or watching this [YouTube video](https://www.youtube.com/watch?v=HW29067qVWk&t=47s). Once you have the hang of it, follow along with this code in your own notebook. 

### Connecting your notebook to a cluster
We have two previous tutorials about using Jupyter notebooks and connecting Jupyter notebooks to a computing cluster (see [here](http://ipyrad.readthedocs.io/analysis.html)). For this notebook I will assume that you are running this code in a Jupyter notebook, and that you have an *ipcluster* instance running either locally or remotely on a cluster. If an *ipcluster* instance is running then *ipyrad* will automatically use all available cores on that cluster instance.

### Import Python libraries
The only library we need to import is *ipyrad*. The *import* command is usually the first code called in a Python document to load any necessary packages. In the code below, we use a convenient trick in Python to tell it that we want to refer to *ipyrad* simply as *ip*. This saves us a little space since we might type the name many times. Below that, we use the print statement to print the version number of *ipyrad*. This is good practice to keep a record of which software version we are using. 

In [3]:
## this is a comment, it is not executed, but the code below it is.
import ipyrad as ip

## here we print the version
print ip.__version__

0.6.20


## The *ipyrad* API data structures
There are two main objects in *ipyrad*: Assembly class objects and Sample class objects. And in fact, most users will only ever interact with the Assembly class objects, since Sample objects are stored inside of the Assembly objects, and the Assembly objects have functions, such as merge, and branch, that are designed for manipulating and exchanging Samples between different Assemblies. 

### Assembly Class objects
Assembly objects are a unique data structure that ipyrad uses to store and organize information about how to Assemble RAD-seq data. They contain functions that can be applied to data, such as clustering, and aligning sequences. And they store information about which settings (prarmeters) to use for those assembly functions, and which Samples those functions should be applied to. You can think of it mostly as a container that has a set of rules associated with it. 

To create a new Assembly object use the `ip.Assembly()` function and pass it the name of your new Assembly. Creating an object in this way has exactly the same effect as using the **-n {name}** argument in the *ipyrad* command line tool, except in the API instead of creating a params.txt file, we store the new Assembly information in a Python variable. This can be named anything you want. Below I name the variable *data1* so it is easy to remember that the Assembly name is also data1. 

In [4]:
## create an Assembly object named data1. 
data1 = ip.Assembly("data1")


  New Assembly: data1


### Setting parameters
You now have a Assembly object with a default set of parameters associated with it, analogous to the params file in the command line tool. You can view and modify these parameters using two arguments to the Assembly object, `set_params()` and `get_params()`.  

In [6]:
## setting/modifying parameters for this Assembly object
data1.set_params('project_dir', "pedicularis")
data1.set_params('sorted_fastq_path', "./example_empirical_rad/*.gz")
data1.set_params('filter_adapters', 2)
data1.set_params('datatype', 'rad')

## prints the parameters to the screen
data1.get_params()

  0   assembly_name               data1                                        
  1   project_dir                 ./pedicularis                                
  2   raw_fastq_path                                                           
  3   barcodes_path                                                            
  4   sorted_fastq_path           ./example_empirical_rad/*.gz                 
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    rad                                          
  8   restriction_overhang        ('TGCAG', '')                                
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6     

### Instantaneous parameter (and error) checking 
A nice feature of the `set_params()` function in the *ipyrad* API is that it checks your parameter settings at the time that you change them to make sure that they are compatible. By contrast, the *ipyrad* CLI does not check params until you try to run a step function. Below you can see that an error is raised when we try to set the "clust_threshold" parameters with an integer, since it requires the value to be a float (decimal). It's hard to catch every possible error, but we've tried to catch many of the most common errors in parameter settings. 

In [8]:
## this is expected to raise an error, since the clust_threshold cannot be 2.0
data1.set_params("clust_threshold", 2.0)

IPyradError:     Error setting parameter 'clust_threshold'
    clust_threshold must be a decimal value between 0 and 1.
    You entered: 2.0
    

### Attributes of Assembly objects
Assembly objects have many attributes which you can access to learn more about your Assembly. To see the full list of options you can type the name of your Assembly variable, followed by a '.', and then press <tab>. This will use tab-completion to list all of the available options. Below I print a few examples. 

In [9]:
print data1.name

data1


In [10]:
## another example attribute listing directories
## associated with this object. Most are empty b/c
## we haven't started creating files yet. But you 
## can see that it shows the fastq directory. 
print data1.dirs

fastqs : 
edits : 
clusts : 
consens : 
outfiles : 



### Sample Class objects
Sample Class objects correspond to individual samples in your study. They store the file paths pointing to the data that is saved on disk, and they store statistics about the results of each step of the Assembly. Sample class objects are stored inside Assembly class objects, and can be added, removed, or merged with other Sample class objects between differnt Assemblies. 

### Creating Samples
Samples are created during step 1 of the ipyrad Assembly. This involves either demultiplexing raw data files or loading data files that are already demultiplexed. For this example we are loading demultiplexed data files. Because we've already entered the path to our data files in `sorted_fastq_path` of our Asssembly object, we can go ahead and run step 1 to create Sample objects that are linked to the data files.  

In [11]:
## run step 1 to create Samples objects
data1.run("1", force=True)



  Assembly: data1
  [####################] 100%  loading reads         | 0:00:11 | s1 | 

  Encountered an unexpected error (see ./ipyrad_log.txt)
  Error message is below -------------------------------
max() arg is an empty sequence


ValueError: max() arg is an empty sequence

### Samples stored in an Assembly
You can see below that Sample objects are stored in an Assembly under the attribute Samples. They are stored as a dictionary in which the keys are Sample names and the values of the dictionary are the Sample objects. 

In [9]:
## Sample objects stored as a dictionary
data1.samples

{'29154_superba': <ipyrad.core.sample.Sample at 0x7f460b78db10>,
 '30556_thamno': <ipyrad.core.sample.Sample at 0x7f460b78d990>,
 '30686_cyathophylla': <ipyrad.core.sample.Sample at 0x7f460b78ddd0>,
 '32082_przewalskii': <ipyrad.core.sample.Sample at 0x7f460b7a5610>,
 '33413_thamno': <ipyrad.core.sample.Sample at 0x7f460b7b5ad0>,
 '33588_przewalskii': <ipyrad.core.sample.Sample at 0x7f460b7b5750>,
 '35236_rex': <ipyrad.core.sample.Sample at 0x7f460b748310>,
 '35855_rex': <ipyrad.core.sample.Sample at 0x7f460b73add0>,
 '38362_rex': <ipyrad.core.sample.Sample at 0x7f460b756990>,
 '39618_rex': <ipyrad.core.sample.Sample at 0x7f460b756610>,
 '40578_rex': <ipyrad.core.sample.Sample at 0x7f460b763190>,
 '41478_cyathophylloides': <ipyrad.core.sample.Sample at 0x7f460b763d10>,
 '41954_cyathophylloides': <ipyrad.core.sample.Sample at 0x7f460b6c6fd0>}

### The progress bar
As you can see running a step of the analysis prints a progress bar similar to what you would see in the *ipyrad* command line tool. There are some differences, however. It shows on the far right "s1" to indicate that this was step 1 of the assembly, and it does not print information about our cluster setup (e.g., number of nodes and cores). This was a stylistic choice to provide a cleaner output for analyses inside Jupyter notebooks. You can view the cluster information when running the step functions by adding the argument `show_cluster=True`. Below, because we are re-running the same step that already finished for this Assembly, we need to use the force=True argument. 



In [10]:
## run step 1 to create Samples objects
data1.run("1", show_cluster=1, force=True)


  local compute node: [4 cores] on oud

  Assembly: data1
  [####################] 100%  loading reads         | 0:00:12 | s1 | 


### Viewing results of Assembly steps
Results for each step are stored in Sample class objects, however, Assembly class objects have functions available for summarizing the stats of all Sample class objects that they contain, which provides a much easier way to view results. This includes `.stats` attribute, and the `.stats_dfs` attributes for each step. 

In [11]:
## print full stats summary
print data1.stats

                        state  reads_raw
29154_superba               1     696994
30556_thamno                1    1452316
30686_cyathophylla          1    1253109
32082_przewalskii           1     964244
33413_thamno                1     636625
33588_przewalskii           1    1002923
35236_rex                   1    1803858
35855_rex                   1    1409843
38362_rex                   1    1391175
39618_rex                   1     822263
40578_rex                   1    1707942
41478_cyathophylloides      1    2199740
41954_cyathophylloides      1    2199613


In [12]:
## print full stats for step 1 (in this case it's the same but for other
## steps the stats_dfs often contains more information.)
print data1.stats_dfs.s1

                        reads_raw
29154_superba              696994
30556_thamno              1452316
30686_cyathophylla        1253109
32082_przewalskii          964244
33413_thamno               636625
33588_przewalskii         1002923
35236_rex                 1803858
35855_rex                 1409843
38362_rex                 1391175
39618_rex                  822263
40578_rex                 1707942
41478_cyathophylloides    2199740
41954_cyathophylloides    2199613


### Branching to subsample taxa
Branching in the *ipyrad* API works the same as in the CLI, but in many ways is easier to use because you can access attributes of the Assembly objects much more easily, such as when you want to provide a list of Sample names in order to subsample (exclude samples) during the branching process. Below is an example. 

In [13]:
## access all Sample names in data1
subsamples = data1.samples.keys()
print "Samples in data1:\n", "\n".join(subsamples)

Samples in data1:
30686_cyathophylla
33413_thamno
30556_thamno
32082_przewalskii
29154_superba
41478_cyathophylloides
40578_rex
35855_rex
33588_przewalskii
39618_rex
38362_rex
35236_rex
41954_cyathophylloides


In [14]:
## drop two samples from this list
subsamples.remove("33588_przewalskii")
subsamples.remove("32082_przewalskii")

## use branching to create new Assembly with only Samples whose
## name is in the subsamples list
data2 = data1.branch("data2", subsamples=subsamples)
print "Samples in data2:\n", "\n".join(data2.samples)

Samples in data2:
30686_cyathophylla
33413_thamno
41478_cyathophylloides
29154_superba
40578_rex
35855_rex
30556_thamno
39618_rex
38362_rex
35236_rex
41954_cyathophylloides


## Branching to iterate over parameter settings
This is the real bread and butter of the *ipyrad* API. 

You can write simple for-loops using Python code to apply a range of parameter settings to different branched assemblies. Furthermore, using branching this can be done in a way that greatly reduces the amount of computation needed to produce multiple data sets. Essentially, branching allows you to recycle intermediate states that are shared between branched Assemblies. This is particularly useful when assemblies differ by only one or few parameters that are applied late in the assembly process. To set up efficient branching code in this way requires some prior knowledge about when (which step) each parameter is applied in ipyrad. That information is available in the documentation (http://ipyrad.readthedocs.io/parameters.html). 

When setting up for-loop routines like the one below it may be helpful to break the script up among multiple cells of a Jupyter notebook so that you can easily restart from one step or another. It may also be useful to subsample your data set to a small number of samples to test the code first, and if all goes well, then proceed with your full data set.

### An example to create 54 assemblies
In the example below we will create 54 complete Assemblies which vary in four different parameter combinations (filter_setting, clust_threshold, min_depth, and min_sample).

In [16]:
## Start by creating an assembly, seting the path to your data, 
## and running step1. I set a project-dir so that all of our 
## data sets will be grouped into a single directory.
base = ip.Assembly("base")
base.set_params("project_dir", "branch-test")
base.set_params("sorted_fastq_path", "~/Dropbox/Public/example_empirical_rad/*.gz")

## step 1: load in the data
base.run('1', show_cluster=True)

  New Assembly: base
  local compute node: [4 cores] on oud

  Assembly: base
  [####################] 100%  loading reads         | 0:00:12 | s1 | 


In [17]:
## testing
base.run("234567")


  Assembly: base
  [####################] 100%  processing reads      | 0:03:38 | s2 | 
  [####################] 100%  dereplicating         | 0:00:30 | s3 | 
  [####################] 100%  clustering            | 0:22:08 | s3 | 
  [####################] 100%  building clusters     | 0:00:28 | s3 | 
  [####################] 100%  chunking              | 0:00:03 | s3 | 
  [####################] 100%  aligning              | 1:31:32 | s3 | 
  [####################] 100%  concatenating         | 0:00:14 | s3 | 
  [####################] 100%  inferring [H, E]      | 1:00:22 | s4 | 
  [####################] 100%  calculating depths    | 0:00:18 | s5 | 
  [####################] 100%  chunking clusters     | 0:00:18 | s5 | 
  [####################] 100%  consens calling       | 0:16:22 | s5 | 
  [####################] 100%  concat/shuffle input  | 0:00:05 | s6 | 
  [####################] 100%  clustering across     | 0:03:19 | s6 | 
  [####################] 100%  building clusters     | 0:00

In [1]:
import ipyrad as ip
base = ip.load_json("/home/deren/Documents/ipyrad/tests/branch-test/base.json")
#base.get_params()
#base.write_params()
#base._link_populations()
base.set_params("output_formats", "*")
#base.run("7", force=True)



  loading Assembly: base
  from saved path: ~/Documents/ipyrad/tests/branch-test/base.json


In [1]:
from ipyrad.analysis.tetrad import Tetrad

In [None]:
tree = Tetrad(name="api",
              nboots=0,
              mapfile="./branch-test/base_outfiles/base.snps.map",
              seqfile="./branch-test/base_outfiles/base.snps.phy")

tree.run()

  loading seq array [13 taxa x 60940 bp]


In [7]:
import ipyrad.plotting as iplot
tree = """((((41478_cyathophylloides,41954_cyathophylloides),(29154_superba,30686_cyathophylla)),
          (32082_przewalskii,33588_przewalskii)),((((38362_rex,39618_rex),(35855_rex,40578_rex)),
          (30556_thamno,35236_rex)),33413_thamno));"""

x, y = iplot.shareplot(base.outfiles.loci, tree, width=900)
x, y

(<toyplot.canvas.Canvas at 0x7fe7825ff750>,
 <toyplot.coordinates.Cartesian at 0x7fe7825f9310>)

In [7]:
import h5py
from ipyrad.assemble.write_outfiles import *
from ipyrad.assemble.util import *

data = base
samples = data.samples.values()

## will iterate optim loci at a time
with h5py.File(data.clust_database, 'r') as io5:
    optim = io5["seqs"].attrs["chunksize"][0]
    nloci = io5["seqs"].shape[0]

    ## get name and snp padding
    anames = io5["seqs"].attrs["samples"]
    snames = [i.name for i in samples]
    ## get only snames in this data set sorted in the order they are in io5
    names = [i for i in anames if i in snames]
    pnames, _ = padnames(names)

In [22]:
import time
import ipyparallel as ipp

ipyclient = ipp.Client()

## get names boolean
sidx = np.array([i in snames for i in anames])
assert len(pnames) == sum(sidx)

## get names index in order of pnames
#sindx = [list(anames).index(i) for i in snames]

## send off outputs as parallel jobs
lbview = ipyclient.load_balanced_view()
start = time.time()
results = []

## build arrays and outputs from arrays.
## these arrays are keys in the tmp h5 array: seqarr, snparr, bisarr, maparr
boss_make_arrays(data, sidx, optim, nloci, ipyclient)

  [####################] 100%  building arrays       | 0:00:01 | s7 | 


In [12]:
data.outfiles.str = os.path.join(data.dirs.outfiles, data.name+".str")
data.outfiles.ustr = os.path.join(data.dirs.outfiles, data.name+".ustr")        
async = write_str(data, sidx, pnames)

KeyError: ''

In [30]:
with h5py.File(tmparrs, 'r') as io5:
    maparr = io5["maparr"]
    print maparr[:]
    end = np.where(np.all(maparr[:] == 0, axis=1))[0].min()
    print end, maparr.shape

[[1 0 0 1]
 [1 0 1 2]
 [1 0 2 3]
 ..., 
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]
60940 (282021, 4)


In [31]:
tmparrs = os.path.join(data.dirs.outfiles, "tmp-{}.h5".format(data.name)) 
with h5py.File(tmparrs, 'r') as io5:
    snparr = io5["snparr"]
    bisarr = io5["bisarr"]

    ## trim to size b/c it was made longer than actual
    send = np.where(np.all(snparr[:] == "", axis=0))[0].min()
    bend = np.where(np.all(bisarr[:] == "", axis=0))[0].min()

    print(snparr.shape, bisarr.shape, send, bend)
    
    ## write to str and ustr
    out1 = open(data.outfiles.str, 'w')
    out2 = open(data.outfiles.ustr, 'w')
    numdict = {'A': '0', 'T': '1', 'G': '2', 'C': '3', 'N': '-9', '-': '-9'}
    if data.paramsdict["max_alleles_consens"] > 1:
        for idx, name in enumerate(pnames):
            print idx, name
            #print "{}\t\t\t\t\t{}\n"\
            #        .format(name,
            #        "\t".join([numdict[DUCT[i][0]] for i in snparr[idx]]))
            print snparr[idx]#, :send]

((13, 282021), (13, 12698), 60940, 12148)
0 29154_superba              
['T' 'G' 'C' ..., '' '' '']
1 30556_thamno               
['C' 'G' 'C' ..., '' '' '']
2 30686_cyathophylla         
['C' 'A' 'C' ..., '' '' '']
3 32082_przewalskii          
['N' 'N' 'N' ..., '' '' '']
4 33413_thamno               
['C' 'G' 'T' ..., '' '' '']
5 33588_przewalskii          
['N' 'N' 'N' ..., '' '' '']
6 35236_rex                  
['C' 'G' 'C' ..., '' '' '']
7 35855_rex                  
['C' 'G' 'C' ..., '' '' '']
8 38362_rex                  
['C' 'G' 'T' ..., '' '' '']
9 39618_rex                  
['N' 'N' 'N' ..., '' '' '']
10 40578_rex                  
['C' 'G' 'Y' ..., '' '' '']
11 41478_cyathophylloides     
['C' 'G' 'C' ..., '' '' '']
12 41954_cyathophylloides     
['C' 'G' 'C' ..., '' '' '']


In [6]:
base.samples

{'29154_superba': <ipyrad.core.sample.Sample at 0x7f8b5cc693d0>,
 '30556_thamno': <ipyrad.core.sample.Sample at 0x7f8b5cbaccd0>,
 '30686_cyathophylla': <ipyrad.core.sample.Sample at 0x7f8b5cc516d0>,
 '32082_przewalskii': <ipyrad.core.sample.Sample at 0x7f8b5cbb2550>,
 '33413_thamno': <ipyrad.core.sample.Sample at 0x7f8b6a29a510>,
 '33588_przewalskii': <ipyrad.core.sample.Sample at 0x7f8b5c85e8d0>,
 '35236_rex': <ipyrad.core.sample.Sample at 0x7f8b5cc69610>,
 '35855_rex': <ipyrad.core.sample.Sample at 0x7f8b5cc58710>,
 '38362_rex': <ipyrad.core.sample.Sample at 0x7f8b5cc04fd0>,
 '39618_rex': <ipyrad.core.sample.Sample at 0x7f8b5cbb2c10>,
 '40578_rex': <ipyrad.core.sample.Sample at 0x7f8b5c946950>,
 '41478_cyathophylloides': <ipyrad.core.sample.Sample at 0x7f8b5cc58750>,
 '41954_cyathophylloides': <ipyrad.core.sample.Sample at 0x7f8b5cbb2790>}

In [32]:
base.popdict = {"prz": ["32082_przewalskii", "33588_przewalskii"],
                "cya": ["29154_superba", "30686_cyathophylla", 
                        "41478_cyathophylloides", "41954_cyathophylloides"],
                "rex": ["30556_thamno", "33413_thamno", "35236_rex",
                        "35855_rex", "38362_rex", "39618_rex", "40578_rex"]}
base.popmins = {'cya':4, 'rex':4, 'prz':0}
base._link_populations(base.popdict, base.popmins)
base.populations

#base.run("7", force=True)

{'cya': (4,
  ['29154_superba',
   '30686_cyathophylla',
   '41478_cyathophylloides',
   '41954_cyathophylloides']),
 'prz': (0, ['32082_przewalskii', '33588_przewalskii']),
 'rex': (4,
  ['30556_thamno',
   '33413_thamno',
   '35236_rex',
   '35855_rex',
   '38362_rex',
   '39618_rex',
   '40578_rex'])}

### Iterating step 2 over different filter settings

In [15]:
## a dictionary for storing new branched Assemblies
s2dict = {}

## iterate over filtering params
for filt in [1, 2]:
    
    ## branch 'base', add _f{filter} to the name
    ## and set the filter param to a new value
    name = base.name + "_f{}".format(filt)
    assembly = base.branch(name)
    assembly.set_params("filter_adapters", filt)
    
    ## run step 2
    assembly.run("2")
    
    ## store assembly in dictionary
    s2dict[assembly.name] = assembly


  Assembly: base_f1
  [####################] 100%  processing reads      | 0:00:50 | s2 | 

  Assembly: base_f2
  [####################] 100%  processing reads      | 0:01:19 | s2 | 


### Iterating step 3 over different clust-thresholds

In [16]:
## A dictionary for storing new branched Assemblies
s3dict = {}

## iterate over assemblies
for name, assembly in s2dict.items():
    
    ## iterate over clust thresholds
    for clust in ['.86', '.90', '.94']:
        
        ## branch the assembly, setting a new name as name+_c{clust} 
        ## and set the new clust threshold
        new = assembly.branch(name+"_c{}".format(clust[1:]))
        new.set_params("clust_threshold", clust)
        
        ## run step 3 with new param
        new.run("3")
        
        ## store in a dictionary
        s3dict[new.name] = new


  Assembly: base_f2_c86
  [####################] 100%  dereplicating         | 0:00:06 | s3 | 
  [####################] 100%  clustering            | 0:05:01 | s3 | 
  [####################] 100%  building clusters     | 0:00:29 | s3 | 
  [####################] 100%  chunking              | 0:00:04 | s3 | 
  [####################] 100%  aligning              | 0:08:52 | s3 | 
  [####################] 100%  concatenating         | 0:00:25 | s3 | 

  Assembly: base_f2_c90
  [####################] 100%  dereplicating         | 0:00:05 | s3 | 
  [####################] 100%  clustering            | 0:05:20 | s3 | 
  [####################] 100%  building clusters     | 0:00:27 | s3 | 
  [####################] 100%  chunking              | 0:00:04 | s3 | 
  [####################] 100%  aligning              | 0:09:12 | s3 | 
  [####################] 100%  concatenating         | 0:00:25 | s3 | 

  Assembly: base_f2_c94
  [####################] 100%  dereplicating         | 0:00:05 | s3 | 
  

### Iterating steps 4-6 over different mindepth settings

In [17]:
## a dictionary for storing new branched Assemblies
s6dict = {}

## iterate over assemblies
for name, assembly in s3dict.items():
    
    ## iterate over mindepth values
    for mindepth in [5, 10, 15]:

        ## branch, assign new name (_d{depth}) and set new param
        new = assembly.branch(name+"_d{}".format(mindepth))
        new.set_params("mindepth_majrule", mindepth)
        new.set_params("mindepth_statistical", mindepth)

        ## run steps 4-6
        new.run("456")
        
        ## put into s6 dictionary
        s6dict[new.name] = new


  Assembly: base_f1_c86_d5
  [####################] 100%  inferring [H, E]      | 0:23:38 | s4 | 
  [####################] 100%  calculating depths    | 0:00:06 | s5 | 
  [####################] 100%  chunking clusters     | 0:00:06 | s5 | 
  [####################] 100%  consens calling       | 0:02:31 | s5 | 
  [####################] 100%  concat/shuffle input  | 0:00:06 | s6 | 
  [####################] 100%  clustering across     | 0:04:16 | s6 | 
  [####################] 100%  building clusters     | 0:00:07 | s6 | 
  [####################] 100%  aligning clusters     | 0:01:02 | s6 | 
  [####################] 100%  database indels       | 0:00:25 | s6 | 
  [####################] 100%  indexing clusters     | 0:00:10 | s6 | 
  [####################] 100%  building database     | 0:00:47 | s6 | 

  Assembly: base_f1_c86_d10
  [####################] 100%  inferring [H, E]      | 0:23:49 | s4 | 
  [####################] 100%  calculating depths    | 0:00:06 | s5 | 
  [#################

IPyradError: 
  Keyboard Interrupt by user. Cleaning up...

### Iterating step 7 over different minsample settings
The progress bars are starting to get really cumbersome now, as you can see. If you wanted you could pass the argument `quiet=True` to the `.run()` function and the progress bars will be suppressed, but error messages would still be printed if they occurred. 

In [2]:
## A dictionary for storing Assemblies
complete = {}

## iterate over parent assemblies
for name, assembly in s6dict.items():
    ## iterate over minsamp values
    for minsamp in [4, 8, 12]:
        
        ## branch assembly, assign new name and minsamp value
        new = assembly.branch(name+"_s{}".format(minsamp))
        new.set_params("min_samples_locus", minsamp)
        
        ## run the final step of assembly
        new.run("7", force=True)
        
        ## store Assembly in dictionary accessible by its name
        complete[new.name] = new


  Assembly: base_f2_c88_d5_s4
  [####################] 100%  filtering loci        | 0:00:07 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:00 | s7 | 
  [####################] 100%  building vcf file     | 0:00:21 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:10 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:03 | s7 | 
  Outfiles written to: ~/Documents/ipyrad/tests/branch-test/base_f2_c88_d5_s4_outfiles

  Assembly: base_f2_c88_d5_s8
  [####################] 100%  filtering loci        | 0:00:01 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:00 | s7 | 
  [####################] 100%  building vcf file     | 0:00:06 | s7 | 
  [####################] 100%  writing vcf file      | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:07 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:02 | s7 | 

### Comparing data sets
The really difficult thing now is that we just produced (2 x 3 x 3 x 3) = 54 data sets, and performing downstream analyses of all of these is going to take a lot of time. We've created a few tools for quickly comparing the stats of these assemblies. By accessing the Assembly objects themselves we have access to all of the information about the Assembly, including its name, parameter settings, and results/statistics. 

#### Compare stats

In [15]:
import toyplot

In [75]:
## make a dictionary with names: nloci
dat1 = {i.name: i.stats_dfs.s7_loci.locus_coverage[13] for i in complete.values()}
dat2 = {i.name: i.stats_dfs.s7_loci.sum_coverage[13] for i in complete.values()}

## plot nloci as bars
canvas = toyplot.Canvas(width=500, height=300)
axes1 = canvas.cartesian(yscale='log')
axes1.x.ticks.show = True

## set interactive names to pop-up when hovering
keys = sorted(dat2, key=lambda x: dat2[x])
vals = sorted(dat2.values())
floater = ["%s" % i for i in keys]
## plot bars with floating titles
bars2 = axes1.scatterplot(vals, title=floater)

## set interactive names to pop-up when hovering
keys = sorted(dat1, key=lambda x: dat1[x])
vals = sorted(dat1.values())
floater = ["%s" % i for i in keys]
## plot bars with floating titles
bars1 = axes1.scatterplot(vals, title=floater)

In [64]:
complete[i].stats_dfs.s5.heterozygosity["29154_superba"]
complete[i].stats_dfs.s7_loci



Unnamed: 0,locus_coverage,sum_coverage
1,0,0
2,0,0
3,0,0
4,0,0
5,0,0
6,0,0
7,0,0
8,0,0
9,0,0
10,0,0


In [51]:
## make a dictionary with names: nloci
dat = {i.name: i.stats_dfs.s5.heterozygosity["29154_superba"] \
       for i in complete.values()}

## plot nloci as bars
canvas = toyplot.Canvas(width=600, height=300)
axes = canvas.axes()

## set interactive names to pop-up when hovering
keys = sorted(dat, key=lambda x: dat[x])
vals = sorted(dat.values())
floater = ["%s" % i for i in keys]

## plot bars with floating titles
bars = axes.scatterplot(vals, title=floater, marker="o", size=10)

In [25]:
## how many loci are shared across all 13 taxa
#complete["base_f1_c86_d10_s12"].stats_dfs.s7_loci.sum_coverage[13]


In [29]:
## let's use the plotting library Toyplot
import toyplot


### Accessing the data
To compare multiple assemblies we can directly access their stats which is stored in the format of a Pandas DataFrame, which is similar to the DataFrame structure in R. 


In [None]:
canvas = Toyplot.Canvas()
axes = canvas.cartesian()
axes.bars([assembly.stats.raw_reads for assembly in completed])

### Branching Assembly objects
Let's imagine at this point that we are interested in clustering our data at two different clustering thresholds. We will try 0.90 and 0.85. First we need to make a copy/branch of the Assembly object. This will inherit the locations of the data linked in the first object, but diverge in any future applications to the object. Thus, the two Assembly objects can share the same working directory, and inherit shared files, but will diverge in creating new files linked to only one or the other. You can view the directories linked to an Assembly object with the `.dirs` argument, shown below. The prefix_outname (param 14) of the new object is automatically set to the Assembly object name. 


### Branched Assembly objects
And you can see below that the two Assembly objects are now working with several shared directories (working, fastq, edits) but with different clust directories (clust_0.85 and clust_0.9). 

In [None]:
print "data1 directories:"
for (i,j) in data1.dirs.items():
    print "{}\t{}".format(i, j)
    
print "\ndata2 directories:"
for (i,j) in data2.dirs.items():
    print "{}\t{}".format(i, j)

In [None]:
## TODO, just make a [name]_stats directory in [work] for each data obj
data1.statsfiles


### Saving stats outputs
Example: two simple ways to save the stats data frame to a file.

In [None]:
data1.stats.to_csv("data1_results.csv", sep="\t")
data1.stats.to_latex("data1_results.tex")

### Example of plotting with _ipyrad_
There are a a few simple plotting functions in _ipyrad_ useful for visualizing results. These are in the module `ipyrad.plotting`. Below is an interactive plot for visualizing the distributions of coverages across the 12 samples in the test data set.  

In [11]:
import ipyrad.plotting as iplot

## plot for one or more selected samples
#iplot.depthplot(data1, ["1A_0", "1B_0"])

## plot for all samples in data1
iplot.depthplot(data1)

## save plot as pdf and html
#iplot.depthplot(data1, outprefix="testfig")

### Saving Assembly objects
Assembly objects (and the Sample objects they contain) are automatically saved each time that you use the `.run()` function. However, you can also save by calling the `.save()` function of an Assembly object. This updates the JSON file. Additionally, Assembly objects have a function called `.write_params()` which can be invoked to create a params file for use by the *ipyrad* command line tool. 

In [None]:
## save assembly object
data1.save()

## load assembly object
data1 = ip.load_assembly("pedicularis/data1.json")

## write params file for use by the CLI
data.write_params()