# User guide to the *ipyrad* API
Welcome! This tutorial will introduce you to the basic and advanced features of working with the *ipyrad* API to assemble RADseq data in Python. The API offers many advantages over the command-line interface, but requires a little more work up front to learn the necessary tools for using it. This includes knowing some very rudimentary Python, and setting up a Jupyter notebook. 

In [1]:
import ipyrad as ip

### Getting started with Jupyter notebooks

This tutorial is an example of a [Jupyter Notebook](http://jupyter.org/Jupyter). If you've installed *ipyrad* then you already have jupyter installed as well, which you can start from the command-line (type `jupyter-notebook`) to launch an interactive notebook like this one. For some background on how jupyter notebooks work I would recommend searching on google, or watching this [YouTube video](https://www.youtube.com/watch?v=HW29067qVWk&t=47s). Once you have the hang of it, follow along with this code in your own notebook. 

### Connecting your notebook to a cluster
We have two previous tutorials about using Jupyter notebooks and connecting Jupyter notebooks to a computing cluster (see [here](http://ipyrad.readthedocs.io/analysis.html)). For this notebook I will assume that you are running this code in a Jupyter notebook, and that you have an *ipcluster* instance running either locally or remotely on a cluster. If an *ipcluster* instance is running then *ipyrad* will automatically use all available cores on that cluster instance.

### Import Python libraries
The only library we need to import is *ipyrad*. The *import* command is usually the first code called in a Python document to load any necessary packages. In the code below, we use a convenient trick in Python to tell it that we want to refer to *ipyrad* simply as *ip*. This saves us a little space since we might type the name many times. Below that, we use the print statement to print the version number of *ipyrad*. This is good practice to keep a record of which software version we are using. 

In [2]:
## this is a comment, it is not executed, but the code below it is.
import ipyrad as ip

## here we print the version
print ip.__version__

0.7.1


## The *ipyrad* API data structures
There are two main objects in *ipyrad*: Assembly class objects and Sample class objects. And in fact, most users will only ever interact with the Assembly class objects, since Sample objects are stored inside of the Assembly objects, and the Assembly objects have functions, such as merge, and branch, that are designed for manipulating and exchanging Samples between different Assemblies. 

### Assembly Class objects
Assembly objects are a unique data structure that ipyrad uses to store and organize information about how to Assemble RAD-seq data. They contain functions that can be applied to data, such as clustering, and aligning sequences. And they store information about which settings (prarmeters) to use for those assembly functions, and which Samples those functions should be applied to. You can think of it mostly as a container that has a set of rules associated with it. 

To create a new Assembly object use the `ip.Assembly()` function and pass it the name of your new Assembly. Creating an object in this way has exactly the same effect as using the **-n {name}** argument in the *ipyrad* command line tool, except in the API instead of creating a params.txt file, we store the new Assembly information in a Python variable. This can be named anything you want. Below I name the variable *data1* so it is easy to remember that the Assembly name is also data1. 

In [3]:
## create an Assembly object named data1. 
data1 = ip.Assembly("data1")


New Assembly: data1


### Setting parameters
You now have a Assembly object with a default set of parameters associated with it, analogous to the params file in the command line tool. You can view and modify these parameters using two arguments to the Assembly object, `set_params()` and `get_params()`.  

In [4]:
## setting/modifying parameters for this Assembly object
data1.set_params('project_dir', "pedicularis")
data1.set_params('sorted_fastq_path', "./example_empirical_rad/*.gz")
data1.set_params('filter_adapters', 2)
data1.set_params('datatype', 'rad')

## prints the parameters to the screen
data1.get_params()

0   assembly_name               data1                                        
1   project_dir                 ./pedicularis                                
2   raw_fastq_path                                                           
3   barcodes_path                                                            
4   sorted_fastq_path           ./example_empirical_rad/*.gz                 
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    rad                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6                               

### Instantaneous parameter (and error) checking 
A nice feature of the `set_params()` function in the *ipyrad* API is that it checks your parameter settings at the time that you change them to make sure that they are compatible. By contrast, the *ipyrad* CLI does not check params until you try to run a step function. Below you can see that an error is raised when we try to set the "clust_threshold" parameters with an integer, since it requires the value to be a float (decimal). It's hard to catch every possible error, but we've tried to catch many of the most common errors in parameter settings. 

In [5]:
## this is expected to raise an error, since the clust_threshold cannot be 2.0
data1.set_params("clust_threshold", 2.0)

IPyradError:     Error setting parameter 'clust_threshold'
    clust_threshold must be a decimal value between 0 and 1.
    You entered: 2.0
    

### Attributes of Assembly objects
Assembly objects have many attributes which you can access to learn more about your Assembly. To see the full list of options you can type the name of your Assembly variable, followed by a '.', and then press <tab>. This will use tab-completion to list all of the available options. Below I print a few examples. 

In [6]:
print data1.name

data1


In [7]:
## another example attribute listing directories
## associated with this object. Most are empty b/c
## we haven't started creating files yet. But you 
## can see that it shows the fastq directory. 
print data1.dirs

fastqs : 
edits : 
clusts : 
consens : 
outfiles : 



### Sample Class objects
Sample Class objects correspond to individual samples in your study. They store the file paths pointing to the data that is saved on disk, and they store statistics about the results of each step of the Assembly. Sample class objects are stored inside Assembly class objects, and can be added, removed, or merged with other Sample class objects between differnt Assemblies. 

### Creating Samples
Samples are created during step 1 of the ipyrad Assembly. This involves either demultiplexing raw data files or loading data files that are already demultiplexed. For this example we are loading demultiplexed data files. Because we've already entered the path to our data files in `sorted_fastq_path` of our Asssembly object, we can go ahead and run step 1 to create Sample objects that are linked to the data files.  

In [8]:
## run step 1 to create Samples objects
data1.run("1", force=True)


Assembly: data1
[####################] 100%  loading reads         | 0:00:11 | s1 | 


### Samples stored in an Assembly
You can see below that Sample objects are stored in an Assembly under the attribute Samples. They are stored as a dictionary in which the keys are Sample names and the values of the dictionary are the Sample objects. 

In [9]:
## Sample objects stored as a dictionary
data1.samples

{'29154_superba': <ipyrad.core.sample.Sample at 0x7f8270a0bd90>,
 '30556_thamno': <ipyrad.core.sample.Sample at 0x7f8270a0be50>,
 '30686_cyathophylla': <ipyrad.core.sample.Sample at 0x7f8270a0b790>,
 '32082_przewalskii': <ipyrad.core.sample.Sample at 0x7f8270a0b850>,
 '33413_thamno': <ipyrad.core.sample.Sample at 0x7f8270a0b950>,
 '33588_przewalskii': <ipyrad.core.sample.Sample at 0x7f8270a2b8d0>,
 '35236_rex': <ipyrad.core.sample.Sample at 0x7f8270a38790>,
 '35855_rex': <ipyrad.core.sample.Sample at 0x7f8270a38410>,
 '38362_rex': <ipyrad.core.sample.Sample at 0x7f8270a45350>,
 '39618_rex': <ipyrad.core.sample.Sample at 0x7f8270a59a50>,
 '40578_rex': <ipyrad.core.sample.Sample at 0x7f827082c610>,
 '41478_cyathophylloides': <ipyrad.core.sample.Sample at 0x7f827082ce50>,
 '41954_cyathophylloides': <ipyrad.core.sample.Sample at 0x7f8270835a10>}

### The progress bar
As you can see running a step of the analysis prints a progress bar similar to what you would see in the *ipyrad* command line tool. There are some differences, however. It shows on the far right "s1" to indicate that this was step 1 of the assembly, and it does not print information about our cluster setup (e.g., number of nodes and cores). This was a stylistic choice to provide a cleaner output for analyses inside Jupyter notebooks. You can view the cluster information when running the step functions by adding the argument `show_cluster=True`. Below, because we are re-running the same step that already finished for this Assembly, we need to use the force=True argument. 



In [10]:
## run step 1 to create Samples objects
data1.run("1", show_cluster=True, force=True)


host compute node: [4 cores] on oud
Assembly: data1
[####################] 100%  loading reads         | 0:00:11 | s1 | 


### Viewing results of Assembly steps
Results for each step are stored in Sample class objects, however, Assembly class objects have functions available for summarizing the stats of all Sample class objects that they contain, which provides a much easier way to view results. This includes `.stats` attribute, and the `.stats_dfs` attributes for each step. 

In [11]:
## print full stats summary
print data1.stats

                        state  reads_raw
29154_superba               1     696994
30556_thamno                1    1452316
30686_cyathophylla          1    1253109
32082_przewalskii           1     964244
33413_thamno                1     636625
33588_przewalskii           1    1002923
35236_rex                   1    1803858
35855_rex                   1    1409843
38362_rex                   1    1391175
39618_rex                   1     822263
40578_rex                   1    1707942
41478_cyathophylloides      1    2199740
41954_cyathophylloides      1    2199613


In [12]:
## print full stats for step 1 (in this case it's the same but for other
## steps the stats_dfs often contains more information.)
print data1.stats_dfs.s1

                        reads_raw
29154_superba              696994
30556_thamno              1452316
30686_cyathophylla        1253109
32082_przewalskii          964244
33413_thamno               636625
33588_przewalskii         1002923
35236_rex                 1803858
35855_rex                 1409843
38362_rex                 1391175
39618_rex                  822263
40578_rex                 1707942
41478_cyathophylloides    2199740
41954_cyathophylloides    2199613


### Branching to subsample taxa
Branching in the *ipyrad* API works the same as in the CLI, but in many ways is easier to use because you can access attributes of the Assembly objects much more easily, such as when you want to provide a list of Sample names in order to subsample (exclude samples) during the branching process. Below is an example. 

In [13]:
## access all Sample names in data1
subsamples = data1.samples.keys()
print "Samples in data1:\n", "\n".join(subsamples)

Samples in data1:
30686_cyathophylla
33413_thamno
30556_thamno
32082_przewalskii
29154_superba
41478_cyathophylloides
40578_rex
35855_rex
33588_przewalskii
39618_rex
38362_rex
35236_rex
41954_cyathophylloides


In [14]:
## drop two samples from this list
subsamples.remove("33588_przewalskii")
subsamples.remove("32082_przewalskii")

## use branching to create new Assembly with only Samples whose
## name is in the subsamples list
data2 = data1.branch("data2", subsamples=subsamples)
print "Samples in data2:\n", "\n".join(data2.samples)

Samples in data2:
30686_cyathophylla
33413_thamno
41478_cyathophylloides
29154_superba
40578_rex
35855_rex
30556_thamno
39618_rex
38362_rex
35236_rex
41954_cyathophylloides


## Branching to iterate over parameter settings
This is the real bread and butter of the *ipyrad* API. 

You can write simple for-loops using Python code to apply a range of parameter settings to different branched assemblies. Furthermore, using branching this can be done in a way that greatly reduces the amount of computation needed to produce multiple data sets. Essentially, branching allows you to recycle intermediate states that are shared between branched Assemblies. This is particularly useful when assemblies differ by only one or few parameters that are applied late in the assembly process. To set up efficient branching code in this way requires some prior knowledge about when (which step) each parameter is applied in ipyrad. That information is available in the documentation (http://ipyrad.readthedocs.io/parameters.html). 

When setting up for-loop routines like the one below it may be helpful to break the script up among multiple cells of a Jupyter notebook so that you can easily restart from one step or another. It may also be useful to subsample your data set to a small number of samples to test the code first, and if all goes well, then proceed with your full data set.

### An example to create 54 assemblies
In the example below we will create 54 complete Assemblies which vary in four different parameter combinations (filter_setting, clust_threshold, min_depth, and min_sample).

In [15]:
## Start by creating an assembly, seting the path to your data, 
## and running step1. I set a project-dir so that all of our 
## data sets will be grouped into a single directory.
base = ip.Assembly("base")
base.set_params("project_dir", "branch-test")
base.set_params("sorted_fastq_path", "~/Dropbox/Public/example_empirical_rad/*.gz")

## step 1: load in the data
base.run('1', show_cluster=True)

New Assembly: base
host compute node: [4 cores] on oud
Assembly: base
[####################] 100%  loading reads         | 0:00:11 | s1 | 


In [16]:
## testing
base.run("234567")

Assembly: base
[####################] 100%  processing reads      | 0:03:55 | s2 | 
[####################] 100%  dereplicating         | 0:00:34 | s3 | 
[####################] 100%  clustering            | 0:21:30 | s3 | 
[####################] 100%  building clusters     | 0:00:26 | s3 | 
[####################] 100%  chunking              | 0:00:04 | s3 | 
[####################] 100%  aligning              | 0:19:23 | s3 | 
[####################] 100%  concatenating         | 0:00:08 | s3 | 
[####################] 100%  inferring [H, E]      | 0:03:39 | s4 | 
  [####################] 100%  calculating depths    | 0:00:18 | s5 | 
  [####################] 100%  chunking clusters     | 0:00:18 | s5 | 
  [####################] 100%  consens calling       | 0:11:38 | s5 | 
Continuing from checkpoint (use 'force' arg to restart instead)
[####################] 100%  concat/shuffle input  | 0:00:05 | s6 | 
[####################] 100%  clustering across     | 0:03:21 | s6 | 
[#################

### Saving Assembly objects
Assembly objects (and the Sample objects they contain) are automatically saved each time that you use the `.run()` function. However, you can also save by calling the `.save()` function of an Assembly object. This updates the JSON file. Additionally, Assembly objects have a function called `.write_params()` which can be invoked to create a params file for use by the *ipyrad* command line tool. 

In [None]:
## save assembly object
data1.save()

## load assembly object
data1 = ip.load_assembly("pedicularis/data1.json")

## write params file for use by the CLI
data1.write_params()