# Pipeline
This pipeline is modeled after:
1. [Carol Rowe's Allenrolfea analysis](https://digitalcommons.usu.edu/all_datasets/39/) <br>
2. [emprical ipyrad API pedicularis](https://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/tests/cookbook-empirical-API-1-pedicularis.ipynb)
<br>
*to be updated with further population analysis as well* <br>


**Check `ipcluster` instance with a profile**<br>
First we need to check our paralization

In [6]:
import ipyparallel as ipp
print 'ipyparallel', ipp.__version__
mpi1 = ipp.Client(profile="MPI1")
print 'mpi1 has',len(mpi1), 'cores'

ipyparallel 6.0.2
mpi1 has 15 cores


### ipyrad
The only library we need to import is *ipyrad*. The import command is usually the first code called in a Python document to load any necessary packages. In the code below, we use a convenient trick in Python to tell it that we want to refer to ipyrad simply as ip. This saves us a little space since we might type the name many times. Below that, we use the print statement to print the version number of *ipyrad*. This is good practice to keep a record of which software version we are using. <br>
<br>
This guide and markdown is straight from the [*ipyrad* API user guide](https://ipyrad.readthedocs.io/API_user-guide.html) but will be using 2019_thesis data.

In [7]:
#requires ipyrad
import ipyrad as ip
print ip.__version__

0.7.28


### Data structure
There are two main objects in *ipyrad*: Assembly class objects and Sample class objects. And in fact, most users will only ever interact with the Assembly class objects, since Sample objects are stored inside of the Assembly objects, and the Assembly objects have functions, such as merge, and branch, that are designed for manipulating and exchanging Samples between different Assemblies. <br>
### Assembly class objects
Assembly objects are a unique data structure that ipyrad uses to store and organize information about how to Assemble RAD-seq data. It contains functions that can be applied to data, such as clustering, and aligning sequences. And it stores information about which settings (prarmeters) to use for assembly functions, and which Samples the functions should be applied to. You can think of it mostly as a container that has a set of rules associated with it. <br>
To create a new Assembly object use the `ip.Assembly()` function and pass it the name of your new Assembly. Creating an object in this way has exactly the same effect as using the **-n {name}** argument in the *ipyrad* command line tool, except in the API instead of creating a params.txt file, we store the new Assembly information in a Python variable. This can be named anything you want. Below I name the variable *data1* so it is easy to remember that the Assembly name is also data1

In [14]:
data1 = ip.Assembly("data1")

New Assembly: data1


### Setting parameters
You now have a Assembly object with a default set of parameters associated with it, analogous to the params file in the command line tool. You can view and modify these parameters using two arguments to the Assembly object, `set_params()` and `get_params()`.

In [15]:
## set and modify params for this assembly object here
data1.set_params('project_dir', './') # this will need to be run for EACH assembly
data1.set_params('raw_fastq_path', './*.gz')
data1.set_params('barcodes_path', '') # gonna need to fix this, remember there was an issue with what Restriction codes gave us!
data1.set_params('assembly_method', 'denovo+refrence') 
data1.set_params('refrence_sequence', 'Tzet_genomic.fna')
data1.set_params('datatype', 'ddrad')
data1.set_params('restriction_overhang', 'TGCAG, CGG') # remember there was an issue with what Restriction codes gave us!
data1.set_params('mindepth_statistical', '6')
data1.set_params('mindepth_majrule', '6')
data1.set_params('filter_adapters', '1')# do two runsrun with 1 and one with 2 (2= stricter)
data1.set_params('max_SNPs_locus', '20, 30') # 20,20 is standard in ipyrad, we used 20,30 last time
# ...

#print param file
data1.get_params()

IPyradError:     Error setting parameter 'raw_fastq_path'
        The value entered for the path to the raw fastq file is unrecognized.
    Please be sure this path is correct. Double check the file name and
    the file extension. If it is a relative path be sure the path is
    correct with respect to the directory you're running ipyrad from.
    You entered: /home/cardenas.61/output/cluster_analysis/both_outfiles/*.gz

    You entered: ./*.gz
    

#### Instantaneous parameter (and error) checking
A nice feature of the `set_params()` function in the *ipyrad* API is that it checks your parameter settings at the time that you change them to make sure that they are compatible. By contrast, the *ipyrad* CLI does not check params until you try to run a step function. As you saw, we assigned any `./*.gz`file in the directory for the raw_fastq_path parameter, but it doesnt exist in this directory so it throws an error. <br>
once you get it all fixed you can print your param file and make sure everything looks right

In [17]:
data1.get_params()

0   assembly_name               data1                                        
1   project_dir                 /home/cardenas.61/output/cluster_analysis/both_outfiles
2   raw_fastq_path                                                           
3   barcodes_path                                                            
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    rad                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6                     

# Multiple libraries and multiple lanes
[See ipyrad documentation](https://ipyrad.readthedocs.io/tutorial-combining-data.html?highlight=-m%20both) for how to handle multiple libraies and lanes in this API; jump below.
<Br><br> Another way to consider handling it is the good ole' fassion way in the commandline BEFORE running python code... BUT this will be **tedious** for 8lanes from one library and 4lanes from another! If you really hate your self... make 12 param files for these 12 files to run at step 1
EX:<br>
`ipyrad -p params-i06_1.txt -s 1 -f
ipyrad -p params-i12_1.txt -s 1 -f
ipyrad -p params-i03_1.txt -s 1 -f
ipyrad -p params-i04_1.txt -s 1 -f
ipyrad -p params-i05_1.txt -s 1 -f
ipyrad -p params-i07_1.txt -s 1 -f
ipyrad -p params-i01_1.txt -s 1 -f
ipyrad -p params-i02_1.txt -s 1 -f
ipyrad -p params-i06_2.txt -s 1 -f
ipyrad -p params-i12_2.txt -s 1 -f
ipyrad -p params-i03_2.txt -s 1 -f
ipyrad -p params-i04_2.txt -s 1 -f
`<br>
<br>

merge these using <br>
`ipyrad -m all [file_i06_1.txt] ... [file_i04_2.txt] -f`<br>
<br>

then create a param file... you really shouldnt!

# Alternative: using *ipyrad* API
Using the ipyrad API is an alternative to using the command-line-interface (CLI) above. As you can see below, writing code with the Python API can be much simpler and more elegant. We recommend using the API inside a Jupyter-notebook. <br>
<br>
#### one lane one library<br>
`data1 = ip.Assembly("data1")`<br>
`data1.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")`<br>
`data1.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")`<br>
`data.run("123467")`<br>
<br>
#### one library multiple lanes<br>
`lib1lane1 = ip.Assembly("lib1lane1")`<br>
`lib1lane1.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")`<br>
`lib1lane1.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")`<br>
`lib1lane1.run("1")`<br>
<br>

`lib1lane2 = ip.Assembly("lib1lane2")`<br>
`lib1lane2.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")`<br>
`lib1lane2.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")`<br>
`lib1lane2.run("1")`<br>
<br>
`merged = ip.merge("lib1-2lanes", [lib1lane1, lib1lane2])`<br>
`merged.run("234567")`<br>

<br>
#### multiple libraries multiple lanes<br>
`lib1lane1 = ip.Assembly("lib1lane1")`<br>
`lib1lane1.set_params("raw_fastq_path", "ipsimdata/lib1_lane1_R1_.fastq.gz")`<br>
`lib1lane1.set_params("barcodes_path", "ipsimdata/lib1_barcodes.txt")`<br>
`lib1lane1.run("1")`<br>

`lib1lane2 = ip.Assembly("lib1lane2")`<br>
`lib1lane2.set_params("raw_fastq_path", "ipsimdata/lib1_lane2.fastq.gz")`<br>
`lib1lane2.set_params("barcodes_path", "ipsimdata/lib1_barcodes.txt")`<br>
`lib1lane2.run("1")`<br>

`lib2lane1 = ip.Assembly("lib1lane1")`<br>
`lib2lane1.set_params("raw_fastq_path", "ipsimdata/lib2_lane1.fastq.gz")`<br>
`lib2lane1.set_params("barcodes_path", "ipsimdata/lib2_barcodes.txt")`<br>
`lib2lane1.run("1")`<br>

`lib2lane2 = ip.Assembly("lib1lane2")`<br>
`lib2lane2.set_params("raw_fastq_path", "ipsimdata/lib2_lane2_.fastq.gz")`<br>
`lib2lane2.set_params("barcodes_path", "ipsimdata/lib2_barcodes.txt")`<br>
`lib2lane2.run("1")`<br>

`fulldata = ip.merge("fulldata", [lib1lane1, lib1lane2, lib2lane1, lib2lane2])`<br>
`fulldata.run("234567")`<br>
<br>
#### splitting a library into different project<br>
`project1 = ["sample1", "sample2", "sample3"]`<br>
`project2 = ["sample4", "sample5", "sample6"]`<br>

`proj1 = fulldata.branch("proj1", subsamples=project1)`<br>
`proj2 = fulldata.branch("proj2", subsamples=project2)`<br>

`proj1.run("234567", force=True)`<br>
`proj2.run("234567", force=True)`<br>

<br>
#### print stats of project 1<br>
`print proj1.stats`<br>
