# Thesis Pipeline
This pipeline is modeled after:
1. [Carol Rowe's Allenrolfea analysis](https://digitalcommons.usu.edu/all_datasets/39/)
2. [emprical ipyrad API pedicularis](https://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/tests/cookbook-empirical-API-1-pedicularis.ipynb)
3. see Grundwald Lab [Poppr analysis](http://grunwaldlab.github.io/Population_Genetics_in_R/index.html) tutorial on github
<br>
<br>
*to be updated with further population analysis as well* <br>


In [None]:
#required modules/programs/packages
#seaborn
#pandas
#numpy
#ipyparallel
#ipyrad

**Check `ipcluster` instance with a profile**<br>
First we need to check our paralization

In [21]:
import ipyparallel as ipp
print 'ipyparallel version is', ipp.__version__
mpi1 = ipp.Client(profile="MPI1")
print 'mpi1 has',len(mpi1), 'cores'
mpi1.ids #this has the same effect as `len(mpi1)`; 
#take note python counting starts at zero!

ipyparallel version is 6.0.2
mpi1 has 15 cores


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

### *ipyrad*
The only library we need to import is *ipyrad*. The import command is usually the first code called in a Python document to load any necessary packages. In the code below, we use a convenient trick in Python to tell it that we want to refer to ipyrad simply as ip. This saves us a little space since we might type the name many times. Below that, we use the print statement to print the version number of *ipyrad*. This is good practice to keep a record of which software version we are using. <br>
<br>
This guide and markdown is straight from the [*ipyrad* API user guide](https://ipyrad.readthedocs.io/API_user-guide.html) but will be using 2019_thesis data.

In [7]:
#requires ipyrad
import ipyrad as ip
print ip.__version__

0.7.28


### Data structure
There are two main objects in *ipyrad*: Assembly class objects and Sample class objects. And in fact, most users will only ever interact with the Assembly class objects, since Sample objects are stored inside of the Assembly objects, and the Assembly objects have functions, such as merge, and branch, that are designed for manipulating and exchanging Samples between different Assemblies. <br>
### Assembly class objects
Assembly objects are a unique data structure that ipyrad uses to store and organize information about how to Assemble RAD-seq data. It contains functions that can be applied to data, such as clustering, and aligning sequences. And it stores information about which settings (prarmeters) to use for assembly functions, and which Samples the functions should be applied to. You can think of it mostly as a container that has a set of rules associated with it. <br>
To create a new Assembly object use the `ip.Assembly()` function and pass it the name of your new Assembly. Creating an object in this way has exactly the same effect as using the **-n {name}** argument in the *ipyrad* command line tool, except in the API instead of creating a params.txt file, we store the new Assembly information in a Python variable. This can be named anything you want. Below I name the variable *data1* so it is easy to remember that the Assembly name is also data1

In [14]:
data1 = ip.Assembly("data1")

New Assembly: data1


### Setting parameters
You now have a Assembly object with a default set of parameters associated with it, analogous to the params file in the command line tool. You can view and modify these parameters using two arguments to the Assembly object, `set_params()` and `get_params()`.

In [15]:
## set and modify params for this assembly object here
data1.set_params('project_dir', './') # this will need to be run for EACH assembly
data1.set_params('raw_fastq_path', './*.gz') # your file
data1.set_params('barcodes_path', '') # gonna need to fix this, remember there was an issue with what Restriction codes gave us!
data1.set_params('assembly_method', 'denovo+refrence') #denovo+refrence seemed to work well.
data1.set_params('refrence_sequence', 'Tzet_genomic.fna')
data1.set_params('datatype', 'ddrad')
data1.set_params('restriction_overhang', 'TGCAG, CGG') # remember there was an issue with what Restriction codes gave us!
data1.set_params('mindepth_statistical', '6')
data1.set_params('mindepth_majrule', '6')
data1.set_params('filter_adapters', '1')# do two runsrun with 1 and one with 2 (2= stricter)
data1.set_params('max_SNPs_locus', '20, 30') # 20,20 is standard in ipyrad, we used 20,30 last time
# ...

#print param file
data1.get_params()

IPyradError:     Error setting parameter 'raw_fastq_path'
        The value entered for the path to the raw fastq file is unrecognized.
    Please be sure this path is correct. Double check the file name and
    the file extension. If it is a relative path be sure the path is
    correct with respect to the directory you're running ipyrad from.
    You entered: /home/cardenas.61/output/cluster_analysis/both_outfiles/*.gz

    You entered: ./*.gz
    

#### Instantaneous parameter (and error) checking
A nice feature of the `set_params()` function in the *ipyrad* API is that it checks your parameter settings at the time that you change them to make sure that they are compatible. By contrast, the *ipyrad* CLI does not check params until you try to run a step function. As you saw, we assigned any `./*.gz`file in the directory for the raw_fastq_path parameter, but it doesnt exist in this directory so it throws an error. <br>
once you get it all fixed you can print your param file and make sure everything looks right

In [17]:
data1.get_params()

0   assembly_name               data1                                        
1   project_dir                 /home/cardenas.61/output/cluster_analysis/both_outfiles
2   raw_fastq_path                                                           
3   barcodes_path                                                            
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    rad                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6                     

# Multiple libraries and multiple lanes
[See *ipyrad* documentation](https://ipyrad.readthedocs.io/tutorial-combining-data.html?highlight=-m%20both) for how to handle multiple libraies and lanes in this API<br>
## barcodes_path
Your barcodes should be set up per index group and in the same directory<Br> 
The index groups are weird, and follow this order: 06, 12 , 03, 04, 05, 07, 01, 02, <br> But the order doesn't matter as much here, as longa s your barcodes file is annotated properly.

## Index by lane
We have 1.5 libraries to run so it will look something like this: <br>
#### multiple libraries multiple lanes<br>
`lib1i06 = ip.Assembly("lib1i06")`<br>
`lib1i06.set_params('project_dir', './') `<br>
`lib1i06.set_params('raw_fastq_path', './*.gz') `<br>
`lib1i06.set_params('barcodes_path', '') `<br>
`lib1i06.set_params('assembly_method', 'denovo+refrence')`<br>
`lib1i06.set_params('refrence_sequence', 'Tzet_genomic.fna')`<br>
`lib1i06.set_params('datatype', 'ddrad')`<br>
`lib1i06.set_params('restriction_overhang', 'TGCAG, CGG') `<br>
`lib1i06.set_params('clust_threshold', '0.9')`<br>
`lib1i06.set_params('mindepth_statistical', '6')`<br>
`lib1i06.set_params('mindepth_majrule', '6')`<br>
`lib1i06.set_params('filter_adapters', '1')`<br>
`lib1i06.set_params('max_SNPs_locus', '20, 30')`<br>
`lib1i06.set_params('output_formats', '*')`<br>
`lib1i06.run("1")`<br>
...<br>
`lib1i01 = ip.Assembly("lib1i01")`<br>
`lib1i01.set_params('project_dir', './') `<br>
`lib1i01.set_params('raw_fastq_path', './*.gz') `<br>
`lib1i01.set_params('barcodes_path', '') `<br>
`lib1i01.set_params('assembly_method', 'denovo+refrence')`<br>
`lib1i01.set_params('refrence_sequence', 'Tzet_genomic.fna')`<br>
`lib1i01.set_params('datatype', 'ddrad')`<br>
`lib1i01.set_params('restriction_overhang', 'TGCAG, CGG') `<br>
`lib1i01.set_params('mindepth_statistical', '6')`<br>
`lib1i01.set_params('mindepth_majrule', '6')`<br>
`lib1i01.set_params('filter_adapters', '1')`<br>
`lib1i01.set_params('max_SNPs_locus', '20, 30')`<br>
`lib2i06.set_params('output_formats', '*')`<br>
`lib2i06.set_params('clust_threshold', '0.9')`<br>
`lib1i01.run("1")`<br>

`lib2i06 = ip.Assembly("lib1i06")`<br>
`lib2i06.set_params('project_dir', './') `<br>
`lib2i06.set_params('raw_fastq_path', './*.gz') `<br>
`lib2i06.set_params('barcodes_path', '') `<br>
`lib2i06.set_params('assembly_method', 'denovo+refrence')`<br>
`lib2i06.set_params('refrence_sequence', 'Tzet_genomic.fna')`<br>
`lib2i06.set_params('datatype', 'ddrad')`<br>
`lib2i06.set_params('restriction_overhang', 'TGCAG, CGG') `<br>
`lib2i06.set_params('mindepth_statistical', '6')`<br>
`lib2i06.set_params('mindepth_majrule', '6')`<br>
`lib2i06.set_params('filter_adapters', '1')`<br>
`lib2i06.set_params('max_SNPs_locus', '20, 30')`<br>
`lib2i06.set_params('output_formats', '*')`<br>
`lib2i06.set_params('clust_threshold', '0.9')`<br>
`lib2i06.run("1")`<br>
...<br>
`lib2i04 = ip.Assembly("lib1i04")`<br>
`lib2i04.set_params('project_dir', './') `<br>
`lib2i04.set_params('raw_fastq_path', './*.gz') `<br>
`lib2i04.set_params('barcodes_path', '') `<br>
`lib2i04.set_params('assembly_method', 'denovo+refrence')`<br>
`lib2i04.set_params('refrence_sequence', 'Tzet_genomic.fna')`<br>
`lib2i04.set_params('datatype', 'ddrad')`<br>
`lib2i04.set_params('restriction_overhang', 'TGCAG, CGG') `<br>
`lib2i04.set_params('mindepth_statistical', '6')`<br>
`lib2i04.set_params('mindepth_majrule', '6')`<br>
`lib2i04.set_params('filter_adapters', '1')`<br>
`lib2i04.set_params('max_SNPs_locus', '20, 30')`<br>
`lib2i04.set_params('output_formats', '*')`<br>
`lib2i04.set_params('clust_threshold', '0.9')`<br> 
`lib2i04.run("1")`<br>
#### merge the demultiplexed libraries!
`fulldata = ip.merge("fulldata", [lib1i06, ..., lib1i01, lib2i06, ..., lib2i04])`<br>
#### Run the dataset!
`.run` operates as the CLI -s command <br>
`fulldata.run("234567")`<br>

## Branch Assembly
At this point, you have the choice to branch the assembly. which might be good for manipulating parameters. For example you may want to see which parameters provide the best coverage.<br>
    for example, you may want to run: `fulldata.run("2")` first <br>
    then check the stats on your data with `fulldata.stats`<Br>
    then run your dataset with further steps: `fulldata.run("3456")`<br>
###### Here we want to make sure this min_sample_locus scales. <Br>
If we only had 2 lanes, this might be reasonable. BUT we will end up with a lot of missing data. So we want to run our min_samples_locus proportional to our dataset(IE I have 130 samples, and may want to use 50 at the low end, and 100 at the high end. That is what this step is for.<br>
<br>
For example, we could check the coverage by changing the min_sample_locus parameter in two seperate runs<br>
`## create a branch for outputs with min_samples = 4 (lots of missing data)`<br>
`min4 = fulldata.branch("min4")`<br>
`min4.set_params("min_samples_locus", 4)`<br>
`min4.run("7")`<br>

`## create a branch for outputs with min_samples = 13 (no missing data)`<br>
`min13 = fulldata.branch("min13")`<br>
`min13.set_params("min_samples_locus", 13)`<br>
`min13.run("7")`<br>


### Final Stats
we can view the final stats of each step to see which we would want to use.<br>
`min13.stats`<br><br>
Or we can call the stats of specific steps to see which had the most coverage.<br>
`min4.stats_dfs.s7_samples`<br>
`min13.stats_dfs.s7_samples`<br>

# Check data quality
From here, we can begin to quality check our data. We want to check our analysis, due to varying lenghts of reads, and randomly selecting SNP's we may not maintain the Min# of samples per locus in our output (param #21 we discussed previously). Even though we compared the coverage, we want to make sure we have N loci reported in the *ipyrad* stats. This code comes from the Allenrolfea_Analysis_pipeline linked at the start above.

In [23]:
import pandas as pd
import seaborn as sns
import numpy as np
#the ustr file has N samples and 2 lines per sample check the number expected rows, and loci.
#ipyrad reported Nloci and we have Nsamples
my_ustr =pd.read_csv('min13.ustr')
print(my_ustr.shape) # should equal Nsamples*2 and Nloci. ex:(Nsamples*2,Nloci)

ImportError: No module named seaborn

# Hypothesis testing... 
Hypothesis_0: We expect populations in the tropics to be patchy (*source*)<br>
**null_0**: no population structure, there is admixture across all sites, one large population<br>
**alt_0.0**: Isolation by distance, one large pop with sub populations<br>
**alt1_0.1**: Isolation by environment, there are multiple populations (creeks determine geneflow, individuals in a creek are more related than individuals between creeks)<br><br>

Hypothesis_1: We are uncertain about the taxonomy of the published T.zeteki genome <br>
**null_1.0**: The published T.zeteki genome is T. zeteki<br>
**alt_1.0**: The published T.zeteki genome is likely a different species<br><br>
    Test this hypothesis by building a phylogeny with the 6 random T.fov and 5 T.zet a insilico digested published T.zet genome<br>
        THIS COULD HAVE AN INTERESTING COALESCENT STORY HERE, WE KNOW THAT PANAMA FORMED ~15-8MYA <br>
        WHAT CONSEQUENCES DOES THAT HAVE FOR THESE SIBILING SPECIES!?<br><br>
Hypothesis2: Parasitoid wasps have some selective pressure on their hosts (*source*)<br> 
**nul2**: ... something about selective pressures of parasitoid wasps?
<br><br>
***there is another hypothesis to test, think about it some more***
<br><br>
## find interesting loci, and do a literature search to see if we know the function of them yet!
we can use HWE... in poppr, we can extract a list of loci we know is not in HWE and explore that dataset... somehow.

# Data Analysis

## RAxML tree

## tetrad-- quartet tree inference
much like SVDquartets

### *structure* analysis
[see ipyrad documentation](https://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/tests/cookbook-structure-pedicularis.ipynb)
#### input and output file locations
#### create *structure* class object
#### set parameters for *structure* object
#### submit job on the cluster
#### summarize replicates with clump
#### calculate the best K and test for convergence
#### create structure plot

## See { notebook name } for further R analysis!
1. test genetic distance by geographic distance
2. map samples (see.... ???)
3. check HWE (poppr & adagenet)

#### Map Samples
May use Python, but we have a pretty straightforward way of doing this in R right now. This will be based off clusters/k values, if any, and mapping those. 

## Explore admixture
using TREEMIX & ABBA-BABA admixture inference

## Coalescent analysis?
... biogeogrphy here though...