# Thesis Pipeline
This pipeline will be modeled after:
1. [Carol Rowe's Allenrolfea analysis](https://digitalcommons.usu.edu/all_datasets/39/)
2. [emprical ipyrad API pedicularis](https://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/tests/cookbook-empirical-API-1-pedicularis.ipynb)
3. see Grundwald Lab [Poppr analysis](http://grunwaldlab.github.io/Population_Genetics_in_R/index.html) tutorial on github
<br>
<br>
*to be updated with further population analysis as well* <br>


# ~!~ important ~!~ 
#### Reconnecting to a notebook after a job has ended
Use this after a job has finished and you are reaccessing a notebook! After your session is over, or you copy and paste a new notebook with old notes/data ect. the IPYNB does not know the pathways of the json files (what the ipyrad uses to build assemblies). You need to tell it to load them with this command:<br>
`ipyrad.load.load_json("<your_assembly>.json")`

In [None]:
#required modules/programs/packages
#seaborn
#pandas
#numpy
#ipyparallel
#ipyrad

**Check `ipcluster` instance with a profile**<br>
First we need to check our paralization

In [None]:
import ipyparallel as ipp
print 'ipyparallel version is', ipp.__version__
mpi1 = ipp.Client(profile="MPI2019_06_21")
print 'MPI2019_06_21 has',len(mpi1), 'cores'

### *ipyrad*
The only library we need to import right now is *ipyrad*. Printing the version number of *ipyrad* is good practice to keep a record of which software version we are using. 
<br>
This markdown follows the [*ipyrad* API user guide](https://ipyrad.readthedocs.io/API_user-guide.html).<br>
See: * Eaton, D. A. R., & Overcast, I. (2019). ipyrad: interactive assembly and analysis of RADseq data sets. In prep. *

In [None]:
#requires ipyrad
import ipyrad as ip
print ip.__version__

## IPYRAD FIRST STEPS
We first used FastQC to examine our files and check our illumina sequences. The sequences looked good overall, and had expected phred scores, with no noticble issue with adaptoers. We then ran step 1, and then merged our assemblys and removed samples with less than 0.5 million reads (see next cell). We have made a directory with all the demultiplexed reads after step 1. This is where we will call our files in our parameters: <br>
`<assembly_name>.set_params('sorted_fastq_path', '/fs/project/adams.1970/cardenas.61/2019_analysis/reduced_samples/*.fastq.gz')`


## reduced data assembly
fulldata.stats is the whole dataset, a new data set with all samples with > 0.05 million reads will go into a new directory labled `reduced_samples`<br>
**removed samples:**<br>
`Metro_Park_CC184	1	15963`<br>
`Metro_Park_CC185	1	19655`<br>
`Metro_Park_CC187	1	16969`<br>
`Metro_Park_CC252	1	42792`<br>
`Metro_Park_CC253	1	34129`<br>
*There are other samples w/ <1.0 million but for now lets keep it simple. We do not want to over filter our data.*

In [None]:
reduced_data = ip.Assembly("reduced_data")
reduced_data.get_params() # need to change these parameters!

### Parameters 1
Here we will set our generic parameters for our primary assembly, we run all the way through with this analysis after creating a branch step that we can use to make minor adjustments.<br>
For example: we will branch after step one; (branch1) and from that branch we will create new assembly branches (branch1 tells reduced2 to copy branch1's .json file). <br><br>
* The only filter we are keeping consistent is we want reads that are bigger than 50bp, no small fragments.

In [None]:
reduced_data.set_params('project_dir', '/fs/project/adams.1970/cardenas.61/2019_analysis') 
reduced_data.set_params('sorted_fastq_path', '/fs/project/adams.1970/cardenas.61/2019_analysis/reduced_samples/*.fastq.gz')
reduced_data.set_params('assembly_method', 'denovo+reference')
reduced_data.set_params('datatype', 'pairddrad')
reduced_data.set_params('reference_sequence', './Tzet_genomic.fna')
reduced_data.set_params('restriction_overhang', 'TGCAG, CGG')
reduced_data.set_params('max_low_qual_bases', '5')
reduced_data.set_params('mindepth_statistical', '6')
reduced_data.set_params('mindepth_majrule', '6')
reduced_data.set_params('clust_threshold', '0.85')
reduced_data.set_params('filter_adapters', '1')
reduced_data.set_params('filter_min_trim_len', '50')
reduced_data.set_params('max_Hs_consens','8,8')
reduced_data.set_params('min_samples_locus', '4')
reduced_data.set_params('max_SNPs_locus', '20, 30')
reduced_data.set_params('trim_reads', '0, 0, 0, 0')
reduced_data.set_params('output_formats', '*')
reduced_data.get_params()

#.set_params('', '')


In [None]:
reduced_data.run("1",ipyclient=mpi1)
reduced_data.stats

In [None]:
branch1 = reduced_data.branch("branch1") 

now we will just run through with the mostly default parameters

In [None]:
reduced_data.run("234567",ipyclient=mpi1)
reduced_data.stats

In [None]:
reduced_data.stats

In [None]:
! cat /fs/project/adams.1970/cardenas.61/2019_analysis/reduced_data_outfiles/reduced_data_stats.txt

# Optimizing parameters
We are going to see what parameters give us the least data loss, we **do not want to over filter our data**.

## Step two and three branches
First see how parameter `filter_adapters` set at = 2 compares to the primary assembly<br>
Now see how parameter `trim_reads` set at = '0, 140, 0, 135'; `(R1>,<R1,R2>,<R2)`  compares to the primary assembly<br>
Then see how adjusting `clust_threshold` set at = 0.80, and 0.90 compares to the 0.85 value in the primary assembly. <br>

In [None]:
filter2 = branch1.branch("filter2")
trimread = branch1.branch("trimread")
clust80 = branch1.branch("clust80")
clust90 = branch1.branch("clust90")

In [None]:
filter2.set_params('filter_adapters', '2')
trimread.set_params('trim_reads', '0, 140, 0, 135')
clust80.set_params('clust_threshold', '0.85')
clust90.set_params('clust_threshold', '0.90')

In [None]:
filter2.run("2",ipyclient=mpi1)
filter2.stats

### Which filter_adapters worked best?
The primary assembly or the filter2 assembly?<br>
use this parameter in trimed read assembly.

In [None]:
#.set_params('filter_adapters', '')

In [None]:
trimread.run("2",ipyclient=mpi1)

In [None]:
trimread.stats

### Which filter trim_read worked best? 
The primary assembly or the trimread assembly? <br>
use this parameter in the comparison of step 3.

In [None]:
#.set_params('trim_reads', '')

In [None]:
clust80.run("3",ipyclient=mpi1)
clust90.run("3",ipyclient=mpi1)

In [None]:
clust80.stats

In [None]:
clust90.stats

## Step four through six
We shouldn't need to tweek the other parameters in steps 4-6. We want to stop and branch step 6 for step 7. <br> Run whichever assembly works best.

In [None]:
<assembly>.run("456",ipyclient=mpi1)

In [None]:
<assembly>.stats

### Step 7
Now see how parameter `filter_min_trim_len` set at = 2,6 & 8 looks compareed to the primary assembly<br>

In [None]:
<name> = <assembly_at_step6>.branch("<name>") # from last assembly run 4-6

In [None]:
min_sample_locus_2 = <assembly>.branch("min_sample_locus_2")
min_sample_locus_6 = <assembly>.branch("min_sample_locus_6")
min_sample_locus_8 = <assembly>.branch("min_sample_locus_8")

In [None]:
min_sample_locus_2.set_params('filter_min_trim_len', '2')
min_sample_locus_6.set_params('filter_min_trim_len', '6')
min_sample_locus_8.set_params('filter_min_trim_len', '8')

In [None]:
min_sample_locus_2.run("456",ipyclient=mpi1)
min_sample_locus_6.run("456",ipyclient=mpi1)
min_sample_locus_8.run("456",ipyclient=mpi1)

### Final Stats
at this point I prefer the following command, the CLI stats file gives you a little more detail that is helpful, especially first set "The number of loci caught by each filter."<br>
`! cat /fs/project/adams.1970/cardenas.61/2019_analysis/<assembly name>_outfiles/<assembly name>_stats.txt`<br>
ex: `! cat /fs/project/adams.1970/cardenas.61/2019_analysis/reduced_R1_outfiles/reduced_R1_stats.txt`

In [None]:
#compare final stats of all three
! cat /fs/project/adams.1970/cardenas.61/2019_analysis_final/<assembly name>_outfiles/<assembly name>_stats.txt

In [None]:
#compare final stats of all three
! cat /fs/project/adams.1970/cardenas.61/2019_analysis_final/<assembly name>_outfiles/<assembly name>_stats.txt

In [None]:
#compare final stats of all three
! cat /fs/project/adams.1970/cardenas.61/2019_analysis_final/<assembly name>_outfiles/<assembly name>_stats.txt

# Data Analysis
These are example (mostly) Ipyrad analysis programs see documentation outlining [the ipyrad analysis tools]( https://ipyrad.readthedocs.io/analysis.html#ipyrad-api-analysis-tools) for cookbooks on how to generally run these analysis.

### PCA
provided in ipyrad analysis toolkit; https://radcamp.github.io/NYC2018/04_PCA_API.html

### RAxML tree
Need to test RAxML tree and see if we can find a way to partition our data... `MAGNET` shell scripts might solve this! If we can get it working...
<br>
<br>
see FASconCAT-G & gphocs2multiphylip.sh script

### tetrad-- quartet tree inference
much like SVDquartets

### Coalescent analysis?
... biogeogrphy here though... bucky?

### *structure* analysis
[see ipyrad documentation](https://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/tests/cookbook-structure-pedicularis.ipynb)
#### input and output file locations
#### create *structure* class object
#### set parameters for *structure* object
#### submit job on the cluster
#### summarize replicates with clump
#### calculate the best K and test for convergence
#### create structure plot

#### Map Samples
May use Python, but we have a pretty straightforward way of doing this in R right now. This will be based off clusters/k values, if any, and mapping those. 

## See { notebook name } for further R analysis!
1. test genetic distance by geographic distance
2. map samples (see.... ???)
3. check HWE (poppr & adagenet)

## Explore admixture
using TREEMIX & ABBA-BABA admixture inference<br>
may not be relevant!!!