# Exon-capture to Phylogeny
###### Calder Atta - FISH 546 Project
---

## Workflow
1. Examine source data files
2. Check quality of reads (FastQC)
3. Merge files from multiple lanes
4. Remove adapter sequences and low quality score reads (TrimGalore)
5. Remove the duplicates from PCR, parse the reads to each locus (custom perl script: preads)
6. Assemble the filtered reads into contigs (Trinity)
7. Merge the loci containing more than one contigs (Geneious)
8. Retrieve orthology by pairwise alignment to corresponding baits sequence (custom perl script: Smith-Waterman algorithm)
9. Identify orthology by comparing the retrieved sequence to the genome of O. nilotics (bait source) (Blast)
10. Multiple sequences alignment (Clustal Omega v1.1.1)
11. Downstream analysis
---

#### Python Shortcuts

In [4]:
project = "/Users/calderatta/Desktop/FISH546_Bioinformatics/project/"

In [5]:
biotools = "/Applications/bio-tools/"

---
## 1. Examine source data files

#### Location, Files, and Naming
The `data/` directory contains reads for 7 species (ID 4 through 10). For each species there are 4 files, representing forward (R1) and reverse (R2) reads for two Illumina lanes (L006 and L008) = 28 files total. Within each forward/reverse pair of .fastq files, the order of sequences is consistent.

In [9]:
ls {project}data

[31mTORN_Pool_10_S10_L006_R1_001.fastq[m[m* [31mTORN_Pool_6_S6_L008_R1_001.fastq[m[m*
[31mTORN_Pool_10_S10_L006_R2_001.fastq[m[m* [31mTORN_Pool_6_S6_L008_R2_001.fastq[m[m*
[31mTORN_Pool_10_S10_L008_R1_001.fastq[m[m* [31mTORN_Pool_7_S7_L006_R1_001.fastq[m[m*
[31mTORN_Pool_10_S10_L008_R2_001.fastq[m[m* [31mTORN_Pool_7_S7_L006_R2_001.fastq[m[m*
[31mTORN_Pool_4_S4_L006_R1_001.fastq[m[m*   [31mTORN_Pool_7_S7_L008_R1_001.fastq[m[m*
[31mTORN_Pool_4_S4_L006_R2_001.fastq[m[m*   [31mTORN_Pool_7_S7_L008_R2_001.fastq[m[m*
[31mTORN_Pool_4_S4_L008_R1_001.fastq[m[m*   [31mTORN_Pool_8_S8_L006_R1_001.fastq[m[m*
[31mTORN_Pool_4_S4_L008_R2_001.fastq[m[m*   [31mTORN_Pool_8_S8_L006_R2_001.fastq[m[m*
[31mTORN_Pool_5_S5_L006_R1_001.fastq[m[m*   [31mTORN_Pool_8_S8_L008_R1_001.fastq[m[m*
[31mTORN_Pool_5_S5_L006_R2_001.fastq[m[m*   [31mTORN_Pool_8_S8_L008_R2_001.fastq[m[m*
[31mTORN_Pool_5_S5_L008_R1_001.fastq[m[m*   [31mTORN_Pool_9_S9_L00

#### Contents

In [10]:
!head {project}data/TORN_Pool_10_S10_L006_R1_001.fastq

@K00179:70:HHV7JBBXX:6:1101:24454:1209 1:N:0:AACGAAGT
TNTCTCTCTCTCTTGCTCTCTCTCTCTCTGTTTGAGCTCTCTCTCCCTCTCTCTCTCTCTCTGTCTCTCTCTGTTTGAGCTAACTCTCTCTCTCTGTTAGAGCTCTCTCTCGNTCGCTCTCTCTCGCTNTCGCTGGCTCGCACGCTCTCT
+
A#AAAFFFJJJFJFAJFJ-FFFAJFJ-FJ-AA<--7<-7FA7F<A--77A-7A-<777A-<F--A-<FF7F<-----777777F777A-77AF<---7A-----)-7-7-7)#--7--<)F<--7)-)#7A-77)--)7)))7)7)))--
@K00179:70:HHV7JBBXX:6:1101:26829:1209 1:N:0:AACGAAGT
CNCTTTCCTTCAGGAGAGACTCTGTCAGGAGGTGCAGGAGGAACAAAAGGAGCAAGAGGAGGAGGATCTGAAGGAGGGATGAGGTGTTGCAGGACGATGAACAGGAGGGGGAGCATGAGGAGGAGCAGGAGTAGGTGGAGCATAAGGAGG
+
A#AAFFJFFJJJJJJAFJJJAJFJ<JAAJJJJAJJJ7FFF<AJJFAJFFJ<AFJJF<AJJJFFJA77F7A--F-AAFF-7FJJJAF7FJ<AAFFJJ77AF7<<A<AJF))-))-)-77A<-<7AJF)<7)7-7-7<F-))7)---7--7)
@K00179:70:HHV7JBBXX:6:1101:4472:1226 1:N:0:AACGAAGT
GNAACAACATGGAGGTCAGAGGAGGAACAACATGGAGGTCAGAGGAGGAACAACATGGAGGTCAGAGGAGGAACAACATGGAGGTCAGAGAAGGCGCATCACGTATCTCAGANGAAAAGAAAGGAGGTNTGCAAAGACGAACGAGGGGGC


#### Sequence Counts
Now let's check that the number of sequences in R1 and R2 files match. First test on one file, then apply to the rest.

In [21]:
!grep -c "@" {project}data/TORN_Pool_10_S10_L008_R2_001.fastq

2419885


In [47]:
!grep -c '@' {project}data/*.fastq

/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_10_S10_L006_R1_001.fastq:2419885
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_10_S10_L006_R2_001.fastq:2419885
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_10_S10_L008_R1_001.fastq:1497568
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_10_S10_L008_R2_001.fastq:1497568
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L006_R1_001.fastq:3061218
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L006_R2_001.fastq:3061218
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L008_R1_001.fastq:2506641
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L008_R2_001.fastq:2506641
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_5_S5_L006_R1_001.fastq:1248025
/Users/calderatta/Desktop/FISH 546 - Bioinforma

They match!
## 2. Check quality of reads (FastQC)
Objective: Examine quality of reads for each sequence file. 

Input: source files (.fastq)

    {project}/data/*.fastq              (eg. TORN_Pool_4_S4_L006_R1_001.fastqc)
Output: sequence quality plots (.html/.zip)

    {project}/analysis/fastqc/*.fastq   (eg. TORN_Pool_4_S4_L006_R1_001_fastqc.html)
    {project}/analysis/fastqc/*.fastq   (eg. TORN_Pool_4_S4_L006_R1_001_fastqc.zip)
Requires:
- fastqc (See `notebooks/installation.ipynb` for installation instructions.)

In [6]:
!{biotools}FastQC/fastqc \
{project}data/*.fastq \
-o {project}analysis/fastqc

Started analysis of TORN_Pool_10_S10_L006_R1_001.fastq
Approx 5% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 10% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 15% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 20% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 25% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 30% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 35% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 40% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 45% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 50% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 55% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 60% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 65% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 70% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 75% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 80% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Ap

Approx 80% complete for TORN_Pool_4_S4_L008_R1_001.fastq
Approx 85% complete for TORN_Pool_4_S4_L008_R1_001.fastq
Approx 90% complete for TORN_Pool_4_S4_L008_R1_001.fastq
Approx 95% complete for TORN_Pool_4_S4_L008_R1_001.fastq
Analysis complete for TORN_Pool_4_S4_L008_R1_001.fastq
Started analysis of TORN_Pool_4_S4_L008_R2_001.fastq
Approx 5% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 10% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 15% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 20% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 25% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 30% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 35% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 40% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 45% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 50% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 55% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 60% complete for TORN_Pool_4_S4

Approx 70% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 75% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 80% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 85% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 90% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 95% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Analysis complete for TORN_Pool_6_S6_L006_R2_001.fastq
Started analysis of TORN_Pool_6_S6_L008_R1_001.fastq
Approx 5% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 10% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 15% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 20% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 25% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 30% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 35% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 40% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 45% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 50% complete for TORN_Pool_6_S6

Approx 60% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 65% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 70% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 75% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 80% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 85% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 90% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 95% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Analysis complete for TORN_Pool_8_S8_L006_R1_001.fastq
Started analysis of TORN_Pool_8_S8_L006_R2_001.fastq
Approx 5% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 10% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 15% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 20% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 25% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 30% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 35% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 40% complete for TORN_Pool_8_S8

Approx 50% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 55% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 60% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 65% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 70% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 75% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 80% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 85% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 90% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 95% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Analysis complete for TORN_Pool_9_S9_L008_R2_001.fastq


Now we have .html files which can be opened in a browser to examine plots.

Alternatively...  
If on a Mac, we can download a DMG for a GUI application (see `installation.ipynb` for details). Once downloaded, open the application and select file(s) one by one. This method is slow, so the above method is prefered. 

## 3. Merge files from multiple lanes
Objective: Merge the data on lane6 and lane8 together  .

Input: source files (.fastq)

    {project}/data/*_L006_*.fastq      (eg. TORN_Pool_4_S4_L006_R1_001.fastqc)
    {project}/data/*_L008_*.fastq      (eg. TORN_Pool_4_S4_L008_R1_001.fastqc)
Output: merged files (.fastq)

    {project}/analysis/merge/*.fastq   (eg. TORN_Pool_4_S4_R1.fastq)

Let's do a test run on one set of files.

In [39]:
!cat \
{project}data/TORN_Pool_4_S4_L006_R1_001.fastq \
{project}data/TORN_Pool_4_S4_L008_R1_001.fastq \
> {project}analysis/merge/TORN_Pool_4_S4_R1.fastq

Now let's do the rest using a loop (one for R1 and one of R2). For this I navigated to the `data/` folder in Terminal and used the following lines of code because I couldn't figure out how to use absolute paths inside a for loop in Jupyter. The output files went into `data/` but I moved them into `analysis/merge/`.

R1:

    (for i in *_L006_R1_001.fastq; do cat ${i%_L006_R1_001.fastq}_L006_R1_001.fastq ${i%_L006_R1_001.fastq}_L008_R1_001.fastq > ${i%_L006_R1_001.fastq}_R1.fastq; done)
R2:

    (for i in *_L006_R2_001.fastq; do cat ${i%_L006_R2_001.fastq}_L006_R2_001.fastq ${i%_L006_R2_001.fastq}_L008_R2_001.fastq > ${i%_L006_R2_001.fastq}_R2.fastq; done)

## 4. Remove adapter sequences and low quality score reads (TrimGalore)
Objective: Trim the adapter and low quality reads in the merged .fastq file.

Input: merged files (.fastq)

    {project}/analysis/merge/*.fastq             (eg. TORN_Pool_4_S4_R1.fastq)
Output: trimmed files (.fq)

    {project}/analysis/trimgalore/*.fastq        (eg. TORN_Pool_4_S4_R1_val_1.fq)
    {project}/analysis/trimgalore/*.fastq        (eg. TORN_Pool_4_S4_R1.fastq_trimming_report.txt)
    {project}/analysis/trimgalore/trim.log.txt
Requires:
- cutadapt (See `notebooks/installation.ipynb` for installation instructions.)
- trim_galore (See `notebooks/installation.ipynb` for installation instructions.)
- adapter sequences:
    - a  = AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
    - a2 = AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

#### Verify cutadapt and trim_galore

In [4]:
!{biotools}cutadapt-master/cutadapt --version

1.18


In [5]:
!{biotools}TrimGalore-0.5.0/trim_galore --version


                        Quality-/Adapter-/RRBS-/Hard-Trimming
                                (powered by Cutadapt)
                                  version 0.5.0

                               Last update: 28 06 2018



Both are installed and up-to-date.
#### Run TrimGalore

I tried to run TrimGalore on one sample, but couldn't get to work. `(cutadapt: error: Too many parameters.)`

In [35]:
!{biotools}TrimGalore-0.5.0/trim_galore \
{project}data/TORN_Pool_10_S10_L008_R2_001.fastq \
--path_to_cutadapt {biotools}cutadapt-master/cutadapt \
-o {project}analysis/trimgalore/

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: '/Applications/bio-tools/cutadapt-master/cutadapt' (user defined)
1.18
Cutadapt seems to be working fine (tested command '/Applications/bio-tools/cutadapt-master/cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/data/TORN_Pool_10_S10_L008_R2_001.fastq <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	62856	AGATCGGAAGAGC	1000000	6.29
Nextera	6	CTGTCTCTTATA	1000000	0.00
smallRNA	1	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 62856). Second best hit was Nextera (count: 6)

Writing report to '/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/analysis/trimgalore/TORN_Pool_10_S10_L008_R2_00

As with merge, I also navigated to the `analysis/merge/` folder in Terminal and used the following line of code. Each file took several minutes to complete. The output files went into `analysis/merge/` but I moved them into `analysis/trimgalore/`.

    (for i in *_R1.fastq; do /Applications/bio-tools/TrimGalore-0.5.0/trim_galore --path_to_cutadapt /Applications/bio-tools/cutadapt-master/cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --paired ${i%_R1.fastq}_R1.fastq ${i%_R1.fastq}_R2.fastq; done) >& trim.log.txt
#### Rename
The output file contains the convention 'val_1' (eg. `TORN_Pool_4_S4_R1_val_1.fq`) or 'val_2', which need to be removed.  
Note: There is also a script containing the following code: `{project}kuang-et-al-2018/scripts-old/preads_script/rename.sh`.

In [None]:
!for f in *_val_1.fq; do /
mv -- "$f" "${f%_R1_val_1.fq}_R1.fq"; /
done

In [None]:
!for f in *_val_2.fq; do /
mv -- "$f" "${f%_R2_val_2.fq}_R2.fq"; /
done)


## 5. Remove the duplicates from PCR, parse the reads to each locus (preads)
Preads is a custom perl script for removing duplicate reads.

Objective: Remove replicates in trimmed file, and parse them into corresponding genes.

Input:  all *.fq files, baits sequences,rmrep.pl, bandp.pl, data.pm

    {project}/analysis/trimgalore/*_R1.fq (eg. TORN_Pool_4_S4_R1.fq)
    {project}/analysis/trimgalore/*_R2.fq (eg. TORN_Pool_4_S4_R2.fq)
    {project}kuang-et-al-2018/scripts-old/preads_script/MudSkipper_Kit_Plus_Opsins.fas
    {project}kuang-et-al-2018/scripts-old/preads_script/bandp.pl
    {project}kuang-et-al-2018/scripts-old/preads_script/data.pm
    {project}kuang-et-al-2018/scripts-old/preads_script/rmrep.pl
Output (rmrep.pl): gene files containing corresponding reads(.fq)

    {project}/analysis/preads/*_rmrep_R1.fq      (eg. TORN_Pool_4_S4_rmrep_R1.fq)
    {project}/analysis/preads/*_rmrep_R1.index   (eg. TORN_Pool_4_S4_rmrep_R1.index)
    {project}/analysis/preads/*_rmrep_R2.fq      (eg. TORN_Pool_4_S4_rmrep_R2.fq)
    {project}/analysis/preads/*_rmrep_R2.index   (eg. TORN_Pool_4_S4_rmrep_R2.index)
    {project}/analysis/preads/*.fas              (eg. TORN_Pool_4_S4.fas)
    {project}/analysis/preads/*.index            (eg. TORN_Pool_4_S4.index)
    {project}/analysis/preads/*.nhr              (eg. TORN_Pool_4_S4.nhr)
    {project}/analysis/preads/*.nin              (eg. TORN_Pool_4_S4.nin)
    {project}/analysis/preads/*.nsq              (eg. TORN_Pool_4_S4.nsq)
    {project}/analysis/preads/preads.1.log.txt
Output (bandp.pl): gene files containing corresponding reads(.fq)

    {project}/analysis/preads/*.*.blast.txt      (eg. MudSkipper_Kit_Plus_Opsins.TORN_Pool_4_S4.blast.txt)
    {project}/analysis/preads/*_results/         (eg. TORN_Pool_4_S4_results) <- contains many .fq files
    {project}/analysis/preads/preads.1.log.txt
        
Requires:
- makeblastdb (See `notebooks/installation.ipynb` for installation instructions. Make sure the script is added to a PATH directory. Eg. `/usr/local/bin/`)
- blastn (See `notebooks/installation.ipynb` for installation instructions. Make sure the script is added to a PATH directory. Eg. `/usr/local/bin/`)

Usage example:

    ./rmrep.pl -taxalist="558_1"
    ./bandp.pl -query="Python_molurus" -subject="558_1" > preads.1.log.txt
#### Setup
First we need to move all the trim_galore results (.fq) into the same directory as the other input files. In Terminal:

    cd /Users/calderatta/Desktop/FISH\ 546\ -\ Bioinformatics/project/kuang-et-al-2018/scripts-old/preads_script/ 
#### Run rmrep.pl
This will be a long-running job (~15min per sample), so let's test it on one first. In the actuall run, we want to substitute '558_1' in the example with a list of sample names (a string of names separated by spaces). For the test we will use 'TORN_Pool_4_S4'.

    ./rmrep.pl -taxalist="TORN_Pool_4_S4"
Ok. It works. Now we can run it on a for loop.

    (for i in TORN_Pool_4_S4 TORN_Pool_5_S5 TORN_Pool_6_S6 TORN_Pool_7_S7 TORN_Pool_8_S8 TORN_Pool_9_S9 TORN_Pool_10_S10; do ./rmrep.pl -taxalist=$i; done)
#### Run bandp.pl

    ./bandp.pl -query="MudSkipper_Kit_Plus_Opsins" -subject=TORN_Pool_4_S4 > preads.1.log.txt
Inputting the subject argument as a list of names separated by spaces didn't seem to work so for for this I just did them individually. Each job took between 24 and 48 hours and done on Office Desktop Windows, sample 9 and 10 done on MOX.

For files with the prefix 'CG_LWS_', the output on Windows inserts a '|' in the file convention on MOX and a character similar to '•' in Windows, which gets converted to a questionmark-box character when moved to Mac.I substituted these characters with an underscore to be able to input into Trinity.

    rename -n 's/|/_/' *.fq
#### Move Files to Analysis Directory
Move output files to the new `/analysis/preads/`.

## 6. Assemble the filtered reads into contigs (Trinity)

Objective: Assemble the reads to short contigs .

Input: Samplename_results (preads result) folder containing gene files, runtrinity.pl, .sh file

    {project}/analysis/preads/*.*.blast.txt      (eg. MudSkipper_Kit_Plus_Opsins.TORN_Pool_4_S4.blast.txt)
    {project}/analysis/preads/*_results/         (eg. TORN_Pool_4_S4_results)
    {project}kuang-et-al-2018/scripts-old/./runtrinity.pl
Output (rmrep.pl): fasta files named in format of "gene name.Trinity.fasta".

    {project}/analysis/trinity/*.Trinity.fasta
        
Requires:
- Trinity (See `notebooks/installation.ipynb` for installation instructions. Make sure the script is added to a PATH directory. Eg. `/usr/local/bin/`)

Usage example:

    ./runtrinity.pl -species="samplename1 samplename2" > runtrinity15.log

#### Setup
First we need to move all the input files, preads results and 'runtrinity.pl' (copied), into a new directory:

    /Users/calderatta/Desktop/FISH\ 546\ -\ Bioinformatics/project/analysis/trinity
#### Run runtrinity.pl
In Progress...

Running Trinity requires ParaFly (in Trinity), seqtk-trinity (in Trinity), samtools, jellyfish (http://www.genome.umd.edu/jellyfish.html#Release), salmon (https://combine-lab.github.io/salmon/)