# Exon-capture to Phylogeny Project

Calder Atta - FISH 546 Project

## Workflow (from Kuang et al. 2018)

1. Raw reads from Illumina sequencer
2. BCL to fastq format demultiplex (Illumina bcl2fastq package)
3. Remove adapter sequences and low quality score reads (Cutadapt 1.1 Trim_golare v0.2.8)
4. Remove the duplicates from PCR, parse the reads to each locus (A custom perl script: preads (supplementary))
5. Assemble the filtered reads into contigs (Trinity v20140717)
6. Merge the loci containing more than one contigs (Geneious v7.1.5)
7. Retrieve orthology by pairwise alignment to corresponding baits sequence (A custom Perl script: Smith-Waterman algorithm)
8. Identify orthology by comparing the retrieved sequence to the genome of O. nilotics (Blast v2.2.27)
9. Multiple sequences alignment (Clustal Omega v1.1.1)
10. Downstream analysis

## Notes from Kuang et al. 2018

- Samples and Genes
	- Sampled 43 species
	- 1 mt markers (COI)
	- 17817 nu markers (120bp baits)
		- target region <120bp was padded with T to 120bp
- Filtering
	- only used sequences found in all species and <5% missing data -> 570 markers
	- parameters for evaluating usefulness (calculated for all markers)
		1. Average pairwise difference (p-dist)
		2. Molecular clocklikeness (MCL)

## Getting started
Store project directory inside a variable.

In [8]:
project = "/Users/calderatta/Desktop/FISH\ 546\ -\ Bioinformatics/project/"

Examine raw data.

In [9]:
ls {project}raw

[31mTORN_Pool_10_S10_L006_R1_001.fastq[m[m* [31mTORN_Pool_6_S6_L008_R1_001.fastq[m[m*
[31mTORN_Pool_10_S10_L006_R2_001.fastq[m[m* [31mTORN_Pool_6_S6_L008_R2_001.fastq[m[m*
[31mTORN_Pool_10_S10_L008_R1_001.fastq[m[m* [31mTORN_Pool_7_S7_L006_R1_001.fastq[m[m*
[31mTORN_Pool_10_S10_L008_R2_001.fastq[m[m* [31mTORN_Pool_7_S7_L006_R2_001.fastq[m[m*
[31mTORN_Pool_4_S4_L006_R1_001.fastq[m[m*   [31mTORN_Pool_7_S7_L008_R1_001.fastq[m[m*
[31mTORN_Pool_4_S4_L006_R2_001.fastq[m[m*   [31mTORN_Pool_7_S7_L008_R2_001.fastq[m[m*
[31mTORN_Pool_4_S4_L008_R1_001.fastq[m[m*   [31mTORN_Pool_8_S8_L006_R1_001.fastq[m[m*
[31mTORN_Pool_4_S4_L008_R2_001.fastq[m[m*   [31mTORN_Pool_8_S8_L006_R2_001.fastq[m[m*
[31mTORN_Pool_5_S5_L006_R1_001.fastq[m[m*   [31mTORN_Pool_8_S8_L008_R1_001.fastq[m[m*
[31mTORN_Pool_5_S5_L006_R2_001.fastq[m[m*   [31mTORN_Pool_8_S8_L008_R2_001.fastq[m[m*
[31mTORN_Pool_5_S5_L008_R1_001.fastq[m[m*   [31mTORN_Pool_9_S9_L00

Species 4 through 10 should be listed. For each species there are forward (R1) and reverse (R2) reads for each Illumina lane (L00#). Within each forward/reverse file pairs, the order of sequences is consistent.

In [10]:
!head {project}raw/TORN_Pool_10_S10_L006_R1_001.fastq

@K00179:70:HHV7JBBXX:6:1101:24454:1209 1:N:0:AACGAAGT
TNTCTCTCTCTCTTGCTCTCTCTCTCTCTGTTTGAGCTCTCTCTCCCTCTCTCTCTCTCTCTGTCTCTCTCTGTTTGAGCTAACTCTCTCTCTCTGTTAGAGCTCTCTCTCGNTCGCTCTCTCTCGCTNTCGCTGGCTCGCACGCTCTCT
+
A#AAAFFFJJJFJFAJFJ-FFFAJFJ-FJ-AA<--7<-7FA7F<A--77A-7A-<777A-<F--A-<FF7F<-----777777F777A-77AF<---7A-----)-7-7-7)#--7--<)F<--7)-)#7A-77)--)7)))7)7)))--
@K00179:70:HHV7JBBXX:6:1101:26829:1209 1:N:0:AACGAAGT
CNCTTTCCTTCAGGAGAGACTCTGTCAGGAGGTGCAGGAGGAACAAAAGGAGCAAGAGGAGGAGGATCTGAAGGAGGGATGAGGTGTTGCAGGACGATGAACAGGAGGGGGAGCATGAGGAGGAGCAGGAGTAGGTGGAGCATAAGGAGG
+
A#AAFFJFFJJJJJJAFJJJAJFJ<JAAJJJJAJJJ7FFF<AJJFAJFFJ<AFJJF<AJJJFFJA77F7A--F-AAFF-7FJJJAF7FJ<AAFFJJ77AF7<<A<AJF))-))-)-77A<-<7AJF)<7)7-7-7<F-))7)---7--7)
@K00179:70:HHV7JBBXX:6:1101:4472:1226 1:N:0:AACGAAGT
GNAACAACATGGAGGTCAGAGGAGGAACAACATGGAGGTCAGAGGAGGAACAACATGGAGGTCAGAGGAGGAACAACATGGAGGTCAGAGAAGGCGCATCACGTATCTCAGANGAAAAGAAAGGAGGTNTGCAAAGACGAACGAGGGGGC


In [None]:
xargs

In [25]:
!ls {project}raw | grep -c "@"

0


In [21]:
!grep -c "@" {project}raw/TORN_Pool_10_S10_L008_R2_001.fastq

2419885


In [41]:
!find {project}raw | xargs /bin/echo

/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_10_S10_L006_R1_001.fastq /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_7_S7_L006_R2_001.fastq /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_6_S6_L008_R1_001.fastq /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_5_S5_L006_R1_001.fastq /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_9_S9_L006_R1_001.fastq /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_8_S8_L008_R2_001.fastq /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L008_R2_001.fastq /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_8_S8_L006_R1_001.fastq /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L006_R1_001.fastq /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/r

## 1. Raw Reads from Illumina Sequencer
This was already done.

## 2. BCL to fastq format demultiplex
This was already done.

## 3. Remove adapter sequences and low quality score reads

Download Trim-Galore v0.5.0

Link: https://github.com/FelixKrueger/TrimGalore/archive/0.5.0.zip

In [27]:
!curl -O https://github.com/FelixKrueger/TrimGalore/archive/0.5.0.zip > {project}cutadapt/TrimGalore-0.5.0.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   127    0   127    0     0    693      0 --:--:-- --:--:-- --:--:--   693


Note: For some reason this didn't work. It turned into a 127 byte .zip file (should be 25.9 MB) that when unzipped turned into a .cpgz file, which unzips back into a .zip (zip cpgz loop). We can check this using md5 or sha1, but I just tried downloading it using Safari and that seemed to work.

In [19]:
!unzip {project}cutadapt/TrimGalore-0.5.0.zip -d {project}cutadapt/TrimGalore-0.5.0

Archive:  /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/cutadapt/TrimGalore-0.5.0.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/cutadapt/TrimGalore-0.5.0.zip or
        /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/cutadapt/TrimGalore-0.5.0.zip.zip, and cannot find /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/cutadapt/TrimGalore-0.5.0.zip.ZIP, period.


In [None]:
sha1