# Exon-capture to Phylogeny Project

Calder Atta - FISH 546 Project

## Workflow (from Kuang et al. 2018)

1. Raw reads from Illumina sequencer
2. BCL to fastq format demultiplex (Illumina bcl2fastq package)
3. Remove adapter sequences and low quality score reads (Cutadapt 1.1 Trim_golare v0.2.8)
4. Remove the duplicates from PCR, parse the reads to each locus (A custom perl script: preads (supplementary))
5. Assemble the filtered reads into contigs (Trinity v20140717)
6. Merge the loci containing more than one contigs (Geneious v7.1.5)
7. Retrieve orthology by pairwise alignment to corresponding baits sequence (A custom Perl script: Smith-Waterman algorithm)
8. Identify orthology by comparing the retrieved sequence to the genome of O. nilotics (Blast v2.2.27)
9. Multiple sequences alignment (Clustal Omega v1.1.1)
10. Downstream analysis

## Notes from Kuang et al. 2018

- Samples and Genes
	- Sampled 43 species
	- 1 mt markers (COI)
	- 17817 nu markers (120bp baits)
		- target region <120bp was padded with T to 120bp
- Filtering
	- only used sequences found in all species and <5% missing data -> 570 markers
	- parameters for evaluating usefulness (calculated for all markers)
		1. Average pairwise difference (p-dist)
		2. Molecular clocklikeness (MCL)

## Getting started
Set short cuts for frequently used directories inside variables.

In [5]:
project = "/Users/calderatta/Desktop/FISH\ 546\ -\ Bioinformatics/project/"

In [None]:
biotools = "/Applications/bio-tools/"

Examine raw data.

In [9]:
ls {project}raw

[31mTORN_Pool_10_S10_L006_R1_001.fastq[m[m* [31mTORN_Pool_6_S6_L008_R1_001.fastq[m[m*
[31mTORN_Pool_10_S10_L006_R2_001.fastq[m[m* [31mTORN_Pool_6_S6_L008_R2_001.fastq[m[m*
[31mTORN_Pool_10_S10_L008_R1_001.fastq[m[m* [31mTORN_Pool_7_S7_L006_R1_001.fastq[m[m*
[31mTORN_Pool_10_S10_L008_R2_001.fastq[m[m* [31mTORN_Pool_7_S7_L006_R2_001.fastq[m[m*
[31mTORN_Pool_4_S4_L006_R1_001.fastq[m[m*   [31mTORN_Pool_7_S7_L008_R1_001.fastq[m[m*
[31mTORN_Pool_4_S4_L006_R2_001.fastq[m[m*   [31mTORN_Pool_7_S7_L008_R2_001.fastq[m[m*
[31mTORN_Pool_4_S4_L008_R1_001.fastq[m[m*   [31mTORN_Pool_8_S8_L006_R1_001.fastq[m[m*
[31mTORN_Pool_4_S4_L008_R2_001.fastq[m[m*   [31mTORN_Pool_8_S8_L006_R2_001.fastq[m[m*
[31mTORN_Pool_5_S5_L006_R1_001.fastq[m[m*   [31mTORN_Pool_8_S8_L008_R1_001.fastq[m[m*
[31mTORN_Pool_5_S5_L006_R2_001.fastq[m[m*   [31mTORN_Pool_8_S8_L008_R2_001.fastq[m[m*
[31mTORN_Pool_5_S5_L008_R1_001.fastq[m[m*   [31mTORN_Pool_9_S9_L00

Species 4 through 10 should be listed. For each species there are forward (R1) and reverse (R2) reads for each Illumina lane (L00#). Within each forward/reverse file pairs, the order of sequences is consistent.

In [10]:
!head {project}raw/TORN_Pool_10_S10_L006_R1_001.fastq

@K00179:70:HHV7JBBXX:6:1101:24454:1209 1:N:0:AACGAAGT
TNTCTCTCTCTCTTGCTCTCTCTCTCTCTGTTTGAGCTCTCTCTCCCTCTCTCTCTCTCTCTGTCTCTCTCTGTTTGAGCTAACTCTCTCTCTCTGTTAGAGCTCTCTCTCGNTCGCTCTCTCTCGCTNTCGCTGGCTCGCACGCTCTCT
+
A#AAAFFFJJJFJFAJFJ-FFFAJFJ-FJ-AA<--7<-7FA7F<A--77A-7A-<777A-<F--A-<FF7F<-----777777F777A-77AF<---7A-----)-7-7-7)#--7--<)F<--7)-)#7A-77)--)7)))7)7)))--
@K00179:70:HHV7JBBXX:6:1101:26829:1209 1:N:0:AACGAAGT
CNCTTTCCTTCAGGAGAGACTCTGTCAGGAGGTGCAGGAGGAACAAAAGGAGCAAGAGGAGGAGGATCTGAAGGAGGGATGAGGTGTTGCAGGACGATGAACAGGAGGGGGAGCATGAGGAGGAGCAGGAGTAGGTGGAGCATAAGGAGG
+
A#AAFFJFFJJJJJJAFJJJAJFJ<JAAJJJJAJJJ7FFF<AJJFAJFFJ<AFJJF<AJJJFFJA77F7A--F-AAFF-7FJJJAF7FJ<AAFFJJ77AF7<<A<AJF))-))-)-77A<-<7AJF)<7)7-7-7<F-))7)---7--7)
@K00179:70:HHV7JBBXX:6:1101:4472:1226 1:N:0:AACGAAGT
GNAACAACATGGAGGTCAGAGGAGGAACAACATGGAGGTCAGAGGAGGAACAACATGGAGGTCAGAGGAGGAACAACATGGAGGTCAGAGAAGGCGCATCACGTATCTCAGANGAAAAGAAAGGAGGTNTGCAAAGACGAACGAGGGGGC


Now let's check that the number of sequences in R1 and R2 files match.

In [21]:
!grep -c "@" {project}raw/TORN_Pool_10_S10_L008_R2_001.fastq

2419885


In [47]:
!grep -c '@' {project}raw/*.fastq

/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_10_S10_L006_R1_001.fastq:2419885
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_10_S10_L006_R2_001.fastq:2419885
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_10_S10_L008_R1_001.fastq:1497568
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_10_S10_L008_R2_001.fastq:1497568
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L006_R1_001.fastq:3061218
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L006_R2_001.fastq:3061218
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L008_R1_001.fastq:2506641
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_4_S4_L008_R2_001.fastq:2506641
/Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/raw/TORN_Pool_5_S5_L006_R1_001.fastq:1248025
/Users/calderatta/Desktop/FISH 546 - Bioinforma

## 1. Raw Reads from Illumina Sequencer
This was already done.

## 2. BCL to fastq format demultiplex
This was already done.

## 3. Remove adapter sequences and low quality score reads

### FastQC - checking quality across all squences in each file
#### Istallation for FastQC v0.11.8
First check that java is in stalled and at least ver 1.8.

In [8]:
!java -version

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)


#### On Mac (DMG file for GUI application)
Link: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.dmg

Install from .dmg file. Open the application, and select file(s).

In [31]:
!curl https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.dmg > {biotools}fastqc_v0.11.8.dmg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.0M  100 10.0M    0     0  99908      0  0:01:45  0:01:45 --:--:--  115k10.0M   48 5014k    0     0   109k      0  0:01:33  0:00:45  0:00:48 87391 10.0M   65 6726k    0     0  96291      0  0:01:49  0:01:11  0:00:38 63668


In [10]:
!open {biotools}FastQC.app

#### On Windows or Mac (Zip file to run in Bash; Use this version for creating a pipeline)
Link: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.zip

In [9]:
!curl https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.zip > {biotools}fastqc_v0.11.8.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9.7M  100  9.7M    0     0   158k      0  0:01:03  0:01:03 --:--:--  114k   175k      0  0:00:57  0:00:49  0:00:08  144k


In [None]:
!unzip {biotools}fastqc_v0.11.8.zip

Note: I already had a working version that I downloaded straight from Safari.

In [18]:
!{biotools}FastQC/fastqc -h


            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

	fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of 
    which will help to identify a different potential type of problem in your
    data.
    
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    
    The options for the program as as follows:
    
    -h --help       Print this help file and exit
    
    -v --version    Print the vers

In [6]:
!{biotools}FastQC/fastqc \
{project}data/*.fastq \
-o {project}analysis/fastqc

Started analysis of TORN_Pool_10_S10_L006_R1_001.fastq
Approx 5% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 10% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 15% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 20% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 25% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 30% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 35% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 40% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 45% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 50% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 55% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 60% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 65% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 70% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 75% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Approx 80% complete for TORN_Pool_10_S10_L006_R1_001.fastq
Ap

Approx 80% complete for TORN_Pool_4_S4_L008_R1_001.fastq
Approx 85% complete for TORN_Pool_4_S4_L008_R1_001.fastq
Approx 90% complete for TORN_Pool_4_S4_L008_R1_001.fastq
Approx 95% complete for TORN_Pool_4_S4_L008_R1_001.fastq
Analysis complete for TORN_Pool_4_S4_L008_R1_001.fastq
Started analysis of TORN_Pool_4_S4_L008_R2_001.fastq
Approx 5% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 10% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 15% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 20% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 25% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 30% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 35% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 40% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 45% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 50% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 55% complete for TORN_Pool_4_S4_L008_R2_001.fastq
Approx 60% complete for TORN_Pool_4_S4

Approx 70% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 75% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 80% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 85% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 90% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Approx 95% complete for TORN_Pool_6_S6_L006_R2_001.fastq
Analysis complete for TORN_Pool_6_S6_L006_R2_001.fastq
Started analysis of TORN_Pool_6_S6_L008_R1_001.fastq
Approx 5% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 10% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 15% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 20% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 25% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 30% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 35% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 40% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 45% complete for TORN_Pool_6_S6_L008_R1_001.fastq
Approx 50% complete for TORN_Pool_6_S6

Approx 60% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 65% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 70% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 75% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 80% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 85% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 90% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Approx 95% complete for TORN_Pool_8_S8_L006_R1_001.fastq
Analysis complete for TORN_Pool_8_S8_L006_R1_001.fastq
Started analysis of TORN_Pool_8_S8_L006_R2_001.fastq
Approx 5% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 10% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 15% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 20% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 25% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 30% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 35% complete for TORN_Pool_8_S8_L006_R2_001.fastq
Approx 40% complete for TORN_Pool_8_S8

Approx 50% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 55% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 60% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 65% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 70% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 75% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 80% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 85% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 90% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Approx 95% complete for TORN_Pool_9_S9_L008_R2_001.fastq
Analysis complete for TORN_Pool_9_S9_L008_R2_001.fastq


Produces an html file which you can open in a browser.

### Download Trim-Galore v0.5.0

Link: https://github.com/FelixKrueger/TrimGalore/archive/0.5.0.zip

In [42]:
!curl -O https://github.com/FelixKrueger/TrimGalore/archive/0.5.0.zip > {biotools}TrimGalore-0.5.0.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   127    0   127    0     0    671      0 --:--:-- --:--:-- --:--:--   671


Note: For some reason this didn't work. It turned into a 127 byte .zip file (should be 25.9 MB) that when unzipped turned into a .cpgz file, which unzips back into a .zip (zip cpgz loop). We can check this using md5 or sha1, but I'm not sure how to run on original file, and I tried downloading it using Safari and that seemed to work.

In [19]:
!unzip {biotools}TrimGalore-0.5.0.zip -d {biotools}TrimGalore-0.5.0

Archive:  /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/cutadapt/TrimGalore-0.5.0.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/cutadapt/TrimGalore-0.5.0.zip or
        /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/cutadapt/TrimGalore-0.5.0.zip.zip, and cannot find /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/cutadapt/TrimGalore-0.5.0.zip.ZIP, period.


In [44]:
!shasum {biotools}TrimGalore-0.5.0.zip

da39a3ee5e6b4b0d3255bfef95601890afd80709  /Users/calderatta/Desktop/FISH 546 - Bioinformatics/project/cutadapt/TrimGalore-0.5.0.zip


Now trying with sickle and seqtk.

In [15]:
!brew install seqtk

To reinstall 1.3, run `brew reinstall seqtk`


In [16]:
!brew install sickle

To reinstall 1.33, run `brew reinstall sickle`


In [11]:
!sickle se -f /Users/calderatta/Desktop/FISH\ 546\ -\ Bioinformatics/week-3/untreated1_chr4.fq.rtf -t sanger -o /Users/calderatta/Desktop/FISH\ 546\ -\ Bioinformatics/week-3/untreated1_chr4_sickle.fq


FastQ records kept: 203121
FastQ records discarded: 1234



In [10]:
!head -n 20 /Users/calderatta/Desktop/FISH\ 546\ -\ Bioinformatics/week-3/untreated1_chr4.fq.rtf

{\rtf1\ansi\ansicpg1252\cocoartf1561\cocoasubrtf600
{\fonttbl\f0\fmodern\fcharset0 Courier;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;}
{\*\expandedcolortbl;;\cssrgb\c0\c0\c0;}
\margl1440\margr1440\vieww21720\viewh13500\viewkind0
\deftab720
\pard\pardeftab720\partightenfactor0

\f0\fs26 \cf0 \expnd0\expndtw0\kerning0
@SRR031729.3941844\
GGACAACCTAGCCAGGAAAGGGGCAGGGAACCCTCTAATTGGGCCCGAACCATTCTGTGGTGTTGGTCACCACAG\
+\
BC?BABABBA@BCBAC>A<4+?BA><B=@?AB@B@A>?BB=B.?7?>1;<??=@A8?8=B8B>?B@46==8863<\
@SRR031728.3674563\
CAACAACAGCCCAGGAAATGAGCTAGCGGACAACCTAGCCAGGAAAGGGGCAGGGAACCCTCTAATTGGGCCCGA\
+\
BBB>B@?B=A4?)ABBABA:?B??CBB@B@?BB;?9A>A4AA=??>?:A=<?7?=??1<67445<55?6<667??\
@SRR031729.8532600\
CCCAATTAGAGGATTCTCTGCCCCTTTCCTGGCTAGGTTGTCCGGTAGCTCATTTCCCGGGATGTTGTTGTGTCC\
+\


In [12]:
!head -n 20 /Users/calderatta/Desktop/FISH\ 546\ -\ Bioinformatics/week-3/untreated1_chr4_sickle.fq

@SRR031729.3941844\
GGACAACCTAGCCAGGAAAGGGGCAGGGAACCCTCTAATTGGGCCCGAACCATTCTGTGGTGTTGGTCACCACAG\
+
BC?BABABBA@BCBAC>A<4+?BA><B=@?AB@B@A>?BB=B.?7?>1;<??=@A8?8=B8B>?B@46==8863<\
@SRR031728.3674563\
CAACAACAGCCCAGGAAATGAGCTAGCGGACAACCTAGCCAGGAAAGGGGCAGGGAACCCTCTAATTGGGCCCGA\
+
BBB>B@?B=A4?)ABBABA:?B??CBB@B@?BB;?9A>A4AA=??>?:A=<?7?=??1<67445<55?6<667??\
@SRR031729.8532600\
CCCAATTAGAGGATTCTCTGCCCCTTTCCTGGCTAGGTTGTCCGGTAGCTCATTTCCCGGGATGTTGTTGTGTCC\
+
BBBBACCBCBCBCBBBABBBBBBBBBBBBA@BBA4(:3BB=CAA;8?=@??0?B=B@<+7@1<>2><3A38'78<\
@SRR031729.2779333\
GTTCTCTGCCCCTTTCCTGGCTAGGTTGTCCGCTAGCTCATTTCCCGAGATG
+
B??AAA@B@AA@@?@AA=A@?=?AB7;@8??:<66?7<5044:6?=?6@66;
@SRR031728.2826481\
TTCCTGGCTAGGTTGTCCGCTAGCTCATTTCCCGGGCTGTTGTTGTGTCCCGGGACACACCTTATTGTGAGTTTG\
+
B>CBBCCCB>CC@BBABBBB?;?BAB<ABABBCBCCCBB<@B?=8;9;BCBACB9>>@0AB@=5:??6B=B<==?\


In [32]:
for i in *_L001_R1_001.fastq; do cat
{i%_L001_R1_001.fastq}_L001_R1_001.fastq
{i%_L001_R1_001.fastq}_L002_R1_001.fastq > {i%_L001_R1_001.fastq}_R1.fastq; done
   
for i in *_L001_R2_001.fastq; do cat
{i%_L001_R2_001.fastq}_L001_R2_001.fastq
{i%_L001_R2_001.fastq}_L002_R2_001.fastq >{i%_L001_R2_001.fastq}_R2.fastq; done)

SyntaxError: positional argument follows keyword argument (<ipython-input-32-8eb21be631a4>, line 2)

In [None]:
%%bash
for f in /where/your/files/r/*fq
do
awk '{s++}END{print s/4}' $f
done