   # RNAseq Coho Differential Expression
   
   This notebook is the first in a series that outlines the steps taken to analyse an RNAseq data set from raw reads to differential expression analysis and Gene Ontology Enrichment analysis. This contains the steps taken to trim-reads and run quality control on the raw dataset. 
   
   The raw data files are too large to post on Github, but can be provided (aspanjer@uw.edu)
   
   Software used in this analysis:
   [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
   [MultiQC](http://multiqc.info)
   [Trim galore](http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)
   

In [None]:
A quick look at the raw data files that are used for the analysis:

In [2]:
cd /Users/andrewspanjer/RNAseq_Coho/

/Users/andrewspanjer/RNAseq_Coho


In [2]:
ls

A.1.fastq.gz  F.1.fastq.gz  K.1.fastq.gz  P.1.fastq.gz  T.2.fastq.gz
A.2.fastq.gz  F.2.fastq.gz  K.2.fastq.gz  P.2.fastq.gz  U.1.fastq.gz
B.1.fastq.gz  G.1.fastq.gz  L.1.fastq.gz  Q.1.fastq.gz  U.2.fastq.gz
B.2.fastq.gz  G.2.fastq.gz  L.2.fastq.gz  Q.2.fastq.gz  V.1.fastq.gz
C.1.fastq.gz  H.1.fastq.gz  M.1.fastq.gz  R.1.fastq.gz  V.2.fastq.gz
C.2.fastq.gz  H.2.fastq.gz  M.2.fastq.gz  R.2.fastq.gz  W.1.fastq.gz
D.1.fastq.gz  I.1.fastq.gz  N.1.fastq.gz  README.md     W.2.fastq.gz
D.2.fastq.gz  I.2.fastq.gz  N.2.fastq.gz  S.1.fastq.gz  X.1.fastq.gz
E.1.fastq.gz  J.1.fastq.gz  O.1.fastq.gz  S.2.fastq.gz  X.2.fastq.gz
E.2.fastq.gz  J.2.fastq.gz  O.2.fastq.gz  T.1.fastq.gz


Unzip first fastq file to look at the first 10 lines to see how the raw sequencng data looks. Rezip the file to save space... 

In [15]:
!gunzip A.1.fastq.gz 

head: A.1.fastq: No such file or directory


For a complete description of the data files see the README.md in /data. This file is one of 48 sequencing files representing PE reads of RNA from Juvenile Coho fish livers. Sequencing was performed by the UW Genomic Core facility.

In [17]:
!head A.1.fastq

@HWKWKBGXX:1:11101:10000:19471/1
TTTTTTTGTACAAAAGCTGAGTAAGTTGGCCTGCAAAGCCCTAATTAAATGCCCCCACAACCAATAGAGAAGCAACAATTACAATTGTCAACTCTTTTTGCAGGCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAATTCGTATA
+
AAAAAEEAEEEEAEEEEA<EEEEE6E//6E/AA<EEEEEE/EEE/EEAEEE<EA/EE/E666<EAAA6<EEEEEEEAEE<E/EEA/A/<EE/EE<EE/<//E//AE/<EAEEEEEEE/6/A/<E/E<E/AA</AAE<A//EEE<A/AE##
@HWKWKBGXX:1:11101:10002:10211/1
GAAGTGTTCTAGGGGTACCGCTTCAACAACGATGGGTATGAAGCTATGGTTAGCCCCGCAGATTTCAGAACATTGTCCGTAGAATACTCCAGGTCGAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAATTCGTATCTCGTATG
+
6A/AA6/EEEEEAEEEEAAEEEEEEAE/EEAEEEEAEE//E//AEEEEA/EEE6/EEEEEEA//E//</<EEAE/AA6EEEA<AEEEE/6E<6AE<EEEE6/AAEAEAEEAEEE<A/EAA<E/EE/66EEA66AAE/</AE<<66<A###
@HWKWKBGXX:1:11101:10005:13424/1
CCACCGCCGGCTTGTACCTGTTGAGGTAGTCAGCCACGCCACATATCGTCGGGCAGTACTTGCCAAATCCATCCCTTGGTGTGCAGTCCTCAGAATAGTCTCCCCTCGTTTGTGCAGAGGAGAGGGAGAACAGCAAAAGAACGCCTGCGG


The data files are structured as follows:

line 1:  @ 'instrument':'run number':'flowcell ID':'lane':'tile':'x-pos':'y-pos' 'read':'is filtered':'control number':'index sequence'

Line 2: This is the actual sequence

Line 3: a line break signaled by '+'

Line 4: These are quality codes for each base in the sequence. A guide for these codes can be found here: http://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm

In [3]:
!gzip A.1.fastq

Next, I want to look at the quality of my raw data reads...

In [None]:
! scripts/FastQC/fastqc data/*.fastq.gz -o analyses/

Started analysis of A.1.fastq.gz
Approx 5% complete for A.1.fastq.gz
Approx 10% complete for A.1.fastq.gz
Approx 15% complete for A.1.fastq.gz
Approx 20% complete for A.1.fastq.gz
Approx 25% complete for A.1.fastq.gz
Approx 30% complete for A.1.fastq.gz
Approx 35% complete for A.1.fastq.gz
Approx 40% complete for A.1.fastq.gz
Approx 45% complete for A.1.fastq.gz
Approx 50% complete for A.1.fastq.gz
Approx 55% complete for A.1.fastq.gz
Approx 60% complete for A.1.fastq.gz
Approx 65% complete for A.1.fastq.gz
Approx 70% complete for A.1.fastq.gz
Approx 75% complete for A.1.fastq.gz
Approx 80% complete for A.1.fastq.gz
Approx 85% complete for A.1.fastq.gz
Approx 90% complete for A.1.fastq.gz
Approx 95% complete for A.1.fastq.gz
Analysis complete for A.1.fastq.gz
Started analysis of A.2.fastq.gz
Approx 5% complete for A.2.fastq.gz
Approx 10% complete for A.2.fastq.gz
Approx 15% complete for A.2.fastq.gz
Approx 20% complete for A.2.fastq.gz
Approx 25% complete for A.2.fastq.gz
Approx 30% co

A summary of these fastqc reports can be generated for all files using... 

In [6]:
!multiqc ./analyses/fastQC_before/ -o ./analyses/ -n multiqc_before

[INFO   ]         multiqc : This is MultiQC v0.8
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching './analyses/fastQC_before/'
[INFO   ]          fastqc : Found 48 reports
[INFO   ]         multiqc : Report      : analyses/multiqc_before.html
[INFO   ]         multiqc : Data        : analyses/multiqc_before_data
[INFO   ]         multiqc : MultiQC complete


Next, based on the fastqc results, I want to run a triming program to remove low quality reads, the output of this program is quite large so I put the results on an external harddrive...


In [2]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/A.1.fastq.gz ./data_raw/A.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/A.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	273916	AGATCGGAAGAGC	1000000	27.39
Nextera	2	CTGTCTCTTATA	1000000	0.00
smallRNA	1	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 273916). Second best hit was Nextera (count: 2)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/A.1.fastq.gz: uncompress failed
Writing report to './data_raw/A.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/A.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/B.1.fastq.gz ./data_raw/B.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/B.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	368340	AGATCGGAAGAGC	1000000	36.83
Nextera	0	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 368340). Second best hit was Nextera (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/B.1.fastq.gz: uncompress failed
Writing report to './data_raw/B.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/B.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/C.1.fastq.gz ./data_raw/C.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/C.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	409340	AGATCGGAAGAGC	1000000	40.93
Nextera	1	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 409340). Second best hit was Nextera (count: 1)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/C.1.fastq.gz: uncompress failed
Writing report to './data_raw/C.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/C.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/D.1.fastq.gz ./data_raw/D.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/D.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	372267	AGATCGGAAGAGC	1000000	37.23
Nextera	1	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 372267). Second best hit was Nextera (count: 1)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/D.1.fastq.gz: uncompress failed
Writing report to './data_raw/D.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/D.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/E.1.fastq.gz ./data_raw/E.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/E.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	376351	AGATCGGAAGAGC	1000000	37.64
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Nextera	0	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 376351). Second best hit was smallRNA (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/E.1.fastq.gz: uncompress failed
Writing report to './data_raw/E.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/E.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cu

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/F.1.fastq.gz ./data_raw/F.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/F.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	376796	AGATCGGAAGAGC	1000000	37.68
Nextera	0	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 376796). Second best hit was Nextera (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/F.1.fastq.gz: uncompress failed
Writing report to './data_raw/F.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/F.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/G.1.fastq.gz ./data_raw/G.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/G.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	374529	AGATCGGAAGAGC	1000000	37.45
Nextera	1	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 374529). Second best hit was Nextera (count: 1)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/G.1.fastq.gz: uncompress failed
Writing report to './data_raw/G.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/G.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/H.1.fastq.gz ./data_raw/H.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/H.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	371397	AGATCGGAAGAGC	1000000	37.14
Nextera	1	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 371397). Second best hit was Nextera (count: 1)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/H.1.fastq.gz: uncompress failed
Writing report to './data_raw/H.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/H.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/I.1.fastq.gz ./data_raw/I.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/I.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	403869	AGATCGGAAGAGC	1000000	40.39
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Nextera	0	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 403869). Second best hit was smallRNA (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/I.1.fastq.gz: uncompress failed
Writing report to './data_raw/I.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/I.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cu

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/J.1.fastq.gz ./data_raw/J.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/J.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	355259	AGATCGGAAGAGC	1000000	35.53
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Nextera	0	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 355259). Second best hit was smallRNA (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/J.1.fastq.gz: uncompress failed
Writing report to './data_raw/J.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/J.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cu

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/K.1.fastq.gz ./data_raw/K.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/K.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	409983	AGATCGGAAGAGC	1000000	41.00
Nextera	3	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 409983). Second best hit was Nextera (count: 3)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/K.1.fastq.gz: uncompress failed
Writing report to './data_raw/K.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/K.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/L.1.fastq.gz ./data_raw/L.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/L.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	381399	AGATCGGAAGAGC	1000000	38.14
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Nextera	0	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 381399). Second best hit was smallRNA (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/L.1.fastq.gz: uncompress failed
Writing report to './data_raw/L.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/L.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cu

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/M.1.fastq.gz ./data_raw/M.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/M.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	345311	AGATCGGAAGAGC	1000000	34.53
Nextera	0	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 345311). Second best hit was Nextera (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/M.1.fastq.gz: uncompress failed
Writing report to './data_raw/M.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/M.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/N.1.fastq.gz ./data_raw/N.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/N.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	400810	AGATCGGAAGAGC	1000000	40.08
Nextera	1	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 400810). Second best hit was Nextera (count: 1)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/N.1.fastq.gz: uncompress failed
Writing report to './data_raw/N.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/N.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/O.1.fastq.gz ./data_raw/O.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/O.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	394544	AGATCGGAAGAGC	1000000	39.45
Nextera	0	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 394544). Second best hit was Nextera (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/O.1.fastq.gz: uncompress failed
Writing report to './data_raw/O.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/O.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/P.1.fastq.gz ./data_raw/P.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/P.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	391437	AGATCGGAAGAGC	1000000	39.14
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Nextera	0	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 391437). Second best hit was smallRNA (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/P.1.fastq.gz: uncompress failed
Writing report to './data_raw/P.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/P.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cu

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/Q.1.fastq.gz ./data_raw/Q.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/Q.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	359241	AGATCGGAAGAGC	1000000	35.92
Nextera	0	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 359241). Second best hit was Nextera (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/Q.1.fastq.gz: uncompress failed
Writing report to './data_raw/Q.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/Q.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/R.1.fastq.gz ./data_raw/R.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/R.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	365448	AGATCGGAAGAGC	1000000	36.54
Nextera	0	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 365448). Second best hit was Nextera (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/R.1.fastq.gz: uncompress failed
Writing report to './data_raw/R.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/R.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/S.1.fastq.gz ./data_raw/S.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/S.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	403094	AGATCGGAAGAGC	1000000	40.31
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Nextera	0	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 403094). Second best hit was smallRNA (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/S.1.fastq.gz: uncompress failed
Writing report to './data_raw/S.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/S.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cu

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/T.1.fastq.gz ./data_raw/T.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/T.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	364140	AGATCGGAAGAGC	1000000	36.41
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Nextera	0	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 364140). Second best hit was smallRNA (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/T.1.fastq.gz: uncompress failed
Writing report to './data_raw/T.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/T.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cu

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/U.1.fastq.gz ./data_raw/U.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/U.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	358276	AGATCGGAAGAGC	1000000	35.83
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Nextera	0	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 358276). Second best hit was smallRNA (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/U.1.fastq.gz: uncompress failed
Writing report to './data_raw/U.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/U.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cu

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/V.1.fastq.gz ./data_raw/V.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/V.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	334664	AGATCGGAAGAGC	1000000	33.47
Nextera	0	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 334664). Second best hit was Nextera (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/V.1.fastq.gz: uncompress failed
Writing report to './data_raw/V.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/V.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/W.1.fastq.gz ./data_raw/W.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/W.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	384971	AGATCGGAAGAGC	1000000	38.50
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Nextera	0	CTGTCTCTTATA	1000000	0.00
Using Illumina adapter for trimming (count: 384971). Second best hit was smallRNA (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/W.1.fastq.gz: uncompress failed
Writing report to './data_raw/W.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/W.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cu

In [None]:
!./scripts/trim_galore_v0/trim_galore --paired ./data_raw/X.1.fastq.gz ./data_raw/X.2.fastq.gz --retain_unpaired --output_dir ./data_raw/ 

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.11
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> ./data_raw/X.1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	385170	AGATCGGAAGAGC	1000000	38.52
Nextera	0	CTGTCTCTTATA	1000000	0.00
smallRNA	0	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 385170). Second best hit was Nextera (count: 0)

gunzip: error writing to output: Broken pipe
gunzip: ./data_raw/X.1.fastq.gz: uncompress failed
Writing report to './data_raw/X.1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: ./data_raw/X.1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.4.2
Cut

In [1]:
cd /Users/andrewspanjer/RNAseq_Coho/

/Users/andrewspanjer/RNAseq_Coho


and re-fun FastQc to see how the reads have improved...

In [6]:
!scripts/FastQC/fastqc /Volumes/Untitled/trimed_fastq/*.fq.gz -o analyses/fastQC_after/

Started analysis of A.1_trimmed.fq.gz
Approx 5% complete for A.1_trimmed.fq.gz
Approx 10% complete for A.1_trimmed.fq.gz
Approx 15% complete for A.1_trimmed.fq.gz
Approx 20% complete for A.1_trimmed.fq.gz
Approx 25% complete for A.1_trimmed.fq.gz
Approx 30% complete for A.1_trimmed.fq.gz
Approx 35% complete for A.1_trimmed.fq.gz
Approx 40% complete for A.1_trimmed.fq.gz
Approx 45% complete for A.1_trimmed.fq.gz
Approx 50% complete for A.1_trimmed.fq.gz
Approx 55% complete for A.1_trimmed.fq.gz
Approx 60% complete for A.1_trimmed.fq.gz
Approx 65% complete for A.1_trimmed.fq.gz
Approx 70% complete for A.1_trimmed.fq.gz
Approx 75% complete for A.1_trimmed.fq.gz
Approx 80% complete for A.1_trimmed.fq.gz
Approx 85% complete for A.1_trimmed.fq.gz
Approx 90% complete for A.1_trimmed.fq.gz
Approx 95% complete for A.1_trimmed.fq.gz
Analysis complete for A.1_trimmed.fq.gz
Started analysis of A.2_trimmed.fq.gz
Approx 5% complete for A.2_trimmed.fq.gz
Approx 10% complete for A.2_trimmed.fq.gz
Appr

re-run multi-qc...

In [7]:
!multiqc analyses/fastQC_after/ -o ./analyses/ -n multiqc_after

[INFO   ]         multiqc : This is MultiQC v0.8
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching 'analyses/fastQC_after/'
[INFO   ]          fastqc : Found 48 reports
[INFO   ]         multiqc : Report      : analyses/multiqc_after.html
[INFO   ]         multiqc : Data        : analyses/multiqc_after_data
[INFO   ]         multiqc : MultiQC complete


I'm happy with the results from above, next I need to concatenate all of my front reads and reverse reads in the same order so they can be run in Trinity for assembly...

In [None]:
!cat /Volumes/Untitled/trimed_fastq/*.1.fq.gz > /Volumes/Untitled/all_1.fq.gz

In [None]:
!cat /Volumes/Untitled/trimed_fastq/*.2.fq.gz > /Volumes/Untitled/all_1.fq.gz

Upload to Galaxy Indiana for assembly 

![upload galaxy](https://github.com/aspanjer/RNAseq_Coho/blob/master/jupyter/upload.png?raw=true)


Start the assembly process..

![Galaxy assembly](https://github.com/aspanjer/RNAseq_Coho/blob/master/analyses/Trinity/Trinity%20Galaxy.png?raw=true)

In [None]:
Next step [Differential Expression](https://github.com/aspanjer/RNAseq_Coho/blob/master/jupyter_notebooks/Differential%20Expression.ipynb) (new journal)