### Receiving data, initial processing

Data was downloaded from Globus (globus.org).  It was sent in two zipped files, one from each lane, containing separate files for each sample (demultiplexed). 

I then checked that the md5 values were the same using `md5 [filename].tar.gz >> [filename].md5` (this appened the new value in the .md5 file, so I could compare). 

### A note on working environment & data accessibility

I originally downloaded data to Ostrich, thinking that I could work on that computer remotely using Remote Desktop and Jupyter Notebook. However, my internet connection is too slow to work productively that way. I therefore also downloaded the data to my external hard drive, and worked locally. This also allows me to use all the packages that I had installed on my personal computer in 2018 (which is helpful). 

Canonical versions of the data is saved to Owl/Nightengales in the zipped format here: [nightingales/O_lurida/2020-04-21_QuantSeq-data/](http://owl.fish.washington.edu/nightingales/O_lurida/2020-04-21_QuantSeq-data/) and as individual fastq/sample here [nightingales/O_lurida/](http://owl.fish.washington.edu/nightingales/O_lurida/), and to Gannet as individual fastq/sample here [Atumefaciens/20200426_olur_fastqc_quantseq/](https://gannet.fish.washington.edu/Atumefaciens/20200426_olur_fastqc_quantseq/). 

Sam also ran MultiQC on my samples; check out his [notebook entry](https://robertslab.github.io/sams-notebook/2020/04/26/FastQC-MultiQC-Laura-Spencer's-QuantSeq-Data.html), and the [MultiQC report](https://gannet.fish.washington.edu/Atumefaciens/20200426_olur_fastqc_quantseq/multiqc_report.html).

### Install tag-seq scripts, create {tagseq} variable

I had cloned the tag-based_RNAseq.git repo back in 2018 using the following: 
`git clone https://github.com/z0on/tag-based_RNAseq.git /Users/lhs3/Documents/bioinf/tag-based_RNAseq/`. 

To update the package, I navigated to the directory locally, then did `git pull` and it updated the files. 

In [3]:
# check working directory 
pwd

'/Users/laura/Documents/roberts-lab/laura-quantseq/notebooks'

In [42]:
# create path variable to github repo 
repo = "Users/laura/Documents/roberts-lab/laura-quantseq/"

In [4]:
# create path variable to raw data, saved on my external hard drive 
rawdata = "/Volumes/Peach\ Backup/QuantSeq-04-21-2020"

In [28]:
# create path variable to tagseq directory 
tagseq = "/Applications/bioinformatics/tag-based_RNAseq/"

In [2]:
# test fastq_quality_filter (see if it's correctly added to my PATH)
! fastq_quality_filter -h

usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.14 by A. Gordon (assafgordon@gmail.com)

   [-h]         = This helpful help screen.
   [-q N]       = Minimum quality score to keep.
   [-p N]       = Minimum percent of bases that must have [-q] quality.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
                  If [-o] is specified,  report will be printed to STDOUT.
                  If [-o] is not specified (and output goes to STDOUT),
                  report will be printed to STDERR.



In [9]:
cd {rawdata}

/Volumes/Peach Backup/QuantSeq-04-21-2020


In [10]:
pwd

'/Volumes/Peach Backup/QuantSeq-04-21-2020'

In [11]:
ls

Batch1_69plex_lane1.md5     Batch2_77plex_lane2_md5
Batch1_69plex_lane1.tar.gz  Batch2_77plex_lane2_tar.gz


Currently, the data is still zipped (that's how it arrived via Globus). I need to thus tar/gunzip the lane files, before I can gunzip the individual library files.  

In [12]:
# extract batch/lane 1 data 
! gunzip -c Batch1_69plex_lane1.tar.gz | tar xopf -

In [13]:
# extract batch/lane 2 data 
! gunzip -c Batch2_77plex_lane2_tar.gz | tar xopf -

In [14]:
# check out resulting file structure 
! ls

Batch1_69plex_lane1.md5    [34mBatch2_77plex_lane2_done[m[m
Batch1_69plex_lane1.tar.gz Batch2_77plex_lane2_md5
[34mBatch1_69plex_lane1_done[m[m   Batch2_77plex_lane2_tar.gz


In [15]:
! ls Batch1_69plex_lane1_done/

137_S63_L001_R1_001.fastq.gz         314_S49_L001_R1_001.fastq.gz
139_S54_L001_R1_001.fastq.gz         315_S26_L001_R1_001.fastq.gz
140_S64_L001_R1_001.fastq.gz         316_S9_L001_R1_001.fastq.gz
141_S61_L001_R1_001.fastq.gz         317_S33_L001_R1_001.fastq.gz
156_S66_L001_R1_001.fastq.gz         318_S6_L001_R1_001.fastq.gz
159_S68_L001_R1_001.fastq.gz         319_S52_L001_R1_001.fastq.gz
161_S57_L001_R1_001.fastq.gz         321_S29_L001_R1_001.fastq.gz
162_S62_L001_R1_001.fastq.gz         322_S8_L001_R1_001.fastq.gz
168_S67_L001_R1_001.fastq.gz         323_S39_L001_R1_001.fastq.gz
169_S65_L001_R1_001.fastq.gz         324_S47_L001_R1_001.fastq.gz
171_S58_L001_R1_001.fastq.gz         325_S13_L001_R1_001.fastq.gz
172_S59_L001_R1_001.fastq.gz         326_S38_L001_R1_001.fastq.gz
181_S69_L001_R1_001.fastq.gz         327_S37_L001_R1_001.fastq.gz
183_S56_L001_R1_001.fastq.gz         328_S12_L001_R1_001.fastq.gz
184_S55_L001_R1_001.fastq.gz         329_S46_L001_R1_001.fastq.gz

### Move to each batch's directory containing demultiplexed library files, and gunzip all fastq files in that folder

In [18]:
cd Batch1_69plex_lane1_done/

/Volumes/Peach Backup/QuantSeq-04-21-2020/Batch1_69plex_lane1_done


In [19]:
! gunzip *.fastq.gz

In [21]:
cd ../Batch2_77plex_lane2_done/

/Volumes/Peach Backup/QuantSeq-04-21-2020/Batch2_77plex_lane2_done


In [22]:
! gunzip *.fastq.gz 

In [23]:
# Check out contents after gunzip 
! ls 

34_S68_L002_R1_001.fastq          482_S25_L002_R1_001.fastq
35_S72_L002_R1_001.fastq          483_S7_L002_R1_001.fastq
37_S70_L002_R1_001.fastq          484_S43_L002_R1_001.fastq
39_S52_L002_R1_001.fastq          485_S21_L002_R1_001.fastq
401_S10_L002_R1_001.fastq         487_S6_L002_R1_001.fastq
402_S5_L002_R1_001.fastq          488_S26_L002_R1_001.fastq
403_S30_L002_R1_001.fastq         489_S35_L002_R1_001.fastq
404_S42_L002_R1_001.fastq         490_S19_L002_R1_001.fastq
411_S9_L002_R1_001.fastq          491_S50_L002_R1_001.fastq
412_S74_L002_R1_001.fastq         492_S40_L002_R1_001.fastq
413_S38_L002_R1_001.fastq         506_S47_L002_R1_001.fastq
414_S49_L002_R1_001.fastq         513_S56_L002_R1_001.fastq
41_S62_L002_R1_001.fastq          521_S65_L002_R1_001.fastq
421_S22_L002_R1_001.fastq         522_S1_L002_R1_001.fastq
431b_S8_L002_R1_001.fastq         523_S4_L002_R1_001.fastq
432_S75_L002_R1_001.fastq         524_S11_L002_R1_001.fastq
434_S55_L002_R1_001.fastq   

## 1. Larval Data Processing 

- Larval libraries were all ran on the same lane, lane/batch2. There are 77 libraries. 
- Steps to processes data include:
- 

In [45]:
! mkdir ../fastqc/

In [46]:
! mkdir ../fastqc/untrimmed/

In [47]:
fastqc = "/Applications/bioinformatics/FastQC/"

In [48]:
! {fastqc}fastqc --help


            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

	fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of 
    which will help to identify a different potential type of problem in your
    data.
    
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    
    The options for the program as as follows:
    
    -h --help       Print this help file and exit
    
    -v --version    Print the vers

In [49]:
# test fastqc on one sample file 
! {fastqc}fastqc \
506_S47_L002_R1_001.fastq \
--outdir ../fastqc/untrimmed/

Started analysis of 506_S47_L002_R1_001.fastq
Approx 5% complete for 506_S47_L002_R1_001.fastq
Approx 10% complete for 506_S47_L002_R1_001.fastq
Approx 15% complete for 506_S47_L002_R1_001.fastq
Approx 20% complete for 506_S47_L002_R1_001.fastq
Approx 25% complete for 506_S47_L002_R1_001.fastq
Approx 30% complete for 506_S47_L002_R1_001.fastq
Approx 35% complete for 506_S47_L002_R1_001.fastq
Approx 40% complete for 506_S47_L002_R1_001.fastq
Approx 45% complete for 506_S47_L002_R1_001.fastq
Approx 50% complete for 506_S47_L002_R1_001.fastq
Approx 55% complete for 506_S47_L002_R1_001.fastq
Approx 60% complete for 506_S47_L002_R1_001.fastq
Approx 65% complete for 506_S47_L002_R1_001.fastq
Approx 70% complete for 506_S47_L002_R1_001.fastq
Approx 75% complete for 506_S47_L002_R1_001.fastq
Approx 80% complete for 506_S47_L002_R1_001.fastq
Approx 85% complete for 506_S47_L002_R1_001.fastq
Approx 90% complete for 506_S47_L002_R1_001.fastq
Approx 95% complete for 506_S47_L002_R1_001.fastq
Analy

In [50]:
! {fastqc}fastqc \
*.fastq \
--outdir ../fastqc/untrimmed/ \
--quiet

In [52]:
# check out resulting fastqc files. 
! ls ../fastqc/untrimmed/

34_S68_L002_R1_001_fastqc.html          481_S57_L002_R1_001_fastqc.html
34_S68_L002_R1_001_fastqc.zip           481_S57_L002_R1_001_fastqc.zip
35_S72_L002_R1_001_fastqc.html          482_S25_L002_R1_001_fastqc.html
35_S72_L002_R1_001_fastqc.zip           482_S25_L002_R1_001_fastqc.zip
37_S70_L002_R1_001_fastqc.html          483_S7_L002_R1_001_fastqc.html
37_S70_L002_R1_001_fastqc.zip           483_S7_L002_R1_001_fastqc.zip
39_S52_L002_R1_001_fastqc.html          484_S43_L002_R1_001_fastqc.html
39_S52_L002_R1_001_fastqc.zip           484_S43_L002_R1_001_fastqc.zip
401_S10_L002_R1_001_fastqc.html         485_S21_L002_R1_001_fastqc.html
401_S10_L002_R1_001_fastqc.zip          485_S21_L002_R1_001_fastqc.zip
402_S5_L002_R1_001_fastqc.html          487_S6_L002_R1_001_fastqc.html
402_S5_L002_R1_001_fastqc.zip           487_S6_L002_R1_001_fastqc.zip
403_S30_L002_R1_001_fastqc.html         488_S26_L002_R1_001_fastqc.html
403_S30_L002_R1_001_fastqc.zip          488_S26_L002_R1_001_f

###  What adapters do I need to trim off? Do I need to remove indices? 

Each entry in a FASTQ file consists of four lines:
 1. Sequence identifier  
 2. Sequence  
 3. Quality score identifier line (consisting only of a +)  
 4. Quality score  
 
First, I will check out a few fastq files - do reads all have the index number included in the sequence, or have those been omitted? 

In [53]:
# this sample (#506) index is CACTAA. 
! head -n 20 506_S47_L002_R1_001.fastq

@A01125:63:HLYCLDRXX:2:2101:5231:1031 1:N:0:CACTAA
CTTTAATACCCTACGTGATTTGAGTTCAGACCGGCGCAAGCCAGGTCGGTTTCTATCTTCTTTTATAATATTCTTTTGTCATGTACGAAAGGACCGTTAA
+
FFFFFFFFFFF,F,,:FFF,F:FFF:FF,FFF:FFFFFFF,F:FFFF:F:F:FFFF,F,FF,FFFFFFFFFFF,FFF:,FFFFFFFFFFFF:FF:FFFFF
@A01125:63:HLYCLDRXX:2:2101:11035:1031 1:N:0:CACTAA
GTGATTCTGGGTTTAACTATAGCAACATTTTTATTAATTTCAATACTGTGGGCATGGGGGGTATATAAAGACCCTTCTAACATTATGTCTGATATTGAAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:F
@A01125:63:HLYCLDRXX:2:2101:13548:1031 1:N:0:CACTAA
CCGATTCATGATGTATTTTTTTTATTAAATTAGGAAATCAAATAAATTTATTGCCAAAATCAAAAAAAAAAAAAAAAAAAAAGATCGGAAGAGCACACGT
+
FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFF,FFFFFFFFFFFF,:FFFF:,F::,FF:,F,FF:F:FFF
@A01125:63:HLYCLDRXX:2:2101:14525:1031 1:N:0:CACTAA
CGGTTTCTATCTTCTTTTATAATATTCTTTTGGCATGTACGAAAGGACCGTTAAAAGAGGAAGTTTCCTTTTAAAAGAAATAGATTTAAATTGGAGTAAG
+
F:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF

In [54]:
# this sample (#414) index is CGCCTG 
! head -n 20 414_S49_L002_R1_001.fastq 

@A01125:63:HLYCLDRXX:2:2101:1307:1031 1:N:0:CGCCTG
TGGGAGTGGATGTAATGGTACAATACTTGGCTTTTGAAAGTGCACTAAATGCATTGGGAATAGGTACATCAATAAAAATTCTGCACAGTCAAAAAAAAAA
+
:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF
@A01125:63:HLYCLDRXX:2:2101:1380:1031 1:N:0:CGCCTG
CAAGCCGTCCAAAAAAAAAAAAAAAAAATATCGGAAGAGCACACGTCTGAACACCAGTCACCGCCGGATCGCGTTTGCCGTCTTCTGCTTGAAAAGGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF,,::F:FF,F,,F,:::,,,:FF,,FF:,:,,FF,:F,F:,,,,F,,FF:,F,,,:,::,:F,,FF,,:,FF
@A01125:63:HLYCLDRXX:2:2101:4001:1031 1:N:0:CGCCTG
GTCAGGTCGGTTTCTATCTTCTTTTATAATATTCTTTTGGCATGTACGAAAGGACCGTTAAAAGAGGAAGTTTCCTTTTAAAAGAAATAGATTTAAATTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F::FFFFF
@A01125:63:HLYCLDRXX:2:2101:9353:1031 1:N:0:CGCCTG
CCCAAAGCTGCTGACTATGAGATAAAACAAATTGAAAGAGCACATGAAACTAGAATCAAAAAAAAAAAAAAAAAAAGAAGGGGAGGACACACCGGGGGAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF

In [55]:
# this sample (#39) index is ATATCC
! head -n 20 39_S52_L002_R1_001.fastq

@A01125:63:HLYCLDRXX:2:2101:2266:1031 1:N:0:ATATCC
GTTGAAGGTGAACTGTAAGGTTTGCTGGAGGTATCAGAAGTGCGAATGCTGACATGAGTAACGATAATGGGGGTGAAAAACCCCCACGCCGGAAGAGCAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF
@A01125:63:HLYCLDRXX:2:2101:2573:1031 1:N:0:ATATCC
CCACTGCACGCAGAAGAGGAGTTGCTGAACTGTCGTTGCGTCTTCGACTTGAAAACAGAAAAAAAAAAAAAAAAAAAGATCGGAAGAGCACACGGCTGAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,,F:FFFFFF,FFFF:FF,F:FFF
@A01125:63:HLYCLDRXX:2:2101:3152:1031 1:N:0:ATATAC
AATAAAAAAAAAAAAAAAAAAAAAAACCCAAAAGCACACCTAAAAAAAACAAAAACAAGTACCACTGTTAAAAACTCTAGACCATGAGATACAGTGGAAT
+
,,F,FF:F,,FF,FFFFF:FFF,FFF,,,,FF,,,F,,F,,F:,F:,,,FFF:F,,,F,:,:,,,:,,:,F,FF,F,,F,,:F:FF,,,,:,,F,FF,,,
@A01125:63:HLYCLDRXX:2:2101:3314:1031 1:N:0:ATATCC
GAACAAAACAGCTATGTAAATATATACTAAAATAATAATCAAAATAAATATACTAAACGAAAATTAAAACATGAAGTTTCATAAAAAAATAAAGAAAATT
+
F,,F:,:,F:,FFFFF,,FFFF:FF,FFF,FF:,::F:,F:FFFFF,,:,F,FF,:,FFF,FFF,,

Looks like all index sequences are included in the Sequence Identifier line for each read (line #1), but not in the actual sequences. Good - now I need to figure out which Illumina adapter sequences I need to trim. 

### Adaptor trimming, deduplicating, and quality filtering:
Creating and launching the cleaning process for all files in the same time:

In [29]:
! ls {tagseq}

[31m2bRAD_bowtie2_launch.pl[m[m            picogreen.csv
README.md                          [31mrnaseq_clipper0.pl[m[m
TagSeq_GSAF_Price.xlsx             [31mrnaseq_clipper_fasta.pl[m[m
TagSeq_sample_prep_june2019.docx   [31mrnaseq_clipper_old.pl[m[m
[31mcountreads.pl[m[m                      [31msamcount.pl[m[m
dna.mixing.R                       [31msamcount_launch.pl[m[m
dupCount.R                         [31msamcount_launch_bt2.pl[m[m
[31mexpression_compiler.pl[m[m             [31msamcount_v.0.1.pl[m[m
[31miRNAseq_shrimpmap_SAM.pl[m[m           [31mselectFastaByHeader.pl[m[m
[31miRNAseq_trim_launch0.pl[m[m            [31msplitFastaByHeader.pl[m[m
illumina_mix_data.csv              tag-seq_scripts_manual.pdf
[31misogroup_namer.pl[m[m                  tagSeq_processing_README.txt
[31mlauncher_creator.py[m[m                [31mtagseq_bowtie2map.pl[m[m
[31mls5_launcher_creator.py[m[m            [31mtagseq_clipper.pl[m

In [31]:
# run perl script that creates list of executables for each fastq file 
! {tagseq}tagseq_trim_launch.pl '\.fastq$' > clean

In [32]:
! head -2 clean

tagseq_clipper.pl 34_S68_L002_R1_001.fastq | fastx_clipper -a AAAAAAAA -l 20 -Q33 | fastx_clipper -a AGATCGGAAG -l 20 -Q33 | fastq_quality_filter -Q33 -q 20 -p 90 >34_S68_L002_R1_001.fastq.trim
tagseq_clipper.pl 35_S72_L002_R1_001.fastq | fastx_clipper -a AAAAAAAA -l 20 -Q33 | fastx_clipper -a AGATCGGAAG -l 20 -Q33 | fastq_quality_filter -Q33 -q 20 -p 90 >35_S72_L002_R1_001.fastq.trim


In [64]:
! cat {tagseq}tagseq_clipper.pl

#!/usr/bin/perl

$usage= "

tagseq_clipper.pl  : 

Clips 5'-leader off Illumina fastq reads in RNA-seq

Removes duplicated reads sharing the same degenerate header and 
the first 20 bases of the sequence (reads containing N bases in this
region are discarded, too)

prints to STDOUT

arguments:
1 : fastq file name
2 : string to define the leading sequence, default '[ATGC]?[ATGC][AC][AT]GGG+|[ATGC]?[ATGC]TGC[AC][AT]GGG+|[ATGC]?[ATGC]GC[AT]TC[ACT][AC][AT]GGG+'

Example:
tagseq_clipper.pl D6.fastq

					 
";

my $fq=shift or die $usage;
my $lead="";
if ($ARGV[0]) { $lead=$ARGV[0];}
else { $lead="[ATGC]?[ATGC][AC][AT][AT][AC][AT][ACT]GGG+|[ATGC]?[ATGC][AC][AT]GGG+|[ATGC]?[ATGC]TGC[AC][AT]GGG+|[ATGC]?[ATGC]GC[AT]TC[ACT][AC][AT]GGG+";}
my $trim=0;
my $name="";
my $name2="";
my $seq="";
my $qua="";
my %seen={};
open INP, $fq or die "cannot open file $fq\n";
my $ll=3;
my $nohead=0;
my $dups=0;
my $ntag=0;
my $tot=0;
my $goods=0;
while (<INP>) {
	if ($ll

In [57]:
# check to make sure I can successfully access tagseq scripts 
! fastx_clipper -h

usage: fastx_clipper [-h] [-a ADAPTER] [-D] [-l N] [-n] [-d N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.14 by A. Gordon (assafgordon@gmail.com)

   [-h]         = This helpful help screen.
   [-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).
   [-l N]       = discard sequences shorter than N nucleotides. default is 5.
   [-d N]       = Keep the adapter and N bases after it.
                  (using '-d 0' is the same as not using '-d' at all. which is the default).
   [-c]         = Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
   [-C]         = Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
   [-k]         = Report Adapter-Only sequences.
   [-n]         = keep sequences with unknown (N) nucleotides. default is to discard such sequences.
   [-v]         = Verbose - report number of sequences.
                  If [-o] is specified,  r

In [60]:
! fastq_quality_filter -h

usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.14 by A. Gordon (assafgordon@gmail.com)

   [-h]         = This helpful help screen.
   [-q N]       = Minimum quality score to keep.
   [-p N]       = Minimum percent of bases that must have [-q] quality.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
                  If [-o] is specified,  report will be printed to STDOUT.
                  If [-o] is not specified (and output goes to STDOUT),
                  report will be printed to STDERR.



In [59]:
! mkdir ../tagseq_trim/

### Based on repeated functions in the clean file, I wrote a script to loop through the files and execute all commands 

- `tagseq_clipper.pl` clips 5'-leader off Illumina fastq reads in RNA-seq  
- `fastx_clipper` is part of the [fastx_toolkit](http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage), and removes sequencing adapters / linkers. Options include: 
  - `-a AAAAAAAA` _remove the adapter string AAAAAAAA_  
  - `-a AGATCGGAAG` _remove the adapter string AGATCGGAAG_  
  - `-l 20` _discard sequences shorter than 20bp_  
- `fastq_quality_filter` filters out reads based on their quality score. 
  - `-q 20` _minimum quality score to keep_  
  - `p 90` _minimum perxentage of bases that must have the minimum quality_  
- `-Q33` _I don't know what this_ 

### Run clean script on all .fastq files in the Batch2 folder 

Based on repeated functions in the clean file, I wrote a script to loop through the files and execute all commands. Also, for some reason I cannot access my {tagseq} variable to call the tagseq script, so I just hard-code the path. 

In [None]:
%%bash 

for file in *.fastq
do
#strip .fastq and directory structure from each file, then
# add suffice .fastq.trim to create output name for each file
results_file="$(basename -a $file).trim"

# run tagseq scripts on each file
/Applications/bioinformatics/tag-based_RNAseq/tagseq_clipper.pl $file | \
fastx_clipper -a AAAAAAAA -l 20 -Q33 | \
fastx_clipper -a AGATCGGAAG -l 20 -Q33 | \
fastq_quality_filter -Q33 -q 20 -p 90 >\
../tagseq_trim/$results_file
done

### Download O. lurida genome

In [88]:
! curl http://owl.fish.washington.edu/halfshell/genomic-databank/Olurida_v081.fa > ../references/Olurida_v081.fa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1090M  100 1090M    0     0  1960k      0  0:09:29  0:09:29 --:--:-- 2198k21k      0  0:08:46  0:00:17  0:08:29 2021k 0     0  2183k      0  0:08:31  0:01:11  0:07:20 2193k    0  2183k      0  0:08:31  0:01:14  0:07:17 2202k9  0:07:11 2212k1939k      0  0:09:35  0:01:45  0:07:50 1024k    0  1765k      0  0:10:32  0:02:00  0:08:32  497k   0     0  1752k      0  0:10:36  0:03:40  0:06:56 2190k    0  1748k      0  0:10:38  0:03:50  0:06:48 1105k  1855k      0  0:10:01  0:05:01  0:05:00 2215k  0     0  1861k      0  0:09:59  0:05:07  0:04:52 2191k   0  1881k      0  0:09:53  0:05:25  0:04:28 2222k931k      0  0:09:38  0:07:31  0:02:07 2169k  1929k      0  0:09:38  0:08:03  0:01:35 2179k 0  1933k      0  0:09:37  0:08:11  0:01:26 2200k  0  0:09:35  0:08:42  0:00:53 2212k


In [89]:
# MD5 should = 3ac56372bd62038f264d27eef0883bd3
! md5 ../references/Olurida_v081.fa

MD5 (../references/Olurida_v081.fa) = 3ac56372bd62038f264d27eef0883bd3


In [30]:
! mkdir bowtie/

In [34]:
cd bowtie/

/Users/laura/Documents/roberts-lab/laura-quantseq/results/bowtie


In [37]:
%%bash
### creating bowtie2 index for Oly genome v081:

bowtie2-build \
../../references/Olurida_v081.fa \
Olurida_v081.fa

Settings:
  Output files: "Olurida_v081.fa.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  ../../references/Olurida_v081.fa
Reading reference sizes
  Time reading reference sizes: 00:00:12
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:07
bmax according to bmaxDivN setting: 269313774
Using parameters --bmax 201985331 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 201985331 --dcv 1024
Constructing suffix-

Building a SMALL index


In [39]:
! ls

Olurida_v081.fa.1.bt2     Olurida_v081.fa.3.bt2     Olurida_v081.fa.rev.1.bt2
Olurida_v081.fa.2.bt2     Olurida_v081.fa.4.bt2     Olurida_v081.fa.rev.2.bt2


In [41]:
cd ../../data/tagseq_trim/

/Users/laura/Documents/roberts-lab/laura-quantseq/data/tagseq_trim


In [104]:
! {tagseq}tagseq_bowtie2map.pl "trim$" ../../results/bowtie/Olurida_v081.fa > maps

In [105]:
! head -2 maps

bowtie2 --local -x ../../results/bowtie/Olurida_v081.fa -U CP-KS-LibL-L37-L_S32_L006_R1_001.fastq.trim -S CP-KS-LibL-L37-L_S32_L006_R1_001.fastq.trim.sam --no-hd --no-sq --no-unal -k 5
bowtie2 --local -x ../../results/bowtie/Olurida_v081.fa -U CP-KS-LibL-L3-G_S64_L006_R1_001.fastq.trim -S CP-KS-LibL-L3-G_S64_L006_R1_001.fastq.trim.sam --no-hd --no-sq --no-unal -k 5


###  Use a loop to execute the commands written in 'maps'; redirect screen output/error in bowtieout.txt

In [106]:
cd ../../data/tagseq_trim/

/Users/laura/Documents/roberts-lab/laura-quantseq/data/tagseq_trim


--no-unal  = suppress SAM records for unaligned reads  
--no-sq    = suppress @SQ header lines  
--no-hd    = suppress SAM header lines (starting with @)  
-k 5       = report up to <5> aligns per read  

In [110]:
%%bash 

for file in *.trim
do
#strip .fastq and directorys tructure from each file, then
# add suffice .fastq.trim to create output name for each file
results_file="$(basename -a $file).sam"

# run tagseq scripts on each file
bowtie2 --local -x \
../../results/bowtie/Olurida_v081.fa \
-U $file \
-S $results_file \ -k 5; \
done >> ../../results/bowtie/bowtieout.txt 2>&1

In [111]:
! cat ../../results/bowtie/bowtieout.txt

21874 reads; of these:
  21874 (100.00%) were unpaired; of these:
    6373 (29.14%) aligned 0 times
    11759 (53.76%) aligned exactly 1 time
    3742 (17.11%) aligned >1 times
70.86% overall alignment rate
21768 reads; of these:
  21768 (100.00%) were unpaired; of these:
    6229 (28.62%) aligned 0 times
    11759 (54.02%) aligned exactly 1 time
    3780 (17.36%) aligned >1 times
71.38% overall alignment rate
2649 reads; of these:
  2649 (100.00%) were unpaired; of these:
    2067 (78.03%) aligned 0 times
    352 (13.29%) aligned exactly 1 time
    230 (8.68%) aligned >1 times
21.97% overall alignment rate
21489 reads; of these:
  21489 (100.00%) were unpaired; of these:
    6142 (28.58%) aligned 0 times
    11731 (54.59%) aligned exactly 1 time
    3616 (16.83%) aligned >1 times
71.42% overall alignment rate
23903 reads; of these:
  23903 (100.00%) were unpaired; of these:
    6156 (25.75%) aligned 0 times
    13075 (54.70%) aligned exactly 1 time
    4672

In [112]:
# alignment rates:
! grep "overall alignment rate"  ../../results/bowtie/bowtieout.txt

70.86% overall alignment rate
71.38% overall alignment rate
21.97% overall alignment rate
71.42% overall alignment rate
74.25% overall alignment rate
74.05% overall alignment rate
69.84% overall alignment rate
71.05% overall alignment rate
69.79% overall alignment rate
65.44% overall alignment rate
73.34% overall alignment rate
41.75% overall alignment rate
73.35% overall alignment rate
73.44% overall alignment rate
63.80% overall alignment rate
73.59% overall alignment rate
57.34% overall alignment rate
71.77% overall alignment rate
74.38% overall alignment rate
74.85% overall alignment rate
62.80% overall alignment rate
68.49% overall alignment rate
73.56% overall alignment rate
71.53% overall alignment rate


## Generating read-counts-per gene 

In [113]:
ls

CP-KS-LibL-L10-G_S59_L006_R1_001.fastq.trim
CP-KS-LibL-L10-G_S59_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L11-G_S65_L006_R1_001.fastq.trim
CP-KS-LibL-L11-G_S65_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L13-G_S60_L006_R1_001.fastq.trim
CP-KS-LibL-L13-G_S60_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L14-G_S63_L006_R1_001.fastq.trim
CP-KS-LibL-L14-G_S63_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L15-G_S68_L006_R1_001.fastq.trim
CP-KS-LibL-L15-G_S68_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L19-G_S67_L006_R1_001.fastq.trim
CP-KS-LibL-L19-G_S67_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L2-G_S34_L006_R1_001.fastq.trim
CP-KS-LibL-L2-G_S34_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L21-G_S58_L006_R1_001.fastq.trim
CP-KS-LibL-L21-G_S58_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L23-G_S61_L006_R1_001.fastq.trim
CP-KS-LibL-L23-G_S61_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L24-G_S62_L006_R1_001.fastq.trim
CP-KS-LibL-L24-G_S62_L006_R1_001.fastq.trim.sam
CP-KS-LibL-L29-G_S36_L006_R1_001.fastq.trim
CP-KS-LibL-L29-G_

In [82]:
! grep ">" ../../references/Olurida_v081.fa | head

>Contig0
>Contig1
>Contig2
>Contig3
>Contig4
>Contig5
>Contig6
>Contig7
>Contig8
>Contig9


In [63]:
! samtools view --help

samtools view: unrecognised option '--help'

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -q INT   only include reads with mapping quality >= INT [0]
  -l STR   only include reads in library STR [null]
  -m INT   only include reads with number of CIGAR operations consuming
        

In [114]:
%%bash
#convert sam to bam

for file in *.sam
do
results_file="$(basename -a $file)_sorted.bam"
samtools view -b $file | samtools sort -o $results_file
done

In [115]:
%%bash
# create .bam indexes
for file in *.bam
do
samtools index $file
done

In [118]:
ls *.bam

CP-KS-LibL-L10-G_S59_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L11-G_S65_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L13-G_S60_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L14-G_S63_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L15-G_S68_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L19-G_S67_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L2-G_S34_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L21-G_S58_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L23-G_S61_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L24-G_S62_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L29-G_S36_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L3-G_S64_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L32-G_S35_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L33-G_S66_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L34-L_S25_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L35-L_S31_L006_R1_001.fastq.trim.sam_sorted.bam
CP-KS-LibL-L37-L_S32_L006_R1_001.fastq.tri

In [121]:
# total mapped and paired reads 
! samtools view -F 6 \
CP-KS-LibL-L7-G_S57_L006_R1_001.fastq.trim.sam_sorted.bam | \
wc -l

   27324


In [122]:
# weird, this is supposed to be the unmapped read count .. same as above
! samtools view -F 4 \
CP-KS-LibL-L7-G_S57_L006_R1_001.fastq.trim.sam_sorted.bam | \
wc -l

   27324
