### Notebook to download RNASeq data, unzip and trim a couple different ways (`seqtk`, `sickle`). 

The raw RNA-Seq data files are described here, with files stored in Owl (link towards bottom of page):
https://github.com/RobertsLab/project-olympia.oyster-genomic/wiki/RNA-Seq-February-2018

The pooled gonad samples from Fidalgo Bay and Oyster Bay adults from low/ambient pH are as follows, which are the 1st and 2nd reads from 2 lanes of sequencing. The other files are from Katherine's CA, OR, and British Columbia populations.  
  - CP-4Spl_S11_L004_R1_0343.fastq.gz  
  - CP-4Spl_S11_L004_R2_0343.fastq.gz  
  - CP-4Spl_S11_L004_R1_0348.fastq.gz 
  - CP-4Spl_S11_L004_R2_0348.fastq.gz 

In [1]:
! pwd

/Users/studentuser/Desktop/Laura/laura-quantseq/notebooks


In [12]:
cd ../data

/Users/studentuser/Desktop/Laura/laura-quantseq/data


In [17]:
! pwd

/Users/studentuser/Desktop/Laura/laura-quantseq/data


In [18]:
! curl -O -O -O -O \
http://owl.fish.washington.edu/nightingales/O_lurida/CP-4Spl_S11_L004_R1_0343.fastq.gz \
    http://owl.fish.washington.edu/nightingales/O_lurida/CP-4Spl_S11_L004_R1_0348.fastq.gz \
        http://owl.fish.washington.edu/nightingales/O_lurida/CP-4Spl_S11_L004_R2_0343.fastq.gz \
            http://owl.fish.washington.edu/nightingales/O_lurida/CP-4Spl_S11_L004_R2_0348.fastq.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1672M  100 1672M    0     0  11.0M      0  0:02:31  0:02:31 --:--:-- 16.4M0     0  3866k      0  0:07:22  0:00:05  0:07:17 4104k7 1456M    0     0  11.1M      0  0:02:29  0:02:10  0:00:19 10.1M
100 1859M  100 1859M    0     0  12.3M      0  0:02:30  0:02:30 --:--:-- 13.0MM    0     0  10.7M      0  0:02:52  0:00:21  0:02:31 15.2M42  941k51  957M    0     0  10.6M      0  0:02:53  0:01:29  0:01:24 10.2M   0     0  11.5M      0  0:02:40  0:01:51  0:00:49 12.8M
100 1821M  100 1821M    0     0  12.4M      0  0:02:26  0:02:26 --:--:-- 15.1M    0     0  12.5M      0  0:02:24  0:00:53  0:01:31 12.9M  45  836M    0     0  12.5M      0  0:02:25  0:01:06  0:01:19 14.4M  0:02:26  0:01:09  0:01:17 13.5M:38 14.9MM    0     0  12.3M      0  0:02:27  0:01:52  0:00:35 15.0M
100 1970M  100 1970M    0     0  14.2M      0  0:02:18  0:02:18 --:--

In [19]:
ls

CP-4Spl_S11_L004_R1_0343.fastq.gz  CP-4Spl_S11_L004_R2_0348.fastq.gz
CP-4Spl_S11_L004_R1_0348.fastq.gz  README.md
CP-4Spl_S11_L004_R2_0343.fastq.gz  [34mtest-data[m[m/


Compare md5 in downloaded files to the md5 stringlisted on Owl here: [checksums.md5](http://owl.fish.washington.edu/nightingales/O_lurida/checksums.md5)

a0cda314b2c11bcc0d99b5e8bf67e2f0  CP-4Spl_S11_L004_R1_0343.fastq.gz
307cdc0f096669e71a9032f9ca602b35  CP-4Spl_S11_L004_R1_0348.fastq.gz
3186f8691eef8f810cc6799fe25cf9e7  CP-4Spl_S11_L004_R2_0343.fastq.gz
b65e6146a66e03c47bb1fbc445034ed8  CP-4Spl_S11_L004_R2_0348.fastq.gz

If files are the same, the result shows say MD5 (file) = (md5 string provided) 

In [20]:
! md5 CP-4Spl_S11_L004_R1_0343.fastq.gz | grep -i a0cda314b2c11bcc0d99b5e8bf67e2f0

MD5 (CP-4Spl_S11_L004_R1_0343.fastq.gz) = a0cda314b2c11bcc0d99b5e8bf67e2f0


In [23]:
! md5 CP-4Spl_S11_L004_R1_0348.fastq.gz | grep -i 307cdc0f096669e71a9032f9ca602b35

MD5 (CP-4Spl_S11_L004_R1_0348.fastq.gz) = 307cdc0f096669e71a9032f9ca602b35


In [25]:
! md5 CP-4Spl_S11_L004_R2_0343.fastq.gz | grep -i 3186f8691eef8f810cc6799fe25cf9e7

MD5 (CP-4Spl_S11_L004_R2_0343.fastq.gz) = 3186f8691eef8f810cc6799fe25cf9e7


In [26]:
! md5 CP-4Spl_S11_L004_R2_0348.fastq.gz | grep -i b65e6146a66e03c47bb1fbc445034ed8

MD5 (CP-4Spl_S11_L004_R2_0348.fastq.gz) = b65e6146a66e03c47bb1fbc445034ed8


Unzip the fastq files all at once, using the wildcard * and file format 

In [28]:
! gunzip  *.fastq.gz

In [29]:
ls

CP-4Spl_S11_L004_R1_0343.fastq  CP-4Spl_S11_L004_R2_0348.fastq
CP-4Spl_S11_L004_R1_0348.fastq  README.md
CP-4Spl_S11_L004_R2_0343.fastq  [34mtest-data[m[m/


Trying to figure out if the "2 reads" from the same lane represent paired-end data.  If so, they should have the same number of lines: 

In [30]:
! wc -l CP-4Spl_S11_L004_R1_0343.fastq

 118488660 CP-4Spl_S11_L004_R1_0343.fastq


In [31]:
! wc -l CP-4Spl_S11_L004_R2_0343.fastq

 118488660 CP-4Spl_S11_L004_R2_0343.fastq


In [32]:
! wc -l CP-4Spl_S11_L004_R1_0348.fastq

 131217744 CP-4Spl_S11_L004_R1_0348.fastq


In [33]:
! wc -l CP-4Spl_S11_L004_R2_0348.fastq

 131217744 CP-4Spl_S11_L004_R2_0348.fastq


Yes, they do. Files ending with 0343 have 118488660 lines, and files ending in 0348 have 131217744. 

Use the `sickle` program to trim fastq file using sanger quality type. 

In [34]:
# Read 1 from lane 0343
! sickle se -f CP-4Spl_S11_L004_R1_0343.fastq -t sanger -o CP-4Spl_S11_L004_R1_0343_sickle.fastq


FastQ records kept: 29616361
FastQ records discarded: 5804



In [35]:
# Read 2 from lane 0343
! sickle se -f CP-4Spl_S11_L004_R2_0343.fastq -t sanger -o CP-4Spl_S11_L004_R2_0343_sickle.fastq


FastQ records kept: 29130489
FastQ records discarded: 491676



In [36]:
# Read 1 from lane 0348
! sickle se -f CP-4Spl_S11_L004_R1_0348.fastq -t sanger -o CP-4Spl_S11_L004_R1_0348_sickle.fastq


FastQ records kept: 32798145
FastQ records discarded: 6291



In [37]:
# Read 2 from lane 0348
! sickle se -f CP-4Spl_S11_L004_R2_0348.fastq -t sanger -o CP-4Spl_S11_L004_R2_0348_sickle.fastq


FastQ records kept: 32307093
FastQ records discarded: 497343



For both lanes, many more records were discarded from read 2. 

Use `seqtk` to trim

In [38]:
! seqtk trimfq CP-4Spl_S11_L004_R1_0343.fastq > CP-4Spl_S11_L004_R1_0343_trimfq.fastq

In [39]:
! seqtk trimfq CP-4Spl_S11_L004_R2_0343.fastq > CP-4Spl_S11_L004_R2_0343_trimfq.fastq

In [40]:
! seqtk trimfq CP-4Spl_S11_L004_R1_0348.fastq > CP-4Spl_S11_L004_R1_0348_trimfq.fastq

In [46]:
! seqtk trimfq CP-4Spl_S11_L004_R2_0348.fastq > CP-4Spl_S11_L004_R2_0348_trimfq.fastq

/bin/sh: /Users/studentuser/trimmed/CP-4Spl_S11_L004_R2_0348_trimfq.fastq: No such file or directory


In [43]:
ls

CP-4Spl_S11_L004_R1_0343.fastq         CP-4Spl_S11_L004_R2_0343_sickle.fastq
CP-4Spl_S11_L004_R1_0343_sickle.fastq  CP-4Spl_S11_L004_R2_0343_trimfq.fastq
CP-4Spl_S11_L004_R1_0343_trimfq.fastq  CP-4Spl_S11_L004_R2_0348.fastq
CP-4Spl_S11_L004_R1_0348.fastq         CP-4Spl_S11_L004_R2_0348_sickle.fastq
CP-4Spl_S11_L004_R1_0348_sickle.fastq  CP-4Spl_S11_L004_R2_0348_trimfq.fastq
CP-4Spl_S11_L004_R1_0348_trimfq.fastq  README.md
CP-4Spl_S11_L004_R2_0343.fastq         [34mtest-data[m[m/


In [42]:
! head -20 CP-4Spl_S11_L004_R1_0343.fastq

@K00242:343:HNFNJBBXX:4:1101:1093:1138 1:N:0:NGGCTTCA
CTCCAAAACATTTCCACAAAAACCCCCAAAACATTCCTCATACTTGTTGTAGCTATTATTACAAACTAGAAAGTAAAACAGCAATGGGCTCATGGTAAAT
+
AAAFFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJFFJJJJJFJJFJJJ<JFFJJFJJJJJJJJJJJJJJJJJJJJJJJFFJJJJ7F<FJJJJJJF
@K00242:343:HNFNJBBXX:4:1101:1154:1138 1:N:0:TGGCTTCA
GTGAGATTTTCCCAGCGCTCCGACCAGCCGATGGCAAGTGGTTTCTCACTTCCTGACCGTGCCAATTTAGAAGGACCACATGATCATGCTTGCCAGGTAT
+
AA<FAFJJJJJJFAJJFJAAJA<JJJ<FJFAJFJFJJJ-J-<FFJ<JFJJJJJJFJJJFJFFJJJ-FJJJJJJ7FJJJJJJF7FJ-A---7-77A<JFF-
@K00242:343:HNFNJBBXX:4:1101:1215:1138 1:N:0:TGGCTTCA
CCTCATAAAACCAGAACTTGAAGTTACCAAGCAGAAAAGAATGACGGTTATCCGAACAACGTTCATAAAGAATATCAACGACTGGTACTCTACGAGTTCC
+
AA-A<FA<<FAFA-<AFJJJ-7FJJJ7FJFJFF<<FF-FFJJFJFAJ-AFAA<FJAJF-<77<FAFJ<-FJJFJJJJAJFJJA<JJJ<<AFJ-FJ-AAAF
@K00242:343:HNFNJBBXX:4:1101:2128:1138 1:N:0:TGGCTTCA
CCGGGTTCTGTCGCTGACGTCATCGTACCCGTCGTCTGACGCGGCGTTAGCGATGTTAACGATTGTAGTGGTAGGGACATTCAGGTACATCCCGAGCATT
+
AAFFFJJJJFFJJJJJFJJFJJJJJJJJJJJJJJJJJJJJJJJJJJ<FJJJJJF

Resulting trimmed files will be analyzed in RMarkdown using the `qrqc` library [Markdown notebook](). Also will check out the untrimmed file using fastqc, and possibly use another program like trim galore. 