Aidan Coyle, afcoyle@uw.edu
FISH 546, Bioinformatics
2021-01-26

## Exploratory Analysis of Individual Hematodinium/C. bairdi Libraries

Due to the large size of these libraries, I won't be looking at all. Instead, I'll be examining one or two as a starting point, 
beginning with Library 72 - the first numerically (note: numeric order does not match order in Nightingales spreadsheet due to lack of zeros as prefixes). 
Library 72 has 2 reads. I'll start by looking at Read 1, then switch to Read 2. 

Library 72 originates from a Hematodinium-infected crab that was sampled on Day 0 of the experiment that was kept in the elevated-temperature treatment group
Experiment was ran by Grace Crandall on Chionoecetes bairdi. For more details, examine my lab notebook.

In [4]:
pwd

'/mnt/c/Users/acoyl/Documents/GitHub/fish546/projects/2021-01-25_Data_Exploration/scripts'

In [3]:
# Get Read 1 for Library 72
!curl http://owl.fish.washington.edu/nightingales/C_bairdi/72_R1_001.fastq.gz > lib72_read1.fastq.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1719M  100 1719M    0     0  10.6M      0  0:02:42  0:02:42 --:--:-- 10.6M


In [5]:
!ls

[0m[01;32m01_indiv_lib_fastq_exploration.ipynb[0m*  [01;32mlib72_read1.fastq.gz[0m*


In [15]:
# Accidentally downloaded to scripts directory, let's fix
!mv lib72_read1.fastq.gz ../data/indiv_fastq/lib72_read1.fastq.gz

In [17]:
# Ensure that data is present
!ls ../data/indiv_fastq

lib72_read1.fastq.gz


In [24]:
# Download checksums file
!curl http://owl.fish.washington.edu/nightingales/C_bairdi/md5sum_list.txt > ../data/checksums/md5_indivlibs_checksums.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2684  100  2684    0     0  30157      0 --:--:-- --:--:-- --:--:-- 29822


In [30]:
# Get md5 checksum for existing file
!md5sum ../data/indiv_fastq/lib72_read1.fastq.gz | awk '{print $1}' > ../data/checksums/tmp_checksum

In [33]:
# See if our file's checksum matches 
!grep -f ../data/checksums/tmp_checksum ../data/checksums/md5_indivlibs_checksums.txt

88227c353e113898a976a71736d1d6ef  ./72_R1_001.fastq.gz


In [46]:
rm ../data/checksums/tmp_checksum

In [39]:
# Unzip .fastq.gz file
!gunzip -c ../data/indiv_fastq/lib72_read1.fastq.gz > ../data/indiv_fastq/lib72_read1.fastq

In [44]:
!head -20 ../data/indiv_fastq/lib72_read1.fastq

@NGSNJ-086:277:GW200313501st:1:1101:1307:1000 1:N:0:ACTCGCTA+TCGACTAG
AATGTGTACAGTGTAGTGTGAACCACAGTGCTGTGCATGGAAGTGAGGAACCTGAATATCATAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCTGTCTCTTATACAAAACTCCGAGCCCACGAGACCCTCGC
+
F,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFF:FFFFFFFFF,:F:,,:FFFFFF,,:F:,:,::FF:,FFF,,F,F,FF:F,FFF:F,F,,F,F,F:,
@NGSNJ-086:277:GW200313501st:1:1101:5719:1000 1:N:0:ACTCGCTA+TCGACTAG
ACCATGAGTTTGAGCCTTGGACCTGCGCACTCAGTCCTCTGGTGGCTTGCTCGTTTTATGCACGTCAGAAGTACTGGTTTTTTCTGCATTCTCCTTGGTGGACGAGTAGAGACCAGGCTTATTGTTGCAAATCCTCTCCGTGGACAGGTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF
@NGSNJ-086:277:GW200313501st:1:1101:9968:1000 1:N:0:ACTCGCTA+TCGACTAG
CCTTTGTTTTTATTGTATTCTCTTACTGTCTTCTTTATAGGGACTGTCTCTTATACACATCTCCGAGCCCACGAGACACTCGCTAATCTAGTATTCAGTCTTCTGCTTGAAAAATGTGGGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
FFFFFFFFFF:FF:FF:F,FFFF,F:F,F

In [45]:
# count number of sequences
!bioawk -cfastx 'END{print NR}' ../data/indiv_fastq/lib72_read1.fastq

27249335


In [61]:
# see size of files
!ls -sh ../data/indiv_fastq/

total 12G
9.6G lib72_read1.fastq	1.7G lib72_read1.fastq.gz


In [63]:
# get format of .gz file
!file ../data/indiv_fastq/lib72_read1.fastq.gz

../data/indiv_fastq/lib72_read1.fastq.gz: gzip compressed data, original size modulo 2^32 63128


In [62]:
# get format of .fastq file
!file ../data/indiv_fastq/lib72_read1.fastq

../data/indiv_fastq/lib72_read1.fastq: ASCII text


In [52]:
# download FastQC
!curl https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip > ../programs/fastqc_v0.11.9.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9.7M  100  9.7M    0     0  2679k      0  0:00:03  0:00:03 --:--:-- 2678k


In [60]:
# unzip FastQC file
!unzip -q ../programs/fastqc_v0.11.9.zip -d ../programs/FastQC

In [69]:
# give FastQC executable permissions
!chmod 755 ../programs/FastQC

In [None]:
# add symbolic link. Since requires interactivity, done in terminal and copied into Jupyter
!sudo ln -s ../programs/FastQC

In [79]:
!cd ../programs/FastQC

/mnt/c/Users/acoyl/Documents/GitHub/fish546/projects/2021-01-25_Data_Exploration/programs/FastQC


In [91]:
# run FastQC on our file
!./fastqc ../../data/indiv_fastq/lib72_read1.fastq

Started analysis of lib72_read1.fastq
Approx 5% complete for lib72_read1.fastq
Approx 10% complete for lib72_read1.fastq
Approx 15% complete for lib72_read1.fastq
Approx 20% complete for lib72_read1.fastq
Approx 25% complete for lib72_read1.fastq
Approx 30% complete for lib72_read1.fastq
Approx 35% complete for lib72_read1.fastq
Approx 40% complete for lib72_read1.fastq
Approx 45% complete for lib72_read1.fastq
Approx 50% complete for lib72_read1.fastq
Approx 55% complete for lib72_read1.fastq
Approx 60% complete for lib72_read1.fastq
Approx 65% complete for lib72_read1.fastq
Approx 70% complete for lib72_read1.fastq
Approx 75% complete for lib72_read1.fastq
Approx 80% complete for lib72_read1.fastq
Approx 85% complete for lib72_read1.fastq
Approx 90% complete for lib72_read1.fastq
Approx 95% complete for lib72_read1.fastq
Analysis complete for lib72_read1.fastq


In [118]:
!ls ../../data/indiv_fastq

lib72_read1.fastq     lib72_read1_fastqc.html
lib72_read1.fastq.gz  lib72_read1_fastqc.zip


When opening lib72_read1_fastqc.html in a browser, it sure looks like my FastQC ran correctly! Fantastic!