# Introduction to FASTQ Files

<!--- 
** Add Intro Text***
-->

## Shell Variables
As before, we will use shell variables to make it easier to refer to the directories we are working with.  The shell variables do not carry over between notebooks.  Shell variables are specific to a shell session, and each notebook is a separate shell session.

So the first thing we will do is assign the variables in this notebook.

In [1]:
# Hack to handle broken pipes - IGNORE.
cleanup () { 
    :
}
trap "cleanup" SIGPIPE

In [2]:
set -u
DATA_BASE="/data/hts_2019_data"
RAW_FASTQS="$DATA_BASE/hts2019_pilot_rawdata"

## Looking at a FASTQ
Let's take a quick look at our data.  For our first pass at analysis, we are just going to be working with the first read data (R1) from one sample.

In [3]:
ls -lSrh $RAW_FASTQS

total 24G
-r--r--r-- 1 jovyan users 1.1K May 16 17:27 README.rtf
-r--r--r-- 1 jovyan users 3.0K Jun 12 13:06 2019_pilot_metadata.md
-r--r--r-- 1 jovyan users 3.7K May 16 17:27 2019_pilot_metadata.tsv
-r--r--r-- 1 jovyan users  11K May 16 17:27 md5_checksum.txt
-r--r--r-- 1 jovyan users  51K May 16 17:27 sequencing_run_report_5630_190509A5.html
-r--r--r-- 1 jovyan users 111M May 16 17:29 [0m[01;31m2_2018_P_H2_S28_L002_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 113M May 16 17:30 [01;31m2_2018_P_H2_S28_L004_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 113M May 16 17:27 [01;31m2_2018_P_H2_S28_L001_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 115M May 16 17:30 [01;31m2_2018_P_H2_S28_L003_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 116M May 16 17:28 [01;31m21_2019_P_M1_S21_L002_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 118M May 16 17:27 [01;31m21_2019_P_M1_S21_L001_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 118M May 16 17:30 [01;31m21_2019_P_M1_S21_L004_R1_001.fastq.gz

-r--r--r-- 1 jovyan users 172M May 16 17:27 [01;31m4_2018_P_T1_S33_L004_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 172M May 16 17:29 [01;31m4_2018_P_T1_S33_L001_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 172M May 16 17:28 [01;31m9_2019_P_M1_S9_L002_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 173M May 16 17:28 [01;31m6_2019_P_M1_S6_L001_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 173M May 16 17:28 [01;31m18_2019_P_M1_S18_L001_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 174M May 16 17:28 [01;31m6_2019_P_M1_S6_L004_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 174M May 16 17:28 [01;31m18_2019_P_M1_S18_L004_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 174M May 16 17:29 [01;31m16_2019_P_M1_S16_L001_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 175M May 16 17:31 [01;31m16_2019_P_M1_S16_L004_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 175M May 16 17:30 [01;31m4_2018_P_T1_S33_L003_R1_001.fastq.gz[0m
-r--r--r-- 1 jovyan users 176M May 16 17:29 [01;31m9_2019_P_M1_S9_L0

> ### Compression: gzip, zcat, etc
> The ".gz" at the end of the FASTQ file name indicates that the fastq file was compressed using a program named gzip.  This is pretty common because FASTQ files can be huge.  `cat` is a program for viewing text files, `zcat` is a special version of this program that lets you view compressed text files without first decompressing them.

Here we will use:
* zcat: to show the compress FASTQ
* head: to grab only the first 10 lines, since the whole file has over 5 x 10^6 lines (which would almost certainly hang our web browser)

In [4]:
zcat $RAW_FASTQS/21_2019_P_M1_S21_L002_R1_001.fastq.gz | head 

@NB501800:327:HF27FBGXB:2:11101:19634:1049 1:N:0:CGCTCATT+CTTCGCCT
GCGCTNCGAGGCGGCGGTGGCGCGGGCGGCGGAGACAGAGGCAGCGGCGGTGGCGAGGTTGGCCGAGTGCTGCTTG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501800:327:HF27FBGXB:2:11101:11230:1055 1:N:0:CGCTCATT+CTTCGCCT
AAAATATTGACTGGTCACCACTTGTCCAGAAATAATGTCAAAAATGGTAGTGCCGACACCGAGATATGAGTCGGTG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEE
@NB501800:327:HF27FBGXB:2:11101:23417:1059 1:N:0:CGCTCATT+CTTCGCCT
AAAGATATCTGTAGGGTTGACCTCCTTGCCGGAGGGTTTCTTAAGAAGAACTCGAGTACCTTTTCTACCTAGCGCC


> ## less
> `less` is a program for  taking an interactive look at a text file, like a FASTQ - it let's you scroll, search, etc. `less` won't work in the bash notebook, if you want to try it out, you need to use a terminal.

> To switch to a terminal, click on the jupyter "File" menu, and select "Open".  A new browser window/tab should open, with your jupyter "home base". Here, you should click on the "Files" tab if it is not already active, there click on "New" and select "Terminal", which should open a new live terminal.

> Since we want to look at a compressed (gzipped) FASTQ, we will use a version of `less` called `zless`, which decompresses on the fly.

> At the terminal's command prompt, type (or paste) `zless /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz`
> You should see the first few lines of the file, notice that it looks like the examples we saw in lecture.

> `zless` (and its standard cousin `less`) can do a lot of things.  Here are a few important keystrokes:

> * q : quit
> * space : scroll down a page
> * up/down arrow : scroll up/down by a line 

# What do quality scores mean?
See the [Quality Scores](quality_scores.ipynb) notebook for a "translation" of quality scores.  The [Wikipedia article on FASTQs](https://en.wikipedia.org/wiki/FASTQ_format) is also a useful resource.