# Fun Introductory Command Line Exercise: Next Generation Sequencing (NGS) Quality Analysis with Emoji üíª üìà üòª

**Developed by:** Ray Enke<sup>1</sup>, Rachael St. Jaques<sup>1</sup>, Max Maza<sup>1</sup>, Caylin Murray<sup>1</sup>, Sabrina Robertson<sup>2</sup>, Andrew Lonsdale<sup>3</sup>, & Jason Williams<sup>4</sup>

<sup>1</sup>Department of Biology, James Madison University <br /> <sup>2</sup>Department of Psychology & Neuroscience, University of North Carolina, Chapel Hill<br /> <sup>3</sup>School of Biosciences, University of Melbourne <br /> <sup>4</sup>DNA Learning Center, Cold Spring Harbor Laboratory

**Learning Goals**: Use basic command line coding to:
- Introduce students to writing basic command line scripts
- Analyze & assess the quality of FASTQ formatted NGS data
- Trim/filter low quality reads in FASTQ files 


The 1st step of any Next Generation Sequencing (NGS) analysis pipeline is checking the quality of the raw sequencing reads in each FASTQ formatted file. If the sequence quality is poor, then your resulting downstream analysis will be inaccurate and misleading. FastQC is a popular software used to provide an overview of basic quality metrics for NGS data. In this lesson, you will use an even more universal form of communication to analyze FASTQ files, THE EMOJI üòªüòªüòª.

**Technical requirements/limitations**: 

- FASTQE software is able to run natively on Mac OS or Linux computer with Anaconda installed (see below). Since Windows does not support the use of emoticons (üòüüò±üòø), this implementation of the lesson can be also be run in a [Jupyter notebook](https://jupyter.org/) on any machine!  
- If using your own computer, you need to install Anaconda on your machine (see pre-class assignment https://bit.ly/2RxKApp; ~20 min to install). Anaconda is a Python-based data processing & scientific computing platform with built in third-party libraries. 

- Lastly, the FASTQE program is limited to short read NGS data of 500bp/read or less.


Like the popular FastQC software, FASTQE can be used to analyze FASTQ file quality whether it‚Äôs from a genome sequencing project, an RNA-seq project, a ChIP-seq project, etc. Here‚Äôs a brief background on the in-class metagenomics project that Dr. Enke‚Äôs Bio 481 Genomics class at James Madison University is collecting data for. Garter snakes excrete sexually dimorphic pheromones to attract a mate. The hypothesis of their experiment is that male and female garter snakes host unique microbial communities in their  musk glands that contribute to sexually dimorphic bioengineering of these pheromone molecules. Figure 1 provides an overview of their 16S metagenomics analysis pipeline. For this lesson though, all you need are the FASTQ files. Feel free to substitute your own favorite FASTQ files for this activity if you like.  

![figure1](./img/figure1.png)

**Figure 1. Overview of the in-class metagenomics project. Using a saline swabbing technique, microbial samples were collected from garter snake tissues in class (A). Swabs were placed in sterile tubes to release collected microbes & DNA was extracted for downstream analysis (B). Barcoded primers were used to PCR amplify the microbial 16S ribosomal DNA repeat genes for each sample followed by Illumina sequencing of PCR amplicons (C-D). The DNA Subway Purple Line web-based software can be used to analyze FASTQ data files generated from Illumina sequencing to reveal the microbial population of our swabs (E). Garter snakes were provided by Dr. Rocky Parker in the JMU Department of Biology (A; yellow shirt).**


As previously discussed, FASTQE is a program that analyzes FASTQ files & reads out an emoji output as an indicator of the sequence‚Äôs quality in the file. So, a high quality read may look like this üòÉ, while this symbol üí© indicates... well you get the idea. 


**Hands on assignment**: Working individually, in pairs, or in groups of 3... provide feedback wherever indicated. If you get stuck, ask for help! Turn in this document at the end of the activity for your graded assignment. If working in pairs or groups, make sure to rotate turns typing commands. Have fun! üòÄ 


# Part 1: Download FASTQ files and run `fastqe`
Jupyter allows you to run commands by selecting a cell and then click the play button or Cntrl+Enter. For example, running the next cell executes the `pwd` (print working directory) command, which will tell you what directory this notebook is located in.

In [3]:
wget

NameError: name 'wget' is not defined

### Try it: 
If you‚Äôve printed a path that doesn‚Äôt make sense (i.e. the directory you navigated to is the incorrect directory) how would you go back to the previous directory? (hint, it includes the change directory command)

- Hint, type your commands in the cell below to see how the `cd` (change directory) command works. 
- Next, navigate to the data folder using the `cd /home/joyvan/data` command
- Execute the `pwd` command again to confirm that you are in your desktop directory 


In [None]:
# Your code here


In [None]:
# Your code here


# IMPORTANT

Your output for the `pwd` command must be `/home/joyvan/data` - otherwise the rest of the lesson will not work because you will not be in the correct directory

# Step 1

Using the `wget` command, download the compressed fastq file here: 

https://de.cyverse.org/dl/d/6476693F-1711-4AD5-AAEA-DDACBF8FB516/fastq.zip

(this is 1 file with the .zip extension that unzips into 3 .fastq files). 

The `wget` command we will use has three components

**Usage**: wget -O [filename][URL]

- `wget` the name of the program 
- `-O` the `-O` is an option we can pass to the `wget` program, this option let's us choose the name we want our file to be saved as, in this case `fastq.zip`. 
- URL in this the URL you want to download a file from

Type `wget` then a space, `-O fastq.zip`, another space, then the URL you are downloading from

In [None]:
# Your code here


# Step 2

**To see the downloaded zip file, you will need to navigate on the left-hand menu to the /home/joyvan/data directory. Click on the folder icon (highlighted) and navigate to the `data` directory.**

![](./img/data.png)


In the next cell, use the `unzip` command to unzip the downloaded `fastq.zip`

**Usage**: `unzip` [file to unzip]

In [None]:
# Your code here


### Question 2: 
What‚Äôs the purpose of using a zipped file?

#### Your answer to question 2: 

(Click the cell below, enter your answer, and then click the Play button or Cntrl+Enter to render as markdown text)

YOUR ANSWER HERE

# Step 3

In the next cell, use the `ls` (list files) command to verify you have unziped three files: `female2_oral1.fastq`,`Male5_oral1.fastq`, `Male5_oral2.fastq`

**Usage**: `ls` [directory] (list contents of a directory - if left blank, will display for the current directory, if a wildcard [e.g. \*.file-extension] is provided, will list all the files with the given file extension)

Use the command `ls` but pass `*.fastq` to directory

In [None]:
# Your code here


# Step 4

In the next cell, run the `fastqe` program to generate your emoji fastq report

**Usage**: `fastqe` [fastq-file] (run the `fastqe` program. If a wildcard [e.g. \*.fastq] is provided, `fastqe` will run on  all the fastq files in the current working directory.  

**Note**: Remember that fastq files are very large, so this command will take ~30 seconds/file to complete.

In [None]:
# Your code here
# Hint: Run this ONLY ONCE. If this takes longer than 60 seconds - Go to the Kernel menu and choose "Restart Kernel"; you will have to also run the command 'cd /home/joyvan/data'


### Question 3: 
What are the advantages and disadvantages to using the command fastqe *.fastq rather than fastqe for each of your files (e.g. fastqe Female2-oral1.fastq	... fastqe Male5-oral1.fastq...) ?

#### Your answer to question 3: 

(Click the cell below, enter your answer, and then click the Play button or Cntrl+Enter to render as markdown text)

YOUR ANSWER HERE

# Student #2 `fastqe` help 

Notice that 1 of your files (`Male5-oral1.fastq`) seems to have lower quality than the others based on the Emoji readout. Let‚Äôs look more closely to see how bad the data is. 

# Step 5

Open the FASTQE help page to view the ‚Äúoptional arguments‚Äù, these are all of the options and setting for the program. 

To get the help info for `fastqe` (and many other command line programs) add the `--help` option to the `fastqe` program instead of a filename or wildcard (remember to leave a space between `fastqe` and `--help`).

In [None]:
# Your code here


### Question 4: 
Which optional argument will show the version # of FASTQE?

#### Your answer to question 4: 
(Enter in the cell below and click the Play button or Cntrl+Enter to render)

YOUR ANSWER HERE

# Step 6

Add the `--scale` option to the `fastqe` command to view the Phred score associated with each emoji in your output. Try this just for the `Male5-oral1.fastq` file (remember to leave a space before you type `--scale`). This will take a few seconds to run. 

In [None]:
# Your code here
# Hint: You are checking the scale for a file you are inputing, therefore you need to have a fastq filename in this command


### Question 5: 
Phred score of ‚â§20 is considered a poor quality base call. How many poor quality base calls are at the 3‚Äô end of this read?

#### Answer to question 5: 

(Click the cell below, enter your answer, and then click the Play button or Cntrl+Enter to render as markdown text)

YOUR ANSWER HERE

# Student #3 `fastp`  

Let‚Äôs use another program called Fastp to get a more conventional readout of the .fastq file data. Fastp is similar to the FastQC program we previously used, however, it also has a trimming tool to cut out or filtering the low quality sequences in our file.

# Step 7

Run `fastp` on the lower quality `Male5-oral1.fastq` file



**Usage (Note: You will need to use all of these elements in your command)**: 

- `fastp` is the name of software that will check the quality of the fastq file
- `-i [input.fastq]` -i option specifies the input file for `fastp`
- `-o [ouput.fastq]` -o option specifies the ouput file for `fastp`
- `--html [ouput.html]` --html option specifies the name of the HTML report for `fastp`
- `--json [ouput.json]` --json option specifies the name of the [JSON](https://en.wikipedia.org/wiki/JSON) report for `fastp`

Write a command using `Male5-oral1.fastq` as your input and `out.Male5-oral1.fastq` as your output. Name your `--html` report `Male5-oral1.html` and your `--json` report `Male5-oral1.json`. 


In [None]:
# Your code here


# Step 8

You should now have 3 new files in your fastp folder

1. .html file (this is your QC report)
2. .json file (ignore this for now)
3. trimmed fastq file (out.Male5_oral1.fastq)

Click on the `fastp.html` file in the Jupyter menu on the left to examine this report

**Note**: Click on **Trust HTML** on the top of the HTML report tab to reveal graphs that may be hidden until you provide this authorization. 


### Question 6: 
From the ‚ÄúSummary‚Äù data in your HTML fastp report, how many reads are in this FASTQ file before and after filtering?

#### Answer to question 6: (Double click on this cell to edit)


(Click the cell below, enter your answer, and then click the Play button or Cntrl+Enter to render as markdown text)

YOUR ANSWER HERE

### Question 7: 
How do the before and after plots compare?

#### Answer to question 7: 

(Click the cell below, enter your answer, and then click the Play button or Cntrl+Enter to render as markdown text)

YOUR ANSWER HERE

# Step 9
Use the `out.Male5-oral1.fastq` file to rerun `fastqe`. Remember this will take a few seconds to run. 

In [None]:
# Your code here


### Question 8: 
How do the before and after plots compare?

#### Answer to question 8: 

(Click the cell below, enter your answer, and then click the Play button or Cntrl+Enter to render as markdown text)

YOUR ANSWER HERE

### Question 9: 
Which tool (fastqe or fastp) did you find easier to use?

#### Answer to question 9: 

(Click the cell below, enter your answer, and then click the Play button or Cntrl+Enter to render as markdown text)

YOUR ANSWER HERE

### Question 10: 
Which tool (fastqe or fastp) do you think is more a more reliable research grade tool?


#### Answer to question 10: 

(Click the cell below, enter your answer, and then click the Play button or Cntrl+Enter to render as markdown text)

YOUR ANSWER HERE

To sum up, you just analyzed Illumina FASTQ data quality using Emoji output. You then filtered out low quality sequences & output before & after QC plots. You did all of that on the command line, congrats!


# Saving your work

Once you have completed your work, go to the file menu and select `Export Notebook As` and choose `Markdown`. Send this Markdown file to your instructor. Alternatively, use the left-hand file menu to navigate to the `notebooks` folder and download this notebook `fastqe-notebook.ipynb` by right-clicking on the file and choosing `Download`. 