# BNFO62: RNA-sequencing, Day <br>Part 1 - (RNA-seq Differential Expression Data Preparation)

**Authors:** Michelle Franc Ragsac (mragsac@eng.ucsd.edu) and Eric Kofman (ekofman@eng.ucsd.edu)

> *Based on the RNA-sequencing notebook from the 2020 BISB Bootcamp taught by Clarence Mah (ckmah@ucsd.edu)--one of the TAs for this class this year: https://github.com/mragsac/BISB-Bootcamp-2020/tree/master/day4/module5_rnaseq*

Within this notebook, we'll be going through the steps taken on the command line to process RNA-sequencing data from the FASTQ file format to a counts matrix containing the representation of each gene within a sample. By the end of this notebook, you will know how to process RNA-sequencing data with a basic pipeline, gain experience with commonly-used sequence alignment tools, and recognize common sequencing file types! 

<div class="alert alert-block alert-warning">
    <p>Please note that despite the fact this notebook uses a Python kernel, all of the cells use <code>%%bash</code> iPython magic commands! We created the notebook this way so that it would be compatible with the Jupyter Hub for this class.</p>
</div>

---

![Diagram of RNA-Sequencing Data Analysis Pipeline](img/day1_00_overview-figure.png)

<i><b>Diagram of RNA-Sequencing Data Analysis Pipeline</b></i>

---

There are quite a few tools that we'll be using in this pipeline, and you can find their tool manuals below: 

> #### Bioinformatics Tool Manuals in Order of Appearance
> 
> 1. **[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)**: A quality control tool for high throughput sequence data 
> 2. **[Samtools](http://www.htslib.org/)**: A suite of programs for interacting with high-throughput sequencing data
> 3. **[`S`pliced `T`ranscripts `A`lignment to a `R`eference (STAR)](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf)**: A *fast* tool for aligning reads to a reference genome  
> 4. **[featureCounts](http://bioinf.wehi.edu.au/featureCounts/)**: A read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations
> 5. **[MultiQC](https://multiqc.info/)**: A tool which is able to summarize the output from numerous bioinformatics tools.

---

### Table of Contents

1. Reviewing UNIX Command Line Commands for Navigating the Terminal
2. Checking the Quality of Raw Sequencing FASTQ Files
3. Aligning or Mapping RNA-sequencing FASTQ Reads to the Genome 
4. Sorting and Indexing Aligned Sequencing Reads with `samtools`
5. Generating a Gene Expression Counts Matrix using the `featureCounts` Program
6. Generate a Summary Quality Control Document with `MultiQC` 

---

## Creating a symlink to the data
Run the following command to create a symlink to the data needed for this notebook:

In [1]:
%%bash
ln -sfn ~/public/rnaseq/Day1_materials/data ~/module-3-rnaseq/Day1_materials/data

## Reviewing UNIX Command Line Commands for Navigating the Terminal

Before we go into bioinformatics command line tools, let's start out by reviewing some UNIX commands, along with some basic rules to follow to help keep your computational analyses organized! 

Within this portion of the notebook, we'll review: 

* Using `pwd` to preview the current "working directory" in which we are located in on the terminal
* Using `ls` to list the contents of the current working directory

### Previewing the Current Working Directory with the `pwd` Command

Whenever opening the terminal, it's helpful to first determine where you're located within your computer's filesystem before you run any other programs or commands. We can preview the current working directory with the `pwd` command. 

<div class="alert alert-block alert-info">
    <p>Because our Jupyter Notebook is located in the <code>module-3-rnaseq/Day1_materials</code> directory, the results of our <code>pwd</code> command should reflect the location of this directory.</p>
</div>

In [2]:
%%bash

# Use pwd to determine our current working directory
pwd

/home/dtv004/module-3-rnaseq/Day1_materials


### Listing the Contents of the Current Working Directory with the `ls` Command

Now that we have a handle of where we're located within our computer's filesystem, we can use the `ls` command to list out the contents of the folder we're located in. 

<div class="alert alert-block alert-info">
    <p>Because our Jupyter Notebook is located in the <code>module-3-rnaseq/Day1_materials</code> directory, the results of our <code>ls</code> command should reflect the contents of this directory.</p>
</div>

In [3]:
%%bash

# Use ls to list the contents of the current working directory
ls

data
Day1_RNAseq_Data_Prep.ipynb
img
results
SOLUTIONS_Day1_RNAseq_Data_Prep.ipynb


From this command, we can see that there are two Jupyter Notebooks present for this module, the `README.md` file for this module, along with three folders: 

* The **`data/`** folder contains the data that we'll be using to learn about RNA-sequencing analysis for this module 
* The **`results/`** folder contains the outputs that you should expect from running our RNA-sequencing pipeline on the data files we've provided--just in case you want to run through the steps on your own and see if you did things correctly! 
* The **`img/`** folder contains some image files that are used in these notebooks

Let's view the contents within the `data/` folder, but also introduce one of the helpful flags that you can use with the `ls` command. `ls -l` allows you to view the contents of a directory in *long* format. This includes the permissions for the file (e.g., read, write, or execute permissions), the size of the file in bytes, along with the date the file was last modified. 

In [4]:
%%bash

# Use ls to show the contents of the data folder
ls -l data/yeast.*

-rw-rw---- 1 120197 root 12400379 Jan 19 20:28 data/yeast.fasta
-rw-rw---- 1 120197 root  3135233 Jan 19 20:28 data/yeast.gtf


<div class="alert alert-block alert-warning">
    <p>For this notebook, we'll be using the <code>yeast.*</code> files to demonstrate the different steps in an RNA-sequencing analysis pipeline in order to save time.</p>
    <p>These represent just a small subsampling of reads so that working with our files doesn't take too long. </p>
</div>

<div class="alert alert-block alert-info">
    <p>With the above block, we said <code>yeast.*</code>. The <code>*</code> is a common symbol that you will see within this notebook and in UNIX commands; it typically represents a <i>wildcard</i>, meaning that we would like to include every single type of file extension after the <code>yeast.</code> text! (e.g., <code>yeast.fasta</code>, <code>yeast.gtf</code>, etc.) Another example is <code>*.fasta</code>, meaning all FASTA files that are present.</p>
</div>

<div class="alert alert-block alert-info">
    <p>If you're ever confused about basic UNIX commands and what their different flags mean, we highly recommend the ExplainShell resource! You can type in the command that you're struggling to understand, and it will explain each component for the command in a semi-interactive website. The website pulls from the <code>man</code> (manual) pages for the commands that you would like to search for.</p>
    <p><b>Link to ExplainShell Website</b>: <a href="https://explainshell.com/">https://explainshell.com/</a></p>
</div>

---

## Checking the Quality of Raw Sequencing FASTQ Files

Within this portion of the notebook, we'll be covering some of the basics behind the FASTQ file format (along with how it's organized) and a common method for assessing the quality of your sequencing data using the `FastQC` software.

### Background on the FASTQ File Format

FASTQ files are text-based files for storing a biological sequence along with its corresponding quality scores, and are the most common file format that a bioinformatician would recieve from a sequencing run for further analysis. Both the sequence letter and quality score are each encoded with a single ASCII character. These files usually contain four lines per sequence. 

---

![FASTQ File Format Description](img/day1_02_fastq-file-format.png)

[Hosseini, Morteza, Diogo Pratas, and Armando J. Pinho. "A survey on data compression methods for biological sequences." Information 7.4 (2016): 56.](https://www.mdpi.com/2078-2489/7/4/56/htm)

---

- **Line 1** begins with a `@` character and is followed by a sequence identifier and an *optional* description (like a `FASTA` title line)
- **Line 2** contains the raw sequence letters 
- **Line 3** begins with a `+` character and is *optionally* followed by the same sequence identifier (and any description) again
- **Line 4** encodes the quality values for the sequence in Line 2, and *must* contain the same number of symbols as letters in the sequence

It's important to become familiar with FASTQ files and how they're organized as they're one of the most common elements that you'll see across different bioinformatics pipelines that involve sequencing data (regardless of sequencing method)! 

<div class="alert alert-block alert-info">
    <p>If you have <b>paired-end</b> sequencing data, you will usually get <i>two</i> FASTQ files from the sequencing core, one labeled with an <code>R1</code> in its filename and one labeled with an <code>R2</code> in its filename!</p>
    <p>For <b>single-end</b> runs, then you will typically get <i>one</i> FASTQ file to evaluate.</p>
</div>

### Previewing the FASTQ File Format with the `head`, `grep`, and `wc` UNIX Commands

Let's review some more UNIX commands as we learn about the `*.fastq` file format! 

* We can use the `head` command to preview the head, or beginning, of a file (the `-n` flag allows you to specify the number of lines that you would like to preview from the beginning of the file)
* `grep` is a powerful command-line utility for searching plain-text data files for specific patterns using regular expressions; you can find more information and some tutorials at the following link: https://www.geeksforgeeks.org/grep-command-in-unixlinux/

<div class="alert alert-block alert-info">
    <p><b>What are Regular Expressions?</b></p>
    <p>While we won't be covering them in this notebook, <b>regular expressions</b> are useful for extracting information from text, such as code, log files, spreadsheets, and more! With regular expressions, you write <i>expressions</i> to search for a specific pattern of characters. They're a bit cumbersome to learn at first, but if you're interested, here are a few resources that we've selected for you to learn more from:</p>
    <ul>
        <li><b>w3schools Python Regex Resource:</b> <a href="https://www.w3schools.com/python/python_regex.asp">https://www.w3schools.com/python/python_regex.asp</a></li>
        <li><b>RegexOne Website:</b> <a href="https://regexone.com/">https://regexone.com/</a></li>
        <li><b>RegExr Website:</b> <a href="https://regexr.com/">https://regexr.com/</a></li>
    </ul>
</div>

#### Using the `head` Command to Preview a FASTQ File

In [5]:
%%bash

# Preview the first 8 lines of the first FASTQ file in the read pair (R1)
head -n 8 data/yeast_R1.fastq

@SRR6924582.1 1 length=76
TCAGATTTAGTCCATAAGGCAAACTTGTTACCACCTTTTCTAATGCTTAAAACGACACCGTTAATTTGGGAGTCGT
+SRR6924582.1 1 length=76
AAAAAEAEEEEAEEEEAEEEAEEEEEEEAEEA/EEEEEEEEEAEEEE6EE//AE/EEEEEAEAEEEAEEEEE<EEA
@SRR6924582.2 2 length=75
GGGTACTTCAAGTACTTACCGGAGAACTTGGTGGTCGAGACAACGGTGACAACAGAGTCGTTCTGCTCGACAGTG
+SRR6924582.2 2 length=75
AAA/AEEEEEEEEEEEEAEE/E/EAEEEEAEEEEEEEEE/EEEEEEE/EEEEEEAEEEE///EEEA//A/EEEAE


From this command, we're previewing two separate reads that were produced in the yeast experiment. The name of each of our reads begins with the pattern `@SRR*`! We can also see the sequences that were read out by the sequencer for both of these reads, along with their associated quality scores at each base! 

<div class="alert alert-block alert-info">
    <p><code>tail</code> is the counterpart command for <code>head</code> as it allows you to view the <i>tail</i> or end of a file! It follows the same syntax as <code>head</code>.</p>
</div>

#### Using the `grep` and `wc` Commands to Determine the Number of Reads in a FASTQ File

Since we know that each of our read headers starts with a `@SRR*` pattern, we can use that information to construct a `grep` command with a simple regular expression! `grep` commands usually follow the pattern below: 

```bash
grep [regular expression] filename
```

The command will then return all *lines* that contain the text following the regular expression one is searching for! 

If we were to run this command within the notebook to look for read headers in our file, then we would clog the notebook with many, many, *many* lines from our FASTQ file that match the `@SRR*` pattern. Instead, we'll *pipe* the output (or, redirect the output) into another command.

In [6]:
%%bash

# Extract all read headers from our FASTQ file, then preview the output using head
grep @SRR* data/yeast_R1.fastq | head

@SRR6924582.1 1 length=76
@SRR6924582.2 2 length=75
@SRR6924582.3 3 length=75
@SRR6924582.4 4 length=76
@SRR6924582.5 5 length=76
@SRR6924582.6 6 length=74
@SRR6924582.7 7 length=75
@SRR6924582.8 8 length=74
@SRR6924582.9 9 length=75
@SRR6924582.10 10 length=76


Instead of just previewing what the output of the command looks like, we can use the `wc -l` command (`wc` with the `-l` flag) to determine the number of lines that are present in the piped result from our `grep` command!

In [7]:
%%bash

# Extract all read headers from our FASTQ file, then 
# determine the number of lines present in the output using the wc -l command
grep @SRR* data/yeast_R1.fastq | wc -l

23520


From this, we can see that there are 23,520 reads present in our `data/yeast_R1.fastq` file! 

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b> How many reads are present in the <code>data/yeast_R2.fastq</code> file? Is it what you would expect to see? Why?</p>
</div>

In [8]:
%%bash

grep @SRR* data/yeast_R2.fastq | wc -l

23520


### Generating a Quality Control Report with the `fastqc` Program

Now that we've gotten a handle for the FASTQ files that we're evaluating, let's try and assess the quality of the sequencing run that we're looking at! **FastQC** is a program that performs quality checks on raw sequencing data in the form of FASTQ, SAM, or BAM-formatted files. While FastQC has the capability to be used interactively with a graphical user interface (GUI), we'll be using it non-interactively on the command-line!

<div class="alert alert-block alert-info">
    <p><b>Analysis Modules included in the FastQC Software</b></p>
    <p>There are quite a few modules that you'll see in a <code>FastQC</code> report for FASTQ files, so we've summarized what each of them will show you below in the order they will appear in the report for easy reference (this information can also be found in the documentation): </p>
    <ol>
        <li><b>Basic Statistics</b>:
            <br>Simple composition statistics for the file analyzed</li>
        <li><b>Per Base Sequence Quality</b>:
            <br>Shows and overview of the range of quality values across all bases at each position in the file via a Box-Whisker plot</li>
        <li><b>Per Sequence Quality</b>:
            <br>Allows you to see if a subset of the sequences in the file have universally low quality values</li>
        <li><b>Per Base Sequence Content</b>:
            <br>Allows you to see the proportion of each base (e.g., <code>A</code>, <code>C</code>, <code>T</code>, or <code>G</code>) at each position across all sequences in the file</li>
        <li><b>Per Sequence GC Content</b>:
            <br>Measures the GC% content across the length of the sequence in the file and compares against a modelled normal distribution of GC% content</li>
        <li><b>Per Base N Content</b>
            <br>Plots the percentage of base calls at each position with a <code>N</code> (<code>N</code>'s are called when the sequencer does not have sufficient confidence to call a base as a specific nucleotide)</li>
        <li><b>Sequence Length Distribution</b>:
            <br>Generates a graph showing the distribution of fragment sizes in the file</li> 
        <li><b>Duplicate Sequences</b>:
            <br>Plots the relative number of sequences with different degress of duplication in the file</li>
        <li><b>Overrepresented Sequences</b>:
            <br>Lists all the sequences that make up more than <code>0.1%</code> of the total number of sequences and its matches to common sequencing contaminants</li>
        <li><b>Adapter Content</b>:
            <br>Plots a cumulative percentage count of the proportion of the library that contains known adapter sequences</li>
        <li><b>Kmer Content</b>:
            <br>Plots the distribution of top 6 most biased kmers present within the file's sequences</li>
        <li><b>Per Tile Sequence Quality</b>: (Note: <i>Only appears in Illumina libraries</i>)
            <br>Shows the deviation from the average quality in each flowcell tile</li>
    </ol>
</div>

#### Running the `FastQC` Software on the Command Line

Luckily, running the command for `FastQC` is quite simple! Let's start out by making two folders, a `results/` folder to hold all of our results, and a `01_fastqc_output/` folder nested within the `results/` folder to hold our results for this section of the notebook! 

In [9]:
%%bash

# Create a folder for results and the fastqc output
mkdir -p results/01_fastqc_output/

# Run the command to generate fastqc reports for yeast_R1 and yeast_R2 fastq files
# and output to the folder we've created 
fastqc data/yeast_R1.fastq data/yeast_R2.fastq -o results/01_fastqc_output

Started analysis of yeast_R1.fastq
Approx 5% complete for yeast_R1.fastq
Approx 10% complete for yeast_R1.fastq
Approx 15% complete for yeast_R1.fastq
Approx 20% complete for yeast_R1.fastq
Approx 25% complete for yeast_R1.fastq
Approx 30% complete for yeast_R1.fastq
Approx 35% complete for yeast_R1.fastq
Approx 40% complete for yeast_R1.fastq
Approx 45% complete for yeast_R1.fastq
Approx 50% complete for yeast_R1.fastq
Approx 55% complete for yeast_R1.fastq
Approx 60% complete for yeast_R1.fastq
Approx 65% complete for yeast_R1.fastq
Approx 70% complete for yeast_R1.fastq
Approx 75% complete for yeast_R1.fastq
Approx 80% complete for yeast_R1.fastq
Approx 85% complete for yeast_R1.fastq
Approx 90% complete for yeast_R1.fastq
Approx 95% complete for yeast_R1.fastq


Analysis complete for yeast_R1.fastq


Started analysis of yeast_R2.fastq
Approx 5% complete for yeast_R2.fastq
Approx 10% complete for yeast_R2.fastq
Approx 15% complete for yeast_R2.fastq
Approx 20% complete for yeast_R2.fastq
Approx 25% complete for yeast_R2.fastq
Approx 30% complete for yeast_R2.fastq
Approx 35% complete for yeast_R2.fastq
Approx 40% complete for yeast_R2.fastq
Approx 45% complete for yeast_R2.fastq
Approx 50% complete for yeast_R2.fastq
Approx 55% complete for yeast_R2.fastq
Approx 60% complete for yeast_R2.fastq
Approx 65% complete for yeast_R2.fastq
Approx 70% complete for yeast_R2.fastq
Approx 75% complete for yeast_R2.fastq
Approx 80% complete for yeast_R2.fastq
Approx 85% complete for yeast_R2.fastq
Approx 90% complete for yeast_R2.fastq
Approx 95% complete for yeast_R2.fastq


Analysis complete for yeast_R2.fastq


With the `ls` command, we can see what files are present in the `results/fastqc_output/` folder! 

In [10]:
%%bash

# Show the output of the fastqc command with the ls command
ls -l results/01_fastqc_output/

total 1250
-rw-rw---- 1 dtv004 root 596741 Jan 19 23:06 yeast_R1_fastqc.html
-rw-rw---- 1 dtv004 root 329922 Jan 19 23:06 yeast_R1_fastqc.zip
-rw-rw---- 1 dtv004 root 624353 Jan 19 23:06 yeast_R2_fastqc.html
-rw-rw---- 1 dtv004 root 327641 Jan 19 23:06 yeast_R2_fastqc.zip


---

<div class="alert alert-block alert-danger">
    <b>Let's download the output files so we can open the <code>html</code> reports for our FASTQ files in our browser!</b> 
</div>

---

## Aligning or Mapping RNA-sequencing FASTQ Reads to the Genome 

Now that we've assessed the quality control of our reads (and removed any reads we would like to exclude due to contamination if necessary, etc.), we can now try to **align** or **map** our reads to the genome! In general terms, the alignment or mapping step is the process of figuring out *where* in the genome our read sequences are from. 

<div class="alert alert-block alert-info">
    <p>The short read alignment problem is an interesting, but fairly difficult problem for two reasons: (i) the reference genome is large and it is difficult to search large sequences than smaller sequences, and (ii) we generally aren't looking for <i>exact</i> matches within the reference genome.</p>
</div>

There are two main options that you can follow depending on the availability of a genome sequence: 

> * When studying an organism with a reference genome, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome (**genome mapping**) or transcriptome (**transcriptome mapping**). Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts.
> * When working on an organism without a reference genome, reads need to be assembled first into longer contigs (**de novo assembly**). These contigs can then be considered as the expressed transcriptome to which reads are re-mapped for quantification.
>
> *Referenced from: https://www.ebi.ac.uk/training-beta/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/data-analysis/read-mapping-or-alignment/*

While there are many bioinformatics tools available to perform the alignment of short reads, the most common read time found through the Illumina sequencing platform, we'll be learning how to use the software known as `STAR`! 

<div class="alert alert-block alert-warning">
    <p>While <code>STAR</code> has <i>a lot</i> of functionality, we won't be going through each individual function within this notebook. We'll be just be going through the flags that we need to get things running.</p>
    <p>We <i>highly encourage</i> you to review the documentation at the following link: <a href="https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf">https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf</a>. (Don't be too intimidated by how long the document is! Reading documentation is a common part of bioinformatics work, so knowing how to parse through all of it is a valuable skill!)</p>
</div>

### Generating a Genome Index with the `genomeGenerate` Run Mode for `STAR` 

```
--runMode: genomeGenerate mode
--genomeDir: /path/to/store/genome_indices
--genomeFastaFiles: /path/to/FASTA_file
--sjdbGTFfile: /path/to/GTF_file
```

In [11]:
%%bash

STAR --runThreadN 1 \
    --runMode genomeGenerate \
    --genomeDir results/ref/ \
    --genomeFastaFiles data/yeast.fasta \
    --genomeSAindexNbases 10 \
    --sjdbGTFfile data/yeast.gtf

	/opt/conda/envs/rna-seq/bin/STAR-avx2 --runThreadN 1 --runMode genomeGenerate --genomeDir results/ref/ --genomeFastaFiles data/yeast.fasta --genomeSAindexNbases 10 --sjdbGTFfile data/yeast.gtf
	STAR version: 2.7.10b   compiled: 2023-05-25T06:56:23+0000 :/opt/conda/conda-bld/star_1684997536154/work/source
Jan 19 23:06:50 ..... started STAR run
Jan 19 23:06:50 ... starting to generate Genome files
Jan 19 23:06:50 ..... processing annotations GTF
Jan 19 23:06:50 ... starting to sort Suffix Array. This may take a long time...
Jan 19 23:06:50 ... sorting Suffix Array chunks and saving them to disk...
Jan 19 23:06:59 ... loading chunks from disk, packing SA...
Jan 19 23:07:00 ... finished generating suffix array
Jan 19 23:07:00 ... generating Suffix Array index
Jan 19 23:07:00 ... completed Suffix Array index
Jan 19 23:07:00 ..... inserting junctions into the genome indices
Jan 19 23:07:01 ... writing Genome to disk ...
Jan 19 23:07:01 ... writing Suffix Array to disk ...
Jan 19 23:07:01 ..

### Aligning Reads to the Reference Genome with `STAR`

```
--genomeDir: /path/to/genome_indices_directory
--runThreadN: number of threads / cores
--readFilesIn: /path/to/FASTQ_file
--outFileNamePrefix: prefix for all output files
--outSAMtype: output filetype (SAM default)
--outSAMunmapped: what to do with unmapped reads
```

In [12]:
%%bash

STAR --runThreadN 1 \
    --genomeDir results/ref/ \
    --readFilesIn data/yeast_R1.fastq data/yeast_R2.fastq \
    --outFileNamePrefix results/02_STAR/yeast \
    --outSAMunmapped Within

	/opt/conda/envs/rna-seq/bin/STAR-avx2 --runThreadN 1 --genomeDir results/ref/ --readFilesIn data/yeast_R1.fastq data/yeast_R2.fastq --outFileNamePrefix results/02_STAR/yeast --outSAMunmapped Within
	STAR version: 2.7.10b   compiled: 2023-05-25T06:56:23+0000 :/opt/conda/conda-bld/star_1684997536154/work/source
Jan 19 23:07:01 ..... started STAR run
Jan 19 23:07:01 ..... loading genome
Jan 19 23:07:01 ..... started mapping
Jan 19 23:07:07 ..... finished mapping
Jan 19 23:07:07 ..... finished successfully


<div class="alert alert-block alert-success">
    <p><b>Exercise:</b> What files were generated by <code>STAR</code> during the alignment step?</p>
</div>

In [13]:
%%bash

ls results/02_STAR/


yeastAligned.out.sam
yeastLog.final.out
yeastLog.out
yeastLog.progress.out
yeastSJ.out.tab


Now we know what files were produced by our analysis, let's evaluate one of them! For `STAR` runs, you can find quick diagnostic information from the run under the `*Log.final.out` file that is generated in the output folder. 

In [14]:
%%bash

# Use the cat command to print out the contents of the yeastLog.final.out file 
cat results/02_STAR/yeastLog.final.out

                                 Started job on |	Jan 19 23:07:01
                             Started mapping on |	Jan 19 23:07:01
                                    Finished on |	Jan 19 23:07:07
       Mapping speed, Million of reads per hour |	14.11

                          Number of input reads |	23520
                      Average input read length |	150
                                    UNIQUE READS:
                   Uniquely mapped reads number |	20353
                        Uniquely mapped reads % |	86.53%
                          Average mapped length |	150.54
                       Number of splices: Total |	559
            Number of splices: Annotated (sjdb) |	513
                       Number of splices: GT/AG |	556
                       Number of splices: GC/AG |	0
                       Number of splices: AT/AC |	0
               Number of splices: Non-canonical |	3
                      Mismatch rate per base, % |	0.49%
                         Deletion rate pe

From this output, we can see that we were able to obtain a decent percentage of uniquely-mapped reads from our dataset, and everything was able to align *extremely* quickly--since we're using a small test dataset to demonstrate the commands. Typically, an alignment run, especially those on larger genomes like `Homo sapiens` or `Mus musculus`, would take *much* longer! 

### Background on the SAM/BAM File Format

There are several files generated during alignment, the most important of which is the alignment file itself! Alignment files can come in either a `SAM` (Sequence Alignment/Map) or in their compressed counterpart, the `BAM` (Binary Alignment/Map) format. While originally developed to be a text-based method for storing biological sequences aligned to a reference sequence, they have also been adapted to support unmapped sequences! 

Generally, the SAM format consists of a header and a subsequent alignment section and SAM files are typically analyzed and edited using the `SAMtools` software suite. Header lines begin with the `@` symbol in order to distinguish them from the alignment section. 

In [15]:
%%bash

# Let's preview the sam file that we generated with the head command!
head results/02_STAR/yeastAligned.out.sam 

@HD	VN:1.4
@SQ	SN:chrI	LN:230218
@SQ	SN:chrII	LN:813184
@SQ	SN:chrIII	LN:316620
@SQ	SN:chrIV	LN:1531933
@SQ	SN:chrIX	LN:439888
@SQ	SN:chrM	LN:85779
@SQ	SN:chrV	LN:576874
@SQ	SN:chrVI	LN:270161
@SQ	SN:chrVII	LN:1090940


With this, we can see a preview of some of the header lines within the file. Let's try and view some of the alignment lines instead by deselecting these lines with `grep -v` and using head again! 

In [16]:
%%bash

# Deselect header lines from the sam file, and then pipe the output to head
grep -v ^@ results/02_STAR/yeastAligned.out.sam | head

SRR6924582.1	99	chrXV	60516	255	76M	=	60639	198	TCAGATTTAGTCCATAAGGCAAACTTGTTACCACCTTTTCTAATGCTTAAAACGACACCGTTAATTTGGGAGTCGT	AAAAAEAEEEEAEEEEAEEEAEEEEEEEAEEA/EEEEEEEEEAEEEE6EE//AE/EEEEEAEAEEEAEEEEE<EEA	NH:i:1	HI:i:1	AS:i:149	nM:i:0
SRR6924582.1	147	chrXV	60639	255	75M	=	60516	-198	AATTCATCAATATCAGCACCTTTTCCTCTAAGTTGGAAAGACCATTTACCACCTTTAGCATTGGCTTCATCTTCC	E<EEE/EEE<<EE<EEEEEEEEEE/EEEE/EEEEEE/EEEEEEEEAEEEEAEAEEEEEEAEE/AEEEEEEAAAAA	NH:i:1	HI:i:1	AS:i:149	nM:i:0
SRR6924582.2	77	*	0	0	*	*	0	0	GGGTACTTCAAGTACTTACCGGAGAACTTGGTGGTCGAGACAACGGTGACAACAGAGTCGTTCTGCTCGACAGTG	AAA/AEEEEEEEEEEEEAEE/E/EAEEEEAEEEEEEEEE/EEEEEEE/EEEEEEAEEEE///EEEA//A/EEEAE	NH:i:0	HI:i:0	AS:i:45	nM:i:10	uT:A:1
SRR6924582.2	141	*	0	0	*	*	0	0	ACCACATCAAGGTCGACGGCCACTTGGGTAACTTGGGTAACGCCATCACTGTCGAGCAGAACGACTCTGTTGTCAC	AAA/AAEEEEA6EEE6EEEEEEEEEEEAEAEEE<AE/EEEE/EEEEE6EEEEEE/EEEEAAAEEEEE<AAA<EEEE	NH:i:0	HI:i:0	AS:i:45	nM:i:10	uT:A:1
SRR6924582.3	163	chrIV	818494	255	76M	=	818529	110	AGATAAATCACCACGAATCATTTGTTCGGTCATGTAACATGCAC

From our output of the alignment section of our file, we can see that there are quite a few columns present! Here's a handy table containing what each column is: 

---

![SAM Alignment Column Information](https://www.michaelchimenti.com/wp-content/uploads/2018/06/sequence_string_sam-768x381.png)

*Referenced from: https://www.michaelchimenti.com/2018/06/three-useful-references-decode-sam-files/*

---

<div class="alert alert-block alert-info">
    <p>Again, we won't be going over <i>all</i> of the functionality of <code>samtools</code>, but you can find most of the information for how to use the package within its documentation at the following link: <a href="http://www.htslib.org/">http://www.htslib.org/</a></p>
    <p>Aaron Quinlan, Ph.D. at the University of Utah has also developed a wonderful tutorial website for using <code>samtools</code> to manipulate and view SAM- and BAM-formatted files: <a href="http://quinlanlab.org/tutorials/samtools/samtools.html">http://quinlanlab.org/tutorials/samtools/samtools.html</a>!</p>
</div>

---

## Sorting and Indexing Aligned Sequencing Reads with `samtools`

With our reads aligned and stored in the SAM file format, we can now convert our alignment file to BAM file format (the binary alternative to the plain-text SAM file format), **sort** and **index** the alignment file. 

### Converting Between the SAM and BAM File Formats using the `samtools view` Command

Generally, to do anything useful with alignment data, we need to first convert our SAM files to the BAM file format as the binary format is much easier for computers to work with in a more efficient manner. However, this makes the files *unreadable* to us humans. We can easily convert between the two files formats using the `samtools view` command. 

#### Converting from SAM to BAM 

To convert from SAM to BAM format, we must specify that our input file is a SAM file as the expected input is a BAM file. We can then use the UNIX redirect operator, `>`, to redirect the output to a file instead of printing the results to the terminal (or in our case, the Jupyter notebook!).

In [17]:
%%bash

# Create a folder to hold the outputs for this section of the notebook
mkdir -p results/03_samtools/

# Use the samtools view command to convert our SAM output to BAM instead
samtools view -S -b results/02_STAR/yeastAligned.out.sam > results/03_samtools/yeastAligned.out.bam

#### Converting from BAM to SAM

If we wanted to move in the opposite direction and convert from BAM to SAM, we could use the `samtools view` command again, but this time it's a little simpler as we don't need to provide any flags! 

In [18]:
%%bash

# Use the samtools view command to preview our BAM file as a SAM file 
samtools view results/03_samtools/yeastAligned.out.bam | head

SRR6924582.1	99	chrXV	60516	255	76M	=	60639	198	TCAGATTTAGTCCATAAGGCAAACTTGTTACCACCTTTTCTAATGCTTAAAACGACACCGTTAATTTGGGAGTCGT	AAAAAEAEEEEAEEEEAEEEAEEEEEEEAEEA/EEEEEEEEEAEEEE6EE//AE/EEEEEAEAEEEAEEEEE<EEA	NH:i:1	HI:i:1	AS:i:149	nM:i:0
SRR6924582.1	147	chrXV	60639	255	75M	=	60516	-198	AATTCATCAATATCAGCACCTTTTCCTCTAAGTTGGAAAGACCATTTACCACCTTTAGCATTGGCTTCATCTTCC	E<EEE/EEE<<EE<EEEEEEEEEE/EEEE/EEEEEE/EEEEEEEEAEEEEAEAEEEEEEAEE/AEEEEEEAAAAA	NH:i:1	HI:i:1	AS:i:149	nM:i:0
SRR6924582.2	77	*	0	0	*	*	0	0	GGGTACTTCAAGTACTTACCGGAGAACTTGGTGGTCGAGACAACGGTGACAACAGAGTCGTTCTGCTCGACAGTG	AAA/AEEEEEEEEEEEEAEE/E/EAEEEEAEEEEEEEEE/EEEEEEE/EEEEEEAEEEE///EEEA//A/EEEAE	NH:i:0	HI:i:0	AS:i:45	nM:i:10	uT:A:1
SRR6924582.2	141	*	0	0	*	*	0	0	ACCACATCAAGGTCGACGGCCACTTGGGTAACTTGGGTAACGCCATCACTGTCGAGCAGAACGACTCTGTTGTCAC	AAA/AAEEEEA6EEE6EEEEEEEEEEEAEAEEE<AE/EEEE/EEEEE6EEEEEE/EEEEAAAEEEEE<AAA<EEEE	NH:i:0	HI:i:0	AS:i:45	nM:i:10	uT:A:1
SRR6924582.3	163	chrIV	818494	255	76M	=	818529	110	AGATAAATCACCACGAATCATTTGTTCGGTCATGTAACATGCAC

<div class="alert alert-block alert-info">
    <p>If you wanted to save the BAM file as a SAM file, you could pipe the output to a location of your choice similar to how we did for the SAM to BAM file conversion!</p>
</div>

### Sorting BAM Files with the `samtools sort` Command

You might have noticed that when we previewed our SAM files, our alignments were *not* in order by their chromosomal position. When you align FASTQ files, the alignments are produced in random order with respect to their position in the reference genome as they follow the order that the sequences occured in the input FASTQ files! 

In order to do anything useful, we must sort the alignments to they are ordered positionally according to their coordinates on the genome, or "genome order". We can sort our files using the `samtools` sort command!

In [19]:
%%bash

# Sort our BAM file and save the output to a new file
samtools sort results/03_samtools/yeastAligned.out.bam > results/03_samtools/yeastAligned.out.sorted.bam

### Indexing BAM Files with the `samtools index` Command

Now that our BAM file is sorted by chromosomal coordinates, we can **index** it! Indexing a sorted BAM file allows us to quickly extract alignments that overlap particular genomic coordinates. It's also an important step to do as it's often required by genome viewers, such as IGV (http://software.broadinstitute.org/software/igv/), so that these software tools can quickly display alignments in each genomic region that you navigate to using their graphic interface! 

Indexing BAM files creates a new companion file format, known as the `BAI` file format (Binary Alignment Index), and can be easily created using the `samtools index` command. 

In [20]:
%%bash

# Index our BAM file and save the output to our desired location
samtools index results/03_samtools/yeastAligned.out.sorted.bam results/03_samtools/yeastAligned.out.sorted.bai

<div class="alert alert-block alert-info">
    <p>Different from the <code>samtools view</code> and <code>samtools sort</code> commands, to specify an output location and filename for the <code>samtools index</code> command, you do not need to use the redirect operator (<code>></code>). To learn more, you can also view the <code>man</code> pages for <code>samtools</code> on the terminal!</p>
</div>

### Viewing Practical Summary Information about Alignments with the `samtools flagstat` Command

Finally, sometimes you would like learn some basic statistics about the alignments in your BAM file! An easy way to do that is using the `samtools flagstat` command. This command parses through the bitwise `FLAG` column of all of your alignments and determines information such as the number of total reads, duplicate reads, mapped reads, paired reads, etc.! 

In [21]:
%%bash

# Use the samtools flagstat command to view summary statistics of the BAM file
samtools flagstat results/03_samtools/yeastAligned.out.sorted.bam

49962 + 0 in total (QC-passed reads + QC-failed reads)
47040 + 0 primary
2922 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
45933 + 0 mapped (91.94% : N/A)
43011 + 0 primary mapped (91.43% : N/A)
47040 + 0 paired in sequencing
23520 + 0 read1
23520 + 0 read2
43010 + 0 properly paired (91.43% : N/A)
43010 + 0 with itself and mate mapped
1 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


---

## Generating a Quality Report for Aligned Files using `FastQC` 

Now that we've generated our alignments and sorted them, we can generate a quality report for them as well to view in the browser! Luckily, we already know how to do this--we can use the `FastQC` software once again!

In [22]:
%%bash

# Generate another folder for FastQC output 
mkdir -p results/04_fastqc_output_alignments/

# Use FastQC to calculate quality control statistics for our alignment file 
fastqc results/03_samtools/yeastAligned.out.sorted.bam -o results/04_fastqc_output_alignments/

Started analysis of yeastAligned.out.sorted.bam
Approx 5% complete for yeastAligned.out.sorted.bam
Approx 10% complete for yeastAligned.out.sorted.bam
Approx 15% complete for yeastAligned.out.sorted.bam
Approx 20% complete for yeastAligned.out.sorted.bam
Approx 25% complete for yeastAligned.out.sorted.bam
Approx 30% complete for yeastAligned.out.sorted.bam
Approx 35% complete for yeastAligned.out.sorted.bam
Approx 40% complete for yeastAligned.out.sorted.bam
Approx 50% complete for yeastAligned.out.sorted.bam
Approx 55% complete for yeastAligned.out.sorted.bam
Approx 60% complete for yeastAligned.out.sorted.bam
Approx 65% complete for yeastAligned.out.sorted.bam
Approx 70% complete for yeastAligned.out.sorted.bam
Approx 75% complete for yeastAligned.out.sorted.bam
Approx 80% complete for yeastAligned.out.sorted.bam
Approx 85% complete for yeastAligned.out.sorted.bam
Approx 95% complete for yeastAligned.out.sorted.bam
Approx 100% complete for yeastAligned.out.sorted.bam


Analysis complete for yeastAligned.out.sorted.bam


<div class="alert alert-block alert-warning">
    <p>We'll skip inspecting this output for now and will save it for the end!</p>
</div>

---

## Generating a Gene Expression Counts Matrix using the `featureCounts` Program

The final step of the command-line portion for RNA-sequencing data processing is determining the number of reads have mapped to each gene. 

While there are many tools that can use BAM files as input and output the number of reads (or counts) associated with each feature of interest (e.g., genes, exons, transcripts, promoters, etc.), one commonly used tool is `featureCounts`. 

<div class="alert alert-block alert-info">
    <p>While <code>featureCounts</code> reports the "raw" counts of reads that map to a single location (unique mappings), it is best used for counting at the gene level and sum the number of reads associated with each of the exons belonging to a specific gene. There are some cases where you would like to use other tools to consider different cases, such as accounting for multiple transcripts for a given gene, or considering transcripts that can map to multiple regions of a genome.</p>
</div>

Generally, during this step, the genomic coordinates of where the read is mapped (contained within the BAM file) are cross-referenced with the genomic coordinates of features (contained within GTF files), such as exons, genes, or transcripts. In our case, we are considering **genes** as our features!

---

![Comparison Between GTF File and Read Alignment](img/day1_05_counts-matrix-gtf-diagram.png)

*Referenced from: https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/05_counting_reads.html*

---

Generally, `featureCounts` takes three inputs: 

- `-a` genome annotation file (`GTF`) where each entry is a feature
- `-o` output counts file
- the last parameter below is the input read file (`bam`)

In [23]:
%%bash

# Create a directory to hold the output for this portion of the notebook
mkdir -p results/05_featureCounts/

# Run the featureCounts command to generate our counts matrix
featureCounts -a data/yeast.gtf -p \
    -o results/05_featureCounts/counts.txt \
    results/03_samtools/yeastAligned.out.sorted.bam


        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
	  v2.0.3

||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                                                                            ||
||                           yeastAligned.out.sorted.bam                      ||
||                                                                            ||
||             Output file : counts.txt                                       ||
||                 Summary : counts.txt.summary                               ||
||              Paired-end : yes                                              ||
||        Count read pairs : no                 

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b> What are the contents of the <code>featureCounts</code> summary file that was generated?</p>
</div>

In [24]:
%%bash

#TODO


---

## Generate a Summary Quality Control Document with `MultiQC` 

Now that we've finished running all of the command-line portions of the RNA-sequencing analysis pipeline, instead of viewing quality reports separately, we can consolidate everything into a single document using the `MultiQC` program! `MultiQC` is a tool that aggregates quality reports from commonly-used bioinformatics programs into a single HTML document that can be used as a quick reference for an experiment, shared with collaborators, or just used to make you look cool! It visualizes statistics from different steps in the pipeline, plots multiple samples together, and is also well-documented via their GitHub page! 

In [25]:
%%bash

# Use the multiqc command to generate summary reports for our data files 
multiqc results -o results/06_multiqc_output


  [91m///[0m ]8;id=327435;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.17[0m

[34m|           multiqc[0m | Search path : /home/dtv004/module-3-rnaseq/Day1_materials/results
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m32/32[0m  m [2mresults/ref/transcriptInfo.tab[0m
[?25h[34m|    feature_counts[0m | Found 1 reports
[34m|              star[0m | Found 1 reports
[34m|            fastqc[0m | Found 3 reports
[34m|           multiqc[0m | Report      : results/06_multiqc_output/multiqc_report.html
[34m|           multiqc[0m | Data        : results/06_multiqc_output/multiqc_data
[34m|           multiqc[0m | MultiQC complete


---

<div class="alert alert-block alert-danger">
    <b>Let's download the output files so we can open the <code>html</code> reports for our entire RNA-sequencing analysis pipeline in our browser!</b> 
</div>