**RNASeq Basics**

RNASeq data processing can be broadly divided into two types: 1) Reference Genome Based (Most commonly used) 2) De Novo, where no reference genome is available (uncommon, mostly used on novel species). For this course, we will be soley focussing on reference based RNA-Seq data analysis. Reference genomes for many species is available through consortia such as NCBI, Ensembl, etc. "Genome builds" are periodically released as more regions and genes are annotated within a genome (eg: hg19, hg39 (https://www.gencodegenes.org/human/)). 

**RNASeq data processing**

After sequencing is completed, you have a collection of sequencing reads for each sample in the form of a *fastq file*. In a reference-based RNAseq experiment, each sequencing read from a fastq file is *mapped* against the reference genome. There are several programs/workflows available to map reads to a genome such as TopHat suite (https://ccb.jhu.edu/software/tophat/index.shtml), STAR aligner (https://github.com/alexdobin/STAR) and StringTie suite (https://ccb.jhu.edu/software/stringtie/). RNASeq read aligners are *splicing-aware*, meaning they are able to map reads that skip introns / exons due to splicing events occuring across RNA transcripts.

**Common things to consider after read mapping is completed :** 

1) What fraction of total reads successfully aligned to the genome ? Low fraction usually indicates contamination within your sample?
2) What fraction uniquely mapped to the genome ? If each read has been derived from RNA molecules, each read must correspond to a unique location within the genome. However, a singificant fraction of reads can be expected to NOT map uniquely to the genome, due to the existence of repetitve regions (eg: rRNA). It is difficult to trace back the region from where a read was sequenced if the read derived from a repetive region that is present across multiple locations within the genome.

**Gene Expression & Normalization**

Once every read has been sufficiently mapped to a corresponding location across the genome, we can quantify the number of reads mapping across each gene/transcript/exon. Since the number of reads generated across a gene can be directly corelated with the length of the gene and also with the sequencing depth, the read counts across a sample must be normalized. There are several ways to perform read count normalization: 1) RPKM/FPKM (Reads/Fragments Per Kilobase of transcript per Million mapped reads) 2) TPM (Transcripts Per Million mapped reads) 3) RPM (Reads Per Million mapped reads)

## Some important RNASeq data terminologies :
We will be using these terms throughout the course

1) Fastq: This is a standardized file format to store sequencing reads for each sample. 

2) Reference genome : A representation set of sequences assembled through previous studies that best represents the organization of genes within a genome

3) Annotation : Each gene within a reference genome is assigned coordinates; eg: chr1:10,000-12,000. Each gene is further *annotated* by defining the location of UTRs, exons and introns within the gene. Commonly used annotations are provided from Gencode (https://www.gencodegenes.org/human/) and UCSC (https://genome.ucsc.edu/cgi-bin/hgTables).

4) Alignment / Alignment Tool / Aligners : Alignment is the process of identifying the region within the genome from which a read was derived. This is done by matching the read sequence with the genome sequence to find a perfect match.

5) SAM / BAM files : SAM file if a standardized file format to store alignment records for each read (https://samtools.github.io/hts-specs/SAMv1.pdf). A BAM file is a binary or compressed version of SAM files. BAM files are not human readable, unlike SAM files.

6) Gene Expression : Broad term for the number of reads derived from a gene (could be a normalized count).

7) Differential Gene Expression (DGE) : A gene that a expression significantly higher/lower between two experimental conditions.

### The downloaded SRA files are next processed by following these steps:
1. `fastq-dump` in the SRA-toolkit to generate .fastq files
2. `FastQC`<sup>[3](#ref3)</sup> to perform Quality Controls and generate QC report for the input RNA-seq data
2. `STAR`<sup>[3](#ref3)</sup> for the read alignment
3. `featureCounts`<sup>[4](#ref4)</sup> for assigning reads to genes

#### 1. Extract fastq reads from SRA files using the NCBI SRA toolkit (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software)

1) Install the SRA toolkit as follows : 
    
    conda install -c bioconda sra-tools

2) fastq-dump -I -split-spot < SRA Accession Number >

#### 2. Perform quality control checks on fastq files using FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/download.html)

fastqc -o ../fastQC_output < fastq file >

#### 3. Perform genome alignment for fastq reads

STAR 
--genomeDir < path to STAR  genome index > 
--sjdbGTFfile < path to genome gtf file > 
--runThreadN < # of precessors to use > 
--outFileNamePrefix < output filename prefix, including the path > 
--readFilesIn < fastq filenames > 
--readFilesCommand zcat 
--outSAMtype BAM Unsorted 
--outSAMmode Full

#### 4. Perform read count on annotated genes using FeatureCounts (http://subread.sourceforge.net/)

featureCounts 
-T < # of threads > 
-t < gene feature to count reads upon eg: exon, transcript >
-g < group read counts by eg: gene, transcript >
-a < path to genome GTF file >
-o < output filename > 
< list of bam files to preform read counting on >

Can run featureCounts on multiple bam files at the same time. Results for each bam will outputted to the same file 
as a tab separated column.

## Example Dataset Features : 

https://github.com/MaayanLab/Zika-RNAseq-Pipeline

### Publication

Wang Z and Ma'ayan A. An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study.

The dataset provides a detailed tutorial / instructions to process and analyze the data. Detailed ipython notebooks are available. A docker image is also available, although it is not possible to run Docker images on TSCC.