# Module 11: RNA Sequencing Analysis

## Overview of RNA Sequencing

RNA sequencing is similar to DNA sequencing. The product of RNA sequencing, the collection of RNA transcripts from DNA, is called either the "transcriptum" (leading to the term transcriptomics) or "gene expression" data.

1. Samples of interest are selected
2. RNA is isolated from the samples of interest in library prep
3. cDNA is generated from the RNA
    - The generated cDNA is fragmented
    - Fragments are selected based on size
    - Linkers are added for paired-end reading
4. Sequencing occurs in the same way as DNA sequencing
    - Generates 100s of millions of paired reads with 10s of billions of bases
    - FASTQ is the usual output format
5. Reads are mapped to:
    - Genome
    - Transcriptome
    - Predicted exon junctions
6. Downstream analysis can then occur

### Brief History of Transcriptomics

- 1995: P. Brown, et al.
    - Gene expression profiling using spotted cDNA microarray
    - Expression levels of known genes
- 2002: Affymetrix
    - Whole genome expression profiling using tiling array
    - Identification and profiling of novel genes and splicing variants
- 2008: Many groups, mRNA-seq:
    - Directed sequencing of mRNAs using NGS

### Features of RNA Sequencing Data

#### Replicates

We can often see replicates in RNA sequencing:

- Biological Replicate
    - Multiple isolations of cells showing the same phenotype, stage, or other experimental condition
    - Environmental factors, growth conditions, time, etc.
    - Correlation coefficient of biological replicates are usually 0.92-0.98
- Technical Replicate
    - Multiple instances of sequence generation
    - The same sequence was sequenced in different flow cells, lanes, indexes, etc.
    - Correlation coefficient should be even higher than biological replicates

#### Alternative Splicing

Alternative sequencing occurs whent he same gene can be transcribed into different isoforms of mRNA. One way this can occur is when gene have multiple exons, which are spliced in different orders.

Alternative sequencing is shown by 35-60% of human genes, with some genes having huge numbers of isoforms.
- *slo*: > 500 isoforms
- *neurexin*: > 1,000 isoforms
- *DSCAM*: > 38,000 isoforms

### Challenges of RNA Sequencing

#### Mapping Reads to the Reference Genome

- Sequencing errors and polymorphisms
- It is more difficult to map reads that span splice junctions (for complex transcriptomes)
    - Having a list or other information about junction constitutions can be essential for effective RNA mapping, but this is only beneficial for known isoforms
    - Can use a mapper to help map junctions in novel isoforms
- Repetitive sequences
    - A significant portion of sequence reads match multiple locations in the genome
    - Obtaining longer sequence reads, or paired-end sequencing strategies, can help alleviate this multi-matching problem

#### Novel Splicing Variants and Quantification

- Discovery of novel splicing variants
    - Reconstruction of complete splice forms
    - Reliability (assignment of a p-value)
- Quantifying the expression levels of recently duplicated genes

### When to Use RNA Sequencing over DNA Sequencing

Applications for RNA sequencing include: differential expression, gene fusion, alternative splicing, novel transcribed regions, allele-specific expression, RNA editing, transcriptome for non-model organisms, small RNA (miRNA) sequencing.

*Functional studies* use RNA sequencing to determine if an experimental condition has a pronounced effect on gene expression (since the genome is held constant). RNA is the functional portion of the DNA process! 
- Drug treated vs untreated cell lines, knockdown vs wildtype mice, etc.

*Some molecular features* can only be observed at the RNA level. These features also make it difficult to predict the transcript sequence from the genome itself.
- Alternative isoforms, fusion transcripts, RNA editing motifs, etc.

RNA sequencing is useful for *interpreting mutations* that don't have an obvious effect on the protein sequence. This can include *regulatory* mutations, that affect what mRNA isoform is expressed (and how much).
- Splice sites, promoters, exonic/intronic splicing motifs, etc.

RNA sequences assist in *prioritizing protein coding* for somatic mutations, which are often heterozygous.
- Unexpressed genes mean a mutation is "less interesting"
- Expression from only a wildtype allele may suggest a loss-of-function (haploinsufficiency)
- Expression of a mutant allele may suggest a candidate for drug targeting

### Technical and Methodological Considerations

#### RNA-seq vs Microarray

RNA-seq has a higher resolution, and can apply the same experimental protocol to various purposes. It can be used to characterize novel transcripts and splicing variants, as well as to profile the expression of known transcripts.

RNA Microarrays are specialized, and can only profile known transcripts.
- SNP Array: detects SNPs
- Junction Array: maps exon junctions
- Gene Fusion Array: detects gene fusions

## Process (Reads to Differential Expression)

1. Raw Sequence Data
    - FASTQ files get QC'd with FastQC/R
2. Reads Mapping
    - Takes Raw Sequence Data
    - SAM/BAM files get QC'd with RNA-SeQC
    - BWA, Bowtie
        - Generate Unspliced Mapping
    - TopHat, MapSplice
        - Generate Spliced Mapping
    - [STAR](https://github.com/alexdobin/STAR) Aligner ([docs](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf)) is one of the most commonly used aligner programs today
3. Expression Quantification
    - Takes Mapped Reads
    - Cufflinks
        - Generates FPKM (Fragments Per Kilobase per Million reads mapped)
        - Generates RPKM (Reads Per Kilobase per Million reads mapped)
    - Summarized read counts and other metrics
4. Differential Expression Testing
    - Takes Expression Quantification Data
    - Cuffdiff
    - DEseq, edgeR, etc.
        - R packages for analyzing the expression data
    - Generates a list of differential expressions and statistical analyses of these expressions to identify genes of interest
5. Functional Interpretation
    - Takes Differential Expression data
    - Function enrichment, inference of networks, integration with other data, etc.
6. Biological insights and hypotheses

### QC

#### Sequencing QC

Information to check:
- Basic information
    - Total reads, sequence length, etc.
- Per base sequence quality
- Overrepresented sequences
    - This is more important for RNA than DNA, since the auditor sequences or barcodes may not be properly trimmed
    - May see repeats at the beginning or end of the reads
    - May need additional trimming
    - Some software can trim off known adapter sequences
- GC content
    - Coverage difference is more important in RNA than DNA, since the coverage translates to expression levels
    - May need to adjust for changes
- Duplication level
    - Mark duplicates to correctly quantify gene expression levels
- and more

Frequently performed with FastQC/R

#### Read Mapping QC

Need to pay attention to exon-exon junction reads, where one read maps across a junction!

Information to check:
- Percentage of reads properly mapped
- Percentage of reads uniquely mapped
    - If none uniquely mapped, these reads are mapped to multiple sites and generally considered improperly mapped
- Percentage of reads in entron, exon, and intergenic regions
    - Should not have many intergenic reads, as they are non-transcribed regions
- 5' or 3' bias
    - Excess reads from one direction
    - Degraded read quality from one direction
- Percentage of expressed genes

Frequently performed with RNA-SeQC

### Expression Quantification

Expression quantification is the process of converting read counts into quantified expression levels

Count data is the summary of the mapped reads to the coding sequence, gene, or exon level. 
- *CPM*: Counts per million
$$
CPM = \frac{\text{\# of fragments mapped to the gene}}{\text{\# of all mapped fragments (millions)}}
$$
- *RPKM*: Reads aligned Per Kilobase of exon per Million reads mapped
    - Not necessarily a consistent measure of expression abundance or relative molar concentration, as highly expressed genes may dominante and it does not normalize over the genome
- *FPKM*: Fragments Per Kilobase of exon per Million fragments mapped
    - Same idea as RPKM, but for paired-end sequencing
    - Not necessarily a consistent measure of expression abundance or relative molar concentration, as highly expressed genes may dominante and it does not normalize over the genome

$$
FPKM = \frac{\text{\# of fragments mapped to the gene}}{\text{\# of all mapped fragments (millions)} \cdot \text{Transcript length (KB)}}
$$

Since this count of mapped reads is not flat and depends on some external factors, there should be additional considerations.

The number of reads is roughly proportional to:
- The length of the gene
- The total number of reads in the library

Scaling by a total mapped reads (sequencing depth) can be substantially influenced by a small proportion of highly expressed genes, leading to **normalization issues**.

#### Normalization Methods

- RPKM/FPKM:
    - Normalized by: differences in sequencing depth and transcript length
    - Goal: compare a gene across samples and different genes within a sample
- Trimmed Mean of M values (TMM):
    - Corrects for differences in RNA *composition* between samples
    - Normalized by: differences in transcript pool composition, extreme outliers
    - Goal: provide better across-sample comparability
    - More computationally complex
- Transcript Per Million (TPM):
    - Normalized by: transcript length distribution in RNA pool
    - Goal: provide better across-sample comparability
    - *Best across-sample comparability*
    - Much more straightforward mathematical formula
$$
T = \sum{\frac{\text{\# of reads in gene} \cdot \text{read length} \cdot 10^{6}}{\text{transcript length}}}
$$

#### Statistical Considerations

Two transcripts may have the same proportion of counts across two samples, resulting in identical RPKM values. However, these transcripts may have vastly different transcript length, influencing the *n* value of the analysis. Since the *n* value has changed, it's logical that the p-value should also change.

Statistical programs like edgeR and DESeq2 try to use the count nature of the RNA-seq data, to increase statistical power and account for the differences in *n* values.

Discrete distributions, such as Poisson or negative binomial, are typically used rather that continuous distributions for modeling RNA-seq data. This makes sense when you remember that the samples we are counting are discrete counts from a sample rather than the whole of data.

As such, many Differential Expression tools demand tables of integer read counts as input, rather than RPKM/FPKM/TPM or other relative measures.

#### Gene Clustering

A hierarchical clustering can be created using simpler measures such as CPM or TPM, organizing genes by their expression patterns. Clustering by gene or by sample causes different biological implications to the results.

R has a standard heatmap package which can model these.

## Observations from GWAS

Genome-wide association studies (GWAS) have identified hundreds of common DNA variants associated with multiple complex diseases and traits. More than 2 out of 3 of these SNPs lie in non-coding regions of DNA.

These non-coding SNPs are in regulatory regions and thus called Expression Qantitative Trait Loci (eQTL). By examining gene expression in variations of these regions, we can identify these regulatory regions. Regression analysis between the genotypes and gene expression levels identifies these results! 

However, in calculating these eQTLs has difficulties. Most human disease-relevant tissues or cell types are hard to obtain, especially since large sample sizes are required for statistical power.

The GTEx (Genotype-Tissue Expression) database is one of the largest sources of expresssion level data, collected from post-mortem samples.