# CMM262: RNA-sequencing, Day 2 (RNA-seq Differential Expression Analysis)

**Authors:** Michelle Franc Ragsac (mragsac@eng.ucsd.edu) 

*Based on the DESeq R Markdown notebook from CMM262 taught in Winter 2020* 

Now that we've gone through the beginning of the RNA-sequencing analysis pipeline to transform our FASTQ sequencing data to, eventually, a counts matrix to show the number of reads we have for each gene in our experiment, we can use the `DESeq2` package in the R programming language to determine the *differentially expressed* genes present in our experiment. 

### Table of Contents

1. Define Differential Expression Parameters for `DESeq2` 
2. Perform Quality Control with Built-In `DESeq2` Methods
3. Perform Differential Expression Analysis on our Dataset
4. Generating MA-Plots to Determine the Differences between Samples
5. Visualizing Significant Results with Volcano Plots
6. Using the `biomaRt` Library to Convert ENSEMBL Gene Identifiers to HSNC Gene Symbols 

---

## Performing Differential Expression Analyses with the `DESeq2` Library 

### Import the Packages We'll be Using in this Portion of the Notebook

In [None]:
# First, import/load in the DESeq2 library into our notebook
library(DESeq2)

# Next, import/load in the ggplot2 and RColorBrewer packages for result exploration
library(ggplot2)
library(RColorBrewer)

### Load in the RNA-Sequencing Dataset using the `read.csv()` Method on our Counts Matrix and Labels File

In [None]:
# Because DESeq2 works with raw count matrices and uses the row names as identifiers 
# for our genes, we'll import the data accordingly with the read.csv() method!
counts <- read.csv('data/asm_dex_counts.txt', 
                   sep = "\t",    # specify that our data is tab-delimited 
                   row.names = 1, # designate the row with gene names present
                   header = TRUE) # state that we have a header already present

# After importing in the data, let's preview the contents with the head() method
head(counts)

In [None]:
# We also need our condition identifiers so DESeq2 knows what groups to compare against each other 
col.data <- read.csv('data/asm_dex_labels.txt', sep = '\t', header = TRUE, row.names = 1)
head(col.data)

---

## Defining Experimental Parameters for `DESeq2` with the `DESeqDataSetFromMatrix()` Method

In [None]:
# Generate the DESeq2DataSet object using our counts matrix values and labels 
dds <- DESeqDataSetFromMatrix(countData = counts,   # specify the counts matrix to use 
                              colData = col.data,   # specify our sample groupings 
                              design = ~ condition) # state we would like to model the condition of our groupings

# Now that we've generate the DESeq2DataSet object, let's preview the contents of the object!
dds 

---

## Performing Data Quality Control with Built-In `DESeq2` Methods

### Applying a Regularized Log-Transformation with the `rlog()` Method

In [None]:
rld <- rlog(dds)

### Visualizing Sample Separation with Principal Component Analysis (PCA) via the `plotPCA()` Method

In [None]:
data <- plotPCA(rld, intgroup = 'condition', returnData = TRUE)
percent.var <- round(100 * attr(data, 'percentVar'))

In [None]:
ggplot(data, aes(x = PC1, y = PC2, color = condition)) +
    geom_point(size = 5) + 
    xlab(paste("PC1: ", percent.var[1], "%variance")) +
    ylab(paste("PC2: ", percent.var[2], "%variance"))

### Visualizing Sample Similarities with a Heatmap using the `heatmap()` Method

In [None]:
sample.distances <- dist(t(assay(rld)))
sample.distances.matrix <- as.matrix(sample.distances)

rownames(sample.distances.matrix) <- paste(rld$condition)
colnames(sample.distances.matrix) <- paste(rld$condition)

head(sample.distances.matrix)

In [None]:
colors <- colorRampPalette(rev(brewer.pal(9, 'Blues')))(255)

heatmap(sample.distances.matrix,
        col = colors)

---

## Perform Differential Expression Analysis on our Dataset

In [None]:
dds.result <- DESeq(dds)

In [None]:
result <- results(dds.result)
head(as.data.frame(result))
summary(result)

In [None]:
result <- result[order(result$padj),]
head(as.data.frame(result))

In [None]:
par(mfrow=c(3,2))

plotCounts(dds, gene="ENSG00000152583")
plotCounts(dds, gene="ENSG00000179094")
plotCounts(dds, gene="ENSG00000116584")
plotCounts(dds, gene="ENSG00000189221")
plotCounts(dds, gene="ENSG00000120129")
plotCounts(dds, gene="ENSG00000148175")

---

## Generating MA-Plots to Determine the Differences between Samples

In [None]:
plotMA(result, main = "DESeq2 MA", ylim = c(-2,2))

In [None]:
result <- results(dds.result, alpha = 0.05)
result.dataframe <- as.data.frame(result)

plotMA(result, main = "DESeq2 MA, alpha=0.05", ylim = c(-2,2))

---

## Visualizing Significant Results with Volcano Plots

In [None]:
result.dataframe$neg.log10.padj <- -log10(result.dataframe$padj)
result.dataframe$is.sig <- result.dataframe$padj < 0.05
result.dataframe$is.sig.big.fc <- result.dataframe$is.sig & (result.dataframe$log2FoldChange > 2 | result.dataframe$log2FoldChange < -2)

In [None]:
ggplot(result.dataframe, aes(x = log2FoldChange, y = neg.log10.padj, color = is.sig.big.fc)) +
    geom_point(size = 1) +
    scale_color_manual(values = c("black", "red")) +
    xlab("Log2FC Normalized Counts") +
    ylab("-Log10 Adjusted p-Value") 

In [None]:
sig.results.dataframe <- result.dataframe[result.dataframe$is.sig.big.fc,]
dim(sig.results.dataframe)

In [None]:
sig.results.dataframe.filt <- sig.results.dataframe[-which(is.na(sig.results.dataframe$padj)),]
dim(sig.results.dataframe.filt)

---

## Using the `biomaRt` Library to Convert ENSEMBL Gene Identifiers to HSNC Gene Symbols 

### Import the Packages We'll be Using in this Portion of the Notebook

In [None]:
library("biomaRt")

### Create the `ENSEMBL` BioMart Object with the `useDataset()` and `useMart()` Methods

In [None]:
ensembl <- useDataset("hsapiens_gene_ensembl", useMart("ensembl", host="uswest.ensembl.org"))
ensembl

### Extract and Clean `ENSEMBL` Identifiers from our `DESeq2` Results with the `gsub()` Method

In [None]:
ensembl.ids <- rownames(sig.results.dataframe)

### Fetch Conversion Between `ENSEMBL` Identifiers and HGNC Symbols with the `getBM()` Method

In [None]:
# Before we fetch the gene list, let's enter these commands to 
# combat connection problems we might encounter when accessing biomart... 
# (From: https://www.bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html)
httr::set_config(httr::config(ssl_verifypeer = FALSE))
httr::set_config(httr::config(ssl_cipher_list = "DEFAULT@SECLEVEL=1"))

# Fetch the translation for our ENSEMBL IDs from BioMart to convert them to HGNC Symbols!
gene.list <- getBM(filters = 'ensembl_gene_id', 
                   attributes = c('ensembl_gene_id', 'hgnc_symbol'), # specifies columns we want from biomart
                   values = ensembl.ids, # provide the query for our search - which ENSEMBL IDs to look up
                   mart = ensembl)       # provide the biomart object we would like to use for our search

### Translate the `ENSEMBL` Identifiers in our `DESeq2` Database

In [None]:
rownames(sig.results.dataframe) <- ensembl.ids
filtered.sig.results.dataframe <- sig.results.dataframe[gene.list$ensembl_gene_id,]

rownames(filtered.sig.results.dataframe) <- make.names(gene.list$hgnc_symbol, unique = TRUE)
head(filtered.sig.results.dataframe)

In [None]:
write.csv(filtered.sig.results.dataframe, file = 'results/asm_sig_genes_hgnc-symbol.csv')