# DESeq2: Basic Differential Expression (DE) analysis

## Objective: Carry out a basic set of DE analyses using DESeq2 and visualize the results

### Load packages

In [None]:
library(tidyverse)
library(DESeq2)
library(dendextend)
library(RColorBrewer)

### Load the 2019 pilot dds object from image file

In [None]:

curdir <- "/home/jovyan/work/scratch/analysis_output"
imgdir <- file.path(curdir, "img")

imgfile <- file.path(imgdir, "pilotdds2019.RData")

imgfile

attach(imgfile)

tools::md5sum(imgfile)

### List the objects that have been attached
ls(2)

dds2019 <- dds2019

detach(pos = 2)

### Check dimensions of the two objects

# Inspect object & Slots of an S4 class

Let's has a look at the object we have created.

In [None]:
dds2019

see the class of dds object

In [None]:
class(dds2019)

DESeqDataSet is a S4 object. Recall that a S4 object was taught when introducing bioconductor. Note that S4 objects allow users to wrap up multiple elements into a single variables where each element is called a slot.

In [None]:
slotNames(dds2019)

The metadata (columnData) is stored in the slot `colData`

In [None]:
dds2019@colData %>% as.data.frame %>% head(3)

The design formula is stored in the slot `design`. The design holds the R formula which expresses how the counts depend on the variables in colData.

In [None]:
dds2019@design

The first thing you may want to do is **have a look at the raw counts** you have imported. The `DESeq2::counts` function extracts a matrix of counts (with the genes along the rows and samples along the columns). Let us first verify the dimension of this matrix.

In [None]:
dim(counts(dds2019))

In [None]:
head(counts(dds2019),3)

This slot returns gene specific information (it will be populated later)

In [None]:
dds2019@dispersionFunction

# Estimate Size Factors and Dispersion Parameters

You recall that DESeq requires that  we have estimates for sample specific size factors and gene specific dispersion factors. More specifically, recall that DESeq models the count $K_{ij}$ (gene $i$, sample $j$) as negative binomial with mean $\mu_{ij}$ and dispersion parameter $\alpha_i$. Here $\mu_{ij}=s_j q_{ij}$ where $\log_2(q_{ij}) = \beta_{0i} + \beta_{1i} z_j$. Here $s_j$ is the sample $j$ specific size factor.

**Summarize of notation**
- $K_{ij}$ denotes the observed **number of reads** mapped to gene $i$ for sample $j$
- $K_{ij}$ follows a **negative binomial distribution** with
    - **Mean** $\mu_{ij}$
    - **Dispersion parameter** $\alpha_i$
- Modelling
    - $K_{ij} \sim NB(\mu_{ij}, \alpha_i)$
    - $\mu_{ij} = s_{j}q_{ij}$
        - $s_j$ is sample $j$ specific normalization constant
    - $\log_2(q_{ij}) = \beta_{0i} + \beta_{1i} z_j$

## 01 Size Factors
 We begin by estimating the size factors $s_1,\ldots,s_n$:

In [None]:
dds2019 <- estimateSizeFactors(dds2019)

Now, compare the dds object to that of before applying the estimateSizeFactors() function. What has changed? What remains unchanged?

In [None]:
dds2019

Note that there is a **sizeFactor** added to **colData**. Let's look at it more carefully

```
> dds # (before estimateSizeFactors)
class: DESeqDataSet 
class: DESeqDataSet 
dim: 8499 24 
metadata(1): version
assays(1): counts
rownames(8499): CNAG_00001 CNAG_00002 ... large_MTrRNA small_MTrRNA
rowData names(0):
colnames(24): 1_2019_P_M1 2_2019_P_M1 ... 23_2019_P_M1 24_2019_P_M1
colData names(22): Label sample_year ... RIN_normal_threshold
  RIN_lowered_threshold

> dds # (after estimateSizeFactors)
class: DESeqDataSet 
dim: 8499 24 
metadata(1): version
assays(1): counts
rownames(8499): CNAG_00001 CNAG_00002 ... large_MTrRNA small_MTrRNA
rowData names(0):
colnames(24): 1_2019_P_M1 2_2019_P_M1 ... 23_2019_P_M1 24_2019_P_M1
colData names(23): Label sample_year ... RIN_lowered_threshold
  sizeFactor

You can also get the size factors directly

In [None]:
sizeFactors(dds2019)

 It is preferable to limit the number of decimal places. Next show the size factors rounded to 3 decimal places

In [None]:
round(sizeFactors(dds2019),3)

Summarize size factors

In [None]:
summary(sizeFactors(dds2019))

Do you see a trend?

In [None]:
sizeFactors(dds2019) %>%
    as.data.frame %>%
        rownames_to_column %>%
            mutate(libnum = as.integer(str_remove(rowname, "_2019_P_M1"))) -> mydf

colnames(mydf)[2] <- "sizefac"

mydf

In [None]:
ggplot(mydf, aes(x = libnum, y = sizefac)) + geom_point()

Now that the size factors have been estimated, we can get "normalized" counts (DESeq2 normalizes against size factor)

In [None]:
# original counts for libraries 1 and 24
counts(dds2019)[1:5,c(1,24)]

# normalized count
counts(dds2019, normalize = TRUE)[1:5, c(1,24)]

# Size factor

sizeFactors(dds2019)[c(1,24)]

In [None]:
# normalized manually using size factors for library 1
counts(dds2019)[1:5, 1] / sizeFactors(dds2019)[1]

In [None]:
# normalized manually using size factors for library 24
counts(dds2019)[1:5, 24] / sizeFactors(dds2019)[24]

How do you get the raw counts for gene  "GeneID: CNAG_05845"?

In [None]:
counts(dds2019, normalize = TRUE)["CNAG_05845",]

## 02 Dispersion Parameters
Next, we get the dispersion factors $\alpha_1,\ldots,\alpha_{m}$

In [None]:
dds2019 <- estimateDispersions(dds2019)

Now inspect the dds object again and note that the rowRanges slot has extra information ("metadata column names(0):" before versus "column names(9): baseMean baseVar ... dispOutlier dispMAP")
- before: 
    - `metadata column names(0):`
- after:  
    - `column names(9): baseMean baseVar ...`

In [None]:
dds2019

Can you notice the difference?
```
> dds (before dispersion)
class: DESeqDataSet 
dim: 8499 24 
metadata(1): version
assays(1): counts
rownames(8499): CNAG_00001 CNAG_00002 ... large_MTrRNA small_MTrRNA
rowData names(0):
colnames(24): 1_2019_P_M1 2_2019_P_M1 ... 23_2019_P_M1 24_2019_P_M1
colData names(23): Label sample_year ... RIN_lowered_threshold
  sizeFactor
  
> dds (after dispersion)
class: DESeqDataSet 
dim: 8499 24 
metadata(1): version
assays(2): counts mu
rownames(8499): CNAG_00001 CNAG_00002 ... large_MTrRNA small_MTrRNA
rowData names(10): baseMean baseVar ... dispOutlier dispMAP
colnames(24): 1_2019_P_M1 2_2019_P_M1 ... 23_2019_P_M1 24_2019_P_M1
colData names(23): Label sample_year ... RIN_lowered_threshold
  sizeFactor
```

Note that the dispersionfunction slot is now populated

In [None]:
dds2019@dispersionFunction

We can extract the gene specific dispersion factors using dispersions(). Note that there will be one number per gene. We look at the first four genes (rounded to 4 decimal places)

In [None]:
alphas <- dispersions(dds2019)

Verify that the number of dispersion factors equals the number of genes

In [None]:
# number of disperion factors
length(alphas)

In [None]:
round(alphas[1:4], 4)

Extract the metadata using mcols() for the first four genes

| Terms       | Description                                   |
|-------------|-----------------------------------------------|
| baseMean    |     mean of normalized counts for all samples |
| baseVar     | variance of normalized counts for all samples |
| allZero     |                all counts for a gene are zero |
| dispGeneEst |             gene-wise estimates of dispersion |
| dispFit     |                   fitted values of dispersion |
| dispersion  |                  final estimate of dispersion |
| dispIter    |                          number of iterations |
| dispOut     |                 dispersion flagged as outlier |
| dispMAP     |                 maximum a posteriori estimate |


In [None]:
mcols(dds2019)[1:4,] %>% as.data.frame

Summarize the dispersion factors using a box plot (may want to log transform)

In [None]:
boxplot(log(dispersions(dds2019)))

# Differential Expression Analysis
We can now conduct a differential expression analysis using the DESeq() function. Keep in mind that to get to this step, we first estimated the size factors and then the dispersion parameters.

In [None]:
### Carry out DE analysis
ddsDE <- DESeq(dds2019)

In [None]:
### Look at object
ddsDE

In [None]:
### Look at some of the results
results(ddsDE)

Note that currently, the model we have is an additive model, which does not include the interaction term of `Media` and `Strain`

### Look at some of the results (tidy version)


In [None]:
results(ddsDE, tidy = TRUE)

We can get the results for the differential expression analysis using results(). Here, we can compare two group of samples specified by the contrast. (If not, the default contrast would be the last term in your additive model `design(dds)`).

In [None]:
# DE with respect to condition
myres_condition4v8 <- results(ddsDE, contrast = c("condition", "pH4", "pH8"))
myres_condition4v8

In [None]:
# DE with respect to condition (flip order)
myres_condition8v4 <- results(ddsDE, contrast = c("condition", "pH8", "pH4"))
myres_condition8v4

In [None]:
### DE with respect to genotype
myres_strainvWT <- results(ddsDE, contrast = c("genotype", "sre1d", "WT"))
myres_strainvWT

Let's look at the results for the first four genes

In [None]:
### Tidy the results
myres_condition8v4 <- results(ddsDE, contrast = c("condition", "pH8", "pH4"), tidy = TRUE)
myres_condition8v4

In [None]:
### Tidy the results for DE with respect to condition
### Results are sorted in ascending order by adjusted p-value
### Here ph4 is the reference level
### log2FC > 0 suggests that higher pH (pH8) is associated with increased expression
### log2FC < 0 suggests that higher pH (pH8) is associated with lower expression
myres_condition8v4 <- results(ddsDE, contrast = c("condition", "pH8", "pH4"), tidy = TRUE)

myres_condition8v4 %>% 
    arrange(desc(-padj)) %>% 
        head(10)

### Visualize DE effect

Looking at the results for these two genes: 

* The estimated log2FC for CNAG_00275 is negative. We will verify visually that ph8, compared to pH4,  is associated with lower expression

* The estimated log2FC for CNAG_00531 is positive. We will verify visually that ph8, compared to pH4, is associated with higher expression


In [None]:
results(ddsDE, tidy = TRUE) %>%
    filter(row %in% c("CNAG_00275","CNAG_00531"))

In [None]:
### This dot plot verify visually that exposure to ph8, compared to pH4,  is associated with lower expression
plotCounts(dds2019, "CNAG_00275", intgroup = "condition")

In [None]:
### This dot plot verify visually that exposure to ph8, compared to pH4,  is associated with higher expression
plotCounts(dds2019, "CNAG_00531", intgroup = "condition")

Volcano plot

In [None]:
### Volcano plot for con effect
ggplot(results(ddsDE, contrast = c("condition", "pH4", "pH8"), tidy = TRUE), 
       aes(x = log2FoldChange, y = -log10(padj))) + geom_point()

In [None]:
### Genotype Effect
ggplot(results(ddsDE, contrast = c("genotype", "sre1d", "WT"), tidy = TRUE), 
       aes(x = log2FoldChange, y = -log10(padj))) + geom_point()

## Clustering

### Regularized log transformation
The regularized log transform can be obtained using the [rlog() function](https://rdrr.io/bioc/DESeq2/man/rlog.html). Note that an important argument for this function is blind (TRUE by default). The default "blinds" the normalization to the design. This is very important so as to not bias the analyses (e.g. class discovery) 

In [None]:
rld <- rlog(dds2019, blind = TRUE)

### Dendrogram of samples: showing strain & media of each sample

Hierarchical clustering using rlog transformation

In [None]:
options(repr.plot.width = 9, repr.plot.height = 5)
dists <- dist(t(assay(rld)))
plot(hclust(dists)) 

Store the dendrogram of samples using hierarchical clustering

In [None]:
assay(rld) %>%
    t() %>%
    dist %>%
    hclust(method = "complete") %>%
    as.dendrogram ->
    mydend

Dendrogram of samples: showing strain of each sample

In [None]:

dendplot <- function(mydend, columndata, labvar, colvar, pchvar) {
    cols <- factor(columndata[[colvar]][order.dendrogram(mydend)])
    collab <- brewer.pal(max(3,nlevels(cols)),"Set1")[cols]
    pchs <- factor(columndata[[pchvar]][order.dendrogram(mydend)])
    pchlab <- seq_len(nlevels(pchs))[pchs]
    lablab <- columndata[[labvar]][order.dendrogram(mydend)]
    
    mydend %>% 
        set("labels_cex",1) %>% 
        set("labels_col",collab) %>%
        set("leaves_pch",pchlab) %>%
        set("labels", lablab)
}



In [None]:
options(repr.plot.width = 9, repr.plot.height = 5)
dendplot(mydend, dds2019@colData, 
         "genotype",    # variable that show in label
         "genotype",    # variable that define color
         "condition") %>% # variable that define shape of points
    plot

Dendrogram of samples: showing media of each sample

### Customize presentation

In [None]:
### Merge gene expression with meta data
myDEplotData <- function(mydds, geneid, mergelab) {
    counts(mydds, normalize = TRUE) %>%
        as_tibble(rownames="gene") %>%
        filter(gene == geneid) %>%
        gather(Label, geneexp, -gene) %>%
        select(-gene) -> genedat

    colData(mydds) %>%
        as.data.frame %>%
        as_tibble %>%
        full_join(genedat, by = mergelab) -> genedat
    
    return(genedat)
}

myDEplotData(dds2019, "CNAG_00003", "Label")[,c("Label", "genotype", "condition" , "geneexp")]


In [None]:
### Basic function

myDEplot <- function(mydds, geneid, mergelab) {
    mydat <- myDEplotData(mydds, geneid, mergelab)
    ggplot(mydat, aes(x = condition, y = geneexp))+ geom_point()
}



In [None]:
### Allow for grouping by any factor in dataframe

myDEplot <- function(mydds, geneid, grpvar, mergelab) {
    mydat <- myDEplotData(mydds, geneid, mergelab)
    ggplot(mydat, aes_string(x=grpvar, y = "geneexp"))+ geom_point()
}

myDEplot(dds2019, "CNAG_00003", "genotype", "Label")
myDEplot(dds2019, "CNAG_00003", "condition", "Label")



In [None]:
### Add color

myDEplot <- function(mydds, geneid, grpvar, mergelab) {
    mydat <- myDEplotData(mydds, geneid, mergelab)
    ggplot(mydat, aes_string(x=grpvar, y = "geneexp", col = grpvar))+ geom_point()
}
myDEplot(dds2019, "CNAG_00003", "condition", "Label")

In [None]:
### Alow for coloring with respect to another factor
myDEplot <- function(mydds, geneid, grpvar, colvar, mergelab) {
    mydat <- myDEplotData(mydds, geneid, mergelab)
    ggplot(mydat, aes_string(x=grpvar, y = "geneexp", col = colvar))+ geom_point()
}
myDEplot(dds2019, "CNAG_00003", "condition", "genotype", "Label")

In [None]:
sessionInfo()