## edgeR-TMM normalization with Star gene counts table

https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

In [None]:
system("ln -sfn ~/public/rnaseq/Day2_materials/* ~/module-3-rnaseq/Day2_materials/")

In [None]:
options(stringsAsFactors = FALSE)

Load required libraries (install packages if required)

In [None]:
#Load packages
library(limma)
library(edgeR)
library(data.table)
library(RColorBrewer)
library(gplots)

### Creating DGE object for edgeR

Read in counts file `data/SMM262_01232024_counts.csv and view head of file

In [None]:
#Read file
counts <- read.csv("./data/CMM262_01232024_counts.csv", stringsAsFactors=F, row.names=1)
head(counts)

Define groups and design and create `dge` using `DGEList()`.

In [None]:
group<-as.factor(c("CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st",
                   "EVT_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st"))
group

design <- model.matrix(~0+group)
design

dge<- DGEList(counts=counts,group=group)

Plot library sizes

In [None]:
par(mar=c(10,5,5,5))
options(scipen=5)
barplot(dge$samples$lib.size, horiz=F, names.arg=colnames(dge$counts), las=2, cex.name = 0.5, cex.axis=.5, 
        main="Library Size")

In [None]:
#To check
class(dge)
dim(dge)
dge$samples

### Filtering based on cpm cutoff

Filter genes with at least 10 cpm present in at least 3 samples

In [None]:
dim(dge)

In [None]:
table(rowSums(dge$counts==0)==20)
keep <- rowSums(cpm(dge)>10) >= 3
dge.f <- dge[keep, , keep.lib.sizes=FALSE]
dim(dge.f)

### Normalization using TMM, dispersion estimation using naive method, and DGE
“TMM (weighted trimmed mean of log expression) determines scaling factor calculated after double trimming values at the two extremes based on log-intensity ratios (M-values) and log-intensity averages (A-values)” (Dillies et al. Briefings in Bioinformatics, Vol. 14 (6): 671–683, 2013)

The calcNormFactors() function normalizes for RNA composition by finding a set of scaling factors for the library sizes that minimize the log-fold changes between the samples for most genes. The default method for computing these scale factors uses a trimmed mean of M-values (TMM) between each pair of samples. We call the product of the original library size and the scaling factor the effective library size. The effective library size replaces the original library size in all downsteam analyses.

To normalize using TMM, 
- Calculate the normalization factors (`calcNormFactors()`) using `d`
- Maximize the negative binomial conditional common likelihood to estimate a common dispersion value across all genes (`estimateCommonDisp()`)
- Compute genewise exact tests for differences in the means between two groups of negative-binomially distributed counts (`exactTest()`)

### Estimate dispersion
Estimate dispersion: The square root of the common dispersion gives the coefficient of variation of biological variation.



The first major step in the analysis of DGE data using the NB model is to estimate the dispersion parameter for each tag, a measure of the degree of inter-library variation for that tag. Estimating the common dispersion gives an idea of overall variability across the genome for this dataset.

In this example, I am renaming the variable to d1 because we can estimate dispersion by assuming everything has the same common dispersion, or we can use a generalized linear model to try to estimate the dispersion. 


Generate the estimate dispersion `d` with `estimateDisp()` using the filtered dge (`dge.f`).

In [None]:
#estimate dispersions
d <- estimateDisp(dge.f, design=design)

In [None]:
d$common.dispersion
sqrt(d$common.disp)

Here the common dispersion is found to be 0.06 and the coefficient of biological variation (BCV) is around 0.25. 

In [None]:
TMM <- calcNormFactors(d, method="TMM")
TMM <- estimateCommonDisp(TMM)
TMM <- exactTest(TMM)
dges <- table(p.adjust(TMM$table$PValue, method="BH")<0.05)
dges

**This means that we have 7977 differentially expressed genes with EdgeR**

If we want to take a look at the top 10:

In [None]:
TMM.table<-data.frame(topTags(TMM, n=20))
TMM.table

In [None]:
#write these DGEs out
DGEs_05 <- topTags(TMM, n=Inf, adjust.method="BH")
keep <- DGEs_05$table$FDR <= 0.05
write.table(DGEs_05[keep,],file="./output/edgeR_TMM_p0.05.txt",sep="\t")

## Plots

### Raw and unfiltered data

First calculate cpm and log cpm using the unfiltered data (`dge`)

In [None]:
# Raw data
cpm <- cpm(dge)
lcpm <- cpm(dge, log=TRUE)

Now calculate the log cpm for the filtered data (`dge.f`)

In [None]:
# Filtered data
lcpm.f <- cpm(dge.f, log=TRUE)

Lastly calculated the TMM normalized data using `dge.f` and get the log cpm of the normalized data (`dge.norm`).

In [None]:
# TMM normalized data
dge.norm <- calcNormFactors(dge.f,method="TMM") 
dge.norm$samples$norm.factors
lcpm.norm <- cpm(dge.norm, log=TRUE)

In [None]:
#set colours for graphs
nsamples <- ncol(dge)
nsamples<-(dge.f)
nsamples<-ncol(dge.norm)
col <- brewer.pal(nsamples, "Paired")

#Visualise filtered vs unfiltered data
par(mfrow=c(1,2))

#plot unfiltered data
samplenames<-c("CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st",
                   "EVT_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st")

plot(density(lcpm[,1]),col=col(lcpm,as.factor = FALSE),lwd=1,ylim=c(0,2.5),las=2,main="",xlab="")

title(main="A. Raw data", xlab="Log-cpm")
abline(v=0, lty=3)
for (i in 2:nsamples){
  den <- density(lcpm[,i])
  lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col=col, bty="n",cex=0.8,pt.cex=0.8)


#plot filtered data
plot(density(lcpm.f[,1]), col=col(lcpm.f,as.factor=FALSE), lwd=2, ylim=c(0,0.5), las=2,
     main="", xlab="")

title(main="B. Filtered data", xlab="Log-cpm")
abline(v=0, lty=3)
for (i in 2:nsamples){
  den <- density(lcpm.f[,i])
  lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col=col, bty="n",cex=0.8,pt.cex=0.8)

### Boxplots of TMM-Normalized vs. unnormalized data

In [None]:
# Unnormalized data
lcpm <- cpm(dge, log=TRUE)

In [None]:
# TMM normalized data
dge.norm <- calcNormFactors(dge.f,method="TMM") 
dge.norm$samples$norm.factors
lcpm.norm <- cpm(dge.norm, log=TRUE)

In [None]:
par(mfrow=c(1,2))

# Unnormalised data
boxplot(lcpm,las=2, col=col, main="",ylim=c(2,20),names=c("CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st",
                   "EVT_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st"))
title(main="A. Unnormalized data",ylab="Log-cpm")

# TMM normalized data
boxplot(lcpm.norm, las=2, col=col, main="", ylim=c(2,20),
        names=c("CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","CTB_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st",
                   "EVT_1st","EVT_1st","EVT_1st","EVT_1st","EVT_1st"))

title(main="B. TMM Normalized data",ylab="Log-cpm")

### PCA plot
Make a PCA plot using the log cpm normalized data (`lcpm.norm`)

In [None]:
#MDS plot PCA
par(mfrow=c(1,2))
col.group <- group
levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")
col.group <- as.character(col.group)
plotMDS(lcpm.norm, labels=group, col=col.group)
title(main="Samples")

### Heatmap of genes significantly different between groups (top 20)

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b> Generate the heatmap for significantly different genes</p>
</div>

Hint: First subset the genes in the log cpm TMM-normalized data (`lcpm.norm`) using the genes in the `TMM.table`

In [None]:
# subset miRNAs from lcpm normalized data
genes<-as.list(row.names(TMM.table))
genes_lcpm.norm<-subset(lcpm.norm,rownames(lcpm.norm) %in% genes)

In [None]:
## Get some nicer colours
mypalette <- brewer.pal(11,"RdYlBu")
morecols <- colorRampPalette(mypalette)
# Set up colour vector for celltype variable
col.cell <- c("purple","orange")[group]

heatmap.2(genes_lcpm.norm,col=rev(morecols(50)),trace="none", main="p<0.05 TMM normalized",
          ColSideColors=col.cell,scale="row",margins=c(9,9), cexCol=0.8)


Total number of DGEs seems low so we can try with the GLM

In [None]:
y <- DGEList(counts=counts,group=group)
keep <- rowSums(y$counts) >= 10
y$counts <- y$counts[keep,]
y <- calcNormFactors(y, method="TMM")
design <- model.matrix(~0+group)
y <- estimateDisp(y,design)
fit <- glmQLFit(y,design)
qlf <- glmQLFTest(fit,coef=2)

In [None]:
dges_qlf <- table(p.adjust(TMM$table$PValue, method="BH")<0.05)
dges_qlf

# Limma-voom

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937821/

It has been shown that for RNA-seq count data, the variance is not independent of the mean – this is true of raw counts or when transformed to log-CPM values. Methods that model counts using a Negative Binomial distribution assume a quadratic mean-variance relationship. In limma, linear modelling is carried out on the log-CPM values which are assumed to be normally distributed and the mean-variance relationship is accommodated using precision weights calculated by the voom function.

When operating on a DGEList-object, voom converts raw counts to log-CPM values by automatically extracting library sizes and normalisation factors from x itself. 

Typically, the “voom-plot” shows a decreasing trend between the means and variances resulting from a combination of technical variation in the sequencing experiment and biological variation amongst the replicate samples from different cell populations. Experiments with high biological variation usually result in flatter trends, where variance values plateau at high expression values. Experiments with low biological variation tend to result in sharp decreasing trends.

First set up the design matrix

In [None]:
#limma-voom 
#Set up design
design <- model.matrix(~0 + group)
colnames(design) <- gsub("group","", colnames(design))
design

Then use [`makeContrasts()`](https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/makeContrasts) to "express contrasts between a set of parameters as a numeric matrix".

In [None]:
cm <- makeContrasts(CTBvsEVT=CTB_1st-EVT_1st,levels=design)

Apply `voom()` to remove heteroscedasticity from count data

In [None]:
v <- voom(dge.norm, design, plot=TRUE)
write.csv(v$E, "./output/TMM_and_Voom_normalized_counts.csv")

Fit the linear model

In [None]:
vfit <- lmFit(v,design)
vfit <- contrasts.fit(vfit, contrasts=cm)
efit <- eBayes(vfit)
plotSA(efit, main="Final model: Mean-variance trend")

Use `decideTests()` to determine which genes are up-regulated, down-regulated or not significantly different.

In [None]:
dt <- decideTests(efit)
summary(dt)

**This shows that with limma_voom we found 8296 DEGs**

### Genes with adjusted p<0.05

In [None]:
#write out p0.01 RNAs
CTB_vs_EVT<-topTreat(efit,coef=1,n=Inf)
head(CTB_vs_EVT)
ENSID<-row.names(CTB_vs_EVT)
norm<-data.frame(v$E)
merged<-merge(CTB_vs_EVT,norm,by=0,all=TRUE)
final<-subset(merged,merged$adj.P.Val<0.05)
write.table(final,file="./output/final_mRNAs_p0.05_limma.txt",sep="\t")

### Heatmap with genes with p<0.05

In [None]:
#heatmap
#subset miRNAs from lcpm normalized data
mRNAs<-as.list(final$Row.names)
lcpm.norm.heatmap<-as.matrix(subset(norm,rownames(norm) %in% mRNAs))

## Get some nicer colours
mypalette <- brewer.pal(11,"RdYlBu")
morecols <- colorRampPalette(mypalette)
# Set up colour vector for celltype variable
col.cell <- c("purple","orange")[group]
heatmap.2(lcpm.norm.heatmap,col=rev(morecols(50)),trace="none", main="p<0.05 TMM normalized",
          ColSideColors=col.cell,scale="row",margins=c(9,9), cexCol=0.8)

In [None]:
star_salmon_degs_EdgeR_limma <- read.csv("./data/DEGs_salmon_star_edgeR_limma.csv", header=TRUE)
star_salmon_degs_EdgeR_limma <- data.frame(star_salmon_degs_EdgeR_limma)
star_salmon_degs_EdgeR_limma