# **Importing data & pre-processing**
Import data from GitHub & set row names (same as in part1):

In [1]:
# import file with NOT normalized expression data
dat.abundances <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.abundances.txt",
                            header=T,
                            sep="\t")
rownames(dat.abundances) <- dat.abundances[,1]        # set rownames to IDs from first column
dat.abundances <- data.matrix(dat.abundances[,-1])    # delete first column and change "data frame" to numeric "data matrix"



# import file with normalized data and extended information
dat.ext <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.ext.txt",
                            header=T,
                            sep="\t")

Keep only phosphorylated peptides (same as in part1):

In [2]:
# give row numbers with phosphopeptides
phospep.idx <- grep("Phospho", dat.ext$Modifications)   # grep() gives all row numbers containing the given pattern in column 'Modifications'

# keep only phospopeptides
dat.abundances <- dat.abundances[phospep.idx,]
#dat.nonorm <- dat.nonorm[phospep.idx,]
dat.ext <- dat.ext[phospep.idx,]

In part1, we have determined that the normalization results of the device software are OK and that we can use them. Therefore, we do not need to perform our own raw data normalization. However, we want to perform a group-specific imputation and replace isolated missing values to avoid excluding almost completely quantified phosphopeptides in some analysis steps. For this, the same imputation as in part1 is performed.

In [3]:
# Give row vectors with group-specific column numbers
basal.idx <- grep("Basal", colnames(dat.abundances))
insulin.idx <- grep("Insulin", colnames(dat.abundances))



dat.abundances2 <- dat.abundances
for(i in 1:nrow(dat.abundances2)){
    if(sum(is.na(dat.abundances2[i,basal.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,basal.idx]))
      dat.abundances2[i, basal.idx[na.idx]] <- mean(dat.abundances2[i, basal.idx], na.rm=T)
    }

    if(sum(is.na(dat.abundances2[i,insulin.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,insulin.idx]))
      dat.abundances2[i, insulin.idx[na.idx]] <- mean(dat.abundances2[i, insulin.idx], na.rm=T)
    }
}
nrow(dat.abundances)
nrow(na.omit(dat.abundances))   # na.omit() removes rows with at least on 'NA'
nrow(na.omit(dat.abundances2))

dat.abundances2 <- na.omit(dat.abundances2)   # delete phosphopeptides that still have missing values despite imputation

# **Identification of differential candidates**
Calculate p-values and fold changes in order to identify differential candidates (same as in part2).

In [4]:
phospep.number <- nrow(dat.abundances2)

# define empty vectors for storing...
fc <- vector(length=phospep.number, mode="numeric")
p.val <- vector(length=phospep.number, mode="numeric")
p.val.adj <- vector(length=phospep.number, mode="numeric")

# adopt IDs of vector elements from row IDs of dat.abundances2
names(fc) <- rownames(dat.abundances2)
names(p.val) <- rownames(dat.abundances2)
names(p.val.adj) <- rownames(dat.abundances2)



# Calculate fold changes
for(i in 1:phospep.number){
    basal.mean <- mean(dat.abundances2[i, basal.idx])
    insulin.mean <- mean(dat.abundances2[i, insulin.idx])

    fc[i] <- insulin.mean / basal.mean
}



# Calculate p-values and adjusted p-values
for(i in 1:phospep.number){
    p.val[i] <- t.test(log2(dat.abundances2[i, basal.idx]), log2(dat.abundances2[i, insulin.idx]))$p.value
}
p.val.adj <- p.adjust(p.val, method="fdr")



# Give row numbers of candidates with both large log-fold change and low adj. p-value
diff.idx1 <- which(abs(log2(fc)) > 1)         # trick: via absolute value of log2(fc) we get both phosphopeptides with fc > 2 or 1/2
diff.idx2 <- which(p.val.adj < 0.05)          # gives row indices of phosphopeptides with adj. p-value < 0.05
diff.idx <- intersect(diff.idx1, diff.idx2)   # intersection gives row indices of differential candidates

# **Hierarchical clustering & heat maps**
In order to perform hierarchical clustering...

In [None]:
# Transpose data matrix to have samples in the rows and phosphopeptides in the columns.
# This is needed, because the dist()-function computes distances between rows.
dat.abundances3 <- t(dat.abundances2)

# Scale the columns to make them more comparable, i.e. subtract from each value
# its column-specific mean and divide by its column-specific standard deviation.
dat.abundances3 <- scale(dat.abundances3)

# Compute distances and hierarchical clustering dendrogram
d1 <- dist(dat.abundances3, method="euclidean")
hc1 <- as.dendrogram(hclust(d1, method="complete"), hang=0.1)

# Visualize the dendrogram
options(repr.plot.width=10, repr.plot.height=10)
plot(hc1)

To improve the visualization of hierarchical clustering results various R packages can be used. Here the package "dendextend" will be used since it provides a relatively simple syntax.  

In [None]:
 install.packages("dendextend")
 library(dendextend)

Improved hierarchical clustering:

In [None]:
# Compute euclidean distances
d1 <- dist(dat.abundances3, method="euclidean")
d2 <- dist(dat.abundances3[,diff.idx], method="euclidean")

# Compute hierarchical clustering using complete linkage
hc1 <- as.dendrogram(hclust(d1, method="complete"), hang=0.1)
hc2 <- as.dendrogram(hclust(d2, method="complete"), hang=0.1)

# set label colors and label sizes
labels_colors(hc1) <- c(rep("red",4), rep("blue",4))
labels_colors(hc2) <- c(rep("red",4), rep("blue",4))
labels_cex(hc1) <- 2.5
labels_cex(hc2) <- 2.5

options(repr.plot.width=20, repr.plot.height=10)
par(mar=c(5, 4, 4, 2) + 1, mfrow=c(1,2))
plot(x=hc1,
     main="Cluster dendrogram",   # plot title
     ylab="Height",   # axis title
     cex.lab=2.0,     # size of axis title
     cex.axis=2.0,    # size of axis labels
     cex.main=3.0     # size of plot title
     )
plot(x=hc2,
     main="Cluster dendrogram (diff. candidates)",
     ylab="Height",
     cex.lab=2.0,
     cex.axis=2.0,
     cex.main=3.0
     )

Simple heat map

In [None]:
nrow(dat.abundances2)
heatmap(dat.abundances2, scale="row")

Improved heat map:

In [None]:
length(diff.idx)


d.fun1 = function(c) dist(c, method="manhattan")
d.fun2 = function(c) as.dist(1-cor(t(c)))

hc.fun1 <- function(d) hclust(d, method="complete")

options(repr.plot.width=10, repr.plot.height=10)
heatmap(dat.abundances2[diff.idx,], scale="row", distfun=d.fun1, hclustfun=hc.fun1, cexCol=2.5, margins=c(8,8))
heatmap(dat.abundances2[diff.idx,], scale="row", distfun=d.fun2, hclustfun=hc.fun1, cexCol=2.5, margins=c(8,8))

# **PCA**
Basic PCA computation and results visualization:

In [None]:
# Perform PCA with the function prComp(). Since it computes principal components
# for the columns of a data matrix we need to transpose the data first via t().
# Moreover, we scale the distributions of the phosphopeptides to make them more
# comparable and get better results.
pcdat <- prcomp(t(dat.abundances2), center=TRUE, scale=TRUE)

# The computed principal components are stored in matrix 'x' stored in the
# results object. Because they are sorted according to data variance, the first
# principal two components differentiate the data points (here: samples) most
# clearly.
pc <- pcdat$x
print(pc)

# To visualize the results
par(mar=c(5, 4, 4, 2) + 1)
plot(pc[,1],
     pc[,2],
     xlab=colnames(pc)[1],
     ylab=colnames(pc)[1],
     main="PCA",
     cex=3,
     cex.axis=2,
     cex.lab=2,
     cex.main=3)

Improved PCA computation and results visualization.

In [None]:
pcdat <- prcomp(t(dat.abundances2), center=TRUE, scale=TRUE)
#pcdat <- prcomp(t(dat.abundances2[diff.idx,]), center=TRUE, scale=TRUE)
scores <- pcdat$x

print(scores)

Xlab <- paste0("PC1 (", round(summary(pcdat)$importance[2,1]*100, 2), "%)")
Ylab <- paste0("PC2 (", round(summary(pcdat)$importance[2,2]*100, 2), "%)")

par(mar=c(5, 4, 4, 2) + 1)
plot(scores[,1], scores[,2], xlab=Xlab, ylab=Ylab, cex=3, cex.axis=2, cex.lab=2)
points(scores[basal.idx,1], scores[basal.idx,2], cex=3, pch=19, col="blue")
points(scores[insulin.idx,1], scores[insulin.idx,2], cex=3, pch=19, col="red")
text(scores[basal.idx,1], scores[basal.idx,2], labels=rownames(scores)[basal.idx], pos=4, offset=1, cex=2, col="blue")
text(scores[insulin.idx,1], scores[insulin.idx,2], labels=rownames(scores)[insulin.idx], pos=2, offset=1, cex=2, col="red")