
# **Importing data & pre-processing**
Import data from GitHub & set row names (same as in part1):

In [2]:
# file with NOT normalized expression data
dat.abundances <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.abundances.txt",
                            header=T,
                            sep="\t")
rownames(dat.abundances) <- dat.abundances[,1]        # set rownames to IDs from first column
dat.abundances <- data.matrix(dat.abundances[,-1])    # delete first column and change "data frame" to numeric "data matrix"



## file with NOT normalized expression data
#dat.nonorm <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.nonorm.txt",
#                            header=T,
#                            sep="\t")
#rownames(dat.nonorm) <- dat.nonorm[,1]
#dat.nonorm <- data.matrix(dat.nonorm[,-1])



# file with normalized data and extended information
dat.ext <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.ext.txt",
                            header=T,
                            sep="\t")

Give rows with phosphopeptides and keep only phosphorylated peptides:

In [3]:
# give row numbers with phosphopeptides
phospep.idx <- grep("Phospho", dat.ext$Modifications)   # grep() gives all row numbers containing the given pattern in column 'Modifications'

# keep only phospopeptides
dat.abundances <- dat.abundances[phospep.idx,]
#dat.nonorm <- dat.nonorm[phospep.idx,]
dat.ext <- dat.ext[phospep.idx,]

In part1, we have determined that the normalization results of the device software are OK and that we can use them. Therefore, we do not need to perform our own raw data normalization. However, we want to perform a group-specific imputation and replace isolated missing values to avoid excluding almost completely quantified phosphopeptides in some analysis steps. For this, the same imputation as in part1 is performed.

In [None]:
basal.idx <- grep("Basal", colnames(dat.abundances))
insulin.idx <- grep("Insulin", colnames(dat.abundances))



dat.abundances2 <- dat.abundances
for(i in 1:nrow(dat.abundances2)){
    if(sum(is.na(dat.abundances2[i,basal.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,basal.idx]))
      dat.abundances2[i, basal.idx[na.idx]] <- mean(dat.abundances2[i, basal.idx], na.rm=T)
    }

    if(sum(is.na(dat.abundances2[i,insulin.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,insulin.idx]))
      dat.abundances2[i, insulin.idx[na.idx]] <- mean(dat.abundances2[i, insulin.idx], na.rm=T)
    }
}
cat("\n\n\n")
nrow(dat.abundances)
nrow(na.omit(dat.abundances))   # na.omit() removes rows with at least on 'NA'
nrow(na.omit(dat.abundances2))

dat.abundances2 <- na.omit(dat.abundances2)   # delete phosphopeptides that still have missing values despite imputation

# **Preparatory definitions**
Empty vectors are now prepared for the storage of the mean values, fold changes, p-values and adjusted p-values still to be calculated. If the group-specific ID vectors "basal.idx" and "insulin.idx" had not already been defined for imputation, we would have done this now.

In [None]:
# already defined for imputation
# basal.idx <- grep("Basal", colnames(dat.abundances))
# insulin.idx <- grep("Insulin", colnames(dat.abundances))

phospep.number <- nrow(dat.abundances2)

# define empty vectors for storing...
fc <- vector(length=phospep.number, mode="numeric")
p.val <- vector(length=phospep.number, mode="numeric")
p.val.adj <- vector(length=phospep.number, mode="numeric")

# adopt IDs of vector elements from row IDs of dat.abundances2
names(fc) <- rownames(dat.abundances2)
names(p.val) <- rownames(dat.abundances2)
names(p.val.adj) <- rownames(dat.abundances2)

print(fc[1:10])

# **Computing means & fold change**
To calculate the mean values and various fold changes, we iterate a for loop over all phosphopeptides (i.e. over rows i = 1, 2, 3, ... of the data matrix "dat.abundances2"). The respective calculations are carried out for each phosphopeptide (= row "i") and saved at index "i" of the respective results vectors.

In [13]:
for(i in 1:phospep.number){
    basal.mean <- mean(dat.abundances2[i, basal.idx])
    insulin.mean <- mean(dat.abundances2[i, insulin.idx])

    fc[i] <- insulin.mean / basal.mean
}

The respective results in the vectors can be inspected with histograms. However, the two opposite fold change variants in "fc" and "fc2" can only be compared visually after a log10 transformation. This transformation ensures that the two distributions, which are based on opposite quotients, appear to be mirrored / symmetric.

In [None]:
options(repr.plot.width=20, repr.plot.height=10)
par(mfrow=c(1,2))

hist(fc, breaks=50)         # breaks controls the number of bars
hist(log2(fc), breaks=50)

# **Computing p-values**
For the calculation of p-values also a for-loop can be used:

In [None]:
for(i in 1:phospep.number){
    p.val[i] <- t.test(log2(dat.abundances2[i, basal.idx]), log2(dat.abundances2[i, insulin.idx]))$p.value
}
p.val.adj <- p.adjust(p.val, method="fdr")

options(repr.plot.width=20, repr.plot.height=10)
par(mfrow=c(1,2))

hist(p.val)
hist(p.val.adj)

# **Volcano plot**
We have now calculated both decision criteria for the detection of differential candidates for all phosphopeptides, namely the p-values (representing statistical significance) and fold changes (representing relevance). If both are plotted against each other on certain scales (log2(fold changes) vs. -log10(p-values)), a so-called volcano plot is obtained.

First, let's plot the **basic volcano** plot:

In [None]:
options(repr.plot.width=15, repr.plot.height=15)
plot(log2(fc), -log10(p.val.adj))

**More informative volcano plot**

We can improve the basic volcano plot as follows:

In [None]:
par(mar= c(5, 4, 4, 2) + 1)
plot(log2(fc), -log10(p.val.adj), cex=4, cex.axis=2, cex.lab=3)
abline(h=-log10(0.05), col="red", lwd=1, lty=1)
abline(v=log2(2), col="blue", lwd=3, lty=2)
abline(v=log2(1/2), col="green", lwd=6, lty=3)

Even more informative volcano plot:

In [None]:
diff.idx1 <- which(abs(log2(fc)) > 1)         # trick: via absolute value of log2(fc) we get both phosphopeptides with fc > 2 or 1/2
diff.idx2 <- which(p.val.adj < 0.05)          # gives row indices of phosphopeptides with adj. p-value < 0.05
diff.idx <- intersect(diff.idx1, diff.idx2)   # intersection gives row indices of differential candidates

par(mar= c(5, 4, 4, 2) + 1)
plot(x=log2(fc),
     y=-log10(p.val.adj),
     cex=4,
     cex.axis=2,
     cex.lab=3,
     cex.main=3,
     main=paste0("Volcano plot with ", length(diff.idx), " differential candidates"))
abline(h=-log10(0.05), col="red", lwd=6, lty=3)
abline(v=log2(2), col="red", lwd=6, lty=3)
abline(v=log2(1/2), col="red", lwd=6, lty=3)
points(log2(fc[diff.idx]), -log10(p.val.adj[diff.idx]), cex=4, col="red")