# **Importing data & pre-processing**
Import data from GitHub & set row names (same as in part1):

In [None]:
# import file with NOT normalized expression data
dat.abundances <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.abundances.txt",
                            header=T,
                            sep="\t")
rownames(dat.abundances) <- dat.abundances[,1]        # set rownames to IDs from first column
dat.abundances <- data.matrix(dat.abundances[,-1])    # delete first column and change "data frame" to numeric "data matrix"



# import file with normalized data and extended information
dat.ext <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.ext.txt",
                            header=T,
                            sep="\t")

Keep only phosphorylated peptides (same as in part1):

In [None]:
# give row numbers with phosphopeptides
phospep.idx <- grep("Phospho", dat.ext$Modifications)   # grep() gives all row numbers containing the given pattern in column 'Modifications'

# keep only phospopeptides
dat.abundances <- dat.abundances[phospep.idx,]
#dat.nonorm <- dat.nonorm[phospep.idx,]
dat.ext <- dat.ext[phospep.idx,]

In part1, we have determined that the normalization results of the device software are OK and that we can use them. Therefore, we do not need to perform our own raw data normalization. However, we want to perform a group-specific imputation and replace isolated missing values to avoid excluding almost completely quantified phosphopeptides in some analysis steps. For this, the same imputation as in part1 is performed.

In [None]:
# Give row vectors with group-specific column numbers
basal.idx <- grep("Basal", colnames(dat.abundances))
insulin.idx <- grep("Insulin", colnames(dat.abundances))



dat.abundances2 <- dat.abundances
for(i in 1:nrow(dat.abundances2)){
    if(sum(is.na(dat.abundances2[i,basal.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,basal.idx]))
      dat.abundances2[i, basal.idx[na.idx]] <- mean(dat.abundances2[i, basal.idx], na.rm=T)
    }

    if(sum(is.na(dat.abundances2[i,insulin.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,insulin.idx]))
      dat.abundances2[i, insulin.idx[na.idx]] <- mean(dat.abundances2[i, insulin.idx], na.rm=T)
    }
}
nrow(dat.abundances)
nrow(na.omit(dat.abundances))   # na.omit() removes rows with at least on 'NA'
nrow(na.omit(dat.abundances2))

dat.abundances2 <- na.omit(dat.abundances2)   # delete phosphopeptides that still have missing values despite imputation

# **Identification of differential candidates**
Calculate p-values and fold changes in order to identify differential candidates (same as in part2).

In [None]:
phospep.number <- nrow(dat.abundances2)

# define empty vectors for storing...
fc <- vector(length=phospep.number, mode="numeric")
p.val <- vector(length=phospep.number, mode="numeric")
p.val.adj <- vector(length=phospep.number, mode="numeric")

# adopt IDs of vector elements from row IDs of dat.abundances2
names(fc) <- rownames(dat.abundances2)
names(p.val) <- rownames(dat.abundances2)
names(p.val.adj) <- rownames(dat.abundances2)



# Calculate fold changes
for(i in 1:phospep.number){
    basal.mean <- mean(dat.abundances2[i, basal.idx])
    insulin.mean <- mean(dat.abundances2[i, insulin.idx])

    fc[i] <- insulin.mean / basal.mean
}



# Calculate p-values and adjusted p-values
for(i in 1:phospep.number){
    p.val[i] <- t.test(log2(dat.abundances2[i, basal.idx]), log2(dat.abundances2[i, insulin.idx]))$p.value
}
p.val.adj <- p.adjust(p.val, method="fdr")



# Give row numbers of candidates with both large log-fold change and low adj. p-value
diff.idx1 <- which(abs(log2(fc)) > 1)         # trick: via absolute value of log2(fc) we get both phosphopeptides with fc > 2 or 1/2
diff.idx2 <- which(p.val.adj < 0.05)          # gives row indices of phosphopeptides with adj. p-value < 0.05
diff.idx <- intersect(diff.idx1, diff.idx2)   # intersection gives row indices of differential candidates
print(length(diff.idx))

# Give row numbers of candidates with more stringent thresholds
diff.idx1 <- which(abs(log2(fc)) > 1.25)
diff.idx2 <- which(p.val.adj < 0.005)
diff.idx <- intersect(diff.idx1, diff.idx2)
print(length(diff.idx))

# **Prepare the list of interesting candidates**
For overrepresentation analysis a list of proteins or genes is needed. Thus, first we have to map the phosphopeptide IDs (row IDs) of our candidates to protein IDs (UniProt IDs). These are the phosphopeptide IDs of our candidates:

In [None]:
rownames(dat.abundances2)[diff.idx]

Since the UniProt ID is already part of the phosphopeptide IDs we just need to extract them via an R function that employs regular expressions. This function is gsub(), which can substitute or delete parts of a given string specified by a regular expression.

In [None]:
uniprot.ids <- c()
for(i in 1:length(diff.idx)){
  #print(paste0("Original: ", rownames(dat.abundances2)[diff.idx[i]]))
  tmp <- gsub("_peptide\\d+", "", rownames(dat.abundances2)[diff.idx[i]])
  #print(paste0("After removing '_peptide\\d+': ", tmp))

  tmp <- gsub("-\\d+", "", tmp)
  #print(paste0("After removing '-\\d+': ", tmp))

  #tmp <- print(strsplit(tmp, "; "))
  tmp <- strsplit(tmp, "; ")
  #print("After splitting:")
  #print(tmp[[1]])
  uniprot.ids <- c(uniprot.ids, tmp[[1]])

  #print("-----------------------")
}
uniprot.ids <- unique(uniprot.ids)
print(uniprot.ids)
print(length(uniprot.ids))

write.table(x=uniprot.ids, file="uniprot.ids.txt", quote=F, sep="\t", row.names=F, col.names=F)

In [None]:
id.map <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/idmapping_2023_11_21_to_GeneID.tsv",
                            header=T,
                            sep="\t")
head(id.map)

entrez.ids <- id.map[id.map$From %in% uniprot.ids,"To"]
print(entrez.ids)
length(entrez.ids)


# **GO-based overrepresentation analysis**
In order to perform Gene Ontology(GO)-based overrepresentation analysis ("GO-analysis" or "GO-ORA") various R packages can be used. E.g. the package "clusterProfiler" is popular and widely used. To use clusterProfiler in Google Colab we have to install it and the package "org.Mm.eg.db", which is a gene database for the species mus musculus (specific databases for all main species are available). Moreover, to visualize GO results we install also the packages "enrichplot" and "ggupset" providing useful plots for ORA results.

In [None]:
# Warning: Installation of the packages takes ca. 20 minutes in Colab!
setRepositories(ind=1:5)
install.packages("clusterProfiler")
install.packages("org.Mm.eg.db")
install.packages("enrichplot")
install.packages("ggupset")
library(clusterProfiler)
library(org.Mm.eg.db)
library(enrichplot)
library(ggupset)

GO-ORA is performed via the enrichGO() function in clusterPlot. We have to specify the GO domain, i.e. "biological process" (BP), "cellular component" (CC) or "molecular function" (MF) via parameter "ont". Moreover we have to specify the apropriate speciec-specific gene database (via "OrgDb"), p-value cutoffs and a method for p-value adjustment. Via "as.data.frame()" the results can be inspected.

In [None]:
ego <- enrichGO(gene = entrez.ids,
                #universe = as.character(id.map$To),
                OrgDb = org.Mm.eg.db,
                ont = "BP",
                pAdjustMethod = "fdr",
                pvalueCutoff = 0.01,
                qvalueCutoff = 0.05,
                readable = TRUE)
as.data.frame(ego)[1:15,]

GO-ORA results visualization with bar plot...

In [None]:
options(repr.plot.width=8, repr.plot.height=6, repr.plot.res=150)
barplot(height=ego,
        color="p.adjust",
        showCategory=15,
        font.size=8,
        title="GO-based ORA results (BP)",
        label_format=50)

GO-ORA results visualization with dot plot...

In [None]:
options(repr.plot.width=8, repr.plot.height=6, repr.plot.res=150)
dotplot(object=ego,
        showCategory=15,
        font.size=8,
        title="GO-based ORA results (BP)",
        label_format=50)

GO-ORA results visualization with cnet plot...

In [None]:
egox <- setReadable(ego, 'org.Mm.eg.db', 'ENTREZID')
options(repr.plot.width=8, repr.plot.height=6, repr.plot.res=150)
cnetplot(x=egox,
         showCategory=c("response to insulin","TOR signaling"),
         #circular=T,
         color.params=list(edge = T),
         cex.params=list(category_label=0.8, gene_label=0.8))

GO-ORA results visualization with emap plot...

In [None]:
egox2 <- enrichplot::pairwise_termsim(egox)
options(repr.plot.width=8, repr.plot.height=6, repr.plot.res=150)
emapplot(x=egox2, showCategory=20)

GO-ORA results visualization with upset plot...

In [None]:
options(repr.plot.width=10, repr.plot.height=6, repr.plot.res=150)
enrichplot::upsetplot(ego)

# **Reactome-based overrepresentation**
In order to perform Reactome-based overrepresentation analysis ("Reactome-analysis" or "Reactome-ORA") in R...

In [None]:
# Warning: Installation of the package takes ca. 4 minutes in Colab!
setRepositories(ind=1:5)
install.packages("ReactomePA")
library("ReactomePA")

The enrichPathway() function is used to make all settings (organism, cutoffs, sizes of terms) and pass our input set of candidate genes.

In [None]:
reactome.res <- enrichPathway(entrez.ids,
                              organism="mouse",
                              pvalueCutoff=1.0,
                              pAdjustMethod = "BH",
                              qvalueCutoff = 0.01,
                              minGSSize = 10,
                              maxGSSize = 707)

The results are already available as a sorted table. We can directly index the results table for the top 10 pathways and the most informative columns.

In [None]:
reactome.res2 <- reactome.res[1:10,c("Description", "p.adjust")]
reactome.res2

Results visualization:

In [None]:
options(repr.plot.width=15, repr.plot.height=10)
par(mar=c(5, 33, 4, 0))
p <- barplot(-log10(reactome.res2[,"p.adjust"]), names.arg=reactome.res2[,"Description"], horiz=T, axes=F, las=1, cex.names=1.5, cex.main=3, font=2, main="top 10 Reactome-pathways")
text(x=-log10(reactome.res2[,"p.adjust"]) - 1, y=p, labels=round(reactome.res2[,"p.adjust"],10), font=2, cex=2)