# **Importing data & pre-processing**
Import data from GitHub & set row names (same as in part1):

In [6]:
# import file with NOT normalized expression data
dat.abundances <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.abundances.txt",
                            header=T,
                            sep="\t")
rownames(dat.abundances) <- dat.abundances[,1]        # set rownames to IDs from first column
dat.abundances <- data.matrix(dat.abundances[,-1])    # delete first column and change "data frame" to numeric "data matrix"



# import file with normalized data and extended information
dat.ext <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.ext.txt",
                            header=T,
                            sep="\t")

Keep only phosphorylated peptides (same as in part1):

In [7]:
# give row numbers with phosphopeptides
phospep.idx <- grep("Phospho", dat.ext$Modifications)   # grep() gives all row numbers containing the given pattern in column 'Modifications'

# keep only phospopeptides
dat.abundances <- dat.abundances[phospep.idx,]
#dat.nonorm <- dat.nonorm[phospep.idx,]
dat.ext <- dat.ext[phospep.idx,]

In part1, we have determined that the normalization results of the device software are OK and that we can use them. Therefore, we do not need to perform our own raw data normalization. However, we want to perform a group-specific imputation and replace isolated missing values to avoid excluding almost completely quantified phosphopeptides in some analysis steps. For this, the same imputation as in part1 is performed.

In [None]:
# Give row vectors with group-specific column numbers
basal.idx <- grep("Basal", colnames(dat.abundances))
insulin.idx <- grep("Insulin", colnames(dat.abundances))



dat.abundances2 <- dat.abundances
for(i in 1:nrow(dat.abundances2)){
    if(sum(is.na(dat.abundances2[i,basal.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,basal.idx]))
      dat.abundances2[i, basal.idx[na.idx]] <- mean(dat.abundances2[i, basal.idx], na.rm=T)
    }

    if(sum(is.na(dat.abundances2[i,insulin.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,insulin.idx]))
      dat.abundances2[i, insulin.idx[na.idx]] <- mean(dat.abundances2[i, insulin.idx], na.rm=T)
    }
}
nrow(dat.abundances)
nrow(na.omit(dat.abundances))   # na.omit() removes rows with at least on 'NA'
nrow(na.omit(dat.abundances2))

dat.abundances2 <- na.omit(dat.abundances2)   # delete phosphopeptides that still have missing values despite imputation

# **Identification of differential candidates**
Calculate p-values and fold changes in order to identify differential candidates (same as in part2).

In [None]:
phospep.number <- nrow(dat.abundances2)

# define empty vectors for storing...
fc <- vector(length=phospep.number, mode="numeric")
p.val <- vector(length=phospep.number, mode="numeric")
p.val.adj <- vector(length=phospep.number, mode="numeric")

# adopt IDs of vector elements from row IDs of dat.abundances2
names(fc) <- rownames(dat.abundances2)
names(p.val) <- rownames(dat.abundances2)
names(p.val.adj) <- rownames(dat.abundances2)



# Calculate fold changes
for(i in 1:phospep.number){
    basal.mean <- mean(dat.abundances2[i, basal.idx])
    insulin.mean <- mean(dat.abundances2[i, insulin.idx])

    fc[i] <- insulin.mean / basal.mean
}



# Calculate p-values and adjusted p-values
for(i in 1:phospep.number){
    p.val[i] <- t.test(log2(dat.abundances2[i, basal.idx]), log2(dat.abundances2[i, insulin.idx]))$p.value
}
p.val.adj <- p.adjust(p.val, method="fdr")



# Give row numbers of candidates with both large log-fold change and low adj. p-value
diff.idx1 <- which(abs(log2(fc)) > 1)         # trick: via absolute value of log2(fc) we get both phosphopeptides with fc > 2 or 1/2
diff.idx2 <- which(p.val.adj < 0.05)          # gives row indices of phosphopeptides with adj. p-value < 0.05
diff.idx <- intersect(diff.idx1, diff.idx2)   # intersection gives row indices of differential candidates
print(length(diff.idx))

# Give row numbers of candidates with more stringent thresholds
diff.idx1 <- which(abs(log2(fc)) > 1.25)
diff.idx2 <- which(p.val.adj < 0.005)
diff.idx <- intersect(diff.idx1, diff.idx2)
print(length(diff.idx))

# **Decision trees in R**
There are various R packages for training decision trees and using them for classification. One of the most user friendly is rpart. Rpart has also the advanatage that it is already pre-installed in Google Colab. We only need to install the rpart-extension "rpart.plot".

In [None]:
setRepositories(ind=1:5)
install.packages("rpart.plot")

#library(rpart)
library(rpart.plot)


Preparing a class vector:

In [None]:
colnames(dat.abundances2)
classes <- gsub("\\d+", "", colnames(dat.abundances2))
print(classes)

Preparing the training data:

In [None]:
rpartdat <- data.frame(classes, t(dat.abundances2), stringsAsFactors=TRUE)
rpartdat[,1:10]

Fitting the tree model:

In [None]:
rpart.model <- rpart(classes~., data=rpartdat, method="class", minsplit=2)

printcp(rpart.model) # display the results
summary(rpart.model) # detailed summary of splits

Visualization of the tree model:

In [None]:
col.list <- list(
  adjustcolor("navy", alpha=0.3),
  adjustcolor("red", alpha=0.3),
  adjustcolor("darkorchid", alpha=0.3),
  adjustcolor("darkgreen", alpha=0.3)
)

options(repr.plot.width=10, repr.plot.height=10)
rpart.plot(rpart.model, roundint=F, type=2, extra=101, cex=1.5, box.palette=col.list)

To check whether the selected feature "Q5HZI1_peptide1" is differential & whether the thresold in the tree is valid a scatter plot may be used:

In [None]:
options(repr.plot.width=10, repr.plot.height=10)
plot(dat.abundances2["Q5HZI1_peptide1", ], cex=3)
text(dat.abundances2["Q5HZI1_peptide1", ], labels=colnames(dat.abundances2), pos=3, offset=1, cex=1.25)

Predict data:

In [None]:
predict(object=rpart.model, newdata=rpartdat)

Split into training & test set to fit the tree model with the training set and then to predict the test set samples:

In [None]:
classes2 <- classes[c(1,2,6,7)]
rpartdat.train <- data.frame(classes2, t(dat.abundances2[diff.idx,c(1,2,6,7)]), stringsAsFactors=TRUE)
rpartdat.train[,1:10]

classes2 <- classes[c(3,4,7,8)]
rpartdat.test <- data.frame(classes2, t(dat.abundances2[diff.idx,c(3,4,7,8)]), stringsAsFactors=TRUE)
rpartdat.test[,1:10]

Fit & visualize the tree model:

In [None]:
rpart.model2 <- rpart(classes2~., data=rpartdat.train, method="class", minsplit=1)

options(repr.plot.width=10, repr.plot.height=10)
rpart.plot(rpart.model2, roundint=F, type=2, extra=101, cex=1.5, box.palette=col.list)

Classify the test set:

In [None]:
predict(object=rpart.model2, newdata=rpartdat.test)