# **Importing data**

Import data from GitHub:

In [None]:
# this is a comment, which is ignored by R
dat.abundances <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.abundances.txt",    # URL to file with normalized expression data in our GitHub repository
                            header=T,                                                                                             # file has a header line
                            sep="\t")                                                                                             # columns are separated by tabs (also common: ',' and ';')

Set IDs from first column as row names and delete them afterwards:

In [None]:
rownames(dat.abundances) <- dat.abundances[,1]        # set rownames to information from first column
dat.abundances <- data.matrix(dat.abundances[,-1])    # delete first column and change "data frame" to numeric "data matrix"

Import 2 other files from GitHub:

In [None]:
# file with NOT normalized expression data
dat.nonorm <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.nonorm.txt",
                            header=T,
                            sep="\t")
rownames(dat.nonorm) <- dat.nonorm[,1]
dat.nonorm <- data.matrix(dat.nonorm[,-1])


# file with normalized data and extended information
dat.ext <- read.table("https://raw.githubusercontent.com/ddz-icb/OmicsDataAnalysisCourse/main/data/dat.ext.txt",
                            header=T,
                            sep="\t")

# **Data inspection**
Get dimensions of data matrix:

In [None]:
nrow(dat.abundances)      # gives number of rows (= phosphopeptides)
ncol(dat.abundances)      # gives number of columns (= samples)

Let's inspect data content:

In [None]:
head(dat.abundances)                # gives first 6 rows
tail(dat.abundances)                # gives last 6 rows
print(dat.abundances[1000:1005,5:8])   # prints defined rows and columns

Further data inspection:

In [None]:
max(dat.abundances, na.rm=T)
min(dat.abundances, na.rm=T)
median(dat.abundances, na.rm=T)
mean(dat.abundances, na.rm=T)
table(is.na(dat.abundances))
summary(dat.abundances)

Inspecting additional information:

In [None]:
print(dat.ext[1:10,])

# **Ensuring data validity: removing meaningless data**
Despite enrichment, there are also non-phosphorylated peptides in the data set. Give rows with phosphopeptides and keep only phosphorylated peptides:

In [None]:
# give row numbers with phosphopeptides
phospep.idx <- grep("Phospho", dat.ext$Modifications)   # grep() gives all row numbers containing the given pattern in column 'Modifications'
print(phospep.idx[1:10])    # give first 10 line numbers of phosphopeptides

# keep only phospopeptides
dat.abundances <- dat.abundances[phospep.idx,]
dat.nonorm <- dat.nonorm[phospep.idx,]
dat.ext <- dat.ext[phospep.idx,]
print(nrow(dat.abundances))
print(nrow(dat.nonorm))
print(nrow(dat.ext))

 [1]  1  2  3  4  5  6  7  8  9 10
[1] 15210
[1] 15210
[1] 15210


# **Normalization**
Inspect not normalized data:

In [None]:
#options(repr.plot.width=10, repr.plot.height=10, na.action="na.rm")
options(repr.plot.width=10, repr.plot.height=10)
boxplot(dat.nonorm)

Scaling needed! Perform log2-transformation before plotting boxplots:

In [None]:
boxplot(log2(dat.nonorm))     # log2() computes the respective log2-value for all values in the data matrix

Inspect the effect of normalization:

In [None]:
boxplot(log2(dat.abundances))

Using other normalization methods than Proteome Discoverer:

In [None]:
#install.packages("BiocManager")
#BiocManager::install("limma")
dat.norm <- limma::normalizeBetweenArrays(log2(dat.nonorm), method="quantile")
boxplot(dat.norm)

# **Dealing with missing values (imputation)**
Removing all rows with NAs:

In [None]:
nrow(na.omit(dat.abundances))   # na.omit() removes rows with at least on 'NA'

Performing group-specific imputation:

In [None]:
basal.idx <- grep("Basal", colnames(dat.abundances))
#print(basal.idx)
insulin.idx <- grep("Insulin", colnames(dat.abundances))
#print(insulin.idx)
dat.abundances2 <- dat.abundances
print(dat.abundances2[1:10,])
#for(i in 1:10){
for(i in 1:nrow(dat.abundances2)){
    #print(i)
    if(sum(is.na(dat.abundances2[i,basal.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,basal.idx]))
      #dat.abundances2[i, basal.idx[na.idx]] <- 1
      dat.abundances2[i, basal.idx[na.idx]] <- mean(dat.abundances2[i, basal.idx], na.rm=T)
    }

    if(sum(is.na(dat.abundances2[i,insulin.idx])) == 1){
      na.idx <- which(is.na(dat.abundances2[i,insulin.idx]))
      #dat.abundances2[i, insulin.idx[na.idx]] <- 1
      dat.abundances2[i, insulin.idx[na.idx]] <- mean(dat.abundances2[i, insulin.idx], na.rm=T)
    }
}
cat("\n\n\n")
print(dat.abundances2[1:10,])
cat("\n\n\n")
nrow(na.omit(dat.abundances2))

Inspect the effect of imputation & removing rows with remaining NAs:

In [None]:
hist(log2(na.omit(dat.abundances2)))
hist(log2(na.omit(dat.abundances)))