<font style="color:#68829E; font-family:calibri; font-size:36px"> <b> MS Analysis with R: Progenesis Data </b> </font>

<font style="color:black; font-family:calibri; font-size:15px">
Welcome to <i>R</i>, on HCC/Binder, through Jupyter. <br>

<p style="margin-left: 2em">
<b><i>"R is a free software environment for statistical computing and graphics" </i></b><br>
<a href="https://www.r-project.org/"> www.r-project.org </a> <br><br>
<b><i>"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, <br>
equations, visualizations and narrative text" </i></b><br>
<a href="https://jupyter.org/"> www.jupyter.org </a> <br><br>
</p>

The topic of this tutorial is exploration of MS data for new learners using commerical processing software, prior to statistical modelling. Content includes: internal standards (IS), sample/feature visualizations, feature filtration, normalization, missing values, quality (QA/QC), correction and correlation. <br> 
Steps described maybe sequentially or non-sequentially executed (dataset dependent). XXX <br><br>

<i> Example Data </i> <br>
Progenesis QI - user guide and tutorial data set @ <a href="http://www.nonlinear.com/progenesis/qi/v2.4/user-guide"> www.nonlinear.com/progenesis/qi/v2.4/user-guide</a> (Ion Mobility HDMSe LC-MS). <br>
( *simulated <i>Example Metadata</i> )
</font>

In [None]:
#lapply(.libPaths(), list.files) # print available/downloaded packages
suppressMessages(library(ggplot2))
suppressMessages(library(reshape2))
#suppressMessages(library(pcaMethods))

In [None]:
setwd("./") # directory (user input)
getwd()

In [None]:
list.files(path=".", pattern=".csv") # detail '.CSVs' in directory

In [None]:
#define PROGENESIS & METADATA files (user input)
#iFileD <- "QC_measurements.csv"
iFileD <- "Progenesis_QI_Tutorial_HDMSe_Norm.csv"
#iFileD <- "Progenesis_QI_Tutorial_HDMSe_Raw.csv"

#iFileM <- "QC_metadata.csv"
iFileM <- "Example_Metadata.csv"

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Processed Data </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
An output from Progenesis (Waters) will be a processed peak list - following detection, alignment, deconvolution (isotopes/adducts) - across samples with either unnormalized (raw) or normalized (Mean Log Ratio - MLR) expression/intensities. <br> 
Typically, each feature is described through a mass-to-charge (MZ in m/z) and retention time (RT in min) value. <br>
</font>

In [None]:
# ... open iFileD ...
# ... rows = peaks / cols = samples ...

In [None]:
labelRow <- 3; # label/header row (user input)

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 1.</b> Metrics, such as alignment scores, chromatographic peak width, normalization factors, are informative.
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Data Importation </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
For ease, a Progenesis '.CSV' file should be accompanied with a manually created Metadata '.CSV' file (in the same order) that contains information pertinent to experimental design, sample collection, data acquisition and so on.
</font>

In [None]:
### PROGENESIS FILE (iFileD) ###

In [None]:
#?read.csv
rawMSdata <- read.csv(file=iFileD, header=F, sep=",", stringsAsFactors=F);
names(rawMSdata) = NULL; # remove names (aesthetics)

head(rawMSdata)

In [None]:
firstSample <- "C_Norm_1"; # user input
firstFeature <- "7.48_322.0672m/z"; # user input

startCol <- grep(firstSample, rawMSdata[labelRow, ]); # define column to index
startRow <- grep(firstFeature, rawMSdata[, 1]); # define row to index

In [None]:
MSsamID <- unlist(rawMSdata[labelRow, c(startCol:dim(rawMSdata)[2])]); # CHECK samples
head(MSsamID)

MSfeatID <- unlist(rawMSdata[c(startRow:dim(rawMSdata)[1]), grep("Compound", rawMSdata[labelRow, ])]); # CHECK features
head(MSfeatID)

In [None]:
#subset MS data - intensities only
MSdata <- rawMSdata[c(startRow:dim(rawMSdata)[1]), c(startCol:dim(rawMSdata)[2])];
MSdata <- t(MSdata);

dim(MSdata)

In [None]:
#convert MS data - numeric matrix
MSmatrix <- matrix(as.numeric(MSdata), nrow=dim(MSdata)[1], ncol=dim(MSdata)[2]);
colnames(MSmatrix) = t(MSfeatID);rownames(MSmatrix) = MSsamID;

MSmatrix[1:5, 1:5]

In [None]:
#extract descriptors (MZ, RT, PW - as.numeric)
mzVal <- as.numeric(unlist( rawMSdata[c(startRow:dim(rawMSdata)[1]), grep("m/z", rawMSdata[labelRow, ])] )); # MZ
mzVal2 <- round(mzVal,4); # 4 dp

rtVal <- as.numeric(unlist( rawMSdata[c(startRow:dim(rawMSdata)[1]), grep("Retention time", rawMSdata[labelRow, ])] )); # RT
rtVal2 <- rtVal*60; # seconds

pkWidth <- as.numeric(unlist( rawMSdata[c(startRow:dim(rawMSdata)[1]), grep("peak width", rawMSdata[labelRow, ])] )); # PW
pkWidth2 <- pkWidth*60; # seconds

In [None]:
### METADATA FILE (iFileM) ###

In [None]:
#?read.csv
rawMetadata <- read.csv(file=iFileM, header=T, sep=",", stringsAsFactors=F);

head(rawMetadata)

In [None]:
Metadata <- rawMetadata;
dim(Metadata)

In [None]:
#specify factors to plot/explore (global)
xAxis <- as.numeric(Metadata[, "RunOrder"]); # factor_#1
#xAxis <- as.factor(Metadata[, "Class"]);

cGroup <- as.factor(Metadata[, "Class"]); # factor_#2

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 2.</b> Fill Metadata with anything and everything.
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Internal Standard </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
For untargeted analysis, internal standards (IS) that span MZ and RT ranges should be incorporated into acquisition to assess chromatographic performance/alignment, sensitivity, mass stability & accuracy.
</font>

In [None]:
MSmatrixProc <- MSmatrix; # select MS data

In [None]:
MSfeatIDProc <- MSfeatID; # ID
mzValProc <- mzVal; # MZ
rtValProc <- rtVal; # RT
pkWidthProc <- pkWidth; # PW

In [None]:
NoiseVal <- 0; # 'noise' estimate

In [None]:
ISlsMZ <- c(121.0290, 121.0292, 139.0391, 121.0290, 121.0290, 121.0290, 121.0289) # MZ for IS
ISlsRT <- c(7.12, 2.89, 3.24, 3.10, 9.48, 8.73, 8.63) # RT for IS
IStolMZ <- 0.001; # MZ tolerance
IStolRT <- 0.05; # RT tolerance 

In [None]:
ISindex <- as.numeric(unlist(mapply(function(x,y){
    which( mzValProc <= (x+IStolMZ) & mzValProc >= (x-IStolMZ) & rtValProc <= (y+IStolRT) & rtValProc >= (y-IStolRT) )
}, x=ISlsMZ, y=ISlsRT))); # locate IS

MSfeatIDProc[ISindex]

In [None]:
#subset IS features
MSmatrixIS <- MSmatrixProc[,ISindex];
dim(MSmatrixIS)

In [None]:
#subset IS descriptors
MSfeatIDIS <- MSfeatIDProc[ISindex]; # ID
mzValIS <- mzValProc[ISindex]; # MZ
rtValIS <- rtValProc[ISindex]; # RT
pkWidthIS <- pkWidthProc[ISindex]; # PW

In [None]:
#ISlsMZ - mzValIS # MZ diff
#ISlsRT - rtValIS # RT diff
pkWidthIS*60

In [None]:
#remove IS features
MSmatrixProc <- MSmatrixProc[,-ISindex];
dim(MSmatrixProc)

In [None]:
#remove IS descriptors
MSfeatIDProc <- MSfeatIDProc[-ISindex]; # ID
mzValProc <- mzValProc[-ISindex]; # MZ
rtValProc <- rtValProc[-ISindex]; # RT
pkWidthProc <- pkWidthProc[-ISindex]; # PW

In [None]:
z1 <- melt( MSmatrixIS ); # transform & format
#z1 <- melt( scale(MSmatrixIS, center=T, scale=apply(MSmatrixIS, 2, sd)) ); # transform & format

#head(z1)

In [None]:
#structure data.frame for ggplot2
plot.m <- data.frame(
    x=rep(rownames(MSmatrixIS), times=length(ISindex)), 
    y=z1[,"value"],
    u=rep(colnames(MSmatrixIS), each=dim(MSmatrixIS)[1]),
    i=rep(cGroup, times=length(ISindex)),
    j=rep(xAxis, times=length(ISindex))
);

head(plot.m)

In [None]:
p1 <- ggplot(data=plot.m, aes(x=j, y=y)) + 
    geom_hline(yintercept=NoiseVal, color="red", size=0.5) +
    geom_point(data=plot.m, aes(x=j, y=y, color=i), size=2) + ggtitle("") +
    geom_line(data=plot.m, linetype=2, size=0.50) +
    facet_wrap(~u, scales="fixed", nrow=4) +
    labs(x="Spectra (RunOrder)", y="Intensity", color="Factor") +
    #guides(color = guide_legend(title.position="top", title.hjust = 0.50)) +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
          strip.text=element_text(face="italic", size=8),
          legend.title=element_text(face="italic", size=9), legend.position="top", legend.text=element_text(size=8)
    )
p1

<div class="alert alert-danger" role="alert" style="font-style:italic; font-size:13px">
<b>#Note!</b> With commerical processing software, checks regarding IS for untargeted analysis are somewhat limited. 
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Plots - Samples </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Alongside sample acquistion, reference data should be acquired also, for example, pooled QC = assess run stability (intra-study), serial dilutions = assess feature linearity, long-term reference = assess assay stability (intra-laboratory). <br>
Visualizations of sample-associated characteristics help optimize/prime MS data - for example, outlier exclusion, replicate correlation, QC, XXX
</font>

In [None]:
MSmat <- MSmatrix; # select MS data
#MSmat <- MSmatrixProc; # select MS data

In [None]:
#rowIndex <- grep("QC",MSsamID); # only QC
rowIndex <- c(1:dim(Metadata)[1]); # sample & QC

MSmat <- MSmat[rowIndex,];
xAxisT <- xAxis[rowIndex];
cGroupT <- cGroup[rowIndex];

In [None]:
#descriptive characteristics
sTIC <- apply(MSmat, 1, sum); # summed intensities 
#mINT <- apply(MSmat, 1, mean); # mean intensities
#sTIC <- apply(log2(MSmat+1), 1, sum); # log2 summed intensities 
mINT <- apply(log2(MSmat+1), 1, mean); # log2 mean intensities

In [None]:
#structure data.frame for ggplot2
plot.m <- data.frame(
    x=c(sTIC, mINT),
    y=rep(c("sTIC", "mINT"), each=dim(MSmat)[1]),
    i=rep(cGroupT, times=2), 
    j=rep(xAxisT, times=2)
);
head(plot.m)

In [None]:
p1 <- ggplot(data=plot.m, aes(x=j, y=x, group=j, fill=i)) + 
    geom_bar(stat="identity") + ggtitle("") +
    facet_wrap(~y, scales="free", nrow=2) +
    labs(x="Spectra (RunOrder)", y="Characteristics (Intensity)", fill="Factor") +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
          strip.text=element_text(face="italic", size=9),
          legend.title=element_text(face="italic", size=9), legend.text=element_text(size=8)
    )
p1

In [None]:
z1 <- melt( scale(MSmat, center=T, scale=apply(MSmat, 2, sd)) ); # transform & format
#z1 <- melt( log2(MSmat+1) ); # transform & format

#head(z1)

In [None]:
#structure data.frame for ggplot2
plot.m <- data.frame(
    x=as.factor(z1[,"Var1"]), 
    y=z1[,"value"],
    i=rep(cGroupT, times=dim(MSmat)[2]),
    j=rep(xAxisT, times=dim(MSmat)[2])
);

head(plot.m)

In [None]:
p1 <- ggplot(data=plot.m, aes(x=j, y=y, group=j, fill=i)) + 
    geom_boxplot() + ggtitle("") +
    labs(x="Spectra (RunOrder)", y="Distribution (Intensity)", fill="Factor") +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
          legend.title=element_text(face="italic", size=9),  legend.position="top", legend.text=element_text(size=8)
    )
p1

In [None]:
#z1 <- melt( scale(MSmat, center=T, scale=apply(MSmat, 2, sd)) ); # transform & format
z1 <- melt( log2(MSmat+1) ); # transform & format
z1 <- z1 [grep("QC",z1[,"Var1"]),]; # subset QC

#head(z1)

In [None]:
refQC <- "QC_1";

#structure data.frame for ggplot2
plot.m <- data.frame(
    x=rep(z1[ grep(refQC,z1[,"Var1"]),"value" ], each=length(unique(z1[,"Var1"]))),
    y=z1[,"value"], a=z1[,"Var1"], b=z1[,"Var2"]
);

head(plot.m)

In [None]:
p1 <- ggplot(data=plot.m, aes(x=x, y=y)) + ggtitle("") +
    geom_point() + ggtitle("") +
    facet_wrap(~a, scales="free", nrow=2) +
    stat_smooth(method='lm', formula=y~x, se=TRUE, fullrange=TRUE, color="red") +
    labs(x="Intensities (log2) - Reference", y="Intensities (log2) - Test") +
    theme(axis.title=element_text(face="bold", size=9),
          strip.text=element_text(face="italic", size=9),
          axis.text=element_text(size=8)
    )
p1

In [None]:
#TIC overlay ... ?

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 3.</b> Blanks may be included for analysis also - with special attention during pre-processing.
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Plots - Features </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
XXX. <br>
Visualizations of feature-associated characteristics help optimize/prime MS data - for example, detection, variation, XXX 
</font>

In [None]:
MSmat <- MSmatrix; # select MS data
#MSmat <- MSmatrixProc; # select MS data

MSfeatIDProc <- MSfeatID;mzValProc <- mzVal;rtValProc <- rtVal;pkWidthProc <- pkWidth;

In [None]:
#rowIndex <- grep("QC",MSsamID); # only QC
rowIndex <- c(1:dim(Metadata)[1]); # sample & QC

MSmat <- MSmat[rowIndex,];
xAxisT <- xAxis[rowIndex];
cGroupT <- as.factor(Metadata[rowIndex, "Batch"]); 

In [None]:
#descriptive statistics
Fsd <- apply(MSmat, 2, sd);c(min(Fsd), max(Fsd)) # standard deviation
FminC <- apply(MSmat, 2, function(x){sum(x==0)});c(min(FminC), max(FminC)) # zero count

In [None]:
#structure data.frame for ggplot2
plot.m <- data.frame(
    x=c(Fsd, FminC),
    y=rep(c("Fsd", "FminC"), each=dim(MSmat)[2]),
    z=rep(c(1:dim(MSmat)[2]), times=2),
    a=rep(mzValProc, times=2), 
    b=rep(rtValProc, times=2)
);
head(plot.m)

In [None]:
range01 <- function(x){(x-min(x))/(max(x)-min(x))}

plot.m$x[grep("Fsd",plot.m$y)] <- range01(plot.m$x[grep("Fsd",plot.m$y)]); # visualize between 0-1
plot.m$x[grep("FminC",plot.m$y)] <- range01(plot.m$x[grep("FminC",plot.m$y)]); # visualize between 0-1

In [None]:
xby <- 500;
xmid <- 0.25;

p1 <- ggplot(data=plot.m, aes(x=z, y=x, color=x)) + 
    geom_point() + ggtitle("") +
    facet_wrap(~y, scales="free", nrow=2) +
    labs(x="Features (Index)", y="Characteristics (Intensity)", color="") +
    scale_colour_gradient2(midpoint=xmid, low="blue", mid="white", high="red") +
    scale_x_continuous(limits=c(0,max(plot.m$z)), breaks=seq(from=0,to=max(plot.m$z), by=xby)) +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
          strip.text=element_text(face="italic", size=9),
          legend.title=element_text(face="bold", size=9), legend.position="top", legend.text=element_text(size=6)
    )
p1

In [None]:
col_index <- seq(from=1, to=12, by=1); # select few features (first) 
#col_index <- sample( c(1:dim(MSmatrix)[2]), 12, replace=FALSE ); # select few features (random)

MSmat <- MSmat[,col_index];

In [None]:
z1 <- melt( MSmat ); # transform & format
#z1 <- melt( scale(MSmat, center=T, scale=apply(MSmat, 2, sd)) ); # transform & format

#head(z1)

In [None]:
#structure data.frame for ggplot2
plot.m <- data.frame(
    x=rep(rownames(MSmat), times=length(col_index)), 
    y=z1[,"value"],
    u=rep(colnames(MSmat), each=dim(MSmat)[1]),
    i=rep(cGroupT, times=length(col_index)),
    j=rep(xAxisT, times=length(col_index))
);

head(plot.m)

In [None]:
p1 <- ggplot(data=plot.m, aes(x=j, y=y, color=i)) +
    geom_hline(yintercept=0, color="black", linetype=1, size=0.50) +
    geom_point(size=2) + ggtitle("") +
    geom_line(data=plot.m,aes(linetype=i), size=0.50) +
    facet_wrap(~u, scales="free_y", nrow=3) +
    labs(x="Spectra (RunOrder)", y="Intensity", color="Factor", linetype="Factor") +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
          strip.text=element_text(face="italic", size=8),
          legend.title=element_text(face="italic", size=9), legend.position="top", legend.text=element_text(size=8)
    )
p1

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Feature Filtration </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Multiple methods exist to help reduce data and preserve reliable peaks only - <br>
<i>1.</i> Minimum Fraction - peak presence (detection rate) using class/group information. <br>
<i>2.</i> Retention Limit - increased variability for early/late elutors. <br>
<i>3.</i> Peak Width - sampling of chromatographic peak (low = inaccurate / high = coelution). <br>
Others include: response linearity, peak shape, blank contribution, standard deviation.
</font>

In [None]:
MSmatrixProcI <- MSmatrix; # select MS data
#MSmatrixProcI <- MSmatrixProc; # select MS data

MSfeatIDProc <- MSfeatID;mzValProc <- mzVal;rtValProc <- rtVal;pkWidthProc <- pkWidth;

In [None]:
print( paste0("number of MS features before filtration = ", dim(MSmatrixProcI)[2]) )

In [None]:
### MINIMUM FRACTION (MF) FILTER ###

In [None]:
#define parameters
cGroupT <- as.factor(Metadata[, "Class"]); # class/group vector
minFrac <- 50; # percent present within group 
minG <- 1; # number of groups
thresD <- 0; # threshold re detection

In [None]:
#run filter
eTABLE <- round(table(cGroupT)*(minFrac/100),0); # expected numbers per class/group

countM <- matrix(NA, nrow=dim(MSmatrix)[2], ncol=length(unique(cGroupT))); # count matrix
countI <- matrix(1, nrow=dim(MSmatrix)[2], ncol=length(unique(cGroupT))); # binary matrix
for (i in 1:dim(MSmatrix)[2]){
    for (j in 1:length(unique(cGroupT))){
        countM[i,j] = sum(MSmatrix[which(cGroupT == unique(cGroupT)[j]),i]>thresD)
        if(countM[i,j]<eTABLE[j]){countI[i,j]=0}
    }
}

MF_filter <- which( apply(countI,1,sum) >= minG )
#length(MF_filter)

In [None]:
#remove MF features
MSmatrixProcI <- MSmatrixProcI[,MF_filter];
print( paste0("number of MS features after filter #1 = ", dim(MSmatrixProcI)[2]) )

In [None]:
#remove MF descriptors
MSfeatIDProc <- MSfeatIDProc[MF_filter]; # ID
mzValProc <- mzValProc[MF_filter]; # MZ
rtValProc <- rtValProc[MF_filter]; # RT
pkWidthProc <- pkWidthProc[MF_filter]; # PW

In [None]:
### RETENTION LIMIT (RL) FILTER ###

In [None]:
c(min(rtValProc), max(rtValProc)) # visualize RT range 

In [None]:
#define parameters
rtLIM <- c(0.50, 9.50); # RT min/max (minutes)

In [None]:
#run filter
RT_filter <- which(rtValProc>=rtLIM[1] & rtValProc<=rtLIM[2])
#length(RT_filter)

In [None]:
#remove RL features
MSmatrixProcI <- MSmatrixProcI[,RT_filter];
print( paste0("number of MS features after filter #2 = ", dim(MSmatrixProcI)[2]) )

In [None]:
#remove RL descriptors
MSfeatIDProc <- MSfeatIDProc[RT_filter]; # ID
mzValProc <- mzValProc[RT_filter]; # MZ
rtValProc <- rtValProc[RT_filter]; # RT
pkWidthProc <- pkWidthProc[RT_filter]; # PW

In [None]:
### PEAK WIDTH (PW) FILTER ###

In [None]:
c(min(pkWidthProc), max(pkWidthProc)) # visualize PW range

In [None]:
#define parameters
pwLIM <- c(0.025, 0.500); # PW min/max (minutes)

In [None]:
#run filter
PW_filter <- which(pkWidthProc>=pwLIM[1] & pkWidthProc<=pwLIM[2])
#length(PW_filter)

In [None]:
#remove PW features
MSmatrixProcI <- MSmatrixProcI[,PW_filter];
print( paste0("number of MS features after filter #3 = ", dim(MSmatrixProcI)[2]) )

In [None]:
#remove PW descriptors
MSfeatIDProc <- MSfeatIDProc[PW_filter]; # ID
mzValProc <- mzValProc[PW_filter]; # MZ
rtValProc <- rtValProc[PW_filter]; # RT
pkWidthProc <- pkWidthProc[PW_filter]; # PW

<div class="alert alert-danger" role="alert" style="font-style:italic; font-size:13px">
<b>Note!</b> A combination of filters - all, one, none - can be empolyed before or after normalization.
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> *Normalization </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Normalization algorithms try to account for systematic variations to overall concentations between samples/rows (comparability).<br>
Examples include: CS = Constant Sum (Total Area), PQN = Probabilistic Quotient, MLR = Mean Log Ratio (Progenesis).
</font>

In [None]:
### RAW DATA ONLY ###

pName <- MSsamID;pName # sample ID vector
RefSpec <- "A_LD_4" # select reference (representative spectra)

In [None]:
colnames(Metadata)
Nfactor <- as.numeric(Metadata[, "NormFactorMLR"]);Nfactor

#normalize to external vector
MSmat <- MSmatrix;
MSnorm <- MSmat*matrix(rep(Nfactor, times=dim(MSmat)[2]), nrow=dim(MSmat)[1], ncol=dim(MSmat)[2]);#MSnorm[1:5,1:5]

In [None]:
MSmat <- MSmatrix;
#MSmat <- MSmatrixProcI; # select MS data

#calculate CS normalization (with/without standardization - reference)
CSfactor <- median(apply(MSmat, 1, sum))/apply(MSmat, 1, sum);CSfactor 
CSfactor <- CSfactor/rep(CSfactor[grep(RefSpec, pName)], times=dim(MSmat)[1]);CSfactor

MSnormCS <- MSmat*matrix(rep(CSfactor, times=dim(MSmat)[2]), nrow=dim(MSmat)[1], ncol=dim(MSmat)[2]);#MSnormCS[1:5,1:5]

In [None]:
MSmat <- MSmatrix;
#MSmat <- MSnormCS; # select MS data

#calculate PQ normalization 
RefSam <- MSmat[grep(RefSpec, pName),]; # empirical reference
#RefSam <- apply(MSmat, 2, median); # theoretical reference
RefSam[RefSam == 0] <- 1e-04; # offset addition

PQmat <- MSmat/matrix(rep(RefSam, each=dim(MSmat)[1]), nrow=dim(MSmat)[1], ncol=dim(MSmat)[2]);
PQfactor <- 1/apply(PQmat, 1, median);PQfactor

MSnormPQ <- MSmat*matrix(rep(PQfactor, times=dim(MSmat)[2]), nrow=dim(MSmat)[1], ncol=dim(MSmat)[2]);#MSnormPQ[1:5,1:5]

In [None]:
MSmat <- MSmatrix;
#MSmat <- MSnormCS; # select MS data

#estimate MLR normalization
RefSam <- MSmat[grep(RefSpec, pName),]; # empirical reference
#RefSam <- apply(MSmat, 2, median); # theoretical reference
RefSam[RefSam == 0] <- 1e-04; # offset addition

MLRmat <- MSmat/matrix(rep(RefSam, each=dim(MSmat)[1]), nrow=dim(MSmat)[1], ncol=dim(MSmat)[2]);
MLRmat[MLRmat == 0] <- 1;
MLRmat <- log10(MLRmat);#MLRmat[1:5,1:5]

UppLim <- apply(MLRmat, 1, median) + (3 * (1.4826*apply(MLRmat, 1, mad))); # upper limit
LowLim <- apply(MLRmat, 1, median) - (3 * (1.4826*apply(MLRmat, 1, mad))); # lower limit
for (i in 1:dim(MSmat)[1]){
    #MLRmat[i, which(MLRmat[i,] > UppLim[i] | MLRmat[i,] < LowLim[i])] <- 0; # mask outliers
    MLRmat[i, which(MLRmat[i,] > UppLim[i] | MLRmat[i,] < LowLim[i])] <- NA; # mask outliers
}

#apply(MLRmat, 1, sd)
#MLRfactor <- 1/(10^apply(MLRmat, 1, mean));MLRfactor
MLRfactor <- 1/(10^apply(MLRmat, 1, mean, na.rm=TRUE));MLRfactor

MSnormMLR <- MSmat*matrix(rep(MLRfactor, times=dim(MSmat)[2]), nrow=dim(MSmat)[1], ncol=dim(MSmat)[2]);#MSnormMLR[1:5,1:5]

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Missing Value Imputation </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
MS data is prone to missing values - a topic that ranges in complexity. <br> 
For simplicity, a standard approach is to take a proxy for the limit of detection (half of minimum) and treat everything the same. Advanced approaches exist that may be applicable also - k-nearest neighbor (kNN), random forest (RF), single value decomposition (SVD) - but care should be taken not to introduce bias.
</font>

In [None]:
#print message re missing values ('0')
paste0( sum(apply(MSmatrix, 2, function(x){sum(x==0)})), " / ", (dim(MSmatrix)[1]*dim(MSmatrix)[2]) )
#paste0( sum(apply(MSmatrixProcI, 2, function(x){sum(x==0)})), " / ", (dim(MSmatrixProcI)[1]*dim(MSmatrixProcI)[2]) )

In [None]:
#find minimum intensity (real/non-zero)
LOD <- min(min(MSmatrix[MSmatrix!=0]));
LOD

In [None]:
#find & replace 'missing' (half of minimum)
MSmatrixProcII <- MSmatrixProcI;
MSmatrixProcII[MSmatrixProcII==0] <- (LOD/2);

dim(MSmatrixProcII)

In [None]:
#save result as new '.CSV' (MVAPACK)
#write.table(rbind(t(Metadata),t(MSmatrixProcII)),"met_proc1.csv",sep=",",col.names=F,row.names=T);
write.table(rbind(t(Metadata),format(t(MSmatrixProcII),scientific=F)),"met_proc1.csv",sep=",",col.names=F,row.names=T);
#list.files(path=".", pattern=".csv")

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Principal Component Analysis </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Unsupervised dimensionality reduction of MS data is valuable for exploration - outlier detection (acquisition, contamination, etc), trends to associated Metadata (batch / runorder), quality with tight clustering of reference data (QC or Controls).
</font>

In [None]:
MSmat <- MSmatrixProcII; # select MS data
pName <- MSsamID; # sample ID vector

dim(MSmat)
colnames(Metadata)

In [None]:
#cGroupT <- cGroup; 
cGroupT <- as.factor(Metadata[, "Batch"]);
rOrderT <- as.numeric(Metadata[, "RunOrder"]);

In [None]:
#MSmat <- MSmat[grep("QC",pName),];cGroupT <- cGroupT[grep("QC",pName)];rOrderT <- rOrderT[grep("QC",pName)]; # subset QC

#m2 <- pca(MSmat, method="ppca", nPcs=2, scale="uv", center=TRUE);m2
m1 <- prcomp(MSmat, rank=2, scale=TRUE, center=TRUE);summary(m1)

#VarExp <- round(m2@R2[1:2]*100, 2);VarExp
VarExp <- round((m1$sdev^2/sum(m1$sdev^2))[1:2]*100, 2);#VarExp

In [None]:
#ggdata <- data.frame(m2@scores[,1:2], Class=cGroupT);head(ggdata)
ggdata <- data.frame(m1$x[,1:2], Class=cGroupT);#head(ggdata)

#ggdata <- ggdata[grep("QC",pName),];head(ggdata) # subset QC

In [None]:
dis_palette <- c("#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd","#8c564b","#e377c2") # discrete 

p1 <- ggplot(ggdata) + 
    geom_point(aes(x=PC1, y=PC2, color=factor(Class)), size=5, shape=20) +
    geom_text(aes(x=PC1, y=PC2, label=row.names(ggdata)), size=2.5, fontface="bold.italic") +
    geom_hline(yintercept=0, color="black", size=0.5) + geom_vline(xintercept=0, color="black", size=0.5) +
    labs(x = paste0("PC1: ",VarExp[1],"%"), y = paste0("PC2: ",VarExp[2], "%")) +
    stat_ellipse(aes(x=PC1,y=PC2,fill=factor(Class)), geom="polygon", level=0.95, alpha=0.2) +
    scale_color_manual(values = dis_palette) + scale_fill_manual(values = dis_palette) +
    guides(color=guide_legend(title="Factor"), fill=guide_legend(title="Factor")) +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
         legend.title=element_text(face="italic", size=9), legend.position="top", legend.text=element_text(size=8)
    )
p1

In [None]:
#ggdata <- data.frame(m2@scores[,1:2], Class=rOrderT);head(ggdata)
ggdata <- data.frame(m1$x[,1:2], Class=rOrderT);#head(ggdata)

#ggdata <- ggdata[grep("QC",pName),];head(ggdata) # subset QC

In [None]:
con_palette <- cm.colors(length(unique(ggdata$Class))) # continuous

p2 <- ggplot(ggdata) + 
    geom_point(aes(x=PC1, y=PC2, color=factor(Class)), size=5, shape=20) +
    geom_text(aes(x=PC1, y=PC2, label=row.names(ggdata)), size=2.5, fontface="bold.italic") +
    geom_hline(yintercept=0, color="black", size=0.5) + geom_vline(xintercept=0, color="black", size=0.5) +
    xlim(layer_scales(p1)$x$range$range) + ylim(layer_scales(p1)$y$range$range) +
    labs(x = paste0("PC1: ",VarExp[1],"%"), y = paste0("PC2: ",VarExp[2], "%")) +
    #stat_ellipse(aes(x=PC1,y=PC2,fill=factor(Class)), geom="polygon", level=0.95, alpha=0.2) +
    scale_color_manual(values = con_palette) + scale_fill_manual(values = con_palette) +
    guides(color=guide_legend(title="Factor", nrow=1)) +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
         legend.title=element_text(face="italic", size=9), legend.position="top", legend.text=element_text(size=5)
    )
p2

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> *Correction Methods </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
MS data can be susceptible to orthogonal structures (to interest), associated to sample preparation and data acquisition, which need to be removed prior to statistical modelling (univariate or multivariate). <br>
A range of mathematical procedures exist - linear vs loess / mean vs median - using reference samples or all samples, QC vs Background, respectively. Estimations of such structures (correction factor) can be a single scalar representative of all features or combined vector representative of each feature. 
</font>

In [None]:
MSmat <- MSmatrixProcII; # select MS data
pName <- MSsamID; # sample ID vector

dim(MSmat)

In [None]:
#define MODEL (QC) set
#rowIndexIM <- grep("QC",pName); # only QC
rowIndexIM <- c(1:dim(MSmat)[1]); # sample & QC

GRP <- c("LD","HD");
rowIndexIM <- NULL;
for (i in 1:length(GRP)){rowIndexIM <- c(rowIndexIM,grep(GRP[i],pName))}

pName[rowIndexIM]

In [None]:
#structure MODEL (QC) data & classes & order
MSmatIM <- MSmat[rowIndexIM,];

cGroupIM <- Metadata[rowIndexIM, "Batch"];cGroupIM
cGroupIM[grep("HD",pName[rowIndexIM])] <- 2;cGroupIM <- as.factor(cGroupIM);cGroupIM
#cGroupIM <- as.factor(Metadata[rowIndexIM, "Batch"]);cGroupIM
rOrderIM <- as.numeric(Metadata[rowIndexIM, "RunOrder"]);rOrderIM 

#NewM <- model.matrix(~0+cGroupIM);NewM

In [None]:
#confirm MODEL (QC) data & classes & order
IMmat <- MSmatIM;
IMbat <- cGroupIM;
IMrun <- rOrderIM;
#clsDM <- NewM;

In [None]:
#define POPULATION (SAM) set
rowIndexIC <- c(1:dim(MSmat)[1]); # sample & QC

pName[rowIndexIC]

In [None]:
#structure POPULATION (SAM) data & classes & order
MSmatIC <- MSmat[rowIndexIC,];

cGroupIC <- Metadata[rowIndexIC, "Batch"];cGroupIC
cGroupIC[grep("HD",pName[rowIndexIC])] <- 2;cGroupIC <- as.factor(cGroupIC);cGroupIC
#cGroupIC <- as.factor(Metadata[rowIndexIC, "Batch"]);cGroupIC
rOrderIC <- as.numeric(Metadata[rowIndexIC, "RunOrder"]);rOrderIC 

#NewM <- model.matrix(~0+cGroupIC);NewM

In [None]:
#confirm POPULATION (SAM) data & classes & order
ICmat <- MSmatIC;
ICbat <- cGroupIC;
ICrun <- rOrderIC;
#clsDC <- NewM;

In [None]:
m1 <- prcomp(ICmat, rank=2, scale=TRUE, center=TRUE);summary(m1)
VarExp <- round((m1$sdev^2/sum(m1$sdev^2))[1:2]*100, 2);#VarExp
ggdata <- data.frame(m1$x[,1:2], Class=ICbat);#head(ggdata)

In [None]:
dis_palette <- c("#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd","#8c564b","#e377c2") # discrete 

p1 <- ggplot(ggdata) + 
    geom_point(aes(x=PC1, y=PC2, color=factor(Class)), size=5, shape=20) +
    geom_text(aes(x=PC1, y=PC2, label=row.names(ggdata)), size=2.5, fontface="bold") +
    geom_hline(yintercept=0, color="black", size=0.5) + geom_vline(xintercept=0, color="black", size=0.5) +
    labs(x = paste0("PC1: ",VarExp[1],"%"), y = paste0("PC2: ",VarExp[2], "%")) +
    stat_ellipse(aes(x=PC1,y=PC2,fill=factor(Class)), geom="polygon", level=0.95, alpha=0.2) +
    scale_color_manual(values = dis_palette) + scale_fill_manual(values = dis_palette) +
    guides(color=guide_legend(title="Class"), fill=guide_legend(title="Class")) +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
         legend.title=element_text(face="italic", size=9), legend.position="top", legend.text=element_text(size=8)
    )
p1

In [None]:
### EXAMPLE CORRECTION I - SINGLE SCALAR BATCH (RATIO) ###

In [None]:
#method = INAPPROPRIATE (check PCA -> no change in MODEL or POPULATION)

OMmat = IMmat; # corrected MODEL
OCmat = ICmat; # corrected POPULATION

#xFitM = apply(IMmat,1,sum);xFitC = apply(ICmat,1,sum); # summed int (MODEL & POPULATION)
xFitM = apply(IMmat,1,median);xFitC = apply(ICmat,1,median); # median int (MODEL & POPULATION)
#xFitM = apply(IMmat,1,mean);xFitC = apply(ICmat,1,mean); # mean int (MODEL & POPULATION)

dfM = data.frame(y=xFitM, r=IMrun, b=IMbat);#head(dfM)
dfC = data.frame(y=xFitC, r=ICrun, b=ICbat);#head(dfC)

m1M = lm(y~1+b, data=dfM);m1Ms = summary(m1M);#print(m1M) # build MODEL model
#m1C = lm(y~1+b, data=dfC);m1Cs = summary(m1C);#print(m1Cs) # build POPULATION model
 
#mFitM = m1M$fitted.values;mFitC = m1C$fitted.values; # 'extract' (MODEL & POPULATION)
mFitM = predict(m1M, dfM);mFitC = predict(m1M, dfC); # 'predict' (MODEL & POPULATION)

#mFitM = rep(0, times=dim(IMmat)[1]);mFitC = rep(0, times=dim(ICmat)[1]);
#for (i in 1:length(unique(dfM$b))){
#    #p = mean(dfM$y[which(dfM$b==unique(dfM$b)[i])]);#p
#    p = median(dfM$y[which(dfM$b==unique(dfM$b)[i])]);#p
#    mFitM[which(dfM$b==unique(dfM$b)[i])] = p;mFitC[which(dfC$b==unique(dfM$b)[i])] = p;
#}

pRefM = median(xFitM)/mFitM;pRefC = median(xFitM)/mFitC;

OMmat = OMmat*matrix(rep(pRefM, times=dim(OMmat)[2]), nrow=dim(OMmat)[1], ncol=dim(OMmat)[2]) 
OCmat = OCmat*matrix(rep(pRefC, times=dim(OCmat)[2]), nrow=dim(OCmat)[1], ncol=dim(OCmat)[2]) 

In [None]:
m1 <- prcomp(OCmat, rank=2, scale=TRUE, center=TRUE);summary(m1)
VarExp <- round((m1$sdev^2/sum(m1$sdev^2))[1:2]*100, 2);#VarExp
ggdata <- data.frame(m1$x[,1:2], Class=ICbat);#head(ggdata)

In [None]:
dis_palette <- c("#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd","#8c564b","#e377c2") # discrete 

p2 <- ggplot(ggdata) + 
    geom_point(aes(x=PC1, y=PC2, color=factor(Class)), size=5, shape=20) +
    geom_text(aes(x=PC1, y=PC2, label=row.names(ggdata)), size=2.5, fontface="bold") +
    geom_hline(yintercept=0, color="black", size=0.5) + geom_vline(xintercept=0, color="black", size=0.5) +
    labs(x = paste0("PC1: ",VarExp[1],"%"), y = paste0("PC2: ",VarExp[2], "%")) +
    stat_ellipse(aes(x=PC1,y=PC2,fill=factor(Class)), geom="polygon", level=0.95, alpha=0.2) +
    scale_color_manual(values = dis_palette) + scale_fill_manual(values = dis_palette) +
    guides(color=guide_legend(title="Class"), fill=guide_legend(title="Class")) +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
         legend.title=element_text(face="italic", size=9), legend.position="top", legend.text=element_text(size=8)
    )
p2

In [None]:
### EXAMPLE CORRECTION II - COMBINED VECTOR BATCH (RATIO) ###

In [None]:
#method = APPROPRIATE (check PCA -> change in MODEL and POPULATION)

OMmat = IMmat; # corrected MODEL
OCmat = ICmat; # corrected POPULATION

for (k in 1:dim(IMmat)[2]){
    
    xFitM = IMmat[,k]; # MODEL int
    xFitC = ICmat[,k]; # POPULATION int

    dfM = data.frame(y=xFitM, r=IMrun, b=IMbat);#head(dfM)
    dfC = data.frame(y=xFitC, r=ICrun, b=ICbat);#head(dfC)

    m1M = lm(y~1+b, data=dfM);m1Ms = summary(m1M);#print(m1M) # build MODEL model
    #m1C = lm(y~1+b, data=dfC);m1Cs = summary(m1C);#print(m1Cs) # build POPULATION model
    
        #m1M = lm(y~1+r*b, data=dfM);m1Ms = summary(m1M);#print(m1M)
        #m1C = lm(y~1+r*b, data=dfC);m1Cs = summary(m1C);#print(m1Cs)
    
    #mFitM = m1M$fitted.values;#mFitC = m1C$fitted.values; # 'extract' (MODEL & POPULATION)
    mFitM = predict(m1M, dfM);mFitC = predict(m1M, dfC); # 'predict' (MODEL & POPULATION)

    sFitM = (xFitM/mFitM)*median(xFitM);sFitC = (xFitC/mFitC)*median(xFitM); # 'correct' (MODEL & POPULATION)
    OMmat[,k] = sFitM;OCmat[,k] = sFitC; # 'replace' (MODEL & POPULATION)
    
}

In [None]:
m1 <- prcomp(OCmat, rank=2, scale=TRUE, center=TRUE);summary(m1)
VarExp <- round((m1$sdev^2/sum(m1$sdev^2))[1:2]*100, 2);#VarExp
ggdata <- data.frame(m1$x[,1:2], Class=ICbat);#head(ggdata)

In [None]:
dis_palette <- c("#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd","#8c564b","#e377c2") # discrete 

p2 <- ggplot(ggdata) + 
    geom_point(aes(x=PC1, y=PC2, color=factor(Class)), size=5, shape=20) +
    geom_text(aes(x=PC1, y=PC2, label=row.names(ggdata)), size=2.5, fontface="bold") +
    geom_hline(yintercept=0, color="black", size=0.5) + geom_vline(xintercept=0, color="black", size=0.5) +
    labs(x = paste0("PC1: ",VarExp[1],"%"), y = paste0("PC2: ",VarExp[2], "%")) +
    stat_ellipse(aes(x=PC1,y=PC2,fill=factor(Class)), geom="polygon", level=0.95, alpha=0.2) +
    scale_color_manual(values = dis_palette) + scale_fill_manual(values = dis_palette) +
    guides(color=guide_legend(title="Class"), fill=guide_legend(title="Class")) +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
         legend.title=element_text(face="italic", size=9), legend.position="top", legend.text=element_text(size=8))

p2

In [None]:
#visualize FITS ... ?

In [None]:
MSmatrixProcIII <- OCmat;
dim(MSmatrixProcIII)

In [None]:
#save result as new '.CSV' (CORRECTED)
#write.table(rbind(t(Metadata),t(MSmatrixProcIII)),"met_proc2.csv",sep=",",col.names=F,row.names=T);
write.table(rbind(t(Metadata),format(t(MSmatrixProcIII),scientific=F)),"met_proc2.csv",sep=",",col.names=F,row.names=T);
#list.files(path=".", pattern=".csv")

<div class="alert alert-danger" role="alert" style="font-style:italic; font-size:13px">
<b>Note!</b> Correction to reference (QC) samples is less susceptible to overfitting and more reproducible in future studies (*that is, if required at all).
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Coefficient of Variation </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
An important measure of data quality is precision, perceived by variation in reference data (QC or Controls), as the coefficient of variation (CV) or residual standard deviation (RSD) - standard deviation/mean - against an acceptance criteria (15-30%). <br> Other metrics include median absolute deviation (MAD), dispersion ratio (D-ratio), intra-class correlation (ICC).
</font>

In [None]:
MSmat <- MSmatrixProcII; # select MS data
#MSmat <- MSmatrixProcIII; # select MS data
pName <- MSsamID; # sample ID vector

dim(MSmat)

In [None]:
#define parameters
CVthres <- 30; # 'acceptance' threshold (%)
CVlimit <- 100; # visualization limit (%)

In [None]:
rowIndex <- grep("QC",pName);rowIndex # only QC
#rowIndex <- c(1:dim(Metadata)[1]); # sample & QC

MSmat <- MSmat[rowIndex,];

In [None]:
#calculate CV (feature by feature)
CV <- ( apply(MSmat,2,sd) / apply(MSmat,2,mean) ) * 100;
c(min(CV), max(CV))

In [None]:
CV2 <- CV;
#CV2 <- abs(CV);
CV2[CV2>CVlimit] <- CVlimit; # mask outliers
#CV2[is.na(CV)==1] <- CVlimit; # mask NAs

CV_filter <- which(CV2<=CVthres);length(CV_filter)

In [None]:
#structure data.frame for ggplot2
plot.m <- data.frame(x=mzValProc, y=rtValProc, z=CV2);
head(plot.m)

In [None]:
p1 <- HeatmapCV(plot.m);
p1

In [None]:
HeatmapCV <- function(plot.m){
    
    matlab_palette <- c("#0000AA","#0040FF","#0080FF","#40BFFF","#80FFFF","#BFFFBF","#FFFF80","#FFBF40","#FF8000","#FF4000","#AA0000")
    mzby <- 100;rtby <- 1;
    
    p1 <- ggplot(data=plot.m, aes(x=x, y=y, color=z)) + 
    geom_point(size=2) +
    geom_hline(yintercept=0, size=0.50, linetype=2) + geom_vline(xintercept=0, size=0.5, linetype=2)+
    scale_color_gradientn(colors=(matlab_palette), limits=c(0,max(plot.m$z)), name="CV", na.value="gray90")+ #"RdYlGn"
    scale_y_continuous(limits=c(0,round(max(plot.m$y),-1)), breaks=seq(from=0,to=round(max(plot.m$y),-1), by=rtby)) +
    scale_x_continuous(limits=c(0,round(max(plot.m$x),-1)), breaks=seq(from=0,to=round(max(plot.m$x),-1), by=mzby)) +
    labs(x="Mass (m/z)", y="RT (min)", color="CV") + coord_fixed(ratio = 75) +
    guides(color = guide_colorbar(title.position="top", title.hjust = 0.50)) +
    theme(axis.title=element_text(face="bold", size=9), axis.text=element_text(size=8),
          legend.title=element_text(face="italic", size=9), legend.text=element_text(size=7),
          legend.position="bottom",
          legend.key.width = unit(3, "cm"),
          axis.line = element_blank()
    )
    return(p1)
    
}

In [None]:
#remove CV features
MSmatrixProcIV <- MSmat[,CV_filter];
dim(MSmatrixProcIV)

In [None]:
#remove CV descriptors
MSfeatIDProcFin <- MSfeatIDProc[CV_filter]; # ID
mzValProcFin <- mzValProc[CV_filter]; # MZ
rtValProcFin <- rtValProc[CV_filter]; # RT
pkWidthProcFin <- pkWidthProc[CV_filter]; # PW

<div class="alert alert-danger" role="alert" style="font-style:italic; font-size:13px">
<b>XXX!</b> XXX.
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Correlation Coefficients </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Correlation across columns/features (STOCSY) may assist compound identification, that is, structural (within) & biological (between) collinearity, while across rows/samples may reveal possible clusters. <br>
Reliability in coefficients will depend on various factors (linearity, outliers, groups) and ideally require confidence through simulations/permutations.
</font>

In [None]:
MSmat <- MSmatrixProcII; # select MS data
#MSmat <- MSmatrixProcIII; # select MS data
pName <- MSsamID; # sample ID vector

dim(MSmat)

In [None]:
#rowIndex <- grep("QC",MSsamID); # only QC
rowIndex <- c(1:dim(Metadata)[1]); # sample & QC

MSmat <- MSmat[rowIndex,];
pName <- pName[rowIndex];

In [None]:
#feature of interest 
driver <- "9.48_121.0290";
driver_index <- grep(driver, MSfeatIDProc);

MSfeatIDProc[driver_index]

In [None]:
CS <- sapply(c(1:dim(MSmat)[2]), function(x){cor(MSmat[,driver_index], MSmat[,x], use="complete.obs", method="pearson")});
#CS <- sapply(c(1:dim(MSmat)[2]), function(x){cor(MSmat[,driver_index], MSmat[,x], use="complete.obs", method="spearman")});

In [None]:
#structure data.frame for ggplot2
plot.m <- data.frame(x=c(1:dim(MSmat)[2]), y=CS, u=mzValProc, v=rtValProc);
rownames(plot.m) <- MSfeatIDProc;
head(plot.m)

In [None]:
cThres <- 0.80; # 'acceptance' threshold (R)
plot.m$c <- rep("High", each=dim(MSmat)[2]);
plot.m$c[which(abs(plot.m$y) < cThres)] <- "Low";

In [None]:
p1 <- ggplot(data=plot.m, aes(x=x, y=y, color=c)) + 
    geom_point(size=1) + ggtitle("") +
    geom_hline(yintercept=(cThres*1), size=0.25, linetype=2) +
    geom_hline(yintercept=(cThres*-1), size=0.25, linetype=2) +
    labs(x="Features (Index)", y="Correlation Coefficient (R)", color="") +
    scale_color_manual(values=c("#FF0000","#CCCCCC")) +
    scale_y_continuous(limits=c(-1,1), breaks=seq(from=-1,to=1,by=0.5)) +
    theme(axis.title=element_text(face="bold", size=9),
          strip.text=element_text(face="italic", size=8),
          legend.position="top",
          axis.text=element_text(size=8))

p1

rownames(plot.m)[which(abs(plot.m$y) >= cThres)]

In [None]:
#run simulations ... ? 

In [None]:
cor.test.p <- function(x){
  FUN <- function(x, y) cor.test(x, y)[[3]]
  z <- outer(colnames(x), colnames(x), 
    Vectorize(function(i,j) FUN(x[,i], x[,j]))
  )
  dimnames(z) <- list(colnames(x), colnames(x))
  z
}

In [None]:
sessionInfo()

<font style="color:black; font-family:calibri; font-size:15px">
<b><i>References</i></b> <br>
</font>
<font style="color:black; font-family:calibri; font-size:15px">
Software Carpentry: Our Lessons (<a href="https://software-carpentry.org/lessons/">https://software-carpentry.org/lessons</a>) <br>
Progenesis QI: User Guide  (<a href="www.nonlinear.com/progenesis/qi/v2.4/user-guide">www.nonlinear.com/progenesis/qi/v2.4/user-guide</a>) <br>
Galaxy Training Material: Metabolomics LC-MS Analysis (<a href="https://github.com/galaxyproject/training-material">https://github.com/galaxyproject/training-material</a>) <br>
National Phenome Centre: nPYc-toolbox-tutorials (<a href="https://github.com/phenomecentre/nPYc-toolbox-tutorials">https://github.com/phenomecentre/nPYc-toolbox-tutorials</a>) <br>
mQACC: Quality Control in Untargeted Metabolomics (<a href="https://epi.grants.cancer.gov/Consortia/mQACC">https://epi.grants.cancer.gov/Consortia/mQACC</a>)
</font>

<font size="2" color="black" face="calibri"> <b>
MS Analysis with R: Progenesis Data <br>
</b> </font>