## ChemProp2
Authors: Abzer Kelminal (abzer.shah@uni-tuebingen.de) <br>
Edited by: Daniel Petras (daniel.petras@uni-tuebingen.de) <br>
Input file format: .txt files or .csv files <br>
Outputs: .csv files  <br>
Dependencies: library(ggplot2), library(dplyr)

### About Input files:

- **Feature_file** is obtained by performing Feature based Molecular Networking on the data using MZmine software.
- **Nw_edge file** has the information of Feature IDs that are similar (not the same) in the columns 'Feature_ID_1' & 'Feature_ID_2'
- **Nw_edge file** is an output of GNPS. 

In [None]:
# setting the current directory as the working directory
setwd('Downloads/ChemProp2_Test') #Example

In [None]:
# install the package if not present
install.packages('ggplot2')
install.packages('dplyr')

In [None]:
library(ggplot2)
library(dplyr)

Feature_file <- read.table("feature_table_ChemProp2.txt", sep="\t", header=TRUE, row.names = 1) # By applying 'row.names = 1', the 1st column 'ID' becomes the row names
Meta_File <-read.table("metadata_ChemProp2.txt", sep="\t",header=TRUE, row.names = 1)
Nw_edge <-read.table("Network_Edges_ChemProp2.txt", sep="\t", header = TRUE)

# In case of .csv files, use 'read.csv' instead of 'read.table'

In [None]:
# head function returns the header (upto first 6 rows)of each files. This gives an idea about the content of the files.
head(Feature_file)
head(Meta_File)
head(Nw_edge)

In [None]:
# If Meta_File information given in Column-wise
Meta_Data <- Meta_File %>% select(contains(readline('Enter the MetaData Name:')))

In [None]:
# If Meta_File information given in Row-wise
Meta_Data <- Meta_File[(readline('Enter the MetaData Name:')),]

### Chemical Proportionality score:

- The below code adds a column of **Chemical Proportionality score** to the Nw_edge file. In addition to that, columns with information such as absolute values of ChemProp score and the sign of Chemprop scores are also added.
- In addition to ChemProp score using Pearson correlation method (which is ideal for linear transformations), the below code also generates scores using other methods such as spearman correlation, natural log transformation, square root transformations, for supporting non-linear data 

### <font color = red> To Note: </font>
- For Feature table extracted using MZmine 3, it might contain several columns in addition to **Row m/z** and **Row Retention Time**. Change the line in the below code `x<-x[,c(-1:-2)]` accordingly to exclude all these columns.
- Similarly, the Nw_edge file may contain the first 2 columns in different names such as ID1, ID2 or Cluster_ID1 , Cluster_ID2. Change the lines in the below code `x<- subset(Feature_file, rownames(Feature_file) == Nw_edge$Feature_ID_1[i])` accordingly to `x<- subset(Feature_file, rownames(Feature_file) == Nw_edge$ID_1[i])`

In [None]:
ChemProp2 <- c()
ChemProp_spearman <-c()
ChemProp_log <- c()
ChemProp_sqrt <- c()

for (i in 1:NROW(Nw_edge)) {
  
  x<- subset(Feature_file, rownames(Feature_file) == Nw_edge$Feature_ID_1[i]) # rownames(Feature_file) is the feature ID. This line gets the 'Feature ID 1' from the first column of Nw_edge i.e., Feature_ID_1. Then picks the corresponding 
  x<- rbind(x,subset(Feature_file, rownames(Feature_file) == Nw_edge$Feature_ID_2[i]))
  # x is the subset data which has the Feature ID 1 and 2 which are similar according to Nw_edge file.
  x<-x[,c(-1:-2)] # Removing the first two columns --> Row m/z and Row Retention Time information
  A<-colnames(x) 
  B<-colnames(Meta_Data)
  A==B # Checking the column names of the subset data x against that of meta data.
  reorder_id<-match(B,A) #Match gives the position in which B (the column names of Meta data) is present in A (subset data) and store the position info in reorder_id 
  reordered_x <- x[reorder_id] #Rearranging x (subset data) with respect to the new positions
  reordered_x <- rbind(Meta_Data[1,],reordered_x) # With positions of both x and meta data being the same now, it can be combined
  
  reordered_x <-data.frame(t(reordered_x))  # Transposing the data, thus it contains 3 columns, 'Metadata info. For ex: Time', 'Feature ID 1', 'Feature ID 2'
  
  corr_result<-cor(reordered_x, method = "pearson") # Performing Pearson correlation
  ChemProp_score <- (corr_result[1,3] - corr_result[1,2]) / 2 # ChemProp2 score is obtained by: (Pearson(Feature ID 2) - Pearson(Feature ID 1)) / 2
  
  corr_2 <- cor(reordered_x, method = "spearman") # Performing Spearman correlation
  Score_spearman <- (corr_2[1,3] - corr_2[1,2]) / 2
  
  log_reorderedX <- cbind(reordered_x[,1],log(reordered_x[,2:3]+1)) # Performing natural log transformations on Feature IDs 1 and 2
  corr_3 <- cor(log_reorderedX) # performing (pearson) correlation on the log transformed data
  Score_log <-(corr_3[1,3] - corr_3[1,2]) / 2
  
  sqrt_reorderedX <- cbind(reordered_x[,1],sqrt(reordered_x[,2:3])) # Taking square roots of Feature IDs 1 and 2
  corr_4 <- cor(sqrt_reorderedX) # performing (pearson) correlation on the square roots
  Score_sqrt <- (corr_4[1,3] - corr_4[1,2])/2
  
  ChemProp2 <- rbind(ChemProp2, ChemProp_score, deparse.level = 0) # deparse.level = 0 constructs no labels; if not given, the resultant matrix has row names (for all rows) created from the input arguments such as 'ChemProp_score' here.
  ChemProp_spearman <- rbind(ChemProp_spearman,Score_spearman,  deparse.level = 0)
  ChemProp_log <- rbind(ChemProp_log,Score_log,  deparse.level = 0)
  ChemProp_sqrt <- rbind(ChemProp_sqrt, Score_sqrt, deparse.level = 0)
}

Nw_edge_new <- cbind (Nw_edge, ChemProp2,ChemProp_spearman,ChemProp_log,ChemProp_sqrt )
rownames(Nw_edge_new) <- NULL
Nw_edge_new <- Nw_edge_new[order(Nw_edge_new$ChemProp2, decreasing = TRUE), ] # Rearranging Nw_edge_new in the decreasing order of ChemProp2 score
Abs_ChemProp2 <- abs(Nw_edge_new$ChemProp2)
Sign_ChemProp2 <- sign(Nw_edge_new$ChemProp2)

ChemProp2_file <- cbind(Nw_edge_new,Abs_ChemProp2,Sign_ChemProp2)
write.csv(ChemProp2_file, 'With_ChemProp2_score.csv')

### Visualizing the distribution of ChemProp2 score of the sample data:

In [None]:
qplot(ChemProp2_file$ChemProp2,
      geom="histogram",
      binwidth=0.05,  
      main="Histogram for distribution of ChemProp2 score in sample data", 
      xlab= 'ChemProp2 Score',
      ylab= 'Density',
      fill=I("#56B4E9"))