**Obtaining selective columns from cluster summary files and performing analysis on them and storing them as a new file**<br>
Authors: Abzer Kelminal (abzer.shah@uni-tuebingen.de)<br>
Edited by:  <br>
Input file format: .clustersummary files from the Bucket table obtained via GNPS<br>
Outputs: Combined .csv files<br>
Dependencies: library(dplyr)

In [None]:
getwd()
install.packages('dplyr')
library('dplyr')

**STEP 1 : Setting the working directory:**
 Copy the path of your files and simply paste it in setwd.<br> 
 For ex: <font color = blue>(C:\Users\Nutzer\Desktop\Test_Data)</font>. Make sure to change the \ symbol to / while copying the path  in setwd

In [None]:
setwd("C:/Users/Nutzer/Desktop/Test_Data")

**STEP 2 :** Initially, we get the path of our smaple files, and the folder in which our files are. 

In [None]:
pattern=".clustersummary"             #Here you can change your file type accordingly as .csv or .xlsx 
dirs <- dir(path=paste(getwd(), sep=""), pattern=pattern, full.names=TRUE, recursive=TRUE)       # Gets the complete path of each file and store them on 'dirs'
folders <- unique(dirname(dirs))      #  Gets the path of the folder with the sample files and store it in 'folders'                                                    
files = list.files(folders, pattern=pattern, full.names=TRUE)  # listing the files in 'folders' and store it in 'files'
files_1 <- basename((files))  # just gets the name of each files
files_2 <- dirname((files))   # gets the folder path of each file

**STEP 3 :** Creating a Result folder to store all the result files

In [None]:
for (j in 1:length(files))
{
  files_1[[j]] <- strsplit(files_1[[j]], ".clustersummary")[[1]]
}

dir.create(path=paste(files_2[[1]], "_Results", sep=""), showWarnings = TRUE)
fName <-paste(files_2[[1]], "_Results", sep="")

**STEP 4 :** Reading the files and selecting the columns we need and storing them as new files in Result folder.

In [None]:

temp <- list()  # Creating empty lists
final <- list()


for (j in 1:length(files))
{
  
  if(pattern == ".clustersummary") 
  { 
    temp[[j]] <- read.csv(file=files[j], header=TRUE, sep="\t") # Reading the input files and storing all of them into temp list
    
  }
  
  clusterID <- temp[[j]] %>% select(starts_with("cluster") & ends_with("index"))          #selecting the columns individually with their names
  PrecursorMass<-temp[[j]] %>% select(starts_with("precursor") & ends_with("mass"))
  ComponentIndex<-temp[[j]] %>% select(starts_with("component") & ends_with("index"))
  LibraryID<-temp[[j]] %>% select(starts_with("Library") & ends_with("ID"))
  GNPS<-temp[[j]] %>% select(contains("GNPSGROUP"))   # gets all the columns with the name GNPSGROUP. It ranges from 2 to few columns.
  
  # Here, we perform the calculations and store them on different variables
  BinID <- ifelse(LibraryID == 'N/A',0,1)    
  colnames(BinID) <- "Binary_Library_ID"
 
  spec<- cbind(clusterID,PrecursorMass,ComponentIndex,LibraryID,BinID,GNPS) #combining all the columns
  
  final[[j]] <- spec  
  write.csv(final[[j]], file=paste(fName, "/NewFiles_", files_1[[j]], ".csv", sep=""), sep='\t',row.names = F) # The results will be stored as csv files in the Results folder with a starting name "NewFiles"
  
}