# SeSiMe
### Sentence/Sequence Similarity Measure

## Goal:  Creat graphml network file and metadata csv file for cytoscape
Import MS data and create documents.

### here: GNPS Dataset of 11134 spectra with smiles.

### Import data

In [1]:
# data locations
ROOT = "C:\\OneDrive - Netherlands eScience Center\\Project_Wageningen_iOMEGA"
PATH_MS_DATA = ROOT + "\\Data\\labeled_MS_data\\"
PATH_SAVE_DATA = ROOT + "\\SeSiMe\\data\\"
PATH_SAVE_MODEL = ROOT + "\\SeSiMe\\models_trained\\"
PATH_SESIME = ROOT + "\\SeSiMe\\"

PATH_NPLINKER = ROOT + "\\nplinker\\prototype\\"
#mgf_file = PATH_MS_DATA + "GNPSLibraries_allSMILES.mgf"
mgf_file = PATH_MS_DATA + "GNPSLibraries_uniqueSMILES_withFeatureIDs.mgf"

In [2]:
# import general packages
import sys
sys.path.insert(0, PATH_NPLINKER)
sys.path.insert(0, PATH_SESIME)

import helper_functions as functions
import MS_functions

import numpy as np
from metabolomics import load_spectra

In [3]:
# Import / Load data
results_file = "filtered_data_unique_smiles_minpeak10_loss500_2dec.json"

spectra, spectra_dict, MS_documents, MS_documents_intensity, sub_spectra_metadata = MS_functions.load_MGF_data(PATH_SAVE_DATA,
                  mgf_file, 
                 results_file = results_file,
                 num_decimals = 2,
                 min_frag = 0.0, max_frag = 1000.0,
                 min_loss = 5.0, max_loss = 500.0,
                 min_intensity_perc = 0.01, 
                 exp_intensity_filter = 0.01,
                 min_peaks = 10,
                 peaks_per_mz = 15/200,
                 peak_loss_words = ['peak_', 'loss_'], #['mz_', 'mz_'], 
                 sub_spectra = False)

Spectra json file found and loaded.


## Documents

+ Peaks were removed using an exponential fit to the peak intensity distribution. 
+ Words were created using 2 decimals.


In [4]:
# Have a look at how a document looks like:
print(MS_documents[0])

['peak_74.73', 'peak_79.02', 'peak_89.02', 'peak_89.04', 'peak_90.05', 'peak_95.05', 'peak_98.98', 'peak_105.04', 'peak_107.05', 'peak_117.03', 'peak_118.04', 'peak_134.67', 'peak_135.05', 'peak_135.28', 'peak_136.05', 'peak_137.00', 'peak_137.15', 'peak_145.03', 'peak_147.12', 'peak_160.09', 'peak_161.08', 'peak_162.59', 'peak_163.04', 'peak_163.08', 'peak_163.29', 'peak_164.04', 'peak_165.00', 'peak_165.40', 'peak_166.30', 'peak_167.15', 'peak_168.17', 'peak_172.58', 'peak_175.08', 'peak_181.06', 'peak_229.03', 'peak_237.01', 'peak_330.10', 'peak_330.14', 'loss_92.10', 'loss_100.08', 'loss_148.05', 'loss_154.04', 'loss_156.53', 'loss_160.95', 'loss_161.96', 'loss_162.81', 'loss_163.71', 'loss_164.11', 'loss_165.07', 'loss_165.82', 'loss_166.03', 'loss_166.07', 'loss_166.52', 'loss_168.04', 'loss_169.03', 'loss_182.00', 'loss_184.08', 'loss_191.96', 'loss_192.11', 'loss_193.07', 'loss_193.84', 'loss_194.07', 'loss_194.44', 'loss_211.07', 'loss_212.08', 'loss_222.06', 'loss_224.07', 'l

#### Switch to general SeSiMe functionality
Once we have a corpus (e.g. through cells above), we can use SeSiMe to apply different similarity measuring methds. 

In [4]:
from Similarities import SimilarityMeasures

MS_measure = SimilarityMeasures(MS_documents)

Using TensorFlow backend.


In [5]:
MS_measure.preprocess_documents(0.2, min_frequency = 2, create_stopwords = False)
print("Number of unique words: ", len(MS_measure.dictionary))

Preprocess documents...
Number of unique words:  67455


### Note
In total it would be about 100.000 words. Many occur only once in the entire corpus and are hence filtered out (makes not sense to place them somewhere in word-space, would be arbitrary!).

Few also are filtered out because they occur too often (in more than 20% of the spectra). Those words have little discriminative power and are hence ignored. Might still be worth keeping them in for comparison!

## Word2Vec -based approach
### Compare different training parameters

+ Create Word2Vec based document centroid vectors based on models trained using different window sizes.

In [6]:
file_model_word2vec = PATH_SAVE_MODEL + 'model_w2v_MS_gnps_uniquesmiles_d300_w300_iter100_loss500_minpeak10_dec2.model'
MS_measure.build_model_word2vec(file_model_word2vec, size=300, window=300, 
                             min_count=1, workers=4, iter=100, 
                             use_stored_model=True)

Load stored word2vec model ...


In [7]:
# Use peak intensities as extra weights
MS_measure.get_vectors_centroid(extra_weights = MS_documents_intensity, tfidf_weighted=True)
MS_measure.get_centroid_similarity(num_hits=25, method='cosine')

All 'words' of the given documents were found in the trained word2vec model.
  Calculated centroid vectors for  9550  of  9550  documents. Calculated centroid vectors for  8610  of  9550  documents.Calculated distances between  9550  documents.


This has calculated (cosine) distances between all spectra in an all-vs-all fashion.
The "num_hits" closest candidates for each spectrum are listed in two matrices.

One stores the distances, the other the respective IDs.

## Create graphml file and get metadata from ClassyFire csv file

In [8]:
MS_functions.MS_similarity_network(MS_measure, 
                          similarity_method="centroid", 
                          link_method = "single", 
                          filename="MS_word2vec_01.graphml", 
                          cutoff = 0.7,
                          max_links = 10)

Network stored as graphml file under:  MS_word2vec_01.graphml


In [18]:
# Compare to this one:
MS_functions.MS_similarity_network(MS_measure, 
                                   similarity_method="centroid", 
                                   link_method = "mutual", 
                                   filename="MS_word2vec_03.graphml", 
                                   cutoff=0.5,
                                   max_links = 20)

Network stored as graphml file under:  MS_word2vec_02.graphml


In [9]:
csvfile = PATH_MS_DATA + "ClassyFire_InputforCytoscape_GNPSLibraries.csv"     

import pandas as pd
mol_classes = pd.read_csv(csvfile, delimiter='\t')  

In [16]:
list_mol_superclasses = []
list_mol_classes = []
list_mol_subclasses = []

for spectrum in spectra:
    subtable = mol_classes[mol_classes['inchikey'].str.contains(spectrum.metadata['inchikey'])]
    
    if subtable.shape[0] > 0:  # i.e. if match was found
        list_mol_superclasses.append(subtable['superclass'].values[0])
        list_mol_classes.append(subtable['class'].values[0])
        list_mol_subclasses.append(subtable['subclass'].values[0])
    else:
        list_mol_superclasses.append('None')
        list_mol_classes.append('None')
        list_mol_subclasses.append('None')
    
# Remove nan's
list_mol_superclasses = ['None' if x is np.nan else x for x in list_mol_superclasses]
list_mol_classes = ['None' if x is np.nan else x for x in list_mol_classes]
list_mol_subclasses = ['None' if x is np.nan else x for x in list_mol_subclasses]

In [17]:
# Export csv for cytoscape
import csv

csv.register_dialect('myDialect', delimiter = ';', lineterminator = '\r\n\r\n')
filename = PATH_SESIME + "spectra_molclasses.csv"

with open(filename, 'w') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerow(["node;" + "spectrum_ID;" + "mol_superclass;" + "mol_class;"
                     + "mol_subclass;" + "smiles;" + "parent_mz;"])
    for i, mol_class in enumerate(list_mol_classes):
        writer.writerow([str(i) + ";" + str(int(sub_spectra_metadata.iloc[i][1]))+ ";" 
                         + list_mol_superclasses[i] + ";" 
                         + mol_class + ";" + list_mol_subclasses[i] + ";" + spectra[i].smiles + ";"
                         + str(int(sub_spectra_metadata.iloc[i][3])) + ";"])

csvFile.close()

# TODO: check error with csv files --> some lines start with "