## Part 1: Importing, first pre-processing 
Loading data, importing packages, first pre-processing steps...

In [1]:
ROOT = "C:\\OneDrive - Netherlands eScience Center\\Project_Wageningen_iOMEGA"
PATH_MS_DATA = ROOT + "\\SeSiMe\\data\\SPECTRA_grimur\\"
PATH_SAVE_MODEL = ROOT + "\\SeSiMe\\models_trained\\"
PATH_SAVE_DATA = ROOT + "\\SeSiMe\\data\\"
PATH_SESIME = ROOT + "\\SeSiMe\\"

PATH_NPLINKER = ROOT + "\\nplinker\\prototype\\"
mgf_file = PATH_MS_DATA + "GNPSLibraries_uniqueSMILES_withFeatureIDs.mgf"

In [2]:
# Import general packages
import sys
sys.path.insert(0, PATH_NPLINKER)
sys.path.insert(0, PATH_SESIME)

import helper_functions as functions
import MS_functions

import numpy as np
from metabolomics import load_spectra

In [3]:
# Import / Load data
results_file = "filtered_data_Grimur_minpeak10_loss500_2dec_10ppm.json"

spectra, spectra_dict, MS_documents, MS_documents_intensity, spectra_metadata = MS_functions.load_MS_data(PATH_MS_DATA, PATH_SAVE_DATA,
                 filefilter="*.*", 
                 results_file = results_file,
                 num_decimals = 2,
                 min_frag = 0.0, max_frag = 1000.0,
                 min_loss = 5.0, max_loss = 500.0,
                 min_intensity_perc = 0.0,
                 exp_intensity_filter = 0.01,
                 min_peaks = 10,
                 peaks_per_mz = 15/200,
                 merge_energies = True,
                 merge_ppm = 10,
                 replace = 'max',
                 peak_loss_words = ['peak_', 'loss_'])   

Spectra json file found and loaded.


#### Switch to general SeSiMe functionality
Once we have a corpus (e.g. through cells above), we can use SeSiMe to apply different similarity measuring methds. 

In [4]:
from Similarities import SimilarityMeasures
MS_measure = SimilarityMeasures(MS_documents)
MS_measure.preprocess_documents(0.2, min_frequency = 2, create_stopwords = False)
print("Number of unique words: ", len(MS_measure.dictionary))

Using TensorFlow backend.


Preprocess documents...
Number of unique words:  35775


## Part 2: Train spec2vec/word2vec model and spectrum vectors
### 1) Train on dataset itself

In [11]:
#file_model_word2vec = PATH_SAVE_MODEL + 'model_w2v_MS_gnps_uniquesmiles_d300_w300_iter100_loss500_minpeak10_dec2.model'
#file_model_word2vec = PATH_SAVE_MODEL + 'model_w2v_MS_allgnps_d300_w300_iter100_loss500_minpeak10_dec2.model'
file_model_word2vec = PATH_SAVE_MODEL + 'model_w2v_MS_Grimur_d300_w300_iter100_loss500_minpeak10_dec2.model'
MS_measure.build_model_word2vec(file_model_word2vec, size=300, window=300, 
                             min_count=1, workers=4, iter=100, 
                             use_stored_model=True)

Stored word2vec model not found!
Calculating new word2vec model...
 Epoch  100  of  100 .

In [12]:
# Use peak intensities as extra weights
MS_measure.get_vectors_centroid(method = 'update', #'ignore',
                             extra_weights = MS_documents_intensity, 
                             tfidf_weighted = True, 
                             weight_method = None, #'sqrt', 
                             tfidf_model = None,
                             extra_epochs = 5)

All 'words' of the given documents were found in the trained word2vec model.
Using present tfidf model.
  Calculated centroid vectors for  4138  of  4138  documents.

## Part 3: Calculate/load the different score matrices
### all-vs-all matrix of spectrum-spectrum similarity scores
+ Word2Vec-centroid similarity scores
+ Cosine similarity scores
+ Modified cosine scores (MolNet)
+ Molecular similarity scores based on molecular fingerprints. Unless stated otherwise: Dice score based on morgen-3 fingerprints.

### Calculate all-vs-all matrix for Word2Vec scores 

In [13]:
from scipy import spatial
M_sim_ctr = 1 - spatial.distance.cdist(MS_measure.vectors_centroid, MS_measure.vectors_centroid, 'cosine')

### Calculate/load modified cosine score (here using "fast" way)
+ Be aware: calculating those is **very slow** !
+ Function below will load the given file and only calculate the scores from scratch if no such file is found.
+ Method choices are 'fast' or 'hungarian', the latter being more exact but even slower.

In [None]:
filename = PATH_SAVE_DATA + "MolNet_Grimur_tol02_minmatch6.npy"
molnet_sim = MS_functions.molnet_matrix(spectra, 
                                          tol = 0.2, 
                                          max_mz = 1000, 
                                          min_mz = 0, 
                                          min_match = 6, 
                                          min_intens = 0.01,
                                          filename = filename,
                                          method = 'fast', #'hungarian',
                                          num_workers = 8,
                                          safety_points = 50)

### Calculate/load cosine scores 
+ Calculating is much faster than the modified cosine score, but can still become **slow**, especially when using small tolerances and little filtering (resulting in many peaks...). 

In [17]:
filename = PATH_SAVE_DATA + "Cosine_Grimur_tol02_minmatch6.npy"
cosine_sim = MS_functions.cosine_matrix(spectra[:100], 
                                      tol = 0.2, 
                                      max_mz = 1000, 
                                      min_mz = 0, 
                                      min_match = 6, 
                                      min_intens = 0,
                                      filename = filename,
                                      num_workers = 4)


Could not find file  test0000
Cosine scores will be calculated from scratch.
Calculate pairwise cosine scores by  4 number of workers.
  Calculated cosine for pair  90 -- 94 . (  100.0  % done).

### Calculate/load molecular similarity scores
+ first calculate molecular fingerprints
+ then calculate (or load if file exists) molecular similarity scores.  

Method for calculating fingerprints here can be "morgan1", "morgan2", "morgan3" or "daylight". For morgan fingerprints scores will be based on Dice scores. For "daylight" fingerprint it will be Tanimoto scores. 

In [None]:
molecules, fingerprints_m3, exclude_IDs = MS_functions.get_mol_fingerprints(spectra_dict, method = "morgan3")
exclude = [np.where(np.array(sub_spectra_metadata)[:,1] == x)[0][0] for x in exclude_IDs]

In [None]:
filename = PATH_SAVE_DATA + "tanimoto_Grimur_morgan3.npy"
molecular_similarities = MS_functions.tanimoto_matrix(spectra, 
                                                      fingerprints_m3,
                                                      filename = filename)