# Cosine Similarity Using Demuth powers

To optimiza the m/z and intensities the powers to use will be **c = 0 d = 0.33**

# Obtaining the data from the Mona File


## Getting the path for Mona file

Using the method load_from_msp, the information inside the Mona file is being obtained.

In [1]:
import os
import sys

ROOT = os.path.dirname(os.getcwd())
sys.path.insert(0, ROOT)

In [None]:
from custom_functions.spectra_functions import get_data_folder_path

# from_external=False to use the data folder within the project
path = get_data_folder_path(from_external=False)
spectrums_file = os.path.join(path, "MoNA-export-GC-MS.msp")

## Appliying filters to the spectra

Applied filters are:
* normalize_intensities(s)
* reduce_to_number_of_peaks(s, **n_required=10**, **ratio_desired=0.5**)
* select_by_mz(s, **mz_from=0**, **mz_to=1000**)
* require_minimum_number_of_peaks(s, **n_required=10**)

In [2]:
from matchms.filtering import normalize_intensities
from matchms.filtering import reduce_to_number_of_peaks
from matchms.filtering import select_by_mz
from matchms.filtering import require_minimum_number_of_peaks
from matchms.importing import load_from_msp

def apply_my_filters(s):
    s = normalize_intensities(s)
    s = reduce_to_number_of_peaks(s, n_required=10, ratio_desired=0.5)
    s = select_by_mz(s, mz_from=0, mz_to=1000)
    s = require_minimum_number_of_peaks(s, n_required=10)
    return s

spectrums_filtered = [apply_my_filters(s) for s in load_from_msp(spectrums_file)]

spectrums_filtered = [s for s in spectrums_filtered if s is not None]

In [3]:
print("Remaining number of spectra:", len(spectrums_filtered))

Remaining number of spectra: 14359


# Computing the Cosine Similarity with NIST Powers

## Defining the method to compute the similarity scores

This method will have a **tolerance of 0.5, mz_power=0, intensity_power=0.33**, it will return an numpy array with the reference spectra, query spectra, the score and matched peaks.

In [4]:
from datetime import datetime
from matchms import calculate_scores
from matchms.similarity import CosineGreedy
import numpy as np

def calculate_similarity_scores(spectrums, tolerance=0.1):
    ## Code inspired by Florian Huber's Jupyter notebook to compute similarity matrix
    ## https://github.com/iomega/spec2vec_gnps_data_analysis/blob/master/custom_functions/similarity_matrix.py
    def get_time():
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        current_time = "Time = " + current_time
        return current_time
    
    length_spec = len(spectrums)
    similarities = np.zeros((length_spec, length_spec))
    num_matches = similarities.copy()
    
    total_num_calculations = int((length_spec**2)/2 + 0.5 * length_spec)
    count = 0
    
    similarity_measure = CosineGreedy(tolerance, 0.0, 0.33)
    
    print("Start", get_time())
    
    for i in range(length_spec):
        for j in range(i, length_spec):
            score, matches = similarity_measure(spectrums[i], spectrums[j])
            similarities[i, j] = score
            num_matches[i, j] = matches
            count += 1
            if (count+1) % 10000 == 0:
                print("\r", "About {:.3f}% completed".format(100 * count/total_num_calculations), get_time(), end="")
            
    for i in range(1, length_spec):
        for j in range (i):
            similarities[i, j] = similarities[j, i]
            num_matches[i, j] = num_matches[j, i]
            
    return similarities, num_matches
            
similarities, num_matches = calculate_similarity_scores(spectrums_filtered, 0.5)

Start Time = 15:56:55
 About 99.993% completed Time = 19:31:16

## Saving the matches and the similarities

In [5]:
filename = os.path.join(path,'similarities_cosine_tol05_mzp0_intp033.npy')
np.save(filename, similarities)
np.save(filename.split('.')[0] + "_matches.npy", num_matches)