# Cosine Similarity Using MassBank powers

To optimiza the m/z and intensities the powers to use will be **c = 2 d = 0.5**

# Obtaining the data from the Mona File


## Getting the path for Mona file

Using the method load_from_msp, the information inside the Mona file is being obtained.

In [7]:
import os
import sys

ROOT = os.path.dirname(os.getcwd())
path = os.path.join(os.path.dirname(os.getcwd()), "data")
msp_file = os.path.join(path, "MoNA-export-GC-MS.msp")
sys.path.insert(0, ROOT)

## Appliying filters to the spectra

Applied filters are:
* normalize_intensities(s)
* reduce_to_number_of_peaks(s, **n_required=10**, **ratio_desired=0.5**)
* select_by_mz(s, **mz_from=0**, **mz_to=1000**)
* require_minimum_number_of_peaks(s, **n_required=10**)

In [8]:
from matchms.importing import load_from_msp

spectrums = [s for s in load_from_msp(msp_file)]
print("Number of Spectra:", len(spectrums))

Number of Spectra: 14847


In [9]:
from matchms.filtering import normalize_intensities
from matchms.filtering import select_by_mz
from matchms.filtering import select_by_relative_intensity

def apply_my_filters(s):
    s = normalize_intensities(s)
    s = select_by_mz(s, mz_from=0, mz_to=1000)
    s = select_by_relative_intensity(s, intensity_from=0.05, intensity_to=1.0)
    return s

spectrums_filtered = [apply_my_filters(s) for s in spectrums]

spectrums_filtered = [s for s in spectrums_filtered if s is not None]

In [10]:
print("Remaining number of spectra:", len(spectrums_filtered))

Remaining number of spectra: 14847


In [11]:
spectrums_filtered = [s for s in spectrums_filtered if len(s.peaks.intensities) > 0]
print("Remaining number of spectra:", len(spectrums_filtered))

Remaining number of spectra: 14844


# Computing the Cosine Similarity with NIST Powers

## Defining the method to compute the similarity scores

This method will have a **tolerance of 0.5, mz_power=2, intensity_power=0.5**, it will return an numpy array with the reference spectra, query spectra, the score and matched peaks.

In [12]:
from datetime import datetime
from matchms import calculate_scores
from matchms.similarity import CosineGreedy
import numpy as np

def calculate_similarity_scores(spectrums, tolerance=0.1):
    
    def get_time():
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        current_time = "Time = " + current_time
        return current_time
    
    length_spec = len(spectrums)
    similarities = np.zeros((length_spec, length_spec))
    num_matches = similarities.copy()
    
    total_num_calculations = int((length_spec**2)/2 + 0.5 * length_spec)
    count = 0
    
    similarity_measure = CosineGreedy(tolerance, 2.0, 0.5)
    
    print("Start", get_time())
    
    for i in range(length_spec):
        for j in range(i, length_spec):
            score, matches = similarity_measure(spectrums[i], spectrums[j])
            similarities[i, j] = score
            num_matches[i, j] = matches
            count += 1
            if (count+1) % 10000 == 0:
                print("\r", "About {:.3f}% completed".format(100 * count/total_num_calculations), get_time(), end="")
            
    for i in range(1, length_spec):
        for j in range (i):
            similarities[i, j] = similarities[j, i]
            num_matches[i, j] = num_matches[j, i]
            
    return similarities, num_matches
            
similarities, num_matches = calculate_similarity_scores(spectrums_filtered, 0.5)

Start Time = 14:06:45
 About 99.991% completed Time = 19:52:11

## Saving the matches and the similarities

In [13]:
filename = os.path.join(path,'similarities_filt05_cosine_tol05_mzp2_intp05.npy')
np.save(filename, similarities)
np.save(filename.split('.')[0] + "_matches.npy", num_matches)