# Mass spectral similarity algorithms
The spectral similarity algorithms are collected from [Spectral Entropy](https://spectralentropy.readthedocs.io/en/master/#spectral_similarity.multiple_distance) and [Spectrum_utils](https://spectrum-utils.readthedocs.io/en/latest/quickstart.html)

Mass spectra are preprocessed by default, including following procedures:

1. Remove ions have *m/z* higher than the precusor mass.

2. Centroid peaks by merging peaks within a given mass tolerance.

3. Remove ions have intensity lower than max intensity * fixed value.

Distance between two MS$^2$ spectra are calculated as follows:

1. Find common peaks (*m/z*) within a given mass tolerance.

2. Define intensity from spec1 as P and intensity from spec2 as Q.

Similarity = 1 - Distance

*The spectral similarity algorithms have been demonstrated to exhibit varying degrees of effectiveness comparing with classic cosine-based similarity. However Collision energy and the differences in mass spectrometries can vary widely in practical applications, especially in untargeted analyses where the sample composition is unknown. Therefore, we recommend using all algorithms for clustering to identify effective algorithm combinations that best suit the specific dataset.*
Since GNPS supports single cosine-based similarity for both library searching and feature organization. MSanalyst incorporates additional 45 algorithms allowing users to interpret their metabolomic data from multiple perspectives. Some of them are originated from basic algorithms, which can reduce noise or increase the influence of individual fragments on the final similar score by adding weight factors, normalization, etc. Therefore, we recommend first using **modified_cosine, neutral_loss, entropy, symmetric_chi_squared, euclidean and ms_for_id to make sure their efficiency, and then use their modified versions.**

## dot_product
It is also known as cosine similarity and defined as:

$Distance = \frac{\sum Q_i P_i}{\sqrt{\sum Q_i^2} \sqrt{\sum P_i^2}}$

## dot_product_reverse_distance
Removing the influence of zero intensity

$Distance = 1 - \sqrt{\frac{(\sum{{} {P_i^{'}}})^2}{{\sum{(Q_i^{'})^2}{\sum (P_i^{'})^2}}}}$
$P^{'}_{i}=\frac{P^{''}_{i}}{\sum_{i}{P^{''}_{i}}}$,
$P^{''}_{i}=\begin{cases}
        0 & \text{ if } Q_{i}=0 \\
        P_{i} & \text{ if } Q_{i}\neq0
        \end{cases}
$

## weighted_dot_product_distance
Modified by a ratio based on the relative intensities of adjacent fragment ions and especially useful for GC-EI-MS

$Distance = 1 - \frac{(\sum{Q^{'}_{i} P^{'}_{i}})^2}{\sum{Q_{i}^{'2}\sum P_{i}^{'2}}}$
here: $P^{'}_{i} = M_{p,i}^{3}I_{p,i}^{0.6}, Q^{'}_{i} = M_{q,i}^{3}I_{q,i}^{0.6}$

$P^{'}_{i}$ and $Q^{'}_{i}$ represent the transformed intensities of the ions, where the mass-to-charge ratios are raised to the power of 3 and the intensities are raised to the power of 0.6. This transformation helps to emphasize the contribution of ions with higher mass-to-charge ratios and moderate intensities.

And the subsequent calculation is the same as dot_product_distance.

## inner_product_distance
An unnormalized version of dot_product similarity

$ Distance = 1-\sum{P_iQ_i}$

By summing the products of corresponding values, the formula emphasizes the importance of matching peaks with similar intensities.

## spectral_contrast_angle_distance
Similar to cosine score with different normalizing factor

$Distance = 1 - \frac{\sum{Q_iP_i}}{\sqrt{\sum Q_i^2\sum P_i^2}}$

${\sum{Q_iP_i}}$ represents the dot product of the intensity from two spectra and ${\sqrt{\sum Q_i^2\sum P_i^2}}$ normalizes the dot product, ensuring that the similarity score is comparable across different spectra.

### Application notes
#### Normalization:
The algorithm normalizes the dot product by the magnitudes of the spectra, making the similarity score comparable across different spectra.
#### Robustness:
The cosine similarity is robust to variations in the magnitudes of the spectra, focusing on the direction of the vectors rather than their magnitudes.

## jaccard_distance
Similar to cosine score, but including an additional term that subtracts the product of the corresponding intensities

$Distance = \frac{\sum(P_i-Q_i)^2}{\sum P_i^2+\sum{Q_i^2-\sum{P_iQ_i}}}$

is similar to the cosine metric, but it includes an additional term that subtracts the product of the corresponding intensities.
This formula calculates a distance measure by summing the squared differences between corresponding intensity values from two spectra, then dividing by the sum of the squares of the intensities minus the sum of the products of corresponding intensities. This could be interpreted as a measure that takes into account not only the differences in intensities but also the interaction between the intensities.
### Application notes
#### Interaction Effect:
By including the term $\sum{P_iQ_i}$,the formula accounts for the interaction between the spectra, which could provide a more measure of similarity.


## Neutral loss
Neutral loss is the difference in *Δm/z* between a precursor ion and its fragment ions in MS/MS spectra, effectively providing insight into the neutral molecules lost during ion fragmentation.

### Application notes
#### Additional dimension:
Neutral loss adds a valuable dimension to MS<sup>2</sup> analysis, particularly for structurally related compounds sharing the common building blocks.

## Modified cosine
Modified cosine similarity builds on traditional cosine similarity. It matches fragment ions directly and also accounts for peaks shifted by the precursor mass difference between spectra, capturing structural changes due to molecular modifications.

### Application notes
It incorporates precursor mass differences, enabling it to detect fragment ions altered by structural modifications.  While effective for single modifications.


## Reference
See [J. Am. Soc. Mass Spectrom. 2022, 33, 9, 1733–1744](https://pubs.acs.org/doi/10.1021/jasms.2c00153) and [J. Am. Soc. Mass Spectrom. 2022, 33, 3, 530–534](https://pubs.acs.org/doi/10.1021/jasms.1c00343) for details


## peak_percentage
Calculated by the proportion of fragments shared by two spectra

$Simlarity = \frac{N_m}{N_p}$

$N_m$: number of matching fragments,

$N_p, N_r$: number of fragments for spectrum p

is calculated as the ratio of the number of common fragments to the total number of fragments in the query spectrum.

### Application notes
#### Sensitivity to Fragmentation:
The algorithm is straightforward to just consider the fragmentation pattern of the spectra and overlook the influence of intensities.

In tandem mass spectra preprocessing, signals ≤ 1% intensity of the highest peak are usually considered noise. After a manual inspection of MNA’s experimental mass spectral library, some essential fragment ions still exist, mixed in with amounts of noise signals even after employing this noise-removing operation. It is an obstacle to deciding a general standard for noise-removing. There is another part of MS2 spectra predicted by CFM-ID in MNA’s library. Although the CFM-ID model is well-designed and applies state-of-the-art machine learning modules, it is impossible for in-silico fragments to follow real-world cleavage laws. Unexpected simulated fragments will undoubtedly affect the comparison of $ MS^2 $ spectra.
Considering the reasons mentioned above, we designed a new mass spectral similarity algorithm called peak percentage (Supplementary Fig. 3). It counts all matched peaks without a user-defined mass error, and the count is divided into the total number of peaks to get the final score. In practical use, we advise using this scoring method after the $MS1$ filtering procedure. The absolute number of matched peak counts (MPC) was also used synergistically to restrict meaningless comparisons of oversimplified spectra.


## unweighted_entropy_distance
$Distance = -\frac{2\times S_{PQ}-S_P-S_Q} {ln(4)}, S_I=\sum_{i} {I_i ln(I_i)}$

$S_I=\sum_{i} {I_i ln(I_i)}$ represents the Shannon entropy of the spectrum $I$, which measures the uncertainty or information content of the spectrum.

$S_{PQ}$ represents the joint entropy of the two spectra, which measures the combined uncertainty of the two spectra.
The formula calculates the calculates the similarity between the two spectra based on their joint and individual entropies. The normalization by ${ln(4)}$ ensures that the similarity score is comparable across different spectra.

### Application notes
#### Information Theoretic Measure:
The algorithm uses Shannon entropy, which is an information theoretic measure that captures the information content of the spectra. This can provide a more meaningful measure of similarity compared to simple intensity differences.

#### Robustness:
The use of entropy makes the algorithm robust to variations in the magnitudes of the spectra, focusing on the distribution of intensities rather than their absolute values.


## entropy_distance

$Distance = -\frac{2\times S_{PQ}^{'}-S_P^{'}-S_Q^{'}} {ln(4)}$
$S_I^{'}=\sum_{i} {I_i^{'} ln(I_i^{'})}, I^{'}=I^{w}$

with$\ w=0.25+S\times 0.5\ (S<1.5)$


## euclidean_distance
Calculating the square root of intensities at each *m/z*, reflecting their overall fragment distribution

$Distance = (\sum|P_{i}-Q_{i}|^2)^{1/2}$

### Application notes
#### Sensitivity to Outliers:
The Euclidean distance is sensitive to outliers because squaring the differences amplifies the effect of large deviations. If outliers are a concern, consider methods to robustify the distance calculation against outliers.
#### Equal Weighting:
All dimensions (peaks in the spectra) are given equal weight in the calculation. In some cases, this might not be desirable if certain peaks are more informative than others.

## squared_euclidean_distance
Squared euclidean similarity

$Distance = \sum(P_{i}-Q_{i})^2$

### Application notes
#### Robustness:
The use of squared differences makes the algorithm sensitive to larger disparities, which can be useful in identifying significant differences between spectra.

#### Sensitivity to Outliers:
The algorithm is sensitive to outliers because squaring differences amplifies the impact of large deviations.

## improved_similarity_distance
A normalized euclidean score, reducing the impact of single fragment

$Distance = \sqrt{\frac{1}{N}\sum\{\frac{P_i-Q_i}{P_i+Q_i}\}^2}$

$N$ is the number of data points (peaks) being compared.
### Application notes
#### Emphasis on Proportional Differences:
The formula gives more weight to the relative differences between the peaks rather than their absolute differences, which can be useful in contexts where the scale of intensity values is less important than their relative changes.

#### Robustness:
This measure can be more robust to variations in the magnitude of spectral data, as it focuses on the ratio of differences to the sum of intensities.

## penrose_shape_distance
Similar to euclidean score, but focusing on the relative shapes of fragment ions rather than the intensity difference

$Distance = \sqrt{\sum((P_i-\bar{P})-(Q_i-\bar{Q}))^2}$

$(P_i-\bar{P})$ and $(Q_i-\bar{Q})$ represent the deviations of each element from its respective mean. Centering the data in this way helps to focus on the relative differences rather than the absolute values.

Squaring these differences emphasizes larger disparities and minimizes the effect of smaller ones. Summing these squared differences gives a total measure of similarity.

### Application notes
#### Focus on Relative Differences:
By centering the data, the formula focuses on relative differences between the vectors, which can be more meaningful than absolute differences when the scales of the data vary.

## manhattan_distance
Similar to euclidean, only calculating the sum of the absolute intensity differences at each *m/z*

$Distance = \sum|P_{i}-Q_{i}|$

### Application notes
#### Robustness:
The use of absolute values makes this measure robust to outliers, as it does not amplify the effect of large deviations as squaring does.
####Lack of Weighting:
Unlike some other metrics, this formula does not weight differences by their importance or by the scale of the data, treating all differences equally.

## penrose_size_distance
Manhattan score with a weighted factor $\sqrt N$

$Distance = \sqrt N\sum{|P_i-Q_i|}$

### Applications
#### Centering:
The use of absolute differences helps to focus on the relative differences rather than the absolute values.
#### Robustness:
The use of absolute differences makes the algorithm robust to noise and variations in ion intensities. It does not assume a specific distribution of the data, which can be advantageous when dealing with real-world mass spectrometry data that may not follow a Gaussian distribution

## mean_character_distance
Dividing the manhattan score by the number of fragments $N$, providing a overview of central tendency

$Distance = \frac{1}{N}\sum{|P_i-Q_i|}$

### Robustness:
 Mean absolute difference is less sensitive to outliers compared to measures that involve squaring differences, such as the Euclidean distance.
### Sensitivity to Distribution:
 The formula normalizes the sum of differences, allowing for comparison between spectra of different sizes or with different numbers of data points.
 However, the mean difference does not take into account the distribution of the data, which could be important in certain applications.

## absolute_value_distance
Normalized form of manhattan, focusing more on ion distribution rather than unique fragments

$Distance = \frac { \sum(|Q_i-P_i|)}{\sum P_i})$

For each pair of intensities $Q_i$ and $P_i$, it reflects the intensity difference between the two spectra at each data point. Sum up all absolute differences$|Q_i-P_i|$ to obtain the total absolute difference
### Application Note
#### Sensitivity to Outliers:
This method is sensitive to outliers since a single large difference can significantly affect the total difference.

##### Lack of Correlation Consideration:
It does not account for correlations between fragment ions, focusing solely on intensity differences.

## matusita_distance
Calculating the euclidean score between the square root transformed values of two spectra

$Distance = \sqrt{\sum(\sqrt{P_{i}}-\sqrt{Q_{i}})^2}$

calculates the Euclidean distance between the square root transformed values of two spectra. It transforms each intensity value by taking the square root and then computes the standard Euclidean distance between these transformed vectors.
### Application notes
#### Sensitivity to Proportional Differences:
By taking the square root of the intensities, the formula gives more weight to smaller differences and less to larger ones, which can be beneficial when the scale of intensity values is important.

## chebyshev_distance
Calculating the maximum absolute value of the intensity difference, focusing on the most special fragment ion

$Distance = \underset{i}{\max}{(|P_{i}\ -\ Q_{i}|)}$

###  Application notes
#### Sensitive to single peaks:
This metric emphasizes the largest difference between two spectra and sensitive to single peaks.
#### focusing on specific features:
Unlike metrics that average differences across all points, this measure highlights the most extreme divergence, focusing on specific features of the spectra. May overlook small, cumulative differences across the spectra that are significant for overall similarity metrics.

## avg_l_distance
The average of Manhattan and Chebyshev score, combining the advantages of the two similarity score

$Distance = \frac{1}{2}(\sum|P_i-Q_i|+\underset{i}{\max}{|P_i-Q_i|})$

$(\sum|P_i-Q_i|)$ evaluates the cumulative intensity difference across all data points, providing a measure of overall similarity between the two spectra.

By including the largest single-point intensity difference ($(\underset{i}{\max}{|P_i-Q_i|})$), this formula emphasizes significant outliers or peaks that might represent key differences between the spectra.

Combining these two metrics ensures the score reflects both overall similarity and the presence of critical intensity mismatches, which can be important for identifying unique spectral features.

### Application notes
#### Robustness to Outliers:
Unlike simpler metrics such as the mean absolute difference, this formula explicitly accounts for the maximum deviation, making it more sensitive to outliers. This can be useful when critical spectral features are concentrated in a few peaks.

#### Highlighting Key Peaks:
By emphasizing the largest differences, the method is particularly effective when the analysis aims to identify unique or significant features in MS2 spectra, such as marker compounds or structural motifs.

#### Normalization:
As with other similarity measures, normalization or standardization may be required to prevent bias from differences in overall intensity scales between spectra.

### Limitations:
The inclusion of the maximum difference may exaggerate the impact of single outlier peaks, potentially skewing similarity assessments if these peaks are not meaningful. Proper preprocessing, such as noise reduction and peak alignment, is essential.


## baroni_urbani_buser_distance

Calculating the minimum/maximum ratio of intensities at each *m/z*, exhibiting the overall similarity

$Distance = 1-\frac{\sum\min{(P_i,Q_i)}+\sqrt{\sum\min{(P_i,Q_i)}\sum(\max{(P)}-\max{(P_i,Q_i)})}}{\sum{\max{(P_i,Q_i)}+\sqrt{\sum{\min{(P_i,Q_i)}\sum(\max{(P)}-\max{(P_i,Q_i)})}}}}$

## fidelity_distance

Measured by calculating the sum of the square root products of intensities at each *m/z*

$Distance = 1-\sum\sqrt{P_{i}Q_{i}}$

This formula calculates a similarity measure by summing the square roots of the product of corresponding intensity values from two spectra, and then subtracting this sum from 1. The idea might be to capture some form of interaction or correlation between the two sets of data.

### Application notes
#### Interaction Capture:
By multiplying and summing the square roots of the intensities, the formula might capture some form of interaction effect between the spectral data points.
#### Sensitivity to Outliers:
Similar to other measures involving square roots or multiplication, this formula might be sensitive to outliers.
#### Scale Dependence:
The formula is dependent on the scale of the data.$P_i$and$Q_i$ need to be on the same scales.

## bhattacharya_1_distance
Measuring the similarity of two spectra at each *m/z* by 'arccos' calculation

$Distance = (\arccos{(\sum\sqrt{P_{i}Q_{i}})})^2$

The geometric mean (\sqrt{P_{i}Q_{i}}$) focuses on the proportional similarity between two spectra at each data point. It is sensitive to the alignment of relative intensities, reducing the impact of absolute differences.

Summing the geometric means aggregates the pointwise similarities, while squaring amplifies the importance of higher alignment across multiple peaks.

The arccosine function provides a bounded angular representation of the similarity, where smaller values indicate greater similarity.
### Application notes
#### Proportional Similarity Emphasis:
This method emphasizes the relative relationship between intensities rather than absolute values. It is particularly useful when comparing spectra with different total intensities or baseline noise levels.
#### Sensitivity to Alignment:
By leveraging the geometric mean, the formula strongly rewards spectra with closely aligned peak intensities, making it effective for identifying structurally related compounds.
#### Normalization:
The method inherently reduces the need for additional normalization because it focuses on proportional intensities rather than absolute differences.
#### Limitations:
Highly sensitive to misaligned peaks, as geometric means drop to near-zero for unaligned intensities.

## bhattacharya_2_distance
Using ‘-ln’ calculation instead to make more tolerant to maximum intensity

$Distance = -\ln{(\sum\sqrt{P_{i}Q_{i}})}$

The same geometric mean and summation as bhattacharya_1_distance but with natural logarithm $\ln$ treatment
### Application notes
#### Proportional Alignment Sensitivity:
This metric highlights relative intensity alignment between spectra. It is well-suited for analyzing spectra with differing absolute intensities, such as those resulting from variable ionization efficiencies.
#### Scale Compression:
The logarithmic transformation helps to manage the influence of large sums, making the similarity score robust to outliers or high-intensity peaks.

## canberra_distance
Very sensitive to differences in low-intensity fragment ions

$Distance = \sum\frac{|P_{i}-Q_{i}|}{|P_{i}|+|Q_{i}|}$

evaluates the pairwise proportional difference between corresponding intensity values of two spectra. By dividing the absolute difference $|P_i-Q_i|$ by the sum of absolute intensities $|P_i|+|Q_i|$, it normalizes the differences, ensuring the metric is scale-independent. The sum aggregates these normalized differences across all data points.

### Application Notes
#### Proportional Difference Emphasis:
This metric is particularly effective when the goal is to compare the relative intensity distributions of two spectra, regardless of their absolute scale.
#### Scale Independence:
By normalizing differences against the combined intensity, the score is robust to variations in total intensity, making it suitable for comparing spectra obtained under varying experimental conditions.

## clark_distance
Similar to canberra score, amplifying the impact of low-intensity ions by squaring

$Distance = (\frac{1}{N}\sum(\frac{P_i-Q_i}{|P_i|+|Q_i|})^2)^\frac{1}{2}$

## dice_distance
Similar to clark without square root calculation, amplifying influence of intensity difference

$Distance = \frac{\sum(P_i-Q_i)^2}{\sum P_i^2+\sum Q_i^2}$

This metric computes the normalized sum of squared differences between corresponding intensities in two spectra. The denominator sums the squared intensities of both spectra, ensuring normalization.

### Application notes
#### Sensitivity to Differences:
The squared difference enhances sensitivity to significant discrepancies, making it suitable for detecting major deviations in spectral patterns.

#### Overweighting large deviations:
Squaring amplifies the influence of larger differences, which can overshadow minor but meaningful variations in spectra.

## divergence_distance
Treating fragment distribution as a probability distribution to calculate similarity

$Distance = 2\sum\frac{(P_i-Q_i)^2}{(P_i+Q_i)^2}$

### Application notes
#### Balanced effect:
Squaring the differences could highlight the discrepancies between the two spectra and the presence of ${(P_i+Q_i)^2}$ in the denominator might balance the influence of intensity differences.
#### Interpretability:
The formula does not directly account for the overall intensity or distribution of the spectra, which could affect the interpretation of the comparison results.

## harmonic_mean_distance
A variation of [Tanimoto coefficient](https://en.wikipedia.org/wiki/Jaccard_index), which measures similarity between two sets or vectors

$Distance = 1-2\sum(\frac{P_{i}Q_{i}}{P_{i}+Q_{i}})$

### Application note
#### Balanced Measure:
The formula takes into account both the presence and absence of peaks by considering the product $P_iQ_i$ and the sum $P_i+Q_i$. Unlike some other similarity measures, this formula can handle zero values, as the division by $|P_i+Q_i|$ prevents division by zero.

## motyka_distance
Mainly focusing on the minimum intensity of shared fragments

$Distance = -\frac{\sum\min{(P_{i},Q_{i})}}{\sum(P_{i}+Q_{i})}$

$\sum\min{(P_{i},Q_{i})}$ calculates the sum of the minimum values between corresponding elements of the two spectra. This captures the overlap or commonality between the two sets of data points.

${\sum(P_{i}+Q_{i})}$ calculates the sum of all values in both spectra combined, representing the total presence or coverage of data points across both spectra.

The ratio represents the proportion of overlap between the two spectra.
### Application notes
#### Balanced Measure:
This formula balances the contribution of common and unique elements in both spectra, providing a comprehensive measure of similarity.
#### Normalization:
By considering the sum of all elements in both spectra, it implicitly normalizes the measure, making it less sensitive to differences in the scale of data.

## intersection_distance
Calculating by sum of the minimum intensity at each *m/z* and normalization

$Distance = 1-\frac{\sum\min{(P_{i},Q_{i})}}{\min(\sum{P_{i},\sum{Q_{i})}}}$

This formula calculates a similarity measure by summing the minimum of corresponding intensity values from two spectra and then normalizing this sum by the minimum of the total sums of intensities from both spectra.
### Application notes
#### Focus on Common Elements:
By using the minimum of corresponding values, the formula emphasizes the presence of common peaks with similar intensities in both spectra.

#### Robustness:
This measure can be more robust to differences in the scale of the spectra, as it focuses on the overlap of intensities rather than their absolute values.

## roberts_distance
Comprehensively considering the absolute and relative intensities

$Distance = 1-\sum\frac{(P_{i}+Q_{i})\frac{\min{(P_{i},Q_{i})}}{\max{(P_{i},Q_{i})}}}{\sum(P_{i}+Q_{i})}$

$\frac{\min{(P_{i},Q_{i})}}{\max{(P_{i},Q_{i})}}$ represents the ratio of the minimum intensity to the maximum intensity for each ion pair.

The weighted sum $(P_{i}+Q_{i})$ gives more weight to ions with higher intensities, making the similarity measure more sensitive to significant peaks.

The denominator $\sum(P_{i}+Q_{i})$normalizes the weighted sum, ensuring that the similarity score is comparable across different spectra.
### Application notes
#### Focus on Relative Intensities:
The algorithm focuses on the relative intensities of corresponding ions, which can be more meaningful than absolute differences when comparing mass spectra.

#### Weighted Contribution:
The weighted sum ensures that ions with higher intensities contribute more to the similarity score, reflecting their significance in the spectra.

#### Normalization:
The normalization by the sum of intensities makes the similarity score comparable across different spectra, regardless of their overall intensity levels.

## ruzicka_distance
Summing absolute intensity differences and normalizing by the maximum intensity for each pair

$Distance = \frac{\sum{|P_{i}-Q_{i}|}}{\sum{\max(P_{i},Q_{i})}}$

Summing these absolute differences ($|P_i-Q_i|$) gives a total measure of dissimilarity. The denominator ${\sum{\max(P_{i},Q_{i})}}$ normalizes the sum of absolute differences, ensuring that the similarity score is comparable across different spectra.
### Application notes
#### Focus on Relative Differences:
The algorithm focuses on the relative differences between the intensities of corresponding ions, which can be more meaningful than absolute differences when comparing mass spectra.
#### Normalization:
The normalization by the sum of maximum intensities makes the similarity score comparable across different spectra, regardless of their overall intensity levels.
#### Robustness:
It does not assume a specific distribution of the data, which can be advantageous when dealing with real-world mass spectrometry data that may not follow a Gaussian distribution.

## hellinger_distance
Using average ion intensity to normalize each intensities, making it more sensitive to distribution shape

$Distance = \sqrt{2\sum(\sqrt{\frac{P_i}{\bar{P}}}-\sqrt{\frac{Q_i}{\bar{Q}}})^2}$

calculates the distance between two mass spectra by considering the normalized intensities of their peaks. ${\bar{P}}$ and ${\bar{Q}}$ represent the mean intensities.
## Application notes
### Sensitivity to Relative Changes:
The square root transformation can emphasize smaller differences in intensity, which might be important for detecting subtle changes between spectra.

## whittaker_index_of_association_distance
Simple version of hellinger, without complex square and square root calculation

$Distance = \frac{1}{2}\sum|\frac{P_i}{\bar{P}}-\frac{Q_i}{\bar{Q}}|$

$\frac{P_i}{\bar{P}}$ and $\frac{Q_i}{\bar{Q}}|$ represent the normalized intensities of the ions, where each intensity is divided by the mean intensity of its respective spectrum.

This algorithm calculates the sum of the absolute differences between the normalized intensities of corresponding ions, scaled by a factor of $\frac{1}{2}$

## lorentzian_distance
Calculated by sum of the logarithmic transformation of intensity at each *m/z*, reducing the impact of noise or outliers

$Distance = \sum{\ln(1+|P_i-Q_i|)}$

The natural logarithm is used to reduce the impact of large differences and provide a more gradual change as differences increase.
### Application notes
#### Emphasis on Large Differences:
The logarithmic transformation gives more weight to large differences between corresponding spectral points, which can be useful in emphasizing significant discrepancies.

## pearson_correlation_distance
Treating spectra as vectors and representing similarity by linear correlation

$Distance = \frac{\sum[(Q_i-\bar{Q})(P_i-\bar{P})]}{\sqrt{\sum(Q_i-\bar{Q})^2\sum(P_i-\bar{P})^2}}$

${\sum[(Q_i-\bar{Q})(P_i-\bar{P})]}$ calculates the sum of the products of the deviations of each corresponding value from its mean. This captures the linear relationship between the two sets of data.

${\sqrt{\sum(Q_i-\bar{Q})^2\sum(P_i-\bar{P})^2}}$ normalizes the numerator by the product of the standard deviations of each set, ensuring that the result is independent of the scale of the data.
### Application notes
#### Sensitivity to Linear Relationships:
This measure is sensitive to linear relationships between the two sets of data, making it useful for identifying correlated changes in spectral data.
#### Normalization:
By normalizing the covariance, the formula provides a measure that is not affected by the magnitude of the data, focusing instead on the pattern of relationship.










## symmetric_chi_squared_distance
Based on chi-square test, similarity is assessed by comparing the sum of squares of intensities at each *m/z*

$Distance =  \sqrt{\sum{\frac{\bar{P}+\bar{Q}}{N(\bar{P}+\bar{Q})^2}\frac{(P_i\bar{Q}-Q_i\bar{P})^2}{P_i+Q_i}\ }}$

$\frac{(P_i\bar{Q}-Q_i\bar{P})^2}{P_i+Q_i}$ represents the squared difference between the intensities of corresponding ions, weighted by the sum of the intensities of the ions.

${\sum{\frac{\bar{P}+\bar{Q}}{N(\bar{P}+\bar{Q})^2}}}$ normalizes the weighted squared differences by the mean intensities of the spectra and the total number of ions, ensuring that the similarity score is comparable across different spectra.

Taking the square root of the sum provides a distance measure, which is more intuitive and comparable across different data sets.
### Application notes
#### Normalization:
The algorithm normalizes the differences by the mean intensities and the total number of ions, making the similarity score comparable across different spectra
#### Robustness:
The use of squared differences makes the algorithm sensitive to larger disparities, but the normalization helps to balance this sensitivity
#### Sensitivity to Outliers:
The algorithm is sensitive to outliers because squaring differences amplifies the impact of large deviations

## probabilistic_symmetric_chi_squared_distance
Similar to symmetric_chi_squared, but normalized by intensity sum for each pair

$Distance = \frac{1}{2} \times \sum\frac{(P_{i}-Q_{i}\ )^2}{P_{i}+Q_{i}\ }$

The squared differences $(P_{i}-Q_{i})^2$ are weighted by $(P_i + Q_i)$, which helps to normalize the differences and reduce the impact of very extreme intensity. Summing these weighted squared differences gives a total measure of dissimilarity.
### Application notes
#### Robustness:
The use of squared differences makes the algorithm sensitive to larger disparities, but the normalization by the sum of intensities helps to balance this sensitivity.

## vicis_symmetric_chi_squared_3_distance
$Distance = \sum\frac{(P_i-Q_i)^2}{\max{(P_i,Q_i)}}$

calculates the sum of the squared differences between the intensities of corresponding ions, normalized by the maximum intensity for each pair.

### Application notes
#### Limited Statistical Significance:
The algorithm does not provide a statistical significance measure for the similarity score, which can make it difficult to determine whether the observed similarity is due to chance or a true match.
#### Normalization & Robustness

## wave_hedges_distance
Similar to vicis_symmetric_chi_squared_3, without squaring the intensity difference

$Distance = \sum\frac{|P_i-Q_i|}{\max{(P_i,Q_i)}}$

calculates the sum of the absolute differences between the intensities of corresponding ions, normalized by the maximum intensity for each ion pair.
### Application notes
#### Normalization & Robustness
#### Limited Statistical Significance

## squared_chord_distance
Calculating the square root of intensity difference at each m/z

$Distance = \sum(\sqrt{P_{i}}-\sqrt{Q_{i}})^2$

$\sqrt{P_{i}}$ and $\sqrt{Q_{i}}$ the square roots of the intensities of the ions. This transformation helps to reduce the impact of very intense ions and makes the measure more comparable across different data sets.

Squaring the differences ($(\sqrt{P_{i}}-\sqrt{Q_{i}})^2$) emphasizes larger disparities and minimizes the effect of smaller ones. Summing these squared differences gives a total measure of dissimilarity.

### Application notes
#### Robustness:
The square root transformation reduces the impact of extreme intensity, making the similarity measure more comparable across different spectra. The use of squared differences makes the algorithm sensitive to larger disparities, but the square root transformation helps to balance this sensitivity.

#### Sensitivity to Outliers:
Like standard Euclidean distance, this measure can be sensitive to outliers because squaring differences amplifies the impact of large deviations.



## ms_for_id_distance
Considering the normalized sum of intensities, number of matching fragment ions, and the differences in intensity and m/z

$Distance = -\frac{N_m^b(\sum I_{q,i}+2\sum I_{r,i})^c}{(N_q+2N_r)^d+\sum|I_{q,i}-I_{r,i}|+\sum|M_{q,i}-M_{r,i}|},\ \ b=4,\ c=1.25,\ d=2
$

The peaks have been filtered with intensity > 0.05.

$N_m$: number of matching fragments,

$N_q, N_r$: number of fragments for spectrum p,q,

$M_q,M_r$: m/z of peak in query and reference spectrum,

$I_q,I_r$: intensity of peak in query and reference spectrum

$N_m^b$ is a normalization factor that scales the similarity score.

$\sum I_{q,i}+2\sum I_{r,i}$ represents the sum of the intensities of the ions in the two spectra, with the second spectrum's intensities weighted by a factor of 2.

$\sum|M_{q,i}-M_{r,i}|$ represents the sum of the differences in *m/z* ratios of corresponding ions.

$\sum|I_{q,i}-I_{r,i}|$ represents the sum of the absolute differences in intensities of corresponding ions.

The denominator normalizes the similarity score by the total number of ions and the differences in intensities and mass-to-charge ratios. The exponent $d=2$ amplifies the effect of the total number of ions.
### Application notes
#### Normalization:
The algorithm normalizes the similarity score by the total number of ions and the differences in intensities and mass-to-charge ratios, making the score comparable across different spectra
#### Robustness:
The use of absolute differences and the normalization by the total number of ions make the algorithm robust to variations in the magnitudes of the spectra


## ms_for_id_v1_distance
Similar to ms_for_id with different weight factor

$Similarity = \frac{N_m^4}{N_qN_r(\sum|I_{q,i}-I_{r,i}|)^a}\ ,\ a=0.25
$

$Distance = \frac{1}{Similarity}$

$N_m$: number of matching fragments

$N_q, N_r$: number of fragments for spectrum p,q

$\frac{N_m^4}{N_qN_r}$ normalizes the similarity score by the number of matching fragments and the total number of fragments in the two spectra. The exponent 4 on $N_m$ amplifies the effect of the number of matching fragments.

$\sum|I_{q,i}-I_{r,i}|$ represents the sum of the absolute differences in intensities of corresponding ions.

The exponent $\ a=0.25$ reduces the impact of the intensity differences, making the similarity score less sensitive to large differences in intensities.

### Application notes
#### Normalization:
The algorithm normalizes the similarity score by the number of matching fragments and the total number of fragments, making the score comparable across different spectra.

#### Robustness:
The use of absolute differences and the exponent $\ a$ makes the algorithm robust to variations in the magnitudes of the spectra.
