Skip to content

biorack/simile

Repository files navigation

SIMILE

SIMILE (Significant Interrelation of MS/MS Ions via Laplacian Embedding) is a Python library for interrelating fragmentation spectra with significance estimation and is robust to multiple differences in chemical structure. Nature Communications manuscript

New in V2:

  • Precursor-based neutral loss difference counts can be used in addition to the original MZ difference counts
  • Maximum weight matching is used instead of original monotonic alignment method with improved performance
  • Multiple matching in addition to original pairwise matching for fragment centric analyses
  • Multiple comparison statistics
  • MUCH faster mass delta counting and significance testing
  • Matching ions report summarizing all scores and mass deltas with metadata

SIMILE Flow

Installation

Use the package manager conda to install environment-base.yml for minimum requirements. Alternatively, use environment.yml to run the example notebook.

conda env create -f environment-base.yml

Binder

Python dependencies

  • python3 (pinned to 3.7 currently due to non-SIMILE bugs)
  • numpy
  • scipy
  • pandas

Usage

import simile as sml

# Generate fragmentation similarity matrix
S, spec_ids = sml.similarity_matrix(mzs, pmzs=pmzs, tolerance=tolerance)

# Generate max weight matching for similarity matrix
M = sml.multiple_match(S, spec_ids)

# Generate pro/con comparison matrix such that 
# symmetric matches are 1 (pro) and
# asymmetric matches are -1 (con)
C = sml.sym_compare(M, spec_ids)

# Calculate significance of max weight matching between fragment ions
# for all combination of spectra
spec_scores, pval, null_dist = sml.z_test(S, M, C, spec_ids, return_dist=True, log_size=5)

# Report back mass deltas and scores for simile comparison
df = sml.matching_ions_report(S, M, C, mzs, pmzs)

Theory

At its core, SIMILE contributes two concepts to the analysis of tandem mass spectrometry:


1. A similarity measure between fragment ions based on the fragmentation process. This similarity measure is defined to satisfy two expected properties of the fragmentation process:

  • (a) Fragment ions are similar if the difference in mass between them is common.
  • (b) Fragment ions are similar if their ancesetor and descendent fragment ions are similar.

Property (a) is satisfied by constructing a transition matrix with row-normalized mass difference frequencies as transition probabilites. This corresponds to a "shortest path" distance between fragment ions.

Property (b) is satisfied by converting the transition matrix of (a) into the pseudo-inverse of its (normalized) laplacian. This corresponds to an "average commute time" distance between fragment ions with the following intuition: If instead of taking the shortest path between x and y we instead meander about according to the transition matrix, how long will it take to wander from x to y and back to x on average?

This notion of "average commute time" distance captures property (b) because if x and y are similar, then their parents and children are similar; and if fragment ions are similar, then the transition probability between them is high. In other words, the paths walked when meandering between x and y are enriched with the ancestors and descendents of x and y. Therefore, if x and y share no (or few) ancestors or descendents, then the time to meander between them is comparably longer than if they do.

2. A null distribution for spectral similarity which leverages intraspectral comparisons to add confidence to interspectral comparisons.

Using an outdated analogy for the fragmentation process, fragment ions are generated from "parent" ions and generate "child" ions. We can extend this analogy to include "sibling" ions by noting that siblings are more similar to eachother than to their parents or children.

By leveraging SIMILE's fragment ion similarity measure which conforms to this analogy, we can ask how likely it is that the fragment ions matched up between fragmentation spectra by SIMILE are siblings. Taking this line of reasoning to its natural conclusion yields a null distibution generated by permuting intra and inter spectral fragment similarity scores to yield p-values.

Current research in multiple comparison is exploring using the asymmetry of the SIMILE max weight matrix as an alternative way to generate a null distribution.

Contributing

Pull requests are welcome.

For major changes, please open an issue first to discuss what you would like to change.

License

Modified BSD

Acknowledgements

The development of SIMILE was made possible by: