README | Stats: varying all parameters | Stats: comparing similarity algorithms
Write a similarity algorithm that allows to find best matches of predicted mass spectra for an experimental mass spectrum.
Same as mgf-parser.
Definition: "Dereplication is a process used in recognising and eliminating the active substances that have already been studied in the early stage of the screening process." - https://www.thefreedictionary.com/Dereplication
Client: UNIGE
- Parse experimental and predicted mass spectra data that are in .mgf, .csv and .tsv formats. Then generate two JSON files with the data (
experiments.json
andpredictions.json
). - Verify if each experimental spectrum has a match in the predictions using the idCode.
- Generate a file
matchingExperiments,json
with only the spectra with a corresponding prediction. - Write a method to load the experimental and predicted data json files and apply a weighted merge to the X values of all the spectra.
- Write a method
similarity()
that returns the similarity between two spectra. The similarity function should be an option, default iscosine()
. Main functions used:align()
,norm()
andcosine()
. - Write a function
findBestMatches()
that runssimilarity()
for one experimental spectrum and an array of predicted spectra. It should return the best matches and meta-information.
mgf-parser
: To parse the mgf filespapaparse
: To parse the csv filesopenchemlib
: Obtain the id code from SMILESml-array-xy-weighted-merge
: Weighted merge of x values that are too close to each other inloadData.js
ml-spectra-processing
: Use thealign
method to align experimental and predicted spectrum for similarityml-distance
: Access to the similarity algorithms (e.g. cosine similarity)ml-array-normed
: To normalize the aligned spectraml-array-min
,ml-array-max
,ml-array-mean
,ml-array-median
: To generate stats on the matchIndex of many experimentsdebug
(dev): to output things fromtestSimilarity()
in the console
While wanting to parse the database, we had a problem with node.js, which threw the error:
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Indeed the file was too big to be handled by node as it is. To solve this, we can use the option --max-old-space-size
like so:
node --max-old-space-size=8192 -r esm combineMgfCsv.js
We tried to use the InChi to see wether each experiment has a prediction. However, this did not work, because many experiments did not have one. Therefore, we had to find another unique identifier to match predictions and experiments.
We decided to use openchemlib
and to generate the ID code of the molecules using the SMILES. Yet again, around half of the experiments did not have a valid SMILES. It appears that the molecule either have the SMILES or the InChi.
We will therefore have to find a way to convert InChi into SMILES, to be able to then work with the other half of the experimental data.
We have documented the results of various tests we have led, varying a handful parameters in dereplicationData.md and in dereplicationStats.md