Building the MassBank data

Here we describe the scripts used to generate / load the:

Candidate sets
MS² matching scores (CFM-ID, MetFrag and SIRIUS)
Molecular fingerprints and descriptors
...

into our SQLite database (DB). We start with the initial SQLite DB: massbank__2020.11__v0.6.1.sqlite. This DB was generated as described in the Methods "Pre-processing pipeline for raw MassBank records" of the manuscript based on the MassBank release 2020.11 using the "massbank2db" package (version 0.6.1). Please note, the latest package version is 0.9.0, but the code parsing and grouping the MassBank records remained unchanged.

Preparation

Python

For the re-generation of the database it is required to install the following Python packages (preferably into a conda environment):

"massbank2db": Contains the routines to convert the MassBank spectra to the input format of the insilico tools.
"rosvm": Provides functionality to compute the molecule fingerprint feature representations.
"ssvm": Provides functionality to convert counting fingerprints into binary representations for the efficient computation of MinMax kernels on integer vectors.
"matchms": Provides routines to compute the similarity between MS² spectra needed for the CFM-ID score computation.
"rdkit": Provides routines to compute the molecular descriptor features.

R

An R installation is required to compute the ClassyFire molecule classes. Furthermore, the following packages need to be installed:

"classyfireR": An interface to the ClassyFire RESTful API
"RSQLite": SQLite Interface for R

Some general remarks

The scripts modifying the DB always create a copy of the DB and add the new information (e.g. scores or features) to the copy, while preserving the original DB. You can modify this behavior, which is entirely a precautious approach.

Generating the molecular candidate sets

The candidate sets where generated using the SIRIUS software by Dr. Kai Dührkop (developer of SIRIUS). SIRIUS uses PubChem as molecular structure DB and returns the candidate sets limited to molecules with the ground truth molecular formula of the particular MassBank spectrum. It is important to note, that neither the GUI nor the CLI tool of SIRIUS was used for the candidate set and MS² score generation. Instead, the non-public internal SIRIUS library was used which allows the score prediction in a structure disjoint fashion. That means, for each MassBank spectrum a CSI:FingerID (prediction backend of SIRIUS) model was used, that was not trained using its ground truth structure. This setting was chosen to prevent overfitting.

Generating the "SIRIUS ready" ms-files

The following script call was used to generate the "SIRIUS ready" ms-files:

python create_insilico_tool_outputs_for_database.py massbank__2020.11__v0.6.1.sqlite sirius

A directory tools/sirius will be created with sub-directory for each MassBank group (see Methods "Pre-processing pipeline for raw MassBank records") containing the ms-files (*.ms) for each group of original MassBank accessions (see Methods "Pre-computing the MS² matching scores"). For example "AU22543794" in "AU_001" relates to the original MassBank accessions "AU300907", "AU300908", "AU300909", "AU300910" and "AU300911". The file tools/sirius/AU_001/AU22543794.ms can be directly loaded into the SIRIUS software.

Importing the SIRIUS candidates and MS² scores

By calling:

python import_sirius_scores.py db/massbank__2020.11__v0.6.1.sqlite tool_output/sirius sirius_scores.tar.gz \
  --pubchem_db_fn=/PATH/TO/LOCAL_PUBCHEM.SQLITE \
  --acc_to_be_removed_fn=grouped_accessions_to_be_removed.txt

a copy of our initial SQLite DB is generated (massbank__with_sirius.sqlite) and the following information is added to the database:

all (spectrum, candidate)-pairs generated by SIRIUS
all (spectrum, candidate, MS² scores) for SIRIUS
enriched candidate sets (see Methods "Generating the molecular candidate sets")
(optional, --include_sirius_fps) the binary fingerprints for each candidate as used by SIRIUS

Note: This step requires a local PubChem SQLite DB.

Candidate set enrichment

As SIRIUS does not return scores for stereoisomers we need to them manually to the candidate sets. For that, we perform an inner merge on first InChIKey part (e.g. FMGSKLZLMKYGDP-UHFFFAOYSA-N) of each candidate of the candidate set between the candidates provided by SIRIUS and a local copy of PubChem:

ROW	InChIKey	SIRIUS MS² score
1	JYGXADMDTFJGBT	0.3
2	OIGNJSKKLXVSLS	0.1
3	OMFXVFTZEKFJBZ	0.5

becomes

IDX	InChIKey	SIRIUS MS² score
1.A	JYGXADMDTFJGBT-UHFFFAOYSA	0.3
1.B	JYGXADMDTFJGBT-KZUKIWJVSA	0.3
1.C	JYGXADMDTFJGBT-LHXCVJCSSA	0.3
2.A	OIGNJSKKLXVSLS-UHFFFAOYSA	0.1
2.B	OIGNJSKKLXVSLS-RDEQXLMJSA	0.1
3.A	OMFXVFTZEKFJBZ-UHFFFAOYSA	0.5
3.B	OMFXVFTZEKFJBZ-NCQLXYLKSA	0.5

after the merge.

Removal of records associated with the #152 pull-request in MassBank

As described in the Methods "Pre-processing pipeline for raw MassBank records" we remove a couple of MassBank records related to the "LU" datasets which where reported to have issues. For that, we compare which original "LU*" accessions where removed from MassBank between release 2020.11 (our baseline) and release 2021.3. We list our internal accession IDs in the file grouped_accessions_to_be_removed.txt. Entries in this list will not be imported to massbank__with_sirius.sqlite database and hence are not part of our experiments.

Applying the other MS² scoring methods to create the candidate scores

Other than SIRIUS we also used MetFrag and CFM-ID for the candidate ranking using the MS² information.

MetFrag

We created the required MetFrag input files, which will be written to tools/metfrag, using the following command:

python create_insilico_tool_outputs_for_database.py massbank__with_sirius.sqlite metfrag

We generate the candidate set files as required by the MetFrag software:

python get_metfrag_candidates.py massbank__with_sirius.sqlite tools/metfrag --gzip

We use the MetFrag software (v 2.4.5) for the insilico MS² scoring of the molecular candidates. A makefile (tools/metfrag/Makefile) can be used to run the MetFrag software parallely on multiple cores.
We import the FragmenterScore of MetFrag as candidate scores into the database.

python import_metfrag_scores.py massbank__with_sirius.sqlite tools/metfrag

A new database file has been created: massbank__with_metfrag.sqlite, which is a copy of massbank__with_sirius.sqlite plus the MetFrag scores.

CFM-ID

We created the required CFM-ID input files, which will be written to tools/cfmid4, using the following command:

python create_insilico_tool_outputs_for_database.py massbank__with_metfrag.sqlite cfmid4

We generate the candidate set files as required by the CFM-ID software:

python get_cfmid_candidates.py massbank__with_metfrag.sqlite tools/cfmid4 --gzip --store_candidates_separately

We use CFM-ID (v 4.0.7) for the insilico MS² spectra prediction. The CFM-ID developers provide pre-trained models cross-validation (CV) models. That means we can predict the insilico MS² spectra for the candidate sets in a structure disjoint fashion. We use the models "Metlin 2019 MSML". The tools/cfmid4/{pos,neg} directories contain a list of the CFM-ID training molecules and their respective left-out CV id, which we use for the structure disjoint prediction. The spectra simulation is a computationally very (!) heavy process and performing it on a cluster is highly recommend. We provide a couple a script (tools/cfmid4/predict_candidate_spectra.sh) illustrating how this can be done on a cluster using the SLURM workload manager.
We load the predicted spectra and compute the similarity score with the corresponding measured spectrum. This similarity is used as CFM-ID MS² candidate score:

python import_cfmid_scores.py massbank__with_metfrag.sqlite tools/cfmid4

A new database file has been created: massbank__with_cfmid.sqlite, which is a copy of massbank__with_metfrag.sqlite plus the CFM-ID scores.

Computing the molecule features

For our experiments we use three (3) different molecular feature representations. In the following we describe how those can be computed and added to the database.

Circular fingerprints (FCFP)

Our Structure Support Vector Machine (SSVM) model uses the FCFP fingerprints, computed from isomeric SMILES, to represent the molecular candidates.

Compute the counting FCFP (with and without chirality encoding, "2D" and "3D") fingerprints:

python compute_circular_fingerprints.py massbank__with_cfmid.sqlite

The fingerprints are inserted into a copy of the DB: massbank__with_fcfp.sqlite
Convert the counting fingerprints into binarized counting vectors. The main purpose of this step is to speed up the kernel computation required for the SSVM. The binary representation still encodes the counts, but allows to use the Tanimoto kernel instead of the MinMax kernel still resulting in the same similarity values. For a details on the implementation the reader is pointed to the "ssvm" library the publication by Ralaivola et al. (2015):

python convert_to_binary_fingerprints.py massbank__with_fcfp.sqlite FCFP

A new database file has been created: massbank__with_binary_fcfp.sqlite, which is a copy of massbank__with_fcfp.sqlite plus the binarized scores.

Molecular descriptors

Bouwmeester et al. (2019) defined a set of molecular descriptors, which are computed using RDKit, found useful for retention time (RT) prediction. We use those features for one of our comparison methods and add the features to our DB.

Compute the descriptors and add them to the DB

python compute_bouwmeester_features.py massbank__with_binary_fcfp.sqlite

A new database file has been created: massbank__with_descriptors.sqlite, which is a copy of massbank__with_binary_fcfp.sqlite plus the molecular descriptors.

Counting substructure fingerprints

Bach et al. (2020) used substructure counting fingerprints to represent molecules in their RankSVM model. As we compare with their approach for MS² and RT score integration, we add their features to our DB. We pre-computed the fingerprints using the CDK package and fingerprint vectors are stored in substructure_fingerprints/candidates___SMILES_ISO.tsv.gz. Please note that the pre-computed fingerprints are limited to the ones in our current candidate set.

Import the pre-computed substructure fingerprints:

python import_substructure_count_fingerprints.py massbank__with_descriptors.sqlite substructure_fingerprints/candidates___SMILES_ISO.tsv.gz

A new database file has been created: massbank__with_substructure_fps.sqlite, which is a copy of massbank__with_descriptors.sqlite plus the substructure fingerprints.

Generate evaluation LC-MS² experiments

We evaluate all methods, including our SSVM model, using a multiple sets (or sequences) of (MS², RT)-tuples sampled for each dataset / MassBank group in our dataset. We refer the reader to the method part of our paper explaining in detail the sampling procedure. Each tuple set has about 50 MS features (i.e. (MS², RT)-tuples). Depending on the amount of available data in each dataset, the number of sampled sets differs. We distinguish two sampling scenarios: "default" (FULLDATA in the paper) and "with_stereo" (ONLYSTEREO in the paper). Again, the interested reader is encouraged to read the corresponding method descriptions in our manuscript.

Generate the evaluation LC-MS² experiments for both scenarios:

python generate_lcmsms_experiments.py massbank__with_substructure_fps.sqlite

A new database file has been created: massbank__with_test_splits.sqlite, which is a copy of massbank__with_substructure_fps.sqlite plus the LC-MS² experiments (indices of evaluation records).

Molecule classifications

In our experiments we do a specific analysis of the performance impact of our SSVM model on specific molecular classes, based on two (2) different classification systems. The first one is ClassyFire
which assigns classes to molecules based on their structure. The second classification system is taken from PubChemLite, which based on literature information on the usage of certain molecules in certain contexts.

ClassyFire

Import the ClassyFire classes to the DB (no backup created):

Rscript insert_classyfire_classes.R massbank__with_substructure_fps.sqlite

PubChemLite

Download the PubChemLite DB (v0.3.0)
Insert PubChemLite classifications to the DB:

python insert_pubchemlite_annotations.py massbank__with_substructure_fps.sqlite /path/to/pubchemlite.csv

3A new database file has been created: massbank__with_pubchemlite.sqlite, which is a copy of massbank__with_substructure_fps.sqlite plus the PubChemLite classification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Building the MassBank data

Preparation

Python

R

Some general remarks

Generating the molecular candidate sets

Generating the "SIRIUS ready" ms-files

Importing the SIRIUS candidates and MS² scores

Candidate set enrichment

Removal of records associated with the #152 pull-request in MassBank

Applying the other MS² scoring methods to create the candidate scores

MetFrag

CFM-ID

Computing the molecule features

Circular fingerprints (FCFP)

Molecular descriptors

Counting substructure fingerprints

Generate evaluation LC-MS² experiments

Molecule classifications

ClassyFire

PubChemLite

Files

README.md

Latest commit

History

README.md

File metadata and controls

Building the MassBank data

Preparation

Python

R

Some general remarks

Generating the molecular candidate sets

Generating the "SIRIUS ready" ms-files

Importing the SIRIUS candidates and MS² scores

Candidate set enrichment

Removal of records associated with the #152 pull-request in MassBank

Applying the other MS² scoring methods to create the candidate scores

MetFrag

CFM-ID

Computing the molecule features

Circular fingerprints (FCFP)

Molecular descriptors

Counting substructure fingerprints

Generate evaluation LC-MS² experiments

Molecule classifications

ClassyFire

PubChemLite