Skip to content

New Python implementation of the Integrated Probabilistic Annotation (IPA) - A Bayesian annotation method for LC/MS data integrating biochemical relations, isotope patterns and adduct formation

License

francescodc87/ipaPy2

Repository files navigation

ipaPy2

Python implementation of the Integrated Probabilistic Annotation (IPA) - A Bayesian annotation method for LC/MS data integrating biochemical relations, isotope patterns and adduct formation. alt text

Installation

ipaPy2 requires Python 3.9 or higher

Install via pip (recommended)

pip install ipaPy2

Install via bioconda

conda install ipaPy2

Compiling from source (macOS)

  1. Create a folder in which you want to put the library
mkdir IPA
cd IPA
  1. Download the library. If Homebrew is not installed in your machine, you can install it from here https://brew.sh
brew install git
git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2
  1. Create and activate a virtual environment for your folder and install the necessary libraries
python3 -m venv ipaPy2
source ipaPy2/bin/activate
pip install wheel
pip install setuptools
pip install twine
pip install pytest==4.4.1
pip install pytest-runner==4.4
  1. Run tests (optional)
python setup.py pytest
  1. Build your library
python setup.py bdist_wheel
  1. The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
pip install /path/to/wheelfile.whl

Compiling from source (Linux)

  1. Create a folder in which you want to put the library
mkdir IPA
cd IPA
  1. Download the library
sudo apt-get install git
git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2
  1. Create and activate a virtual environment for your folder and install the necessary libraries
python3 -m venv ipaPy2
source ipaPy2/bin/activate
pip install wheel
pip install setuptools
pip install twine
pip install pytest==4.4.1
pip install pytest-runner==4.4
  1. Run tests (optional)
python setup.py pytest
  1. Build your library
python setup.py bdist_wheel
  1. The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
pip install /path/to/wheelfile.whl

Compiling from source (Windows)

  1. Create a folder in which you want to put the library
mkdir IPA
cd IPA
  1. Install git (https://github.com/git-guides/install-git)
  2. Download the library
git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2
  1. Create and activate a virtual environment for your folder and install the necessary libraries
python3 -m venv ipaPy2
source ipaPy2/bin/activate
pip install wheel
pip install setuptools
pip install twine
pip install pytest==4.4.1
pip install pytest-runner==4.4
  1. Run tests (optional)
python setup.py pytest
  1. Build your library
python setup.py bdist_wheel
  1. The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
pip install /path/to/wheelfile.whl

Databases

One of the most powerful features of the IPA method is that it is able to integrate the knowledge gained from previous experiments in the annotation process. There are three files that are used as the IPA database:

1. Adducts file (required)
The ipaPy2 library requires a file contains all the information required for the computation of the adducts. An adducts.csv file is provided with the package here. The file contains the most common adducts. If any exotic adduct (or in-source fragment) needs to be considered, the user must modify the file accordingly. The format required for the adducts file is shown below.

import pandas as pd
import numpy as np
adducts = pd.read_csv('DB/adducts.csv')
adducts.head()
name calc Charge Mult Mass Ion_mode Formula_add Formula_ded Multi
0 M+H M+1.007276 1 1 1.007276 positive H1 FALSE 1
1 M+NH4 M+18.033823 1 1 18.033823 positive N1H4 FALSE 1
2 M+Na M+22.989218 1 1 22.989218 positive Na1 FALSE 1
3 M+K M+38.963158 1 1 38.963158 positive K1 FALSE 1
4 M+ M-0.00054858 1 1 -0.000549 positive FALSE FALSE 1

2. MS1 database file (required)
The IPA method requires a pandas dataframe containing the database against which the annotation is performed. This dataframe must contain the following columns in this exact order (optional columns can have empty fields):

  • id: unique id of the database entry (e.g., 'C00031') - necessary
  • name: compound name (e.g., 'D-Glucose') - necessary
  • formula: chemical formula (e.g., 'C6H12O6') - necessary
  • inchi: inchi string - optional
  • smiles: smiles string - optional
  • RT: if known, retention time range (in seconds) where this compound is expected to elute (e.g., '30;60') - optional
  • adductsPos: list of adducts that should be considered in Positive mode for this entry (e.g.,'M+Na;M+H;M+') - necessary
  • adductsNeg: list of adducts that should be considered in Negative mode for this entry (e.g.,'M-H;M-2H') - necessary
  • description: comments on the entry - optional
  • pk: previous knowledge on the likelihood of this compound to be present in the sample analyse. The value has to be between 1 (compound highly likely to be present in the sample) and 0 (compound cannot be present in the sample).
  • MS2: id for the MS2 database entries related to this compound - optional
  • reactions: list of reaction ids involving this compound (e.g., 'R00010 R00015 R00028'). If required, these can be used to find possible biochemical connections - optional

The column names must be the ones reported here. While users are strongly advised to build their own ad-hoc database to match their specific instrument setup and sample types, here you can find a relatively big example database.

DB = pd.read_csv('DB/IPA_MS1.csv')
DB.head()
id name formula inchi smiles RT adductsPos adductsNeg description pk MS2 reactions
0 C00002 ATP C10H16N5O13P3 InChI=1S/C10H16N5O13P3/c11-8-5-9(13-2-12-8)15(... NaN NaN M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 EMBL-MCF_spec365637_1 R00002 R00076 R00085 R00086 R00087 R00088 R000...
1 C00003 NAD+ C21H28N7O14P2 InChI=1S/C21H27N7O14P2/c22-17-12-19(25-7-24-17... NaN NaN M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 EMBL-MCF_specxxxxx_10 R00023 R00090 R00091 R00092 R00093 R00094 R000...
2 C00004 NADH C21H29N7O14P2 InChI=1S/C21H29N7O14P2/c22-17-12-19(25-7-24-17... NaN NaN M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 NaN R00023 R00090 R00091 R00092 R00093 R00094 R000...
3 C00005 NADPH C21H30N7O17P3 InChI=1S/C21H30N7O17P3/c22-17-12-19(25-7-24-17... NaN NaN M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 NaN R00105 R00106 R00107 R00108 R00109 R00111 R001...
4 C00006 NADP+ C21H29N7O17P3 InChI=1S/C21H28N7O17P3/c22-17-12-19(25-7-24-17... NaN NaN M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 EMBL-MCF_specxxxxxx_45 R00104 R00106 R00107 R00108 R00109 R00111 R001...

This example databases was obtained considering the KEGG database, the Natural Products Atlas database and the MoNa database (only compounds having at least one fragmentation spectrum obtained with a QExactive). For each entry, only a handful of the most common adducts are considered. To fully exploit the IPA method, it is strongly recommended to constantly update the database when new knowledge is gained from previous experience. Providing a retention time window for compounds previously detected with the analytical system at hand it is particularly useful. For the sake of the example in this tutorial, a reduced example database is also provided.

DB = pd.read_csv('DB/DB_test_pos.csv')
DB.head()
id name formula inchi smiles RT adductsPos adductsNeg description pk MS2 reactions
0 C00079 L-Phenylalanine C9H11NO2 InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-... NaN 120;160 M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 UA005501_1 R00686 R00688 R00689 R00690 R00691 R00692 R006...
1 C00082 L-Tyrosine C9H11NO3 InChI=1S/C9H11NO3/c10-8(9(12)13)5-6-1-3-7(11)4... NaN 50;90 M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 UA005601_1 R00031 R00728 R00729 R00730 R00731 R00732 R007...
2 C00114 Choline C5H14NO InChI=1S/C5H14NO/c1-6(2,3)4-5-7/h7H,4-5H2,1-3H... NaN NaN M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 NaN R01021 R01022 R01023 R01025 R01026 R01027 R010...
3 C00123 L-Leucine C6H13NO2 InChI=1S/C6H13NO2/c1-4(2)3-5(7)6(8)9/h4-5H,3,7... NaN 70;110 M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 NaN R01088 R01089 R01090 R01091 R02552 R03657 R084...
4 C00148 L-Proline C5H9NO2 InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H... NaN 35;55 M+H;M+Na;M+2H;2M+H M-H;2M-H;M-2H;3M-H NaN 1 EMBL-MCF_specxxxxx_7 R00135 R00671 R01246 R01248 R01249 R01251 R012...

3. MS2 database file (only required if MS2 data is available)
This new implementation of the IPA method also allows the user to include MS2 data in the annotation pipeline. In order to exploit this functionality an MS2 spectra database must be provided. The MS2 database must be provided as a pandas dataframe including the following columns in this exact order:

  • compound_id: unique id for each compound, it must match with the ids used in the MS1 database - necessary
  • id: unique id for the single entry (i.e., spectra) of the database - necessary
  • name: compound name (e.g., 'D-Glucose') - necessary
  • formula: chemical formula (e.g., 'C6H12O6') - necessary
  • inchi: inchi string - optional
  • precursorType: the adduct form of the precursor ion (e.g., 'M+H') - necessary
  • instrument: the type of instrument the spectrum was acquired with - optional
  • collision.energy: the collision energy level used to acquire the spectrum (e.g., '15') - necessary
  • spectrum: the actual spectrum in the form of a string in the following format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'

It is necessary that the user uses an MS2 database specific to the instrument used to acquire the data. The MS2 database found here, contains all the MS2 spectra found in the MoNa database acquired with a QExactive. This is a relatively big file, and for the sake of this tutorial a drastically reduced version of it has been included within this repository, and can be found here.

DBMS2 = pd.read_csv('DB/DBMS2_test_pos.csv')
DBMS2.head()
compound_id id name formula inchi precursorType instrument collision.energy spectrum
0 EMBL-MCF_specxxxxxx_11 EMBL-MCF_spec103039 L-valine C5H11NO2 InChI=1S/C5H11NO2/c1-3(2)4(6)5(7)8/h3-4H,6H2,1... M+H Thermo Q-Exactive Plus 35 55.0550575256:5.821211 57.0581207275:0.385600 ...
1 EMBL-MCF_specxxxxxx_11 EMBL-MCF_spec353465 L-valine C5H11NO2 InChI=1S/C5H11NO2/c1-3(2)4(6)5(7)8/h3-4H,6H2,1... M+H Thermo Q-Exactive Plus 30 49.5028053042:0.000000 49.5031356971:0.000000 ...
2 EMBL-MCF_specxxxxxx_11 EMBL-MCF_spec27828 L-valine C5H11NO2 InChI=1S/C5H11NO2/c1-3(2)4(6)5(7)8/h3-4H,6H2,1... M-H Thermo Q-Exactive Plus 35 55.0173454285:0.819357 58.0282363892:0.155430 ...
3 EMBL-MCF_specxxxxx_7 EMBL-MCF_spec96902 L-proline C5H9NO2 InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H... M+H Thermo Q-Exactive Plus 35 50.5765228271:0.040013 51.3066940308:0.039949 ...
4 EMBL-MCF_specxxxxx_7 EMBL-MCF_spec353568 L-proline C5H9NO2 InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H... M+H Thermo Q-Exactive Plus 30 49.5028215674:0.000000 49.5031519602:0.000000 ...

Data preparation

Before using the ipaPy2 package, the processed data coming from an untargeted metabolomics experiment must be properly prepared.

1. MS1 data

The data must be organized in a pandas dataframe containing the following columns:

  • ids: an unique numeric id for each mass spectrometry feature feature
  • rel.ids: relation ids. Features must be clustered based on correlation/peak shape/retention time. Features in the same cluster are likely to come from the same metabolite.
  • mzs: mass-to-charge ratios, usually the average across different samples.
  • RTs: retention times in seconds, usually the average across different samples.
  • Int: representative (e.g., maximum or average) intensity detected for each feature across samples (either peak area or peak intensity)

Below is reported an example:

df1=pd.read_csv('ExampleDatasets/README/df_test_pos.csv')
df1.head()
ids rel.ids mzs RTs Int
0 1 0 116.070544 45.770423 2.170017e+09
1 88 0 117.073678 45.787586 1.256520e+08
2 501 0 231.133673 46.183948 2.519223e+07
3 4429 0 232.136923 46.176715 2.635594e+06
4 2 1 104.106830 40.843309 1.889172e+09

The clustering of the features is a necessary and must be performed before running the IPA method. For this step, the use of widely used data processing software such as mzMatch and CAMERA is recommended. Nevertheless, the ipaPy2 library provides a function (clusterFeatures()) able to perform such step, starting from a dataframe containing the measured intensities across several samples (at least 3 samples, the more samples the better). Such dataframe should be organized as follows:

df2=pd.read_csv('ExampleDatasets/README/df_test_pos_not_clustered.csv')
df2.head()
ids mzs RTs sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
0 1 116.070544 45.770423 1.003660e+09 1.299828e+09 1.878029e+09 1.778238e+09 1.715394e+09 4.340034e+08 1.586635e+09 2.170017e+09 1.312151e+09 2.051875e+09
1 2 104.106830 40.843309 3.778343e+08 8.721901e+08 8.353805e+08 1.889172e+09 1.114844e+09 1.296362e+09 7.361379e+08 7.386887e+08 9.546864e+08 6.969054e+08
2 3 118.085998 43.584638 5.984715e+08 1.399106e+09 2.831220e+08 1.415610e+09 7.557607e+08 7.800359e+08 8.949854e+08 5.074069e+08 6.854525e+08 1.000501e+09
3 4 166.086047 143.321396 1.390905e+09 1.047887e+09 1.053413e+09 2.781809e+08 1.037486e+09 1.117700e+09 6.153332e+08 1.215932e+09 1.264092e+09 1.370995e+09
4 5 132.101745 89.387202 6.071912e+08 1.014152e+09 1.270735e+09 1.069765e+09 4.925938e+08 4.087633e+08 3.777945e+08 2.541470e+08 8.025257e+08 3.544281e+08
from ipaPy2 import ipa
df=ipa.clusterFeatures(df2)
Clustering features ....
0.0 seconds elapsed

All information about the function can be found in the help of the function

help(ipa.clusterFeatures)
Help on function clusterFeatures in module ipaPy2.ipa:

clusterFeatures(df, Cthr=0.8, RTwin=1, Intmode='max')
    Clustering MS1 features based on correlation across samples.
    
    Parameters
    ----------
    df: pandas dataframe with the following columns:
        -ids: a unique id for each feature
        -mzs: mass-to-charge ratios, usually the average across different
              samples.
        -RTs: retention times in seconds, usually the average across different
              samples.
        -Intensities: for each sample, a column reporting the detected
                      intensities in each sample. 
    Cthr: Default value 0.8. Minimum correlation allowed in each cluster
    RTwin: Default value 1. Maximum difference in RT time between features in
           the same cluster
    Intmode: Defines how the representative intensity of each feature is
             computed. If 'max' (default) the maximum across samples is used.
             If 'ave' the average across samples is computed
    Returns
    -------
    df: pandas dataframe in correct format to be used as an input of the
    map_isotope_patterns() function

After running, this function returns a pandas dataframe in the correct format for the ipaPy2 package

df.head()
ids rel.ids mzs RTs Int
0 1 0 116.070544 45.770423 2.170017e+09
1 88 0 117.073678 45.787586 1.256520e+08
2 501 0 231.133673 46.183948 2.519223e+07
3 4429 0 232.136923 46.176715 2.635594e+06
4 2 1 104.106830 40.843309 1.889172e+09

2. MS2 data

If fragmentation data was acquired during the experiment, it can be included in the IPA annotation process. To do so, the data must be organized in a pandas dataframe containing the following columns, in this exact order:

  • id: an unique id for each feature for which the MS2 spectrum was acquired (same as in MS1)
  • spectrum: string containing the spectrum information in the following format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'
  • ev: collision energy used to acquire the fragmentation spectrum

Below is reported an example:

dfMS2=pd.read_csv('ExampleDatasets/README/MS2data_example.csv')
dfMS2.head()
id spectrum ev
0 1 51.3066132836457:0.884272376680125 59.96532241... 35
1 1 51.3066132836457:0.884272376680125 59.96532241... 15
2 90 62.4153253406374:0.743812036877455 63.93291389... 35
3 992 50.983321052233:0.973529955385613 53.039006800... 35
4 3 55.0551847656264:5.67780579195993 57.058126021... 35

Usage

The Integrated Probabilistic Annotation (IPA) method can be applied in different situations, and the ipaPy2 package allows the users to tailor the IPA pipeline around their specific needs.

This brief tutorial describes the most common scenarios the IPA method can be applied to.

1. Mapping isotope patterns

The first step of the IPA pipeline consists in the mapping of the isotope patterns within the dataset considered. This is achieved through the map_isotope_patterns(). The help of this function provides a detailed description of it.

help(ipa.map_isotope_patterns)
Help on function map_isotope_patterns in module ipaPy2.ipa:

map_isotope_patterns(df, isoDiff=1, ppm=100, ionisation=1, MinIsoRatio=0.5)
    mapping isotope patterns in MS1 data.
    
    Parameters
    ----------
    df : pandas dataframe (necessary)
         A dataframe containing the MS1 data including the following columns:
            -ids: an unique id for each feature
            -rel.ids:   relation ids. In a previous step of the data processing
                        pipeline, features are clustered based on peak shape
                        similarity/retention time. Features in the same
                        cluster are likely to come from the same metabolite.
                        All isotope patterns must be in the same rel.id
                        cluster.
            -mzs: mass-to-charge ratios, usually the average across
                  different samples.
            -RTs: retention times in seconds, usually the average across
                  different samples.
            -Ints: representative (e.g., maximum or average) intensity detected
                   for each feature across samples (either peak area or peak
                   intensity)
    isoDiff : Default value 1. Difference between isotopes of charge 1, does
              not need to be exact
    ppm:   Default value 100. Maximum ppm value allowed between 2 isotopes.
            It is very high on purpose
    ionisation: Default value 1. positive = 1, negative = -1
    MinIsoRatio: mininum intensity ratio expressed (Default value 1%). Only
                isotopes with intensity higher than MinIsoRatio% of the main isotope
                are considered.
    
    Returns
    -------
    df: the main input is modified by adding and populating the following
        columns
        - relationship: the possible values are:
                        * bp: basepeak, most intense peak within each rel id
                        * bp|isotope: isotope of the basepeak
                        * potential bp: most intense peak within each isotope
                                        pattern (excluding the basepeak)
                        * potential bp|isotope: isotope of one potential bp
        - isotope pattern: feature used to cluster the different isotope
                            patterns within the same relation id
        - charge: predicted charge based on the isotope pattern (1,2,3,4,5 or
                  -1,-2,-3,-4,-5 are the only values allowed)

For the sake of this tutorial, the small dataset example introduced above is considered.

ipa.map_isotope_patterns(df,ionisation=1)
mapping isotope patterns ....
0.1 seconds elapsed

Once finished, this function modifies the pandas dataframe provided as input annotating all isotope patterns.

df.head()
ids rel.ids mzs RTs Int relationship isotope pattern charge
0 1 0 116.070544 45.770423 2.170017e+09 bp 0 1
1 88 0 117.073678 45.787586 1.256520e+08 bp|isotope 0 1
2 501 0 231.133673 46.183948 2.519223e+07 potential bp 1 1
3 4429 0 232.136923 46.176715 2.635594e+06 potential bp|isotope 1 1
4 2 1 104.106830 40.843309 1.889172e+09 bp 0 1

Some data processing pipelines already have an isotope mapping function and the user can use them as long as they organise the data in the correct format.

2. Compute all adducts

The second step of the pipeline consists in the calculation of all possible adducts that could be formed by the compounds included in the database. This is done by the function compute_all_adducts(). This function comes with a very detailed help.

help(ipa.compute_all_adducts)
Help on function compute_all_adducts in module ipaPy2.ipa:

compute_all_adducts(adductsAll, DB, ionisation=1, ncores=1)
    compute all adducts table based on the information present in the database
    
    Parameters
    ----------
    adductsAll : pandas dataframe (necessary)
                 Dataframe containing information on all possible
                 adducts. The file must be in the same format as the example
                 provided in the DB/adducts.csv
    DB : pandas dataframe (necessary)
         Dataframe containing the database against which the annotation is
         performed. The DB must contain the following columns in this exact
         order (optional fields can contain None):
             - id: unique id of the database entry (e.g., 'C00031') - necessary
             - name: compound name (e.g., 'D-Glucose') - necessary
             - formula: chemical formula (e.g., 'C6H12O6') - necessary
             - inchi: inchi string - optional
             - smiles: smiles string - optional
             - RT: if known, retention time range (in seconds) where this
                   compound is expected to elute (e.g., '30;60') - optional
             - adductsPos: list of adducts that should be considered in
                           positive mode for this entry (e.g.,'M+Na;M+H;M+')
             - adductsNeg: list of adducts that should be considered in
                           negative mode for this entry (e.g.,'M-H;M-2H')
             - description: comments on the entry - optional
             - pk: previous knowledge on the likelihood of this compound to be
                   present in the sample analysed. The value has to be between
                   1 (compound likely to be present in the sample) and 0
                   (compound cannot be present in the sample).
             - MS2: id for the MS2 database entries related to this compound
                    (optional)
             - reactions: list of reactions ids involving this compound
                          (e.g., 'R00010 R00015 R00028')-optional 
    ionisation : Default value 1. positive = 1, negative = -1
    ncores : default value 1. Number of cores used
    
    Returns
    -------
    allAdds: pandas dataframe containing the information on all the possible
    adducts given the database.

Depending on the size of the dataset used (i.e., number of compounds included), this step can become rather time-consuming, and the use of multiple cores should be considered. In the context of this tutorial, the heavily reduced example dataset introduced before is considered.

allAddsPos = ipa.compute_all_adducts(adducts, DB, ionisation=1, ncores=1)
computing all adducts ....
0.1 seconds elapsed
allAddsPos.head()
id name adduct formula charge m/z RT pk MS2
0 C00079 L-Phenylalanine M+H C9H12NO2 1 166.086255 120;160 1 UA005501_1
1 C00079 L-Phenylalanine M+Na C9H11NNaO2 1 188.068197 120;160 1 UA005501_1
2 C00079 L-Phenylalanine M+2H C9H13NO2 2 83.546765 120;160 1 UA005501_1
3 C00079 L-Phenylalanine 2M+H C18H23N2O4 1 331.165233 120;160 1 UA005501_1
4 C00082 L-Tyrosine M+H C9H12NO3 1 182.081169 50;90 1 UA005601_1

If the same database is used for subsequent experiments without introducing new information, it is recommended to save the results of this function into a .csv file. In this case, the user would need to repeat this step in the future only if the DB changed.

3. Annotation based on MS1 information

At this point, the actual annotation process can start. If no fragmentation data is available, the MS1annotation() function should be used. This function annotates the dataset using the MS1 data and the information stored in the dataset. A detailed description of the function can be accessed through the help:

help(ipa.MS1annotation)
Help on function MS1annotation in module ipaPy2.ipa:

MS1annotation(df, allAdds, ppm, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, ncores=1)
    Annotation of the dataset base on the MS1 information. Prior probabilities
    are based on mass only, while post probabilities are based on mass, RT,
    previous knowledge and isotope patterns.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    allAdds: pandas dataframe containing the information on all the possible
            adducts given the database. It should be the output of either
            ipa.compute_all_adducts() or ipa.compute_all_adducts_Parallel()
    ppm: accuracy of the MS instrument used
    me: accurate mass of the electron. Default 5.48579909065e-04
    ratiosd: default 0.9. It represents the acceptable ratio between predicted
             intensity and observed intensity of isotopes. It is used to compute
             the shape parameters of the lognormal distribution used to
             calculate the isotope pattern scores as sqrt(1/ratiosd)
    ppmunk: ppm associated to the 'unknown' annotation. If not provided equal
            to ppm.
    ratiounk: isotope ratio associated to the 'unknown' annotation. If not
              provided equal to 0.5
    ppmthr: Maximum ppm possible for the annotations. If not provided equal to
            2*ppm
    pRTNone: Multiplicative factor for the RT if no RTrange present in the
             database. If not provided equal to 0.8
    pRTout: Multiplicative factor for the RT if measured RT is outside the
            RTrange present in the database. If not provided equal to 0.4
    ncores: default value 1. Number of cores used
    
    Returns
    -------
    annotations: a dictionary containing all the possible annotations for the
                measured features. The keys of the dictionary are the unique
                ids for the features present in df. For each feature, the
                annotations are summarized in a pandas dataframe.
annotations=ipa.MS1annotation(df,allAddsPos,ppm=3,ncores=1)
annotating based on MS1 information....
0.4 seconds elapsed

This function returns all the possible annotations for all the mass spectrometry features (excluding the ones previously identified as isotopes). The annotations are provided in the form of a dictionary. The keys of the dictionary are the unique ids for the features present in df. For each feature, all possible annotations are summarised in a dataframe including the following information:

  • id: Unique id associated with the compound as reported in the database
  • name: Name of the compound
  • formula: Chemical formula of the putative annotation
  • adduct: Adduct type
  • mz: Theoretical m/z associated with the specific ion
  • charge: Theoretical charge of the ion
  • RT range: Retention time range reported in the database for the specific compound
  • ppm: mass accuracy
  • isotope pattern score: Score representing how similar the measured and theoretical isopattern scores are
  • fragmentation pattern score: Cosine similarity. Empty in this case as no MS2 data was provided
  • prior: Probabilities associated with each possible annotation computed by only considering the mz values (i.e., only considering ppm)
  • post: Probabilities associated with each possible annotation computed by integrating all the additional information available: retention time range, ppm, isotope pattern score and prior knowledge.

As an example, possible annotations for the feature associated with id=1 (m/z=116.0705438, RT=45.77) is shown below:

annotations[1]
id name formula adduct m/z charge RT range ppm isotope pattern score fragmentation pattern score prior post
0 C00148 L-Proline C5H10NO2 M+H 116.070605 1 35;55 -0.523247 0.331946 None 0.318084 0.454248
1 C00763 D-Proline C5H10NO2 M+H 116.070605 1 None -0.523247 0.331946 None 0.318084 0.363398
2 C18170 3-Acetamidopropanal C5H10NO2 M+H 116.070605 1 500;560 -0.523247 0.331946 None 0.318084 0.181699
3 Unknown Unknown None None None None None 3.000000 0.004161 None 0.045748 0.000655

It should be noticed that in this example, the prior probabilities associated with L-Proline M+H, D-Proline M+H and 3-Acetamidopropanal are exactly the same. This is because all three ions have exactly the same theoretical mass. However, the post probabilities are different. This is because the retention time associated with this feature is within the retention range reported in the database for L-Proline and outside the one reported for 3-Acetamidopropanal.

An expert in LC/MS-based mass spectrometry would argue that with most chromatographic columns stereoisomers such as L- and D-Proline would share the same RT range. While this is likely to be correct, it must be noted that the IPA method can only use the information present in the database. When populating it, we opted for a more agnostic approach and only included RT ranges for compounds that where actually detected as standards with our experimental setting. If the user wants to include the notion that ‘stereoisomers share the same RT ranges’, they should simply add this information in the database.

Here another example:

annotations[999]
id name formula adduct m/z charge RT range ppm isotope pattern score fragmentation pattern score prior post
0 C00079 L-Phenylalanine C18H23N2O4 2M+H 331.165233 1 120;160 -0.941814 0.472049 None 0.240106 0.550778
1 C02265 D-Phenylalanine C18H23N2O4 2M+H 331.165233 1 None -0.941814 0.472049 None 0.240106 0.440622
2 Unknown Unknown None None None None None 3.000000 0.055901 None 0.039575 0.0086
3 C03263 Coproporphyrinogen III C36H46N4O8 M+2H 331.165233 2 None -0.941814 0.0 None 0.240106 0.0
4 C05768 Coproporphyrinogen I C36H46N4O8 M+2H 331.165233 2 None -0.941814 0.0 None 0.240106 0.0

Also in this case, all the prior probabilities associated with the four ions are exactly the same since all the ions have the same theoretical mass-to-charge ratio. However, the posterior probabilities are significantly different. Two of these ions (Coproporphyrinogen III M+2H and Coproporphyrinogen I M+2H) have charge +2, while the other two possible annotations have charge +1. The observed isotope pattern is consistent with an ion with charge +1 (i.e., difference between isotopes = 1), and this is reflected in the isotope score pattern and consequently on the posterior probabilities. Moreover, the retention time associated with this feature is within the range reported for L-Phenylalanine in the database. Therefore, the posterior probability associated with L-Phenylalanine 2M+H is the most highest.

4. Annotation based on MS1 and MS2 information

As already mentioned above, fragmentation data can be included in the annotation process by using the MSMSannotation() function. A detailed description of the function can be accessed through the help:

help(ipa.MSMSannotation)
Help on function MSMSannotation in module ipaPy2.ipa:

MSMSannotation(df, dfMS2, allAdds, DBMS2, ppm, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, mzdCS=0, ppmCS=10, CSunk=0.7, evfilt=False, ncores=1)
    Annotation of the dataset base on the MS1 and MS2 information. Prior
    probabilities are based on mass only, while post probabilities are based
    on mass, RT, previous knowledge and isotope patterns.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    dfMS2: pandas dataframe containing the MS2 data. It must contain 3 columns
        -id: an unique id for each feature for which the MS2 spectrum was
             acquired (same as in df)
        -spectrum: string containing the spectrum information in the following
                   format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'
        -ev: collision energy used to acquire the fragmentation spectrum
    allAdds: pandas dataframe containing the information on all the possible
            adducts given the database. It should be the output of either
            ipa.compute_all_adducts() or ipa.compute_all_adducts_Parallel()
    DBMS2: pandas dataframe containing the database containing the MS2
           information
    ppm: accuracy of the MS instrument used
    me: accurate mass of the electron. Default 5.48579909065e-04
    ratiosd: default 0.9. It represents the acceptable ratio between predicted
            intensity and observed intensity of isotopes. it is used to compute
            the shape parameters of the lognormal distribution used to
            calculate the isotope pattern scores as sqrt(1/ratiosd)
    ppmunk: ppm associated to the 'unknown' annotation. If not provided equal
            to ppm.
    ratiounk: isotope ratio associated to the 'unknown' annotation. If not
              provided equal to 0.5
    ppmthr: Maximum ppm possible for the annotations. Ff not provided equal to
            2*ppm
    pRTNone: Multiplicative factor for the RT if no RTrange present in the
            database. If not provided equal to 0.8
    pRTout: Multiplicative factor for the RT if measured RT is outside the
            RTrange present in the database. If not provided equal to 0.4
    mzdCS: maximum mz difference allowed when computing cosine similarity
           scores. If one wants to use this parameter instead of ppmCS, this
           must be set to 0. Default 0.
    ppmCS: maximum ppm allowed when computing cosine similarity scores.
           If one wants to use this parameter instead of mzdCS, this must be
           set to 0. Default 10.
    CSunk: cosine similarity score associated with the 'unknown' annotation.
            Default 0.7
    evfilt: Default value False. If true, only spectrum acquired with the same
            collision energy are considered.
    ncores: default value 1. Number of cores used
    
    Returns
    -------
    annotations: a dictionary containing all the possible annotations for the
                 measured features. The keys of the dictionary are the unique
                 ids for the features present in df. For each feature, the
                 annotations are summarized in a pandas dataframe.

The line below integrates the fragmentation data and the fragmentation database introduced above in the annotation process. The role of the CSunk (“cosine unknown”) parameter should be briefly discussed here. In most cases, the fragmentation database contains fragmentation spectra only for a subset of the compounds in the database. Therefore, when considering a feature for which the fragmentation spectra was acquired, it is often the case that the cosine similarity can only be computed for a subset of the possible annotations. The CSunk value is then assigned to the other possible annotations for comparison.

 annotations=ipa.MSMSannotation(df,dfMS2,allAddsPos,DBMS2,CSunk=0.7,ppm=3,ncores=1)
annotating based on MS1 and MS2 information....
0.7 seconds elapsed

The output of this function has the same structure as the one from the MSannotation() function, but it also includes the fragmentation pattern scores when the fragmentation data is available. As an example, possible annotations for the feature associated with id=1 is shown below:

annotations[1]
id name formula adduct m/z charge RT range ppm isotope pattern score fragmentation pattern score prior post
0 C00148 L-Proline C5H10NO2 M+H 116.070605 1 35;55 -0.523247 0.331946 0.999759 0.318084 0.543121
1 C00763 D-Proline C5H10NO2 M+H 116.070605 1 None -0.523247 0.331946 0.7 0.318084 0.304221
2 C18170 3-Acetamidopropanal C5H10NO2 M+H 116.070605 1 500;560 -0.523247 0.331946 0.7 0.318084 0.15211
3 Unknown Unknown None None None None None 3.000000 0.004161 0.7 0.045748 0.000548

In this case, the cosine similarity score for the annotation L-Proline M+H is very high, therefore the posterior probability associated with it is higher than the one obtained without considering the MS2 data.

Here another example for a feature having a very similar mass-to-charge ratio (id=90, m/z=117.0705223, RT=63.45).

annotations[90]
id name formula adduct m/z charge RT range ppm isotope pattern score fragmentation pattern score prior post
0 C00763 D-Proline C5H10NO2 M+H 116.070605 1 None -0.708479 None 0.7 0.317329 0.480821
1 C18170 3-Acetamidopropanal C5H10NO2 M+H 116.070605 1 500;560 -0.708479 None 0.7 0.317329 0.24041
2 C00148 L-Proline C5H10NO2 M+H 116.070605 1 35;55 -0.708479 None 0.59986 0.317329 0.206018
3 Unknown Unknown None None None None None 3.000000 None 0.7 0.048013 0.072751

In this case, the cosine similarity score for the annotation L-Proline M+H is not very high. Moreover, the retention time assigned to this feature is outside both retention time ranges reported in the database for L-Proline and 3-Acetamidopropanal. Therefore, the most likely annotation for this feature is D-Proline M+H, the one annotation not rejected directly by the available evidence. It should be noted that the fragmentation pattern score has a rather weak effect on the posterior probability associated with L-Proline, given how close it is to the fragmentation pattern score associated features that do not have MS2 info in the database (CSunk=0.7). The main reason why the D-Proline annotation appears to be the most likely is due to the fact that the retention time associated to this feature (63.45 s) is outside the retention time ranges associated with L-Proline and 3-Acetamidopropanal.

5. Computing posterior probabilities integrating adducts connections

Until this point, the putative annotations and the associated probabilities computed for each feature are independent from each other. However, the IPA method can be used to update the probabilities by considering the possible relationship between annotations. For example, the Gibbs_sampler_add() function uses a Gibbs sampler to estimate the posterior probabilities obtained by considering all possible adduct connections.

The help() provides a detailed description of this function:

help(ipa.Gibbs_sampler_add)
Help on function Gibbs_sampler_add in module ipaPy2.ipa:

Gibbs_sampler_add(df, annotations, noits=100, burn=None, delta_add=1, all_out=False, zs=None)
    Gibbs sampler considering only adduct connections. The function computes
    the posterior probabilities of the annotations considering the adducts
    connections.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    annotations: a dictionary containing all the possible annotations for the
                measured features. The keys of the dictionary are the unique
                ids for the features present in df. For each feature, the
                annotations are summarized in a pandas dataframe. Output of
                functions MS1annotation(), MS1annotation_Parallel(),
                MSMSannotation() or MSMSannotation_Parallel
    noits: number of iterations if the Gibbs sampler to be run
    burn: number of iterations to be ignored when computing posterior
          probabilities. If None, is set to 10% of total iterations
    delta_add: parameter used when computing the conditional priors. The
               parameter must be positive. The smaller the parameter the more
               weight the adducts connections have on the posterior
               probabilities. Default 1.
    all_out: logical value. If true the list of assignments found in each
             iteration is returned by the function. Default False.
    zs: list of assignments computed in a previous run of the Gibbs sampler. 
        Optional, default None.
    
    Returns
    -------
    annotations: the function modifies the annotations dictionary by adding 2
                 columns to each entry. One named 'post Gibbs' contains the
                 posterior probabilities computed. The other is called
                 'chi-square pval' containing the p-value from a chi-squared
                 test comparing the 'post' with the 'post Gibbs' probabilities.
    zs: optional, if all_out==True, the function return the full list of
        assignments computed. This allows restarting the sampler from where
        you are from a previous run.
zs = ipa.Gibbs_sampler_add(df,annotations,noits=1000,delta_add=0.1, all_out=True)
computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 1000/1000 [00:03<00:00, 258.48it/s]

parsing results ...
Done -  3.9 seconds elapsed

The function modifies the annotations dictionary by adding two additional columns to each dataframe:

  • post Gibbs: posterior probabilities obtained from the Gibbs sampler.
  • chi-square pval: In order to see if the posterior probabilities obtained are statistically different from the priors, a chi-square test is used. The obtained p-value is reported in this coloumn.

If all_out=True, the function also returns the full list of assignments computed. If provided as an input to the Gibbs sampler, it allows to restart it from where you finished.

ipa.Gibbs_sampler_add(df,annotations, noits=4000,delta_add=0.1,zs=zs)
computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 4000/4000 [00:15<00:00, 261.29it/s]

parsing results ...
Done -  15.4 seconds elapsed

As an example, the possible annotations for the feature associated with the id 501 is shown below.

annotations[501]
id name formula adduct m/z charge RT range ppm isotope pattern score fragmentation pattern score prior post post Gibbs chi-square pval
0 C00148 L-Proline C10H19N2O4 2M+H 231.133933 1 35;55 -1.124747 0.328045 None 0.314538 0.453117 0.662075 4.087367e-186
1 C00763 D-Proline C10H19N2O4 2M+H 231.133933 1 None -1.124747 0.328045 None 0.314538 0.362494 0.266163 4.087367e-186
2 C18170 3-Acetamidopropanal C10H19N2O4 2M+H 231.133933 1 500;560 -1.124747 0.328045 None 0.314538 0.181247 0.071540 4.087367e-186
3 Unknown Unknown None None None None None 3.000000 0.015864 None 0.056386 0.003143 0.000222 4.087367e-186

This feature is clustered with feature id=1, the most likely annotation of which is L-Proline M+H. As expected, considering the adducts connections the 'post Gibbs' probability associated with L-Proline 2M+H is significantly higher than the alternative.

6. Computing posterior probabilities integrating biochemical connections

The IPA method can also update the probabilities associated to each possible annotations by considering all possible biochemical connections.

Before doing so, it is necessary to provide a pandas dataframe reporting which compounds can be considered biochemically related. The function Compute_Bio() can be used to compute such a dataframe. The help() provides a detailed description of this function:

help(ipa.Compute_Bio)
Help on function Compute_Bio in module ipaPy2.ipa:

Compute_Bio(DB, annotations=None, mode='reactions', connections=['C3H5NO', 'C6H12N4O', 'C4H6N2O2', 'C4H5NO3', 'C3H5NOS', 'C6H10N2O3S2', 'C5H7NO3', 'C5H8N2O2', 'C2H3NO', 'C6H7N3O', 'C6H11NO', 'C6H11NO', 'C6H12N2O', 'C5H9NOS', 'C9H9NO', 'C5H7NO', 'C3H5NO2', 'C4H7NO2', 'C11H10N2O', 'C9H9NO2', 'C5H9NO', 'C4H4O2', 'C3H5O', 'C10H12N5O6P', 'C10H15N2O3S', 'C10H14N2O2S', 'CH2ON', 'C21H34N7O16P3S', 'C21H33N7O15P3S', 'C10H15N3O5S', 'C5H7', 'C3H2O3', 'C16H30O', 'C8H8NO5P', 'CH3N2O', 'C5H4N5', 'C10H11N5O3', 'C10H13N5O9P2', 'C10H12N5O6P', 'C9H13N3O10P2', 'C9H12N3O7P', 'C4H4N3O', 'C10H13N5O10P2', 'C10H12N5O7P', 'C5H4N5O', 'C10H11N5O4', 'C10H14N2O10P2', 'C10H12N2O4', 'C5H5N2O2', 'C10H13N2O7P', 'C9H12N2O11P2', 'C9H11N2O8P', 'C4H3N2O2', 'C9H10N2O5', 'C2H3O2', 'C2H2O', 'C2H2', 'CO2', 'CHO2', 'H2O', 'H3O6P2', 'C2H4', 'CO', 'C2O2', 'H2', 'O', 'P', 'C2H2O', 'CH2', 'HPO3', 'NH2', 'PP', 'NH', 'SO3', 'N', 'C6H10O5', 'C6H10O6', 'C5H8O4', 'C12H20O11', 'C6H11O8P', 'C6H8O6', 'C6H10O5', 'C18H30O15'], ncores=1)
    Compute matrix of biochemical connections. Either based on a list of
    possible connections in the form of a list of formulas or based on the
    reactions present in the database.
    
    Parameters
    ----------
    DB: pandas dataframe containing the database against which the annotation
        is performed. The DB must contain the following columns in this exact
        order (optional fields can contain None):
        - id: unique id of the database entry (e.g., 'C00031') - necessary
        - name: compound name (e.g., 'D-Glucose') - necessary
        - formula: chemical formula (e.g., 'C6H12O6') - necessary
        - inchi: inchi string - optional
        - smiles: smiles string - optional
        - RT: if known, retention time range (in seconds) where this compound
                is expected to elute (e.g., '30;60') - optional
        - adductsPos: list of adducts that should be considered in positive mode
                      for this entry (e.g.,'M+Na;M+H;M+') - necessary
        - adductsNeg: list of adducts that should be considered in negative
                      mode for this entry (e.g.,'M-H;M-2H') - necessary
        - description: comments on the entry - optional
        - pk: previous knowledge on the likelihood of this compound to be
             present in the sample analyse. The value has to be between 1
             (compound likely to be present in the sample) and 0 (compound
             cannot be present in the sample).
        - MS2: id for the MS2 database entries related to this compound
               (optional)
        - reactions: list of reactions ids involving this compound
                    (e.g., 'R00010 R00015 R00028')-optional, but necessary if 
                    mode='reactions'.
    annotations: If equal to None (default) all entries in the DB are considered 
                (used to pre-compute the Bio matrix), alternatively it should be
                a dictionary containing all the possible annotations for the
                measured features. The keys of the dictionary are the unique ids
                for the features present in df. For each feature, the
                annotations are summarized in a pandas dataframe. Output of
                functions MS1annotation(), MS1annotation_Parallel(),
                MSMSannotation() or MSMSannotation_Parallel. In this case
                only the entries currently considered as possible annotations
                are used.
    mode: either 'reactions' (connections are computed based on the reactions
          present in the database) or 'connections' (connections are computed
          based on the list of connections provided). Default 'reactions'.
    connections: list of possible connections between compounds defined as
                formulas. Only necessary if mode='connections'. A list of
                common biotransformations is provided as default.
    ncores: default value 1. Number of cores used
    
    Returns
    -------
        Bio: dataframe containing all the possible connections computed.

According to the value assigned to the 'mode' parameter, the function can compute all possible biochemical connections in two ways. If mode='reactions', the function connects the compounds that share the same reaction id(s) according to what is reported in the database.

Bio = ipa.Compute_Bio(DB,annotations,mode='reactions')
Bio
computing all possible biochemical connections
considering the reactions stored in the database ...
0.0 seconds elapsed
0 1
0 C00082 C00079
1 C00082 C04368
2 C21092 C00407
3 C02265 C00079
4 C00123 C02486
5 C00763 C00431
6 C00079 C20807
7 C00407 C00183

If mode='connections', the function computes the 'chemical formula difference' for each pair of compounds considered. If the difference is included in the list of connections, the two compounds are considered connected. A list of connections is provided as default, but it can be modified.

Bio = ipa.Compute_Bio(DB,annotations,mode='connections')
Bio
computing all possible biochemical connections
considering the provided connections ...
3.1 seconds elapsed
0 1
0 C04282 C05131
1 C04282 C22140
2 C04282 C16744
3 C01879 C05131
4 C01879 C22140
5 C01879 C16744
6 C01877 C05131
7 C01877 C22140
8 C01877 C16744
9 C05131 C02237
10 C05131 C04281
11 C05131 C22141
12 C02237 C22140
13 C02237 C16744
14 C22140 C04281
15 C22140 C22141
16 C16744 C04281
17 C16744 C22141

Depending on the size of the database and the dataset, computing all possible biochemical connections can be extremely computationally demanding and can drastically increase the computation time needed for the annotation. For this reason, a precomputed list of biochemical connections based on the database provided (computed based on 'reaction' or 'connections' mode) is included in the library and can be used directly without the need of computing the biochemical connections.

Bio = pd.read_csv('DB/allBIO_reactions.csv')

The list of connections computed with mode='connections' needs to be unzipped first.

import zipfile
with zipfile.ZipFile("DB/allBio_connections.csv.zip","r") as zip_ref:
    zip_ref.extractall("DB/")

Bio=pd.read_csv('DB/allBio_connections.csv')

Alternatively, the user can define their own biochemical connections. For example: L-Proline C00148 L-Valine C00183 L-Phenylalanine C00079 L-Leucine C00123 5-Oxoproline C01879 Betaine C00719 Hordatine A C08307 L-Tyrosine C00082 D-Proline C00763 D-Phenylalanine C02265

Bio=pd.DataFrame([['C00148','C00763'],
                  ['C00079','C02265'],
                  ['C08307','C00082'],
                  ['C08307','C00079']])
ipa.Gibbs_sampler_bio(df,annotations,Bio,noits=5000,delta_bio=0.1)
computing posterior probabilities including biochemical connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:23<00:00, 217.12it/s]


parsing results ...
Done -  23.1 seconds elapsed

As an example, the possible annotations for the feature associated with the id 992 is shown below.

annotations[992]
id name formula adduct m/z charge RT range ppm isotope pattern score fragmentation pattern score prior post post Gibbs chi-square pval
0 C00763 D-Proline C5H10NO2 M+H 116.070605 1 None -0.65851 None 0.7 0.317559 0.531683 0.770667 0.0
2 C00148 L-Proline C5H10NO2 M+H 116.070605 1 35;55 -0.65851 None 0.324512 0.317559 0.123241 0.180889 0.0
1 C18170 3-Acetamidopropanal C5H10NO2 M+H 116.070605 1 500;560 -0.65851 None 0.7 0.317559 0.265841 0.035556 0.0
3 Unknown Unknown None None None None None 3.00000 None 0.7 0.047324 0.079234 0.012889 0.0

The probability associated with the D-Proline M+H is significantly higher after considering the biochemical connections. This is because D-Proline is biochemically connected to L-Proline (by proline racemase), and the most likely annotation for the feature id=1 is L-Proline M+H (>50%).

7. Computing posterior probabilities integrating both adducts and biochemical connections

It is also possible to run the Gibbs sampler considering biochemical and adduct connections at the same time. To do so, one can use the function Gibbs_sampler_bio_add(). The help() provides a detailed explanation of the function.

help(ipa.Gibbs_sampler_bio_add)
Help on function Gibbs_sampler_bio_add in module ipaPy2.ipa:

Gibbs_sampler_bio_add(df, annotations, Bio, noits=100, burn=None, delta_bio=1, delta_add=1, all_out=False, zs=None)
    Gibbs sampler considering both biochemical and adducts connections. The
    function computes the posterior probabilities of the annotations
    considering the possible biochemical connections reported in Bio and the
    possible adducts connection.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    annotations: a dictionary containing all the possible annotations for the
                 measured features. The keys of the dictionary are the unique
                 ids for the features present in df. For each feature, the
                 annotations are summarized in a pandas dataframe. Output of
                 functions MS1annotation(), MS1annotation_Parallel(),
                 MSMSannotation() or MSMSannotation_Parallel
    Bio: dataframe (2 columns), reporting all the possible connections between
         compounds. It uses the unique ids from the database. It could be the
         output of Compute_Bio() or Compute_Bio_Parallel().
    noits: number of iterations if the Gibbs sampler to be run
    burn: number of iterations to be ignored when computing posterior
          probabilities. If None, is set to 10% of total iterations
    delta_bio: parameter used when computing the conditional priors.
               The parameter must be positive. The smaller the parameter the
               more weight the adducts connections have on the posterior
               probabilities. Default 1.
    delta_add: parameter used when computing the conditional priors. The
               parameter must be positive. The smaller the parameter the more
               weight the adducts connections have on the posterior
               probabilities. Default 1.
    all_out: logical value. If true the list of assignments found in each
            iteration is returned by the function. Default False.
    zs: list of assignments computed in a previous run of the Gibbs sampler.
        Optional, default None.
    
    Returns
    -------
    annotations: the function modifies the annotations dictionary by adding 2
                columns to each entry. One named 'post Gibbs' contains the
                posterior probabilities computed. The other is called
                'chi-square pval' containing the p-value from a chi-squared
                test comparing the 'post' with the 'post Gibbs' probabilities.
    zs: optional, if all_out==True, the function return the full list of
        assignments computed. This allows restarting the sampler from where you
        are from a previous run
ipa.Gibbs_sampler_bio_add(df,annotations,Bio,noits=5000,delta_bio=0.1,delta_add=0.1)
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:24<00:00, 204.74it/s]


parsing results ...
Done -  24.5 seconds elapsed
annotations[1]
id name formula adduct m/z charge RT range ppm isotope pattern score fragmentation pattern score prior post post Gibbs chi-square pval
0 C00148 L-Proline C5H10NO2 M+H 116.070605 1 35;55 -0.523247 0.331946 0.999759 0.318084 0.543121 0.783111 1.468730e-271
1 C00763 D-Proline C5H10NO2 M+H 116.070605 1 None -0.523247 0.331946 0.7 0.318084 0.304221 0.212444 1.468730e-271
2 C18170 3-Acetamidopropanal C5H10NO2 M+H 116.070605 1 500;560 -0.523247 0.331946 0.7 0.318084 0.15211 0.004444 1.468730e-271
3 Unknown Unknown None None None None None 3.000000 0.004161 0.7 0.045748 0.000548 0.000000 1.468730e-271

8. Running the whole pipeline with a single function

Finally, the ipaPy2 library also include a wrapper function that allows running the whole IPA pipeline in one step. A detailed description of the function can be accessed with the help.

help(ipa.simpleIPA)
Help on function simpleIPA in module ipaPy2.ipa:

simpleIPA(df, ionisation, DB, adductsAll, ppm, dfMS2=None, DBMS2=None, noits=100, burn=None, delta_add=None, delta_bio=None, Bio=None, mode='reactions', CSunk=0.5, isodiff=1, ppmiso=100, ncores=1, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, mzdCS=0, ppmCS=10, evfilt=False, connections=['C3H5NO', 'C6H12N4O', 'C4H6N2O2', 'C4H5NO3', 'C3H5NOS', 'C6H10N2O3S2', 'C5H7NO3', 'C5H8N2O2', 'C2H3NO', 'C6H7N3O', 'C6H11NO', 'C6H11NO', 'C6H12N2O', 'C5H9NOS', 'C9H9NO', 'C5H7NO', 'C3H5NO2', 'C4H7NO2', 'C11H10N2O', 'C9H9NO2', 'C5H9NO', 'C4H4O2', 'C3H5O', 'C10H12N5O6P', 'C10H15N2O3S', 'C10H14N2O2S', 'CH2ON', 'C21H34N7O16P3S', 'C21H33N7O15P3S', 'C10H15N3O5S', 'C5H7', 'C3H2O3', 'C16H30O', 'C8H8NO5P', 'CH3N2O', 'C5H4N5', 'C10H11N5O3', 'C10H13N5O9P2', 'C10H12N5O6P', 'C9H13N3O10P2', 'C9H12N3O7P', 'C4H4N3O', 'C10H13N5O10P2', 'C10H12N5O7P', 'C5H4N5O', 'C10H11N5O4', 'C10H14N2O10P2', 'C10H12N2O4', 'C5H5N2O2', 'C10H13N2O7P', 'C9H12N2O11P2', 'C9H11N2O8P', 'C4H3N2O2', 'C9H10N2O5', 'C2H3O2', 'C2H2O', 'C2H2', 'CO2', 'CHO2', 'H2O', 'H3O6P2', 'C2H4', 'CO', 'C2O2', 'H2', 'O', 'P', 'C2H2O', 'CH2', 'HPO3', 'NH2', 'PP', 'NH', 'SO3', 'N', 'C6H10O5', 'C6H10O6', 'C5H8O4', 'C12H20O11', 'C6H11O8P', 'C6H8O6', 'C6H10O5', 'C18H30O15'])
     Wrapper function performing the whole IPA pipeline.
     
     Parameters
     ----------
     df: pandas dataframe containing the MS1 data. It should be the output of the
         function ipa.map_isotope_patterns()
         
     DB: pandas dataframe containing the database against which the annotation
         is performed. The DB must contain the following columns in this exact
         order (optional fields can contain None):
         - id: unique id of the database entry (e.g., 'C00031') - necessary
         - name: compound name (e.g., 'D-Glucose') - necessary
         - formula: chemical formula (e.g., 'C6H12O6') - necessary
         - inchi: inchi string - optional
         - smiles: smiles string - optional
         - RT: if known, retention time range (in seconds) where this compound
                 is expected to elute (e.g., '30;60') - optional
         - adductsPos: list of adducts that should be considered in positive mode
                       for this entry (e.g.,'M+Na;M+H;M+') - necessary
         - adductsNeg: list of adducts that should be considered in negative
                       mode for this entry (e.g.,'M-H;M-2H') - necessary
         - description: comments on the entry - optional
         - pk: previous knowledge on the likelihood of this compound to be
              present in the sample analyse. The value has to be between 1
              (compound likely to be present in the sample) and 0 (compound
              cannot be present in the sample).
         - MS2: id for the MS2 database entries related to this compound
                (optional)
         - reactions: list of reactions ids involving this compound
                     (e.g., 'R00010 R00015 R00028')-optional, but necessary if 
                     mode='reactions'.
     adductsAll:a dataframe containing information on all possible adducts.              1
     ppm: accuracy of the MS instrument used
     dfMS2: pandas dataframe containing the MS2 data (optional). It must contain
            3 columns:
                    -id: an unique id for each feature for which the MS2 spectrum
                        was acquired (same as in df)
                    -spectrum: string containing the spectrum inforamtion in the
                               following format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'
                    -ev: collision energy used to aquire the fragmentation
                        spectrum
     DBMS2: pandas dataframe containing the database containing the MS2
            information (optional)
     evfilt: Default value False. If true, only spectra acquired with the same
             collision energy are considered.
     noits: number of iterations if the Gibbs sampler to be run
     burn: number of iterations to be ignored when computing posterior
           probabilities. If None, is set to 10% of total iterations
     delta_bio: parameter used when computing the conditional priors.
                The parameter must be positive. The smaller the parameter the
                more weight the adducts connections have on the posterior
                probabilities. Default 1.
     delta_add: parameter used when computing the conditional priors. The
                parameter must be positive. The smaller the parameter the more
                weight the adducts connections have on the posterior
                probabilities. Default 1.
     Bio: dataframe (2 columns), reporting all the possible connections between
          compounds. It uses the unique ids from the database. It could be the
          output of Compute_Bio() or Compute_Bio_Parallel().
     mode: either 'reactions' (connections are computed based on the reactions
           present in the database) or 'connections' (connections are computed
           based on the list of connections provided). Default 'reactions'.
    CSunk: cosine similarity score associated with the 'unknown' annotation.
            Default 0.7    
    isoDiff: Default value 1. Difference between isotopes of charge 1, does not
             need to be exact
     ppmiso: Default value 100. Maximum ppm value allowed between 2 isotopes.
             It is very high on purpose
     ncores: default value 1. Number of cores used
     
    
     me: accurate mass of the electron. Default 5.48579909065e-04
     ratiosd: default 0.9. It represents the acceptable ratio between predicted
             intensity and observed intensity of isotopes. it is used to compute
             the shape parameters of the lognormal distribution used to
             calculate the isotope pattern scores as sqrt(1/ratiosd)
     ppmunk: ppm associated to the 'unknown' annotation. If not provided equal
             to ppm.
     ratiounk: isotope ratio associated to the 'unknown' annotation. If not
               provided equal to 0.5
     ppmthr: Maximum ppm possible for the annotations. Ff not provided equal to
             2*ppm
     pRTNone: Multiplicative factor for the RT if no RTrange present in the
             database. If not provided equal to 0.8
     pRTout: Multiplicative factor for the RT if measured RT is outside the
             RTrange present in the database. If not provided equal to 0.4
     mzdCS: maximum mz difference allowed when computing cosine similarity
            scores. If one wants to use this parameter instead of ppmCS, this
            must be set to 0. Default 0.
     ppmCS: maximum ppm allowed when computing cosine similarity scores.
            If one wants to use this parameter instead of mzdCS, this must be
            set to 0. Default 10.
     connections: list of possible connections between compounds defined as
                 formulas. Only necessary if mode='connections'. A list of
                 common biotransformations is provided as default.
     Output:
         annotations: a dictionary containing all the possible annotations for the measured features. The keys of the dictionary are the
                      unique ids for the features present in df. For each feature, the annotations are summarized in a pandas dataframe.

Based on the parameters passed on to the function, the end-result of this function will be different.

For example, if one wants to use both the MS1 and MS2 data and not use the Gibbs sampler, the following should be used:

annotations= ipa.simpleIPA(df,ionisation=1, DB=DB,adductsAll=adducts,ppm=3,dfMS2=dfMS2,DBMS2=DBMS2)
isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 and MS2 information....
0.8 seconds elapsed

If instead one wants to use only the MS1 data and only consider the adducts connections in the Gibbs sampler, one should use the following:

annotations= ipa.simpleIPA(df,ionisation=1, DB=DB,adductsAll=adducts,ppm=3,noits=5000,delta_add=0.1)
isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 information....
0.4 seconds elapsed
computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:19<00:00, 255.13it/s]

parsing results ...
Done -  19.7 seconds elapsed

Or, if one wants to use both the MS1 and MS2 data and consider both adducts and biochemical connections in the Gibbs sampler, the following should be used.

annotations= ipa.simpleIPA(df,ionisation=1, DB=DB,adductsAll=adducts,ppm=3,dfMS2=dfMS2,DBMS2=DBMS2,noits=5000,
                             Bio=Bio,
                             delta_add=0.1, 
                             delta_bio=0.4)
isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 and MS2 information....
0.7 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:24<00:00, 207.29it/s]


parsing results ...
Done -  24.2 seconds elapsed

About

New Python implementation of the Integrated Probabilistic Annotation (IPA) - A Bayesian annotation method for LC/MS data integrating biochemical relations, isotope patterns and adduct formation

Resources

License

Stars

Watchers

Forks

Packages

No packages published