
# ipaPy2
Python implementation of the Integrated Probabilistic Annotation (IPA) - A Bayesian annotation method for LC/MS data integrating biochemical relations,
isotope patterns and adduct formation.
![alt text](figure_paper.png)

## Installation
ipaPy2 requires Python 3.9 or higher

### Install via pip (recommended )

```
pip install ipaPy2
```

### Compiling from source (macOS)
1. create a folder in which you want to put the library
```
mkdir IPA
cd IPA
```
2. download the library. If Homebrew is not installed in your machine, you can install it from here https://brew.sh 
```
brew install git
```
```
git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2
```
3. create and activate a virtual environment for your folder and install the necessary libraries
```
python3 -m venv ipaPy2
source ipaPy2/bin/activate
pip install wheel
pip install setuptools
pip install twine
pip install pytest==4.4.1
pip install pytest-runner==4.4
```
4. run tests (optional)
```
python setup.py pytest
```
5. build your library
```
python setup.py bdist_wheel
```
6. The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
```
pip install /path/to/wheelfile.whl
```

### Compiling from source (Linux)
1. create a folder in which you want to put the library
```
mkdir IPA
cd IPA
```
2. download the library
```
sudo apt-get install git
git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2
```
3. create and activate a virtual environment for your folder and install the necessary libraries
```
python3 -m venv ipaPy2
source ipaPy2/bin/activate
pip install wheel
pip install setuptools
pip install twine
pip install pytest==4.4.1
pip install pytest-runner==4.4
```
4. run tests (optional)
```
python setup.py pytest
```
5. build your library
```
python setup.py bdist_wheel
```
6. The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
```
pip install /path/to/wheelfile.whl
```

### Compiling from source (Windows)
1. create a folder in which you want to put the library
```
mkdir IPA
cd IPA
```
2. Install git (https://github.com/git-guides/install-git)
3. download the library
```
git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2
```
3. create and activate a virtual environment for your folder and install the necessary libraries
```
python3 -m venv ipaPy2
source ipaPy2/bin/activate
pip install wheel
pip install setuptools
pip install twine
pip install pytest==4.4.1
pip install pytest-runner==4.4
```
4. run tests (optional)
```
python setup.py pytest
```
5. build your library
```
python setup.py bdist_wheel
```
6. The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
```
pip install /path/to/wheelfile.whl
```

## Databases
One of the most powerful features of the IPA method is that it is able to integrate the knowledge gained from previous experiments in the annotation process. There are three files that are used as database:

**1. adducts file (required)**
<br />
The ipaPy2 library requires a file contains all the information required for the computation of the adducts. An adducts.csv file is provided with the package [here](DB/adducts.csv). The file contains the most common adducts. If any exotic adduct (or in-source fragment) needs to be considered, the user must modify the file accordingly. The format required for the adducts file is shown below. 


In [1]:
import pandas as pd
import numpy as np
adducts = pd.read_csv('DB/adducts.csv')
adducts.head()

Unnamed: 0,name,calc,Charge,Mult,Mass,Ion_mode,Formula_add,Formula_ded,Multi
0,M+H,M+1.007276,1,1,1.007276,positive,H1,False,1
1,M+NH4,M+18.033823,1,1,18.033823,positive,N1H4,False,1
2,M+Na,M+22.989218,1,1,22.989218,positive,Na1,False,1
3,M+K,M+38.963158,1,1,38.963158,positive,K1,False,1
4,M+,M-0.00054858,1,1,-0.000549,positive,FALSE,False,1


**2. MS1 database file (required)**
<br />
The IPA method requires a pandas dataframe containing the database against which the annotation is performed.
Such dataframe must contain the following columns in this exact order (optional columns can have empty fields):
- **id**: unique id of the database entry (e.g., 'C00031') - *necessary*
- **name**: compound name (e.g., 'D-Glucose') - *necessary*
- **formula**: chemical formula (e.g., 'C6H12O6') - *necessary*
- **inchi**: inchi string - *optional*
- **smiles**: smiles string - *optional*
- **RT**: if known, retention time range (in seconds) where this compound is expected to elute (e.g., '30;60') - *optional*
- **adductsPos**: list of adducts that should be considered in positive mode for this entry (e.g.,'M+Na;M+H;M+') - *necessary*
- **adductsNeg**: list of adducts that should be considered in Negative mode for this entry (e.g.,'M-H;M-2H') - *necessary*
- **description**: comments on the entry - *optional*
- **pk**: previous knowledge on the likelihood of this compound to be present in the sample analyse. The value has to be between 1 (compound likely to be present in the sample) and 0 (compound cannot be present in the sample).
- **MS2**: id for the MS2 database entries related to this compound - *optional*
- **reactions**: list of reactions ids involving this compound (e.g., 'R00010 R00015 R00028'). If required, these can be used to find possible biochemical connections - *optional* 

The column names must be the ones reported here.
While the users are strongly advised to build their own *ad-hoc* database, [here](DB/IPA_MS1.csv) you can find a relatively big example database.

In [2]:
DB = pd.read_csv('DB/IPA_MS1.csv')
DB.head()

Unnamed: 0,id,name,formula,inchi,smiles,RT,adductsPos,adductsNeg,description,pk,MS2,reactions
0,C00002,ATP,C10H16N5O13P3,InChI=1S/C10H16N5O13P3/c11-8-5-9(13-2-12-8)15(...,,,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,EMBL-MCF_spec365637_1,R00002 R00076 R00085 R00086 R00087 R00088 R000...
1,C00003,NAD+,C21H28N7O14P2,InChI=1S/C21H27N7O14P2/c22-17-12-19(25-7-24-17...,,,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,EMBL-MCF_specxxxxx_10,R00023 R00090 R00091 R00092 R00093 R00094 R000...
2,C00004,NADH,C21H29N7O14P2,InChI=1S/C21H29N7O14P2/c22-17-12-19(25-7-24-17...,,,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,,R00023 R00090 R00091 R00092 R00093 R00094 R000...
3,C00005,NADPH,C21H30N7O17P3,InChI=1S/C21H30N7O17P3/c22-17-12-19(25-7-24-17...,,,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,,R00105 R00106 R00107 R00108 R00109 R00111 R001...
4,C00006,NADP+,C21H29N7O17P3,InChI=1S/C21H28N7O17P3/c22-17-12-19(25-7-24-17...,,,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,EMBL-MCF_specxxxxxx_45,R00104 R00106 R00107 R00108 R00109 R00111 R001...


This example databases was obtained considering the [KEGG database](https://www.genome.jp/kegg/compound/), the [Natural Products Atlas database](https://www.npatlas.org) and the [MoNa database](https://mona.fiehnlab.ucdavis.edu) (only compounds having at least one fragmentation spectra obtained with a QExactive).
For each entry, only a handful of the most common adducts is considered.
To fully exploit the IPA method, it is strongly recommended to constantly update the database when new knowledge is gained from previous experience. Providing a retention time window for compounds previously detected with the analytical system at hand it is particularly useful.
For the sake of the example in this tutorial, a reduced example database is also provided.

In [3]:
DB = pd.read_csv('DB/DB_test_pos.csv')
DB.head()

Unnamed: 0,id,name,formula,inchi,smiles,RT,adductsPos,adductsNeg,description,pk,MS2,reactions
0,C00079,L-Phenylalanine,C9H11NO2,InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-...,,120;160,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,UA005501_1,R00686 R00688 R00689 R00690 R00691 R00692 R006...
1,C00082,L-Tyrosine,C9H11NO3,InChI=1S/C9H11NO3/c10-8(9(12)13)5-6-1-3-7(11)4...,,50;90,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,UA005601_1,R00031 R00728 R00729 R00730 R00731 R00732 R007...
2,C00114,Choline,C5H14NO,"InChI=1S/C5H14NO/c1-6(2,3)4-5-7/h7H,4-5H2,1-3H...",,,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,,R01021 R01022 R01023 R01025 R01026 R01027 R010...
3,C00123,L-Leucine,C6H13NO2,"InChI=1S/C6H13NO2/c1-4(2)3-5(7)6(8)9/h4-5H,3,7...",,70;110,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,,R01088 R01089 R01090 R01091 R02552 R03657 R084...
4,C00148,L-Proline,C5H9NO2,"InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H...",,35;55,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1,EMBL-MCF_specxxxxx_7,R00135 R00671 R01246 R01248 R01249 R01251 R012...


**3. MS2 database file (only required is MS2 data is available)**
<br />
This new implementation of the IPA method also allows the user to include MS2 data in the annotation pipeline.
In order to exploit this functionality a MS2 spectra database must be provided.
The MS2 database must be provided as a pandas dataframe including the following columns in this exact order:
- **compound_id**: unique id for each compound, it must match with the ids used in the MS1 database - *necessary*
- **id**: Unique id for the single entry (i.e., spectra) of the database *necessary*
- **name**: compound name (e.g., 'D-Glucose') - *necessary*
- **formula**: chemical formula (e.g., 'C6H12O6') - *necessary*
- **inchi**: inchi string - *optional*
- **precursorType**: the adduct form of the precursor ion (e.g., 'M+H') - *necessary*
- **instrument**: the type of instrument the spectrum was acquired with - *optional*
- **collision.energy**: the collision energy level used to acquire the spectrum (e.g., '15') - *necessary*
- **spectrum**: The actual spectrum in the form of a string in the following format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'

It is necessary that the user uses a MS2 database specific to the instrument used to acquire the data.
The MS2 database found [here](https://drive.google.com/file/d/15qduvtE8aSAAUCf1FE4ojcVLaTw-B2W6/view?usp=sharing), contains all the MS2 spectra found in the [MoNa](https://mona.fiehnlab.ucdavis.edu) database acquired with a QExactive. This is a relatively big file, and for the sake of this tutorial a drastically reduced version of it has been included within this repository, and can be found [here](DB/DBMS2_test_pos.csv).


In [4]:
DBMS2 = pd.read_csv('DB/DBMS2_test_pos.csv')
DBMS2.head()

Unnamed: 0,compound_id,id,name,formula,inchi,precursorType,instrument,collision.energy,spectrum
0,EMBL-MCF_specxxxxxx_11,EMBL-MCF_spec103039,L-valine,C5H11NO2,"InChI=1S/C5H11NO2/c1-3(2)4(6)5(7)8/h3-4H,6H2,1...",M+H,Thermo Q-Exactive Plus,35,55.0550575256:5.821211 57.0581207275:0.385600 ...
1,EMBL-MCF_specxxxxxx_11,EMBL-MCF_spec353465,L-valine,C5H11NO2,"InChI=1S/C5H11NO2/c1-3(2)4(6)5(7)8/h3-4H,6H2,1...",M+H,Thermo Q-Exactive Plus,30,49.5028053042:0.000000 49.5031356971:0.000000 ...
2,EMBL-MCF_specxxxxxx_11,EMBL-MCF_spec27828,L-valine,C5H11NO2,"InChI=1S/C5H11NO2/c1-3(2)4(6)5(7)8/h3-4H,6H2,1...",M-H,Thermo Q-Exactive Plus,35,55.0173454285:0.819357 58.0282363892:0.155430 ...
3,EMBL-MCF_specxxxxx_7,EMBL-MCF_spec96902,L-proline,C5H9NO2,"InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H...",M+H,Thermo Q-Exactive Plus,35,50.5765228271:0.040013 51.3066940308:0.039949 ...
4,EMBL-MCF_specxxxxx_7,EMBL-MCF_spec353568,L-proline,C5H9NO2,"InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H...",M+H,Thermo Q-Exactive Plus,30,49.5028215674:0.000000 49.5031519602:0.000000 ...


## Data preparation
Before using the ipaPy2 package, the processed data coming from an untargeted metabolomics experiment must be properly prepared.

**1. MS1 data**

The data must be organized in a pandas dataframe containing the following columns:
- **ids**: an unique numeric id for each mass spectrometry feature feature
- **rel.ids**: relation ids. Features must be clustered based on correlation/peak shape/retention time. Features in the same cluster are likely to come from the same metabolite.
- **mzs**: mass-to-charge ratios, usually the average across different samples.
- **RTs**: retention times in seconds, usually the average across different samples.
- **Int**: representative (e.g., maximum or average) intensity detected for each feature across samples (either peak area or peak intensity)


Below is reported an example:

In [5]:
df1=pd.read_csv('ExampleDatasets/README/df_test_pos.csv')
df1.head()

Unnamed: 0,ids,rel.ids,mzs,RTs,Int
0,1,0,116.070544,45.770423,2170017000.0
1,88,0,117.073678,45.787586,125652000.0
2,501,0,231.133673,46.183948,25192230.0
3,4429,0,232.136923,46.176715,2635594.0
4,2,1,104.10683,40.843309,1889172000.0


The clustering of the features is a necessary and must be performed before running the IPA method. For this step, the use of widely used data processing software such as [mzMatch](https://github.com/UoMMIB/mzmatch.R) and [CAMERA](https://bioconductor.org/packages/release/bioc/html/CAMERA.html) is recommended.
Nevertheless, the ipaPy2 library provides a function (clusterFeatures()) able to perform such step, starting from a dataframe containing the measured intensities across several samples (at least 3 samples, the more samples the better). 
Such dataframe should be organized as follows:

In [6]:
df2=pd.read_csv('ExampleDatasets/README/df_test_pos_not_clustered.csv')
df2.head()

Unnamed: 0,ids,mzs,RTs,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10
0,1,116.070544,45.770423,1003660000.0,1299828000.0,1878029000.0,1778238000.0,1715394000.0,434003400.0,1586635000.0,2170017000.0,1312151000.0,2051875000.0
1,2,104.10683,40.843309,377834300.0,872190100.0,835380500.0,1889172000.0,1114844000.0,1296362000.0,736137900.0,738688700.0,954686400.0,696905400.0
2,3,118.085998,43.584638,598471500.0,1399106000.0,283122000.0,1415610000.0,755760700.0,780035900.0,894985400.0,507406900.0,685452500.0,1000501000.0
3,4,166.086047,143.321396,1390905000.0,1047887000.0,1053413000.0,278180900.0,1037486000.0,1117700000.0,615333200.0,1215932000.0,1264092000.0,1370995000.0
4,5,132.101745,89.387202,607191200.0,1014152000.0,1270735000.0,1069765000.0,492593800.0,408763300.0,377794500.0,254147000.0,802525700.0,354428100.0


In [7]:
from ipaPy2 import ipa
df=ipa.clusterFeatures(df2)

Clustering features ....
0.0 seconds elapsed


All information about the function can be found in the help of the function

In [8]:
help(ipa.clusterFeatures)

Help on function clusterFeatures in module ipaPy2.ipa:

clusterFeatures(df, Cthr=0.8, RTwin=1, Intmode='max')
    Clustering MS1 features based on correlation across samples.
    
    Parameters
    ----------
    df: pandas dataframe with the following columns:
        -ids: a unique id for each feature
        -mzs: mass-to-charge ratios, usually the average across different
              samples.
        -RTs: retention times in seconds, usually the average across different
              samples.
        -Intensities: for each sample, a column reporting the detected
                      intensities in each sample. 
    Cthr: Default value 0.8. Minimum correlation allowed in each cluster
    RTwin: Default value 1. Maximum difference in RT time between features in
           the same cluster
    Intmode: Defines how the representative intensity of each feature is
             computed. If 'max' (default) the maximum across samples is used.
             If 'ave' the average across sa

After running, this function returns a pandas dataframe in the correct format for the ipaPy2 package

In [9]:
df.head()

Unnamed: 0,ids,rel.ids,mzs,RTs,Int
0,1,0,116.070544,45.770423,2170017000.0
1,88,0,117.073678,45.787586,125652000.0
2,501,0,231.133673,46.183948,25192230.0
3,4429,0,232.136923,46.176715,2635594.0
4,2,1,104.10683,40.843309,1889172000.0


**2. MS2 data**

If fragmentation data was acquired during the experiment, it can be included in the IPA annotation process.
To do so, the data must be organized in a pandas dataframe containing the following columns:
- **id**: an unique id for each feature for which the MS2 spectrum was acquired (same as in MS1)
- **spectrum**: string containing the spectrum inforamtion in the following format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'
- **ev**: collision energy used to aquire the fragmentation spectrum

Below is reported an example:

In [10]:
dfMS2=pd.read_csv('ExampleDatasets/README/MS2data_example.csv')
dfMS2.head()

Unnamed: 0,id,spectrum,ev
0,1,51.3066132836457:0.884272376680125 59.96532241...,35
1,1,51.3066132836457:0.884272376680125 59.96532241...,15
2,90,62.4153253406374:0.743812036877455 63.93291389...,35
3,992,50.983321052233:0.973529955385613 53.039006800...,35
4,3,55.0551847656264:5.67780579195993 57.058126021...,35


## Usage
The Integrated Probabilistic Annotation (IPA) method can be applied in different situations, and the ipaPy2 package allow the users to taylor the IPA pipeline around their specific needs.

This brief tutorial describes the most common scenarios the IPA method can be applied to.

**1. Mapping isotope patterns**

The first step of the IPA pipeline consists in the mapping of the isotope patterns within the dataset considered. This is achieved through the map_isotope_patterns(). The help of this function provides a detailed description of it.

In [11]:
help(ipa.map_isotope_patterns)

Help on function map_isotope_patterns in module ipaPy2.ipa:

map_isotope_patterns(df, isoDiff=1, ppm=100, ionisation=1)
    mapping isotope patterns in MS1 data.
    
    Parameters
    ----------
    df : pandas dataframe (necessary)
         A dataframe containing the MS1 data including the following columns:
            -ids: an unique id for each feature
            -rel.ids:   relation ids. In a previous step of the data processing
                        pipeline, features are clustered based on peak shape
                        similarity/retention time. Features in the same
                        cluster are likely to come from the same metabolite.
                        All isotope patterns must be in the same rel.id
                        cluster.
            -mzs: mass-to-charge ratios, usually the average across
                  different samples.
            -RTs: retention times in seconds, usually the average across
                  different samples.
            -

For the sake of this tutorial, the small dataset example introduced above is considered.

In [12]:
ipa.map_isotope_patterns(df,ionisation=1)

mapping isotope patterns ....
0.1 seconds elapsed


Once finished, this function modifies the pandas dataframe provided as input annotating all isotope patterns.

In [13]:
df.head()

Unnamed: 0,ids,rel.ids,mzs,RTs,Int,relationship,isotope pattern,charge
0,1,0,116.070544,45.770423,2170017000.0,bp,0,1
1,88,0,117.073678,45.787586,125652000.0,bp|isotope,0,1
2,501,0,231.133673,46.183948,25192230.0,potential bp,1,1
3,4429,0,232.136923,46.176715,2635594.0,potential bp|isotope,1,1
4,2,1,104.10683,40.843309,1889172000.0,bp,0,1


Some data processing pipelines already have a isotope mapping function and the user can used them as long as they organise the data in the correct format

**2. Compute all adducts**

The second step of the pipeline consists in the calculation of all possible adducts that could be generated by the compounds included in the database.
This is done by the function compute_all_adducts(). This function comes with a very detailed help.

In [14]:
help(ipa.compute_all_adducts)

Help on function compute_all_adducts in module ipaPy2.ipa:

compute_all_adducts(adductsAll, DB, ionisation=1, ncores=1)
    compute all adducts table based on the information present in the database
    
    Parameters
    ----------
    adductsAll : pandas dataframe (necessary)
                 Dataframe containing information on all possible
                 adducts. The file must be in the same format as the example
                 provided in the DB/adducts.csv
    DB : pandas dataframe (necessary)
         Dataframe containing the database against which the annotation is
         performed. The DB must contain the following columns in this exact
         order (optional fields can contain None):
             - id: unique id of the database entry (e.g., 'C00031') - necessary
             - name: compound name (e.g., 'D-Glucose') - necessary
             - formula: chemical formula (e.g., 'C6H12O6') - necessary
             - inchi: inchi string - optional
             - smiles: sm

Depending on the size of the dataset used (i.e., number of compounds included), this step can become rather time-consuming, and the use of multiple cores should be considered.
In the context of this tutorial, the heavily reduced example dataset introduced before is considered.

In [15]:
allAddsPos = ipa.compute_all_adducts(adducts, DB, ionisation=1, ncores=1)

computing all adducts ....
0.1 seconds elapsed


In [16]:
allAddsPos.head()

Unnamed: 0,id,name,adduct,formula,charge,m/z,RT,pk,MS2
0,C00079,L-Phenylalanine,M+H,C9H12NO2,1,166.086255,120;160,1,UA005501_1
1,C00079,L-Phenylalanine,M+Na,C9H11NNaO2,1,188.068197,120;160,1,UA005501_1
2,C00079,L-Phenylalanine,M+2H,C9H13NO2,2,83.546765,120;160,1,UA005501_1
3,C00079,L-Phenylalanine,2M+H,C18H23N2O4,1,331.165233,120;160,1,UA005501_1
4,C00082,L-Tyrosine,M+H,C9H12NO3,1,182.081169,50;90,1,UA005601_1


If the same database is used for subsequent experiments without introducing new information, it is recommended to save the results of this function into a .csv file. Therefore, the user would need to repeat this step in the future only if the DB changed.

**3. Annotation based on MS1 information**

At this point, the actual annotation process can start. If no fragmentation data is available, the MS1annotation() function should be used. This function annotates the dataset using the MS1 data and the information stored in the dataset. A detailed description of the function can be accessed through the help:

In [17]:
help(ipa.MS1annotation)

Help on function MS1annotation in module ipaPy2.ipa:

MS1annotation(df, allAdds, ppm, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, ncores=1)
    Annotation of the dataset base on the MS1 information. Prior probabilities
    are based on mass only, while post probabilities are based on mass, RT,
    previous knowledge and isotope patterns.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    allAdds: pandas dataframe containing the information on all the possible
            adducts given the database. It should be the output of either
            ipa.compute_all_adducts() or ipa.compute_all_adducts_Parallel()
    ppm: accuracy of the MS instrument used
    me: accurate mass of the electron. Default 5.48579909065e-04
    ratiosd: default 0.9. It represents the acceptable ratio between predicted
             intensity and

In [18]:
annotations=ipa.MS1annotation(df,allAddsPos,ppm=3,ncores=1)

annotating based on MS1 information....
0.9 seconds elapsed


This function returns all the possible annotations for all the mass spectrometry features (excluding the ones previously identified as isotopes). The annotations are provided in the form of a dictionary.  The keys of the dictionary are the unique ids for the features present in df.
For each feature, all possible annotations are summarised in a dataframe including the following information:

- **id:** Unique id associated with the compound as reported in the database
- **name:** Name of the compound
- **formula:** Chemical formula of the putative annotation
- **adduct:** Adduct type
- **mz:** Theoretical m/z associated with the specific ion
- **charge:** Theoretical charge of the ion
- **RT range:** Retention time range reported in the database for the specific compound
- **ppm:** mass accuracy
- **isotope pattern score:** Score representing how similar the measured and theoretical isopattern scores are
- **fragmentation pattern score:** Cosine similarity. Empty in this case as no MS2 data was provided
- **prior:** Probabilities associated with each possible annotation computed by only considering the mz values (i.e., only considering ppm)
- **post:** Probabilities associated with each possible annotation computed by integrating all the additional information available: retention time range, ppm, isotope pattern score and prior knowledge.

As an example, possible annotations for the feature associated with id=1 is shown below:


In [19]:
annotations[1]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post
0,C00148,L-Proline,C5H10NO2,M+H,116.070605,1.0,35;55,-0.523247,0.331946,,0.318084,0.454248
1,C00763,D-Proline,C5H10NO2,M+H,116.070605,1.0,,-0.523247,0.331946,,0.318084,0.363398
2,C18170,3-Acetamidopropanal,C5H10NO2,M+H,116.070605,1.0,500;560,-0.523247,0.331946,,0.318084,0.181699
3,Unknown,Unknown,,,,,,3.0,0.004161,,0.045748,0.000655


It should be noticed that in this example, the prior probabilities associated with L-Proline M+H, D-Proline M+H and 3-Acetamidopropanal are exactly the same. This is because all three ions have exactly the same theoretical mass.
However, the post probabilities are different. This is because the retention time associated with this feature is within the retention range reported in the database for L-Proline and outside the one reported for 3-Acetamidopropanal.

Here another example:

In [20]:
annotations[999]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post
0,C00079,L-Phenylalanine,C18H23N2O4,2M+H,331.165233,1.0,120;160,-0.941814,0.472049,,0.240106,0.550778
1,C02265,D-Phenylalanine,C18H23N2O4,2M+H,331.165233,1.0,,-0.941814,0.472049,,0.240106,0.440622
2,Unknown,Unknown,,,,,,3.0,0.055901,,0.039575,0.0086
3,C03263,Coproporphyrinogen III,C36H46N4O8,M+2H,331.165233,2.0,,-0.941814,0.0,,0.240106,0.0
4,C05768,Coproporphyrinogen I,C36H46N4O8,M+2H,331.165233,2.0,,-0.941814,0.0,,0.240106,0.0


Also in this case, all the prior probabilities associated with the four ions are exactly the same since all the ions have the same theoretical mass-to-charge ratios. However, the post probabilities are significantly different.
Two of these ions (Coproporphyrinogen III M+2H and Coproporphyrinogen I M+2H) have charge +2, while the other two possible annotations have charge 1. The observed isotope pattern is coherent with an ion with charge +1 (i.e., difference between isotopes = 1), and this is reflected on the isotope score pattern and consequently on the post probabilities. Moreover, the retention time associated with this feature is withing the range reported for L-Phenylalanine in the database. Therefore, the post probability associated with the L-Phenylalanine 2M+H is the most likely.

**4. Annotation based on MS1 and MS2 information**

As already mentioned above, fragmentation data can be included in the annotation process by using the MSMSannotation() function. A detailed description of the function can be accessed through the help:

In [21]:
help(ipa.MSMSannotation)

Help on function MSMSannotation in module ipaPy2.ipa:

MSMSannotation(df, dfMS2, allAdds, DBMS2, ppm, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, mzdCS=0, ppmCS=10, CSunk=0.7, evfilt=False, ncores=1)
    Annotation of the dataset base on the MS1 and MS2 information. Prior
    probabilities are based on mass only, while post probabilities are based
    on mass, RT, previous knowledge and isotope patterns.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    dfMS2: pandas dataframe containing the MS2 data. It must contain 3 columns
        -id: an unique id for each feature for which the MS2 spectrum was
             acquired (same as in df)
        -spectrum: string containing the spectrum information in the following
                   format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'
        -ev: collision energy used to acquir

The line below integrates the fragmentation data and the fragmentation database introduced above in the annotation process. The role of the CSunk parameter should be briefly discussed here. In most cases, the fragmentation database contains fragmentation spectra only for a subset of the compounds in the database. Therefore when considering a feature for which the fragmentation spectra was acquired, it is often the case that the cosine similarity can be only computed for a subset of the possible annotations. The CSunk value is then assigned to the other possible annotations for comparison.

In [22]:
 annotations=ipa.MSMSannotation(df,dfMS2,allAddsPos,DBMS2,CSunk=0.7,ppm=3,ncores=1)

annotating based on MS1 and MS2 information....
1.3 seconds elapsed


The output of this function has the same structure of the one from the MSannotation() function, but it also includes the fragmentation pattern scores when the fragmentation data is available.
As an example, possible annotations for the feature associated with id=1 is shown below:

In [23]:
annotations[1]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post
0,C00148,L-Proline,C5H10NO2,M+H,116.070605,1.0,35;55,-0.523247,0.331946,0.999759,0.318084,0.543121
1,C00763,D-Proline,C5H10NO2,M+H,116.070605,1.0,,-0.523247,0.331946,0.7,0.318084,0.304221
2,C18170,3-Acetamidopropanal,C5H10NO2,M+H,116.070605,1.0,500;560,-0.523247,0.331946,0.7,0.318084,0.15211
3,Unknown,Unknown,,,,,,3.0,0.004161,0.7,0.045748,0.000548


In this case the cosine similarity score for the annotation L-Proline M+H is very high, therefore the post probability associate with it is higher than the one obtained without considering the MS2 data.

Here another example for a feature having a very similar mass-to-charge ratio.

In [24]:
annotations[90]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post
0,C00763,D-Proline,C5H10NO2,M+H,116.070605,1.0,,-0.708479,,0.7,0.317329,0.480821
1,C18170,3-Acetamidopropanal,C5H10NO2,M+H,116.070605,1.0,500;560,-0.708479,,0.7,0.317329,0.24041
2,C00148,L-Proline,C5H10NO2,M+H,116.070605,1.0,35;55,-0.708479,,0.59986,0.317329,0.206018
3,Unknown,Unknown,,,,,,3.0,,0.7,0.048013,0.072751


In this case, the cosine similarity score for the annotation L-Proline M+H is not very high. Moreover, the retention time assigned to this feature is outside both retention time ranges reported in the database for L-Proline and 3-Acetamidopropanal. Therefore, the most likely annotation for this feature is D-Proline M+H.

**5. Computing posterior probabilities integrating adducts connections**

Until this point, the putative annotations and the associated probabilities computed for each feature are independent from each other. However, the IPA method can be used to update the probabilities by considering the possible relationship between annotations.
For example, the Gibbs_sampler_add() function uses a Gibbs sampler to estimate the posterior probabilities obtained by considering all possible adducts connections.

The help() provides a detailed description of this function:

In [25]:
help(ipa.Gibbs_sampler_add)

Help on function Gibbs_sampler_add in module ipaPy2.ipa:

Gibbs_sampler_add(df, annotations, noits=100, burn=None, delta_add=1, all_out=False, zs=None)
    Gibbs sampler considering only adduct connections. The function computes
    the posterior probabilities of the annotations considering the adducts
    connections.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    annotations: a dictionary containing all the possible annotations for the
                measured features. The keys of the dictionary are the unique
                ids for the features present in df. For each feature, the
                annotations are summarized in a pandas dataframe. Output of
                functions MS1annotation(), MS1annotation_Parallel(),
                MSMSannotation() or MSMSannotation_Parallel
    noits: number of iterations if the Gibbs sampler to be run
    burn: number of it

In [26]:
zs = ipa.Gibbs_sampler_add(df,annotations,noits=1000,delta_add=0.1, all_out=True)

computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 1000/1000 [00:05<00:00, 191.42it/s]

parsing results ...
Done -  5.3 seconds elapsed





The function modifies the annotations dictionary by adding two additional columns to each dataframe:
- **post Gibbs:** posterior probabilities obtained from the Gibbs sampler.
- **chi-square pval:** In order to see if the posterior probabilities obtained are statistically different from the priors, a chi-square test is used. The obtained p-value is reported in this coloumn.

If the all_out=True, the function also returns the full list of assignments computed. If provided as an input the the Gibbs sampler it allows to restart it from where you finished.

In [27]:
ipa.Gibbs_sampler_add(df,annotations, noits=4000,delta_add=0.1,zs=zs)

computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 4000/4000 [00:20<00:00, 190.94it/s]


parsing results ...
Done -  21.0 seconds elapsed


As an example, the possible annotations for the feature associated with the id 501 is shown below.

In [28]:
annotations[501]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post,post Gibbs,chi-square pval
0,C00148,L-Proline,C10H19N2O4,2M+H,231.133933,1.0,35;55,-1.124747,0.328045,,0.314538,0.453117,0.657187,2.844927e-187
1,C00763,D-Proline,C10H19N2O4,2M+H,231.133933,1.0,,-1.124747,0.328045,,0.314538,0.362494,0.281493,2.844927e-187
2,C18170,3-Acetamidopropanal,C10H19N2O4,2M+H,231.133933,1.0,500;560,-1.124747,0.328045,,0.314538,0.181247,0.060875,2.844927e-187
3,Unknown,Unknown,,,,,,3.0,0.015864,,0.056386,0.003143,0.000444,2.844927e-187


This feature is clustered with feature id=1, which most likely annotation is L-Proline M+H. As expected, considering the adducts connections the 'post Gibbs' probability associated with L-Proline 2M+H is significantly higher.

**6. Computing posterior probabilities integrating biochemical connections**

The IPA method can also update the probabilities associated to each possible annotations by also considering all possible connections.

Before doing so, it is necessary to provide a pandas dataframe reporting which compounds can be consired biochemically related.
The function Compute_Bio() can be used to compute such dataframe.
The help() provides a detailed description of this function:

In [29]:
help(ipa.Compute_Bio)

Help on function Compute_Bio in module ipaPy2.ipa:

Compute_Bio(DB, annotations=None, mode='reactions', connections=['C3H5NO', 'C6H12N4O', 'C4H6N2O2', 'C4H5NO3', 'C3H5NOS', 'C6H10N2O3S2', 'C5H7NO3', 'C5H8N2O2', 'C2H3NO', 'C6H7N3O', 'C6H11NO', 'C6H11NO', 'C6H12N2O', 'C5H9NOS', 'C9H9NO', 'C5H7NO', 'C3H5NO2', 'C4H7NO2', 'C11H10N2O', 'C9H9NO2', 'C5H9NO', 'C4H4O2', 'C3H5O', 'C10H12N5O6P', 'C10H15N2O3S', 'C10H14N2O2S', 'CH2ON', 'C21H34N7O16P3S', 'C21H33N7O15P3S', 'C10H15N3O5S', 'C5H7', 'C3H2O3', 'C16H30O', 'C8H8NO5P', 'CH3N2O', 'C5H4N5', 'C10H11N5O3', 'C10H13N5O9P2', 'C10H12N5O6P', 'C9H13N3O10P2', 'C9H12N3O7P', 'C4H4N3O', 'C10H13N5O10P2', 'C10H12N5O7P', 'C5H4N5O', 'C10H11N5O4', 'C10H14N2O10P2', 'C10H12N2O4', 'C5H5N2O2', 'C10H13N2O7P', 'C9H12N2O11P2', 'C9H11N2O8P', 'C4H3N2O2', 'C9H10N2O5', 'C2H3O2', 'C2H2O', 'C2H2', 'CO2', 'CHO2', 'H2O', 'H3O6P2', 'C2H4', 'CO', 'C2O2', 'H2', 'O', 'P', 'C2H2O', 'CH2', 'HPO3', 'NH2', 'PP', 'NH', 'SO3', 'N', 'C6H10O5', 'C6H10O6', 'C5H8O4', 'C12H20O11', 'C6H11O8P

According to the value assigned to the 'mode' parameter, the function can compute all possible biochemical connections in two ways.
If mode='reactions', the function connects the compounds that share the same reaction(s) id according to what reported in the database.

In [30]:
Bio = ipa.Compute_Bio(DB,annotations,mode='reactions')
Bio

computing all possible biochemical connections
considering the reactions stored in the database ...
0.0 seconds elapsed


Unnamed: 0,0,1
0,C00763,C00431
1,C02265,C00079
2,C00183,C00407
3,C00079,C00082
4,C00079,C20807
5,C02486,C00123
6,C04368,C00082
7,C00407,C21092


If mode='connections', the function computes the 'chemical formula difference' for each pair of compounds considered. If the difference is included in the list of connections, the two compounds are considered connected.
A list of connections is provided as default, but it can be modified.

In [31]:
Bio = ipa.Compute_Bio(DB,annotations,mode='connections')
Bio

computing all possible biochemical connections
considering the provided connections ...
6.1 seconds elapsed


Unnamed: 0,0,1
0,C02237,C16744
1,C02237,C22140
2,C02237,C05131
3,C16744,C04281
4,C16744,C22141
5,C16744,C01879
6,C16744,C04282
7,C16744,C01877
8,C22140,C04281
9,C22140,C22141


Depending on the size of the database and the dataset, computing all possible biochemical connections can be extremely computational demanding and drastically increase the computation time needed for the annotation. For this reason, a precomputed list of biochemical connections based on the database provided (computed based on 'reaction' or 'connections' mode) is included in the library and can be used directly without the need of computing the biochemical connections.

In [32]:
Bio = pd.read_csv('DB/allBIO_reactions.csv')

The list of connections computed with mode='connections' needs to be unzipped first.

In [33]:
import zipfile
with zipfile.ZipFile("DB/allBio_connections.csv.zip","r") as zip_ref:
    zip_ref.extractall("DB/")

Bio=pd.read_csv('DB/allBio_connections.csv')


Alternatively, the user can define his own biochemical connections.
For example:
L-Proline C00148
L-Valine C00183
L-Phenylalanine C00079
L-Leucine C00123
5-Oxoproline C01879
Betaine C00719
Hordatine A C08307
L-Tyrosine C00082
D-Proline C00763
D-Phenylalanine C02265




In [34]:
Bio=pd.DataFrame([['C00148','C00763'],
                  ['C00079','C02265'],
                  ['C08307','C00082'],
                  ['C08307','C00079']])

In [35]:
ipa.Gibbs_sampler_bio(df,annotations,Bio,noits=5000,delta_bio=0.1)

computing posterior probabilities including biochemical connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:30<00:00, 162.45it/s]


parsing results ...
Done -  30.8 seconds elapsed


As an example, the possible annotations for the feature associated with the id 992 is shown below.

In [36]:
annotations[992]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post,post Gibbs,chi-square pval
0,C00763,D-Proline,C5H10NO2,M+H,116.070605,1.0,,-0.65851,,0.7,0.317559,0.531683,0.774,0.0
2,C00148,L-Proline,C5H10NO2,M+H,116.070605,1.0,35;55,-0.65851,,0.324512,0.317559,0.123241,0.178667,0.0
1,C18170,3-Acetamidopropanal,C5H10NO2,M+H,116.070605,1.0,500;560,-0.65851,,0.7,0.317559,0.265841,0.036222,0.0
3,Unknown,Unknown,,,,,,3.0,,0.7,0.047324,0.079234,0.011111,0.0


The probability associated with the D-Proline M+H is significantly higher after considering the biochemical connections. This is because, D-Proline is biochemically connected to L-Proline (proline racemase), and the most likely annotation for the feature id=1 is L-Proline M+H (>50%).

**7. Computing posterior probabilities integrating both adducts and biochemical connections**

It is also possible to run the Gibbs sampler considering biochemical and adduct connections at the same time.
To do so, one can use the function Gibbs_sampler_bio_add().
The help() provides a detailed explanation of the function. 

In [37]:
help(ipa.Gibbs_sampler_bio_add)

Help on function Gibbs_sampler_bio_add in module ipaPy2.ipa:

Gibbs_sampler_bio_add(df, annotations, Bio, noits=100, burn=None, delta_bio=1, delta_add=1, all_out=False, zs=None)
    Gibbs sampler considering both biochemical and adducts connections. The
    function computes the posterior probabilities of the annotations
    considering the possible biochemical connections reported in Bio and the
    possible adducts connection.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    annotations: a dictionary containing all the possible annotations for the
                 measured features. The keys of the dictionary are the unique
                 ids for the features present in df. For each feature, the
                 annotations are summarized in a pandas dataframe. Output of
                 functions MS1annotation(), MS1annotation_Parallel(),
                 MSMSannotati

In [38]:
ipa.Gibbs_sampler_bio_add(df,annotations,Bio,noits=5000,delta_bio=0.1,delta_add=0.1)

computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:28<00:00, 175.56it/s]


parsing results ...
Done -  28.6 seconds elapsed


In [39]:
annotations[1]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post,post Gibbs,chi-square pval
0,C00148,L-Proline,C5H10NO2,M+H,116.070605,1.0,35;55,-0.523247,0.331946,0.999759,0.318084,0.543121,0.618,7.003899e-93
1,C00763,D-Proline,C5H10NO2,M+H,116.070605,1.0,,-0.523247,0.331946,0.7,0.318084,0.304221,0.34,7.003899e-93
2,C18170,3-Acetamidopropanal,C5H10NO2,M+H,116.070605,1.0,500;560,-0.523247,0.331946,0.7,0.318084,0.15211,0.042,7.003899e-93
3,Unknown,Unknown,,,,,,3.0,0.004161,0.7,0.045748,0.000548,0.0,7.003899e-93


**8. Running the whole pipeline with a single function**

Finally, the ipaPy2 library also include a wrapper function that allows to run the whole IPA pipeline in one step.
A detailed description of the function can be accessed with the help.

In [40]:
help(ipa.simpleIPA)

Help on function simpleIPA in module ipaPy2.ipa:

simpleIPA(df, ionisation, DB, adductsAll, ppm, dfMS2=None, DBMS2=None, noits=100, burn=None, delta_add=None, delta_bio=None, Bio=None, mode='reactions', CSunk=0.5, isodiff=1, ppmiso=100, ncores=1, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, mzdCS=0, ppmCS=10, evfilt=False, connections=['C3H5NO', 'C6H12N4O', 'C4H6N2O2', 'C4H5NO3', 'C3H5NOS', 'C6H10N2O3S2', 'C5H7NO3', 'C5H8N2O2', 'C2H3NO', 'C6H7N3O', 'C6H11NO', 'C6H11NO', 'C6H12N2O', 'C5H9NOS', 'C9H9NO', 'C5H7NO', 'C3H5NO2', 'C4H7NO2', 'C11H10N2O', 'C9H9NO2', 'C5H9NO', 'C4H4O2', 'C3H5O', 'C10H12N5O6P', 'C10H15N2O3S', 'C10H14N2O2S', 'CH2ON', 'C21H34N7O16P3S', 'C21H33N7O15P3S', 'C10H15N3O5S', 'C5H7', 'C3H2O3', 'C16H30O', 'C8H8NO5P', 'CH3N2O', 'C5H4N5', 'C10H11N5O3', 'C10H13N5O9P2', 'C10H12N5O6P', 'C9H13N3O10P2', 'C9H12N3O7P', 'C4H4N3O', 'C10H13N5O10P2', 'C10H12N5O7P', 'C5H4N5O', 'C10H11N5O4', 'C10H14N2O10P2', 'C10H12N2O4', 'C5H5N2O2

Based on the parameters included in the function, the end-result of this function will be different.

For example, if one wants to use both the MS1 and MS2 data and not use the Gibbs sampler, the following should be used:

In [41]:
annotations= ipa.simpleIPA(df,ionisation=1, DB=DB,adductsAll=adducts,ppm=3,dfMS2=dfMS2,DBMS2=DBMS2,
                           noits=5000,delta_add=0.1)

isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 and MS2 information....
1.2 seconds elapsed
computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:25<00:00, 193.44it/s]


parsing results ...
Done -  25.9 seconds elapsed


If instead one wants to use only the MS1 data and only consider the adducts connections in the Gibbs sampler should use the following:

In [42]:
annotations= ipa.simpleIPA(df,ionisation=1, DB=DB,adductsAll=adducts,ppm=3,noits=5000,delta_add=0.1)

isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 information....
0.8 seconds elapsed
computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:25<00:00, 192.32it/s]

parsing results ...
Done -  26.1 seconds elapsed





Or if one wants to use both the MS1 and MS2 data and consider both adducts and biochemical connections in the Gibbs sampler, the following should be used.

In [43]:
annotations= ipa.simpleIPA(df,ionisation=1, DB=DB,adductsAll=adducts,ppm=3,dfMS2=dfMS2,DBMS2=DBMS2,noits=5000,
                             Bio=Bio,
                             delta_add=0.1, 
                             delta_bio=0.4)

isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 and MS2 information....
1.3 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:28<00:00, 177.00it/s]


parsing results ...
Done -  28.3 seconds elapsed
