# Synthetic example

In this notebook, we show the usage of the new implementation of the IPA method with the simulated dataset introduced with the previous version of the method ([Del Carratore et al., 2019](https://pubs.acs.org/doi/full/10.1021/acs.analchem.9b02354)).
The synthetic experiment was built by considering 15 compounds involved in the mevalonate pathway and limonene synthesis (compounds highlighted in blue in the figure below).
![Mevalonate](ExampleDatasets/Synthetic/mevalonate_pathway.jpeg)
Several adducts and isotopes were simulated for each of the considered metabolites, both in negative and positive mode, consistent with the relative concentrations shown int he table below. A detailed description of how the simulated datasets was built can be found in the original paper ([Del Carratore et al., 2019](https://pubs.acs.org/doi/full/10.1021/acs.analchem.9b02354)).
 


In [1]:
import pandas as pd

compounds = pd.read_csv('ExampleDatasets/Synthetic/synthetic_compounds_info.csv')
compounds

Unnamed: 0,id,name,Formula,RT,Relative Concentration
0,C00024,Acetyl-CoA,C23H38N7O17P3S1,60,19.1
1,C00332,Acetoacetyl-CoA,C25H40N7O18P3S,75,95.7
2,C00010,Coenzyme A,C21H36N7O16P3S,300,24.9
3,C00356,3-Hydroxy-3-methylglutaryl-CoA,C27H44N7O20P3S,120,36.8
4,C00005,NADPH,C21H30N7O17P3,200,76.1
5,C00006,NADP,C21H29N7O17P3,150,77.8
6,C00418,Mevalonic acid,C6H12O4,250,34.9
7,C00002,ATP,C10H16N5O13P3,210,6.7
8,C00008,ADP,C10H15N5O10P2,211,9.8
9,C01107,Mevalonic acid-5P,C6H13O7P,310,35.8


## Positive mode
The dataset simulated in positive mode contains 83 mass spectrometry features and can be found in the .csv file shown below.

In [2]:
df_all = pd.read_csv('ExampleDatasets/Synthetic/positive_synth_dataset.csv')
df_all.head()

Unnamed: 0,ids,rel.ids,mzs,RTs,Int,Compound ID,Formula,Adduct,isotope,charge
0,1,0,810.13374,60.534708,19.1,C00024,C23H39N7O17P3S1,M+H,mono,1
1,2,0,832.11583,61.629705,9.000503,C00024,C23H38N7O17P3S1Na1,M+Na,mono,1
2,3,0,405.571155,60.727202,16.601507,C00024,C23H40N7O17P3S1,M+2H,mono,2
3,4,0,1619.258146,60.988866,9.287749,C00024,C46H77N14O34P6S2,2M+H,mono,1
4,5,1,852.144271,75.607828,95.7,C00332,C25H41N7O18P3S1,M+H,mono,1


The .csv file also contains the correct annotation for each feature. To use the IPA method, a dataframe containing only the necessary information is required.

In [3]:
df = df_all.copy()
df=df.drop(['Compound ID', 'Formula','Adduct','isotope','charge'], axis=1)
df.head()

Unnamed: 0,ids,rel.ids,mzs,RTs,Int
0,1,0,810.13374,60.534708,19.1
1,2,0,832.11583,61.629705,9.000503
2,3,0,405.571155,60.727202,16.601507
3,4,0,1619.258146,60.988866,9.287749
4,5,1,852.144271,75.607828,95.7


We also need to import the ipaPy2 module and the necessary database.

In [4]:
from ipaPy2 import ipa
DB=pd.read_csv('ExampleDatasets/Synthetic/DBMS1_reduced.csv')
adducts = pd.read_csv('DB/adducts.csv')

The biochemical connections used for the computation of the posterior probability are directly taken from the mevalonate pathway reported above.

In [5]:
Bio=pd.DataFrame([['C00024','C00332'], # Acetyl-Coa, Acetoacetyl-CoA
                  ['C00332','C00356'], # Acetoacetyl-CoA, HMG-CoA
                  ['C00356','C00010'], # HMG-CoA, CoA
                  ['C00356','C00005'], # HMG-CoA, NADPH
                  ['C00010','C00005'], # CoA, NADPH
                  ['C00010','C00006'], # CoA, NADP+
                  ['C00006','C00005'], # NADP+, NADPH
                  ['C00356','C00006'], # HMG-CoA, NADP+
                  ['C00356','C00418'], # HMG-CoA, Mevalonate
                  ['C00010','C00418'], # CoA, Mevalonate
                  ['C00005','C00418'], # NADPH, Mevalonate
                  ['C00006','C00418'], # NADP+, Mevalonate
                  ['C00002','C00008'], # ATP, ADP
                  ['C00418','C00008'], # Mevalonate, ADP
                  ['C00418','C00002'], # Mevalonate, ATP
                  ['C00418','C01107'], # Mevalonate, Mevalonate-5-phosphate
                  ['C01107','C00002'], # Mevalonate-5-phosphate, ATP
                  ['C01107','C00008'], # Mevalonate-5-phosphate, ADP
                  ['C01107','C01143'], # Mevalonate-5-phosphate, Mevalonate-5-diphosphate
                  ['C00002','C01143'], # ATP, Mevalonate-5-diphosphate
                  ['C00008','C01143'], # ADP, Mevalonate-5-diphosphate
                  ['C00002','C00129'], # ATP, isopentenyl pyrophosphate
                  ['C00008','C00129'], # ADP, isopentenyl pyrophosphate
                  ['C01143','C00129'], # Mevalonate-5-diphosphate, isopentenyl pyrophosphate
                  ['C00235','C00129'], # Dimethylallyl diphosphate, isopentenyl pyrophosphate
                  ['C00235','HMDB0032291'], # Dimethylallyl diphosphate, Geranyl pyrophosphate
                  ['C00129','HMDB0032291'], # isopentenyl pyrophosphate, Geranyl pyrophosphate
                  ['HMDB0032291','C00521']]) # Geranyl pyrophosphate, Limonene

At this point, the whole IPA pipeline can be run with the simpleIPA() function.
It should be noted that in this notebook 5 cores are used for the analysis. The user should be very careful with the ncores parameter, which should be chosen considering the actual hardware used.

In [6]:
annotations_MS1 = ipa.simpleIPA(df=df,ionisation=1,DB=DB,adductsAll=adducts,ppm=5,ppmthr=30,Bio=Bio,
                            delta_add=.1,delta_bio=1,burn=1000,noits=5000,ncores=5)

mapping isotope patterns ....
0.1 seconds elapsed
computing all adducts - Parallelized ....
6.6 seconds elapsed
annotating based on MS1 information - Parallelized ...
1.4 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|███████████| 5000/5000 [01:13<00:00, 67.61it/s]


parsing results ...
Done -  74.1 seconds elapsed


To highlight how the integration of MS2 data improves the annotation accurancy in the new implementation of the IPA method, the fragmentation spectra were simulated for three of the MS1 features 

In [7]:
dfMS2 = pd.read_csv('ExampleDatasets/Synthetic/dfMS2_pos.csv')
dfMS2

Unnamed: 0,id,spectrum,ev
0,43,53.319:1.97448429548671 54.537:2.2957893853734...,35
1,51,68.0501937866:0.982856686586494 70.0658111572:...,35
2,76,40:52.348731209745 41:74.3811909502889 42:66.3...,35


We can now re-run the whole annotation pipeline including the MS2 data using the simpleIPA() function as shown below:

In [8]:
DBMS2 = pd.read_csv('ExampleDatasets/Synthetic/DBMS2_reduced.csv')
annotations_MS2 = ipa.simpleIPA(df=df,dfMS2=dfMS2,DBMS2=DBMS2,ionisation=1,DB=DB,adductsAll=adducts,ppm=5,
                                ppmthr=30,Bio=Bio,delta_add=.1,delta_bio=1,burn=1000,noits=5000,CSunk=0.1,ncores=5)

mapping isotope patterns ....
0.1 seconds elapsed
computing all adducts - Parallelized ....
6.5 seconds elapsed
annotating based on MS1 and MS2 information - Parallelized...
1.4 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|███████████| 5000/5000 [01:14<00:00, 67.03it/s]


parsing results ...
Done -  74.7 seconds elapsed


The synthetic dataset here described was also used to provide a comparison between the new and old implementation of the IPA method. To achieve this, the same dataset was annotated with the original R package ([https://github.com/francescodc87/IPA](https://github.com/francescodc87/IPA)) using the exact same database. The R script used to annotate the example dataset can be found [here](https://github.com/francescodc87/ipaPy2/blob/main/ExampleDatasets/Synthetic/running_old_IPA.R). The posterior probabilities calculated by the original implemantaion for the correct annotations are summarized in the res_IPAV1_POS.csv file.

In [9]:
result_summary = pd.read_csv('ExampleDatasets/Synthetic/res_IPAV1_POS.csv')

With the few lines below, the posterior proabilities calculated by the new implementation of the IPA method are added to the same table.

In [10]:
postIPA2_MS1 = []
postIPA2_MS2 = []
for idx in result_summary['ids']:
    correct=list(result_summary['Compound.ID'][result_summary['ids']==idx])[0]
    tmp = annotations_MS1[idx]
    postIPA2_MS1.append(list(tmp['post Gibbs'][tmp['id']==correct])[0])
    tmp = annotations_MS2[idx]
    postIPA2_MS2.append(list(tmp['post Gibbs'][tmp['id']==correct])[0])

result_summary['post IPA2 - MS1'] = postIPA2_MS1
result_summary['post IPA2 - MS2'] = postIPA2_MS2


The table below shows the probabilities associated to the correct annotation by both the old (post IPA1) and new implementation (IPA2) of the IPA method. "post IPA2 - MS1" refers to the posterior probabilities obtained when only considering the MS1 data, while "post IPA - MS2" refers to the posterior probabilities obtained when also including the fragmentation spectra during annotation.

In [11]:
result_summary 

Unnamed: 0,ids,rel.ids,mzs,RTs,Int,Compound.ID,Formula,Adduct,post IPA1,post IPA2 - MS1,post IPA2 - MS2
0,1,0,810.13374,60.534708,19.1,C00024,C23H39N7O17P3S1,M+H,1.0,0.997,0.99725
1,2,0,832.11583,61.629705,9.000503,C00024,C23H38N7O17P3S1Na1,M+Na,0.996,0.997,0.99875
2,3,0,405.571155,60.727202,16.601507,C00024,C23H40N7O17P3S1,M+2H,0.99375,0.99675,0.99675
3,4,0,1619.258146,60.988866,9.287749,C00024,C46H77N14O34P6S2,2M+H,0.99425,0.99825,0.99925
4,5,1,852.144271,75.607828,95.7,C00332,C25H41N7O18P3S1,M+H,1.0,0.99975,0.99925
5,7,1,874.125193,73.9838,72.348321,C00332,C25H40N7O18P3S1Na1,M+Na,0.99575,0.9995,0.99975
6,9,1,426.576318,74.66154,66.800632,C00332,C25H42N7O18P3S1,M+2H,0.9945,0.9995,0.99975
7,11,1,1703.28066,74.653105,36.161426,C00332,C50H81N14O36P6S2,2M+H,0.99475,0.99975,1.0
8,14,2,768.123027,299.078198,24.9,C00010,C21H37N7O16P3S1,M+H,0.99975,1.0,0.99975
9,16,2,790.104911,302.337016,20.391088,C00010,C21H36N7O16P3S1Na1,M+Na,0.996,0.999,0.999


The annotation accuracy for the three features having an associated MS2 spectra (id=43, id=51 and id=76) increases when the MS2 data is considered in the annotation process (i.e., higher posterior probabilities in the table above). More interestingly, it should be noted that the increase in accuracy is not limited to the features for which a fragmentation spectrum has been simulated, but it 'spreads' through adducts and biochemical connections. For example, the posterior probabilities associated with all the simulated adducts for Limonene (C00521) increased when the fragmentation spectrum for the M+H adduct is considered. Additionally, the posterior probabilities associated with all the adducts of Geranyl pyrophosphate (HMDB0032291) also increased since the IPA mehod considers this compound biochemically connected to Limonene. 

Similar to what is shown in the 2019 IPA paper, having the correct annotations for each mass spectrometry feature, we can use the Logarithmic Predictive Score (LPS) to evaluate the overall performance of the method.
The LPS score is computed as follows:

$LPS = \sum \limits _{i=0} ^{M} log(p_{i}) $

where $p_i$ is the probability assigned to the correct annotation for the $i^{th}$ feature.
In the best case scenario the LPS score is equal to zero. In all other cases LPS is a negative value.


The LPS score computed for the old method is:

In [12]:
import numpy as np
np.sum(np.log(result_summary['post IPA1']))

-27.110869367994507

The LPS score computed for the new method only considering MS1 data is

In [13]:
np.sum(np.log(result_summary ['post IPA2 - MS1']))

-23.96410310141909

And the LPS scpre computed for the new method also including MS2 data is

In [14]:
np.sum(np.log(result_summary ['post IPA2 - MS2']))

-14.473628546862983

While the results obtained with the original IPA implementation are reported here for comparison, it must be mentioned that these scores are not directly comparable with the ones computed with the new implementation of the IPA method, because of the completely different approach used to consider isotope patterns.

# Negative mode
The dataset simulated in negative mode contains 95 mass spectrometry features and can be found in the .csv file shown below.

In [15]:
df_all = pd.read_csv('ExampleDatasets/Synthetic/negative_synth_dataset.csv')
df_all.head()

Unnamed: 0,ids,rel.ids,mzs,RTs,Int,Compound ID,Formula,Adduct,isotope,charge
0,1,0,808.117931,60.168864,15.454966,C00024,C23H37N7O17P3S1,M-H,mono,-1
1,2,0,1617.244072,59.832483,16.612324,C00024,C46H75N14O34P6S2,2M-H,mono,-1
2,3,0,1618.246019,61.852075,8.265022,C00024,C45[13]C1H75N14O34P6S2,2M-H,iso,-1
3,4,0,403.554946,60.069536,5.311077,C00024,C23H36N7O17P3S1,M-2H,mono,-2
4,5,0,2426.368968,58.812628,13.08505,C00024,C69H113N21O51P9S3,3M-H,mono,-1


Repeating the same steps done for the positive dataset, we can run the whole IPA pipeline with the simpleIPA() function considering only the MS1 data.

In [16]:
df = df_all.copy()
df=df.drop(['Compound ID', 'Formula','Adduct','isotope','charge'], axis=1)
annotations_MS1 = ipa.simpleIPA(df=df,ionisation=-1,DB=DB,adductsAll=adducts,ppm=5,ppmthr=30,Bio=Bio,
                            delta_add=.1,delta_bio=1,burn=1000,noits=5000,ncores=5)

mapping isotope patterns ....
0.1 seconds elapsed
computing all adducts - Parallelized ....
25.5 seconds elapsed
annotating based on MS1 information - Parallelized ...
1.7 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|███████████| 5000/5000 [01:20<00:00, 62.00it/s]


parsing results ...
Done -  80.8 seconds elapsed


To highlight how the integration of MS2 data improves the annotation accurancy in the new implementation of the IPA method, the fragmentation spectra were simulated for four of the MS1 features

In [17]:
dfMS2 = pd.read_csv('ExampleDatasets/Synthetic/dfMS2_neg.csv')
dfMS2

Unnamed: 0,id,spectrum,ev
0,53,15.0234751:1.47715889194318 17.00273965:5.4950...,35
1,57,49.5038454774:1.26593544822571 66.9429024669:4...,35
2,61,49.5039295661:1.62058301590392 78.957781377:82...,35
3,89,41.03912516:25.3037735965324 43.05477522:3.467...,35


We can now re-run the whole annotation pipeline including the MS2 data using the simpleIPA() function as shown below:

In [None]:
annotations_MS2 = ipa.simpleIPA(df=df,dfMS2=dfMS2,DBMS2=DBMS2,ionisation=-1,DB=DB,adductsAll=adducts,ppm=5,
                                ppmthr=30,Bio=Bio,delta_add=.1,delta_bio=1,burn=1000,noits=5000,CSunk=0.1,ncores=5)

mapping isotope patterns ....
0.1 seconds elapsed
computing all adducts - Parallelized ....
25.2 seconds elapsed
annotating based on MS1 and MS2 information - Parallelized...
1.8 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar:  52%|█████▋     | 2578/5000 [00:41<00:39, 62.07it/s]

Similarly to what done for the positive dataset, the negative dataset was also annotated with the original R package ([https://github.com/francescodc87/IPA](https://github.com/francescodc87/IPA)) using the exact same database. The R script used to annotate the example dataset can be found [here](https://github.com/francescodc87/ipaPy2/blob/main/ExampleDatasets/Synthetic/running_old_IPA.R). The posterior probabilities calculated by the original implemantaion for the correct annotations are summarized in the res_IPAV1_NEG.csv file.
With the few lines below, the posterior proabilities calculated by the new implementation of the IPA method are added to the same table.

In [None]:
result_summary = pd.read_csv('ExampleDatasets/Synthetic/res_IPAV1_NEG.csv')

postIPA2_MS1 = []
postIPA2_MS2 = []
for idx in result_summary['ids']:
    correct=list(result_summary['Compound.ID'][result_summary['ids']==idx])[0]
    tmp = annotations_MS1[idx]
    postIPA2_MS1.append(list(tmp['post Gibbs'][tmp['id']==correct])[0])
    tmp = annotations_MS2[idx]
    postIPA2_MS2.append(list(tmp['post Gibbs'][tmp['id']==correct])[0])
    
result_summary ['post IPA2 - MS1'] = postIPA2_MS1
result_summary ['post IPA2 - MS2'] = postIPA2_MS2

The table below shows the probabilities associated to the correct annotation by both the old (post IPA1) and new implementation (IPA2) of the IPA method. "post IPA2 - MS1" refers to the posterior probabilities obtained when only considering the MS1 data, while "post IPA - MS2" refers to the posterior probabilities obtained when also including the fragmentation spectra during annotation.

In [None]:
result_summary

The annotation accuracy for the four features having an associated MS2 spectra (id=53, id=57, id=61 and id=89) increases when the MS2 data is considered in the annotation process (i.e., higher posterior probabilities in the table above). Also in this case, the increase in accuracy is not limited to the features for which a fragmentation spectrum has been simulated, but it 'spreads' through adducts and biochemical connections. 

To evaluate the overall performance of the method, the LPS score were also computed for the negative dataset as shown below.

In [None]:
np.sum(np.log(result_summary['post IPA1']))

In [None]:
np.sum(np.log(result_summary ['post IPA2 - MS1']))

In [None]:
np.sum(np.log(result_summary ['post IPA2 - MS2']))