# *E. coli* example


In the paper where the first version of the IPA method was initially presented ([Del Carratore et al., 2019](https://pubs.acs.org/doi/full/10.1021/acs.analchem.9b02354)), an LC/MS-based untargeted metabolomics experiment on *Escherichia coli* extract was introduced to test the method in real-life conditions.
The new version of the IPA method was applied to the datasets (positive and negative) obtained from this experiment.

## Positive dataset
The positive dataset can be found within this library:

In [1]:
import pandas as pd
dfpos = pd.read_csv('ExampleDatasets/Ecoli/Ecoli_pos.csv')
dfpos[dfpos['rel.ids']==1]

Unnamed: 0,ids,rel.ids,mzs,RTs,Int
7,2,1,116.07058,42.66312,4527300000.0
8,54,1,117.073749,42.766938,274986000.0
9,73,1,231.133964,42.565073,235710800.0
10,221,1,70.065249,42.831656,59570450.0
11,484,1,118.074696,42.758915,22317830.0
12,2842,1,71.068563,42.741755,2621352.0


In order to run the method on this dataset, it is necessary to load the ipaPy2 package, the MS$^1$ database and the adducts information.

In [2]:
from ipaPy2 import ipa
DB=pd.read_csv('DB/IPA_MS1.csv')
adducts = pd.read_csv('DB/adducts.csv')

As described in the original paper, a set of standard mixes was analysed with the same analytical setting.
Everything learned from theses samples was recorded in the .csv file shown below.

In [3]:
updates = pd.read_csv('ExampleDatasets/Beer/update_based_on_standards.csv')
updates.head()

Unnamed: 0,KEGG.id,Name,Names,Formula,monoisotopic.mass,previous.knowledge,Ref,RT,POS.adducts,main.POS.adducts,NEG.adducts,main.NEG.adducts
0,C00025,L-Glutamate,L-Glutamate;L-Glutamic acid;L-Glutaminic acid;...,C5H9N1O4,147.05316,1,Information taken by the single standard injec...,30;60,M+H;M-H2O+H;2M+H;M-NH3+H;M+Na;M+2H,M+H,M-H;M-H2O-H;M-;2M-H;M+K-2H;M-2H;3M-H,M-H
1,C00031,D-Glucose,D-Glucose;Grape sugar;Dextrose;Glucose;D-Gluco...,C6H12O6,180.06339,1,This compound has been analyses in two standar...,30;60,M+Na;M+H+Na;M+;2M+H;M+2H;M+H,M+Na,M+CH2O2-H;M-H;M+Cl;2M-H;3M-H;M-2H,M+CH2O2-H
2,C00041,L-Alanine,L-Alanine;L-2-Aminopropionic acid;L-alpha-Alanine,C3H7N1O2,89.04768,1,The standard mix containing has been analyzed ...,20;60,M+H;2M+H;M+Na;M+2Na-H;M+2H,M+H,M-H;M-;2M-H;M-2H;3M-H,M-H
3,C00042,Succinate,Succinate;Succinic acid;Butanedionic acid;Ethy...,C4H6O4,118.02661,1,Information taken by a standard mix containing...,60;100,M-H2O+H;M+H;M+Na;M-NH3+H;M+CH2+H;2M+Na;M+2H;2M+H,M-H2O+H,M-H;M-;M-H2O-H;2M-H;M-2H;3M-H,M-H
4,C00062,L-Arginine,L-Arginine;(S)-2-Amino-5-guanidinovaleric acid...,C6H14N4O2,174.11168,1,The standard mix containing has been analyzed ...,20;60,M+H;2M+H;M+Na;M+2H;2M+Na;M+H+K;M+2Na-H,M+H,M-H;2M-H;M-;M-2H;3M-H,M-H


This information can be used in the annotation process by updating the database.
With the simple for loop shown below, I can copy the information about retention time and adducts (in positive and negative mode) into the database. The ipaPy2 library can now use this information.

In [4]:
for k in range(0,len(updates.index)):
    DB.iloc[DB['id']==updates.iloc[k,0],5] = updates.iloc[k,7] #commets here
    DB.iloc[DB['id']==updates.iloc[k,0],6] = updates.iloc[k,8]
    DB.iloc[DB['id']==updates.iloc[k,0],7] = updates.iloc[k,10]

A dataframe containing all possible biochemical connections among all metabolites present in the IPA_MS1.csv database has been pre-computed and it is available in the library. Using this instead of computing the connections will strongly speed up the pipeline.

In [5]:
Bio = pd.read_csv('DB/allBIO_reactions.csv')

Finally, we can run the whole pipeline with the simpleIPA() function.

WARNING! running the whole pipeline including the Gibbs sampler for such a big dataset/database will take several hours.

In [6]:
annotationsPos = ipa.simpleIPA(df=dfpos,ionisation=1,DB=DB,adductsAll=adducts,ppm=5, Bio=Bio,
                            delta_add=0.1,delta_bio=1,noits=3000,ncores=20)

mapping isotope patterns ....
1.9 seconds elapsed
computing all adducts - Parallelized ....
17.7 seconds elapsed
annotating based on MS1 information - Parallelized ...
8.2 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|█████████| 3000/3000 [1:24:26<00:00,  1.69s/it]


parsing results ...
Done -  5068.8 seconds elapsed


In the original paper ([Del Carratore et al., 2019](https://pubs.acs.org/doi/full/10.1021/acs.analchem.9b02354)), the annotation for the feature associated with id=2 (m/z=116.070580, RT= 42.66s) is shown in [Figure 4](https://pubs.acs.org/cms/10.1021/acs.analchem.9b02354/asset/images/large/ac9b02354_0004.jpeg). For the sake of comparison, the annotation table obtained for the same feature with the new IPA implementation is shown below.


In [7]:
annotationsPos[2]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post,post Gibbs,chi-square pval
0,C00148,L-Proline,C5H10NO2,M+H,116.070605,1.0,20;70,-0.214542,0.199451,,0.19471,0.23801,0.938519,0.0
1,C00763,D-Proline,C5H10NO2,M+H,116.070605,1.0,,-0.214542,0.199451,,0.19471,0.190408,0.035926,0.0
3,C18170,3-Acetamidopropanal,C5H10NO2,M+H,116.070605,1.0,,-0.214542,0.199451,,0.19471,0.190408,0.012593,0.0
2,C16435,Proline,C5H10NO2,M+H,116.070605,1.0,,-0.214542,0.199451,,0.19471,0.190408,0.007407,0.0
4,NPA018555,Pleurocybellaziridin,C5H10NO2,M+H,116.070605,1.0,,-0.214542,0.199451,,0.19471,0.190408,0.005556,0.0
5,Unknown,Unknown,,,,,,5.0,0.002747,,0.026448,0.000356,0.0,0.0


The whole annotation dictonary for this dataset can be saved as a .pickle file.

In [8]:
import pickle
file = open("ExampleDatasets/Ecoli/annotationsPos.pickle", "wb")
pickle.dump(annotationsPos, file)
file.close()

## Negative dataset
The negative dataset can also be found within this library:

In [9]:
dfneg = pd.read_csv('ExampleDatasets/Ecoli/Ecoli_neg.csv')
dfneg.head()

Unnamed: 0,ids,rel.ids,mzs,RTs,Int
0,1,0,191.018617,56.672309,1790060000.0
1,14,0,192.021784,56.628663,124421400.0
2,30,0,111.007146,56.776775,71385700.0
3,75,0,193.02243,56.636817,25689800.0
4,177,0,405.027847,56.930517,10976920.0


This can be annotated with the IPA method in the same way as the positive dataset.

WARNING! running the whole pipeline including the Gibbs sampler for such big dataset/database will take several hours.

In [10]:
annotationsNeg = ipa.simpleIPA(df=dfneg,ionisation=-1,DB=DB,adductsAll=adducts,ppm=10,Bio=Bio,
                            delta_add=0.1,delta_bio=1,noits=3000,ncores=20)

mapping isotope patterns ....
1.6 seconds elapsed
computing all adducts - Parallelized ....
63.4 seconds elapsed
annotating based on MS1 information - Parallelized ...
8.8 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|███████████| 3000/3000 [49:29<00:00,  1.01it/s]


parsing results ...
Done -  2971.4 seconds elapsed


The whole annotation dictionary for this dataset can be saved as a .pickle file.

In [11]:
file = open("ExampleDatasets/Ecoli/annotationsNeg.pickle", "wb")
pickle.dump(annotationsNeg, file)
file.close()