# Beer dataset

In the paper where the first version of the IPA method was initially presented ([Del Carratore et al., 2019](https://pubs.acs.org/doi/full/10.1021/acs.analchem.9b02354)), an LC/MS-based untargeted metabolomics experiment on 21 different beers (7 indian pale ales, 7 lagers, and 7 porters) was introduced.
The new version of the IPA method was applied to the datasets (positive and negative) obtained from this experiment.

## Positive dataset
The positive dataset can be found within this library:

In [1]:
import pandas as pd
dfpos = pd.read_csv('ExampleDatasets/Beer/Beer_pos.csv')
dfpos[dfpos['rel.ids']==9]

Unnamed: 0,ids,rel.ids,mzs,RTs,Int
56,10,9,182.080964,67.367644,752484700.0
57,174,9,183.084172,67.386181,76511290.0
58,248,9,165.054445,67.381211,56638180.0
59,303,9,136.075532,67.299333,46015640.0
60,385,9,311.123492,66.831756,33858540.0
61,959,9,123.043945,67.4053,14597330.0
62,1859,9,119.049026,67.433861,6920786.0
63,1996,9,147.043864,67.365218,6399960.0
64,2204,9,473.176365,66.906002,5715022.0
65,2225,9,166.057784,67.444651,5636496.0


In order to run the method on this dataset, it is necessary to load the package, the MS$^1$ database and the adducts information.

In [2]:
from ipaPy2 import ipa
DB=pd.read_csv('DB/IPA_MS1.csv')
adducts = pd.read_csv('DB/adducts.csv')

As described in the original paper, a set of standard mixes was analysed with the same analytical setting. Everything learned from theses samples was recorded in the .csv file shown below.

In [3]:
updates = pd.read_csv('ExampleDatasets/Beer/update_based_on_standards.csv')
updates.head()

Unnamed: 0,KEGG.id,Name,Names,Formula,monoisotopic.mass,previous.knowledge,Ref,RT,POS.adducts,main.POS.adducts,NEG.adducts,main.NEG.adducts
0,C00025,L-Glutamate,L-Glutamate;L-Glutamic acid;L-Glutaminic acid;...,C5H9N1O4,147.05316,1,Information taken by the single standard injec...,30;60,M+H;M-H2O+H;2M+H;M-NH3+H;M+Na;M+2H,M+H,M-H;M-H2O-H;M-;2M-H;M+K-2H;M-2H;3M-H,M-H
1,C00031,D-Glucose,D-Glucose;Grape sugar;Dextrose;Glucose;D-Gluco...,C6H12O6,180.06339,1,This compound has been analyses in two standar...,30;60,M+Na;M+H+Na;M+;2M+H;M+2H;M+H,M+Na,M+CH2O2-H;M-H;M+Cl;2M-H;3M-H;M-2H,M+CH2O2-H
2,C00041,L-Alanine,L-Alanine;L-2-Aminopropionic acid;L-alpha-Alanine,C3H7N1O2,89.04768,1,The standard mix containing has been analyzed ...,20;60,M+H;2M+H;M+Na;M+2Na-H;M+2H,M+H,M-H;M-;2M-H;M-2H;3M-H,M-H
3,C00042,Succinate,Succinate;Succinic acid;Butanedionic acid;Ethy...,C4H6O4,118.02661,1,Information taken by a standard mix containing...,60;100,M-H2O+H;M+H;M+Na;M-NH3+H;M+CH2+H;2M+Na;M+2H;2M+H,M-H2O+H,M-H;M-;M-H2O-H;2M-H;M-2H;3M-H,M-H
4,C00062,L-Arginine,L-Arginine;(S)-2-Amino-5-guanidinovaleric acid...,C6H14N4O2,174.11168,1,The standard mix containing has been analyzed ...,20;60,M+H;2M+H;M+Na;M+2H;2M+Na;M+H+K;M+2Na-H,M+H,M-H;2M-H;M-;M-2H;3M-H,M-H


This information can be used in the annotation process by updating the database. With the simple for loop shown below, I can copy the information about retention time and adducts (in positive and negative mode) into the database. The ipaPy2 library can now use this information during the annotation process.

In [4]:
for k in range(0,len(updates.index)):
    DB.iloc[DB['id']==updates.iloc[k,0],5] = updates.iloc[k,7] # copying information about RT to the DB
    DB.iloc[DB['id']==updates.iloc[k,0],6] = updates.iloc[k,8] # copying information about positive adducts to the DB
    DB.iloc[DB['id']==updates.iloc[k,0],7] = updates.iloc[k,10] # copying information about positive adducts to the DB

A dataframe containing all possible biochemical connections among all metabolites present in the IPA_MS1.csv database has been pre-computed and it is available in the library. Using this instead of computing the connections will strongly speed up the pipeline.

In [5]:
Bio = pd.read_csv('DB/allBIO_reactions.csv')

Finally, we can run the whole pipeline with the simpleIPA() function.

WARNING! running the whole pipeline including the Gibbs sampler for such big dataset/database will take several hours.

In [6]:
annotationsPos = ipa.simpleIPA(df=dfpos,ionisation=1,DB=DB,adductsAll=adducts,ppm=5, Bio=Bio,
                            delta_add=0.1,delta_bio=1,noits=3000,ncores=20)

mapping isotope patterns ....
2.7 seconds elapsed
computing all adducts - Parallelized ....
19.0 seconds elapsed
annotating based on MS1 information - Parallelized ...
12.3 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|█████████| 3000/3000 [8:34:17<00:00, 10.29s/it]


parsing results ...
Done -  30862.7 seconds elapsed


The annotation for the same feature considered in the original paper as an example (id=10, m/z=182.080964, RT= 67.37s) and shown in [Figure 5](https://pubs.acs.org/cms/10.1021/acs.analchem.9b02354/asset/images/large/ac9b02354_0005.jpeg) is shown below.

In [7]:
annotationsPos[10]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post,post Gibbs,chi-square pval
1,C00082,L-Tyrosine,C9H12NO3,M+H,182.081169,1.0,,-1.126904,0.09999922,,0.097486,0.09756086,0.753704,0.0
7,C20807,3-Hydroxy-L-phenylalanine,C9H12NO3,M+H,182.081169,1.0,,-1.126904,0.09999922,,0.097486,0.09756086,0.045556,0.0
8,C21308,(S)-beta-Tyrosine,C9H12NO3,M+H,182.081169,1.0,,-1.126904,0.09999922,,0.097486,0.09756086,0.041481,0.0
3,C03290,L-threo-3-Phenylserine,C9H12NO3,M+H,182.081169,1.0,,-1.126904,0.09999922,,0.097486,0.09756086,0.04,0.0
4,C04368,3-Amino-3-(4-hydroxyphenyl)propanoate,C9H12NO3,M+H,182.081169,1.0,,-1.126904,0.09999922,,0.097486,0.09756086,0.039259,0.0
0,C06420,D-Tyrosine,C9H12NO3,M+H,182.081169,1.0,50;90,-1.126904,0.09999922,,0.097486,0.1219511,0.023704,0.0
5,C19579,gamma-Hydroxy-3-pyridinebutanoate,C9H12NO3,M+H,182.081169,1.0,,-1.126904,0.09999922,,0.097486,0.09756086,0.016296,0.0
6,C19712,N-Hydroxy-L-phenylalanine,C9H12NO3,M+H,182.081169,1.0,,-1.126904,0.09999922,,0.097486,0.09756086,0.016296,0.0
9,NPA027085,2-((2-hydroxyethyl)amino)benzoic acid,C9H12NO3,M+H,182.081169,1.0,,-1.126904,0.09999922,,0.097486,0.09756086,0.014815,0.0
2,C01536,Tyrosine,C9H12NO3,M+H,182.081169,1.0,,-1.126904,0.09999922,,0.097486,0.09756086,0.008889,0.0


The whole annotation dictonary for this dataset can be saved as a .pickle file.

In [8]:
import pickle
file = open("ExampleDatasets/Beer/annotationsPos.pickle", "wb")
pickle.dump(annotationsPos, file)
file.close()

## Negative dataset
The negative dataset can also be found within this library:

In [9]:
Bio = pd.read_csv('DB/allBIO_reactions.csv')
dfneg = pd.read_csv('ExampleDatasets/Beer/Beer_neg.csv')
dfneg.head()

Unnamed: 0,ids,rel.ids,mzs,RTs,Int
0,1,0,96.958464,48.370133,1449591000.0
1,35,0,98.954139,48.395777,69215420.0
2,215,0,705.192297,50.280204,15274240.0
3,250,0,925.235146,49.086028,13603920.0
4,301,0,786.218066,50.939859,11542660.0


This can be annotated with the IPA method in the same way as the positive data.

WARNING! running the whole pipeline including the Gibbs sampler for such big dataset/database will take several hours.

In [10]:
annotationsNeg = ipa.simpleIPA(df=dfneg,ionisation=-1,DB=DB,adductsAll=adducts,ppm=10,Bio=Bio,
                            delta_add=0.1,delta_bio=1,noits=3000,ncores=20)

mapping isotope patterns ....
2.5 seconds elapsed
computing all adducts - Parallelized ....
54.5 seconds elapsed
annotating based on MS1 information - Parallelized ...
11.4 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|█████████| 3000/3000 [2:06:49<00:00,  2.54s/it]


parsing results ...
Done -  7612.9 seconds elapsed


The whole annotation dictionary for this dataset can be saved as a .pickle file.

In [11]:
file = open("ExampleDatasets/Beer/annotationsNeg.pickle", "wb")
pickle.dump(annotationsNeg, file)
file.close()