# Q-TOF example

With the aim of testing the new implementation of the IPA method with a real-life dataset including fragmentation data, the dataset introduced by [Ten-Doménech et al., (2020)](https://www.mdpi.com/2218-1989/10/4/126/htm) was used. The aim of this study was to test different strategies for automated MS2 data-dependent acquisition (DDA). Here we only considered the untargeted DDA strategy as it is the most commonly used for untargeted metabolomics experiments. The experiment consisted in the analysis of human milk samples with a Agilent 6550 Spectrometer iFunnel quadrupole time-of-flight (QTOF) MS.

Peak detection, integration, deconvolution, alignment and pseudospectra identification were performend with [XCMS](http://www.bioconductor.org/packages/release/bioc/html/xcms.html) and [CAMERA](https://bioconductor.org/packages/release/bioc/html/CAMERA.html) in R 3.6.1. 
The obtained peak tables and MS2 data were made available through the [Mendeley Data repository](https://data.mendeley.com/) under [DOI:10.17632/fnzbxmkv83.1](DOI:10.17632/fnzbxmkv83.1).

The data was reorganised into .csv files to be compatible with the ipaPy2 package and it has made available within this library.
The MS1 data can be found here:

In [1]:
import pandas as pd
import numpy as np
dfMS1 = pd.read_csv('ExampleDatasets/QTOF_DDA/dfMS1.csv')
dfMS1=dfMS1.replace('None',None)
dfMS1.head()

Unnamed: 0,ids,rel.ids,mzs,RTs,Ints,relationship,isotope.pattern,charge
0,15646,763,164.920465,4.248273,19065.755811,bp,,
1,15179,4435,164.920489,12.941176,20773.512435,bp,,
2,10366,421,609.339623,3.241514,48622.08767,potential bp,0.0,1.0
3,15580,421,610.343058,3.250793,19287.414987,potential bp|isotope,0.0,1.0
4,15489,285,1140.835809,15.545977,19641.19867,,,


And the MS2 data can be found here:

In [2]:
dfMS2 = pd.read_csv('ExampleDatasets/QTOF_DDA/dfMS2.csv')
dfMS2

Unnamed: 0,id,spectrum,ev
0,2586,71.085:683.6281 73.0278:271.797 75.0992:492.37...,20
1,2586,71.0851:1917.348 73.0277:137.5172 73.0645:579....,20
2,4385,71.0863:157.7903 80.9472:268.5981 82.9449:282....,20
3,4380,71.0497:155.6042 71.0861:2613.574 72.091:152.0...,20
4,4385,71.081:206.8802 73.0653:142.1885 80.9478:978.4...,20
...,...,...,...
1184,8491,71.0851:374.3737 80.9475:902.768 82.9442:163.1...,20
1185,3515,70.0641:130.5069 71.0134:331.6173 71.0837:202....,20
1186,3515,70.0289:140.5467 71.0131:142.3553 71.0607:251....,20
1187,8491,73.0528:151.0102 73.0636:156.0888 80.9484:387....,20


In order to run the method this dataset, it is necessary to load the package, the MS1 database, the MS2 database and the adducts information.
Due to its large size, the MS2 dataset included with the package is not available within the library and should must be downloaded before from [here](https://drive.google.com/file/d/15qduvtE8aSAAUCf1FE4ojcVLaTw-B2W6/view?usp=sharing).
This MS2 database is based on data acquired from a Qexactive instrument, and it would be best to use a different one for this dataset.

In [3]:
from ipaPy2 import ipa
DBMS1=pd.read_csv('DB/IPA_MS1.csv')
DBMS2=pd.read_csv('DB/IPA_MS2_Qex.csv')  # make sure that this database was downladed and put in the correct folder
adducts = pd.read_csv('DB/adducts.csv')

As mentioned above, this study is based on the analysis of human milk samples. Therefore, it would be best to increase the 'prior knowledge' (pk) score associated with metabolites previously detected in the human milk.
For example, we know that human milk is rich in sugars, with lactose being the major component ([Newburg, 2013](https://link.springer.com/article/10.1134/S0006297913070092)).

In [4]:
DBMS1['pk']=[0.9]*len(DBMS1.index)
DBMS1.iloc[171,9] = 1
DBMS1[DBMS1['id']=='C00243']

Unnamed: 0,id,name,formula,inchi,smiles,RT,adductsPos,adductsNeg,description,pk,MS2,reactions
171,C00243,Lactose,C12H22O11,InChI=1S/C12H22O11/c13-1-3-5(15)6(16)9(19)12(2...,,,M+H;M+Na;M+2H;2M+H,M-H;2M-H;M-2H;3M-H,,1.0,,R00503 R01100 R01678 R01680 R04393 R05166


A dataframe containing all possible biochemical connections among all metabolites present in the IPA_MS1.csv database has been pre-computed and it is available in the library. Using this instead of computing the connections will strongly speed up the pipeline.

In [5]:
Bio = pd.read_csv('DB/allBIO_reactions.csv')

Finally, we can run the whole pipeline with the simpleIPA() function.

In [6]:
annotations = ipa.simpleIPA(df=dfMS1,dfMS2=dfMS2,ionisation=1,DB=DBMS1,DBMS2=DBMS2,adductsAll=adducts,ppm=5,
                            delta_add=0.1,delta_bio=0.5,Bio=Bio,burn=1000,noits=5000,
                            CSunk=0.7,ncores=70)

isotopes already mapped
computing all adducts - Parallelized ....
35.0 seconds elapsed
annotating based on MS1 and MS2 information - Parallelized...
184.2 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [20:19:34<00:00, 14.63s/it]   


parsing results ...
Done -  73220.9 seconds elapsed


As an example, the annotation of few peaks is shown below.


As mentioned before, human milk is rich in different sugars, and the mass feature shown below (id=98) is very likely to be a sugar. 

In [7]:
annotations[98]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post,post Gibbs,chi-square pval
1,C05402,Melibiose,C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.994367,0.028457,0.039,0.0515,8.709218e-09
2,C01742,Palatinose,C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.99417,0.028457,0.038993,0.045,8.709218e-09
0,C08240,Gentiobiose,C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.994705,0.028457,0.039014,0.0385,8.709218e-09
3,C00243,Lactose,C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.7,0.028457,0.030505,0.035,8.709218e-09
27,C01970,beta-Lactose,C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.7,0.028457,0.027455,0.032,8.709218e-09
4,C00208,Maltose,C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.729812,0.028457,0.028624,0.032,8.709218e-09
16,C19773,3-O-alpha-D-Mannopyranosyl-alpha-D-mannopyranose,C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.7,0.028457,0.027455,0.031,8.709218e-09
5,C01083,"alpha,alpha-Trehalose",C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.700893,0.028457,0.02749,0.03025,8.709218e-09
34,C00252,Isomaltose,C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.7,0.028457,0.027455,0.03025,8.709218e-09
9,C08250,Sophorose,C12H22NaO11,M+Na,365.10543,1.0,,0.6713,0.028537,0.7,0.028457,0.027455,0.0295,8.709218e-09


Human milk has been reported to be a source of sphingolipids such as sphingomyelin ([De Cas et al., 2020](https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-020-02641-0)).
The two mass features shown below are very likely to be sphingolipids:

In [8]:
annotations[837]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post,post Gibbs,chi-square pval
0,EMBL-MCF_spec32425x_1,SPHINGOMYELIN,C41H84N2O6P,M+H,731.606151,1.0,,0.663757,,0.860111,0.877047,0.887492,0.815,1.058444e-47
1,Unknown,Unknown,,,,,,5.0,,0.7,0.122953,0.112508,0.185,1.058444e-47


In [9]:
annotations[1983]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post,post Gibbs,chi-square pval
0,C12144,Phytosphingosine,C18H40NO3,M+H,318.30027,1.0,,-0.831674,,0.842381,0.874864,0.883341,0.98075,4.3714790000000004e-82
1,Unknown,Unknown,,,,,,5.0,,0.7,0.125136,0.116659,0.01925,4.3714790000000004e-82


Alarmingly, the presence of phtalates has also been reported in human breast milk ([Fromme et al., 2011](https://pubmed.ncbi.nlm.nih.gov/21406311/)).
The most likely annotation for the mass feature shown below (id=255) is Bis(2-ethylhexyl)phthalate (DEHP).

In [10]:
annotations[255]

Unnamed: 0,id,name,formula,adduct,m/z,charge,RT range,ppm,isotope pattern score,fragmentation pattern score,prior,post,post Gibbs,chi-square pval
0,C03690,Bis(2-ethylhexyl)phthalate,C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.99118,0.089423,0.1240212,0.14825,0.0
4,C15375,Apocholic acid,C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.7,0.089423,0.08758737,0.09675,0.0
9,NPA027102,2-hydroxy-6-(12-oxoheptadecyl)benzoic acid,C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.7,0.089423,0.08758737,0.095,0.0
8,NPA018676,Antroquinonol,C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.7,0.089423,0.08758737,0.08875,0.0
10,NPA032143,Phomopene B,C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.7,0.089423,0.08758737,0.0885,0.0
6,NPA011157,Delta-8'-Merulinic acid A,C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.7,0.089423,0.08758737,0.0845,0.0
7,NPA016979,12α-Hydroxy-3-ketocholanic acid,C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.7,0.089423,0.08758737,0.0845,0.0
2,C14227,Di-n-octyl phthalate,C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.7,0.089423,0.08758737,0.0805,0.0
1,C11637,"3alpha,12alpha-Dihydroxy-5beta-chol-6-enoate",C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.7,0.089423,0.08758737,0.07925,0.0
3,C14577,Diisooctyl phthalate,C24H39O4,M+H,391.284286,1.0,,0.837601,0.090775,0.7,0.089423,0.08758737,0.0785,0.0


The whole annotation file can be saved.

In [11]:
import pickle
file = open("ExampleDatasets/QTOF_DDA/annotations.pickle", "wb")
pickle.dump(annotations, file)
file.close()