In [10]:
import pandas as pd
import numpy as np

# On the road to being complete

Most of this is adapted directly from FTMSVisualization (https://github.com/wkew/FTMSVisualization) especially 0-FormulaGenerator.py and 1-FormulaAssignment.py files that are the ones relevant to what we want to achieve - make a "database" of possible formulas with their respective exact theoric mass and then assign to a set of m/z peaks a specific formula from said database.

Apart from this, the 2007 paper "Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry" of Tobias Kind and Oliver Fiehn was used as a base for the rules used to make the "formula database".

### Warning!

I advise not to run this notebook since it takes so much time (like a day) and it also asks a lot for the pc, specifically, RAM memory. For both of these reasons, this shouldn't be ran lightly.

Update: Speed of the functions was considerably improved (still takes a good amount of time). RAM and space issues stil exist.

### Some of the issues encountered so far with this process - Not updated

Biggest problem -  Criteria for formula attribution

#### Commentary not COMPLETE - situation is evolving



## Part nº 1 - Building the Formula "Database"

### Setting up functions to build the formula database

First a dictionary with all the masses and abundances of different elements (taken from FTMSVisualization).

In [1]:
#atomic masses taken from Pure Appl. Chem. 2016; 88(3): 265–291, Atomic Weights of the Elements 2013, doi: 10.1515/pac-2015-0305
#Atomic Masses taken from AME2012 - Chinese Physics C 36 (2012)  1603-2014, Wang, Audi, Wapstra, Kondex, MacCormic, Xu, and Pfeiffer. doi: 10.1088/1674-1137/36/12/003
#isotoptic abundances from Pure Appl. Chem. 2016; 88(3): 293–306, Isotopic compositions of the elements 2013 (IUPAC Technical Report), doi: 10.1515/pac-2015-0503
#electron mass from NIST http://physics.nist.gov/cgi-bin/cuu/Value?meu|search_for=electron+mass
chemdict = {'H':(1.007825, 0.99984),
            'C':(12.000000, 0.98892),
            'N':(14.003074, 0.99634),
            'O':(15.994915, 0.99762),
            'Na':(22.989769, 1.0),
            'P':(30.973763,1.0),
            'S':(31.972071, 0.95041),
            'Cl':(34.968853, 0.75765),
            'F':(18.9984032, 1.0)} 

Function to calculate formulas exact masses (Update: Fixed an error - mass of P was counted twice)

In [2]:
def getmass(c,h,o,n,s,p,cl,f):
    "Get the exact mass for any formula."
    massC = chemdict['C'][0] * c
    massH = chemdict['H'][0] * h
    massO = chemdict['O'][0] * o
    massN = chemdict['N'][0] * n
    massS = chemdict['S'][0] * s
    massP = chemdict['P'][0] * p
    massCl = chemdict['Cl'][0] * cl
    massF = chemdict['F'][0] * f

    massTotal = massC + massH + massO + massN + massS + massP + massCl + massF

    return massTotal

Function to calculate the natural abundances of the formulas made of all the most common isotopes for each element.

In [3]:
def getabun(c,h,o,n,s,cl):
    "The natural abundance of the formulas made of all the most common isotopes for each element."
    abunC, abunH, abunO, abunN, abunS, abunCl = 1,1,1,1,1,1
    
    if c > 0:
        abunC = chemdict['C'][1] ** c

    if h > 0:
        abunH = chemdict['H'][1] ** h

    if o > 0:
        abunO = chemdict['O'][1] ** o

    if n > 0:
        abunN = chemdict['N'][1] ** n

    if s > 0:
        abunS = chemdict['S'][1] ** s
        
    if cl > 0:
        abunCl = chemdict['Cl'][1] ** cl

    abunTotal = abunC * abunH * abunO * abunN * abunS * abunCl
    return abunTotal

### Maximum number of elements and ranges for element proportions according to the 7 golden rules paper

In [4]:
#if m < 500:
#    maxC,maxH,maxO,maxN,maxS,maxP = 39,72,20,20,10,9
#elif m < 1000: #36 7 golden rules?
#    maxC,maxH,maxO,maxN,maxS,maxP = 78,126,27,25,14,9
#elif m < 2000:
#    maxC,maxH,maxO,maxN,maxS,maxP = 156,180,63,32,14,9

#Common range
com_range = {'H/C':(0.2,3.1),'N/C':(0,1.3),'O/C':(0,1.2),'P/C':(0,0.3),'S/C':(0,0.8)} #99.7% of all existing formulas
#Extended range
ext_range = {'H/C':(0.1,6),'N/C':(0,4),'O/C':(0,3),'P/C':(0,2),'S/C':(0,3)} #99.99% of all existing formulas

### Maximum number of elements and ranges for element proportions as per the chemical constraints put upon on the MetaboScape

Issues: Adding Cl and F element leads to exponentially greater amount of formulas and time of analysis.

#### These limits were the ones imposed in the following functions for rule nº1 and rule nº5 of the 7 golden rules.

In [5]:
#Rules applied to the file:
maxC,maxH,maxO,maxN,maxS,maxP = 39,72,20,20,5,9
maxF, maxCl = 8,5
ms_range = {'H/C':(0.2,3.1),'N/C':(0,1.3),'O/C':(0,1.2),'P/C':(0,0.3),'S/C':(0,0.8),'F/C':(0,1.5), 'Cl/C':(0,0.8)}

### Function to give every formula possible and their exact mass between a certain mass interval and according to other limitations such as the 7 golden rules paper or specifically imposed conditions (adapted from FTMSVisualization) - form_calc

Rules nº 1, 2 (partially), 4, 5 and 6 currently followed.

Rule nº 7 - TMS check not applicable.

Rule nº 3- Isotope pattern checking, I think it is not applicable.

#### form_calc function applies all these rules, rules nº2 and 6 however have specific functions.

2 different functions with the same name:

The first one is the closest to the one in FTMSVisualization (with some changes to be closer to the 7 golden rules). I should probably add the actual function in FTMS Visualization for changes to be better seen but more significant changes are with using NOPS function instead of 1.3 * number of carbons check (explained later) and using Lewis and Senior rules more in-depth instead of a simple N and H odd/even number check (explained later).

The second one is a bit different made so the speed of the function could be better - around 20% improvement in speed (significative with the time it takes, should save several minutes).

Number of formulas in both are equal so using the 2nd should be better in terms of efficiency.

In [59]:
def form_calc(low, high, elem_range = com_range):
    """Calculates all formulas possible (according to some stipulations) between a certain mass interval.
    
       low: scalar; lower limit of the molecular mass of the formula.
       high: scalar, upper limit of the molecular mass of the formula.
       elem_range: dictionary; dictionary where keys are string of the ratios of certain elements (examples: 'H/C', 'O/C') and
    their values are a tuple with the minimum and maximum ratios they can have in a certain formula (rule nº5).
    
       return: dictionary where keys are exact masses and values are tuples with the overall abundance of the monoisotopic 'mass'
    (maybe superfluous) and the number of atoms of each elements (in a specific order)."""
    
    
    #RULE Nº 1 - Following the 7 golden rule limits
    #if high <= 500:
        #maxC,maxH,maxO,maxN,maxS,maxP = 39,72,20,20,10,9
    #else: #ignoring the other limit since this already takes a lot of time 
    #elif high <= 1000: #36 7 golden rules?
        #maxC,maxH,maxO,maxN,maxS,maxP = 78,126,27,25,14,9
    #else: #elif m < 2000:
        #maxC,maxH,maxO,maxN,maxS,maxP = 156,180,63,32,14,9

    """Following the maximum elements from the limits imposed to MetaboScape."""
    maxC,maxH,maxO,maxN,maxS,maxP = 39,72,20,20,5,9
    maxF, maxCl = 8,5
    
    maxC = min((int(high) / 12), maxC + 1) #max carbon nº is the smaller of the total mass/12 or predefined maxC
    maxH = min((maxC * 4), maxH + 1) #max hydrogen nº is the smaller of 4 times the nº of carbons or the predefined max hydrogen nº
    maxO = min((int(high) / 16), maxO + 1)#max oxygen nº has to be the smaller of the total mass/16 or predefined maxO
    #Those 3 above should have a +1 next to their maxC, maxH and maxO but since I forgot on the first 
    maxN = maxN + 1
    maxS = maxS + 1
    maxP = maxP + 1
    maxF = maxF + 1
    maxCl = maxCl + 1

    allposs = {} #pd.DataFrame(columns = ['abundance', 'c','h','o','n','s','p'])
    
    #For's for each element and ifs that see if the each element count follow the imposed ratios in elem_range
    for c in range(int(maxC))[1:]: #molecules contain at least 1 C and 1 H
        for h in range(int(maxH))[1:]:
            hcrat = float(h)/float(c)
            if elem_range['H/C'][0] < hcrat < elem_range['H/C'][1]: #RULE Nº 4
                for p in range(maxP):
                    pcrat = float(p)/float(c)
                    if pcrat < elem_range['P/C'][1]: #RULE Nº 5
                        for o in range(int(maxO)):
                            ocrat = float(o)/float(c)
                            if ocrat < elem_range['O/C'][1]: #RULE Nº 5
                                for n in range(maxN):
                                    ncrat = float(n)/float(c)
                                    if ncrat < elem_range['N/C'][1]: #RULE Nº 5
                                        #if neg_nhchecker(h,n):
                                        for s in range(maxS):
                                            scrat = float(s)/float(c)
                                            if scrat < elem_range['S/C'][1]: #RULE Nº 5
                                                NOPS_ratio = NOPS(n,o,p,s)
                                                if NOPS_ratio:#RULE Nº 6 - element probability check - see function below
                                                    for f in range(maxF):
                                                        fcrat = float(f)/float(c)
                                                        if fcrat < elem_range['F/C'][1]: #RULE Nº 5
                                                            for cl in range(maxCl):
                                                                clcrat = float(cl)/float(c)
                                                                if clcrat < elem_range['Cl/C'][1]: #RULE Nº 5
                                                                    mass = getmass(c,h,o,n,s,p,cl,f) #getting the exact mass of the formula
                                                                    if low < mass < high: #If the mass is in the given range
                                                                        #formula = "C%iH%iO%iN%iS%iP%i" % (c,h,o,n,s,p)
                                                                        Valency = Lewis_Senior_rules(c,h,o,n,s,p,cl,f)
                                                                        if Valency:
                                                                            abundance = getabun(c,h,o,n,s,cl)
                                                                            allposs[mass] = (abundance, c,h,o,n,s,p, cl, f)
        #print(c)
    return allposs

In [65]:
def form_calc(low, high, elem_range = com_range):
    """Calculates all formulas possible (according to some stipulations) between a certain mass interval.
    
       low: scalar; lower limit of the molecular mass of the formula.
       high: scalar, upper limit of the molecular mass of the formula.
       elem_range: dictionary; dictionary where keys are string of the ratios of certain elements (examples: 'H/C', 'O/C') and
    their values are a tuple with the minimum and maximum ratios they can have in a certain formula (rule nº5).
    
       return: dictionary where keys are exact masses and values are tuples with the overall abundance of the monoisotopic 'mass'
    (maybe superfluous) and the number of atoms of each elements (in a specific order)."""
    
    #RULE Nº 1 - Following the 7 golden rule limits
    #if high <= 500:
        #maxC,maxH,maxO,maxN,maxS,maxP = 39,72,20,20,10,9
    #else: #ignoring the other limit since this already takes a lot of time 
    #elif high <= 1000: #36 7 golden rules?
        #maxC,maxH,maxO,maxN,maxS,maxP = 78,126,27,25,14,9
    #else: #elif m < 2000:
        #maxC,maxH,maxO,maxN,maxS,maxP = 156,180,63,32,14,9

    """Following the maximum elements from the limits imposed to MetaboScape."""
    maxC,maxH,maxO,maxN,maxS,maxP = 39,72,20,20,5,9
    maxF, maxCl = 8,5
    
    maxC = min((int(high) / 12), maxC + 1) #max carbon nº is the smaller of the total mass/12 or predefined maxC
    maxH = min((maxC * 4), maxH + 1) #max hydrogen nº is the smaller of 4 times the nº of carbons or the predefined max hydrogen nº
    maxO2 = min((int(high) / 16), maxO + 1)#max oxygen nº has to be the smaller of the total mass/16 or predefined maxO
    #Those 3 above should have a +1 next to their maxC, maxH and maxO but since I forgot on the first 
    maxN2 = maxN + 1
    maxS2 = maxS + 1
    maxP2 = maxP + 1
    maxF2 = maxF + 1
    maxCl2 = maxCl + 1

    allposs = {} #pd.DataFrame(columns = ['abundance', 'c','h','o','n','s','p'])
    
    for c in range(int(maxC))[1:]: #molecules contain at least 1 C and 1 H
        for h in range(int(maxH))[1:]:
            hcrat = float(h)/float(c)
            if elem_range['H/C'][0] < hcrat < elem_range['H/C'][1]: #RULE Nº 4
                maxP = min(maxP2, int(c * elem_range['P/C'][1]+0.99)) #RULE Nº 5
                #print('maxP', maxP)
                for p in range(maxP):
                    maxO = min(maxO2, int(c * elem_range['O/C'][1]+0.99)) #RULE Nº 5
                    for o in range(int(maxO)):
                        maxN = min(maxN2, int(c * elem_range['N/C'][1]+0.99)) #RULE Nº 5
                        for n in range(maxN):
                            maxS = min(maxS2, int(c * elem_range['S/C'][1]+0.99)) #RULE Nº 5
                            for s in range(maxS):
                                NOPS_ratio = NOPS(n,o,p,s)
                                if NOPS_ratio:#RULE Nº 6 - element probability check - see function below
                                    maxF = min(maxF2, int(c * elem_range['F/C'][1]+0.99)) #RULE Nº 5
                                    for f in range(maxF):
                                        maxCl = min(maxCl2, int(c * elem_range['Cl/C'][1]+0.99)) #RULE Nº 5
                                        for cl in range(maxCl):
                                            mass = getmass(c,h,o,n,s,p,cl,f) #getting the exact mass of the formula
                                            if low < mass < high: #If the mass is in the given range
                                                #formula = "C%iH%iO%iN%iS%iP%i" % (c,h,o,n,s,p)
                                                Valency = Lewis_Senior_rules(c,h,o,n,s,p,cl,f)
                                                if Valency:
                                                    abundance = getabun(c,h,o,n,s,cl) #If isotope pattern checking becomes a thing
                                                    allposs[mass] = (abundance, c,h,o,n,s,p, cl, f)
        #print(c)
    return allposs

### Rule nº 6 - HNOPS heuristic probability check

FTMSVisualization didn't impose this rule. Instead, they did something else where they said that the total number of heteroatoms had to be smaller than 1.3 times the number of carbons. I altered to this to follow the 7 golden rules paper more closely. The following function sees if there are at least 3 types of heteroatoms in the formula and then observes if each of their counts is below a threshold shown in the 7 golden rules paper. If it isn't, NOPS_ratio is False and the formula is ignored.

#### Possible problem:

No extra heteroatom checks for F and Cl elements. This means that they are basically treated as non-heteroatoms so far. There should probably be soem kind of check that takes into account their presence when other heteroatoms are also present but I haven't found something like that in the literature yet.

In [7]:
#Rule nº 6 - HNOPS heuristic probability check
def NOPS (n,o,p,s):
    """Checks if the element counts follow the HNOPS heuristic probablility checks as delineated by the 7 golden rules paper.
    
       n,o,p,s - integers; number of N, O, P and S atoms respectively in the considered formula.
       
       returns: bool; True if it fulfills the conditions, False if it doesn't."""
    NOPS_ratio = True
    if (n > 1) and (o > 1) and (p > 1) and (s > 1): #NOPS
        if (n < 10) and (o < 20) and (p < 4) and (s < 3):
            NOPS_ratio = True
        else:
            NOPS_ratio = False
    elif (n > 3) and (o > 3) and (p > 3): #NOP
        if (n < 11) and (o < 22) and (p < 6):
            NOPS_ratio = True
        else:
            NOPS_ratio = False
    elif (o > 1) and (p > 1) and (s > 1): #OPS
        if (o < 14) and (p < 3) and (s < 3):
            NOPS_ratio = True
        else:
            NOPS_ratio = False
    elif (n > 1) and (p > 1) and (s > 1): #PSN
        if (n < 10) and (p < 4) and (s < 3):
            NOPS_ratio = True
        else:
            NOPS_ratio = False
    elif (n > 6) and (o > 6) and (s > 6): #PSN
        if (n < 19) and (o < 14) and (s < 8):
            NOPS_ratio = True
        else:
            NOPS_ratio = False

    return NOPS_ratio

### Rule nº 2 - Lewis and Senior (chemical rules) check

FTMSVisualization didn't impose this rule strictly . Instead, they checked the sum of the H and N (Na and K too, ionic compositions) elements to see if the formula followed a strict N rule, see if the sum was odd (not allowed) or even (allowed).

I altered this to follow the 7 golden rules paper more closely which says the following:

i) The sum of valences or the total number of atoms having odd valences is even;  - (close to the N rule followed)

ii) The sum of valences is greater than or equal to twice the maximum valence;

iii) The sum of valences is greater than or equal to twice the number of atoms minus 1 - (see if we can actually make a structure with those molecules - exaggerated since we are considering all molecules can always have their maximum valence state.


While "allowing maximum valence states for each element" i.e. allowing for elements to be in higher valences that they can assume in some molecules and not only their ground states (example: N usually has a valence of 3 establishing 3 bonds, but can have a valence of 4 (molecules with NO2 groups) or 5). Maximum valences for each element is an absolute value.

Rule ii) confuses me a bit with what they mean as the "maximum valence". As I've seen in a paper ('ANALOGOUS ODD-EVEN PARITIES
IN MATHEMATICS AND CHEMISTRY', 2003, Morikawa T and Newbold BT), the objective of this rule is to indicate "the non-existence of small molecules such as CH2" (although they don't expalin what maximum valence is). Since we are dealing with molecules with, at least, 100 Da, I ignored this rule so far.

Note: Cl with the normal ground state valence of +1 instead of this maximum +7. My knowledge (with no base in literature) tells me this should be rare in biologic molecules so I put it as +1. It is worth mentioning that the % of formulas that are filtered due to this change is (in context of the massive amount of allowed formulas) small.

In [8]:
def Lewis_Senior_rules(c,h,o=0,n=0,s=0,p=0,cl=0,f=0):
    """See if the formula follows Lewis' and Senior's rules (considering all max possible valency states for each element
    except Cl).
    
       c,h,o,n,s,p,cl,f - integers; number of C, H, O, N, S, P, Cl and F atoms respectively in the considered formula.
       
       returns: bool; True if it fulfills the conditions, False if it doesn't."""
    
    #Max_Valencies
    valC = 4
    valH = 1 #ODD
    valO = 2 #Positive instead of negative for it to work?
    valN = 5 #Normally 3, ODD
    valS = 6 #Normally -2 
    valP = 5 #Normally -3, Odd
    valCl = 7 #Normally -1, Odd ????
    valF = 1 #Normally -1, Odd
    
    Valency = False
    #1st rule - The sum of valences or the total number of atoms having odd valences is even.
    if (h + n + p + cl + f) % 2 == 0:
        total_v = (valC * c) + (valH * h) + (valO * o) + (valN * n) + (valS * s) + (valP * p) + (valCl * cl) + (valF * f)
            #2nd rule - The sum of valences is greater than or equal to twice the maximum valence.
            #This one confuses me a little - maximum valences? Is it 8? Is it the element with the most valencies? But like Cl doesn't
            #normally have 7
            #if total_v > 2*8: #?
            #Ok, not applying this one since it normally only eliminates small molecules either way and we are searching for molecules
            #with more than 100 Da.

            #3rd rule - The sum of valences is greater than or equal to twice the number of atoms minus 1.
        natoms = c + h + o + n + s + p + cl + f
        if total_v >= (2*(natoms-1)):
            Valency = True

    return Valency

Making a dictionary of dataframes with each dataframe having all possible formulas (according to the previously mentioned rules and functions) of a 100 mass interval with the name of each dataframe being dict + lower limit of said mass interval.

#### Warning: Don't do range (1,12) here

In [66]:
formulas = {}
#Change range here - don't do range(1,12)!!
#Example range(1,4), gives in the formula dictionary 3 DataFrames with formulas with m/z between 100-200, 200-300 and 300-400
for i in range(1,12):
    a = form_calc(i*100, (i+1)*100, elem_range = ms_range)
    formulas['dict' + str(i*100)] = pd.DataFrame.from_dict(a, orient = 'index', columns = ['Abundance', 'C','H','O','N','S','P','F','Cl'])
    formulas['dict' + str(i*100)] = formulas['dict' + str(i*100)].sort_index()
    #Writing a .csv file of the data
    #formulas['dict' + str(i*100)].to_csv('dict' + str(i*100) + '.csv')
#a = form_calc(200, 300, elem_range = com_range) #using the common 99.7 common range described in the 7 golden rules paper.

Writing individual files if one wants to. 

Be careful: they can be very big.

In [67]:
#formulas['dict600'].to_csv('dict600-LS.csv')

In [67]:
formulas

{'dict100':             Abundance  C   H  O  N  S  P  F  Cl
 100.007291   0.957622  3   1  1  2  0  0  0   1
 100.007803   0.953372  4   5  1  0  0  1  0   0
 100.007978   0.716023  5   5  0  0  0  0  1   0
 100.009121   0.738481  2   6  1  0  0  0  1   1
 100.009519   0.911867  3   4  0  2  1  0  0   0
 ...               ... ..  .. .. .. .. .. ..  ..
 199.998653   0.675327  5   7  0  2  1  0  1   2
 199.998696   0.918957  6   5  4  2  0  1  0   0
 199.998871   0.690176  7   5  3  2  0  0  1   0
 199.999165   0.672329  6  11  0  0  1  1  1   1
 199.999943   0.806557  5  12  2  0  3  0  0   0
 
 [19070 rows x 9 columns],
 'dict200':             Abundance   C   H  O  N  S  P  F  Cl
 200.000014   0.711823   4   6  4  2  0  0  1   1
 200.000273   0.904008   9   4  0  0  0  1  0   3
 200.000412   0.878951   5   4  3  4  1  0  0   0
 200.000483   0.520786   4  12  0  0  1  0  2   2
 200.000526   0.708664   5  10  4  0  0  1  1   0
 ...               ...  ..  .. .. .. .. .. ..  ..
 299.999925

## Part nº2 - Assigning Formulas to m/z peaks / bucket labels

Reading the file from MetaboScape. Due to some complications with reading the .csv files. I had to transform it into an Excel where each row was separated into columns by the "," (text to columns function of excel) adn then was read with the read_excel function from Pandas.

In [13]:
formula_file = pd.read_excel('tabela5yeasts11-3-2020.xlsx')

#Making the bucket label column into floats (and taking out the " Da" part)
for i in range(len(formula_file)):
    formula_file.iloc[i,0] = np.float(formula_file.iloc[i,0][:-3])
formula_file

Unnamed: 0,label,m/z,Name,Formula,KEGG,BY0_000001,BY0_000002,BY0_000003,GRE3_000001,GRE3_000002,GRE3_000003,ENO1_000001,ENO1_000002,ENO1_000003,dGLO1_000001,dGLO1_000002,dGLO1_000003,GLO2_000001,GLO2_000002,GLO2_000003
0,307.084,308.091097,Glutathione,C10H17N3O6S,C02471,1.482951e+09,1.520914e+09,1.515563e+09,1.231184e+09,1.205245e+09,1.227117e+09,7.151679e+08,7.148845e+08,7.031374e+08,5.903519e+08,5.864704e+08,5.959080e+08,287722656.0,2.914869e+08,2.920921e+08
1,555.269,556.276570,Enkephalin L,C28H37N5O7,,4.219694e+08,4.219694e+08,4.219694e+08,4.219694e+08,4.219694e+08,4.219694e+08,4.219694e+08,4.219694e+08,4.219694e+08,4.219694e+08,4.219694e+08,4.219694e+08,421969440.0,4.219694e+08,4.219694e+08
2,624.087,625.094568,,C13H31F4N2O15P3,,4.933358e+08,4.858229e+08,4.977819e+08,5.939113e+08,5.818669e+08,5.844022e+08,3.827252e+08,3.834892e+08,3.807282e+08,2.709398e+08,2.645076e+08,2.677631e+08,131498984.0,1.330865e+08,1.320918e+08
3,493.317,494.324096,LysoPC_16:1_9Z_0:0_,C24H48NO7P,C04230,4.026932e+08,4.174214e+08,4.232130e+08,4.742500e+08,4.770647e+08,4.894069e+08,8.762351e+07,8.903744e+07,8.863885e+07,8.300221e+07,8.356744e+07,8.238210e+07,56428384.0,5.809368e+07,5.933968e+07
4,257.103,258.110160,Glycerophosphocholine,C8H20NO6P,C00670,2.056732e+08,2.052868e+08,2.117945e+08,2.113430e+08,1.993518e+08,2.069864e+08,2.030151e+08,1.999564e+08,1.930296e+08,1.212009e+08,1.165595e+08,1.169341e+08,59549848.0,6.113070e+07,6.227151e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21169,2029.12,2030.126139,,,,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.487363e+06,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,0.000000e+00,0.000000e+00
21170,532.812,533.819665,,,,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.487326e+06,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,0.000000e+00,0.000000e+00
21171,1070,1071.004232,,,,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.487324e+06,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,0.000000e+00,0.000000e+00
21172,1343.9,1344.907878,,,,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.487058e+06,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,0.000000e+00,0.000000e+00


### Function to assign a formula and after function to "write" the formula in a correct way - form_checker

### Major issue: Formula attribution criteria. Right now, the formula with the closest theoric mass to the m/z peak is chosen 

2 different with the same name:

The first one is the closest to the one in FTMSVisualization (with some changes to vastly improve the speed of the process, multiple times faster by doing the section with df2 before the for, instead of what the original function did).

The second one is a bit different made so the speed of the function could be even higher - very very small improvement though, not even sure it actually helps - will be developed further. Both are here since I didn't check yet if the results with the two are identical.

#### The two functions have a different parameter. The rest of notebook is setup to work with the first form_checker. Application for the second one is always there as a commentary right after the application if one wishes to switch.

In [70]:
def form_checker(mass, threshold, threshppm, df):
    """Assigning formulas to an m/z peak based on the distance of the m/z to the formulas present in a given database.
    
       mass: scalar; m/z of the peak.
       threshold: scalar; error threshold for formulae in Da - i.e. absolute error threshold.
       threshppm: scalar; error threshold for formulae in ppm - i.e. relative error threshold.
       df: Pandas Dataframe; dataframe with the formulas that are possible to assign to the m/z peak.
       
       returns: tuple with the mass given and the formula assigned (np.nan if no formula could be assigned within the 2 given
    thresholds)."""
    
    #allposs = []
    #i=0
    possible_ma = []
    #Select formulas that have masses within the absolute error threshold given.
    df2 = df.copy()
    df2 = df2[df2.index<= (mass+threshold)]
    df2 = df2[df2.index>= (mass-threshold)]
    
    for x in df2.itertuples():
        #Calculate the error (in ppm) of the mass of the filtered formulas
        error = ((mass - x[0])/x[0])*1000000
        if abs(error) <= threshppm:
            #Select formulas whose masses are within the relative error threshold given
            possible_ma.append(x)
                
                #THINGS FROM THE ORIGINAL FUNCTION THAT I FOUND WERE NOT NEEDED
                
                #allposs.append(list(x[1:9]))
                #allposs[i].append(error)
                #allposs[i].append(intensity)
                #allposs[i].append(mass)
                #formulatemp = FTPM.formulator(int(x[3]),int(x[4]),int(x[6]),int(x[5]),int(x[7]),int(x[8]),int(x[9]),int(x[10]),ionisationmode)
                #dbe = FTPM.DBEcalc(int(x[3]),int(x[4]),int(x[6]),ionisationmode)
                #allposs[i].append(dbe)
                #allposs[i].append(formulatemp)
                #i = i +1
    
    #If more than one formula is within the mass interval
    if len(possible_ma) > 1:
        #Calculate and store the error (in absolute values) of the mass of the filtered formulas
        min_dif = []
        for i in range(len(possible_ma)):
            min_dif.append(abs(possible_ma[i][0] - mass))
        mini = np.argmin(min_dif)
        #Choose the formula with the lowest error (closest to original mass).
        return(mass, formulator(possible_ma[mini][2],possible_ma[mini][3],possible_ma[mini][4],possible_ma[mini][5],possible_ma[mini][6],possible_ma[mini][7],possible_ma[mini][8],possible_ma[mini][9]))
    
    #Only one formula is within the mass interval
    elif len(possible_ma) == 1:
        return(mass, formulator(possible_ma[0][2],possible_ma[0][3],possible_ma[0][4],possible_ma[0][5],possible_ma[0][6],possible_ma[0][7],possible_ma[0][8],possible_ma[0][9]))
    
    #No formula is within the mass interval
    else:
        return(mass, np.nan)

In [35]:
def form_checker(mass, threshppm, df):
    """Assigning formulas to an m/z peak based on the distance of the m/z to the formulas present in a given database.
    
       mass: scalar; m/z of the peak.
       threshppm: scalar; error threshold for formulae in ppm - i.e. relative error threshold.
       df: Pandas Dataframe; dataframe with the formulas that are possible to assign to the m/z peak.
       
       returns: tuple with the mass given and the formula assigned (np.nan if no formula could be assigned within the 2 given
    thresholds)."""
    
    #Calculate mass difference allowed based on the mass and ppm thresholds given
    mass_dif = mass - (mass*1000000/(1000000 + threshppm))
    #Select the formulas from the database dataframe that are within said mass difference to the mass given
    df2 = df.copy()
    df2 = df2[df2.index<= (mass+mass_dif)]
    df2 = df2[df2.index>= (mass-mass_dif)]
    
    #If more than one formula is within the mass interval
    if len(df2) > 1:
        #Calculate and store the error (in ppm) of the mass of the filtered formulas
        df3 = pd.DataFrame(abs(((mass - df2.index)/df2.index)*1000000), index = df2.index, columns = ['error'])
        mini = df3.idxmin(axis = 'index')
        #Choose the formula with the lowest error (closest to original mass).
        return(mass, formulator(df2.loc[mini, 'C'].values,df2.loc[mini,'H'].values,df2.loc[mini,'O'].values,
                                df2.loc[mini,'N'].values,df2.loc[mini,'S'].values,df2.loc[mini,'P'].values,
                                df2.loc[mini,'F'].values, df2.loc[mini,'Cl'].values))
    
    #Only one formula is within the mass interval
    elif len(df2) == 1:
        return(mass, formulator(df2['C'].values,df2['H'].values,df2['O'].values,df2['N'].values,df2['S'].values,df2['P'].values,
                                df2['F'].values,df2['Cl'].values))
    
    #No formula is within the mass interval
    else: 
        return(mass, np.nan)

#### Formulator function

Function to write formulas based on the element counts in the same order that MetaboScape writes its formulas.

In [16]:
#Adapted from FTMSVisualization
def formulator(c,h,o,n,s,p,f,cl):
    """Transforms element counts to a readable formula in string format. Element order: C, H, N, O, S, P, F and Cl."""
    formula = "C"+str(c)+"H"+str(h)    
    if cl > 0:
        if cl > 1:
            formula = formula + "Cl" + str(cl)
        else:
            formula = formula + "Cl"
    if f > 0:
        if f > 1:
            formula = formula + "F" + str(f)
        else:
            formula = formula + "F"
    if n > 0:
        if n > 1:
            formula = formula + "N" + str(n)
        else:
            formula = formula + "N"
    if o > 0:
        if o > 1:
            formula = formula + "O" + str(o)
        else:
            formula = formula + "O"
    if p > 0:
        if p > 1:
            formula = formula + "P" + str(p)
        else:
            formula = formula + "P"
    if s > 0:
        if s > 1:
            formula = formula + "S" + str(s)
        else:
            formula = formula + "S"
    return formula

### Parameters for formula attribution (FTMSVisualization)

In [17]:
#Define some parameters for our assignment thresholds.
#Relevant parameters
threshppm = 1.0 # Error threshold for formulae in ppm - i.e. relative error threshold.
#Not needed for new form_checker
threshold = 0.005 # Error threshold for formulae in Da - i.e. absolute error threshold.

## Extras

#Parameters concerning Isotope finding, checking and the like which isn't needed for this kind of Formula assignment.
#isothresh = 0.0001 # threshold for isotope peak checker #Function not currently active
precisionfactor = 1000#0#0000 # This is a work in progress. THis is used for isotope hunting. A value of 1000 = 1.003355*1000,
#is equal to a 1 mDa error threshold, at 500 m/z this is 0.2 ppm.


#Parameters concerning Kendrick Mass Defect series that they used in a myriad of functions. I don't understand its relevancy.
maxgap = 0.0003 #gap between KMDs for separating series. Units are Da, hence the small numbers. This value will be a compromise, and depend on complexity of your dataset.
# Looking at Z* and KMD method compared to other methods it seems an appropriate range could be 0.0002 or even 0.00002. Look at closer in a future version
minKMDseries = 1 #minimum number of peaks in a single homologous series to assign by zstar approach # 3 seems a good number.

### Test with a very small part of the dataset

In [78]:
teste = formula_file[formula_file['label']< 200]
teste

Unnamed: 0,label,m/z,Name,Formula,KEGG,BY0_000001,BY0_000002,BY0_000003,GRE3_000001,GRE3_000002,GRE3_000003,ENO1_000001,ENO1_000002,ENO1_000003,dGLO1_000001,dGLO1_000002,dGLO1_000003,GLO2_000001,GLO2_000002,GLO2_000003
74,194.115,233.078461,Tetraethylene glycol,C8H18O5,,0.0,0.0,0.0,8602024.0,9371144.0,8738521.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
101,188.105,227.068084,Azelaic acid,C9H16O4,C08261,0.0,0.0,0.0,8437384.0,9330059.0,9128233.0,0.0,0.0,0.0,0.0,0.0,1194726.0,0.0,0.0,0.0
137,187.121,226.084055,N-Heptanoylglycine,C9H17NO3,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1749687.0,0.0,2265409.0,2242338.0,2368078.0,0.0,0.0,0.0
147,186.089,225.052442,5-_2-Methylpropyl_tetrahydro-2-oxo-3-furancarb...,C9H14O4,,0.0,0.0,0.0,4850851.0,4553961.0,4521906.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
172,190.121,229.083634,,C9H18O4,,0.0,0.0,0.0,0.0,0.0,0.0,1678466.0,2134808.0,1921604.0,0.0,0.0,1197077.0,0.0,0.0,0.0


In [79]:
teste_form = []
for i in range(len(teste)):
    mass = teste.iloc[i,0]
    print(form_checker(mass, threshold, threshppm, df = formulas['dict100'])) #1st Form Checker function
    #print(form_checker(mass, threshppm, df = formulas['dict100'])) #2nd Form Checker function
    teste_form.append(form_checker(mass, threshold, threshppm, df = formulas['dict100']))

(194.1153114956, 'C8H18O5')
(188.1047773826, 'C9H16O4')
(187.1208269152, 'C9H17NO3')
(186.0892154645, 'C9H14O4')
(190.1205077248, 'C9H18O4')


In [80]:
intersection = 0
cls = 0
for i in range(len(teste)):
    if teste.iloc[i,3] == teste_form[i][1]:
        intersection = intersection + 1
intersection

5

### Assigning formula to the full dataset

Cut the intensity and samples out of the dataset

In [68]:
formula_file2 = formula_file[['label','m/z','Name','Formula','KEGG']]

In [69]:
#Make a column to store the formulas with each of the labels.
forma = pd.DataFrame(np.zeros((len(formula_file2),1)), index = formula_file2['label'], columns = ['Form_give'])

### Try to assign a formula to each peak on the dataset

#### Warning: Don't do range (1,12) here, do the same range as done in form_calc

In [71]:
#Change range here - don't do range(1,12)!!
#Same range as in the form_calc before
for i in range (1,12):
    teste = formula_file2[formula_file2['label'] < (i+1)*100]
    teste = teste[teste['label'] > i*100]
    for j in range(len(teste)):
        mass = teste.iloc[j,0]
        tup = form_checker(mass, threshold, threshppm, df = formulas['dict' + str(i*100)]) #1st Form Checker function
        #tup = form_checker(mass, threshppm, df = formulas['dict' + str(i*100)]) #2nd Form Checker function
        forma.loc[tup[0]] = tup[1]
        #print(j)
    print(i*100, 'complete')


100 complete
200 complete
300 complete
400 complete
500 complete


### Just until m/z 600 for this example

### Make a dataframe with Formulas assigned by MetaboScape ('Formula') and by form_checker ('Form_give')

In [72]:
forma.loc[forma.index<600]

Unnamed: 0_level_0,Form_give
label,Unnamed: 1_level_1
307.083818,C10H17N3O6S
555.269298,C19H47Cl2N7OS4
493.316816,C19H41ClFN11O
257.102875,C8H20NO6P
347.063081,C10H14N5O7P
...,...
536.699349,C7H2Cl2F5NO4S5
538.942676,C14H5Cl8N5O3S3
550.505569,C34H69Cl2F
535.492464,C25H61Cl4N7


In [73]:
formula_file2 = formula_file2.set_index('label')
ending = pd.concat([formula_file2,forma], axis = 1)

In [74]:
ending = ending.loc[ending.index<600]
colus = ending[['Formula','Form_give']]

colus2 = colus.notnull()
print('-----------Number of formulas given by the MetaboScape (Formula) and this notebook (Form_give)---------------')
print(colus2.sum())
print('Total number of peaks:', len(ending))

-----------Number of formulas given by the MetaboScape (Formula) and this notebook (Form_give)---------------
Formula      1788
Form_give    4765
dtype: int64
Total number of peaks: 5266


In [75]:
colus

Unnamed: 0_level_0,Formula,Form_give
label,Unnamed: 1_level_1,Unnamed: 2_level_1
307.083818,C10H17N3O6S,C10H17N3O6S
555.269298,C28H37N5O7,C19H47Cl2N7OS4
493.316816,C24H48NO7P,C19H41ClFN11O
257.102875,C8H20NO6P,C8H20NO6P
347.063081,C10H14N5O7P,C10H14N5O7P
...,...,...
536.699349,,C7H2Cl2F5NO4S5
538.942676,,C14H5Cl8N5O3S3
550.505569,,C34H69Cl2F
535.492464,,C25H61Cl4N7


Almost all peaks had a formula associated not the best

### Number of equal formula assignments between MetaboScape and form_checker

In [83]:
intersection = 0
cls = 0
for i in range(len(colus)):
    if colus.iloc[i,0] == colus.iloc[i,1]:
        intersection = intersection + 1
    #if type(colus.iloc[i,0]) == type('str'):
        #if 'Cl' in colus.iloc[i,0]:
            #print(colus.iloc[i,0])
            #cls = cls + 1
        #elif 'F' in colus.iloc[i,0]:
            #cls = cls + 1
    #print(colus.iloc[i,0], colus.iloc[i,1])
print(intersection, 'em', colus2.sum().iloc[0])
print(intersection/colus2.sum().iloc[0]*100, '% of same attributions.', )

320 em 1788
17.89709172259508 % of same attributions.


### Miscellaneous non-relevant tests

### Really not relevant

In [26]:
#abc = formula_file2.loc[formula_file2['Formula'].notnull()]
#ret = []
#for i in range(len(abc)):
    #ret.append(Lewis_Senior_rules())

Unnamed: 0,label,m/z,Name,Formula,KEGG
0,307.084,308.091097,Glutathione,C10H17N3O6S,C02471
1,555.269,556.276570,Enkephalin L,C28H37N5O7,
2,624.087,625.094568,,C13H31F4N2O15P3,
3,493.317,494.324096,LysoPC_16:1_9Z_0:0_,C24H48NO7P,C04230
4,257.103,258.110160,Glycerophosphocholine,C8H20NO6P,C00670
...,...,...,...,...,...
20951,830.605,831.612543,PG_14:1_11Z_26:1_11Z_,C46H87O10P,
20960,544.272,545.279705,"Lyso-PS_20:4_5Z,8Z,11Z,14Z_0:0_",C26H43NO9P,
21037,256.239,257.246442,2-hydroxyhexadecanal,C16H32O2,
21064,535.099,536.106465,lipoyl-AMP,C18H26N5O8PS2,


In [25]:
#pd.read_table('D/dict100')

In [179]:
teste = formula_file[formula_file['label']< 300]
teste = teste[teste['label']> 200]
teste

Unnamed: 0,label,m/z,Name,Formula,KEGG
4,257.103,258.110160,Glycerophosphocholine,C8H20NO6P,C00670
7,254.225,255.231904,Hypogeic acid,C16H30O2,
13,256.24,257.247559,Palmitic acid,C16H32O2,C00249
20,243.183,244.190777,N-Undecanoylglycine,C13H25NO3,
22,278.152,317.114972,alpha-CEHC,C16H22O4,
...,...,...,...,...,...
20108,262.23,263.236853,"6,10,14-Trimethyl-5,9,13-pentadecatrien-2-one",C18H30O,
20521,226.121,227.127818,Allixin,C12H18O4,
20770,269.272,270.279028,Capsiamide,C17H35NO,C17515
20991,266.95,267.957701,,,


In [182]:
forma = pd.DataFrame(np.zeros((len(teste),1)), index = teste['label'], columns = ['Form_give'])
#print(forma)
for i in range(len(teste)):
    mass = teste.iloc[i,0]
    tup = form_checker(mass, threshold, threshppm, dicta = formulas['dict200'])
    forma.loc[tup[0]] = tup[1]
print(forma)


             Form_give
label                 
257.102875         NaN
254.224610    C16H30O2
256.240207    C16H32O2
243.183406   C13H25NO3
278.151833    C16H22O4
...                ...
262.229577     C18H30O
226.120541    C12H18O4
269.271752    C17H35NO
266.950424  C4H5N4O4SP
256.239165         NaN

[420 rows x 1 columns]


In [92]:
def form_checker(mass, threshold, threshppm, df):
    #allposs = []
    #i=0
    possible_ma = []
    df2 = df.copy()
    df2 = df2[df2.index<= (mass+threshold)]
    df2 = df2[df2.index>= (mass-threshold)]
    df3 = pd.DataFrame(abs(((mass - df2.index)/df2.index)*1000000), index = df2.index, columns = ['error'])
    df2 = [df3['error'] <= threshppm]
    if len(df3) > 1:
        mini = df3.idxmin(axis = 'index')
        #print(df2.loc[mini,'N'].values)
        return(mass, formulator(df2.loc[mini, 'C'].values,df2.loc[mini,'H'].values,df2.loc[mini,'O'].values,
                                df2.loc[mini,'N'].values,df2.loc[mini,'S'].values,df2.loc[mini,'P'].values,
                                df2.loc[mini,'F'].values, df2.loc[mini,'Cl'].values))
    elif len(df3) == 1:
        return(mass, formulator(df2['C'].values,df2['H'].values,df2['O'].values,
                                df2['N'].values,df2['S'].values,df2['P'].values,
                                df2['F'].values,df2['Cl'].values))
    else: 
        return(mass, np.nan)

In [12]:
def form_checker(mass, threshold, threshppm, df):
    #allposs = []
    #i=0
    possible_ma = []
    df2 = df.copy()
    #df2 = df2[df2['label']<= (mass+threshold)]
    #df2 = df2[df2['label']>= (mass-threshold)]
    for x in df.itertuples():
        if (mass-threshold) <= x[0] <= (mass+threshold):
            error = ((mass - x[0])/x[0])*1000000
            if abs(error) <= threshppm:
                #ideally enter Senior and Lewis check - next thing to be developed
                possible_ma.append(x)
                #print(x[0], formulator(x[2],x[3],x[4],x[5],x[6],x[7]))
                
                #THINGS FROM THE ORIGINAL FUNCTION THAT I FOUND WERE NOT NEEDED
                
                #allposs.append(list(x[1:9]))
                #allposs[i].append(error)
                #allposs[i].append(intensity)
                #allposs[i].append(mass)
                #formulatemp = FTPM.formulator(int(x[3]),int(x[4]),int(x[6]),int(x[5]),int(x[7]),int(x[8]),int(x[9]),int(x[10]),ionisationmode)
                #dbe = FTPM.DBEcalc(int(x[3]),int(x[4]),int(x[6]),ionisationmode)
                #allposs[i].append(dbe)
                #allposs[i].append(formulatemp)
                #i = i +1
                
    #sENIOR AND LEWIS CHECK SHOULD BE HERE
    if len(possible_ma) > 1:
        min_dif = []
        for i in range(len(possible_ma)):
            min_dif.append(abs(possible_ma[i][0] - mass))
        mini = np.argmin(min_dif)
        return(mass, formulator(possible_ma[mini][2],possible_ma[mini][3],possible_ma[mini][4],possible_ma[mini][5],possible_ma[mini][6],possible_ma[mini][7]))
    elif len(possible_ma) == 1:
        return(mass, formulator(possible_ma[0][2],possible_ma[0][3],possible_ma[0][4],possible_ma[0][5],possible_ma[0][6],possible_ma[0][7]))
    else:
        return(mass, np.nan)

In [169]:
#pd.concat([teste.set_index('label'),forma])
teste = teste.set_index('label')
pd.concat([teste,forma], axis = 1)
#forma

Unnamed: 0_level_0,m/z,Name,Formula,KEGG,BY0_000001,BY0_000002,BY0_000003,GRE3_000001,GRE3_000002,GRE3_000003,ENO1_000001,ENO1_000002,ENO1_000003,dGLO1_000001,dGLO1_000002,dGLO1_000003,GLO2_000001,GLO2_000002,GLO2_000003,Form_give
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
257.102875,258.110160,Glycerophosphocholine,C8H20NO6P,C00670,2.056732e+08,2.052868e+08,2.117945e+08,2.113430e+08,1.993518e+08,2.069864e+08,2.030151e+08,1.999564e+08,1.930296e+08,1.212009e+08,1.165595e+08,1.169341e+08,59549848.0,6.113070e+07,6.227151e+07,
254.224610,255.231904,Hypogeic acid,C16H30O2,,3.768004e+07,6.755935e+07,7.409344e+07,2.596554e+07,4.509098e+07,7.337231e+07,3.014258e+07,7.357830e+07,8.325845e+07,5.520110e+07,5.732034e+07,5.935389e+07,9995730.0,1.379522e+07,1.950888e+07,C16H30O2
256.240207,257.247559,Palmitic acid,C16H32O2,C00249,4.938765e+07,5.654261e+07,5.656344e+07,5.135225e+07,4.646735e+07,4.946956e+07,3.967435e+07,5.034116e+07,5.555791e+07,3.700789e+07,3.896814e+07,4.062987e+07,13851696.0,1.279689e+07,1.322522e+07,C16H32O2
243.183406,244.190777,N-Undecanoylglycine,C13H25NO3,,1.923931e+07,1.813270e+07,1.958480e+07,2.611610e+07,2.529480e+07,2.436860e+07,1.036602e+07,1.101832e+07,1.081115e+07,1.092722e+07,1.230570e+07,1.234905e+07,5295285.5,5.204488e+06,4.870206e+06,C13H25NO3
278.151833,317.114972,alpha-CEHC,C16H22O4,,1.389726e+07,1.454467e+07,1.619012e+07,1.631480e+07,1.342425e+07,1.304407e+07,1.132682e+07,1.062885e+07,1.030479e+07,6.430070e+06,6.741803e+06,5.763232e+06,5298218.5,4.351453e+06,4.133799e+06,C16H22O4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262.229577,263.236853,"6,10,14-Trimethyl-5,9,13-pentadecatrien-2-one",C18H30O,,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.111047e+06,0.0,0.000000e+00,0.000000e+00,C18H30O
226.120541,227.127818,Allixin,C12H18O4,,0.000000e+00,0.000000e+00,0.000000e+00,2.487062e+06,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,0.000000e+00,0.000000e+00,C12H18O4
269.271752,270.279028,Capsiamide,C17H35NO,C17515,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.096925e+06,0.0,0.000000e+00,0.000000e+00,C17H35NO
266.950424,267.957701,,,,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.512282e+06,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,0.000000e+00,0.000000e+00,C4H5N4O4SP


In [132]:
dict23.loc[dict23['C'] == 16].loc[dict23['H'] == 32]
teste.iloc[-1,0]

256.2391650817

In [133]:
256.240230
((256.2391650817 - 256.240230)/256.240230)*1000000

-4.155937184480636

In [110]:
dict23 = pd.DataFrame.from_dict(a, orient = 'index', columns = ['Abundance', 'C','H','O','N','S','P'])
#dict12.to_csv('Formulas100-200.csv')
dict23 = dict23.sort_index()
print(dict23)

            Abundance   C   H   O  N  S  P
200.000412   0.878951   5   4   3  4  1  0
200.001218   0.852422   5  14   0  0  2  1
200.001687   0.928933   5   6   1  4  0  1
200.001755   0.866743   7   6   4  1  1  0
200.003030   0.916031   7   8   2  1  0  1
...               ...  ..  ..  .. .. .. ..
299.999174   0.869468  10   6  10  1  0  0
299.999349   0.776458  17   4   2  2  1  0
299.999504   0.772499   9  18   1  0  3  1
299.999912   0.903729   8  16   4  0  0  2
299.999973   0.841835   9  10   2  4  1  1

[59074 rows x 7 columns]


In [56]:
dict12 = pd.DataFrame.from_dict(a, orient = 'index', columns = ['Abundance', 'C','H','O','N','S','P'])
#dict12.to_csv('Formulas100-200.csv')
dict12 = dict12.sort_index()
print(dict12)

            Abundance  C   H  O  N  S  P
100.003469   0.956417  3   2  3  1  0  0
100.009519   0.911867  3   4  0  2  1  0
100.016045   0.948991  4   4  3  0  0  0
100.018724   0.921291  7   2  0  1  0  0
100.022095   0.904787  4   6  0  1  1  0
...               ... ..  .. .. .. .. ..
199.995229   0.832837  6   6  1  3  2  0
199.995705   0.909089  7   4  7  0  0  0
199.996572   0.821269  8   8  2  0  2  0
199.997847   0.867971  8  10  0  0  1  1
199.999943   0.806557  5  12  2  0  3  0

[9103 rows x 7 columns]


In [19]:
a = 0

In [50]:
com_range['H/C'][0]

0.2

In [44]:
#if not ((n < 11) and (o < 22) and (p < 6)):
#    print('HELLO')

HELLO


In [13]:
#hi = '34s'
#hi = hi + 'sd'
#hi

'34ssd'

In [16]:
#n, o, p, s = 12,20,5,20
#import numpy as np

In [32]:
NOPS_ratio = True
if (n > 1) and (o > 1) and (p > 1) and (s > 1): #NOPS
    if (n < 10) and (o < 20) and (p < 4) and (s < 3):
        NOPS_ratio = True
    else:
        NOPS_ratio = False
elif (n > 3) and (o > 3) and (p > 3): #NOP
    if (n < 11) and (o < 22) and (p < 6):
        NOPS_ratio = True
    else:
        NOPS_ratio = False
elif (o > 1) and (p > 1) and (s > 1): #OPS
    if (o < 14) and (p < 3) and (s < 3):
        NOPS_ratio = True
    else:
        NOPS_ratio = False
elif (n > 1) and (p > 1) and (s > 1): #PSN
    if (n < 10) and (p < 4) and (s < 3):
        NOPS_ratio = True
    else:
        NOPS_ratio = False
elif (n > 6) and (o > 6) and (s > 6): #PSN
    if (n < 19) and (o < 14) and (s < 8):
        NOPS_ratio = True
    else:
        NOPS_ratio = False
NOPS_ratio

False

In [57]:
ending = ending.loc[ending.index<500]
colus = ending[['Formula','Form_give']]

colus2 = colus.notnull()
print('-----------Number of formulas given by the MetaboScape (Formula) and this notebook (Form_give)---------------')
print(colus2.sum())
print('Total number of peaks:', len(ending))

-----------Number of formulas given by the MetaboScape (Formula) and this notebook (Form_give)---------------
Formula       3872
Form_give    17089
dtype: int64
Total number of peaks: 17334


Almost all peaks had a formula associated not the best

In [65]:
intersection = 0
cls = 0
for i in range(len(colus)):
    if colus.iloc[i,0] == colus.iloc[i,1]:
        intersection = intersection + 1
    if type(colus.iloc[i,0]) == type('str'):
        if 'Cl' in colus.iloc[i,0]:
            cls = cls + 1
        elif 'F' in colus.iloc[i,0]:
            cls = cls + 1
    #print(colus.iloc[i,0], colus.iloc[i,1])
print(intersection, cls, 3872 - cls, intersection/(3872 - cls))

326 2903 969 0.3364293085655315
