# Formula Generation Notebook

This notebook generates a Formula Database to be  used by the Formula Assignment algorithm that includes all possible formulas to be considered for annotation.

This database is created using a series of rules and is in good part adapted from FTMSVisualization (https://github.com/wkew/FTMSVisualization) especially 0-FormulaGenerator.py.

Apart from this, the paper "Kind T, Fiehn O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics. 2007;8:1-20. doi:10.1186/1471-2105-8-105" was used as a base for extra rules used to make the "Formula Database" to expand what was done in FTMSVisualization. This was complemented with some criteria used by "Kujawinski EB, Behn MD. Automated Analysis of Electrospray Ionization Fourier Transform Ion Cyclotron Resonance Mass Spectra of Natural Organic Matter. Anal Chem. 2006;78(13):4363-4373. doi:10.1021/ac0600306" for their CIA software for formula assignment. Furthermore, the rules for assigning formulas were also extended with basic C(13) isotope searching and other priority rules. Some of these rules were expanded or slightly altered to fit more in term with the characteristics of biological-relevant molecules.

#### Building the Formulas Database

We present some of the different functions we use for the generation of the database to be customized for the user if other parameters are desired. Other functions that should not be changed are stored in `form_assign_func.py`.

Generating the databases can require large amounts of RAM memory.



### Needed imports

In [None]:
import pandas as pd
import numpy as np
import form_assign_func as form_afunc
from form_assign_func import chemdict

# Part 1 - Function to generate the Formulas Database

Other related functions are in `form_assign_func`.

In [None]:
def form_calc(low, high, elem_range, elem_max):
    """Calculates all formulas possible (according to some stipulations) between a certain mass interval.
    
       low: scalar; lower limit of the molecular mass of the formula.
       high: scalar, upper limit of the molecular mass of the formula.
       elem_range: dictionary.
       elem_max: dictionary.
    
       return: dictionary where keys are exact masses and values are tuples with the overall abundance of the monoisotopic 
    'mass', the number of atoms of each elements (in a specific order), bool value to if the formula follows the valency rules 
    when its elements are in their most common valency, the abundance and mass of the isotope with 1 C(13) atom and a defaulted
    False that is meant to see if the C(13) isotope mass or the monoisotopic mass is being used.
    """
    
    """Following the maximum elements from the recommended in 7 Golden Rules paper."""
    # RULE Nº 1
    if high <= 250:
        maxC,maxH,maxO,maxN = elem_max[250]['C'], elem_max[250]['H'], elem_max[250]['O'], elem_max[250]['N']
        maxS,maxP,maxCl,maxF = elem_max[250]['S'], elem_max[250]['P'], elem_max[250]['Cl'], elem_max[250]['F']
    elif high <= 500:
        maxC,maxH,maxO,maxN = elem_max[500]['C'], elem_max[500]['H'], elem_max[500]['O'], elem_max[500]['N']
        maxS,maxP,maxCl,maxF = elem_max[500]['S'], elem_max[500]['P'], elem_max[500]['Cl'], elem_max[500]['F']
    elif high <= 750:
        maxC,maxH,maxO,maxN = elem_max[750]['C'], elem_max[750]['H'], elem_max[750]['O'], elem_max[750]['N']
        maxS,maxP,maxCl,maxF = elem_max[750]['S'], elem_max[750]['P'], elem_max[750]['Cl'], elem_max[750]['F']
    elif high <= 1000:
        maxC,maxH,maxO,maxN = elem_max[1000]['C'], elem_max[1000]['H'], elem_max[1000]['O'], elem_max[1000]['N']
        maxS,maxP,maxCl,maxF = elem_max[1000]['S'], elem_max[1000]['P'], elem_max[1000]['Cl'], elem_max[1000]['F']
    elif high <= 1250:
        maxC,maxH,maxO,maxN = elem_max[1250]['C'], elem_max[1250]['H'], elem_max[1250]['O'], elem_max[1250]['N']
        maxS,maxP,maxCl,maxF = elem_max[1250]['S'], elem_max[1250]['P'], elem_max[1250]['Cl'], elem_max[1250]['F']
    else:
        maxC,maxH,maxO = elem_max['Higher']['C'], elem_max['Higher']['H'], elem_max['Higher']['O']
        maxN,maxS,maxP = elem_max['Higher']['N'], elem_max['Higher']['S'], elem_max['Higher']['P']
        maxCl,maxF = elem_max['Higher']['Cl'], elem_max['Higher']['F']

    # The maximum possible range to use as a start
    ext_range = elem_range['ext_range']
    
    # + 1 is done since we use range
    maxC = min((int(high) / 12), maxC + 1) # max carbon nº is the smaller of the total mass/12 or predefined maxC
    maxH2 = min((maxC * 4), maxH + 1) # max H nº is the smaller of 4 times the nº of carbons or the predefined max hydrogen nº
    maxO2 = min((int(high) / 16), maxO + 1) # max oxygen nº has to be the smaller of the total mass/16 or predefined maxO
    # Maybe those 3 above on the int part should have a +1 next to them.
    maxN2 = maxN + 1
    maxS2 = maxS + 1
    maxP2 = maxP + 1
    maxF2 = maxF + 1
    maxCl2 = maxCl + 1

    allposs = {} # pd.DataFrame(columns = ['abundance', 'c','h','o','n','s','p'])
    
    # Construction of all possible formulas
    for c in tqdm(range(int(maxC))[1:]): # metabolites contain at least 1 C and 1 H
        
        # This process was done for every element, starting from elements with higher atomic masses to lower masses (faster)
        # See the maximum nº of elements considering the established maximum value and the maximum number according to the
        # element ration with carbon
        maxCl = min(maxCl2, int(c * ext_range['Cl/C'][1]+0.99)) # RULE Nº 5
        for cl in range(maxCl):
            massCl = chemdict['C'][0] * c + chemdict['Cl'][0] * cl # Calculate the mass of the formulas thus far
            if massCl < high: # If it's below the high threshold, continue
                # Repeat for other elements
                maxS = min(maxS2, int(c * ext_range['S/C'][1]+0.99)) # RULE Nº 5
                for s in range(maxS):
                    massS = massCl + chemdict['S'][0] * s
                    if massS < high:
                        maxP = min(maxP2, int(c * ext_range['P/C'][1]+0.99)) # RULE Nº 5
                        for p in range(maxP):
                            massP = massS + chemdict['P'][0] * p
                            if massP < high:
                                maxO = min(maxO2, int(c * ext_range['O/C'][1]+0.99)) # RULE Nº 5
                                minO = 0 # If there are P, there is at least 3 O's.
                                if p > 0:
                                    minO = 3*p
                                    minO = min(minO, int(maxO))
                                for o in range(minO, int(maxO)):
                                    massO = massP + chemdict['O'][0] * o
                                    if massO < high:
                                        maxN = min(maxN2, int(c * ext_range['N/C'][1]+0.99)) # RULE Nº 5
                                        for n in range(maxN):
                                            massN = massO + chemdict['N'][0] * n
                                            if massN < high:
                                                #print(n)
                                                if (n + o) <= 2*c: # Adapted Kujawinski criteria nº1
                                                    NOPS_ratio = form_afunc.NOPS(n,o,p,s)
                                                    if NOPS_ratio: # RULE Nº 6 - element probability check - see function below
                                                        maxF = min(maxF2, int(c * ext_range['F/C'][1]+0.99)) # RULE Nº 5
                                                        for f in range(maxF):
                                                            massF = massN + chemdict['F'][0] * f 
                                                            if massF < high:
                                                                maxH = min(ext_range['H/C'][1] * c + 0.99, maxH2)
                                                                for h in range(int(maxH))[1:]:
                                                                    hcrat = float(h)/float(c)
                                                                    if ext_range['H/C'][0] < hcrat: # RULE Nº 4
                                                                        mass = massF + chemdict['H'][0] * h
                                                                        if h <= (2*c + n + p + 2): # Adapted Kujawinski criteria nº2
                                                                            if low < mass < high:
                                                                                # Rule nº 2 (partially) - Valency check
                                                                                Valency, Valency_normal = form_afunc.Lewis_Senior_rules(c,h,o,n,s,p,cl,f)
                                                                                if Valency:
                                                                                    # If the formula passed all the checks
                                                                                    elem_range_check = False
                                                                                    for m, e_range in elem_range.items():
                                                                                        if mass < m or mass > 1250:
                                                                                            HCrange = h/c
                                                                                            OCrange = o/c
                                                                                            NCrange = n/c
                                                                                            PCrange = p/c
                                                                                            SCrange = s/c
                                                                                            if mass > 1250:
                                                                                                elem_range_check=form_afunc.elem_check(HCrange, OCrange, NCrange,
                                                                                                                       PCrange, SCrange, elem_range['Higher'])
                                                                                            else:
                                                                                                elem_range_check=form_afunc.elem_check(HCrange, OCrange, NCrange,
                                                                                                                       PCrange, SCrange, e_range)
                                                                                            break

                                                                                    if elem_range_check:
                                                                                        # Get the data for the formula and store
                                                                                        abuns = form_afunc.getabun(c,h,o,n,s,cl)
                                                                                        abundance = abuns['Monoisotopic']
                                                                                        abundanceC13 = abuns['C13']
                                                                                        massC13 = mass + chemdict['C13'][0] - chemdict['C'][0]
                                                                                        allposs[mass] = (abundance,c,h,o,n,s,p,cl,f,
                                                                                                         Valency_normal,abundanceC13,massC13,False)
                                                                                        #allposs[mass] = formulator(c,h,o,n,s,p,f,cl)

    return allposs

# Part 2 - Choose Parameters and Generate the Formulas Database

In [None]:
# Ranges of elemental ratios to use (min. ratio - max. ratio)
# The keys correspond to the highest mass to which those ratios should be used
# The key ext_range should always be named ext_range
elem_range = {
    250: {'H/C':(0.1, 6),'N/C':(0, 3),'O/C':(0,6),'P/C':(0,1),'S/C':(0,1),'F/C':(0,1.5), 'Cl/C':(0,0.8)},
    500: {'H/C':(0.1, 2.5),'N/C':(0, 1.3),'O/C':(0,3.5),'P/C':(0,0.75),'S/C':(0,0.5),'F/C':(0,1.5), 'Cl/C':(0,0.8)},
    750: {'H/C':(0.3, 2.3),'N/C':(0, 0.6),'O/C':(0,3.5),'P/C':(0,0.75),'S/C':(0,0.3),'F/C':(0,1.5), 'Cl/C':(0,0.8)},
    1000: {'H/C':(0.5, 2.3),'N/C':(0, 0.6),'O/C':(0,1.2),'P/C':(0,0.3),'S/C':(0,0.3),'F/C':(0,1.5), 'Cl/C':(0,0.8)},
    1250: {'H/C':(0.5, 2.3),'N/C':(0, 0.6),'O/C':(0,1.2),'P/C':(0,0.3),'S/C':(0,0.3),'F/C':(0,1.5), 'Cl/C':(0,0.8)},
    # Maximum range to consider for any formula, before checking their mass, this shoulf be higher or equal than the
    # ranges for any mass bracket
    'ext_range': {'H/C':(0.1, 6),'N/C':(0, 3),'O/C':(0,6),'P/C':(0,1),'S/C':(0,1),'F/C':(0,1.5), 'Cl/C':(0,0.8)}
}

In [None]:
# Maximum number of atoms of an element a formula should contain
# The keys correspond to the highest mass to which those maximums should be used, these keys should always be the ones
# used in the example here below

elem_max = {
    250: {'C':37,'H':72,'O':18,'N':12,'S':6, 'P':4, 'F':0, 'Cl':0},
    500: {'C':37,'H':72,'O':18,'N':12,'S':6, 'P':4, 'F':0, 'Cl':0},
    750: {'C':54,'H':92,'O':27,'N':12,'S':6, 'P':6, 'F':0, 'Cl':0},
    1000: {'C':78,'H':130,'O':31,'N':18,'S':6, 'P':6, 'F':0, 'Cl':0},
    1250: {'C':78,'H':130,'O':31,'N':18,'S':6, 'P':6, 'F':0, 'Cl':0},
    # > 1250 Da
    'Higher': {'C':78,'H':130,'O':31,'N':18,'S':6, 'P':6, 'F':0, 'Cl':0},
}

In [None]:
# Store the Formula Database
formulas = {}
from tqdm import tqdm

# Creating from 0 to 1250 Da in 250 Da intervals - 5 total files
# You can change this interval, the maximum in each interval will determine which parameters are used in the above cells.
for i in range(0,1250,250):
    # i*100 - dif - 0.001 = the idea here is to account for possible C(13) isotopes that might surpass the i*100 barrier while
    # the monoisotopic formula wouldn't which would lead to an error in the formula assignment procedure if the monoisotopic
    # isn't present in the database. That does mean there is a slight overlap between the different files.

    # (i+1)*100 + dif + 0.001 = symmetric to the the low barrier

    # Building the Database for each 100 m/z interval
    #a = form_calc(i*100 - dif - 0.001, (i+1)*100 + dif + 0.001, elem_range = ms_range)
    a = form_calc(i, i+250, elem_range=elem_range, elem_max=elem_max)
    formulas['dict' + str(i)] = pd.DataFrame.from_dict(a, orient = 'index', 
                                                           columns = ['Abundance', 'C', 'H', 'O', 'N', 'S', 'P',
                                                                      'Cl', 'F', 'Valency', 'C13 Abun', 'C13 mass',
                                                                      'C13 check'])
    #abundance,c,h,o,n,s,p,cl,f,Valency_normal,abundanceC13,massC13,False
    #formulas['dict' + str(i*100)] = formulas['dict' + str(i*100)].sort_index()

    print(f'{i}-{i+250} complete.')

    # Writing a .csv file of the data
    #formulas['dict' + str(i*100)].to_csv('dict' + str(i*100) + '.csv')

In [None]:
# Number of formulas in each 250 Da interval
a = 0
for i in formulas:
    a += len(formulas[i])
    print(f'{i[4:]}-{int(i[4:])+250} has {len(formulas[i])} formulas.')
print(f'Formula Database has a total of {a} formulas.')

In [None]:
for i in formulas:
    formulas[i].to_csv(f'formulas_{i}.csv')