# CompoundCalulator 20201111

Given a list of base compound names and molecular weights, this notebook calculates possible derivatives (e.g. metabolites) in 2 phases, and adds adducts and losses resulting in a list of possible ion masses with labels.
The intention is that this be used with the PCVG compound finding code, and the idea is that once we have a suggested identification, e.g. ibuprofen, we can calculate other potential masses and look for matches with the ions we've found. This eliminates a lot of manual calculations and allows interactive investigation, e.g. if we find an adduct that corresponds to addition of C2H3O2Na we can quickly incorporate it into our lists and see if it shows up elsewhere. It also allows us to look for forms that we might not otherwise consider. However the calculated mass lists can also be used interactively with complex spectra, for example by haveing the calculator in one window and the matching notebook in another.

Changes
-------

20201007

All user-defined parameters accessible in "Setup"...code can be executed with "Run selected cell and all below"

20201005

Modifications are separated from the base compound with underscores so that only losses have a minus sign making the output easier to read.

The base compound information is now a list of (name, mass) tuples which provides a way to incorporate specific known modifications. For example, vinpocetin shows a loss of C2H4, possibly following oxidations, which then undergoes further pahase 1 and 2 modifications so the base_compound list contains both Vinpo and Apo-vinpo.

The output can now include heterodimers which are calculated by generating new compounds from all pairs of modified compounds, i.e. after the phase 1 and phase 2 nodifications have been generated but prior to generating the final list with adducts. NOTE: this can generate a lot of extra entries!

20201111

A number of changes have been made to allow better operation with large numbers of adducts, in particular a new recursive routine effificently calculates the combinations.

An extra field ('Root') has been added to Composition; this tracks the initial compound in each derived composition so they can be easily sorted and filtered later. This is useful when the base compound list contains several compounds as in interactive spectrum analysis where unknowns, fragments, etc. can be added to the list.

In some cases ions corresponding to adduct clusters are observed in the spectrum, for example JChromA 2019(1600)-174; this can now be accomodated by seting *include_adducts_as_compounds* = True. These most likely are not simply due to the small cations (Na, K) but require additional moieties, such as formate or acetate, which can be specified via *adduct_as_compound_must_have*; the mass range can also be restricted.

There is a cell to summarize the date and conditions used and this information is also written as the first line, prefixed with #, in output files.

In [2]:
from collections import defaultdict
from itertools import groupby
import datetime
import os

Class and function definitions
------------------------------
The basic entity is a 'Composition'...NB. this is not an elemental composition but simply a text label, a count, a root name and a mass. When compositions are combined, the labels are concatenated (using a specified separator character) and the masses are added.

In [3]:
from dataclasses import dataclass

@dataclass
class Composition:
    Name: str = ""
    Count: int = 1
    Mass: float=-1
    Root:  str = ""       #the composition this on is based on 
    
    mods = {}
    
    def __init__(self, name, count, mass=None, root=None):

        self.Name = f'{name}' if count == 1 else f'({name}){count}'

        self.Count = 1    # there's only one of these even if the 'count' (really a multiplier) is gretaer
        self.Mass = mass if mass else self.Mods[name]*count        
        
        if root:
            self.Root = root
        else:
            self.Root = name
    
    # Make the Composition from a (Name, Count) tuple
    @classmethod
    def from_tuple(cls, t):
        return Composition(t[0],t[1])

    # make a composition from a list of (Name,Count)tuples
    @classmethod
    def from_tuple_list(cls, t_list):
        comp = None
        
        for t in t_list:           
            if not comp:               
                comp = Composition.from_tuple(t)    #create a comp from the first in the list so we can append others to it
            else:
                comp2 = Composition.from_tuple(t)
                comp = comp.add_comp(comp2, sep='.')
                
        return comp
    
    @classmethod
    def proton(cls):
        return Composition('H+', 1, 1.00727)
    
    def protonate(self):
        return self.add_comp(Composition('H+', 1, 1.00727), sep='.')
    
    def deprotonate(self):
        return self.add_comp(Composition('[-H+]-', 1, -1.00727), sep='.')

    def make_copy(self, mult=1):
        return Composition(self.Name, self.Count*mult, self.Mass*mult, self.Root)
    
    def label(self):
        return self.Name
    
    # Merge two compositions to generate a new one with a mass
    def add_comp(self, comp1, sep='_'):
        new_name = self.label() + sep + comp1.label()
        new_mass = self.Mass + comp1.Mass
        return Composition( new_name, 1, root=self.Root, mass=new_mass)

#provide the set of allowed changes - this can be extended as needed
Composition.Mods = {'OH':15.99492,
        'COOH':29.97418,     #COOH is CH3->COOH, i.e. +O2, -H2)
        'gluc':176.032088,
        'sulphate':79.956815,
        'NH3':17.026549,     #adducts from here
        'Na-H':21.981944,
        'K-H':37.955881,
        'K*H': 39.9540,     #41K - H
        'Ca-2H': 37.946941,
        'H2O':-18.010565,
        'NaAc': 82.003074,
        'NaFo': 67.987424,     # sodium formate
        'KFo': 83.961361,
        'C2H4O2':60.021129,
        'CH2O2':46.005479,
        'CHO2': 44.997654,
        'CO2':-43.989829,
        'Am':-17.026549,
        'Rib':-132.0425,
        'RibNH2':-149.0691,
        'HCOOH':-46.005479,
                }
 
print('Proton: ', Composition.proton())

a = Composition('Na-H',2)
print('Normal init:', a)

b = Composition.from_tuple(('K-H',2))
print('From tuple:', b)

ab = a.add_comp(b, sep='.')
print('From merge:', ab)

t_list = [('Na-H',2),('K-H',2), ('NH3', 1)]
abc = Composition.from_tuple_list(t_list)
print('From tuple list:', abc)

Proton:  Composition(Name='H+', Count=1, Mass=1.00727, Root='H+')
Normal init: Composition(Name='(Na-H)2', Count=1, Mass=43.963888, Root='Na-H')
From tuple: Composition(Name='(K-H)2', Count=1, Mass=75.911762, Root='K-H')
From merge: Composition(Name='(Na-H)2.(K-H)2', Count=1, Mass=119.87565, Root='Na-H')
From tuple list: Composition(Name='(Na-H)2.(K-H)2.NH3', Count=1, Mass=136.902199, Root='Na-H')


In [4]:
# recursive routine to find combinations
def get_combs(maxima, item_count, pos, seed, take, res):
    """
    The idea is that the number of each composition to evaluate can be written as a list of integers, e.g. [1,0,0], [0,1,0].
    We process each entry successively, setting the value to the number we need to take or the maximum allowed for that entry;
    the number to take for the subsequnt entry is based on the number remaining from the first. E.g. if we are to take 5 and 3
    are used for the first entry, we pass 2 to the next. We stop when tke gets to 0 or when we run out of entries,
    """
    if take == 0:
        return
    elif take > maxima[pos]:
        this_take = maxima[pos]
    else:
        this_take = take
    
    while this_take >= 0:
        
        # clear the rest of the seed and set this position's value
        for i in range(pos, len(seed)): seed[i] = 0
        seed[pos] = this_take

        # set up for next level
        next_take = take - this_take
        next_pos = pos + 1
        
        if not next_take: # or next_pos == item_count:        # nothing more to add, so save a copy of the current seed
            res.append(list(seed))   # copy the seed
        elif next_pos == item_count:
            break
        else:    
            get_combs(maxima, item_count, next_pos, seed, next_take, res)
        
        this_take -=1


def make_combinations(limit_list, max_combinations):
    """
    Sets up for the recursive routine by getting and cleaning the list of limits, generating a list of integers 
    corresponding to the maximum number of each composition, and calling get_combs with take counts of 1, 2, 3...max
    """
    
    # first we make sure the compositions are unique and limits are non-zero
    # this is needed because the user may specify the same composition more than once which woukd
    # cause it to be treated as a separate limit
    
    cleaned = defaultdict(int)
    
    # create a dictionary of {comp:limit}; if the comp is already present the limit will be added
    for (c,l) in limit_list:
        if l > 0:
            cleaned[c] += l
    
    # convert the cleaned dict to a list and then into lists of comps and maxima   
    clean_list = [(c, cleaned[c]) for c in cleaned]    
    comps, limits = zip(*clean_list)
    
    # get a list of combinations; each combination is a list of the counts for the composition at that index
    item_count = len(limits)  # number of entries in the limit list
    seed = [0]*item_count
    
    res = []   # this will hold the lists of integers representing the count of each Composition
    
    # take 1, 2, 3...max_combinations items and append to res[]
    for take in range(1, max_combinations+1):
        get_combs(limits, item_count, 0, seed, take, res)
    
    # finally generate a list of the actual compostions, i.e [('x',2), ('y',3)] etc.
    # by combining the compositions and each list of counts
    combs=[]
    
    for r in res:
        c = [(comps[i], r[i]) for i in range(item_count) if r[i] > 0]
        combs.append(c)
        
    return combs,','.join(comps)

# Note: x is present twice
combs, comps_as_str = make_combinations([('x', 2), ('y', 2), ('z', 2), ('x',1)], 3)

print(len(combs), 'should be 17')
print(comps_as_str)
# for c in combs:
#     print(c)

17 should be 17
x,y,z


In [21]:
def add_mods(compounds, limits, sep='_', update_root=False):
    """
    Adds modifications to each compound in the list returning the new compound list.
    The modfications are provided as a list of (mods, max count) tuples
    By default the root is root updated, so it stays the same as the orignal compound, but if True it
    is changed to the new comppound. This allows the root to reflect the compounds at a different level, e.g. after phase 1
    """
    mods = []

    # Make the compounds by copying the base and adding the possible mods
    for c in compounds:
        for l in limits:
            for i in range(l[1]):
                new_comp = c.make_copy().add_comp(Composition(l[0], i+1), sep=sep)
                
                if update_root:
                    new_comp.Root = new_comp.Name
                mods.append(new_comp)
                #print(new_comp)

    compounds += mods
    
    return compounds


In [22]:
# convert compositions to a printable string
def limits_as_string(limits):
    """
    Coverts the composition limits for a particular type (adducts, losses, phase 1...) to string.
    Compositions can be switched off by setting the limit to zero so we skip those
    """
    non_zero_limits = [l for l in limits if l[1] > 0]  # a list of compositions withlimit > 0
    
    if len(non_zero_limits) == 0:
        return ""
    else:
        desc = ",".join([f'{l}' for l in non_zero_limits])
        return desc

In [23]:
# generates a unique file name given the parameters and a string representing the date
def get_ouput_file_name(comp_names, ionization, time_str, xic_width):
    """
    Generates a file name based on the compounds used (as a string Comp1_comp2.. etc.) and te polarity
    with additions indicating the file is intended to extract XICs in PeakView and the date/time if
    required; the format used by the main code is YYMMDD_HHMMSS
    """
    polarity = 'neg' if ionization == "negative" else 'pos'

    wants_xic = xic_width > 0

    base_name = f'{comp_names} ions {polarity}'

    if wants_xic:
        base_name += ' xic'
        
    if include_date_in_file_name:
        base_name += ' ' + time_str
    
    return wants_xic, base_name + '.txt'

Setup
-----

Provide the  base compound information and other parameters.
The base compounds are supplied as a list of (name, mass) tuples.
As shown below, the mass need not be a real known compound but can be an observed and unexplained peak so that it's potential derivatives are generated.

All user-defined parameters are set here so, once they are set, the code can be executed with 'Run selected cell and all below"

In [24]:
# define the path for data files
# This allows the Calculator and Match notebooks to easily share data and works on Colab
try:
    from google.colab import drive
    
    print('Using Colab')
    
    drive.mount('/content/drive')

    data_path = os.sep + os.path.join('content', 'drive', 'MyDrive', 'SharedData')
except:
    print('Not using Colab')
    data_path = os.sep + os.path.join('Users','ronbonner','Data', 'SharedData')

file_path = os.path.join(data_path, '200325 SJ C18 Pos A, C001-U.txt')

print(file_path)

Not using Colab
/Users/ronbonner/Data/SharedData/200325 SJ C18 Pos A, C001-U.txt


In [54]:

# Define the compound(s) we want to work with
#
# base_compounds = [('Guan', 283.091669),  # Guanosine
#                 ('F', 151.0489),       # fragment-H+
#                 ('x678', 678.5042),    # unknown at 679.5115 - H+
#                 ('x230', 229.24033),   # formic acid
#                 ]     

base_compounds = [('Ibu', 206.1307), ('x544', 544.2148)]   #Identifier + MW
#base_compounds = [('Vinpo', 350.1994),('Apo_vinpo', 322.1681)]  #Apo_vinpo is vinpo-C2H4
#base_compounds = [('x544', 544.2148+1.00727)] 

multimer_limit = 2              # maximum multimer count
max_adduct_count = 6             # total number of adducts allowed
ionization = 'positive'          # only 'negative' changes the settings...anything else is 'positive'
include_hetero_dimers = False     # if True, calculate dimers of *different* compounds

include_adducts_as_compounds = False  # add the ionized adducts to the ion list subject to the following conditions
max_adduct_as_compound_mass = 800
adduct_as_compound_must_have = 'CH2O2'

# Output parameters

output_mass_limit = 1000    # masses greater than this are not written to the file
xic_width = 0.0            # if 0 the normal output form is used...alternative, e.g. 0.01, to generate the PeakView compatible form

save_ion_list = False
write_locally = False       # write to the same location as the notebook (useful for Colab); if 'False' a file path is generated
include_date_in_file_name = False

# Define the limits for metabolites and adducts...
# Defining this way is not required but allows metabolite and adduct sets to be easily changed depending on polarity.
# Unwanted compositions can be rmoved or the limit can be set to zero
phase1_limits = [('OH', 2), ('COOH', 1)]  # metabolite modifications - phase 1

if ionization == 'negative':
    phase2_limits = [('gluc', 1), ('sulphate', 0)]
    adduct_limits = [('Na-H', 2), ('K-H', 1), ('C2H4O2',2), ('CH2O2', 2)]  
    loss_limits = [('H2O',0), ('CO2',0)]
else:
    phase2_limits = [('gluc', 1)]
    adduct_limits = [('Na-H', 3), ('K-H', 1), ('K*H',0), ('NH3',1), ('C2H4O2',2) ]    #, ('Na-H', 8), ('K-H', 8), ('K*H', 8)]
    loss_limits = [('H2O',2), ('HCOOH', 1), ('Am', 0), ('Rib', 0), ('RibNH2',0)]

Step 1 - Adduct generation
---------------------------

Generate a list of possible adduct forms by generating all comibnations of adducts (up to the specified limit) and selecting the unique forms (i.e. as far as we are concerned, a+b+a is the same as a+a+b). Note: this approach would also work if we wanted to allow combinations of the metabolites. These will be added to each compound.

In [55]:
adduct_combs, comps_as_str = make_combinations(adduct_limits, max_adduct_count)
    
adduct_comps = [Composition.from_tuple_list(c) for c in adduct_combs]
adduct_comps = sorted(adduct_comps, key=lambda x: x.Mass)

print(len(adduct_comps),'adduct forms')
print(comps_as_str)

# for a in adduct_comps[:75]:
#     print(a)

46 adduct forms
Na-H,K-H,NH3,C2H4O2


Step 2 - Compound generation
-----------------------------

We convert the base compound list to a of compositions and then successively apply the various modifications, generating extended compound lists, as follows

- phase 1
- phase 2

Then calculate the dimers and heterodimers (if desired)

In [56]:
# Make the compounds by copying the base and adding the possible mods

compounds = [Composition(name, 1, mass) for name, mass in base_compounds]
        
compounds = add_mods(compounds, phase1_limits, update_root=True)
print(len(compounds), 'compounds after phase 1')

compounds = add_mods(compounds, phase2_limits, update_root=True)
print(len(compounds), 'after phase 2')

multimers = []

for c in compounds:
    for m in range(2, multimer_limit+1):
        new_comp = c.make_copy(m)
        multimers.append(new_comp)

if include_hetero_dimers:
    for i, c in enumerate(compounds):
        for j in range(i+1, len(compounds)):
            new_comp = c.make_copy()
            new_comp_2 = compounds[j].make_copy()
            new_comp = new_comp.add_comp(new_comp_2, sep='+')
            multimers.append(new_comp)
    
compounds += multimers

print(len(compounds), 'with multimers')

compounds = add_mods(compounds, loss_limits, sep='-')
print (len(compounds), 'after losses')

# for c in compounds:
#     print(c)

8 compounds after phase 1
16 after phase 2
32 with multimers
128 after losses


Step 3 - Generate ion forms
----------------------------

We now add all the adduct forms to each of the compounds. The approach relies on adducts being formed by replacing labile protons and are therefore indpendent of the polarity; the final form is determined by providing a charge agent, i.e. adding or subtracting protons.


In [57]:
ion_forms = []  

# If desired we add the adducts that meet the mass and label requirements.
# we do that here, rathther than add them to the compound list, so the adducts aren't added to themselves
if include_adducts_as_compounds:
    
    adduct_comps = sorted(adduct_comps, key = lambda x: x.Mass)

    adducts_to_add = [a for a in adduct_comps if a.Mass <= max_adduct_as_compound_mass \
                              and adduct_as_compound_must_have in a.Name]
    
    print(len(adducts_to_add),'ionized adduct forms')
    
    for a in adducts_to_add:
        new_comp = a.make_copy()
        if ionization == 'negative':
            new_comp = new_comp.deprotonate()
        else:
            new_comp = new_comp.protonate()
        ion_forms.append(new_comp)

# now we add each compound on its own and then with the adducts
for c in compounds:
    
    # add the base compound, with a proton added or subtracted depending on the ionization mode
    new_comp = c.make_copy()
    if ionization == 'negative':
        new_comp = new_comp.deprotonate()
    else:
        new_comp = new_comp.protonate() 
        
    ion_forms.append(new_comp)   
    
    # then add the adduct forms
    for a in adduct_comps:
        new_comp = c.make_copy().add_comp(a, sep='.')
        if ionization == 'negative':
            new_comp = new_comp.deprotonate()
        else:
            new_comp = new_comp.protonate()
        ion_forms.append(new_comp)       
        
print(len(ion_forms), 'ion forms')

6016 ion forms


Step 4 - Summarize results and conditions
-----------------------------------------

In [58]:
# summarize calculations

current_time = datetime.datetime.now()

time_str = current_time.strftime('%y%m%d_%H%M%S')

comp_names = '_'.join([f'{c}' for (c,m) in base_compounds])

if include_adducts_as_compounds:  # add the ionized adducts to the ion list subject to the following conditions
    comp_names += f'_{adduct_as_compound_must_have}'

print (time_str)
cond_str = f'Time:{time_str}'

print('Compounds:', comp_names)
cond_str += f';Compounds:{comp_names}'

if multimer_limit > 1:
    print(f'Up to {multimer_limit} multimers')
    cond_str += f';Multimer_limit:{multimer_limit}'

if include_hetero_dimers:
    print(f'Include heterodimers')
    cond_str += f';Heterodimers:True'

print(f'{ionization} mode')
cond_str += f';Polarity:{ionization}'

desc = limits_as_string(phase1_limits)
if desc:
    print(f'Phase 1: {desc}')
    cond_str += f';Phase_1:{desc}'


desc = limits_as_string(phase2_limits)
if desc:
    print(f'Phase 2: {desc}')
    cond_str += f';Phase_2:{desc}'

desc = limits_as_string(adduct_limits)
if desc:
    print(f'Adducts: {desc}, max count = {max_adduct_count}')
    cond_str += f';Adducts:{desc}'

desc = limits_as_string(loss_limits)
if desc:
    print(f'Losses: {desc}')  
    cond_str += f';Losses:{desc}'

if include_adducts_as_compounds:
    desc = f'{len(adducts_to_add)} adducts with'
    desc += f' mass < {max_adduct_as_compound_mass}'
    desc += f' and {adduct_as_compound_must_have} in name included'
    print(desc)
    cond_str += f';Adducts_as_comps:{len(adducts_to_add)}'
    cond_str += f';Add_comps_mass_limit:{max_adduct_as_compound_mass}'
    cond_str += f';Add_comps_must_have:{adduct_as_compound_must_have}'
          
print(len(ion_forms), 'ion forms')


print(cond_str)

201122_082656
Compounds: Ibu_x544
Up to 2 multimers
positive mode
Phase 1: ('OH', 2),('COOH', 1)
Phase 2: ('gluc', 1)
Adducts: ('Na-H', 3),('K-H', 1),('NH3', 1),('C2H4O2', 2), max count = 6
Losses: ('H2O', 2),('HCOOH', 1)
6016 ion forms
Time:201122_082656;Compounds:Ibu_x544;Multimer_limit:2;Polarity:positive;Phase_1:('OH', 2),('COOH', 1);Phase_2:('gluc', 1);Adducts:('Na-H', 3),('K-H', 1),('NH3', 1),('C2H4O2', 2);Losses:('H2O', 2),('HCOOH', 1)


Step 5 - Save the mass/name list
--------------------------------

Optionally save the ion forms as a simple tab delimited text file.
- the main format is: mass, root, label
- an additional format: mass, xic width, name is intended to be used with PeakView Extract XIC (by importing it)

The list can also be truncated to an upper mass limit.

To be sure the file exists, we re-open it and count the nuber of lines

In [65]:
# Set up fie names and paths...

if save_ion_list:
    
    wants_xic, out_name = get_ouput_file_name(comp_names, ionization, time_str, xic_width)

    line_count = 1      # first line is conditions

    ion_forms = sorted(ion_forms, key=lambda x: x.Mass)

    output_path = out_name if write_locally else os.path.join(data_path, out_name)

    print (output_path)

    with open(output_path, 'w') as f:

        print('#',cond_str, file=f)

        for ion in ion_forms:

            if ion.Mass > output_mass_limit: 
                break       

            if wants_xic:
                print(f'{ion.Mass:10.4f}\t{xic_width}\t{ion.Name}', file=f)
            else:
                print(f'{ion.Mass:10.4f}\t{ion.Root}\t{ion.Name}', file=f)

            line_count += 1

        f.close()

    print(time_str) 
    print(line_count, 'lines written to', output_path)

    with open(output_path, 'r') as f:   
        lines_read = f.readlines()    
        f.close()

    print(len(lines_read), 'read')

    if lines_read[0][0] == '#':
        print("Conditions:")
        print(lines_read[0][1:])
else:
    for ion in sorted(ion_forms, key=lambda x:x.Mass):        #sort list by mass
        print(f'{ion.Mass:12.4f}     {ion.Root:14} {ion.Name}')

    161.1325     Ibu            Ibu-HCOOH.H+
    171.1168     Ibu            Ibu-(H2O)2.H+
    177.1274     Ibu_OH         Ibu_OH-HCOOH.H+
    178.1590     Ibu            Ibu-HCOOH.NH3.H+
    183.1144     Ibu            Ibu-HCOOH.Na-H.H+
    187.1118     Ibu_OH         Ibu_OH-(H2O)2.H+
    188.1434     Ibu            Ibu-(H2O)2.NH3.H+
    189.1274     Ibu            Ibu-H2O.H+
    191.1067     Ibu_COOH       Ibu_COOH-HCOOH.H+
    193.0988     Ibu            Ibu-(H2O)2.Na-H.H+
    193.1223     Ibu_(OH)2      Ibu_(OH)2-HCOOH.H+
    194.1540     Ibu_OH         Ibu_OH-HCOOH.NH3.H+
    199.0884     Ibu            Ibu-HCOOH.K-H.H+
    199.1094     Ibu_OH         Ibu_OH-HCOOH.Na-H.H+
    200.1410     Ibu            Ibu-HCOOH.Na-H.NH3.H+
    201.0910     Ibu_COOH       Ibu_COOH-(H2O)2.H+
    203.1067     Ibu_(OH)2      Ibu_(OH)2-(H2O)2.H+
    204.1383     Ibu_OH         Ibu_OH-(H2O)2.NH3.H+
    205.0964     Ibu            Ibu-HCOOH.(Na-H)2.H+
    205.1223     Ibu_OH         Ibu_OH-H2O.H+
    2