In [1]:
!jt -t grade3 -tfs 12

CompoundCalulator 20200816
==========================
Given a base compound name and molecular weight, this calculates possible derivatives (e.g. metabolites) in 2 phases, and adds adducts and losses resulting in a list of possible ion masses with labels.
The intention is that this is used with the PCVG compound finding code, and the idea is that once we have a suggested identification, e.g. ibuprofen, we can calculate other potential masses and look for matches with the ions we've found. This eliminates a lot of manual calculations and allows interactive investigation, e.g. if we find an adduct that corresponds to addition of C2H3O2Na we can quickly incorporate it into our lists and see if it shows up elsewhere. It also allows us to look for forms that we might not otherwise consider.

To do
----
- Add UI - a form?
- Clean up and standardize code
- Refactor the Composition class to use Count and include adduct, etc.


Changes
-------

A new modfication level - base_mods - has been added to allow for cspscifi changes to the base compound. For example, vinpocetin shows a loss of C2H4, possibly following oxidations, which then undergoe further pahase 1 and 2 modifications. Base_mods are used to create additional compounds before any other modifications are considefed.
The output can now include hereodimers which are calculated by generating new compounds from all pairs of modified compounds, i.e. after the phase 1 and phase 2 nodifications have been generated but prior to generating the final list with adducts. NOTE: this can generate a lot of extra entries!

In [2]:
from itertools import combinations
from itertools import groupby

Class and function definitions
------------------------------
The basic entity is a 'Composition'...NB. this is not an elemental composition but simply a text label, a count and a mass. When compositions are combined, the labels are concatenated (using a specified separator character) and the masses are added.

In [3]:
from dataclasses import dataclass

@dataclass
class Composition:
    Name: str = ""
    Count: int = 1
    Mass: float=-1
    
    mods = {}
    
    def __init__(self, name, count, mass=None):
        self.Name = f'{name}' if count == 1 else f'{count}({name})'
        self.Count = 1    # there's only one of these even if the 'count' (really a multiplier) is gretaer
        self.Mass = mass if mass else self.Mods[name]*count
    
    # Make the Composition from a (Name, Count) tuple
    @classmethod
    def from_tuple(cls, t):
        return Composition(t[0],t[1])

    # make a composition from a list of (Name,Count)tuples
    @classmethod
    def from_tuple_list(cls, t_list):
        comp = None
        
        for t in t_list:           
            if not comp:               
                comp = Composition.from_tuple(t)    #creat a comp from the first in the list so we can append others to it
            else:
                comp2 = Composition.from_tuple(t)
                comp = comp.add_comp(comp2, sep='.')
                
        return comp
    
    @classmethod
    def proton(cls):
        return Composition('H+', 1, 1.00727)
    
    def protonate(self):
        return self.add_comp(Composition('H+', 1, 1.00727), sep='.')
    
    def deprotonate(self):
        return self.add_comp(Composition('[-H+]-', 1, -1.00727), sep='.')

    def make_copy(self, mult=1):
        return Composition(self.Name, self.Count*mult, self.Mass*mult)
    
    def label(self):
        return self.Name
    
    # Merge two compositions to generate a new one with a mass
    def add_comp(self, comp1, sep='-'):
        new_name = self.label() + sep + comp1.label()
        new_mass = self.Mass + comp1.Mass
        return Composition( new_name, 1, mass=new_mass)
    
    # expands a composition by generating a list of Name repeated Count times
    def expand(self):
        result = [self.Name for n in range(self.Count)]
        return result

Composition.Mods = {'OH':15.99492,
        'COOH':29.97418,     #COOH is CH3->COOH, i.e. +O2, -H2)
        'gluc':176.032088,
        'sulphate':79.956815,
        'NH3':17.026549,     #adducts from here
        'Na-H':21.981944,
        'K-H':37.955881,
        'Ca-2H': 37.946941,
        'H2O':-18.010565,
        'NaAc': 82.003074,
        'NaFo': 67.987424,     # sodium formate
        'C2H4O2':60.021129,
        'CH2O2':46.004931,
        'CO2':-43.989829,
        'C2H4': -28.0313}
 
print('Proton: ', Composition.proton())

a = Composition('Na-H',2)
print('Normal init:', a)

b = Composition.from_tuple(('K-H',2))
print('From tuple:', b)

ab = a.add_comp(b, sep='.')
print('From merge:', ab)

print('Expanded:', a.expand())

t_list = [('Na-H',2),('K-H',2), ('NH3', 1)]
abc = Composition.from_tuple_list(t_list)
print('From tuple list:', abc)

Proton:  Composition(Name='H+', Count=1, Mass=1.00727)
Normal init: Composition(Name='2(Na-H)', Count=1, Mass=43.963888)
From tuple: Composition(Name='2(K-H)', Count=1, Mass=75.911762)
From merge: Composition(Name='2(Na-H).2(K-H)', Count=1, Mass=119.87565)
Expanded: ['2(Na-H)']
From tuple list: Composition(Name='2(Na-H).2(K-H).NH3', Count=1, Mass=136.902199)


In [4]:
def make_combinations(limit_list, max_combinations):
    """Given a list of limits as tuples (comp, upper_limit), return all combinations to a given maximum value
    """

    entities = []

    # Use the limit_list to generate an expanded list of individual entities, i.e. [('X', 2), ('Y',2)] +> X, X, Y, Y
    for l in limit_list:
        for n in range(l[1]):
            entities.append(l[0])

    entity_combinations = []

    # Now find all combinations of 1 entity, 2 entities...to the max number required
    # This will include duplicates, e.g. x,y and y,x
    for i in range(1, max_combinations + 1): 
        entity_combinations += list(combinations(entities, i))

    # making this into a set will find the unique combinations.
    # initially each combination tuple was sorted to make sure they were canonicalized, but this doesn't seem to be needed
    # i.e. combs = [tuple(sorted(c)) for c in entity_combinations]; cel = set(combs)
    
    csl=set(entity_combinations)   

    csa = []

    # we convert these back into the form ('x',2)('y,'1) by grouping the elements of each combination
    # and recording the element and its count...Note each group has to be converted to a list for this to work
    for c in csl:
        csa.append( [(key, len(list(group))) for key, group in groupby(c)])
    
    return csa
        
combs = make_combinations([('y',3), ('x',2), ('z',2)], 3)
print(len(combs),'should be 17')   #...should be 17

17 should be 17


In [5]:
def add_mods(compounds, limits):
    """
    Adds modifications to each compound in the list returning the new compound list.
    The modfications are provided as a list of (mods, max count) tuples
    """
    mods = []

    # Make the compounds by copying the base and adding the possible mods
    for c in compounds:
        for l in limits:
            for i in range(l[1]):
                new_comp = c.make_copy().add_comp(Composition(l[0], i+1))
                mods.append(new_comp)
                #print(new_comp)

    compounds += mods
    
    return compounds


Setup
-----

Provide the initial parameters - base compound information and other global values.
The base compound is formed from a name and a mass.
As shown below, the mass need not be a real known compound but can be an observed and unexplained peak so that it's potential derivatives are generated.

In [6]:
#base_name, base_mass = 'Ibu', 206.1307   #Identifier + MW
base_name, base_mass = 'Vinpo', 350.1994   #Identifier + MW

# base_name = 'x543'
# base_mass = 543.2068+1.00727

multimer_limit = 3
max_adduct_count = 4 # total number of adducts allowed
ionization = 'positive'
include_hetero_dimers = True
base_mods = ['C2H4']     #[] - empty list if none

In [7]:
# Define the limits for metabolites and adducts...
# Defining this way is not required but allows the sets to be easily changed depending on polarity.

phase1_limits = [('OH', 1), ('COOH', 1)]  # metabolite modifications - phase 1

if ionization == 'negative':
    phase2_limits = [('gluc', 1)] #, ('sulphate', 1)]
    adduct_limits = [('Na-H', 3), ('K-H', 2), ('NaAc',2), ('NaFo', 1)]  #, ('NH3', 1), ('NaAc',2)
    loss_limits = [('H2O',2), ('CO2',1)]
else:
    phase2_limits = [('gluc', 2)]
    adduct_limits = [('Na-H', 3), ('K-H', 3), ('NH3', 1), ('NaAc',2), ('NaFo', 1)]
    loss_limits = [('H2O',2)]

Step 1 - Adduct generation
---------------------------

Generate a list of possible adduct forms by generating all comibnations of adducts (up to the specified limit) and selecting the unique forms (i.e. as far as we are concerned, a+b+a is the same as a+a+b). Note: this approach would also work if we wanted to allow combinations of the metabolites. These will be added to each compound.

In [8]:
adduct_combs = make_combinations(adduct_limits, 4)
    
adduct_comps = [Composition.from_tuple_list(c) for c in adduct_combs]
adduct_comps = sorted(adduct_comps, key=lambda x: x.Mass)

print(len(adduct_comps),'adduct forms')

# for ac in adduct_comps:
#     print(ac)

76 adduct forms


Step 2 - Compound generation
-----------------------------

We successively apply the various modifications, generating extended compound lists, as follows
- base
- phase 1
- phase 2

Then calculate the dimers and heterodimers (if desired)

In [9]:
base_compound = Composition(base_name, 1, base_mass)

compounds = [base_compound]

if base_mods:
    for c in base_mods:
        new_comp = base_compound.make_copy().add_comp(Composition(c, 1))   #limited to 1
        compounds.append(new_comp)
        
# Make the compounds by copying the base and adding the possible mods

compounds = add_mods(compounds, phase1_limits)
print(len(compounds), 'after phase 1')

compounds = add_mods(compounds, phase2_limits)
print(len(compounds), 'after phase 2')

multimers = []

for c in compounds:
    for m in range(2, multimer_limit+1):
        new_comp = c.make_copy(m)
        multimers.append(new_comp)

if include_hetero_dimers:
    for i, c in enumerate(compounds):
        for j in range(i+1, len(compounds)):
            new_comp = c.make_copy()
            new_comp_2 = compounds[j].make_copy()
            new_comp = new_comp.add_comp(new_comp_2, sep='+')
            multimers.append(new_comp)
    
compounds += multimers

print(len(compounds), 'with multimers')

compounds = add_mods(compounds, loss_limits)
print (len(compounds), 'after losses')

# for c in compounds:
#     print(c)

6 after phase 1
18 after phase 2
207 with multimers
621 after losses


Step 3 - Generate ion forms
----------------------------

We now add all the adduct forms to each of the compounds. The approach relies on adducts being formed by replacing labile protons and are therefore indpendent of the polarity; the final form is determined by providing a charge agent, i.e. adding or subtracting protons.


In [10]:
ion_forms = []  

for c in compounds:
    
    # add the base compound, with a proton added or subtracted deopending on the ionization mode
    new_comp = c.make_copy()
    if ionization == 'negative':
        new_comp = new_comp.deprotonate()
    else:
        new_comp = new_comp.protonate() 
        
    ion_forms.append(new_comp)   
    
    # then add the adduct forms
    for a in adduct_comps:
        new_comp = c.make_copy().add_comp(a, sep='.')
        if ionization == 'negative':
            new_comp = new_comp.deprotonate()
        else:
            new_comp = new_comp.protonate()            
        ion_forms.append(new_comp)       
        
print(len(ion_forms))
# for ion in ion_forms:
#     print(ion)

47817


Step 4 - Save the mass/name list
--------------------------------

The list is saved as a simple tab delimited text file.
- the main format is: mass, label
- an additional format: mass, xic width, name is intended to be used with PeakView Extract XIC (by importing it)

The list can also be truncated to an upper mass limit

In [11]:
import os

mass_limit = 500
xic_width = 0.0   # if 0 the normal form is used...alternative could be 0.01 to generate the PeakView compatible form
count = 0

ion_forms = sorted(ion_forms, key=lambda x: x.Mass)

# Set up fie names and paths...this is a convenient platform independent way to provide file paths
f_dir = os.sep + os.path.join('Users','ronbonner','Data','PCA')

if ionization == "negative":
    f_name = f'{base_name} ions neg.txt'
else:
    f_name = f'{base_name} ions pos.txt'

data_path = os.path.join(f_dir, f_name)

print (data_path)

with open(f_name, 'w') as f:
    
    for ion in ion_forms:

        if ion.Mass > mass_limit: 
            break       
      
        if xic_width:
            f.write('{:10.4f}\t{}\t{}\n'.format(ion.Mass, xic_width,ion.Name))
        else:
            f.write(f'{ion.Mass:10.4f}\t{ion.Name}\n')
            
        count += 1
    
    f.close()

print(count, 'lines written')

/Users/ronbonner/Data/PCA/Vinpo ions pos.txt
933 lines written


Step 5 - Verification
---------------------

Just to be sure the file exists, we re-open it, read the lines and report how many there are. The file can also be viewed with the magic commands (preceded by a !) shown below.

In [12]:
with open(f_name, 'r') as f:
    
  lines = f.readlines()
    
  f.close()

print(len(lines), 'read')

933 read


In [13]:
!ls -l

total 264
-rw-r--r--  1 ronbonner  staff  41169 Oct  3 06:11 CompoundCalculator.ipynb
-rw-r--r--  1 ronbonner  staff  39975 Oct  2 09:23 Ibu ions neg.txt
-rw-r--r--  1 ronbonner  staff   1063 Sep 11 06:33 LICENSE
-rw-r--r--  1 ronbonner  staff    116 Sep 11 06:33 README.md
-rw-r--r--  1 ronbonner  staff  37804 Oct  3 06:11 Vinpo ions pos.txt


In [14]:
!cat 'Vinpo ions pos.txt'

  287.1542	Vinpo-C2H4-2(H2O).H+
  303.1492	Vinpo-C2H4-OH-2(H2O).H+
  304.1808	Vinpo-C2H4-2(H2O).NH3.H+
  305.1648	Vinpo-C2H4-H2O.H+
  309.1362	Vinpo-C2H4-2(H2O).Na-H.H+
  315.1855	Vinpo-2(H2O).H+
  317.1284	Vinpo-C2H4-COOH-2(H2O).H+
  320.1757	Vinpo-C2H4-OH-2(H2O).NH3.H+
  321.1597	Vinpo-C2H4-OH-H2O.H+
  322.1914	Vinpo-C2H4-H2O.NH3.H+
  323.1754	Vinpo-C2H4.H+
  325.1101	Vinpo-C2H4-2(H2O).K-H.H+
  325.1311	Vinpo-C2H4-OH-2(H2O).Na-H.H+
  326.1627	Vinpo-C2H4-2(H2O).Na-H.NH3.H+
  327.1467	Vinpo-C2H4-H2O.Na-H.H+
  331.1181	Vinpo-C2H4-2(H2O).2(Na-H).H+
  331.1805	Vinpo-OH-2(H2O).H+
  332.2121	Vinpo-2(H2O).NH3.H+
  333.1961	Vinpo-H2O.H+
  334.1550	Vinpo-C2H4-COOH-2(H2O).NH3.H+
  335.1390	Vinpo-C2H4-COOH-H2O.H+
  337.1675	Vinpo-2(H2O).Na-H.H+
  338.1863	Vinpo-C2H4-OH-H2O.NH3.H+
  339.1104	Vinpo-C2H4-COOH-2(H2O).Na-H.H+
  339.1703	Vinpo-C2H4-OH.H+
  340.2019	Vinpo-C2H4.NH3.H+
  341.1050	Vinpo-C2H4-OH-2(H2O).K-H.H+
  342.1367	Vinpo-C2H4-2(H2O).K-H.NH3.H+
  342.1577	Vinpo-C2H4-OH-2(H2O).Na-H.NH3.