Welcome to the M+1 tutorial for alanine! In this document, we are going to go through the steps to simulate an alanine measurement.

We begin by importing the different python files and libraries we need. We start with "sys.path.insert" to allow us to import from the parent directory; this allows us to have a parent directory with all the code, and a separate folder to keep our notebooks and other data. 

In [1]:
import sys; sys.path.insert(0, '..')

from datetime import date

today = date.today()

import copy
import json

import numpy as np
import pandas as pd
from tqdm import tqdm

import basicDeltaOperations as op
import calcIsotopologues as ci
import fragmentAndSimulate as fas

Our first step is to initialize some basic information about the molecule. This is done via four separate lists of the same length giving information about the various sites. 

IDList gives each a name (arbitrary; by convention the first letter(s) give the element)

elIDs gives the identity of each element ("C", "O", "N", "H", and "S" are supported) 

numberAtSite gives the number of atoms at that particular site (as defined in the theory paper, sites contain any number of atoms of a single element)

deltas gives the site specific deltas for the M+1 substitution at each site (13C, 17O, 15N, 33S, D). Other deltas (18O, 34S, 36S) are set assuming a mass scaling relationship with the M+1 substitution. We do not include any information about clumps at present. The reference frames are VPDB, VSMOW, AIR, and CDT; see basicDeltaOperations file for the standard constants.

In [2]:
IDList = ['Calphabeta','Ccarboxyl','Ocarboxyl','Namine','Hretained','Hlost']
elIDs = ['C','C','O','N','H','H']
numberAtSite = [2,1,2,1,6,2]
deltas = [-40,-20,0,-10,25,40]

l = [elIDs, numberAtSite, deltas]
cols = ['IDS','Number','deltas']

Next, we define the fragmentation dictionary for this molecule. This is done via a two step process; first we define a dictionary "allFragments" giving information about every possible fragment of this molecule; we then define a "fragSubset" list giving the names of the fragments we are using for this experiment, and construct the actual dictionary by pulling out relevant fragments. This makes it easy to add or remove fragments for given experiments. 

Each fragment has a name ("full" or "44" here) keyed to subgeometries (keys "01", "02", etc.) The subgeometries allow us to define multiple fragmentation pathways leading to the same observed beam. Each subgeometry gets a list specifying whether each site is retained (1) or lost ('x') in that fragment (fractional fragments are not permitted; 1 and 'x' are the only values). They also get a relative contribution ('relCont') defining how much that particular subgeomtery contributes to the observed fragment. The relative contributions should sum to 1. 

At present, there are no automatic checks to make sure an input fragmentation pattern is valid--they must be checked manually. Part of the value of simulating and solving a synthetic dataset is confirming everything is defined properly. 

In [3]:
allFragments = {'full':{'01':{'subgeometry':[1,1,1,1,1,1],'relCont':1}},
                  '44':{'01':{'subgeometry':[1,'x','x',1,1,'x'],'relCont':1}}}

fragSubset = ['full','44']

fragmentationDictionary = {key: value for key, value in allFragments.items() if key in fragSubset}

Next, we process this fragmentation information to create two lists: condensedFrags and fragKeys.

fragKeys contains a string for each subgeometry, written as e.g. ['full_01','44_01']. This variable tracks all subfragments, not just the observed fragments. 

condensedFrags is a list of lists; each list specifies the particular subgeometry sampled by that fragment. For example, in our case, condensedFrags = [[1, 1, 1, 1, 1, 1], [1, 'x', 'x', 1, 1, 'x']], for full_01 and 44_01, respectively. 

In [4]:
siteFrags =[]
fragSubgeometryKeys = []
    
for fragKey, subFragDict in fragmentationDictionary.items():
    for subFragNum, subFragInfo in subFragDict.items():
        l.append(subFragInfo['subgeometry'])
        cols.append(fragKey + '_' + subFragNum)
        siteFrags.append(subFragInfo['subgeometry'])
        fragSubgeometryKeys.append(fragKey + '_' + subFragNum)

We then define an additional way to track fragmentation, the "expandedFrags" variable. This is similar to condensed frags, but expands entries for multiatomic sites to give one entry per atom, as opposed to one entry for the site. 

We call the condensedFrags depiction the "SITE" depiction of our molecule.

We call the expandedFrags depiction the "ATOM" depiction of our molecule. Note that we may also expand elIDs and delta values from the "SITE" depiction to the "ATOM" depiction. We will make use of both depictions. 

In [5]:
atomFrags = [fas.expandFrag(x, numberAtSite) for x in siteFrags]

In [6]:
atomFrags

[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 'x', 'x', 'x', 1, 1, 1, 1, 1, 1, 1, 'x', 'x']]

Finally, we put all this information into a dataFrame to track basic information about this molecule. 

In [7]:
df = pd.DataFrame(l, columns = IDList)
df = df.transpose()
df.columns = cols

df

Unnamed: 0,IDS,Number,deltas,full_01,44_01
Calphabeta,C,2,-40,1,1
Ccarboxyl,C,1,-20,1,x
Ocarboxyl,O,2,0,1,x
Namine,N,1,-10,1,1
Hretained,H,6,25,1,1
Hlost,H,2,40,1,x


Often, we will want to initialize the same molecule many times, with different delta values and/or fragments. We will write a function that does all of this behind the scenes, and accomplish it just by calling that function. You may use the alanineTest.py file as an example. (We also added an option for it to tell us delta 18O assuming a mass scaling relationship with 17O)

In [8]:
import alanineTest

deltas = [-40,-20,0,-10,25,40]
fragSubset = ['full','44']
df, expandedFrags, fragSubgeometryKeys, fragmentationDictionary = alanineTest.initializeAlanine(deltas, fragSubset,
                                                                                    printHeavy = True)

Delta 18O
0.0


With this basic information, we can begin simulating a measurement. We start by calculating the concentration of all isotopologues of a molecule; this is viable computationally for molecules with millions of distinct isotopologues. For more complex molecules, we may wish to only calculate concentrations for the M+1, M+2, etc. populations of interest; we have implemented this for M+1. 

In [9]:
#disable disables the progress bar
byAtom = ci.inputToAtomDict(df, disable = False, M1Only = False)

Calculating Isotopologue Concentrations


100%|███████████████████████████████████████████████████████████████████████████| 1512/1512 [00:00<00:00, 65959.97it/s]


Compiling Isotopologue Dictionary


100%|███████████████████████████████████████████████████████████████████████████| 1512/1512 [00:00<00:00, 56088.26it/s]


Our output is a dictionary. The keys are strings with length equal to the number of atoms (NOT # of sites) in the molecule; i.e., they correspond to the ATOM depiction. We can track the element IDs at each position of the string with the following function.

In [10]:
ci.strSiteElements(df)

'CCCOONHHHHHHHH'

Our dictionary contains the following information:

Number: The number of indistinguishable isotopologues with this geometry. 

Full: An expanded version of the ATOM depiction. Each multiatomic site, with n atoms, contains n numbers in parentheses corresponding to the individual atoms. These are indistinguishable; hence, rather than including (1,0) and (0,1) simultaneously, we only include (0,1), always using the version with leading zeroes. Review the theory paper for more details about this depiction. 

Conc: The sum of the concentration of all indistinguishable isotopologues with this geometry. Summing all concentrations in byAtom should yield 1. 

Mass: The cardinal mass difference between this isotopologue and the unsubstituted isotopologue. This is obtained by summing across the byAtom string. 

Subs: The identity of any substitutions in this isotopologue. 

Note that we calculate the ATOM depiction from the full depiction, which for the above example would include (0,1) but not (1,0). This means 01 would be a key in byAtom but 10 would not be. It is rarely recommended that one index into byAtom manually. 

In [11]:
byAtom

{'00000000000000': {'Number': 1,
  'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(0, 0)',
  'Conc': 0.9587824350083225,
  'Mass': 0,
  'Subs': ''},
 '00000000000001': {'Number': 2,
  'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(0, 1)',
  'Conc': 0.0003106271003199443,
  'Mass': 1,
  'Subs': 'D'},
 '00000000000011': {'Number': 1,
  'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(1, 1)',
  'Conc': 2.5159304115833954e-08,
  'Mass': 2,
  'Subs': 'DD'},
 '00000000000100': {'Number': 6,
  'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 1)(0, 0)',
  'Conc': 0.0009184407052729123,
  'Mass': 1,
  'Subs': 'D'},
 '00000000000101': {'Number': 12,
  'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 1)(0, 1)',
  'Conc': 2.975571544468823e-07,
  'Mass': 2,
  'Subs': 'DD'},
 '00000000000111': {'Number': 6,
  'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 1)(1, 1)',
  'Conc': 2.410070123585612e-11,
  'Mass': 3,
  'Subs': 'DDD'},
 '00000000001100': {'Number': 15,
  'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 1, 1)(0, 0)',
  'Conc': 3.665818308991039e-0

At this point, one may add any clumps of interest. This is an advanced feature, so we do not discuss it here.

After adding clumps, we calculate a "bySub" dictionary. This contains similar information, but here the keys are substitutions ("D", "13C", etc.) and the information is given for all isotopologues with those substitutions. This makes it easier to calculate molecular average information and spectra. 

In [12]:
bySub = ci.calcSubDictionary(byAtom, df, atomInput = True)

In [13]:
bySub

{'': {'Number': 1,
  'Full': ['(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(0, 0)'],
  'Conc': 0.9587824350083225,
  'Mass': [0],
  'ATOM': ['00000000000000']},
 'D': {'Number': 8,
  'Full': ['(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(0, 1)',
   '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 1)(0, 0)'],
  'Conc': 0.0012290678055928567,
  'Mass': [1, 1],
  'ATOM': ['00000000000001', '00000000000100']},
 'DD': {'Number': 28,
  'Full': ['(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(1, 1)',
   '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 1)(0, 1)',
   '(0, 0)0(0, 0)0(0, 0, 0, 0, 1, 1)(0, 0)'],
  'Conc': 6.892982894618201e-07,
  'Mass': [2, 2, 2],
  'ATOM': ['00000000000011', '00000000000101', '00000000001100']},
 'DDD': {'Number': 56,
  'Full': ['(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 1)(1, 1)',
   '(0, 0)0(0, 0)0(0, 0, 0, 0, 1, 1)(0, 1)',
   '(0, 0)0(0, 0)0(0, 0, 0, 1, 1, 1)(0, 0)'],
  'Conc': 2.2090118358316652e-10,
  'Mass': [3, 3, 3],
  'ATOM': ['00000000000111', '00000000001101', '00000000011100']},
 'DDDD': {'Number': 70,
  'Full': ['(0, 0)0(0, 0)0(0, 0,

We then simulate the molecular average ("U value") measurement from the bySub dictionary. We can determine which measurements to simulate in two ways: first, we can specify a mass Threshold; all U values for substitutions below this cardinal mass difference will be calculated. Alternatively, we can enter a subList (including e.g. '13C', '15N'); in this case, U values will be calculated only for the indicated substitutions. 

Multiple substitutions are given without spaces (e.g. "18O18O"). Currently, we do not check for symmetry (e.g. if given "18O13C" it does not look for both "13C18O" and "18O13C"). The order must match the order given in the bySub dictionary (try bySub.keys())

In [14]:
UValueList = ['13C','15N','18O','D','DD']
allMeasurementInfo = {}
allMeasurementInfo = fas.UValueMeasurement(bySub, allMeasurementInfo, massThreshold = 3,
                                          subList = UValueList)

In [15]:
allMeasurementInfo

{'Full Molecule': {'D': 0.0012819048,
  'DD': 7.189308692913599e-07,
  '15N': 0.00363924,
  '18O': 0.004010400000000001,
  '13C': 0.03258788}}

We next prepare to simulate M+N data. To do so, we first select only those isotopologues with cardinal mass differences of interest, explicitly enumerating the M+1, M+2, etc. populations. (Note that a faster version of our algorithm would calculate this directly, not calculating all isotopologues then selecting this subset)

In [16]:
MN = ci.massSelections(byAtom, massThreshold = 2)

In [17]:
MN

{'M0': {'00000000000000': {'Number': 1,
   'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(0, 0)',
   'Conc': 0.9587824350083225,
   'Mass': 0,
   'Subs': ''}},
 'M1': {'00000000000001': {'Number': 2,
   'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(0, 1)',
   'Conc': 0.0003106271003199443,
   'Mass': 1,
   'Subs': 'D'},
  '00000000000100': {'Number': 6,
   'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 1)(0, 0)',
   'Conc': 0.0009184407052729123,
   'Mass': 1,
   'Subs': 'D'},
  '00000100000000': {'Number': 1,
   'Full': '(0, 0)0(0, 0)1(0, 0, 0, 0, 0, 0)(0, 0)',
   'Conc': 0.0034892393887796876,
   'Mass': 1,
   'Subs': '15N'},
  '00001000000000': {'Number': 2,
   'Full': '(0, 0)0(0, 1)0(0, 0, 0, 0, 0, 0)(0, 0)',
   'Conc': 0.0007284828941193235,
   'Mass': 1,
   'Subs': '17O'},
  '00100000000000': {'Number': 1,
   'Full': '(0, 0)1(0, 0)0(0, 0, 0, 0, 0, 0)(0, 0)',
   'Conc': 0.010558549379102012,
   'Mass': 1,
   'Subs': '13C'},
  '01000000000000': {'Number': 2,
   'Full': '(0, 1)0(0, 0)0(0, 0, 0, 0, 0

We update this dictionary to include information about fragmentation. We do not use unresolvedDict here, which is an advanced feature allowing us to specify how unresolved ion beams combine.

The output dictionary calculates the stochastic U values for each isotopologue, the ATOM depiction of each isotopologue upon fragmentation, and the substituions of each isotopologue upon fragmentation. This makes it easy to predict spectra upon fragmentation for various M+N experiments. 

In [18]:
MN = fas.trackMNFragments(MN, atomFrags, fragSubgeometryKeys, df, unresolvedDict = {})

In [19]:
MN

{'M0': {'00000000000000': {'Number': 1,
   'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(0, 0)',
   'Conc': 0.9587824350083225,
   'Mass': 0,
   'Subs': '',
   'Stochastic U': 1.0,
   'full_01 Identity': '00000000000000',
   'full_01 Subs': 'Unsub',
   '44_01 Identity': '00xxx0000000xx',
   '44_01 Subs': 'Unsub'}},
 'M1': {'00000000000001': {'Number': 2,
   'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 0)(0, 1)',
   'Conc': 0.0003106271003199443,
   'Mass': 1,
   'Subs': 'D',
   'Stochastic U': 0.00032398079999999997,
   'full_01 Identity': '00000000000001',
   'full_01 Subs': 'D',
   '44_01 Identity': '00xxx0000000xx',
   '44_01 Subs': 'Unsub'},
  '00000000000100': {'Number': 6,
   'Full': '(0, 0)0(0, 0)0(0, 0, 0, 0, 0, 1)(0, 0)',
   'Conc': 0.0009184407052729123,
   'Mass': 1,
   'Subs': 'D',
   'Stochastic U': 0.000957924,
   'full_01 Identity': '00000000000100',
   'full_01 Subs': 'D',
   '44_01 Identity': '00xxx0000001xx',
   '44_01 Subs': 'D'},
  '00000100000000': {'Number': 1,
   'Full': '

We then use this MN dictionary to simulate the M+N experiment of interest. The function here has many options:

abundanceThreshold: gives a relative abundance threshold (e.g. 0.01) below which peaks will not be observed. If a simulated ion beam has relative abundance below this threshold, it is culled from the predicted measurement. 

fractionationFactors: Advanced feature, not used here. Allows the user to specify experimental fractionation factors for each ion beam. 

omitMeasurements: Allows a user to manually specify ion beams to not measure. For example, omitMeasurements = {'M1':{'61':'D'}} would mean I do not observe the D ion beam of the 61 fragment of the M+1 experiment, regardless of its abundance. 

unresolvedDict: Discussed above. 

outputFull: For debugging; see following note. 

The FF output of the tuple contains fractionation factors used for this simulation. 

In [20]:
predictedMeasurement, FF = fas.predictMNFragmentExpt(allMeasurementInfo, MN, expandedFrags, fragSubgeometryKeys, df, 
                                                 fragmentationDictionary,
                                                 abundanceThreshold = 0, 
                                                     fractionationFactors = {}, 
                                                     omitMeasurements = {}, 
                                                     unresolvedDict = {}, 
                                                     outputFull = False)

PredictedMeasurement contains information about each observed ion beam, specifying its abundance four different ways (calculated stepwise):

Abs. Abundance gives the actual abundance in concentration space. 

Rel. Abundance gives the relative abundance compared to all ion beams (including unobserved ones) for this fragment. This is the quantity we hope to actually recover. 

Combined Rel. Abundance gives the relative abundance after combining any unresolved beams; for example, if the 17O beam adds to the 13C beam, the 13C beam will increase and 17O will equal 0. 

Adj. Rel. Abundance gives the relative abundance after culling those beams below the abundance threshold or manually omitted. This is the quantity we actually observe. 

See the theory paper for more details about each step. 

Note that beams which are not observed are culled from the dictionary, so information for these is not provided. Setting outputFull = True allows the user to see this information; however, note the dataset cannot be used in this case (as it includes data for unobserved beams). 

In [21]:
predictedMeasurement

{'Full Molecule': {'D': 0.0012819048,
  'DD': 7.189308692913599e-07,
  '15N': 0.00363924,
  '18O': 0.004010400000000001,
  '13C': 0.03258788},
 'M0': {'full': {'Unsub': {'Abs. Abundance': 0.9587824350083225,
    'Rel. Abundance': 1.0,
    'Combined Rel. Abundance': 1.0,
    'Adj. Rel. Abundance': 1.0}},
  '44': {'Unsub': {'Abs. Abundance': 0.9587824350083225,
    'Rel. Abundance': 1.0,
    'Combined Rel. Abundance': 1.0,
    'Adj. Rel. Abundance': 1.0}}},
 'M1': {'full': {'D': {'Abs. Abundance': 0.0012290678055928567,
    'Rel. Abundance': 0.03349736519737601,
    'Combined Rel. Abundance': 0.03349736519737601,
    'Adj. Rel. Abundance': 0.03349736519737601},
   '15N': {'Abs. Abundance': 0.0034892393887796876,
    'Rel. Abundance': 0.09509672740198702,
    'Combined Rel. Abundance': 0.09509672740198702,
    'Adj. Rel. Abundance': 0.09509672740198702},
   '17O': {'Abs. Abundance': 0.0007284828941193235,
    'Rel. Abundance': 0.019854280970760307,
    'Combined Rel. Abundance': 0.0198542

Our primary data passing file is a .json format; this exports our simulated data as a .json. 

In [22]:
outputPath = str(today) + " Tutorial Alanine"
output = json.dumps(predictedMeasurement)

f = open(outputPath + ".json","w")
f.write(output)
f.close()

Typically, we will simulate the same molecule many times; again, we define a function simulateMeasurement in the alanineTest.py folder, allowing easy access to the options of this function.

In [23]:
deltas = [-40,-20,0,-10,25,40]
fragSubset = ['full','44']
df, expandedFrags, fragSubgeometryKeys, fragmentationDictionary = alanineTest.initializeAlanine(deltas, fragSubset)

unresolvedDict = {}
calcFF = False
forbiddenPeaks = {}
UValueList = ['13C','15N']
abundanceThreshold = 0
massThreshold = 1

predictedMeasurement, MNDict, fractionationFactorsSmp = alanineTest.simulateMeasurement(df, fragmentationDictionary, 
                                                                                 expandedFrags, fragSubgeometryKeys, 
                                                   abundanceThreshold = abundanceThreshold,
                                                   outputPath = str(today) + " TUTORIAL Sample Stochastic",
                                                               calcFF = calcFF,
                                                               ffstd = 0.05,
                                                   unresolvedDict = unresolvedDict,
                                                   outputFull = False,
                                                   omitMeasurements = forbiddenPeaks,
                                                   UValueList = UValueList,
                                                   massThreshold = massThreshold)

Delta 18O
0.0
Calculating Isotopologue Concentrations


100%|███████████████████████████████████████████████████████████████████████████| 1512/1512 [00:00<00:00, 65953.80it/s]


Compiling Isotopologue Dictionary


100%|███████████████████████████████████████████████████████████████████████████| 1512/1512 [00:00<00:00, 44597.04it/s]

Simulating Measurement





In [24]:
deltas = [-30,-30,0,0,0,0]
df, expandedFrags, fragSubgeometryKeys, fragmentationDictionary = alanineTest.initializeAlanine(deltas, fragSubset)

predictedMeasurement, MNDict, fractionationFactorsStd = alanineTest.simulateMeasurement(df, fragmentationDictionary, 
                                                                                 expandedFrags, fragSubgeometryKeys, 
                                                   abundanceThreshold = abundanceThreshold,
                                                   outputPath = str(today) + " TUTORIAL Standard Stochastic",
                                                               calcFF = calcFF,
                                                               ffstd = 0.05,
                                                   unresolvedDict = unresolvedDict,
                                                   fractionationFactors = fractionationFactorsSmp,
                                                   outputFull = False,
                                                   omitMeasurements = forbiddenPeaks,
                                                   UValueList = UValueList,
                                                   massThreshold = massThreshold)

Delta 18O
0.0
Calculating Isotopologue Concentrations


100%|███████████████████████████████████████████████████████████████████████████| 1512/1512 [00:00<00:00, 43312.30it/s]


Compiling Isotopologue Dictionary


100%|███████████████████████████████████████████████████████████████████████████| 1512/1512 [00:00<00:00, 63094.83it/s]

Simulating Measurement



