# Automatic atom mapping - example

This notebook demonstrates how the `atom_mapping` module works in practice. It's purpose is to reduce the workload when preparing input data for MFA analysis on INCA. 

The only input required is the COBRA model that contains all reaction data, and, most importantly, references for metabolite structures in KEGG Compound, HMDB, CHEBI databases, or an InChI key.

This module uses RDKit to generate the input for RDT (Reaction Decoder Tool), a tool to estimate the atom mapping of a reaction. The output of RDT is then parsed and used to generate a dictionary of atom mappings for each reaction in the model.

#### First, import required modules:

In [1]:
import pandas as pd
from incawrapper.atommapping import atom_mapping
from cobra.io import load_model
import re

## Overview
#### 1. create a dataframe with the metabolites
- df that has met_id and annoations as columns

In [2]:
model = load_model("textbook")
 
met_df = pd.DataFrame(
    {
        "met_id": [met.id for met in model.metabolites],
        "annotations": [met.annotation for met in model.metabolites],
    }
)
print(met_df.head())

Set parameter Username
Academic license - for non-commercial use only - expires 2024-03-05
    met_id                                        annotations
0  13dpg_c  {'bigg.metabolite': '13dpg', 'biocyc': 'DPG', ...
1    2pg_c  {'bigg.metabolite': '2pg', 'biocyc': '2-PG', '...
2    3pg_c  {'bigg.metabolite': '3pg', 'biocyc': 'G3P', 'c...
3   6pgc_c  {'bigg.metabolite': '6pgc', 'biocyc': 'CPD-296...
4   6pgl_c  {'bigg.metabolite': '6pgl', 'biocyc': 'D-6-P-G...


#### 2. create a dataframe with the reactions
- df that has rxn_id and annotations as columns
- the information is taken from the model

In [3]:
# Examples
reaction_data_temp = {}
for cnt, r in enumerate(model.reactions):
    reaction_data_dict = {
        "rxn_id": r.id,
        "equation": r.build_reaction_string(),
        "reactants_stoichiometry": [
            r.get_coefficient(react.id) for react in r.reactants
        ],
        "reactants_ids": [react.id for react in r.reactants],
        "products_stoichiometry": [
            r.get_coefficient(prod.id) for prod in r.products
        ],
        "products_ids": [prod.id for prod in r.products],
    }
    reaction_data_temp[cnt] = reaction_data_dict
 
rxn_data = pd.DataFrame.from_dict(reaction_data_temp, orient='index')
rxn_data.head()

Unnamed: 0,rxn_id,equation,reactants_stoichiometry,reactants_ids,products_stoichiometry,products_ids
0,ACALD,acald_c + coa_c + nad_c <=> accoa_c + h_c + na...,"[-1.0, -1.0, -1.0]","[acald_c, coa_c, nad_c]","[1.0, 1.0, 1.0]","[accoa_c, h_c, nadh_c]"
1,ACALDt,acald_e <=> acald_c,[-1.0],[acald_e],[1.0],[acald_c]
2,ACKr,ac_c + atp_c <=> actp_c + adp_c,"[-1.0, -1.0]","[ac_c, atp_c]","[1.0, 1.0]","[actp_c, adp_c]"
3,ACONTa,cit_c <=> acon_C_c + h2o_c,[-1.0],[cit_c],"[1.0, 1.0]","[acon_C_c, h2o_c]"
4,ACONTb,acon_C_c + h2o_c <=> icit_c,"[-1.0, -1.0]","[acon_C_c, h2o_c]",[1.0],[icit_c]


In [4]:
file = "./Literature data/wasylenko/wasylenko_model_KEGG.csv"

wasylenko_kegg = pd.read_csv(file, index_col=0)
wasylenko_kegg

Unnamed: 0,0
0,C00031.ext (abcdef) -> C00092 (abcdef)
1,C01510.ext (abcde) -> C00231 (abcde)
2,C00469 (ab) -> C00469.ext (ab)
3,C00033 (ab) -> C00033.ext (ab)
4,C00116 (abc) -> C00116.ext (abc)
...,...
90,C00092 (abcdef) -> C00182 (abcdef)
91,C00092 (abcdef) -> C00965 (abcdef)
92,C05345 (abcdef) -> C00464 (abcdef)
93,C00011.ext (a) -> C00011 (a)


In [5]:
rxn = wasylenko_kegg.iloc[5, 0]
rxn

'0.459 C00041 + 0.161 C00062 + 0.102 C00152 + 0.297 C00049 + 0.007 C00097 + 0.105 C00064 + 0.302 C00025 + 0.29 C00037 + 0.066 C00135 + 0.193 C00407 + 0.296 C00123 + 0.286 C00047 + 0.051 C00073 + 0.134 C00079 + 0.165 C00148 + 0.185 C00065 + 0.191 C00188 + 0.028 C00078 + 0.102 C00082 + 0.265 C00183 + 0.519 C00182 + 0.023 Trehalose + 0.808 C00464 + 1.135 C00965 + 0.051 C00020 + 0.05 C00055 + 0.051 C00144 + 0.067 C00105 + 0.0036 C00360 + 0.0024 C00239 + 0.0024 C00362 + 0.0036 C00364 + 0.0066 C00422 + 0.0007 C01694 + 0.0015 C05437 + 0.0006 PA + 0.0062 PC + 0.0046 PE + 0.0055 PI + 0.0017 PS + 0.4623 C00024.c -> Biomass'

Remove atom mapping

In [6]:
def remove_atommapping(rxn):
    rxn_wo_am = re.sub(r'\([^)]*\)', '', rxn)
    return rxn_wo_am

rxn_wo_am = remove_atommapping(rxn)
rxn_wo_am

'0.459 C00041 + 0.161 C00062 + 0.102 C00152 + 0.297 C00049 + 0.007 C00097 + 0.105 C00064 + 0.302 C00025 + 0.29 C00037 + 0.066 C00135 + 0.193 C00407 + 0.296 C00123 + 0.286 C00047 + 0.051 C00073 + 0.134 C00079 + 0.165 C00148 + 0.185 C00065 + 0.191 C00188 + 0.028 C00078 + 0.102 C00082 + 0.265 C00183 + 0.519 C00182 + 0.023 Trehalose + 0.808 C00464 + 1.135 C00965 + 0.051 C00020 + 0.05 C00055 + 0.051 C00144 + 0.067 C00105 + 0.0036 C00360 + 0.0024 C00239 + 0.0024 C00362 + 0.0036 C00364 + 0.0066 C00422 + 0.0007 C01694 + 0.0015 C05437 + 0.0006 PA + 0.0062 PC + 0.0046 PE + 0.0055 PI + 0.0017 PS + 0.4623 C00024.c -> Biomass'

Split substrate and product side

In [7]:
def split_rxn(rxn_wo_am):
    if " <-> " in rxn_wo_am:
        rxn_sub, rxn_prod = rxn_wo_am.split(" <-> ")
    elif " -> " in rxn_wo_am:
        rxn_sub, rxn_prod = rxn_wo_am.split(" -> ")
    return rxn_sub, rxn_prod

rxn_sub, rxn_prod = split_rxn(rxn_wo_am)

print(rxn_sub, rxn_prod)

0.459 C00041 + 0.161 C00062 + 0.102 C00152 + 0.297 C00049 + 0.007 C00097 + 0.105 C00064 + 0.302 C00025 + 0.29 C00037 + 0.066 C00135 + 0.193 C00407 + 0.296 C00123 + 0.286 C00047 + 0.051 C00073 + 0.134 C00079 + 0.165 C00148 + 0.185 C00065 + 0.191 C00188 + 0.028 C00078 + 0.102 C00082 + 0.265 C00183 + 0.519 C00182 + 0.023 Trehalose + 0.808 C00464 + 1.135 C00965 + 0.051 C00020 + 0.05 C00055 + 0.051 C00144 + 0.067 C00105 + 0.0036 C00360 + 0.0024 C00239 + 0.0024 C00362 + 0.0036 C00364 + 0.0066 C00422 + 0.0007 C01694 + 0.0015 C05437 + 0.0006 PA + 0.0062 PC + 0.0046 PE + 0.0055 PI + 0.0017 PS + 0.4623 C00024.c Biomass


Split reactants

In [8]:
def split_reactants_products(rxn_side):
    if " + " in rxn_side:
        rxn_side_split = rxn_side.split(" + ")
    else:
        rxn_side_split = [rxn_side]
    return rxn_side_split

rxn_sub_split = split_reactants_products(rxn_sub)
rxn_prod_split = split_reactants_products(rxn_prod)

print(rxn_sub_split, rxn_prod_split)

['0.459 C00041', '0.161 C00062', '0.102 C00152', '0.297 C00049', '0.007 C00097', '0.105 C00064', '0.302 C00025', '0.29 C00037', '0.066 C00135', '0.193 C00407', '0.296 C00123', '0.286 C00047', '0.051 C00073', '0.134 C00079', '0.165 C00148', '0.185 C00065', '0.191 C00188', '0.028 C00078', '0.102 C00082', '0.265 C00183', '0.519 C00182', '0.023 Trehalose', '0.808 C00464', '1.135 C00965', '0.051 C00020', '0.05 C00055', '0.051 C00144', '0.067 C00105', '0.0036 C00360', '0.0024 C00239', '0.0024 C00362', '0.0036 C00364', '0.0066 C00422', '0.0007 C01694', '0.0015 C05437', '0.0006 PA', '0.0062 PC', '0.0046 PE', '0.0055 PI', '0.0017 PS', '0.4623 C00024.c'] ['Biomass']


In [9]:
def clear_white_spaces(compound_list: list):
    new_list = []
    for compound in compound_list:
        new_compound = re.sub(r'[ ]+', ' ', compound)
        new_compound = new_compound.strip()
        new_list.append(new_compound)
    return new_list

rxn_sub_split = clear_white_spaces(rxn_sub_split)
rxn_prod_split = clear_white_spaces(rxn_prod_split)

print(rxn_sub_split, rxn_prod_split)

['0.459 C00041', '0.161 C00062', '0.102 C00152', '0.297 C00049', '0.007 C00097', '0.105 C00064', '0.302 C00025', '0.29 C00037', '0.066 C00135', '0.193 C00407', '0.296 C00123', '0.286 C00047', '0.051 C00073', '0.134 C00079', '0.165 C00148', '0.185 C00065', '0.191 C00188', '0.028 C00078', '0.102 C00082', '0.265 C00183', '0.519 C00182', '0.023 Trehalose', '0.808 C00464', '1.135 C00965', '0.051 C00020', '0.05 C00055', '0.051 C00144', '0.067 C00105', '0.0036 C00360', '0.0024 C00239', '0.0024 C00362', '0.0036 C00364', '0.0066 C00422', '0.0007 C01694', '0.0015 C05437', '0.0006 PA', '0.0062 PC', '0.0046 PE', '0.0055 PI', '0.0017 PS', '0.4623 C00024.c'] ['Biomass']


Extract stoichiometry

In [10]:
def extract_stoichiometry_and_compound(rxn_side_split):
    rxn_side_stoich = []
    rxn_side_compound = []
    for reactant in rxn_side_split:
        if " " in reactant:
            stoich, compound = reactant.split(" ")
            rxn_side_stoich.append(stoich)
            #if "." in compound:
            #    compound = compound.split(".")[0]
            rxn_side_compound.append(compound)
        else:
            rxn_side_stoich.append("1")
            #if "." in reactant:
            #    reactant = reactant.split(".")[0]
            rxn_side_compound.append(reactant)
    return rxn_side_stoich, rxn_side_compound

rxn_sub_stoich, rxn_sub_compound = extract_stoichiometry_and_compound(rxn_sub_split)
rxn_prod_stoich, rxn_prod_compound = extract_stoichiometry_and_compound(rxn_prod_split)

print(rxn_sub_stoich, rxn_sub_compound)
print(rxn_prod_stoich, rxn_prod_compound)

['0.459', '0.161', '0.102', '0.297', '0.007', '0.105', '0.302', '0.29', '0.066', '0.193', '0.296', '0.286', '0.051', '0.134', '0.165', '0.185', '0.191', '0.028', '0.102', '0.265', '0.519', '0.023', '0.808', '1.135', '0.051', '0.05', '0.051', '0.067', '0.0036', '0.0024', '0.0024', '0.0036', '0.0066', '0.0007', '0.0015', '0.0006', '0.0062', '0.0046', '0.0055', '0.0017', '0.4623'] ['C00041', 'C00062', 'C00152', 'C00049', 'C00097', 'C00064', 'C00025', 'C00037', 'C00135', 'C00407', 'C00123', 'C00047', 'C00073', 'C00079', 'C00148', 'C00065', 'C00188', 'C00078', 'C00082', 'C00183', 'C00182', 'Trehalose', 'C00464', 'C00965', 'C00020', 'C00055', 'C00144', 'C00105', 'C00360', 'C00239', 'C00362', 'C00364', 'C00422', 'C01694', 'C05437', 'PA', 'PC', 'PE', 'PI', 'PS', 'C00024.c']
['1'] ['Biomass']


Run on all reactions

In [11]:
model_dict = {}

for i, row in wasylenko_kegg.iterrows():
    rxn = row[0]
    rxn_wo_am = remove_atommapping(rxn)
    rxn_sub, rxn_prod = split_rxn(rxn_wo_am)
    rxn_sub_split = split_reactants_products(rxn_sub)
    rxn_prod_split = split_reactants_products(rxn_prod)
    rxn_sub_split_cleared = clear_white_spaces(rxn_sub_split)
    rxn_prod_split_cleared = clear_white_spaces(rxn_prod_split)
    rxn_sub_stoich, rxn_sub_compound = extract_stoichiometry_and_compound(rxn_sub_split_cleared)
    rxn_prod_stoich, rxn_prod_compound = extract_stoichiometry_and_compound(rxn_prod_split_cleared)
    model_dict[i] = {
        "rxn_id": i,
        "equation": rxn_wo_am,
        "reactants_stoichiometry": [float(stoich) * -1 for stoich in rxn_sub_stoich],
        "reactants_ids": rxn_sub_compound,
        "products_stoichiometry": [float(stoich) for stoich in rxn_prod_stoich],
        "products_ids": rxn_prod_compound,
    }

rxn_data = pd.DataFrame.from_dict(model_dict, orient='index')
rxn_data.head(10)

Unnamed: 0,rxn_id,equation,reactants_stoichiometry,reactants_ids,products_stoichiometry,products_ids
0,0,C00031.ext -> C00092,[-1.0],[C00031.ext],[1.0],[C00092]
1,1,C01510.ext -> C00231,[-1.0],[C01510.ext],[1.0],[C00231]
2,2,C00469 -> C00469.ext,[-1.0],[C00469],[1.0],[C00469.ext]
3,3,C00033 -> C00033.ext,[-1.0],[C00033],[1.0],[C00033.ext]
4,4,C00116 -> C00116.ext,[-1.0],[C00116],[1.0],[C00116.ext]
5,5,0.459 C00041 + 0.161 C00062 + 0.102 C00152 + 0...,"[-0.459, -0.161, -0.102, -0.297, -0.007, -0.10...","[C00041, C00062, C00152, C00049, C00097, C0006...",[1.0],[Biomass]
6,6,C00022.c -> C00022.mnt,[-1.0],[C00022.c],[1.0],[C00022.mnt]
7,7,C00022.m -> C00022.mnt,[-1.0],[C00022.m],[1.0],[C00022.mnt]
8,8,C00022.mnt -> C00022.fix,[-1.0],[C00022.mnt],[1.0],[C00022.fix]
9,9,C00092 <-> C05345,[-1.0],[C00092],[1.0],[C05345]


In [12]:
metabolites = []
for i, row in rxn_data.iterrows():
    metabolites.extend(row["reactants_ids"])
    metabolites.extend(row["products_ids"])
metabolites = list(set(metabolites))
metabolites

met_df = {}
for met in metabolites:
    if met.startswith("C"):
        if "." in met:
            temp_met = met.split(".")[0]
        else:
            temp_met = met
        met_df[temp_met] = {
            "met_id": met,
            "annotations": {"kegg.compound": temp_met},
        }

met_df = pd.DataFrame.from_dict(met_df, orient='index')
met_df

Unnamed: 0,met_id,annotations
C00082,C00082,{'kegg.compound': 'C00082'}
C00036,C00036.m,{'kegg.compound': 'C00036'}
C00011,C00011.out,{'kegg.compound': 'C00011'}
C00149,C00149,{'kegg.compound': 'C00149'}
C00062,C00062,{'kegg.compound': 'C00062'}
...,...,...
C05382,C05382,{'kegg.compound': 'C05382'}
C00152,C00152,{'kegg.compound': 'C00152'}
C00073,C00073,{'kegg.compound': 'C00073'}
C01694,C01694,{'kegg.compound': 'C01694'}


In [13]:
met_df.iloc[0].annotations

{'kegg.compound': 'C00082'}

#### 3. create a MolfileDownloader object
- calling the generate_molfile_database method triggers the download of the molfiles for the metabolites in the dataframe

In [17]:
base_path = "./"
downloader = atom_mapping.MolfileDownloader(met_df.iloc[0:2], base_path=base_path)
downloader.generate_molfile_database()

Fetching metabolite structures...
Successfully fetched 0/2 metabolites


#### 4. Write reactions in the correct format for RDT (write_rxn_file)

In [None]:
atom_mapping.write_rxn_files(rxn_data, base_path=base_path)

#### 5. Query RDT for the atom mapping (obtain_atom_mappings)
- This might take a while
- Can be skipped if you already have the atom mappings

In [None]:
atom_mapping.obtain_atom_mappings(max_time=20, base_path=base_path)

#### 6. Write reactions file with atom mapping

In [None]:
base_path + '/mappedRxns/rxnFiles'

In [None]:
atom_mapping.parse_reaction_mappings(mapped_rxn_path=base_path + '/mappedRxns/rxnFiles', rxn_data=rxn_data)

#### 7. When you're done, you can delete the downloaded files and directories to free up space

In [None]:
atom_mapping.clean_output(base_path=base_path)