# Create metabolite mapping table

**General steps to build the metabolite mapping table:**

> I / Install and load the modules.

> II / Create the metabolite table for the AraCore model.

> III / BiGG metabolites.

> IV / ModelSEED metabolites table.

> V / Mapping AraCore.

> VI / Aggregation of compounds identifiers.

> VII / Mapping by chemical formulas.

# I / Install missing modules and load modules 

In [45]:
import pandas as pd
import numpy as np
import cobra
import requests # requests module allows us to send HTTP requests

## 1) Load AraCore Model

### 1) a. Get file from github 

In [46]:
fileName = 'https://raw.githubusercontent.com/ma-blaetke/CBM_C3_C4_Metabolism/master/data/2018-23-05-mb-genC3.sbml'
r = requests.get(fileName) # requests module allows us to send HTTP requests

### 1) b. Create model 

In [47]:
model = cobra.io.read_sbml_model(r.text)

In [48]:
model

0,1
Name,c3_model
Memory address,0x07f8dc65fab50
Number of metabolites,413
Number of reactions,572
Number of groups,0
Objective expression,1.0*Ex_Suc - 1.0*Ex_Suc_reverse_fb96e
Compartments,"Chloroplast, Lumen, Cytosol, Mitochondrion, IntermembraneSpace, Peroxisome"


## 2) Correct Compartment Naming in AraCore Model according to BiGG naming conventions

In [49]:
model.compartments

{'h': 'Chloroplast',
 'l': 'Lumen',
 'c': 'Cytosol',
 'm': 'Mitochondrion',
 'i': 'IntermembraneSpace',
 'p': 'Peroxisome'}

In [50]:
bigg_compartments = {'c':'cytosol',
'e':'extracellular space',
'p':'periplasm',
'm':'mitochondria',
'x':'peroxisome/glyoxysome',
'r':'endoplasmic reticulum',
'v':'vacuole',
'n':'nucleus',
'g':'golgi apparatus',
'u':'thylakoid',
'l':'lysosome',
'h':'chloroplast',
'f':'flagellum',
's':'eyespot',
'im':'intermembrane space of mitochondria',
'cx':'carboxyzome',
'um':'thylakoid membrane',
'cm':'cytosolic membrane',
'i':'inner mitochondrial compartment',
'mm':'mitochondrial intermembrane',
'w':'wildtype staph aureus',
'y':'cytochrome complex'}

In [51]:
bigg_compartments

{'c': 'cytosol',
 'e': 'extracellular space',
 'p': 'periplasm',
 'm': 'mitochondria',
 'x': 'peroxisome/glyoxysome',
 'r': 'endoplasmic reticulum',
 'v': 'vacuole',
 'n': 'nucleus',
 'g': 'golgi apparatus',
 'u': 'thylakoid',
 'l': 'lysosome',
 'h': 'chloroplast',
 'f': 'flagellum',
 's': 'eyespot',
 'im': 'intermembrane space of mitochondria',
 'cx': 'carboxyzome',
 'um': 'thylakoid membrane',
 'cm': 'cytosolic membrane',
 'i': 'inner mitochondrial compartment',
 'mm': 'mitochondrial intermembrane',
 'w': 'wildtype staph aureus',
 'y': 'cytochrome complex'}

On internet, we can access to this following page to find compartments which can figure in the model : http://bigg.ucsd.edu/compartments

**Summary of the notations used :** 

- c :	cytosol
- h :	chloroplast
- m :	mitochondria
- x :	peroxisome/glyoxysome
- im :intermembrane space of mitochondria
- h :	chloroplast
- ul : thylakoid lumen <<< NEW


In [52]:
df_compartment_mapping = pd.Series(
  { 'c': 'c', #cytosol
  'h': 'h', #chloroplast
  'm': 'm', #mitochondria
  'p': 'x', #peroxisome/glyoxysome
  'i': 'im', #intermembrane space of mitochondria
  'l': 'ul', #thylakoid lumen <<< NEW
  }
 )

# II / Create Metabolite Table for AraCore Model

## 1) Create mapping table 

### 1) a. Create the dataframe 

In [53]:
#Create mapping table
df_metabolites_aracore = pd.DataFrame(
    {
        "aracore_ids" : [met_obj.id for met_obj in model.metabolites],
        "aracore_name" : [met_obj.name for met_obj in model.metabolites],
        "aracore_formula": [met_obj.formula for met_obj in model.metabolites],
        "aracore_annotations" : [met_obj.annotation for met_obj in model.metabolites]
    })

df_metabolites_aracore.head(25) 

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations
0,hnu_h,Photon,X,{}
1,PQ_h,Oxidized plastoquinone,C13H16O2,{}
2,H2O_h,"H2O, water",H2O,{}
3,H_h,"H+, proton",H,{}
4,PQH2_h,Reduced plastoquinone,C13H18O2,{}
5,O2_h,"O2, oxygen",O2,{}
6,H_l,"H+, proton",H,{}
7,PCox_h,Oxidized plastocyanin,X,{}
8,PCrd_h,Reduced plastocyanin,X,{}
9,Fdox_h,Oxidized ferredoxin,S8FeX,{}


### 1) b. Changes to the dataframe 

According to the instructions reported in the website we mentionned previously concerning BiGG conventions, a few changes have to be performed:
> - The compartment symbols need to be updated in metabolites identifiers:
>>  - we split the metabolite identifiers in order to isolate the compartment.
>>  - we replace the compartment symbol by the one presented in the website (see dictionary "bigg_compartments" above). 

> - Universal identifiers need to be created in order to make the correspondence between these ones and the ones presented in the BiGG metabolite table (mapping):
>>  - we split according to the symbol "_" and we keep the first part of each list
>>  - The universal metabolite identifiers need to be in lower case to avoid problems with the case of the letters

In [54]:
#Update compartment symbols in metabolite ids and make metabolite ids lower case
def apply_split_aracore_ids(aracore_id):
    aracore_id_splited = aracore_id.rsplit('_',1) # "1" corresponds to the maxsplit => specifies the number of split to do
    #print(aracore_id_splited[0])
    #print(aracore_id_splited[1])
    name_molecule = aracore_id_splited[0]
    compartment = aracore_id_splited[1]
    # Use the dataframe and consider it as a dictionary, included into the pandas dataframe => we have the access to the value
    # of the dictionary by passing the compartment as key
    bigg_compartment = df_compartment_mapping[compartment] # to access the value by the key
    # [] = accession operator
    #print(f"{name_molecule}_{bigg_compartment}")
    return (f"{name_molecule}_{bigg_compartment}")

df_metabolites_aracore['aracore_updated_ids'] = df_metabolites_aracore['aracore_ids'].apply(lambda aracore_id:apply_split_aracore_ids(aracore_id))
# Lambda function : what is written after "x:" is what is needed to be returned


#Create universal metabolite ids by removing compartment symbols
def apply_split_updated_universal_ids(aracore_universal_id):
    aracore_universal_id_splited = aracore_universal_id.rsplit('_', 1)
    #print(aracore_universal_id_splited[0])
    return aracore_universal_id_splited[0]

df_metabolites_aracore['aracore_updated_universal_ids'] = df_metabolites_aracore['aracore_updated_ids'].apply(lambda aracore_universal_id:apply_split_updated_universal_ids(aracore_universal_id))

df_metabolites_aracore.head(25)

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids
0,hnu_h,Photon,X,{},hnu_h,hnu
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},PQ_h,PQ
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O
3,H_h,"H+, proton",H,{},H_h,H
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},PQH2_h,PQH2
5,O2_h,"O2, oxygen",O2,{},O2_h,O2
6,H_l,"H+, proton",H,{},H_ul,H
7,PCox_h,Oxidized plastocyanin,X,{},PCox_h,PCox
8,PCrd_h,Reduced plastocyanin,X,{},PCrd_h,PCrd
9,Fdox_h,Oxidized ferredoxin,S8FeX,{},Fdox_h,Fdox


In [55]:
df_metabolites_aracore['aracore_updated_universal_ids_lower'] = df_metabolites_aracore['aracore_updated_universal_ids'].str.lower()
df_metabolites_aracore.head(25)

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,aracore_updated_universal_ids_lower
0,hnu_h,Photon,X,{},hnu_h,hnu,hnu
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},PQ_h,PQ,pq
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O,h2o
3,H_h,"H+, proton",H,{},H_h,H,h
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},PQH2_h,PQH2,pqh2
5,O2_h,"O2, oxygen",O2,{},O2_h,O2,o2
6,H_l,"H+, proton",H,{},H_ul,H,h
7,PCox_h,Oxidized plastocyanin,X,{},PCox_h,PCox,pcox
8,PCrd_h,Reduced plastocyanin,X,{},PCrd_h,PCrd,pcrd
9,Fdox_h,Oxidized ferredoxin,S8FeX,{},Fdox_h,Fdox,fdox


# III / BiGG Metabolites

## 1) Load BiGG metabolites table 

In [56]:
# Load BIGG Metabolites Table
bigg_metabolites_url = 'http://bigg.ucsd.edu/static/namespace/bigg_models_metabolites.txt'
df_metabolites_bigg = pd.read_csv(bigg_metabolites_url, sep='\t') # load the file and parse it according to the separator

df_metabolites_bigg.head(25)

Unnamed: 0,bigg_id,universal_bigg_id,name,model_list,database_links,old_bigg_ids
0,12dgr120_c,12dgr120,"1,2-Diacyl-sn-glycerol (didodecanoyl, n-C12:0)",iEC1364_W; iEC1349_Crooks; iEC1356_Bl21DE3; iM...,MetaNetX (MNX) Chemical: http://identifiers.or...,12dgr120; 12dgr120[c]; 12dgr120_c; _12dgr120_c
1,12dgr140_c,12dgr140,"1,2-Diacyl-sn-glycerol (ditetradecanoyl, n-C14:0)",iECNA114_1301; iECSE_1348; iECO111_1330; iECOK...,MetaNetX (MNX) Chemical: http://identifiers.or...,12dgr140; 12dgr140[c]; 12dgr140_c; _12dgr140_c
2,12dgr180_c,12dgr180,"1,2-Diacyl-sn-glycerol (dioctadecanoyl, n-C18:0)",iECB_1328; iECDH10B_1368; iEcE24377_1341; iECD...,MetaNetX (MNX) Chemical: http://identifiers.or...,12dgr180; 12dgr180[c]; 12dgr180_c; _12dgr180_c
3,14glucan_c,14glucan,"1,4-alpha-D-glucan",iSFxv_1172; iUTI89_1310; iSSON_1240; iSbBS512_...,BioCyc: http://identifiers.org/biocyc/META:1-4...,14glucan; 14glucan_c
4,15dap_c,15dap,"1,5-Diaminopentane",iECUMN_1333; iLF82_1304; iETEC_1333; iECSF_132...,KEGG Compound: http://identifiers.org/kegg.com...,15dap; 15dap[c]; 15dap_c
5,23ddhb_c,23ddhb,"2,3-Dihydro-2,3-dihydroxybenzoate",iEC1372_W3110; iEC1368_DH5a; iCN900; iEC1364_W...,KEGG Compound: http://identifiers.org/kegg.com...,23ddhb; 23ddhb_c
6,23dhba_c,23dhba,"(2,3-Dihydroxybenzoyl)adenylate",iECs_1301; iECO111_1330; iECP_1309; iECIAI1_13...,KEGG Compound: http://identifiers.org/kegg.com...,23dhba; 23dhba_c
7,23dhbzs_c,23dhbzs,"2,3-dihydroxybenzoylserine",STM_v1_0; iY75_1357; iAF1260b; iML1515; iEC134...,KEGG Compound: http://identifiers.org/kegg.com...,23dhbzs; 23dhbzs_c
8,26dap_LL_c,26dap_LL,"LL-2,6-Diaminoheptanedioate",iLJ478; iAF1260b; STM_v1_0; iJN678; iY75_1357;...,KEGG Compound: http://identifiers.org/kegg.com...,26dap-LL[c]; 26dap_DASH_LL_c; 26dap_LL; 26dap_...
9,2agpe141_c,2agpe141,2-Acyl-sn-glycero-3-phosphoethanolamine (n-C14:1),iEC1344_C; iYS1720; iEC1368_DH5a; iEC1372_W311...,MetaNetX (MNX) Chemical: http://identifiers.or...,2agpe141; 2agpe141_c; _2agpe141_c


## 2) Changes to the universal BiGG ids 

In order to make the correspondence between the universal BiGG identifiers and the aracore and ModelSEED ones, a few changes in the universal BiGG ids need to be performed:

> **For the BiGG database, we use the database links as "references", because they contain information about ModelSEED ids**

- convert the universal BiGG ids to lower case
- convert the string of database links into dictionaries:
    - key: database identifier/symbol
    - value: database-specific metabolite/annotation identifier

In [57]:
#Convert universal bigg id for metabolites to lower case
df_metabolites_bigg['universal_bigg_id_lower'] = df_metabolites_bigg['universal_bigg_id'].str.lower()

In [58]:
#Convert string of database links into dictionaries of database identifier/symbol (key) and database-specific metabolite/annotation id (value)
#Parse the string to get the seed_id
def apply_parsing_databaseLinks(databaseLink):
    if isinstance(databaseLink, str):
        #isinstance(object, classinfo): returns True if the object argument is an instance of the classinfo argument.
        #If object is not an object of the given type, the function always returns False
        dictionary = {}
        databaseLink_splited = databaseLink.split(';')
        for db_link in databaseLink_splited:
            db_link_splited = db_link.split(':',1)
            intermediate_variable = db_link_splited[-1]
            intermediate_variable_splited = intermediate_variable.split('/')
            key = intermediate_variable_splited[-2]
            value = intermediate_variable_splited[-1]
            dictionary[key] = value
        #print(dictionary)
    else:
        dictionary = {}
    return dictionary

df_metabolites_bigg['database_links'] = df_metabolites_bigg['database_links'].apply(apply_parsing_databaseLinks) 
df_metabolites_bigg.head(25)

Unnamed: 0,bigg_id,universal_bigg_id,name,model_list,database_links,old_bigg_ids,universal_bigg_id_lower
0,12dgr120_c,12dgr120,"1,2-Diacyl-sn-glycerol (didodecanoyl, n-C12:0)",iEC1364_W; iEC1349_Crooks; iEC1356_Bl21DE3; iM...,{'metanetx.chemical': 'MNXM4939'},12dgr120; 12dgr120[c]; 12dgr120_c; _12dgr120_c,12dgr120
1,12dgr140_c,12dgr140,"1,2-Diacyl-sn-glycerol (ditetradecanoyl, n-C14:0)",iECNA114_1301; iECSE_1348; iECO111_1330; iECOK...,{'metanetx.chemical': 'MNXM146479'},12dgr140; 12dgr140[c]; 12dgr140_c; _12dgr140_c,12dgr140
2,12dgr180_c,12dgr180,"1,2-Diacyl-sn-glycerol (dioctadecanoyl, n-C18:0)",iECB_1328; iECDH10B_1368; iEcE24377_1341; iECD...,{'metanetx.chemical': 'MNXM4217'},12dgr180; 12dgr180[c]; 12dgr180_c; _12dgr180_c,12dgr180
3,14glucan_c,14glucan,"1,4-alpha-D-glucan",iSFxv_1172; iUTI89_1310; iSSON_1240; iSbBS512_...,"{'biocyc': 'META:1-4-alpha-D-Glucan', 'metanet...",14glucan; 14glucan_c,14glucan
4,15dap_c,15dap,"1,5-Diaminopentane",iECUMN_1333; iLF82_1304; iETEC_1333; iECSF_132...,"{'kegg.compound': 'C01672', 'chebi': 'CHEBI:58...",15dap; 15dap[c]; 15dap_c,15dap
5,23ddhb_c,23ddhb,"2,3-Dihydro-2,3-dihydroxybenzoate",iEC1372_W3110; iEC1368_DH5a; iCN900; iEC1364_W...,"{'kegg.compound': 'C04171', 'chebi': 'CHEBI:87...",23ddhb; 23ddhb_c,23ddhb
6,23dhba_c,23dhba,"(2,3-Dihydroxybenzoyl)adenylate",iECs_1301; iECO111_1330; iECP_1309; iECIAI1_13...,"{'kegg.compound': 'C04030', 'chebi': 'CHEBI:57...",23dhba; 23dhba_c,23dhba
7,23dhbzs_c,23dhbzs,"2,3-dihydroxybenzoylserine",STM_v1_0; iY75_1357; iAF1260b; iML1515; iEC134...,"{'kegg.compound': 'C04204', 'chebi': 'CHEBI:70...",23dhbzs; 23dhbzs_c,23dhbzs
8,26dap_LL_c,26dap_LL,"LL-2,6-Diaminoheptanedioate",iLJ478; iAF1260b; STM_v1_0; iJN678; iY75_1357;...,"{'kegg.compound': 'C00666', 'chebi': 'CHEBI:63...",26dap-LL[c]; 26dap_DASH_LL_c; 26dap_LL; 26dap_...,26dap_ll
9,2agpe141_c,2agpe141,2-Acyl-sn-glycero-3-phosphoethanolamine (n-C14:1),iEC1344_C; iYS1720; iEC1368_DH5a; iEC1372_W311...,{'metanetx.chemical': 'MNXM3447'},2agpe141; 2agpe141_c; _2agpe141_c,2agpe141


In [59]:
#All database keys in database_links
np.unique(df_metabolites_bigg['database_links'].apply(lambda x: list(x.keys())).sum())
#np.unique() function returns the sorted unique elements of an array

array(['biocyc', 'chebi', 'hmdb', 'inchikey', 'kegg.compound',
       'kegg.drug', 'kegg.glycan', 'lipidmaps', 'metanetx.chemical',
       'reactome', 'seed.compound'], dtype='<U17')

## 3) Get ModelSEED compound ids (metabolite ids) into the BiGG model file  

In [60]:
#Get ModelSEED compound ids (metabolite ids) into the BiGG model file
def apply_modelSEED_id(dict_db_link):
    if "seed.compound" in dict_db_link.keys():
        seed_id = dict_db_link["seed.compound"]
    else:
        seed_id = None
    #print(seed_id)
    return seed_id

df_metabolites_bigg['seed.compound'] = df_metabolites_bigg['database_links'].apply(apply_modelSEED_id)

df_metabolites_bigg.head(25)

Unnamed: 0,bigg_id,universal_bigg_id,name,model_list,database_links,old_bigg_ids,universal_bigg_id_lower,seed.compound
0,12dgr120_c,12dgr120,"1,2-Diacyl-sn-glycerol (didodecanoyl, n-C12:0)",iEC1364_W; iEC1349_Crooks; iEC1356_Bl21DE3; iM...,{'metanetx.chemical': 'MNXM4939'},12dgr120; 12dgr120[c]; 12dgr120_c; _12dgr120_c,12dgr120,
1,12dgr140_c,12dgr140,"1,2-Diacyl-sn-glycerol (ditetradecanoyl, n-C14:0)",iECNA114_1301; iECSE_1348; iECO111_1330; iECOK...,{'metanetx.chemical': 'MNXM146479'},12dgr140; 12dgr140[c]; 12dgr140_c; _12dgr140_c,12dgr140,
2,12dgr180_c,12dgr180,"1,2-Diacyl-sn-glycerol (dioctadecanoyl, n-C18:0)",iECB_1328; iECDH10B_1368; iEcE24377_1341; iECD...,{'metanetx.chemical': 'MNXM4217'},12dgr180; 12dgr180[c]; 12dgr180_c; _12dgr180_c,12dgr180,
3,14glucan_c,14glucan,"1,4-alpha-D-glucan",iSFxv_1172; iUTI89_1310; iSSON_1240; iSbBS512_...,"{'biocyc': 'META:1-4-alpha-D-Glucan', 'metanet...",14glucan; 14glucan_c,14glucan,cpd21754
4,15dap_c,15dap,"1,5-Diaminopentane",iECUMN_1333; iLF82_1304; iETEC_1333; iECSF_132...,"{'kegg.compound': 'C01672', 'chebi': 'CHEBI:58...",15dap; 15dap[c]; 15dap_c,15dap,cpd01155
5,23ddhb_c,23ddhb,"2,3-Dihydro-2,3-dihydroxybenzoate",iEC1372_W3110; iEC1368_DH5a; iCN900; iEC1364_W...,"{'kegg.compound': 'C04171', 'chebi': 'CHEBI:87...",23ddhb; 23ddhb_c,23ddhb,cpd29666
6,23dhba_c,23dhba,"(2,3-Dihydroxybenzoyl)adenylate",iECs_1301; iECO111_1330; iECP_1309; iECIAI1_13...,"{'kegg.compound': 'C04030', 'chebi': 'CHEBI:57...",23dhba; 23dhba_c,23dhba,cpd02494
7,23dhbzs_c,23dhbzs,"2,3-dihydroxybenzoylserine",STM_v1_0; iY75_1357; iAF1260b; iML1515; iEC134...,"{'kegg.compound': 'C04204', 'chebi': 'CHEBI:70...",23dhbzs; 23dhbzs_c,23dhbzs,cpd15332
8,26dap_LL_c,26dap_LL,"LL-2,6-Diaminoheptanedioate",iLJ478; iAF1260b; STM_v1_0; iJN678; iY75_1357;...,"{'kegg.compound': 'C00666', 'chebi': 'CHEBI:63...",26dap-LL[c]; 26dap_DASH_LL_c; 26dap_LL; 26dap_...,26dap_ll,cpd00504
9,2agpe141_c,2agpe141,2-Acyl-sn-glycero-3-phosphoethanolamine (n-C14:1),iEC1344_C; iYS1720; iEC1368_DH5a; iEC1372_W311...,{'metanetx.chemical': 'MNXM3447'},2agpe141; 2agpe141_c; _2agpe141_c,2agpe141,


# IV / ModelSEED Metabolites table

## 1) Load ModelSEED metabolites table 

In [61]:
# Load ModelSeed Metabolites Table

seed_metabolites_url = 'https://raw.githubusercontent.com/ModelSEED/ModelSEEDDatabase/master/Biochemistry/compounds.tsv'
df_metabolites_seed = pd.read_csv(seed_metabolites_url, sep='\t')

df_metabolites_seed.head(25)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,id,abbreviation,name,formula,mass,source,inchikey,charge,is_core,is_obsolete,...,is_cofactor,deltag,deltagerr,pka,pkb,abstract_compound,comprised_of,aliases,smiles,notes
0,cpd00001,h2o,H2O,H2O,18.0,Primary Database,XLYOFNOQVPJJNP-UHFFFAOYSA-N,0,1,0,...,0,-37.54,0.18,1:1:15.70,1:1:-1.80,,,Name: H20; H2O; H3O+; HO-; Hydroxide ion; OH; ...,O,GC|EQ|EQU
1,cpd00002,atp,ATP,C10H13N5O13P3,504.0,Primary Database,ZKHQWZAMYRWXGA-KQYNXXCUSA-K,-3,1,0,...,0,-548.85,0.36,1:14:12.60;1:22:3.29;1:26:0.90;1:29:7.42;1:30:...,1:6:-7.46;1:9:-1.06;1:14:-3.85;1:15:4.93,,,Name: ATP; Adenosine 5'-triphosphate; adenosin...,Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)([O-])OP(=O...,GC|EQ|EQU
2,cpd00003,nad,NAD,C21H26N7O14P2,662.0,Primary Database,BAWFJGJZGIEFAR-NNYOXOHSSA-M,-1,1,0,...,0,-286.41,1.59,1:6:11.94;1:17:1.85;1:18:2.28;1:25:11.38;1:35:...,1:6:-4.22;1:35:-3.85;1:37:-1.05;1:41:4.93;1:43...,,,Name: DPN; DPN+; DPN-ox; Diphosphopyridine nuc...,NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(...,GC|EQ|EQU
3,cpd00004,nadh,NADH,C21H27N7O14P2,663.0,Primary Database,BOPGDPNILDQYTO-NNYOXOHSSA-L,-2,1,0,...,0,-271.15,1.59,1:14:12.28;1:18:14.00;1:22:-7.46;1:26:-1.05;1:...,1:6:2.28;1:9:1.85;1:14:-3.85;1:15:4.93;1:18:-3...,,,Name: DPNH; NAD-reduced; NADH; NADH+H+; NADH2;...,NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)(...,GC|EQ|EQU
4,cpd00005,nadph,NADPH,C21H26N7O17P3,742.0,Primary Database,ACFIXJIJDZMPPO-NNYOXOHSSA-J,-4,1,0,...,0,-483.1,1.62,1:18:0.90;1:19:5.78;1:26:0.66;1:30:3.26;1:40:1...,1:11:-7.46;1:12:-1.06;1:22:4.87;1:40:-3.78,,,Name: NADP(H); NADP-red; NADP-reduced; NADPH; ...,NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)(...,GC|EQ|EQU
5,cpd00006,nadp,NADP,C21H25N7O17P3,741.0,Primary Database,XJLXINKUBYWONI-NNYOXOHSSA-K,-3,1,0,...,0,-498.36,1.63,1:18:0.90;1:19:5.78;1:26:3.26;1:30:0.66;1:47:1...,1:11:-7.46;1:12:-1.06;1:22:4.87,,,Name: NADP; NADP(+); NADP+; NADP-ox; NADP-oxid...,NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(...,GC|EQ|EQU
6,cpd00007,o2,O2,O2,32.0,Primary Database,MYMOFIZGZYHOMD-UHFFFAOYSA-N,0,1,0,...,0,3.92,0.71,,,,,Name: O2; Oxygen; dioxygen; oxygen; oxygen mol...,O=O,GC|EQ|EQU
7,cpd00008,adp,ADP,C10H13N5O10P2,425.0,Primary Database,XTWYTFMLZFPYCI-KQYNXXCUSA-L,-2,1,0,...,0,-340.04,0.3,1:14:12.46;1:18:13.98;1:22:2.22;1:25:7.42;1:26...,1:6:-7.46;1:9:-1.05;1:14:-3.85;1:15:4.93;1:18:...,,,Name: ADP; Adenosine 5'-diphosphate; Adenosine...,Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)([O-])OP(=O...,GC|EQ|EQU
8,cpd00009,pi,Phosphate,HO4P,96.0,Primary Database,NBIIXXVUZAFLBC-UHFFFAOYSA-L,-2,1,0,...,0,-252.51,0.18,1:2:12.90;1:3:1.80;1:4:6.95,,,,Name: H2PO4-; HPO4-2; HPO42-; Orthophosphate; ...,O=P([O-])([O-])O,GC|EQ|EQU
9,cpd00010,coa,CoA,C21H32N7O16P3S,764.0,Primary Database,RGJOEKWQDUBAIZ-IBOSZNHHSA-J,-4,1,0,...,0,-429.53,1.87,1:22:0.92;1:23:5.94;1:26:0.83;1:30:3.27;1:48:1...,1:8:-7.46;1:9:-1.06;1:17:4.89,,,Name: CoA; CoA-SH; Coenzyme A; CoenzymeA; Coen...,CC(C)(COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H]...,GC|EQ|EQU


## 2) Changes to the abbreviation 

As previously described for the aracore and BiGG ids, a few changes need to be made in order to make the correspondences between the databses and the AraCore model. In this case, we will focus on "abbreviation" column, because it seems to look like the previous BiGG ids and the aracore ones. 

> **For the ModelSEED database, we use aliases as "references" because they contain information about BiGG ids**

Thus, we need to perform the following changes:
- convert the abbreviation to lower case
- convert string of the "aliases" pairs into dictionaries of keys and values:
    1. split by "|" to separate the different key - value pairs
    2. split by ":" to separate keys and values
    3. split the strings (keys and values) that are again strings with multiple items that need to be splitted:
        - split those strings by ";" and also remove leading and tailing white spaces

In [62]:
df_metabolites_seed['abbreviation_lower'] = df_metabolites_seed['abbreviation'].str.lower()

In [63]:
#Convert string of alias pairs into dictionaries of keys and value - 
# 1) split by "|" to seperate the different key - value pairs
# 2) split by ":" to seperate keys and values
def apply_alias_into_dict(aliases_str):
    if isinstance(aliases_str, str):
        dictionary = {}        
        for alias in aliases_str.split("|"): 
            intermediate_alias = alias.split(":", 1)
            key = intermediate_alias[0]
            value = intermediate_alias[-1]
            dictionary[key] = value
    else:
        dictionary = {}
    #print(dictionary)
    return dictionary
    
df_metabolites_seed['aliases_dict'] = df_metabolites_seed['aliases'].apply(apply_alias_into_dict)


#Some of the keys have values that are again strings with multiple items that need to be splitted
# 3) split those strings by ';' and also remove leading and tailing white spaces 
def apply_items_splited(alias_dict):
    dict_alias = {}
    for alias_key, alias_values in alias_dict.items(): # dictionary comprehension
        alias_values_splited = alias_values.split(';')
        list_strip_elements = [] # we initialize the list for each value of the dictionary => at the same level of the list comprehension
        for alias_values in alias_values_splited: # list comprehension           
            alias_values_strip = alias_values.strip()
            list_strip_elements.append(alias_values_strip)
            #print(list_strip_elements)
        dict_alias[alias_key] = list_strip_elements
    #print(dict_alias)
    return dict_alias

df_metabolites_seed['aliases_dict_items_splitted'] = df_metabolites_seed['aliases_dict'].apply(apply_items_splited)
        
df_metabolites_seed.head(25)

Unnamed: 0,id,abbreviation,name,formula,mass,source,inchikey,charge,is_core,is_obsolete,...,pka,pkb,abstract_compound,comprised_of,aliases,smiles,notes,abbreviation_lower,aliases_dict,aliases_dict_items_splitted
0,cpd00001,h2o,H2O,H2O,18.0,Primary Database,XLYOFNOQVPJJNP-UHFFFAOYSA-N,0,1,0,...,1:1:15.70,1:1:-1.80,,,Name: H20; H2O; H3O+; HO-; Hydroxide ion; OH; ...,O,GC|EQ|EQU,h2o,{'Name': ' H20; H2O; H3O+; HO-; Hydroxide ion;...,"{'Name': ['H20', 'H2O', 'H3O+', 'HO-', 'Hydrox..."
1,cpd00002,atp,ATP,C10H13N5O13P3,504.0,Primary Database,ZKHQWZAMYRWXGA-KQYNXXCUSA-K,-3,1,0,...,1:14:12.60;1:22:3.29;1:26:0.90;1:29:7.42;1:30:...,1:6:-7.46;1:9:-1.06;1:14:-3.85;1:15:4.93,,,Name: ATP; Adenosine 5'-triphosphate; adenosin...,Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)([O-])OP(=O...,GC|EQ|EQU,atp,{'Name': ' ATP; Adenosine 5'-triphosphate; ade...,"{'Name': ['ATP', 'Adenosine 5'-triphosphate', ..."
2,cpd00003,nad,NAD,C21H26N7O14P2,662.0,Primary Database,BAWFJGJZGIEFAR-NNYOXOHSSA-M,-1,1,0,...,1:6:11.94;1:17:1.85;1:18:2.28;1:25:11.38;1:35:...,1:6:-4.22;1:35:-3.85;1:37:-1.05;1:41:4.93;1:43...,,,Name: DPN; DPN+; DPN-ox; Diphosphopyridine nuc...,NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(...,GC|EQ|EQU,nad,{'Name': ' DPN; DPN+; DPN-ox; Diphosphopyridin...,"{'Name': ['DPN', 'DPN+', 'DPN-ox', 'Diphosphop..."
3,cpd00004,nadh,NADH,C21H27N7O14P2,663.0,Primary Database,BOPGDPNILDQYTO-NNYOXOHSSA-L,-2,1,0,...,1:14:12.28;1:18:14.00;1:22:-7.46;1:26:-1.05;1:...,1:6:2.28;1:9:1.85;1:14:-3.85;1:15:4.93;1:18:-3...,,,Name: DPNH; NAD-reduced; NADH; NADH+H+; NADH2;...,NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)(...,GC|EQ|EQU,nadh,{'Name': ' DPNH; NAD-reduced; NADH; NADH+H+; N...,"{'Name': ['DPNH', 'NAD-reduced', 'NADH', 'NADH..."
4,cpd00005,nadph,NADPH,C21H26N7O17P3,742.0,Primary Database,ACFIXJIJDZMPPO-NNYOXOHSSA-J,-4,1,0,...,1:18:0.90;1:19:5.78;1:26:0.66;1:30:3.26;1:40:1...,1:11:-7.46;1:12:-1.06;1:22:4.87;1:40:-3.78,,,Name: NADP(H); NADP-red; NADP-reduced; NADPH; ...,NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)(...,GC|EQ|EQU,nadph,{'Name': ' NADP(H); NADP-red; NADP-reduced; NA...,"{'Name': ['NADP(H)', 'NADP-red', 'NADP-reduced..."
5,cpd00006,nadp,NADP,C21H25N7O17P3,741.0,Primary Database,XJLXINKUBYWONI-NNYOXOHSSA-K,-3,1,0,...,1:18:0.90;1:19:5.78;1:26:3.26;1:30:0.66;1:47:1...,1:11:-7.46;1:12:-1.06;1:22:4.87,,,Name: NADP; NADP(+); NADP+; NADP-ox; NADP-oxid...,NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(...,GC|EQ|EQU,nadp,{'Name': ' NADP; NADP(+); NADP+; NADP-ox; NADP...,"{'Name': ['NADP', 'NADP(+)', 'NADP+', 'NADP-ox..."
6,cpd00007,o2,O2,O2,32.0,Primary Database,MYMOFIZGZYHOMD-UHFFFAOYSA-N,0,1,0,...,,,,,Name: O2; Oxygen; dioxygen; oxygen; oxygen mol...,O=O,GC|EQ|EQU,o2,{'Name': ' O2; Oxygen; dioxygen; oxygen; oxyge...,"{'Name': ['O2', 'Oxygen', 'dioxygen', 'oxygen'..."
7,cpd00008,adp,ADP,C10H13N5O10P2,425.0,Primary Database,XTWYTFMLZFPYCI-KQYNXXCUSA-L,-2,1,0,...,1:14:12.46;1:18:13.98;1:22:2.22;1:25:7.42;1:26...,1:6:-7.46;1:9:-1.05;1:14:-3.85;1:15:4.93;1:18:...,,,Name: ADP; Adenosine 5'-diphosphate; Adenosine...,Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)([O-])OP(=O...,GC|EQ|EQU,adp,{'Name': ' ADP; Adenosine 5'-diphosphate; Aden...,"{'Name': ['ADP', 'Adenosine 5'-diphosphate', '..."
8,cpd00009,pi,Phosphate,HO4P,96.0,Primary Database,NBIIXXVUZAFLBC-UHFFFAOYSA-L,-2,1,0,...,1:2:12.90;1:3:1.80;1:4:6.95,,,,Name: H2PO4-; HPO4-2; HPO42-; Orthophosphate; ...,O=P([O-])([O-])O,GC|EQ|EQU,pi,{'Name': ' H2PO4-; HPO4-2; HPO42-; Orthophosph...,"{'Name': ['H2PO4-', 'HPO4-2', 'HPO42-', 'Ortho..."
9,cpd00010,coa,CoA,C21H32N7O16P3S,764.0,Primary Database,RGJOEKWQDUBAIZ-IBOSZNHHSA-J,-4,1,0,...,1:22:0.92;1:23:5.94;1:26:0.83;1:30:3.27;1:48:1...,1:8:-7.46;1:9:-1.06;1:17:4.89,,,Name: CoA; CoA-SH; Coenzyme A; CoenzymeA; Coen...,CC(C)(COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H]...,GC|EQ|EQU,coa,{'Name': ' CoA; CoA-SH; Coenzyme A; CoenzymeA;...,"{'Name': ['CoA', 'CoA-SH', 'Coenzyme A', 'Coen..."


In [64]:
#All database keys in aliases
np.unique(df_metabolites_seed['aliases_dict_items_splitted'].apply(lambda x: list(x.keys())).sum())

array(['AlgaGEM', 'AraCyc', 'AraGEM', 'BiGG', 'BrachyCyc', 'ChlamyCyc',
       'CornCyc', 'DF_Athaliana', 'EcoCyc', 'JM_Creinhardtii',
       'JP_Creinhardtii_MSB', 'JP_Creinhardtii_NMeth', 'KEGG', 'MaizeCyc',
       'Maize_C4GEM', 'MetaCyc', 'Name', 'PlantCyc', 'PoplarCyc',
       'RiceCyc', 'SorghumCyc', 'SoyCyc', 'TS_Athaliana', 'iAF1260',
       'iAF692', 'iAG612', 'iAO358', 'iAbaylyiv4', 'iGT196', 'iIN800',
       'iIT341', 'iJN746', 'iJR904', 'iMA945', 'iMEO21', 'iMM904',
       'iMO1053-PAO1', 'iMO1056', 'iND750', 'iNJ661', 'iPS189', 'iRR1083',
       'iRS1563', 'iRS1597', 'iSB619', 'iSO783', 'iYO844'], dtype='<U21')

## 3) Get BiGG metabolite ids 

In [65]:
#Get BiGG metabolite ids
def apply_bigg_metabolites_id(dict_aliases):
    if "BiGG" in dict_aliases.keys():
        dict_aliases = dict_aliases["BiGG"]
    else:
        dict_aliases = None
    return dict_aliases # WARNING !! Don't forget the return !! 
        
df_metabolites_seed['BiGG'] = df_metabolites_seed['aliases_dict_items_splitted'].apply(apply_bigg_metabolites_id)
        
df_metabolites_seed.head(25)

Unnamed: 0,id,abbreviation,name,formula,mass,source,inchikey,charge,is_core,is_obsolete,...,pkb,abstract_compound,comprised_of,aliases,smiles,notes,abbreviation_lower,aliases_dict,aliases_dict_items_splitted,BiGG
0,cpd00001,h2o,H2O,H2O,18.0,Primary Database,XLYOFNOQVPJJNP-UHFFFAOYSA-N,0,1,0,...,1:1:-1.80,,,Name: H20; H2O; H3O+; HO-; Hydroxide ion; OH; ...,O,GC|EQ|EQU,h2o,{'Name': ' H20; H2O; H3O+; HO-; Hydroxide ion;...,"{'Name': ['H20', 'H2O', 'H3O+', 'HO-', 'Hydrox...","[h2o, oh1]"
1,cpd00002,atp,ATP,C10H13N5O13P3,504.0,Primary Database,ZKHQWZAMYRWXGA-KQYNXXCUSA-K,-3,1,0,...,1:6:-7.46;1:9:-1.06;1:14:-3.85;1:15:4.93,,,Name: ATP; Adenosine 5'-triphosphate; adenosin...,Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)([O-])OP(=O...,GC|EQ|EQU,atp,{'Name': ' ATP; Adenosine 5'-triphosphate; ade...,"{'Name': ['ATP', 'Adenosine 5'-triphosphate', ...",[atp]
2,cpd00003,nad,NAD,C21H26N7O14P2,662.0,Primary Database,BAWFJGJZGIEFAR-NNYOXOHSSA-M,-1,1,0,...,1:6:-4.22;1:35:-3.85;1:37:-1.05;1:41:4.93;1:43...,,,Name: DPN; DPN+; DPN-ox; Diphosphopyridine nuc...,NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(...,GC|EQ|EQU,nad,{'Name': ' DPN; DPN+; DPN-ox; Diphosphopyridin...,"{'Name': ['DPN', 'DPN+', 'DPN-ox', 'Diphosphop...",[nad]
3,cpd00004,nadh,NADH,C21H27N7O14P2,663.0,Primary Database,BOPGDPNILDQYTO-NNYOXOHSSA-L,-2,1,0,...,1:6:2.28;1:9:1.85;1:14:-3.85;1:15:4.93;1:18:-3...,,,Name: DPNH; NAD-reduced; NADH; NADH+H+; NADH2;...,NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)(...,GC|EQ|EQU,nadh,{'Name': ' DPNH; NAD-reduced; NADH; NADH+H+; N...,"{'Name': ['DPNH', 'NAD-reduced', 'NADH', 'NADH...",[nadh]
4,cpd00005,nadph,NADPH,C21H26N7O17P3,742.0,Primary Database,ACFIXJIJDZMPPO-NNYOXOHSSA-J,-4,1,0,...,1:11:-7.46;1:12:-1.06;1:22:4.87;1:40:-3.78,,,Name: NADP(H); NADP-red; NADP-reduced; NADPH; ...,NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)(...,GC|EQ|EQU,nadph,{'Name': ' NADP(H); NADP-red; NADP-reduced; NA...,"{'Name': ['NADP(H)', 'NADP-red', 'NADP-reduced...",[nadph]
5,cpd00006,nadp,NADP,C21H25N7O17P3,741.0,Primary Database,XJLXINKUBYWONI-NNYOXOHSSA-K,-3,1,0,...,1:11:-7.46;1:12:-1.06;1:22:4.87,,,Name: NADP; NADP(+); NADP+; NADP-ox; NADP-oxid...,NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(...,GC|EQ|EQU,nadp,{'Name': ' NADP; NADP(+); NADP+; NADP-ox; NADP...,"{'Name': ['NADP', 'NADP(+)', 'NADP+', 'NADP-ox...",[nadp]
6,cpd00007,o2,O2,O2,32.0,Primary Database,MYMOFIZGZYHOMD-UHFFFAOYSA-N,0,1,0,...,,,,Name: O2; Oxygen; dioxygen; oxygen; oxygen mol...,O=O,GC|EQ|EQU,o2,{'Name': ' O2; Oxygen; dioxygen; oxygen; oxyge...,"{'Name': ['O2', 'Oxygen', 'dioxygen', 'oxygen'...",[o2]
7,cpd00008,adp,ADP,C10H13N5O10P2,425.0,Primary Database,XTWYTFMLZFPYCI-KQYNXXCUSA-L,-2,1,0,...,1:6:-7.46;1:9:-1.05;1:14:-3.85;1:15:4.93;1:18:...,,,Name: ADP; Adenosine 5'-diphosphate; Adenosine...,Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)([O-])OP(=O...,GC|EQ|EQU,adp,{'Name': ' ADP; Adenosine 5'-diphosphate; Aden...,"{'Name': ['ADP', 'Adenosine 5'-diphosphate', '...",[adp]
8,cpd00009,pi,Phosphate,HO4P,96.0,Primary Database,NBIIXXVUZAFLBC-UHFFFAOYSA-L,-2,1,0,...,,,,Name: H2PO4-; HPO4-2; HPO42-; Orthophosphate; ...,O=P([O-])([O-])O,GC|EQ|EQU,pi,{'Name': ' H2PO4-; HPO4-2; HPO42-; Orthophosph...,"{'Name': ['H2PO4-', 'HPO4-2', 'HPO42-', 'Ortho...",[pi]
9,cpd00010,coa,CoA,C21H32N7O16P3S,764.0,Primary Database,RGJOEKWQDUBAIZ-IBOSZNHHSA-J,-4,1,0,...,1:8:-7.46;1:9:-1.06;1:17:4.89,,,Name: CoA; CoA-SH; Coenzyme A; CoenzymeA; Coen...,CC(C)(COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H]...,GC|EQ|EQU,coa,{'Name': ' CoA; CoA-SH; Coenzyme A; CoenzymeA;...,"{'Name': ['CoA', 'CoA-SH', 'Coenzyme A', 'Coen...",[coa]


In [66]:
df_metabolites_seed['BiGG'].notna().value_counts() # => ModelSeed seems to have only 2729 compounds mapped with BiGG Ids

False    31263
True      2729
Name: BiGG, dtype: int64

# V / Mapping AraCore 

## 1) Mapping AraCore - BiGG 

### Based on BiGG and AraCore Metabolite Ids 

In [67]:
df_metabolites_aracore.head(25)

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,aracore_updated_universal_ids_lower
0,hnu_h,Photon,X,{},hnu_h,hnu,hnu
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},PQ_h,PQ,pq
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O,h2o
3,H_h,"H+, proton",H,{},H_h,H,h
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},PQH2_h,PQH2,pqh2
5,O2_h,"O2, oxygen",O2,{},O2_h,O2,o2
6,H_l,"H+, proton",H,{},H_ul,H,h
7,PCox_h,Oxidized plastocyanin,X,{},PCox_h,PCox,pcox
8,PCrd_h,Reduced plastocyanin,X,{},PCrd_h,PCrd,pcrd
9,Fdox_h,Oxidized ferredoxin,S8FeX,{},Fdox_h,Fdox,fdox


In [68]:
#Check if universal metabolite ids of aracore model have an equivalent metabolite ids in BIGG based on same namimg
df_metabolites_aracore['is_bigg_id'] = df_metabolites_aracore['aracore_updated_universal_ids_lower'].apply(lambda met_id: ((df_metabolites_bigg['universal_bigg_id_lower'] == met_id).sum() > 0))

#Fill column for universal bigg ids for which above is the case
df_metabolites_aracore['universal_bigg_id'] = df_metabolites_aracore[['aracore_updated_universal_ids_lower','is_bigg_id']].apply(lambda x: x[0] if x[1] else None,axis=1)

#Add ModelSEED compound ids from BiGG metabolite table => df_metabolites_bigg['seed.compound']
df_metabolites_aracore['bigg_seed_id'] = df_metabolites_aracore[['aracore_updated_universal_ids_lower','is_bigg_id']].apply(lambda x: df_metabolites_bigg[df_metabolites_bigg['universal_bigg_id_lower'] == x[0]]['seed.compound'].unique()[0] if x[1] else None,axis=1)

df_metabolites_aracore.head(25)

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,aracore_updated_universal_ids_lower,is_bigg_id,universal_bigg_id,bigg_seed_id
0,hnu_h,Photon,X,{},hnu_h,hnu,hnu,False,,
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},PQ_h,PQ,pq,True,pq,
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O,h2o,True,h2o,cpd27222
3,H_h,"H+, proton",H,{},H_h,H,h,True,h,cpd00067
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},PQH2_h,PQH2,pqh2,True,pqh2,
5,O2_h,"O2, oxygen",O2,{},O2_h,O2,o2,True,o2,cpd00007
6,H_l,"H+, proton",H,{},H_ul,H,h,True,h,cpd00067
7,PCox_h,Oxidized plastocyanin,X,{},PCox_h,PCox,pcox,True,pcox,cpd30035
8,PCrd_h,Reduced plastocyanin,X,{},PCrd_h,PCrd,pcrd,True,pcrd,cpd30034
9,Fdox_h,Oxidized ferredoxin,S8FeX,{},Fdox_h,Fdox,fdox,True,fdox,cpd15876


In [69]:
#Count the number of mappings there are between between aracore_updated_universal_ids and universal_bigg_id_lower 
df_metabolites_aracore['is_bigg_id'].value_counts() # => 165 mapping between aracore_updated_universal_ids and universal_bigg_id_lower
#value_counts() function returns a Series containing counts of unique rows in the DataFrame.

False    248
True     165
Name: is_bigg_id, dtype: int64

In [70]:
#Count the number of ModelSEED compound ids found by mapped BiGG: these are noted as notna() in the dataframe
df_metabolites_aracore['bigg_seed_id'].isna().value_counts() # => 156 additional ModelSeed compound id found by mapped BiGG
# isna() function allows us to see the ids which as noted as NaN in the dataframe

True     257
False    156
Name: bigg_seed_id, dtype: int64

## 2) Mapping AraCore - ModelSEED 

In [71]:
#Check if universal metabolite ids of aracore model have an equivalent metabolite ids in BIGG based on same namimg
df_metabolites_aracore['is_seed_id'] = df_metabolites_aracore['aracore_updated_universal_ids_lower'].apply(lambda met_id: ((df_metabolites_seed['abbreviation_lower'] == met_id).sum() > 0))

#Fill column for universal bigg ids for which above is the case
df_metabolites_aracore['seed_id'] = df_metabolites_aracore[['aracore_updated_universal_ids_lower','is_seed_id']].apply(lambda x: df_metabolites_seed[df_metabolites_seed['abbreviation_lower'] == x[0]]['id'].tolist()[0] if x[1] else None,axis=1)

#Add BiGG ids (BiGG) from ModelSEED table => df_metabolites_seed['BiGG]
df_metabolites_aracore['seed_BiGG_id'] = df_metabolites_aracore[['seed_id','is_seed_id']].apply(lambda x: df_metabolites_seed[df_metabolites_seed['id'] == x[0]]['BiGG']  if x[1] else None,axis=1)
df_metabolites_aracore['seed_BiGG_id'] = df_metabolites_aracore['seed_BiGG_id'].apply(lambda x: x.values.tolist()[0] if isinstance(x, pd.core.series.Series) else None)

df_metabolites_aracore.head(25)

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,aracore_updated_universal_ids_lower,is_bigg_id,universal_bigg_id,bigg_seed_id,is_seed_id,seed_id,seed_BiGG_id
0,hnu_h,Photon,X,{},hnu_h,hnu,hnu,False,,,False,,
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},PQ_h,PQ,pq,True,pq,,False,,
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O,h2o,True,h2o,cpd27222,True,cpd00001,"[h2o, oh1]"
3,H_h,"H+, proton",H,{},H_h,H,h,True,h,cpd00067,True,cpd00067,[h]
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},PQH2_h,PQH2,pqh2,True,pqh2,,False,,
5,O2_h,"O2, oxygen",O2,{},O2_h,O2,o2,True,o2,cpd00007,True,cpd00007,[o2]
6,H_l,"H+, proton",H,{},H_ul,H,h,True,h,cpd00067,True,cpd00067,[h]
7,PCox_h,Oxidized plastocyanin,X,{},PCox_h,PCox,pcox,True,pcox,cpd30035,False,,
8,PCrd_h,Reduced plastocyanin,X,{},PCrd_h,PCrd,pcrd,True,pcrd,cpd30034,False,,
9,Fdox_h,Oxidized ferredoxin,S8FeX,{},Fdox_h,Fdox,fdox,True,fdox,cpd15876,True,cpd15876,[fdox]


In [72]:
#Count the number of correspondences there are between aracore_updated_universal_ids and ModelSeed abbreviation
df_metabolites_aracore['is_seed_id'].value_counts() # => 162 mapping between aracore_updated_universal_ids and ModelSeed abbreviation

False    251
True     162
Name: is_seed_id, dtype: int64

In [73]:
#Count the number of additional BiGG ids there are after mapping ModelSEED compound ids
df_metabolites_aracore['seed_BiGG_id'].notna().value_counts() # => 157 additional BiGG Ids found by mapped ModelSeed compound id
#notna() function returns a boolean same-sized object indicating if the values are not NA

False    256
True     157
Name: seed_BiGG_id, dtype: int64

# VI / Aggregation of compound ids 

## 1) Aggregate ModelSEED compound ids in column 'bigg_seed_id' and 'seed_id' 

In [74]:
#Aggregate model seed compound ids from columns 'bigg_seed_id' and 'seed_id' into a list and extract unique values of this list
df_metabolites_aracore['seed_id_aggr'] = df_metabolites_aracore[['bigg_seed_id', 'seed_id']].apply(lambda x: list(np.unique(list(filter(None,[x[0],x[1]])))) , axis=1)
df_metabolites_aracore.head(25)

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,aracore_updated_universal_ids_lower,is_bigg_id,universal_bigg_id,bigg_seed_id,is_seed_id,seed_id,seed_BiGG_id,seed_id_aggr
0,hnu_h,Photon,X,{},hnu_h,hnu,hnu,False,,,False,,,[]
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},PQ_h,PQ,pq,True,pq,,False,,,[]
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O,h2o,True,h2o,cpd27222,True,cpd00001,"[h2o, oh1]","[cpd00001, cpd27222]"
3,H_h,"H+, proton",H,{},H_h,H,h,True,h,cpd00067,True,cpd00067,[h],[cpd00067]
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},PQH2_h,PQH2,pqh2,True,pqh2,,False,,,[]
5,O2_h,"O2, oxygen",O2,{},O2_h,O2,o2,True,o2,cpd00007,True,cpd00007,[o2],[cpd00007]
6,H_l,"H+, proton",H,{},H_ul,H,h,True,h,cpd00067,True,cpd00067,[h],[cpd00067]
7,PCox_h,Oxidized plastocyanin,X,{},PCox_h,PCox,pcox,True,pcox,cpd30035,False,,,[cpd30035]
8,PCrd_h,Reduced plastocyanin,X,{},PCrd_h,PCrd,pcrd,True,pcrd,cpd30034,False,,,[cpd30034]
9,Fdox_h,Oxidized ferredoxin,S8FeX,{},Fdox_h,Fdox,fdox,True,fdox,cpd15876,True,cpd15876,[fdox],[cpd15876]


In [75]:
df_metabolites_aracore['seed_id_aggr'].apply(len).value_counts() # => 127  + 37 = 164 metabolites have ModelSeed Ids -> 37 metabolites have 2 ModelSeed Ids -> potential conflicts !?!

0    249
1    127
2     37
Name: seed_id_aggr, dtype: int64

In [76]:
#Extract those metabolites that have more than one ModelSeed compound ids mmapped
df_metabolites_aracore_seed_conflicted = df_metabolites_aracore[df_metabolites_aracore['seed_id_aggr'].apply(len) > 1].copy()
df_metabolites_aracore_seed_conflicted

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,aracore_updated_universal_ids_lower,is_bigg_id,universal_bigg_id,bigg_seed_id,is_seed_id,seed_id,seed_BiGG_id,seed_id_aggr
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O,h2o,True,h2o,cpd27222,True,cpd00001,"[h2o, oh1]","[cpd00001, cpd27222]"
14,Pi_h,Orthophosphate,HO4P,{},Pi_h,Pi,pi,True,pi,cpd27787,True,cpd00009,[pi],"[cpd00009, cpd27787]"
23,F6P_h,Fructose 6-phosphate,C6H11O9P,{},F6P_h,F6P,f6p,True,f6p,cpd19035,True,cpd00072,"[f6p, f6p_B]","[cpd00072, cpd19035]"
28,R5P_h,Ribose 5-phosphate,C5H9O8P,{},R5P_h,R5P,r5p,True,r5p,cpd19028,True,cpd00101,[r5p],"[cpd00101, cpd19028]"
30,G6P_h,Glucose 6-phosphate,C6H11O9P,{},G6P_h,G6P,g6p,True,g6p,cpd26836,True,cpd00079,[g6p],"[cpd00079, cpd26836]"
31,G1P_h,Glucose 1-phosphate,C6H11O9P,{},G1P_h,G1P,g1p,True,g1p,cpd28817,True,cpd00089,[g1p],"[cpd00089, cpd28817]"
33,PPi_h,"Diphosphate, Pyrophosphate",O7P2,{},PPi_h,PPi,ppi,True,ppi,cpd27828,True,cpd00012,[ppi],"[cpd00012, cpd27828]"
45,G6P_c,Glucose 6-phosphate,C6H11O9P,{},G6P_c,G6P,g6p,True,g6p,cpd26836,True,cpd00079,[g6p],"[cpd00079, cpd26836]"
50,Pi_c,Orthophosphate,HO4P,{},Pi_c,Pi,pi,True,pi,cpd27787,True,cpd00009,[pi],"[cpd00009, cpd27787]"
51,G1P_c,Glucose 1-phosphate,C6H11O9P,{},G1P_c,G1P,g1p,True,g1p,cpd28817,True,cpd00089,[g1p],"[cpd00089, cpd28817]"


In [77]:
#Add abbreviations and formulas of the df_metabolites_seed to compare by eye and make notes to solve potential conflicts for later
df_metabolites_aracore_seed_conflicted['seed_id_aggr_abbr'] = df_metabolites_aracore_seed_conflicted['seed_id_aggr'].apply(lambda x: [df_metabolites_seed[df_metabolites_seed['id'] == compound_id]['abbreviation'].tolist()[0] for compound_id in x])
df_metabolites_aracore_seed_conflicted['seed_id_aggr_formula'] = df_metabolites_aracore_seed_conflicted['seed_id_aggr'].apply(lambda x: [df_metabolites_seed[df_metabolites_seed['id'] == compound_id]['formula'].tolist()[0] for compound_id in x])
df_metabolites_aracore_seed_conflicted[['aracore_ids','aracore_name','aracore_formula','seed_id_aggr','seed_id_aggr_abbr','seed_id_aggr_formula']]

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,seed_id_aggr,seed_id_aggr_abbr,seed_id_aggr_formula
2,H2O_h,"H2O, water",H2O,"[cpd00001, cpd27222]","[h2o, hydroxyl-group]","[H2O, HO]"
14,Pi_h,Orthophosphate,HO4P,"[cpd00009, cpd27787]","[pi, phosphate-group]","[HO4P, HO4P]"
23,F6P_h,Fructose 6-phosphate,C6H11O9P,"[cpd00072, cpd19035]","[f6p, beta-D-Fructose 6-phosphate]","[C6H11O9P, C6H11O9P]"
28,R5P_h,Ribose 5-phosphate,C5H9O8P,"[cpd00101, cpd19028]","[r5p, alpha-D-Ribose 5-phosphate]","[C5H9O8P, C5H9O8P]"
30,G6P_h,Glucose 6-phosphate,C6H11O9P,"[cpd00079, cpd26836]","[g6p, D-glucose-6-phosphate]","[C6H11O9P, C6H11O9P]"
31,G1P_h,Glucose 1-phosphate,C6H11O9P,"[cpd00089, cpd28817]","[g1p, glucose-1-phosphate]","[C6H11O9P, C6H11O9P]"
33,PPi_h,"Diphosphate, Pyrophosphate",O7P2,"[cpd00012, cpd27828]","[ppi, pyrophosphate-group]","[HO7P2, HO7P2]"
45,G6P_c,Glucose 6-phosphate,C6H11O9P,"[cpd00079, cpd26836]","[g6p, D-glucose-6-phosphate]","[C6H11O9P, C6H11O9P]"
50,Pi_c,Orthophosphate,HO4P,"[cpd00009, cpd27787]","[pi, phosphate-group]","[HO4P, HO4P]"
51,G1P_c,Glucose 1-phosphate,C6H11O9P,"[cpd00089, cpd28817]","[g1p, glucose-1-phosphate]","[C6H11O9P, C6H11O9P]"


In [78]:
#Resolve conflict in original table df_metabolites_aracore

df_metabolites_aracore.set_index("aracore_ids",inplace=True) #Set index to acacore_ids to make editing "easier"

#H2O_c, H2O_h, H2O_p, H2O_m == cpd00001
df_metabolites_aracore.loc['H2O_c','seed_id_aggr'] == ['cpd00001']
df_metabolites_aracore.loc['H2O_h','seed_id_aggr'] == ['cpd00001']
df_metabolites_aracore.loc['H2O_p','seed_id_aggr'] == ['cpd00001']
df_metabolites_aracore.loc['H2O_m','seed_id_aggr'] == ['cpd00001']

#THF_m, THF_c, THF_h	5,6,7,8-Tetrahydrofolate == cpd00087 -> 186 has one more H in sum formula
df_metabolites_aracore.loc['THF_m','seed_id_aggr'] == ['cpd00087']
df_metabolites_aracore.loc['THF_c','seed_id_aggr'] == ['cpd00087']
df_metabolites_aracore.loc['THF_h','seed_id_aggr'] == ['cpd00087']

#SO4_h, SO4_m, SO4_c Sulfate O4S == cpd00048
df_metabolites_aracore.loc['SO4_h','seed_id_aggr'] == ['cpd00048']
df_metabolites_aracore.loc['SO4_m','seed_id_aggr'] == ['cpd00048']
df_metabolites_aracore.loc['SO4_c','seed_id_aggr'] == ['cpd00048']

#ppi has different sum formulas in the Aracore model?
#Coenzyme A has different sum formulas in the Aracore model?
#5,6,7,8-Tetrahydrofolate has different sum formulas in the Aracore model?

df_metabolites_aracore.reset_index(inplace=True)

## 2) Aggregate BiGG ids in column 'universal_bigg_id' and 'seed_BiGG_id'

In [79]:
#Aggregate model seed compound ids from columns 'bigg_seed_id' and 'seed_id' into a list and extract unique values of this list

df_metabolites_aracore['bigg_id_aggr'] = df_metabolites_aracore[['universal_bigg_id', 'seed_BiGG_id']].apply(lambda x: list(filter(None,[x[0]]+x[1])) if x[1] else  list(filter(None,[x[0]]+[x[1]])), axis=1) #.apply(len).value_counts()
df_metabolites_aracore['bigg_id_aggr'] = df_metabolites_aracore['bigg_id_aggr'].apply(lambda x: list(np.unique(x)))

df_metabolites_aracore.head(25)

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,aracore_updated_universal_ids_lower,is_bigg_id,universal_bigg_id,bigg_seed_id,is_seed_id,seed_id,seed_BiGG_id,seed_id_aggr,bigg_id_aggr
0,hnu_h,Photon,X,{},hnu_h,hnu,hnu,False,,,False,,,[],[]
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},PQ_h,PQ,pq,True,pq,,False,,,[],[pq]
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O,h2o,True,h2o,cpd27222,True,cpd00001,"[h2o, oh1]","[cpd00001, cpd27222]","[h2o, oh1]"
3,H_h,"H+, proton",H,{},H_h,H,h,True,h,cpd00067,True,cpd00067,[h],[cpd00067],[h]
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},PQH2_h,PQH2,pqh2,True,pqh2,,False,,,[],[pqh2]
5,O2_h,"O2, oxygen",O2,{},O2_h,O2,o2,True,o2,cpd00007,True,cpd00007,[o2],[cpd00007],[o2]
6,H_l,"H+, proton",H,{},H_ul,H,h,True,h,cpd00067,True,cpd00067,[h],[cpd00067],[h]
7,PCox_h,Oxidized plastocyanin,X,{},PCox_h,PCox,pcox,True,pcox,cpd30035,False,,,[cpd30035],[pcox]
8,PCrd_h,Reduced plastocyanin,X,{},PCrd_h,PCrd,pcrd,True,pcrd,cpd30034,False,,,[cpd30034],[pcrd]
9,Fdox_h,Oxidized ferredoxin,S8FeX,{},Fdox_h,Fdox,fdox,True,fdox,cpd15876,True,cpd15876,[fdox],[cpd15876],[fdox]


In [80]:
#Extract those metabolites that have more than one ModelSeed compound ids mapped
df_metabolites_aracore_bigg_conflicted = df_metabolites_aracore[df_metabolites_aracore['bigg_id_aggr'].apply(len) > 1].copy()
df_metabolites_aracore_bigg_conflicted

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,aracore_updated_universal_ids_lower,is_bigg_id,universal_bigg_id,bigg_seed_id,is_seed_id,seed_id,seed_BiGG_id,seed_id_aggr,bigg_id_aggr
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O,h2o,True,h2o,cpd27222,True,cpd00001,"[h2o, oh1]","[cpd00001, cpd27222]","[h2o, oh1]"
23,F6P_h,Fructose 6-phosphate,C6H11O9P,{},F6P_h,F6P,f6p,True,f6p,cpd19035,True,cpd00072,"[f6p, f6p_B]","[cpd00072, cpd19035]","[f6p, f6p_B]"
55,H2O_c,"H2O, water",H2O,{},H2O_c,H2O,h2o,True,h2o,cpd27222,True,cpd00001,"[h2o, oh1]","[cpd00001, cpd27222]","[h2o, oh1]"
56,F6P_c,Fructose 6-phosphate,C6H11O9P,{},F6P_c,F6P,f6p,True,f6p,cpd19035,True,cpd00072,"[f6p, f6p_B]","[cpd00072, cpd19035]","[f6p, f6p_B]"
76,HCO3_c,Bicarbonate,CHO3,{},HCO3_c,HCO3,hco3,True,hco3,cpd00242,True,cpd00242,"[h2co3, hco3]",[cpd00242],"[h2co3, hco3]"
101,H2O_m,"H2O, water",H2O,{},H2O_m,H2O,h2o,True,h2o,cpd27222,True,cpd00001,"[h2o, oh1]","[cpd00001, cpd27222]","[h2o, oh1]"
126,H2O_p,"H2O, water",H2O,{},H2O_x,H2O,h2o,True,h2o,cpd27222,True,cpd00001,"[h2o, oh1]","[cpd00001, cpd27222]","[h2o, oh1]"
136,NH4_m,Ammonia,H4N,{},NH4_m,NH4,nh4,True,nh4,cpd19013,True,cpd00013,"[nh3, nh4]","[cpd00013, cpd19013]","[nh3, nh4]"
178,HCO3_h,Bicarbonate,CHO3,{},HCO3_h,HCO3,hco3,True,hco3,cpd00242,True,cpd00242,"[h2co3, hco3]",[cpd00242],"[h2co3, hco3]"
180,ACP_h,Acyl-carrier protein,HSR,{},ACP_h,ACP,acp,True,acp,cpd11493,True,cpd11493,[ACP],[cpd11493],"[ACP, acp]"


**Access above table and compare by eye and make notes to solve potential conflicts for later:**
- H2O_h, H2O_m, H2O_p, H2O_c  => h2O
- F6P_h, F6P_c => f6p
- HCO3_c, HCO3_h => hco3
- NH4_m, NH4_h, NH4_c => nh4
- ACP_h => acp
- Orn_h, Orn_m => orn

In [81]:
#Resolve conflict in original table df_metabolites_aracore

df_metabolites_aracore.set_index("aracore_ids",inplace=True) #Set index to acacore_ids to make editing easiers

# H2O_h, H2O_m, H2O_p, H2O_c  => h2O
df_metabolites_aracore.loc['H2O_c','bigg_id_aggr'] = [['h2o']]
df_metabolites_aracore.loc['H2O_h','bigg_id_aggr'] = [['h2o']]
df_metabolites_aracore.loc['H2O_p','bigg_id_aggr'] = [['h2o']]
df_metabolites_aracore.loc['H2O_m','bigg_id_aggr'] = [['h2o']]

# F6P_h, F6P_c => f6p
df_metabolites_aracore.loc['F6P_h','bigg_id_aggr'] = [['f6p']]
df_metabolites_aracore.loc['F6P_c','bigg_id_aggr'] = [['f6p']]

# HCO3_c, HCO3_h => hco3
df_metabolites_aracore.loc['HCO3_c','bigg_id_aggr'] = [['hco3']]
df_metabolites_aracore.loc['HCO3_h','bigg_id_aggr'] = [['hco3']]

# NH4_m, NH4_h, NH4_c => nh4
df_metabolites_aracore.loc['NH4_m','bigg_id_aggr'] = [['nh4']]
df_metabolites_aracore.loc['NH4_h','bigg_id_aggr'] =[ ['nh4']]
df_metabolites_aracore.loc['NH4_c','bigg_id_aggr'] = [['nh4']]

# ACP_h => acp
df_metabolites_aracore.loc['ACP_h','bigg_id_aggr'] = [['acp']]

# Orn_h, Orn_m => orn
df_metabolites_aracore.loc['Orn_h','bigg_id_aggr'] = [['orn']]
df_metabolites_aracore.loc['Orn_m','bigg_id_aggr'] = [['orn']]


#ppi has different sum formulas in the Aracore model?
#Coenzyme A has different sum formulas in the Aracore model?
#5,6,7,8-Tetrahydrofolate has different sum formulas in the Aracore model?

df_metabolites_aracore.reset_index(inplace=True)

In [82]:
#Check if all bigg conflicts resolved 
(df_metabolites_aracore['bigg_id_aggr'].apply(len) > 1).value_counts() #all conflicts resolved

False    413
Name: bigg_id_aggr, dtype: int64

In [83]:
#Convert list of len 1 into string of BiGG id
df_metabolites_aracore['bigg_id_aggr'] = df_metabolites_aracore['bigg_id_aggr'].apply(lambda x: x[0] if x else None)

#Clean dataframe and drop cols
df_metabolites_aracore.drop(['is_seed_id','is_bigg_id','seed_BiGG_id','seed_id','bigg_seed_id','universal_bigg_id'], axis=1, inplace=True)

df_metabolites_aracore.head(25)

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,aracore_updated_universal_ids_lower,seed_id_aggr,bigg_id_aggr
0,hnu_h,Photon,X,{},hnu_h,hnu,hnu,[],
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},PQ_h,PQ,pq,[],pq
2,H2O_h,"H2O, water",H2O,{},H2O_h,H2O,h2o,"[cpd00001, cpd27222]",h2o
3,H_h,"H+, proton",H,{},H_h,H,h,[cpd00067],h
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},PQH2_h,PQH2,pqh2,[],pqh2
5,O2_h,"O2, oxygen",O2,{},O2_h,O2,o2,[cpd00007],o2
6,H_l,"H+, proton",H,{},H_ul,H,h,[cpd00067],h
7,PCox_h,Oxidized plastocyanin,X,{},PCox_h,PCox,pcox,[cpd30035],pcox
8,PCrd_h,Reduced plastocyanin,X,{},PCrd_h,PCrd,pcrd,[cpd30034],pcrd
9,Fdox_h,Oxidized ferredoxin,S8FeX,{},Fdox_h,Fdox,fdox,[cpd15876],fdox


In [84]:
df_metabolites_aracore['bigg_id_aggr'].notna().value_counts() #165 BiGG Ids mapped to aracore id
# => 248 - 165 = 83 conflicts to solve by hand

False    248
True     165
Name: bigg_id_aggr, dtype: int64

In [85]:
#Resolve conflict in original table df_metabolites_aracore
df_metabolites_aracore.set_index("aracore_ids",inplace=True) #Set index to acacore_ids to make editing easiers

#H2O_c, H2O_h, H2O_p, H2O_m == cpd00001
df_metabolites_aracore.loc['H2O_c','seed_id_aggr'] = [['cpd00001']]
df_metabolites_aracore.loc['H2O_h','seed_id_aggr'] = [['cpd00001']]
df_metabolites_aracore.loc['H2O_p','seed_id_aggr'] = [['cpd00001']]
df_metabolites_aracore.loc['H2O_m','seed_id_aggr'] = [['cpd00001']]

#Pi_h,_c,_m => cpd00009
df_metabolites_aracore.loc['Pi_c','seed_id_aggr'] = [['cpd00009']]
df_metabolites_aracore.loc['Pi_h','seed_id_aggr'] = [['cpd00009']]
df_metabolites_aracore.loc['Pi_m','seed_id_aggr'] = [['cpd00009']]

#F6P_c,h => cpd00072
df_metabolites_aracore.loc['F6P_c','seed_id_aggr'] = [['cpd00072']]
df_metabolites_aracore.loc['F6P_h','seed_id_aggr'] = [['cpd00072']]

#R5P_h => cpd00101
df_metabolites_aracore.loc['R5P_h','seed_id_aggr'] = [['cpd00101']]

#G6P_h,c => cpd00079
df_metabolites_aracore.loc['G6P_c','seed_id_aggr'] = [['cpd00079']]
df_metabolites_aracore.loc['G6P_h','seed_id_aggr'] = [['cpd00079']]

#G1P_h,c => cpd00089
df_metabolites_aracore.loc['G1P_c','seed_id_aggr'] = [['cpd00089']]
df_metabolites_aracore.loc['G1P_h','seed_id_aggr'] = [['cpd00089']]


#UDPG_c => cpd00026
df_metabolites_aracore.loc['UDPG_c','seed_id_aggr'] = [['cpd00026']]

#PPi_h,_c => cpd00012
df_metabolites_aracore.loc['PPi_c','seed_id_aggr'] = [['cpd00012']]
df_metabolites_aracore.loc['PPi_h','seed_id_aggr'] = [['cpd00012']]


#CoA_m,_c,_h => cpd00010
df_metabolites_aracore.loc['CoA_c','seed_id_aggr'] = [['cpd00010']]
df_metabolites_aracore.loc['CoA_h','seed_id_aggr'] = [['cpd00010']]
df_metabolites_aracore.loc['CoA_m','seed_id_aggr'] = [['cpd00010']]

#NH4_m,_h,_c => cpd00013
df_metabolites_aracore.loc['NH4_c','seed_id_aggr'] = [['cpd00013']]
df_metabolites_aracore.loc['NH4_h','seed_id_aggr'] = [['cpd00013']]
df_metabolites_aracore.loc['NH4_m','seed_id_aggr'] = [['cpd00013']]

#AMP_h,_c => cpd00018
df_metabolites_aracore.loc['AMP_c','seed_id_aggr'] = [['cpd00018']]
df_metabolites_aracore.loc['AMP_h','seed_id_aggr'] = [['cpd00018']]

#H2S_h,_c,_m => cpd00239
df_metabolites_aracore.loc['H2S_c','seed_id_aggr'] = [['cpd00239']]
df_metabolites_aracore.loc['H2S_h','seed_id_aggr'] = [['cpd00239']]
df_metabolites_aracore.loc['H2S_m','seed_id_aggr'] = [['cpd00239']]

#Orn_h,_m => cpd00064
df_metabolites_aracore.loc['Orn_h','seed_id_aggr'] = [['cpd00064']]
df_metabolites_aracore.loc['Orn_m','seed_id_aggr'] = [['cpd00064']]

#For_h => cpd00047
df_metabolites_aracore.loc['For_h','seed_id_aggr'] = [['cpd00047']]

#THF_m, THF_c, THF_h	5,6,7,8-Tetrahydrofolate == cpd00087 -> 186 has one more H in sum formula
df_metabolites_aracore.loc['THF_m','seed_id_aggr'] = [['cpd00087']]
df_metabolites_aracore.loc['THF_c','seed_id_aggr'] = [['cpd00087']]
df_metabolites_aracore.loc['THF_h','seed_id_aggr'] = [['cpd00087']]

#SO4_h, SO4_m, SO4_c Sulfate O4S == cpd00048
df_metabolites_aracore.loc['SO4_h','seed_id_aggr'] = [['cpd00048']]
df_metabolites_aracore.loc['SO4_m','seed_id_aggr'] = [['cpd00048']]
df_metabolites_aracore.loc['SO4_c','seed_id_aggr'] = [['cpd00048']]

#ppi has different sum formulas in the Aracore model?!?
#Coenzyme A has different sum formulas in the Aracore model?!?
#5,6,7,8-Tetrahydrofolate has different sum formulas in the Aracore model?!?

df_metabolites_aracore.reset_index(inplace=True)

In [86]:
#Export final mapping table for manual mapping
df_metabolites_aracore.to_csv('../data/processed/2021-06-22-ca-metabolite-mapping-table')

**After mapping the identifiers from the AraCore model and the two databases BiGG and ModelSEED, we will try to perform the mapping according to the chemical formulas.**

# VII / Mapping by chemical formulas

In [40]:
df_metabolites_seed["formula"].value_counts()

C15H24        150
C6H12O6       100
C12H22O11      66
C6H11O9P       66
C15H26O        54
             ... 
C33H39N4O4      1
C6H3Cl4N        1
C24H29O4        1
C8H11NO5P       1
C38H46N2O8      1
Name: formula, Length: 16762, dtype: int64

**From df_metabolites_seed, we create a new dataframe in which one raw corresponds to one chemical formula.**

**Chemical formula = index of the dataframe.**


1. We use the groupby() function with a column name in parameter to group all the lines having the same value for the column of interest.

2. We are lucky, because we have a column with the corresponding id modelSEED generated by the group by
=> Aggregation of the id modelSEED (transformation of the group value to have only one value)
ex : mean, sum, max, min

**In this case, we pass the column named "formula" to the groupby() function.
We group all the lines which have the same chemical formula to aggregate the lines.
As we want all the ids modelSEED, we will recover them in a list.**

3. drop() function to delete the "X" (no value)

4. reset_index() function to include the index, and then, betting a dataframe


First, we tried to use these following commands : 

df_metabolites_groupby_formula = df_metabolites_seed.groupby("formula")["id"].apply(list).drop(labels = 'X', axis=0) reset_index()
df_metabolites_groupby_formula 

However, it didn't seem to be very efficient to solve both the groupby() part and the conflicts on model_seed_id. Thus, we performed an aggregation on model_seed_id after the groupby() function to list the model_seed_id.

> After a groupby(), we need to perform an aggregation, with the following function : agg().

> agg(dictionary) : this function defines the aggregation we want for each column.

> dictionary as parameter : the keys are the names of the columns at which we want to perform the aggregation, and the values are the way we aggregate the values of the columns.

Finally, we tried to use these following commands :

df_metabolites_groupby_formula = df_metabolites_seed.groupby("formula").agg({"id":list, "abbreviation":list}).drop(labels = 'X', axis = 0).reset_index()

df_metabolites_groupby_formula 

In [41]:
df_metabolites_groupby_formula = df_metabolites_seed.groupby("formula").agg({"id":list, "abbreviation":list}).drop(labels = 'X', axis = 0).reset_index()
df_metabolites_groupby_formula 

Unnamed: 0,formula,id,abbreviation
0,Ag,"[cpd04100, cpd24345, cpd37274]","[ag, Ag++, Silver]"
1,Ag2O4S,[cpd22247],[Ag2SO4(aq)]
2,AgNO3,[cpd28685],[AgNO3]
3,Al,[cpd24344],[Al+++]
4,AlF3,[cpd26028],[aluminium fluoride]
...,...,...,...
16756,V,[cpd12858],[V]
16757,W,"[cpd00560, cpd28334]","[Tungsten, W+6]"
16758,Xe,[cpd09418],[Xenon]
16759,Z,[cpd31000],[photon]


In [44]:
df_metabolites_groupby_formula = df_metabolites_groupby_formula.rename(columns = {"formula":"model_seed_formula", "id":"model_seed_id", "abbreviation":"model_seed_abbreviation"})
df_metabolites_groupby_formula

Unnamed: 0,model_seed_formula,model_seed_id,model_seed_abbreviation
0,Ag,"[cpd04100, cpd24345, cpd37274]","[ag, Ag++, Silver]"
1,Ag2O4S,[cpd22247],[Ag2SO4(aq)]
2,AgNO3,[cpd28685],[AgNO3]
3,Al,[cpd24344],[Al+++]
4,AlF3,[cpd26028],[aluminium fluoride]
...,...,...,...
16756,V,[cpd12858],[V]
16757,W,"[cpd00560, cpd28334]","[Tungsten, W+6]"
16758,Xe,[cpd09418],[Xenon]
16759,Z,[cpd31000],[photon]


In [45]:
# Now, we merge with the dataframe according to the formula
df_metabolites_aracore_final = df_metabolites_aracore.merge(df_metabolites_groupby_formula, how = 'left', left_on = 'aracore_formula', right_on = 'model_seed_formula')
df_metabolites_aracore_final

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,seed_id_aggr,bigg_id_aggr,model_seed_formula,model_seed_id,model_seed_abbreviation
0,hnu_h,Photon,X,{},hnu_h,hnu,[],,,,
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},pq_h,pq,[],pq,C13H16O2,"[cpd07588, cpd12011, cpd16487]","[4'-Hydroxy-3'-prenylacetophenone, Plastoquino..."
2,H2O_h,"H2O, water",H2O,{},h2o_h,h2o,[[cpd00001]],h2o,H2O,"[cpd00001, cpd15275]","[h2o, oh1]"
3,H_h,"H+, proton",H,{},h_h,h,[cpd00067],h,H,[cpd00067],[h]
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},pqh2_h,pqh2,[],pqh2,C13H18O2,"[cpd01475, cpd16486]","[Plastoquinol-1, Plastoquinol]"
...,...,...,...,...,...,...,...,...,...,...,...
408,PQstar_h,Plastoquinone radical,C13H16O2,{},pqstar_h,pqstar,[],,C13H16O2,"[cpd07588, cpd12011, cpd16487]","[4'-Hydroxy-3'-prenylacetophenone, Plastoquino..."
409,PGR5_PGRL1ox_h,oxidised proton gradient regulation 5 (PGR5)/P...,X,{},pgr5_pgrl1ox_h,pgr5_pgrl1ox,[],,,,
410,PGR5_PGRL1rd_h,reduced proton gradient regulation 5 (PGR5)/PG...,X,{},pgr5_pgrl1rd_h,pgr5_pgrl1rd,[],,,,
411,NDHox_h,oxidised NADH dehydrogenase-like (NDH) complex,X,{},ndhox_h,ndhox,[],,,,


In [46]:
df_metabolites_aracore_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 413 entries, 0 to 412
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   aracore_ids                    413 non-null    object
 1   aracore_name                   413 non-null    object
 2   aracore_formula                413 non-null    object
 3   aracore_annotations            413 non-null    object
 4   aracore_updated_ids            413 non-null    object
 5   aracore_updated_universal_ids  413 non-null    object
 6   seed_id_aggr                   413 non-null    object
 7   bigg_id_aggr                   165 non-null    object
 8   model_seed_formula             360 non-null    object
 9   model_seed_id                  360 non-null    object
 10  model_seed_abbreviation        360 non-null    object
dtypes: object(11)
memory usage: 38.7+ KB


In [47]:
#Export final mapping table for manual mapping
df_metabolites_aracore_final.to_csv('../data/processed/2021-06-22-metabolite-mapping-table2.csv')

# Conflicts solving

Now, we have a dataframe with a column including ModelSEED IDs from the ModelSEED database. However, conflicts could be present if we have more than one ID. Thus, we will apply the len() function to each line of the dataframe, and we count the number of line for each length value thanks to the value_counts() function.
=> Value_counts() function returns a Series containing counts for unique rows in the dataframe. 

In [48]:
# replace the NaN values by empty lists
def apply_replace_Nan_by_list(line):
    if type(line) is not list:
        value = []
    else:
        value = line # WARNING !! Always put the else: condition to specify which value is taken in that case; here, we recover
        # the line if type(line) is list
    return value
df_metabolites_aracore_final["model_seed_id"] = df_metabolites_aracore_final["model_seed_id"].apply(apply_replace_Nan_by_list)
df_metabolites_aracore_final

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,seed_id_aggr,bigg_id_aggr,model_seed_formula,model_seed_id,model_seed_abbreviation
0,hnu_h,Photon,X,{},hnu_h,hnu,[],,,[],
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},pq_h,pq,[],pq,C13H16O2,"[cpd07588, cpd12011, cpd16487]","[4'-Hydroxy-3'-prenylacetophenone, Plastoquino..."
2,H2O_h,"H2O, water",H2O,{},h2o_h,h2o,[[cpd00001]],h2o,H2O,"[cpd00001, cpd15275]","[h2o, oh1]"
3,H_h,"H+, proton",H,{},h_h,h,[cpd00067],h,H,[cpd00067],[h]
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},pqh2_h,pqh2,[],pqh2,C13H18O2,"[cpd01475, cpd16486]","[Plastoquinol-1, Plastoquinol]"
...,...,...,...,...,...,...,...,...,...,...,...
408,PQstar_h,Plastoquinone radical,C13H16O2,{},pqstar_h,pqstar,[],,C13H16O2,"[cpd07588, cpd12011, cpd16487]","[4'-Hydroxy-3'-prenylacetophenone, Plastoquino..."
409,PGR5_PGRL1ox_h,oxidised proton gradient regulation 5 (PGR5)/P...,X,{},pgr5_pgrl1ox_h,pgr5_pgrl1ox,[],,,[],
410,PGR5_PGRL1rd_h,reduced proton gradient regulation 5 (PGR5)/PG...,X,{},pgr5_pgrl1rd_h,pgr5_pgrl1rd,[],,,[],
411,NDHox_h,oxidised NADH dehydrogenase-like (NDH) complex,X,{},ndhox_h,ndhox,[],,,[],


In [49]:
df_metabolites_aracore_final["model_seed_id"].apply(len).value_counts() # on the left : the length of the lists at the level of
# the ModelSEED ID; on the right, the numbers of rows for this length value
# => 53 lines would have 0 IDs; 77 lines would have 1 ID 
# => 63 lines would have 2 IDs; the other lines would have more than 2 IDs => potential conflicts?

1      77
2      63
0      53
3      51
4      27
5      22
7      20
66     13
11     12
9      11
16     11
6       8
10      7
12      6
100     6
14      5
19      5
13      4
8       3
15      3
20      2
39      2
52      2
Name: model_seed_id, dtype: int64

In [50]:
# Extract those metabolites that have more than one ModelSEED compound ids mapped
df_metabolites_aracore_seed_conflicted = df_metabolites_aracore_final[df_metabolites_aracore_final['model_seed_id'].apply(len) > 1].copy()
df_metabolites_aracore_seed_conflicted

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,seed_id_aggr,bigg_id_aggr,model_seed_formula,model_seed_id,model_seed_abbreviation
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},pq_h,pq,[],pq,C13H16O2,"[cpd07588, cpd12011, cpd16487]","[4'-Hydroxy-3'-prenylacetophenone, Plastoquino..."
2,H2O_h,"H2O, water",H2O,{},h2o_h,h2o,[[cpd00001]],h2o,H2O,"[cpd00001, cpd15275]","[h2o, oh1]"
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},pqh2_h,pqh2,[],pqh2,C13H18O2,"[cpd01475, cpd16486]","[Plastoquinol-1, Plastoquinol]"
5,O2_h,"O2, oxygen",O2,{},o2_h,o2,[cpd00007],o2,O2,"[cpd00007, cpd00532]","[o2, o2s]"
11,NADP_h,Nicotinamide adenine dinucleotide phosphate,C21H25N7O17P3,{},nadp_h,nadp,[cpd00006],nadp,C21H25N7O17P3,"[cpd00006, cpd33786]","[nadp, alpha-NADP+]"
...,...,...,...,...,...,...,...,...,...,...,...
404,Thr_m,Threonine,C4H9NO3,{},thr_m,thr,[],,C4H9NO3,"[cpd00161, cpd00227, cpd00611, cpd01432, cpd02...","[thr-L, hom-L, D-Threonine, 2-Methylserine, GA..."
405,Trp_m,Tryptophan,C11H12N2O2,{},trp_m,trp,[],,C11H12N2O2,"[cpd00065, cpd00411, cpd04921, cpd07629, cpd10...","[trp-L, D-Tryptophan, Ethotoin, Vasicinol, Nir..."
406,Tyr_m,Tyrosine,C9H11NO3,{},tyr_m,tyr,[],,C9H11NO3,"[cpd00069, cpd02099, cpd02674, cpd03843, cpd23...","[tyr-L, L-threo-3-Phenylserine, beta-Tyrosine,..."
407,Val_m,Valine,C5H11NO2,{},val_m,val,[],,C5H11NO2,"[cpd00156, cpd00339, cpd00540, cpd01241, cpd01...","[val-L, 5aptn, glyb, D-Norvaline, L-Norvaline,..."


In [51]:
df_metabolites_aracore_seed_conflicted.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 283 entries, 1 to 408
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   aracore_ids                    283 non-null    object
 1   aracore_name                   283 non-null    object
 2   aracore_formula                283 non-null    object
 3   aracore_annotations            283 non-null    object
 4   aracore_updated_ids            283 non-null    object
 5   aracore_updated_universal_ids  283 non-null    object
 6   seed_id_aggr                   283 non-null    object
 7   bigg_id_aggr                   95 non-null     object
 8   model_seed_formula             283 non-null    object
 9   model_seed_id                  283 non-null    object
 10  model_seed_abbreviation        283 non-null    object
dtypes: object(11)
memory usage: 26.5+ KB


> According to the results obtained below, we would have 283 conflicts to solve by hand.

> We will look for the conflicts we couldn't solve before, e.g. the seed_id_aggr which are empty

In [52]:
# We would like to look at the conflicts we want to solve
df_metabolites_aracore_seed_no_match = df_metabolites_aracore_seed_conflicted[df_metabolites_aracore_seed_conflicted['seed_id_aggr'].apply(len) == 0].copy()
df_metabolites_aracore_seed_no_match

Unnamed: 0,aracore_ids,aracore_name,aracore_formula,aracore_annotations,aracore_updated_ids,aracore_updated_universal_ids,seed_id_aggr,bigg_id_aggr,model_seed_formula,model_seed_id,model_seed_abbreviation
1,PQ_h,Oxidized plastoquinone,C13H16O2,{},pq_h,pq,[],pq,C13H16O2,"[cpd07588, cpd12011, cpd16487]","[4'-Hydroxy-3'-prenylacetophenone, Plastoquino..."
4,PQH2_h,Reduced plastoquinone,C13H18O2,{},pqh2_h,pqh2,[],pqh2,C13H18O2,"[cpd01475, cpd16486]","[Plastoquinol-1, Plastoquinol]"
16,RuBP_h,"Ribulose 1,5-bisphosphate",C5H8O11P2,{},rubp_h,rubp,[],,C5H8O11P2,"[cpd00847, cpd00871, cpd34070]","[r15bp, rb15bp, D-ribofuranose 2,5-bisphosphate]"
20,GAP_h,Glyceraldehyde 3-phosphate,C3H5O6P,{},gap_h,gap,[],,C3H5O6P,"[cpd00095, cpd00102, cpd19005, cpd27362]","[dhap, g3p, DL-Glyceraldehyde 3-phosphate, L-GAP]"
22,FBP_h,"Fructose 1,6-bisphosphate",C6H10O12P2,{},fbp_h,fbp,[],,C6H10O12P2,"[cpd00290, cpd00499, cpd00503, cpd00898, cpd02...","[fdp, g16bp, f26bp, Inositol 1,4-bisphosphate,..."
...,...,...,...,...,...,...,...,...,...,...,...
404,Thr_m,Threonine,C4H9NO3,{},thr_m,thr,[],,C4H9NO3,"[cpd00161, cpd00227, cpd00611, cpd01432, cpd02...","[thr-L, hom-L, D-Threonine, 2-Methylserine, GA..."
405,Trp_m,Tryptophan,C11H12N2O2,{},trp_m,trp,[],,C11H12N2O2,"[cpd00065, cpd00411, cpd04921, cpd07629, cpd10...","[trp-L, D-Tryptophan, Ethotoin, Vasicinol, Nir..."
406,Tyr_m,Tyrosine,C9H11NO3,{},tyr_m,tyr,[],,C9H11NO3,"[cpd00069, cpd02099, cpd02674, cpd03843, cpd23...","[tyr-L, L-threo-3-Phenylserine, beta-Tyrosine,..."
407,Val_m,Valine,C5H11NO2,{},val_m,val,[],,C5H11NO2,"[cpd00156, cpd00339, cpd00540, cpd01241, cpd01...","[val-L, 5aptn, glyb, D-Norvaline, L-Norvaline,..."


In [48]:
df_metabolites_aracore_seed_no_match.to_csv('../data/processed/2021-06-01-metabolite-aracore-seed-no-match.csv')

> **After executing the commands, we would have 189 conflicts to solve by hand.**

In [49]:
#Resolve conflict in original table df_metabolites_aracore
df_metabolites_aracore.set_index("aracore_ids",inplace=True) #Set index to acacore_ids to make editing easiers

> **Due to the important number of conflicts to solve by hand and the lack of time, we have decided to skip this step and keep the mapping table presenting less conflicts to solve by hand.**