
### **Media Prediction**

**Media/Culture Information**

1.1 MediaDive | Provides media IDs, components/annotations, associated taxa, and other relevant information

1.2 BacDive | Provides culture/isolation info, metabolic annotations (including ec), and more (1.1)

**Taxa to EC**

2.1 UniProtKB (species) | Queries UniProtKB for ec numbers from an input of species names (1.1)

2.2 UniProtKB (taxon_id) | Queries UniProtKB for ec numbers from an input of NCBI taxon_id's (1.2)

2.3 Final taxa2ec dataframe | Formatting identifiers and associated ec numbers (1.2, 2.1, 2.3)

**Media to EC**

3.1 UniProtKB (KEGG CPD) | Queries UniProtKB for ec numbers from an input of KEGG compounds (1.1)



*Search terms: "final dataframe", "#SAVE", "To-do", ...*

In [39]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from io import StringIO

# OS for reading/saving
import os
DATA_DIR = '~/Desktop/code/data/'

# Requests for querying databases
import requests
import re
from requests.adapters import HTTPAdapter, Retry

### [1.1] MediaDive

Returns all information (media id's, components, component id's, characteristics, associated taxa information, etc.) used for subsequent analyses.

In [2]:
import modules.mediadive as md

Retrieve all MediaDive info

In [None]:
# Retrieve all available media from MediaDive
md_media_df = md.get_media()
#md_media_df.to_csv(os.path.join(DATA_DIR,"mediadive", "mediadive-media.csv"), index=False) #SAVE

In [3]:
# Create media_id_list
md_media_df = md_media_df.rename(columns={"id": "media_id"})
media_id_list = md_media_df["media_id"].astype(str).unique()

# Use media_id_list to retrieve media composition information
md_comp_df = md.get_composition(media_id_list)

# Use media_id_list to retrieve media-associated strain information
md_strains_df = md.get_strains(media_id_list)

#md_comp_df.to_csv(os.path.join(DATA_DIR, "mediadive","mediadive-media-comp.csv"), sep=";", index=False) #SAVE
#md_strains_df.to_csv(os.path.join(DATA_DIR, "mediadive", "mediadive-media-strains.csv"), sep=";", index=False) #SAVE

100%|██████████| 3315/3315 [09:59<00:00,  5.53it/s]
100%|██████████| 3315/3315 [09:17<00:00,  5.95it/s]


Merge MediaDive outputs

In [4]:
# Merge media composition and strains info
md_df = pd.merge(left=md_comp_df, right=md_strains_df, on="media_id", how="outer", indicator=True)

# Merge media information with original dataframe
md_df = pd.merge(left=md_media_df, right=md_df, on="media_id", how="left", indicator=False)

# Add extra column indicating the source of the data
md_df = md_df.rename(columns={"_merge": "merge_source"})
md_df["merge_source"] = md_df["merge_source"].cat.rename_categories({"right_only": "composition_only", "left_only": "strains_only"})

#md_df.to_csv(os.path.join(DATA_DIR,"mediadive","mediadive-all.csv"), sep=";", index=False)

**MediaDive final dataframe**

In [13]:
md_df = pd.read_csv(os.path.join(DATA_DIR,"mediadive","mediadive-all.csv"), sep = ';')
md_df.head()

Unnamed: 0,media_id,name,complex_medium,source,link,min_pH,max_pH,reference,description,components,component_ids,strain_id,species,ccno,bacdive_id,merge_source
0,1,NUTRIENT AGAR,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,,,,,,,
1,1a,REACTIVATION WITH LIQUID MEDIUM 1,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,"['Peptone', 'Meat extract', 'Agar', 'Distilled...","[1, 2, 3, 4]",29.0,Comamonas testosteroni,DSM 38,2912.0,both
2,1a,REACTIVATION WITH LIQUID MEDIUM 1,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,"['Peptone', 'Meat extract', 'Agar', 'Distilled...","[1, 2, 3, 4]",30.0,Delftia acidovorans,DSM 39,2941.0,both
3,1a,REACTIVATION WITH LIQUID MEDIUM 1,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,"['Peptone', 'Meat extract', 'Agar', 'Distilled...","[1, 2, 3, 4]",39.0,Acidovorax delafieldii,DSM 64,2885.0,both
4,1a,REACTIVATION WITH LIQUID MEDIUM 1,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,"['Peptone', 'Meat extract', 'Agar', 'Distilled...","[1, 2, 3, 4]",52.0,Pseudomonas putida,DSM 84,12895.0,both


### [1.2] BacDive

Returns ec numbers, environmental info, metabolite data, taxonomic information, culturing info, and more from an input of bacdive_id's

In [7]:
import bacdive
import modules.bacdive as bd

Initialize BacDive, prepare IDs

In [8]:
# Retrieve bacdive_id's from the 'md_df' MediaDive table
md_df = pd.read_csv(os.path.join(DATA_DIR, "mediadive", "mediadive-all.csv"), sep=";")
bd_id_list = md_df["bacdive_id"].dropna().astype(int).astype(str).unique()
bd_id_list

array(['2912', '2941', '2885', ..., '159029', '158241', '131423'],
      dtype=object)

Retrieval of BacDive info

In [10]:
# Initialize client
client = bacdive.BacdiveClient('wjlowe03@gmail.com', 'UNINA_Bacdive')

step = 100
bacdive_df = []

for idx_start in tqdm(range(0, len(bd_id_list), step)): #status bar
    id_list = ";".join(bd_id_list[idx_start:idx_start+step]) 
    bacdive_df.append(bd.taxon2ec(id_list=id_list, client=client))

bacdive_df = pd.concat(bacdive_df, axis=0, ignore_index=True)
bacdive_df = bacdive_df.drop("reference", axis=1)

#bacdive_df.to_csv(os.path.join(DATA_DIR, "bacdive", "bacdive-all.csv"), index=False) #SAVE

-- Authentication successful --


100%|██████████| 138/138 [04:03<00:00,  1.76s/it]


Unnamed: 0,general_@ref,bacdive_id,dsmz_id,general_keywords,general_description,taxon_id,ncbi_tax_id_matching_level,strain_history_@ref,strain_history_history,general_doi,...,api_id32sta_beta_gur,physiology_and_metabolism_murein,physiology_and_metabolism_api_list,isolation_enrichment_culture,isolation_enrichment_culture_temperature,multicellular_morphology_complex_color,api_list_beta_hem,metabolite_tests_citrate_test,compound_production_excreted,isolation_enrichment_culture_duration
0,21113,24370,11532.0,"[Bacteria, human pathogen]",Pseudomonas fluorescens PWD34 is a human patho...,294,species,21113.0,"<- W. Duetz, RIVM, Bilthoven; PWD34",10.13145/bacdive24370.20240510.9,...,,,,,,,,,,
1,21111,24368,304.0,Bacteria,Pseudomonas sp. DSM 304 is a bacterium of the ...,306,species,21111.0,"<- IMG, 1591 (<i>P. fluorescens</i>) <- H. Stolp",10.13145/bacdive24368.20240510.9,...,,,,,,,,,,
2,20542,23995,30059.0,Bacteria,Lelliottia amnigena 21824 is a bacterium that ...,61646,species,20542.0,<- Bakteriologisches Institut der Sueddeutsche...,10.13145/bacdive23995.20240510.9,...,,,,,,,,,,
3,1479,17621,3849.0,"[16S sequence, Bacteria, plant pathogen]",Xanthomonas citri subsp. malvacearum XM13 is a...,346,species,,,10.13145/bacdive17621.20240510.9,...,,,,,,,,,,
4,1480,17596,3850.0,"[genome sequence, Bacteria, obligate aerobe, G...",Xanthomonas campestris DSM 3850 is an obligate...,339,species,,,10.13145/bacdive17596.20240510.9,...,,,,,,,,,,


**BacDive final dataframe**

In [17]:
bacdive_df = pd.read_csv('~/Desktop/code/data/bacdive/bacdive-all.csv', low_memory=False)
bacdive_df.head()

Unnamed: 0,general_@ref,bacdive_id,dsmz_id,general_keywords,general_description,taxon_id,ncbi_tax_id_matching_level,strain_history_@ref,strain_history_history,general_doi,...,api_id32sta_beta_gur,physiology_and_metabolism_murein,physiology_and_metabolism_api_list,isolation_enrichment_culture,isolation_enrichment_culture_temperature,multicellular_morphology_complex_color,api_list_beta_hem,metabolite_tests_citrate_test,compound_production_excreted,isolation_enrichment_culture_duration
0,21113,24370,11532.0,"['Bacteria', 'human pathogen']",Pseudomonas fluorescens PWD34 is a human patho...,294,species,21113.0,"<- W. Duetz, RIVM, Bilthoven; PWD34",10.13145/bacdive24370.20240510.9,...,,,,,,,,,,
1,21111,24368,304.0,Bacteria,Pseudomonas sp. DSM 304 is a bacterium of the ...,306,species,21111.0,"<- IMG, 1591 (<i>P. fluorescens</i>) <- H. Stolp",10.13145/bacdive24368.20240510.9,...,,,,,,,,,,
2,20542,23995,30059.0,Bacteria,Lelliottia amnigena 21824 is a bacterium that ...,61646,species,20542.0,<- Bakteriologisches Institut der Sueddeutsche...,10.13145/bacdive23995.20240510.9,...,,,,,,,,,,
3,1479,17621,3849.0,"['16S sequence', 'Bacteria', 'plant pathogen']",Xanthomonas citri subsp. malvacearum XM13 is a...,346,species,,,10.13145/bacdive17621.20240510.9,...,,,,,,,,,,
4,1480,17596,3850.0,"['genome sequence', 'Bacteria', 'obligate aero...",Xanthomonas campestris DSM 3850 is an obligate...,339,species,,,10.13145/bacdive17596.20240510.9,...,,,,,,,,,,


### [2.1] UniProtKB taxa2ec (species name)

Returns ec numbers and identifier information from UniProtKB for an input of species (MediaDive)

In [12]:
import modules.uniprotkb as uni

Format species name from the 'md_df' MediaDive dataframe for querying UniProtKB

In [13]:
species_df = md_df.copy()
species_df['species'] = species_df['species'].replace(' ','+', regex=True)
species_list = set(species_df['species'].to_list())
len(species_list)

12391

Retrieval of UniProtKB info

In [14]:
# Import and run function to retrieve ec's associated with each species
    # Note: currently, script only checks reviewed entries (SwissProt), and ignores 1000's of tREMBL entries
    # Errors represent where species are totally absent, whereas non-reviewed entries show as 'x species with no data'
    
uniprot_df = uni.species2ec(species_list)
#uniprot_df.to_csv(os.path.join(DATA_DIR, "uniprot", "uniprot-all.csv"), index=False) #SAVE

Processing species:  21%|██▏       | 2635/12391 [05:53<13:21, 12.18it/s]  

HTTP error occurred for Sphingobacterium+composti+[homonym]: 400 Client Error: Bad Request for url: https://rest.uniprot.org/uniprotkb/search?fields=accession%2Cec%2Corganism_name%2Corganism_id%2Ccc_cofactor%2Cid&format=tsv&size=500&query=%28organism_name%3ASphingobacterium+composti+%5Bhomonym%5D%29+AND+%28ec%3A*%29+AND+%28reviewed%3Atrue%29


Processing species:  84%|████████▎ | 10364/12391 [22:29<03:27,  9.78it/s] 

HTTP error occurred for Protocrea+illino&euml;nsis: 400 Client Error: Bad Request for url: https://rest.uniprot.org/uniprotkb/search?fields=accession%2Cec%2Corganism_name%2Corganism_id%2Ccc_cofactor%2Cid&format=tsv&size=500&query=%28organism_name%3AProtocrea+illino&euml;nsis%29+AND+%28ec%3A*%29+AND+%28reviewed%3Atrue%29


Processing species: 100%|██████████| 12391/12391 [26:43<00:00,  7.73it/s]


10660 species with no data


Unnamed: 0,species,ec_uniprot
0,Ruminiclostridium+josui,1.2.1.70; 1.3.1.76; 4.99.1.4
1,Ruminiclostridium+josui,2.5.1.61
2,Ruminiclostridium+josui,2.1.1.107; 4.2.1.75
3,Ruminiclostridium+josui,4.2.1.24
4,Ruminiclostridium+josui,3.2.1.4


**UniProtKB (MediaDive species name) final dataframe**

In [8]:
uniprot_df = pd.read_csv('~/Desktop/code/data/uniprot/uniprot-all.csv')
uniprot_df.head()

Unnamed: 0,species,ec_uniprot
0,Ruminiclostridium+josui,1.2.1.70; 1.3.1.76; 4.99.1.4
1,Ruminiclostridium+josui,2.5.1.61
2,Ruminiclostridium+josui,2.1.1.107; 4.2.1.75
3,Ruminiclostridium+josui,4.2.1.24
4,Ruminiclostridium+josui,3.2.1.4


### [2.2] UniProtKB taxa2ec (taxon_id)

Combined approach: uses NCBI taxon_id's (species level) from BacDive to query UniProtKB

In [None]:
import modules.uniprotkb as uni

Format taxon_id's to query UniProtKB

In [32]:
import ast 

bacdive_df = pd.read_csv('~/Desktop/code/data/bacdive/bacdive-all.csv', low_memory=False)
taxon_id = bacdive_df['taxon_id'].to_list()

# Some NCBI IDs are nested dictionaries, this should extract the ID at the species OR strain level:
data = taxon_id
tax_ids = []

# Loop to extract NCBI tax ids
for item in data:
    if isinstance(item, str) and item.startswith('[') and item.endswith(']'):
        # Parse the string representation of the list of dictionaries
        try:
            dicts = ast.literal_eval(item)
            for d in dicts:
                if d['Matching level'] == 'species':
                    tax_ids.append(d['NCBI tax id'])
        except (ValueError, SyntaxError):
            # Handle cases where the string is not a valid list of dictionaries
            continue

len(tax_ids)

[83615, 231455, 33050, 40683, 169176, 291995, 203192, 1211807, 316, 203192, 1211807, 316, 316, 303, 57480, 82996, 61651, 55208, 82983, 61648, 204039, 556, 82979, 80866, 221992, 415849, 56457, 378548, 256466, 86184, 47883, 29442, 303, 301, 43263, 553088, 396808, 363850, 1293823, 41200, 649746, 1120986, 97477, 86959, 349095, 164452, 164451, 172827, 81028, 115778, 89059, 109790, 2275, 2275, 154046, 160404, 169679, 84698, 1501, 33954, 1537, 1492, 152260, 120962, 1263547, 81462, 351091, 146817, 447593, 1120994, 286803, 501571, 288966, 390806, 349933, 349931, 396504, 319475, 208479, 94869, 1519, 1535, 1488, 2049024, 2981785, 626940, 508460, 180311, 174709, 264463, 261299, 259063, 179628, 89152, 36842, 29361, 341220, 33025, 2981724, 2028282, 55205, 301953, 453230, 131540, 2221, 29291, 118126, 2298, 33059, 35837, 42471, 264148, 89053, 463192, 458711, 58180, 82171, 126333, 91360, 231447, 160661, 190898, 57666, 2779352, 1980281, 94009, 214851, 365349, 55504, 112901, 112900, 2374, 202604, 28212, 

Retrieval of UniProtKB info

In [35]:
ncbi_df = uni.taxon2ec(tax_ids)
ncbi_df['ec_uniprot'] = ncbi_df['ec_uniprot'].str.split('; ').explode('ec_uniprot')

#ncbi_df.to_csv(os.path.join(DATA_DIR, 'ncbi-ec-2.csv'), index=False) #SAVE

Processing species: 100%|██████████| 2053/2053 [33:17<00:00,  1.03it/s] 


503 species with no data


**UniProtKB (NCBI taxon IDs) final dataframe**

In [36]:
ncbi_df = pd.read_csv(os.path.join(DATA_DIR, 'ncbi-ec.csv'))
ncbi_df.head()

Unnamed: 0,species,ec_uniprot
0,231455,5.6.2.2
1,231455,2.7.7.6
2,33050,2.6.1.-
3,33050,3.1.1.87
4,33050,2.7.1.25


### [2.3] Formatting final taxa2ec table

Reformat outputs from [1.2-2.2]

In [16]:
bacdive_df = pd.read_csv(os.path.join(DATA_DIR, 'bacdive', 'bacdive-all.csv'), low_memory=False)
uniprot_df = pd.read_csv(os.path.join(DATA_DIR, 'uniprot', 'uniprot-all.csv'))
ncbi_df = pd.read_csv(os.path.join(DATA_DIR, 'ncbi-ec.csv'))

# BacDive taxa2ec (grouped by bacdive_id)
bacdive_ec = bacdive_df[['bacdive_id','taxon_id','type_strain','ec']].copy()
bacdive_ec['ec'] = bacdive_ec['ec'].str.replace("'", "")
bacdive_ec = bacdive_ec.rename(columns={'ec': 'ec_bacdive'})

# UniProtKB taxa2ec (grouped by species name)
uniprot_ec = uniprot_df.copy()
uniprot_ec['species'] = uniprot_ec['species'].replace(r'\+',' ', regex=True)
uniprot_ec['ec_uniprot'] = uniprot_ec['ec_uniprot'].str.replace(";", ",")
uniprot_ec = uniprot_ec.groupby("species", as_index=False)["ec_uniprot"].apply(lambda x: "[%s]" % ', '.join(x))

# NCBI taxa2ec (grouped by taxon_id)
ncbi_ec = ncbi_df.astype(str).copy()
ncbi_ec = ncbi_ec.rename(columns={'species': 'taxon_id', 'ec_uniprot': 'ec_ncbi'})
ncbi_ec = ncbi_ec.groupby("taxon_id", as_index=False)["ec_ncbi"].apply(lambda x: "[%s]" % ', '.join(x))

Merge 'md_df' MediaDive dataframe with formatted outputs

In [17]:
media_df = md_df.copy()

# Completing merge in multiple steps since we're merging on different columns
merged1 = pd.merge(left = media_df, right = uniprot_ec, on = 'species', how = 'left')
merged2 = pd.merge(left = merged1, right = bacdive_ec, on = 'bacdive_id', how = 'left')
merged3 = pd.merge(left = merged2, right = ncbi_ec, on = 'taxon_id', how = 'left')

merged3.head()

Unnamed: 0,media_id,name,complex_medium,source,link,min_pH,max_pH,reference,description,components,...,strain_id,species,ccno,bacdive_id,merge_source,ec_uniprot,taxon_id,type_strain,ec_bacdive,ec_ncbi
0,1,NUTRIENT AGAR,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,,...,,,,,,,,,,
1,1a,REACTIVATION WITH LIQUID MEDIUM 1,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,"['Peptone', 'Meat extract', 'Agar', 'Distilled...",...,29.0,Comamonas testosteroni,DSM 38,2912.0,both,"[2.6.1.1, 4.1.1.12, 1.13.11.74, 1.13.11.76, 1....",285,no,"[1.9.3.1, 3.2.1.21, 3.5.1.5, 3.5.3.6]","[4.2.1.3, 4.2.1.99, 6.3.2.10, 2.3.1.241, 5.6.2..."
2,1a,REACTIVATION WITH LIQUID MEDIUM 1,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,"['Peptone', 'Meat extract', 'Agar', 'Distilled...",...,30.0,Delftia acidovorans,DSM 39,2941.0,both,"[1.14.11.43, 1.14.11.44, 3.1.1.75, 2.4.2.1, 2....","[{'NCBI tax id': 1218107, 'Matching level': 's...",yes,"[1.9.3.1, 3.2.1.21, 3.5.1.5, 3.5.3.6]",
3,1a,REACTIVATION WITH LIQUID MEDIUM 1,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,"['Peptone', 'Meat extract', 'Agar', 'Distilled...",...,39.0,Acidovorax delafieldii,DSM 64,2885.0,both,[1.1.1.37],47920,yes,"[3.2.1.51, 3.2.1.24, 3.2.1.52, 3.2.1.21, 3.2.1...",
4,1a,REACTIVATION WITH LIQUID MEDIUM 1,1,DSMZ,https://www.dsmz.de/microorganisms/medium/pdf/...,7.0,7.0,,,"['Peptone', 'Meat extract', 'Agar', 'Distilled...",...,52.0,Pseudomonas putida,DSM 84,12895.0,both,"[5.1.1.10, 5.3.3.1, 1.18.1.3, 1.2.98.1, 1.18.1...",303,no,"[3.2.1.21, 3.5.1.5, 3.5.3.6]","[2.7.7.23, 2.7.1.25, 2.7.7.4, 3.1.11.5, 5.6.2...."


**taxa2ec final dataframe**

In [18]:
final_df = merged3[["media_id", "species", "taxon_id", "ec_uniprot", "ec_bacdive","ec_ncbi"]].copy()

# Melt ec columns and attribute ec source
final_df = final_df.melt(
    id_vars=["media_id", "species", "taxon_id"],
    value_vars=["ec_uniprot", "ec_bacdive", "ec_ncbi"],
    value_name="ec",
    var_name="source"
)

# Format source and ec columns
final_df["source"] = final_df["source"].str.replace("ec_", "")
final_df['ec'] = final_df['ec'].astype(str).copy()
final_df["ec"] = final_df["ec"].str.replace("[", "").str.replace("]", "")
final_df['ec'] = final_df['ec'].str.split(', ')
final_df = final_df.explode('ec')

# Remove rows with nan 'ec' values
final_df = final_df.copy()
substring = 'nan'
filter = final_df['ec'].str.contains(substring) # create filter
final_df = final_df[~filter]

#final_df.to_csv(os.path.join(DATA_DIR, "taxa2ec-final.csv"), index=False) #SAVE
final_df

Unnamed: 0,media_id,species,taxon_id,source,ec
1,1a,Comamonas testosteroni,285,uniprot,2.6.1.1
1,1a,Comamonas testosteroni,285,uniprot,4.1.1.12
1,1a,Comamonas testosteroni,285,uniprot,1.13.11.74
1,1a,Comamonas testosteroni,285,uniprot,1.13.11.76
1,1a,Comamonas testosteroni,285,uniprot,1.14.13.23
...,...,...,...,...,...
92484,J1236,Thermus thermophilus,274,ncbi,6.1.1.16
92484,J1236,Thermus thermophilus,274,ncbi,2.5.1.145
92484,J1236,Thermus thermophilus,274,ncbi,1.1.1.37
92484,J1236,Thermus thermophilus,274,ncbi,5.2.1.8


### [3.1] KEGG media2ec (CPD)

Returns ec numbers from KEGG for an input of CPDs (MediaDive)

In [2]:
from Bio.KEGG import REST
import modules.mediadive as md
import modules.kegg as kegg

Retrieve KEGG and ChEBI compound IDs from an input of MediaDive 'component_ids'

In [3]:
# Retrieve 'component_ids' from the 'md_df' MediaDive dataframe
md_df = pd.read_csv(os.path.join(DATA_DIR,"mediadive","mediadive-all.csv"), sep = ';')

comps_df = md_df[['media_id','components','component_ids']].copy()
comps_df['media_id'] = comps_df['media_id'].drop_duplicates()
comps_df = comps_df.dropna()

# Extract the component_ids into a list
comps_df['component_ids'] = comps_df['component_ids'].astype(str)

def extract_ids(comps_df, component_ids):
    id_set = set()  # Use a set to avoid duplicate IDs
    for ids in comps_df['component_ids']:
        id_list = eval(ids)  # Convert the string representation of the list to an actual list
        id_set.update(id_list)
    return list(id_set)

# Extract IDs
id_list = extract_ids(comps_df, 'ids')

Retrieve associated ECs for each compound

In [4]:
# Retrieve compound IDs from component IDs (MediaDive)
compound_df = md.get_compounds(id_list)

# Making 'cpd_list' using the KEGG compound IDs
cpd_df = compound_df['KEGG cpd'].dropna().copy()
cpd_list = cpd_df.to_list()

# Retrieve ECs from CPDs
cpd2ec_df = kegg.compound2ec(cpd_list) # HTTP errors = no ECs associated with this compound on KEGG

  0%|          | 0/763 [00:00<?, ?it/s]

100%|██████████| 763/763 [01:39<00:00,  7.68it/s]


HTTP error occurred for C00293: 404 Client Error: Not Found for url: https://rest.kegg.jp/get/compound:C00293
HTTP error occurred for C00382: 404 Client Error: Not Found for url: https://rest.kegg.jp/get/compound:C00382


Format final table for media2ec

In [5]:
# Merge dataframes with component_id, compound, and ec information
cpd_merge = pd.merge(left=compound_df, right=cpd2ec_df, on="KEGG cpd", how="outer")
cpd_merge = cpd_merge.drop_duplicates()

#cpd_merge.to_csv(os.path.join(DATA_DIR,"kegg","kegg-compound-ec.csv"), index=False) #SAVE
cpd_merge

Unnamed: 0,component_id,ChEBI,KEGG cpd,Enzyme
0,4,15377.0,C00001,1.1.1.1 1.1.1.22 1.1.1.23 ...
3,56,15377.0,C00001,1.1.1.1 1.1.1.22 1.1.1.23 ...
6,754,15377.0,C00001,1.1.1.1 1.1.1.22 1.1.1.23 ...
9,1798,15846.0,C00003,1.1.1.1 1.1.1.3 1.1.1.4 ...
10,783,15346.0,C00010,1.1.1.34 1.1.1.88 1.1.1.- ...
...,...,...,...,...
882,2042,,,
883,2043,132764.0,,
884,2045,,,
885,2046,2509.0,,


**media2ec final dataframe**

In [34]:
# Explode md_df on component_id
component_df = md_df.copy()
component_df['component_ids'] = md_df['component_ids'].str.strip('[]')
component_df['component_ids'] = component_df['component_ids'].str.split(', ').explode('component_ids')
component_df = component_df.rename(columns={'component_ids': 'component_id'})

# Merge md_df with media2ec (KEGG), remove NaN 'KEGG cpd' values
cpd_merge['component_id'] = cpd_merge['component_id'].astype(str)
comp_comp = pd.merge(left=component_df, right=cpd_merge, on='component_id', how='outer')
kegg2ec = comp_comp.loc[comp_comp['KEGG cpd'].notnull()].copy()
kegg2ec = kegg2ec.loc[kegg2ec['Enzyme'].notnull()]

#NOTE: not all 'component id's' have an associated 'KEGG compound', and not all 'KEGG compound's' have an associated 'Enzyme'...this means our final table has a loss of data from our original input
    #Can try to remedy the data loss by splitting complex media components into their simpler forms

# Load and format taxon_id's from bacdive dataframe
taxon_df = pd.read_csv(os.path.join(DATA_DIR,"bacdive","bacdive-all.csv"), low_memory=False)
taxon_df = taxon_df[['bacdive_id','taxon_id']]

# Merge taxon_id's to kegg2ec table
media_final = pd.merge(left=kegg2ec, right=taxon_df, on='bacdive_id', how='outer')

media_final = media_final[['media_id','taxon_id','component_id','KEGG cpd','Enzyme']].copy()
media_final = media_final.dropna(subset='Enzyme')

#media_final.to_csv(os.path.join(DATA_DIR, "media2ec-final.csv"), index=False) #SAVE
media_final.head()
    #Two different model inputs: one with component id's, one with enzymes

Unnamed: 0,media_id,taxon_id,component_id,KEGG cpd,Enzyme
5,J597,"[{'NCBI tax id': 1226664, 'Matching level': 's...",4,C00001,1.1.1.1 1.1.1.22 1.1.1.23 ...
15,J97,"[{'NCBI tax id': 104101, 'Matching level': 'sp...",37,C12486,1.14.-.-
16,J709,104098,199,C00369,2.4.1.1 2.4.1.18 2.4.1.19 ...
25,J709,"[{'NCBI tax id': 610245, 'Matching level': 'sp...",199,C00369,2.4.1.1 2.4.1.18 2.4.1.19 ...
36,J97,"[{'NCBI tax id': 1123227, 'Matching level': 's...",4,C00001,1.1.1.1 1.1.1.22 1.1.1.23 ...


**media2ec final dataframe (exploded)**

In [49]:
media_final = pd.read_csv(os.path.join(DATA_DIR, "media2ec-final.csv"))

# Function to split the column by 6-9 spaces (variable delimitation...idk why)
def split_and_clean(value):
    # Split the string based on 6 to 9 spaces
    split_values = re.split(r'\s{6,9}', value)
    # Remove any remaining spaces from the split values
    split_values = [v.strip() for v in split_values]
    return split_values

# Split and Explode
df_split['split_column'] = media_final['Enzyme'].apply(split_and_clean).copy()
media_split = df_split.explode('split_column').reset_index(drop=True)
media_split = media_split.drop(columns=['Enzyme'])
media_split = media_split.rename(columns={'split_column': 'ec_KEGG'})

media_split['ec_KEGG'] = media_split['ec_KEGG'].dropna()
#media_split.to_csv(os.path.join(DATA_DIR, "media2ec-explode.csv"), index=False) #SAVE
media_split.head()

Unnamed: 0,media_id,taxon_id,component_id,KEGG cpd,ec_KEGG
0,J597,"[{'NCBI tax id': 1226664, 'Matching level': 's...",4,C00001,1.1.1.1
1,J597,"[{'NCBI tax id': 1226664, 'Matching level': 's...",4,C00001,1.1.1.22
2,J597,"[{'NCBI tax id': 1226664, 'Matching level': 's...",4,C00001,1.1.1.23
3,J597,"[{'NCBI tax id': 1226664, 'Matching level': 's...",4,C00001,1.1.1.115
4,J97,"[{'NCBI tax id': 104101, 'Matching level': 'sp...",37,C12486,1.14.-.-


**To-do**

- Implement ChEBI media2ec (massive dataset, figure out how to reduce/manage)

- Normalize ECs for comparison (proportion of dataset rather than raw counts)