### **Media Prediction**
[MediaDive] to [KEGG, UniProt OR BacDive] to [Downstream applications]

In [4]:
import pandas as pd
from io import StringIO
from Bio.KEGG import REST

import requests
import re
from requests.adapters import HTTPAdapter, Retry

### [MediaDive] 
Using an input of DSMZ media id's, retrieve components, component id's, and strain taxonomy information

Import MediaDive functions, provide input = **media_id_list**

In [7]:
import modules.mediadive as md
media_id_list = ['1','1001']

In [8]:
#media_id_list = ['65','514','830','92','1','535','220','11','693','104','53','645','110','585','58','429','372','545','98','9','63','553','78','84','104b','215','381','425','119','535b','554','736','67','804','27','83','641.1','354','123','621','760.1','339.1','339','595','31','144','330','214','593','371']

In [9]:
# Returns dataframe of the media_id and its components
md_comp_df = md.get_composition(media_id_list)

# Returns dataframe of the media_id and associated strains' info
md_strains_df = md.get_strains(media_id_list)

# Merging component and strain info
md_df = pd.merge(left=md_comp_df, right=md_strains_df, on="media_id",how="outer")
md_df

Unnamed: 0,media_id,components,component_ids,strain_id,species,ccno,bacdive_id
0,1,"[Peptone, Meat extract, Agar, Distilled water]","[1, 2, 3, 4]",1,Heyndrickxia coagulans,DSM 1,654.0
1,1,"[Peptone, Meat extract, Agar, Distilled water]","[1, 2, 3, 4]",2,Paenibacillus macquariensis subsp. macquariensis,DSM 2,11477.0
2,1,"[Peptone, Meat extract, Agar, Distilled water]","[1, 2, 3, 4]",3,Sporosarcina psychrophila,DSM 3,11984.0
3,1,"[Peptone, Meat extract, Agar, Distilled water]","[1, 2, 3, 4]",6,Peribacillus psychrosaccharolyticus,DSM 6,748.0
4,1,"[Peptone, Meat extract, Agar, Distilled water]","[1, 2, 3, 4]",7,Bacillus amyloliquefaciens,DSM 7,598.0
...,...,...,...,...,...,...,...
2291,1,"[Peptone, Meat extract, Agar, Distilled water]","[1, 2, 3, 4]",49035,Acinetobacter pittii,DSM 101622,
2292,1,"[Peptone, Meat extract, Agar, Distilled water]","[1, 2, 3, 4]",49036,Acinetobacter pittii,DSM 101623,
2293,1,"[Peptone, Meat extract, Agar, Distilled water]","[1, 2, 3, 4]",1431,Pseudomonas sp.,DSM 1991,12943.0
2294,1,"[Peptone, Meat extract, Agar, Distilled water]","[1, 2, 3, 4]",180,Pseudomonas putida,DSM 291,12871.0


Retrieve ChEBI and KEGG compound id's associated with the media components, output = **compounds_df**

In [11]:
# Extract the component_ids into a list
md_df['component_ids'] = md_df['component_ids'].astype(str)

def extract_ids(md_df, component_ids):
    id_set = set()  # Use a set to avoid duplicate IDs
    for ids in md_df['component_ids']:
        id_list = eval(ids)  # Convert the string representation of the list to an actual list
        id_set.update(id_list)
    return list(id_set)

component_id_list = extract_ids(md_df, 'ids')

# Returns dataframe of the component_id and associated ChEBI and KEGG compound IDs
compounds_df = md.get_compounds(component_id_list)
compounds_df

Unnamed: 0,component_id,ChEBI,KEGG cpd
0,1,,
1,2,,
2,3,2509.0,C08815
3,4,15377.0,C00001
4,519,30961.0,
5,136,17015.0,C00255
6,135,7916.0,C00864
7,138,27470.0,C00504
8,139,15940.0,C00253
9,907,39054.0,


### [KEGG]
Query KEGG to retrieve ec numbers associated with KEGG cpd id's, input = **compounds_df**

In [13]:
# Making 'cpd_list' using the KEGG compound IDs
df = compounds_df['KEGG cpd'].dropna()
cpd_list = df.to_list()

import modules.compound2ec as cpds
cpd2ec_df = cpds.compound2ec(cpd_list)
cpd2ec_df

def process():
    df_filtered = cpd2ec_df[cpd2ec_df['Enzyme'] != 'None']
    df_expanded = df_filtered.assign(Enzyme=df_filtered['Enzyme'].str.split()).explode('Enzyme')
    return df_expanded

filtered_df = process()
kegg_ec_list = filtered_df['Enzyme'].dropna().to_list()
kegg_ec_list

['1.1.1.1',
 '1.1.1.22',
 '1.1.1.23',
 '1.1.1.115',
 '1.5.1.30',
 '1.5.1.36',
 '1.5.1.41',
 '2.5.1.9',
 '2.7.1.33',
 '3.5.1.22',
 '3.5.1.92',
 '3.5.1.-',
 '1.5.1.3',
 '1.17.1.5',
 '1.17.2.1',
 '2.1.1.7',
 '2.4.1.196',
 '1.16.1.6',
 '6.3.1.20',
 '1.14.-.-',
 '1.-.-.-',
 '2.8.1.6',
 '3.5.1.12',
 '6.2.1.11',
 '1.7.1.-',
 '1.14.12.-',
 '1.14.13.27',
 '1.14.99.68',
 '1.14.14.10']

### [UniProt]
Query UniProt to retrieve ec numbers associated with ChEBI id's, input = **compounds_df**

### [BacDive]
Query BacDive to retrieve ec numbers from taxa (bacdive_id), input = **bacdive_ids**

In [16]:
### each media id input can have 1000s of associated strains, pulling all ec's for all strains can take hours -- reduce somehow?

### Notes:
Some "Error" codes are because these media components are complex, and there's not an individual ingredient associated with it. Proceeding will keep various broths, vitamin mixes, trace element mixes, etc. grouped together, which could be useful for certain analyses. However, if we'd like the most basic components, we'll have to separate the basic ingredients from a solution, and add those to our later lists.

Cross-reference and compare these different outputs. Seems to be a lot of loss of info due to lack of annotations from component_id to compound to ec.