In [1]:
import pandas as pd

First of, import the CHEBI-hierarchy and the TCDB dfs

In [9]:
df_hierarchy = pd.read_csv("chebiHierarchy.tsv", sep="\t")
df_hierarchy['child'] = df_hierarchy['child'].str.extract(r'CHEBI_(\d+)').astype(int)
df_hierarchy['parent'] = df_hierarchy['parent'].str.extract(r'CHEBI_(\d+)').astype(int)
all_parents = set(df_hierarchy["parent"])

df_tcdb = pd.read_csv("../main_retry/tcdb_data_combined.csv")
all_chebi_tcdb = set(df_tcdb["CHEBI IDs"].unique())

Transformation from primary/secondary ChEBI ID to primary ID

In [10]:
prim_sec_chebi = pd.read_csv("primary_secondary_chebi_ids.tsv", sep="\t")
secondary_to_primary =  {}

for _, row in prim_sec_chebi.iterrows():
    primary_id = int(row["Primary_CHEBI_ID"])
    secondary_ids = eval(row["Secondary_CHEBI_IDs"])
    
    for s_id in secondary_ids:
        secondary_to_primary[s_id] = primary_id

def get_primary_id(chebi_id):
    chebi_id = int(chebi_id)
    return secondary_to_primary.get(str(chebi_id), chebi_id)

In [11]:
all_primary_chebi_tcdb = set([get_primary_id(el) for el in all_chebi_tcdb])

df_hierarchy_prim = df_hierarchy.copy()
df_hierarchy_prim["child"] = df_hierarchy["child"].apply(get_primary_id)
df_hierarchy_prim["parent"] = df_hierarchy["parent"].apply(get_primary_id)

all_children = df_hierarchy_prim["child"].unique().tolist()
not_in_hierarchy = [id for id in all_primary_chebi_tcdb if id not in all_children]

*Method 1*

Filter out all ChEBIs that are not leaf nodes.

In [None]:
df_filtered = df_hierarchy_prim[~df_hierarchy_prim["child"].isin(all_parents)]
leaf_nodes = df_filtered["child"].unique().tolist()

# Just a curiosity here
common_children = [id for id in all_primary_chebi_tcdb if id in leaf_nodes]

print(f"There are {len(leaf_nodes)} leaf nodes.\nThe original set of ChEBIs from TCDB contains {len(all_primary_chebi_tcdb)} elements.")
print(f"The leaf nodes and the IDs from TCDB have {len(common_children)} IDs in common.")
print(f"This means that there are {len(all_primary_chebi_tcdb)-len(common_children)} IDs that are parents or not listed in the hierarchy")
print(f"There are {len(not_in_hierarchy)} IDs not in hierarchy and {len(all_primary_chebi_tcdb)-len(common_children)-len(not_in_hierarchy)} IDs that are parents.")
print(f"As seen below, some of the {len(all_primary_chebi_tcdb)-len(common_children)-len(not_in_hierarchy)} ChEBIs that either are a parent or not in the hierarchy are caught with Method 2.")

There are 187792 leaf nodes.
The original set of ChEBIs from TCDB contains 1524 elements.
The leaf nodes and the IDs from TCDB have 941 IDs in common.
This means that there are 583 IDs that are parents or not listed in the hierarchy
There are 86 IDs not in hierarchy and 497 IDs that are parents.
As seen below, some of the 497 ChEBIs that either are a parent or not in the hierarchy are caught with Method 2.


*Method 2*

Take the ChEBIs from TCDB, remove the ones not in the hierarchy, and kill all that have children, if their children are on TCDB too.

Only in this method is conversion needed, as this combines both the tcdb and chebi data.

In [None]:
# Finding all the parents in tcdb that HAS a child in the df
parents_w_children_in_tcdb = set(df_hierarchy_prim[df_hierarchy_prim["child"].isin(all_primary_chebi_tcdb)]["parent"])

chebi_ids_to_remove = parents_w_children_in_tcdb.intersection(all_primary_chebi_tcdb)

filtered_chebis = all_primary_chebi_tcdb - chebi_ids_to_remove

print(f"Now, there are {len(filtered_chebis)} IDs in the filter.")
print(f"This method only removes {len(all_primary_chebi_tcdb)-len(filtered_chebis)} from the original set of {len(all_primary_chebi_tcdb)} IDs.\nThis means that {len(all_primary_chebi_tcdb)-len(filtered_chebis)-len(not_in_hierarchy)} IDs are parents with children.")
print("The ones that are too broad, will not be able to connect to info regarding their substrate properties,\nand will subsequently be removed.")

Now, there are 1326 IDs in the filter.
This method only removes 198 from the original set of 1524 IDs.
This means that 112 IDs are parents with children.
The ones that are too broad, will not be able to connect to info regarding their substrate properties,
and will subsequently be removed.


Method 2 is superior, as it gives parents that have no children as well as the leaf nodes. This is not a weakness, as the instances where the ChEBI is too broad to have specific properties, they will simply be removed. It also includes all the IDs that are NOT listed in the hierarchy, as there actually are some, as shown in Method 1. (663 to be exact)