# HackMed 21: UniProt Bacterial Protein Explorer
Start: 24.04.2021 | author Camillo Moschner (cm967)

Source Data: https://www.uniprot.org/uniprot/?query=taxonomy:%22Bacteria%20[2]%22%20AND%20reviewed:yes%20ec:6.1.1.9

## Import statements

In [1]:
import numpy as np
import pandas as pd
from pprint import pprint
import pickle
import re
from itertools import combinations

In [2]:
from tqdm.notebook import tqdm

## Function Definitions

In [3]:
def getNumbers(str):
    array = re.findall(r'[0-9]+', str)
    return int(array[0])

def getChEBIs(str):
    array = re.findall(r'[0-9]\d\d\d+', str)
    return array

def getIntChEBIs(cell):
    extracted_ChEBI_list = getChEBIs(cell)
    integer_map = map(int, extracted_ChEBI_list)
    integer_list = list(integer_map)
    return integer_list

## UniProt Explorer & Data Prep

Load dataset from UniProt:

In [4]:
tab_file = '/Users/camillomoschner/Desktop/21_HackMed/uniprot-taxonomy__Bacteria+[2]_-filtered-reviewed_yes.tab'

In [5]:
df_bac_prot = pd.read_csv(tab_file, sep='\t')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Process UniProt dataset to exclude those that do not contain ChEBIs:

In [6]:
pd.set_option('display.max_columns', None)
proc_df_bac_prot = df_bac_prot.loc[df_bac_prot['ChEBI IDs'].notnull()]
proc_df_bac_prot

Unnamed: 0,Entry,Entry name,Status,Protein names,Gene names,Organism,Length,Catalytic activity,Cofactor,EC number,Function [CC],Kinetics,Binding site,Activity regulation,Active site,Calcium binding,Pathway,Sequence,Gene encoded by,Organism ID,pH dependence,Metal binding,Temperature dependence,Protein existence,Interacts with,Gene ontology (biological process),Gene ontology (molecular function),Gene ontology (GO),ChEBI,ChEBI (Catalytic activity),ChEBI (Cofactor),ChEBI IDs,Post-translational modification,Signal peptide,Subcellular location [CC],Biotechnological use,Pharmaceutical use,3D,PubMed ID,Date of last modification,Protein families,Taxonomic lineage IDs
0,Q8K9I1,SYV_BUCAP,reviewed,Valine--tRNA ligase (EC 6.1.1.9) (Valyl-tRNA s...,valS BUsg_354,Buchnera aphidicola subsp. Schizaphis graminum...,960,CATALYTIC ACTIVITY: Reaction=ATP + L-valine + ...,,6.1.1.9,FUNCTION: Catalyzes the attachment of valine t...,,"BINDING 556; /note=""ATP""; /evidence=""ECO:000...",,,,,MKKNYNPKDIEEHLYNFWEKNGFFKPNNNLNKPAFCIMMPPPNITG...,,198804,,,,Inferred from homology,,valyl-tRNA aminoacylation [GO:0006438],aminoacyl-tRNA editing activity [GO:0002161]; ...,cytoplasm [GO:0005737]; aminoacyl-tRNA editing...,AMP [CHEBI:456215]; L-valine [CHEBI:57762]; AT...,AMP [CHEBI:456215]; L-valine [CHEBI:57762]; AT...,,CHEBI:456215; CHEBI:57762; CHEBI:30616; CHEBI:...,,,SUBCELLULAR LOCATION: Cytoplasm {ECO:0000255|H...,,,,12089438,2020-12-02,"Class-I aminoacyl-tRNA synthetase family, ValS...",198804
1,Q664P8,TAUB_YERPS,reviewed,Taurine import ATP-binding protein TauB (EC 7....,tauB YPTB3721,Yersinia pseudotuberculosis serotype I (strain...,255,CATALYTIC ACTIVITY: Reaction=ATP + H2O + tauri...,,7.6.2.7,FUNCTION: Part of the ABC transporter complex ...,,,,,,,MLNVSGLWAEYQGKPALQDVSLQIASGQLVVVLGPSGCGKTTLLNL...,,273123,,,,Inferred from homology,,,ATPase activity [GO:0016887]; ATP binding [GO:...,plasma membrane [GO:0005886]; ATP binding [GO:...,H2O [CHEBI:15377]; H(+) [CHEBI:15378]; phospha...,H2O [CHEBI:15377]; H(+) [CHEBI:15378]; phospha...,,CHEBI:15377; CHEBI:15378; CHEBI:43474; CHEBI:5...,,,SUBCELLULAR LOCATION: Cell inner membrane {ECO...,,,,15358858,2021-04-07,"ABC transporter superfamily, Taurine importer ...",273123
2,Q8E4B4,TARI_STRA3,reviewed,Ribitol-5-phosphate cytidylyltransferase (EC 2...,tarI gbs1487,Streptococcus agalactiae serotype III (strain ...,239,CATALYTIC ACTIVITY: Reaction=CTP + D-ribitol 5...,,2.7.7.40,FUNCTION: Catalyzes the transfer of the cytidy...,,,,,,PATHWAY: Cell wall biogenesis; poly(ribitol ph...,MNIGVIFAGGVGRRMNTKGKPKQFLEVHGKPIIVHTIDIFQNTEAI...,,211110,,,,Inferred from homology,,cell wall organization [GO:0071555]; isoprenoi...,D-ribitol-5-phosphate cytidylyltransferase act...,D-ribitol-5-phosphate cytidylyltransferase act...,H(+) [CHEBI:15378]; CDP-L-ribitol [CHEBI:57608...,H(+) [CHEBI:15378]; CDP-L-ribitol [CHEBI:57608...,,CHEBI:15378; CHEBI:57608; CHEBI:37563; CHEBI:3...,,,,,,,12354221,2020-12-02,"IspD/TarI cytidylyltransferase family, TarI su...",211110
3,B3CQ06,SYS_WOLPP,reviewed,Serine--tRNA ligase (EC 6.1.1.11) (Seryl-tRNA ...,serS WP0551,Wolbachia pipientis subsp. Culex pipiens (stra...,426,CATALYTIC ACTIVITY: Reaction=ATP + L-serine + ...,,6.1.1.11,FUNCTION: Catalyzes the attachment of serine t...,,"BINDING 284; /note=""Serine""; /evidence=""ECO:...",,,,PATHWAY: Aminoacyl-tRNA biosynthesis; selenocy...,MHDIEHIRKNPKGFEKAIKSRGVKEFTAKEILEIDHKKRSLTTKLQ...,,570417,,,,Inferred from homology,,selenocysteine biosynthetic process [GO:001626...,ATP binding [GO:0005524]; serine-tRNA ligase a...,cytoplasm [GO:0005737]; ATP binding [GO:000552...,AMP [CHEBI:456215]; H(+) [CHEBI:15378]; 3'-(L-...,AMP [CHEBI:456215]; H(+) [CHEBI:15378]; 3'-(L-...,,CHEBI:456215; CHEBI:15378; CHEBI:78533; CHEBI:...,,,SUBCELLULAR LOCATION: Cytoplasm {ECO:0000255|H...,,,,18550617,2020-12-02,"Class-II aminoacyl-tRNA synthetase family, Typ...",570417
4,Q83JA5,SYW_SHIFL,reviewed,Tryptophan--tRNA ligase (EC 6.1.1.2) (Tryptoph...,trpS SF3402 S4360,Shigella flexneri,334,CATALYTIC ACTIVITY: Reaction=ATP + L-tryptopha...,,6.1.1.2,FUNCTION: Catalyzes the attachment of tryptoph...,,"BINDING 135; /note=""L-tryptophan""; /evidence...",,,,,MTKPIVFSGAQPSGELTIGNYMGALRQWVNMQDDYHCIYCIVDQHA...,,623,,,,Inferred from homology,,tryptophanyl-tRNA aminoacylation [GO:0006436],ATP binding [GO:0005524]; tryptophan-tRNA liga...,cytoplasm [GO:0005737]; ATP binding [GO:000552...,AMP [CHEBI:456215]; H(+) [CHEBI:15378]; 3'-(L-...,AMP [CHEBI:456215]; H(+) [CHEBI:15378]; 3'-(L-...,,CHEBI:456215; CHEBI:15378; CHEBI:78535; CHEBI:...,,,SUBCELLULAR LOCATION: Cytoplasm {ECO:0000255|H...,,,,12384590; 12704152,2021-02-10,Class-I aminoacyl-tRNA synthetase family,623
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
334961,Q13UP6,PDXH_PARXL,reviewed,Pyridoxine/pyridoxamine 5'-phosphate oxidase (...,pdxH Bxeno_A3655 Bxe_A0741,Paraburkholderia xenovorans (strain LB400),213,CATALYTIC ACTIVITY: Reaction=H2O + O2 + pyrido...,COFACTOR: Name=FMN; Xref=ChEBI:CHEBI:58210; Ev...,1.4.3.5,FUNCTION: Catalyzes the oxidation of either py...,,"BINDING 66; /note=""Substrate""; /evidence=""EC...",,,,PATHWAY: Cofactor metabolism; pyridoxal 5'-pho...,MTSLAELRKNYSLGSLDVGDVDRNPFRQFDTWFKQAVDAQLPEPNT...,,266265,,,,Inferred from homology,,pyridoxine biosynthetic process [GO:0008615],FMN binding [GO:0010181]; pyridoxamine-phospha...,FMN binding [GO:0010181]; pyridoxamine-phospha...,H2O2 [CHEBI:16240]; H2O [CHEBI:15377]; FMN [CH...,H2O2 [CHEBI:16240]; H2O [CHEBI:15377]; O2 [CHE...,FMN [CHEBI:58210],CHEBI:16240; CHEBI:15377; CHEBI:58210; CHEBI:1...,,,,,,,17030797,2021-04-07,Pyridoxamine 5'-phosphate oxidase family,266265
334962,B0SBH6,PDXJ_LEPBA,reviewed,Pyridoxine 5'-phosphate synthase (PNP synthase...,pdxJ LBF_0458,Leptospira biflexa serovar Patoc (strain Patoc...,260,CATALYTIC ACTIVITY: Reaction=1-deoxy-D-xylulos...,,2.6.99.2,FUNCTION: Catalyzes the complicated ring closu...,,"BINDING 7; /note=""3-amino-2-oxopropyl phospha...",,"ACT_SITE 43; /note=""Proton acceptor""; /evide...",,PATHWAY: Cofactor biosynthesis; pyridoxine 5'-...,MTQLSVNVNKIATLRNSRGGSLPSVLKLSELILDSGAHGITVHPRS...,,355278,,,,Inferred from homology,,pyridoxine biosynthetic process [GO:0008615],pyridoxine 5'-phosphate synthase activity [GO:...,cytoplasm [GO:0005737]; pyridoxine 5'-phosphat...,1-deoxy-D-xylulose 5-phosphate [CHEBI:57792]; ...,1-deoxy-D-xylulose 5-phosphate [CHEBI:57792]; ...,,CHEBI:57792; CHEBI:15377; CHEBI:15378; CHEBI:4...,,,SUBCELLULAR LOCATION: Cytoplasm {ECO:0000255|H...,,,,18270594,2021-04-07,PNP synthase family,355278
334964,C3L855,PEPT_BACAC,reviewed,Peptidase T (EC 3.4.11.4) (Aminotripeptidase) ...,pepT BAMEG_0759,Bacillus anthracis (strain CDC 684 / NRRL 3495),410,CATALYTIC ACTIVITY: Reaction=Release of the N-...,COFACTOR: Name=Zn(2+); Xref=ChEBI:CHEBI:29105;...,3.4.11.4,FUNCTION: Cleaves the N-terminal amino acid of...,,,,"ACT_SITE 81; /evidence=""ECO:0000255|HAMAP-Rul...",,,MKEELIERFTRYVKIDTQSNEDSHTVPTTPGQIEFGKLLVEELKEV...,,568206,,"METAL 79; /note=""Zinc 1""; /evidence=""ECO:000...",,Inferred from homology,,peptide catabolic process [GO:0043171],metallopeptidase activity [GO:0008237]; tripep...,cytoplasm [GO:0005737]; metallopeptidase activ...,Zn(2+) [CHEBI:29105],,Zn(2+) [CHEBI:29105],CHEBI:29105,,,SUBCELLULAR LOCATION: Cytoplasm {ECO:0000255|H...,,,,,2021-04-07,Peptidase M20B family,568206
334965,B7HED5,PGK_BACC4,reviewed,Phosphoglycerate kinase (EC 2.7.2.3),pgk BCB4264_A5252,Bacillus cereus (strain B4264),394,CATALYTIC ACTIVITY: Reaction=(2R)-3-phosphogly...,,2.7.2.3,,,"BINDING 36; /note=""Substrate""; /evidence=""EC...",,,,PATHWAY: Carbohydrate degradation; glycolysis;...,MNKKSIRDVDLKGKRVFCRVDFNVPMKEGKITDETRIRAALPTIQY...,,405532,,,,Inferred from homology,,glycolytic process [GO:0006096],ATP binding [GO:0005524]; phosphoglycerate kin...,cytoplasm [GO:0005737]; ATP binding [GO:000552...,(2R)-3-phosphoglycerate [CHEBI:58272]; (2R)-3-...,(2R)-3-phosphoglycerate [CHEBI:58272]; (2R)-3-...,,CHEBI:58272; CHEBI:57604; CHEBI:30616; CHEBI:4...,,,SUBCELLULAR LOCATION: Cytoplasm {ECO:0000255|H...,,,,,2021-02-10,Phosphoglycerate kinase family,405532


In [7]:
df_bac_prot['ChEBI IDs'][0]

'CHEBI:456215; CHEBI:57762; CHEBI:30616; CHEBI:78537; CHEBI:78442; CHEBI:33019'

## FDA Drug Shortage Data Prep
Load FDA drug shortages:

In [8]:
#pd.set_option('display.max_rows', None)
drug_shortage_df = pickle.load(open( "/Users/camillomoschner/Desktop/21_HackMed/drug_shortages.p", "rb" ) )
#drug_shortage_df

Process scraped dataset to polish ChEBIs:

In [9]:
proc_drug_shortage_df = drug_shortage_df.loc[drug_shortage_df['CAS_number'] != 'N/A']
proc_drug_shortage_df.reset_index(inplace=True, drop=True)
proc_drug_shortage_df['ChEBI'] = [ int(getNumbers(i)) for i in proc_drug_shortage_df['ChEBI'] ]
proc_drug_shortage_df.reset_index(inplace=True,drop=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  proc_drug_shortage_df['ChEBI'] = [ int(getNumbers(i)) for i in proc_drug_shortage_df['ChEBI'] ]


## Test Execution
Test whether FDA Drug Shortage ChEBIs are a product in the bacterial reactome, creating our hits:

In [10]:
ChEBI_graph = []
for x in range(len(proc_df_bac_prot)):
    _ = [int(re.findall('[0-9]+', s)[0]) for s in proc_df_bac_prot["ChEBI IDs"].iloc[x].split(";")]
    ChEBI_graph.append(_)
adjacency = []
enzyme_name = []
for x in range(len(ChEBI_graph)):
    # since we know what CheBIs are involved each enzymatic reaction but not their order we have to ensure
    #  that we consider all combinations of molecules acting together, using iteratool's combinations:
    mini_list = list(combinations(ChEBI_graph[x],r = 2))
    adjacency.extend(mini_list)
    for _ in range(len(mini_list)):
        enzyme_name.append(proc_df_bac_prot["Entry"].iloc[x])

In [11]:
adjacency_list = pd.DataFrame()
adjacency_list["reac_1"] = np.array(adjacency)[:,0]
adjacency_list["reac_2"] = np.array(adjacency)[:,1]
adjacency_list["enzyme"] = enzyme_name

Visually inspect the reactants and the enzyme they are associated with (NB. we still don't know what their association is, i.e. which ones are reactants and which ones are products).

In [12]:
adjacency_list

Unnamed: 0,reac_1,reac_2,enzyme
0,456215,57762,Q8K9I1
1,456215,30616,Q8K9I1
2,456215,78537,Q8K9I1
3,456215,78442,Q8K9I1
4,456215,33019,Q8K9I1
...,...,...,...
2649529,58272,30616,B7HED5
2649530,58272,456216,B7HED5
2649531,57604,30616,B7HED5
2649532,57604,456216,B7HED5


## Hit-ChEBI Reassociation
We have now discovered chemical structures that are in both the FDA Drug Shortage list and the list of a all reactions of bacteria known to humans.

However, these include *all* ChEBIs that the enzyme in question is associated with. To truly identify our hits we need to reassociated our "drug_ChEBI" with the enzyme we discoverd is involved with it.

In [13]:
hits = pd.DataFrame()
drug_ChEBI = []
for ChEBI_id in proc_drug_shortage_df['ChEBI']:
    hit_df = (adjacency_list[adjacency_list["reac_1"] == ChEBI_id])
    if len(hit_df) > 0:
        hits = hits.append(hit_df)
        for x in range(len(hit_df)):
            drug_ChEBI.append(ChEBI_id)
    hit_df = (adjacency_list[adjacency_list["reac_2"] == ChEBI_id])
    if len(hit_df) > 0:
        hits = hits.append(hit_df)
        for x in range(len(hit_df)):
            drug_ChEBI.append(ChEBI_id)
        
hits["drug_ChEBI"] = drug_ChEBI

Visualise all your potential chemical interactions including the identified hits/drug_ChEBIs:

In [14]:
hits

Unnamed: 0,reac_1,reac_2,enzyme,drug_ChEBI
2192373,15377,17823,P18326,17823
2192380,15378,17823,P18326,17823
2192386,15379,17823,P18326,17823
2192391,33737,17823,P18326,17823
2192395,33738,17823,P18326,17823
...,...,...,...,...
1673545,15378,28262,Q57366,28262
1673553,58389,28262,Q57366,28262
1673560,16374,28262,Q57366,28262
1970480,15377,28262,Q8GPG4,28262


This now allows us to identify how many true hits we found using:

In [15]:
len(hits['drug_ChEBI'].unique())

5

(at least to the knowledge that we have so far)

## Data Backup

In [16]:
pickle.dump( hits, open( "/Users/camillomoschner/Documents/GitHub/react2drug/drug_hit_ChEBIs.p", "wb" ) )