# Common Enzymes Among Two Diseases

The main goal of this project is to search and find common enzymes from bibliographic data of any two query terms. These search terms can be two distinct or similar oncogenic diseases, viral or bacterial pathogens, or any biomedical terms for that matter. As long as there is scientific literature for these two query terms, this code can fetch their bibliographic and citation data.

Below I have used two diseases with different origins, **Cancer** & **SARS CoV 2** as two query terms examples. We'd first fetch the bibliographic data from the Entrez database of NCBI. Then we'd process the data to extract only enzymes. Using these enzymes as Nodes we'd then construct a network graph. The layout of the graph is shown below.

**FAQs**
1. Why bibliographic data?
- Frankly I don't know how we can fetch this information from another method or database. In the future, if I found out there is a better way of doing this, I'll update the code.

2. Why bother to find common enzymes among two diseases?
- I get curious about the commonality aspect of two distinct things or even topics. I wanted to create a custom script that can return a list of common enzymes between two scientific terms or phenomena. 
  
3. Does every article's citation & bibliographic data has enzymes list?
- Not necessarily. If enzyme(s) are mentioned in the article and its identifier(s) is included in the MeSH record.

In [1]:
! pip install -q biopython

! pip install -q pyvis

[?25l[K     |▏                               | 10 kB 26.3 MB/s eta 0:00:01[K     |▎                               | 20 kB 19.7 MB/s eta 0:00:01[K     |▍                               | 30 kB 12.5 MB/s eta 0:00:01[K     |▋                               | 40 kB 10.1 MB/s eta 0:00:01[K     |▊                               | 51 kB 4.6 MB/s eta 0:00:01[K     |▉                               | 61 kB 5.4 MB/s eta 0:00:01[K     |█                               | 71 kB 6.1 MB/s eta 0:00:01[K     |█▏                              | 81 kB 4.4 MB/s eta 0:00:01[K     |█▎                              | 92 kB 4.9 MB/s eta 0:00:01[K     |█▍                              | 102 kB 5.4 MB/s eta 0:00:01[K     |█▋                              | 112 kB 5.4 MB/s eta 0:00:01[K     |█▊                              | 122 kB 5.4 MB/s eta 0:00:01[K     |█▉                              | 133 kB 5.4 MB/s eta 0:00:01[K     |██                              | 143 kB 5.4 MB/s eta 0:00:01[K 

In [2]:
# Fetching PubMed article metadata
from Bio import Entrez, Medline

# Graph creation and visualisation
from pyvis.network import Network

import time
import os
from functools import reduce
import numpy as np
import pandas as pd

In [None]:
# Mapping the time
start_time = time.time()

In [18]:
def esearch(query_term):
  """Returns PMID(s) for given query term"""

  Entrez.email = 'akishirsath@gmail.com'

  handle = Entrez.esearch(db="pubmed", term=query_term, retmax="10000")

  records = Entrez.read(handle)

  return records["IdList"]

In [19]:
def efetch(pmids):
    """Returns MEDLINE/pubmed record associated with the PMID(s)"""
    
    Entrez.email = 'akishirsath@gmail.com'

    handle = Entrez.efetch(db="pubmed", 
                           id=pmids, 
                           rettype="medline", 
                           retmode="text")

    records = Medline.parse(handle)    
    
    return list(records)

In [20]:
first_query_term = 'Cancer'
second_query_term = 'Covid'

first_pmids = esearch(first_query_term)

second_pmids = esearch(second_query_term)

In [21]:
len(first_pmids), len(second_pmids)

(10000, 10000)

In [22]:
first_topic_records = efetch(",".join(first_pmids))

time.sleep(10)

second_topic_records = efetch(",".join(second_pmids))

## Network-Graph Method

In [None]:
colors = {
    'backgrd' : '#f1f2f6',    # Background color
    'font' : '#2f3542',       # Text font color
    'first_prim' : '#6F1E51', # Article nodes color (first)
    'second_prim' : '#1B1464',# Article nodes color (second)
    'first_sec' : '#ED4C67',  # Enzyme nodes color (first)
    'second_sec' : '#0652DD'  # Enzyme nodes color (second)
}

In [None]:
N = Network(height='750px', 
            width='100%', 
            bgcolor=colors['backgrd'], 
            font_color=colors['font'], 
            notebook=True)

In [None]:
N.set_options("""
var options = {
  "edges": {
    "arrows": {
      "to": {
        "enabled": true,
        "scaleFactor": 0.5
      }
    },
    "color": {
      "inherit": true
    },
    "smooth": {
      "forceDirection": "none"
    }
  },
  "physics": {
    "barnesHut": {
      "gravitationalConstant": -17350,
      "springLength": 210,
      "springConstant": 0.055,
      "avoidOverlap": 0.53
    },
    "minVelocity": 0.75
  }
}
""")

In [None]:
cancer_enzymes = dict()

for  record in first_topic_records:
  substances = record.get('RN')
  pmid = record.get('PMID')

  if isinstance(substances, list):
    for molecule in substances:
      if molecule.startswith('EC'):

        # Primary PMID node
        N.add_node(pmid.strip(), size=30, color=colors['first_prim'])

        # Secondary Enzyme node
        N.add_node(molecule, size=15, color=colors['first_sec'])
        N.add_edge(pmid.strip(), molecule)


In [None]:
for  record in second_topic_records:
  substances = record.get('RN')
  pmid = record.get('PMID')
  if isinstance(substances, list):
    for molecule in substances:
      if molecule.startswith('EC'):

        # Primary PMID node
        N.add_node(pmid.strip(), size=30, color=colors['second_prim'])

        # Secondary Enzyme node
        N.add_node(molecule, size=15, color=colors['second_sec'])
        N.add_edge(pmid.strip(), molecule)

In [None]:
N.show('common_enzymes_net_graph_viz.html')

In [None]:
end_time = time.time()

In [None]:
print("--- %s seconds ---" % (end_time - start_time))

--- 179.98954319953918 seconds ---


## Tabular

In [23]:
def records_to_enzymes(records, name):
  '''Extract and process enzymes along with their 
  respective PMIDs into Pandas Dataframe'''
  
  enzymes = list()
  for  record in records:
    substances = record.get('RN')
    pmid = record.get('PMID')
    if isinstance(substances, list):
      for molecule in substances:
        if molecule.startswith('EC'):
          enzymes.append((pmid, molecule))

  enzymes_df = pd.DataFrame(enzymes, columns=['PMID', 'Enzyme'])

  enzymes_df['Disease']=[name]*len(enzymes_df) 

  return enzymes_df

In [24]:
cancer_enzymes_df = records_to_enzymes(first_topic_records, 'Cancer')

In [25]:
cancer_enzymes_df

Unnamed: 0,PMID,Enzyme,Disease
0,35416174,"EC 1.1.1.30 (BDH2 protein, human)",Cancer
1,35416174,EC 1.1.1.30 (Hydroxybutyrate Dehydrogenase),Cancer
2,35416174,"EC 2.7.1.1 (MTOR protein, human)",Cancer
3,35416174,EC 2.7.11.1 (Proto-Oncogene Proteins c-akt),Cancer
4,35416174,EC 2.7.11.1 (TOR Serine-Threonine Kinases),Cancer
...,...,...,...
811,35347031,EC 2.7.7.49 (Telomerase),Cancer
812,35347013,EC 2.7.11.1 (Protein Serine-Threonine Kinases),Cancer
813,35347013,"EC 2.7.11.1 (SRPK2 protein, human)",Cancer
814,35347013,"EC 3.1.3.48 (PTPRZ1 protein, human)",Cancer


In [26]:
covid_enzymes_df = records_to_enzymes(second_topic_records, 'SARS CoV2')

In [27]:
covid_enzymes_df

Unnamed: 0,PMID,Enzyme,Disease
0,35414771,EC 3.4.17.23 (Angiotensin-Converting Enzyme 2),SARS CoV2
1,35414771,EC 3.4.21.75 (Furin),SARS CoV2
2,35414393,EC 3.4.17.23 (Angiotensin-Converting Enzyme 2),SARS CoV2
3,35412852,EC 2.7.7.- (Nucleotidyltransferases),SARS CoV2
4,35409421,EC 3.4.17.23 (Angiotensin-Converting Enzyme 2),SARS CoV2
...,...,...,...
152,35289719,EC 3.4.17.23 (Angiotensin-Converting Enzyme 2),SARS CoV2
153,35283406,EC 3.4.15.1 (Peptidyl-Dipeptidase A),SARS CoV2
154,35283406,EC 3.4.17.23 (Angiotensin-Converting Enzyme 2),SARS CoV2
155,35283406,EC 3.4.21.- (Serine Endopeptidases),SARS CoV2


In [28]:
# https://stackoverflow.com/questions/46556169/finding-common-elements-between-multiple-dataframe-columns
common_enymes = reduce(np.intersect1d, [cancer_enzymes_df.Enzyme, covid_enzymes_df.Enzyme])
len(common_enymes)

31

In [29]:
combine_df = pd.concat([cancer_enzymes_df, covid_enzymes_df])

In [30]:
common_df = combine_df[combine_df['Enzyme'].isin(common_enymes)]

In [32]:
for item in common_df.Enzyme:
  print(item)

EC 2.7.1.1 (MTOR protein, human)
EC 2.7.11.1 (TOR Serine-Threonine Kinases)
EC 2.7.7.- (Nucleotidyltransferases)
EC 2.7.11.31 (AMP-Activated Protein Kinases)
EC 2.3.2.27 (Ubiquitin-Protein Ligases)
EC 3.4.17.23 (Angiotensin-Converting Enzyme 2)
EC 3.4.21.75 (Furin)
EC 2.7.1.1 (MTOR protein, human)
EC 2.7.11.1 (TOR Serine-Threonine Kinases)
EC 2.7.1.1 (MTOR protein, human)
EC 2.7.11.1 (TOR Serine-Threonine Kinases)
EC 2.7.11.1 (TOR Serine-Threonine Kinases)
EC 2.3.2.27 (Ubiquitin-Protein Ligases)
EC 3.4.25.1 (Proteasome Endopeptidase Complex)
EC 2.3.2.27 (Ubiquitin-Protein Ligases)
EC 3.4.21.- (Serine Endopeptidases)
EC 2.3.2.27 (Ubiquitin-Protein Ligases)
EC 3.4.25.1 (Proteasome Endopeptidase Complex)
EC 2.3.2.27 (Ubiquitin-Protein Ligases)
EC 3.4.14.5 (Dipeptidyl Peptidase 4)
EC 2.7.7.- (Nucleotidyltransferases)
EC 3.4.25.1 (Proteasome Endopeptidase Complex)
EC 2.3.2.27 (Ubiquitin-Protein Ligases)
EC 3.4.22.- (Cysteine Endopeptidases)
EC 3.4.21.75 (Furin)
EC 2.7.1.1 (MTOR protein, hum