# Common Enzymes Among Two Diseases

The main goal of this project is to search and find common enzymes from bibliographic data of any two query terms. These search terms can be two distinct or similar oncogenic diseases, viral or bacterial pathogens, or any biomedical terms for that matter. As long as there is scientific literature for these two query terms, this code can fetch their bibliographic and citation data.

Below I have used two diseases with different origins, **Cancer** & **SARS CoV 2** as two query terms examples. We'd first fetch the bibliographic data from the Entrez database of NCBI. Then we'd process the data to extract only enzymes. Using these enzymes as Nodes we'd then construct a network graph. The layout of the graph is shown below.

**FAQs**
1. Why bibliographic data?
- Frankly I don't know how we can fetch this information from another method or database. In the future, if I found out there is a better way of doing this, I'll update the code.

2. Why bother to find common enzymes among two diseases?
- I get curious about the commonality aspect of two distinct things or even topics. I wanted to create a custom script that can return a list of common enzymes between two scientific terms or phenomena. 
  
3. Does every article's citation & bibliographic data has enzymes list?
- Not necessarily. If enzyme(s) are mentioned in the article and its identifier(s) is included in the MeSH record.

In [1]:
! pip install -q biopython

! pip install -q pyvis

[K     |████████████████████████████████| 2.3 MB 13.7 MB/s 
[?25h  Building wheel for pyvis (setup.py) ... [?25l[?25hdone


In [40]:
# Fetching PubMed article metadata
from Bio import Entrez, Medline

# Graph creation and visualisation
from pyvis.network import Network

import time
import os
from functools import reduce
import numpy as np
import pandas as pd

In [3]:
# Mapping the time
start_time = time.time()

In [4]:
def process_pmid_txt(text_file_path):

  pmids = list()

  with open(text_file_path, "r") as f:
    for pmid in f.read().split('\n'):
      pmids.append(pmid.strip())  

  return pmids

In [5]:
def efetch(pmids):
    """Returns MEDLINE/pubmed record associated with the PMID(s)"""
    
    Entrez.email = 'akishirsath@gmail.com'

    handle = Entrez.efetch(db="pubmed", 
                           id=pmids, 
                           rettype="medline", 
                           retmode="text")

    records = Medline.parse(handle)    
    
    return list(records)

In [6]:
first_file = "/content/drive/MyDrive/05-Data/PubMed-Common-Enzymes/pmid-Cancer-set.txt"

second_file = "/content/drive/MyDrive/05-Data/PubMed-Common-Enzymes/pmid-sarscov-2-set.txt"

first_pmids = process_pmid_txt(first_file)

second_pmids = process_pmid_txt(second_file)

first_topic_records = efetch(",".join(first_pmids))

time.sleep(10)

second_topic_records = efetch(",".join(second_pmids))

## Network-Graph Method

In [7]:
colors = {
    'backgrd' : '#f1f2f6',    # Background color
    'font' : '#2f3542',       # Text font color
    'first_prim' : '#6F1E51', # Article nodes color (first)
    'second_prim' : '#1B1464',# Article nodes color (second)
    'first_sec' : '#ED4C67',  # Enzyme nodes color (first)
    'second_sec' : '#0652DD'  # Enzyme nodes color (second)
}

In [8]:
N = Network(height='750px', 
            width='100%', 
            bgcolor=colors['backgrd'], 
            font_color=colors['font'], 
            notebook=True)

In [9]:
N.set_options("""
var options = {
  "edges": {
    "arrows": {
      "to": {
        "enabled": true,
        "scaleFactor": 0.5
      }
    },
    "color": {
      "inherit": true
    },
    "smooth": {
      "forceDirection": "none"
    }
  },
  "physics": {
    "barnesHut": {
      "gravitationalConstant": -17350,
      "springLength": 210,
      "springConstant": 0.055,
      "avoidOverlap": 0.53
    },
    "minVelocity": 0.75
  }
}
""")

In [10]:
cancer_enzymes = dict()

for  record in first_topic_records:
  substances = record.get('RN')
  pmid = record.get('PMID')

  if isinstance(substances, list):
    for molecule in substances:
      if molecule.startswith('EC'):

        # Primary PMID node
        N.add_node(pmid.strip(), size=30, color=colors['first_prim'])

        # Secondary Enzyme node
        N.add_node(molecule, size=15, color=colors['first_sec'])
        N.add_edge(pmid.strip(), molecule)


In [12]:
for  record in second_topic_records:
  substances = record.get('RN')
  pmid = record.get('PMID')
  if isinstance(substances, list):
    for molecule in substances:
      if molecule.startswith('EC'):

        # Primary PMID node
        N.add_node(pmid.strip(), size=30, color=colors['second_prim'])

        # Secondary Enzyme node
        N.add_node(molecule, size=15, color=colors['second_sec'])
        N.add_edge(pmid.strip(), molecule)

In [13]:
N.show('common_enzymes_net_graph_viz.html')

In [14]:
end_time = time.time()

In [15]:
print("--- %s seconds ---" % (end_time - start_time))

--- 179.98954319953918 seconds ---


## Tabular

In [32]:
def records_to_enzymes(records, name):
  '''Extract and process enzymes along with their 
  respective PMIDs into Pandas Dataframe'''
  
  enzymes = list()
  for  record in records:
    substances = record.get('RN')
    pmid = record.get('PMID')
    if isinstance(substances, list):
      for molecule in substances:
        if molecule.startswith('EC'):
          enzymes.append((pmid, molecule))

  enzymes_df = pd.DataFrame(enzymes, columns=['PMID', 'Enzyme'])

  enzymes_df['Disease']=[name]*len(enzymes_df) 

  return enzymes_df

In [33]:
cancer_enzymes_df = records_to_enzymes(first_topic_records, 'Cancer')

In [34]:
cancer_enzymes_df

Unnamed: 0,PMID,Enzyme,Disease
0,19081671,EC 2.7.- (Protein Kinases),Cancer
1,31081789,EC 4.2.1.2 (Fumarate Hydratase),Cancer
2,27582428,EC 2.7.1.- (Phosphatidylinositol 3-Kinases),Cancer
3,11280022,EC 2.7.7.49 (Telomerase),Cancer
4,23142414,"EC 2.3.2.27 (BRAP protein, human)",Cancer
...,...,...,...
1252,23222297,EC 2.7.11.1 (Proto-Oncogene Proteins B-raf),Cancer
1253,21825998,EC 2.7.7.6 (RNA Polymerase III),Cancer
1254,11201683,EC 3.4.24.- (Pregnancy-Associated Plasma Prote...,Cancer
1255,20976540,"EC 2.7.10.1 (Receptor, IGF Type 1)",Cancer


In [35]:
covid_enzymes_df = records_to_enzymes(second_topic_records, 'SARS CoV2')

In [36]:
covid_enzymes_df

Unnamed: 0,PMID,Enzyme,Disease
0,33293238,"EC 3.4.17.23 (ACE2 protein, human)",SARS CoV2
1,33293238,EC 3.4.17.23 (Angiotensin-Converting Enzyme 2),SARS CoV2
2,34611326,EC 3.4.15.1 (Peptidyl-Dipeptidase A),SARS CoV2
3,33103998,"EC 3.4.17.23 (ACE2 protein, human)",SARS CoV2
4,33103998,EC 3.4.17.23 (Angiotensin-Converting Enzyme 2),SARS CoV2
...,...,...,...
2198,34092976,EC 3.4.15.1 (Peptidyl-Dipeptidase A),SARS CoV2
2199,35181721,EC 3.1.26.5 (Ribonuclease P),SARS CoV2
2200,32522617,EC 3.4.15.1 (Peptidyl-Dipeptidase A),SARS CoV2
2201,32522617,EC 3.4.17.23 (Angiotensin-Converting Enzyme 2),SARS CoV2


In [42]:
# https://stackoverflow.com/questions/46556169/finding-common-elements-between-multiple-dataframe-columns
common_enymes = reduce(np.intersect1d, [cancer_enzymes_df.Enzyme, covid_enzymes_df.Enzyme])
len(common_enymes)

69

In [44]:
combine_df = pd.concat([cancer_enzymes_df, covid_enzymes_df])

In [46]:
common_df = combine_df[combine_df['Enzyme'].isin(common_enymes)]

In [47]:
common_df

Unnamed: 0,PMID,Enzyme,Disease
0,19081671,EC 2.7.- (Protein Kinases),Cancer
5,23142414,EC 2.3.2.27 (Ubiquitin-Protein Ligases),Cancer
6,17882664,EC 2.3.2.27 (Ubiquitin-Protein Ligases),Cancer
7,19860736,EC 3.6.1.- (Adenosine Triphosphatases),Cancer
19,6258623,EC 3.4.15.1 (Peptidyl-Dipeptidase A),Cancer
...,...,...,...
2186,34201422,EC 3.4.21.- (Kallikreins),SARS CoV2
2189,34202565,EC 3.1.- (Exoribonucleases),SARS CoV2
2193,34405154,EC 1.13.12.- (Luciferases),SARS CoV2
2198,34092976,EC 3.4.15.1 (Peptidyl-Dipeptidase A),SARS CoV2
