# bioNX: Automated Knowledge Graph Construction for PPI Networks

Given 

This notebook contains a workflow that integrates data from several sources:
* [bioGRID](https://thebiogrid.org/) - primary data source for PPIs
* [HGNC](https://www.genenames.org/) - Gene nomenclature reference
* [PubMed](https://pubmed.ncbi.nlm.nih.gov/) - Literature
* [Uniprot](https://www.uniprot.org/) - Protein properties
* [Entrez](https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) - Gene properties
* [GO](http://geneontology.org/) - Gene properties

*Please note this project is under development.*

# Setup

In [57]:
import os
import sys
sys.path[0] = '../'
from dotenv import load_dotenv
import numpy as np
import pandas as pd
import re
import requests
from xml.etree.ElementTree import fromstring, ElementTree
import time

In [58]:
# Display options
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columns

In [59]:
# Set access key for BioGRID REST API
load_dotenv()
BIOGRID_ACCESS_KEY = os.getenv('BIOGRID_ACCESS_KEY')
NEO4J_HOME = os.getenv('NEO4J_HOME')
importDir = '/Users/gregory/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-d0b780b3-ce77-46cc-b7ed-bd0f78a46581/installation-3.5.14/import/'

# Building the Base Graph

Interaction data is fetched from the BioGRID API for a single gene and its protein-protein interactions (PPIs). The PPI dataset forms the core foundation upon which the knowledge graph is built. It is augmented using data fetched from additional sources.

In [60]:
# Loads data and formats columns
def load_biogrid_data(gene_specifier, limit=10000):

    if not isinstance(gene_specifier, list): 
        gene_specifier = gene_specifier
    else:
        gene_specifier = '|'.join(gene_specifier)
    
    url = f"https://webservice.thebiogrid.org/interactions/?searchNames=true&geneList={gene_specifier}" \
    "&taxId=9606&includeInteractors=true&includeInteractorInteractions=true&includeHeader=true" \
    f"&accesskey={BIOGRID_ACCESS_KEY}&max={limit}"

    # Load data
    data = pd.read_csv(url, sep='\t', header=0)

    # Remove leading hash character
    data.rename(columns={"#BioGRID Interaction ID":"BioGRID Interaction ID"}, inplace=True)

    # Replace pipe separators with commas
    data = data.replace('\|', ',', regex=True)

    # Select str columns and replace '-' with np.nan
    cols = ['Systematic Name Interactor A', 
          'Systematic Name Interactor B', 
          'Score', 
          'Modification', 
          'Phenotypes',
          'Qualifications',
          'Tags']

    data[cols] = data[cols].applymap(lambda col: re.sub(r'^-$', str(np.NaN), col))
       
    return data

In [61]:
# Selects and transforms columns
def preprocess_biogrid_data(data):
    
     # Convert Score column to float
    data['Score'] = data['Score'].astype('float64')
    
    # Select columns of interest for graph
    data = data[['BioGRID Interaction ID', 'Official Symbol Interactor A', 'Entrez Gene Interactor A', 
                       'Synonyms Interactor A', 'Organism Interactor A', 'Official Symbol Interactor B', 
                       'Entrez Gene Interactor B', 'Synonyms Interactor B', 'Organism Interactor B', 
                       'Author', 'Pubmed ID', 'Experimental System', 'Experimental System Type', 'Throughput']]

    # Create Year column
    data['Publication Year'] = data['Author'].str.split(' ').str[-1].str.strip('()')    
    
    # Remove Year from Author column
    data['Author'] = data['Author'].str.split(' ').str[:2]
    data['Author'] = data['Author'].apply(lambda x: ', '.join(x))
    
    return data

In [62]:
# Select gene or genes; create for loop and append dataframes for multiple genes
gene = 'BRCA1'#'TP53'#'CYP1A2'
genes = ['BRCA1', 'TP53']#['FECH', 'CYP1A2']

# Fetch and clean data
ppi_data = load_biogrid_data(gene, limit=10)
ppi_data = preprocess_biogrid_data(ppi_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [63]:
ppi_data.shape

(10, 15)

In [64]:
ppi_data

Unnamed: 0,BioGRID Interaction ID,Official Symbol Interactor A,Entrez Gene Interactor A,Synonyms Interactor A,Organism Interactor A,Official Symbol Interactor B,Entrez Gene Interactor B,Synonyms Interactor B,Organism Interactor B,Author,Pubmed ID,Experimental System,Experimental System Type,Throughput,Publication Year
0,1161,EP300,2033,"KAT3B,RSTS2,p300",9606,TFAP2A,7020,"AP-2,AP-2alpha,AP2TF,BOFS,TFAP2",9606,"Braganca, J",12586840,Two-hybrid,physical,Low Throughput,2003
1,2368,BRCA1,672,"BRCAI,BRCC1,BROVCA1,FANCS,IRIS,PNCA4,PPP1R53,P...",9606,ATF1,466,"EWS-ATF1,FUS/ATF-1,TREB36",9606,"Houvras, Y",10945975,Two-hybrid,physical,Low Throughput,2000
2,2398,BRCA1,672,"BRCAI,BRCC1,BROVCA1,FANCS,IRIS,PNCA4,PPP1R53,P...",9606,MSH2,4436,"COCA1,FCC1,HNPCC,HNPCC1,LCFS2",9606,"Wang, Q",11498787,Two-hybrid,physical,Low Throughput,2001
3,2411,BRCA1,672,"BRCAI,BRCC1,BROVCA1,FANCS,IRIS,PNCA4,PPP1R53,P...",9606,BARD1,580,-,9606,"Wu, LC",8944023,Two-hybrid,physical,Low Throughput,1996
4,2424,BRCA1,672,"BRCAI,BRCC1,BROVCA1,FANCS,IRIS,PNCA4,PPP1R53,P...",9606,MSH6,2956,"GTBP,GTMBP,HNPCC5,HSAP,p160",9606,"Wang, Q",11498787,Two-hybrid,physical,Low Throughput,2001
5,3315,MCM5,4174,"CDC46,P1-CDC46",9606,MCM2,4171,"BM28,CCNL1,CDCL1,D3S3194,MITOTIN,cdc19",9606,"Kneissl, M",12614612,Two-hybrid,physical,High Throughput,2003
6,3324,ORC2,4999,ORC2L,9606,MCM2,4171,"BM28,CCNL1,CDCL1,D3S3194,MITOTIN,cdc19",9606,"Kneissl, M",12614612,Two-hybrid,physical,High Throughput,2003
7,3325,RPA2,6118,"REPA2,RP-A p32,RP-A p34,RPA32",9606,MCM2,4171,"BM28,CCNL1,CDCL1,D3S3194,MITOTIN,cdc19",9606,"Kneissl, M",12614612,Two-hybrid,physical,High Throughput,2003
8,3326,DBF4,10926,"ASK,CHIF,DBF4A,ZDBF1",9606,MCM2,4171,"BM28,CCNL1,CDCL1,D3S3194,MITOTIN,cdc19",9606,"Kneissl, M",12614612,Two-hybrid,physical,High Throughput,2003
9,3329,AKAP8,10270,"AKAP 95,AKAP-8,AKAP-95,AKAP95",9606,MCM2,4171,"BM28,CCNL1,CDCL1,D3S3194,MITOTIN,cdc19",9606,"Eide, T",12740381,Two-hybrid,physical,Low Throughput,2003


In [65]:
# Save in project folder
ppi_data.to_csv('../data/clean/biogrid_ppi_data.csv', index=False)

# Save to Neo4j imports folder for LOAD CSV command
ppi_data.to_csv(importDir + 'biogrid_ppi_data.csv', index=False)

In [66]:
ppi_data.groupby('Author')['Pubmed ID'].value_counts()[:5]

Author       Pubmed ID
Braganca, J  12586840     1
Eide, T      12740381     1
Houvras, Y   10945975     1
Kneissl, M   12614612     4
Wang, Q      11498787     2
Name: Pubmed ID, dtype: int64

# Integrating Data from External APIs

Fetching data from various APIs to build a table for genes:

* HGNC
* PubMed
* Uniprot
* Entrez

In [67]:
### Create new columns for gene: Gene Description, NCBI url, Wikipedia url, full sequence url, chromosome, #bp
### Create new columns for article: Article Title, Publication, Pubmed url
# Create new column: subcellular location, condition, PTM/Processing, chromosome

## HGNC 

Parses XML from request. Documentation at:<br>
https://www.genenames.org/help/rest/

In [68]:
def fetch_hgnc_data(ppi_data):
    
    def build_gene_dict(gene):

        url = f"http://rest.genenames.org/fetch/symbol/{gene}";
        r = requests.get(url)

        tree = ElementTree(fromstring(r.text))
        root = tree.getroot()

        str_attribs = ['symbol', 'name', 'entrez_id', 'locus_type', 'location', 'ensembl_gene_id', 
                    'locus_group']

        arr_attribs = ['pubmed_id', 'gene_group', 'uniprot_ids', 'omim_id']

        gene_dict = dict()

        # retrieve <str> attributes
        for index, name in enumerate(str_attribs):
            elements = root.findall(f".//str[@name='{name}']") # add .upper() method to gene symbol
            try:
                gene_dict[name] = [elements[0].text]
            except Exception as e:
                #print(f"Error: {e}") # Log errors 
                gene_dict[name] = ['NULL'] ### Keep symbol if no record is available; some genes are new research; use synonyms
                continue

        # retrieve <arr> attributes
        for index, name in enumerate(arr_attribs):
            elements = root.findall(f".//arr[@name='{name}']/*")
            try:
                gene_dict[name] = [elements[0].text]
            except Exception as e:
                #print(f"Error: {e}") # Log errors 
                gene_dict[name] = ['NULL']
                continue

        return gene_dict
    
    def build_gene_dataframe(ppi_data):

        genes = list(ppi_data['Official Symbol Interactor A'].unique())

        gene_dict_list = []

        for i, gene in enumerate(genes):
            gene_dict = build_gene_dict(gene)
            gene_dict_list.append(gene_dict)
            
            # Throttle to less than 10 requests per second
            if i != 0 and i % 10 == 0:
                time.sleep(.5)

        merge_gene_dict = {}

        for key in gene_dict_list[0].keys():
            merge_gene_dict[key] = [gene_dict_list[i][key][0] for i in range(len(gene_dict_list))]

        return pd.DataFrame(merge_gene_dict)
    
    return build_gene_dataframe(ppi_data)


## Pubmed

Example query: <br>
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=9806834 - 'pubmed_id'

--> can fetch comma delimited list for id to retrieve all in one request

In [94]:
def fetch_pubmed_data(data, column):
    
    def build_article_dict(pubmed_id, column):

        url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id={pubmed_id}";
        r = requests.get(url)

        tree = ElementTree(fromstring(r.text))
        root = tree.getroot()

        item_attribs = ['Title', 'FullJournalName', 'DOI']

        article_dict = {column:[pubmed_id]}

        # retrieve <Item> attributes
        for index, Name in enumerate(item_attribs):
            elements = root.findall(f".//Item[@Name='{Name}']") # add .upper() method to gene symbol
            try:
                article_dict[Name.lower()] = [elements[0].text]
            except Exception as e:
                print(f"Error: {e}") # Log errors 
                article_dict[Name.lower()] = ['NULL'] ### Keep symbol if no record is available; some genes are new research; use synonyms
                continue

        # Throttle to 10 requests per second
        time.sleep(0.1)

        article_dict['full_journal_name'] = article_dict.pop('fulljournalname')
        
        return article_dict
    
    def build_article_dataframe(data, column):

        pubmed_id_list = list(data[column].unique())

        article_dict_list = [build_article_dict(article, column) for article in pubmed_id_list]
        #print(article_dict_list)
        
        merge_article_dict = {}

        # Possible to use csv.writerows() for csv output
        for item in article_dict_list:
            #print(item)
            for key in item.keys():
                #print(key)
                merge_article_dict[key] = [article_dict_list[i][key][0] for i in range(len(article_dict_list))]
        
        return pd.DataFrame(merge_article_dict)
    
    article_df = build_article_dataframe(data, column)
    
    return data.merge(article_df)
    

In [95]:
fetch_pubmed_data(gene_data, 'pubmed_id')

Error: list index out of range
Error: list index out of range


Unnamed: 0,symbol,name,entrez_id,locus_type,location,ensembl_gene_id,locus_group,pubmed_id,gene_group,uniprot_ids,omim_id,title,doi,full_journal_name
0,MAP2K2,mitogen-activated protein kinase kinase 2,5605,gene with protein product,19p13.3,ENSG00000126934,protein-coding gene,8388392,Mitogen-activated protein kinase kinases,P36507,601263,Cloning and characterization of two distinct h...,,The Journal of biological chemistry
1,PGRMC1,progesterone receptor membrane component 1,10857,gene with protein product,Xq24,ENSG00000101856,protein-coding gene,9705155,Membrane associated progesterone receptor family,O00264,300435,Cloning and tissue expression of two putative ...,10.1515/bchm.1998.379.7.907,Biological chemistry
2,PNKP,polynucleotide kinase 3'-phosphatase,11284,gene with protein product,19q13.33,ENSG00000039650,protein-coding gene,10446192,HAD Asp-based non-protein phosphatases,Q96T60,605610,"Molecular cloning of the human gene, PNKP, enc...",10.1074/jbc.274.34.24176,The Journal of biological chemistry
3,CIAO1,cytosolic iron-sulfur assembly component 1,9391,gene with protein product,2q11.2,ENSG00000144021,protein-coding gene,9556563,WD repeat domain containing,O76071,604333,Ciao 1 is a novel WD40 protein that interacts ...,10.1074/jbc.273.18.10880,The Journal of biological chemistry
4,ATR,ATR serine/threonine kinase,545,gene with protein product,3q23,ENSG00000175054,protein-coding gene,8978690,Armadillo like helical domain containing,Q13535,601215,The Schizosaccharomyces pombe rad3 checkpoint ...,,The EMBO journal


In [80]:
article_dict_list = [{'pubmed_id': ['8388392'], 'title': ['Cloning and characterization of two distinct human extracellular signal-regulated kinase activator kinases, MEK1 and MEK2.'], 'DOI': ['NULL'], 'full_journal_name': ['The Journal of biological chemistry']}, {'pubmed_id': ['9705155'], 'title': ['Cloning and tissue expression of two putative steroid membrane receptors.'], 'doi': ['10.1515/bchm.1998.379.7.907'], 'full_journal_name': ['Biological chemistry']}, {'pubmed_id': ['10446192'], 'title': ["Molecular cloning of the human gene, PNKP, encoding a polynucleotide kinase 3'-phosphatase and evidence for its role in repair of DNA strand breaks caused by oxidative damage."], 'doi': ['10.1074/jbc.274.34.24176'], 'full_journal_name': ['The Journal of biological chemistry']}, {'pubmed_id': ['9556563'], 'title': ['Ciao 1 is a novel WD40 protein that interacts with the tumor suppressor protein WT1.'], 'doi': ['10.1074/jbc.273.18.10880'], 'full_journal_name': ['The Journal of biological chemistry']}, {'pubmed_id': ['8978690'], 'title': ['The Schizosaccharomyces pombe rad3 checkpoint gene.'], 'DOI': ['NULL'], 'full_journal_name': ['The EMBO journal']}]

In [81]:
article_dict_list

[{'pubmed_id': ['8388392'],
  'title': ['Cloning and characterization of two distinct human extracellular signal-regulated kinase activator kinases, MEK1 and MEK2.'],
  'DOI': ['NULL'],
  'full_journal_name': ['The Journal of biological chemistry']},
 {'pubmed_id': ['9705155'],
  'title': ['Cloning and tissue expression of two putative steroid membrane receptors.'],
  'doi': ['10.1515/bchm.1998.379.7.907'],
  'full_journal_name': ['Biological chemistry']},
 {'pubmed_id': ['10446192'],
  'title': ["Molecular cloning of the human gene, PNKP, encoding a polynucleotide kinase 3'-phosphatase and evidence for its role in repair of DNA strand breaks caused by oxidative damage."],
  'doi': ['10.1074/jbc.274.34.24176'],
  'full_journal_name': ['The Journal of biological chemistry']},
 {'pubmed_id': ['9556563'],
  'title': ['Ciao 1 is a novel WD40 protein that interacts with the tumor suppressor protein WT1.'],
  'doi': ['10.1074/jbc.273.18.10880'],
  'full_journal_name': ['The Journal of biolog

In [None]:
for idx, item in enumerate(article_dict_list):
    #print(item.keys())
    #article_dict_list.pop(idx)
    for key in item:
        item[key.lower()] = item.pop(key)
        #print(item)
    article_dict_list.append(item)

In [None]:
article_dict_list

## Clean Final Data

In [24]:
def final_process_data(data):
    
    def rename_columns(data):
        data.columns = data.columns.str.lower().str.replace(' ', '_')
        
        return data

    # Data cleaning pipeline
    data = rename_columns(data)
    
    # Extract chromosome
    
    
    return data

# Pipeline

In [100]:
# Select gene; create for loop and append dataframes for multiple genes
gene = 'CYP1A2'#'CYP1A2'#'TP53'#'CYP1A2' # or
#genes = ['FECH', 'CYP1A2']

# Fetch and clean interaction data from BioGRID
ppi_data = load_biogrid_data(gene, limit=20)
ppi_data = preprocess_biogrid_data(ppi_data)

# Fetch HGNC data for information on genes; time limit on requests s 
gene_data = fetch_hgnc_data(ppi_data)

# Fetch PubMed data
ppi_data = fetch_pubmed_data(ppi_data, 'Pubmed ID')
gene_data = fetch_pubmed_data(gene_data, 'pubmed_id')

# Final cleaning step
ppi_data = final_process_data(ppi_data)
gene_data = final_process_data(gene_data)

# Save in project folder
ppi_data.to_csv('../data/clean/biogrid_ppi_data.csv', index=False)
gene_data.to_csv('../data/clean/gene_data.csv', index=False)

# Save to Neo4j imports folder for LOAD CSV command
ppi_data.to_csv(importDir + 'biogrid_ppi_data.csv', index=False)
gene_data.to_csv(importDir + 'gene_data.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


## Tests

In [None]:
# Test for all genes present in genes data
interaction_set = set(interaction_data['Official Symbol Interactor A'].unique())
gene_set = set(gene_data['symbol'].unique())

assert interaction_set == gene_set, \
    f"There are genes missing in the data:\ninteraction data:{interaction_set - gene_set}\ngene data: {gene_set - interaction_set}"


In [30]:
gene_data

Unnamed: 0,symbol,name,entrez_id,locus_type,location,ensembl_gene_id,locus_group,pubmed_id,gene_group,uniprot_ids,omim_id,title,doi,full_journal_name
0,FECH,ferrochelatase,2235,gene with protein product,18q21.31,ENSG00000066926,protein-coding gene,1838349,,P22830,612386,Assignment of the gene for cyclic AMP-response...,10.1016/0888-7543(91)90047-i,Genomics
1,POR,cytochrome p450 oxidoreductase,5447,gene with protein product,7q11.23,ENSG00000127948,protein-coding gene,2516426,MicroRNA protein coding host genes,P16435,124015,Isolation of a human cytochrome P-450 reductas...,10.1111/j.1469-1809.1989.tb01798.x,Annals of human genetics
2,PGRMC1,progesterone receptor membrane component 1,10857,gene with protein product,Xq24,ENSG00000101856,protein-coding gene,9705155,Membrane associated progesterone receptor family,O00264,300435,Cloning and tissue expression of two putative ...,10.1515/bchm.1998.379.7.907,Biological chemistry
3,CYP1A2,cytochrome P450 family 1 subfamily A member 2,1544,gene with protein product,15q24.1,ENSG00000140505,protein-coding gene,15128046,Cytochrome P450 family 1,P05177,124060,Comparison of cytochrome P450 (CYP) genes from...,10.1097/00008571-200401000-00001,Pharmacogenetics
4,CYB5A,cytochrome b5 type A,1528,gene with protein product,18q22.3,ENSG00000166347,protein-coding gene,1840560,,P00167,613218,Chromosomal localization of a cytochrome b5 ge...,10.1016/0888-7543(91)90136-3,Genomics
