# bioNX: Automated Knowledge Graph Construction for PPI Networks

Given 

This notebook contains a workflow that integrates data from several sources:
* [bioGRID](https://thebiogrid.org/) - primary data source for PPIs
* [HGNC](https://www.genenames.org/) - Gene nomenclature reference
* [PubMed](https://pubmed.ncbi.nlm.nih.gov/) - Literature
* [Uniprot](https://www.uniprot.org/) - Protein properties (*pending implementation*)
* [Entrez](https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) - Gene properties (*pending implementation*)
* [GO](http://geneontology.org/) - Gene properties (*pending implementation*)

*Please note this project is under development.*

# Setup

In [1]:
import os
import sys
sys.path[0] = '../'
from dotenv import load_dotenv
import numpy as np
import pandas as pd
import re
import requests
from xml.etree.ElementTree import fromstring, ElementTree
import time

In [2]:
# Display options
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columns

In [25]:
# Set access key for BioGRID REST API
load_dotenv()
BIOGRID_ACCESS_KEY = os.getenv('BIOGRID_ACCESS_KEY')
NEO4J_HOME = os.getenv('NEO4J_HOME')
importDir = NEO4J_HOME.replace('\\', '') + 'import/'

# Building the Base Graph

Interaction data is fetched from the BioGRID API for a single gene and its protein-protein interactions (PPIs). The PPI dataset forms the core foundation upon which the knowledge graph is built. It is augmented using data fetched from additional sources.

In [27]:
# Loads data and formats columns
def load_biogrid_data(gene_specifier, limit=10000):

    if not isinstance(gene_specifier, list): 
        gene_specifier = gene_specifier
    else:
        gene_specifier = '|'.join(gene_specifier)
    
    url = f"https://webservice.thebiogrid.org/interactions/?searchNames=true&geneList={gene_specifier}" \
    "&taxId=9606&includeInteractors=true&includeInteractorInteractions=true&includeHeader=true" \
    f"&accesskey={BIOGRID_ACCESS_KEY}&max={limit}"

    # Load data
    data = pd.read_csv(url, sep='\t', header=0)

    # Remove leading hash character
    data.rename(columns={"#BioGRID Interaction ID":"BioGRID Interaction ID"}, inplace=True)

    # Replace pipe separators with commas
    data = data.replace('\|', ',', regex=True)

    # Select str columns and replace '-' with np.nan
    cols = ['Systematic Name Interactor A', 
          'Systematic Name Interactor B', 
          'Score', 
          'Modification', 
          'Phenotypes',
          'Qualifications',
          'Tags']

    data[cols] = data[cols].applymap(lambda col: re.sub(r'^-$', str(np.NaN), col))
       
    return data

In [28]:
# Selects and transforms columns
def preprocess_biogrid_data(data):
    
     # Convert Score column to float
    data['Score'] = data['Score'].astype('float64')
    
    # Select columns of interest for graph
    data = data[['BioGRID Interaction ID', 'Official Symbol Interactor A', 'Entrez Gene Interactor A', 
                       'Synonyms Interactor A', 'Organism Interactor A', 'Official Symbol Interactor B', 
                       'Entrez Gene Interactor B', 'Synonyms Interactor B', 'Organism Interactor B', 
                       'Author', 'Pubmed ID', 'Experimental System', 'Experimental System Type', 'Throughput']]

    # Create Year column
    data['Publication Year'] = data['Author'].str.split(' ').str[-1].str.strip('()')    
    
    # Remove Year from Author column
    data['Author'] = data['Author'].str.split(' ').str[:2]
    data['Author'] = data['Author'].apply(lambda x: ', '.join(x))
    
    return data

# Integrating Data from External APIs

Fetching data from various APIs to build a table for genes:

* HGNC
* PubMed
* Uniprot
* Entrez

In [29]:
### Create new columns for gene: Gene Description, NCBI url, Wikipedia url, full sequence url, chromosome, base pairs
### Create new columns for article: Article Title, Publication, Pubmed url
### Create new column: subcellular location, condition, PTM/Processing, chromosome

## HGNC 

Parses XML from request. Documentation at:<br>
https://www.genenames.org/help/rest/

In [30]:
def fetch_hgnc_data(ppi_data):
    
    def build_gene_dict(gene):

        url = f"http://rest.genenames.org/fetch/symbol/{gene}";
        r = requests.get(url)

        tree = ElementTree(fromstring(r.text))
        root = tree.getroot()

        str_attribs = ['symbol', 'name', 'entrez_id', 'locus_type', 'location', 'ensembl_gene_id', 
                    'locus_group']

        arr_attribs = ['pubmed_id', 'gene_group', 'uniprot_ids', 'omim_id']

        gene_dict = dict()

        # retrieve <str> attributes
        for index, name in enumerate(str_attribs):
            elements = root.findall(f".//str[@name='{name}']") # add .upper() method to gene symbol
            try:
                gene_dict[name] = [elements[0].text]
            except Exception as e:
                #print(f"Error: {e}") # Log errors 
                gene_dict[name] = ['NULL'] ### Keep symbol if no record is available; some genes are new research; use synonyms
                continue

        # retrieve <arr> attributes
        for index, name in enumerate(arr_attribs):
            elements = root.findall(f".//arr[@name='{name}']/*")
            try:
                gene_dict[name] = [elements[0].text]
            except Exception as e:
                #print(f"Error: {e}") # Log errors 
                gene_dict[name] = ['NULL']
                continue

        return gene_dict
    
    def build_gene_dataframe(ppi_data):

        genes = list(ppi_data['Official Symbol Interactor A'].unique())

        gene_dict_list = []

        for i, gene in enumerate(genes):
            gene_dict = build_gene_dict(gene)
            gene_dict_list.append(gene_dict)
            
            # Throttle to less than 10 requests per second
            if i != 0 and i % 10 == 0:
                time.sleep(.5)

        merge_gene_dict = {}

        for key in gene_dict_list[0].keys():
            merge_gene_dict[key] = [gene_dict_list[i][key][0] for i in range(len(gene_dict_list))]

        return pd.DataFrame(merge_gene_dict)
    
    return build_gene_dataframe(ppi_data)


## Pubmed

Example query: <br>
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=9806834 - 'pubmed_id'

--> can fetch comma delimited list for id to retrieve all in one request

In [31]:
def fetch_pubmed_data(data, column):
    
    def build_article_dict(pubmed_id, column):

        url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id={pubmed_id}";
        r = requests.get(url)

        tree = ElementTree(fromstring(r.text))
        root = tree.getroot()

        item_attribs = ['Title', 'FullJournalName', 'DOI']

        article_dict = {column:[pubmed_id]}

        # retrieve <Item> attributes
        for index, Name in enumerate(item_attribs):
            elements = root.findall(f".//Item[@Name='{Name}']") # add .upper() method to gene symbol
            try:
                article_dict[Name.lower()] = [elements[0].text]
            except Exception as e:
                print(f"Error: {e}") # Log errors 
                article_dict[Name.lower()] = ['NULL'] ### Keep symbol if no record is available; some genes are new research; use synonyms
                continue

        # Throttle to 10 requests per second
        time.sleep(0.1)

        article_dict['full_journal_name'] = article_dict.pop('fulljournalname')
        
        return article_dict
    
    def build_article_dataframe(data, column):

        pubmed_id_list = list(data[column].unique())

        article_dict_list = [build_article_dict(article, column) for article in pubmed_id_list]
        #print(article_dict_list)
        
        merge_article_dict = {}

        # Possible to use csv.writerows() for csv output
        for item in article_dict_list:
            #print(item)
            for key in item.keys():
                #print(key)
                merge_article_dict[key] = [article_dict_list[i][key][0] for i in range(len(article_dict_list))]
        
        return pd.DataFrame(merge_article_dict)
    
    article_df = build_article_dataframe(data, column)
    
    return data.merge(article_df)
    

# Clean Final Data

In [32]:
def final_process_data(data):
    
    def rename_columns(data):
        data.columns = data.columns.str.lower().str.replace(' ', '_')
        
        return data

    # Data cleaning pipeline
    data = rename_columns(data)
    
    # Extract chromosome
    
    
    return data

# Generate the Knowledge Graph

## Select Gene

In [None]:
gene = 'MTHFR'
limit = 100

## Run the Pipeline

In [33]:
# Fetch and clean interaction data from BioGRID
ppi_data = load_biogrid_data(gene, limit=limit)
ppi_data = preprocess_biogrid_data(ppi_data)

# Fetch HGNC data for information on genes; time limit on requests s 
gene_data = fetch_hgnc_data(ppi_data)

# Fetch PubMed data
ppi_data = fetch_pubmed_data(ppi_data, 'Pubmed ID')
gene_data = fetch_pubmed_data(gene_data, 'pubmed_id')

# Final cleaning step
ppi_data = final_process_data(ppi_data)
gene_data = final_process_data(gene_data)

# Save in project folder
ppi_data.to_csv('../data/clean/biogrid_ppi_data.csv', index=False)
gene_data.to_csv('../data/clean/gene_data.csv', index=False)

# Save to Neo4j imports folder for LOAD CSV command
ppi_data.to_csv(importDir + 'biogrid_ppi_data.csv', index=False)
gene_data.to_csv(importDir + 'gene_data.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
Error: list index out of range
