# <p style="text-align: center;">RNA Knowledge Graph Analysis and Enhancement</p>
    
***
***

**Authors:** [ECavalleri](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=emanuele.cavalleri@unimi.it), [ACabri](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=alberto.cabri@unimi.it), [MSGomez](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=mauricio.soto@unimi.it), [MMesiti](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=marco.mesiti@unimi.it)

**GitHub Repositories:** [testRNA-KG](https://github.com/emanuelecavalleri/testRNA-KG), [PheKnowLator](https://github.com/callahantiff/PheKnowLator/)
  
<br>  
  
**Purpose:** We analyze and visualize our KG topology and import it into a Neo4j storage. The Neo4j instance will encompass properties of single edges, thereby enriching the biomedical KG. This approach allows us to capture the provenance of relations, enabling us to assess the reliability of different sources (a source can be more reliable than another one and therefore we can decide to keep only information from more reliable source(s) or we can rely more on edges that are present in more than one database), the context in which relations exist (a "miRNA-gene" interaction that is not general but occurs in a specific cell line), such as their appearance in case-control studies or cohort studies, eliminate relationships not documented before a specific date (e.g., 2017) to evaluate if machine learning algorithms can predict these relationships, add scores (e.g., False Discovery Rate -- FDR) associated with a certain relation that act as indicators that either enhance or reduce the validity of a relation.
<br>

**Assumptions:**   
- Knowledge graphs ➞ `./resources/knowledge_graphs`
<br>

**Dependencies:**   
- **Scripts**: This notebook makes use of the [`GRAPE`](https://github.com/AnacletoLAB/grape/) tool.  
- **Data**: All downloaded and generated data sources are provided through [this](https://drive.google.com/drive/folders/1sev5zczMviX7UVqMhTpkFXG43K3nQa9f) dedicated Google Drive repository. 
_____
***

## Table of Contents
***

### [Preprocessing](#pre-processing)


### [Neo4j](#neo4j)   


### [GRAPE](#grape)  

____
***

## Set-Up Environment
***

In [None]:
# Run this to install/update grape
#!pip install --upgrade grape ensmallen embiggen graphviz

In [None]:
# import needed libraries
import pandas as pd
import numpy as np
from typing import Union
import re
import requests
from tqdm import tqdm

from grape import Graph, GraphVisualizer
from grape.embedders import Node2VecCBOWEnsmallen

tqdm.pandas()

In [None]:
# directory to write edges data to
edge_data_location = '../resources/edge_data/'

# directory to use for processing data
processed_data_location = '../resources/processed_data/'

# directory to use for metadata
metadata_location = '../resources/property_data/'

***
# Preprocessing  <a class="anchor" id="pre-processing"></a>

The aim of this section is defining the required file path to access the graph, the nodes and edges type lists.

In [None]:
fpath = "../resources/knowledge_graphs/"
graphname = "PheKnowLator_v3.1.1_full_instance_inverseRelations_OWLNETS"
graphext = ".nt"

graph_fname = fpath+graphname+graphext

colnames = ["subject","predicate","object","unused"]

In [None]:
fulldata = pd.read_csv(graph_fname,sep=' ',header=None, names=colnames)
fulldata.drop([colnames[3]],axis=1,inplace=True) # remove the last column containing the dot symbol
fulldata.head()

### Build the nodes dataframe
This is extracted from the list of nodes in the graph file and is then integrated with the relevant object types as a new column named "type".

In [None]:
nodes_df = pd.DataFrame(set(fulldata[colnames[0]])|set(fulldata[colnames[2]]),columns=["name"])
nodes_df.dropna(inplace=True)
nodes_df.head()

In [None]:
print('Number of nodes in ' + graph_fname + ': ' + str(len(nodes_df)))

In [None]:
# Full mapping for all node types in RNA-KG
RNAonly = False # when false all nodes are considered otherwise only RNA nodes are selected

def uri2ntype(uri: str)->Union[str,None]:
    
    retval = None
    
    # match regular expression for all RNA genes in ncbi format
    retlist=re.split(r"gene/[\w\-]+[?]", uri)
    if len(retlist) == 2:  # pattern matched therefore list item 1 contains the RNA type
        value = retlist[1][:-1]
        if value == "others":
            retval = "otherRNA"
        elif value == "pseudo":
            retval = "Pseudogene"
        elif value == "unknown":
            retval = "unknown RNA"
        else:
            retval = value
    # regular expressions didn't match -> continue with direct string matching
    elif ("https://www.mirbase.org/" in uri): 
        retval = "miRNA"
    elif ("https://www.addgene.org/" in uri):
        retval = "gRNA"
    elif ("https://www.ncbi.nlm.nih.gov/nuccore/" in uri):
        retval = "Viral RNA"
    elif ("http://web.mit.edu/sirna/" in uri):
        retval = "s(i/h)RNA"
    elif ("https://hanlab.uth.edu/HeRA" in uri): 
        retval = "eRNA"
    elif ("http://bigdata.ibp.ac.cn/piRBase" in uri): 
        retval = "piRNA"
    elif ("http://scottgroup.med.usherbrooke.ca/snoDB" in uri): 
        retval = "rRNA"
    elif ("tRNA" in uri) or ("trna" in uri) or ("TRNA" in uri):
        retval = "tRNA"
    elif ("tRF" in uri) or ("trf" in uri):
        retval = "tRF"
    elif ("tsRNA" in uri): 
        retval = "tsRNA"
    elif ("https://go.drugbank.com/drugs/" in uri): 
        retval = "RNA drug"
    elif ("https://eskip-finder.org" in uri): 
        retval = "ASO"
    elif ("https://www.aptagen.com/aptamer-details" in uri): 
        retval = "Aptamer"
    elif ("retained_intron" in uri): 
        retval = "Retained intron"
    elif ("tbdb.io/tboxes/" in uri) or ("penchovsky" in uri):
        retval = "Riboswitch"
    elif ("http://rfamlive.xfam.org/" in uri):
        retval = "Ribozyme"
    elif not RNAonly:    
        if ("http://purl.obolibrary.org/obo/MONDO" in uri) or ("purl.obolibrary.org/obo/DOID" in uri) or ("ghr.nlm.nih.gov/condition" in uri) or ("rarediseases.info.nih.gov/diseases" in uri):
            retval = "Disease"
        elif ("purl.obolibrary.org/obo/IDO" in uri):
            retval = "Infectious disease"
        elif ("purl.obolibrary.org/obo/MFOMD" in uri):
            retval = "Mental disease"
        elif ("http://purl.obolibrary.org/obo/GO" in uri):
            retval = "GO"
        elif ("http://purl.obolibrary.org/obo/CHR" in uri):
            retval = "Chromosome"
        elif ("http://purl.obolibrary.org/obo/SO" in uri):
            retval = "Sequence"
        elif ("http://purl.obolibrary.org/obo/VO" in uri):
            retval = "Vaccine"
        elif ("http://purl.obolibrary.org/obo/CHEBI" in uri): 
            retval = "Chemical"
        elif ("http://purl.obolibrary.org/obo/PR" in uri) or ("http://purl.obolibrary.org/obo/vo/ontorat/PR" in uri): 
            retval = "Protein"
        elif ("http://purl.obolibrary.org/obo/PW" in uri) or ("https://reactome.org/content/detail/" in uri): 
            retval = "Pathway"
        elif ("http://purl.obolibrary.org/obo/VO" in uri): 
            retval = "Vaccine"
        elif ("http://purl.obolibrary.org/obo/FOODON" in uri): 
            retval = "Food"
        elif ("http://purl.obolibrary.org/obo/MF" in uri): 
            retval = "Mental functioning"
        elif ("http://purl.obolibrary.org/obo/OGMS" in uri): 
            retval = "General medical science"
        elif ("http://purl.obolibrary.org/obo/MAXO" in uri): 
            retval = "Medical action"
        elif ("https://www.ncbi.nlm.nih.gov/snp/" in uri):
            retval = "Variant (SNP)"
        elif ("http://purl.obolibrary.org/obo/NBO" in uri):
            retval = "Neuro behaviour"
        elif ("https://www.genome.gov/genetics-glossary/" in uri):
            retval = "Biological role"
        elif  ("http://purl.obolibrary.org/obo/CARO" in uri) or ("http://purl.obolibrary.org/obo/UBERON" in uri) or ("http://sig.uw.edu/fma" in uri) or ("http://purl.obolibrary.org/obo/FMA" in uri): 
            retval = "Anatomy"  
        elif  ("http://purl.obolibrary.org/obo/NCIT" in uri): 
            retval = "NCI thesaurus" 
        elif ("http://purl.obolibrary.org/obo/FBbt" in uri):
            retval = "Drosophila anatomy" 
        elif ("http://purl.obolibrary.org/obo/CL_" in uri): 
            retval = "Cell"
        elif ("http://purl.obolibrary.org/obo/CLO" in uri) or ("http://www.ebi.ac.uk/cellline" in uri): 
            retval = "Cell line"
        elif ("http://purl.obolibrary.org/obo/HP" in uri) or ("http://purl.obolibrary.org/obo/PATO" in uri) or ("http://purl.obolibrary.org/obo/UPHENO" in uri): 
            retval = "Phenotype"
        elif ("http://purl.obolibrary.org/obo/GNO" in uri): 
            retval = "Glycan"
        elif ("http://purl.obolibrary.org/obo/BFO" in uri): 
            retval = "Basic formal"
        elif ("http://purl.obolibrary.org/obo/ENVO" in uri): 
            retval = "Environment"
        elif ("http://purl.obolibrary.org/obo/ECTO" in uri): 
            retval = "Environmental exposure"
        elif ("www.ncbi.nlm.nih.gov/gene" in uri) or ("http://purl.obolibrary.org/obo/OGG" in uri):
            retval = "Gene"
        elif ("http://purl.obolibrary.org/obo/OGG" in uri):
            retval = "Genome"
        elif ("www.ncbi.nlm.nih.gov/Taxonomy/Browser" in uri) or ("purl.obolibrary.org/obo/NCBITaxon" in uri): 
            retval = "Species"
        elif ("https://www.encodeproject.org/targets" in uri): 
            retval = "Epigenetic modification"
        elif ("crdd.osdd.net/raghava/dbem?" in uri): 
            retval = "Histone modification"
        elif ("bigdata.ibp.ac.cn/SmProt/SmProt.php?ID" in uri): 
            retval = "Small protein"
        elif ("snomedct" in uri) or ("SNOMEDCT" in uri): 
            retval = "snomedct"
        elif ("http://www.ebi.ac.uk/efo/EFO" in uri): 
            retval = "Experimental factor"
        elif ("http://purl.obolibrary.org/obo/HsapDv" in uri): 
            retval = "Human developmental stage"
        elif ("http://www.w3.org/2002/07/owl#Nothing" in uri): 
            retval = "owlNothing"
        elif ("http://purl.obolibrary.org/obo/" in uri):    # all unmapped obo types are dealt with here
            retlist = re.split(r"obo/",uri)
            retval = retlist[1].split('_', 1)[0]
            
    else:
        retval = None

    return retval

In [None]:
%%time
ntypes_list = []
for u in tqdm(nodes_df["name"].values):
    nty = uri2ntype(u)
    ntypes_list.append(nty)

nodes_df.loc[:,"type"] = ntypes_list
nodes_df.tail()

In [None]:
nodes_df['type'].unique()

In [None]:
print("Unassigned node types:", nodes_df.type.isna().sum())
nodes_df.type.fillna('undefined',inplace=True)

nodes_df = nodes_df.dropna()
nodes_df.tail()

### Build the edges dataframe

In [None]:
# extract from graph those codes which refer only to edges, maintaining their
# id as from the graph numerical codes
ety_df = pd.DataFrame(set(fulldata[colnames[1]]),columns=["name"])
#ety_df=fulluri[fulluri.id.isin(ecodes)]
#ety_df.reset_index(drop=True,inplace=True)
ety_df.head()

In [None]:
# split camel case strings, e.g., "overexpressedIn" --> "overexpressed in"
def camel_case_split(identifier):
    matches = re.finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
    return [m.group(0) for m in matches]

# automatically match all OBO items for the specified ontologies
hdr = {'Accept': 'application/json'}
ontos = ["ro","bspo","vo","clo","mondo","ogg","cl","mf"]
tomatch = "http://purl.obolibrary.org/obo/"

def uri2etype(uri: str)->Union[str,None]:
    label = None
    for oy in ontos:
        baseuri = f"https://www.ebi.ac.uk/ols4/api/ontologies/{oy}/properties?iri={uri[1:-1]}"
        try:
            res = requests.get(baseuri,hdr).json()
            label=res['_embedded']['properties'][0]['label']
            label=label.lower().capitalize()
        except:
            pass

    if ("http://www.w3.org/1999/02/22-rdf-syntax-ns#type" in uri):   
        label = "Type"  
    elif ("http://www.w3.org/2000/01/rdf-schema#subClassOf" in uri):   
        label = "SubClass of" 
    # new manually set edges by splitting over the # symbol of the uri
    elif ("http://semanticscience.org/resource/SIO_000420" in uri):   
        label = "Has expression" 
    elif ("http://purl.obolibrary.org/obo/CLO_0054408" in uri):   
        label = "Overexpresses gene"
    elif ("http://purl.obolibrary.org/obo/CLO_0054409" in uri):   
        label = "Adenoma formation induced by cell lineage cells in mice"
    elif ("http://purl.obolibrary.org/obo/uberon/core" in uri) or \
         ("http://purl.obolibrary.org/obo/mondo" in uri) or \
         ("http://purl.obolibrary.org/obo/so" in uri) or \
         ("http://purl.obolibrary.org/obo/envo" in uri) or \
         ("http://purl.obolibrary.org/obo/pr" in uri) or \
         ("http://purl.obolibrary.org/obo/pato" in uri) or \
         ("http://purl.obolibrary.org/obo/pw" in uri) or \
         ("http://purl.obolibrary.org/obo/exo.obo" in uri) or \
         ("http://purl.obolibrary.org/obo/cl" in uri) or \
         ("http://purl.obolibrary.org/obo/nbo" in uri) or \
         ("http://purl.obolibrary.org/obo/MF" in uri) or \
         ("http://www.obofoundry.org/ro" in uri) or \
         ("http://purl.obolibrary.org/obo/chebi" in uri):
            label = uri[1:-1].split("#")[1]
            label = '_'.join(camel_case_split(label))
            label = label.replace('_',' ').lower().capitalize()
    else:
        pass
    
    return label

In [None]:
%%time
etypes_list = []
for u in tqdm(ety_df["name"].values):
    ety = uri2etype(u)
    etypes_list.append(ety)

ety_df.loc[:,"type"] = etypes_list
ety_df.tail()

In [None]:
print("Unassigned edge types:",ety_df.type.isna().sum())
#ety_df.type.fillna('undefined',inplace=True)
#ety_df = ety_df.dropna()

In [None]:
%%time
# add the type column to the original graph structure
edges_df = fulldata.copy()

efmap = lambda x: ety_df.type[np.where(ety_df['name'].eq(x))[0][0]]
edges_df["type"] = edges_df[colnames[1]].progress_apply(efmap)
edges_df.tail()

***
# Neo4j  <a class="anchor" id="neo4j"></a>
We start from properties (metadata) stored by PheKnowLator when generating the OWLNETS version of our KG.

In [None]:
properties = pd.read_csv('../resources/knowledge_graphs/PheKnowLator_v3.1.1_full_instance_inverseRelations_OWLNETS_NodeLabels.txt',
            sep='\t')
node_properties = properties[properties['entity_type'] == 'NODES']
node_properties

In [None]:
neo4jnodes_df = nodes_df.copy()
neo4jnodes_df = pd.merge(neo4jnodes_df, node_properties, left_on='name', right_on='entity_uri', how='outer')
neo4jnodes_df = neo4jnodes_df[['name', 'type', 'label', 'description/definition', 'synonym']]
neo4jnodes_df

In [None]:
neo4jedges_df = ety_df.copy()
edge_properties = properties[properties['entity_type'] == 'RELATIONS']
neo4jedges_df = pd.merge(neo4jedges_df, edge_properties, left_on='name', right_on='entity_uri', how='outer')
neo4jedges_df = neo4jedges_df[['name', 'type', 'description/definition', 'synonym']]
neo4jedges_df = neo4jedges_df.rename(columns={'type':'label'})
neo4jedges_df

In [None]:
neo4jedges_df = pd.merge(neo4jedges_df, edges_df, left_on='name', right_on='predicate', how='outer')
neo4jedges_df = neo4jedges_df[['subject','predicate','object','label','description/definition','synonym']]
neo4jedges_df

### gene-disease from [Human Disease Molecular Mechanisms](https://github.com/callahantiff/PheKnowLator/wiki/Building-a-KG-of-Human-Disease-Molecular-Mechanisms) (PKT-built)
Using a Neo4j system we can also store single-edges' metadata. Let's start with `gene-disease` ones.

In [None]:
# Utility function to merge attributes on the same relation (same row on a CSV) as a list since subject and object are the pivot elements
# for join operations. See next cell for an example of how it works
def merge_rows(df, column1, column2):
    df = df.drop_duplicates()
    df_merged = df.groupby([column1, column2]).agg(lambda x: '|'.join(set(str(i) for i in x if pd.notnull(i)))).reset_index()
    return df_merged.drop_duplicates()

In [None]:
data = {'a': [1, 1, 1],
    'b': [2, 2, 3],
    'c': ['£', '%', '$']}

df = pd.DataFrame(data)
print(df)
print('merge_rows:')
print(merge_rows(df, 'a', 'b'))

We need to manually process data (i.e., do what PKT does for us by processing resource_info input) to append single edge properties. 

In [None]:
list(pd.read_csv('../resources/resource_info.txt', sep='|', header=None).loc[0])

In [None]:
gene_disease = pd.read_csv(edge_data_location + 'gene-disease_curated_gene_disease_associations.tsv', sep='\t', names=[
    'geneId','geneSymbol','DSI','DPI','diseaseId','diseaseName','diseaseType','diseaseClass','diseaseSemanticType','score','EI',
    'YearInitial','YearFinal','NofPmids','NofSnps','Source(s)']).drop(columns=['geneSymbol','diseaseName','diseaseClass',
                                                                               'diseaseSemanticType','diseaseType'])
gene_disease['geneId'] = '<http://www.ncbi.nlm.nih.gov/gene/' + gene_disease['geneId'].astype(str) + '>'
gene_disease = gene_disease[gene_disease['EI'] >= 1.0]
gene_disease['YearInitial'] = gene_disease['YearInitial'].astype('str').str.replace('.0', '')
gene_disease['YearFinal'] = gene_disease['YearFinal'].astype('str').str.replace('.0', '')
gene_disease['Source(s)'] = gene_disease['Source(s)'].str.replace(';', '|')

gene_disease = pd.merge(gene_disease, pd.read_csv(processed_data_location + 'DISEASE_MONDO_MAP.txt', sep='\t', header=None),
                        left_on='diseaseId', right_on=0).drop(columns=[0,'diseaseId'])
gene_disease[1] = '<http://purl.obolibrary.org/obo/' + gene_disease[1] + '>'
gene_disease.rename(columns={'geneId':'subject', 1:'object'}, inplace=True)
gene_disease = gene_disease[['subject', 'object', 'DSI', 'DPI', 'YearInitial', 'YearFinal', 'NofPmids', 'NofSnps', 'score', 'EI', 'Source(s)']]
gene_disease = merge_rows(gene_disease, 'subject', 'object')
gene_disease

In [None]:
test = pd.merge(neo4jedges_df, gene_disease, on=['subject','object'], how='left')
test[test['subject'] == '<http://www.ncbi.nlm.nih.gov/gene/100126334>']

***
### gene-miRNA from [TarBase](https://dianalab.e-ce.uth.gr/html/diana/web/index.php?r=tarbasev8/index)

In [None]:
list(pd.read_csv('../resources/resource_info.txt', sep='|', header=None).loc[1])

In [None]:
gene_miRNA = pd.read_csv(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt',
                         names=['geneId','geneName','mirna','species','cell_line','tissue','category','method','positive_negative',
                                'direct_indirect',
                                'up_down','condition','transcript(3p/5p)'], sep='\t').drop(columns=['geneName','species'])

gene_miRNA = pd.merge(gene_miRNA, pd.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', sep='\t', header=None),
                      left_on='geneId', right_on=0).drop(columns=['geneId',0,2,3,4,5])
gene_miRNA[1] = '<http://www.ncbi.nlm.nih.gov/gene/' + gene_miRNA[1].astype(str) + '>'
gene_miRNA.rename(columns={1:'subject'}, inplace=True)
gene_miRNA = pd.merge(gene_miRNA, pd.read_csv(processed_data_location + 'MIRBASE_ID_ACCESSION_MAP.txt', sep='\t', header=None),
                      left_on='mirna', right_on=0).drop(columns=['mirna',0])
gene_miRNA[1] = '<https://www.mirbase.org/hairpin/' + gene_miRNA[1] + '>'
gene_miRNA.rename(columns={1:'object'}, inplace=True)
gene_miRNA = gene_miRNA[['subject', 'object'] + [col for col in gene_miRNA.columns if col not in ['subject', 'object']]]
gene_miRNA['Source(s)'] = 'TarBase'

gene_miRNA = merge_rows(gene_miRNA, 'subject', 'object')
gene_miRNA

We need to ground (as much as possible) properties! For instance, `HITS-CLIP` strings should be mapped on an appropriate ontology related to biological experiments.

In [None]:
gene_miRNA.condition = gene_miRNA.condition.replace('', np.nan)
gene_miRNA.up_down = gene_miRNA.up_down.replace('', np.nan)

In [None]:
gene_miRNA['cell_line'].unique()

In [None]:
gene_miRNA.cell_line = gene_miRNA.cell_line.replace('', np.nan).str.lower()
gene_miRNA.cell_line = gene_miRNA.cell_line.str.replace('cells', 'cell')
# I already prepared this look-up table and I won't bore you on that, btw the method to generate it is the same we saw in the first notebook
# (I consider labels and synonyms in the CLO ontology which is community-based ontology of cell lines)
desc_clo_map = pd.read_csv(processed_data_location + 'DESC_CLO_MAP.txt', header=None, sep='\t')
desc_clo_map2 = desc_clo_map.copy()
desc_clo_map2[0] = desc_clo_map2[0].str.replace(' cell', '')
desc_clo_map2[0] = desc_clo_map2[0].str.replace(' cells', '')
desc_clo_map = pd.concat([desc_clo_map, desc_clo_map2])
clo_dict = dict(zip(desc_clo_map[0], 'http://purl.obolibrary.org/obo/' + desc_clo_map[1] +
                      ' (' + desc_clo_map[0] + ')'))

def replace_with_clo(substring):
    if pd.isna(substring):
        return np.nan
    else:
        return '|'.join([clo_dict.get(part, part) for part in substring.split('|')])

gene_miRNA.cell_line = [replace_with_clo(item) for item in gene_miRNA.cell_line]

gene_miRNA.cell_line.unique()[:5]

In [None]:
gene_miRNA.tissue.unique()[:5]

In [None]:
gene_miRNA.tissue = gene_miRNA.tissue.replace('', np.nan).str.lower()
gene_miRNA.tissue = gene_miRNA.tissue.str.replace("/", '|')
gene_miRNA.tissue = gene_miRNA.tissue.str.replace("larva, whole", 'larva')
gene_miRNA.tissue = gene_miRNA.tissue.str.replace("peripheral blood", 'http://purl.obolibrary.org/obo/BTO_0000553 (peripheral blood)')
gene_miRNA.tissue = gene_miRNA.tissue.str.replace("nervous", 'http://purl.obolibrary.org/obo/UBERON_0003714 (neural tissue)')
gene_miRNA.tissue = gene_miRNA.tissue.str.replace("gastric", 'http://purl.obolibrary.org/obo/UBERON_0000945 (stomach)')
gene_miRNA.tissue = gene_miRNA.tissue.str.replace("synovial", 'http://purl.obolibrary.org/obo/UBERON_0002217 (synovial joint)')
gene_miRNA.tissue = gene_miRNA.tissue.str.replace("ewing sarcoma", 'http://purl.obolibrary.org/obo/MONDO_0012817 (ewing sarcoma)')

# I already prepared this look-up table and I won't bore you on that, btw the method to generate it is the same we saw in the first notebook
# (I consider labels and synonyms in the Uberon ontology which is an integrated cross-species anatomy ontology)
desc_uberon_map = pd.read_csv(processed_data_location + 'DESC_EXT_MAP.txt', header=None, sep='\t')
uberon_dict = dict(zip(desc_uberon_map[0], 'http://purl.obolibrary.org/obo/' + desc_uberon_map[1] +
                      ' (' + desc_uberon_map[0] + ')'))

def replace_with_uberon(substring):
    if pd.isna(substring):
        return np.nan
    else:
        return '|'.join([uberon_dict.get(part, part) for part in substring.split('|')])

gene_miRNA.tissue = [replace_with_uberon(item) for item in gene_miRNA.tissue]

gene_miRNA.tissue.unique()[:5]

In [None]:
gene_miRNA['method'].unique()[:5]

In [None]:
gene_miRNA.method = gene_miRNA.method.replace('', np.nan).str.lower()
gene_miRNA.method = gene_miRNA.method.str.replace("microarrays", 'microarray')
# I already prepared this look-up table and I won't bore you on that, btw the method to generate it is the same we saw in the first notebook
# (I consider labels and synonyms in the NCIT ontology which is a reference terminology in biology and medicine)
desc_ncit_map = pd.read_csv(processed_data_location + 'DESC_NCIT_MAP.txt', header=None, sep='\t')
ncit_dict = dict(zip(desc_ncit_map[0], 'http://purl.obolibrary.org/obo/' + desc_ncit_map[1] +
                      ' (' + desc_ncit_map[0] + ')'))

def replace_with_ncit(substring):
    if pd.isna(substring):
        return np.nan
    else:
        return '|'.join([ncit_dict.get(part, part) for part in substring.split('|')])

gene_miRNA.method = [replace_with_ncit(item) for item in gene_miRNA.method]

gene_miRNA.method.unique()[:5]

In [None]:
gene_miRNA

In [None]:
gene_miRNA.category.unique()

In [None]:
gene_miRNA.category = gene_miRNA.category.str.replace(',','|')
gene_miRNA.category = gene_miRNA.category.str.replace('Embryonic/Fetal',
                                                      'http://purl.obolibrary.org/obo/NCIT_C34100 (embryonic)')
gene_miRNA.category = gene_miRNA.category.str.replace('Normal/Primary',
                                                      'http://purl.obolibrary.org/obo/EFO_0009654 (reference sample)')
gene_miRNA.category = gene_miRNA.category.str.replace('Cancer/Malignant', 'http://purl.obolibrary.org/obo/MONDO_0004992 (cancer)')
gene_miRNA.category = gene_miRNA.category.str.replace('Stem/Progenitor',
                                                      'http://purl.obolibrary.org/obo/CLO_0037224 (stem cell line cell)')
gene_miRNA.category = gene_miRNA.category.str.replace('Pulmonary Artery',
                                                      'http://purl.obolibrary.org/obo/NCIT_C12774 (pulmonary artery)')

In [None]:
gene_miRNA

***
### miRNA-disease from [miR2Disease](http://watson.compbio.iupui.edu:8080/miR2Disease/)

In [None]:
list(pd.read_csv('../resources/resource_info.txt', sep='|', header=None).loc[2])

In [None]:
miRNA_disease = pd.read_csv(edge_data_location + 'miRNA-disease_miR2Disease.txt',
                            names=['DO','miRNA','Disease','Condition','Experiment','Year','Description'], sep='\t').drop(columns='Disease')
miRNA_disease = pd.merge(miRNA_disease, pd.read_csv(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', sep='\t', header=None),
                         left_on='DO', right_on=0).drop(columns=[0,'DO'])
miRNA_disease = pd.merge(miRNA_disease, pd.read_csv(processed_data_location + 'MIRBASE_ID_ACCESSION_MAP.txt', sep='\t', header=None),
                         left_on='miRNA', right_on=0).drop(columns=[0,'miRNA'])

miRNA_disease['1_x'] = '<http://purl.obolibrary.org/obo/' + miRNA_disease['1_x'].astype(str) + '>'
miRNA_disease['1_y'] = '<https://www.mirbase.org/hairpin/' + miRNA_disease['1_y'].astype(str) + '>'
miRNA_disease['Year'] = miRNA_disease['Year'].astype('str').str.replace('.0', '')

miRNA_disease = miRNA_disease[['1_y','1_x','Condition','Experiment','Year','Description']]
miRNA_disease.rename(columns={'1_x':'object','1_y':'subject'}, inplace=True)
miRNA_disease['Source(s)'] = 'miR2Disease'
miRNA_disease.head()

In [None]:
miRNA_disease.Condition.unique()

In [None]:
miRNA_disease[miRNA_disease['Condition'] == 'hepatocellular carcinoma (HCC)']
# hepatocellular carcinoma (HCC) is an error due to the fact that this DB is manually compiled
miRNA_disease.Condition = miRNA_disease.Condition.replace('hepatocellular carcinoma (HCC)', np.nan)

In [None]:
miRNA_disease.Condition = miRNA_disease.Condition.str.replace("down-regulated",
                                                              'http://purl.obolibrary.org/obo/OMIT_0016265 (down-regulation)')
miRNA_disease.Condition = miRNA_disease.Condition.str.replace("up-regulated",
                                                              "http://purl.obolibrary.org/obo/OMIT_0016489 (up-regulation)")
miRNA_disease.Condition = miRNA_disease.Condition.str.replace("normal",
                                                              "http://purl.obolibrary.org/obo/OBI_0002584 (differential expression analysis data)")

In [None]:
miRNA_disease.Experiment.unique()

In [None]:
miRNA_disease.Experiment = miRNA_disease.Experiment.str.replace('microarray', 'http://purl.obolibrary.org/obo/NCIT_C44282 (microarray)')
miRNA_disease.Experiment = miRNA_disease.Experiment.str.replace('Northern blot, qRT-PCR etc', 'northern blot, qRT-PCR etc')
miRNA_disease.Experiment = miRNA_disease.Experiment.str.replace('northern blot', 'http://purl.obolibrary.org/obo/NCIT_C16355 (northern blotting)')
miRNA_disease.Experiment = miRNA_disease.Experiment.str.replace(', qRT-PCR etc', 'http://purl.obolibrary.org/obo/NCIT_C28408 (qrt-pcr)')

In [None]:
miRNA_disease

In [None]:
neo4jnodes_df.to_csv(metadata_location + 'nodes.csv', index=None)
neo4jedges_df.to_csv(metadata_location + 'relationships.csv', index=None)
gene_disease.to_csv(metadata_location + 'gene_disease.csv', index=None)
gene_miRNA.to_csv(metadata_location + 'gene_miRNA.csv', index=None)
miRNA_disease.to_csv(metadata_location + 'miRNA_disease.csv', index=None)

In the subsequent chunks, we write Cypher scripts to import our KG in Neo4j. Since they are time-consuming, you skip the following steps: our KG is available at http://fievel.anacleto.di.unimi.it:7474/browser/. Data are stored within the (/ --> var --> lib --> neo4j -->) import folder.

If something goes wrong...

![Neo4j meta-graph](https://github.com/emanuelecavalleri/testRNA-KG/assets/33032169/8aa80e9c-4371-4705-bea2-3546f3234fab "Neo4j meta-graph")

We can retrieve "autosomal dominant nonsyndromic hearing loss 15" (MONDO:0011226) disease or miRNA attributes with ease.



We can retrieve useful statistics by sampling some nodes and reporting on property and relationship counts per node.

We can show a meta-graph containing gene-miRNA interactions that occur only in HeLa cell lines (http://purl.obolibrary.org/obo/CLO_0003684). We can notice a small central well-connected core, typical of (scale-free) biological networks. We can also analyse one of the central nodes, which is https://www.mirbase.org/hairpin/MI0000063 (stem-loop hsa-let-7b). For instance, it interacts with BAX (a human apoptosis regulator gene, Gene ID: 581). We can explore this relationship more in depth.

![hela subnetwork](https://github.com/emanuelecavalleri/testRNA-KG/assets/33032169/7be56b86-abe0-40be-a1ec-5e79e0788246 "hela subnetwork")

![hsa-let-7b subsubnetwork](https://github.com/emanuelecavalleri/testRNA-KG/assets/33032169/9f802d8d-7567-4edb-857d-1c84e28898b4 "hsa-let-7b subsubnetwork")

***
# GRAPE  <a class="anchor" id="grape"></a>

In [None]:
# set the names for persistence files
pik_nodes = "../../nodes.pkl"
pik_edges = "../../edges.pkl"

In [None]:
# save the nodes and edges dataframe to pickle files for persistence
nodes_df.to_pickle(pik_nodes)
edges_df.to_pickle(pik_edges)

In [None]:
# load the nodes and edges dataframe from pickle files
nodes_df = pd.read_pickle(pik_nodes)
edges_df = pd.read_pickle(pik_edges)

In [None]:
%%time
# load it into a graph
graph = Graph.from_pd(
    edges_df=edges_df,
    nodes_df=nodes_df,
    node_name_column="name",
    node_type_column="type",
    edge_src_column="subject",
    edge_dst_column="object",
    #edge_weight_column="weight",
    edge_type_column="type",
    node_types_separator="|",
    directed=True,
    name="graph",
)

graph

In [None]:
htmlrep = fpath+"RNAgraphReport.html"

ff = open(htmlrep,"w")
ff.write(str(graph))
ff.close()

In [None]:
graph.get_diameter()

In [None]:
graph = Graph.from_pd(
    edges_df=edges_df,
    nodes_df=nodes_df,
    node_name_column="name",
    node_type_column="type",
    edge_src_column="subject",
    edge_dst_column="object",
    #edge_weight_column="weight",
    edge_type_column="type",
    node_types_separator="|",
    directed=False,
    name="graph",
)

engine = Node2VecCBOWEnsmallen(walk_length=5)
embedding = engine.fit_transform(graph)
vis = GraphVisualizer(graph)

vis.fit_edges(embedding)
vis.plot_edge_types(k=9)

In [None]:
vis.fit_nodes(embedding)
vis.plot_node_types(k=9)

### Predictions using Node2Vec

In [None]:
train, test = graph.connected_holdout(train_size=0.7)
train.enable()

vis = GraphVisualizer(
    graph=test,
    support=graph
)

vis.fit_negative_and_positive_edges(embedding)
vis.plot_positive_and_negative_edges()

***
# ASSIGNMENTS

- Let us go back to the `miRNA_gene` dataframe. It has 'positive_negative', 'direct_indirect', and 'up_down' columns a.k.a. single edges' metadata. A refinment process for this kg would be promote these three properties to specific edge types. Check [RO](https://www.ebi.ac.uk/ols4/ontologies/ro) to retrieve relations more accurate than the generic "interacts with", split the dataframe according to these new three relations, and generate the refined KG using PheKnowLator. This analysis increases the semantic (and biological) information stored within your KG.

- Moreover, the `miRNA_gene` dataframe has a "transcript(3p/5p)" column. We can improve the KG by splitting stem-loop hairpin miRNA and mature sequences. The look-up table for mapping mature miRNA sequences to miRBase identifiers is already provided. Remember to store the -5p/-3p information about mature miRNA molecules as a node property. What should I do if I find a "5p|3p" cell?
    - hsa-mir-15a --> https://www.mirbase.org/hairpin/MI0000069 (we already know it...)
    - hsa-miR-15a-5p --> https://www.mirbase.org/mature/MIMAT0000068
    - hsa-miR-15a-3p--> https://www.mirbase.org/mature/MIMAT0004488

- The previous improvement distinguishes mature and stem-loop sequence, but mature sequence are derived from stem-loops! We can store this information within the KG as a new bidirectional edge (look for "develops"-like relationships stored in RO on [OLS4](https://www.ebi.ac.uk/ols4/)). This is a good example of how we can infer biological information just by studying the semantics of the KG domain, without considering any new dataset.  
    - hsa-mir-15a --> develops --> hsa-miR-15a-5p (hint: employ an RO property with an appropriate inverse)
    - hsa-mir-15a --> develops --> hsa-miR-15a-3p

- To improve look-up tables, you can move from a 1-to-1 mapping to an adaptive threshold mapping, e.g., map both Parkinson disease and Parkinsons disease to MONDO:0005180 (a proper threshold could be could be 95%). Thresholds have be empirically validated, meaning a CLO mapping requires a lower threshold than a Mondo one. These assignments show KG construction and maintenance as a circular process that constantly improves the final KG to identify edge and node-subtypes and enhance grounding/mapping. 

- Analyze you newly generated KG(s) with GRAPE.

In [None]:
gene_miRNA[['subject','object','positive_negative','direct_indirect','up_down','transcript(3p/5p)']].head()

In [None]:
from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

print(similar("Parkinson disease", "Parkinson disease"))
print(similar("Parkinson disease", "Parkinsons disease"))
print(similar("Parkinson disease", "Parkinsonss disease"))