# <p style="text-align: center;">Construction of a simplified RNA-based Knowledge Graph</p>

**Objective:** [SDM-RDFizer](https://github.com/SDM-TIB/SDM-RDFizer) is an interpreter of mapping rules that allows the transformation of (un)structured data into RDF Knowledge Graphs. In this notebook, we wish to show the use of RDFizer for the generation of a RNA-based Knowledge Graph that involves **gene-disease**, **gene-miRNA**, and **miRNA-disease** relationships. These relationships are made available from the following public repositories:
- [gene-disease](https://storage.googleapis.com/pheknowlator/current_build/data/original_data/curated_gene_disease_associations.tsv), provided by PheKnowLator
- [gene-miRNA](https://dianalab.e-ce.uth.gr/downloads/tarbase_v8_data.tar.gz), provided by TarBase (we selected 50k random rows to reduce input size)
- [miRNA-disease](http://watson.compbio.iupui.edu:8080/miR2Disease/download/AllEntries.txt), provided by miR2Disease

The construction of the Knowledge Graph should be compliant with the following Ontologies:
- Mondo Merged Disease Ontology ([**MONDO**](https://obofoundry.org/ontology/mondo.html)), a semi-automatically constructed Ontology that merges in multiple disease resources
- Non-Coding RNA Ontology ([**NCRO**](https://obofoundry.org/ontology/ncro.html)), an ontology for non-coding RNA, both of biological origin, and engineered
- Relations Ontology ([**RO**](https://obofoundry.org/ontology/ro.html)), a collection of OWL relations (ObjectProperties) intended for use across a wide variety of biological ontologies

To achieve this task, we need to map objects in RDF triples by means of RDFizer facilities, we will consider:
- [disease-mondo-map](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt), provided by PheKnowLator
- [mirbase-ncro-map](#genemirna), obtained by means of PheKnowLator ecosystem's functions
- [doid-mondo-map](#mirnadis), obtained by means of PheKnowLator ecosystem's functions

All the data required for the execution of this notebook can be downloaded from this [testRNA-KG/RDFizer/files](https://drive.google.com/drive/u/3/folders/1e54vON8b82FpqcNGesiqiVORD3vV_z2R) Google Drive repository.

**Authors:** [E. Cavalleri](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=emanuele.cavalleri@studenti.unimi.it), [T.J. Callahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com), [M. Mesiti](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=marco.mesiti@unimi.it)

**GitHub Repositories:** [testRNA-KG/RDFizer](https://github.com/emanuelecavalleri/testRNA-KG/tree/main/RDFizer_comparison), [PheKnowLator](https://github.com/callahantiff/PheKnowLator/)  
<!--- **Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)** --->
  
<br>  

<a target="_blank" href="https://user-images.githubusercontent.com/33032169/225636670-056a7774-f3d6-4aee-84b1-4f462c3cf33a.png"> <img src="https://user-images.githubusercontent.com/33032169/225636670-056a7774-f3d6-4aee-84b1-4f462c3cf33a.png"></a> 

(*Click Figure to Enlarge Image in Current Browser Tab*)

<br>

***
***

## Download, load, and map data
This section downloads and loads in several different types of data, which are needed to construct the Knowledge Graph. Identifiers have to be mapped to be compliant to MONDO and NCRO bio-Ontologies. 

***
### gene-disease from [PheKnowLator](https://github.com/callahantiff/PheKnowLator)
[PheKnowLator](https://github.com/callahantiff/PheKnowLator) is a tool developed to download, process, map, and clean data in order to build edges for a Knowledge Graph. It provides you a Human Disease Molecular Mechanisms Knowledge Graph that includes gene-disease associations.

In [1]:
import pandas as pd

gene_disease = pd.read_csv("https://storage.googleapis.com/pheknowlator/current_build/"+
                            "data/original_data/curated_gene_disease_associations.tsv", sep='\t')
# The original data is filtered such that only records meeting the following criteria were included:
# EI >= 1.0 (90th percentile)
gene_disease = gene_disease[gene_disease['EI']>=1.0]
gene_disease

Unnamed: 0,geneId,geneSymbol,DSI,DPI,diseaseId,diseaseName,diseaseType,diseaseClass,diseaseSemanticType,score,EI,YearInitial,YearFinal,NofPmids,NofSnps,source
0,1,A1BG,0.700,0.538,C0019209,Hepatomegaly,phenotype,C23;C06,Finding,0.30,1.0,2017.0,2017.0,1,0,CTD_human
1,1,A1BG,0.700,0.538,C0036341,Schizophrenia,disease,F03,Mental or Behavioral Dysfunction,0.30,1.0,2015.0,2015.0,1,0,CTD_human
3,2,A2M,0.529,0.769,C0007102,Malignant tumor of colon,disease,C06;C04,Neoplastic Process,0.31,1.0,2004.0,2019.0,1,0,CTD_human
4,2,A2M,0.529,0.769,C0009375,Colonic Neoplasms,group,C06;C04,Neoplastic Process,0.30,1.0,2004.0,2004.0,1,0,CTD_human
5,2,A2M,0.529,0.769,C0011265,Presenile dementia,disease,C10;F03,Mental or Behavioral Dysfunction,0.30,1.0,1998.0,2004.0,3,0,CTD_human
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84027,106480868,RN7SKP114,0.931,0.077,C4722327,"PROSTATE CANCER, HEREDITARY, 1",disease,C04;C12,Neoplastic Process,0.30,1.0,2018.0,2018.0,1,0,CTD_human
84028,106481323,RNU6-456P,0.931,0.077,C2931456,"Prostate cancer, familial",disease,C04;C12,Neoplastic Process,0.30,1.0,2018.0,2018.0,1,0,CTD_human
84029,106481323,RNU6-456P,0.931,0.077,C4722327,"PROSTATE CANCER, HEREDITARY, 1",disease,C04;C12,Neoplastic Process,0.30,1.0,2018.0,2018.0,1,0,CTD_human
84031,107075310,MTCO2P12,0.368,0.962,C0268237,Cytochrome-c Oxidase Deficiency,disease,C16;C18,Disease or Syndrome; Congenital Abnormality,0.33,1.0,1999.0,2011.0,0,0,GENOMICS_ENGLAND


In [2]:
gene_disease[['geneId','diseaseId']].to_csv(
    'files/gene-disease.csv', index=None, header=['gene', 'disease'])

gene_disease_map = pd.read_csv("https://storage.googleapis.com/pheknowlator/current_build/"+
                                "data/processed_data/DISEASE_MONDO_MAP.txt", sep='\t', header=None)
gene_disease_map.to_csv(
    'files/DISEASE_MONDO_MAP.csv', index=None, header=['umls', 'disease'])

***
### gene-miRNA from [TarBase](https://dianalab.e-ce.uth.gr/html/diana/web/index.php?r=tarbasev8/index)<a class="anchor" id="genemirna"></a>
DIANA-TarBase v8 is a reference database devoted to the indexing of experimentally supported microRNA (miRNA) targets.

In [3]:
gene_mirna = pd.read_csv('https://dianalab.e-ce.uth.gr/downloads/tarbase_v8_data.tar.gz',
                         compression='gzip', sep="\t", dtype={"cell_line": "string"})
gene_mirna

Unnamed: 0,TarBase_v8_download.txt,geneName,mirna,species,cell_line,tissue,category,method,positive_negative,direct_indirect,up_down,condition
0,0910001A06Rik(mmu),0910001A06Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
1,1200004M23Rik(mmu),1200004M23Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
2,1700027J05Rik(mmu),1700027J05Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
3,1810015A11Rik(mmu),1810015A11Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
4,2310047A01Rik(mmu),2310047A01Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
...,...,...,...,...,...,...,...,...,...,...,...,...
927115,vimentin(hsa),vimentin(hsa),hsa-miR-9-5p,Homo sapiens,,,Cancer/Malignant,Western Blot,NEGATIVE,INDIRECT,,
927116,Â Â Â Â (PTPRG)(hsa),Â Â Â Â (PTPRG)(hsa),hsa-miR-146a-5p,Homo sapiens,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
927117,Î±-actin(mmu),Î±-actin(mmu),mmu-miR-24-3p,Mus musculus,,,Cancer/Malignant,qPCR,POSITIVE,INDIRECT,UP,
927118,Î±-actin(mmu),Î±-actin(mmu),mmu-miR-24-3p,Mus musculus,,,Cancer/Malignant,Western Blot,POSITIVE,INDIRECT,UP,


In [4]:
from typing import Tuple
from rdflib import Graph, URIRef

# Get dbxrefs for all ontology classes' label
def gets_ontology_class_label(graph: Graph) -> Tuple:
    dbx_uris: Dict = dict()
    dbx = [x for x in graph if 'label' in str(x[1]).lower() if isinstance(x[0], URIRef)]
    for x in dbx:
        if str(x[2]).lower() in dbx_uris.keys(): dbx_uris[str(x[2]).lower()].append(str(x[0]))
        else: dbx_uris[str(x[2]).lower()] = [str(x[0])]
    dbx_type = {str(x[2]).lower(): 'DbXref' for x in dbx}

    ex_uris: Dict = dict()
    ex = [x for x in graph if 'exactmatch' in str(x[1]).lower() if isinstance([0], URIRef)]
    for x in ex:
        if str(x[2]).lower() in ex_uris.keys(): ex_uris[str(x[2]).lower()].append(str(x[0]))
        else: ex_uris[str(x[2]).lower()] = [str(x[0])]
    ex_type = {str(x[2]).lower(): 'ExactMatch' for x in ex}

    return {**dbx_uris, **ex_uris}, {**dbx_type, **ex_type}

In [5]:
# For the time being, we keep only Homo sapiens rows
gene_mirna = gene_mirna[gene_mirna['species'].str.contains("Homo sapiens", na=False)]

# Moreover, we keep only 50k (random) rows to reduce input size
gene_mirna = gene_mirna.sample(n=50000)

# NCRO Ontology ignores if a transcript is "3p" or "5p"
gene_mirna["mirna"] = gene_mirna["mirna"].str.replace(r'-3p$', '')
gene_mirna["mirna"] = gene_mirna["mirna"].str.replace(r'-5p$', '')

# If you want to learn more about why we have 5p or 3p human miRNAs:
# https://www.mirbase.org/blog/category/nomenclature/

# Again, for NCRO IDs compatability
gene_mirna['mirna'] = gene_mirna['mirna'].str.lower()

ncro_graph = Graph().parse("http://purl.obolibrary.org/obo/ncro.owl", format='application/rdf+xml')
ncro_label = gets_ontology_class_label(ncro_graph)[0]
# Fix string patterns
mirbase_ids = {str(k): {str(i).split('/')[-1].replace(':','_') for i in v} for k, v in ncro_label.items() if 'NCRO' in str(v) and 'mir-' in str(k) and 'hsa' in str(k)}
ncro_ids = {'hsa-'+str(k): {str(i).split('/')[-1].replace(':','_') for i in v} for k, v in ncro_label.items() if 'NCRO' in str(v) and 'mir-' in str(k) and 'hsa' not in str(k)}

with open('files/MIRBASE_ID_NCRO_MAP.csv', 'w') as outfile:
    outfile.write("mirbase,ncro\n")
    for k, v in {**mirbase_ids, **ncro_ids}.items():
        outfile.write(str(k) + ',' + str(v).replace('{','').replace('\'','').replace('}','') + '\n')

gene_entrez_map = pd.read_csv('https://storage.googleapis.com/pheknowlator/current_build/'+
                               'data/processed_data/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', sep='\t', header=None)

gene_mirna.merge(gene_entrez_map, left_on=['TarBase_v8_download.txt'], right_on=[0])[[1,'mirna']].to_csv('files/gene-miRNA.csv', index=None, header=['gene', 'mirna'])

***
### miRNA-disease from [miR2Disease](http://watson.compbio.iupui.edu:8080/miR2Disease/)<a class="anchor" id="mirnadis"></a>
miR2Disease, a manually curated database, aims at providing a comprehensive resource of miRNA deregulation in various human diseases.

In [6]:
mirna_disease = pd.read_csv('http://watson.compbio.iupui.edu:8080/miR2Disease/download/AllEntries.txt', sep="\t", header=None)
mirna_disease[0] = mirna_disease[0].str.lower()

descDOmap = pd.read_csv('http://watson.compbio.iupui.edu:8080/miR2Disease/download/diseaseList.txt', sep="\t")
descDOmap.columns = ['desc', 'doid']
descDOmap['doid'] = descDOmap['doid'].str.upper()
descDOmap['doid'] = descDOmap['doid'].str.replace(':', '_')

mirna_disease.columns = ['mirna', 'desc', 2,3,4,5]
mirna_disease = pd.merge(descDOmap, mirna_disease, on=['desc'])
mirna_disease

Unnamed: 0,desc,doid,mirna,2,3,4,5
0,acute lymphoblastic leukemia (ALL),DOID_9952,hsa-let-7e,down-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
1,acute lymphoblastic leukemia (ALL),DOID_9952,hsa-mir-125a,down-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
2,acute lymphoblastic leukemia (ALL),DOID_9952,hsa-mir-151*,up-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
3,acute lymphoblastic leukemia (ALL),DOID_9952,hsa-mir-210,up-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
4,acute lymphoblastic leukemia (ALL),DOID_9952,hsa-mir-22,down-regulated,"northern blot, qRT-PCR etc",2009.0,Gene silencing of MIR22 in acute lymphoblastic...
...,...,...,...,...,...,...,...
2619,Waldenstrom Macroglobulinemia (WM),DOID_9080,hsa-mir-542-3p,up-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."
2620,Waldenstrom Macroglobulinemia (WM),DOID_9080,hsa-mir-9*,down-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."
2621,Waldenstrom Macroglobulinemia (WM),DOID_9080,hsa-mir-155,up-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."
2622,Waldenstrom Macroglobulinemia (WM),DOID_9080,hsa-mir-184,up-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."


In [7]:
from typing import Tuple

def gets_ontology_class_dbxrefs(graph: Graph) -> Tuple:
    dbx_uris: Dict = dict()
    dbx = [x for x in graph if 'hasdbxref' in str(x[1]).lower() if isinstance(x[0], URIRef)]
    for x in dbx:
        if str(x[2]).lower() in dbx_uris.keys(): dbx_uris[str(x[2]).lower()].append(str(x[0]))
        else: dbx_uris[str(x[2]).lower()] = [str(x[0])]
    dbx_type = {str(x[2]).lower(): 'DbXref' for x in dbx}

    ex_uris: Dict = dict()
    ex = [x for x in graph if 'exactmatch' in str(x[1]).lower() if isinstance([0], URIRef)]
    for x in ex:
        if str(x[2]).lower() in ex_uris.keys(): ex_uris[str(x[2]).lower()].append(str(x[0]))
        else: ex_uris[str(x[2]).lower()] = [str(x[0])]
    ex_type = {str(x[2]).lower(): 'ExactMatch' for x in ex}

    return {**dbx_uris, **ex_uris}, {**dbx_type, **ex_type}

In [8]:
mondo_graph = Graph().parse("http://purl.obolibrary.org/obo/mondo.owl", format='application/rdf+xml')
dbxref_res = gets_ontology_class_dbxrefs(mondo_graph)[0]

# Fix DOIDs (substitute : with _) and upper case them
mondo_dict = {str(k).replace(':','_').upper(): {str(i).split('/')[-1].replace(':','_') for i in v}
              for k, v in dbxref_res.items() if 'doid' in str(k)}

with open('files/DOID_MONDO_MAP.csv', 'w') as outfile:
    outfile.write("doid,disease\n")
    for k, v in mondo_dict.items():
        outfile.write(str(k) + ',' + str(v).replace('{','').replace(', ','').replace('\'','').replace('}','') + '\n')

mirbase_mirna_map = pd.read_csv("files/MIRBASE_ID_NCRO_MAP.csv")
mirna_disease.merge(mirbase_mirna_map, left_on=['mirna'], right_on=['mirbase'])[['ncro', 'doid']].to_csv('files/mirna-disease.csv', index=None, header=['ncro', 'doid'])

In [9]:
with open('files/DOID_MONDO_MAP.csv', 'w') as outfile:
    outfile.write("doid,disease\n")
    for k, v in mondo_dict.items():
        outfile.write(str(k) + ',' + str(v).replace('{','').replace(', ','').replace('\'','').replace('}','') + '\n')

***
We can copy-paste `mapping.ttl` content in [**rdf-grapher**](https://www.ldf.fi/service/rdf-grapher) to graphically visualize it.

<a target="_blank" href=https://user-images.githubusercontent.com/33032169/229295644-f50cdc40-f13b-4d45-8f81-9f7fde625ffb.png> <img src=https://user-images.githubusercontent.com/33032169/229295644-f50cdc40-f13b-4d45-8f81-9f7fde625ffb.png></a> 

(*Click Figure to Enlarge Image in Current Browser Tab*)

Another visualization tool is [**rdf-visualizer**](https://issemantic.net/rdf-visualizer), which allows us to dynamically inspect the graph (in the figure below you can find a screenshot of the output).

<a target="_blank" href=https://user-images.githubusercontent.com/33032169/232470062-4dd148ef-b389-4b84-b183-b4d38e5ac788.png> <img src=https://user-images.githubusercontent.com/33032169/232470062-4dd148ef-b389-4b84-b183-b4d38e5ac788.png></a> 

(*Click Figure to Enlarge Image in Current Browser Tab*)

In [10]:
!python3 -m rdfizer -c config.ini

Semantifying RNA-KG...
TM: file:///Users/emanuelecavalleri/Desktop/Finaldissertation/SDM-RDFizer/rdfizer/gene-miRNA
TM: file:///Users/emanuelecavalleri/Desktop/Finaldissertation/SDM-RDFizer/rdfizer/gene-disease
TM: file:///Users/emanuelecavalleri/Desktop/Finaldissertation/SDM-RDFizer/rdfizer/miRNA-disease
TM: file:///Users/emanuelecavalleri/Desktop/Finaldissertation/SDM-RDFizer/rdfizer/gene-gene-map
TM: file:///Users/emanuelecavalleri/Desktop/Finaldissertation/SDM-RDFizer/rdfizer/disease-mondo-map
TM: file:///Users/emanuelecavalleri/Desktop/Finaldissertation/SDM-RDFizer/rdfizer/mirbase-ncro-map
TM: file:///Users/emanuelecavalleri/Desktop/Finaldissertation/SDM-RDFizer/rdfizer/doid-mondo-map
Successfully semantified RNA-KG.


Successfully semantified all datasets in 15.967 seconds.


Let us consider for instance *MEGF8* gene (entrez ID [1954](http://www.ncbi.nlm.nih.gov/gene/1954)). Biologically speaking, the protein encoded by this gene is a membrane protein that contains several [EGF-like](https://www.uniprot.org/keywords/KW-0245) and [PSI](https://www.ebi.ac.uk/interpro/entry/InterPro/IPR016201/) domains. Defects in this gene are a cause of [Carpenter syndrome 2](https://www.ncbi.nlm.nih.gov/medgen/905199) ([MONDO_0013998](http://purl.obolibrary.org/obo/MONDO_0013998)).

In [11]:
# Read in the RNA-KG generated using RDFizer
g = Graph()
g.parse("output/RNA-KG.ttl", format='ttl')

<Graph identifier=Na19ed8c90b3b4d5c86f94c0ec0466529 (<class 'rdflib.graph.Graph'>)>

In [12]:
# Utility function to add edges to a graph

from typing import List, Set, Union

def adds_edges_to_graph(graph: Graph, edge_list: Union[List, Set], progress_bar: bool = True) -> Graph:
    
    edge_list_updated = set(edge_list) if isinstance(edge_list, List) else edge_list
    edge_set = edge_list_updated if progress_bar else edge_list_updated
    for edge in edge_set: graph.add(edge)

    return graph

In [13]:
# Obtain a Knowledge subGraph having only MEGF8 as subject

from rdflib.extras.external_graph_libs import *
from rdflib import Graph, Namespace, URIRef, BNode, Literal
import pickle
import networkx as nx

filtered_triples = []
for s, p, o in g:
    if str(s).startswith(str(URIRef('http://www.ncbi.nlm.nih.gov/gene/1954'))):
        filtered_triples += [(s, p, o)]

print('There are {} edges'.format(len(filtered_triples)))
sub_graph = adds_edges_to_graph(Graph(), filtered_triples)

sub_graph.serialize(destination='output/subRNA-KG.nt', format='nt', encoding='utf-8')

There are 12 edges


<Graph identifier=N0f86f364e70b471981678d8707577088 (<class 'rdflib.graph.Graph'>)>

<a target="_blank" href=https://user-images.githubusercontent.com/33032169/229296211-10a43191-d03e-4900-ba30-32ca67e67dfb.png> <img src=https://user-images.githubusercontent.com/33032169/229296211-10a43191-d03e-4900-ba30-32ca67e67dfb.png></a> 

MEGF8 (entrez ID [1954](http://www.ncbi.nlm.nih.gov/gene/1954)):
- *is_a* (http://www.w3.org/1999/02/22-rdf-syntax-ns#type) gene ([SO_0000704](http://purl.obolibrary.org/obo/SO_0000704))
- which *causes or contributes to condition* ([RO_0003302](http://purl.obolibrary.org/obo/RO_0003302)) Carpenter syndrome ([MONDO_0013998](http://purl.obolibrary.org/obo/MONDO_0013998)),
- and *interacts with* ([RO_0002434](http://purl.obolibrary.org/obo/RO_0002434)) [NCRO_0002044](http://purl.obolibrary.org/obo/NCRO_0002044).

NCRO_0002044 is a microRNA (miRNA) aka *[hsa-mir-374a](https://www.mirbase.org/textsearch.shtml?q=hsa-mir-374a)*. This miRNA is known to [target](https://en.wikipedia.org/wiki/MicroRNA#Targets) MEGF8 transcripts (if you are interested in this topic, please watch the seminar [*An introduction to the RNA world*](https://youtu.be/QjNeOWaKKEE) for further biologial information on how this targeting occurs).