# <p style="text-align: center;">RNA Knowledge Graph Build Data Preparation</p>
    
***
***

**Authors:** [ECavalleri](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=emanuele.cavalleri@unimi.it), [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com), [MMesiti](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=marco.mesiti@unimi.it)

**GitHub Repositories:** [testRNA-KG](https://github.com/emanuelecavalleri/testRNA-KG), [PheKnowLator](https://github.com/callahantiff/PheKnowLator/)  
<!--- **Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)** --->
  
<br>  
  
**Purpose:** This notebook serves as a script to download, process, map, and clean data in order to build edges for the simplified RNA-centered Knowledge Graph.

<br>

**Assumptions:**   
- Edge data downloads ➞ `./resources/edge_data`  
- Ontologies ➞ `./resources/ontologies`    
- Processed data write location ➞ `./resources/processed_data`  

<br>

**Dependencies:**   
- **Scripts**: This notebook utilizes several helper functions, which are stored in the [`data_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/data_utils.py) and [`kg_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/kg_utils.py) scripts.  
- **Data**: All downloaded and generated data sources are provided through [this](https://drive.google.com/drive/folders/1sev5zczMviX7UVqMhTpkFXG43K3nQa9f) dedicated Google Drive repository. <u>This notebook will download everything that is needed for you</u>.  
_____
***

## Table of Contents
***

### [Download Ontologies](#create-ontologies)


### [Create Identifier Maps ](#create-identifier-maps)   


### [Download and process Edge Datasets](#create-edges)  

____
***

## Set-Up Environment
***

In [2]:
import sys
!{sys.executable} -m pip install -r requirements.txt
sys.path.append('../')

Defaulting to user installation because normal site-packages is not writeable






In [36]:
# import needed libraries
import datetime
import glob
import itertools
import networkx
import numpy
import os
import openpyxl
import pandas
import pickle
import re
import requests
import sys

from collections import Counter
from functools import reduce
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from rdflib.namespace import OWL, RDF, RDFS
from reactome2py import content
from tqdm import tqdm
from typing import Dict
from typing import Tuple

from pkt_kg.utils import *

import pandas as pd
import tarfile

#### Define Global Variables

In [4]:
# directory to store resources
resource_data_location = '../resources/'

# directory to use for processing data
processed_data_location = '../resources/processed_data/'

# directory to write ontology data to
ontology_data_location = '../resources/ontologies/'

# directory to write edges data to
edge_data_location = '../resources/edge_data/'

# owltools location
owltools_location = '../pkt_kg/libs/owltools'

In [10]:
# Download data function for already processed data
def download(name, path):
    url = 'https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/'+name
    #if not os.path.exists(path + name):
    data_downloader(url, path)

***
***
### DOWNLOAD ONTOLOGIES  <a class="anchor" id="create-ontologies"></a>
We must establish a unified standard for identifying entities within our simplified RNA-centered KG. Entities are relations, diseases, miRNAs, and genes. While well-reputed bio-ontologies provide terms for relations, miRNAs, and diseases, genes lack direct correspondences.
***
***

### Relation Ontology ([RO](https://www.ebi.ac.uk/ols/ontologies/ro))
The OBO Relations Ontology (RO) is a collection of OWL relations (ObjectProperties) intended for use across a wide variety of biological ontologies.

In [11]:
if not os.path.exists(ontology_data_location + 'ro_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/ro.owl',
                             ontology_data_location + 'ro_with_imports.owl'))



In [12]:
# Relations labels are already provided by PKL ecosystem
download('RELATIONS_LABELS.txt', '../resources/relations_data/')
download('INVERSE_RELATIONS.txt', '../resources/relations_data/')

# Load data, print row count, and preview it
ro_data_label = pandas.read_csv('../resources/relations_data/'+'RELATIONS_LABELS.txt', header=0, delimiter='\t')

print('There are {edge_count} RO Relations and Labels'.format(edge_count=len(ro_data_label)))
ro_data_label.head(n=5)

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/RELATIONS_LABELS.txt
Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/INVERSE_RELATIONS.txt
There are 667 RO Relations and Labels


Unnamed: 0,Label,Relation
0,helper property (not for use in curation),http://purl.obolibrary.org/obo/RO_0002464
1,developmentally replaces,http://purl.obolibrary.org/obo/RO_0002285
2,is approximately equivalent to,http://purl.obolibrary.org/obo/RO_0002603
3,has intracellular endoparasite,http://purl.obolibrary.org/obo/RO_0002641
4,supplies,http://purl.obolibrary.org/obo/RO_0002178


***
### Mondo Disease Ontology ([Mondo](https://www.ebi.ac.uk/ols/ontologies/ro))
A semi-automatically constructed ontology that merges in multiple disease resources to yield a coherent merged ontology.

In [13]:
if not os.path.exists(ontology_data_location + 'mondo_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/mondo.owl',
                             ontology_data_location + 'mondo_with_imports.owl'))



***
### Non-Coding RNA Ontology ([NCRO](https://www.ebi.ac.uk/ols4/ontologies/ncro))
The NCRO is a reference ontology in the non-coding RNA (ncRNA) field, aiming to provide a common set of terms and relations that will facilitate the curation, analysis, exchange, sharing, and management of ncRNA structural, functional, and sequence data.

In [14]:
if not os.path.exists(ontology_data_location + 'ncro_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/ncro.owl',
                             ontology_data_location + 'ncro_with_imports.owl'))



At this point, please run the <tt>Ontology_Cleaning.ipynb</tt> notebook provided by PKT.

***
***
### DOWNLOAD EDGES  <a class="anchor" id="create-edges"></a>
***
***

### gene-disease from [Human Disease Molecular Mechanisms](https://github.com/callahantiff/PheKnowLator/wiki/Building-a-KG-of-Human-Disease-Molecular-Mechanisms) (PKT-built)

In [15]:
data_downloader('https://storage.googleapis.com/pheknowlator/current_build/data/original_data/curated_gene_disease_associations.tsv',
                edge_data_location)

# Rename file adding relationship's identifier
os.rename(edge_data_location+'curated_gene_disease_associations.tsv',
          edge_data_location+'gene-disease_curated_gene_disease_associations.tsv')

with open(edge_data_location + 'gene-disease_curated_gene_disease_associations.tsv') as f:
    data = f.read()

data = pd.read_csv(edge_data_location + 'gene-disease_curated_gene_disease_associations.tsv', sep="\t")  
data

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/original_data/curated_gene_disease_associations.tsv


Unnamed: 0,geneId,geneSymbol,DSI,DPI,diseaseId,diseaseName,diseaseType,diseaseClass,diseaseSemanticType,score,EI,YearInitial,YearFinal,NofPmids,NofSnps,source
0,1,A1BG,0.700,0.538,C0019209,Hepatomegaly,phenotype,C23;C06,Finding,0.30,1.000,2017.0,2017.0,1,0,CTD_human
1,1,A1BG,0.700,0.538,C0036341,Schizophrenia,disease,F03,Mental or Behavioral Dysfunction,0.30,1.000,2015.0,2015.0,1,0,CTD_human
2,2,A2M,0.529,0.769,C0002395,Alzheimer's Disease,disease,C10;F03,Disease or Syndrome,0.50,0.769,1998.0,2018.0,3,0,CTD_human
3,2,A2M,0.529,0.769,C0007102,Malignant tumor of colon,disease,C06;C04,Neoplastic Process,0.31,1.000,2004.0,2019.0,1,0,CTD_human
4,2,A2M,0.529,0.769,C0009375,Colonic Neoplasms,group,C06;C04,Neoplastic Process,0.30,1.000,2004.0,2004.0,1,0,CTD_human
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84033,109580095,HBB-LCR,0.743,0.115,C0002875,Cooley's anemia,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human
84034,109580095,HBB-LCR,0.743,0.115,C0005283,beta Thalassemia,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human
84035,109580095,HBB-LCR,0.743,0.115,C0019025,Hemoglobin F Disease,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human
84036,109580095,HBB-LCR,0.743,0.115,C0085578,Thalassemia Minor,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human


For representing genes, we can use NCBI Entrez Gene identifiers (<tt>geneID</tt> column), , and it's worth noting that symbols could have been a viable choice as well. For denoting diseases (<tt>diseaseID</tt> column), we can notice the original tsv adopts DisGeNET identifiers. We'll need to establish a mapping that links these identifiers to the Mondo ontology.

***
### gene-miRNA from [TarBase](https://dianalab.e-ce.uth.gr/html/diana/web/index.php?r=tarbasev8/index)
DIANA-TarBase v8 is a reference database devoted to the indexing of experimentally supported microRNA (miRNA) targets.

In [51]:
data_downloader('https://dianalab.e-ce.uth.gr/downloads/tarbase_v8_data.tar.gz', edge_data_location)

with tarfile.TarFile(edge_data_location+'tarbase_v8_data.tar', 'r') as tar_ref:
    tar_ref.extractall(edge_data_location)
    
# Remove tar file
os.remove(edge_data_location+'tarbase_v8_data.tar')
    
# Rename file adding relationship's identifier
os.rename(edge_data_location+'TarBase_v8_download.txt',
          edge_data_location+'gene-miRNA_TarBase_v8_download.txt')   

with open(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt') as f:
    data = f.read()

data = pd.read_csv(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt', sep="\t", dtype={"cell_line": "string"})  
data

Downloading Gzipped Data from https://dianalab.e-ce.uth.gr/downloads/tarbase_v8_data.tar.gz


Unnamed: 0,geneId,geneName,mirna,species,cell_line,tissue,category,method,positive_negative,direct_indirect,up_down,condition
0,0910001A06Rik(mmu),0910001A06Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
1,1200004M23Rik(mmu),1200004M23Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
2,1700027J05Rik(mmu),1700027J05Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
3,1810015A11Rik(mmu),1810015A11Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
4,2310047A01Rik(mmu),2310047A01Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
...,...,...,...,...,...,...,...,...,...,...,...,...
927114,uPA(hsa),uPA(hsa),hsa-miR-23b-3p,Homo sapiens,,,Cancer/Malignant,Western Blot,POSITIVE,INDIRECT,DOWN,
927115,vimentin(hsa),vimentin(hsa),hsa-miR-9-5p,Homo sapiens,,,Cancer/Malignant,Western Blot,NEGATIVE,INDIRECT,,
927116,Â Â Â Â (PTPRG)(hsa),Â Â Â Â (PTPRG)(hsa),hsa-miR-146a-5p,Homo sapiens,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
927117,Î±-actin(mmu),Î±-actin(mmu),mmu-miR-24-3p,Mus musculus,,,Cancer/Malignant,qPCR,POSITIVE,INDIRECT,UP,


In [52]:
# For the time being, we keep only Homo sapiens rows
data = data[data['species'].str.contains("Homo sapiens")]

# Moreover, we keep only 50k (random) rows to reduce input size
data = data.sample(n=50000)

# This simplified KG ignores if a transcript is "3p" or "5p", so we store this information as additional column
data['p'] = data[data['mirna'].str.contains("p")]['mirna']
data["p"] = data["p"].str[-2:]
data['mirna'] = data['mirna'].str.replace(r'-[35]p$', '', regex=True)
data['mirna'] = data['mirna'].str.lower()

# If you're interested in understanding why Homo sapiens has -3p and -5p miRNAs:
# https://pubmed.ncbi.nlm.nih.gov/12592000/
# Putting aside the -(3/5)p information, we are essentially dealing with non-mature (aka hairpin) miRNA:
# https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3616697/

data

Unnamed: 0,geneId,geneName,mirna,species,cell_line,tissue,category,method,positive_negative,direct_indirect,up_down,condition,p
206695,ENSG00000120694,HSPH1,hsa-mir-320c,Homo sapiens,BCBL1,Bone Marrow,Cancer/Malignant,HITS-CLIP,POSITIVE,DIRECT,DOWN,,
588042,ENSG00000233927,RPS28,hsa-mir-3613,Homo sapiens,BCBL1,Bone Marrow,Cancer/Malignant,HITS-CLIP,POSITIVE,DIRECT,DOWN,,3p
517597,ENSG00000182568,SATB1,hsa-mir-19b,Homo sapiens,HEK293,Kidney,Embryonic/Fetal,PAR-CLIP,POSITIVE,DIRECT,DOWN,mild MNase digestion,3p
222544,ENSG00000124688,MAD2L1BP,hsa-mir-361,Homo sapiens,293S,Kidney,Embryonic/Fetal,HITS-CLIP,POSITIVE,DIRECT,DOWN,no treatment (control),3p
261765,ENSG00000133794,ARNTL,hsa-mir-124,Homo sapiens,HEPG2,Liver,Cancer/Malignant,Microarrays,NEGATIVE,INDIRECT,,"72hrs post-transfection, Overexpression",3p
...,...,...,...,...,...,...,...,...,...,...,...,...,...
19659,ENSG00000033170,FUT8,hsa-mir-671,Homo sapiens,MDAMB231,Mammary Gland,Cancer/Malignant,HITS-CLIP,POSITIVE,DIRECT,DOWN,,5p
433708,ENSG00000166266,CUL5,hsa-mir-362,Homo sapiens,293S,Kidney,Embryonic/Fetal,HITS-CLIP,POSITIVE,DIRECT,DOWN,no treatment (control),5p
353787,ENSG00000150722,PPP1R1C,hsa-mir-146a,Homo sapiens,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,,5p
213214,ENSG00000122566,HNRNPA2B1,hsa-mir-7-1,Homo sapiens,293S,Kidney,Embryonic/Fetal,HITS-CLIP,POSITIVE,DIRECT,DOWN,no treatment (control),3p


In this case, we have only symbols and ENSG identifiers (<tt>geneId</tt> and <tt>geneName</tt> columns) for identifying genes, we'll require an additional mapping to link these symbols or ENSG to NCBI Entrez Gene identifiers.

In [53]:
data.to_csv(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt', header=None, sep='\t', index=None)

***
### miRNA-disease from [miR2Disease](http://watson.compbio.iupui.edu:8080/miR2Disease/)
miR2Disease, a manually curated database, aims at providing a comprehensive resource of miRNA deregulation in various human diseases.

In [55]:
data_downloader('http://watson.compbio.iupui.edu:8080/miR2Disease/download/AllEntries.txt', edge_data_location)

data = pd.read_csv(edge_data_location + 'AllEntries.txt', sep="\t", header=None)  
os.remove(edge_data_location + 'AllEntries.txt')
data

Downloading Data from http://watson.compbio.iupui.edu:8080/miR2Disease/download/AllEntries.txt


Unnamed: 0,0,1,2,3,4,5
0,hsa-let-7f-2,kidney cancer,up-regulated,microarray,2007.0,Micro-RNA profiling in kidney and bladder canc...
1,hsa-let-7g,hepatocellular carcinoma (HCC),down-regulated,"Northern blot, qRT-PCR etc",2008.0,Identification of metastasis-related microRNAs...
2,hsa-let-7g,lung cancer,down-regulated,"Northern blot, qRT-PCR etc",2007.0,The let-7 microRNA represses cell proliferatio...
3,hsa-let-7g,non-small cell lung cancer (NSCLC),down-regulated,"Northern blot, qRT-PCR etc",2008.0,Suppression of non-small cell lung tumor devel...
4,hsa-let-7g,ovarian cancer (OC),down-regulated,"Northern blot, qRT-PCR etc",2007.0,Let-7 expression defines two differentiation s...
...,...,...,...,...,...,...
2897,hsa-miR-21,glioblastoma multiforme (GBM),up-regulated,"Northern blot, qRT-PCR etc",2008.0,miR-124 and miR-137 inhibit proliferation of g...
2898,hsa-miR-21,glioma,up-regulated,"Northern blot, qRT-PCR etc",2008.0,MicroRNA 21 promotes glioma invasion by target...
2899,hsa-miR-21,hepatocellular carcinoma (HCC),up-regulated,"Northern blot, qRT-PCR etc",2006.0,Downregulation of miR-122 in the rodent and hu...
2900,hsa-miR-21,Inclusion body myositis (IBM),up-regulated,microarray,2007.0,Distinctive patterns of microRNA expression in...


In [56]:
# miR2Disease provides a look-up table for mapping disease names to DO terms
data_downloader('http://watson.compbio.iupui.edu:8080/miR2Disease/download/diseaseList.txt', processed_data_location)

descDOmap = pd.read_csv(processed_data_location + 'diseaseList.txt', sep="\t")  
os.remove(processed_data_location + 'diseaseList.txt')
descDOmap

Downloading Data from http://watson.compbio.iupui.edu:8080/miR2Disease/download/diseaseList.txt


Unnamed: 0,disease name in original paper,disease ontology ID
0,Abdominal Aortic Aneurysm,DOID:7693
1,acute lymphoblastic leukemia (ALL),DOID:9952
2,acute myeloid leukemia (AML),DOID:9119
3,acute myocardial infarction,DOID:9408
4,acute promyelocytic leukemia (APL),DOID:9119
...,...,...
169,uterine leiomyoma (ULM),DOID:13223
170,uveal melanoma,DOID:1909
171,vascular disease,DOID:178
172,vesicular stomatitis,DOID:10881


In [57]:
# Ontologies are represented in OWL files that make use of _ for URIs
descDOmap['disease ontology ID'] = descDOmap['disease ontology ID'].astype(str).str.replace(':', '_')

disease2mirna = pd.merge(descDOmap, data, left_on=['disease name in original paper'], right_on=[1]).drop(
    columns=['disease name in original paper'])
disease2mirna[0] = disease2mirna[0].str.lower()
disease2mirna

Unnamed: 0,disease ontology ID,0,1,2,3,4,5
0,DOID_9952,hsa-let-7e,acute lymphoblastic leukemia (ALL),down-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
1,DOID_9952,hsa-mir-125a,acute lymphoblastic leukemia (ALL),down-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
2,DOID_9952,hsa-mir-151*,acute lymphoblastic leukemia (ALL),up-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
3,DOID_9952,hsa-mir-210,acute lymphoblastic leukemia (ALL),up-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
4,DOID_9952,hsa-mir-22,acute lymphoblastic leukemia (ALL),down-regulated,"northern blot, qRT-PCR etc",2009.0,Gene silencing of MIR22 in acute lymphoblastic...
...,...,...,...,...,...,...,...
2619,DOID_9080,hsa-mir-542-3p,Waldenstrom Macroglobulinemia (WM),up-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."
2620,DOID_9080,hsa-mir-9*,Waldenstrom Macroglobulinemia (WM),down-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."
2621,DOID_9080,hsa-mir-155,Waldenstrom Macroglobulinemia (WM),up-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."
2622,DOID_9080,hsa-mir-184,Waldenstrom Macroglobulinemia (WM),up-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."


We lost 2,902-2,624=278 rows during mapping (278/2,902<10%).

In [58]:
disease2mirna.to_csv(edge_data_location + 'miRNA-disease_miR2Disease.txt', header=None, sep='\t', index=None)

***
***
### CREATE MAPPING DATASETS  <a class="anchor" id="create-identifier-maps"></a>
***
***

### Ensembl Gene-Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Ensembl gene identifiers to Entrez gene identifiers

**Output:** `ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`

Already provided by PKL ecosystem.

In [23]:
download('ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', processed_data_location)

ensEntrez = pd.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', sep="\t", header=None)
ensEntrez

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt


Unnamed: 0,0,1,2,3,4,5
0,ENSG00000171241,79801,protein-coding,protein-coding,protein-coding,protein-coding
1,ENSG00000131149,23199,protein-coding,protein-coding,protein-coding,protein-coding
2,ENSG00000096092,28978,protein-coding,protein-coding,protein-coding,protein-coding
3,ENSG00000222691,106479891,snRNA,pseudogene,not protein-coding,not protein-coding
4,ENSG00000230052,100873180,unprocessed_pseudogene,pseudogene,not protein-coding,not protein-coding
...,...,...,...,...,...,...
42283,ENSG00000175699,256369,protein-coding,protein-coding,protein-coding,protein-coding
42284,ENSG00000251308,359776,processed_pseudogene,pseudogene,not protein-coding,not protein-coding
42285,ENSG00000108479,2584,protein-coding,protein-coding,protein-coding,protein-coding
42286,ENSG00000167371,112476,protein-coding,protein-coding,protein-coding,protein-coding


***
### DisGeNET-Mondo <a class="anchor" id="DisGeNET-Mondo"></a>


**Purpose:** To map DisGeNET identifiers to Mondo identifiers

**Output:** `DISEASE_MONDO_MAP.txt`

Already provided by PKL ecosystem.

In [50]:
download('DISEASE_MONDO_MAP.txt', processed_data_location)

disMondo = pd.read_csv(processed_data_location + 'DISEASE_MONDO_MAP.txt', sep="\t", header=None)
disMondo

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt


Unnamed: 0,0,1
0,0001816,MONDO_0016982
1,0002116,MONDO_0005085
2,0014667,MONDO_0005066
3,0040084,MONDO_0005972
4,0040085,MONDO_0005229
...,...,...
182942,D017098,MONDO_0001341
182943,C131452,MONDO_0013874
182944,19995004,MONDO_0022013
182945,268173,MONDO_0017053


***
### Disease Ontology (DO) - Mondo mapping <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map DO identifiers to Mondo identifiers

**Output:** `DISEASE_DOID_MONDO_Map.txt`

In [24]:
mondo_graph = Graph().parse(ontology_data_location + 'mondo_with_imports.owl')

dbxref_res = gets_ontology_class_dbxrefs(mondo_graph)[0]

# Fix DOIDs (substitute : with _) and upper case them
mondo_dict = {str(k).replace(':','_').upper(): {str(i).split('/')[-1].replace(':','_') for i in v}
              for k, v in dbxref_res.items() if 'doid' in str(k)}

with open(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', 'w') as outfile:
    for k, v in mondo_dict.items():
        outfile.write(str(k) + '\t' + str(v).replace('{','').replace('\'','').replace('}','') + '\n')
 
doidMondo = pd.read_csv(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', sep="\t", header=None)
doidMondo[1] = doidMondo[1].str.split(',')
doidMondo = doidMondo.explode(1)
doidMondo.to_csv(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', header=None, sep='\t', index=None)

In [32]:
doidMondo = pd.read_csv(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', sep="\t", header=None)
doidMondo

Unnamed: 0,0,1
0,DOID_14555,MONDO_0001998
1,DOID_2349,MONDO_0002277
2,DOID_0060465,MONDO_0016068
3,DOID_7160,MONDO_0004125
4,DOID_0111292,MONDO_0013103
...,...,...
11146,DOID_2649,MONDO_0004997
11147,DOID_0070353,MONDO_0012786
11148,DOID_14796,MONDO_0009124
11149,DOID_0060859,MONDO_0000827


***
### miRBase ID - Non-Coding RNA Ontology (NCRO) mapping


**Purpose:** To map miRNA identifiers from miRBase to NCRO identifiers

**Output:** `MIRBASE_ID_NCRO_MAP.txt`

In [37]:
# Helper function to get dbxrefs for all ontology classes' label
def gets_ontology_class_label(graph: Graph) -> Tuple:
    dbx_uris: Dict = dict()
    dbx = [x for x in graph if 'label' in str(x[1]).lower() if isinstance(x[0], URIRef)]
    for x in dbx:
        if str(x[2]).lower() in dbx_uris.keys(): dbx_uris[str(x[2]).lower()].append(str(x[0]))
        else: dbx_uris[str(x[2]).lower()] = [str(x[0])]
    dbx_type = {str(x[2]).lower(): 'DbXref' for x in dbx}

    ex_uris: Dict = dict()
    ex = [x for x in graph if 'exactmatch' in str(x[1]).lower() if isinstance([0], URIRef)]
    for x in ex:
        if str(x[2]).lower() in ex_uris.keys(): ex_uris[str(x[2]).lower()].append(str(x[0]))
        else: ex_uris[str(x[2]).lower()] = [str(x[0])]
    ex_type = {str(x[2]).lower(): 'ExactMatch' for x in ex}

    return {**dbx_uris, **ex_uris}, {**dbx_type, **ex_type}

In [38]:
# Read data into RDFLib graph object
ncro_graph = Graph().parse(ontology_data_location + 'ncro_with_imports.owl')

dbxref_res = gets_ontology_class_label(ncro_graph)[0]

# Fix string patterns
ncro_dict = {str(k): {str(i).split('/')[-1].replace(':','_') for i in v}
             for k, v in dbxref_res.items() if 'NCRO' in str(v) and 'mir-' in str(k) and 'hsa' in str(k)}
ncro_dict2 = {'hsa-'+str(k): {str(i).split('/')[-1].replace(':','_') for i in v} for k, v in dbxref_res.items()
              if 'NCRO' in str(v) and 'mir-' in str(k) and 'hsa' not in str(k)}

In [39]:
with open(processed_data_location + 'MIRBASE_ID_NCRO_MAP.txt', 'w') as outfile:
    for k, v in {**ncro_dict, **ncro_dict2}.items():
        outfile.write(str(k) + '\t' + str(v).replace('{','').replace('\'','').replace('}','') + '\n')

hsaNcro = pd.read_csv(processed_data_location + 'MIRBASE_ID_NCRO_MAP.txt', sep="\t", header=None)
hsaNcro[1] = hsaNcro[1].str.split(',')
hsaNcro = hsaNcro.explode(1)
hsaNcro.to_csv(processed_data_location + 'MIRBASE_ID_NCRO_MAP.txt', header=None, sep='\t', index=None)

In [40]:
pd.read_csv(processed_data_location + 'MIRBASE_ID_NCRO_MAP.txt', sep="\t", header=None)

Unnamed: 0,0,1
0,hsa-mir-4782,NCRO_0003219
1,hsa-mir-106b,NCRO_0001673
2,hsa-mir-4652,NCRO_0001159
3,hsa-mir-651,NCRO_0002901
4,hsa-mir-6085,NCRO_0001382
...,...,...
2014,hsa-mir-1254,NCRO_0003004
2015,hsa-mir-4435,NCRO_0003133
2016,hsa-mir-322,NCRO_0002800
2017,hsa-mir-133,NCRO_0002711


***
To represent genes, PKT designates them as subclasses of relevant Sequence Ontology ([SO](https://www.ebi.ac.uk/ols4/ontologies/so)) terms. If you wish to include additional bio-entities to enhance this KG, you will need to expand this dictionary. If you are interested, you can explore the entire [RNA-KG](https://github.com/AnacletoLAB/RNA-KG/)). Feel free to contact me at [emanuele dot cavalleri at unimi dot it](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=emanuele.cavalleri@unimi.it).

In [41]:
# KG construction approach dictionary (for non ontological data), provided by PKL ecosystem
download('subclass_construction_map.pkl', '../resources/construction_approach/')

# Load data, print row count, and preview it
nonO_data = pd.read_pickle(r'../resources/construction_approach/'+'subclass_construction_map.pkl')

# For instance, ncbi IDs are mapped to appropriate SO Ontology entries
list(nonO_data.items())[:5]

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/subclass_construction_map.pkl


[('84103', ['SO_0001217']),
 ('84690', ['SO_0001217']),
 ('3579', ['SO_0001217']),
 ('54514', ['SO_0001217']),
 ('7159', ['SO_0001217'])]

PKT also provides node and relation metadata (a.k.a. node properties and relation attributes).

In [42]:
# KG metadata, provided by PKL ecosystem
download('node_metadata_dict.pkl', '../resources/node_data/')

# Load data, print row count, and preview it
metadata = pd.read_pickle(r'../resources/node_data/'+'node_metadata_dict.pkl')

# 0 for nodes (e.g., genes); 1 for (RO) relations
list(metadata.items())[0]

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/node_metadata_dict.pkl


('nodes',
 {'http://www.ncbi.nlm.nih.gov/gene/1': {'Label': 'A1BG',
   'Description': "A1BG has locus group 'protein-coding' and is located on chromosome 19 (19q13.43).",
   'Synonym': 'HEL-S-163pA|ABG|A1B|epididymis secretory sperm binding protein Li 163pA|GAB|HYST2477alpha-1B-glycoprotein'},
  'http://www.ncbi.nlm.nih.gov/gene/2': {'Label': 'A2M',
   'Description': "A2M has locus group 'protein-coding' and is located on chromosome 12 (12p13.31).",
   'Synonym': 'CPAMD5|S863-7alpha-2-macroglobulin|FWP007|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|A2MD|alpha-2-M'},
  'http://www.ncbi.nlm.nih.gov/gene/3': {'Label': 'A2MP1',
   'Description': "A2MP1 has locus group 'pseudo' and is located on chromosome 12 (12p13.31).",
   'Synonym': 'A2MPpregnancy-zone protein pseudogene'},
  'http://www.ncbi.nlm.nih.gov/gene/9': {'Label': 'NAT1',
   'Description': "NAT1 has locus group 'protein-coding' and is located on chromosome 8 (8p22).",
   'Synonym': 'NAT-1|N-acetyltransfera