# <p style="text-align: center;">RNA Knowledge Graph Build Data Preparation</p>
    
***
***

**Authors:** [ECavalleri](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=emanuele.cavalleri@unimi.it), [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com), [MMesiti](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=marco.mesiti@unimi.it)

**GitHub Repositories:** [testRNA-KG](https://github.com/emanuelecavalleri/testRNA-KG), [PheKnowLator](https://github.com/callahantiff/PheKnowLator/) (PKT)
<!--- **Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)** --->
  
<br>  
  
**Purpose:** This notebook serves as a script to download, process, map, and clean data in order to build edges for the simplified RNA-centered Knowledge Graph.

<br>

**Assumptions:**   
- Edge data downloads ➞ `./resources/edge_data`  
- Ontologies ➞ `./resources/ontologies`    
- Processed data write location ➞ `./resources/processed_data`  

<br>

**Dependencies:**   
- **Scripts**: This notebook utilizes several helper functions, which are stored in the [`data_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/data_utils.py) and [`kg_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/kg_utils.py) scripts.  
- **Data**: All downloaded and generated data sources are provided through [this](https://drive.google.com/drive/folders/1sev5zczMviX7UVqMhTpkFXG43K3nQa9f) dedicated Google Drive repository. <u>This notebook will download everything that is needed for you</u>.  
_____
***

## Table of Contents
***

### [Download Ontologies](#create-ontologies)


### [Create Identifier Maps ](#create-identifier-maps)   


### [Download and process Edge Datasets](#create-edges)  

____
***

## Set-Up Environment
***

In [1]:
%%capture
import sys
!{sys.executable} -m pip install -r requirements.txt
sys.path.append('../')

In [2]:
# import needed libraries
import os
import pickle
import tarfile
import pandas as pd
import gffpandas.gffpandas as gffpd
from rdflib import Graph

from pkt_kg.utils import *

#### Define Global Variables

In [3]:
# directory to store resources
resource_data_location = '../resources/'

# directory to use for processing data
processed_data_location = '../resources/processed_data/'

# directory to write ontology data to
ontology_data_location = '../resources/ontologies/'

# directory to write edges data to
edge_data_location = '../resources/edge_data/'

# owltools location
owltools_location = '../pkt_kg/libs/owltools'

In [4]:
# Download data function for already processed data
def download(name, path):
    url = 'https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/' + name
    #if not os.path.exists(path + name):
    data_downloader(url, path)

***
***
### DOWNLOAD ONTOLOGIES  <a class="anchor" id="create-ontologies"></a>
We must establish a unified standard for identifying entities within our simplified RNA-centered KG. Entities are relations, diseases, miRNAs, and genes. While well-reputed bio-ontologies provide terms for relations and diseases, miRNAs and genes lack direct correspondences.
***
***

### Relation Ontology ([RO](https://www.ebi.ac.uk/ols/ontologies/ro))
The OBO Relations Ontology (RO) is a collection of OWL relations (ObjectProperties) intended for use across a wide variety of biological ontologies.

In [5]:
command = '{} {} --merge-import-closure -o {}'
os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/ro.owl',
                         ontology_data_location + 'ro_with_imports.owl'))



0

In [6]:
# Relations labels are already provided by PKT ecosystem
download('RELATIONS_LABELS.txt', '../resources/relations_data/')
download('INVERSE_RELATIONS.txt', '../resources/relations_data/')

# Load data, print row count, and preview it
ro_data_label = pandas.read_csv('../resources/relations_data/'+'RELATIONS_LABELS.txt', header=0, delimiter='\t')

# We also specify symmetric relations (e.g., RO_0002434 inverseOf RO_0002434)
symmetric_relation = pd.DataFrame({'Relation': 'RO_0002434', 'Inverse_Relation': 'RO_0002434'}, index=[0])
ro_data_inv = pandas.read_csv('../resources/relations_data/'+'INVERSE_RELATIONS.txt', header=0, delimiter='\t')
ro_data_inv = pd.concat([ro_data_inv, symmetric_relation], ignore_index=True)

print('There are {edge_count} RO Relations and Labels'.format(edge_count=len(ro_data_label)))
ro_data_label.head(n=5)

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/RELATIONS_LABELS.txt
Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/INVERSE_RELATIONS.txt
There are 667 RO Relations and Labels


Unnamed: 0,Label,Relation
0,helper property (not for use in curation),http://purl.obolibrary.org/obo/RO_0002464
1,developmentally replaces,http://purl.obolibrary.org/obo/RO_0002285
2,is approximately equivalent to,http://purl.obolibrary.org/obo/RO_0002603
3,has intracellular endoparasite,http://purl.obolibrary.org/obo/RO_0002641
4,supplies,http://purl.obolibrary.org/obo/RO_0002178


***
### Mondo Disease Ontology ([Mondo](https://www.ebi.ac.uk/ols/ontologies/ro))
A semi-automatically constructed ontology that merges in multiple disease resources to yield a coherent merged ontology.

In [7]:
command = '{} {} --merge-import-closure -o {}'
os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/mondo.owl',
                         ontology_data_location + 'mondo_with_imports.owl'))



0

At this point, please run the [<tt>Ontology_Cleaning.ipynb</tt>](https://github.com/callahantiff/PheKnowLator/blob/master/notebooks/Ontology_Cleaning.ipynb) notebook provided by PKT.

***
***
### DOWNLOAD EDGES  <a class="anchor" id="create-edges"></a>
***
***

### gene-disease from [Human Disease Molecular Mechanisms](https://github.com/callahantiff/PheKnowLator/wiki/Building-a-KG-of-Human-Disease-Molecular-Mechanisms) (PKT-built)

In [8]:
data_downloader('https://storage.googleapis.com/pheknowlator/current_build/data/original_data/curated_gene_disease_associations.tsv',
                edge_data_location)

# Rename file adding relationship's identifier
os.rename(edge_data_location+'curated_gene_disease_associations.tsv',
          edge_data_location+'gene-disease_curated_gene_disease_associations.tsv')

with open(edge_data_location + 'gene-disease_curated_gene_disease_associations.tsv') as f:
    data = f.read()

data = pd.read_csv(edge_data_location + 'gene-disease_curated_gene_disease_associations.tsv', sep="\t")  
data

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/original_data/curated_gene_disease_associations.tsv


Unnamed: 0,geneId,geneSymbol,DSI,DPI,diseaseId,diseaseName,diseaseType,diseaseClass,diseaseSemanticType,score,EI,YearInitial,YearFinal,NofPmids,NofSnps,source
0,1,A1BG,0.700,0.538,C0019209,Hepatomegaly,phenotype,C23;C06,Finding,0.30,1.000,2017.0,2017.0,1,0,CTD_human
1,1,A1BG,0.700,0.538,C0036341,Schizophrenia,disease,F03,Mental or Behavioral Dysfunction,0.30,1.000,2015.0,2015.0,1,0,CTD_human
2,2,A2M,0.529,0.769,C0002395,Alzheimer's Disease,disease,C10;F03,Disease or Syndrome,0.50,0.769,1998.0,2018.0,3,0,CTD_human
3,2,A2M,0.529,0.769,C0007102,Malignant tumor of colon,disease,C06;C04,Neoplastic Process,0.31,1.000,2004.0,2019.0,1,0,CTD_human
4,2,A2M,0.529,0.769,C0009375,Colonic Neoplasms,group,C06;C04,Neoplastic Process,0.30,1.000,2004.0,2004.0,1,0,CTD_human
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84033,109580095,HBB-LCR,0.743,0.115,C0002875,Cooley's anemia,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human
84034,109580095,HBB-LCR,0.743,0.115,C0005283,beta Thalassemia,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human
84035,109580095,HBB-LCR,0.743,0.115,C0019025,Hemoglobin F Disease,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human
84036,109580095,HBB-LCR,0.743,0.115,C0085578,Thalassemia Minor,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human


In [9]:
# We keep only 50k (random) rows to reduce input size
data = data.sample(n=50000)
data.to_csv(edge_data_location + 'gene-disease_curated_gene_disease_associations.tsv', header=None, sep='\t', index=None)

For representing genes, we can use NCBI Entrez Gene identifiers (<tt>geneID</tt> column). It is worth noting that symbols could have been a viable choice as well. For denoting diseases (<tt>diseaseID</tt> column), we can notice the original tsv adopts DisGeNET identifiers.

***
### gene-miRNA from [TarBase](https://dianalab.e-ce.uth.gr/html/diana/web/index.php?r=tarbasev8/index)
DIANA-TarBase v8 is a reference database devoted to the indexing of experimentally supported microRNA (miRNA) targets.

In [10]:
data_downloader('https://dianalab.e-ce.uth.gr/downloads/tarbase_v8_data.tar.gz', edge_data_location)

with tarfile.TarFile(edge_data_location+'tarbase_v8_data.tar', 'r') as tar_ref:
    tar_ref.extractall(edge_data_location)
    
# Remove tar file
os.remove(edge_data_location+'tarbase_v8_data.tar')
    
# Rename file adding relationship's identifier
os.rename(edge_data_location+'TarBase_v8_download.txt',
          edge_data_location+'gene-miRNA_TarBase_v8_download.txt')   

with open(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt') as f:
    data = f.read()

data = pd.read_csv(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt', sep="\t", dtype={"cell_line": "string"})  
data

Downloading Gzipped Data from https://dianalab.e-ce.uth.gr/downloads/tarbase_v8_data.tar.gz


Unnamed: 0,geneId,geneName,mirna,species,cell_line,tissue,category,method,positive_negative,direct_indirect,up_down,condition
0,0910001A06Rik(mmu),0910001A06Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
1,1200004M23Rik(mmu),1200004M23Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
2,1700027J05Rik(mmu),1700027J05Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
3,1810015A11Rik(mmu),1810015A11Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
4,2310047A01Rik(mmu),2310047A01Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
...,...,...,...,...,...,...,...,...,...,...,...,...
927114,uPA(hsa),uPA(hsa),hsa-miR-23b-3p,Homo sapiens,,,Cancer/Malignant,Western Blot,POSITIVE,INDIRECT,DOWN,
927115,vimentin(hsa),vimentin(hsa),hsa-miR-9-5p,Homo sapiens,,,Cancer/Malignant,Western Blot,NEGATIVE,INDIRECT,,
927116,Â Â Â Â (PTPRG)(hsa),Â Â Â Â (PTPRG)(hsa),hsa-miR-146a-5p,Homo sapiens,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
927117,Î±-actin(mmu),Î±-actin(mmu),mmu-miR-24-3p,Mus musculus,,,Cancer/Malignant,qPCR,POSITIVE,INDIRECT,UP,


In [11]:
# For the time being, we keep only Homo sapiens rows
data = data[data['species'].str.contains("Homo sapiens")]

# Moreover, we keep only 50k (random) rows to reduce input size
data = data.sample(n=50000)

# This simplified KG ignores if a transcript is "3p" or "5p", so we store this information as additional column
data['p'] = data[data['mirna'].str.contains("p")]['mirna']
data["p"] = data["p"].str[-2:]
data['mirna'] = data['mirna'].str.replace(r'-[35]p$', '', regex=True)
data['mirna'] = data['mirna'].str.lower()

# If you're interested in understanding why Homo sapiens has -3p and -5p miRNAs:
# https://pubmed.ncbi.nlm.nih.gov/12592000/
# Putting aside the -(3/5)p information, we are essentially dealing with non-mature (aka hairpin) miRNA:
# https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3616697/

data

Unnamed: 0,geneId,geneName,mirna,species,cell_line,tissue,category,method,positive_negative,direct_indirect,up_down,condition,p
23496,ENSG00000044446,PHKA2,hsa-mir-766,Homo sapiens,293S,Kidney,Embryonic/Fetal,HITS-CLIP,POSITIVE,DIRECT,DOWN,no treatment (control),5p
439409,ENSG00000166946,CCNDBP1,hsa-let-7b,Homo sapiens,HELA,Cervix,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,"8hrs post-transfection, Overexpression",5p
202972,ENSG00000119866,BCL11A,hsa-mir-30a,Homo sapiens,TZMBL,Cervix,Cancer/Malignant,PAR-CLIP,POSITIVE,DIRECT,DOWN,shRNAs against HIV-1,5p
112380,ENSG00000103495,MAZ,hsa-mir-29b,Homo sapiens,293S,Kidney,Embryonic/Fetal,HITS-CLIP,POSITIVE,DIRECT,DOWN,treatment: arsenite,3p
322689,ENSG00000143499,SMYD2,hsa-mir-301b,Homo sapiens,HEK293,Kidney,Embryonic/Fetal,PAR-CLIP,POSITIVE,DIRECT,DOWN,mild MNase digestion,3p
...,...,...,...,...,...,...,...,...,...,...,...,...,...
198909,ENSG00000119048,UBE2B,hsa-mir-20a,Homo sapiens,HEK293,Kidney,Embryonic/Fetal,PAR-CLIP,POSITIVE,DIRECT,DOWN,mild MNase digestion,5p
27773,ENSG00000052802,MSMO1,hsa-mir-23a,Homo sapiens,293S,Kidney,Embryonic/Fetal,HITS-CLIP,POSITIVE,DIRECT,DOWN,treatment: emetine,3p
346983,ENSG00000148730,EIF4EBP2,hsa-mir-22,Homo sapiens,HUVEC,Umbilical Vein,Normal/Primary,HITS-CLIP,POSITIVE,DIRECT,DOWN,,3p
297095,ENSG00000138757,G3BP2,hsa-mir-27a,Homo sapiens,BETA Cells,Pancreas,Normal/Primary,HITS-CLIP,POSITIVE,DIRECT,DOWN,,3p


In this case, we have only symbols and ENSG identifiers (<tt>geneId</tt> and <tt>geneName</tt> columns) for identifying genes, we'll require an additional mapping to link these symbols or ENSG to NCBI Entrez Gene identifiers.

In [12]:
data.to_csv(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt', header=None, sep='\t', index=None)

***
### miRNA-disease from [miR2Disease](http://watson.compbio.iupui.edu:8080/miR2Disease/)
miR2Disease, a manually curated database, aims at providing a comprehensive resource of miRNA deregulation in various human diseases.

In [13]:
data_downloader('http://watson.compbio.iupui.edu:8080/miR2Disease/download/AllEntries.txt', edge_data_location)

data = pd.read_csv(edge_data_location + 'AllEntries.txt', sep="\t", header=None)  
os.remove(edge_data_location + 'AllEntries.txt')
data

Downloading Data from http://watson.compbio.iupui.edu:8080/miR2Disease/download/AllEntries.txt


Unnamed: 0,0,1,2,3,4,5
0,hsa-let-7f-2,kidney cancer,up-regulated,microarray,2007.0,Micro-RNA profiling in kidney and bladder canc...
1,hsa-let-7g,hepatocellular carcinoma (HCC),down-regulated,"Northern blot, qRT-PCR etc",2008.0,Identification of metastasis-related microRNAs...
2,hsa-let-7g,lung cancer,down-regulated,"Northern blot, qRT-PCR etc",2007.0,The let-7 microRNA represses cell proliferatio...
3,hsa-let-7g,non-small cell lung cancer (NSCLC),down-regulated,"Northern blot, qRT-PCR etc",2008.0,Suppression of non-small cell lung tumor devel...
4,hsa-let-7g,ovarian cancer (OC),down-regulated,"Northern blot, qRT-PCR etc",2007.0,Let-7 expression defines two differentiation s...
...,...,...,...,...,...,...
2897,hsa-miR-21,glioblastoma multiforme (GBM),up-regulated,"Northern blot, qRT-PCR etc",2008.0,miR-124 and miR-137 inhibit proliferation of g...
2898,hsa-miR-21,glioma,up-regulated,"Northern blot, qRT-PCR etc",2008.0,MicroRNA 21 promotes glioma invasion by target...
2899,hsa-miR-21,hepatocellular carcinoma (HCC),up-regulated,"Northern blot, qRT-PCR etc",2006.0,Downregulation of miR-122 in the rodent and hu...
2900,hsa-miR-21,Inclusion body myositis (IBM),up-regulated,microarray,2007.0,Distinctive patterns of microRNA expression in...


In [14]:
# miR2Disease provides a look-up table for mapping disease names to DO terms
data_downloader('http://watson.compbio.iupui.edu:8080/miR2Disease/download/diseaseList.txt', processed_data_location)

descDOmap = pd.read_csv(processed_data_location + 'diseaseList.txt', sep="\t")  
os.remove(processed_data_location + 'diseaseList.txt')
descDOmap

Downloading Data from http://watson.compbio.iupui.edu:8080/miR2Disease/download/diseaseList.txt


Unnamed: 0,disease name in original paper,disease ontology ID
0,Abdominal Aortic Aneurysm,DOID:7693
1,acute lymphoblastic leukemia (ALL),DOID:9952
2,acute myeloid leukemia (AML),DOID:9119
3,acute myocardial infarction,DOID:9408
4,acute promyelocytic leukemia (APL),DOID:9119
...,...,...
169,uterine leiomyoma (ULM),DOID:13223
170,uveal melanoma,DOID:1909
171,vascular disease,DOID:178
172,vesicular stomatitis,DOID:10881


In [15]:
# Ontologies are represented in OWL files that make use of _ for URIs
descDOmap['disease ontology ID'] = descDOmap['disease ontology ID'].astype(str).str.replace(':', '_')

disease2mirna = pd.merge(descDOmap, data, left_on=['disease name in original paper'], right_on=[1]).drop(
    columns=['disease name in original paper'])
disease2mirna[0] = disease2mirna[0].str.lower()
disease2mirna

Unnamed: 0,disease ontology ID,0,1,2,3,4,5
0,DOID_9952,hsa-let-7e,acute lymphoblastic leukemia (ALL),down-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
1,DOID_9952,hsa-mir-125a,acute lymphoblastic leukemia (ALL),down-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
2,DOID_9952,hsa-mir-151*,acute lymphoblastic leukemia (ALL),up-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
3,DOID_9952,hsa-mir-210,acute lymphoblastic leukemia (ALL),up-regulated,microarray,2007.0,MicroRNA expression signatures accurately disc...
4,DOID_9952,hsa-mir-22,acute lymphoblastic leukemia (ALL),down-regulated,"northern blot, qRT-PCR etc",2009.0,Gene silencing of MIR22 in acute lymphoblastic...
...,...,...,...,...,...,...,...
2619,DOID_9080,hsa-mir-542-3p,Waldenstrom Macroglobulinemia (WM),up-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."
2620,DOID_9080,hsa-mir-9*,Waldenstrom Macroglobulinemia (WM),down-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."
2621,DOID_9080,hsa-mir-155,Waldenstrom Macroglobulinemia (WM),up-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."
2622,DOID_9080,hsa-mir-184,Waldenstrom Macroglobulinemia (WM),up-regulated,"Northern blot, qRT-PCR etc",2009.0,"microRNA expression in the biology, prognosis ..."


We lost 2,902-2,624=278 rows during mapping (278/2,902<10%).

In [16]:
disease2mirna.to_csv(edge_data_location + 'miRNA-disease_miR2Disease.txt', header=None, sep='\t', index=None)

***
***
### CREATE MAPPING DATASETS  <a class="anchor" id="create-identifier-maps"></a>
***
***

### Ensembl Gene-Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Ensembl gene identifiers to Entrez gene identifiers

**Output:** `ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`

Already provided by PKT ecosystem.

In [17]:
download('ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', processed_data_location)

ensEntrez = pd.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', sep="\t", header=None)
ensEntrez

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt


Unnamed: 0,0,1,2,3,4,5
0,ENSG00000171241,79801,protein-coding,protein-coding,protein-coding,protein-coding
1,ENSG00000131149,23199,protein-coding,protein-coding,protein-coding,protein-coding
2,ENSG00000096092,28978,protein-coding,protein-coding,protein-coding,protein-coding
3,ENSG00000222691,106479891,snRNA,pseudogene,not protein-coding,not protein-coding
4,ENSG00000230052,100873180,unprocessed_pseudogene,pseudogene,not protein-coding,not protein-coding
...,...,...,...,...,...,...
42283,ENSG00000175699,256369,protein-coding,protein-coding,protein-coding,protein-coding
42284,ENSG00000251308,359776,processed_pseudogene,pseudogene,not protein-coding,not protein-coding
42285,ENSG00000108479,2584,protein-coding,protein-coding,protein-coding,protein-coding
42286,ENSG00000167371,112476,protein-coding,protein-coding,protein-coding,protein-coding


***
### DisGeNET-Mondo <a class="anchor" id="DisGeNET-Mondo"></a>


**Purpose:** To map DisGeNET identifiers to Mondo identifiers

**Output:** `DISEASE_MONDO_MAP.txt`

Already provided by PKT ecosystem.

In [18]:
download('DISEASE_MONDO_MAP.txt', processed_data_location)

disMondo = pd.read_csv(processed_data_location + 'DISEASE_MONDO_MAP.txt', sep="\t", header=None)
disMondo

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt


Unnamed: 0,0,1
0,0001816,MONDO_0016982
1,0002116,MONDO_0005085
2,0014667,MONDO_0005066
3,0040084,MONDO_0005972
4,0040085,MONDO_0005229
...,...,...
182942,D017098,MONDO_0001341
182943,C131452,MONDO_0013874
182944,19995004,MONDO_0022013
182945,268173,MONDO_0017053


***
### Disease Ontology (DO) - Mondo mapping <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map DO identifiers to Mondo identifiers

**Output:** `DISEASE_DOID_MONDO_Map.txt`

In [19]:
mondo_graph = Graph().parse(ontology_data_location + 'mondo_with_imports.owl')

dbxref_res = gets_ontology_class_dbxrefs(mondo_graph)[0]

# Fix DOIDs (substitute : with _) and upper case them
mondo_dict = {str(k).replace(':','_').upper(): {str(i).split('/')[-1].replace(':','_') for i in v}
              for k, v in dbxref_res.items() if 'doid' in str(k)}

with open(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', 'w') as outfile:
    for k, v in mondo_dict.items():
        outfile.write(str(k) + '\t' + str(v).replace('{','').replace('\'','').replace('}','') + '\n')
 
doidMondo = pd.read_csv(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', sep="\t", header=None)
doidMondo[1] = doidMondo[1].str.split(',')
doidMondo = doidMondo.explode(1)
doidMondo.to_csv(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', header=None, sep='\t', index=None)

In [20]:
doidMondo = pd.read_csv(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', sep="\t", header=None)
doidMondo

Unnamed: 0,0,1
0,DOID_0060231,MONDO_0017195
1,DOID_0081065,MONDO_0850469
2,DOID_3255,MONDO_0002578
3,DOID_6428,MONDO_0006132
4,DOID_0080563,MONDO_0012052
...,...,...
11281,DOID_0080627,MONDO_0008756
11282,DOID_6559,MONDO_0003923
11283,DOID_13810,MONDO_0005439
11284,DOID_999,MONDO_0015691


***
### miRBase ID - miRBase accession


**Purpose:** To map miRNA identifiers from miRBase to miRBase accession (it guarantees a standard identification of miRNA molecules via URIs)

**Output:** `MIRBASE_ID_ACCESSION_MAP.txt`

In [21]:
data_downloader('https://www.mirbase.org/download/hsa.gff3', processed_data_location)

miRBaseMap = gffpd.read_gff3(processed_data_location + 'hsa.gff3')  
os.remove(processed_data_location + 'hsa.gff3')
print(miRBaseMap.header)
print(miRBaseMap.df)

Downloading Data from https://www.mirbase.org/download/hsa.gff3
##gff-version 3
##date 2018-3-5
#
# Chromosomal coordinates of Homo sapiens microRNAs
# microRNAs:               miRBase v22
# genome-build-id:         GRCh38
# genome-build-accession:  NCBI_Assembly:GCA_000001405.15
#
# Hairpin precursor sequences have type "miRNA_primary_transcript". 
# Note, these sequences do not represent the full primary transcript, 
# rather a predicted stem-loop portion that includes the precursor 
# miRNA. Mature sequences have type "miRNA".
#

     seq_id source                      type     start       end score strand   
0      chr1      .  miRNA_primary_transcript     17369     17436     .      -  \
1      chr1      .                     miRNA     17409     17431     .      -   
2      chr1      .                     miRNA     17369     17391     .      -   
3      chr1      .  miRNA_primary_transcript     30366     30503     .      +   
4      chr1      .                     miRNA     30438  

In [22]:
miRBaseMap = miRBaseMap.attributes_to_columns()
miRBaseMap = miRBaseMap[['attributes']]
miRBaseMap

Unnamed: 0,attributes
0,ID=MI0022705;Alias=MI0022705;Name=hsa-mir-6859-1
1,ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-mi...
2,ID=MIMAT0027619;Alias=MIMAT0027619;Name=hsa-mi...
3,ID=MI0006363;Alias=MI0006363;Name=hsa-mir-1302-2
4,ID=MIMAT0005890;Alias=MIMAT0005890;Name=hsa-mi...
...,...
4796,ID=MIMAT0023714_1;Alias=MIMAT0023714;Name=hsa-...
4797,ID=MI0032313;Alias=MI0032313;Name=hsa-mir-9985
4798,ID=MIMAT0039763;Alias=MIMAT0039763;Name=hsa-mi...
4799,ID=MI0039722;Alias=MI0039722;Name=hsa-mir-12120


In [23]:
miRBaseMap = miRBaseMap.attributes.str.split(';',expand=True)
# Keep only "ID" and "Name" columns
miRBaseMap = miRBaseMap[[2,0]]
# Remove substring "ID="
miRBaseMap[0] = miRBaseMap[0].str[3:]
# Remove substring "Name="
miRBaseMap[2] = miRBaseMap[2].str[5:]
# Keep only hairpin/stem-loop miRNAs
# (those starting with MI and not MIMAT, last one is reserved for mature sequences)
miRBaseMap = miRBaseMap[~miRBaseMap[0].str.startswith('MIMAT')]
miRBaseMap

Unnamed: 0,2,0
0,hsa-mir-6859-1,MI0022705
3,hsa-mir-1302-2,MI0006363
5,hsa-mir-6859-2,MI0026420
8,hsa-mir-12136,MI0039740
10,hsa-mir-200b,MI0000342
...,...,...
4791,hsa-mir-1184-3,MI0015972
4793,hsa-mir-3690-2,MI0023561
4795,hsa-mir-6089-2,MI0023563
4797,hsa-mir-9985,MI0032313


In [24]:
miRBaseMap.to_csv(processed_data_location+'MIRBASE_ID_ACCESSION_MAP.txt', header=None,sep='\t', index=None)

***
To represent genes, PKT designates them as subclasses of relevant Sequence Ontology ([SO](https://www.ebi.ac.uk/ols4/ontologies/so)) terms. We add miRNAs as subclasses of [SO_0000647](http://purl.obolibrary.org/obo/SO_0000647) (*miRNA_primary_transcript*).

In [25]:
# KG construction approach dictionary (for non ontological data), provided by PKT ecosystem
download('subclass_construction_map.pkl', '../resources/construction_approach/')

# Load data, print row count, and preview it
nonO_data = pd.read_pickle(r'../resources/construction_approach/subclass_construction_map.pkl')

# For instance, ncbi IDs are mapped to appropriate SO Ontology entries
list(nonO_data.items())[:5]

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/subclass_construction_map.pkl


[('84103', ['SO_0001217']),
 ('84690', ['SO_0001217']),
 ('3579', ['SO_0001217']),
 ('54514', ['SO_0001217']),
 ('7159', ['SO_0001217'])]

In [26]:
list(nonO_data.items())[-5:]

[('R-HSA-977442', {'PW_0000848'}),
 ('R-HSA-977441', {'PW_0000848'}),
 ('R-HSA-6788656', {'PW_0000051', 'PW_0000052', 'PW_0000053', 'PW_0000075'}),
 ('R-HSA-168254', {'PW_0001054'}),
 ('R-HSA-71182', {'PW_0000052', 'PW_0000075'})]

In [27]:
miRBaseMap['SO'] = [['SO_0000647']] * len(miRBaseMap)

mirna_nonO = miRBaseMap.drop(2, axis=1).set_index(0).to_dict()
nonO_data = {**nonO_data, **mirna_nonO['SO']}

list(nonO_data.items())[len(list(nonO_data.items()))-5:len(list(nonO_data.items()))]

[('MI0015972', ['SO_0000647']),
 ('MI0023561', ['SO_0000647']),
 ('MI0023563', ['SO_0000647']),
 ('MI0032313', ['SO_0000647']),
 ('MI0039722', ['SO_0000647'])]

In [28]:
with open('../resources/construction_approach/subclass_construction_map.pkl', 'wb') as handle:
    pickle.dump(nonO_data, handle, protocol=pickle.HIGHEST_PROTOCOL)

***
PKT also provides node and relation metadata (a.k.a. node properties and relation type attributes).

In [29]:
# KG metadata, provided by PKT ecosystem
download('node_metadata_dict.pkl', '../resources/node_data/')

# Load data, print row count, and preview it
metadata = pd.read_pickle(r'../resources/node_data/node_metadata_dict.pkl')

metadata.keys()

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/node_metadata_dict.pkl


dict_keys(['nodes', 'relations'])

In [30]:
{k: metadata['nodes'][k] for k in list(metadata['nodes'])[:5]}

{'http://www.ncbi.nlm.nih.gov/gene/1': {'Label': 'A1BG',
  'Description': "A1BG has locus group 'protein-coding' and is located on chromosome 19 (19q13.43).",
  'Synonym': 'HEL-S-163pA|ABG|A1B|epididymis secretory sperm binding protein Li 163pA|GAB|HYST2477alpha-1B-glycoprotein'},
 'http://www.ncbi.nlm.nih.gov/gene/2': {'Label': 'A2M',
  'Description': "A2M has locus group 'protein-coding' and is located on chromosome 12 (12p13.31).",
  'Synonym': 'CPAMD5|S863-7alpha-2-macroglobulin|FWP007|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|A2MD|alpha-2-M'},
 'http://www.ncbi.nlm.nih.gov/gene/3': {'Label': 'A2MP1',
  'Description': "A2MP1 has locus group 'pseudo' and is located on chromosome 12 (12p13.31).",
  'Synonym': 'A2MPpregnancy-zone protein pseudogene'},
 'http://www.ncbi.nlm.nih.gov/gene/9': {'Label': 'NAT1',
  'Description': "NAT1 has locus group 'protein-coding' and is located on chromosome 8 (8p22).",
  'Synonym': 'NAT-1|N-acetyltransferase type 1|NATIarylamin

In [31]:
{k: metadata['relations'][k] for k in list(metadata['relations'])[:5]}

{'http://purl.obolibrary.org/obo/RO_0002533': {'Label': 'sequence atomic unit',
  'Description': 'Any individual unit of a collection of like units arranged in a linear order',
  'Synonym': 'None'},
 'http://purl.obolibrary.org/obo/RO_0002577': {'Label': 'system',
  'Description': 'A material entity consisting of multiple components that are causally integrated.',
  'Synonym': 'None'},
 'http://purl.obolibrary.org/obo/RO_0002534': {'Label': 'sequence bearer',
  'Description': 'Any entity that can be divided into parts such that each part is an atomical unit of a sequence',
  'Synonym': 'None'},
 'http://purl.obolibrary.org/obo/RO_0002532': {'Label': 'sequentially ordered entity',
  'Description': 'Any entity that is ordered in discrete units along a linear axis.',
  'Synonym': 'None'},
 'http://purl.obolibrary.org/obo/RO_0002310': {'Label': 'exposure event or process',
  'Description': 'A process occurring within or in the vicinity of an organism that exerts some causal influence on th

We can add miRNA properties (*Label*, *Description*, and *Synonym*) from miRBase.

In [34]:
from Bio import SeqIO

data_downloader('https://www.mirbase.org/download/miRNA.dat', processed_data_location)

# Open the EMBL file
embl_file = processed_data_location + 'miRNA.dat'

# Create empty lists to store the data
data = {
    "ID": [],
    "Description": [],
    "Sequence": [],
    "Comments": [],
    "References": []
}

# Iterate through the records in the EMBL file
for record in SeqIO.parse(embl_file, "embl"):
    data["ID"].append(record.id)
    data["Description"].append(record.description)
    data["Sequence"].append(str(record.seq))
    data["Comments"].append(str(record.annotations.get('comment', '')))
    references = []
    i = 0
    for ref in record.annotations.get('references', []):
        i = i + 1
        references.append(f"{[i], ref.pubmed_id}")
    data["References"].append(", ".join(references))

df = pd.DataFrame(data)
df

Downloading Data from https://www.mirbase.org/download/miRNA.dat


Unnamed: 0,ID,Description,Sequence,Comments,References
0,MI0000001,Caenorhabditis elegans let-7 stem-loop,UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAU...,let-7 is found on chromosome X in Caenorhabdit...,"([1], '11679671'), ([2], '12672692'), ([3], '1..."
1,MI0000002,Caenorhabditis elegans lin-4 stem-loop,AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUU...,lin-4 is found on chromosome II in Caenorhabdi...,"([1], '11679671'), ([2], '10642801'), ([3], '1..."
2,MI0000003,Caenorhabditis elegans miR-1 stem-loop,AAAGUGACCGUACCGAGCUGCAUACUUCCUUACAUGCCCAUACUAU...,miR-1 was independently identified in C. elega...,"([1], '11679671'), ([2], '11679672'), ([3], '1..."
3,MI0000004,Caenorhabditis elegans miR-2 stem-loop,UAAACAGUAUACAGAAAGCCAUCAAAGCGGUGGUUGAUGUGUUGCA...,,"([1], '11679671'), ([2], '11679672'), ([3], '1..."
4,MI0000005,Caenorhabditis elegans miR-34 stem-loop,CGGACAAUGCUCGAGAGGCAGUGUGGUUAGCUGGUUGCAUAUUUCC...,,"([1], '11679671'), ([2], '12672692'), ([3], '1..."
...,...,...,...,...,...
38584,MI0041070,Symbiodinium microadriaticum miR-12461 stem-loop,GAGGAUGCUGAUCAUUCACUGGCCCCCUGUGGACACGUGUGUUGCA...,,"([1], '24119094')"
38585,MI0041071,Homo sapiens miR-9902-1 stem-loop,GCAGGGAAAGGGAACCCAGAAAUCUGGUAUGCCAGCAAAGAGAGUA...,,"([1], '25049417')"
38586,MI0041072,Gallus gallus miR-1784b stem-loop,UUCUGCUCCUAUUUAAGUCAAUGGCAGAACUCUCACUGAUUUCAAU...,,"([1], '29079676')"
38587,MI0041073,Monodelphis domestica miR-7385g-1 stem-loop,UAGUCUGAUAUUCCAUGUUUCUAUGUCAUGAAACUUGGAGCAUAGA...,,"([1], '29079676')"


In [35]:
# Reduce to "ID", "Description", and "Synonym" columns
df = df[df['Description'].astype(str).str.contains('Homo sapiens')]
df['Synonym'] = df['Description']
df['Description'] = df['Comments'].astype(str) + df['References'].astype(str) + '. Sequence: ' + df['Sequence'].astype(str)
df = df[['ID', 'Description', 'Synonym']]
df

Unnamed: 0,ID,Description,Synonym
57,MI0000060,let-7a-3p cloned in [6] has a 1 nt 3' extensio...,Homo sapiens let-7a-1 stem-loop
58,MI0000061,"([1], '11679670'), ([2], '15183728'), ([3], '1...",Homo sapiens let-7a-2 stem-loop
59,MI0000062,let-7a-3p cloned in [6] has a 1 nt 3' extensio...,Homo sapiens let-7a-3 stem-loop
60,MI0000063,"([1], '11679670'), ([2], '14573789'), ([3], '1...",Homo sapiens let-7b stem-loop
61,MI0000064,"([1], '11679670'), ([2], '14573789'), ([3], '1...",Homo sapiens let-7c stem-loop
...,...,...,...
37307,MI0039734,"([1], '28471449'). Sequence: UUAACAUCUUUUCCAUC...",Homo sapiens miR-12132 stem-loop
37308,MI0039735,"([1], '28471449'). Sequence: GAAGUGUACUUUUUAAU...",Homo sapiens miR-12133 stem-loop
37312,MI0039739,"([1], '28640911'). Sequence: UGUGGAUAUUCUUUUUU...",Homo sapiens miR-12135 stem-loop
37313,MI0039740,"([1], '28640911'). Sequence: GAAAAAGUCAUGGAGGC...",Homo sapiens miR-12136 stem-loop


Finally, we add *Labels* through the map we have previously developed.

In [36]:
df = pd.merge(df, miRBaseMap, left_on=['ID'], right_on=[0])
df['Label'] = df[2]
df = df[['ID', 'Label', 'Description', 'Synonym']]
df

Unnamed: 0,ID,Label,Description,Synonym
0,MI0000060,hsa-let-7a-1,let-7a-3p cloned in [6] has a 1 nt 3' extensio...,Homo sapiens let-7a-1 stem-loop
1,MI0000061,hsa-let-7a-2,"([1], '11679670'), ([2], '15183728'), ([3], '1...",Homo sapiens let-7a-2 stem-loop
2,MI0000062,hsa-let-7a-3,let-7a-3p cloned in [6] has a 1 nt 3' extensio...,Homo sapiens let-7a-3 stem-loop
3,MI0000063,hsa-let-7b,"([1], '11679670'), ([2], '14573789'), ([3], '1...",Homo sapiens let-7b stem-loop
4,MI0000064,hsa-let-7c,"([1], '11679670'), ([2], '14573789'), ([3], '1...",Homo sapiens let-7c stem-loop
...,...,...,...,...
1908,MI0039734,hsa-mir-12132,"([1], '28471449'). Sequence: UUAACAUCUUUUCCAUC...",Homo sapiens miR-12132 stem-loop
1909,MI0039735,hsa-mir-12133,"([1], '28471449'). Sequence: GAAGUGUACUUUUUAAU...",Homo sapiens miR-12133 stem-loop
1910,MI0039739,hsa-mir-12135,"([1], '28640911'). Sequence: UGUGGAUAUUCUUUUUU...",Homo sapiens miR-12135 stem-loop
1911,MI0039740,hsa-mir-12136,"([1], '28640911'). Sequence: GAAAAAGUCAUGGAGGC...",Homo sapiens miR-12136 stem-loop


In [37]:
# Convert the DataFrame to dictionary format
miRNA_dict = {}
for index, row in df.iterrows():
    # Transform the ID to correspondent URI (KG's ID)
    gene_id = f'https://www.mirbase.org/hairpin/{row["ID"]}'
    miRNA_dict[gene_id] = {
        'Label': row['Label'],
        'Description': row['Description'],
        'Synonym': row['Synonym']
    }
    
{k: miRNA_dict[k] for k in list(miRNA_dict)[:5]}

{'https://www.mirbase.org/hairpin/MI0000060': {'Label': 'hsa-let-7a-1',
  'Description': "let-7a-3p cloned in [6] has a 1 nt 3' extension (U), which is incompatible\nwith the genome sequence.([1], '11679670'), ([2], '15183728'), ([3], '12554860'), ([4], '14573789'), ([5], '15325244'), ([6], '17604727'), ([7], '17616659'), ([8], '17989717'), ([9], '20158877'). Sequence: UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAUACAAUCUACUGUCUUUCCUA",
  'Synonym': 'Homo sapiens let-7a-1 stem-loop'},
 'https://www.mirbase.org/hairpin/MI0000061': {'Label': 'hsa-let-7a-2',
  'Description': "([1], '11679670'), ([2], '15183728'), ([3], '12554860'), ([4], '14573789'), ([5], '15325244'), ([6], '17604727'), ([7], '17616659'), ([8], '17989717'), ([9], '19015728'), ([10], '20158877'). Sequence: AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGAUAACUGUACAGCCUCCUAGCUUUCCU",
  'Synonym': 'Homo sapiens let-7a-2 stem-loop'},
 'https://www.mirbase.org/hairpin/MI0000062': {'Label': 'hsa-let-7a-3',
  'Descri

In [38]:
nodes_dict = {**metadata.get('nodes'), **miRNA_dict}
nodes_final_dict = {'nodes': nodes_dict}

rel_dict = {'relations': {**metadata.get('relations')}}
metadata = {**nodes_final_dict, **rel_dict}

with open('../resources/node_data/node_metadata_dict.pkl', 'wb') as handle:
    pickle.dump(metadata, handle, protocol=pickle.HIGHEST_PROTOCOL)

***
If you are interested in enhancing this KG, you can explore the entire [RNA-KG](https://github.com/AnacletoLAB/RNA-KG/). Feel free to contact me at [emanuele dot cavalleri at unimi dot it](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=emanuele.cavalleri@unimi.it).