# a re-run will be needed

***
***

<img width='700' src="https://user-images.githubusercontent.com/8030363/108961534-b9a66980-7634-11eb-96e2-cc46589dcb8c.png" style="vertical-align:middle">

## Pre-Knowledge Graph Build Data Preparation
***

**Authors:** [ECavalleri](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=emanuele.cavalleri@studenti.unimi.it), [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com), [MMesiti](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=xxx@gmail.com)

**GitHub Repositories:** [RNA-KG](https://github.com/emanuelecavalleri/xxx), [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
<!--- **Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)** --->
  
<br>  
  
**Purpose:** This notebook serves as a script to download and process data in order to generate mapping and filtering data needed to build edges for the RNA knowledge graph. For more information on the data sources utilize within this script, please see the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page.

<br>

**Assumptions:**   
- Raw data downloads ➞ `./resources/processed_data/unprocessed_data`    
- Processed data write location ➞ `./resources/processed_data`  

<br>

**Dependencies:**   
- **Scripts**: This notebook utilizes several helper functions, which are stored in the [`data_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/data_utils.py) and [`kg_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/kg_utils.py) scripts.  
- **Data**: Hyperlinks to all downloaded and generated data sources are provided through [this](https://console.cloud.google.com/storage/browser/pheknowlator/release_v2.0.0?project=pheknowlator) dedicated Google Cloud Storage Bucket. <u>This notebook will download everything that is needed for you</u>.  
_____
***

## Table of Contents
***

### [Download Ontologies](#create-ontologies)


### [Create Identifier Maps ](#create-identifier-maps)   


### [Download Edge Datasets](#create-edges)  

____

## Set-Up Environment
_____

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt
sys.path.append('../')




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/anaconda3/bin/python -m pip install --upgrade pip[0m


In [2]:
# import needed libraries
import datetime
import glob
import itertools
import networkx
import numpy
import os
import openpyxl
import pandas
import pickle
import re
import requests
import tarfile
import pandas as pd

from collections import Counter
from functools import reduce
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from rdflib.namespace import OWL, RDF, RDFS
from reactome2py import content
from tqdm import tqdm
from typing import Dict

from pkt_kg.utils import * 
from builds.ontology_cleaning import *

from typing import Tuple

#### Define Global Variables

In [3]:
# directory to store resources
resource_data_location = '../resources/'

# directory to use for processing data
unprocessed_data_location = '../resources/processed_data/unprocessed_data/'
processed_data_location = '../resources/processed_data/'

# directory to write relations data to
relations_data_location = '../resources/relations_data/'

# directory to write node metadata to
node_data_location = '../resources/node_data/'

# directory to write kg construction approach dictionary to
construction_approach_location = '../resources/construction_approach/'

# directory to write ontology data to
ontology_data_location = '../resources/ontologies/'

# directory to write edges data to
edge_data_location = '../resources/edge_data/'

# owltools location
owltools_location = '../pkt_kg/libs/owltools'

# obo spacespace
obo = Namespace('http://purl.obolibrary.org/obo/')


# set up environment variables
write_location = '../resources/ontologies'
knowledge_graphs_location = '../resources/knowledge_graphs'

# set global namespaces
schema = Namespace('http://www.w3.org/2001/XMLSchema#')
oboinowl = Namespace('http://www.geneontology.org/formats/oboInOwl#')

In [5]:
# download data function for already processed data
def download(name, path):
    url = 'https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/'+name
    if not os.path.exists(path + name):
        data_downloader(url, path)
        
# subclass metadata
download('subclass_construction_map.pkl', construction_approach_location)

***
***
### DOWNLOAD ONTOLOGIES  <a class="anchor" id="create-ontologies"></a>
***
***

#### Relation Ontology (RO)

In [86]:
if not os.path.exists(ontology_data_location + 'ro_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/ro.owl',
                             ontology_data_location + 'ro_with_imports.owl'))

#### Mondo Disease Ontology (MONDO)

In [87]:
if not os.path.exists(ontology_data_location + 'mondo_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/mondo.owl',
                             ontology_data_location + 'mondo_with_imports.owl'))

#### Non-Coding RNA Ontology (NCRO)

In [88]:
if not os.path.exists(ontology_data_location + 'ncro_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/ncro.owl',
                             ontology_data_location + 'ncro_with_imports.owl'))

***
***
### CREATE MAPPING DATASETS  <a class="anchor" id="create-identifier-maps"></a>
***
***

***
### Ensembl Gene-Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Ensembl gene identifiers to Entrez gene identifiers when creating `gene`-`gene` edges

**Output:** `ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`

Already provided by PKL ecosystem.

In [7]:
download('ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', processed_data_location)

ensEntrez = pd.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', sep="\t", header=None)
ensEntre

Unnamed: 0,ENSG00000171241,79801,protein-coding,protein-coding.1,protein-coding.2,protein-coding.3
0,ENSG00000131149,23199,protein-coding,protein-coding,protein-coding,protein-coding
1,ENSG00000096092,28978,protein-coding,protein-coding,protein-coding,protein-coding
2,ENSG00000222691,106479891,snRNA,pseudogene,not protein-coding,not protein-coding
3,ENSG00000230052,100873180,unprocessed_pseudogene,pseudogene,not protein-coding,not protein-coding
4,ENSG00000158050,1844,protein-coding,protein-coding,protein-coding,protein-coding
...,...,...,...,...,...,...
42282,ENSG00000175699,256369,protein-coding,protein-coding,protein-coding,protein-coding
42283,ENSG00000251308,359776,processed_pseudogene,pseudogene,not protein-coding,not protein-coding
42284,ENSG00000108479,2584,protein-coding,protein-coding,protein-coding,protein-coding
42285,ENSG00000167371,112476,protein-coding,protein-coding,protein-coding,protein-coding


***
### Disease Ontology (DO) - MONDO mapping <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map DO identifiers to MONDO identifiers

**Output:** `DISEASE_DOID_MONDO_Map.txt`

In [8]:
mondo_graph = Graph().parse(ontology_data_location + 'mondo_with_imports.owl')

dbxref_res = gets_ontology_class_dbxrefs(mondo_graph)[0]

# Fix DOIDs (substitute : with _)
mondo_dict = {str(k).replace(':','_').upper(): {str(i).split('/')[-1].replace(':','_') for i in v} for k, v in dbxref_res.items() if 'doid' in str(k)}

with open(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', 'w') as outfile:
    for k, v in mondo_dict.items():
        outfile.write(str(k) + '\t' + str(v).replace('{','').replace('\'','').replace('}','') + '\n')

In [10]:
doidMondo = pd.read_csv(processed_data_location + 'DISEASE_DOID_MONDO_Map.txt', sep="\t", header=None)
doidMondo

Unnamed: 0,0,1
0,DOID_0060503,MONDO_0000778
1,DOID_0111559,MONDO_0032728
2,DOID_3635,MONDO_0018940
3,DOID_0110321,MONDO_0013200
4,DOID_10444,MONDO_0001037
...,...,...
9786,DOID_0111153,MONDO_0016558
9787,DOID_13404,MONDO_0007011
9788,DOID_0111341,MONDO_0007434
9789,DOID_14271,MONDO_0001930


***
### Disease description from DO - DO mapping <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Disease descriptions from DO to DO identifiers

**Output:** `DISEASE_DOID_MONDO_Map.txt`

Provided by mir2Disease.

In [11]:
data_downloader('http://watson.compbio.iupui.edu:8080/miR2Disease/download/diseaseList.txt', processed_data_location)
 
descDOmap = pd.read_csv(processed_data_location + 'diseaseList.txt', sep="\t")
descDOmap

Downloading Data from http://watson.compbio.iupui.edu:8080/miR2Disease/download/diseaseList.txt


Unnamed: 0,disease name in original paper,disease ontology ID
0,Abdominal Aortic Aneurysm,DOID:7693
1,acute lymphoblastic leukemia (ALL),DOID:9952
2,acute myeloid leukemia (AML),DOID:9119
3,acute myocardial infarction,DOID:9408
4,acute promyelocytic leukemia (APL),DOID:9119
...,...,...
169,uterine leiomyoma (ULM),DOID:13223
170,uveal melanoma,DOID:1909
171,vascular disease,DOID:178
172,vesicular stomatitis,DOID:10881


Then fix it.

In [12]:
descDOmap.columns = ['desc', 'doid']
descDOmap['desc'] = descDOmap['desc'].str.lower()
descDOmap['doid'] = descDOmap['doid'].str.upper()
descDOmap['doid'] = descDOmap['doid'].str.replace(':', '_')
descDOmap

Unnamed: 0,desc,doid
0,abdominal aortic aneurysm,DOID_7693
1,acute lymphoblastic leukemia (all),DOID_9952
2,acute myeloid leukemia (aml),DOID_9119
3,acute myocardial infarction,DOID_9408
4,acute promyelocytic leukemia (apl),DOID_9119
...,...,...
169,uterine leiomyoma (ulm),DOID_13223
170,uveal melanoma,DOID_1909
171,vascular disease,DOID_178
172,vesicular stomatitis,DOID_10881


In [13]:
# Remove original file
os.remove(processed_data_location + 'diseaseList.txt')

***
### miRBase ID - Non-Coding RNA Ontology (NCRO) mapping


**Purpose:** To map miRNA identifiers from miRBase to NCRO identifiers

**Output:** `hsa_NCRO_Map.txt`

Provided by mir2Disease.

In [6]:
# Helper function to get dbxrefs for all ontology classes' label
def gets_ontology_class_label(graph: Graph) -> Tuple:
    dbx_uris: Dict = dict()
    dbx = [x for x in graph if 'label' in str(x[1]).lower() if isinstance(x[0], URIRef)]
    for x in dbx:
        if str(x[2]).lower() in dbx_uris.keys(): dbx_uris[str(x[2]).lower()].append(str(x[0]))
        else: dbx_uris[str(x[2]).lower()] = [str(x[0])]
    dbx_type = {str(x[2]).lower(): 'DbXref' for x in dbx}

    ex_uris: Dict = dict()
    ex = [x for x in graph if 'exactmatch' in str(x[1]).lower() if isinstance([0], URIRef)]
    for x in ex:
        if str(x[2]).lower() in ex_uris.keys(): ex_uris[str(x[2]).lower()].append(str(x[0]))
        else: ex_uris[str(x[2]).lower()] = [str(x[0])]
    ex_type = {str(x[2]).lower(): 'ExactMatch' for x in ex}

    return {**dbx_uris, **ex_uris}, {**dbx_type, **ex_type}

In [7]:
# read data into RDFLib graph object
ncro_graph = Graph().parse(ontology_data_location + 'ncro_with_imports.owl')

dbxref_res = gets_ontology_class_label(ncro_graph)[0]

# Fix string patterns
ncro_dict = {str(k): {str(i).split('/')[-1].replace(':','_') for i in v} for k, v in dbxref_res.items() if 'NCRO' in str(v) and 'mir-' in str(k) and 'hsa' in str(k)}
ncro_dict2 = {'hsa-'+str(k): {str(i).split('/')[-1].replace(':','_') for i in v} for k, v in dbxref_res.items() if 'NCRO' in str(v) and 'mir-' in str(k) and 'hsa' not in str(k)}

list({**ncro_dict, **ncro_dict2}.items())[:5]

KeyboardInterrupt: 

In [9]:
with open(processed_data_location + 'hsa_NCRO_Map.txt', 'w') as outfile:
    for k, v in {**ncro_dict, **ncro_dict2}.items():
        outfile.write(str(k) + '\t' + str(v).replace('{','').replace('\'','').replace('}','') + '\n')

***
***
### DOWNLOAD EDGES  <a class="anchor" id="create-edges"></a>
***
***

**Get Relations Labels**  
Identify all relations and their labels for use when building the knowledge graph.

In [21]:
# Already provided by PKL ecosystem
download('RELATIONS_LABELS.txt', relations_data_location)

# load data, print row count, and preview it
ro_data_label = pandas.read_csv(relations_data_location + 'RELATIONS_LABELS.txt', header=0, delimiter='\t')

print('There are {edge_count} RO Relations and Labels'.format(edge_count=len(ro_data_label)))
ro_data_label.head(n=5)

There are 667 RO Relations and Labels


Unnamed: 0,Label,Relation
0,helper property (not for use in curation),http://purl.obolibrary.org/obo/RO_0002464
1,developmentally replaces,http://purl.obolibrary.org/obo/RO_0002285
2,is approximately equivalent to,http://purl.obolibrary.org/obo/RO_0002603
3,has intracellular endoparasite,http://purl.obolibrary.org/obo/RO_0002641
4,supplies,http://purl.obolibrary.org/obo/RO_0002178


#### Add non-ontology data to pkl subclass dictionary

In [69]:
obj = pd.read_pickle(construction_approach_location + r'subclass_construction_map.pkl')
obj

{'84103': ['SO_0001217'],
 '84690': ['SO_0001217'],
 '3579': ['SO_0001217'],
 '54514': ['SO_0001217'],
 '7159': ['SO_0001217'],
 '9070': ['SO_0001217'],
 '642641': ['SO_0000336'],
 '105374698': ['SO_0002127'],
 '317781': ['SO_0001217'],
 '5016': ['SO_0001217'],
 '100133310': ['SO_0000336'],
 '51166': ['SO_0001217'],
 '646851': ['SO_0001217'],
 '9819': ['SO_0001217'],
 '100873575': ['SO_0000336'],
 '109910379': ['SO_0001637'],
 '106481303': ['SO_0000336'],
 '51200': ['SO_0001217'],
 '7529': ['SO_0001217'],
 '100874251': ['SO_0002127'],
 '23054': ['SO_0001217'],
 '100130338': ['SO_0000336'],
 '4163': ['SO_0001217'],
 '100271626': ['SO_0000336'],
 '677811': ['SO_0001267'],
 '9367': ['SO_0001217'],
 '106631777': ['SO_0001637'],
 '55763': ['SO_0001217'],
 '100500825': ['SO_0000276'],
 '158434': ['SO_0002127'],
 '83729': ['SO_0001217'],
 '150197': ['SO_0002127'],
 '100873271': ['SO_0000336'],
 '145235': ['SO_0000336'],
 '84964': ['SO_0001217'],
 '646174': ['SO_0001217'],
 '100421201': ['SO_0

In [70]:
ncro_dict = pd.read_csv(processed_data_location + 'hsa_NCRO_Map.txt', sep='\t', header=None)
ncro_dict['SO'] = [['SO_0001265']] * 2019
ncro_0 = ncro_dict.drop(1, axis=1).set_index(0).to_dict()
ncro_0['SO']

{'hsa-mir-1302-8': ['SO_0001265'],
 'hsa-mir-3617': ['SO_0001265'],
 'hsa-mir-4539': ['SO_0001265'],
 'hsa-mir-210': ['SO_0001265'],
 'hsa-mir-6895': ['SO_0001265'],
 'hsa-mir-3143': ['SO_0001265'],
 'hsa-mir-548aq': ['SO_0001265'],
 'hsa-mir-4298': ['SO_0001265'],
 'hsa-mir-6863': ['SO_0001265'],
 'hsa-mir-5092': ['SO_0001265'],
 'hsa-mir-4701': ['SO_0001265'],
 'hsa-mir-3671': ['SO_0001265'],
 'hsa-mir-134': ['SO_0001265'],
 'hsa-mir-3158-1': ['SO_0001265'],
 'hsa-mir-1243': ['SO_0001265'],
 'hsa-mir-5093': ['SO_0001265'],
 'hsa-mir-6085': ['SO_0001265'],
 'hsa-mir-4715': ['SO_0001265'],
 'hsa-mir-3189': ['SO_0001265'],
 'hsa-mir-515-1': ['SO_0001265'],
 'hsa-mir-873': ['SO_0001265'],
 'hsa-mir-590': ['SO_0001265'],
 'hsa-mir-4518': ['SO_0001265'],
 'hsa-mir-1305': ['SO_0001265'],
 'hsa-mir-4327': ['SO_0001265'],
 'hsa-mir-1306': ['SO_0001265'],
 'hsa-mir-3162': ['SO_0001265'],
 'hsa-mir-4643': ['SO_0001265'],
 'hsa-mir-3687-1': ['SO_0001265'],
 'hsa-mir-302d': ['SO_0001265'],
 'hsa-

In [71]:
obj.update(ncro_0['SO'])

In [72]:
ncro_1 = ncro_dict.drop(0, axis=1).set_index(1).to_dict()
ncro_1['SO']

{'NCRO_0002185': ['SO_0001265'],
 'NCRO_0003197': ['SO_0001265'],
 'NCRO_0001132': ['SO_0001265'],
 'NCRO_0002758': ['SO_0001265'],
 'NCRO_0001579': ['SO_0001265'],
 'NCRO_0000846': ['SO_0001265'],
 'NCRO_0002113': ['SO_0001265'],
 'NCRO_0000888': ['SO_0001265'],
 'NCRO_0001547': ['SO_0001265'],
 'NCRO_0001305': ['SO_0001265'],
 'NCRO_0001198': ['SO_0001265'],
 'NCRO_0000986': ['SO_0001265'],
 'NCRO_0002775': ['SO_0001265'],
 'NCRO_0002457': ['SO_0001265'],
 'NCRO_0003265': ['SO_0001265'],
 'NCRO_0001306': ['SO_0001265'],
 'NCRO_0001382': ['SO_0001265'],
 'NCRO_0001210': ['SO_0001265'],
 'NCRO_0000874': ['SO_0001265'],
 'NCRO_0001744': ['SO_0001265'],
 'NCRO_0002862': ['SO_0001265'],
 'NCRO_0002872': ['SO_0001265'],
 'NCRO_0001114': ['SO_0001265'],
 'NCRO_0003095': ['SO_0001265'],
 'NCRO_0000924': ['SO_0001265'],
 'NCRO_0002942': ['SO_0001265'],
 'NCRO_0000855': ['SO_0001265'],
 'NCRO_0001151': ['SO_0001265'],
 'NCRO_0002689': ['SO_0001265'],
 'NCRO_0001884': ['SO_0001265'],
 'NCRO_000

In [73]:
obj.update(ncro_1['SO'])

In [76]:
# Store data (serialize)
with open(construction_approach_location + 'subclass_construction_map.pkl', 'wb') as handle:
    pickle.dump(obj, handle, protocol=pickle.HIGHEST_PROTOCOL)

<br>

***
***
### Linked Data <a class="anchor" id="linked-data"></a>
***
***

#### gene-disease from PKL itself

In [25]:
data_downloader('https://storage.googleapis.com/pheknowlator/current_build/data/original_data/curated_gene_disease_associations.tsv',
                edge_data_location)

# Rename file adding relationship's identifier
os.rename(edge_data_location+'curated_gene_disease_associations.tsv',
          edge_data_location+'gene-disease_curated_gene_disease_associations.tsv')

Downloading Data from https://storage.googleapis.com/pheknowlator/current_build/data/original_data/curated_gene_disease_associations.tsv


In [27]:
with open(edge_data_location + 'gene-disease_curated_gene_disease_associations.tsv') as f:
    data = f.read()

data = pd.read_csv(edge_data_location + 'gene-disease_curated_gene_disease_associations.tsv', sep="\t")  
data

Unnamed: 0,geneId,geneSymbol,DSI,DPI,diseaseId,diseaseName,diseaseType,diseaseClass,diseaseSemanticType,score,EI,YearInitial,YearFinal,NofPmids,NofSnps,source
0,1,A1BG,0.700,0.538,C0019209,Hepatomegaly,phenotype,C23;C06,Finding,0.30,1.000,2017.0,2017.0,1,0,CTD_human
1,1,A1BG,0.700,0.538,C0036341,Schizophrenia,disease,F03,Mental or Behavioral Dysfunction,0.30,1.000,2015.0,2015.0,1,0,CTD_human
2,2,A2M,0.529,0.769,C0002395,Alzheimer's Disease,disease,C10;F03,Disease or Syndrome,0.50,0.769,1998.0,2018.0,3,0,CTD_human
3,2,A2M,0.529,0.769,C0007102,Malignant tumor of colon,disease,C06;C04,Neoplastic Process,0.31,1.000,2004.0,2019.0,1,0,CTD_human
4,2,A2M,0.529,0.769,C0009375,Colonic Neoplasms,group,C06;C04,Neoplastic Process,0.30,1.000,2004.0,2004.0,1,0,CTD_human
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84033,109580095,HBB-LCR,0.743,0.115,C0002875,Cooley's anemia,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human
84034,109580095,HBB-LCR,0.743,0.115,C0005283,beta Thalassemia,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human
84035,109580095,HBB-LCR,0.743,0.115,C0019025,Hemoglobin F Disease,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human
84036,109580095,HBB-LCR,0.743,0.115,C0085578,Thalassemia Minor,disease,C16;C15,Disease or Syndrome,0.30,,,,0,0,CTD_human


#### gene-miRNA from TarBase

In [28]:
data_downloader('https://dianalab.e-ce.uth.gr/downloads/tarbase_v8_data.tar.gz', edge_data_location)

with tarfile.TarFile(edge_data_location+'tarbase_v8_data.tar', 'r') as tar_ref:
    tar_ref.extractall(edge_data_location)
    
# Remove tar file
os.remove(edge_data_location+'tarbase_v8_data.tar')
    
# Rename file adding relationship's identifier
os.rename(edge_data_location+'TarBase_v8_download.txt',
          edge_data_location+'gene-miRNA_TarBase_v8_download.txt')    

Downloading Gzipped Data from https://dianalab.e-ce.uth.gr/downloads/tarbase_v8_data.tar.gz


In [73]:
with open(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt') as f:
    data = f.read()

data = pd.read_csv(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt', sep="\t", dtype={"cell_line": "string"})  
data

Unnamed: 0,geneId,geneName,mirna,species,cell_line,tissue,category,method,positive_negative,direct_indirect,up_down,condition
0,0910001A06Rik(mmu),0910001A06Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
1,1200004M23Rik(mmu),1200004M23Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
2,1700027J05Rik(mmu),1700027J05Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
3,1810015A11Rik(mmu),1810015A11Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
4,2310047A01Rik(mmu),2310047A01Rik(mmu),mmu-miR-124-3p,Mus musculus,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,
...,...,...,...,...,...,...,...,...,...,...,...,...
927114,uPA(hsa),uPA(hsa),hsa-miR-23b-3p,Homo sapiens,,,Cancer/Malignant,Western Blot,POSITIVE,INDIRECT,DOWN,
927115,vimentin(hsa),vimentin(hsa),hsa-miR-9-5p,Homo sapiens,,,Cancer/Malignant,Western Blot,NEGATIVE,INDIRECT,,
927116,Â Â Â Â (PTPRG)(hsa),Â Â Â Â (PTPRG)(hsa),hsa-miR-146a-5p,Homo sapiens,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,UP,
927117,Î±-actin(mmu),Î±-actin(mmu),mmu-miR-24-3p,Mus musculus,,,Cancer/Malignant,qPCR,POSITIVE,INDIRECT,UP,


In [74]:
# For the time being, we keep only hsa rows
data = data[data['species'].str.contains("Homo sapiens")]
# ignoring if a transcript is "3p" or "5p"
data['p'] = data[data['mirna'].str.contains("p")]['mirna']
data["p"] = data["p"].str[-2:]
data["mirna"] = data["mirna"].str.replace(r'-3p$', '')
data["mirna"] = data["mirna"].str.replace(r'-5p$', '')

# For NCRO ids compatability
data['mirna'] = data['mirna'].str.lower()
data

In [75]:
data["mirna"] = data["mirna"].str.replace(r'-3p$', '')
data["mirna"] = data["mirna"].str.replace(r'-5p$', '')
data

Unnamed: 0,geneId,geneName,mirna,species,cell_line,tissue,category,method,positive_negative,direct_indirect,up_down,condition,p
7,39701(hsa),39701(hsa),hsa-mir-21,Homo sapiens,,,Cancer/Malignant,Microarrays,POSITIVE,INDIRECT,DOWN,,5p
13,A2D(hsa),A2D(hsa),hsa-mir-34a,Homo sapiens,,,Cancer/Malignant,pSILAC,POSITIVE,INDIRECT,UP,,5p
14,A2LG(hsa),A2LG(hsa),hsa-mir-34a,Homo sapiens,,,Cancer/Malignant,pSILAC,POSITIVE,INDIRECT,UP,,5p
15,A2LP(hsa),A2LP(hsa),hsa-mir-34a,Homo sapiens,,,Cancer/Malignant,pSILAC,POSITIVE,INDIRECT,UP,,5p
16,A2RP(hsa),A2RP(hsa),hsa-mir-34a,Homo sapiens,,,Cancer/Malignant,pSILAC,POSITIVE,INDIRECT,UP,,5p
...,...,...,...,...,...,...,...,...,...,...,...,...,...
927112,tcag7.648(hsa),tcag7.648(hsa),hsa-mir-34a,Homo sapiens,,,Cancer/Malignant,pSILAC,POSITIVE,INDIRECT,DOWN,,5p
927113,uPA(hsa),uPA(hsa),hsa-mir-23b,Homo sapiens,,,Cancer/Malignant,qPCR,NEGATIVE,INDIRECT,DOWN,,3p
927114,uPA(hsa),uPA(hsa),hsa-mir-23b,Homo sapiens,,,Cancer/Malignant,Western Blot,POSITIVE,INDIRECT,DOWN,,3p
927115,vimentin(hsa),vimentin(hsa),hsa-mir-9,Homo sapiens,,,Cancer/Malignant,Western Blot,NEGATIVE,INDIRECT,,,5p


In [76]:
data.to_csv(edge_data_location + 'gene-miRNA_TarBase_v8_download.txt', header=None, sep='\t', index=None)

#### miRNA-disease from miR2Disease

In [83]:
data_downloader('http://watson.compbio.iupui.edu:8080/miR2Disease/download/AllEntries.txt', edge_data_location)
    
# Rename file adding relationship's identifier
os.rename(edge_data_location+'AllEntries.txt',
          edge_data_location+'miRNA-disease_miR2Disease.txt') 

Downloading Data from http://watson.compbio.iupui.edu:8080/miR2Disease/download/AllEntries.txt


In [84]:
edge_data_location = '../resources/edge_data/'

with open(edge_data_location + 'miRNA-disease_mir2Disease.txt') as f:
    data = f.read()

data = pd.read_csv(edge_data_location + 'miRNA-disease_miR2Disease.txt', sep="\t", header=None)  
# For NCRO ids compatability
data[0] = data[0].str.lower()
data

Unnamed: 0,0,1,2,3,4,5
0,hsa-let-7f-2,kidney cancer,up-regulated,microarray,2007.0,Micro-RNA profiling in kidney and bladder canc...
1,hsa-let-7g,hepatocellular carcinoma (HCC),down-regulated,"Northern blot, qRT-PCR etc",2008.0,Identification of metastasis-related microRNAs...
2,hsa-let-7g,lung cancer,down-regulated,"Northern blot, qRT-PCR etc",2007.0,The let-7 microRNA represses cell proliferatio...
3,hsa-let-7g,non-small cell lung cancer (NSCLC),down-regulated,"Northern blot, qRT-PCR etc",2008.0,Suppression of non-small cell lung tumor devel...
4,hsa-let-7g,ovarian cancer (OC),down-regulated,"Northern blot, qRT-PCR etc",2007.0,Let-7 expression defines two differentiation s...
...,...,...,...,...,...,...
2897,hsa-mir-21,glioblastoma multiforme (GBM),up-regulated,"Northern blot, qRT-PCR etc",2008.0,miR-124 and miR-137 inhibit proliferation of g...
2898,hsa-mir-21,glioma,up-regulated,"Northern blot, qRT-PCR etc",2008.0,MicroRNA 21 promotes glioma invasion by target...
2899,hsa-mir-21,hepatocellular carcinoma (HCC),up-regulated,"Northern blot, qRT-PCR etc",2006.0,Downregulation of miR-122 in the rodent and hu...
2900,hsa-mir-21,Inclusion body myositis (IBM),up-regulated,microarray,2007.0,Distinctive patterns of microRNA expression in...


In [85]:
# From description to DO ids
data.columns = ['mirna', 'desc', 2,3,4,5]
disease2mirna = pd.merge(descDOmap, data, on=['desc'])
disease2mirna

Unnamed: 0,desc,doid,mirna,2,3,4,5
0,adenoma,DOID_657,hsa-let-7a,normal,"Northern blot, qRT-PCR etc",2007.0,Disrupting the pairing between let-7 and Hmga2...
1,adrenocortical carcinoma,DOID_3948,hsa-mir-184,up-regulated,microarray,2009.0,Integrative molecular-bioinformatics study of ...
2,adrenocortical carcinoma,DOID_3948,hsa-mir-503,up-regulated,microarray,2009.0,Integrative molecular-bioinformatics study of ...
3,adrenocortical carcinoma,DOID_3948,hsa-mir-511,down-regulated,microarray,2009.0,Integrative molecular-bioinformatics study of ...
4,adrenocortical carcinoma,DOID_3948,hsa-mir-214,down-regulated,microarray,2009.0,Integrative molecular-bioinformatics study of ...
...,...,...,...,...,...,...,...
1290,vascular disease,DOID_178,hsa-mir-21,up-regulated,"Northern blot, qRT-PCR etc",2007.0,MicroRNA expression signature and antisense-me...
1291,vascular disease,DOID_178,hsa-mir-352,up-regulated,"Northern blot, qRT-PCR etc",2007.0,MicroRNA expression signature and antisense-me...
1292,vascular disease,DOID_178,hsa-mir-365,down-regulated,"Northern blot, qRT-PCR etc",2007.0,MicroRNA expression signature and antisense-me...
1293,vesicular stomatitis,DOID_10881,hsa-mir-93,down-regulated,"Northern blot, qRT-PCR etc",2007.0,Hypersusceptibility to vesicular stomatitis vi...


In [93]:
disease2mirna.to_csv(edge_data_location + 'miRNA-disease_mir2Disease.txt', header=None, sep='\t', index=None)


<br>

***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```