***
***

# PheKnowLator - Data Preparation

***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**

**Purpose:** This notebook serves as a script to download and process data in order to generate mapping and filtering data needed to build edges for the PheKnowLator knowledge graph. For more information on the data sources utilize within this script, please see the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page.

**Assumptions:**   
- Raw data downloads ➞ `./resources/processed_data/unprocessed_data`    
- Processed data write location ➞ `./resources/processed_data`   

**Dependencies:** This notebook utilizes several helper functions, which are stored in the [`DataPreparationHelperFunctions.py`](https://github.com/callahantiff/PheKnowLator/blob/master/scripts/python/DataPreparationHelperFunctions.py) script. Hyperlinks to all downloaded and generated data sources are provided on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page as well as within each source subsection of this notebook. All generated data is freely available for download from DropBox. 

<br>

## Table of Contents
***
### [Create Identifier Maps ](#create-identifier-maps)  
- [HUMAN TRANSCRIPT, GENE, AND PROTEIN IDENTIFIER MAPPING](#human-transcript,-gene,-and-protein-identifier-mapping)
  - [Ensembl Gene-Ensembl Transcript](#ensemblgene-ensembltranscript)  
  - [Entrez Gene-Ensembl Transcript](#entrezgene-ensembltranscript)
  - [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
  - [Ensembl Gene-Protein Ontology](#ensemblgene-proteinontology)
  - [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)
  - [STRING-Protein Ontology](#string-proteinontology)  
<br>
- [OTHER IDENTIFIER MAPPING](#other-identifier-mapping) 
  - [ChEBI Identifiers](#mesh-chebi)  
  - [Human Protein Atlas Tissue and Cell Types](#hpa-uberon)  
  - [Human Disease and Phenotype Identifiers](#disease-identifiers)


### [Create Edge Datasets](#create-edge-datasets)
- [ONTOLOGIES](#ontologies)  
  - [Protein Ontology](#protein-ontology)  
  - [Relation Ontology](#relation-ontology)  
<br>
- [LINKED DATA](#linked-data)  
  - [Reactome Protein-Complex Data](#reactome-protein-complex)
  - [Reactome Complex-Complex Data](#reactome-complex-complex)
  - [Reactome Chemical-Complex Data](#reactome-chemical-complex)  
  - [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  
  - [NCBI Gene Protein-Coding Genes and Proteins](#ncbi-protein-coding-genes)


### [Gather Instance Data Labels](#create-label-data)   



***

### Set-Up Environment

***


In [1]:
# import needed libraries
import glob
import networkx
import numpy
import pandas

from owlready2 import subprocess
from rdflib import Graph, Namespace, URIRef, BNode, extras, Literal
from rdflib.extras.external_graph_libs import *
from tqdm import tqdm

# import script containing helper functions
from scripts.python.DataPreparationHelperFunctions import *

**Define Global Variables**

In [2]:
# directory to read unprocessed data files from
unprocessed_data_location = 'resources/processed_data/unprocessed_data/'

# directory to write processed data files to
processed_data_location = 'resources/processed_data/'

***
***
### CREATE MAPPING DATASETS  <a class="anchor" id="create-identifier-maps"></a>
***
***

### Human Transcript, Gene, and Protein Identifier Mapping  <a class="anchor" id="human-transcript,-gene,-and-protein-identifier-mapping"></a>

**Data Source Wiki Pages:** 
- [Uniprot Knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)   
- [Protein Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#protein-ontology)
- [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene) 

**Purpose:** To map create protein-coding gene-protein relations and mappings between the identifiers listed below. The edges types produced from each of these mappings will be further described within each identifier mapping section: 
- [Ensembl Gene-Ensembl Transcript](#ensemblgene-ensembltranscript)  
- [Entrez Gene-Ensembl Transcript](#entrezgene-ensembltranscript)  
- [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
- [Ensembl Gene-Protein Ontology](#ensemblgene-proteinontology) 
- [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)
- [STRING-Protein Ontology](#string-proteinontology)

**Output:** This script downloads and saves the following data:  
- Human Ensembl Identifiers: [HUMAN_9606_idmapping_selected.tab.gz](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) ➞ [`HUMAN_9606_idmapping_selected.tab`](https://www.dropbox.com/s/afk12rtr1aya0za/HUMAN_9606_idmapping_selected.tab?dl=0) 
- Human Protein Identifiers: [promapping.txt](https://proconsortium.org/download/current/promapping.txt) ➞ [`promapping.txt`](https://www.dropbox.com/s/x7wdimv6ph6bl8k/promapping.txt?dl=0) 
- Human Gene Identifiers: [Homo_sapiens.gene_info.gz](ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz) ➞ [`Homo_sapiens.gene_info`](https://www.dropbox.com/s/vazlmzxydgv6xzz/Homo_sapiens.gene_info?dl=0)  

***

**Process Data:** `HUMAN_9606_idmapping_selected.tab`

In [44]:
url = 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data


In [76]:
# read in data and provided labels (needed to unnest data)
uniprot_ensembl = pandas.read_csv(unprocessed_data_location + 'HUMAN_9606_idmapping_selected.tab',
                                  header = None,
                                  delimiter = '\t',
                                  names = ['Entry', 'Entry name', 'GeneID (EntrezGene)', 'RefSeq', 'GI', 'PDB',
                                           'GO', 'UniRef100', 'UniRef90', 'UniRef50', 'UniParc', 'PIR',
                                           'NCBI-taxon', 'MIM', 'UniGene', 'PubMed', 'EMBL', 'EMBL-CDS',
                                           'Ensembl', 'Ensembl_TRS', 'Ensembl_PRO', 'Additional PubMed'],
                                  low_memory=False)

# replace NaN with 'None'
uniprot_ensembl.fillna('None', inplace=True)

In [77]:
# explode nested data
nest_cols = [list(uniprot_ensembl)[2]] + list(uniprot_ensembl)[18:21]
explode_df_ensembl = explode(uniprot_ensembl.copy(), nested_cols, ';')

# preview data
explode_df_ensembl.head(n=3)

Unnamed: 0,Entry,Entry name,GeneID (EntrezGene),RefSeq,GI,PDB,GO,UniRef100,UniRef90,UniRef50,...,NCBI-taxon,MIM,UniGene,PubMed,EMBL,EMBL-CDS,Ensembl,Ensembl_TRS,Ensembl_PRO,Additional PubMed
0,P31946,1433B_HUMAN,7529,NP_003395.1; NP_647539.1; XP_016883528.1,4507949; 377656702; 67464628; 1345590; 1034625...,2BQ0:A; 2BQ0:B; 2C23:A; 4DNK:A; 4DNK:B; 5N10:A...,GO:0005737; GO:0005829; GO:0070062; GO:0005925...,UniRef100_P31946,UniRef90_P31946,UniRef50_P31946,...,9606,601289,,8515476; 14702039; 11780052; 15489334; 2357255...,X57346; AK292717; AL008725; CH471077; CH471077...,CAA40621.1; BAF85406.1; -; EAW75893.1; EAW7589...,ENSG00000166913,ENST00000353703,ENSP00000300161,11996670; 12364343; 12437930; 12468542; 124825...
1,P31946,1433B_HUMAN,7529,NP_003395.1; NP_647539.1; XP_016883528.1,4507949; 377656702; 67464628; 1345590; 1034625...,2BQ0:A; 2BQ0:B; 2C23:A; 4DNK:A; 4DNK:B; 5N10:A...,GO:0005737; GO:0005829; GO:0070062; GO:0005925...,UniRef100_P31946,UniRef90_P31946,UniRef50_P31946,...,9606,601289,,8515476; 14702039; 11780052; 15489334; 2357255...,X57346; AK292717; AL008725; CH471077; CH471077...,CAA40621.1; BAF85406.1; -; EAW75893.1; EAW7589...,ENSG00000166913,ENST00000372839,ENSP00000300161,11996670; 12364343; 12437930; 12468542; 124825...
2,P31946,1433B_HUMAN,7529,NP_003395.1; NP_647539.1; XP_016883528.1,4507949; 377656702; 67464628; 1345590; 1034625...,2BQ0:A; 2BQ0:B; 2C23:A; 4DNK:A; 4DNK:B; 5N10:A...,GO:0005737; GO:0005829; GO:0070062; GO:0005925...,UniRef100_P31946,UniRef90_P31946,UniRef50_P31946,...,9606,601289,,8515476; 14702039; 11780052; 15489334; 2357255...,X57346; AK292717; AL008725; CH471077; CH471077...,CAA40621.1; BAF85406.1; -; EAW75893.1; EAW7589...,ENSG00000166913,ENST00000353703,ENSP00000361930,11996670; 12364343; 12437930; 12468542; 124825...


***

**Process Data:** `promapping.txt`

In [276]:
url = 'https://proconsortium.org/download/current/promapping.txt'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [294]:
pro_mapping = pandas.read_csv(unprocessed_data_location + 'promapping.txt',
                              header = None,
                              names = ['pro_id', 'Entry', 'pro_mapping'],
                              delimiter = '\t')

# remove rows without 'UniProtKB'
pro_mapping = pro_mapping.loc[pro_mapping['Entry'].apply(lambda x: x.startswith('UniProtKB:'))] 


# remove identifier type, which appears before ':'
pro_mapping['Entry'].replace('(^\w*\:)','', inplace=True, regex=True)

# preview data
pro_mapping.head(n=3)

Unnamed: 0,pro_id,Entry,pro_mapping
6,PR:000000005,P37173,is_a
7,PR:000000005,P38438,is_a
8,PR:000000005,Q62312,is_a


***

**Process Data:** `Homo_sapiens.gene_info`

In [None]:
url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz'
data_downloader(url, unprocessed_data_location)

In [None]:
ncbi_gene = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info', header = 0, delimiter = '\t')

# replace "-" with "None"
ncbi_gene.replace('-','None', inplace=True)

***

**Merge Processed Data:** `HUMAN_9606_idmapping_selected.tab` + `promapping.txt`  

In [297]:
# merge uniprot and ncbi data
uniprot_merged_data = pandas.merge(explode_df_ensembl,
                           pro_mapping,
                           left_on='Entry',
                           right_on='Entry',
                           how='outer')

# replace NaN with 'None'
uniprot_merged_data.fillna('None', inplace=True)

# preview data
uniprot_merged_data.head(n=3)

Unnamed: 0,Entry,Entry name,GeneID (EntrezGene),RefSeq,GI,PDB,GO,UniRef100,UniRef90,UniRef50,...,UniGene,PubMed,EMBL,EMBL-CDS,Ensembl,Ensembl_TRS,Ensembl_PRO,Additional PubMed,pro_id,pro_mapping
0,P31946,1433B_HUMAN,7529,NP_003395.1; NP_647539.1; XP_016883528.1,4507949; 377656702; 67464628; 1345590; 1034625...,2BQ0:A; 2BQ0:B; 2C23:A; 4DNK:A; 4DNK:B; 5N10:A...,GO:0005737; GO:0005829; GO:0070062; GO:0005925...,UniRef100_P31946,UniRef90_P31946,UniRef50_P31946,...,,8515476; 14702039; 11780052; 15489334; 2357255...,X57346; AK292717; AL008725; CH471077; CH471077...,CAA40621.1; BAF85406.1; -; EAW75893.1; EAW7589...,ENSG00000166913,ENST00000353703,ENSP00000300161,11996670; 12364343; 12437930; 12468542; 124825...,PR:000002175,is_a
1,P31946,1433B_HUMAN,7529,NP_003395.1; NP_647539.1; XP_016883528.1,4507949; 377656702; 67464628; 1345590; 1034625...,2BQ0:A; 2BQ0:B; 2C23:A; 4DNK:A; 4DNK:B; 5N10:A...,GO:0005737; GO:0005829; GO:0070062; GO:0005925...,UniRef100_P31946,UniRef90_P31946,UniRef50_P31946,...,,8515476; 14702039; 11780052; 15489334; 2357255...,X57346; AK292717; AL008725; CH471077; CH471077...,CAA40621.1; BAF85406.1; -; EAW75893.1; EAW7589...,ENSG00000166913,ENST00000353703,ENSP00000300161,11996670; 12364343; 12437930; 12468542; 124825...,PR:P31946,exact
2,P31946,1433B_HUMAN,7529,NP_003395.1; NP_647539.1; XP_016883528.1,4507949; 377656702; 67464628; 1345590; 1034625...,2BQ0:A; 2BQ0:B; 2C23:A; 4DNK:A; 4DNK:B; 5N10:A...,GO:0005737; GO:0005829; GO:0070062; GO:0005925...,UniRef100_P31946,UniRef90_P31946,UniRef50_P31946,...,,8515476; 14702039; 11780052; 15489334; 2357255...,X57346; AK292717; AL008725; CH471077; CH471077...,CAA40621.1; BAF85406.1; -; EAW75893.1; EAW7589...,ENSG00000166913,ENST00000372839,ENSP00000300161,11996670; 12364343; 12437930; 12468542; 124825...,PR:000002175,is_a


***

**Merge Processed Data:** `uniprot_merged_data` + `Homo_sapiens.gene_info`

In [298]:
# make sure merge columns are the same type
ncbi_gene['GeneID'] = ncbi_gene['GeneID'].astype(str)

# merge uniprot and ncbi data
merged_data = pandas.merge(uniprot_merged_data,
                           ncbi_gene,
                           left_on='GeneID (EntrezGene)',
                           right_on='GeneID',
                           how='outer')

# replace NaN with 'None'
merged_data.fillna('None', inplace=True)

# preview data
merged_data.head(n=3)

Unnamed: 0,Entry,Entry name,GeneID (EntrezGene),RefSeq,GI,PDB,GO,UniRef100,UniRef90,UniRef50,...,chromosome,map_location,description,type_of_gene,Symbol_from_nomenclature_authority,Full_name_from_nomenclature_authority,Nomenclature_status,Other_designations,Modification_date,Feature_type
0,P31946,1433B_HUMAN,7529,NP_003395.1; NP_647539.1; XP_016883528.1,4507949; 377656702; 67464628; 1345590; 1034625...,2BQ0:A; 2BQ0:B; 2C23:A; 4DNK:A; 4DNK:B; 5N10:A...,GO:0005737; GO:0005829; GO:0070062; GO:0005925...,UniRef100_P31946,UniRef90_P31946,UniRef50_P31946,...,20,20q13.12,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,protein-coding,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,O,14-3-3 protein beta/alpha|14-3-3 alpha|brain p...,20191200.0,
1,P31946,1433B_HUMAN,7529,NP_003395.1; NP_647539.1; XP_016883528.1,4507949; 377656702; 67464628; 1345590; 1034625...,2BQ0:A; 2BQ0:B; 2C23:A; 4DNK:A; 4DNK:B; 5N10:A...,GO:0005737; GO:0005829; GO:0070062; GO:0005925...,UniRef100_P31946,UniRef90_P31946,UniRef50_P31946,...,20,20q13.12,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,protein-coding,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,O,14-3-3 protein beta/alpha|14-3-3 alpha|brain p...,20191200.0,
2,P31946,1433B_HUMAN,7529,NP_003395.1; NP_647539.1; XP_016883528.1,4507949; 377656702; 67464628; 1345590; 1034625...,2BQ0:A; 2BQ0:B; 2C23:A; 4DNK:A; 4DNK:B; 5N10:A...,GO:0005737; GO:0005829; GO:0070062; GO:0005925...,UniRef100_P31946,UniRef90_P31946,UniRef50_P31946,...,20,20q13.12,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,protein-coding,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,O,14-3-3 protein beta/alpha|14-3-3 alpha|brain p...,20191200.0,


***

### Ensembl Gene-Ensembl Transcript <a class="anchor" id="ensemblgene-ensembltranscript"></a>

**Purpose:** To map Ensembl gene identifiers to Ensembl transcript identifiers when creating the following edges: 
- RNA-Cell   
- RNA-Tissue Types  
- RNA-Protein  

**Output:** [`ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`](https://www.dropbox.com/s/8n1isqytlz2z1g6/ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt?dl=0)

In [None]:
# de-dup data
df_ens = merged_data.drop_duplicates(subset=['Ensembl', 'Ensembl_TRS', 'Symbol_from_nomenclature_authority', 'Synonyms', 'Full_name_from_nomenclature_authority'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['Ensembl'] != 'None' and row['Ensembl_TRS'] != 'None': 
            outfile.write(row['Ensembl'] + '\t' + row['Ensembl_TRS'] + '\t' + row['Symbol_from_nomenclature_authority'] + '\t' + row['Synonyms'] + '\t' + row['Full_name_from_nomenclature_authority'] +  '\n')

outfile.close()

**Preview Processed Data**

In [300]:
eget_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Ensembl_Transcript_IDs', 'Ensembl_Labels', 'Ensembl_Synonyms', 'Ensembl_Description'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-ensembl transcript edges'.format(edge_count=len(eget_data)))

There are 147648 ensembl gene-ensembl transcript edges


In [301]:
eget_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Ensembl_Transcript_IDs,Ensembl_Labels,Ensembl_Synonyms,Ensembl_Description
0,ENSG00000166913,ENST00000353703,YWHAB,GW128|HEL-S-1|HS1|KCIP-1|YWHAA,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
1,ENSG00000166913,ENST00000372839,YWHAB,GW128|HEL-S-1|HS1|KCIP-1|YWHAA,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
2,ENSG00000108953,ENST00000264335,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
3,ENSG00000274474,ENST00000264335,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
4,ENSG00000108953,ENST00000571732,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...


  
***

### Entrez Gene-Ensembl Transcript <a class="anchor" id="entrezgene-ensembltranscript"></a>

**Purpose:** To map Entrez gene identifiers to Ensembl transcript identifiers when creating the following edges: 
- gene-RNA 

**Output:** [`ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`](https://www.dropbox.com/s/yqnofd8h90luygu/ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt?dl=0)

In [None]:
# de-dup data
df_ens = merged_data.drop_duplicates(subset=['GeneID', 'Ensembl_TRS', 'Symbol_from_nomenclature_authority', 'Synonyms', 'Full_name_from_nomenclature_authority'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['GeneID'] != 'None' and row['Ensembl_TRS'] != 'None': 
            outfile.write(row['GeneID'] + '\t' + row['Ensembl_TRS'] + '\t' + row['Symbol_from_nomenclature_authority'] + '\t' + row['Synonyms'] + '\t' + row['Full_name_from_nomenclature_authority'] +  '\n')

outfile.close()

**Preview Processed Data**

In [324]:
eet_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                            header = None,
                            names=['Entrez_Gene_IDs', 'Ensembl_Transcript_IDs', 'Ensembl_Symbols', 'Ensembl_Synonyms', 'Ensembl_Description'],
                            delimiter = '\t')

print('There are {edge_count} entrez gene-ensembl transcript edges'.format(edge_count=len(eet_data)))

There are 51382 entrez gene-ensembl transcript edges


In [325]:
eet_data.head(n=5)

Unnamed: 0,Entrez_Gene_IDs,Ensembl_Transcript_IDs,Ensembl_Symbols,Ensembl_Synonyms,Ensembl_Description
0,7529,ENST00000353703,YWHAB,GW128|HEL-S-1|HS1|KCIP-1|YWHAA,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
1,7529,ENST00000372839,YWHAB,GW128|HEL-S-1|HS1|KCIP-1|YWHAA,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
2,7531,ENST00000264335,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
3,7531,ENST00000571732,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
4,7531,ENST00000616643,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...



***

### Ensembl Gene-Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Ensembl gene identifiers to Entrez gene identifiers when creating the following edges:   
- gene-gene

**Output:** [`ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`](https://www.dropbox.com/s/crghjh2we5v7pws/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt?dl=0)

In [None]:
# de-dup data
df_ens = merged_data.drop_duplicates(subset=['Ensembl', 'GeneID', 'Symbol_from_nomenclature_authority', 'Synonyms', 'Full_name_from_nomenclature_authority'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['Ensembl'] != 'None' and row['GeneID'] != 'None': 
            outfile.write(row['Ensembl'] + '\t' + row['GeneID'] + '\t' + row['Symbol_from_nomenclature_authority'] + '\t' + row['Synonyms'] + '\t' + row['Full_name_from_nomenclature_authority'] +  '\n')

outfile.close()

**Preview Processed Data**

In [326]:
egeg_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Entrez_Gene_IDs', 'Ensembl_Symbols', 'Ensembl_Synonyms', 'Ensembl_Description'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-entrez gene edges'.format(edge_count=len(egeg_data)))

There are 20972 ensembl gene-entrez gene edges


In [327]:
egeg_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Entrez_Gene_IDs,Ensembl_Symbols,Ensembl_Synonyms,Ensembl_Description
0,ENSG00000166913,7529,YWHAB,GW128|HEL-S-1|HS1|KCIP-1|YWHAA,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
1,ENSG00000108953,7531,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
2,ENSG00000274474,7531,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
3,ENSG00000128245,7533,YWHAH,YWHA1,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
4,ENSG00000170027,7532,YWHAG,14-3-3GAMMA|EIEE56|PPP1R170,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...



***

### Ensembl Gene - Protein Ontology <a class="anchor" id="ensemblgene-proteinontology"></a>

**Purpose:** To map Ensembl gene identifiers to Protein Ontology identifiers when creating the following edges:   
- RNA-protein
- gene-protein

**Output:** [`ENSEMBL_GENE_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/3rt17t81yqjuc62/ENSEMBL_PROTEIN_PRO_MAP.txt?dl=0)

In [None]:
# de-dup data
df_ens = merged_data.drop_duplicates(subset=['Ensembl', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENSEMBL_GENE_PRO_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['Ensembl'] != 'None' and row['pro_id'] != 'None': 
            outfile.write(row['Ensembl'] + '\t' + row['pro_id'].replace(':', '_') + '\t' + row['Symbol_from_nomenclature_authority'] + '\t' + row['Synonyms'] + '\t' + row['Full_name_from_nomenclature_authority'] +  '\n')

outfile.close()

**Preview Processed Data**

In [330]:
egpr_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_PRO_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Protein_Ontology_IDs', 'Ensembl_Symbols', 'Ensembl_Synonyms', 'Ensembl_Description'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-protein ontology edges'.format(edge_count=len(egpr_data)))

There are 41208 ensembl gene-protein ontology edges


In [331]:
egpr_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Protein_Ontology_IDs,Ensembl_Symbols,Ensembl_Synonyms,Ensembl_Description
0,ENSG00000166913,PR_000002175,YWHAB,GW128|HEL-S-1|HS1|KCIP-1|YWHAA,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
1,ENSG00000166913,PR_P31946,YWHAB,GW128|HEL-S-1|HS1|KCIP-1|YWHAA,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
2,ENSG00000108953,PR_000003104,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
3,ENSG00000108953,PR_P62258,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
4,ENSG00000274474,PR_000003104,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...



***

### Uniprot Accession-Protein Ontology <a class="anchor" id="uniprotaccession-proteinontology"></a>

**Purpose:** To map Uniprot accession identifiers to Protein Ontology identifiers when creating the following edges:  
- protein-gobp  
- protein-gomf  
- protein-gocc  
- protein-complex  
- protein-cofactor  
- protein-catalyst 
- protein-reaction  
- protein-pathway

**Output:** [`UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/txp8tqdipzwus9p/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt?dl=0)

In [None]:
# de-dup data
df_ens = merged_data.drop_duplicates(subset=['Entry', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['Entry'] != 'None' and row['pro_id'] != 'None': 
            outfile.write(row['Entry'] + '\t' + row['pro_id'].replace(':', '_') +  '\n')

outfile.close()

**Preview Processed Data**

In [319]:
uapr_data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['Uniprot_Accession_IDs', 'Protein_Ontology_IDs'],
                            delimiter = '\t')

print('There are {edge_count} uniprot accession-protein ontology edges'.format(edge_count=len(uapr_data)))

There are 313776 uniprot accession-protein ontology edges


In [320]:
uapr_data.head(n=5)

Unnamed: 0,Uniprot_Accession_IDs,Protein_Ontology_IDs
0,P31946,PR_000002175
1,P31946,PR_P31946
2,P62258,PR_000003104
3,P62258,PR_P62258
4,Q04917,PR_000017551



***

### STRING-Protein Ontology <a class="anchor" id="string-proteinontology"></a>

**Purpose:** To map STRING identifiers to Protein Ontology identifiers when creating the following edges:   
- protein-protein  

**Output:** [`STRING_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/mekh5lr3bxp7gvu/STRING_PRO_ONTOLOGY_MAP.txt?dl=0)

In [None]:
# de-dup data
df_ens = merged_data.drop_duplicates(subset=['Ensembl_PRO', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['Ensembl_PRO'] != 'None' and row['pro_id'] != 'None': 
            outfile.write('9606.' + row['Ensembl_PRO'] + '\t' + row['pro_id'].replace(':', '_') +  '\n')

outfile.close()

**Preview Processed Data**

In [322]:
stpr_data = pandas.read_csv(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['STRING_IDs', 'Protein_Ontology_IDs'],
                            delimiter = '\t')

print('There are {edge_count} string-protein ontology edges'.format(edge_count=len(stpr_data)))

There are 92755 string-protein ontology edges


In [323]:
stpr_data.head(n=5)

Unnamed: 0,STRING_IDs,Protein_Ontology_IDs
0,9606.ENSP00000300161,PR_000002175
1,9606.ENSP00000300161,PR_P31946
2,9606. ENSP00000361930,PR_000002175
3,9606. ENSP00000361930,PR_P31946
4,9606.ENSP00000264335,PR_000003104


***
### Other Identifier Mapping <a class="anchor" id="other-identifier-mapping"></a>
* [ChEBI Identifiers](#mesh-chebi)  
* [Human Protein Atlas Tissue and Cell Types](#hpa-uberon) 
* [Human Disease and Phenotype Identifiers](#disease-identifiers) 

***

### MESH - ChEBI <a class="anchor" id="mesh-chebi"></a>

**Data Source Wiki Page:** [mapping-mesh-to-chebi](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#mapping-mesh-identifiers-to-chebi-identifiers)  

**Purpose:** Map MeSH identifiers to ChEBI identifiers when creating the following edges:  
- Chemical-Gene  
- Chemical-Disease

**Dependencies:** This script assumes that the `NCBO_rest_api.py` script was run and the data generated from this file was written to `./resources/processed_data/temp`. 

**Output:** [`MESH_CHEBI_MAP.txt`](https://www.dropbox.com/s/5nr87v5h6x8oc1b/MESH_CHEBI_MAP.txt?dl=0)


In [None]:
with open(processed_data_location + 'MESH_CHEBI_MAP.txt', 'w') as out:
    for filename in tqdm(glob.glob(processed_data_location + 'temp/*.txt')):
        for row in list(filter(None, open(filename, 'r').read().split('\n'))):
            mesh = '_'.join(row.split('\t')[0].split('/')[-2:])
            chebi = row.split('\t')[1].split('/')[-1]
            out.write(mesh + '\t' + chebi + '\n')

out.close()

**Preview Processed Data**

In [147]:
mc_data = pandas.read_csv(processed_data_location + 'MESH_CHEBI_MAP.txt',
                          delimiter = '\t',
                          header=None,
                          names=['MeSH_IDs', 'ChEBI_IDs'])

print('There are {edge_count} MeSH-ChEBI edges'.format(edge_count=len(mc_data)))

There are 11434 MeSH-ChEBI edges


In [148]:
mc_data.head(n=5)

Unnamed: 0,MeSH_IDs,ChEBI_IDs
0,MESH_C535085,CHEBI_133814
1,MESH_C008574,CHEBI_17221
2,MESH_C492482,CHEBI_34581
3,MESH_C007556,CHEBI_135978
4,MESH_C500395,CHEBI_29138



***

### Human Protein Atlas Tissue/Cells - UBERON + Cell Ontology <a class="anchor" id="hpa-uberon"></a>

**Data Source Wiki Page:** [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#human-protein-atlas)  

**Purpose:** Downloads the [`normal_tissue.tsv.zip`](https://www.proteinatlas.org/download/normal_tissue.tsv.zip) and [`rna_tissue_consensus.tsv.zip`](https://www.proteinatlas.org/download/rna_tissue_consensus.tsv.zip) files in order to create mappings between HPA cell and tissue type strings to Uber-Anatomy and Cell Ontology concepts (see [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-protein-atlas) for details on the mapping process). This mapping is then used to create the following edge types:  
- RNA-cell type  
- RNA-tissue type  
- RNA-protein  

**Output:**  
- All HPA tissue and cell type strings ➞ [`HPA_tissues.txt`](https://www.dropbox.com/s/m0spn8h1l8kxb61/HPA_tissues.txt?dl=0)  
- Mapping HPA strings to ontology concepts (documentation) ➞ [`zooma_tissue_cell_mapping_04DEC2019.xlsx`](https://www.dropbox.com/s/9xjhc7aygecmtnm/zooma_tissue_cell_mapping_04DEC2019.xlsx?dl=0)  
- Final HPA-ontology mappings ➞ [`HPA_TISSUE_CELL_MAP.txt`](https://www.dropbox.com/s/dsh1x88u6251w76/HPA_TISSUE_CELL_MAP.txt?dl=0)

In [3]:
url_normal = 'https://www.proteinatlas.org/download/normal_tissue.tsv.zip'
data_downloader(url_normal, unprocessed_data_location)

url_abnormal = 'https://www.proteinatlas.org/download/rna_tissue_consensus.tsv.zip'
data_downloader(url_abnormal, unprocessed_data_location)

Downloading zipped data file
Downloading zipped data file


_Read in Data Files_

In [4]:
abnormal_tissue = []

for line in tqdm(open(unprocessed_data_location + 'rna_tissue_consensus.tsv').readlines()):
    abnormal_tissue.append(line.split('\t')[2].strip())

100%|██████████| 1193056/1193056 [00:01<00:00, 915078.26it/s]


In [5]:
normal_tissue = []

for line in tqdm(open(unprocessed_data_location + 'normal_tissue.tsv').readlines()):
    normal_tissue.append(line.split('\t')[2].strip() + ' - ' + line.split('\t')[3].strip())

100%|██████████| 1056062/1056062 [00:02<00:00, 517483.38it/s]


In [6]:
# combine normal and abnormal tissue and cell types into single list
combo = set(abnormal_tissue + normal_tissue)

# write results
with open(unprocessed_data_location + 'HPA_tissues.txt', 'w') as outfile:
    for x in tqdm(combo):
        outfile.write(x.strip() + '\n')

outfile.close()

100%|██████████| 208/208 [00:00<00:00, 276605.97it/s]


In [132]:
# read back in mapped tissue/cell data
hpa_mapping_data = pandas.read_excel(open(unprocessed_data_location + 'zooma_tissue_cell_mapping_04DEC2019.xlsx', 'rb'),
                                     sheet_name='zooma_tissue_cell_mapping_04DEC',
                                     header=0)

hpa_mapping_data.fillna('None', inplace=True)

# preview data
hpa_mapping_data.head(n=3)

Unnamed: 0,TISSUE,CELL TYPE,ONTOLOGY,ONTOLOGY ID,ONTOLOGY LABEL,MAPPING
0,adipose tissue,,UBERON,http://purl.obolibrary.org/obo/UBERON_0001013,adipose tissue,ZOOMA
1,adipose tissue,adipocytes,UBERON,http://purl.obolibrary.org/obo/UBERON_0001013,adipose tissue,ZOOMA
2,adipose tissue,adipocytes,CL,http://purl.obolibrary.org/obo/CL_0001070,fat cell,Manual


In [10]:
# reformat data and write it out
with open(processed_data_location + 'HPA_TISSUE_CELL_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(data.iterrows(), total=data.shape[0]):
        if row['TISSUE'] != 'None':
            outfile.write(str(row['TISSUE']).strip() + '\t' + str(row['ONTOLOGY ID']).strip() + '\n')

        if row['CELL TYPE'] != 'None':
            outfile.write(str(row['CELL TYPE']).strip() + '\t' + str(row['ONTOLOGY ID']).strip() + '\n')

outfile.close()

100%|██████████| 340/340 [00:00<00:00, 4113.13it/s]


**Preview Processed Data**

In [25]:
hpa_data = pandas.read_csv(processed_data_location + 'HPA_TISSUE_CELL_MAP.txt',
                           header = None,
                           names=['Tissue/Cell', 'UBERON_CL_IDs'],
                           delimiter = '\t')

print('There are {edge_count} edges'.format(edge_count=len(hpa_data)))

There are 622 edges


In [13]:
hpa_data.head(n=5)

Unnamed: 0,Tissue/Cell,UBERON_CL_IDs
0,adipose tissue,http://purl.obolibrary.org/obo/UBERON_0001013
1,adipose tissue,http://purl.obolibrary.org/obo/UBERON_0001013
2,adipocytes,http://purl.obolibrary.org/obo/UBERON_0001013
3,adipose tissue,http://purl.obolibrary.org/obo/CL_0001070
4,adipocytes,http://purl.obolibrary.org/obo/CL_0001070



***

### Disease and Phenotype Identifiers <a class="anchor" id="disease-identifiers"></a>

**Data Source Wiki Page:** [disgenet](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet)  

**Purpose:** This script downloads the [disease_mappings.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz) to map UMLS identifiers to Human Disease and Human Phenotype identifiers when creating the following edges:  
- chemical-disease  
- disease-phenotype

**Output:**   
- Human Disease Ontology Mappings ➞ [`DISEASE_DOID_MAP.txt`](https://www.dropbox.com/s/q30ferujl7k574j/DISEASE_DOID_MAP.txt?dl=0)  
- Human Phenotype Ontology Mappings ➞ [`PHENOTYPE_HPO_MAP.txt`](https://www.dropbox.com/s/5ayl0c5qm7r4tdm/PHENOTYPE_HPO_MAP.txt?dl=0)


In [14]:
url = 'https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data file


In [161]:
data = pandas.read_csv(unprocessed_data_location + 'disease_mappings.tsv',
                       header = 0,
                       delimiter = '|')

In [225]:
data.head(n=3)

Unnamed: 0,diseaseId,name,vocabulary,code,vocabularyName
0,C0018923,Hemangiosarcoma,DO,1816,angiosarcoma
1,C0854893,Angiosarcoma non-metastatic,DO,1816,angiosarcoma
2,C0033999,Pterygium,DO,2116,pterygium


In [None]:
# convert to dictionary
disease_dict = {}

for idx, row in tqdm(data.iterrows(), total=data.shape[0]):
    
    if row['vocabulary'] == 'MSH':
        mesh_finder(data, row['code'], 'MESH:', disease_dict)
        
    elif row['vocabulary'] == 'OMIM':
        mesh_finder(data, row['code'], 'OMIM:', disease_dict)
        
    elif row['vocabulary'] == 'ORDO':
        mesh_finder(data, row['code'], 'ORPHA:', disease_dict)
    
    elif row['diseaseId'] in disease_dict.keys():
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']].append('DOID_' + row['code']) 
        
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']].append(row['code'].replace('HP:', 'HP_'))
    
    else:
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']] = ['DOID_' + row['code']] 
        
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']] = [row['code'].replace('HP:', 'HP_')] 
    

In [None]:
# reformat data and write it out
with open(processed_data_location + 'DISEASE_DOID_MAP.txt', 'w') as outfile1,open(processed_data_location + 'PHENOTYPE_HPO_MAP.txt', 'w') as outfile2:
    for key, value in tqdm(disease_dict.items()):
        for i in value:
            # get diseases
            if i.startswith('DOID_'): 
                outfile1.write(key + '\t' + i + '\n')

            # get phenotypes
            if i.startswith('HP_'): 
                outfile2.write(key + '\t' + i + '\n')

outfile1.close()
outfile2.close()

**Preview Processed Data**

_Preview Disease (DOID) Mappings_

In [262]:
dis_data = pandas.read_csv(processed_data_location + 'DISEASE_DOID_MAP.txt',
                           header = None,
                           names=['Disease_IDs', 'DOID_IDs'],
                           delimiter = '\t')

print('There are {} disease-DOID edges'.format(len(dis_data)))

There are 46720 disease-DOID edges


In [263]:
dis_data.head(n=5)

Unnamed: 0,Disease_IDs,DOID_IDs
0,C0018923,DOID_0001816
1,C0854893,DOID_0001816
2,C0033999,DOID_0002116
3,C4520843,DOID_0002116
4,C0024814,DOID_0014667


_Preview Phenotype (HP) Mappings_

In [264]:
hp_data = pandas.read_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt',
                          header = None,
                          names=['Disease_IDs', 'HP_IDs'],
                          delimiter = '\t')

print('There are {} phenotype-HPO edges'.format(len(hp_data)))

There are 21676 phenotype-HPO edges


In [265]:
hp_data.head(n=5)

Unnamed: 0,Disease_IDs,HP_IDs
0,C0018923,HP_0200058
1,C0033999,HP_0001059
2,C4520843,HP_0001059
3,C0037199,HP_0000246
4,C0008780,HP_0012265


***
***
### CREATE EDGE DATASETS  <a class="anchor" id="create-edge-datasets"></a>
***  
***
***

### Ontologies  <a class="anchor" id="ontologies"></a>
- [Protein Ontology](#protein-ontology)  
- [Relation Ontology](#relation-ontology)  

***

### Protein Ontology <a class="anchor" id="protein-ontology"></a>

**Data Source Wiki Page:** [protein-ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-phenotype-ontology)  

**Purpose:** This script downloads the [pr.owl](http://purl.obolibrary.org/obo/pr.owl) file from [ProConsortium.org](https://proconsortium.org/) in order to create a version of the ontology that contains only human proteins. This is achieved by performing forward and reverse breadth first search over all proteins which are `owl:subClassOf` [Homo sapiens protein](https://proconsortium.org/app/entry/PR%3A000029067/).

<br>
**Output:**  
- Human Protein Ontology ➞ [`human_pro.owl`](https://www.dropbox.com/s/jw8jksgnqbcz9sm/human_pro.owl?dl=0)
- Classified Human Protein Ontology (Hermit) ➞ [`human_pro_closed.owl`](https://www.dropbox.com/s/6ux85agl95ja3wx/human_pro_closed.owl?dl=0)


In [120]:
url = 'http://purl.obolibrary.org/obo/pr.owl'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [12]:
# read in ontology as graph (the ontology is large so this takes ~60 minutes) - 11,757,623 edges on 12/18/2019
graph = Graph()
graph.parse(unprocessed_data_location + 'pr.owl')

print('There are {} edges in the ontology'.format(len(graph)))

There are 11757623 edges in the ontology


**Convert Ontology to Directed MulitGraph:**  
In order to create a verison of the ontology which includes all relevant human edges, we need to first convert the KG to a [directed multigraph](https://networkx.github.io/documentation/stable/reference/classes/multidigraph.html).

In [122]:
# convert RDF graph to multidigraph (the ontology is large so this takes ~45 minutes)
networkx_mdg = rdflib_to_networkx_multidigraph(graph)

**Identify Human Proteins:**   
A list of human proteins is obtained by querying the ontology to return all ontology classes `only_in_taxon some Homo sapiens`. To expedite the query time, the following SPARQL query is run from the [ProConsortium](https://proconsortium.org/pro_sparql.shtml) SPARQL endpoint: 

```SPARQL
PREFIX obo: <http://purl.obolibrary.org/obo/>

SELECT ?PRO_term
FROM <http://purl.obolibrary.org/obo/pr>
WHERE {
       ?PRO_term rdf:type owl:Class .
       ?PRO_term rdfs:subClassOf ?restriction .
       ?restriction owl:onProperty obo:RO_0002160 .
       ?restriction owl:someValuesFrom obo:NCBITaxon_9606 .

       # use this to filter-out things like hgnc ids
       FILTER (regex(?PRO_term,"http://purl.obolibrary.org/obo/*")) .
}

```


In [123]:
# download data - pro classes only_in_taxon some Homo sapiens (61,064 classes on 12/18/2019)
url = 'http://sparql.proconsortium.org/virtuoso/sparql?query=PREFIX+obo%3A+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F%3E%0D%0ASELECT+%3FPRO_term%0D%0AFROM+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fpr%3E%0D%0AWHERE%0D%0A%7B%0D%0A+++%3FPRO_term+rdf%3Atype+owl%3AClass+.%0D%0A+++%3FPRO_term+rdfs%3AsubClassOf+%3Frestriction+.%0D%0A+++%3Frestriction+owl%3AonProperty+obo%3ARO_0002160+.%0D%0A+++%3Frestriction+owl%3AsomeValuesFrom+obo%3ANCBITaxon_9606+.%0D%0A%0D%0A+++FILTER+%28regex%28%3FPRO_term%2C%22http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F*%22%29%29+.%0D%0A%0D%0A%7D%0D%0A&format=text%2Fhtml&debug='
html = requests.get(url, allow_redirects=True).content

# extract data from html table
df_list = pandas.read_html(html)
human_pro_classes = list(df_list[-1]['PRO_term'])

print('There are {protein_count} human classes in the PRO ontology'.format(protein_count=len(human_pro_classes)))

There are 61064 human classes in the PRO ontology


**Construct Human PRO:**   
Now that we have all of the paths from the original graph that are relevant to humans, we can construct a human-only version of the PRotein ontology.

In [124]:
# create a new graph using bfs paths
human_pro_graph = Graph()
human_networkx_mdg = networkx.MultiDiGraph()

for node in tqdm(human_pro_classes):
    forward = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='original'))
    reverse = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='reverse'))
    
    # add edges from forward and reverse bfs paths
    for path in forward + reverse:
        human_pro_graph.add((path[0], path[2], path[1]))
        human_networkx_mdg.add_edge(path[0], path[1], path[2])

100%|██████████| 61064/61064 [36:49<00:00, 27.63it/s]    


In [125]:
# verify that the constructed ontology only has 1 component
networkx.number_connected_components(human_networkx_mdg.to_undirected())

1

In [25]:
# save filtered ontology
human_pro_graph.serialize(destination=unprocessed_data_location + 'human_pro.owl', format='xml')

**Classify Ontology:**  
To ensure that we have correclty built the new ontology, we run the hermit reasoner over it to ensure that there are no incomplete triples or inconsistent classes. In order to do this, we will call the reasoner using [OWLTools](https://github.com/owlcollab/owltools), which this script assumes has already been downloaded to the `../resources/lib` directory. The following arguments are then called to run the reasoner (from the command line):  

```bash
./resources/lib/owltools ./resources/unprocessed_data/human_pro_filtered.owl --reasoner hermit --run-reasoner --assert-implied -o ./resources/processed_data/human_pro_closed.owl
```

_**Note.** This step takes around 30-45 minutes to run. When run from the command line the reasoner determined that the ontology was consistent and 174 new axioms were inferrred (12/18/2019)._

In [None]:
# run reasoner -- RUN FROM COMMAND LINE NOT HERE
# subprocess.run(['../../resources/lib/owltools',
#                 '../../resources/unprocessed_data/human_pro_filtered.owl',
#                 '--reasoner hermit',
#                 '--run-reasoner',
#                 '--assert-implied',
#                 '--list-unsatisfiable',
#                 '-o ./resources/processed_data/human_pro_closed.owl'])

**Examine Cleaned Human PRO:**  
Once we have cleaned the ontology we can get counts of components, nodes, edges, and then write the cleaned graph to the `../../resources/processed_data` repository.

In [None]:
# get count of connected components
pro_human_graph = Graph()
pro_human_graph.parse(processed_data_location + 'human_pro_closed.owl')

# get node and edge count
edge_count = len(human_pro_graph)
node_count = len(set([str(node) for edge in list(human_pro_graph) for node in edge[0::2]]))

print('\n The classified, filtered Human version of PRO contains {node} nodes and {edge} edges\n'.format(node=node_count, edge=edge_count))


***

### Relation Ontology <a class="anchor" id="relation-ontology"></a>

**Data Source Wiki Page:** [RO](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#relation-ontology)  

**Purpose:** This script downloads the [ro.owl](http://purl.obolibrary.org/obo/ro.owl) file from [obofoundry.org](http://www.obofoundry.org/) in order to obtain all `ObjectProperties` and their inverse relations.  

**Output:** [`RO_INVERSE_RELATIONS.txt`](https://www.dropbox.com/s/osqk3c5cazncmni/RO_INVERSE_RELATIONS.txt?dl=0)

In [129]:
url = 'http://purl.obolibrary.org/obo/ro.owl'
data_downloader(url)

Downloading data file


In [130]:
ro_graph = Graph()
ro_graph.parse(unprocessed_data_location + 'ro.owl')

print('There are {} edges in the ontology'.format(len(ro_graph))) #5,669 edges on 12/15/2019

There are 5669 edges in the ontology


**Identify Relations and Inverse Relations:**  
Identify all relations and their inverse relations using the `owl:inverseOf` property. To make it easier to look up the inverse relations, each pair is listed twice, for example:  
- [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015) `owl:inverseOf` [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025)  
- [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025) `owl:inverseOf` [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015)

In [131]:
with open(processed_data_location + 'RO_INVERSE_RELATIONS.txt', 'w') as outfile:

    # find inverse relations
    for s, p, o in tqdm(ro_graph):
        if 'owl#inverseOf' in str(p):
            if 'RO' in str(s) and 'RO' in str(o):
                outfile.write(str(s) + '\t' + str(p) + '\t' + str(o) + '\n')
                outfile.write(str(o) + '\t' + str(p) + '\t' + str(s) + '\n')

outfile.close()

**Preview Processed Data**

In [33]:
ro_data = pandas.read_csv(processed_data_location + 'RO_INVERSE_RELATIONS.txt',
                          header = None,
                          names=['RO_Edge', 'RO_ObjectProperty', 'RO_Inverse_Edge'],
                          delimiter = '\t')

print('There are {edge_count} RO Relations and Inverse Relations'.format(edge_count=len(ro_data)))

There are 172 RO Relations and Inverse Relations


In [34]:
ro_data.head(n=5)

Unnamed: 0,RO_Edge,RO_ObjectProperty,RO_Inverse_Edge
0,http://purl.obolibrary.org/obo/RO_0003000,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0003001
1,http://purl.obolibrary.org/obo/RO_0003001,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0003000
2,http://purl.obolibrary.org/obo/RO_0002233,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0002352
3,http://purl.obolibrary.org/obo/RO_0002352,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0002233
4,http://purl.obolibrary.org/obo/RO_0002234,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0002353


***
### Linked Data <a class="anchor" id="linked-data"></a>
* [Reactome Protein-Complex Data](#reactome-protein-complex)  
* [Reactome Complex-Complex Data](#reactome-complex-complex)  
* [Reactome Chemical-Complex Data](#reactome-chemical-complex)  
* [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  
* [NCBI Gene Protein-Coding Genes and Proteins](#ncbi-protein-coding-genes)    


***

### Reactome Protein-Complex Data <a class="anchor" id="reactome-protein-complex"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.org) in order to create the following edges:  
- protein-complex

**Output:** [`REACTOME_PROTEIN_COMPLEX.txt`](https://www.dropbox.com/s/7meu0cdz1mrnsz7/REACTOME_PROTEIN_COMPLEX.txt?dl=0)


In [35]:
url = 'https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [None]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data):
        row = line.split('\t')
        
        # find all proteins in a complex
        for x in row[2].split('|'):
            if x.startswith('uniprot:'):            
                outfile.write(x.split(':')[-1].strip() + '\t' + row[0].strip() + '\t' + row[1].strip() + '\n')

outfile.close()

**Preview Processed Data**

In [142]:
pc_data = pandas.read_csv(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt',
                       header = None,
                       names=['Uniprot_Protein', 'Reactome_Complex', 'Reactome_Label'],
                       delimiter = '\t')

print('There are {edge_count} protein-complex edges'.format(edge_count=len(pc_data)))

There are 91985 protein-complex edges


In [143]:
pc_data.head(n=5)

Unnamed: 0,Uniprot_Protein,Reactome_Complex,Reactome_Label
0,P08603,R-HSA-1006173,CFH:Host cell surface [plasma membrane]
1,Q16621,R-HSA-1008206,NF-E2:Promoter region of beta-globin [nucleopl...
2,Q9ULX9,R-HSA-1008206,NF-E2:Promoter region of beta-globin [nucleopl...
3,O15525,R-HSA-1008206,NF-E2:Promoter region of beta-globin [nucleopl...
4,O60675,R-HSA-1008206,NF-E2:Promoter region of beta-globin [nucleopl...



***

### Reactome Complex-Complex Data <a class="anchor" id="reactome-complex-complex"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) in order to create the following edges:  
- complex-complex  

**Output:** [`REACTOME_COMPLEX_COMPLEX.txt`](https://www.dropbox.com/s/sojaq8u3hwfw4jz/REACTOME_COMPLEX_COMPLEX.txt?dl=0)


In [333]:
# create label dictionary
labels = pandas.read_csv(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt',
                         header = 0,
                         delimiter = '\t')

# convert to dictionary
label_dict = {row[0]:row[1] for idx, row in labels.iterrows()}

In [None]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        
        # find all complexes
        for x in row[3].split('|'):
            if x.startswith('R-HSA-') and x.strip() in label_dict.keys():            
                outfile.write(row[0].strip() + '\t' + x.strip() + '\t' + row[1].strip() + '\t' + label_dict[x.strip()] + '\n')

outfile.close()

**Preview Processed Data**

In [335]:
cc_data = pandas.read_csv(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt',
                          header = None,
                          names=['Reactome_Complex_u', 'Reactome_Complex_v', 'Reactome Label_u', 'Reactome Label_v'],
                          delimiter = '\t')

print('There are {edge_count} complex-complex edges'.format(edge_count=len(cc_data)))

There are 13722 complex-complex edges


In [336]:
cc_data.head(n=5)

Unnamed: 0,Reactome_Complex_u,Reactome_Complex_v,Reactome Label_u,Reactome Label_v
0,R-HSA-1008206,R-HSA-1008229,NF-E2:Promoter region of beta-globin [nucleopl...,NF-E2 [nucleoplasm]
1,R-HSA-1013011,R-HSA-1013017,GABA B receptor G-protein beta-gamma and Kir3 ...,GABA B receptor G-protein beta-gamma complex [...
2,R-HSA-1013011,R-HSA-1013019,GABA B receptor G-protein beta-gamma and Kir3 ...,G-protein beta-gamma subunits [plasma membrane]
3,R-HSA-1013011,R-HSA-420698,GABA B receptor G-protein beta-gamma and Kir3 ...,GABAB receptor:GABA [plasma membrane]
4,R-HSA-1013011,R-HSA-420748,GABA B receptor G-protein beta-gamma and Kir3 ...,GABAB receptor [plasma membrane]



***

### Reactome Chemical-Complex Data <a class="anchor" id="reactome-chemical-complex"></a>

**Data Souurce Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) in order to create the following edges:  
- chemical-complex  

**Output:** [`REACTOME_CHEMICAL_COMPLEX.txt`](https://www.dropbox.com/s/qoetjt0vfy6qb3y/REACTOME_CHEMICAL_COMPLEX.txt?dl=0)


In [None]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        
        # find all proteins in a complex
        for x in row[2].split('|'):
            if x.startswith('chebi:'):            
                outfile.write(x.replace('chebi:', 'CHEBI_') + '\t' + row[0].strip() + '\t' + row[1].strip() + '\n')

outfile.close()

In [159]:
cc1_data = pandas.read_csv(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt',
                           header = None,
                           names=['CHEBI_IDs', 'Reactome_IDs', 'Reactome_Label'],
                           delimiter = '\t')

print('There are {edge_count} chemical-complex edges'.format(edge_count=len(cc1_data)))

There are 5608 chemical-complex edges


In [160]:
cc1_data.head(n=5)

Unnamed: 0,CHEBI_IDs,Reactome_IDs,Reactome_Label
0,CHEBI_24505,R-HSA-1006173,CFH:Host cell surface [plasma membrane]
1,CHEBI_28879,R-HSA-1006173,CFH:Host cell surface [plasma membrane]
2,CHEBI_59888,R-HSA-1013011,GABA B receptor G-protein beta-gamma and Kir3 ...
3,CHEBI_59888,R-HSA-1013017,GABA B receptor G-protein beta-gamma complex [...
4,CHEBI_29105,R-HSA-109266,NT5E:Zn2+ dimer [plasma membrane]


***
***

### Uniprot  Protein-Cofactor and Protein-Catalyst <a class="anchor" id="uniprot-protein-cofactorcatalyst"></a>

**Data Source Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)  

**Purpose:** This script downloads the [uniprot-cofactor-catalyst.tab](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) file from the [Uniprot Knowledge Base](https://www.uniprot.org) in order to create the following edges:  
- protein-cofactor  
- protein-catalyst  

**Output:**  
- protein-cofactor ➞ [`UNIPROT_PROTEIN_COFACTOR.txt`](https://www.dropbox.com/s/ij9t89botd8nmmj/UNIPROT_PROTEIN_COFACTOR.txt?dl=0)
- protein-catalyst ➞ [`UNIPROT_PROTEIN_CATALYST.txt`](https://www.dropbox.com/s/pvopvs0iq8x3oq2/UNIPROT_PROTEIN_CATALYST.txt?dl=0)


In [48]:
url = 'https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Centry%20name%2Creviewed%2Cdatabase(PRO)%2Cchebi(Cofactor)%2Cchebi(Catalytic%20activity)&format=tab'
data_downloader(url, unprocessed_data_location, 'uniprot-cofactor-catalyst.tab')

Downloading data file


In [49]:
data = open(unprocessed_data_location + 'uniprot-cofactor-catalyst.tab').readlines()

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt', 'w') as outfile1, open(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt', 'w') as outfile2:
    for line in tqdm(data):

        # get cofactors
        if 'CHEBI' in line.split('\t')[4]: 
            for i in line.split('\t')[4].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile1.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')
        
        # get catalysts
        if 'CHEBI' in line.split('\t')[5]:       
            for i in line.split('\t')[5].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile2.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')

outfile1.close()
outfile2.close()

100%|██████████| 188350/188350 [00:00<00:00, 405727.43it/s]


**Preview Processed Data**

_Preview Cofactor Data_

In [50]:
pcp1_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt',
                            header = None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter = '\t')

print('There are {edge_count} protein-cofactor edges'.format(edge_count=len(pcp1_data)))

There are 5577 protein-cofactor edges


In [51]:
pcp1_data.head(n=5)

Unnamed: 0,Protein_Ontology_IDs,CHEBI_IDs
0,PR_Q9BRS2,CHEBI_18420
1,PR_Q05823,CHEBI_18420
2,PR_Q05823,CHEBI_29035
3,PR_Q13472,CHEBI_18420
4,PR_Q9BXA7,CHEBI_18420


_Preview Catalyst Data_

In [52]:
pcp2_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt',
                            header = None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter = '\t')

print('There are {edge_count} protein-catalyst edges'.format(edge_count=len(pcp2_data)))

There are 59863 protein-catalyst edges


In [53]:
pcp2_data.head(n=5)

Unnamed: 0,Protein_Ontology_IDs,CHEBI_IDs
0,PR_Q9NP80,CHEBI_15377
1,PR_Q9NP80,CHEBI_15378
2,PR_Q9NP80,CHEBI_28868
3,PR_Q9NP80,CHEBI_16870
4,PR_Q9NP80,CHEBI_58168



***
***

### NCBI Gene Protein-Coding Gene-Protein <a class="anchor" id="ncbi-protein-coding-genes"></a>

**Data Source Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) 

**Purpose:** This script utilizes the merged data created in the [Human-Transcript, Gene, and Protein Identifier Mapping](#Human-Transcript,-Gene,-and-Protein-Identifier-Mapping) subsection in order to create the following edges:  
- gene-protein

**Output:** [`PROTEIN_CODING_GENES_PROTEINS.txt`](https://www.dropbox.com/s/79ce6oe68jt72ph/PROTEIN_CODING_GENES_PROTEINS.txt?dl=0)  


In [None]:
# de-dup data
df_ens = merged_data.drop_duplicates(subset=['GeneID', 'type_of_gene', 'Cross-reference (PRO)', 'Symbol_from_nomenclature_authority', 'Full_name_from_nomenclature_authority', 'Synonyms'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'PROTEIN_CODING_GENES_PROTEINS.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['type_of_gene'] == 'protein-coding' and row['Cross-reference (PRO)'] != 'None': 
            outfile.write(row['GeneID'] + '\t' + 'PR_' + row['Cross-reference (PRO)'] + '\t' + row['Symbol_from_nomenclature_authority'] + '\t' + row['Synonyms'] + '\t' + row['Full_name_from_nomenclature_authority'] + '\n')

outfile.close()

**Preview Processed Data**

In [270]:
hpe_data = pandas.read_csv(processed_data_location + 'PROTEIN_CODING_GENES_PROTEINS.txt',
                           header = None,
                           names=['Entrez_Gene_IDs', 'Protein_Ontology_IDs', 'Gene_Name', 'Gene_Synonyms', 'Gene_Description'],
                           delimiter = '\t')

print('There are {edge_count} protein-coding gene edges'.format(edge_count=len(hpe_data)))

There are 18744 protein-coding gene edges


In [271]:
hpe_data.head(n=5)

Unnamed: 0,Entrez_Gene_IDs,Protein_Ontology_IDs,Gene_Name,Gene_Synonyms,Gene_Description
0,7529,PR_P31946,YWHAB,GW128|HEL-S-1|HS1|KCIP-1|YWHAA,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
1,7531,PR_P62258,YWHAE,14-3-3E|HEL2|KCIP-1|MDCR|MDS,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
2,7533,PR_Q04917,YWHAH,YWHA1,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
3,7532,PR_P61981,YWHAG,14-3-3GAMMA|EIEE56|PPP1R170,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...
4,2810,PR_P31947,SFN,YWHAS,stratifin



***
***
### GATHER INSTANCE DATA LABELS <a class="anchor" id="create-label-data"></a>
***
***


**Data Source Wiki Pages:**  
- [ClinVar](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar): Variant
- [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase): RNA
- [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene): Gene
- [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#reactome-pathway-database): Complex, Pathway, Reaction  


<br>
**Purpose:** Ontologies include identifier attributes like labels and synonyms. This section creates a file that contains labels for all data that is not from an ontology, which includes:
- Variant (dbSNP Identifiers) ➞ [`tab_delimited/variant_summary.txt.gz`](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)  
- RNA (Ensembl Transcripts) ➞ [`HUMAN_9606_idmapping_selected.tab`](https://www.dropbox.com/s/afk12rtr1aya0za/HUMAN_9606_idmapping_selected.tab?dl=0) + [`Homo_sapiens.gene_info`](https://www.dropbox.com/s/vazlmzxydgv6xzz/Homo_sapiens.gene_info?dl=0)
- Gene (Entrez Gene Identifiers) ➞ [`Homo_sapiens.gene_info`](https://www.dropbox.com/s/vazlmzxydgv6xzz/Homo_sapiens.gene_info?dl=0)
- Complex (Reactome Complex Identifiers) ➞ [`ComplexParticipantsPubMedIdentifiers_human.txt`](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) 
- Pathway (Reactome Pathway Identifiers) ➞ [`ReactomePathways.txt`](https://reactome.org/download/current/ReactomePathways.txt)
- Reaction (Reactome Reaction Identifiers) ➞ [`UniProt2ReactomeReactions.txt`](https://reactome.org/download/current/UniProt2ReactomeReactions.txt) + [`ChEBI2ReactomeReactions.txt`](https://reactome.org/download/current/ChEBI2ReactomeReactions.txt)
  

<br>
**Output:** File contains the following columns: identifier, label, synonyms, descriptions  
- [`instance_data_labels`]()   

***
**Variant:** `tab_delimited/variant_summary.txt.gz`

In [338]:
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data


In [413]:
var_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt',
                           header = 0,
                           delimiter = '\t',
                           low_memory=False)

In [414]:
# filter data to only include pathogenic variants
var_data = var_data.loc[var_data['ClinicalSignificance'].apply(lambda x: x.startswith('Pathogenic'))] 
var_data = var_data.loc[var_data['RS# (dbSNP)'].apply(lambda x: x != -1)] 

# reorder data
var_data = var_data[['RS# (dbSNP)', 'Name']].drop_duplicates(subset=None, keep='first', inplace=False)

# preview data
var_data.head(n=5)

Unnamed: 0,RS# (dbSNP),Name
3345,200300612,NM_147127.5(EVC2):c.3659+2T>C
6421,757415879,NM_000094.3(COL7A1):c.5797C>T (p.Arg1933Ter)
6586,886058642,NM_000094.3(COL7A1):c.1637-1G>A
6633,757552268,NM_000387.6(SLC25A20):c.326+1del
6705,775696136,NM_001369.2(DNAH5):c.13458dup (p.Asn4487Ter)


***
**RNA + Gene:** `HUMAN_9606_idmapping_selected.tab` + `Homo_sapiens.gene_info`

This section takes advantage of the merged_data set that was created in the [Human Transcript, Gene, and Protein Identifier Mapping](#human-transcript,-gene,-and-protein-identifier-mapping) section by merging the `Homo_sapiens.gene_info` and exploded `HUMAN_9606_idmapping_selected.tab` files.

In [373]:
# make sure merge columns are the same type
ncbi_gene['GeneID'] = ncbi_gene['GeneID'].astype(str)

# merge uniprot and ncbi data
merged_label_data = pandas.merge(explode_df_ensembl[['GeneID (EntrezGene)', 'Ensembl_TRS']],
                                 ncbi_gene[['GeneID', 'Symbol_from_nomenclature_authority', 'Full_name_from_nomenclature_authority', 'Synonyms']],
                                 left_on='GeneID (EntrezGene)',
                                 right_on='GeneID',
                                 how='outer')

# replace NaN with 'None'
merged_label_data.fillna('None', inplace=True)

# preview data
merged_label_data.head(n=5)

Unnamed: 0,GeneID (EntrezGene),Ensembl_TRS,GeneID,Symbol_from_nomenclature_authority,Full_name_from_nomenclature_authority,Synonyms
0,7529,ENST00000353703,7529,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,GW128|HEL-S-1|HS1|KCIP-1|YWHAA
1,7529,ENST00000372839,7529,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,GW128|HEL-S-1|HS1|KCIP-1|YWHAA
2,7529,ENST00000353703,7529,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,GW128|HEL-S-1|HS1|KCIP-1|YWHAA
3,7529,ENST00000372839,7529,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,GW128|HEL-S-1|HS1|KCIP-1|YWHAA
4,7529,,7529,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,GW128|HEL-S-1|HS1|KCIP-1|YWHAA


**RNA Data**

In [439]:
# reorder data
rna_data = merged_label_data[['Ensembl_TRS', 'Symbol_from_nomenclature_authority']].drop_duplicates(subset=None, keep='first', inplace=False)

# delete all rows containing "None"
indexNames = rna_data[(rna_data['Ensembl_TRS'] == 'None') | (rna_data['Symbol_from_nomenclature_authority'] == 'None')].index
rna_data.drop(indexNames, inplace=True)

# preview data
rna_data.head(n=5)

Unnamed: 0,Ensembl_TRS,Symbol_from_nomenclature_authority
0,ENST00000353703,YWHAB
1,ENST00000372839,YWHAB
5,ENST00000264335,YWHAE
7,ENST00000571732,YWHAE
9,ENST00000616643,YWHAE


**Gene Data**

In [440]:
# reorder data
gene_data = merged_label_data[['GeneID', 'Symbol_from_nomenclature_authority']].drop_duplicates(subset=None, keep='first', inplace=False)

# delete all rows containing "None"
indexNames = gene_data[(gene_data['GeneID'] == 'None') | (gene_data['Symbol_from_nomenclature_authority'] == 'None')].index
gene_data.drop(indexNames, inplace=True)

# preview data
gene_data.head(n=5)

Unnamed: 0,GeneID,Symbol_from_nomenclature_authority
0,7529,YWHAB
5,7531,YWHAE
38,7533,YWHAH
41,7532,YWHAG
42,2810,SFN


***
**Complex:** `ComplexParticipantsPubMedIdentifiers_human.txt`  
Script assumes that the `ComplexParticipantsPubMedIdentifiers_human.txt` file was downloaded above in the [Reactome Protein Complex](#reactome-protein-complex) section.

In [None]:
url = 'https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt'
data_downloader(url, unprocessed_data_location)

In [420]:
complex_data = pandas.read_csv(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt',
                               header = 0,
                               delimiter = '\t',
                               low_memory=False)

In [421]:
# reorder data
complex_data = complex_data[['identifier', 'name']].drop_duplicates(subset=None, keep='first', inplace=False)

# preview data
complex_data.head(n=5)

Unnamed: 0,identifier,name
0,R-HSA-1006173,CFH:Host cell surface [plasma membrane]
1,R-HSA-1008206,NF-E2:Promoter region of beta-globin [nucleopl...
2,R-HSA-1008229,NF-E2 [nucleoplasm]
3,R-HSA-1008252,IRF1:Promoter region of IFN beta [nucleoplasm]
4,R-HSA-1011577,C-terminal EH domain containing proteins:Raben...


***
***
**Pathway:** `ReactomePathways.txt`

In [367]:
url = 'https://reactome.org/download/current/ReactomePathways.txt'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [422]:
pathway_data = pandas.read_csv(unprocessed_data_location + 'ReactomePathways.txt',
                               header = None,
                               delimiter = '\t',
                               low_memory=False)

In [423]:
# filter data to only include human data
pathway_data = pathway_data.loc[pathway_data[2].apply(lambda x: x.startswith('Homo sapien'))] 

# reorder data
pathway_data = pathway_data[[0, 1]].drop_duplicates(subset=None, keep='first', inplace=False)

# preview data
pathway_data.head(n=5)

Unnamed: 0,0,1
10025,R-HSA-164843,2-LTR circle formation
10026,R-HSA-73843,5-Phosphoribose 1-diphosphate biosynthesis
10027,R-HSA-1971475,A tetrasaccharide linker sequence is required ...
10028,R-HSA-5619084,ABC transporter disorders
10029,R-HSA-1369062,ABC transporters in lipid homeostasis


***
**Reaction:** `UniProt2ReactomeReactions.txt` + `ChEBI2ReactomeReactions.txt`

In [377]:
url = 'https://reactome.org/download/current/UniProt2ReactomeReactions.txt'
data_downloader(url, unprocessed_data_location)

url = 'https://reactome.org/download/current/ChEBI2ReactomeReactions.txt'
data_downloader(url, unprocessed_data_location)

Downloading data file
Downloading data file


In [424]:
react_data1 = pandas.read_csv(unprocessed_data_location + 'UniProt2ReactomeReactions.txt',
                              header = None,
                              delimiter = '\t',
                              low_memory=False)

react_data2 = pandas.read_csv(unprocessed_data_location + 'ChEBI2ReactomeReactions.txt',
                              header = None,
                              delimiter = '\t',
                              low_memory=False)

In [425]:
# merge datasets
merged_react_data = pandas.merge(react_data1[[1, 3, 5]],
                                 react_data2[[1, 3, 5]],
                                 left_on=[1, 3, 5],
                                 right_on=[1, 3, 5],
                                 how='outer')

# filter data to only include pathogenic variants
merged_react_data = merged_react_data.loc[merged_react_data[5].apply(lambda x: x.startswith('Homo sapiens'))] 

# reorder data
merged_react_data = merged_react_data[[1, 3]].drop_duplicates(subset=None, keep='first', inplace=False)

# preview data
merged_react_data.head(n=5)

Unnamed: 0,1,3
16004,R-HSA-1112666,BLNK (SLP-65) Signalosome hydrolyzes phosphati...
16384,R-HSA-166753,Conversion of C4 into C4a and C4b
17320,R-HSA-166792,Conversion of C2 into C2a and C2b
18247,R-HSA-173626,Activation of C1r
18532,R-HSA-173631,Activation of C1s



***
**Write Label Data to File**

In [None]:
# create a list of all labeled datasets
dfs = [var_data, rna_data, gene_data, complex_data, pathway_data, merged_react_data]

# write out data
with open(processed_data_location + 'Instance_Data_Labels.txt', 'w') as outfile:
    
    for df in tqdm(dfs):
        for idx, row in df.iterrows():
            if len(list(df)) == 2:
                outfile.write(str(row[list(df)[0]]) + '\t' + str(row[list(df)[1]]) + '\n')

outfile.close()

**Preview Label Data**

In [445]:
label_data = pandas.read_csv(processed_data_location + 'Instance_Data_Labels.txt',
                             header = None,
                             delimiter = '\t')

print('There are {edge_count} instance data labels'.format(edge_count=len(label_data)))

There are 185584 instance data labels


In [446]:
label_data.head(n=5)

Unnamed: 0,0,1
0,200300612,NM_147127.5(EVC2):c.3659+2T>C
1,757415879,NM_000094.3(COL7A1):c.5797C>T (p.Arg1933Ter)
2,886058642,NM_000094.3(COL7A1):c.1637-1G>A
3,757552268,NM_000387.6(SLC25A20):c.326+1del
4,775696136,NM_001369.2(DNAH5):c.13458dup (p.Asn4487Ter)
