***
# PheKnowLator - Data Preparation
***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**
  
<br>  
  
**Purpose:** This notebook serves as a script to download and process data in order to generate mapping and filtering data needed to build edges for the PheKnowLator knowledge graph. For more information on the data sources utilize within this script, please see the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page.

<br>

**Assumptions:**   
- Raw data downloads ➞ `./resources/processed_data/unprocessed_data`    
- Processed data write location ➞ `./resources/processed_data`  

<br>

**Dependencies:** This notebook utilizes several helper functions, which are stored in the [`data_preparation_helper_functions.py`](https://github.com/callahantiff/PheKnowLator/blob/master/scripts/python/data_preparation_helper_functions.py) script. Hyperlinks to all downloaded and generated data sources are provided on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page as well as within each source subsection of this notebook. All generated data is freely available for download from DropBox. 

_____
***

## Table of Contents
***

### [Create Identifier Maps ](#create-identifier-maps)  
- [HUMAN TRANSCRIPT, GENE, AND PROTEIN IDENTIFIER MAPPING](#human-transcript,-gene,-and-protein-identifier-mapping)
  - [Entrez Gene-Ensembl Transcript](#entrezgene-ensembltranscript)  
  - [Entrez Gene-Protein Ontology](#entrezgene-proteinontology)  
  - [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
  - [Gene Symbol-Ensembl Transcript](#genesymbol-ensembltranscript)  
  - [STRING-Protein Ontology](#string-proteinontology)  
  - [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)
  

- [OTHER IDENTIFIER MAPPING](#other-identifier-mapping) 
  - [ChEBI Identifiers](#mesh-chebi) 
  - [Human Disease and Phenotype Identifiers](#disease-identifiers)
  - [Human Protein Atlas Tissue and Cell Types](#hpa-uberon)  
  - [Reactome Pathways - Pathway Ontology](#reactome-pw)  
  - [Genomic Identifiers - Sequence Ontology](#genomic-soo)  

<br>

### [Create Edge Datasets](#create-edge-datasets)
- [ONTOLOGIES](#ontologies)  
  - [Protein Ontology](#protein-ontology)  
  - [Relations Ontology](#relations-ontology)  


- [LINKED DATA](#linked-data)  
  - [Clinvar Variant-Diseases and Phenotypes](#clinvar-variant)
  - [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  

<br>

### [Create Instance Data and/or Subclass Metadata](#create-instance-metadata)  
- [Genes/RNA](#gene-and-rna-metadata)
- [Pathways](#pathway-metadata)
- [Variants](#variant-metadata) 

____

<br>

### Set-Up Environment
_____

In [1]:
# import needed libraries
import datetime
import glob
import ijson
import itertools
import networkx
import numpy
import pandas
import pickle
import requests

from collections import Counter
from functools import reduce
from owlready2 import subprocess
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from reactome2py import content
from tqdm import tqdm

# import script containing helper functions
from pkt_kg.utils import *

**Define Global Variables**

In [2]:
# directory to read unprocessed data files from
unprocessed_data_location = 'resources/processed_data/unprocessed_data/'

# directory to write processed data files to
processed_data_location = 'resources/processed_data/'

# directory to write relations data to
relations_data_location = 'resources/relations_data/'

# directory to write node metadata to
node_data_location = 'resources/node_data/'

<br>

***
***
### CREATE MAPPING DATASETS  <a class="anchor" id="create-identifier-maps"></a>
***
***

### Human Transcript, Gene, and Protein Identifier Mapping  <a class="anchor" id="human-transcript,-gene,-and-protein-identifier-mapping"></a>
***

**Data Source Wiki Pages:**   
- [Ensembl](https://uswest.ensembl.org/)  
- [Uniprot Knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)  
- [HGNC](ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt) 
- [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene) 
- [Protein Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#protein-ontology)

<br>

**Purpose:** To map create protein-coding gene-protein relations and mappings between the identifiers listed below. The edges types produced from each of these mappings will be further described within each identifier mapping section:  
- [Entrez Gene-Ensembl Transcript](#entrezgene-ensembltranscript)  
- [Entrez Gene-Protein Ontology](#entrezgene-proteinontology)  
- [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
- [Gene Symbol-Ensembl Transcript](#genesymbol-ensembltranscript)  
- [STRING-Protein Ontology](#string-proteinontology)  
- [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)

<br>

**Gene and Trainscript Types:** The transcript and gene/locus types were reviewed by a PhD Molecular biologist to confirm whether or not they should be treated as `protein-coding` or not, which is useful for creating `genomic-rna`, `genomic-protein`, and `rna-protein` edges in the knowledge graph. For more information on this classification, please see the table below. Definitions of concepts in the table have been taken from [HGNC](https://www.genenames.org/help/symbol-report/), [Ensembl](https://uswest.ensembl.org/info/genome/genebuild/biotypes.html), [NCBI](https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objects/entrezgene/entrezgene.asn), and Wikipedia.

<table>
<th align="center">Gene and Transcript Type</th>  
<th align="center">Definition</th>
<th align="center">Type</th>
<th align="center">Genomic material <i>transcribed_to</i> RNA</th>
<th align="center">RNA <i>translated_to</i> Protein</th>
<th align="center">Genomic material <i>has_gene_product</i> Protein</th>
<tr>
  <td rowspan="2">biological-region</td> 
  <td rowspan="2">Biological_region (SO:0001411; Special note: This is a parental feature spanning all other feature annotation on each RefSeq Functional Element record. It is a 'misc_feature' in GenBank flat files but a 'Region' feature in ASN.1 and GFF3 formats</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
  <td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">IG_C_gene</td> 
  <td rowspan="2">Constant chain immunoglobulin gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_C_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>	 	 
<tr>
  <td rowspan="2">IG_D_gene</td> 
  <td rowspan="2">Diversity chain immunoglobulin gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_J_gene</td> 
  <td rowspan="2">IG J gene: Joining chain immunoglobulin gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_J_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_V_gene</td> 
  <td rowspan="2">Variable chain immunoglobulin gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_V_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">lncRNA</td> 
  <td rowspan="2">RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs) (SO:0001877); these are at least 200 nt in length. Subtypes include intergenic (SO:0001463), intronic (SO:0001903) and antisense (SO:0001904)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">miRNA</td> 
  <td rowspan="2">RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) (SO:0001265)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">misc_RNA</td> 
  <td rowspan="2">Non-protein coding genes that encode miscellaneous types of small ncRNAs, such as vault (SO:0000404) and Y (SO:0000405) RNA genes</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">Mt_rRNA</td> 
  <td rowspan="2">Mitochondrial rRNA</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">Mt_tRNA</td> 
  <td rowspan="2">Mitochondrial tRNA</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">ncRNA</td> 
  <td rowspan="2">Noncoding RNA</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">non_stop_decay</td> 
  <td rowspan="2">Transcripts that have polyA features (including signal) without a prior stop codon in the CDS, i.e. a non-genomic polyA tail attached directly to the CDS without 3' UTR. These transcripts are subject to degradation</td>
  <td>gene</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">nonsense_mediated_decay</td> 
  <td rowspan="2">If the coding sequence (following the appropriate reference) of a transcript finishes >50bp from a downstream splice site then it is tagged as NMD. If the variant does not cover the full reference coding sequence then it is annotated as NMD if NMD is unavoidable i.e. no matter what the exon structure of the missing portion is the transcript will be subject to NMD</td>
  <td>gene</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">other</td> 
  <td rowspan="2">other</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">phenotype</td> 
  <td rowspan="2"> Mapped phenotypes where the causative gene has not been identified (SO:0001500) </td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">polymorphic_pseudogene</td> 
  <td rowspan="2">Pseudogene owing to a SNP/DIP but in other individuals/haplotypes/strains the gene is translated</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">processed_pseudogene</td> 
  <td rowspan="2">Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">processed_transcript</td> 
  <td rowspan="2">Gene/transcript that doesn't contain an open reading frame</td>
  <td>gene</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">protein_coding</td> 
  <td rowspan="2">Contains an open reading frame (ORF)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>yes</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>yes</td> 
</tr>
<tr>
  <td rowspan="2">pseudogene</td> 
  <td rowspan="2">Have homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus. Sometimes these entries have an intact coding sequence or an open but truncated ORF, in which case there is other evidence used (for example genomic polyA stretches at the 3' end) to classify them as a pseudogene</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">retained_intron</td> 
  <td rowspan="2">Has an alternatively spliced transcript believed to contain intronic sequence relative to other, coding, variants</td>
  <td>gene</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">ribozyme</td> 
  <td rowspan="2">Ribozymes are RNA molecules that have the ability to catalyze specific biochemical reactions, including RNA splicing in gene expression, similar to the action of protein enzymes</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">rRNA</td> 
  <td rowspan="2">RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) (SO:0001637)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">rRNA_pseudogene</td> 
  <td rowspan="2">A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">scaRNA</td> 
  <td rowspan="2">Small Cajal body-specific RNAs are a class of small nucleolar RNAs that specifically localise to the Cajal body, a nuclear organelle involved in the biogenesis of small nuclear ribonucleoproteins/td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">scRNA</td> 
  <td rowspan="2">RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) (SO:0001266)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">snoRNA</td> 
  <td rowspan="2">RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains (SO:0001267)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">snRNA</td> 
  <td rowspan="2">RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) (SO:0001268)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">sRNA</td> 
  <td rowspan="2">Bacterial small RNAs (sRNA) are small RNAs produced by bacteria; they are 50- to 500-nucleotide non-coding RNA molecules, highly structured and containing several stem-loops</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TEC</td> 
  <td rowspan="2">TEC (To be Experimentally Confirmed). This is used for non-spliced EST clusters that have polyA features. This category has been specifically created for the ENCODE project to highlight regions that could indicate the presence of protein coding genes that require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>yes</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_C_gene</td> 
  <td rowspan="2">Constant chain T cell receptor gene that undergoes somatic recombination before transcription/td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_D_gene</td> 
  <td rowspan="2">Diversity chain T cell receptor gene that undergoes somatic recombination before transcription/td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_J_gene</td> 
  <td rowspan="2">Joining chain T cell receptor gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_J_pseudogene</td> 
  <td rowspan="2">T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_V_gene</td> 
  <td rowspan="2">Variable chain T cell receptor gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_V_pseudogene</td> 
  <td rowspan="2">T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">transcribed_processed_pseudogene</td> 
  <td rowspan="2">Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">transcribed_unitary_pseudogene</td> 
  <td rowspan="2">Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">transcribed_unprocessed_pseudogene</td> 
  <td rowspan="2">Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">translated_processed_pseudogene</td> 
  <td rowspan="2">Pseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed'</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">translated_unprocessed_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">tRNA</td> 
  <td rowspan="2">Transfer RNA</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">unitary_pseudogene</td> 
  <td rowspan="2">A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">unknown</td> 
  <td rowspan="2">Entries where the locus type is currently unknown</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">unprocessed_pseudogene</td> 
  <td rowspan="2">Pseudogene that can contain introns since produced by gene duplication</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2" align="center">VvaultRNA</td> 
  <td rowspan="2" align="center">Short non coding RNA genes that form part of the vault ribonucleoprotein complex</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
</table> 


<br>

**Output:** This script downloads and saves the following data:  
- Human Ensembl Gene Set ➞ [`Homo_sapiens.GRCh38.99.gtf`](ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz)
- Human Ensembl-UniProt Identifiers ➞ [`Homo_sapiens.GRCh38.98.uniprot.tsv`](ftp://ftp.ensembl.org/pub/release-99/tsv/homo_sapiens/Homo_sapiens.GRCh38.99.entrez.tsv.gzv) 
- Human Ensembl-Entrez Identifiers ➞ [`Homo_sapiens.GRCh38.98.entrez.tsv`](ftp://ftp.ensembl.org/pub/release-98/tsv/homo_sapiens/Homo_sapiens.GRCh38.98.entrez.tsv.gz) 
- Human Gene Identifiers ➞ [`Homo_sapiens.gene_info`](ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz), [`hgnc_complete_set.txt`](ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt)  
- Human Protein Identifiers ➞ [`promapping.txt`](https://proconsortium.org/download/current/promapping.txt)  
- UniProt Identifiers ➞ [`uniprot_identifier_mapping.tab`](https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Cdatabase(GeneID)%2Cdatabase(Ensembl)%2Cdatabase(HGNC)%2Cgenes(PREFERRED)%2Cgenes(ALTERNATIVE))

_All Merged Data Sets:  
- [`Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt`](https://www.dropbox.com/s/qnxqnm88rqcerh9/Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt?dl=1)  
- [`Merged_gene_rna_protein_identifiers.pkl`](https://www.dropbox.com/s/6idnt7b3i322hlh/Merged_gene_rna_protein_identifiers.pkl?dl=1)  

***

<br>

***
**HGNC Data** 

_Human Gene Set Data_ - `hgnc_complete_set.txt`

In [3]:
url = 'ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt'
data_downloader(url, unprocessed_data_location)

Downloading Data from FTP Server: ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt


In [4]:
hgnc = pandas.read_csv(unprocessed_data_location + 'hgnc_complete_set.txt',
                       header=0,
                       delimiter='\t',
                       low_memory=False)

_Preprocess Data_  
Data file needs to be lightly cleaned before it can be merged with other data. This light cleaning includes renaming columns, replacing `NaN` with `None`, updating data types (i.e. making all columns type `str`), and unnesting `|` delimited data. The final step is to update the gene_type variable such that each of the variable values is re-grouped to be protein-coding, other or ncRNA.

In [5]:
# remove all rows thave don't have 'approved' status
hgnc = hgnc.loc[hgnc['status'].apply(lambda x: x == 'Approved')]

# drop uneeded columns
hgnc = hgnc[['hgnc_id', 'entrez_id', 'ensembl_gene_id', 'uniprot_ids', 'symbol', 'locus_type',
             'alias_symbol', 'name', 'location', 'alias_name']]

# rename columns
hgnc.rename(columns={'uniprot_ids': 'uniprot_id', 'location': 'map_location', 'locus_type': 'hgnc_gene_type'}, inplace=True)

# strip 'HGNC' off of the identifiers
hgnc['hgnc_id'].replace('.*\:','', inplace=True, regex=True)

# combine certain columns into single column
hgnc['name'] = hgnc['name'] + '|' + hgnc['alias_name']
hgnc['synonyms'] = hgnc['alias_symbol'] + '|' + hgnc['alias_name']

# replace NaN with 'None'
hgnc.fillna('None', inplace=True)

# make data columns of type string
hgnc['entrez_id'] = hgnc['entrez_id'].apply(lambda x: str(int(x)) if x != 'None' else 'None')

# explode nested data
explode_df_hgnc = explodes_data(hgnc.copy(), ['ensembl_gene_id', 'uniprot_id', 'symbol', 'name'], '|')

# reformat hgnc gene type
explode_df_hgnc['hgnc_gene_type'].replace('region', 'biological region', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('complex locus constituent', 'complex locus constituent', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('endogenous retrovirus', 'endogenous retrovirus', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('fragile site', 'fragile site', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('immunoglobulin gene', 'IG_gene', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('immunoglobulin pseudogene', 'IG_pseudogene', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, long non-coding', 'lncRNA', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, micro', 'miRNA', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, misc', 'miscRNA', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('phenotype only', 'phenotype only', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('gene with protein product', 'protein-coding', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('protocadherin', 'protocadherin', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('pseudogene', 'pseudogene', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('readthrough', 'readthrough', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, ribosomal', 'rRNA', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, small nucleolar', 'snoRNA', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, small nuclear', 'snRNA', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, cluster', 'sRNA', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('T cell receptor gene', 'TR_gene', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('T cell receptor pseudogene', 'TR_pseudogene', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('transposable element', 'transposable element', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, transfer', 'tRNA', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('unknown', 'unknown', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, vault', 'vaultRNA', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('virus integration site', 'virus integration site', inplace=True, regex=False)
explode_df_hgnc['hgnc_gene_type'].replace('RNA, Y', 'Y RNA', inplace=True, regex=False)

# master gene type
explode_df_hgnc['master_gene_type'] = explode_df_hgnc['hgnc_gene_type']
explode_df_hgnc['master_gene_type'].replace('biological region', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('complex locus constituent', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('endogenous retrovirus', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('fragile site', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('IG_gene', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('IG_pseudogene', 'pseudogene', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('lncRNA', 'ncRNA', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('miRNA', 'ncRNA', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('miscRNA', 'ncRNA', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('phenotype only', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('protocadherin', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('readthrough', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('snoRNA', 'ncRNA', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('snRNA', 'ncRNA', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('sRNA', 'ncRNA', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('TR_gene', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('TR_pseudogene', 'pseudogene', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('transposable element', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('tRNA', 'tRNA', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('vaultRNA', 'ncRNA', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('virus integration site', 'other', inplace=True, regex=False)
explode_df_hgnc['master_gene_type'].replace('Y RNA', 'ncRNA', inplace=True, regex=False)

# remove original gene type column
explode_df_hgnc.drop(['alias_symbol', 'alias_name'], axis=1, inplace=True)

# remove duplicates
explode_df_hgnc.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
explode_df_hgnc.head(n=3)

Unnamed: 0,hgnc_id,entrez_id,ensembl_gene_id,uniprot_id,symbol,hgnc_gene_type,name,map_location,synonyms,master_gene_type
0,5,1,ENSG00000121410,P04217,A1BG,protein-coding,,19q13.43,,protein-coding
1,37133,503538,ENSG00000268895,,A1BG-AS1,lncRNA,,19q13.43,,ncRNA
2,24086,29974,ENSG00000148584,Q9NQ94,A1CF,protein-coding,,10q11.23,,protein-coding


<br>

***
***
**Ensembl Data**

_Human Gene Set Data_ - `Homo_sapiens.GRCh38.99.gtf.gz`

In [6]:
url = 'ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz'
data_downloader(url, unprocessed_data_location)

Downloading Gzipped data from FTP Server: ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz
Decompressing and Writing Gzipped Data to File


In [7]:
ensembl_geneset = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.99.gtf',
                                  header = None,
                                  delimiter='\t',
                                  skiprows=5,
                                  low_memory=False)

_Preprocess Data_  
Data file needs to be reformatted in order for it to be able to be merged with the other gene, RNA, and protein identifier data. To do this, we iterate over each row of the data and extract the fields shown below in `column_names`, making each of these extracted fields their own column. The final step is to update the gene_type variable such that each of the variable values is re-grouped to be `protein-coding`, `other` or `ncRNA`.

In [8]:
# set list of data items to extract from column string
data_cols = ['gene_id', 'transcript_id', 'gene_name', 'gene_biotype', 'transcript_name', 'transcript_biotype']

# loop over items contained in the string of column 8
cleaned_column = []

for data_list in tqdm(list(ensembl_geneset[8])):
    data_list = data_list if not data_list.endswith(';') else data_list[:-1]
    temp_data = [data_list.split('; ')[[x.split(' ')[0] for x in data_list.split('; ')].index(col)] if col in data_list else col + ' None' for col in data_cols]
    cleaned_column.append(temp_data) 

# update columns
ensembl_geneset['ensembl_gene_id'] = [x[0].split(' ')[-1].replace('"', '') for x in cleaned_column]
ensembl_geneset['transcript_stable_id'] = [x[1].split(' ')[-1].replace('"', '') for x in cleaned_column]
ensembl_geneset['symbol'] = [x[2].split(' ')[-1].replace('"', '') for x in cleaned_column]
ensembl_geneset['ensembl_gene_type'] = [x[3].split(' ')[-1].replace('"', '') for x in cleaned_column]
ensembl_geneset['transcript_name'] = [x[4].split(' ')[-1].replace('"', '') for x in cleaned_column]
ensembl_geneset['ensembl_transcript_type'] = [x[5].split(' ')[-1].replace('"', '') for x in cleaned_column]

# replace NaN with 'None'
ensembl_geneset.fillna('None', inplace=True)

# reformat ensembl gene type
ensembl_geneset['ensembl_gene_type'].replace('misc_RNA', 'miscRNA', inplace=True, regex=False)
ensembl_geneset['ensembl_gene_type'].replace('TEC', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['ensembl_gene_type'].replace('protein_coding', 'protein-coding', inplace=True, regex=False)

# reformat master gene type
ensembl_geneset['master_gene_type'] = ensembl_geneset['ensembl_gene_type']
ensembl_geneset['master_gene_type'].replace('IG_C_gene', 'other', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('IG_C_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('IG_D_gene', 'other', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('IG_J_gene', 'other', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('IG_J_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('IG_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('IG_V_gene', 'other', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('IG_V_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('lncRNA', 'ncRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('miRNA', 'ncRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('miscRNA', 'ncRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('Mt_rRNA', 'rRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('Mt_tRNA', 'tRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('polymorphic_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('processed_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('ribozyme', 'other', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('rRNA_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('scaRNA', 'ncRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('scRNA', 'ncRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('snoRNA', 'ncRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('snRNA', 'ncRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('sRNA', 'ncRNA', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('TR_C_gene', 'other', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('TR_D_gene', 'other', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('TR_J_gene', 'other', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('TR_J_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('TR_V_gene', 'other', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('TR_V_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('transcribed_processed_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('transcribed_unitary_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('transcribed_unprocessed_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('translated_processed_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('translated_unprocessed_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('unitary_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('unprocessed_pseudogene', 'pseudogene', inplace=True, regex=False)
ensembl_geneset['master_gene_type'].replace('vaultRNA', 'ncRNA', inplace=True, regex=False)

# reformat master transcript type
ensembl_geneset['master_transcript_type'] = ensembl_geneset['ensembl_transcript_type']
ensembl_geneset['master_transcript_type'].replace('IG_C_gene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('IG_C_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('IG_D_gene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('IG_J_gene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('IG_J_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('IG_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('IG_V_gene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('IG_V_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('lncRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('miRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('misc_RNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('Mt_rRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('Mt_tRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('non_stop_decay', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('None', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('nonsense_mediated_decay', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('polymorphic_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('processed_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('processed_transcript', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('protein_coding', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('retained_intron', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('ribozyme', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('rRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('rRNA_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('scaRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('scRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('snoRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('snRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('sRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('TEC', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('TR_C_gene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('TR_D_gene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('TR_J_gene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('TR_J_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('TR_V_gene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('TR_V_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('transcribed_processed_pseudogene', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('transcribed_unitary_pseudogene', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('transcribed_unprocessed_pseudogene', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('translated_processed_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('translated_unprocessed_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('unitary_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('unprocessed_pseudogene', 'protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('vaultRNA', 'not protein-coding', inplace=True, regex=False)
ensembl_geneset['master_transcript_type'].replace('unknown', 'not protein-coding', inplace=True, regex=False)

# remove uneeded columns
ensembl_geneset.drop(list(range(9)), axis=1, inplace=True)

# remove duplicates
ensembl_geneset.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
ensembl_geneset.head(n=3)

100%|██████████| 2905054/2905054 [03:02<00:00, 15887.25it/s]


Unnamed: 0,ensembl_gene_id,transcript_stable_id,symbol,ensembl_gene_type,transcript_name,ensembl_transcript_type,master_gene_type,master_transcript_type
0,ENSG00000223972,,DDX11L1,transcribed_unprocessed_pseudogene,,,pseudogene,not protein-coding
1,ENSG00000223972,ENST00000456328,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,pseudogene,protein-coding
5,ENSG00000223972,ENST00000450305,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-201,transcribed_unprocessed_pseudogene,pseudogene,not protein-coding


<br>

***
**Ensembl Annotation Data**

_Ensembl-UniProt_ - `Homo_sapiens.GRCh38.98.uniprot.tsv`  
Once the main ensembl gene set has been read in, the next step is to read in the `ensembl-uniprot` mapping file. These files are vital for successfully merging the ensembl identifiers with the uniprot data set.

In [9]:
url_uniprot = 'ftp://ftp.ensembl.org/pub/release-99/tsv/homo_sapiens/Homo_sapiens.GRCh38.99.uniprot.tsv.gz'
data_downloader(url_uniprot, unprocessed_data_location)

Downloading Gzipped data from FTP Server: ftp://ftp.ensembl.org/pub/release-99/tsv/homo_sapiens/Homo_sapiens.GRCh38.99.uniprot.tsv.gz
Decompressing and Writing Gzipped Data to File


In [10]:
ensembl_uniprot = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.99.uniprot.tsv',
                                  header=0,
                                  delimiter='\t',
                                  low_memory=False)

# rename columns
ensembl_uniprot.rename(columns={'xref': 'uniprot_id', 'gene_stable_id': 'ensembl_gene_id'}, inplace=True)

# replace "-"
ensembl_uniprot.replace('-', 'None', inplace=True)

# replace NaN with 'None'
ensembl_uniprot.fillna('None', inplace=True)

# remove uneeded columns
ensembl_uniprot.drop(['db_name', 'info_type', 'source_identity', 'xref_identity', 'linkage_type'], axis=1, inplace=True)

# remove duplicates
ensembl_uniprot.drop_duplicates(subset=None, keep='first', inplace=True)

<br>

_Ensembl-Entrez_ - `Homo_sapiens.GRCh38.98.entrez.tsv`  
Once the main ensembl gene set has been read in, the next step is to read in the `ensembl-entrez` mapping file. These files are vital for successfully merging the ensembl identifiers with the entrez data set.

In [11]:
url_entrez = 'ftp://ftp.ensembl.org/pub/release-99/tsv/homo_sapiens/Homo_sapiens.GRCh38.99.entrez.tsv.gz'
data_downloader(url_entrez, unprocessed_data_location)

Downloading Gzipped data from FTP Server: ftp://ftp.ensembl.org/pub/release-99/tsv/homo_sapiens/Homo_sapiens.GRCh38.99.entrez.tsv.gz
Decompressing and Writing Gzipped Data to File


In [12]:
# read in ensembl-entrez data
ensembl_entrez = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.99.entrez.tsv',
                                 header=0,
                                 delimiter='\t',
                                 low_memory=False)

# rename columns
ensembl_entrez.rename(columns={'xref': 'entrez_id', 'gene_stable_id': 'ensembl_gene_id'}, inplace=True)

# remove all rows that are not dbname "EntrezGene"
ensembl_entrez = ensembl_entrez.loc[ensembl_entrez['db_name'].apply(lambda x: x == 'EntrezGene')]

# replace "-"
ensembl_entrez.replace('-', 'None', inplace=True)

# replace NaN with 'None'
ensembl_entrez.fillna('None', inplace=True)

# remove uneeded columns
ensembl_entrez.drop(['db_name', 'info_type', 'source_identity', 'xref_identity', 'linkage_type'], axis=1, inplace=True)

# remove duplicates
ensembl_entrez.drop_duplicates(subset=None, keep='first', inplace=True)

<br>

_Merge Annotation Data_ - `ensembl_uniprot` + `ensembl_entrez`

In [13]:
ensembl_annot = pandas.merge(ensembl_uniprot,
                             ensembl_entrez,
                             left_on=['ensembl_gene_id', 'transcript_stable_id', 'protein_stable_id'],
                             right_on=['ensembl_gene_id', 'transcript_stable_id', 'protein_stable_id'],
                             how='outer')

# replace NaN with 'None'
ensembl_annot.fillna('None', inplace=True)

# preview data
ensembl_annot.head(n=3)

Unnamed: 0,ensembl_gene_id,transcript_stable_id,protein_stable_id,uniprot_id,entrez_id
0,ENSG00000186092,ENST00000641515,ENSP00000493376,A0A2U3U0J3,79501
1,ENSG00000186092,ENST00000335137,ENSP00000334393,Q8NH21,79501
2,ENSG00000284733,ENST00000426406,ENSP00000409316,Q6IEY1,729759


<br>

_Merge Ensembl Annotation and Gene Set Data_ - `ensembl_geneset` + `ensembl_annot`

In [14]:
ensembl = pandas.merge(ensembl_geneset,
                       ensembl_annot,
                       left_on = ['ensembl_gene_id', 'transcript_stable_id'],
                       right_on = ['ensembl_gene_id', 'transcript_stable_id'],
                       how='outer')

# replace NaN with 'None'
ensembl.fillna('None', inplace=True)
ensembl.replace('NA','None', inplace=True, regex=False)

# preview data
ensembl.head(n=3)

Unnamed: 0,ensembl_gene_id,transcript_stable_id,symbol,ensembl_gene_type,transcript_name,ensembl_transcript_type,master_gene_type,master_transcript_type,protein_stable_id,uniprot_id,entrez_id
0,ENSG00000223972,,DDX11L1,transcribed_unprocessed_pseudogene,,,pseudogene,not protein-coding,,,
1,ENSG00000223972,ENST00000456328,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,pseudogene,protein-coding,,,
2,ENSG00000223972,ENST00000450305,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-201,transcribed_unprocessed_pseudogene,pseudogene,not protein-coding,,,


_Save Cleaned Ensembl Data_  
Save the cleaned Ensembl data so that it can be used when generating node metadata for transcript identifiers.

In [15]:
ensembl.to_csv(processed_data_location + 'ensembl_identifier_data_cleaned.txt',
               header=True,
               sep='\t',
               index=False)

<br>

***
**UniProt Data**   
_Human Gene Set Data_ - `uniprot_identifier_mapping.tab`

In [16]:
url = 'https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Cdatabase(GeneID)%2Cdatabase(Ensembl)%2Cdatabase(HGNC)%2Cgenes(PREFERRED)%2Cgenes(ALTERNATIVE)&format=tab'
data_downloader(url, unprocessed_data_location, 'uniprot_identifier_mapping.tab')

Downloading Data from https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Cdatabase(GeneID)%2Cdatabase(Ensembl)%2Cdatabase(HGNC)%2Cgenes(PREFERRED)%2Cgenes(ALTERNATIVE)&format=tab


In [17]:
uniprot = pandas.read_csv(unprocessed_data_location + 'uniprot_identifier_mapping.tab',
                          header=0,
                          delimiter='\t')

_Preprocess Data_  
Data file needs to be lightly cleaned before it can be merged with other data. This light cleaning includes renaming columns, replacing `NaN` with `None`, and unnesting `;` and `" "` delimited data.

In [18]:
# replace NaN with 'None'
uniprot.fillna('None', inplace=True)

# rename columns
uniprot.rename(columns={'Entry': 'uniprot_id',
                        'Cross-reference (GeneID)': 'entrez_id',
                        'Ensembl transcript': 'transcript_stable_id',
                        'Cross-reference (HGNC)': 'hgnc_id',
                        'Gene names  (synonym )': 'synonyms',
                        'Gene names  (primary )' :'symbol'}, inplace=True)

# update space-delimited synonyms to a '|'
uniprot['synonyms'] = uniprot['synonyms'].apply(lambda x: '|'.join(x.split()) if x.isupper() else x)

# explode nested data
explode_df_uniprot = explodes_data(uniprot.copy(), ['transcript_stable_id', 'entrez_id', 'hgnc_id'], ';')
explode_df_uniprot = explodes_data(explode_df_uniprot.copy(), ['symbol'], '|')


# strip out uniprot names
explode_df_uniprot['transcript_stable_id'].replace('\s.*','', inplace=True, regex=True)

# remove duplicates
explode_df_uniprot.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
explode_df_uniprot.head(n=3)

Unnamed: 0,uniprot_id,entrez_id,transcript_stable_id,hgnc_id,symbol,synonyms
0,Q8NF67,,,43603,ANKRD20A12P,
1,Q9NPB9,51554.0,ENST00000249887,1611,ACKR4,CCBP2|CCR11|CCRL1|VSHK1
2,P31937,11112.0,ENST00000265395,4907,HIBADH,


<br>

***
**NCBI Data**   
_Human Gene Set Data_ - `Homo_sapiens.gene_info`

In [19]:
url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz'
data_downloader(url, unprocessed_data_location)

Downloading Gzipped data from FTP Server: ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
Decompressing and Writing Gzipped Data to File


In [20]:
ncbi_gene = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info',
                            header=0,
                            delimiter='\t')

  interactivity=interactivity, compiler=compiler, result=result)


_Preprocess Data_  
Data file needs to be lightly cleaned before it can be merged with other data. This light cleaning includes renaming columns, replacing `NaN` with `None`, updating data types (i.e. making all columns type `str`), and unnesting `|` delimited data. Then, the `gene_type` variable is cleaned such that each of the variable's values are re-grouped to be `protein-coding`, `other` or `ncRNA`.

In [21]:
# remove all rows that are not human
ncbi_gene = ncbi_gene.loc[ncbi_gene['#tax_id'].apply(lambda x: x == 9606)]

# replace "-" with "None"
ncbi_gene.replace('-','None', inplace=True)

# rename columns before merging
ncbi_gene.rename(columns={'GeneID': 'entrez_id', 'Symbol': 'symbol', 'Synonyms': 'synonyms'}, inplace=True)

# combine symbol columns into single column
ncbi_gene['symbol'] = ncbi_gene['Symbol_from_nomenclature_authority'] + '|' + ncbi_gene['symbol']
ncbi_gene['name'] = ncbi_gene['Full_name_from_nomenclature_authority'] + '|' + ncbi_gene['description']

# explode nested data
explode_df_ncbi_gene = explodes_data(ncbi_gene.copy(), ['symbol', 'name'], '|')

# make sure that merge columns are of same type
explode_df_ncbi_gene['entrez_id'] = explode_df_ncbi_gene['entrez_id'].astype(str)

# reformat entrez gene type
explode_df_ncbi_gene['entrez_gene_type'] = explode_df_ncbi_gene['type_of_gene']
explode_df_ncbi_gene['entrez_gene_type'].replace('None', 'unknown', inplace=True, regex=False)
explode_df_ncbi_gene['entrez_gene_type'].replace('pseudo', 'pseudogene', inplace=True, regex=False)
explode_df_ncbi_gene['entrez_gene_type'].replace('biological-region', 'biological region', inplace=True, regex=False)

# reformat master gene type
explode_df_ncbi_gene['master_gene_type'] = explode_df_ncbi_gene['entrez_gene_type']
explode_df_ncbi_gene['master_gene_type'].replace('biological region', 'other', inplace=True, regex=False)
explode_df_ncbi_gene['master_gene_type'].replace('scRNA', 'ncRNA', inplace=True, regex=False)
explode_df_ncbi_gene['master_gene_type'].replace('snoRNA', 'ncRNA', inplace=True, regex=False)
explode_df_ncbi_gene['master_gene_type'].replace('snRNA', 'ncRNA', inplace=True, regex=False)

# remove uneeded columns
explode_df_ncbi_gene.drop(['type_of_gene', 'dbXrefs', 'description', 'Nomenclature_status', 'Modification_date',
                           'LocusTag', '#tax_id', 'Full_name_from_nomenclature_authority', 'Feature_type',
                           'Symbol_from_nomenclature_authority'], axis=1, inplace=True)

# remove duplicates
explode_df_ncbi_gene.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
explode_df_ncbi_gene.head(n=3)

Unnamed: 0,entrez_id,symbol,synonyms,chromosome,map_location,Other_designations,name,entrez_gene_type,master_gene_type
0,1,A1BG,A1B|ABG|GAB|HYST2477,19,19q13.43,alpha-1B-glycoprotein|HEL-S-163pA|epididymis s...,alpha-1-B glycoprotein,protein-coding,protein-coding
4,2,A2M,A2MD|CPAMD5|FWP007|S863-7,12,12p13.31,alpha-2-macroglobulin|C3 and PZP-like alpha-2-...,alpha-2-macroglobulin,protein-coding,protein-coding
8,3,A2MP1,A2MP,12,12p13.31,pregnancy-zone protein pseudogene,alpha-2-macroglobulin pseudogene 1,pseudogene,pseudogene


<br>

***
**Protein Ontology Identifier Mapping Data**   
_Protein Ontology Identifier Data_ - `promapping.txt`

In [22]:
url = 'https://proconsortium.org/download/current/promapping.txt'
data_downloader(url, unprocessed_data_location)

Downloading Data from https://proconsortium.org/download/current/promapping.txt


In [23]:
pro_mapping = pandas.read_csv(unprocessed_data_location + 'promapping.txt',
                              header=None,
                              names=['pro_id', 'Entry', 'pro_mapping'],
                              delimiter='\t')

_Preprocess Data_  
Basic filtering to to include `Protein Ontology` mappings to `Uniprot` identifiers and cleaning to update formatting of accession values in order to remove `UniProtKB:` is performed.

In [24]:
# remove rows without 'UniProtKB'
pro_mapping = pro_mapping.loc[pro_mapping['Entry'].apply(lambda x: x.startswith('UniProtKB:'))] 

# replace PR: with PR_
pro_mapping['pro_id'].replace('PR:','PR_', inplace=True, regex=True)

# remove identifier type, which appears before ':'
pro_mapping['Entry'].replace('(^\w*\:)','', inplace=True, regex=True)

# rename columns before merging
pro_mapping.rename(columns={'Entry': 'uniprot_id'}, inplace=True)

# remove uneeded columns
pro_mapping.drop(['pro_mapping'], axis=1, inplace=True)

# remove duplicates
pro_mapping.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
pro_mapping.head(n=3)

Unnamed: 0,pro_id,uniprot_id
6,PR_000000005,P37173
7,PR_000000005,P38438
8,PR_000000005,Q62312


<br>

***

**Merge Processed Data:** `hgnc` + `ensembl`

In [25]:
# merge uniprot and ncbi data
ensembl_hgnc_merged_data = pandas.merge(ensembl, explode_df_hgnc,
                                        left_on=['ensembl_gene_id', 'entrez_id', 'uniprot_id',
                                                 'master_gene_type', 'symbol'],
                                        right_on=['ensembl_gene_id', 'entrez_id', 'uniprot_id',
                                                  'master_gene_type', 'symbol'],
                                        how='outer')

# replace NaN with 'None'
ensembl_hgnc_merged_data.fillna('None', inplace=True)

# remove duplicates
ensembl_hgnc_merged_data.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
ensembl_hgnc_merged_data.head(n=3)

Unnamed: 0,ensembl_gene_id,transcript_stable_id,symbol,ensembl_gene_type,transcript_name,ensembl_transcript_type,master_gene_type,master_transcript_type,protein_stable_id,uniprot_id,entrez_id,hgnc_id,hgnc_gene_type,name,map_location,synonyms
0,ENSG00000223972,,DDX11L1,transcribed_unprocessed_pseudogene,,,pseudogene,not protein-coding,,,,,,,,
1,ENSG00000223972,ENST00000456328,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,pseudogene,protein-coding,,,,,,,,
2,ENSG00000223972,ENST00000450305,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-201,transcribed_unprocessed_pseudogene,pseudogene,not protein-coding,,,,,,,,


<br>

***
**Merge Processed Data:** `ensembl_hgnc_merged_data` + `explode_df_uniprot`

In [26]:
# merge uniprot and ncbi data
ensembl_hgnc_uniprot_merged_data = pandas.merge(ensembl_hgnc_merged_data, explode_df_uniprot,
                                                left_on=['entrez_id', 'hgnc_id', 'uniprot_id', 'transcript_stable_id', 'symbol','synonyms'],
                                                right_on=['entrez_id', 'hgnc_id', 'uniprot_id', 'transcript_stable_id', 'symbol','synonyms'],
                                                how='outer')

# replace NaN with 'None'
ensembl_hgnc_uniprot_merged_data.fillna('None', inplace=True)

# remove duplicates
ensembl_hgnc_uniprot_merged_data.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
ensembl_hgnc_uniprot_merged_data.head(n=3)

Unnamed: 0,ensembl_gene_id,transcript_stable_id,symbol,ensembl_gene_type,transcript_name,ensembl_transcript_type,master_gene_type,master_transcript_type,protein_stable_id,uniprot_id,entrez_id,hgnc_id,hgnc_gene_type,name,map_location,synonyms
0,ENSG00000223972,,DDX11L1,transcribed_unprocessed_pseudogene,,,pseudogene,not protein-coding,,,,,,,,
1,ENSG00000223972,ENST00000456328,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,pseudogene,protein-coding,,,,,,,,
2,ENSG00000223972,ENST00000450305,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-201,transcribed_unprocessed_pseudogene,pseudogene,not protein-coding,,,,,,,,


<br>

***
**Merge Processed Data:** `ensembl_hgnc_uniprot_merged_data` + `Homo_sapiens.gene_info`

In [27]:
ensembl_hgnc_uniprot_ncbi_merged_data = pandas.merge(ensembl_hgnc_uniprot_merged_data, explode_df_ncbi_gene,
                                                     left_on=['entrez_id', 'master_gene_type', 'symbol', 'synonyms', 'name', 'map_location'],
                                                     right_on=['entrez_id', 'master_gene_type', 'symbol', 'synonyms', 'name', 'map_location'],
                                                     how='outer')

# replace NaN with 'None'
ensembl_hgnc_uniprot_ncbi_merged_data.fillna('None', inplace=True)

# remove duplicates
ensembl_hgnc_uniprot_ncbi_merged_data.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
ensembl_hgnc_uniprot_ncbi_merged_data.head(n=3)

Unnamed: 0,ensembl_gene_id,transcript_stable_id,symbol,ensembl_gene_type,transcript_name,ensembl_transcript_type,master_gene_type,master_transcript_type,protein_stable_id,uniprot_id,entrez_id,hgnc_id,hgnc_gene_type,name,map_location,synonyms,chromosome,Other_designations,entrez_gene_type
0,ENSG00000223972,,DDX11L1,transcribed_unprocessed_pseudogene,,,pseudogene,not protein-coding,,,,,,,,,,,
1,ENSG00000223972,ENST00000456328,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,pseudogene,protein-coding,,,,,,,,,,,
2,ENSG00000223972,ENST00000450305,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-201,transcribed_unprocessed_pseudogene,pseudogene,not protein-coding,,,,,,,,,,,


<br>

***
**Merge Processed Data:** `ensembl_hgnc_uniprot_ncbi_merged_data` + `promapping.txt`  

In [28]:
merged_data = pandas.merge(ensembl_hgnc_uniprot_ncbi_merged_data,
                           pro_mapping,
                           left_on='uniprot_id',
                           right_on='uniprot_id',
                           how='outer')

# replace NaN with 'None'
merged_data.fillna('None', inplace=True)

# remove duplicates
merged_data.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
merged_data.head(n=3)

Unnamed: 0,ensembl_gene_id,transcript_stable_id,symbol,ensembl_gene_type,transcript_name,ensembl_transcript_type,master_gene_type,master_transcript_type,protein_stable_id,uniprot_id,entrez_id,hgnc_id,hgnc_gene_type,name,map_location,synonyms,chromosome,Other_designations,entrez_gene_type,pro_id
0,ENSG00000223972,,DDX11L1,transcribed_unprocessed_pseudogene,,,pseudogene,not protein-coding,,,,,,,,,,,,
1,ENSG00000223972,ENST00000456328,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,pseudogene,protein-coding,,,,,,,,,,,,
2,ENSG00000223972,ENST00000450305,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-201,transcribed_unprocessed_pseudogene,pseudogene,not protein-coding,,,,,,,,,,,,


<br>

_Fix Symbol Formatting_  
some genes are formatted similarly to dates (e.g. `DEC1`), which can be erroneously re-formatted during input as a date value (i.e. `1-DEC`). In order for the data to be successfully merged with other data sources, all date-formatted genes need to be resolved.

In [29]:
clean_dates = []

for x in tqdm(list(merged_data['symbol'])):
    if '-' in x and len(x.split('-')[0]) < 3 and len(x.split('-')[1]) == 3:
        clean_dates.append(x.split('-')[1].upper() + x.split('-')[0])
    else:
        clean_dates.append(x)

# add cleaned date var back to data set
merged_data['symbol'] = clean_dates

# remove duplicates
merged_data.drop_duplicates(subset=None, keep='first', inplace=True)

# make sure that all gene and transcript type colunmns have none recoded to unknown or not protein-coding
merged_data['hgnc_gene_type'].replace('None', 'unknown', inplace=True, regex=False)
merged_data['ensembl_gene_type'].replace('None', 'unknown', inplace=True, regex=False)
merged_data['entrez_gene_type'].replace('None', 'unknown', inplace=True, regex=False)
merged_data['master_gene_type'].replace('None', 'unknown', inplace=True, regex=False)
merged_data['master_transcript_type'].replace('None', 'not protein-coding', inplace=True, regex=False)
merged_data['ensembl_transcript_type'].replace('None', 'unknown', inplace=True, regex=False)

# remove duplicates
merged_data_clean = merged_data.drop_duplicates(subset=None, keep='first')

# write data
merged_data_clean.to_csv(processed_data_location + 'Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt',
                         header=True,
                         sep='\t',
                         index=False)
    
# preview data
merged_data_clean.head(n=3)

100%|██████████| 1083723/1083723 [00:00<00:00, 1619827.27it/s]


Unnamed: 0,ensembl_gene_id,transcript_stable_id,symbol,ensembl_gene_type,transcript_name,ensembl_transcript_type,master_gene_type,master_transcript_type,protein_stable_id,uniprot_id,entrez_id,hgnc_id,hgnc_gene_type,name,map_location,synonyms,chromosome,Other_designations,entrez_gene_type,pro_id
0,ENSG00000223972,,DDX11L1,transcribed_unprocessed_pseudogene,,unknown,pseudogene,not protein-coding,,,,,unknown,,,,,,unknown,
1,ENSG00000223972,ENST00000456328,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,pseudogene,protein-coding,,,,,unknown,,,,,,unknown,
2,ENSG00000223972,ENST00000450305,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-201,transcribed_unprocessed_pseudogene,pseudogene,not protein-coding,,,,,unknown,,,,,,unknown,


***
**Create a Master Mapping Dictionary**  
Although the above steps result in a `pandas.Dataframe` of the merged identifiers, there is still work needed in order to be able to obtain a complete mapping between the identifiers. For example, if you were to search for entrez identifier `entrez_259234` you would find the following mappings: `entrez_259234-ENSG00000233316`, `entrez_259234-DSCR10`. If you only had `ENSG00000233316`, from the current data you would be unable to obtain the gene symbol without first mapping to the Entrez gene identifier. 

To solve this problem, we build a master dictionary where the keys are `ensembl_gene_id`, `transcript_stable_id`, `symbol`, `protein_stable_id`, `uniprot_id`, `entrez_id`, `hgnc_id`, and `pro_id` identifiers and values are the list of identifiers that match to each identifier. It's important to note that there are several labeling identifiers (i.e. `name`, `chromosome`, `map_location`, `Other_designations`, `synonyms`, `transcript_name`, `*_gene_types`, and `trasnscript_type_update`), which will only be mapped when clustered against one of the primary identifier types (i.e. the keys described above).

_Note_. The next chunk takes approximately ~45-90 minutes to run.

In [35]:
# get all permutations of identifiers (e.g. ['entrez_id', 'ensembl_gene_id'])
identifiers = ['ensembl_gene_id', 'transcript_stable_id', 'protein_stable_id', 'uniprot_id', 'entrez_id',
               'hgnc_id', 'pro_id', 'symbol', 'synonyms', 'ensembl_gene_type', 'transcript_name',
               'ensembl_transcript_type', 'master_gene_type', 'master_transcript_type', 'hgnc_gene_type',
               'name', 'map_location', 'chromosome', 'Other_designations', 'entrez_gene_type']

stop_point = 8

# get list of data types that ignores subjects of a permutation pair that are metadata
identifier_list = [x for x in list(itertools.permutations(identifiers, 2)) if x[0] not in identifiers[stop_point:]]

# create master dictionary of all identifiers
master_dict = {}

for ids in tqdm(identifier_list):
    maps = {k: [ids[1] + '_' + x for x in set(g[ids[1]].tolist()) if x != 'None'] for k,g in merged_data.groupby(ids[0])}

    for key in tqdm(maps.keys()):
        if ids[0] + '_' + key in master_dict.keys():
            master_dict[ids[0] + '_' + key] += maps[key]
        else:
            master_dict[ids[0] + '_' + key] = maps[key]


  0%|          | 0/152 [00:00<?, ?it/s]
  0%|          | 0/63746 [00:00<?, ?it/s][A
100%|██████████| 63746/63746 [00:00<00:00, 483034.28it/s][A
  1%|          | 1/152 [00:15<38:05, 15.14s/it]
  0%|          | 0/63746 [00:00<?, ?it/s][A
100%|██████████| 63746/63746 [00:00<00:00, 384918.41it/s][A
  1%|▏         | 2/152 [00:30<37:53, 15.16s/it]
  0%|          | 0/63746 [00:00<?, ?it/s][A
100%|██████████| 63746/63746 [00:00<00:00, 488834.72it/s][A
  2%|▏         | 3/152 [00:46<38:12, 15.39s/it]
  0%|          | 0/63746 [00:00<?, ?it/s][A
100%|██████████| 63746/63746 [00:00<00:00, 453869.70it/s][A
  3%|▎         | 4/152 [01:02<38:26, 15.58s/it]
  0%|          | 0/63746 [00:00<?, ?it/s][A
100%|██████████| 63746/63746 [00:00<00:00, 505920.93it/s][A
  3%|▎         | 5/152 [01:18<38:29, 15.71s/it]
  0%|          | 0/63746 [00:00<?, ?it/s][A
100%|██████████| 63746/63746 [00:00<00:00, 490267.10it/s][A
  4%|▍         | 6/152 [01:34<38:14, 15.71s/it]
  0%|          | 0/63746 [00:00<?, ?

 77%|███████▋  | 187896/243817 [00:00<00:00, 475498.40it/s][A
100%|██████████| 243817/243817 [00:00<00:00, 473475.91it/s][A
 22%|██▏       | 33/152 [17:50<1:51:09, 56.05s/it]
  0%|          | 0/243817 [00:00<?, ?it/s][A
 21%|██        | 50532/243817 [00:00<00:00, 505289.16it/s][A
 41%|████      | 98954/243817 [00:00<00:00, 498775.21it/s][A
 61%|██████    | 148589/243817 [00:00<00:00, 498043.88it/s][A
100%|██████████| 243817/243817 [00:00<00:00, 490954.79it/s][A
 22%|██▏       | 34/152 [18:46<1:50:21, 56.11s/it]
  0%|          | 0/243817 [00:00<?, ?it/s][A
 21%|██        | 50737/243817 [00:00<00:00, 507369.27it/s][A
 41%|████▏     | 100625/243817 [00:00<00:00, 504790.99it/s][A
 62%|██████▏   | 150198/243817 [00:00<00:00, 502036.43it/s][A
100%|██████████| 243817/243817 [00:00<00:00, 516684.23it/s][A
 23%|██▎       | 35/152 [19:42<1:49:40, 56.24s/it]
  0%|          | 0/243817 [00:00<?, ?it/s][A
 21%|██        | 50871/243817 [00:00<00:00, 508709.27it/s][A
 42%|████▏     | 101

100%|██████████| 343180/343180 [00:00<00:00, 468088.38it/s][A
 40%|████      | 61/152 [36:54<1:46:31, 70.23s/it]
  0%|          | 0/343180 [00:00<?, ?it/s][A
 11%|█         | 37845/343180 [00:00<00:00, 378442.24it/s][A
 23%|██▎       | 80105/343180 [00:00<00:00, 390687.03it/s][A
 36%|███▌      | 123792/343180 [00:00<00:00, 403463.21it/s][A
 50%|█████     | 172965/343180 [00:00<00:00, 426426.45it/s][A
 64%|██████▎   | 218717/343180 [00:00<00:00, 435298.97it/s][A
 77%|███████▋  | 265226/343180 [00:00<00:00, 443825.09it/s][A
100%|██████████| 343180/343180 [00:00<00:00, 446832.66it/s][A
 41%|████      | 62/152 [38:27<1:55:33, 77.04s/it]
  0%|          | 0/343180 [00:00<?, ?it/s][A
 11%|█         | 38017/343180 [00:00<00:00, 380165.83it/s][A
 23%|██▎       | 78771/343180 [00:00<00:00, 387970.91it/s][A
 37%|███▋      | 125380/343180 [00:00<00:00, 408501.87it/s][A
 50%|████▉     | 170000/343180 [00:00<00:00, 419122.64it/s][A
 61%|██████▏   | 210271/343180 [00:00<00:00, 414057.46

 51%|█████▏    | 78/152 [58:30<59:29, 48.23s/it]  
  0%|          | 0/61586 [00:00<?, ?it/s][A
100%|██████████| 61586/61586 [00:00<00:00, 465474.48it/s][A
 52%|█████▏    | 79/152 [58:45<46:26, 38.18s/it]
  0%|          | 0/61586 [00:00<?, ?it/s][A
100%|██████████| 61586/61586 [00:00<00:00, 458813.11it/s][A
 53%|█████▎    | 80/152 [59:00<37:17, 31.08s/it]
  0%|          | 0/61586 [00:00<?, ?it/s][A
100%|██████████| 61586/61586 [00:00<00:00, 525227.18it/s][A
 53%|█████▎    | 81/152 [59:14<30:57, 26.17s/it]
  0%|          | 0/61586 [00:00<?, ?it/s][A
100%|██████████| 61586/61586 [00:00<00:00, 474219.87it/s][A
 54%|█████▍    | 82/152 [59:29<26:32, 22.75s/it]
  0%|          | 0/61586 [00:00<?, ?it/s][A
100%|██████████| 61586/61586 [00:00<00:00, 483421.37it/s][A
 55%|█████▍    | 83/152 [59:45<23:56, 20.81s/it]
  0%|          | 0/61586 [00:00<?, ?it/s][A
100%|██████████| 61586/61586 [00:00<00:00, 472637.19it/s][A
 55%|█████▌    | 84/152 [1:00:01<21:56, 19.36s/it]
  0%|          | 

100%|██████████| 204568/204568 [00:00<00:00, 469377.86it/s][A
 81%|████████  | 123/152 [1:14:25<24:51, 51.43s/it]
  0%|          | 0/204568 [00:00<?, ?it/s][A
 23%|██▎       | 47948/204568 [00:00<00:00, 479475.88it/s][A
 48%|████▊     | 97552/204568 [00:00<00:00, 484326.57it/s][A
 72%|███████▏  | 146473/204568 [00:00<00:00, 485779.74it/s][A
100%|██████████| 204568/204568 [00:00<00:00, 489589.27it/s][A
 82%|████████▏ | 124/152 [1:15:14<23:37, 50.64s/it]
  0%|          | 0/204568 [00:00<?, ?it/s][A
 23%|██▎       | 46636/204568 [00:00<00:00, 466350.44it/s][A
 46%|████▌     | 93844/204568 [00:00<00:00, 468053.12it/s][A
 69%|██████▉   | 141856/204568 [00:00<00:00, 471607.82it/s][A
100%|██████████| 204568/204568 [00:00<00:00, 469071.99it/s][A
 82%|████████▏ | 125/152 [1:16:04<22:44, 50.52s/it]
  0%|          | 0/204568 [00:00<?, ?it/s][A
 23%|██▎       | 47708/204568 [00:00<00:00, 477075.91it/s][A
 47%|████▋     | 96230/204568 [00:00<00:00, 479488.72it/s][A
 72%|███████▏  | 14

_Condensing Master Mapping Dictionary_  
Once we have all pairwise mapping data between all of the identifiers, the next step is to condense all information by identifier, into a format that can easily be used downstream.

In [36]:
reformatted_mapped_identifiers = {}

# set globals
gene_type_var = 'master_gene_type'
transcript_type_var = 'master_transcript_type'

for ident in tqdm(master_dict.keys()):
    identifier_info = set()
    
    for x_id in master_dict[ident]:
        # get all identifying information for all linked identifiers
        if not any(x for x in identifiers[stop_point:] if x_id.startswith(x)) and x_id in master_dict.keys():
            identifier_info |= set(master_dict[x_id])
        else:
            continue

    # clean gene and transcript types mapped to multiple types
    gene_types = [x.split('_')[-1] for x in identifier_info if x.startswith(gene_type_var)]
    trans_types = [x.split('_')[-1] for x in identifier_info if x.startswith(transcript_type_var)]
    type_updates = []

    for types in [gene_types if len(gene_types) > 0 else ['None'], trans_types if len(trans_types) > 0 else ['None']]:
        if 'protein-coding' in set(types):
            type_updates.append('protein-coding')
        else:
            type_updates.append('not protein-coding')

    # update identifier set information
    identifier_info = [x for x in identifier_info if not x.startswith(gene_type_var) and not x.startswith(transcript_type_var)]
    identifier_info += ['gene_type_update_' + type_updates[0], 'transcript_type_update_' + type_updates[1]]
    reformatted_mapped_identifiers[ident] = identifier_info

# save a copy of the dictionary
pickle.dump(reformatted_mapped_identifiers,
            open(processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl', 'wb'), protocol=4)


100%|██████████| 1160128/1160128 [08:38<00:00, 2237.50it/s] 


In [None]:
# reformatted_mapped_identifiers = pickle.load(open(processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl', 'rb'),
#                                              encoding='bytes')


<br>

***

### Ensembl Gene-Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Ensembl gene identifiers to Entrez gene identifiers when creating the following edges:   
- gene-gene

**Output:** [`ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`](https://www.dropbox.com/s/ggnue4s5psvywn9/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt?dl=1)

In [41]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                  'ensembl_gene_id', 'entrez_id',
                  'gene_type_update', False, False)

100%|██████████| 1160128/1160128 [01:44<00:00, 11098.09it/s] 


_Preview Processed Data_

In [42]:
egeg_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                            header=None,
                            names=['Ensembl_Gene_IDs', 'Entrez_Gene_IDs', 'Gene_Type'],
                            delimiter='\t')

print('There are {edge_count} ensembl gene-entrez gene edges'.format(edge_count=len(egeg_data.drop_duplicates())))

There are 43803 ensembl gene-entrez gene edges


In [43]:
egeg_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Entrez_Gene_IDs,Gene_Type
0,ENSG00000000003,7105,protein-coding
1,ENSG00000000005,64102,protein-coding
2,ENSG00000000419,8813,protein-coding
3,ENSG00000000457,57147,protein-coding
4,ENSG00000000460,55732,protein-coding


<br>

***

### Ensembl Transcript-Protein Ontology <a class="anchor" id="ensembltranscript-proteinontology"></a>

**Purpose:** To map Ensembl transcript identifiers to Protein Ontology identifiers when creating the following edges: 
- rna-protein  

**Output:** [`ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/9upjijohu0xzf13/ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt?dl=1)


In [44]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt',
                  'transcript_stable_id', 'pro_id',
                  'transcript_type_update', False, True)

100%|██████████| 1160128/1160128 [01:40<00:00, 11577.20it/s] 


_Preview Processed Data_

In [45]:
etpr_data = pandas.read_csv(processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt',
                            header=None,
                            names=['Ensembl_Transcript_IDs', 'Protein_Ontology_IDs', 'Transcript_Type'],
                            delimiter='\t',
                            low_memory=False)

print('There are {edge_count} ensembl transcript-protein ontology edges'.format(edge_count=len(etpr_data.drop_duplicates())))

There are 323860 ensembl transcript-protein ontology edges


In [46]:
etpr_data.head(n=5)

Unnamed: 0,Ensembl_Transcript_IDs,Protein_Ontology_IDs,Transcript_Type
0,ENST00000000233,PR_000004204,protein-coding
1,ENST00000000233,PR_P84085,protein-coding
2,ENST00000000412,PR_000010032,protein-coding
3,ENST00000000412,PR_P20645,protein-coding
4,ENST00000000442,PR_000007208,protein-coding


<br>

***
***

### Entrez Gene-Ensembl Transcript <a class="anchor" id="entrezgene-ensembltranscript"></a>

**Purpose:** To map entrez gene identifiers to Ensembl transcript identifiers when creating the following edges: 
- gene-rna 

**Output:** [`ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`](https://www.dropbox.com/s/jxm9v7qfwm2b6ot/ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt?dl=1)

In [47]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                  'entrez_id', 'transcript_stable_id',
                  'transcript_type_update', False, False)

100%|██████████| 1160128/1160128 [01:41<00:00, 11421.90it/s] 


_Preview Processed Data_

In [48]:
eet_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                           header=None,
                           names=['Entrez_Gene_IDs', 'Ensembl_Transcript_IDs', 'Gene_Type', 'Transcript_Type'],
                           delimiter='\t',
                           low_memory=False)

print('There are {edge_count} entrez gene identifiers-ensembl transcript edges'.format(edge_count=len(eet_data.drop_duplicates())))

There are 216982 entrez gene identifiers-ensembl transcript edges


In [49]:
eet_data.head(n=5)

Unnamed: 0,Entrez_Gene_IDs,Ensembl_Transcript_IDs,Gene_Type,Transcript_Type
0,381,ENST00000000233,protein-coding,
1,4074,ENST00000000412,protein-coding,
2,2101,ENST00000000442,protein-coding,
3,2288,ENST00000001008,protein-coding,
4,56603,ENST00000001146,protein-coding,


<br>

***

### Entrez Gene-Protein Ontology <a class="anchor" id="entrezgene-proteinontology"></a>

**Purpose:** To map Protein Ontology identifiers to Ensembl transcript identifiers when creating the following edges:   
- chemical-protein  
- gene-protein

**Output:** [`ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/e5x7rq4kc2kfq49/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt?dl=1)

In [50]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt',
                  'entrez_id', 'pro_id',
                  'gene_type_update', False, True)

100%|██████████| 1160128/1160128 [01:37<00:00, 11940.32it/s] 


_Preview Processed Data_

In [51]:
egpr_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt',
                            header=None,
                            names=['Gene_IDs', 'Protein_Ontology_IDs', 'Gene_Type'],
                            delimiter='\t')

print('There are {edge_count} entrez gene-protein ontology edges'.format(edge_count=len(egpr_data.drop_duplicates())))

There are 37471 entrez gene-protein ontology edges


In [52]:
egpr_data.head(n=5)

Unnamed: 0,Gene_IDs,Protein_Ontology_IDs,Gene_Type
0,1,PR_P04217,protein-coding
1,1,PR_000003511,protein-coding
2,10,PR_P11245,protein-coding
3,10,PR_000011001,protein-coding
4,100,PR_000003707,protein-coding


<br>

***
***

### Gene Symbol-Ensembl Transcript <a class="anchor" id="genesymbol-ensembltranscript"></a>

**Purpose:** To map gene symbols to Ensembl transcript identifiers when creating the following edges: 
- chemical-rna  
- rna-anatomy  
- rna-cell  

**Output:** [`GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt`](https://www.dropbox.com/s/7wdrbpc79kj3sr7/GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt?dl=1)

In [53]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt',
                  'symbol', 'transcript_stable_id',
                  'transcript_type_update', False, False)

100%|██████████| 1160128/1160128 [01:47<00:00, 10817.72it/s] 


_Preview Processed Data_

In [54]:
set_data = pandas.read_csv(processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt',
                            header=None,
                            names=['Gene_Symbols', 'Ensembl_Transcript_IDs', 'Transcript_Type'],
                            delimiter='\t')

print('There are {edge_count} gene symbol-ensembl transcript edges'.format(edge_count=len(set_data.drop_duplicates())))

There are 269322 gene symbol-ensembl transcript edges


In [55]:
set_data.head(n=5)

Unnamed: 0,Gene_Symbols,Ensembl_Transcript_IDs,Transcript_Type
0,ARF5,ENST00000000233,protein-coding
1,M6PR,ENST00000000412,protein-coding
2,ESRRA,ENST00000000442,protein-coding
3,FKBP4,ENST00000001008,protein-coding
4,CYP26B1,ENST00000001146,protein-coding


<BR>

***

### STRING-Protein Ontology <a class="anchor" id="string-proteinontology"></a>

**Purpose:** To map STRING identifiers to Protein Ontology identifiers when creating the following edges:   
- protein-protein  

**Output:** [`STRING_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/mpv6rzum0c1lgxe/STRING_PRO_ONTOLOGY_MAP.txt?dl=0=1)

In [56]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                  'protein_stable_id', 'pro_id',
                  None, False, True)

100%|██████████| 1160128/1160128 [00:46<00:00, 25051.90it/s] 


_Preview Processed Data_

In [57]:
stpr_data = pandas.read_csv(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                            header=None,
                            names=['STRING_IDs', 'Protein_Ontology_IDs', 'Genomic_Type'],
                            delimiter='\t')

print('There are {edge_count} string-protein ontology edges'.format(edge_count=len(stpr_data.drop_duplicates())))

There are 208038 string-protein ontology edges


In [58]:
stpr_data.head(n=5)

Unnamed: 0,STRING_IDs,Protein_Ontology_IDs,Genomic_Type
0,ENSP00000000233,PR_000004204,
1,ENSP00000000233,PR_P84085,
2,ENSP00000000412,PR_000010032,
3,ENSP00000000412,PR_P20645,
4,ENSP00000000442,PR_000007208,


<br>

***

### Uniprot Accession-Protein Ontology <a class="anchor" id="uniprotaccession-proteinontology"></a>

**Purpose:** To map Uniprot accession identifiers to Protein Ontology identifiers when creating the following edges:  
- protein-gobp  
- protein-gomf  
- protein-gocc  
- protein-cofactor  
- protein-catalyst 
- protein-pathway

**Output:** [`UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/wvk1yv28xb06mfr/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt?dl=1)

In [59]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                  'uniprot_id', 'pro_id',
                  None, False, True)

100%|██████████| 1160128/1160128 [01:29<00:00, 12981.75it/s] 


_Preview Processed Data_

In [60]:
uapr_data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                            header=None,
                            names=['Uniprot_Accession_IDs', 'Protein_Ontology_IDs', 'Genomic_Type'],
                            delimiter='\t')

print('There are {edge_count} uniprot accession-protein ontology edges'.format(edge_count=len(uapr_data.drop_duplicates())))

There are 299074 uniprot accession-protein ontology edges


In [61]:
uapr_data.head(n=5)

Unnamed: 0,Uniprot_Accession_IDs,Protein_Ontology_IDs,Genomic_Type
0,A0A023HJ61,PR_P20338,
1,A0A023HJ61,PR_000013631,
2,A0A023IP86,PR_P13612,
3,A0A023IP86,PR_000009129,
4,A0A023IP88,PR_P13612,


<br><br>

***
***
### Other Identifier Mapping <a class="anchor" id="other-identifier-mapping"></a>
***
* [ChEBI Identifiers](#mesh-chebi)  
* [Human Protein Atlas Tissue and Cell Types](#hpa-uberon) 
* [Human Disease and Phenotype Identifiers](#disease-identifiers) 
* [Reactome Pathways and the Pathway Ontology](#reactome-pw)  
* [Genomic Identifiers and the Sequence Ontology](#genomic-so)  

***
***

***
### ChEBI-MeSH Identifiers <a class="anchor" id="mesh-chebi"></a>

**Data Source Wiki Page:** [mapping-mesh-to-chebi](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#mapping-mesh-identifiers-to-chebi-identifiers)  

**Purpose:** Map MeSH identifiers to ChEBI identifiers when creating the following edges:  
- chemical-gene  
- chemical-disease

**Dependencies:** This script assumes that the [`ncbo_rest_api.py`](https://gist.github.com/callahantiff/a28fb3160782f42f104e9ec41553af0d) script was run and the data generated from this file was written to `./resources/processed_data/temp`. 

**Output:** [`MESH_CHEBI_MAP.txt`](https://www.dropbox.com/s/1mbd6a12dmjslae/MESH_CHEBI_MAP.txt?dl=1)


In [None]:
with open(processed_data_location + 'MESH_CHEBI_MAP.txt', 'w') as out:
    for filename in tqdm(glob.glob(processed_data_location + 'temp/*.txt')):
        for row in list(filter(None, open(filename, 'r').read().split('\n'))):
            mesh = '_'.join(row.split('\t')[0].split('/')[-2:])
            chebi = row.split('\t')[1].split('/')[-1]
            out.write(mesh + '\t' + chebi + '\n')

out.close()

_Preview Processed Data_

In [None]:
mc_data = pandas.read_csv(processed_data_location + 'MESH_CHEBI_MAP.txt',
                          delimiter='\t',
                          header=None,
                          names=['MeSH_IDs', 'ChEBI_IDs'])

print('There are {edge_count} MeSH-ChEBI edges'.format(edge_count=len(mc_data)))

In [None]:
mc_data.head(n=5)

<br>

***
***

### Disease and Phenotype Identifiers <a class="anchor" id="disease-identifiers"></a>

**Data Source Wiki Page:** [DisGeNET](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet)  

**Purpose:** This script downloads the [disease_mappings.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz) to map UMLS identifiers to Human Disease and Human Phenotype identifiers when creating the following edges:  
- chemical-disease  
- disease-phenotype

**Output:**   
- Human Disease Ontology Mappings ➞ [`DISEASE_DOID_MAP.txt`](https://www.dropbox.com/s/ziv0glx4ph9jidc/DISEASE_DOID_MAP.txt?dl=1)  
- Human Phenotype Ontology Mappings ➞ [`PHENOTYPE_HPO_MAP.txt`](https://www.dropbox.com/s/71ts7kw44vm70tg/PHENOTYPE_HPO_MAP.txt?dl=1)

In [None]:
url = 'https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz'
data_downloader(url, unprocessed_data_location)

In [None]:
disease_data = pandas.read_csv(unprocessed_data_location + 'disease_mappings.tsv',
                               header=0,
                               delimiter='\t')  # DisGeNET change delimiter from "|" to "\t" in May 2020

disease_data.head(n=3)

_Build Disease Identifier Dictionary_  
In order to improve efficiency when mapping different disease terminology identifiers to the [Human Disease Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-disease-ontology) and [Human Phenotype Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-phenotype-ontology), we create a dictionary of disease identifiers.

In [None]:
# convert to dictionary
disease_dict = {}

for idx, row in tqdm(disease_data.iterrows(), total=disease_data.shape[0]):
    if row['vocabulary'] == 'MSH':
        mesh_finder(disease_data, row['code'], 'MESH:', disease_dict)
        print(row['code'])
    elif row['vocabulary'] == 'OMIM':
        mesh_finder(disease_data, row['code'], 'OMIM:', disease_dict)
        print(row['code'])
    elif row['vocabulary'] == 'ORDO':
        mesh_finder(disease_data, row['code'], 'ORPHA:', disease_dict)
        print(row['code'])
    elif row['diseaseId'] in disease_dict.keys():
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']].append('DOID_' + row['code']) 
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']].append(row['code'].replace('HP:', 'HP_'))
    else:
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']] = ['DOID_' + row['code']] 
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']] = [row['code'].replace('HP:', 'HP_')] 

_Write Mapping Data_

In [None]:
with open(processed_data_location + 'DISEASE_DOID_MAP.txt', 'w') as outfile1,open(processed_data_location + 'PHENOTYPE_HPO_MAP.txt', 'w') as outfile2:
    for key, value in tqdm(disease_dict.items()):
        for i in value:
            # get diseases
            if i.startswith('DOID_'): 
                outfile1.write(key.split(':')[-1] + '\t' + i + '\n')

            # get phenotypes
            if i.startswith('HP_'): 
                outfile2.write(key.split(':')[-1] + '\t' + i + '\n')

outfile1.close()
outfile2.close()

<br>

_Preview Processed Human Disease Ontology Mappings_

In [None]:
dis_data = pandas.read_csv(processed_data_location + 'DISEASE_DOID_MAP.txt',
                           header=None,
                           names=['Disease_IDs', 'DOID_IDs'],
                           delimiter='\t')

print('There are {} disease-DOID edges'.format(len(dis_data)))

In [None]:
dis_data.head(n=5)

<br>

_Preview Processed Human Phenotype Mappings_

In [None]:
hp_data = pandas.read_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt',
                          header=None,
                          names=['Disease_IDs', 'HP_IDs'],
                          delimiter='\t')

print('There are {} phenotype-HPO edges'.format(len(hp_data)))

In [None]:
hp_data.head(n=5)

<br>

***

### Human Protein Atlas/GTEx Tissue/Cells - UBERON + Cell Ontology + Cell Line Ontology <a class="anchor" id="hpa-uberon"></a>

**Data Source Wiki Page:**  
- [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#human-protein-atlas) 
- [genotype-tissue-expression-project](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#the-genotype-tissue-expression-gtex-project)  

<br>

**Purpose:** Downloads a query for cell, tissue, and blood types with overexpressed protein-coding genes in the human proteome ([`proteinatlas_search.tsv`](https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv)) and median gene-level TPM by tissue for all genes that are not protein-coding ([`GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct`](https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz)) in order to create mappings between cell and tissue type strings to the Uber-Anatomy, Cell Ontology, and Cell Line Ontology concepts (see [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-protein-atlas) for details on the mapping process). The mappings are then used to create the following edge types:  
- rna-cell line  
- rna-tissue type   
- protein-cell line  
- protein-tissue type  

<br>

**Output:**  
- All HPA tissue and cell type strings ➞ [`HPA_tissues.txt`](https://www.dropbox.com/s/0w6lbrit4mag92h/HPA_tissues.txt?dl=1)  
- Mapping HPA strings to ontology concepts (documentation) ➞ [`zooma_tissue_cell_mapping_04JAN2020.xlsx`](https://www.dropbox.com/s/0dh8tg8g4yvrl06/zooma_tissue_cell_mapping_04JAN2020.xlsx?dl=1)  
- Final HPA-ontology mappings ➞ [`HPA_GTEx_TISSUE_CELL_MAP.txt`](https://www.dropbox.com/s/at00xui224syzh8/HPA_GTEx_TISSUE_CELL_MAP.txt?dl=1)
- HPA Edges ➞ [`HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt`](https://www.dropbox.com/s/us3u516e4vhkuco/HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt?dl=1)

**Human Protein Atlas**

_Download Data_

In [None]:
url = 'https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv'
data_downloader(url, unprocessed_data_location, 'proteinatlas_search.tsv.gz')

_Load Data_

In [None]:
hpa = pandas.read_csv(unprocessed_data_location + 'proteinatlas_search.tsv',
                      header=0,
                      delimiter='\t')

# replace NaN with 'None'
hpa.fillna('None', inplace=True)

_Identify HPA Terms Needing Mapping_  
To expedite the mapping process, all HPA tissues, cells, cell lines, and fluid types are extracted from the HPA data columns.

In [None]:
# retrieve terms to map
terms_to_map = list(hpa.columns)

# write results
with open(unprocessed_data_location + 'HPA_tissues.txt', 'w') as outfile:
    for x in tqdm(terms_to_map):
        if x.endswith('[NX]'):
            term = x.split('RNA - ')[-1].split(' [NX]')[:-1][0]
            outfile.write(term + '\n')

outfile.close()

<br>

**Genotype-Tissue Expression Project**

_Download Data_

In [None]:
url='https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz'
data_downloader(url, unprocessed_data_location)

_Load Data_

In [None]:
gtex = pandas.read_csv(unprocessed_data_location + 'GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct',
                       header=0,
                       skiprows=2,
                       delimiter='\t')

# replace NaN with 'None'
gtex.fillna('None', inplace=True)

# remove identifier type, which appears after '.'
gtex['Name'].replace('(\..*)','', inplace=True, regex=True)

<br>

**Get Mapping Data**   
Import the tissues, cells, cell lines, and fluids that we externally mapped from HPA and GTEx data to [UBERON](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#uber-anatomy-ontology), the [Cell Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#cell-ontology), and the [Cell Line Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#cell-line-ontology).

In [None]:
mapping_data = pandas.read_excel(open(unprocessed_data_location + 'zooma_tissue_cell_mapping_04JAN2020.xlsx', 'rb'),
                                 sheet_name='Concept_Mapping - 04JAN2020',
                                 header=0)

# convert NaN to None
mapping_data.fillna('None', inplace=True)

# preview data
mapping_data.head(n=3)

<br>

_Write HPA and GTEx Mapping Data_  
The HPA and GTEx mapping data is written locally so that it can be used by the `PheKnowLator` algorithm when creating the knowledge graph edge lists. 

In [None]:
with open(processed_data_location + 'HPA_GTEx_TISSUE_CELL_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(mapping_data.iterrows(), total=mapping_data.shape[0]):
        if row['UBERON ID'] != 'None':
            outfile.write(str(row['ORIGINAL TERM']).strip() + '\t' + str(row['UBERON ID']).strip() + '\n')
        if row['CL ID'] != 'None':
            outfile.write(str(row['ORIGINAL TERM']).strip() + '\t' + str(row['CL ID']).strip() + '\n')
        if row['CLO ID'] != 'None':
            outfile.write(str(row['ORIGINAL TERM']).strip() + '\t' + str(row['CLO ID']).strip() + '\n')

outfile.close()

_Preview Processed Data_

In [None]:
mapping_data = pandas.read_csv(processed_data_location + 'HPA_GTEx_TISSUE_CELL_MAP.txt',
                               header=None,
                               names=['TISSUE_CELL_TERM', 'ONTOLOGY_IDs'],
                               delimiter='\t')

In [None]:
mapping_data.head(n=3)

<br>

**Create Edge Data Set**

_Human Protein Atlas_  
The `HPA` data looped over and reformatted such all all tissue, cell, cell lines, and fluid types are stored as a nested list. As shown in the code chunk, you will see that the anatomy type is specified as an item in the list according to its type. This is done in order to make mapping more efficient while building the knowledge graph edge list.

In [None]:
hpa_results = []

for idx, row in tqdm(hpa.iterrows(), total=hpa.shape[0]):
    if row['RNA tissue specific NX'] != 'None':
        for x in row['RNA tissue specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'anatomy', x.split(':')[0]]]

    if row['RNA cell line specific NX'] != 'None':
        for x in row['RNA cell line specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'cell line', x.split(':')[0]]]

    if row['RNA brain regional specific NX'] != 'None':
        for x in row['RNA brain regional specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'anatomy', x.split(':')[0]]]

    if row['RNA blood cell specific NX'] != 'None':
        for x in row['RNA blood cell specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'anatomy', x.split(':')[0]]]

    if row['RNA blood lineage specific NX'] != 'None':
        for x in row['RNA blood lineage specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'anatomy', x.split(':')[0]]]

<br>

_Genotype-Tissue Expression Project_  
The `GTEx` edge data is created by first filtering out all _protein-coding_ genes that appear in the `HPA` cell transcriptome data set. Once filter so that we are only left noncoding genes, we perform an additional filtering step to only add genes and their corresponding tissue, cell, or fluid, if the median expression is `>= 1.0`. The `GTEx` is formatted such all all tissue, cell, and fluid types occur as their own column and all unique genes occur as a row, thus the expression filtering step is performed while also reformatting the file. The genes and tissues/cells/fluids that meet criteria are stored as a nested list.

In [None]:
# get a list of hpa protein-coding genes
hpa_genes = list(hpa['Ensembl'].drop_duplicates(keep='first', inplace=False))

# remove rows that contain protein coding genes
gtex = gtex.loc[gtex['Name'].apply(lambda x: x not in hpa_genes)]

# loop over data and re-organize - only keep results with tpm >= 1 and if gene symbol is not a protein-coding gene
gtex_results = []

for idx, row in tqdm(gtex.iterrows(), total=gtex.shape[0]):    
    for col in list(gtex.columns)[2:]:
        if row[col] >= 1.0:           
            gtex_results += [[row['Name'], row['Description'], 'None', 'Evidence at transcript level', 'cell line' if 'Cells' in col else 'anatomy', col]]

_Write Results_

In [None]:
with open(processed_data_location + 'HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt', 'w') as outfile:
    for res in tqdm(hpa_results + gtex_results):
        outfile.write(str(res[0]) + '\t' + str(res[1]) + '\t' + str(res[2]) + '\t' + str(res[3]) + '\t' + str(res[4]) + '\t' + str(res[5]) + '\n')

outfile.close()

_Preview Processed Data_

In [None]:
hpa_edges = pandas.read_csv(processed_data_location + 'HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt',
                           header=None,
                           names=['Ensembl_IDs', 'Gene_Symbols', 'Uniport_IDs', 'Evidence', 'Anatomy_Type', 'Anatomy'],
                           low_memory=False,
                           sep='\t')

print('There are {edge_count} edges'.format(edge_count=len(hpa_edges)))

In [None]:
hpa_edges.head(n=5)

<br>

***

### Mapping Reactome Pathways to the Pathway Ontology <a class="anchor" id="reactome-pw"></a>

**Data Source Wiki Page:** [Pathway Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#pathway-ontology)  

**Purpose:** This script downloads the [canonical pathways](http://compath.scai.fraunhofer.de/export_mappings) and [kegg-reactome pathway mappings](https://github.com/ComPath/resources/blob/master/mappings/kegg_reactome.csv) files from the [ComPath Ecosystem](https://github.com/ComPath) in order to create the following identifier mappings:  
- `Reactome Pathway Identifiers`  ➞ `KEGG Pathway Identifiers` ➞ `Pathway Ontology Identifiers` 

**Output:**  
- [`REACTOME_PW_GO_MAPPINGS.txt`](https://www.dropbox.com/s/6752jvjt5p3qf2e/REACTOME_PW_GO_MAPPINGS.txt?dl=1)


**Download the Pathway Ontology**   
Use [OWL Tools](https://github.com/owlcollab/owltools/wiki) to download the [Pathway Ontology](http://www.obofoundry.org/ontology/pw.html). Once downloaded, we read the ontology in as a `RDFLib` graph object so that we can query it to obtain all `DbXRefs`.

In [None]:
# download ontology using subprocess and OWLTOOLS in order to get the ontology and its imported ontologies
subprocess.check_call(['./pkt_kg/libs/owltools',
                       'http://purl.obolibrary.org/obo/pw.owl',
                       '--merge-import-closure',
                       '-o',
                       unprocessed_data_location + 'pw_with_imports.owl'])

In [None]:
pw_graph = Graph()
pw_graph.parse(unprocessed_data_location + 'pw_with_imports.owl')

print('There are {} axioms in the ontology (date: {})'.format(len(pw_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

In [None]:
results = pw_graph.query(
    """SELECT DISTINCT ?c ?xref ?rel_syns ?exc_syns
           WHERE {
              ?c rdf:type owl:Class .
              ?c rdfs:label ?c_label .
              ?c_annot owl:annotatedSource ?c .
              ?c_annot oboInOwl:hasDbXref ?xref .
              ?c oboInOwl:hasRelatedSynonym ?rel_syns . 
              ?c oboInOwl:hasExactSynonym ?exc_syns .}
           """, initNs={"rdf": 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                        "rdfs": 'http://www.w3.org/2000/01/rdf-schema#',
                        "owl": 'http://www.w3.org/2002/07/owl#',
                        "oboInOwl": 'http://www.geneontology.org/formats/oboInOwl#'}) 

_Reformat Mapping Results_  
Create a dictionary of mapping results where pathway ontology identifiers are values and the keys are `DbXRef` identifiers.


In [None]:
id_mappings = {}

for res in tqdm(results):
    for x in res:
        if 'http' not in x and 'PMID' not in x:
            if str(x) in id_mappings.keys():
                id_mappings[str(x)] |= set([str(res[0])])
            else:
                id_mappings[str(x)] = set([str(res[0])])

print('There are {} results (date: {})'.format(len(id_mappings), datetime.datetime.now().strftime('%m/%d/%Y')))

**Download Human Reactome Pathways**  
Download a file of all [Reactome Pathways](https://reactome.org/download/current/ReactomePathways.txt), [Reactome's GO Annotations]('https://reactome.org/download/current/gene_association.reactome.gz'), and [Reactome's mappings to CHEBI](https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt). This file will be filtered to only include human pathways.

_Reactome Pathway Stable Identifiers_

In [None]:
url = 'https://reactome.org/download/current/ReactomePathways.txt'
data_downloader(url, unprocessed_data_location)

In [None]:
reactome_pathways = pandas.read_csv(unprocessed_data_location + 'ReactomePathways.txt',
                                    header=None,
                                    delimiter='\t',
                                    low_memory=False)

# remove all non-human pathways
reactome_pathways = reactome_pathways.loc[reactome_pathways[2].apply(lambda x: x == 'Homo sapiens')] 

# save as list
mapped_reactome_identifiers = {x:set(['PW_0000001']) for x in set(list(reactome_pathways[0]))}     


_Reactome's Mappings to GO Annotations_

In [None]:
url = 'https://reactome.org/download/current/gene_association.reactome.gz'
data_downloader(url, unprocessed_data_location)

In [None]:
reactome_pathways2 = pandas.read_csv(unprocessed_data_location + 'gene_association.reactome',
                                     header=None,
                                     delimiter='\t',
                                     skiprows=1,
                                     low_memory=False)

# remove all non-human pathways
reactome_pathways2 = reactome_pathways2.loc[reactome_pathways2[12].apply(lambda x: x == 'taxon:9606')] 

# save as list
mapped_reactome_identifiers.update({x.split(':')[-1]:set(['PW_0000001']) for x in set(list(reactome_pathways2[5]))})     


_Reactome's Mappings to ChEBI_

In [None]:
url = 'https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt'
data_downloader(url, unprocessed_data_location)

In [None]:
reactome_pathways3 = pandas.read_csv(unprocessed_data_location + 'ChEBI2Reactome_All_Levels.txt',
                                     header=None,
                                     delimiter='\t',
                                     low_memory=False)

# remove all non-human pathways
reactome_pathways3 = reactome_pathways3.loc[reactome_pathways3[5].apply(lambda x: x == 'Homo sapiens')] 

# save as list
mapped_reactome_identifiers.update({x:set(['PW_0000001']) for x in set(list(reactome_pathways3[1]))})     


**ComPath Reactome Pathway Mappings**  
Use [ComPath Mappings](https://github.com/ComPath/resources/tree/master/mappings) to obtain the following mappings:  
- `Reactome Pathway Identifiers`  ➞ `KEGG Pathway Identifiers` ➞ `Pathway Ontology Identifiers` 

_Canonical Pathways_

In [None]:
url1 = 'http://compath.scai.fraunhofer.de/export_mappings'
data_downloader(url1, unprocessed_data_location, 'compath_canonical_pathway_mappings.txt')

In [None]:
compath_cannonical = pandas.read_csv(unprocessed_data_location + 'compath_canonical_pathway_mappings.txt',
                               header=None,
                               delimiter='\t',
                               low_memory=False)

# replace NaN with 'None'
compath_cannonical.fillna('None', inplace=True)

In [None]:
for idx, row in tqdm(compath_cannonical.iterrows(), total=compath_cannonical.shape[0]):
    if row[6] == 'kegg' and 'KEGG:' + row[5].strip('path:hsa') in id_mappings.keys() and row[2] == 'reactome':
        for x in id_mappings['KEGG:' + row[5].strip('path:hsa')]:
            if row[1] in mapped_reactome_identifiers.keys():
                mapped_reactome_identifiers[row[1]] |= set([x.split('/')[-1]])
            else:
                mapped_reactome_identifiers[row[1]] = set([x.split('/')[-1]])
    
    if (row[2] == 'kegg' and 'KEGG:' + row[1].strip('path:hsa') in id_mappings.keys()) and row[6] == 'reactome':
        for x in id_mappings['KEGG:' + row[1].strip('path:hsa')]:
            if row[5] in mapped_reactome_identifiers.keys():
                mapped_reactome_identifiers[row[5]] |= set([x.split('/')[-1]])
            else:
                mapped_reactome_identifiers[row[5]] = set([x.split('/')[-1]])         

_KEGG - Reactome Mappings_

In [None]:
url2 = 'https://raw.githubusercontent.com/ComPath/resources/master/mappings/kegg_reactome.csv'
data_downloader(url2, unprocessed_data_location, 'kegg_reactome.csv')

In [None]:
kegg_reactome_map = pandas.read_csv(unprocessed_data_location + 'kegg_reactome.csv',
                                    header=0,
                                    delimiter=',',
                                    low_memory=False)

In [None]:
for idx, row in tqdm(kegg_reactome_map.iterrows(), total=kegg_reactome_map.shape[0]):
    if row['Source Resource'] == 'reactome' and 'KEGG:' + row['Target ID'].strip('path:hsa') in id_mappings.keys():
        for x in id_mappings['KEGG:' + row['Target ID'].strip('path:hsa')]:
            if row['Source ID'] in mapped_reactome_identifiers.keys():
                mapped_reactome_identifiers[row['Source ID']] |= set([x.split('/')[-1]])
            else:
                mapped_reactome_identifiers[row['Source ID']] = set([x.split('/')[-1]])
    
    if row['Target Resource'] == 'reactome' and 'KEGG:' + row['Source Resource'].strip('path:hsa') in id_mappings.keys():
        for x in id_mappings['KEGG:' + row['Source ID'].strip('path:hsa')]:
            if row['Target ID'] in mapped_reactome_identifiers.keys():
                mapped_reactome_identifiers[row['Target ID']] |= set([x.split('/')[-1]])
            else:
                mapped_reactome_identifiers[row['Target ID']] = set([x.split('/')[-1]])
    

**Reactome Pathway GO Annotation Mappings**  
Use Reactome's [API](https://reactome.org/dev/content-service) to obtain the following mappings:  
- `Reactome Pathway Identifiers`  ➞ `Gene Ontology Identifiers`


In [None]:
for request_ids in tqdm(list(chunks(list(mapped_reactome_identifiers.keys()), 20))):
    for res in content.query_ids(ids=','.join(request_ids)):
        for key in res.keys(): 
            if key == 'goBiologicalProcess':
                for x in res[key]:
                    if res['stId'] in mapped_reactome_identifiers.keys():
                        mapped_reactome_identifiers[res['stId']] |= set(['GO_' + res[key]['accession']])
                    else:
                        mapped_reactome_identifiers[res['stId']] = set(['GO_' + res[key]['accession']])

**Write Mappings**

In [None]:
# reformat data and write it out
with open(processed_data_location + 'REACTOME_PW_GO_MAPPINGS.txt', 'w') as outfile:
    for key in tqdm(mapped_reactome_identifiers.keys()):
        for mapping in mapped_reactome_identifiers[key]:
            outfile.write(key + '\t' + mapping + '\n')

outfile.close()

In [None]:
pw_data = pandas.read_csv(processed_data_location + 'REACTOME_PW_GO_MAPPINGS.txt',
                           header=None,
                           names=['Pathway_IDs', 'Mapping_IDs'],
                           delimiter='\t')

print('There are {edge_count} pathway ontology mappings'.format(edge_count=len(pw_data)))

In [None]:
pw_data.head(n=5)

<br>

***

### Mapping Genomic Identifiers to the Sequence Ontology <a class="anchor" id="genomic-soo"></a>

**Data Source Wiki Page:** [Sequence Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/_edit#sequence-ontology)  

**Purpose:** This script downloads the [genomic identifier mapping](https://www.dropbox.com/s/g0blo27qc8ogvk2/genomic_sequence_ontology_mappings.xlsx?dl=1) file in order to create the following identifier mappings:  
- `Gene BioTypes`  ➞ `Sequence Ontology Identifiers`  
- `RNA BioTypes`  ➞ `Sequence Ontology Identifiers`  
- `variant Types`  ➞ `Sequence Ontology Identifiers`

**Output:**  
- [`SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt`](https://www.dropbox.com/s/6dl73a3470u1hcr/SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt?dl=1)


In [None]:
mapping_data = pandas.read_excel(open(unprocessed_data_location + 'genomic_sequence_ontology_mappings.xlsx', 'rb'),
                                 sheet_name='GenomicType_SO_Map_09Mar2020',
                                 header=0)

# convert data to dictionary
genomic_type_so_map = {}

for idx, row in tqdm(mapping_data.iterrows(), total=mapping_data.shape[0]):
    genomic_type_so_map[row['source_*_type'] + '_' + row['Genomic']] = row['SO ID']


**Genes**

In [None]:
# read in genomic mapping data
genomic_mapped_ids = pickle.load(open(processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl', 'rb'),
                                 encoding='bytes')


In [None]:
sequence_map = {}

for identifier in tqdm(genomic_mapped_ids.keys()):    
    if identifier.startswith('entrez_id_') and identifier.replace('entrez_id_', '') != 'None':
        id_clean = identifier.replace('entrez_id_', '')
        
        # get identifier types
        ensembl = [x.replace('ensembl_gene_type_', '') for x in genomic_mapped_ids[identifier] if x.startswith('ensembl_gene_type') and x != 'ensembl_gene_type_unknown']
        hgnc = [x.replace('hgnc_gene_type_', '')  for x in genomic_mapped_ids[identifier] if x.startswith('hgnc_gene_type') and x != 'hgnc_gene_type_unknown']
        entrez = [x.replace('entrez_gene_type_', '')  for x in genomic_mapped_ids[identifier] if x.startswith('entrez_gene_type') and x != 'entrez_gene_type_unknown']
        
        # determine gene type
        if len(ensembl) > 0:
            gene_type = genomic_type_so_map[ensembl[0].replace('ensembl_gene_type_', '') + '_Gene']
        elif len(hgnc) > 0:
            gene_type = genomic_type_so_map[hgnc[0].replace('hgnc_gene_type_', '') + '_Gene']
        elif len(entrez) > 0:
            gene_type = genomic_type_so_map[entrez[0].replace('entrez_gene_type_', '') + '_Gene']
        else:
            gene_type = 'SO_0000704'
        
        # update sequence map
        if id_clean in sequence_map.keys():
            sequence_map[id_clean] += [gene_type]
        else:
            sequence_map[id_clean] = [gene_type]
    

**Transcripts**

In [None]:
# read in processed Ensembl Transcript data 
transcript_data = pandas.read_csv(processed_data_location + 'ensembl_identifier_data_cleaned.txt',
                                  header=0,
                                  delimiter='\t',
                                  low_memory=False)

# convert to dictionary
transcripts = {}

for idx, row in tqdm(transcript_data.iterrows(), total=transcript_data.shape[0]):
    if row['transcript_stable_id'] != 'None':
        if row['transcript_stable_id'].replace('transcript_stable_id_', '') in transcripts.keys():
            transcripts[row['transcript_stable_id'].replace('transcript_stable_id_', '')] += [row['ensembl_transcript_type']]
        else:
            transcripts[row['transcript_stable_id'].replace('transcript_stable_id_', '')] = [row['ensembl_transcript_type']]


In [None]:
# update SO map dictionary
for identifier in tqdm(transcripts.keys()):
    
    # map transcript type
    if transcripts[identifier][0] == 'protein_coding':
        trans_type = genomic_type_so_map['protein-coding_Transcript']
    elif transcripts[identifier][0] == 'misc_RNA':
        trans_type = genomic_type_so_map['miscRNA_Transcript']
    else:
        trans_type = genomic_type_so_map[list(set(transcripts[identifier]))[0] + '_Transcript']

    sequence_map[identifier] = [trans_type, 'SO_0000673']
        

**Variants**

In [None]:
# read in variant summary data 
# (assumes that is was already downloaded from 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz')
variant_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt',
                               header=0,
                               delimiter='\t',
                               low_memory=False)

# convert to dictionary
variants = {}

for idx, row in tqdm(variant_data.iterrows(), total=variant_data.shape[0]):
    if row['Assembly'] == 'GRCh38' and row['RS# (dbSNP)'] != -1:
        if 'rs' + str(row['RS# (dbSNP)']) in variants.keys():
                variants['rs' + str(row['RS# (dbSNP)'])] |= set([row['Type']])
        else:
            variants['rs' + str(row['RS# (dbSNP)'])] = set([row['Type']])


In [None]:
# update SO map dictionary
for identifier in tqdm(variants.keys()):
    for typ in variants[identifier]:
    
        # map variant type
        var_type = genomic_type_so_map[typ + '_Variant']

        if identifier in sequence_map.keys():
            sequence_map[identifier] += [var_type]
        else:
            sequence_map[identifier] = [var_type]


**Write Data**

In [None]:
# reformat data and write it out
with open(processed_data_location + 'SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt', 'w') as outfile:
    for key in tqdm(sequence_map.keys()):
        for map_type in sequence_map[key]:
            outfile.write(key + '\t' + map_type + '\n')

outfile.close()

In [None]:
so_data = pandas.read_csv(processed_data_location + 'SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt',
                           header=None,
                           names=['Identifier', 'Sequence_Ontology_ID'],
                           delimiter='\t')

print('There are {edge_count} sequence ontology mappings'.format(edge_count=len(so_data)))

In [None]:
so_data.head(n=5)

<br>

**Combine Pathway and Sequence Ontology Mapping Data in Dictionary**  
Combine the pathway and sequence mapping data into a dictionary and output it.

In [None]:
subclass_mapping = {}  

# combine genomic and pathway maps
sequence_map.update(mapped_reactome_identifiers)

# iterate over pathway lists and combine them
for key in tqdm(sequence_map.keys()):
    subclass_mapping[key] = sequence_map[key]


In [None]:
# save a copy of the dictionary
pickle.dump(subclass_mapping, open(processed_data_location + 'subclass_construction_map.pkl', 'wb'), protocol=4)


<br>

***
***
### CREATE EDGE DATASETS  <a class="anchor" id="create-edge-datasets"></a>
***
***

### Ontologies  <a class="anchor" id="ontologies"></a>
***
- [Protein Ontology](#protein-ontology)  
- [Relations Ontology](#relations-ontology)  

***
***

***
### Protein Ontology <a class="anchor" id="protein-ontology"></a>

**Data Source Wiki Page:** [protein-ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-phenotype-ontology)  

**Purpose:** This script uses OWlTools to download the [pr.owl](http://purl.obolibrary.org/obo/pr.owl) (with imports) file from [ProConsortium.org](https://proconsortium.org/) in order to create a version of the ontology that contains only human proteins. This is achieved by performing forward and reverse breadth first search over all proteins which are `owl:subClassOf` [Homo sapiens protein](https://proconsortium.org/app/entry/PR%3A000029067/).

<br>

**Output:**  
- Human Protein Ontology ➞ [`human_pro.owl`](https://www.dropbox.com/s/t7sq1dshu2usmkj/human_pro.owl?dl=1)
- Classified Human Protein Ontology (Hermit) ➞ [`human_pro_closed.owl`](https://www.dropbox.com/s/khtyja0zw14rp32/human_pro_closed.owl?dl=1)


In [None]:
# download ontology using subprocess and OWLTOOLS in order to get the ontology and its imported ontologies
subprocess.check_call(['./pkt_kg/libs/owltools',
                       'http://purl.obolibrary.org/obo/pr.owl',
                       '--merge-import-closure',
                       '-o',
                       unprocessed_data_location + 'pw_with_imports.owl'])

In [None]:
# read in ontology as graph (the ontology is large so this takes ~60 minutes) - 11,799,102 edges on 06/08/2020
graph = Graph()
graph.parse(unprocessed_data_location + 'pw_with_imports.owl')

print('There are {} axioms in the ontology (date: {})'.format(len(graph), datetime.datetime.now().strftime('%m/%d/%Y')))

_Convert Ontology to Directed MulitGraph_  
In order to create a version of the ontology which includes all relevant human edges, we need to first convert the KG to a [directed multigraph](https://networkx.github.io/documentation/stable/reference/classes/multidigraph.html).

In [None]:
# convert RDF graph to multidigraph
networkx_mdg: networkx.MultiDiGraph = networkx.MultiDiGraph()
    
for s, p, o in tqdm(graph):
    networkx_mdg.add_edge(s, o, **{'key': p})

_Identify Human Proteins_   
A list of human proteins is obtained by querying the ontology to return all ontology classes `only_in_taxon some Homo sapiens`. To expedite the query time, the following SPARQL query is run from the [ProConsortium](https://proconsortium.org/pro_sparql.shtml) SPARQL endpoint: 

```SPARQL
PREFIX obo: <http://purl.obolibrary.org/obo/>

SELECT ?PRO_term
FROM <http://purl.obolibrary.org/obo/pr>
WHERE {
       ?PRO_term rdf:type owl:Class .
       ?PRO_term rdfs:subClassOf ?restriction .
       ?restriction owl:onProperty obo:RO_0002160 .
       ?restriction owl:someValuesFrom obo:NCBITaxon_9606 .

       # use this to filter-out things like hgnc ids
       FILTER (regex(?PRO_term,"http://purl.obolibrary.org/obo/*")) .
}

```


In [None]:
# download data - pro classes only_in_taxon some Homo sapiens (61,025 classes on 06/08/2020)
url = 'https://sparql.proconsortium.org/virtuoso/sparql?default-graph-uri=&query=PREFIX+obo%3A+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F%3E%0D%0A%0D%0ASELECT+%3FPRO_term%0D%0AFROM+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fpr%3E%0D%0AWHERE+%7B%0D%0A+++++++%3FPRO_term+rdf%3Atype+owl%3AClass+.%0D%0A+++++++%3FPRO_term+rdfs%3AsubClassOf+%3Frestriction+.%0D%0A+++++++%3Frestriction+owl%3AonProperty+obo%3ARO_0002160+.%0D%0A+++++++%3Frestriction+owl%3AsomeValuesFrom+obo%3ANCBITaxon_9606+.%0D%0A%0D%0A+++++++%23+use+this+to+filter-out+things+like+hgnc+ids%0D%0A+++++++FILTER+%28regex%28%3FPRO_term%2C%22http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F*%22%29%29+.%0D%0A%7D&format=text%2Fhtml&timeout=5000&debug=on'
html = requests.get(url, allow_redirects=True).content

# extract data from html table
df_list = pandas.read_html(html)
human_pro_classes = list(df_list[-1]['PRO_term'])

print('There are {} edges in the ontology (date:{})'.format(len(human_pro_classes), datetime.datetime.now().strftime('%m/%d/%Y')))


<br>

_Construct Human PRO_   
Now that we have all of the paths from the original graph that are relevant to humans, we can construct a human-only version of the PRotein ontology.

In [None]:
# create a new graph using bfs paths
human_pro_graph = Graph()
human_networkx_mdg = networkx.MultiDiGraph()

for node in tqdm(human_pro_classes):
    forward = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='original'))
    reverse = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='reverse'))
    
    # add edges from forward and reverse bfs paths
    for path in forward + reverse:
        human_pro_graph.add((path[0], path[2], path[1]))
        human_networkx_mdg.add_edge(path[0], path[1], path[2])

In [None]:
# verify that the constructed ontology only has 1 component
networkx.number_connected_components(human_networkx_mdg.to_undirected())

In [None]:
# save filtered ontology
human_pro_graph.serialize(destination=unprocessed_data_location + 'human_pro.owl', format='xml')

<br>

_Classify Ontology_  
To ensure that we have correclty built the new ontology, we run the hermit reasoner over it to ensure that there are no incomplete triples or inconsistent classes. In order to do this, we will call the reasoner using [OWLTools](https://github.com/owlcollab/owltools), which this script assumes has already been downloaded to the `./resources/lib` directory. The following arguments are then called to run the reasoner (from the command line):  

```bash
./resources/lib/owltools ./resources/processed_data/unprocessed_data/human_pro.owl --reasoner hermit --run-reasoner --assert-implied -o ./resources/processed_data/human_pro_closed.owl
```

_**Note.** This step takes around 30 minutes to run. When run from the command line the reasoner determined that the ontology was consistent and 12 new axioms were inferred (06/08/2020)._

In [None]:
# # run reasoner -- RUN FROM COMMAND LINE NOT HERE
subprocess.run(['./pkt_kg/libs/owltools',
                './resources/unprocessed_data/human_pro_filtered.owl',
                '--reasoner hermit',
                '--run-reasoner',
                '--assert-implied',
                '--list-unsatisfiable',
                '-o ./resources/processed_data/human_pro_closed.owl'])

_Examine Cleaned Human PRO_  
Once we have cleaned the ontology we can get counts of components, nodes, edges, and then write the cleaned graph to the `../../resources/processed_data` repository.

In [None]:
gets_ontology_statistics(processed_data_location + 'human_pro_closed.owl')

***


### Relations Ontology <a class="anchor" id="relations-ontology"></a>

**Data Source Wiki Page:** [RO](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#relation-ontology)  

**Purpose:** This script downloads the [ro.owl](http://purl.obolibrary.org/obo/ro.owl) file from [obofoundry.org](http://www.obofoundry.org/) in order to obtain all `ObjectProperties` and their inverse relations.  

**Output:** 
- Relations and Inverse Relations ➞ [`INVERSE_RELATIONS.txt`](https://www.dropbox.com/s/4gq1iebdxta7qr8/INVERSE_RELATIONS.txt?dl=1)
- Relations and Labels ➞ [`RELATIONS_LABELS.txt`](https://www.dropbox.com/s/vr0tj22am192ubw/RELATIONS_LABELS.txt?dl=1)

_Download Ontology_

In [None]:
# download ontology using subprocess and OWLTOOLS in order to get the ontology and its imported ontologies
subprocess.run(['./resources/lib/owltools',
                'http://purl.obolibrary.org/obo/ro.owl',
                '--merge-import-closure',
                '-o',
                unprocessed_data_location + 'ro_with_imports.owl'])

_Load Ontology to RDFLib Graph_

In [None]:
ro_graph = Graph()
ro_graph.parse(unprocessed_data_location + 'ro_with_imports.owl')

print('There are {} edges in the ontology (date:{})'.format(len(ro_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

**Identify Relations and Inverse Relations**  
Identify all relations and their inverse relations using the `owl:inverseOf` property. To make it easier to look up the inverse relations, each pair is listed twice, for example:  
- [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015) `owl:inverseOf` [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025)  
- [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025) `owl:inverseOf` [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015)

In [None]:
with open(relations_data_location + 'INVERSE_RELATIONS.txt', 'w') as outfile:
    
    # write column names
    outfile.write('Relation' + '\t' + 'Inverse_Relation' + '\n')

    # find inverse relations
    for s, p, o in tqdm(ro_graph):
        if 'owl#inverseOf' in str(p):
            if 'RO' in str(s) and 'RO' in str(o):
                outfile.write(str(s.split('/')[-1]) + '\t' + str(o.split('/')[-1]) + '\n')
                outfile.write(str(o.split('/')[-1]) + '\t' + str(s.split('/')[-1]) + '\n')

outfile.close()

_Preview Processed Data_

In [None]:
ro_data = pandas.read_csv(relations_data_location + 'INVERSE_RELATIONS.txt',
                          header=0,
                          delimiter='\t')

print('There are {edge_count} RO Relations and Inverse Relations'.format(edge_count=len(ro_data)))

In [None]:
ro_data.head(n=5)

<br>

**Get Relations Labels**  
Identify all relations and their labels for use when building the knowledge graph.

In [None]:
results = ro_graph.query(
    """SELECT DISTINCT ?p ?p_label
           WHERE {
              ?p rdf:type owl:ObjectProperty .
              ?p rdfs:label ?p_label . }
           """, initNs={"rdf": 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                        "rdfs": 'http://www.w3.org/2000/01/rdf-schema#',
                        "owl": 'http://www.w3.org/2002/07/owl#'})    

In [None]:
# write data to file
with open(relations_data_location + 'RELATIONS_LABELS.txt', 'w') as outfile:
    
    # write column names
    outfile.write('Relation' + '\t' + 'Label' + '\n')

    for p, p_label in list(results):
        outfile.write(str(p).split('/')[-1] + '\t' + str(p_label) + '\n')

_Preview Processed Data_

In [None]:
ro_data_label = pandas.read_csv(relations_data_location + 'RELATIONS_LABELS.txt',
                                header=0,
                                delimiter='\t')

print('There are {edge_count} RO Relations and Labels'.format(edge_count=len(ro_data_label)))

In [None]:
ro_data_label.head(n=5)

<br><br>

***
***
### Linked Data <a class="anchor" id="linked-data"></a>
***
* [Clinvar Variant-Diseases and Phenotypes](#clinvar-variant) 
* [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  

***

***
***
### Clinvar Variant-Diseases and Phenotypes <a class="anchor" id="clinvar-variant"></a>

**Data Source Wiki Page:** [Clinvar](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar)  

**Purpose:** This script downloads the [variant_summary.txt](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz) file from [CLinVar](https://www.ncbi.nlm.nih.gov/clinvar/) in order to create the following edges:  
- gene-variant  
- variant-disease  
- variant-phenotype  

**Output:** [`CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt`](https://www.dropbox.com/s/px3dlywz0q6gb6d/CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt?dl=1)


In [None]:
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
data_downloader(url, unprocessed_data_location)

In [None]:
clinvar_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt',
                               header=0,
                               delimiter='\t',
                               low_memory=False)

_Preprocess Data_

In [None]:
# replace NaN with 'None'
clinvar_data.fillna('None', inplace=True)

# explode nested data
explode_df_clinvar = explodes_data(clinvar_data.copy(), ['PhenotypeIDS'], ';')
explode_df_clinvar = explodes_dataexplode_df_clinvar.copy(), ['PhenotypeIDS'], ',')

# edit column formatting
explode_df_clinvar['PhenotypeIDS'].replace('Orphanet:ORPHA','ORPHA:', inplace=True, regex=True)
explode_df_clinvar['PhenotypeIDS'].replace('Human Phenotype Ontology:HP:','HP_', inplace=True, regex=True)

# write data
explode_df_clinvar.to_csv(processed_data_location + 'CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt',
                          header=True,
                          sep='\t',
                          encoding='utf-8',
                          index=False)

_Preview Processed Data_

In [None]:
print('There are {edge_count} variant edges'.format(edge_count=len(explode_df_clinvar)))

In [None]:
# preview data
explode_df_clinvar.head(n=5)

<br>


***

### Uniprot  Protein-Cofactor and Protein-Catalyst <a class="anchor" id="uniprot-protein-cofactorcatalyst"></a>

**Data Source Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)  

**Purpose:** This script downloads the [uniprot-cofactor-catalyst.tab](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) file from the [Uniprot Knowledge Base](https://www.uniprot.org) in order to create the following edges:  
- protein-cofactor  
- protein-catalyst  

**Output:**  
- protein-cofactor ➞ [`UNIPROT_PROTEIN_COFACTOR.txt`](https://www.dropbox.com/s/xjtljf21eqign73/UNIPROT_PROTEIN_COFACTOR.txt?dl=1)
- protein-catalyst ➞ [`UNIPROT_PROTEIN_CATALYST.txt`](https://www.dropbox.com/s/w4lh6k9wbo5qkw0/UNIPROT_PROTEIN_CATALYST.txt?dl=1)


In [None]:
url = 'https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Centry%20name%2Creviewed%2Cdatabase(PRO)%2Cchebi(Cofactor)%2Cchebi(Catalytic%20activity)&format=tab'
data_downloader(url, unprocessed_data_location, 'uniprot-cofactor-catalyst.tab')

In [None]:
data = open(unprocessed_data_location + 'uniprot-cofactor-catalyst.tab').readlines()

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt', 'w') as outfile1, open(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt', 'w') as outfile2:
    for line in tqdm(data):

        # get cofactors
        if 'CHEBI' in line.split('\t')[4]: 
            for i in line.split('\t')[4].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile1.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')
        
        # get catalysts
        if 'CHEBI' in line.split('\t')[5]:       
            for i in line.split('\t')[5].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile2.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')

outfile1.close()
outfile2.close()

<br>

_Preview Cofactor Data_

In [None]:
pcp1_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt',
                            header=None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter='\t')

print('There are {edge_count} protein-cofactor edges'.format(edge_count=len(pcp1_data)))

In [None]:
pcp1_data.head(n=5)

<br>

_Preview Catalyst Data_

In [None]:
pcp2_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt',
                            header=None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter='\t')

print('There are {edge_count} protein-catalyst edges'.format(edge_count=len(pcp2_data)))

In [None]:
pcp2_data.head(n=5)

<br>

***
***
### INSTANCE AND/OR SUBCLASS METADATA <a class="anchor" id="create-instance-metadata"></a>
***

**Data Source Wiki Page:** [Dependencies](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies/#node-metadata) 

<br>

**Purpose:** The goal of this section is to obtain metadata for each instance and/or subclass data source used in the knowledge graph. To determine which of the edges contains instance and/or subclass data, the [`Master_Edge_List_Dict.json`](https://www.dropbox.com/s/w4l9yffnn4tyk2e/Master_Edge_List_Dict.json?dl=1) file is parsed and saved to a nested dictionary (see example below). 

```python
{
  'complex': {
              'chemical-complex': [[node_1, node_2]...[node_n, node_m]],
              'complex-complex':  [[node_1, node_2]...[node_n, node_m]],
              'complex-pathway':  [[node_1, node_2]...[node_n, node_m]],
              },
     'gene': {
                'chemical-gene':  [[node_1, node_2]...[node_n, node_m]],
                 'gene-disease':  [[node_1, node_2]...[node_n, node_m]],
              }
}
```

<br>

Once this dictionary is created, each major data type (examples shown in the list below) will be processed. For **[`Release V2.0.0`](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**, the following are instance and/or subclass data and require the compiling of metadata:
- [Genes](#gene-metadata)
- [RNA](#rna-metadata)
- [Pathways](#pathway-metadata)
- [Variants](#variant-metadata)


<br>

____

**Metadata:** The <u>metadata</u> we will gather includes:  

| **Metadata Type** | **Definition** | **Example Node**  | **Example Node Metadata** | 
| :---: | :---: | :---: | :---: | 
| Label | The primary label or name for the node | `R-HSA-1006173` | "CFH:Host cell surface" |       
| Description | A definition or other useful details about the node | `rs794727058` | This `germline` `single nucleotide variant (allele alteration: C➞T)` located on chromosome `5 (GRCh38: NC_000005.10, start/stop positions (126555930/126555930))` with `pathogenic` clinical significance and a last review date of `2/23/2015` (review status: `criteria provided, single submitter`). |        
| Synonym | Alternative terms used for a node | `81399` | "OR1-1, OR7-21" |           

<br>

The metadata information will be used to create the following edges in the knowledge graph:  
- **Label** ➞ node `rdfs:label`  
- **Description** ➞ node `obo:IAO_0000115` description 
- **Synonyms** ➞ node `oboInOwl:hasExactSynonym` synonym 

<br>

*<b>NOTE.</b> All node metadata datasets are written to the `node_data` directory. The algorithm will look for data in this directory and if it is not there, then no node metadata will be created.*

_____

<br>

### Prepare Metadata Dictionaries
***

**Purpose:** To create the resources needed in order to create metadata dictionaries, which are in turn used to obtain metadata for instance and/or subclass data nodes. This process has the following steps:

**1. [Identify Instance and/or Subclass Data Nodes](#identify-instance-or-subclass-data-nodes):** In order to automatically obtain the list of edges that include an instance data source and their corresponding edge lists, the `Master_Edge_List_Dict.json` is read in and processed.  
  - <u>Input Data</u>: [`Master_Edge_List_Dict.json`](https://www.dropbox.com/s/w4l9yffnn4tyk2e/Master_Edge_List_Dict.json?dl=1)  

<br>

**2. [Generate Metadata Dictionaries](#generate-metadata-dictionaries):** In order to efficiently obtain metadata for the instance and/or subclass data nodes identified in _Step 1_, we first read in the data for each node type (i.e. genes, rna, pathways, and variants) and convert it into a dictionary. Then, each metadata dictionary is saved to a `master_metadata_dictionary`, keyed by node type.
  - <u>Input Datasets</u>:  
    - Genes ➞ [`Homo_sapiens.gene_info`](https://www.dropbox.com/s/95jmr5bqkjcft8k/Homo_sapiens.gene_info?dl=1)    
    - RNA ➞ [`ensembl_identifier_data_cleaned.txt`](https://www.dropbox.com/s/gcq8157cqcdz2d0/ensembl_identifier_data_cleaned.txt?dl=1) 
    - Pathways ➞ [`reactome2py API`](https://github.com/reactome/reactome2py)   
    - Variants ➞ [`variant_summary.txt.gz`](https://www.dropbox.com/s/nqsgf92jhu7690e/variant_summary.txt?dl=1)  

<br>

**3. [Write Metadata Files](#write-metadata-files):** The Instance and/or subclass data node dictionary from _Step 1_ and metadata dictionaries from _Step 2_ are used to write `.txt` files for all `edge-type` data included in the instance and/or subclass node dictionary.

<br>

***

### Identify Instance and/or Subclass Data Nodes  <a class="anchor" id="identify-instance-or-subclass-data-nodes"></a>

In [None]:
# read in data files for each edge type
edge_data = json.load(open('./resources/Master_Edge_List_Dict.json', 'r'))
edge_dict = {key:[edge_data[key]['data_type'], edge_data[key]['edge_list']] for key in edge_data.keys()}


**Sort Data**  
For all edges in the `edge_dict()` that include instance and/or subclass data, we create a new dictionary where each edge type is further organized by node from the edge type that references the instance data (e.g. from the `chemical-gene` edge type, the `gene` node references instance data).

In [None]:
# sort data files
metadata_file_info = {}

for edge in tqdm(edge_dict.keys()): 
    if 'entity' in edge_dict[edge][0]:
        
        # get instance type
        inst_type = edge.split('-')[edge_dict[edge][0].split('-').index('entity')]
        
        # read in data
        if inst_type in metadata_file_info.keys(): 
            metadata_file_info[inst_type][edge] = {}
            metadata_file_info[inst_type][edge]['data'] = edge_dict[edge][1]
            metadata_file_info[inst_type][edge]['inst_sbc_idx'] = edge_dict[edge][0]
        else:
            metadata_file_info[inst_type] = {}
            metadata_file_info[inst_type][edge] =  {}
            metadata_file_info[inst_type][edge]['data'] =  edge_dict[edge][1]
            metadata_file_info[inst_type][edge]['inst_sbc_idx'] = edge_dict[edge][0]


<br>

***

### Generate Metadata Dictionaries  <a class="anchor" id="generate-metadata-dictionaries"></a>
In this step, the goal is to create a metadata dictionary for each node type that does not rely on API data. In this case, only the **Gene**, **RNA**, and **Variant** nodes require data that is not from an API.


<br>

**Genes Metadata Dictionary**

In [None]:
# entrez gene data
entrez_gene_data = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info',
                                   header=0,
                                   delimiter='\t',
                                   low_memory=False)

# remove all rows that are not human
entrez_gene_data = entrez_gene_data.loc[entrez_gene_data['#tax_id'].apply(lambda x: x == 9606)]

# replace NaN and '-' with 'None'
entrez_gene_data.fillna('None', inplace=True)
entrez_gene_data.replace('-','None', inplace=True, regex=False)


_Create Gene Metadata Dictionary_  
The nested dictionary of gene metadata is created by looping over the merged data described in the prior column. The `keys` of the dictionary are `Entrez gene identifiers` and the `values` are dictionaries for each metadata type: `symbol`, `description`, and `name`.

In [None]:
# create metadata
genes, label, description, synonym = [], [], [], []

for idx, row in tqdm(entrez_gene_data.iterrows(), total=entrez_gene_data.shape[0]):
    # node 
    if row['GeneID'] != 'None':
        genes.append(row['GeneID'])
    
    # label -- only want metadata if there is a label
    if row['Symbol'] != 'None' or row['Symbol'] != '':       
        label.append(row['Symbol'])
    else:
        label.append('Entrez_ID:' + row['GeneID'])

    # description        
    if row['Full_name_from_nomenclature_authority'] != 'None' and row['type_of_gene'] != 'None' and row['chromosome'] != 'None' and row['map_location'] != 'None':

        description.append("{desc} has locus group '{gene}' and is located on chromosome {chrom} (map_location: {map_loc}).".format(desc=row['Symbol'],
                                                                                                                             gene=row['type_of_gene'],
                                                                                                                             chrom=row['chromosome'],
                                                                                                                             map_loc=row['map_location']))

    else:
        description.append("{desc} locus group '{gene}'.".format(desc=row['Symbol'], gene=row['type_of_gene']))

    # synonym        
    if row['Synonyms'] != 'None' and row['Other_designations'] != 'None':
        syns = '|'.join(set([x for x in (row['Synonyms'] + row['Other_designations']).split('|') if x != 'None' or x != '']))
        synonym.append(syns)
    elif row['Synonyms'] != 'None':
        syns = '|'.join(set([x for x in (row['Synonyms']).split('|') if x != 'None' or x != '']))
        synonym.append(syns)
    elif row['Other_designations'] != 'None':
        syns = '|'.join(set([x for x in (row['Other_designations']).split('|') if x != 'None' or x != '']))
        synonym.append(syns)
    else:
        synonym.append('None')
            
    
# combine into new data frame        
gene_metadata_final = pandas.DataFrame(list(zip(genes, label, description, synonym)), columns =['ID', 'Label', 'Description', 'Synonym'])

# make all variables string
gene_metadata_final = gene_metadata_final.astype(str)

# dedup
gene_metadata_final.drop_duplicates(subset='ID', keep='first', inplace=True)

# convert df to dictionary
gene_metadata_final.set_index('ID', inplace=True)
gene_metadata_dict = gene_metadata_final.to_dict('index')


<br>

**RNA Metadata Dictionary**

In [None]:
rna_gene_data = pandas.read_csv(processed_data_location + 'ensembl_identifier_data_cleaned.txt',
                                header=0,
                                delimiter='\t',
                                low_memory=False)


_Preprocess Data_  
Normal data preprocess and filtering steps are performed in order to prepare the data for the next step, which converts it to a metadata dictionary.

In [None]:
# remove rows without identifiers
rna_gene_data = rna_gene_data.loc[rna_gene_data['transcript_stable_id'].apply(lambda x: x != 'None')]

# remove unneede columns
rna_gene_data.drop(['ensembl_gene_id', 'symbol', 'protein_stable_id', 'uniprot_id', 'master_transcript_type',
                    'entrez_id', 'ensembl_gene_type', 'master_gene_type', 'symbol'], axis=1, inplace=True)

# remove duplicates
rna_gene_data.drop_duplicates(subset=['transcript_stable_id', 'transcript_name', 'ensembl_transcript_type'], keep='first', inplace=True)

# replace NaN with 'None'
rna_gene_data.fillna('None', inplace=True)


_Create RNA Metadata Dictionary_  
The nested dictionary of rna metadata is created by looping over the cleaned human [Ensembl](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#ensembl) gene, RNA, and protein identifier data set ([`ensembl_identifier_data_cleaned.txt`](https://www.dropbox.com/s/gcq8157cqcdz2d0/ensembl_identifier_data_cleaned.txt?dl=1)). The `keys` of the dictionary are `Ensembl transcript identifiers` and the `values` are dictionaries for each metadata type: `symbol`, `description`, and `name`.

In [None]:
# create metadata
rna, label, description, synonym = [], [], [], []

for idx, row in tqdm(rna_gene_data.iterrows(), total=rna_gene_data.shape[0]):
    # node
    rna.append(row['transcript_stable_id'])
    
    # label info
    if row['transcript_name'] != 'None':
        label.append(row['transcript_name'])
    else:
        rna_type = 'Ensembl_Transcript_ID:' + row['transcript_stable_id']
    
    # rna type info
    rna_type = row['ensembl_transcript_type']

    if rna_type != 'None':
        # description
        description.append("Transcript {desc} is classified as type '{typ}'.".format(desc=row['transcript_name'], typ=rna_type))
    else:
        # description
        description.append('None')

    # synonym
    synonym.append('None')
    
# combine into new data frame
rna_metadata_final = pandas.DataFrame(list(zip(rna, label, description, synonym)),
                                      columns =['ID', 'Label', 'Description', 'Synonym'])

# convert df to dictionary
rna_metadata_final.set_index('ID', inplace=True)
rna_metadata_dict = rna_metadata_final.to_dict('index')


<br>

**Variant Metadata Dictionary**

_Download Data_  
Only run this code block if the `variant_summary.txt` has not already been downloaded.

In [None]:
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
data_downloader(url, unprocessed_data_location)


In [None]:
var_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt',
                           header=0,
                           delimiter='\t',
                           low_memory=False)


_Preprocess Data_  
Normal data preprocess and filtering steps are performed in order to prepare the data for the next step, which converts it to a metadata dictionary.

In [None]:
# remove rows without identifiers
var_data = var_data.loc[var_data['Assembly'].apply(lambda x: x == 'GRCh38')]
var_data = var_data.loc[var_data['RS# (dbSNP)'].apply(lambda x: x != -1)]

# de-dup data
var_metadata = var_data[['#AlleleID', 'Type', 'Name', 'ClinicalSignificance', 'RS# (dbSNP)', 'Origin',
                         'ChromosomeAccession', 'Chromosome', 'Start', 'Stop', 'ReferenceAllele',
                         'Assembly', 'AlternateAllele','Cytogenetic', 'ReviewStatus', 'LastEvaluated']] 

# replace NaN with 'None'
var_metadata.fillna('None', inplace=True)

# remove duplicate dbSNP ids by choosing the most recent reviewed variant
var_metadata.sort_values('LastEvaluated', ascending=False, inplace=True)
var_metadata.drop_duplicates(subset='RS# (dbSNP)', keep='first', inplace=True)


_Create Variant Metadata Dictionary_  
The nested dictionary of rna metadata is created by looping over the human [ClinVar Variant](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar) identifier data set ([`variant_summary.txt`](https://www.dropbox.com/s/nqsgf92jhu7690e/variant_summary.txt?dl=1)). The `keys` of the dictionary are `dbSNP identifiers` and the `values` are dictionaries for each metadata type: `symbol`, `description`, and `name`.

In [None]:
# create metadata
variant, label, description = [], [], []

for idx, row in tqdm(var_metadata.iterrows(), total=var_metadata.shape[0]):
    # node
    if row['RS# (dbSNP)'] != 'None':
        variant.append('rs' + str(row['RS# (dbSNP)']))
    
    # label -- only want metadata if there is a label
    if row['Name'] != 'None':
        label.append(row['Name'])
    else:
        label.append('dbSNP_ID:rs' + str(row['RS# (dbSNP)']))
    
    # description
    sent = "This variant is a {Origin} {Type} that results when a {ReferenceAllele} allele is changed to {AlternateAllele} on chromosome {Chromosome} ({ChromosomeAccession}, start:{Start}/stop:{Stop} positions, cytogenetic location:{Cytogenetic}) and has clinical significance '{ClinicalSignificance}'. This entry is for the {Assembly} and was last reviewed on {LastEvaluated} with review status '{ReviewStatus}'."
    description.append(sent.format(Origin=row['Origin'], Type=row['Type'], ReferenceAllele=row['ReferenceAllele'],
                                   AlternateAllele=row['AlternateAllele'], Chromosome=row['Chromosome'],
                                   ChromosomeAccession=row['ChromosomeAccession'], Start=row['Start'],
                                   Stop=row['Stop'], Cytogenetic=row['Cytogenetic'], ClinicalSignificance=row['ClinicalSignificance'],
                                   Assembly=row['Assembly'], LastEvaluated=row['LastEvaluated'], ReviewStatus=row['ReviewStatus']))

# combine into new data frame
var_metadata_final = pandas.DataFrame(list(zip(variant, label, description)), columns =['ID', 'Label', 'Description'])

# drop duplicates
var_metadata_final.drop_duplicates(subset=None, keep='first', inplace=True)

# make all variables string
var_metadata_final = var_metadata_final.astype(str)

# convert df to dictionary
var_metadata_final.set_index('ID', inplace=True)
var_metadata_dict = var_metadata_final.to_dict('index') 


<br>

**Create Master Metadata Dictionary**  
To make it easier to navigate the mapping of each instance node in an edge, a master dictionary is created and keyed by node type. This is most useful when both nodes in an edge are instances, but of different data types (e.g. `gene-rna`).


In [None]:
master_metadata_dictionary = {'gene': gene_metadata_dict,
                              'rna': rna_metadata_dict,
                              'variant': var_metadata_dict}


<br>

***

### Write Metadata Files  <a class="anchor" id="write-metadata-files"></a>   
using the `Master Metadata Dictionary` created in the prior step, all of the `edge-type` data is processed and the resulting data written out `.txt` file to the `./resource/node_data` repository.

- [Genes](#gene-metadata)
- [RNA](#rna-metadata)
- [Pathways](#pathway-metadata)
- [Variants](#variant-metadata)

***

### Genes <a class="anchor" id="gene-metadata"></a>

**Data Source Wiki Pages:** [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene) 

**Output:**  
- chemical-gene ➞ [`chemical-gene_GENE_METADATA.txt`](https://www.dropbox.com/s/jc332b9vo8qc1rm/chemical-gene_GENE_METADATA.txt?dl=1) 
- gene-disease ➞ [`gene-disease_GENE_METADATA.txt`](https://www.dropbox.com/s/1gk34aoj28ze6r8/gene-disease_GENE_METADATA.txt?dl=1) 
- gene-gene ➞ [`gene-gene_GENE_METADATA.txt`](https://www.dropbox.com/s/vxlgc7iblh02frp/gene-gene_GENE_METADATA.txt?dl=1) 
- gene-pathway ➞ [`gene-pathway_GENE_METADATA.txt`](https://www.dropbox.com/s/vh5l5kwwxww1twc/gene-pathway_GENE_METADATA.txt?dl=1) 
- gene-phenotype ➞ [`gene-phenotype_GENE_METADATA.txt`](https://www.dropbox.com/s/t1e7l6xqu4c1q8s/gene-phenotype_GENE_METADATA.txt?dl=1) 
- gene-protein ➞ [`gene-protein_GENE_METADATA.txt`](https://www.dropbox.com/s/4t9inhmxw0mluh2/gene-protein_GENE_METADATA.txt?dl=1) 
- gene-rna ➞ [`gene-rna_GENE_METADATA.txt`](https://www.dropbox.com/s/eoii2ee0j6j64id/gene-rna_GENE_METADATA.txt?dl=1) 

In [None]:
node_type = 'gene'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['inst_sbc_idx']
    if 'entity' in edge_data_type: inst_idx = edge_data_type.split('-').index('entity')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_data_location + edge_type + '_' + node_type.upper() + '_METADATA.txt',
                   header=True,
                   sep='\t',
                   index=False)
    

<br>

***

### RNA<a class="anchor" id="rna-metadata"></a>

**Data Source Wiki Pages:**  
- [Ensembl](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar)  
- [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene) 

**Output:**  
- chemical-rna ➞ [`chemical-rna_RNA_METADATA.txt`](https://www.dropbox.com/s/eu2ihol7xyxuhrg/chemical-rna_RNA_METADATA.txt?dl=1) 
- rna-anatomy ➞ [`rna-anatomy_RNA_METADATA.txt`](https://www.dropbox.com/s/dsktusrpeit9cpn/rna-anatomy_RNA_METADATA.txt?dl=1) 
- rna-cell ➞ [`rna-cell_RNA_METADATA.txt`](https://www.dropbox.com/s/nnqb7911hdhbw7q/rna-cell_RNA_METADATA.txt?dl=1) 
- rna-protein ➞ [`rna-protein_RNA_METADATA.txt`](https://www.dropbox.com/s/gjq1hydldyds4rz/rna-protein_RNA_METADATA.txt?dl=1) 

In [None]:
node_type = 'rna'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['inst_sbc_idx']
    if 'entity' in edge_data_type: inst_idx = edge_data_type.split('-').index('entity')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_data_location + edge_type + '_' + node_type.upper() + '_METADATA.txt',
                   header=True,
                   sep='\t',
                   index=False)
    

<br>

***

### Pathways<a class="anchor" id="pathway-metadata"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#reactome-pathway-database)  

**Output:**    
- chemical-pathway ➞ [`chemical-pathway_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/drs7av9ie5a6omr/chemical-pathway_PATHWAY_METADATA.txt?dl=1)
- gobp-pathway ➞ [`gobp-pathway_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/wkfd7qhlkndzjay/gobp-pathway_PATHWAY_METADATA.txt?dl=1)
- pathway-gocc ➞ [`pathway-gocc_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/tknn5uyobls9f4n/pathway-gocc_PATHWAY_METADATA.txt?dl=1)
- pathway-gomf ➞ [`pathway-gomf_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/zr6arq66qmz08ws/pathway-gomf_PATHWAY_METADATA.txt?dl=1)
- protein-pathway ➞ [`protein-pathway_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/cuvgwh2r8qtf08j/protein-pathway_PATHWAY_METADATA.txt?dl=1)

In [None]:
node_type = 'pathway'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['inst_sbc_idx']
    if 'entity' in edge_data_type: inst_idx = edge_data_type.split('-').index('entity')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_data_location + edge_type + '_' + node_type.upper() + '_METADATA.txt',
                   header=True,
                   sep='\t',
                   index=False)
    

<br>

***

### Variants<a class="anchor" id="variant-metadata"></a>

**Data Source Wiki Page:** [ClinVar](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar)  

**Output:**  
- variant-disease ➞ [`variant-disease_VARIANT_METADATA.txt`](https://www.dropbox.com/s/eiyh1bc7lttdmut/variant-disease_VARIANT_METADATA.txt?dl=1)  
- variant-gene ➞ [`variant-gene_VARIANT_METADATA.txt`](https://www.dropbox.com/s/luafq6fa07eljn9/variant-gene_VARIANT_METADATA.txt?dl=1)  
- variant-phenotype ➞ [`variant-phenotype_VARIANT_METADATA.txt`](https://www.dropbox.com/s/fhphtuy6ecq1bkx/variant-phenotype_VARIANT_METADATA.txt?dl=1)  

In [None]:
node_type = 'variant'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['inst_sbc_idx']
    if 'entity' in edge_data_type: inst_idx = edge_data_type.split('-').index('entity')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = hmetadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_data_location + edge_type + '_' + node_type.upper() + '_METADATA.txt',
                   header=True,
                   sep='\t',
                   index=False)
    


<br>

***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```