***
***

<img width='700' src="https://user-images.githubusercontent.com/8030363/108961534-b9a66980-7634-11eb-96e2-cc46589dcb8c.png" style="vertical-align:middle">

## Pre-Knowledge Graph Build Data Preparation
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[`v4.0.0`](https://github.com/callahantiff/PheKnowLator/wiki/v4.0.0)**
  
<br>  
  
**Purpose:** This notebook serves as a script to download and process data in order to generate mapping and filtering data needed to build edges for the PheKnowLator knowledge graph. For more information on the data sources utilize within this script, please see the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources) Wiki page.

<br>

**Assumptions:**   
- Raw data downloads ➞ `./resources/processed_data/unprocessed_data`    
- Processed data write location ➞ `./resources/processed_data`  

<br>

**Dependencies:**   
- **Scripts**: This notebook utilizes several helper functions, which are stored in the [`data_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/data_utils.py) and [`kg_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/kg_utils.py) scripts.  
- **Data**: Hyperlinks to all downloaded and generated data sources are provided through [this](https://console.cloud.google.com/storage/browser/pheknowlator/release_v4.0.0?project=pheknowlator) dedicated Google Cloud Storage Bucket. <u>This notebook will download everything that is needed for you</u>.  
_____
***

## Table of Contents
***

### [Identifier Maps ](#create-identifier-maps)  
- [HUMAN TRANSCRIPT, GENE, AND PROTEIN IDENTIFIER MAPPING](#human-transcript,-gene,-and-protein-identifier-mapping)
  - [Entrez Gene-Ensembl Transcript](#entrezgene-ensembltranscript)  
  - [Entrez Gene-Protein Ontology](#entrezgene-proteinontology)  
  - [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
  - [Gene Symbol-Ensembl Transcript](#genesymbol-ensembltranscript)  
  - [STRING-Protein Ontology](#string-proteinontology)  
  - [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)  
  - [Uniprot Accession-Entrez Gene](#uniprotaccession-entrezgene)
  

- [OTHER IDENTIFIER MAPPING](#other-identifier-mapping) 
  - [ChEBI Identifiers](#mesh-chebi) 
  - [Human Disease and Phenotype Identifiers](#disease-identifiers)
  - [Human Protein Atlas Tissue and Cell Types](#hpa-uberon)  
  - [Reactome Pathways - Pathway Ontology](#reactome-pw)  
  - [Genomic Identifiers - Sequence Ontology](#genomic-soo)  


### [Edge Datasets](#create-edge-datasets)
- [ONTOLOGIES](#ontologies)  
  - [Protein Ontology](#protein-ontology)  
  - [Relations Ontology](#relations-ontology)  


- [LINKED DATA](#linked-data)  
  - [Clinvar Variant-Diseases and Phenotypes](#clinvar-variant)
  - [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  


### [Node and Relation Metadata](#node-relation-metadata)  
- [CTD_chem_gene_ixns.tsv](#chemical-gene)  
- [CTD_chem_go_enriched.tsv](#chemical-go)    
- [CTD_chemicals_diseases.tsv](#chemical-disease)  
- [CTD_genes_pathways.tsv](#gene-pathway)     
- [goa_human.gaf](#goa)     
- [COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt](#gene-gene)    
- [phenotype.hpoa](#phenotype-disease) 
- [ChEBI2Reactome_All_Levels.txt](#chemical-pathway)   
- [gene_association.reactome](#reactome-goa)    
- [UniProt2Reactome_All_Levels.txt](#uniprot-react)     
- [CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt](#variant-disease)    
- [CLINVAR_VARIANT_GENE_EDGES.txt](#variant-gene)   
- [HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt](#hpa) 
- [UNIPROT_PROTEIN_CATALYST.txt](#uniprot-catalyst)  
- [UNIPROT_PROTEIN_COFACTOR.txt](#uniprot-cofactor)  
- [9606.protein.links.v11.0.txt.gz](#protein-protein)  
- [curated_gene_disease_associations.tsv](#gene-phen)  
____

<br>

## Set-Up Environment
_____

In [None]:
# # uncomment and run to install any required modules from notebooks/requirements.txt
# import sys
# !{sys.executable} -m pip install -r requirements.txt

In [None]:
# # if running a local version of pkt_kg, uncomment the code below
# import sys
# sys.path.append('../')

In [None]:
# import needed libraries
import datetime
import glob
import itertools
import json
import networkx
import numpy
import os
import openpyxl
import pandas
import pickle
import re
import requests
import shutil
import sys

from collections import Counter
from functools import reduce
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from rdflib.namespace import OWL, RDF, RDFS
from reactome2py import content
from tqdm.notebook import tqdm
from typing import Dict

from pkt_kg.utils import *  # import pkt_kg utility script containing helper functions

#### Define Global Variables

In [None]:
# directory to use for processing data
unprocessed_data_location = '../resources/processed_data/unprocessed_data/'
processed_data_location = '../resources/processed_data/'

# directory to write relations data to
relations_data_location = '../resources/relations_data/'

# directory to write node metadata to
metadata_location = '../resources/metadata/'

# directory to write kg construction approach dictionary to
construction_approach_location = '../resources/construction_approach/'

# directory to write ontology data to
ontology_data_location = '../resources/ontologies/'

# owltools location
owltools_location = '../pkt_kg/libs/owltools'

# obo spacespace
obo = Namespace('http://purl.obolibrary.org/obo/')

<br>

***
***
### CREATE MAPPING DATASETS  <a class="anchor" id="create-identifier-maps"></a>
***
***

### Human Transcript, Gene, and Protein Identifier Mapping  <a class="anchor" id="human-transcript,-gene,-and-protein-identifier-mapping"></a>
***

**Data Source Wiki Pages:**   
- [Ensembl](https://uswest.ensembl.org/)  
- [Uniprot Knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#universal-protein-resource-knowledgebase)  
- [HGNC](ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt) 
- [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#national-center-for-biotechnology-information-gene) 
- [Protein Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources/#protein-ontology)

<br>

**Purpose:** To map create `protein-coding gene`-`protein` edges and mappings between the identifiers types listed below. The edges types produced from each of these mappings will be further described within each of the subsequent identifier mapping sections:  
- [Entrez Gene-Ensembl Transcript](#entrezgene-ensembltranscript)  
- [Entrez Gene-Protein Ontology](#entrezgene-proteinontology)  
- [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
- [Gene Symbol-Ensembl Transcript](#genesymbol-ensembltranscript)  
- [STRING-Protein Ontology](#string-proteinontology)  
- [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)
- [Uniprot Accession-Entrez Gene](#uniprotaccession-entrezgene)

<br>

**Gene and Transcript Types:** The transcript and gene/locus types were reviewed by a PhD Molecular biologist to confirm whether or not they should be classified as `protein-coding` or not, which is useful for creating `genomic`-`rna`, `genomic`-`protein`, and `rna`-`protein` edges in the knowledge graph. For more information on this classification, please see the table below. Definitions of concepts in the table have been taken from [HGNC](https://www.genenames.org/help/symbol-report/), [Ensembl](https://uswest.ensembl.org/info/genome/genebuild/biotypes.html), [NCBI](https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objects/entrezgene/entrezgene.asn), and Wikipedia.

<table>
<th align="center">Gene and Transcript Type</th>  
<th align="center">Definition</th>
<th align="center">Type</th>
<th align="center">Genomic material <i>transcribed_to</i> RNA</th>
<th align="center">RNA <i>translated_to</i> Protein</th>
<th align="center">Genomic material <i>has_gene_product</i> Protein</th>
<tr>
  <td rowspan="2">biological-region</td> 
  <td rowspan="2">Biological_region (SO:0001411); Special note: This is a parental feature spanning all other feature annotation on each RefSeq Functional Element record. It is a 'misc_feature' in GenBank flat files but a 'Region' feature in ASN.1 and GFF3 formats</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
  <td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">IG_C_gene</td> 
  <td rowspan="2">Constant chain immunoglobulin gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_C_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>	 	 
<tr>
  <td rowspan="2">IG_D_gene</td> 
  <td rowspan="2">Diversity chain immunoglobulin gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_J_gene</td> 
  <td rowspan="2">IG J gene: Joining chain immunoglobulin gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_J_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_V_gene</td> 
  <td rowspan="2">Variable chain immunoglobulin gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">IG_V_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">lncRNA</td> 
  <td rowspan="2">RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs) (SO:0001877); these are at least 200 nt in length. Subtypes include intergenic (SO:0001463), intronic (SO:0001903) and antisense (SO:0001904)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">miRNA</td> 
  <td rowspan="2">RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) (SO:0001265)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">misc_RNA</td> 
  <td rowspan="2">Non-protein coding genes that encode miscellaneous types of small ncRNAs, such as vault (SO:0000404) and Y (SO:0000405) RNA genes</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">Mt_rRNA</td> 
  <td rowspan="2">Mitochondrial rRNA</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">Mt_tRNA</td> 
  <td rowspan="2">Mitochondrial tRNA</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">ncRNA</td> 
  <td rowspan="2">Noncoding RNA</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">non_stop_decay</td> 
  <td rowspan="2">Transcripts that have polyA features (including signal) without a prior stop codon in the CDS, i.e. a non-genomic polyA tail attached directly to the CDS without 3' UTR. These transcripts are subject to degradation</td>
  <td>gene</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">nonsense_mediated_decay</td> 
  <td rowspan="2">If the coding sequence (following the appropriate reference) of a transcript finishes >50bp from a downstream splice site then it is tagged as NMD. If the variant does not cover the full reference coding sequence then it is annotated as NMD if NMD is unavoidable i.e. no matter what the exon structure of the missing portion is the transcript will be subject to NMD</td>
  <td>gene</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">other</td> 
  <td rowspan="2">other</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">phenotype</td> 
  <td rowspan="2"> Mapped phenotypes where the causative gene has not been identified (SO:0001500) </td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">polymorphic_pseudogene</td> 
  <td rowspan="2">Pseudogene owing to a SNP/DIP but in other individuals/haplotypes/strains the gene is translated</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">processed_pseudogene</td> 
  <td rowspan="2">Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">processed_transcript</td> 
  <td rowspan="2">Gene/transcript that doesn't contain an open reading frame</td>
  <td>gene</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">protein_coding</td> 
  <td rowspan="2">Contains an open reading frame (ORF)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>yes</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>yes</td> 
</tr>
<tr>
  <td rowspan="2">pseudogene</td> 
  <td rowspan="2">Have homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus. Sometimes these entries have an intact coding sequence or an open but truncated open reading frame, in which case there is other evidence used (for example genomic polyA stretches at the 3' end) to classify them as a pseudogene</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">retained_intron</td> 
  <td rowspan="2">Has an alternatively spliced transcript believed to contain intronic sequence relative to other, coding, variants</td>
  <td>gene</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">ribozyme</td> 
  <td rowspan="2">Ribozymes are RNA molecules that have the ability to catalyze specific biochemical reactions, including RNA splicing in gene expression, similar to the action of protein enzymes</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">rRNA</td> 
  <td rowspan="2">RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) (SO:0001637)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">rRNA_pseudogene</td> 
  <td rowspan="2">A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the open reading frame. Thought to have arisen through duplication followed by loss of function</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">scaRNA</td> 
  <td rowspan="2">Small Cajal body-specific RNAs are a class of small nucleolar RNAs that specifically localize to the Cajal body, a nuclear organelle involved in the biogenesis of small nuclear ribonucleoproteins/td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">scRNA</td> 
  <td rowspan="2">RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) (SO:0001266)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">snoRNA</td> 
  <td rowspan="2">RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains (SO:0001267)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">snRNA</td> 
  <td rowspan="2">RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) (SO:0001268)</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">sRNA</td> 
  <td rowspan="2">Bacterial small RNAs (sRNA) are small RNAs produced by bacteria; they are 50- to 500-nucleotide non-coding RNA molecules, highly structured and containing several stem-loops</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TEC</td> 
  <td rowspan="2">TEC (To be Experimentally Confirmed). This is used for non-spliced EST clusters that have polyA features. This category has been specifically created for the ENCODE project to highlight regions that could indicate the presence of protein coding genes that require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>yes</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_C_gene</td> 
  <td rowspan="2">Constant chain T cell receptor gene that undergoes somatic recombination before transcription/td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_D_gene</td> 
  <td rowspan="2">Diversity chain T cell receptor gene that undergoes somatic recombination before transcription/td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_J_gene</td> 
  <td rowspan="2">Joining chain T cell receptor gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_J_pseudogene</td> 
  <td rowspan="2">T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_V_gene</td> 
  <td rowspan="2">Variable chain T cell receptor gene that undergoes somatic recombination before transcription</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">TR_V_pseudogene</td> 
  <td rowspan="2">T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">transcribed_processed_pseudogene</td> 
  <td rowspan="2">Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">transcribed_unitary_pseudogene</td> 
  <td rowspan="2">Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">transcribed_unprocessed_pseudogene</td> 
  <td rowspan="2">Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
<tr>
<td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">translated_processed_pseudogene</td> 
  <td rowspan="2">Pseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed'</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">translated_unprocessed_pseudogene</td> 
  <td rowspan="2">Inactivated immunoglobulin gene. Immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">tRNA</td> 
  <td rowspan="2">Transfer RNA</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">unitary_pseudogene</td> 
  <td rowspan="2">A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2">unknown</td> 
  <td rowspan="2">Entries where the locus type is currently unknown</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td> --- </td> 
  <td> --- </td> 
  <td> --- </td> 
</tr>
<tr>
  <td rowspan="2">unprocessed_pseudogene</td> 
  <td rowspan="2">Pseudogene that can contain introns since produced by gene duplication</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>yes</td> 
  <td>no</td> 
</tr>
<tr>
  <td rowspan="2" align="center">vaultRNA</td> 
  <td rowspan="2" align="center">Short non coding RNA genes that form part of the vault ribonucleoprotein complex</td>
  <td>gene</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td>
</tr>
  <td>transcript</td> 
  <td>yes</td> 
  <td>no</td> 
  <td>no</td> 
</tr>
</table> 


<br>

**Output:** This script downloads and saves the following data:  
- Human Ensembl Gene Set ➞ `Homo_sapiens.GRCh38.<<release>>.gtf`
- Human Ensembl-UniProt Identifiers ➞ `Homo_sapiens.GRCh38.<<release>>.uniprot.tsv` 
- Human Ensembl-Entrez Identifiers ➞ `Homo_sapiens.GRCh38.<<release>>.entrez.tsv` 
- Human Gene Identifiers ➞ [`Homo_sapiens.gene_info`](ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz), [`hgnc_complete_set.txt`](ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt)  
- Human Protein Identifiers ➞ [`promapping.txt`](https://proconsortium.org/download/current/promapping.txt)  
- UniProt Identifiers ➞ [`uniprot_identifier_mapping.tab`](https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Cdatabase(GeneID)%2Cdatabase(Ensembl)%2Cdatabase(HGNC)%2Cgenes(PREFERRED)%2Cgenes(ALTERNATIVE))

*All Merged Data Sets:*  
- `Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt` 
- `Merged_gene_rna_protein_identifiers.pkl`  

***

**Genomic Typing Dictionary**  
Read in the  `genomic_typing_dict.pkl` dictionary, which is needed in order to preprocess the genomic identifier datasets.

In [None]:
# download data
url = 'https://storage.googleapis.com/pheknowlator/curated_data/genomic_typing_dict.pkl'
if not os.path.exists(unprocessed_data_location + 'genomic_typing_dict.pkl'):
    data_downloader(url, unprocessed_data_location)

# load data
genomic_type_mapper = pickle.load(open(unprocessed_data_location + 'genomic_typing_dict.pkl', 'rb'))

***
**HGNC Data** 

_Human Gene Set Data_ - `hgnc_complete_set.txt`

In [None]:
# download data
url = 'http://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt'
if not os.path.exists(unprocessed_data_location + 'hgnc_complete_set.txt'):
    data_downloader(url, unprocessed_data_location)

# load data
hgnc = pandas.read_csv(unprocessed_data_location + 'hgnc_complete_set.txt', header=0, delimiter='\t', low_memory=False)

_Preprocess Data_  
Data file needs to be lightly cleaned before it can be merged with other data. This light cleaning includes renaming columns, replacing `NaN` with `None`, updating data types (i.e. making all columns type `str`), and unnesting `|` delimited data. The final step is to update the gene_type variable such that each of the variable values is re-grouped to be protein-coding, other or ncRNA.

In [None]:
hgnc = hgnc.loc[hgnc['status'].apply(lambda x: x == 'Approved')]
hgnc = hgnc[['hgnc_id', 'entrez_id', 'ensembl_gene_id', 'uniprot_ids', 'symbol', 'locus_type', 'alias_symbol', 'name', 'location', 'alias_name']]
hgnc.rename(columns={'uniprot_ids': 'uniprot_id', 'location': 'map_location', 'locus_type': 'hgnc_gene_type'}, inplace=True)
hgnc['hgnc_id'] = hgnc['hgnc_id'].str.replace('.*\:', '', regex=True)  # strip 'HGNC' off of the identifiers
hgnc.fillna('None', inplace=True)  # replace NaN with 'None'
hgnc['entrez_id'] = hgnc['entrez_id'].apply(lambda x: str(int(x)) if x != 'None' else 'None')  # make col str

# combine certain columns into single column
hgnc['name'] = hgnc['name'] + '|' + hgnc['alias_name']
hgnc['synonyms'] = hgnc['alias_symbol'] + '|' + hgnc['alias_name'] + '|' + hgnc['name']
hgnc['symbol'] = hgnc['symbol'] + '|' + hgnc['alias_symbol']

# explode nested data and reformat values in preparation for combining it with other gene identifiers
explode_df_hgnc = explodes_data(hgnc.copy(), ['ensembl_gene_id', 'uniprot_id', 'symbol', 'name', 'synonyms'], '|')

# reformat hgnc gene type
for val in genomic_type_mapper['hgnc_gene_type'].keys():
    explode_df_hgnc['hgnc_gene_type'] = explode_df_hgnc['hgnc_gene_type'].str.replace(val, genomic_type_mapper['hgnc_gene_type'][val])

# reformat master hgnc gene type
explode_df_hgnc['master_gene_type'] = explode_df_hgnc['hgnc_gene_type']
master_dict = genomic_type_mapper['hgnc_master_gene_type']
for val in master_dict.keys():
    explode_df_hgnc['master_gene_type'] = explode_df_hgnc['master_gene_type'].str.replace(val, master_dict[val])

# post-process reformatted data
explode_df_hgnc.drop(['alias_symbol', 'alias_name'], axis=1, inplace=True)  # remove original gene type column
explode_df_hgnc.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
explode_df_hgnc.head(n=3)

***
**Ensembl Data**

_Human Gene Set Data_ - `Homo_sapiens.GRCh38.102.gtf.gz`

In [None]:
# download data
url = 'ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz'
if not os.path.exists(unprocessed_data_location + 'Homo_sapiens.GRCh38.102.gtf'):
    data_downloader(url, unprocessed_data_location)

# load data
ensembl_geneset = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.102.gtf',
                                  header = None, delimiter='\t', skiprows=5, usecols=[8], low_memory=False)

_Preprocess Data_  
Data file needs to be reformatted in order for it to be able to be merged with the other gene, RNA, and protein identifier data. To do this, we iterate over each row of the data and extract the fields shown below in `column_names`, making each of these extracted fields their own column. The final step is to update the gene_type variable such that each of the variable values is re-grouped to be `protein-coding`, `other` or `ncRNA`.

In [None]:
ensembl_data = list(ensembl_geneset[8]); ensembl_df_data = []
for i in tqdm(range(0, len(ensembl_data))):
    if 'gene_id' in ensembl_data[i] and 'transcript_id' in ensembl_data[i]:
        row_dict = {x.split(' "')[0].lstrip(): x.split(' "')[1].strip('"') for x in ensembl_data[i].split(';')[0:-1]}
        ensembl_df_data += [(row_dict['gene_id'], row_dict['transcript_id'], row_dict['gene_name'],
                           row_dict['gene_biotype'], row_dict['transcript_name'], row_dict['transcript_biotype'])]
# convert to data frame
ensembl_geneset = pandas.DataFrame(ensembl_df_data,
                                   columns=['ensembl_gene_id', 'transcript_stable_id', 'symbol',
                                            'ensembl_gene_type', 'transcript_name', 'ensembl_transcript_type'])

# reformat ensembl gene type
gene_dict = genomic_type_mapper['ensembl_gene_type']
for val in gene_dict.keys():
    ensembl_geneset['ensembl_gene_type'] = ensembl_geneset['ensembl_gene_type'].str.replace(val, gene_dict[val])
# reformat master gene type
ensembl_geneset['master_gene_type'] = ensembl_geneset['ensembl_gene_type']
gene_dict = genomic_type_mapper['ensembl_master_gene_type']
for val in gene_dict.keys():
    ensembl_geneset['master_gene_type'] = ensembl_geneset['master_gene_type'].str.replace(val, gene_dict[val])
# reformat master transcript type
ensembl_geneset['ensembl_transcript_type'] = ensembl_geneset['ensembl_transcript_type'].str.replace('vault_RNA', 'vaultRNA', regex=False)
ensembl_geneset['master_transcript_type'] = ensembl_geneset['ensembl_transcript_type']
trans_dict = genomic_type_mapper['ensembl_master_transcript_type']
for val in trans_dict.keys():
    ensembl_geneset['master_transcript_type'] = ensembl_geneset['master_transcript_type'].str.replace(val, trans_dict[val])

# post-process reformatted data
ensembl_geneset.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
ensembl_geneset.head(n=3)

***
**Ensembl Annotation Data**

_Ensembl-UniProt_ - `Homo_sapiens.GRCh38.102.uniprot.tsv`  
Once the main ensembl gene set has been read in, the next step is to read in the `ensembl-uniprot` mapping file. These files are vital for successfully merging the ensembl identifiers with the uniprot data set.

In [None]:
# download data
url_uniprot = 'ftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.uniprot.tsv.gz'
if not os.path.exists(unprocessed_data_location + 'Homo_sapiens.GRCh38.102.uniprot.tsv'):
    data_downloader(url_uniprot, unprocessed_data_location)

# load data
ensembl_uniprot = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.102.uniprot.tsv', header=0, delimiter='\t', low_memory=False)

# preprocess data
ensembl_uniprot.rename(columns={'xref': 'uniprot_id', 'gene_stable_id': 'ensembl_gene_id'}, inplace=True)
ensembl_uniprot.replace('-', 'None', inplace=True)
ensembl_uniprot.fillna('None', inplace=True)
ensembl_uniprot = ensembl_uniprot.loc[ensembl_uniprot['xref_identity'].apply(lambda x: x != 'None')]
ensembl_uniprot = ensembl_uniprot.loc[ensembl_uniprot['uniprot_id'].apply(lambda x: '-' not in x)]  # remove isoforms
ensembl_uniprot = ensembl_uniprot.loc[ensembl_uniprot['info_type'].apply(lambda x: x == 'DIRECT')]
# ensembl_uniprot['master_gene_type'] = ['protein-coding'] * len(ensembl_uniprot)
# ensembl_uniprot['master_transcript_type'] = ['protein-coding'] * len(ensembl_uniprot)
ensembl_uniprot.drop(['db_name', 'info_type', 'source_identity', 'xref_identity', 'linkage_type'], axis=1, inplace=True)
ensembl_uniprot.drop_duplicates(subset=None, keep='first', inplace=True)

_Ensembl-Entrez_ - `Homo_sapiens.GRCh38.102.entrez.tsv`  
Once the main ensembl gene set has been read in, the next step is to read in the `ensembl-entrez` mapping file. These files are vital for successfully merging the ensembl identifiers with the entrez data set.

In [None]:
# download data
url_entrez = 'ftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.entrez.tsv.gz'
if not os.path.exists(unprocessed_data_location + 'Homo_sapiens.GRCh38.102.entrez.tsv'):
    data_downloader(url_entrez, unprocessed_data_location)

# load data
ensembl_entrez = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.102.entrez.tsv', header=0, delimiter='\t', low_memory=False)

# preprocess data
ensembl_entrez.rename(columns={'xref': 'entrez_id', 'gene_stable_id': 'ensembl_gene_id'}, inplace=True)
ensembl_entrez = ensembl_entrez.loc[ensembl_entrez['db_name'].apply(lambda x: x == 'EntrezGene')]
ensembl_entrez = ensembl_entrez.loc[ensembl_entrez['info_type'].apply(lambda x: x == 'DEPENDENT')]
ensembl_entrez.replace('-', 'None', inplace=True)
ensembl_entrez.fillna('None', inplace=True)
ensembl_entrez.drop(['db_name', 'info_type', 'source_identity', 'xref_identity', 'linkage_type'], axis=1, inplace=True)
ensembl_entrez.drop_duplicates(subset=None, keep='first', inplace=True)

_Merge Annotation Data_ - `ensembl_uniprot` + `ensembl_entrez`

In [None]:
merge_cols = list(set(ensembl_entrez).intersection(set(ensembl_uniprot)))
ensembl_annot = pandas.merge(ensembl_uniprot, ensembl_entrez, on=merge_cols, how='outer')
ensembl_annot.fillna('None', inplace=True)

# preview data
ensembl_annot.head(n=3)

_Merge Ensembl Annotation and Gene Set Data_ - `ensembl_geneset` + `ensembl_annot`

In [None]:
merge_cols = list(set(ensembl_annot).intersection(set(ensembl_geneset)))
ensembl = pandas.merge(ensembl_geneset, ensembl_annot, on=merge_cols, how='outer')
ensembl.fillna('None', inplace=True)
ensembl.replace('NA','None', inplace=True, regex=False)

# preview data
ensembl.head(n=3)

_Save Cleaned Ensembl Data_  
Save the cleaned Ensembl data so that it can be used when generating node metadata for transcript identifiers.

In [None]:
ensembl.to_csv(processed_data_location + 'ensembl_identifier_data_cleaned.txt', header=True, sep='\t', index=False)

***
**UniProt Data**   
_Human Gene Set Data_ - `uniprot_identifier_mapping.tab`

This data was obtained by querying the [UniProt Knowledgebase](https://www.uniprot.org/uniprot/) using the *organism:"Homo sapiens (Human) [9606]"* keyword and including the following columns:
- Entry (Standard)    
- GeneID (*Genome Annotation*)  
- Ensembl (*Genome Annotation*)  
- HGNC (*Organism-specific*)  
- Gene names (primary) (*Names & Taxonomy*)    
- Gene synonym (primary) (*Names & Taxonomy*)    

The URL to access the results of this query is obtained by clicking on the share symbol and copying the free-text from the box. To obtain the data in a tab-delimited format the following string is appended to the end of the URL: "&format=tab".

**NOTE.** Be sure to obtain a new URL from the [UniProt Knowledgebase](https://www.uniprot.org/uniprot/) when rebuilding to ensure you are getting the most up-to-date data. This query was last generated on `01/30/2022`.

In [None]:
# download data
url = 'https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Creviewed%2Cdatabase(GeneID)%2Cdatabase(Ensembl)%2Cdatabase(HGNC)%2Cgenes(ALTERNATIVE)%2Cgenes(PREFERRED)&format=tab'
if not os.path.exists(unprocessed_data_location + 'uniprot_identifier_mapping.tab'):
    data_downloader(url, unprocessed_data_location, 'uniprot_identifier_mapping.tab')

# load data
uniprot = pandas.read_csv(unprocessed_data_location + 'uniprot_identifier_mapping.tab', header=0, delimiter='\t')

_Preprocess Data_  
Data file needs to be lightly cleaned before it can be merged with other data. This light cleaning includes renaming columns, replacing `NaN` with `None`, and unnesting `"|"` delimited data.

In [None]:
uniprot.fillna('None', inplace=True)  # replace NaN with 'None'
uniprot.rename(columns={'Entry': 'uniprot_id',
                        'Cross-reference (GeneID)': 'entrez_id',
                        'Ensembl transcript': 'transcript_stable_id',
                        'Cross-reference (HGNC)': 'hgnc_id',
                        'Gene names  (synonym )': 'synonyms',
                        'Gene names  (primary )' :'symbol'}, inplace=True)

# update space-delimited synonyms to a pipe (i.e. '|')
uniprot['synonyms'] = uniprot['synonyms'].apply(lambda x: '|'.join(x.split()) if x.isupper() else x)

# only keep reviewed entries
uniprot = uniprot.loc[uniprot['Status'].apply(lambda x: x != 'unreviewed')]

# explode nested data
explode_df_uniprot = explodes_data(uniprot.copy(), ['transcript_stable_id', 'entrez_id', 'hgnc_id'], ';')
explode_df_uniprot = explodes_data(explode_df_uniprot.copy(), ['symbol', 'synonyms'], '|')

# strip out uniprot names
explode_df_uniprot['transcript_stable_id'] = explode_df_uniprot['transcript_stable_id'].str.replace('\s.*','', regex=True)

# remove duplicates
explode_df_uniprot.drop(['Status'], axis=1, inplace=True)
explode_df_uniprot.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
explode_df_uniprot.head(n=3)

***
**NCBI Data**   
_Human Gene Set Data_ - `Homo_sapiens.gene_info`

In [None]:
# download data
url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz'
if not os.path.exists(unprocessed_data_location + 'Homo_sapiens.gene_info'):
    data_downloader(url, unprocessed_data_location)

# load data
ncbi_gene = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info', header=0, delimiter='\t', low_memory=False)

_Preprocess Data_  
Data file needs to be lightly cleaned before it can be merged with other data. This light cleaning includes renaming columns, replacing `NaN` with `None`, updating data types (i.e. making all columns type `str`), and unnesting `|` delimited data. Then, the `gene_type` variable is cleaned such that each of the variable's values are re-grouped to be `protein-coding`, `other` or `ncRNA`.

In [None]:
# preprocess data
ncbi_gene = ncbi_gene.loc[ncbi_gene['#tax_id'].apply(lambda x: x == 9606)]  # remove non-human rows
ncbi_gene.replace('-', 'None', inplace=True)
ncbi_gene.rename(columns={'GeneID': 'entrez_id', 'Symbol': 'symbol', 'Synonyms': 'synonyms'}, inplace=True)
ncbi_gene['synonyms'] = ncbi_gene['synonyms'] + '|' + ncbi_gene['description'] + '|' + ncbi_gene['Full_name_from_nomenclature_authority'] + '|' + ncbi_gene['Other_designations']
ncbi_gene['symbol'] = ncbi_gene['Symbol_from_nomenclature_authority'] + '|' + ncbi_gene['symbol']
ncbi_gene['name'] = ncbi_gene['Full_name_from_nomenclature_authority'] + '|' + ncbi_gene['description']

# explode nested data
explode_df_ncbi_gene = explodes_data(ncbi_gene.copy(), ['symbol', 'synonyms', 'name', 'dbXrefs'], '|')

# clean up results
explode_df_ncbi_gene['entrez_id'] = explode_df_ncbi_gene['entrez_id'].astype(str)
explode_df_ncbi_gene = explode_df_ncbi_gene.loc[explode_df_ncbi_gene['dbXrefs'].apply(lambda x: x.split(':')[0] in ['Ensembl', 'HGNC', 'IMGT/GENE-DB'])]
explode_df_ncbi_gene['hgnc_id'] = explode_df_ncbi_gene['dbXrefs'].loc[explode_df_ncbi_gene['dbXrefs'].apply(lambda x: x.startswith('HGNC'))]
explode_df_ncbi_gene['ensembl_gene_id'] = explode_df_ncbi_gene['dbXrefs'].loc[explode_df_ncbi_gene['dbXrefs'].apply(lambda x: x.startswith('Ensembl'))]
explode_df_ncbi_gene.fillna('None', inplace=True)

# reformat entrez gene type
explode_df_ncbi_gene['entrez_gene_type'] = explode_df_ncbi_gene['type_of_gene']
gene_dict = genomic_type_mapper['entrez_gene_type']
for val in gene_dict.keys():
    explode_df_ncbi_gene['entrez_gene_type'] = explode_df_ncbi_gene['entrez_gene_type'].str.replace(val, gene_dict[val])
# reformat master gene type
explode_df_ncbi_gene['master_gene_type'] = explode_df_ncbi_gene['entrez_gene_type']
gene_dict = genomic_type_mapper['master_gene_type']
for val in gene_dict.keys():
    explode_df_ncbi_gene['master_gene_type'] = explode_df_ncbi_gene['master_gene_type'].str.replace(val, gene_dict[val])

# post-process reformatted data
explode_df_ncbi_gene.drop(['type_of_gene', 'dbXrefs', 'description', 'Nomenclature_status', 'Modification_date',
                           'LocusTag', '#tax_id', 'Full_name_from_nomenclature_authority', 'Feature_type',
                           'Symbol_from_nomenclature_authority'], axis=1, inplace=True)
explode_df_ncbi_gene['hgnc_id'] = explode_df_ncbi_gene['hgnc_id'].str.replace('HGNC:', '', regex=True)
explode_df_ncbi_gene['ensembl_gene_id'] = explode_df_ncbi_gene['ensembl_gene_id'].str.replace('Ensembl:', '', regex=True)
explode_df_ncbi_gene.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
explode_df_ncbi_gene.head(n=3)

***
**Protein Ontology Identifier Mapping Data**   
_Protein Ontology Identifier Data_ - `promapping.txt`

In [None]:
# download data
url = 'https://proconsortium.org/download/current/promapping.txt'
if not os.path.exists(unprocessed_data_location + 'promapping.txt'):
    data_downloader(url, unprocessed_data_location)

# load data
pro_map = pandas.read_csv(unprocessed_data_location + 'promapping.txt', header=None, names=['pro_id', 'entry', 'pro_mapping'], delimiter='\t')

_Preprocess Data_  
Basic filtering to to include `Protein Ontology` mappings to `Uniprot` identifiers and cleaning to update formatting of accession values (i.e. removing `UniProtKB:`).

In [None]:
pro_map = pro_map.loc[pro_map['entry'].apply(lambda x: x.startswith('Uni') and '_VAR' not in x and ', ' not in x)]  # keep 'UniProtKB' rows
pro_map = pro_map.loc[pro_map['pro_mapping'].apply(lambda x: x.startswith('exact'))] # keep exact mappings
pro_map['pro_id'] = pro_map['pro_id'].str.replace('PR:','PR_', regex=True)  # replace PR: with PR_
pro_map['entry'] = pro_map['entry'].str.replace('(^\w*\:)','', regex=True)  # remove id prefixes
pro_map = pro_map.loc[pro_map['pro_id'].apply(lambda x: '-' not in x)] # remove isoforms
pro_map.rename(columns={'entry': 'uniprot_id'}, inplace=True)  # rename columns before merging
pro_map.drop(['pro_mapping'], axis=1, inplace=True)  # remove uneeded columns
pro_map.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
pro_map.head(n=3)

***

#### Merging Processed Genomic Identifier Data Sources  
Merging all of the genomic identifier data sources is needed in order to create a map that can be used to integrate the different genomic data sources.

***
*Data Sources:* `hgnc` + `ensembl`

In [None]:
merge_cols = list(set(explode_df_hgnc.columns).intersection(set(ensembl.columns)))
ensembl_hgnc_merged_data = pandas.merge(ensembl, explode_df_hgnc, on=merge_cols, how='outer')

# clean up merged data
ensembl_hgnc_merged_data.fillna('None', inplace=True)
ensembl_hgnc_merged_data.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
ensembl_hgnc_merged_data.head(n=3)

***
*Data Sources:* `ensembl_hgnc_merged_data` + `explode_df_uniprot`

In [None]:
merge_cols = list(set(ensembl_hgnc_merged_data.columns).intersection(set(explode_df_uniprot.columns)))
ensembl_hgnc_uniprot_merged_data = pandas.merge(ensembl_hgnc_merged_data, explode_df_uniprot, on=merge_cols, how='outer')

# clean up merged data
ensembl_hgnc_uniprot_merged_data.fillna('None', inplace=True)
ensembl_hgnc_uniprot_merged_data.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
ensembl_hgnc_uniprot_merged_data.head(n=3)

***
*Data Sources:* `ensembl_hgnc_uniprot_merged_data` + `Homo_sapiens.gene_info`

In [None]:
merge_cols = merge_cols = list(set(ensembl_hgnc_uniprot_merged_data).intersection(set(explode_df_ncbi_gene.columns)))
ensembl_hgnc_uniprot_ncbi_merged_data = pandas.merge(ensembl_hgnc_uniprot_merged_data, explode_df_ncbi_gene, on=merge_cols, how='outer')

# clean up merged data
ensembl_hgnc_uniprot_ncbi_merged_data.fillna('None', inplace=True)
ensembl_hgnc_uniprot_ncbi_merged_data.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
ensembl_hgnc_uniprot_ncbi_merged_data.head(n=3)

***
*Data Sources:* `ensembl_hgnc_uniprot_ncbi_merged_data` + `promapping.txt`  

In [None]:
merged_data = pandas.merge(ensembl_hgnc_uniprot_ncbi_merged_data, pro_map, on='uniprot_id', how='outer')

# clean up merged data
merged_data.fillna('None', inplace=True)
merged_data.drop_duplicates(subset=None, keep='first', inplace=True)

# preview data
merged_data.head(n=3)

_Fix Symbol Formatting_  
some genes are formatted similarly to dates (e.g. `DEC1`), which can be erroneously re-formatted during input as a date value (i.e. `1-DEC`). In order for the data to be successfully merged with other data sources, all date-formatted genes need to be resolved.

In [None]:
clean_dates = []
for x in tqdm(list(merged_data['symbol'])):
    if '-' in x and len(x.split('-')[0]) < 3 and len(x.split('-')[1]) == 3:
        clean_dates.append(x.split('-')[1].upper() + x.split('-')[0])
    else: clean_dates.append(x)

# add cleaned date var back to data set
merged_data['symbol'] = clean_dates
merged_data.fillna('None', inplace=True)

# make sure that all gene and transcript type colunmns have none recoded to unknown or not protein-coding
merged_data['hgnc_gene_type'] = merged_data['hgnc_gene_type'].str.replace('None', 'unknown', regex=False)
merged_data['ensembl_gene_type'] = merged_data['ensembl_gene_type'].str.replace('None', 'unknown', regex=False)
merged_data['entrez_gene_type'] = merged_data['entrez_gene_type'].str.replace('None', 'unknown', regex=False)
merged_data['master_gene_type'] = merged_data['master_gene_type'].str.replace('None', 'unknown', regex=False)
merged_data['master_transcript_type'] = merged_data['master_transcript_type'].str.replace('None', 'not protein-coding', regex=False)
merged_data['ensembl_transcript_type'] = merged_data['ensembl_transcript_type'].str.replace('None', 'unknown', regex=False)

# remove duplicates
merged_data_clean = merged_data.drop_duplicates(subset=None, keep='first')

# write data
merged_data_clean.to_csv(processed_data_location + 'Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt', header=True, sep='\t', index=False)
    
# preview data
merged_data_clean.head(n=3)

***
**Create a Master Mapping Dictionary**  
Although the above steps result in a `pandas.Dataframe` of the merged identifiers, there is still work needed in order to be able to obtain a complete mapping between the identifiers. For example, if you were to search for Entrez gene identifier `entrez_259234` you would find the following mappings: `entrez_259234-ENSG00000233316`, `entrez_259234-DSCR10`. If you only had `ENSG00000233316`, with the current data you would be unable to obtain the gene symbol without first mapping to the Entrez gene identifier. 

To solve this problem, we build a master dictionary where the keys are `ensembl_gene_id`, `transcript_stable_id`, `protein_stable_id`, `uniprot_id`, `entrez_id`, `hgnc_id`, `pro_id`, and `symbol` identifiers and values are the list of genomic identifiers that match to each identifier. It's important to note that there are several labeling identifiers (i.e. `name`, `chromosome`, `map_location`, `Other_designations`, `synonyms`, `transcript_name`, `*_gene_types`, and `trasnscript_type_update`), which will only be mapped when clustered against one of the primary identifier types (i.e. the keys described above).

_Note_. The next chunk does a lot of heavy lifting and takes approximately ~40 minutes to run.

In [None]:
# reformat data to convert all nones, empty values, and unknowns to NaN
for col in merged_data_clean.columns:
    merged_data_clean[col] = merged_data_clean[col].apply(lambda x: '|'.join([i for i in x.split('|') if i != 'None']))
merged_data_clean.replace(to_replace=['None', '', 'unknown'], value=numpy.nan, inplace=True)
identifiers = [x for x in merged_data_clean.columns if x.endswith('_id')] + ['symbol']

In [None]:
# convert data to dictionary
master_dict = {}
for idx in tqdm(identifiers):
    grouped_data = merged_data_clean.groupby(idx)
    grp_ids = set([x for x in list(grouped_data.groups.keys()) if x != numpy.nan])
    for grp in grp_ids:
        df = grouped_data.get_group(grp).dropna(axis=1, how='all')
        df_cols, key = df.columns, idx + '_' + grp
        val_df = [[col + '_' + x for x in set(df[col]) if isinstance(x, str)] for col in df_cols if col != idx]
        if len(val_df) > 0:
            if key in master_dict.keys(): master_dict[key] += [i for j in val_df for i in j if len(i) > 0]
            else: master_dict[key] = [i for j in val_df for i in j if len(i) > 0]  

_Finalizing Master Mapping Dictionary_  
Then, we need to identify a master gene and transcript type for each entity because the last ran code chunk can result in several genes and transcripts with differing types (i.e. `protein-coding` or `not protein-coding`). The next step collects all information for each gene and transcript and performs a voting procedure to select a single primary gene and transcript type.

In [None]:
reformatted_mapped_identifiers = dict()
for key, values in tqdm(master_dict.items()):
    identifier_info = set(values); gene_prefix = 'master_gene_type_'; trans_prefix = 'master_transcript_type_'
    if key.split('_')[0] in ['protein', 'uniprot', 'pro']: pass
    elif 'transcript' in key:
        trans_match = [x.replace(trans_prefix, '') for x in values if trans_prefix in x]
        if len(trans_match) > 0:
            t_type_list = ['protein-coding' if ('protein-coding' in trans_match or 'protein_coding' in trans_match) else 'not protein-coding']
            identifier_info |= {'transcript_type_update_' + max(set(t_type_list), key=t_type_list.count)}
    else:
        gene_match = [x.replace(gene_prefix, '') for x in values if x.startswith(gene_prefix) and 'type' in x]
        if len(gene_match) > 0:
            g_type_list = ['protein-coding' if ('protein-coding' in gene_match or 'protein_coding' in gene_match) else 'not protein-coding']
            identifier_info |= {'gene_type_update_' + max(set(g_type_list), key=g_type_list.count)}
    reformatted_mapped_identifiers[key] = identifier_info

In [None]:
# save a copy of the dictionary
# output > 4GB requires special approach: https://stackoverflow.com/questions/42653386/does-pickle-randomly-fail-with-oserror-on-large-files
filepath = processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl'

# defensive way to write pickle.write, allowing for very large files on all platforms
max_bytes, bytes_out = 2**31 - 1, pickle.dumps(reformatted_mapped_identifiers)
n_bytes = sys.getsizeof(bytes_out)

with open(filepath, 'wb') as f_out:
    for idx in range(0, n_bytes, max_bytes):
        f_out.write(bytes_out[idx:idx+max_bytes])

In [None]:
# # load data
# filepath = processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl'

# # defensive way to write pickle.load, allowing for very large files on all platforms
# max_bytes = 2**31 - 1
# input_size = os.path.getsize(filepath)
# bytes_in = bytearray(0)

# with open(filepath, 'rb') as f_in:
#     for _ in range(0, input_size, max_bytes):
#         bytes_in += f_in.read(max_bytes)

# # load pickled data
# reformatted_mapped_identifiers = pickle.loads(bytes_in)

***
### Ensembl Gene-Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Ensembl gene identifiers to Entrez gene identifiers when creating `gene`-`gene` edges

**Output:** `ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                  'ensembl_gene_id', 'entrez_id', 'ensembl_gene_type', 'entrez_gene_type',
                  'gene_type_update', 'gene_type_update')

In [None]:
# load data, print the number of rows, and preview it
egeg_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                            header=None, delimiter='\t', low_memory=False,
                            names=['Ensembl_Gene_IDs', 'Entrez_Gene_IDs',
                                   'Ensembl_Gene_Type', 'Entrez_Gene_Type',
                                   'Master_Gene_Type1', 'Master_Gene_Type2'])

# add prefix to output edge
egeg_data['Entrez_Gene_IDs'] = 'NCBIGene_' + egeg_data['Entrez_Gene_IDs'].astype(str)

# write data back to file
egeg_data.to_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', header=None, sep='\t', index=False)

print('There are {edge_count} ensembl gene-entrez gene edges'.format(edge_count=len(egeg_data)))
egeg_data.head(n=5)

***
### Ensembl Transcript-Protein Ontology <a class="anchor" id="ensembltranscript-proteinontology"></a>

**Purpose:** To map Ensembl transcript identifiers to Protein Ontology identifiers when creating `rna`-`protein` edges

**Output:** `ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt`


In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt',
                  'transcript_stable_id', 'pro_id', 'ensembl_transcript_type', None,
                  'transcript_type_update', None)

In [None]:
# load data, print the number of rows, and preview it
etpr_data = pandas.read_csv(processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt',
                            header=None, delimiter='\t', low_memory=False, usecols=[0, 1, 2, 4],
                            names=['Ensembl_Transcript_IDs', 'Protein_Ontology_IDs',
                                   'Ensembl_Transcript_Type', 'Master_Transcript_Type'])

# add prefix to output edge
etpr_data['Ensembl_Transcript_ID_Edge'] = 'ensembl_' + etpr_data['Ensembl_Transcript_IDs'].astype(str)

# write data back to file
etpr_data.to_csv(processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt', header=None, sep='\t', index=False)

print('There are {edge_count} ensembl transcript-protein ontology edges'.format(edge_count=len(etpr_data)))
etpr_data.head(n=5)

***
### Entrez Gene-Ensembl Transcript <a class="anchor" id="entrezgene-ensembltranscript"></a>

**Purpose:** To map entrez gene identifiers to Ensembl transcript identifiers when creating `gene`-`rna` edges

**Output:** `ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                  'entrez_id', 'transcript_stable_id', 'entrez_gene_type', 'ensembl_transcript_type',
                  'gene_type_update', 'transcript_type_update')

In [None]:
# load data, print the number of rows, and preview it
eet_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                           header=None, delimiter='\t', low_memory=False,
                           names=['Entrez_Gene_IDs', 'Ensembl_Transcript_IDs',
                                  'Entrez_Gene_Type', 'Ensembl_Transcript_Type',
                                  'Master_Gene_Type', 'Master_Transcript_Type'])

# add prefix to output edge
eet_data['Ensembl_Transcript_IDs'] = 'ensembl_' + eet_data['Ensembl_Transcript_IDs'].astype(str)
eet_data['Entrez_Gene_Edge'] = 'NCBIGene_' + eet_data['Entrez_Gene_IDs'].astype(str)

# write data back to file
eet_data.to_csv(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt', header=None, sep='\t', index=False)

print('There are {edge_count} entrez gene identifiers-ensembl transcript edges'.format(edge_count=len(eet_data)))
eet_data.head(n=5)

***
### Entrez Gene-Protein Ontology <a class="anchor" id="entrezgene-proteinontology"></a>

**Purpose:** To map Protein Ontology identifiers to Ensembl transcript identifiers when creating the following edges:   
- chemical-protein  
- gene-protein

**Output:** `ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt',
                  'entrez_id', 'pro_id', 'entrez_gene_type', None,
                  'gene_type_update', None)

In [None]:
# load data, print the number of rows, and preview it
egpr_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt',
                            header=None, delimiter='\t', low_memory=False, usecols = [0, 1, 2, 4],
                            names=['Gene_IDs', 'Protein_Ontology_IDs',
                                   'Entrez_Gene_Type', 'Master_Gene_Type'])

# add prefix to output edge
egpr_data['Entrez_Gene_Edge'] = 'NCBIGene_' + egpr_data['Gene_IDs'].astype(str)

# write data back to file
egpr_data.to_csv(processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt', header=None, sep='\t', index=False)

print('There are {edge_count} entrez gene-protein ontology edges'.format(edge_count=len(egpr_data)))
egpr_data.head(n=5)

***
### Gene Symbol-Ensembl Transcript <a class="anchor" id="genesymbol-ensembltranscript"></a>

**Purpose:** To map gene symbols to Ensembl transcript identifiers when creating the following edges: 
- chemical-rna  
- rna-anatomy  
- rna-cell  

**Output:** `GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt',
                  'symbol', 'transcript_stable_id', 'master_gene_type', 'ensembl_transcript_type',
                  'gene_type_update', 'transcript_type_update')

In [None]:
# load data, print the number of rows, and preview it
set_data = pandas.read_csv(processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt',
                           header=None, delimiter='\t', low_memory=False,
                           names=['Gene_Symbols', 'Ensembl_Transcript_IDs',
                                  'Gene_Type', 'Ensembl_Transcript_Type',
                                  'Master_Gene_Type', 'Master_Transcript_Type'])

# add prefix to output edge
set_data['Ensembl_Transcript_IDs'] = 'ensembl_' + set_data['Ensembl_Transcript_IDs'].astype(str)

# write data back to file
set_data.to_csv(processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt', header=None, sep='\t', index=False)

print('There are {edge_count} gene symbol-ensembl transcript edges'.format(edge_count=len(set_data.drop_duplicates())))
set_data.head(n=5)

***

### STRING-Protein Ontology <a class="anchor" id="string-proteinontology"></a>

**Purpose:** To map STRING identifiers to Protein Ontology identifiers when creating `protein`-`protein` edges 

**Output:** `STRING_PRO_ONTOLOGY_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                  'protein_stable_id', 'pro_id', None, None, None, None)

In [None]:
# load data, print the number of rows, and preview it
stpr_data = pandas.read_csv(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                            header=None, delimiter='\t', low_memory=False, usecols=[0, 1],
                            names=['STRING_IDs', 'Protein_Ontology_IDs'])

# add prefix to output edge
stpr_data['STRING_IDs'] = '9606.' + stpr_data['STRING_IDs'].astype(str)

# write data back to file
stpr_data.to_csv(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt', header=None, sep='\t', index=False)

print('There are {edge_count} string-protein ontology edges'.format(edge_count=len(stpr_data.drop_duplicates())))
stpr_data.head(n=5)

***

### Uniprot Accession-Protein Ontology <a class="anchor" id="uniprotaccession-proteinontology"></a>

**Purpose:** To map Uniprot accession identifiers to Protein Ontology identifiers when creating the following edges:  
- protein-gobp  
- protein-gomf  
- protein-gocc  
- protein-cofactor  
- protein-catalyst 
- protein-pathway

**Output:** `UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                  'uniprot_id', 'pro_id', None, None, None, None)

In [None]:
# load data, print the number of rows, and preview it
uapr_data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                            header=None, delimiter='\t', low_memory=False, usecols=[0, 1],
                            names=['Uniprot_Accession_IDs', 'Protein_Ontology_IDs'])

print('There are {edge_count} uniprot accession-protein ontology edges'.format(edge_count=len(uapr_data.drop_duplicates())))
uapr_data.head(n=5)

***

### Uniprot Accession-Entrez Gene <a class="anchor" id="uniprotaccession-entrezgene"></a>

**Purpose:** To map Uniprot accession identifiers to Entrez Gene identifiers when creating the following edges:  
- gene-gene  

**Output:** `UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt',
                  'uniprot_id', 'entrez_id', None, 'master_gene_type', None, 'gene_type_update')

In [None]:
# load data, print the number of rows, and preview it
uaeg_data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt',
                            header=None, delimiter='\t', low_memory=False, usecols=[0, 1, 2, 3],
                            names=['Uniprot_Accession_IDs', 'Entrez_Gene_IDs',
                                  'master_gene_type', 'gene_type_update'])

# add prefix to output edge
uaeg_data['Entrez_Gene_IDs'] = 'NCBIGene_' + uaeg_data['Entrez_Gene_IDs'].astype(str)

# write data back to file
uaeg_data.to_csv(processed_data_location + 'UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt', header=None, sep='\t', index=False)

print('There are {edge_count} uniprot accession-entrez gene edges'.format(edge_count=len(uaeg_data.drop_duplicates())))
uaeg_data.head(n=5)

<br>

***
***
### Other Identifier Mapping <a class="anchor" id="other-identifier-mapping"></a>
***
* [ChEBI Identifiers](#mesh-chebi)  
* [Human Protein Atlas Tissue and Cell Types](#hpa-uberon) 
* [Human Disease and Phenotype Identifiers](#disease-identifiers) 
* [Reactome Pathways and the Pathway Ontology](#reactome-pw)  
* [Genomic Identifiers and the Sequence Ontology](#genomic-so)  

***
***

### ChEBI-MeSH Identifiers <a class="anchor" id="mesh-chebi"></a>

**Data Source Wiki Page:** [mapping-mesh-to-chebi](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#mapping-mesh-identifiers-to-chebi-identifiers)  

**Purpose:** Map MeSH identifiers to ChEBI identifiers when creating the following edges:  
- chemical-gene  
- chemical-disease

**Dependencies:** Recapitulates the [`LOOM`](https://www.bioontology.org/wiki/BioPortal_Mappings) algorithm implemented by BioPortal when creating mappings between resources. The procedure is relatively straightforward and consists of the following:
- For all MeSH `SCR Chemicals`, obtain the following information:  
  - <u>Identifiers</u>: MeSH identifiers     
  - <u>Labels</u>: string labels using the `RDFS:label` object property  
  - <u>Synonyms</u>: track down all synonyms using the `vocab:concept` and `vocab:preferredConcept` object properties   
- For all ChEBI classes, obtain the following information:  
  - <u>Labels</u>: string labels using the `RDFS:label` object property  
  - <u>Synonyms</u>: track down all synonyms using all `synonym` object properties 
  
*Alternatively:* You can use the [`ncbo_rest_api.py`](https://gist.github.com/callahantiff/a28fb3160782f42f104e9ec41553af0d) script to pull mappings from the BioPortal API, but note that it takes >2 days for it to finish.

**Output:** `CHEBI_MESH_MAP.txt`


***  
**MeSH**  
Downloads the `nt`-formatted version of the current MeSH vocabulary. Preprocesing is then performed in order to reformat the data so that it can be converted into a Pandas DataFrame in preparation of merging it with `ChEBI` in order to identify overlapping concepts.

In [None]:
# download data
url = 'ftp://nlmpubs.nlm.nih.gov/online/mesh/rdf/2021/mesh2021.nt'
if not os.path.exists(unprocessed_data_location + 'mesh2021.nt'):
    data_downloader(url, unprocessed_data_location)
    
# load data
mesh = [x.split('> ') for x in tqdm(open(unprocessed_data_location + 'mesh2021.nt').readlines())]

In [None]:
# preprocess data
mesh_dict, results = {}, []
for row in tqdm(mesh):
    dbx, lab, msh_type = None, None, None
    s, p, o = row[0].split('/')[-1], row[1].split('#')[-1], row[2]  
    if s[0] in ['C', 'D'] and ('.' not in s and 'Q' not in s) and len(s) >= 5:
        s = 'MESH_' + s
        if p == 'preferredConcept' or p == 'concept': dbx = 'MESH_' + o.split('/')[-1]
        if 'label' in p.lower(): lab = o.split('"')[1]
        if 'type' in p.lower(): msh_type = o.split('#')[1]
        if s in mesh_dict.keys():
            if dbx is not None: mesh_dict[s]['dbxref'].add(dbx)
            if lab is not None: mesh_dict[s]['label'].add(lab)
            if msh_type is not None: mesh_dict[s]['type'].add(msh_type)
        else:
            mesh_dict[s] = {'dbxref': set() if dbx is None else {dbx},
                            'label': set() if lab is None else {lab},
                            'type': set() if msh_type is None else {msh_type},
                            'synonym': set()}

# fine tune dictionary - obtain labels for each entry's synonym identifiers
for key in tqdm(mesh_dict.keys()):
    for i in mesh_dict[key]['dbxref']:
        if len(mesh_dict[key]['dbxref']) > 0 and i in mesh_dict.keys():
            mesh_dict[key]['synonym'] |= mesh_dict[i]['label']

# expand data and convert to pandas DataFrame
for key, value in tqdm(mesh_dict.items()):
    results += [[key, list(value['label'])[0], 'NAME']]
    if len(value['synonym']) > 0:
        for i in value['synonym']:
            results += [[key, i, 'SYNONYM']]
mesh_filtered = pandas.DataFrame({'CODE': [x[0] for x in results],
                                  'TYPE': [x[2] for x in results],
                                  'STRING': [x[1] for x in results]})

# lowercase all strings and remove white space and punctuation
mesh_filtered['STRING'] = mesh_filtered['STRING'].str.lower()
mesh_filtered['STRING'] = mesh_filtered['STRING'].str.replace('[^\w]','')

# preview data
mesh_filtered.head()

***  
**ChEBI**  
Downloads the flat-file containing labels and synonyms for all classes in the `ChEBI` ontology. Preprocessing is then performed in order to reformat the data so that it can be converted into a Pandas DataFrame in preparation of merging it with `MeSH` in order to identify overlapping concepts.

In [None]:
# download data
url = 'ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names.tsv.gz'
if not os.path.exists(unprocessed_data_location + 'names.tsv'):
    data_downloader(url, unprocessed_data_location)
    
# load data
chebi = pandas.read_csv(unprocessed_data_location + 'names.tsv', header=0, delimiter='\t')

# preprocess data
chebi_filtered = chebi[['COMPOUND_ID', 'TYPE', 'NAME']]
chebi_filtered.drop_duplicates(subset=None, keep='first', inplace=True)
chebi_filtered.columns = ['CODE', 'TYPE', 'STRING']

# append CHEBI to the number in each code
chebi_filtered['CODE'] = chebi_filtered['CODE'].apply(lambda x: "{}{}".format('CHEBI_', x))

# lowercase all strings and remove white space and punctuation
chebi_filtered['STRING'] = chebi_filtered['STRING'].str.lower()
chebi_filtered['STRING'] = chebi_filtered['STRING'].str.replace('[^\w]','')

# preview data
chebi_filtered.head()

***  
**Merge Identifier Data**  
Performs an inner merge of the `MeSH` and `ChEBI` Pandas DataFrames in order to find concepts that exist in both DataFrames. Results are then written out to a text file.

In [None]:
# merge data
chem_merge = pandas.merge(chebi_filtered[['STRING', 'CODE']], mesh_filtered[['STRING', 'CODE']], on='STRING', how='inner')

# filter results
mesh_edges = set()
for idx, row in chem_merge.drop_duplicates().iterrows():
    mesh, chebi = row['CODE_y'], row['CODE_x']
    syns = [x for x in mesh_dict[mesh]['dbxref'] if 'C' in x or 'D' in x]
    mesh_edges.add(tuple([mesh, chebi]))
    if len(syns) > 0:
        for x in syns:
            mesh_edges.add(tuple([x, chebi]))

# write resulting mappings
with open(processed_data_location + 'MESH_CHEBI_MAP.txt', 'w') as out:
    for pair in mesh_edges:
        out.write(pair[0].replace('_', ':') + '\t' + pair[1] + '\n')

In [None]:
# load data
data = pandas.read_csv(processed_data_location + 'MESH_CHEBI_MAP.txt', header=None, names=['MESH_ID', 'CHEBI_ID'], delimiter='\t')

# preview mapping results
print('There are {} MeSH-ChEBI Edges'.format(len(data)))
data.head(n=5)

***

### Disease and Phenotype Identifiers <a class="anchor" id="disease-identifiers"></a>

**Data Source Wiki Page:**  
- [DisGeNET](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#disgenet)  
- [MedGen](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#national-center-for-biotechnology-information-medgen) 

**Purpose:** This script downloads the Human Phenotype Ontology (HPO), the MonDO Disease Ontology (MONDO), [disease_mappings.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz), and [MGCONSO.RRF](https://ftp.ncbi.nlm.nih.gov/pub/medgen/MGCONSO.RRF.gz) in order to map UMLS identifiers to HPO and MONDO identifiers when creating the following edges:  
- chemical-disease  
- disease-phenotype  
- chemical-phenotype  
- gene-phenotype  
- variant-phenotype  

**Output:**   
- Human Disease Ontology Mappings ➞ `DISEASE_MONDO_MAP.txt`
- Human Phenotype Ontology Mappings ➞ `PHENOTYPE_HPO_MAP.txt`

***
**MONDO Identifiers**  
`MONDO` contains DbXRef mappings to other disease terminology identifiers. To make this useful, we will store the DbXRefs as a dictionary with `MONDO` identifiers as the values.

In [None]:
# download ontology
if not os.path.exists(unprocessed_data_location + 'mondo_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/mondo.owl',
                             unprocessed_data_location + 'mondo_with_imports.owl'))
    
# read data into RDFLib graph object
mondo_graph = Graph().parse(unprocessed_data_location + 'mondo_with_imports.owl')
print('There are {} axioms in the ontology (date: {})'.format(len(mondo_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

# get dbxrefs for all MONDO classes
dbxref_res = gets_ontology_class_dbxrefs(mondo_graph)[0]
mondo_dict = {str(k).lower().split('/')[-1]: {str(i).split('/')[-1].replace('_', ':') for i in v} for k, v in dbxref_res.items() if 'MONDO' in str(v)}

# pickle dictionary
pickle.dump(mondo_dict, open(processed_data_location + 'Mondo_Identifier_Map.pkl', 'wb'), protocol=4)

# convert to pandas DataFrame
temp_list = []
for k, v in mondo_dict.items():
    if k.startswith('umls:'): new_k = k.split(':')[-1].upper()
    elif k.startswith('hp:'): new_k = k.upper()
    elif k.startswith('mesh:'): new_k = 'MESH:' + k.split(':')[-1].upper()
    elif k.startswith('orphanet:'): new_k = 'ORPHA:' + k.split(':')[-1].upper()
    elif k.startswith('omimps:'): new_k = 'OMIM:' + k.split(':')[-1].upper()
    else: new_k = k
    for i in v:
        temp_list += [[new_k, i.replace(':', '_')]]
        temp_list += [[i, i.replace(':', '_')]]

        # convert to 
mondo_df = pandas.DataFrame({'other_id': [x[0] for x in temp_list],
                             'ontology_id': [x[1] for x in temp_list]})

***
**HPO Identifiers**  
`HPO` contains DbXRef mappings to other disease terminology identifiers. To make this useful, we will store the DbXRefs as a dictionary with `HPO` identifiers as the values.

In [None]:
# download ontology
if not os.path.exists(unprocessed_data_location + 'hp_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/hp.owl',
                             unprocessed_data_location + 'hp_with_imports.owl'))

# read data into RDFLib graph object
hp_graph = Graph().parse(unprocessed_data_location + 'hp_with_imports.owl')
print('There are {} axioms in the ontology (date: {})'.format(len(hp_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

# get dbxrefs for all HPO classes
dbxref_res = gets_ontology_class_dbxrefs(hp_graph)[0]
hp_dict = {str(k).lower().split('/')[-1]: {str(i).split('/')[-1].replace('_', ':') for i in v} for k, v in dbxref_res.items() if 'HP' in str(v)}

# pickle dictionary
pickle.dump(hp_dict, open(processed_data_location + 'HPO_Identifier_Map.pkl', 'wb'), protocol=4)

# convert to pandas DataFrame
temp_list = []
for k, v in hp_dict.items():
    if k.startswith('umls:'): new_k = k.split(':')[-1].upper()
    elif k.startswith('mondo:'): new_k = k.upper()
    elif k.startswith('msh:'): new_k = 'MESH:' + k.split(':')[-1].upper()
    elif k.startswith('orpha:'): new_k = 'ORPHA:' + k.split(':')[-1].upper()
    else: new_k = k
    for i in v:
        temp_list += [[new_k, i.replace(':', '_')]]
        temp_list += [[i, i.replace(':', '_')]]

# convert to 
hp_df = pandas.DataFrame({'other_id': [x[0] for x in temp_list],
                          'ontology_id': [x[1] for x in temp_list]})

*Combine MONDO and HP Disease Mapping DataFrames into a Single DataFrame*

In [None]:
# combine data frames
disease_map_df = pandas.concat([mondo_df, hp_df])

# preview data
disease_map_df.head(n=5)

***
**DisGeNET Disease Mappings**

In [None]:
# download data
url = 'https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz'
if not os.path.exists(unprocessed_data_location + 'disease_mappings.tsv'):
    data_downloader(url, unprocessed_data_location)
    
# load data
disease_data = pandas.read_csv(unprocessed_data_location + 'disease_mappings.tsv', header=0, delimiter='\t')

# reformat data
disease_data['vocabulary'] = disease_data['vocabulary'].str.lower()
disease_data['diseaseId'] = disease_data['diseaseId'].str.lower()
disease_data['vocabulary'] = disease_data['vocabulary'].str.replace('hpo', 'HP')
disease_data['vocabulary'] = disease_data['vocabulary'].str.replace('mondo', 'MONDO')
disease_data['vocabulary'] = disease_data['vocabulary'].str.replace('msh', 'MESH')
disease_data['vocabulary'] = disease_data['vocabulary'].str.replace('omim', 'OMIM')
disease_data['vocabulary'] = disease_data['vocabulary'].str.replace('do', 'doid')
disease_data['vocabulary'] = disease_data['vocabulary'].str.replace('ordo', 'ORPHA')
disease_data['vocabulary'] = disease_data['vocabulary'].str.replace('ORPHAid', 'ORPHA')

# capitalize UMLS id
disease_data['diseaseId'] = disease_data['diseaseId'].str.upper()

# create a disease code column
disease_data['code'] = disease_data['vocabulary'] + ':' + disease_data['code']
disease_data['code'] = disease_data['code'].str.replace('HP:HP:', 'HP:')

# rename columns
disease_data.rename(columns={'diseaseId': 'cui', 'vocabularyName': 'code_name'}, inplace=True)

# remove unneeded columns
disease_data = disease_data[['cui', 'code', 'code_name', 'vocabulary']].drop_duplicates()

# preview data
disease_data.head(n=3)

***
**MedGen Disease Mappings**

In [None]:
# download data
url = 'https://ftp.ncbi.nlm.nih.gov/pub/medgen/MGCONSO.RRF.gz'
if not os.path.exists(unprocessed_data_location + 'MGCONSO.RRF'):
    data_downloader(url, unprocessed_data_location)
    
# load data and clean data
medgen_data = pandas.read_csv(unprocessed_data_location + 'MGCONSO.RRF', header=0, delimiter='|')
medgen_data = medgen_data[medgen_data['SUPPRESS'] == 'N'].drop_duplicates()
medgen_data = medgen_data[medgen_data['SAB'].isin(['HPO', 'MONDO', 'MSH', 'ORDO', 'OMIM'])].drop_duplicates()

# reformat codes
medgen_data['temp_code'] = medgen_data.apply(lambda x: 'MESH:' + x['CODE'] if x['SAB'] == 'MSH'
                                             else 'OMIM:' + x['CODE'] if x['SAB'] == 'OMIM'
                                             else 'ORPHA:' + x['SDUI'].split('_')[-1] if x['SAB'] == 'ORDO'
                                             else x['SDUI'] if x['SAB'] == 'HPO'
                                             else x['SDUI'] if x['SAB'] == 'MONDO'
                                             else 'None', axis=1)

# add rows for MedGen identifiers
temp = medgen_data[['#CUI']]
temp['temp_code'] = 'MedGen:' + medgen_data['#CUI']
medgen_data = pandas.concat([medgen_data, temp])

# remove unneeded columns
medgen_data = medgen_data[['#CUI', 'temp_code', 'STR', 'SAB']].drop_duplicates()

# rename columns
medgen_data.rename(columns={'#CUI': 'cui',
                            'STR': 'code_name',
                            'temp_code': 'code',
                           'SAB': 'vocabulary'}, inplace=True)

# reformat vocabulary ids
medgen_data['vocabulary'] = medgen_data['vocabulary'].str.replace('HPO', 'HP')
medgen_data['vocabulary'] = medgen_data['vocabulary'].str.replace('MSH', 'MESH')

# preview data
medgen_data.head(n=3)

*Combine DisGeNET and MedGen Mappings*

In [None]:
# combine data
disease_mapping_data = pandas.concat([disease_data, medgen_data]).drop_duplicates()

# preview data
disease_mapping_data.head(n=3)

_Build Disease Identifier Dictionary_  
In order to improve efficiency when mapping different disease terminology identifiers to the [MonDO Disease Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#mondo-disease-ontology) and [Human Phenotype Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#human-phenotype-ontology), we create a dictionary of disease identifiers.

In [None]:
# find cuis that map to HP or MONDO
disease_data_keep = disease_mapping_data.copy()
disease_data_keep = disease_data_keep.query('vocabulary == "HP" | vocabulary == "MONDO"')
disease_data_keep = disease_data_keep[['cui', 'code']]
cui_list = set(disease_data_keep['cui'])

# obtain a list of other ids that map to the cuis
temp_df = disease_mapping_data[disease_mapping_data['cui'].isin(cui_list)]

# merge back with original data
merged_temp = temp_df.merge(disease_data_keep, on='cui')
merged_temp = merged_temp[['code_x', 'code_y', 'code_name', 'vocabulary']].drop_duplicates()

# rename the columns
merged_temp.rename(columns={'code_x': 'cui', 'code_y': 'code'}, inplace=True)

# combine the columns back to main data
disease_mapping_data = pandas.concat([disease_mapping_data, merged_temp]).drop_duplicates()
disease_mapping_data = disease_mapping_data[['cui', 'code']].drop_duplicates()

# merge ontology and other mappings together
cleaned_disease_map = disease_mapping_data.merge(disease_map_df, left_on='cui', right_on='other_id')

# clean up file
cleaned_disease_map = cleaned_disease_map[['cui', 'ontology_id']]
cleaned_disease_map.rename(columns={'cui': 'disease_id'}, inplace=True)

# format ontology identifiers
cleaned_disease_map['ontology_id'] = cleaned_disease_map['ontology_id'].str.replace(':', '_')
cleaned_disease_map['vocabulary'] = cleaned_disease_map['ontology_id'].str.replace('\_.*', '', regex=True)
cleaned_disease_map.drop_duplicates(inplace=True)

# preview data
cleaned_disease_map.head(n=3)

_Write Mapping Data_

In [None]:
# split data by ontology and write to file
mondo_map = cleaned_disease_map[cleaned_disease_map['vocabulary'] == 'MONDO'].drop_duplicates()
hp_map = cleaned_disease_map[cleaned_disease_map['vocabulary'] == 'HP'].drop_duplicates()
mondo_map = mondo_map[['disease_id', 'ontology_id']]
hp_map = hp_map[['disease_id', 'ontology_id']]


# write data
mondo_map.to_csv(processed_data_location + 'DISEASE_MONDO_MAP.txt', header=None, index=False, sep='\t')
hp_map.to_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt', header=None, index=False, sep='\t')

_Preview Processed MONDO Disease Ontology Mappings_

In [None]:
# load data, print row count, and preview it
dis_data = pandas.read_csv(processed_data_location + 'DISEASE_MONDO_MAP.txt', header=None, names=['Disease_IDs', 'MONDO_IDs'], delimiter='\t')

print('There are {} disease-MONDO edges'.format(len(dis_data)))
dis_data.head(n=5)

_Preview Processed Human Phenotype Mappings_

In [None]:
# load data, print row count, and preview it
hp_data = pandas.read_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt', header=None, names=['Disease_IDs', 'HP_IDs'], delimiter='\t')

print('There are {} phenotype-HPO edges'.format(len(hp_data)))
hp_data.head(n=5)

***

### Human Protein Atlas/GTEx Tissue/Cells - UBERON + Cell Ontology + Cell Line Ontology <a class="anchor" id="hpa-uberon"></a>

**Data Source Wiki Page:**  
- [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#human-protein-atlas) 
- [genotype-tissue-expression-project](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#genotype-tissue-expression-project)  

<br>

**Purpose:** Downloads a query for cell, tissue, and blood types with overexpressed protein-coding genes in the human proteome ([`proteinatlas_search.tsv`](https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv)) via [API](https://www.proteinatlas.org/about/help/dataaccess) and median gene-level TPM by tissue for all genes that are not protein-coding ([`GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct`](https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz)) in order to create mappings between cell and tissue type strings to the Uber-Anatomy, Cell Ontology, and Cell Line Ontology concepts (see [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#human-protein-atlas) for details on the mapping process). The mappings are then used to create the following edge types:  
- rna-cell line  
- rna-tissue type   
- protein-cell line  
- protein-tissue type  


**Output:**  
- All HPA tissue and cell type strings ➞ `HPA_tissues.txt`  
- Mapping HPA strings to ontology concepts (documentation) ➞ `zooma_tissue_cell_mapping_04JAN2020.xlsx` 
- Final HPA-ontology mappings ➞ `HPA_GTEx_TISSUE_CELL_MAP.txt`
- HPA Edges ➞ `HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt`

***
**Human Protein Atlas**  
To expedite the mapping process, all HPA tissues, cells, cell lines, and fluid types are extracted from the HPA data columns.

In [None]:
# download data
url = 'https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv'
if not os.path.exists(unprocessed_data_location + 'proteinatlas_search.tsv'):
    data_downloader(url, unprocessed_data_location, 'proteinatlas_search.tsv.gz')

# load data
hpa = pandas.read_csv(unprocessed_data_location + 'proteinatlas_search.tsv', header=0, delimiter='\t')
hpa.fillna('None', inplace=True)

In [None]:
# retrieve terms to map and write results
with open(unprocessed_data_location + 'HPA_tissues.txt', 'w') as outfile:
    for x in tqdm(list(hpa.columns)):
        if x.endswith('[nTPM]'):
            outfile.write(x.split('RNA - ')[-1].split(' [nTPM]')[:-1][0] + '\n')

***
**Genotype-Tissue Expression Project**  
Import the tissues, cells, cell lines, and fluids that we externally mapped from HPA and GTEx data to [UBERON](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#uber-anatomy-ontology), the [Cell Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#cell-ontology), and the [Cell Line Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#cell-line-ontology).

In [None]:
# load data
url='https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz'
if not os.path.exists(unprocessed_data_location + 'GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct'):
    data_downloader(url, unprocessed_data_location)

# load data
gtex = pandas.read_csv(unprocessed_data_location + 'GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct', header=0, skiprows=2, delimiter='\t')
gtex.fillna('None', inplace=True)  # replace NaN with 'None'
gtex['Name'] = gtex['Name'].str.replace('(\..*)','', regex=True)  # remove identifier type, which appears after '.'


In [None]:
# download data
url='https://storage.googleapis.com/pheknowlator/curated_data/zooma_tissue_cell_mapping_04JAN2020.xlsx'
if not os.path.exists(unprocessed_data_location + 'zooma_tissue_cell_mapping_04JAN2020.xlsx'):
    data_downloader(url, unprocessed_data_location)
    
# load ontology mapping data
mapping_data = pandas.read_excel(open(unprocessed_data_location + 'zooma_tissue_cell_mapping_04JAN2020.xlsx', 'rb'),
                                 sheet_name='Concept_Mapping - 04JAN2020', header=0, engine='openpyxl')
mapping_data.fillna('None', inplace=True)  # convert NaN to None

# preview data
mapping_data.head(n=3)

_Write HPA and GTEx Mapping Data_  
The HPA and GTEx mapping data is written locally so that it can be used by the `PheKnowLator` algorithm when creating the knowledge graph edge lists. 

In [None]:
with open(processed_data_location + 'HPA_GTEx_TISSUE_CELL_MAP.txt', 'w') as out:
    for idx, row in tqdm(mapping_data.iterrows(), total=mapping_data.shape[0]):
        if row['UBERON'] != 'None': out.write(str(row['TERM']).strip() + '\t' + str(row['UBERON']).strip() + '\n')
        if row['CL'] != 'None': out.write(str(row['TERM']).strip() + '\t' + str(row['CL']).strip() + '\n')
        if row['CLO'] != 'None': out.write(str(row['TERM']).strip() + '\t' + str(row['CLO']).strip() + '\n')

In [None]:
# load mapping data
mapping_data = pandas.read_csv(processed_data_location + 'HPA_GTEx_TISSUE_CELL_MAP.txt', header=None, names=['TISSUE_CELL_TERM', 'ONTOLOGY_IDs'], delimiter='\t')

# preview data
mapping_data.head(n=3)

***

**Create Edge Data Set**

_Human Protein Atlas_  
hpaThe `HPA` data is looped over and reformatted such that all tissue, cell, cell lines, and fluid types are stored as a nested list. The anatomy type is specified as an item in the list according to its type in order to make mapping more efficient while building the knowledge graph edge list.

In [None]:
hpa_results = []
for idx, row in tqdm(hpa.iterrows(), total=hpa.shape[0]):
    ens = str(row['Ensembl']); gene = str(row['Gene']); uni = str(row['Uniprot'])
    evid = str(row['Evidence']); sub = str(row['Subcellular location']); source = 'The Human Protein Atlas'
    if row['RNA tissue specific nTPM'] != 'None':
        row_val = row['RNA tissue specific nTPM']
        if ';' in row_val:
            for x in row_val.split(';'):
                x1 = str(x.split(':')[0]); x2 = float(x.split(': ')[1])
                hpa_results += [ [ens, gene, uni, evid, 'anatomy', 'None', x1, x2, source]]
        else:
            x1 = str(row_val.split(':')[0]); x2 = float(row_val.split(': ')[1])
            hpa_results += [[ens, gene, uni, evid, 'anatomy', 'None', x1, x2, source]]
    if row['RNA cell line specific nTPM'] != 'None':
        row_val = row['RNA cell line specific nTPM']
        if ';' in row_val:
            for x in row_val.split(';'):
                x1 = str(x.split(':')[0]); x2 = float(x.split(': ')[1])
                hpa_results += [[ens, gene, uni, evid, 'cell line', sub, x1, x2, source]]
        else:
            x1 = str(row_val.split(':')[0]); x2 = float(row_val.split(': ')[1])
            hpa_results += [[ens, gene, uni, evid, 'cell line', sub, x1, x2, source]]
    if row['RNA brain regional specific nTPM'] != 'None':
        row_val = row['RNA brain regional specific nTPM']
        if ';' in row_val:
            for x in row_val.split(';'):
                x1 = str(x.split(':')[0]); x2 = float(x.split(': ')[1])
                hpa_results += [[ens, gene, uni, evid, 'anatomy', 'None', x1, x2, source]]
        else:
            x1 = str(row_val.split(':')[0]); x2 = float(row_val.split(': ')[1])
            hpa_results += [[ens, gene, uni, evid, 'anatomy', 'None', x1, x2, source]]
    if row['RNA blood cell specific nTPM'] != 'None':
        row_val = row['RNA blood cell specific nTPM']
        if ';' in row_val:
            for x in row_val.split(';'):
                x1 = str(x.split(':')[0]); x2 = float(x.split(': ')[1])
                hpa_results += [[ens, gene, uni, evid, 'cell line', sub, x1, x2, source]]
        else:
            x1 = str(row_val.split(':')[0]); x2 = float(row_val.split(': ')[1])
            hpa_results += [[ens, gene, uni, evid, 'cell line', sub, x1, x2, source]]
    if row['RNA blood lineage specific nTPM'] != 'None':
        row_val = row['RNA blood lineage specific nTPM']
        if ';' in row_val:
            for x in row_val.split(';'):
                x1 = str(x.split(':')[0]); x2 = float(x.split(': ')[1])
                hpa_results += [[ens, gene, uni, evid, 'cell line', sub, x1, x2, source]]
        else:
            x1 = str(row_val.split(':')[0]); x2 = float(row_val.split(': ')[1])
            hpa_results += [[ens, gene, uni, evid, 'cell line', sub, x1, x2, source]]

_Genotype-Tissue Expression Project_  
The `GTEx` edge data is created by first filtering out all _protein-coding_ genes that appear in the `HPA` cell transcriptome data set. Once filter so that we are only left noncoding genes, we perform an additional filtering step to only add genes and their corresponding tissue, cell, or fluid, if the median expression is `>= 1.0`. The `GTEx` is formatted such that all anatomical entities occur as their own column and all unique genes occur as a row, thus the expression filtering step is performed while also reformatting the file. The genes and tissues/cells/fluids that meet criteria are stored as a nested list.

In [None]:
# remove rows that contain protein coding genes already in the hpa data
hpa_genes = list(hpa['Ensembl'].drop_duplicates(keep='first', inplace=False))
gtex = gtex.loc[gtex['Name'].apply(lambda x: x not in hpa_genes)]

In [None]:
# loop over data and re-organize - only keep results with tpm >= 1 and if gene symbol is not a protein-coding gene
gtex_results = []
source = 'Genotype-Tissue Expression (GTEx) Project'
for idx, row in tqdm(gtex.iterrows(), total=gtex.shape[0]):
    for col in list(gtex.columns)[2:]:
        typ = 'cell line' if 'Cells' in col else 'anatomy'
        evidence = 'Evidence at transcript level'
        gtex_results += [[str(row['Name']), str(row['Description']), 'None', evidence, typ, 'None', col, float(row[col]), source]]
        
        

*Writes Edge Data*  

In [None]:
with open(processed_data_location + 'HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt', 'w') as out:
    for x in tqdm(hpa_results + gtex_results):
        out.write(x[0] + '\t' + x[1] + '\t' + x[2] + '\t' + x[3] + '\t' + x[4] + '\t' + x[5] + '\t' + x[6] + '\t' + str(x[7]) + '\t' + x[8] + '\n')

In [None]:
# load data, return edge count, and preview it
hpa_edges = pandas.read_csv(processed_data_location + 'HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt',
                           header=None, low_memory=False, sep='\t',
                           names=['Ensembl_IDs', 'Gene_Symbols', 'Uniprot_IDs', 'Evidence',
                                  'Anatomy_Type', 'Subcellular_Location', 'Anatomy', 'Expresison_Value',
                                 'Source'])

print('There are {edge_count} edges'.format(edge_count=len(hpa_edges)))
hpa_edges.head(n=5)

<br>

***

### Mapping Reactome Pathways to the Pathway Ontology <a class="anchor" id="reactome-pw"></a>

**Data Source Wiki Page:** [Pathway Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#pathway-ontology)  

**Purpose:** This script downloads the [canonical pathways](http://compath.scai.fraunhofer.de/export_mappings) and [kegg-reactome pathway mappings](https://github.com/ComPath/resources/blob/master/mappings/kegg_reactome.csv) files from the [ComPath Ecosystem](https://github.com/ComPath) in order to create the following identifier mappings:  
- `Reactome Pathway Identifiers`  ➞ `KEGG Pathway Identifiers` ➞ `Pathway Ontology Identifiers` 

**Output:**  
- `REACTOME_PW_GO_MAPPINGS.txt`


***

**Pathway Ontology**   
Use [OWL Tools](https://github.com/owlcollab/owltools/wiki) to download the [Pathway Ontology](http://www.obofoundry.org/ontology/pw.html). Once downloaded, we read the ontology in as a `RDFLib` graph object so that we can query it to obtain all `DbXRefs`.

In [None]:
# download ontology
if not os.path.exists(unprocessed_data_location + 'pw_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/pw.owl',
                             unprocessed_data_location + 'pw_with_imports.owl'))

# load the knowledge graph
pw_graph = Graph().parse(unprocessed_data_location + 'pw_with_imports.owl')
print('There are {} axioms in the ontology (date: {})'.format(len(pw_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

_Reformat Mapping Results_  
Create a dictionary of mapping results where pathway ontology identifiers are values and the keys are `DbXRef` identifiers.


In [None]:
# get dbxref results
dbxref_res = gets_ontology_class_dbxrefs(pw_graph)[0]
dbxref_dict = {str(k).lower().split('/')[-1]: {str(i).split('/')[-1].replace('_', ':') for i in v} for k, v in dbxref_res.items() if 'PW_' in str(v)}

# get synonym results
syn_res = gets_ontology_class_synonyms(pw_graph)[0]
synonym_dict = {str(k).lower().split('/')[-1]: {str(i).split('/')[-1].replace('_', ':') for i in v} for k, v in syn_res.items() if 'PW_' in str(v)}

# combine results into single dictionary
id_mappings = {**dbxref_dict, **synonym_dict}

print('There are {} results (date: {})'.format(len(id_mappings), datetime.datetime.now().strftime('%m/%d/%Y')))

***

**Reactome Pathways**  
Download a file of all [Reactome Pathways](https://reactome.org/download/current/ReactomePathways.txt), [Reactome's GO Annotations]('https://reactome.org/download/current/gene_association.reactome.gz'), and [Reactome's mappings to CHEBI](https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt). This file will be filtered to only include human pathways.

_Reactome Pathway Stable Identifiers_

In [None]:
# download data
url = 'https://reactome.org/download/current/ReactomePathways.txt'
if not os.path.exists(unprocessed_data_location + 'ReactomePathways.txt'):
    data_downloader(url, unprocessed_data_location)

# load data
reactome_pathways = pandas.read_csv(unprocessed_data_location + 'ReactomePathways.txt', header=None, delimiter='\t', low_memory=False)

In [None]:
# remove all non-human pathways and save as list
reactome_pathways = reactome_pathways.loc[reactome_pathways[2].apply(lambda x: x == 'Homo sapiens')] 
reactome_map = {x:set(['PW_0000001']) for x in set(list(reactome_pathways[0]))}     

_Reactome's Mappings to GO Annotations_

In [None]:
# download data
url = 'https://reactome.org/download/current/gene_association.reactome.gz'
if not os.path.exists(unprocessed_data_location + 'gene_association.reactome'):
    data_downloader(url, unprocessed_data_location)

# load data
reactome_pathways2 = pandas.read_csv(unprocessed_data_location + 'gene_association.reactome', header=None, delimiter='\t', skiprows=4, low_memory=False)

In [None]:
# remove all non-human pathways and save as list
reactome_pathways2 = reactome_pathways2.loc[reactome_pathways2[12].apply(lambda x: x == 'taxon:9606')] 
reactome_map.update({x.split(':')[-1]:set(['PW_0000001']) for x in set(list(reactome_pathways2[5]))})     

_Reactome's Mappings to ChEBI_

In [None]:
# download data
url = 'https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt'
if not os.path.exists(unprocessed_data_location + 'ChEBI2Reactome_All_Levels.txt'):
    data_downloader(url, unprocessed_data_location)

# load data
reactome_pathways3 = pandas.read_csv(unprocessed_data_location + 'ChEBI2Reactome_All_Levels.txt', header=None, delimiter='\t', low_memory=False)

In [None]:
# remove all non-human pathways and save as list
reactome_pathways3 = reactome_pathways3.loc[reactome_pathways3[5].apply(lambda x: x == 'Homo sapiens')] 
reactome_map.update({x:set(['PW_0000001']) for x in set(list(reactome_pathways3[1]))})     

***

**ComPath Reactome Pathway Mappings**  
Use [ComPath Mappings](https://github.com/ComPath/resources/tree/master/mappings) to obtain the following mappings:  `Reactome Pathways`  ➞ `KEGG Pathways` ➞ `Pathway Ontology` 

_Canonical Pathways_

In [None]:
# download data
url1 = 'http://compath.scai.fraunhofer.de/export_mappings'
if not os.path.exists(unprocessed_data_location + 'compath_canonical_pathway_mappings.txt'):
    data_downloader(url1, unprocessed_data_location, 'compath_canonical_pathway_mappings.txt')

# load data
compath_cannonical = pandas.read_csv(unprocessed_data_location + 'compath_canonical_pathway_mappings.txt', header=None, delimiter='\t', low_memory=False)
compath_cannonical.fillna('None', inplace=True)

In [None]:
for idx, row in tqdm(compath_cannonical.iterrows(), total=compath_cannonical.shape[0]):
    if row[6] == 'kegg' and 'kegg:' + row[5].strip('path:hsa') in id_mappings.keys() and row[2] == 'reactome':
        for x in id_mappings['kegg:' + row[5].strip('path:hsa')]:
            if row[1] in reactome_map.keys(): reactome_map[row[1]] |= set([x.split('/')[-1]])
            else: reactome_map[row[1]] = set([x.split('/')[-1]])
    if (row[2] == 'kegg' and 'kegg:' + row[1].strip('path:hsa') in id_mappings.keys()) and row[6] == 'reactome':
        for x in id_mappings['kegg:' + row[1].strip('path:hsa')]:
            if row[5] in reactome_map.keys(): reactome_map[row[5]] |= set([x.split('/')[-1]])
            else: reactome_map[row[5]] = set([x.split('/')[-1]])         

_KEGG - Reactome Mappings_

In [None]:
# download data
url2 = 'https://raw.githubusercontent.com/ComPath/resources/master/mappings/kegg_reactome.csv'
if not os.path.exists(unprocessed_data_location + 'kegg_reactome.csv'):
    data_downloader(url2, unprocessed_data_location, 'kegg_reactome.csv')

# load data
kegg_reactome_map = pandas.read_csv(unprocessed_data_location + 'kegg_reactome.csv', header=0, delimiter=',', low_memory=False)

In [None]:
for idx, row in tqdm(kegg_reactome_map.iterrows(), total=kegg_reactome_map.shape[0]):
    if row['Source Resource'] == 'reactome' and 'kegg:' + row['Target ID'].strip('path:hsa') in id_mappings.keys():
        for x in id_mappings['kegg:' + row['Target ID'].strip('path:hsa')]:
            if row['Source ID'] in reactome_map.keys(): reactome_map[row['Source ID']] |= set([x.split('/')[-1]])
            else: reactome_map[row['Source ID']] = set([x.split('/')[-1]])
    if row['Target Resource'] == 'reactome' and 'kegg:' + row['Source Resource'].strip('path:hsa') in id_mappings.keys():
        for x in id_mappings['kegg:' + row['Source ID'].strip('path:hsa')]:
            if row['Target ID'] in reactome_map.keys(): reactome_map[row['Target ID']] |= set([x.split('/')[-1]])
            else: reactome_map[row['Target ID']] = set([x.split('/')[-1]])

***

**Reactome Pathway GO Annotation Mappings**  
Use Reactome's [API](https://reactome.org/dev/content-service) to obtain the following mappings: `Reactome Pathway Identifiers`  ➞ `Gene Ontology Identifiers`.

In [None]:
for request_ids in tqdm(list(chunks(list(reactome_map.keys()), 20))):
    result, key = content.query_ids(ids=','.join(request_ids)), 'goBiologicalProcess'
    if result is not None and (isinstance(result, List) or result['code'] != 404):
        for res in result:
            if key in res.keys():
                if res['stId'] in reactome_map.keys(): reactome_map[res['stId']] |= {'GO_' + res[key]['accession']}
                else: reactome_map[res['stId']] = {'GO_' + res[key]['accession']}

*Write Data*

In [None]:
# reformat identifiers -- replacing ontology concepts with ':' to '_'
temp_dict = dict()
for key, value in tqdm(reactome_map.items()):
    temp_dict[key] = set(x.replace(':', '_') for x in value)

# overwrite original reactome dict with cleaned mappings
reactome_map = temp_dict

# output data
with open(processed_data_location + 'REACTOME_PW_GO_MAPPINGS.txt', 'w') as out:
    for key in tqdm(reactome_map.keys()):
        for x in reactome_map[key]:
            if x.startswith('PW') or x.startswith('GO'): out.write(key + '\t' + x + '\n')

In [None]:
# load data, print row count, and preview it
pw_data = pandas.read_csv(processed_data_location + 'REACTOME_PW_GO_MAPPINGS.txt', header=None, names=['Pathway_IDs', 'Mapping_IDs'], delimiter='\t')

print('There are {edge_count} pathway ontology mappings'.format(edge_count=len(pw_data)))
pw_data.head(n=5)

<br>

***

### Mapping Genomic Identifiers to the Sequence Ontology <a class="anchor" id="genomic-soo"></a>

**Data Source Wiki Page:** [Sequence Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources/_edit#sequence-ontology)  

**Purpose:** This script downloads the `genomic_sequence_ontology_mappings.xlsx` file in order to create the following identifier mappings:  
- `Gene BioTypes`  ➞ `Sequence Ontology Identifiers`  
- `RNA BioTypes`  ➞ `Sequence Ontology Identifiers`  
- `variant Types`  ➞ `Sequence Ontology Identifiers`

**Output:**  
- `SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt`


In [None]:
# download data
url='https://storage.googleapis.com/pheknowlator/curated_data/genomic_sequence_ontology_mappings.xlsx'
if not os.path.exists(unprocessed_data_location + 'genomic_sequence_ontology_mappings.xlsx'):
    data_downloader(url, unprocessed_data_location)

# load data
mapping_data = pandas.read_excel(open(unprocessed_data_location + 'genomic_sequence_ontology_mappings.xlsx', 'rb'),
                                 sheet_name='GenomicType_SO_Map_09Mar2020', header=0, engine='openpyxl')

# convert data to dictionary
genomic_type_so_map = {}
for idx, row in tqdm(mapping_data.iterrows(), total=mapping_data.shape[0]):
    genomic_type_so_map[row['source_*_type'] + '_' + row['Genomic']] = row['SO ID']

***

**Genes**

In [None]:
# read in genomic mapping data
genomic_mapped_ids = pickle.load(open(processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl', 'rb'))

sequence_map = {}
for identifier in tqdm(genomic_mapped_ids.keys()):    
    if identifier.startswith('entrez_id_') and identifier.replace('entrez_id_', '') != 'None':
        id_clean = identifier.replace('entrez_id_', '')
        
        # get identifier types
        ensembl = [x.replace('ensembl_gene_type_', '') for x in genomic_mapped_ids[identifier] if x.startswith('ensembl_gene_type') and x != 'ensembl_gene_type_unknown']
        hgnc = [x.replace('hgnc_gene_type_', '')  for x in genomic_mapped_ids[identifier] if x.startswith('hgnc_gene_type') and x != 'hgnc_gene_type_unknown']
        entrez = [x.replace('entrez_gene_type_', '')  for x in genomic_mapped_ids[identifier] if x.startswith('entrez_gene_type') and x != 'entrez_gene_type_unknown']
        
        # determine gene type
        if len(ensembl) > 0: gene_type = genomic_type_so_map[ensembl[0].replace('ensembl_gene_type_', '') + '_Gene']
        elif len(hgnc) > 0: gene_type = genomic_type_so_map[hgnc[0].replace('hgnc_gene_type_', '') + '_Gene']
        elif len(entrez) > 0: gene_type = genomic_type_so_map[entrez[0].replace('entrez_gene_type_', '') + '_Gene']
        else: gene_type = 'SO_0000704'  
        
        # update sequence map
        if id_clean in sequence_map.keys(): sequence_map[id_clean] += [gene_type]
        else: sequence_map[id_clean] = [gene_type]

***

**Transcripts**

In [None]:
# read in processed Ensembl Transcript data 
transcript_data = pandas.read_csv(processed_data_location + 'ensembl_identifier_data_cleaned.txt', header=0, delimiter='\t', low_memory=False)

# convert to dictionary
transcripts = {}
for idx, row in tqdm(transcript_data.iterrows(), total=transcript_data.shape[0]):
    if row['transcript_stable_id'] != 'None':
        if row['transcript_stable_id'].replace('transcript_stable_id_', '') in transcripts.keys():
            transcripts[row['transcript_stable_id'].replace('transcript_stable_id_', '')] += [row['ensembl_transcript_type']]
        else: transcripts[row['transcript_stable_id'].replace('transcript_stable_id_', '')] = [row['ensembl_transcript_type']]
            
# update so map dictionary
for identifier in tqdm(transcripts.keys()):
    if transcripts[identifier][0] == 'protein_coding': trans_type = genomic_type_so_map['protein-coding_Transcript']
    elif transcripts[identifier][0] == 'misc_RNA': trans_type = genomic_type_so_map['miscRNA_Transcript']
    else: trans_type = genomic_type_so_map[list(set(transcripts[identifier]))[0] + '_Transcript']
    sequence_map[identifier] = [trans_type, 'SO_0000673']

***

**Variants**

In [None]:
# read in variant summary data 
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
if not os.path.exists(unprocessed_data_location + 'variant_summary.txt'):
    data_downloader(url, unprocessed_data_location)
    
# load data    
variant_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt', header=0, delimiter='\t', low_memory=False)

# convert to dictionary
variants = {}
for idx, row in tqdm(variant_data.iterrows(), total=variant_data.shape[0]):
    if row['Assembly'] == 'GRCh38' and row['RS# (dbSNP)'] != -1:
        if 'rs' + str(row['RS# (dbSNP)']) in variants.keys(): variants['rs' + str(row['RS# (dbSNP)'])] |= set([row['Type']])
        else: variants['rs' + str(row['RS# (dbSNP)'])] = set([row['Type']])

# update so map dictionary
for identifier in tqdm(variants.keys()):
    for typ in variants[identifier]:
        var_type = genomic_type_so_map[typ.lower() + '_Variant']
        if identifier in sequence_map.keys(): sequence_map[identifier] += [var_type]
        else: sequence_map[identifier] = [var_type]

*** 
**Write Data**

In [None]:
# reformat data and write it out
with open(processed_data_location + 'SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt', 'w') as outfile:
    for key in tqdm(sequence_map.keys()):
        for map_type in sequence_map[key]:
            outfile.write(key + '\t' + map_type + '\n')

# load data, print row count, and preview it
so_data = pandas.read_csv(processed_data_location + 'SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt', header=None, delimiter='\t', names=['Identifier', 'Sequence_Ontology_ID'])

print('There are {edge_count} sequence ontology mappings'.format(edge_count=len(so_data)))
so_data.head(n=5)

***

**Combine Pathway and Sequence Ontology Mapping Data in Dictionary**  
Combine the pathway and sequence mapping data into a dictionary and output it.

In [None]:
# combine genomic and pathway maps
subclass_mapping = {}  
sequence_map.update(reactome_map)

# iterate over pathway lists and combine them
for key in tqdm(sequence_map.keys()):
    subclass_mapping[key] = sequence_map[key]

# save a copy of the dictionary
pickle.dump(subclass_mapping, open(construction_approach_location + 'subclass_construction_map.pkl', 'wb'), protocol=4)

<br>

***
***
### CREATE EDGE DATASETS  <a class="anchor" id="create-edge-datasets"></a>
***
***

### Ontologies  <a class="anchor" id="ontologies"></a>
***
- [Protein Ontology](#protein-ontology)  
- [Relations Ontology](#relations-ontology)  

***
***

***
### Protein Ontology <a class="anchor" id="protein-ontology"></a>

**Data Source Wiki Page:** [protein-ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#human-phenotype-ontology)  

**Purpose:** This script uses [OWLTools](https://github.com/owlcollab/owltools) to download the [pr.owl](http://purl.obolibrary.org/obo/pr.owl) (with imports) file from [ProConsortium.org](https://proconsortium.org/) in order to create a version of the ontology that contains only human proteins. This is achieved by performing forward and reverse breadth first search over all proteins which are `owl:subClassOf` [Homo sapiens protein](https://proconsortium.org/app/entry/PR%3A000029067/).

<br>

**Output:**  
- Human Protein Ontology ➞ `human_pro.owl`
- Classified Human Protein Ontology (Hermit) ➞ `human_pro_closed.owl`


In [None]:
# download ontology
if not os.path.exists(unprocessed_data_location + 'pr_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/pr.owl',
                             unprocessed_data_location + 'pr_with_imports.owl'))
    
# read in ontology as graph (the ontology is large so this takes ~60 minutes)
print('Loading Protein Ontology')
pr_graph = Graph().parse(unprocessed_data_location + 'pr_with_imports.owl')
print('There are {} axioms in the ontology (date: {})'.format(len(pr_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

_Convert Ontology to Directed MulitGraph_  
In order to create a version of the ontology which includes all relevant human edges, we need to first convert the KG to a [directed multigraph](https://networkx.github.io/documentation/stable/reference/classes/multidigraph.html).

In [None]:
networkx_mdg: networkx.MultiDiGraph = networkx.MultiDiGraph()
    
for s, p, o in tqdm(pr_graph):
    networkx_mdg.add_edge(s, o, **{'key': p})

_Identify Human Proteins_   
A list of human proteins is obtained by querying the ontology to return all ontology classes `only_in_taxon some Homo sapiens`.

*Approach 1 - Query Loaded Graph to Obtain Human Protein Classes*  
Does not require using external resources or SPARQL Endpoints. This is the preferred approach.

In [None]:
human_classes_restriction = list(pr_graph.triples((None, OWL.someValuesFrom, obo.NCBITaxon_9606)))
human_classes = [list(pr_graph.subjects(RDFS.subClassOf, x[0])) for x in human_classes_restriction]
human_pro_classes = list(str(i) for j in human_classes for i in j if 'PR_' in str(i))

print('There are {} edges in the ontology (date:{})'.format(len(human_pro_classes), datetime.datetime.now().strftime('%m/%d/%Y')))

*Approach 2 - Query PRO Consortium SPARQL Endpoint to Obtain Human Protein Classes*  
This approach should only be used when the PRO endpoint is not limiting the number of results that are returned. As of `October 2021`, this was happening so please use *Approach 1* which is guaranteed to return the correct results.

In [None]:
# # download data
# url = 'https://sparql.proconsortium.org/virtuoso/sparql?query=PREFIX+obo%3A+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F%3E%0D%0A%0D%0ASELECT+%3FPRO_term%0D%0AFROM+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fpr%3E%0D%0AWHERE+%7B%0D%0A+++++++%3FPRO_term+rdf%3Atype+owl%3AClass+.%0D%0A+++++++%3FPRO_term+rdfs%3AsubClassOf+%3Frestriction+.%0D%0A+++++++%3Frestriction+owl%3AonProperty+obo%3ARO_0002160+.%0D%0A+++++++%3Frestriction+owl%3AsomeValuesFrom+obo%3ANCBITaxon_9606+.%0D%0A%0D%0A+++++++%23+use+this+to+filter-out+things+like+hgnc+ids%0D%0A+++++++FILTER+%28regex%28%3FPRO_term%2C%22http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F*%22%29%29+.%0D%0A%7D&format=text%2Fhtml&debug='
# if not os.path.exists(unprocessed_data_location + 'human_pro_classes.html'):
#     data_downloader(url, unprocessed_data_location, 'human_pro_classes.html')

# # load data
# df_list = pandas.read_html(unprocessed_data_location + 'human_pro_classes.html')

# # extract data from html table - pro classes only_in_taxon some Homo sapiens
# human_pro_classes = list(df_list[-1]['PRO_term'])
# print('There are {} edges in the ontology (date:{})'.format(len(human_pro_classes), datetime.datetime.now().strftime('%m/%d/%Y')))

_Construct Human PRO_   
Now that we have all of the paths from the original graph that are relevant to humans, we can construct a human-only version of the PRotein Ontology. After building the human subset, we verify the number of connected components and get 1. However, after reformatting the graph using [OWLTools](https://github.com/owlcollab/owltools) you will see that there are 3 connected components: component 1 (n=`1051673`); component 2 (n=`12`); and component 3 (n=`2`). The contents of components 2 and 3 are shown below:

```python
[{'http://purl.obolibrary.org/obo/IAO_0000115',
  'http://www.geneontology.org/formats/oboInOwl#hasAlternativeId',
  'http://www.geneontology.org/formats/oboInOwl#hasBroadSynonym',
  'http://www.geneontology.org/formats/oboInOwl#hasDbXref',
  'http://www.geneontology.org/formats/oboInOwl#hasExactSynonym',
  'http://www.geneontology.org/formats/oboInOwl#hasNarrowSynonym',
  'http://www.geneontology.org/formats/oboInOwl#hasOBONamespace',
  'http://www.geneontology.org/formats/oboInOwl#hasRelatedSynonym',
  'http://www.geneontology.org/formats/oboInOwl#id',
  'http://www.geneontology.org/formats/oboInOwl#is_transitive',
  'http://www.geneontology.org/formats/oboInOwl#shorthand',
  'http://www.w3.org/2002/07/owl#AnnotationProperty'},
 
 {'N41f0be4cf00c48929605b1e69a09f326',
  'http://www.w3.org/2002/07/owl#Ontology'}]
```

In [None]:
# create a new graph using bfs paths
human_pro_graph = Graph()
human_networkx_mdg = networkx.MultiDiGraph()

for node in tqdm(human_pro_classes):
    forward = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='original'))
    reverse = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='reverse'))
    
    # add edges from forward and reverse bfs paths
    for path in set(forward + reverse):
        human_pro_graph.add((path[0], path[2], path[1]))
        human_networkx_mdg.add_edge(path[0], path[1], **{'key': path[2]})

In [None]:
# get connected component information
print('Finding Connected Components')
components = list(networkx.connected_components(human_networkx_mdg.to_undirected()))
component_dict = sorted(components, key=len, reverse=True)

# if more than 1 connected component, only keep the biggest
if len(component_dict) > 1:
    print('Cleaning Graph: Removing Small Disconnected Components')
    for node in tqdm([x for y in component_dict[1:] for x in list(y)]):
        human_pro_graph.remove((node, None, None))

# save data
print('Saving Human Subset of the Protein Ontology')
human_pro_graph.serialize(destination=unprocessed_data_location + 'human_pro.owl', format='xml')

_Classify Ontology_  
To ensure that we have correctly built the new ontology, we run the hermit reasoner over it to ensure that there are no incomplete triples or inconsistent classes. In order to do this, we will call the reasoner using [OWLTools](https://github.com/owlcollab/owltools), which this script assumes has already been downloaded to the `./resources/lib` directory. The following arguments are then called to run the reasoner (from the command line):  

___

```bash
../pkt_kg/libs/owltools ./resources/processed_data/unprocessed_data/human_pro.owl --reasoner elk --run-reasoner --assert-implied -o ./resources/processed_data/human_pro_closed.owl
```
___


_**Note.** This step takes around 5 minutes to run. When run from the command line the reasoner determined that the ontology was consistent and 200 new axioms were inferred (12/01/2020)._

In [None]:
# run reasoner
command = '{} {} --reasoner {} --run-reasoner --assert-implied -o {}'
os.system(command.format(owltools_location, unprocessed_data_location + 'human_pro.owl', 'elk',
                         ontology_data_location + 'pr_with_imports.owl'))

_Examine Cleaned Human PRO_  
Once we have cleaned the ontology we can get counts of components, nodes, and edges.

In [None]:
gets_ontology_statistics(ontology_data_location + 'pr_with_imports.owl', '../pkt_kg/libs/owltools')

<br>

***

### Relations Ontology <a class="anchor" id="relations-ontology"></a>

**Data Source Wiki Page:** [Relations Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#relations-ontology)  

**Purpose:** This script downloads the [ro.owl](http://purl.obolibrary.org/obo/ro.owl) file from [obofoundry.org](http://www.obofoundry.org/) in order to obtain all `ObjectProperties` and their inverse relations.  

**Output:** 
- Relations and Inverse Relations ➞ `INVERSE_RELATIONS.txt`
- Relations and Labels ➞ `RELATIONS_LABELS.txt`

In [None]:
# download ontology
if not os.path.exists(unprocessed_data_location + 'ro_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/ro.owl',
                             unprocessed_data_location + 'ro_with_imports.owl'))
# load graph
ro_graph = Graph().parse(unprocessed_data_location + 'ro_with_imports.owl')
print('There are {} edges in the ontology (date:{})'.format(len(ro_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

***

**Identify Relations and Inverse Relations**  
Identify all relations and their inverse relations using the `owl:inverseOf` property. To make it easier to look up the inverse relations, each pair is listed twice, for example:  
- [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015) `owl:inverseOf` [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025)  
- [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025) `owl:inverseOf` [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015)

In [None]:
with open(relations_data_location + 'INVERSE_RELATIONS.txt', 'w') as outfile:
    outfile.write('Relation' + '\t' + 'Inverse_Relation' + '\n')
    for s, p, o in tqdm(ro_graph):
        if 'owl#inverseOf' in str(p):
            if 'RO' in str(s) and 'RO' in str(o):
                outfile.write(str(s.split('/')[-1]) + '\t' + str(o.split('/')[-1]) + '\n')
                outfile.write(str(o.split('/')[-1]) + '\t' + str(s.split('/')[-1]) + '\n')

_Preview Processed Data_

In [None]:
# load data, print row count, and preview it
ro_data = pandas.read_csv(relations_data_location + 'INVERSE_RELATIONS.txt', header=0, delimiter='\t')

print('There are {edge_count} RO Relations and Inverse Relations'.format(edge_count=len(ro_data)))
ro_data.head(n=5)

***

**Get Relations Labels**  
Identify all relations and their labels for use when building the knowledge graph.

In [None]:
results = {str(x[2]).lower(): str(x[0]) for x in ro_graph if '/RO_' in str(x[0]) and 'label' in str(x[1]).lower()}

# write data to file
with open(relations_data_location + 'RELATIONS_LABELS.txt', 'w') as outfile:
    outfile.write('Label' + '\t' + 'Relation' + '\n')
    for k, v in results.items():
        outfile.write(str(v).split('/')[-1] + '\t' + str(k) + '\n')

_Preview Processed Data_

In [None]:
# load data, print row count, and preview it
ro_data_label = pandas.read_csv(relations_data_location + 'RELATIONS_LABELS.txt', header=0, delimiter='\t')

print('There are {edge_count} RO Relations and Labels'.format(edge_count=len(ro_data_label)))
ro_data_label.head(n=5)

<br>

***
***
### Linked Data <a class="anchor" id="linked-data"></a>
***
* [Clinvar Variant-Diseases and Phenotypes](#clinvar-variant) 
* [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  

***

***
***
### Clinvar Variant-Diseases and Phenotypes <a class="anchor" id="clinvar-variant"></a>

**Data Source Wiki Page:** [Clinvar](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#clinvar)  

**Purpose:** This script downloads the data files list below in order to create the following edges:  
- gene-variant  
- variant-disease  
- variant-phenotype  

**Data Files:**  
Details on each file have been taken from this [README](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/README.txt) and are provided in relevant code chunks below.  
##### *Core Data Files* <a class="anchor" id="core-data-files"></a>  
- [`variant_summary.txt.gz`](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)    

##### *Metadata Files*<a class="anchor" id="metadata-files"></a>    
- [`var_citations.txt`](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt)  
- [`allele_gene.txt.gz`](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/allele_gene.txt.gz)  

**Output:**  
- `CLINVAR_VARIANT_GENE_EDGES.txt`  
- `CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt`


<br>

#### Download and Process Core Data Files <a class="anchor" id="core-data-files"></a>
***

*Data Files:*  
- [`variant_summary.txt.gz`](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)  

*Processing Details*  
The first step is down the `variant_summary.txt.gz` file. After downloading, the file is cleaned to handle missing data, unneeded variables are removed, identifiers and date fields are cleaned and reformatted, and rows without valid disease/phenotype identifiers are removed.  

<br>

[**`variant_summary.txt.gz`**](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)

> A tab-delimited report based on each variant at a location on the genome for which data have been submitted to ClinVar.  
The data for the variant are reported for each assembly, so most variants have a line for GRCh37 (hg19) and another line for GRCh38 (hg38).
>
>  - <u>AlleleID</u>: integer value as stored in the AlleleID field in ClinVar  
>  - <u>Type</u>: character, the type of variant represented by the AlleleID  
>  - <u>Name</u>: character, ClinVar's preferred name for the record with this AlleleID  
>  - <u>GeneID</u>: integer, GeneID in NCBI's Gene database, reported if there is a single gene, otherwise reported as -1. 
>  - <u>GeneSymbol</u>: character, comma-separated list of GeneIDs overlapping the variant  
>  - <u>HGNC_ID</u>: string, of format HGNC:integer, reported if there is a single GeneID.  
>  - <u>ClinicalSignificance</u>: character, comma-separated list of aggregate values of clinical significance calculated for this variant. 
>  - <u>ClinSigSimple</u>: integer,  
               0 = no current value of Likely pathogenic or Pathogenic;
               1 = at least one current record submitted with an interpretation of Likely pathogenic or          
                   Pathogenic (independent of whether that record includes assertion criteria and 
                   evidence)  
              -1 = no values for clinical significance at all for this variant or set of variants; 
                   used for the "included" variants that are only in ClinVar because they are included
                   in a haplotype or genotype with an interpretation  
>  - <u>LastEvaluated</u>: date, the latest date any submitter reported clinical significance  
>  - <u>RS# (dbSNP)</u>: integer, rs# in dbSNP, reported as -1 if missing  
>  - <u>nsv/esv (dbVar)</u>: character, the NSV identifier for the region in dbVar  
>  - <u>RCVaccession</u>: character, list of RCV accessions that report this variant  
>  - <u>PhenotypeIDs</u>: character, list of identifiers for phenotype(s) interpreted for this variant. If more than 5 conditions are reported, the number of conditions is reported instead.  
>  - <u>PhenotypeList</u>: character, list of names corresponding to PhenotypeIDs. If more than 5 conditions are reported, the number of conditions is reported instead.  
>  - <u>Origin</u>: character, list of all allelic origins for this variant  
>  - <u>OriginSimple</u>: character, processed from Origin to make it easier to distinguish between germline and somatic  
>  - <u>Assembly</u>: character, name of the assembly on which locations are based   
>  - <u>ChromosomeAccession</u>: Accession and version of the RefSeq sequence defining the position reported in the start and stop columns.  
>  - <u>Chromosome</u>: character, chromosomal location  
>  - <u>Start</u>: integer, starting location, right-shifted, in pter->qter orientation  
>  - <u>Stop</u>: integer, end location, right-shifted, in pter->qter orientation  
>  - <u>ReferenceAllele</u>: The reference allele using the right-shifted location in Start and Stop.  
>  - <u>AlternateAllele</u>: The alternate allele using the right-shifted location in Start and Stop.  
>  - <u>Cytogenetic</u>: character, ISCN band
>  - <u>ReviewStatus</u>: character, highest review status for reporting this measure.  
>  - <u>NumberSubmitters</u>: integer, number of submitters describing this variant  
>  - <u>Guidelines</u>: character, ACMG only right now  
>  - <u>TestedInGTR</u>: character, Y/N for Yes/No if there is a test registered as specific to this variant in the NIH Genetic Testing Registry (GTR)  
>  - <u>OtherIDs</u>: character, list of other identifiers or sources of information about this variant  
>  - <u>SubmitterCategories</u>: coded value to indicate whether data were submitted by another resource (1), any other type of source (2), both (3), or none (4)  
>  - <u>VariationID</u>: The identifier ClinVar uses specific to the AlleleID.  Not all VariationIDS that may be related to the AlleleID are reported in this file.  
>  - <u>PositionVCF</u>: integer, starting location, left-shifted, in pter->qter orientation  
>  - <u>ReferenceAlleleVCF</u>: The reference allele using the left-shifted location in vcf_pos.  
>  - <u>AlternateAlleleVCF</u>: The alternate allele using the left-shifted location in vcf_pos.  

In [None]:
# download data
url = 'https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
if not os.path.exists(unprocessed_data_location + 'variant_summary.txt'):
    data_downloader(url, unprocessed_data_location)

# load data
var_summary = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt',
                              header=0, delimiter='\t', low_memory=False)

In [None]:
# replace "na" and "-" with NaN
var_summary = var_summary.replace('na', numpy.nan)
var_summary = var_summary.replace('-', numpy.nan)

# handle ids that are coded as missing (i.e., -1)
var_summary['GeneID'] = var_summary['GeneID'].replace(-1, numpy.nan)
var_summary['RS# (dbSNP)'] = var_summary['RS# (dbSNP)'].replace(-1, numpy.nan)

# convert date format
var_summary['LastEvaluated'] = var_summary['LastEvaluated'].str.replace('None', '')
var_summary['LastEvaluated'] = pandas.to_datetime(var_summary['LastEvaluated'])
var_summary['LastEvaluated'] = var_summary['LastEvaluated'].dt.strftime('%B %d, %Y')
var_summary['LastEvaluated'] = var_summary['LastEvaluated'].replace('', numpy.nan)

# rename variables
var_summary.rename(columns={'#AlleleID': 'AlleleID',
                           'nsv/esv (dbVar)': 'nsv',
                           'Name': 'VariantName'}, inplace=True)

# update variable types
var_summary['GeneID'] = var_summary['GeneID'].astype('Int64')
var_summary['RS# (dbSNP)'] = var_summary['RS# (dbSNP)'].astype('Int64')

# print row count and preview data
print('There are {edge_count} variant edges'.format(edge_count=len(var_summary)))
var_summary.head(n=5)

*Address Duplicate Rows for GRCh37 and GRCh38 Assemblies*

In [None]:
# subset df
var_summary_update_assemb = var_summary.copy()
var_summary_update_assemb = var_summary_update_assemb[['VariationID', 'Assembly', 'ChromosomeAccession',
                                                       'Chromosome', 'Start', 'Stop', 'ReferenceAllele',
                                                       'AlternateAllele', 'Cytogenetic', 'PositionVCF']].drop_duplicates()

# identify columns to process
assemb_cols = ['ChromosomeAccession', 'Chromosome', 'Start', 'Stop', 'ReferenceAllele',
               'AlternateAllele','Cytogenetic', 'PositionVCF', 'ReferenceAlleleVCF', 'AlternateAlleleVCF']

# group data by variant
df = var_summary_update_assemb.fillna('None')
df = df.groupby('VariationID').apply(lambda g: str(g.drop(['VariationID'], axis=1).to_dict('records'))).to_dict()

# convert to Pandas DataFrame
df_items = df.items()
temp_df = pandas.DataFrame({'VariationID': [x[0] for x in df_items], 'Assembly': [x[1] for x in df_items]})

# join temp df with original data
var_summary_assemb = var_summary.copy().drop(assemb_cols + ['Assembly'], axis = 1)
var_summary_update = var_summary_assemb.merge(temp_df, on='VariationID', how='left')

# drop duplicates
var_summary_update.drop_duplicates(inplace=True)

# print row count and preview data
print('There are {edge_count} edges'.format(edge_count=len(var_summary_update)))
var_summary_update.head(n=5)

*Process `PhenotypeIDS` and `PhenotypeList` Columns*

In [None]:
# clean-up identifiers
var_summary_update['Phenotype'] = var_summary_update['PhenotypeIDS'].str.replace('|', ';').str.replace(',', ';')
var_summary_update['OtherIDs'] = var_summary_update['OtherIDs'].str.replace(';', '|').str.replace(',', '|')

# remove unneeded variables
drop_list = ['PhenotypeList', 'PhenotypeIDS']
var_summary_update = var_summary_update.drop(drop_list, axis = 1).drop_duplicates()

# replace NaN with 'None'
var_summary_update['Phenotype'] = var_summary_update['Phenotype'].fillna('None')

# reformat phenotypeIDS and trim leading whitespace from unnested columns
var_summary_update['Phenotype'] = var_summary_update['Phenotype'].apply(
    lambda x: ';'.join(set(x for x in ['MONDO:' + i.split(':')[-1] if i.startswith('MONDO')
                        else 'HP:' + i.split(':')[-1] if i.startswith('Human Phenotype')
                        else 'ORPHA:' + i.split(':')[-1] if i.startswith('Orphanet')
                        else 'None' if i.endswith(' conditions')
                        else i for i in x.split(';')] if x != 'None')))

# drop duplicates
var_summary_update.drop_duplicates(inplace=True)

# print row count and preview data
print('There are {edge_count} edges'.format(edge_count=len(var_summary_update)))
var_summary_update.head(n=5)

<br>

#### Metadata Files <a class="anchor" id="metadata-files"></a>
***

*Data Files:*  
- [`var_citations.txt`](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt)  
- [`allele_gene.txt.gz`](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/allele_gene.txt.gz)  
- [`gene_specific_summary.txt`](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/gene_specific_summary.txt)  

*Processing Details*  
<u>Step 1</u>: The first step is down the files. After downloading, the files are cleaned to handle missing data, unneeded variables are removed, and identifiers and date fields are cleaned and reformatted.  

<u>Step 2</u>: Merge each cleaned file with the processed variant summary data from the prior steps.

<br>

[**`var_citations.txt`**](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt)

> A tab-delimited report of citations associated with data in ClinVar, connected to the AlleleID, the VariationID, and either rs# from dbSNP or nsv in dbVar.
>
> - <u>AlleleID</u>: integer value as stored in the AlleleID field in ClinVar  
> - <u>VariationID</u>: The identifier ClinVar uses to anchor its default display  
> - <u>rs</u>: rs identifier from dbSNP, null if missing  
> - <u>nsv</u>: nsv identifier from dbVar, null if missing  
> - <u>citation_source</u>: The source of the citation, either PubMed, PubMedCentral, or the NCBI Bookshelf  
> - <u>citation_id</u>: The identifier used by that source  

In [None]:
# download data
url = 'https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt'
if not os.path.exists(unprocessed_data_location + 'var_citations.txt'):
    data_downloader(url, unprocessed_data_location)

# load data
var_citations = pandas.read_csv(unprocessed_data_location + 'var_citations.txt',
                                header=0, delimiter='\t', low_memory=False)

In [None]:
# replace "na" and "-" with NaN
var_citations = var_citations.replace('na', numpy.nan)
var_citations = var_citations.replace('-', numpy.nan)

# combine citation information
var_citations['Citation'] = var_citations['citation_source'] + ':' + var_citations['citation_id']
# remove unneeded variables
drop_list = ['citation_source', 'citation_id']
var_citations = var_citations.drop(drop_list, axis = 1).drop_duplicates()

# group data by citations
var_citations = var_citations.groupby('VariationID').Citation.agg([('Citation', '|'.join)]).reset_index()
var_citations = var_citations.drop_duplicates().sort_values(by=['VariationID'])

# print row count and preview data
print('There are {edge_count} edges'.format(edge_count=len(var_citations)))
var_citations.head(n=5)

<br>

[**`allele_gene.txt.gz`**](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/allele_gene.txt.gz)

> Reports per ClinVar's AlleleID, the genes that are related to that gene and how they are related.
>
> - <u>AlleleID</u>: the integer identifier assigned by ClinVar to each simple allele
> - <u>GeneID</u>: integer, GeneID in NCBI's Gene database  
> - <u>Symbol</u>: character, Symbol preferred in NCBI's Gene database. Is the symbol from HGNC when available  
> - <u>Name</u>: character, full name of the gene  
> - <u>GenesPerAlleleID</u>: integer, number of genes related to the allele  
> - <u>Category</u>: character, type of allele-gene relationship. The values for category are:
>   - <u>asserted, but not computed</u>: Submitted as related to a gene, but not within the location of that gene on the genome  
>   - <u>genes overlapped by variant</u>: The gene and variant overlap  
>   - <u>near gene, downstream</u>: Outside the location of the gene on the genome, within 5 kb  
>   - <u>near gene, upstream</u>: Outside the location of the gene on the genome, within 5 kb  
>   - <u>within multiple genes by overlap</u>: The variant is within genes that overlap on the genome. Includes introns  
>   - <u>within single gene</u>: The variant is in only one gene. Includes introns    
> - <u>Source</u>: character, was the relationship submitted or calculated? 

In [None]:
# download data
url = 'https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/allele_gene.txt.gz'
if not os.path.exists(unprocessed_data_location + 'allele_gene.txt'):
    data_downloader(url, unprocessed_data_location)

# load data
allele_gene = pandas.read_csv(unprocessed_data_location + 'allele_gene.txt',
                              header=0, delimiter='\t', low_memory=False)

In [None]:
# replace "na" and "-" with NaN
allele_gene = allele_gene.replace('na', numpy.nan)
allele_gene = allele_gene.replace('-', numpy.nan)

# handle gene ids that may be coded as -1
allele_gene['GeneID'] = allele_gene['GeneID'].replace(-1, numpy.nan)

# rename variables
allele_gene.rename(columns={'#AlleleID': 'AlleleID',
                           'Symbol': 'GeneSymbol',
                           'Name': 'GeneName'}, inplace=True)

# update variable types
allele_gene['GeneID'] = allele_gene['GeneID'].astype('Int64')

# print row count and preview data
print('There are {edge_count} edges'.format(edge_count=len(allele_gene)))
allele_gene.head(n=5)

_Merge and Process Data Sources_

Merge `var_summary_update` with `var_citations` data

In [None]:
# merge data
merge_cols = list(set(var_summary_update.columns).intersection(set(var_citations.columns)))
var_summary_merged = var_summary_update.merge(var_citations, on=merge_cols, how='left')

# print row count and preview data
print('There are {edge_count} edges'.format(edge_count=len(var_summary_merged)))
var_summary_merged.head(n=5)

Merge merged `var_summary_update` with `allele_gene` data

In [None]:
# merge data
merge_cols = list(set(var_summary_merged.columns).intersection(set(allele_gene.columns)))
var_summary_merged = var_summary_merged.merge(allele_gene, on=merge_cols, how='left')

# update variable types
var_summary_merged['GenesPerAlleleID'] = var_summary_merged['GenesPerAlleleID'].astype('Int64')

# print row count and preview data
print('There are {edge_count} edges'.format(edge_count=len(var_summary_merged)))
var_summary_merged.head(n=5)

**Write Edge Lists**

*`variant`-`gene` Edges*

In [None]:
# reduce data set
var_summary_merged_gene = var_summary_merged.copy()
var_summary_merged_gene = var_summary_merged_gene[[
    'VariationID', 'AlleleID', 'RS# (dbSNP)', 'Type', 'VariantName',
    'OtherIDs', 'GeneID', 'GeneSymbol', 'GeneName', 'GenesPerAlleleID',
    'Assembly', 'Category', 'Guidelines', 'TestedInGTR', 'RCVaccession', 'LastEvaluated',
    'ReviewStatus', 'ClinicalSignificance', 'ClinSigSimple', 'Origin', 'OriginSimple', 'Source',
    'SubmitterCategories', 'NumberSubmitters', 'Citation']]
var_summary_merged_gene.drop_duplicates(inplace=True)

# remove any rows missing a gene id
var_summary_merged_gene = var_summary_merged_gene.dropna(subset=['GeneID'])

# head prefix to output
var_summary_merged_gene['GeneID'] = 'NCBIGene_' + var_summary_merged_gene['GeneID'].astype(str)
var_summary_merged_gene['VariationID'] = 'clinvar_' + var_summary_merged_gene['VariationID'].astype(str)

# print row count and preview data
print('There are {edge_count} edges'.format(edge_count=len(var_summary_merged_gene)))
var_summary_merged_gene.head(n=5)

In [None]:
# write out data
var_summary_merged_gene.to_csv(open(processed_data_location + 'CLINVAR_VARIANT_GENE_EDGES.txt', 'w'),
                          index=False, header=True, sep='\t')

*`variant`-`disease` / `variant`-`phenotype` Edges*

In [None]:
# reduce data set
var_summary_merged_disease = var_summary_merged.copy()
var_summary_merged_disease = var_summary_merged_disease[[
    'VariationID', 'AlleleID', 'RS# (dbSNP)', 'Type', 'VariantName', 'RCVaccession',
    'LastEvaluated', 'ReviewStatus', 'ClinicalSignificance', 'ClinSigSimple', 'GeneID',
    'NumberSubmitters', 'SubmitterCategories', 'Guidelines', 'TestedInGTR',
    'Origin', 'OriginSimple', 'Assembly', 'Phenotype', 'Citation', 'OtherIDs']]
var_summary_merged_disease.drop_duplicates(inplace=True)

# expand results by disease identifier
cols = ['Phenotype']
for col in tqdm(cols): var_summary_merged_disease = var_summary_merged_disease.assign(**{col: var_summary_merged_disease[col].str.split(';')}).explode(col)
    
# remove phenotype rows with None and drop duplicates
var_summary_merged_disease = var_summary_merged_disease[var_summary_merged_disease['Phenotype'] != 'None']
var_summary_merged_disease.drop_duplicates(inplace=True)

# head prefix to output
var_summary_merged_disease['VariationID'] = 'clinvar_' + var_summary_merged_disease['VariationID'].astype(str)

# print row count and preview data
print('There are {edge_count} edges'.format(edge_count=len(var_summary_merged_disease)))
var_summary_merged_disease.head(n=5)

In [None]:
# write data to file
var_summary_merged_disease.to_csv(open(processed_data_location + 'CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt', 'w'),
                          index=False, header=True, sep='\t')


<br>

***

### Uniprot  Protein-Cofactor and Protein-Catalyst <a class="anchor" id="uniprot-protein-cofactorcatalyst"></a>

**Data Source Wiki Page:** [UniProt](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#universal-protein-resource-knowledgebase)  

**Purpose:** This script downloads the [uniprot-cofactor-catalyst.tab](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#universal-protein-resource-knowledgebase) file from the [Uniprot Knowledge Base](https://www.uniprot.org) in order to create the following edges:  
- protein-cofactor  
- protein-catalyst  

**Data:** This data was obtained by querying the [UniProt Knowledgebase](https://www.uniprot.org/uniprot/) using the *reviewed:yes AND organism:"Homo sapiens (Human) [9606]""* keyword and including the following columns:
- Entry (Standard) 
- Status (Standard) 
- PRO (*Miscellaneous*)  
- ChEBI (Cofactor) (*Chemical entities*)   
- ChEBI (Catalytic activity) (*Chemical entities*)  

The URL to access the results of this query is obtained by clicking on the share symbol and copying the free-text from the box. To obtain the data in a tab-delimited format the following string is appended to the end of the URL: "&format=tab".

**NOTE.** Be sure to obtain a new URL from the [UniProt Knowledgebase](https://www.uniprot.org/uniprot/) when rebuilding to ensure you are getting the most up-to-date data. This query was last generated on `12/02/2020`.

<br>

**Output:**  
- protein-cofactor ➞ `UNIPROT_PROTEIN_COFACTOR.txt`
- protein-catalyst ➞ `UNIPROT_PROTEIN_CATALYST.txt`


In [None]:
# download data
url = 'https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Creviewed%2Centry%20name%2Cdatabase(PRO)%2Cchebi(Cofactor)%2Cchebi(Catalytic%20activity)&format=tab'
if not os.path.exists(unprocessed_data_location + 'uniprot-cofactor-catalyst.tab'):
    data_downloader(url, unprocessed_data_location, 'uniprot-cofactor-catalyst.tab')

# upload data
data = open(unprocessed_data_location + 'uniprot-cofactor-catalyst.tab').readlines()

In [None]:
# reformat data and write it out
with open(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt', 'w') as outfile1, open(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt', 'w') as outfile2:
    for line in tqdm(data):
        status = line.split('\t')[1]; upt_id = line.split('\t')[0]; upt_entry = line.split('\t')[2]
        pr_id = 'PR_' + line.split('\t')[3].strip(';')
        # get cofactors
        if 'CHEBI' in line.split('\t')[4]: 
            for i in line.split('\t')[4].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile1.write(pr_id + '\t' + chebi + '\t' + status + '\t' + upt_id + '\t' + upt_entry + '\n')
        # get catalysts
        if 'CHEBI' in line.split('\t')[5]:       
            for i in line.strip('\n').split('\t')[5].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile2.write(pr_id + '\t' + chebi + '\t' + status + '\t' + upt_id + '\t' + upt_entry + '\n')

***

**Cofactor Data**  

In [None]:
# load data, print row count, and preview it
pcp1_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt', header=None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs', 'Status', 'Uniprot_ID', 'Uniprot_Entry_name'],
                            delimiter='\t')

print('There are {edge_count} protein-cofactor edges'.format(edge_count=len(pcp1_data)))
pcp1_data.head(n=5)

***


**Catalyst Data**  

In [None]:
# load data, print row count, and preview it
pcp2_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt', header=None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs', 'Status', 'Uniprot_ID', 'Uniprot_Entry_name'],
                            delimiter='\t')

print('There are {edge_count} protein-catalyst edges'.format(edge_count=len(pcp2_data)))
pcp2_data.head(n=5)

<br>

***
***
### NODE AND RELATION METADATA<a class="anchor" id="node-relation-metadata"></a>
***

**Data Source Wiki Page:** [Dependencies](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies/#metadata) 

**Purpose:** The goal of this section is to obtain metadata for each entity that is not from an ontology and all relations used in the knowledge graph. 

<br>

**Metadata:**  
A variety of <u>metadata</u> are pulled from the data sources that are used to support external edges added to enhance the core set of ontologies. For the monthly PheknowLator builds, please see [`pheknowlator_source_metadata.xlsx`](https://github.com/callahantiff/PheKnowLator/blob/master/resources/pheknowlator_source_metadata.xlsx) spreadsheet. This spreadsheet has two tabs, one for nodes and one for edges. Each each entity (i.e., node or relation) there are several columns, including descriptions of the metadata, the variable type, and even examples of values for each type of metadata. 

*Example Metadata Dictionary Output*. The code snippet below is meant to provide a snapshot of how data are organized in the metadata dictionary. As demonstrated by this example, there are three high-level keys:  
  - `nodes`: Nodes are keyed by CURIE. Every node has a `Label`, `Description`, `Synonym`, and `Dbxref` (whenever possible). Metadata that are obtained from specific sources that are not ontologies are added as a nested dictionary keyed by the filename.   
  - `edges`: Edges are keyed by a label which represents the edge type (the same label that is used in `resource_info.txt` and `edge_source_list.txt` files). Metadata that are obtained from specific sources that are not ontologies are added as a nested dictionary keyed by the filename.    
  - `relations`: Relations or `owl:ObjectProperty` objects are keyed by CURIE. Similar to nodes, every relation has a `Label`, `Description`, and `Synonym` (whenever possible). Metadata that are obtained from specific sources that are not ontologies are added as a nested dictionary keyed by the filename.     

```python
{
    'nodes': {
        'NCBIGene_2052': {
            'Label': 'EPHX1',
            'Description': "EPHX1 has locus group 'protein-coding' and is located on chromosome 1 (1q42.12).",
            'Synonym': 'epoxide hydrolase 1, microsomal (xenobiotic)|epoxide hydratase|EPHX|HYL1|MEHepoxide hydrolase 1|epoxide hydrolase 1 microsomal|EPOX',
            'Dbxref': 'MIM:132810|HGNC:HGNC:3401|Ensembl:ENSG00000143819', ... },
        'CHEBI_4592': {
            'Label': 'Dihydroxycarbazepine',
            'Description': "None",
            'Synonym': '10,11-Dihydro-10,11-dihydroxy-5H-dibenzazepine-5-carboxamide|10,11-Dihydroxycarbamazepine',
            'Dbxref': 'CAS:35079-97-1|KEGG:C07495',
            'CTD_chem_gene_ixns.tsv.gz': {  
                'CTD_ChemicalID': {'MESH:C004822'},
                'CTD_CasRN': {'35079-97-1'},
                'CTD_ChemicalName': {'10,11-dihydro-10,11-dihydroxy-5H-dibenzazepine-5-carboxamide'}}, ... }, ... },
    'edges': {
        'chemical-gene': {
            'CHEBI_4592-NCBIGene_2052': {
                {'CTD_chem_gene_ixns.tsv': {
                    'CTD_Evidence': [{'CTD_Interaction': '[EPHX1 gene SNP affects the metabolism of carbamazepine epoxide] which affects the chemical synthesis of 10,11-dihydro-10,11-dihydroxy-5H-dibenzazepine-5-carboxamide',
                     'CTD_InteractionActions': 'affects^chemical synthesis|affects^metabolic processing',
                     'CTD_PubMedIDs': '15692831'}]}}, ...}, ...}, ...}, 
    'relations': {
        'RO_0002434': {
            'Label': 'interacts with',
            'Description': 'A relationship that holds between two entities in which the processes executed by the two entities are causally connected.',
            'Synonym': 'in pairwise interaction with'}, ... }
}
```

<br>


<i><b>NOTE.</b> All entity metadata are written to the `metadata` directory as a `pickled` dictionary called `entity_metadata_dict.pkl`. The algorithm will look for this dictionary in the `metadata` directory and if it is not there, then no entity metadata will be created.</i>

<br>

### Prepare Metadata Dictionaries
***

**Purpose:** To create the resources needed in order to create metadata dictionaries. This process has the following steps:

**1. [Generate Metadata Dictionaries](#generate-metadata-dictionaries):** In order to obtain metadata, we first read in the data source for each type and convert it into a dictionary. Then, each metadata dictionary is merged together and saved to a `master_metadata_dictionary`, keyed by identifier.
  - <u>Input Datasets</u>:  
    - [CTD_chem_gene_ixns.tsv](http://ctdbase.org/reports/CTD_chem_gene_ixns.tsv.gz)   
      - Edges: `chemical-gene`, `chemical-protein`, `chemical-rna`  
      - Identifier Maps:  
        - Chemicals: [MESH_CHEBI_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt)  
        - Proteins: [ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt)   
        - RNA: [ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt) 
    - [CTD_chem_go_enriched.tsv](http://ctdbase.org/reports/CTD_chem_go_enriched.tsv.gz)   
      - Edges: `chemical-gobp`, `chemical-gocc`, `chemical-gomf`  
      - Identifier Maps:  
        - Chemicals: [MESH_CHEBI_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt)  
    - [CTD_chemicals_diseases.tsv](http://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz)   
      - Edges: `chemical-disease`, `chemical-phenotype`  
      - Identifier Maps:  
        - Chemicals: [MESH_CHEBI_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt)  
        - Diseases: [DISEASE_MONDO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt) 
        - Phenotypes: [PHENOTYPE_HPO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt)  
    - [ChEBI2Reactome_All_Levels.txt](https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt)   
      - Edge: `chemical-pathway`   
    - [goa_human.gaf](http://current.geneontology.org/annotations/goa_human.gaf.gz)   
      - Edges: `protein-gobp`, `protein-gocc`, `protein-gomf`    
      - Identifier Maps:  
        - Proteins: [UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt) 
    - [COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt](http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt)   
      - Edge: `gene-gene`    
      - Identifier Maps:  
        - Genes: [UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt) 
    - [phenotype.hpoa](http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa)   
      - Edge: `disease-phenotype`    
      - Identifier Maps:  
        - Diseases: [DISEASE_MONDO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt)   
    - [ChEBI2Reactome_All_Levels.txt](https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt)   
      - Edge: `chemical-pathway`   
    - [gene_association.reactome](https://reactome.org/download/current/gene_association.reactome.gz)   
      - Edge: `gobp-pathway`, `pathway-gocc`, `pathway-gomf`      
    - [UniProt2Reactome_All_Levels.txt](https://reactome.org/download/current/UniProt2Reactome_All_Levels.txt)   
      - Edge: `protein-pathway`      
      - Identifier Maps:  
        - Proteins: [UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt) 
    - [CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt)   
      - Edge: `variant-disease`, `variant-disease`      
      - Identifier Maps:  
        - Diseases: [DISEASE_MONDO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt)
        - Phenotypes: [PHENOTYPE_HPO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt)        
    - [CLINVAR_VARIANT_GENE_EDGES.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/CLINVAR_VARIANT_GENE_EDGES.txt)   
      - Edge: `variant-gene`      
    - [HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt)   
      - Edge: `protein-anatomy`, `protein-cell`, `rna-anatomy`, `rna-cell`         
      - Identifier Maps:  
        - Proteins: [UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt)         
		- Anatomy: [HPA_GTEx_TISSUE_CELL_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEx_TISSUE_CELL_MAP.txt)
        - Cells: [HPA_GTEx_TISSUE_CELL_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEx_TISSUE_CELL_MAP.txt)  
        - RNA: [GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt) 
    - [UNIPROT_PROTEIN_CATALYST.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_PROTEIN_CATALYST.txt)   
      - Edge: `protein-catalyst`
    - [UNIPROT_PROTEIN_COFACTOR.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_PROTEIN_COFACTOR.txt)   
      - Edge: `protein-cofactor`
    - [9606.protein.links.v11.0.txt.gz](https://stringdb-static.org/download/protein.links.v11.0/9606.protein.links.v11.0.txt.gz)   
      - Edge: `protein-protein`      
      - Identifier Maps:  
        - Proteins: [STRING_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/STRING_PRO_ONTOLOGY_MAP.txt)
    - [curated_gene_disease_associations.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/curated_gene_disease_associations.tsv.gz)   
      - Edge: `gene-disease`, `gene-phenotype`      
      - Identifier Maps:  
        - Diseases: [DISEASE_MONDO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt)
        - Phenotypes: [PHENOTYPE_HPO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt)  
    
<br>

**2. [Write Metadata Files](#write-metadata-files):** The `master_metadata_dictionary` dictionary from _Step 1_ is `pickled` and saved to the `resources/metadata/entity_metadata_dict.pkl` directory.

<br>

***

In [None]:
# create the shell for the node and relation dictionary
master_metadata_dictionary = {'nodes': {}, 'relations': {}, 'edges': {}}

# # create temp metadata directory
# temp_location = metadata_location + 'temp'
# if os.path.exists(temp_location): shutil.rmtree(temp_location)
# os.mkdir(temp_location)
# os.mkdir(temp_location + '/nodes')
# os.mkdir(temp_location + '/relations')
# os.mkdir(temp_location + '/edges')

<br>

### Generate Metadata Dictionaries  <a class="anchor" id="generate-metadata-dictionaries"></a>

There are two types of data that are processed when building the metadata dictionary. The first type of data is *Primary*, meaning it consists of a small set of variables that are collected for all entities that are included in the knowledge graph (i.e., `Label`, `Description`, `DbXref`, `Synonym`). These data are collected for entities of type: genes, RNA, variants, and pathways. *Secondary* data are then collected for all edges in the knowledge graph that include entities that are not obtained from an ontology. For these sources, metadata may differ by source.  
- [Primary Metadata Elements](#primary)  
- [Secondary Metadata Elements](#secondary)

#### Primary Metadata Elements<a class="anchor" id="primary"></a>  
***

##### Genes Metadata Dictionary <a class="anchor" id="gene-metadata"></a>  

**Data Source Wiki Page:** [National Center for Biotechnology Information Gene](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#national-center-for-biotechnology-information-gene)

The nested dictionary of gene metadata is created by looping over the cleaned human [National Center for Biotechnology Information Gene](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#national-center-for-biotechnology-information-gene) identifier data set ([`ensembl_identifier_data_cleaned.txt`](ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz)). The `keys` of the dictionary are `Entrez gene identifiers` and the `values` are dictionaries for each metadata type.

In [None]:
# entrez gene data
entrez_gene_data = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info', header=0, delimiter='\t', low_memory=False)

# remove all rows that are not human
entrez_gene_data = entrez_gene_data.loc[entrez_gene_data['#tax_id'].apply(lambda x: x == 9606)]

# replace NaN and '-' with 'None'
entrez_gene_data.fillna('None', inplace=True)
entrez_gene_data.replace('-','None', inplace=True, regex=False)

# update prefixes
entrez_gene_data['GeneID'] = 'NCBIGene_' + entrez_gene_data['GeneID'].astype('str')

In [None]:
# create metadata
for idx, row in tqdm(entrez_gene_data.iterrows(), total=entrez_gene_data.shape[0]):
    if row['GeneID'] != 'None':
        genes, lab, desc, syn = [], [], [], []
        gene_id, sym, defn = row['GeneID'], row['Symbol'], row['description']
        gene_type, dbxref = row['type_of_gene'], row['dbXrefs']
        chrom, map_loc, s1, s2 = row['chromosome'], row['map_location'], row['Synonyms'], row['Other_designations']
        genes.append('http://www.ncbi.nlm.nih.gov/gene/' + str(gene_id))
        if sym != 'None' or sym != '': lab.append(sym)
        else: lab.append('Entrez_ID:' + gene_id)
        if 'None' not in [defn, gene_type, chrom, map_loc]:
            desc_str = "{} has locus group '{}' and is located on chromosome {} ({})."
            desc.append(desc_str.format(sym, gene_type, chrom, map_loc))
        else: desc.append("{} locus group '{}'.".format(sym, gene_type))
        if s1 != 'None' and s2 != 'None': syn.append('|'.join(set([x for x in (s1 + s2).split('|') if x != 'None' or x != ''])))
        elif s1 != 'None': syn.append('|'.join(set([x for x in s1.split('|') if x != 'None' or x != ''])))
        elif s2 != 'None': syn.append('|'.join(set([x for x in s2.split('|') if x != 'None' or x != ''])))
        else: syn.append('None')
        # update master dictionary
        master_metadata_dictionary['nodes'][gene_id] = {
            'Label': ''.join(lab),
            'Description': ''.join(desc),
            'Synonym': '|'.join(syn),
            'Dbxref': dbxref}

# delete unneeded data
del entrez_gene_data

##### RNA Metadata Dictionary  <a class="anchor" id="rna-metadata"></a>  

**Data Source Wiki Page:** [Ensembl](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#ensembl)

The nested dictionary of rna metadata is created by looping over the cleaned human [Ensembl](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#ensembl) gene, RNA, and protein identifier data set (`ensembl_identifier_data_cleaned.txt`). The `keys` of the dictionary are `Ensembl transcript identifiers` and the `values` are dictionaries for each metadata type.

In [None]:
# load data
rna_gene_data = pandas.read_csv(processed_data_location + 'ensembl_identifier_data_cleaned.txt', header=0, delimiter='\t', low_memory=False)

# remove rows without identifiers
rna_gene_data = rna_gene_data.loc[rna_gene_data['transcript_stable_id'].apply(lambda x: x != 'None')]

# remove unneeded columns
rna_gene_data.drop(['ensembl_gene_id', 'symbol', 'protein_stable_id', 'uniprot_id', 'master_transcript_type',
                    'entrez_id', 'ensembl_gene_type', 'master_gene_type', 'symbol'], axis=1, inplace=True)

# remove duplicates
rna_gene_data.drop_duplicates(subset=['transcript_stable_id', 'transcript_name', 'ensembl_transcript_type'], keep='first', inplace=True)

# update prefixes
rna_gene_data['transcript_stable_id'] = 'ensembl_' + rna_gene_data['transcript_stable_id'].astype('str')

# replace NaN with 'None'
rna_gene_data.fillna('None', inplace=True)

In [None]:
# create metadata
for idx, row in tqdm(rna_gene_data.iterrows(), total=rna_gene_data.shape[0]):
    rna, lab, desc, syn = [], [], [], []
    rna_id = row['transcript_stable_id']
    ent_type, nme = row['ensembl_transcript_type'], row['transcript_name']
    rna.append('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=' + rna_id)
    if nme != 'None': lab.append(nme)
    else:
        lab.append('Ensembl_Transcript_ID:' + rna_id)
        nme = 'Ensembl_Transcript_ID:' + rna_id
    if ent_type != 'None': desc.append("Transcript {} is classified as type '{}'.".format(nme, ent_type))
    else: desc.append('None')
    syn.append('None')
    
    # update master dictionary
    master_metadata_dictionary['nodes'][rna_id] = {
        'Label': ''.join(lab),
        'Description': ''.join(desc),
        'Synonym': '|'.join(syn)}

# delete unneeded data
del rna_gene_data

##### Variant Metadata Dictionary <a class="anchor" id="variant-metadata"></a>   

**Data Source Wiki Page:** [ClinVar Variant](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#clinvar)

The nested dictionary of rna metadata is created by looping over the human [ClinVar Variant](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#clinvar) identifier data set ([`variant_summary.txt`](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)). The `keys` of the dictionary are `dbSNP identifiers` and the `values` are dictionaries for each metadata type.

In [None]:
# download data
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
if not os.path.exists(unprocessed_data_location + 'variant_summary.txt'):
    data_downloader(url, unprocessed_data_location)

# load data
var_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt', header=0, delimiter='\t', low_memory=False)

# remove rows without identifiers
var_data = var_data.loc[var_data['Assembly'].apply(lambda x: x == 'GRCh38')]
var_data = var_data.loc[var_data['RS# (dbSNP)'].apply(lambda x: x != -1)]

# de-dup data
var_metadata = var_data[['VariationID', '#AlleleID', 'Type', 'Name', 'ClinicalSignificance', 'RS# (dbSNP)', 'Origin',
                         'ChromosomeAccession', 'Chromosome', 'Start', 'Stop', 'ReferenceAllele', 'OtherIDs',
                         'Assembly', 'AlternateAllele','Cytogenetic', 'ReviewStatus', 'LastEvaluated']] 
# update prefixes
var_metadata['VariationID'] = 'clinvar_' + var_metadata['VariationID'].astype('str')


# replace NaN with 'None'
var_metadata.replace('na', 'None', inplace=True)
var_metadata.fillna('None', inplace=True)

# remove duplicate dbSNP ids by choosing the most recent reviewed variant
var_metadata.sort_values('LastEvaluated', ascending=False, inplace=True)
var_metadata.drop_duplicates(subset='RS# (dbSNP)', keep='first', inplace=True)

In [None]:
# create metadata
for idx, row in tqdm(var_metadata.iterrows(), total=var_metadata.shape[0]):
    if row['VariationID'] != 'None':
        variant, label, desc, syn = [], [], [], []
        var_id, lab, dbxref = row['VariationID'], row['Name'], row['OtherIDs']
        variant.append('https://www.ncbi.nlm.nih.gov/snp/rs' + str(var_id))
        if lab != 'None': label.append(lab)
        else: label.append('dbSNP_ID:rs' + str(var_id))
        sent = "This variant is a {} {} located on chromosome {} ({}, start:{}/stop:{} positions, " +\
               "cytogenetic location:{}) and has clinical significance '{}'. " +\
               "This entry is for the {} and was last reviewed on {} with review status '{}'."
        desc.append(sent.format(row['Origin'].replace(';', '/'), row['Type'].replace(';', '/'), row['Chromosome'], row['ChromosomeAccession'],
                                row['Start'], row['Stop'], row['Cytogenetic'], row['ClinicalSignificance'],
                                row['Assembly'], row['LastEvaluated'], row['ReviewStatus']).replace('None', 'UNKNOWN'))
        syn.append('None')
        
        # update master dictionary
        master_metadata_dictionary['nodes'][var_id] = {
            'Label': ''.join(lab),
            'Description': ''.join(desc),
            'Synonym': '|'.join(syn),
            'Dbxref': dbxref}

# delete unneeded data
del var_metadata

##### Pathway Metadata Dictionary <a class="anchor" id="pathway-metadata"></a>  

**Data Source Wiki Page:** [Reactome Pathway Database](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#reactome-pathway-database)

The nested dictionary of pathway metadata is created by looping over the human [Reactome Pathway Database](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#reactome-pathway-database) identifier data set ([`ReactomePathways.txt`](https://reactome.org/download/current/ReactomePathways.txt)); Reactome-Gene Association data ([`gene_association.reactome.gz`](https://reactome.org/download/current/gene_association.reactome.gz)), and Reactome-ChEBI data ([`ChEBI2Reactome_All_Levels.txt`](https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt)). The `keys` of the dictionary are `Reactome identifiers` and the `values` are dictionaries for each metadata type.

In [None]:
# download reactome pathways data
url = 'https://reactome.org/download/current/ReactomePathways.txt'
if not os.path.exists(unprocessed_data_location + 'ReactomePathways.txt'):
    data_downloader(url, unprocessed_data_location)
# load data
reactome_pathways = pandas.read_csv(unprocessed_data_location + 'ReactomePathways.txt', header=None, delimiter='\t', low_memory=False)
reactome_pathways = reactome_pathways.loc[reactome_pathways[2].apply(lambda x: x == 'Homo sapiens')] 

# download reactome gene association data
url = 'https://reactome.org/download/current/gene_association.reactome.gz'
if not os.path.exists(unprocessed_data_location + 'gene_association.reactome'):
    data_downloader(url, unprocessed_data_location)
# load data
reactome_pathways2 = pandas.read_csv(unprocessed_data_location + 'gene_association.reactome', header=None, delimiter='\t', skiprows=4, low_memory=False)
reactome_pathways2 = reactome_pathways2.loc[reactome_pathways2[12].apply(lambda x: x == 'taxon:9606')]
reactome_pathways2[5] = reactome_pathways2[5].str.replace('REACTOME:','', regex=True) 

# download reactome CHEBI data
url = 'https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt'
if not os.path.exists(unprocessed_data_location + 'ChEBI2Reactome_All_Levels.txt'):
    data_downloader(url, unprocessed_data_location)
# load data
reactome_pathways3 = pandas.read_csv(unprocessed_data_location + 'ChEBI2Reactome_All_Levels.txt', header=None, delimiter='\t', low_memory=False)
# remove all non-human pathways and save as list
reactome_pathways3 = reactome_pathways3.loc[reactome_pathways3[5].apply(lambda x: x == 'Homo sapiens')] 

In [None]:
# get metadata
nodes = list(set(reactome_pathways[0]) | set(reactome_pathways2[5]) | set(reactome_pathways3[1]))
pathway_metadata_final = metadata_api_mapper(nodes)

# update dictionary
pathway_metadata_final['ID'] = pathway_metadata_final['ID'].map('reactome_{}'.format)
pathway_metadata_final.set_index('ID', inplace=True)

# add entries to existing dictionary
master_metadata_dictionary['nodes'].update(pathway_metadata_final.to_dict('index'))

# delete unneeded data
del reactome_pathways, reactome_pathways2, reactome_pathways3

##### Relations Metadata Dictionary <a class="anchor" id="relations-metadata"></a>   

**Data Source Wiki Page:** [Relations Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#relations-ontology)

The nested dictionary of relation metadata is created by looping over the human [Relations Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#relations-ontology) identifier data set (`ro_with_imports.owl`). The `keys` of the dictionary are `Relations Ontology identifiers` and the `values` are dictionaries for each metadata type.

In [None]:
# download ontology
if not os.path.exists(unprocessed_data_location + 'ro_with_imports.owl'):
    command = '{} {} --merge-import-closure -o {}'
    os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/ro.owl',
                             unprocessed_data_location + 'ro_with_imports.owl'))
# load graph
ro_graph = Graph().parse(unprocessed_data_location + 'ro_with_imports.owl')
print('There are {} edges in the ontology (date:{})'.format(len(ro_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

In [None]:
# get metadata
relation_metadata_dict, obo = {}, Namespace('http://purl.obolibrary.org/obo/')

# get ontology information
cls = [x for x in gets_ontology_classes(ro_graph) if '/RO_' in str(x)] +\
      [x for x in gets_object_properties(ro_graph) if '/RO_' in str(x)]
master_synonyms = [x for x in ro_graph if 'synonym' in str(x[1]).lower() and isinstance(x[0], URIRef)]

for x in tqdm(cls):
    # labels
    cls_label = [x for x in ro_graph.objects(x, RDFS.label) if '@' not in n3(x) or '@en' in n3(x)]
    labels = str(cls_label[0]) if len(cls_label) > 0 else 'None'
    # synonyms
    cls_syn = [str(i[2]) for i in master_synonyms if x == i[0]]
    synonym = str(cls_syn[0]) if len(cls_syn) > 0 else 'None'
    # description
    cls_desc = [x for x in ro_graph.objects(x, obo.IAO_0000115) if '@' not in n3(x) or '@en' in n3(x)]
    desc = '|'.join([str(cls_desc[0])]) if len(cls_desc) > 0 else 'None'
    
    relation_metadata_dict[str(x).split('/')[-1]] = {
        'Label': labels, 'Description': desc, 'Synonym': synonym
    }

# add entries to existing dictionary
master_metadata_dictionary['relations'].update(relation_metadata_dict)

# delete unneeded data
del ro_graph

<br>

#### Secondary Metadata Elements<a class="anchor" id="secondary"></a>  

***

##### Download Identifier Maps

This code chunk downloads identifier mapping files that were creating in the prior steps.

- Chemicals: [MESH_CHEBI_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt)  
- Genes: [UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt) 
- Proteins:  
  - [ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt)   
  - [UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt) 
  - [STRING_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/STRING_PRO_ONTOLOGY_MAP.txt)
- RNA:  
  - [ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt)  
  - [GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt)
- Diseases: [DISEASE_MONDO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt) 
- Phenotypes: [PHENOTYPE_HPO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt) 

In [None]:
# entrez-ensembl map
rna_map = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                          header=None, delimiter='\t', low_memory=False,
                          names=['Entrez_Gene_IDs', 'Ensembl_Transcript_IDs', 'Entrez_Gene_Type',
                                 'Ensembl_Transcript_Type', 'Master_Gene_Type', 'Master_Transcript_Type',
                                 'Entrez_Gene_prefix'])
# entrez-pro map
entrez_pro_map = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt',
                                 header=None, delimiter='\t', low_memory=False, usecols = [0, 1, 2, 4],
                                 names=['Gene_IDs', 'Protein_Ontology_IDs', 'Entrez_Gene_Type',
                                        'Master_Gene_Type', 'Entrez_Gene_Prefix'])
# symbol-ensembl map
symbol_transcript_map = pandas.read_csv(processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt',
                           header=None, delimiter='\t', low_memory=False,
                           names=['Gene_Symbols', 'Ensembl_Transcript_IDs',
                                  'Gene_Type', 'Ensembl_Transcript_Type',
                                  'Master_Gene_Type', 'Master_Transcript_Type'])

# string-pro map
string_pro_map = pandas.read_csv(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                                 header=None, delimiter='\t', low_memory=False, usecols=[0, 1],
                                 names=['STRING_IDs', 'Protein_Ontology_IDs'])
# uniprot-pro map
uniprot_pro_map = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                                  header=None, delimiter='\t', low_memory=False, usecols=[0, 1],
                                  names=['Uniprot_Accession_IDs', 'Protein_Ontology_IDs'])
# uniprot-entrez gene map
uniprot_entrez_data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt',
                                      header=None, delimiter='\t', low_memory=False, usecols=[0, 1, 2, 3],
                                      names=['Uniprot_Accession_IDs', 'Entrez_Gene_IDs',
                                             'master_gene_type', 'gene_type_update'])
# mesh-chebi map
mesh_chebi_map = pandas.read_csv(processed_data_location + 'MESH_CHEBI_MAP.txt', header=None, 
                                 names=['MESH_ID', 'CHEBI_ID'], delimiter='\t')
# disease maps
disease_maps = pandas.read_csv(processed_data_location + 'DISEASE_MONDO_MAP.txt', header=None,
                               names=['Disease_IDs', 'MONDO_IDs'], delimiter='\t')
# phenotype maps
phenotype_maps = pandas.read_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt', header=None,
                                 names=['Disease_IDs', 'HP_IDs'], delimiter='\t')

# cells and anatomical entities
anatomy_maps = pandas.read_csv(processed_data_location + 'HPA_GTEx_TISSUE_CELL_MAP.txt', header=None,
                               names=['anatomy_ids', 'ontolgoy_ids'], delimiter='\t')



##### Genomic Entity Metadata<a class="anchor" id="genomicinfo"></a>  

Process the dictionary created in the prior steps in order to assist with creating a master metadata file for all nodes that are a genomic entity (i.e., genes, transcripts, or proteins). Some example output is shown below:

``` python
{'NCBIGene_51471': {
    'Synonyms': ['acetyltransferase 1',
                 'Hcml2',
                 'probable N-acetyltransferase 8B',
                 'CML2',
                 'putative N-acetyltransferase 8B',
                 'ATase1',
                 'N-acetyltransferase 8B (putative, gene/pseudogene)',
                 'N-acetyltransferase Camello 2',
                 'NAT8BP',
                 'camello-like protein 2',
                 'N-acetyltransferase 8B (GCN5-related, putative, gene/pseudogene)',
                 'putative N-acetyltransferase 8B',
                 'ATase1',
                 'N-acetyltransferase 8B (GCN5-related, putative, gene/pseudogene)',
                 'N-acetyltransferase Camello 2',
                 'acetyltransferase 1',
                 'camello-like protein 2',
                 'probable N-acetyltransferase 8B'],
   'PR': ['PR_Q9UHF3'],
   'GeneSymbol': ['GeneSymbol_NAT8BP',
                  'GeneSymbol_Hcml2',
                  'GeneSymbol_CML2',
                  'GeneSymbol_NAT8B'],
   'ensembl gene': ['ensembl_ENSG00000204872'],
   'ensembl protein': ['ensembl_ENSP00000485054'],
   'map_location': ['2p13.1'],
   'Label': ['N-acetyltransferase 8B (putative, gene/pseudogene)'],
   'ensembl transcript': ['ensembl_ENST00000377712'],
   'transcript_name': ['NAT8B-201'],
   'HGNC_ID': ['HGNC_ID_30235'],
   'chromosome': ['2'],
   'uniprot': ['uniprot_Q9UHF3']}}
```

In [None]:
# load data -- only reload if the dictionary has not been populated
if not 'reformatted_mapped_identifiers' in locals():
    filepath = processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl'
    max_bytes = 2**31 - 1; input_size = os.path.getsize(filepath); bytes_in = bytearray(0)
    with open(filepath, 'rb') as f_in:
        for _ in range(0, input_size, max_bytes):
            bytes_in += f_in.read(max_bytes)
    reformatted_mapped_identifiers = pickle.loads(bytes_in)

In [None]:
# clean up data for use with master metadata
genomic_metadata = dict()
for key, value in tqdm(reformatted_mapped_identifiers.items()):
    old_prefix = '_'.join(key.split('_')[0:-1]); idx = key.split('_')[-1]; pass_var = True; new_prefix = None
    if old_prefix == 'entrez_id': new_prefix = 'NCBIGene'
    elif old_prefix in ['ensembl_gene_id', 'protein_stable_id', 'transcript_stable_id']: new_prefix = 'ensembl'
    elif old_prefix == 'pro_id_PR': new_prefix = 'PR'
    else: pass_var = False
    if pass_var and new_prefix is not None:
        updated_key = new_prefix + '_' + idx; master_metadata_dict = {updated_key: {}}
        for x in value:
            i, j = '_'.join(x.split('_')[0:-1]), x.split('_')[-1]
            if 'type' in i: continue
            elif i == 'entrez_id': new_i = 'NCBIGene'; j = new_i + '_' + j
            elif i == 'ensembl_gene_id': new_i = 'ensembl gene'; j = 'ensembl_' + j
            elif i == 'protein_stable_id': new_i = 'ensembl protein'; j = 'ensembl_' + j
            elif i == 'transcript_stable_id': new_i = 'ensembl transcript'; j = 'ensembl_' + j
            elif i == 'pro_id_PR': new_i = 'PR'; j = new_i + '_' + j
            elif i == 'hgnc_id': new_i = 'HGNC_ID'; j = new_i + '_' + j
            elif i == 'uniprot_id': new_i = 'uniprot'; j = new_i + '_' + j
            elif i == 'symbol': new_i = 'GeneSymbol'; j = new_i + '_' + j
            else:
                if i == 'synonyms': new_i = 'Synonyms'
                elif i == 'name': new_i = 'Label'
                elif i == 'Other_designations': new_i = 'Synonyms'; j = j.split('|')
                else: new_i = i
            if new_i in master_metadata_dict[updated_key].keys():
                if isinstance(j , list): master_metadata_dict[updated_key][new_i] += j
                else: master_metadata_dict[updated_key][new_i] += [j]
            else: master_metadata_dict[updated_key][new_i] = [j]
        genomic_metadata[updated_key] = master_metadata_dict
        
# delete unneeded data
del reformatted_mapped_identifiers

***

#### `CTD_chem_gene_ixns.tsv` <a class="anchor" id="chemical-gene"></a>  

**Data Source Wiki Page:** [Comparative Toxicogenomics Database](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#comparative-toxicogenomics-database)


**Edges:**  
- `chemical-gene`  
- `chemical-protein`  
- `chemical-rna` 

**Identifier Maps:**    
- Chemicals: [MESH_CHEBI_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt)  
- Proteins: [ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt)   
- RNA: [ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt)

This chunk process the [`CTD_chem_gene_ixns.tsv`](http://ctdbase.org/reports/CTD_chem_gene_ixns.tsv.gz) file and obtains the following node and edge metadata:  
- **Nodes:**  
_chemical_  
  - `ChemicalID`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID. If not, MeSH Identifier. Variable is provided as a string without a prefix.    
  - `CasRN`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID. If not, a string containing a CAS Registry Number, if available.    
  - `ChemicalName`: A string containing the concept's synonym. If derived from an ontology, the string will be prefixed by the synonym type. If not, a string containing the name of the chemical.   
  
  _Gene, RNA, and Protein_    
  - `GenomicInformation`: A dictionary of gene, RNA, and protein identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details.  


- **Edges:**  
  - `Interaction`: A string describing a chemical-gene/protein/rna interaction.   
  - `InteractionActions`: A "|"-delimited list of the actions that underlie an interaction.   
  - `PubMedIDs`: |'-delimited list of PubMed identifiers that do not include a prefix.  

In [None]:
# download data
url = 'http://ctdbase.org/reports/CTD_chem_gene_ixns.tsv.gz'
if not os.path.exists(unprocessed_data_location + 'CTD_chem_gene_ixns.tsv'):
    data_downloader(url, unprocessed_data_location, 'CTD_chem_gene_ixns.tsv')

# load data
ctd_gene_inx = pandas.read_csv(unprocessed_data_location + 'CTD_chem_gene_ixns.tsv', header=0, delimiter='\t', skiprows=27)
ctd_gene_inx = ctd_gene_inx[ctd_gene_inx['# ChemicalName'] != '#']
ctd_gene_inx = ctd_gene_inx[ctd_gene_inx['OrganismID'] == 9606]
ctd_gene_inx = ctd_gene_inx[ctd_gene_inx['PubMedIDs'] != numpy.nan]
ctd_gene_inx.fillna('None', inplace=True)
# fix variable typing
ctd_gene_inx['GeneID'] = ctd_gene_inx['GeneID'].astype('Int64')
ctd_gene_inx['OrganismID'] = ctd_gene_inx['OrganismID'].astype('Int64')
# update prefix
ctd_gene_inx['ChemicalID'] = 'MESH:' + ctd_gene_inx['ChemicalID']

*Merge Identifier Maps*

In [None]:
# merge identifier maps
ctd_gene_inx = ctd_gene_inx.merge(mesh_chebi_map, left_on='ChemicalID', right_on='MESH_ID')
ctd_gene_inx = ctd_gene_inx.merge(rna_map, left_on='GeneID', right_on='Entrez_Gene_IDs')
ctd_gene_inx = ctd_gene_inx.merge(entrez_pro_map, left_on='GeneID', right_on='Gene_IDs')

# visualize data
ctd_gene_inx.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'] = {'chemical-gene': {}, 'chemical-rna': {}, 'chemical-protein': {}}

# create dictionary
for idx, row in tqdm(ctd_gene_inx.iterrows(), total=ctd_gene_inx.shape[0]):
    chebi = row['CHEBI_ID'].rstrip(); gene_form = None
    chemical_name = row['# ChemicalName']; chemical_id = row['ChemicalID'].rstrip(); casrn = row['CasRN']
    evidence = [{'CTD_Interaction': row['Interaction'],
                 'CTD_InteractionActions': row['InteractionActions'],
                 'CTD_PubMedIDs': row['PubMedIDs']}]
    if row['GeneForms'] == 'gene':
        node_key = row['Entrez_Gene_prefix'].rstrip(); gene_form = row['GeneForms']
        if node_key in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[node_key]
        else: genomic_info_dict = None
        edge_key = '{}-{}'.format(chebi, node_key); edge_type = 'chemical-gene'
    if row['GeneForms'] == 'protein':
        node_key = row['Protein_Ontology_IDs'].rstrip(); gene_form = row['GeneForms']
        if node_key in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[node_key]
        else: genomic_info_dict = None
        edge_key = '{}-{}'.format(chebi, node_key); edge_type = 'chemical-protein'
    if row['GeneForms'] == 'mRNA':
        node_key = row['Ensembl_Transcript_IDs'].rstrip(); gene_form = row['GeneForms']
        if node_key in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[node_key]
        else: genomic_info_dict = None
        edge_key = '{}-{}'.format(chebi, node_key); edge_type = 'chemical-rna'
    if gene_form is not None:
        # add chebi metadata
        if chebi in master_metadata_dictionary['nodes'].keys():
            if url in master_metadata_dictionary['nodes'][chebi].keys():
                master_metadata_dictionary['nodes'][chebi][url]['CTD_ChemicalName'] |= {chemical_name}
                master_metadata_dictionary['nodes'][chebi][url]['CTD_ChemicalID'] |= {chemical_id}
                master_metadata_dictionary['nodes'][chebi][url]['CTD_CasRN'] |= {casrn}
            else:
                master_metadata_dictionary['nodes'][chebi].update({
                    url: {'CTD_ChemicalID': {chemical_id},
                          'CTD_CasRN': {casrn},
                          'CTD_ChemicalName': {chemical_name}}})
        else:
            master_metadata_dictionary['nodes'].update({chebi: {
                url: {'CTD_ChemicalID': {chemical_id},
                      'CTD_CasRN': {casrn},
                      'CTD_ChemicalName': {chemical_name}}}})
        
        # add genomic information
        if node_key in master_metadata_dictionary['nodes'].keys():
            if genomic_info_dict is not None:
                master_metadata_dictionary['nodes'][node_key].update({'genomic_data': genomic_info_dict})
            else: master_metadata_dictionary['nodes'][node_key].update({'genomic_data': 'None'})
        else:
            if genomic_info_dict is not None:
                master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': genomic_info_dict}})
            else: master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': 'None'}})

        # add relation data to dictionary
        if edge_key in master_metadata_dictionary['edges'].keys():
            if url in master_metadata_dictionary['edges'][edge_key].keys():
                if 'CTD_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                    inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                    inital_ev = inital_ev['CTD_Evidence'] + evidence
                    ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                    master_metadata_dictionary['edges'][edge_key][url]['CTD_Evidence'] = ev
                else: master_metadata_dictionary['edges'][edge_key][url].update({'CTD_Evidence': evidence})
            else: master_metadata_dictionary['edges'][edge_key].update({url: {'CTD_Evidence': evidence, 'Type': edge_type}})
        else: master_metadata_dictionary['edges'].update({edge_key: {url: {'CTD_Evidence': evidence, 'Type': edge_type}}})
            
# delete unneeded data
del ctd_gene_inx

***

#### `CTD_chem_go_enriched.tsv` <a class="anchor" id="chemical-go"></a>  

**Data Source Wiki Page:** [Comparative Toxicogenomics Database](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#comparative-toxicogenomics-database)

**Edges:**  
- `chemical-gobp`  
- `chemical-gocc`  
- `chemical-gomf` 

**Identifier Maps:**    
- Chemicals: [MESH_CHEBI_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt)  

This chunk process the [`CTD_chem_go_enriched.tsv`](http://ctdbase.org/reports/CTD_chem_go_enriched.tsv.gz) file and obtains the following node and edge metadata:  
- **Nodes:**  
  _Chemical_  
  - `ChemicalID`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID. If not, MeSH Identifier. Variable is provided as a string without a prefix.    
  - `CasRN`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID. If not, a string containing a CAS Registry Number, if available.    
  - `ChemicalName`: A string containing the concept's synonym. If derived from an ontology, the string will be prefixed by the synonym type. If not, a string containing the name of the chemical.  
  
  _GO Biological Process, Cellular Component, Molecular Function_  
  - `GOTermName`: A string containing the concept's synonym. 
  - `Ontology`: A string naming the GO Ontology subset.  


- **Edges:**  
  - `HighestGOLevel`: The highest level to which the GO term is assigned within the GO hierarchical ontology. Many GO terms are located at multiple levels within the ontology; only the highest level is displayed. Level 1 constitutes “children” of the most general Biological Process, Cellular Component, and Molecular Function terms. Source: http://ctdbase.org/help/chemGODetailHelp.jsp.   
  - `Pvalue`: Raw P-value. Source: http://ctdbase.org/help/chemGODetailHelp.jsp.   
  - `CorrectedPValue`:The corrected p-value calculated using the Bonferroni multiple testing adjustment. Source: http://ctdbase.org/help/chemGODetailHelp.jsp.  
  - `TargetMatchQty`: The count of matches to the target. Source: http://ctdbase.org/help/chemGODetailHelp.jsp.  
  - `TargetTotalQty`: The total matches to the target. Source: http://ctdbase.org/help/chemGODetailHelp.jsp.
  - `BackgroundMatchQty`: The count of matches to the genome. Source: http://ctdbase.org/help/chemGODetailHelp.jsp.
  - `BackgroundTotalQty`: The total matches to the genome. Source: http://ctdbase.org/help/chemGODetailHelp.jsp.

In [None]:
# download data
url = 'http://ctdbase.org/reports/CTD_chem_go_enriched.tsv.gz'
if not os.path.exists(unprocessed_data_location + 'CTD_chem_go_enriched.tsv'):
    data_downloader(url, unprocessed_data_location, 'CTD_chem_go_enriched.tsv')

# load data
ctd_chem_go = pandas.read_csv(unprocessed_data_location + 'CTD_chem_go_enriched.tsv', header=0, delimiter='\t', skiprows=27)
ctd_chem_go = ctd_chem_go[ctd_chem_go['# ChemicalName'] != '#']
ctd_chem_go.fillna('None', inplace=True)
# update prefix
ctd_chem_go['ChemicalID'] = 'MESH:' + ctd_chem_go['ChemicalID']
ctd_chem_go['GOTermID'] = ctd_chem_go['GOTermID'].str.replace(':', '_')

 *Merge Identifier Maps*

In [None]:
# merge identifier maps
ctd_chem_go = ctd_chem_go.merge(mesh_chebi_map, left_on='ChemicalID', right_on='MESH_ID')

# visualize data
ctd_chem_go.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'chemical-gobp': {}, 'chemical-gocc': {}, 'chemical-gomf': {}})

# create dictionary
for idx, row in tqdm(ctd_chem_go.iterrows(), total=ctd_chem_go.shape[0]):
    chebi = row['CHEBI_ID'].rstrip(); node_key = row['GOTermID']
    chemical_name = row['# ChemicalName']; chemical_id = row['ChemicalID'].rstrip(); casrn = row['CasRN']
    ontology = row['Ontology']; go_name = row['GOTermName']
    evidence = [{'CTD_Pvalue': row['PValue'],
                 'CTD_CorrectedPValue': row['CorrectedPValue'],
                 'CTD_TargetMatchQty': row['TargetMatchQty'],
                 'CTD_TargetTotalQty': row['TargetTotalQty'],
                 'CTD_BackgroundMatchQty': row['BackgroundMatchQty'],
                 'CTD_BackgroundTotalQty': row['BackgroundTotalQty'],
                 'CTD_HighestGOLevel': row['HighestGOLevel']}]
    # specify edge type, which is related to the ontology aspect
    if ontology == 'Biological Process': edge_key = '{}-{}'.format(chebi, node_key); edge_type = 'chemical-gobp'
    if ontology == 'Cellular Component': edge_key = '{}-{}'.format(chebi, node_key); edge_type = 'chemical-gocc'
    if ontology == 'Molecular Function': edge_key = '{}-{}'.format(chebi, node_key); edge_type = 'chemical-gomf'    
    
    # add chebi metadata
    if chebi in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][chebi].keys():
            master_metadata_dictionary['nodes'][chebi][url]['CTD_ChemicalName'] |= {chemical_name}
            master_metadata_dictionary['nodes'][chebi][url]['CTD_ChemicalID'] |= {chemical_id}
            master_metadata_dictionary['nodes'][chebi][url]['CTD_CasRN'] |= {casrn}
        else:
            master_metadata_dictionary['nodes'][chebi].update({
                url: {'CTD_ChemicalID': {chemical_id},
                      'CTD_CasRN': {casrn},
                      'CTD_ChemicalName': {chemical_name}}})
    else:
        master_metadata_dictionary['nodes'].update({chebi: {
                url: {'CTD_ChemicalID': {chemical_id},
                      'CTD_CasRN': {casrn},
                      'CTD_ChemicalName': {chemical_name}}}})
    
    # add go information
    if node_key in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][node_key].keys():
            master_metadata_dictionary['nodes'][node_key][url]['CTD_Ontology'] |= {ontology}
            master_metadata_dictionary['nodes'][node_key][url]['CTD_GOTermName'] |= {go_name}
        else:
            master_metadata_dictionary['nodes'][node_key].update({
                url: {'CTD_Ontology': {ontology},
                      'CTD_GOTermName': {go_name}}})
    else:
        master_metadata_dictionary['nodes'].update({node_key: {
            url: {'CTD_Ontology': {ontology},
                  'CTD_GOTermName': {go_name}}}})
    
    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'CTD_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['CTD_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['CTD_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'CTD_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'CTD_Evidence': evidence, 'Type': edge_type}}) 
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'CTD_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del ctd_chem_go

***

#### `CTD_chemicals_diseases.tsv` <a class="anchor" id="chemical-disease"></a>  

**Data Source Wiki Page:** [Comparative Toxicogenomics Database](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#comparative-toxicogenomics-database)

**Edges:**  
- `chemical-disease`  
- `chemical-phenotype`  

**Identifier Maps:**    
- Chemicals: [MESH_CHEBI_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt)  
- Diseases: [DISEASE_MONDO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt) 
- Phenotypes: [PHENOTYPE_HPO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt)   

This chunk process the [`CTD_chemicals_diseases.tsv`](http://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz) file and obtains the following node and edge metadata:  
- **Nodes:**  
  _Chemical_  
  - `ChemicalID`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID. If not, MeSH Identifier. Variable is provided as a string without a prefix.    
  - `CasRN`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID. If not, a string containing a CAS Registry Number, if available.    
  - `ChemicalName`: A string containing the concept's synonym. If derived from an ontology, the string will be prefixed by the synonym type. If not, a string containing the name of the chemical.    
  
  _Disease, Phenotype_  
  - `DiseaseName`: A string containing the concept's synonym. 
  - `DiseaseID`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID.  
  - `OmimIDs`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID.  


- **Edges:**  
  - `DirectEvidence`: '|'-delimited list of strings that include keywords.   
  - `InferenceScore`: The inference score (float) reflects the degree of similarity between CTD chemical–gene–disease networks and a similar scale-free random network. The higher the score, the more likely the inference network has atypical connectivity.   
  - `PubMedIDs`: |'-delimited list of PubMed identifiers that do not include a prefix.  
  - `InferenceGeneSymbol`: A string containing the gene symbol. The genes on which the inferred association is based (i.e., genes that have curated interactions with the chemical and curated associations with the disease).  

In [None]:
# download data
url = 'http://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz'
if not os.path.exists(unprocessed_data_location + 'CTD_chemicals_diseases.tsv'):
    data_downloader(url, unprocessed_data_location, 'CTD_chemicals_diseases.tsv')

# load data
ctd_chem_dis = pandas.read_csv(unprocessed_data_location + 'CTD_chemicals_diseases.tsv', header=0, delimiter='\t', skiprows=27)
ctd_chem_dis = ctd_chem_dis[ctd_chem_dis['# ChemicalName'] != '#']
ctd_chem_dis = ctd_chem_dis[ctd_chem_dis['PubMedIDs'] != numpy.nan]
ctd_chem_dis = ctd_chem_dis[ctd_chem_dis['DiseaseID'] != numpy.nan]
# update prefix
ctd_chem_dis['ChemicalID'] = 'MESH:' + ctd_chem_dis ['ChemicalID']

*Merge Identifier Maps*

In [None]:
ctd_chem_dis = ctd_chem_dis.merge(mesh_chebi_map, left_on='ChemicalID', right_on='MESH_ID')
ctd_chem_dis = ctd_chem_dis.merge(disease_maps, left_on='DiseaseID', right_on='Disease_IDs')
ctd_chem_dis = ctd_chem_dis.merge(phenotype_maps, left_on='DiseaseID', right_on='Disease_IDs')

# visualize data
ctd_chem_dis.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'chemical-disease': {}, 'chemical-phenotype': {}})

# create dictionary
for idx, row in tqdm(ctd_chem_dis.iterrows(), total=ctd_chem_dis.shape[0]):
    chebi = row['CHEBI_ID'].rstrip()
    chemical_name = row['# ChemicalName']; chemical_id = row['ChemicalID'].rstrip(); casrn = row['CasRN']
    dis_name = row['DiseaseName']; dis_id = row['DiseaseID']
    omim = row['OmimIDs'] if not pandas.isna(row['OmimIDs']) else 'None'
    evidence = [{'CTD_DirectEvidence': row['DirectEvidence'] if not pandas.isna(row['DirectEvidence']) else 'None',
                 'CTD_InferenceScore': row['InferenceScore'] if not pandas.isna(row['InferenceScore']) else 'None',
                 'CTD_PubMedIDs': row['PubMedIDs'],
                 'CTD_InferenceGeneSymbol': row['InferenceGeneSymbol'] if not pandas.isna(row['InferenceGeneSymbol']) else 'None'}]
    for node_key in [row['MONDO_IDs'], row['HP_IDs']]:
        if not pandas.isna(node_key) and node_key.startswith('MONDO'):
            edge_key = '{}-{}'.format(chebi, node_key); edge_type = 'chemical-disease'
        if not pandas.isna(node_key) and node_key.startswith('HP'):
            edge_key = '{}-{}'.format(chebi, node_key); edge_type = 'chemical-phenotype'
        
        # add chebi metadata
        if chebi in master_metadata_dictionary['nodes'].keys():
            if url in master_metadata_dictionary['nodes'][chebi].keys():
                master_metadata_dictionary['nodes'][chebi][url]['CTD_ChemicalName'] |= {chemical_name}
                master_metadata_dictionary['nodes'][chebi][url]['CTD_ChemicalID'] |= {chemical_id}
                master_metadata_dictionary['nodes'][chebi][url]['CTD_CasRN'] |= {casrn}
            else:
                master_metadata_dictionary['nodes'][chebi].update({
                url: {'CTD_ChemicalID': {chemical_id},
                      'CTD_CasRN': {casrn},
                      'CTD_ChemicalName': {chemical_name}}})
        else:
            master_metadata_dictionary['nodes'].update({chebi: {
                url: {'CTD_ChemicalID': {chemical_id},
                      'CTD_CasRN': {casrn},
                      'CTD_ChemicalName': {chemical_name}}}})
        
        # add disease information
        if node_key in master_metadata_dictionary['nodes'].keys():
            if url in master_metadata_dictionary['nodes'][node_key]:
                master_metadata_dictionary['nodes'][node_key][url]['CTD_DiseaseName'] |= {dis_name}
                master_metadata_dictionary['nodes'][node_key][url]['CTD_DiseaseID'] |= {dis_id}
                master_metadata_dictionary['nodes'][node_key][url]['CTD_OmimIDs'] |= {omim}
            else:
                master_metadata_dictionary['nodes'][node_key].update({
                    url: {'CTD_DiseaseName': {dis_name},
                          'CTD_DiseaseID': {dis_id},
                          'CTD_OmimIDs': {omim}}})
        else:
            master_metadata_dictionary['nodes'].update({node_key: {
                    url: {'CTD_DiseaseName': {dis_name},
                          'CTD_DiseaseID': {dis_id},
                          'CTD_OmimIDs': {omim}}}})
        
        # add relation data to dictionary
        if edge_key in master_metadata_dictionary['edges'].keys():
            if url in master_metadata_dictionary['edges'][edge_key].keys():
                if 'CTD_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                    inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                    inital_ev = inital_ev['CTD_Evidence'] + evidence
                    ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                    master_metadata_dictionary['edges'][edge_key][url]['CTD_Evidence'] = ev
                else: master_metadata_dictionary['edges'][edge_key][url].update({'CTD_Evidence': evidence})
            else: master_metadata_dictionary['edges'][edge_key].update({url: {'CTD_Evidence': evidence, 'Type': edge_type}})
        else: master_metadata_dictionary['edges'].update({edge_key: {url: {'CTD_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del ctd_chem_dis

***

#### `ChEBI2Reactome_All_Levels.txt` <a class="anchor" id="gene-pathway"></a>  

**Data Source Wiki Page:** [Reactome Pathway Database](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#reactome-pathway-database)     


**Edges:**  
- `chemical-pathway`  

This chunk process the [`ChEBI2Reactome_All_Levels.txt`](https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt) file and obtains the following node metadata:  
- **Nodes:**  
  _Pathway_  
  - `DBReference`: A string containing the concept's database cross-reference, which is formatted as prefix:ID.       


- **Edges:**  
  - `EvidenceID`: A string containing an evidence code.

In [None]:
# download data
url = 'https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt'
if not os.path.exists(unprocessed_data_location + 'ChEBI2Reactome_All_Levels.txt'):
    data_downloader(url, unprocessed_data_location, 'ChEBI2Reactome_All_Levels.txt')

# load data
rtm_chem_path = pandas.read_csv(unprocessed_data_location + 'ChEBI2Reactome_All_Levels.txt', header=None, delimiter='\t', skiprows=0)
rtm_chem_path = rtm_chem_path[rtm_chem_path[5] == 'Homo sapiens']
rtm_chem_path.fillna('None', inplace=True)
# update prefix
rtm_chem_path[0] = 'CHEBI_' + rtm_chem_path[0].astype('str')
rtm_chem_path[1] = 'reactome_' + rtm_chem_path[1]

# visualize data
rtm_chem_path.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'chemical-pathway': {}})

# create dictionary
for idx, row in tqdm(rtm_chem_path.iterrows(), total=rtm_chem_path.shape[0]):
    chebi = row[0].rstrip(); node_key = row[1]; path_name = row[3]
    evidence = [{'CTD_EvidenceID': row[4]}]   
    edge_key = '{}-{}'.format(chebi, node_key); edge_type = 'chemical-pathway'   
    
    # add reactome information 
    if node_key in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][node_key].keys():
            master_metadata_dictionary['nodes'][node_key][url]['Reactome_PathwayName'] |= {path_name}
        else:
            master_metadata_dictionary['nodes'][node_key].update({
                url: {'Reactome_PathwayName': {path_name}}})
    else:
        master_metadata_dictionary['nodes'].update({node_key: {
            url: {'Reactome_PathwayName': {path_name}}}})
    
    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'Reactome_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['Reactome_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['Reactome_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'Reactome_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'Reactome_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'Reactome_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del rtm_chem_path

***

#### `goa_human.gaf` <a class="anchor" id="goa"></a>  

**Data Source Wiki Page:** [Gene Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#gene-ontology)   


**Edges:**  
- `protein-gobp`  
- `protein-gocc`  
- `protein-gomf`  

**Identifier Maps:**    
- Proteins: [UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt)  
This chunk process the [`goa_human.gaf`](http://current.geneontology.org/annotations/goa_human.gaf.gz) file and obtains the following node and edge metadata:  
- **Nodes:**  
_GO Biological Process, Cellular Component, and Molecular Function_     
  - `DB_Object_Name`: A string containing the concept's synonym.    
  - `DB_Object_Synonym`: A string containing the concept's synonym.      
  - `DB_Object_Symbol`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID. 
  - `With_Or_From`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID.  
  - `DB_Object_Type`: A string indicating the type of object that has been annotated.  
  
  _Protein_    
  - `GenomicInformation`: A dictionary of protein identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details. 


- **Edges:**  
  - `Qualifier`: Some annotations are modified by qualifiers, which have specific usage rules and meanings within GO. 
  - `DB_Reference`: One or more unique identifiers for a single source cited as an authority for the attribution of the GO ID to the DB Object ID. This may be a literature reference or a database record. The syntax is DB:accession_number.    
  - `EvidenceCode`: Each annotation includes an evidence code to indicate how the annotation to a particular term is supported
    - Inferred from Experiment (EXP)
    - Inferred from Direct Assay (IDA)
    - Inferred from Physical Interaction (IPI)
    - Inferred from Mutant Phenotype (IMP)
    - Inferred from Genetic Interaction (IGI)
    - Inferred from Expression Pattern (IEP)
    - Inferred from High Throughput Experiment (HTP)
    - Inferred from High Throughput Direct Assay (HDA)
    - Inferred from High Throughput Mutant Phenotype (HMP)
    - Inferred from High Throughput Genetic Interaction (HGI)
    - Inferred from High Throughput Expression Pattern (HEP)
    - Inferred from Biological aspect of Ancestor (IBA)
    - Inferred from Biological aspect of Descendant (IBD)
    - Inferred from Key Residues (IKR)
    - Inferred from Rapid Divergence (IRD)
    - Inferred from Sequence or structural Similarity (ISS)
    - Inferred from Sequence Orthology (ISO)
    - Inferred from Sequence Alignment (ISA)
    - Inferred from Sequence Model (ISM)
    - Inferred from Genomic Context (IGC)
    - Inferred from Reviewed Computational Analysis (RCA)
    - Traceable Author Statement (TAS)
    - Non-traceable Author Statement (NAS)
    - Inferred by Curator (IC)
    - No biological Data available (ND)
    - Inferred from Electronic Annotation (IEA)  
  - `AssignedBy`: A string indicating who assigned the association.  

In [None]:
# download data
url = 'http://current.geneontology.org/annotations/goa_human.gaf.gz'
if not os.path.exists(unprocessed_data_location + 'goa_human.gaf'):
    data_downloader(url, unprocessed_data_location, 'goa_human.gaf')

# load data
goa_gene = pandas.read_csv(unprocessed_data_location + 'goa_human.gaf', header=None, delimiter='\t', skiprows=41, low_memory=False)
goa_gene = goa_gene[goa_gene[12] == 'taxon:9606']
goa_gene = goa_gene[goa_gene[3] != 'NOT']
goa_gene = goa_gene[goa_gene[11] == 'protein']
# fix prefix
goa_gene[4] = goa_gene[4].str.replace(':', '_')
goa_gene.fillna('None', inplace=True)


 *Merge Identifier Maps*

In [None]:
goa_gene = goa_gene.merge(uniprot_pro_map, left_on=1, right_on='Uniprot_Accession_IDs')

# visualize data
goa_gene.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'protein-gobp': {}, 'protein-gocc': {}, 'protein-gomf': {}})

# create dictionary
for idx, row in tqdm(goa_gene.iterrows(), total=goa_gene.shape[0]):
    pr = row['Protein_Ontology_IDs'].rstrip(); node_key = row[4]
    pr_db = row[9]; pr_syn = row[10]; pr_symb = row[2]; aspect = row[8]; db_with = row[7]
    evidence = [{'GOA_Qualifier': row[3], 'GOA_DB_Reference': row[5],
                 'GOA_EvidenceCode': row[6], 'GOA_AssignedBy': row[14]}]
    if aspect == 'P': edge_key = '{}-{}'.format(pr, node_key); edge_type = 'protein-gobp'
    if aspect == 'C': edge_key = '{}-{}'.format(pr, node_key); edge_type = 'protein-gocc'
    if aspect == 'F': edge_key = '{}-{}'.format(pr, node_key); edge_type = 'protein-gomf'    
    if pr in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[pr]
    else: genomic_info_dict = None
    
    # add pr information
    if pr in master_metadata_dictionary['nodes'].keys():      
        if url in master_metadata_dictionary['nodes'][pr].keys():
            master_metadata_dictionary['nodes'][pr][url]['GOA_DB_Object_Name'] |= {pr_db}
            master_metadata_dictionary['nodes'][pr][url]['GOA_DB_Object_Synonym'] |= {pr_syn}
            master_metadata_dictionary['nodes'][pr][url]['GOA_DB_Object_Symbol'] |= {pr_symb}
            master_metadata_dictionary['nodes'][pr][url]['GOA_With_Or_From'] |= {db_with}
        else:
            master_metadata_dictionary['nodes'][pr].update({
                url: {'GOA_DB_Object_Name': {pr_db},
                      'GOA_DB_Object_Synonym': {pr_syn},
                      'GOA_DB_Object_Symbol': {pr_symb},
                      'GOA_With_Or_From': {db_with}}})
    else:
        master_metadata_dictionary['nodes'].update({pr: {
            url: {'GOA_DB_Object_Name': {pr_db},
                  'GOA_DB_Object_Synonym': {pr_syn},
                  'GOA_DB_Object_Symbol': {pr_symb},
                  'GOA_With_Or_From': {db_with}}}})
    
    # add genomic information
    if node_key in master_metadata_dictionary['nodes'].keys():
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'][node_key].update({'genomic_data': genomic_info_dict})
        else: master_metadata_dictionary['nodes'][node_key].update({'genomic_data': 'None'})
    else:
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': genomic_info_dict}})
        else: master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': 'None'}})

    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'GOA_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['GOA_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['GOA_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'GOA_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'GOA_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'GOA_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del goa_gene

***

#### `COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt` <a class="anchor" id="gene-gene"></a>  

**Data Source Wiki Page:** [GeneMANIA](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#genemania)      

**Edges:**  
- `gene-gene`  

**Identifier Maps:**    
- Genes: [UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt) 

This chunk process the [`COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt`](http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt) file and obtains the following edge metadata:    
- **Nodes:**  
  _Genes_    
  - `GenomicInformation`: A dictionary of gene identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details. 
- **Edges:**  
  - `Weight`: Assumes the input gene list is related through GO biological processes. The score will vary depending on the type of network, but in general is a number ranging from zero (no interaction) to 1 (strong interaction). See `PMID:25254104` for more detail.   

In [None]:
# download data
url = 'http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt'
if not os.path.exists(unprocessed_data_location + 'COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt'):
    data_downloader(url, unprocessed_data_location, 'COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt')

# load data
gm_gene_gene = pandas.read_csv(unprocessed_data_location + 'COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt', header=0, delimiter='\t', skiprows=0)


 *Merge Identifier Maps*

In [None]:
gm_gene_gene = gm_gene_gene.merge(uniprot_entrez_data, left_on='Gene_A', right_on='Uniprot_Accession_IDs')
gm_gene_gene.rename(columns={'Entrez_Gene_IDs': 'Entrez_Gene_A'}, inplace=True)
gm_gene_gene = gm_gene_gene.merge(uniprot_entrez_data, left_on='Gene_B', right_on='Uniprot_Accession_IDs')
gm_gene_gene.rename(columns={'Entrez_Gene_IDs': 'Entrez_Gene_B'}, inplace=True)

# visualize data
gm_gene_gene.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'gene-gene': {}})

# create dictionary
for idx, row in tqdm(gm_gene_gene.iterrows(), total=gm_gene_gene.shape[0]):
    genes = [row['Entrez_Gene_A'], row['Entrez_Gene_B']]; weight = row['Weight']; gene_info = []
    edge_key = '{}-{}'.format(row['Entrez_Gene_A'], row['Entrez_Gene_B']); edge_type = 'gene-gene' 
    
    for node_key in genes:
        if node_key in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[node_key]
        else: genomic_info_dict = None
        
        # add genomic information
        if node_key in master_metadata_dictionary['nodes'].keys():
            if genomic_info_dict is not None:
                master_metadata_dictionary['nodes'][node_key].update({'genomic_data': genomic_info_dict})
            else: master_metadata_dictionary['nodes'][node_key].update({'genomic_data': 'None'})
        else:
            if genomic_info_dict is not None:
                master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': genomic_info_dict}})
            else: master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': 'None'}}) 
    
    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'GeneMania_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                master_metadata_dictionary['edges'][edge_key][url]['GeneMania_Evidence'] = weight
            else: master_metadata_dictionary['edges'][edge_key][url].update({'GeneMania_Evidence': weight})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'GeneMania_Evidence': weight, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'GeneMania_Evidence': weight, 'Type': edge_type}}})

# delete unneeded data
del gm_gene_gene

***

#### `phenotype.hpoa` <a class="anchor" id="phenotype-disease"></a>  

**Data Source Wiki Page:** [Human Phenotype Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#human-phenotype-ontology)      

**Edges:**  
- `disease-phenotype`  

**Identifier Maps:**    
- Diseases: [DISEASE_MONDO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt) 

This chunk process the [`phenotype.hpoa`](http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa) file and obtains the following node and edge metadata:  
- **Nodes:**  
  _Disease, Phenotype_  
  - `DiseaseName`: A string containing the concept's synonym.  


- **Edges:**  
  - `Reference`: This required field indicates the source of the information used for the annotation. This may be the clinical experience of the annotator or may be taken from an article as indicated by a PubMed id. Each collaborating center of the Human Phenotype Ontology consortium is assigned a HPO:Ref id. In addition, if appropriate, a PubMed id for an article describing the clinical abnormality may be used.    
  - `Evidence`: This required field indicates the level of evidence supporting the annotation. Annotations that have been extracted by parsing the Clinical Features sections of the omim.txt file are assigned the evidence code IEA. Other codes include PCS for published clinical study. This should be used for information extracted from articles in the medical literature. ICE can be used for annotations based on individual clinical experience. This may be appropriate for disorders with a limited amount of published data. This must be accompanied by an entry in the DB:Reference field denoting the individual or center performing the annotation together with an identifier. For instance, GH:007 might be used to refer to the seventh such annotation made by a specialist from Gotham Hospital (assuming the prefix GH has been registered with the HPO). Finally we have TAS, which stands for “traceable author statement”, usually reviews or disease entries (e.g. OMIM) that only refers to the original publication..    
  - `Frequency`: A term-id from the HPO-sub-ontology below the term Frequency.
     There are three allowed options for this field.
      1. A term-id from the HPO-sub-ontology below the term Frequency.
      2. A count of patients affected within a cohort. For instance, 7/13 would indicate that 7 of the 13 patients with the specified disease were found to have the phenotypic abnormality referred to by the HPO term in question in the study referred to by the DB_Reference
      3. A percentage value such as 17%, again referring to the percentage of patients found to have the phenotypic abnormality referred to by the HPO term in question in the study referred to by the DB_Reference. If possible, the 7/13 format is preferred over the percentage format if the exact data is available..    
  - `Sex`: This field contains the strings MALE or FEMALE if the annotation in question is limited to males or females. This field refers to the phenotypic (and not the chromosomal) sex, and does not intend to capture the further complexities of sex determination. If a phenotype is limited to one or the other sex, then the corresponding term from the Clinical modifier subontology should also be used in the Modifier field.
  - `Modifier`: A term from the Clinical modifier subontology.  
  - `Aspect`: One of P (Phenotypic abnormality), I (inheritance), C (onset and clinical course). This field is mandatory; cardinality 1.  
  - `Biocuration`: This refers to the center or user making the annotation and the date on which the annotation was made; format is YYYY-MM-DD this field is mandatory. Multiple entries can be separated by a semicolon if an annotation was revised, e.g., HPO:skoehler[2010-04-21];HPO:lcarmody[2019-06-02]. 

In [None]:
# download data
url = 'http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa'
if not os.path.exists(unprocessed_data_location + 'phenotype.hpoa'):
    data_downloader(url, unprocessed_data_location, 'phenotype.hpoa')

# load data
hpo_dis_phe = pandas.read_csv(unprocessed_data_location + 'phenotype.hpoa', header=0, delimiter='\t', skiprows=4, low_memory=False)
hpo_dis_phe = hpo_dis_phe[hpo_dis_phe['Qualifier'] != 'NOT']
hpo_dis_phe.fillna('None', inplace=True)

# fix prefix
hpo_dis_phe['HPO_ID'] = hpo_dis_phe['HPO_ID'].str.replace(':', '_')

 *Merge Identifier Maps*

In [None]:
hpo_dis_phe = hpo_dis_phe.merge(disease_maps, left_on='#DatabaseID', right_on='Disease_IDs')

# visualize data
hpo_dis_phe.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'disease-phenotype': {}})

# create dictionary
for idx, row in tqdm(hpo_dis_phe.iterrows(), total=hpo_dis_phe.shape[0]):
    concepts = [row['MONDO_IDs'], row['HPO_ID']]
    disease_names = {row['MONDO_IDs']: {row['DiseaseName']}, row['HPO_ID']: {'None'}}
    edge_key = '{}-{}'.format(row['MONDO_IDs'], row['HPO_ID']); edge_type = 'disease-phenotype'
    evidence = [{'HPO_Reference': row['Reference'],
                 'HPO_EvidenceCode': row['Evidence'],
                 'HPO_Frequency': row['Frequency'],
                 'HPO_Sex': row['Sex'],
                 'HPO_Modifier': row['Modifier'],
                 'HPO_Aspect': row['Aspect'],
                 'HPO_Biocuration': row['Biocuration']}]
    for node_key in concepts:
        if node_key in master_metadata_dictionary['nodes'].keys():
            if url in master_metadata_dictionary['nodes'][node_key].keys():
                master_metadata_dictionary['nodes'][node_key][url]['HPO_DiseaseName'] |= disease_names[node_key]
            else: master_metadata_dictionary['nodes'][node_key].update({url: {'HPO_DiseaseName': disease_names[node_key]}})
        else: master_metadata_dictionary['nodes'].update({node_key: {url: {'HPO_DiseaseName': disease_names[node_key]}}})
    
    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'HPO_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['HPO_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['HPO_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'HPO_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'HPO_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'HPO_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del hpo_dis_phe

***

#### `gene_association.reactome` <a class="anchor" id="reactome-go"></a>  

**Data Source Wiki Page:** [Reactome Pathway Database](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#reactome-pathway-database)    

**Edges:**  
- `gobp-pathway`     
- `pathway-gocc`   
- `pathway-gomf`  

This chunk process the [`gene_association.reactome.tsv`](https://reactome.org/download/current/gene_association.reactome.gz) file and obtains the following node and edge metadata:  
- **Nodes:**  
  _Pathway_  
  - `DBReference`: A string containing the concept's database cross-reference, which is formatted as Prefix:ID.    
  
  _Gene Ontology Biological Proces, Cellular Component, Molecular Function_
  - `Aspect`: A variable indicating what GO subontology is being used.    


- **Edges:**  
  - `EvidenceCode`: Each annotation includes an evidence code to indicate how the annotation to a particular term is supported
    - Inferred from Experiment (EXP)
    - Inferred from Direct Assay (IDA)
    - Inferred from Physical Interaction (IPI)
    - Inferred from Mutant Phenotype (IMP)
    - Inferred from Genetic Interaction (IGI)
    - Inferred from Expression Pattern (IEP)
    - Inferred from High Throughput Experiment (HTP)
    - Inferred from High Throughput Direct Assay (HDA)
    - Inferred from High Throughput Mutant Phenotype (HMP)
    - Inferred from High Throughput Genetic Interaction (HGI)
    - Inferred from High Throughput Expression Pattern (HEP)
    - Inferred from Biological aspect of Ancestor (IBA)
    - Inferred from Biological aspect of Descendant (IBD)
    - Inferred from Key Residues (IKR)
    - Inferred from Rapid Divergence (IRD)
    - Inferred from Sequence or structural Similarity (ISS)
    - Inferred from Sequence Orthology (ISO)
    - Inferred from Sequence Alignment (ISA)
    - Inferred from Sequence Model (ISM)
    - Inferred from Genomic Context (IGC)
    - Inferred from Reviewed Computational Analysis (RCA)
    - Traceable Author Statement (TAS)
    - Non-traceable Author Statement (NAS)
    - Inferred by Curator (IC)
    - No biological Data available (ND)
    - Inferred from Electronic Annotation (IEA)   
  - `AssignedBy`: A string indicating who assigned the association.   
  - `Qualifier`: Some annotations are modified by qualifiers, which have specific usage rules and meanings within GO. 

In [None]:
# download data
url = 'https://reactome.org/download/current/gene_association.reactome.gz'
if not os.path.exists(unprocessed_data_location + 'gene_association.reactome'):
    data_downloader(url, unprocessed_data_location, 'gene_association.reactome')

# load data
rce_go_ptw = pandas.read_csv(unprocessed_data_location + 'gene_association.reactome', header=None, delimiter='\t', skiprows=4)
rce_go_ptw.fillna('None', inplace=True)
rce_go_ptw = rce_go_ptw[rce_go_ptw[12] == 'taxon:9606']
rce_go_ptw = rce_go_ptw[[x.startswith('REACTOME') for x in rce_go_ptw[5]]]

# fix variable prefixing
rce_go_ptw[4] = rce_go_ptw[4].str.replace(':', '_')
rce_go_ptw[5] = rce_go_ptw[5].str.replace('REACTOME:', 'reactome_')

# visualize data
rce_go_ptw.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'gobp-pathway': {}, 'pathway-gocc': {}, 'pathway-gomf': {}})

# create dictionary
for idx, row in tqdm(rce_go_ptw.iterrows(), total=rce_go_ptw.shape[0]):
    react = row[5].rstrip(); node_key = row[4]; pathway_db = row[0]; aspect = row[8]
    evidence = [{'Reactome_EvidenceCode': row[6],
                 'Reactome_AssignedBy': row[14],
                 'Reactome_Qualifier': row[3]}]
    # specify edge type, which is related to the ontology aspect
    if aspect == 'P': edge_key = '{}-{}'.format(node_key, react); edge_type = 'gobp-pathway'
    if aspect == 'C': edge_key = '{}-{}'.format(react, node_key); edge_type = 'pathway-gocc'
    if aspect == 'F': edge_key = '{}-{}'.format(react, node_key); edge_type = 'pathway-gomf'    
    
    # add reactome metadata
    if react in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][react].keys():
            master_metadata_dictionary['nodes'][react][url]['Reactome_DBReference'] |= {pathway_db}
        else: master_metadata_dictionary['nodes'][react].update({url: {'Reactome_DBReference': {pathway_db}}})
    else: master_metadata_dictionary['nodes'].update({react: {url: {'Reactome_DBReference': {pathway_db}}}})
    
    # add go information
    if node_key in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][node_key].keys():
            master_metadata_dictionary['nodes'][node_key][url]['Reactome_Aspect'] |= {aspect}
        else: master_metadata_dictionary['nodes'][node_key].update({url: {'Reactome_Aspect': {aspect}}})
    else: master_metadata_dictionary['nodes'].update({node_key: {url: {'Reactome_Aspect': {aspect}}}})
    
    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'Reactome_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['Reactome_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['Reactome_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'Reactome_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'Reactome_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'Reactome_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del rce_go_ptw

***

#### `UniProt2Reactome_All_Levels.txt` <a class="anchor" id="uniprot-react"></a>  

**Data Source Wiki Page:** [Reactome Pathway Database](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#reactome-pathway-database)    

**Edges:**  
- `protein-pathway`   

**Identifier Maps:**    
- Proteins: [UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt)   
This chunk process the [`UniProt2Reactome_All_Levels.txt`](https://reactome.org/download/current/UniProt2Reactome_All_Levels.txt) file and obtains the following node and edge metadata:  
- **Nodes:**   
  _Pathway_  
  - `PathwayName`: A string containing the concept's label.   
    
  _Protein_    
  - `GenomicInformation`: A dictionary of protein identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details. 


- **Edges:**  
  - `EvidenceID`: Each annotation includes an evidence code to indicate how the annotation to a particular term is supported
    - Inferred from Experiment (EXP)
    - Inferred from Direct Assay (IDA)
    - Inferred from Physical Interaction (IPI)
    - Inferred from Mutant Phenotype (IMP)
    - Inferred from Genetic Interaction (IGI)
    - Inferred from Expression Pattern (IEP)
    - Inferred from High Throughput Experiment (HTP)
    - Inferred from High Throughput Direct Assay (HDA)
    - Inferred from High Throughput Mutant Phenotype (HMP)
    - Inferred from High Throughput Genetic Interaction (HGI)
    - Inferred from High Throughput Expression Pattern (HEP)
    - Inferred from Biological aspect of Ancestor (IBA)
    - Inferred from Biological aspect of Descendant (IBD)
    - Inferred from Key Residues (IKR)
    - Inferred from Rapid Divergence (IRD)
    - Inferred from Sequence or structural Similarity (ISS)
    - Inferred from Sequence Orthology (ISO)
    - Inferred from Sequence Alignment (ISA)
    - Inferred from Sequence Model (ISM)
    - Inferred from Genomic Context (IGC)
    - Inferred from Reviewed Computational Analysis (RCA)
    - Traceable Author Statement (TAS)
    - Non-traceable Author Statement (NAS)
    - Inferred by Curator (IC)
    - No biological Data available (ND)
    - Inferred from Electronic Annotation (IEA)    

In [None]:
# download data
url = 'https://reactome.org/download/current/UniProt2Reactome_All_Levels.txt'
if not os.path.exists(unprocessed_data_location + 'UniProt2Reactome_All_Levels.txt'):
    data_downloader(url, unprocessed_data_location, 'UniProt2Reactome_All_Levels.txt')

# load data
rce_prot_pth = pandas.read_csv(unprocessed_data_location + 'UniProt2Reactome_All_Levels.txt', header=None, delimiter='\t', skiprows=0)
rce_prot_pth = rce_prot_pth[rce_prot_pth[5] == 'Homo sapiens']
# fix prefixes
rce_prot_pth[1] = 'reactome_' + rce_prot_pth[1]

 *Merge Identifier Maps*

In [None]:
rce_prot_pth = rce_prot_pth.merge(uniprot_pro_map, left_on=0, right_on='Uniprot_Accession_IDs')

# visualize data
rce_prot_pth.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'protein-pathway': {}})

# create dictionary
for idx, row in tqdm(rce_prot_pth.iterrows(), total=rce_prot_pth.shape[0]):
    pr = row['Protein_Ontology_IDs'].rstrip(); node_key = row[1]; react_name = row[3]
    evidence = [{'Reactome_EvidenceID': row[4]}]
    edge_key = '{}-{}'.format(pr, node_key); edge_type = 'protein-pathway'
    if pr in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[pr]
    else: genomic_info_dict = None
    
    # add reactome information
    if node_key in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][node_key].keys():
            master_metadata_dictionary['nodes'][node_key][url]['Reactome_PathwayName'] |= {react_name}
        else:
            master_metadata_dictionary['nodes'][node_key].update({url: {'Reactome_PathwayName': {react_name}}})
    else:
        master_metadata_dictionary['nodes'].update({node_key: {url: {'Reactome_PathwayName': {react_name}}}})
    
    # add genomic information
    if pr in master_metadata_dictionary['nodes'].keys():
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'][pr].update({'genomic_data': genomic_info_dict})
        else: master_metadata_dictionary['nodes'][pr].update({'genomic_data': 'None'})
    else:
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'].update({pr: {'genomic_data': genomic_info_dict}})
        else: master_metadata_dictionary['nodes'].update({pr: {'genomic_data': 'None'}})

    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'Reactome_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['Reactome_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['Reactome_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'Reactome_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'Reactome_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'Reactome_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del rce_prot_pth

***

#### `CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt` <a class="anchor" id="variant-disease"></a>  

**Data Source Wiki Page:** [ClinVar](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#clinvar)  

**Edges:**  
- `variant-disease`  
- `variant-phenotype`  

**Identifier Maps:**    
- Diseases: [DISEASE_MONDO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt)
- Phenotypes: [PHENOTYPE_HPO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt)    

This chunk process the [`CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt`](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt) file and obtains the following node and edge metadata:  
- **Nodes:**  
  _Disease, Phenotype_  
  - `Phenotype`: A string containing a disease identifier and prefix. Sources are OMIM, MedGen (UMLS), and Orphanet.  
  
  _Variant_  
  - `VariantName`: A string containing the name of the variant.  
  - `rs_id`: An integer that represents a dbSNP identifier.    
  - `AlleleID`: An integer that represents an Allele identifier.    
  - `RCVaccession`: An integer that represents an RCV accession identifier. 
  - `Type`: Character, the type of variant represented by the AlleleID.  
  - `Assembly`: A list of dictionaries, stored as a string, that contains information related to the assembly (i.e., Assembly, ChromosomeAccession, Chromosome, Start, Stop, ReferenceAlel, AlernateAllel, Cytogenetic, and PositionVCF).   


- **Edges:**  
  - `OtherIDs`: A "|"-delimited list of other identifiers associated with the variant edge. Note that each identifier included also includes a prefix.    
  - `GeneID`: An identifier for the gene associated with each variant (wherever possible).  
  - `Guidelines`: Character, ACMG only right now.  
  - `TestedInGTR`: Character, Y/N for Yes/No if there is a test registered as specific to this variant in the NIH Genetic Testing Registry (GTR).  
  - `LastEvaluated`: Date, the latest date any submitter reported clinical significance.  
  - `ReviewStatus`: Character, highest review status for reporting this measure.  
  - `ClinicalSignificance`: Character, comma-separated list of aggregate values of clinical significance calculated for this variant.  
  - `ClinSigSimple`: Integer,  
    0 = no current value of Likely pathogenic or Pathogenic;  
    1 = at least one current record submitted with an interpretation of Likely pathogenic or Pathogenic (independent of whether that record includes assertion criteria and evidence).    
   -1 = no values for clinical significance at all for this variant or set of variants; used for the "included" variants that are only in ClinVar because they are included in a haplotype or genotype with an interpretation.  
  - `Origin`: Character, list of all allelic origins for this variant.  
  - `OriginSimple`: Character, processed from Origin to make it easier to distinguish between germline and somatic.  
  - `SubmitterCategories`: Coded value to indicate whether data were submitted by another resource (1), any other type of source (2), both (3), or none (4).  
  - `NumberSubmitters`: Integer, number of submitters describing this variant  
  - `Citation`: A "|"-delimited list of evidence supporting the variant association. Sources are either PubMed, PubMedCentral, or the NCBI Bookshelf.  

In [None]:
# download data
url = 'https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt'
if not os.path.exists(unprocessed_data_location + 'CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt'):
    data_downloader(url, unprocessed_data_location, 'CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt')

# load data
clv_var_dis = pandas.read_csv(unprocessed_data_location + 'CLINVAR_VARIANT_DISEASE_PHENOTYPE_EDGES.txt', header=0, delimiter='\t', low_memory=False)

 *Merge Identifier Maps*

In [None]:
clv_var_dis = clv_var_dis.merge(disease_maps, left_on='Phenotype', right_on='Disease_IDs')
clv_var_dis = clv_var_dis.merge(phenotype_maps, left_on='Phenotype', right_on='Disease_IDs')

# visualize data
clv_var_dis.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'variant-disease': {}, 'variant-phenotype': {}})

# create dictionary
for idx, row in tqdm(clv_var_dis.iterrows(), total=clv_var_dis.shape[0]):
    node_key = row['VariationID']; rcv = row['RCVaccession']; var_type = row['Type']
    pheno = row['Phenotype']; rs_id = row['RS# (dbSNP)']; allele_id = row['AlleleID']
    assembly = row['Assembly']; var_name = row['VariantName']
    evidence = [{'ClinVar_GeneID': row['GeneID'],
                 'ClinVar_OtherIDs': row['OtherIDs'],
                 'ClinVar_Guidelines': row['Guidelines'],
                 'ClinVar_TestedInGTR': row['TestedInGTR'],
                 'ClinVar_LastEvaluated': row['LastEvaluated'],
                 'ClinVar_ReviewStatus': row['ReviewStatus'],
                 'ClinVar_ClinicalSignificance': row['ClinicalSignificance'],
                 'ClinVar_ClinSigSimple': row['ClinSigSimple'],
                 'ClinVar_Origin': row['Origin'],
                 'ClinVar_OriginSimple': row['OriginSimple'],
                 'ClinVar_SubmitterCategories': row['SubmitterCategories'],
                 'ClinVar_NumberSubmitters': row['NumberSubmitters'],
                 'ClinVar_Citation': row['Citation']}]   
    for idx in [row['MONDO_IDs'].rstrip(), row['HP_IDs'].rstrip()]:
        if idx.startswith('MONDO'): edge_key = '{}-{}'.format(node_key, idx); edge_type = 'variant-disease'
        else: edge_key = '{}-{}'.format(node_key, idx); edge_type = 'variant-phenotype'
    
        # add disease/phenotype metadata
        if idx in master_metadata_dictionary['nodes'].keys():
            if url in master_metadata_dictionary['nodes'][idx].keys():
                master_metadata_dictionary['nodes'][idx][url]['ClinVar_Phenotype'] |= {pheno}
            else: master_metadata_dictionary['nodes'][idx].update({url: {'ClinVar_Phenotype': {pheno}}})
        else: master_metadata_dictionary['nodes'].update({idx: {url: {'ClinVar_Phenotype': {pheno}}}})

        # add variant information
        if node_key in master_metadata_dictionary['nodes'].keys():
            if url in master_metadata_dictionary['nodes'][node_key].keys():
                master_metadata_dictionary['nodes'][node_key][url]['ClinVar_VariantName'] |= {var_name}
                master_metadata_dictionary['nodes'][node_key][url]['ClinVar_rs_id'] |= {rs_id}
                master_metadata_dictionary['nodes'][node_key][url]['ClinVar_AlleleID'] |= {allele_id}
                master_metadata_dictionary['nodes'][node_key][url]['ClinVar_RCVaccession'] |= {rcv}
                master_metadata_dictionary['nodes'][node_key][url]['ClinVar_Type'] |= {var_type}
                master_metadata_dictionary['nodes'][node_key][url]['ClinVar_Assembly'] |= {assembly}
            else:
                master_metadata_dictionary['nodes'][node_key].update({
                    url: {'ClinVar_rs_id': {rs_id},
                          'ClinVar_VariantName': {var_name},
                          'ClinVar_AlleleID': {allele_id},
                          'ClinVar_RCVaccession': {rcv},
                          'ClinVar_Type': {var_type},
                          'ClinVar_Assembly': {assembly}
                         }})
        else:
            master_metadata_dictionary['nodes'].update({node_key: {
                url: {'ClinVar_rs_id': {rs_id},
                      'ClinVar_VariantName': {var_name},
                      'ClinVar_AlleleID': {allele_id},
                      'ClinVar_RCVaccession': {rcv},
                      'ClinVar_Type': {var_type},
                      'ClinVar_Assembly': {assembly}}}})

        # add relation data to dictionary
        if edge_key in master_metadata_dictionary['edges'].keys():
            if url in master_metadata_dictionary['edges'][edge_key].keys():
                if 'ClinVar_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                    inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                    inital_ev = inital_ev['ClinVar_Evidence'] + evidence
                    ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                    master_metadata_dictionary['edges'][edge_key][url]['ClinVar_Evidence'] = ev
                else: master_metadata_dictionary['edges'][edge_key][url].update({'ClinVar_Evidence': evidence})
            else: master_metadata_dictionary['edges'][edge_key].update({url: {'ClinVar_Evidence': evidence, 'Type': edge_type}})
        else: master_metadata_dictionary['edges'].update({edge_key: {url: {'ClinVar_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del clv_var_dis

***

#### `CLINVAR_VARIANT_GENE_EDGES.txt` <a class="anchor" id="variant-gene"></a>  

**Data Source Wiki Page:** [ClinVar](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#clinvar)  

**Edges:**  
- `variant-gene`     

This chunk process the [`CLINVAR_VARIANT_GENE_EDGES.txt`](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/CLINVAR_VARIANT_GENE_EDGES.txt) file and obtains the following node and edge metadata:  
- **Nodes:**  
  _Variant_  
  - `VariantName`: A string containing the name of the variant.  
  - `rs_id`: An integer that represents a dbSNP identifier.    
  - `AlleleID`: An integer that represents an Allele identifier.    
  - `RCVaccession`: An integer that represents an RCV accession identifier. 
  - `Type`: Character, the type of variant represented by the AlleleID.  
  - `Assembly`: A list of dictionaries, stored as a string, that contains information related to the assembly (i.e., Assembly, ChromosomeAccession, Chromosome, Start, Stop, ReferenceAlel, AlernateAllel, Cytogenetic, and PositionVCF).  
  - `GenesPerAlleleID`: An integer that represents the count of genes that are found in the allele which corresponds to the variant.  
  - `Category`: The type of allele-gene relationship. The values for category are:
      - Asserted, but not computed: Submitted as related to a gene, but not within the location of that gene on the genome
      - Genes overlapped by variant: The gene and variant overlap
      - Near gene, downstream: Outside the location of the gene on the genome, within 5 kb
      -Near gene, upstream: Outside the location of the gene on the genome, within 5 kb
      - Within multiple genes by overlap: The variant is within genes that overlap on the genome. Includes introns
      - Within single gene: The variant is in only one gene. Includes introns
      
  _Gene_    
  - `GenomicInformation`: A dictionary of gene identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details. 


- **Edges:**  
  - `OtherIDs`: A "|"-delimited list of other identifiers associated with the variant edge. Note that each identifier included also includes a prefix.    
  - `Guidelines`: Character, ACMG only right now.  
  - `TestedInGTR`: Character, Y/N for Yes/No if there is a test registered as specific to this variant in the NIH Genetic Testing Registry (GTR).  
  - `LastEvaluated`: Date, the latest date any submitter reported clinical significance.  
  - `ReviewStatus`: Character, highest review status for reporting this measure.  
  - `ClinicalSignificance`: Character, comma-separated list of aggregate values of clinical significance calculated for this variant.  
  - `ClinSigSimple`: Integer,  
    0 = no current value of Likely pathogenic or Pathogenic;  
    1 = at least one current record submitted with an interpretation of Likely pathogenic or Pathogenic (independent of whether that record includes assertion criteria and evidence).    
   -1 = no values for clinical significance at all for this variant or set of variants; used for the "included" variants that are only in ClinVar because they are included in a haplotype or genotype with an interpretation.  
  - `Origin`: Character, list of all allelic origins for this variant.  
  - `OriginSimple`: Character, processed from Origin to make it easier to distinguish between germline and somatic.  
  - `SubmitterCategories`: Coded value to indicate whether data were submitted by another resource (1), any other type of source (2), both (3), or none (4).  
  - `NumberSubmitters`: Integer, number of submitters describing this variant  
  - `Citation`: A "|"-delimited list of evidence supporting the variant association. Sources are either PubMed, PubMedCentral, or the NCBI Bookshelf.  

In [None]:
# download data
url = 'https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/CLINVAR_VARIANT_GENE_EDGES.txt'
if not os.path.exists(unprocessed_data_location + 'CLINVAR_VARIANT_GENE_EDGES.txt'):
    data_downloader(url, unprocessed_data_location, 'CLINVAR_VARIANT_GENE_EDGES.txt')

# load data
clv_var_gene = pandas.read_csv(unprocessed_data_location + 'CLINVAR_VARIANT_GENE_EDGES.txt', header=0, delimiter='\t', low_memory=False)

# visualize data
clv_var_gene.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'variant-gene': {}})

# create dictionary
for idx, row in tqdm(clv_var_gene.iterrows(), total=clv_var_gene.shape[0]):
    node_key = row['VariationID']; rcv = row['RCVaccession']; var_type = row['Type']
    gene = row['GeneID']; var_name = row['VariantName']
    rs_id = row['RS# (dbSNP)']; allele_id = row['AlleleID']
    assembly = row['Assembly']; gpa = row['GenesPerAlleleID']; category = row['Category']
    evidence = [{'ClinVar_OtherIDs': row['OtherIDs'],
                 'ClinVar_Guidelines': row['Guidelines'],
                 'ClinVar_TestedInGTR': row['TestedInGTR'],
                 'ClinVar_LastEvaluated': row['LastEvaluated'],
                 'ClinVar_ReviewStatus': row['ReviewStatus'],
                 'ClinVar_ClinicalSignificance': row['ClinicalSignificance'],
                 'ClinVar_ClinSigSimple': row['ClinSigSimple'],
                 'ClinVar_Origin': row['Origin'],
                 'ClinVar_OriginSimple': row['OriginSimple'],
                 'ClinVar_SubmitterCategories': row['SubmitterCategories'],
                 'ClinVar_NumberSubmitters': row['NumberSubmitters'],
                 'ClinVar_Citation': row['Citation']}]   
    edge_key = '{}-{}'.format(node_key, gene); edge_type = 'variant-gene'

    # add genomic information
    if gene in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[gene]
    else: genomic_info_dict = None
    if node_key in master_metadata_dictionary['nodes'].keys():
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'][gene].update({'genomic_data': genomic_info_dict})
        else: master_metadata_dictionary['nodes'][gene].update({'genomic_data': 'None'})
    else:
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'].update({gene: {'genomic_data': genomic_info_dict}})
        else: master_metadata_dictionary['nodes'].update({gene: {'genomic_data': 'None'}})
    
    # add variant information
    if node_key in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][node_key].keys():
            master_metadata_dictionary['nodes'][node_key][url]['ClinVar_VariantName'] |= {var_name}
            master_metadata_dictionary['nodes'][node_key][url]['ClinVar_rs_id'] |= {rs_id}
            master_metadata_dictionary['nodes'][node_key][url]['ClinVar_AlleleID'] |= {allele_id}
            master_metadata_dictionary['nodes'][node_key][url]['ClinVar_RCVaccession'] |= {rcv}
            master_metadata_dictionary['nodes'][node_key][url]['ClinVar_Type'] |= {var_type}
            master_metadata_dictionary['nodes'][node_key][url]['ClinVar_Assembly'] |= {assembly}
            master_metadata_dictionary['nodes'][node_key][url]['ClinVar_GenesPerAlleleID'] |= {gpa}
            master_metadata_dictionary['nodes'][node_key][url]['ClinVar_Category'] |= {category}
        else:
            master_metadata_dictionary['nodes'][node_key].update({
                url: {'ClinVar_rs_id': {rs_id},
                      'ClinVar_VariantName': {var_name},
                      'ClinVar_AlleleID': {allele_id},
                      'ClinVar_RCVaccession': {rcv},
                      'ClinVar_Type': {var_type},
                      'ClinVar_Assembly': {assembly},
                      'ClinVar_GenesPerAlleleID': {gpa},
                      'ClinVar_Category': {category}
                     }})
    else:
        master_metadata_dictionary['nodes'].update({node_key: {
            url: {'ClinVar_rs_id': {rs_id},
                  'ClinVar_VariantName': {var_name},
                  'ClinVar_AlleleID': {allele_id},
                  'ClinVar_RCVaccession': {rcv},
                  'ClinVar_Type': {var_type},
                  'ClinVar_Assembly': {assembly},
                  'ClinVar_GenesPerAlleleID': {gpa},
                  'ClinVar_Category': {category}}}})

    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'ClinVar_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['ClinVar_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['ClinVar_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'ClinVar_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'ClinVar_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'ClinVar_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del clv_var_gene        

***

#### `HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt` <a class="anchor" id="hpa"></a>  

**Data Source Wiki Page:**  
- [Genotype-Tissue Expression Project](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#genotype-tissue-expression-project)  
- [Human Protein Atlas](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#human-protein-atlas)  

**Edges:**  
- `protein-anatomy`  
- `protein-cell`  
- `rna-anatomy`  
- `rna-cell`  

**Identifier Maps:**    
- Proteins: [UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt)    
- Anatomy: [HPA_GTEx_TISSUE_CELL_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEx_TISSUE_CELL_MAP.txt)
- Cells: [HPA_GTEx_TISSUE_CELL_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEx_TISSUE_CELL_MAP.txt)  
- RNA: [GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt)  

This chunk process the [`HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt`](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt) file and obtains the following node and edge metadata:  
- **Nodes:**  
  _Anatomy, Cell_ 
  - `Anatomy`: A string containing the concept's synonym. If derived from an ontology, the string will be prefixed by the synonym type.    
  - `Anatomy_Type`: A string indicating the type of annotation.    
  - `Subcellular_Location`: A string containing a subcellular compartment.  
  
  _Protein, RNA_    
  - `GenomicInformation`: A dictionary of protein identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details. 


- **Edges:**  
  - `Expression_Value`: The expression value derived from the experiments.    
  - `Source`: A string indicating the source of the data (i.e., Human Protein Atlas or the Genotype-Tissue Expression project).  
  - `Evidence`: A string indicating if the evidence is at the transcript or protein level.  

In [None]:
# download data
url = 'https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt'
if not os.path.exists(unprocessed_data_location + 'HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt'):
    data_downloader(url, unprocessed_data_location, 'HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt')

# load data
hpa_gen_ant = pandas.read_csv(unprocessed_data_location + 'HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt', header=None, delimiter='\t', skiprows=0)

# filter data
hpa_gen_ant = hpa_gen_ant[hpa_gen_ant[3] != 'No human protein/transcript evidence']

 *Merge Identifier Maps*

In [None]:
hpa_gen_ant = hpa_gen_ant.merge(uniprot_pro_map, left_on=2, right_on='Uniprot_Accession_IDs')
hpa_gen_ant = hpa_gen_ant.merge(anatomy_maps, left_on=6, right_on='anatomy_ids')
hpa_gen_ant = hpa_gen_ant.merge(symbol_transcript_map, left_on=1, right_on='Gene_Symbols')

# visualize data
hpa_gen_ant.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'protein-anatomy': {}, 'protein-cell': {}, 'rna-anatomy': {}, 'rna-cell': {}})

# create dictionary
for idx, row in tqdm(hpa_gen_ant.iterrows(), total=hpa_gen_ant.shape[0]):
    node_key = row['ontolgoy_ids']; anatomy = row[6]; anatomy_type = row[4]; subcell = row[5]
    evidence = [{'HPA_GTEx_Expression_Value': row[7], 'HPA_GTEx_Source': row[8], 'HPA_GTEx_Evidence': row[3]}]   
    protein = row['Protein_Ontology_IDs'].rstrip(); rna = row['Ensembl_Transcript_IDs']
    if row[3] == 'Evidence at protein level' and row[4] == 'anatomy':
        node_key2 = protein; edge_key = '{}-{}'.format(node_key2, node_key); edge_type = 'protein-anatomy'
    elif row[3] == 'Evidence at protein level' and row[4] != 'anatomy':
        node_key2 = protein; edge_key = '{}-{}'.format(node_key2, node_key); edge_type = 'protein-cell'
    elif row[3] == 'Evidence at transcript level' and row[4] == 'anatomy':
        node_key2 = rna; edge_key = '{}-{}'.format(node_key2, node_key); edge_type = 'rna-anatomy'
    elif row[3] == 'Evidence at transcript level' and row[4] != 'anatomy':
        node_key2 = rna; edge_key = '{}-{}'.format(node_key2, node_key); edge_type = 'protein-cell'
    else: pass
    
    # add anatomical information
    if node_key in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][node_key].keys():
            master_metadata_dictionary['nodes'][node_key][url]['HPA_GTEx_Anatomy'] |= {anatomy}
            master_metadata_dictionary['nodes'][node_key][url]['HPA_GTEx_Anatomy_Type'] |= {anatomy_type}
            master_metadata_dictionary['nodes'][node_key][url]['HPA_GTEx_Subcellular_Location'] |= {subcell}
        else:
            master_metadata_dictionary['nodes'][node_key].update({url: {
                'HPA_GTEx_Anatomy': {anatomy},
                'HPA_GTEx_Anatomy_Type': {anatomy_type},
                'HPA_GTEx_Subcellular_Location': {subcell}}})
    else:
        master_metadata_dictionary['nodes'].update({node_key: {url: {
            'HPA_GTEx_Anatomy': {anatomy},
            'HPA_GTEx_Anatomy_Type': {anatomy_type},
            'HPA_GTEx_Subcellular_Location': {subcell}}}})

    # add genomic information
    if node_key2 in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[node_key2]
    if node_key2 in master_metadata_dictionary['nodes'].keys():
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'][node_key2].update({'genomic_data': genomic_info_dict})
        else: master_metadata_dictionary['nodes'][node_key2].update({'genomic_data': 'None'})
    else:
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'].update({node_key2: {'genomic_data': genomic_info_dict}})
        else: master_metadata_dictionary['nodes'].update({node_key2: {'genomic_data': 'None'}})

    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'HPA_GTEx_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['HPA_GTEx_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['HPA_GTEx_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'HPA_GTEx_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'HPA_GTEx_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'HPA_GTEx__Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del hpa_gen_ant 

***

#### `UNIPROT_PROTEIN_CATALYST.txt` <a class="anchor" id="uniprot-catalyst"></a>  

**Data Source Wiki Page:** [Universal Protein Resource Knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#universal-protein-resource-knowledgebase)  

**Edges:**  
- `protein-catalyst`  

This chunk process the [`UNIPROT_PROTEIN_CATALYST.txt`](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_PROTEIN_CATALYST.txt) file and obtains the following node and edge metadata:  
 
- **Nodes:**  
  _Protein_    
  - `GenomicInformation`: A dictionary of protein identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details. 


- **Edges:**  
  - `Status`: A string to indicate the status of the entry in Uniprot.   

In [None]:
# download data
url = 'https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_PROTEIN_CATALYST.txt'
if not os.path.exists(unprocessed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt'):
    data_downloader(url, unprocessed_data_location, 'UNIPROT_PROTEIN_CATALYST.txt')

# load data
upt_prot_cat = pandas.read_csv(unprocessed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt', header=None, delimiter='\t', skiprows=0)

# visualize data
upt_prot_cat.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'protein-catalyst': {}})

# create dictionary
for idx, row in tqdm(upt_prot_cat.iterrows(), total=upt_prot_cat.shape[0]):
    node_key = row[0]; chebi = row[1]; evidence = [{'Uniprot_Status': row[2]}]   
    edge_key = '{}-{}'.format(node_key, chebi); edge_type = 'protein-catalyst'
    
    # add catalyst information
    if chebi in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][chebi].keys():
            master_metadata_dictionary['nodes'][chebi][url]['Uniprot_CHEBI'] |= {chebi}
        else: master_metadata_dictionary['nodes'][chebi].update({url: {'Uniprot_CHEBI': {chebi}}})
    else: master_metadata_dictionary['nodes'].update({chebi: {url: {'Uniprot_CHEBI': {chebi}}}})
    
    # add genomic information
    if node_key in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[node_key]
    if node_key in master_metadata_dictionary['nodes'].keys():
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'][node_key].update({'genomic_data': genomic_info_dict})
        else: master_metadata_dictionary['nodes'][node_key].update({'genomic_data': 'None'})
    else:
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': genomic_info_dict}})
        else: master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': 'None'}})

    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'Uniprot_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['Uniprot_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['Uniprot_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'Uniprot_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'Uniprot_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'Uniprot_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del upt_prot_cat

***

#### `UNIPROT_PROTEIN_COFACTOR.txt` <a class="anchor" id="uniprot-cofactor"></a>  

**Data Source Wiki Page:** [Universal Protein Resource Knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#universal-protein-resource-knowledgebase)  

**Edges:**  
- `protein-cofactor`  

This chunk process the [`UNIPROT_PROTEIN_COFACTOR.txt`](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_PROTEIN_COFACTOR.txt) file and obtains the following node and edge metadata:  
 
- **Nodes:**  
  _Protein_    
  - `GenomicInformation`: A dictionary of protein identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details. 


- **Edges:**  
  - `Status`: A string to indicate the status of the entry in Uniprot. 

In [None]:
# download data
url = 'https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_PROTEIN_COFACTOR.txt'
if not os.path.exists(unprocessed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt'):
    data_downloader(url, unprocessed_data_location, 'UNIPROT_PROTEIN_COFACTOR.txt')

# load data
upt_prot_cof = pandas.read_csv(unprocessed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt', header=None, delimiter='\t', skiprows=0)

# visualize data
upt_prot_cof.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'protein-cofactor': {}})

# create dictionary
for idx, row in tqdm(upt_prot_cof.iterrows(), total=upt_prot_cof.shape[0]):
    node_key = row[0]; chebi = row[1]; evidence = [{'Uniprot_Status': row[2]}]   
    edge_key = '{}-{}'.format(node_key, chebi); edge_type = 'protein-cofactor'
    
    # add catalyst information
    if chebi in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][chebi].keys():
            master_metadata_dictionary['nodes'][chebi][url]['Uniprot_CHEBI'] |= {chebi}
        else: master_metadata_dictionary['nodes'][chebi].update({url: {'Uniprot_CHEBI': {chebi}}})
    else: master_metadata_dictionary['nodes'].update({chebi: {url: {'Uniprot_CHEBI': {chebi}}}})
    
    # add genomic information
    if node_key in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[node_key]
    if node_key in master_metadata_dictionary['nodes'].keys():
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'][node_key].update({'genomic_data': genomic_info_dict})
        else: master_metadata_dictionary['nodes'][node_key].update({'genomic_data': 'None'})
    else:
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': genomic_info_dict}})
        else: master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': 'None'}})

    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'Uniprot_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['Uniprot_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['Uniprot_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'Uniprot_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'Uniprot_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'Uniprot_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del upt_prot_cof

***

#### `9606.protein.links.v11.0.txt.gz` <a class="anchor" id="protein-protein"></a>  

**Data Source Wiki Page:** [Search Tool for Recurring Instances of Neighbouring Genes Database](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#search-tool-for-recurring-instances-of-neighbouring-genes-database) 

**Edges:**  
- `protein-protein`   

**Identifier Maps:**    
- Proteins: [STRING_PRO_ONTOLOGY_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/STRING_PRO_ONTOLOGY_MAP.txt)  

This chunk process the [`9606.protein.links.v11.0.txt.gz`](https://stringdb-static.org/download/protein.links.v11.0/9606.protein.links.v11.0.txt.gz) file and obtains the following node and edge metadata:  
 
- **Nodes:**  
  _Protein_    
  - `GenomicInformation`: A dictionary of protein identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details. 


- **Edges:**  
  - `combined_score`: The combined score is computed by combining the probabilities from the different evidence channels and corrected for the probability of randomly observing an interaction. Scores range from 0-1000. For a more detailed description please see PMID:15608232.   

In [None]:
# download data
url = 'https://stringdb-static.org/download/protein.links.v11.0/9606.protein.links.v11.0.txt.gz'
if not os.path.exists(unprocessed_data_location + '9606.protein.links.v11.0.txt'):
    data_downloader(url, unprocessed_data_location, '9606.protein.links.v11.0.txt')

# load data
stg_prot_prot = pandas.read_csv(unprocessed_data_location + '9606.protein.links.v11.0.txt', header=0, delimiter=' ', skiprows=0)

 *Merge Identifier Maps*

In [None]:
stg_prot_prot = stg_prot_prot.merge(string_pro_map, left_on='protein1', right_on='STRING_IDs')
stg_prot_prot = stg_prot_prot.merge(string_pro_map, left_on='protein2', right_on='STRING_IDs')

# visualize data
stg_prot_prot.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'protein-protein': {}})

# create dictionary
for idx, row in tqdm(stg_prot_prot.iterrows(), total=stg_prot_prot.shape[0]):
    proteins = [row['Protein_Ontology_IDs_x'], row['Protein_Ontology_IDs_y']]; score = row['combined_score']; gene_info = []
    edge_key = '{}-{}'.format(row['Protein_Ontology_IDs_x'], row['Protein_Ontology_IDs_y']); edge_type = 'protein-protein' 
    
    for node_key in proteins:
        if node_key in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[node_key]
        else: genomic_info_dict = None
        
        # add genomic information
        if node_key in master_metadata_dictionary['nodes'].keys():
            if genomic_info_dict is not None:
                master_metadata_dictionary['nodes'][node_key].update({'genomic_data': genomic_info_dict})
            else: master_metadata_dictionary['nodes'][node_key].update({'genomic_data': 'None'})
        else:
            if genomic_info_dict is not None:
                master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': genomic_info_dict}})
            else: master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': 'None'}}) 
    
    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'String_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                master_metadata_dictionary['edges'][edge_key][url]['String_Evidence'] = score
            else: master_metadata_dictionary['edges'][edge_key][url].update({'String_Evidence': score})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'String_Evidence': score, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'String_Evidence': score, 'Type': edge_type}}})

# delete unneeded data
del stg_prot_prot

***

#### `curated_gene_disease_associations.tsv` <a class="anchor" id="gene-phen"></a>  

**Data Source Wiki Page:** [DisGeNET](https://github.com/callahantiff/PheKnowLator/wiki/v4-Data-Sources#disgenet)

**Edges:**  
- `gene-disease`   
- `gene-phenotype`    

**Identifier Maps:**    
- Diseases: [DISEASE_MONDO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt)
- Phenotypes: [PHENOTYPE_HPO_MAP.txt](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt)   

This chunk process the [curated_gene_disease_associations.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/curated_gene_disease_associations.tsv.gz) file and obtains the following node and edge metadata:  
- **Nodes:**  
  _Disease, Phenotype_  
  - `diseaseId`: A string containing the concept's database cross-reference, which is formatted as DB:ID. If not, a MeSH or OMIM identifier. Variable is provided as a string with the "MESH" or "OMIM" prefix in all caps.    
  - `diseaseName`: A string containing the concept's synonym. If derived from an ontology, the string will be prefixed by the synonym type.    
  - `diseaseSematicType`: A string containing a high-level grouper or typing variable for the disease.    
  - `diseaseClass`: A ";"-delimnited list of ICD codes that can be used to classify the disease.  
 
  _Gene_    
  - `GenomicInformation`: A dictionary of gene identifier information. See the [Genomic Entity Metadata](#genomicinfo) code chunk for more details. 


- **Edges:**  
  - `DSI`: The Disease Similarity Index ranges from from 0.25 to 1. It is calculated as: DSI = log2(# diseases assoc with gene/total # of diseases in DisGeNET) / log2(1/total # of diseases in DisGeNET)  
  - `DPI`: The Disease Pleiotropy Index ranges from 0 to 1. it is calculated as: DPI = (# of MeSH disease classes of disease assoc with gene/total # of MeSH disease classes)*100.   
  - `score`: The score range from 0 to 1, and take into account the number and type of sources (level of curation, model organisms), and the number of publications supporting the association.  
  - `EI`: The Evidence Index (EL) is a metric developed by ClinGen that measures the strength of evidence of a gene-disease relationship that correlates to a qualitative classification: "Definitive", "Strong", "Moderate", "Limited", "Disputed" ([PMID:28552198](https://www.ncbi.nlm.nih.gov/pubmed/28552198)). EI=1 indicates that all the publications support the GDA or the VDA, while EI<1 indicates that there are publications that assert that there is no association between the gene/variants and the disease. If the gene/variant has no EI value, it indicates that the index has not been computed for this association. It is calculated as: EI = (# positive pubs/total # of pubs).  
  - `YearInitial`: First time that the association was reported.  
  - `YearFinal`: Last time that the association was reported.  
  - `NofPmids`: Count of associated Pubmed IDs.  
  - `NofSnps`: Count of associated SNPs.  
  - `Source`: The original source reporting the Gene-Disease Association.  

In [None]:
# download data
url = 'https://www.disgenet.org/static/disgenet_ap1/files/downloads/curated_gene_disease_associations.tsv.gz'
if not os.path.exists(unprocessed_data_location + 'curated_gene_disease_associations.tsv'):
    data_downloader(url, unprocessed_data_location, 'curated_gene_disease_associations.tsv')

# load data
dgt_dis_gene = pandas.read_csv(unprocessed_data_location + 'curated_gene_disease_associations.tsv', header=0, delimiter='\t', skiprows=0)
dgt_dis_gene = dgt_dis_gene[dgt_dis_gene['diseaseType'] != 'group']

# fix variable typing
dgt_dis_gene['YearInitial'] = dgt_dis_gene['YearInitial'].astype('float').astype('Int64')
dgt_dis_gene['YearFinal'] = dgt_dis_gene['YearFinal'].astype('float').astype('Int64')

# fix prefix
dgt_dis_gene['geneId'] = 'NCBIGene_' + dgt_dis_gene['geneId'].astype('str')


*Merge Identifier Maps*

In [None]:
dgt_dis_gene = dgt_dis_gene.merge(disease_maps, left_on='diseaseId', right_on='Disease_IDs')
dgt_dis_gene = dgt_dis_gene.merge(phenotype_maps, left_on='diseaseId', right_on='Disease_IDs')

# visualize data
dgt_dis_gene.head(n=3)

*Create Metadata Dictionary*

In [None]:
# master_metadata_dictionary['edges'].update({'gene-disease': {}, 'gene-phenotype': {}})

# create dictionary
for idx, row in tqdm(dgt_dis_gene.iterrows(), total=dgt_dis_gene.shape[0]):
    node_key = row['geneId']; dis_id = row['diseaseId']; dis_name = row['diseaseName']
    sem_type = row['diseaseSemanticType']; dis_cls = row['diseaseClass']
    evidence = [{'DisGeNET_DSI': row['DSI'] if not pandas.isna(row['DSI']) else 'None',
                 'DisGeNET_DPI': row['DPI'] if not pandas.isna(row['DPI']) else 'None',
                 'DisGeNET_score': row['score'] if not pandas.isna(row['score']) else 'None',
                 'DisGeNET_EI': row['EI'] if not pandas.isna(row['EI']) else 'None',
                 'DisGeNET_YearInitial': row['YearInitial'] if not pandas.isna(row['YearInitial']) else 'None',
                 'DisGeNET_YearFinal': row['YearFinal'] if not pandas.isna(row['YearFinal']) else 'None',
                 'DisGeNET_NofPmids': row['NofPmids'],
                 'DisGeNET_NofSnps': row['NofSnps']}]   
    if row['diseaseType'] == 'disease':
        node_key2 = row['MONDO_IDs']; edge_key = '{}-{}'.format(node_key, node_key2); edge_type = 'gene-disease'
    else:
        node_key2 = row['HP_IDs']; edge_key = '{}-{}'.format(node_key, node_key2); edge_type = 'gene-phenotype'
    
    # add disease/phenotype information
    if node_key2 in master_metadata_dictionary['nodes'].keys():
        if url in master_metadata_dictionary['nodes'][node_key2].keys():
            master_metadata_dictionary['nodes'][node_key2][url]['DisGeNET_diseaseId'] |= {dis_id}
            master_metadata_dictionary['nodes'][node_key2][url]['DisGeNET_diseaseName'] |= {dis_name}
            master_metadata_dictionary['nodes'][node_key2][url]['DisGeNET_diseaseSemanticType'] |= {sem_type}
            master_metadata_dictionary['nodes'][node_key2][url]['DisGeNET_diseaseClass'] |= {dis_cls}
        else:
            master_metadata_dictionary['nodes'][node_key2].update({url: {
                'DisGeNET_diseaseId': {dis_id},
                'DisGeNET_diseaseName': {dis_name},
                'DisGeNET_diseaseSemanticType': {sem_type},
                'DisGeNET_diseaseClass': {dis_cls}}})
    else:
        master_metadata_dictionary['nodes'].update({node_key2: {url: {
            'DisGeNET_diseaseId': {dis_id},
            'DisGeNET_diseaseName': {dis_name},
            'DisGeNET_diseaseSemanticType': {sem_type},
            'DisGeNET_diseaseClass': {dis_cls}}}})

    # add genomic information
    if node_key in genomic_metadata.keys(): genomic_info_dict = genomic_metadata[node_key]
    if node_key in master_metadata_dictionary['nodes'].keys():
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'][node_key].update({'genomic_data': genomic_info_dict})
        else: master_metadata_dictionary['nodes'][node_key].update({'genomic_data': 'None'})
    else:
        if genomic_info_dict is not None:
            master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': genomic_info_dict}})
        else: master_metadata_dictionary['nodes'].update({node_key: {'genomic_data': 'None'}})

    # add relation data to dictionary
    if edge_key in master_metadata_dictionary['edges'].keys():
        if url in master_metadata_dictionary['edges'][edge_key].keys():
            if 'DisGeNET_Evidence' in master_metadata_dictionary['edges'][edge_key][url].keys():
                inital_ev = master_metadata_dictionary['edges'][edge_key][url]
                inital_ev = inital_ev['DisGeNET_Evidence'] + evidence
                ev = [json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in inital_ev)]
                master_metadata_dictionary['edges'][edge_key][url]['DisGeNET_Evidence'] = ev
            else: master_metadata_dictionary['edges'][edge_key][url].update({'DisGeNET_Evidence': evidence})
        else: master_metadata_dictionary['edges'][edge_key].update({url: {'DisGeNET_Evidence': evidence, 'Type': edge_type}})
    else: master_metadata_dictionary['edges'].update({edge_key: {url: {'DisGeNET_Evidence': evidence, 'Type': edge_type}}})

# delete unneeded data
del dgt_dis_gene

<br>

***

#### Save Metadata Dictionary
Write the metadata dictionary to a file named `entity_metadata_dict.pkl` and located in the `resources/metadata/` directory.

*Node Data*

In [None]:
# # create list of dictionaries
# print('Creating Node List ...')
# node_list = [{k: v} for k, v in tqdm(master_metadata_dictionary['nodes'].items())]
# master_metadata_dictionary['nodes'] = {} 

In [None]:
node_temp = {}
for k, v in tqdm(master_metadata_dictionary['nodes'].items()):
    file_loc = metadata_location + 'temp/nodes/' + k  + '.json'
    # write data to temp directory
    dump_jsonl([v[k]], file_loc)
    # add dictionary entry with file path
    node_temp[k] = file_loc
    # delete entry
    del master_metadata_dictionary['nodes'][k]

# update nodes entry
master_metadata_dictionary['nodes'] = node_temp
    
# remove unneeded data
del node_list

*Relation Data*

In [None]:
# print('\nCreating Relations List ...')
# relation_list = [{k: v} for k, v in tqdm(master_metadata_dictionary['relations'].items())]
# master_metadata_dictionary['relations'] = {} 

In [None]:
relations_temp = {}
for k, v in tqdm(master_metadata_dictionary['relations'].items()):
    file_loc = metadata_location + 'temp/relations/' + k  + '.json'
    # write data to temp directory
    dump_jsonl([v[k]], file_loc)
    # add dictionary entry with file path
    relations_temp[k] = file_loc
    # delete entry
    del master_metadata_dictionary['relations'][k]

# update nodes entry
master_metadata_dictionary['relations'] = relations_temp
    
# remove unneeded data
del relation_list

*Edge Data*

In [None]:
# # print('\nCreating Edge List ...')
# # edge_list = [{k: v} for k, v in tqdm(master_metadata_dictionary['edges'].items())]
# master_metadata_dictionary['edges'] = {} 

In [None]:
edges_temp = {}
for k, v in tqdm(master_metadata_dictionary['edges'].items()):
    file_loc = metadata_location + 'temp/edges/' + k  + '.json'
    # write data to temp directory
    dump_jsonl([v[k]], file_loc)
    # add dictionary entry with file path
    edges_temp[k] = file_loc
    # delete entry
    del master_metadata_dictionary['edges'][k]
    

# update nodes entry
master_metadata_dictionary['edges'] = edges_temp
    
# remove unneeded data
del edge_list

*Save Updated Metadata Dictionary to Metadata Location*

In [None]:
# save a copy of the dictionary
# output > 4GB requires special approach: https://stackoverflow.com/questions/42653386/does-pickle-randomly-fail-with-oserror-on-large-files
filepath = metadata_location + 'entity_metadata_dict.pkl'

# defensive way to write pickle.write, allowing for very large files on all platforms
max_bytes, bytes_out = 2**31 - 1, pickle.dumps(master_metadata_dictionary)
n_bytes = sys.getsizeof(bytes_out)

with open(filepath, 'wb') as f_out:
    for idx in range(0, n_bytes, max_bytes):
        f_out.write(bytes_out[idx:idx+max_bytes])


<br>

***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```