***
# PheKnowLator - Data Preparation
***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**

**Purpose:** This notebook serves as a script to download and process data in order to generate mapping and filtering data needed to build edges for the PheKnowLator knowledge graph. For more information on the data sources utilize within this script, please see the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page.

**Assumptions:**   
- Raw data downloads ➞ `./resources/processed_data/unprocessed_data`    
- Processed data write location ➞ `./resources/processed_data`   

**Dependencies:** This notebook utilizes several helper functions, which are stored in the [`data_preparation_helper_functions.py`](https://github.com/callahantiff/PheKnowLator/blob/master/scripts/python/data_preparation_helper_functions.py) script. Hyperlinks to all downloaded and generated data sources are provided on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page as well as within each source subsection of this notebook. All generated data is freely available for download from DropBox. 

_____
***

## Table of Contents
***

### [Create Identifier Maps ](#create-identifier-maps)  
- [HUMAN TRANSCRIPT, GENE, AND PROTEIN IDENTIFIER MAPPING](#human-transcript,-gene,-and-protein-identifier-mapping)
  - [Ensembl Gene-Ensembl Transcript](#ensemblgene-ensembltranscript)  
  - [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
  - [Ensembl Transcript-Protein Ontology](#ensembltranscript-proteinontology)
  - [Gene Symbol-Ensembl Transcript](#genesymbol-ensembltranscript)
  - [Entrez Gene-Protein Ontology](#entrezgene-proteinontology)  
  - [STRING-Protein Ontology](#string-proteinontology)  
  - [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)


- [OTHER IDENTIFIER MAPPING](#other-identifier-mapping) 
  - [ChEBI Identifiers](#mesh-chebi) 
  - [Human Disease and Phenotype Identifiers](#disease-identifiers)
  - [Human Protein Atlas Tissue and Cell Types](#hpa-uberon)  

<br>

### [Create Edge Datasets](#create-edge-datasets)
- [ONTOLOGIES](#ontologies)  
  - [Protein Ontology](#protein-ontology)  
  - [Relations Ontology](#relations-ontology)  


- [LINKED DATA](#linked-data)  
  - [Clinvar Variant-Diseases and Phenotypes](#clinvar-variant)
  - [NCBI Gene Protein-Coding Genes and Proteins](#ncbi-protein-coding-genes)  
  - [Reactome Chemical-Complex Data](#reactome-chemical-complex)
  - [Reactome Complex-Complex Data](#reactome-complex-complex)
  - [Reactome Complex-Pathway Data](#reactome-complex-pathway)
  - [Reactome Protein-Complex Data](#reactome-protein-complex)
  - [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  

<br>

### [Gather Instance Data Metadata](#create-instance-metadata)  
- [Genes/RNA](#gene-and-rna-metadata)
- [Pathways](#pathway-metadata)
- [Complexes](#complex-metadata)
- [Reactions](#reaction-metadata)
- [Variants](#variant-metadata) 

____

<br>

### Set-Up Environment
_____

In [1]:
# import needed libraries
import glob
import networkx
import numpy
import pandas

from functools import reduce
from owlready2 import subprocess
from rdflib import Graph, Namespace, URIRef, BNode, extras, Literal
from rdflib.extras.external_graph_libs import *
from reactome2py import content
from tqdm import tqdm

# import script containing helper functions
from scripts.python.data_preparation_helper_functions import *

**Define Global Variables**

In [2]:
# directory to read unprocessed data files from
unprocessed_data_location = 'resources/processed_data/unprocessed_data/'

# directory to write processed data files to
processed_data_location = 'resources/processed_data/'

<br>

***
***
### CREATE MAPPING DATASETS  <a class="anchor" id="create-identifier-maps"></a>
***
***

### Human Transcript, Gene, and Protein Identifier Mapping  <a class="anchor" id="human-transcript,-gene,-and-protein-identifier-mapping"></a>
***

**Data Source Wiki Pages:**   
- [Ensembl](https://uswest.ensembl.org/)  

- [Uniprot Knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)  
- [HGNC](ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt) 
- [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene) 
- [Protein Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#protein-ontology)

<br>

**Purpose:** To map create protein-coding gene-protein relations and mappings between the identifiers listed below. The edges types produced from each of these mappings will be further described within each identifier mapping section:  
- [Ensembl Gene-Ensembl Transcript](#ensemblgene-ensembltranscript)  
- [Entrez Gene-Ensembl Transcript](#entrezgene-ensembltranscript)  
- [Entrez Gene-Protein Ontology](#entrezgene-proteinontology)  
- [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
- [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)
- [STRING-Protein Ontology](#string-proteinontology)

<br>

**Output:** This script downloads and saves the following data:  
- Human Ensembl Gene Set ➞ [`Homo_sapiens.GRCh38.99.gtf`](ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz)
- Human Ensembl-UniProt Identifiers ➞ [`Homo_sapiens.GRCh38.98.uniprot.tsv`](https://www.dropbox.com/s/cesjvqz1b8c7ami/Homo_sapiens.GRCh38.98.uniprot.tsv?dl=1) 
- Human Ensembl-Entrez Identifiers ➞ [`Homo_sapiens.GRCh38.98.entrez.tsv`](https://www.dropbox.com/s/5kstw70py0azvws/Homo_sapiens.GRCh38.98.entrez.tsv?dl=1) 
- Human Gene Identifiers ➞ [`Homo_sapiens.gene_info`](https://www.dropbox.com/s/vazlmzxydgv6xzz/Homo_sapiens.gene_info?dl=1), [`hgnc_complete_set.txt`](ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt)  
- Human Protein Identifiers ➞ [`promapping.txt`](https://www.dropbox.com/s/x7wdimv6ph6bl8k/promapping.txt?dl=1) 

_All Merged Data Sets:_ [`Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt`](https://www.dropbox.com/s/fiek6h5rowi7dh0/Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt?dl=1)

***

***
**Process Data:** `hgnc_complete_set.txt`

In [4]:
url = 'ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt'
data_downloader(url, unprocessed_data_location)

Downloading data from ftp server


In [3]:
# read in ensembl-uniprot data
hgnc = pandas.read_csv(unprocessed_data_location + 'hgnc_complete_set.txt',
                       header = 0,
                       delimiter = '\t',
                       low_memory=False)

# drop uneeded columns
hgnc = hgnc[['hgnc_id', 'entrez_id', 'ensembl_gene_id', 'uniprot_ids', 'symbol', 'name', 'locus_group', 'location', 'alias_symbol']]

# replace NaN with 'None'
hgnc.fillna('None', inplace=True)

# make data columns of type string
hgnc['entrez_id'] = hgnc['entrez_id'].apply(lambda x: str(int(x)) if x != 'None' else 'None')

# explode nested data
explode_df_hgnc = explode(hgnc.copy(), ['entrez_id', 'ensembl_gene_id', 'uniprot_ids'], '|')

# preview data
explode_df_hgnc.head(n=3)

Unnamed: 0,hgnc_id,entrez_id,ensembl_gene_id,uniprot_ids,symbol,name,locus_group,location,alias_symbol
0,HGNC:5,1,ENSG00000121410,P04217,A1BG,alpha-1-B glycoprotein,protein-coding gene,19q13.43,
1,HGNC:37133,503538,ENSG00000268895,,A1BG-AS1,A1BG antisense RNA 1,non-coding RNA,19q13.43,FLJ23569
2,HGNC:24086,29974,ENSG00000148584,Q9NQ94,A1CF,APOBEC1 complementation factor,protein-coding gene,10q11.23,ACF|ASP|ACF64|ACF65|APOBEC1CF


***
**Process Data:** `Homo_sapiens.GRCh38.99.gtf.gz` + `Homo_sapiens.GRCh38.98.uniprot.tsv` + `Homo_sapiens.GRCh38.98.entrez.tsv`

In [52]:
# full human gene set
url = 'ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz'
data_downloader(url, unprocessed_data_location)

# uniprot annotations
url1 = 'ftp://ftp.ensembl.org/pub/release-99/tsv/homo_sapiens/Homo_sapiens.GRCh38.99.uniprot.tsv.gz'
data_downloader(url1, unprocessed_data_location)

# entrez annotations
url2 = 'ftp://ftp.ensembl.org/pub/release-99/tsv/homo_sapiens/Homo_sapiens.GRCh38.99.entrez.tsv.gz'
data_downloader(url2, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data


_Read in Gene Set Data_

In [4]:
ensembl_geneset = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.99.gtf',
                                  header = None,
                                  delimiter = '\t',
                                  skiprows = 5,
                                  low_memory=False)

# iterate over the nested column and un ravel it
column_names = ['gene_id', 'gene_version', 'transcript_id', 'transcript_version', 'exon_number', 'gene_name',
                'gene_source', 'gene_biotype', 'transcript_name', 'transcript_source', 'transcript_biotype',
                'exon_id', 'exon_version', 'tag', 'transcript_support_level']

cleaned_column = []

for idx, row in tqdm(ensembl_geneset.iterrows(), total=ensembl_geneset.shape[0]):
    row_results, col_res = [], row[8].split(';')

    for col in column_names:
        match = [x.replace(col, '').strip().strip('"') for x in col_res if col in x]
        row_results.append(match[0].replace(col, '') if len(match) > 0 else 'None')

    cleaned_column += [row_results]
          
# remove nested column
ensembl_geneset = ensembl_geneset[[1, 2]]

# add columns back to data frame
ensembl_geneset['gene_stable_id'] = [x[0] for x in cleaned_column]
ensembl_geneset['transcript_stable_id'] = [x[2] for x in cleaned_column]
ensembl_geneset['exon_number'] = [x[4] for x in cleaned_column]
ensembl_geneset['gene_name'] = [x[5] for x in cleaned_column]
ensembl_geneset['gene_type'] = [x[7] for x in cleaned_column]
ensembl_geneset['transcript_name'] = [x[8] for x in cleaned_column]
ensembl_geneset['transcript_type'] = [x[10] for x in cleaned_column]
ensembl_geneset['exon_stable_id'] = [x[11] for x in cleaned_column]
ensembl_geneset['tag'] = [x[13] for x in cleaned_column]
ensembl_geneset['transcript_support_level'] = [x[14] for x in cleaned_column]

# rename columns
ensembl_geneset.rename(columns={1: 'ensembl_source', 2: 'molecule_type'}, inplace=True)

# replace NaN with 'None'
ensembl_geneset.fillna('None', inplace=True)

# preview data
ensembl_geneset.head(n=3)

100%|██████████| 2905054/2905054 [09:10<00:00, 5281.62it/s]


Unnamed: 0,ensembl_source,molecule_type,gene_stable_id,transcript_stable_id,exon_number,gene_name,gene_type,transcript_name,transcript_type,exon_stable_id,tag,transcript_support_level
0,havana,gene,ENSG00000223972,,,DDX11L1,transcribed_unprocessed_pseudogene,,,,,
1,havana,transcript,ENSG00000223972,ENST00000456328,,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,,basic,1.0
2,havana,exon,ENSG00000223972,ENST00000456328,1.0,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,ENSE00002234944,basic,1.0


_Read in Annotation Data_

In [7]:
# read in ensembl-uniprot data
ensembl1 = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.99.uniprot.tsv',
                           header = 0,
                           delimiter = '\t',
                           low_memory=False)
# replace "-"
ensembl1.replace('-','None', inplace=True)

# read in entrez-uniprot data
ensembl2 = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.99.entrez.tsv',
                           header = 0,
                           delimiter = '\t',
                           low_memory=False)

# replace "-"
ensembl2.replace('-','None', inplace=True)

# merge annotation datasets
ensembl_annot = pandas.merge(ensembl1[['gene_stable_id', 'transcript_stable_id', 'protein_stable_id', 'xref']],
                             ensembl2[['gene_stable_id', 'transcript_stable_id', 'protein_stable_id', 'xref']],
                             left_on=['gene_stable_id', 'transcript_stable_id', 'protein_stable_id'],
                             right_on=['gene_stable_id', 'transcript_stable_id', 'protein_stable_id'],
                             how='outer')

# rename columns
ensembl_annot.rename(columns={'xref_x': 'xref_uniprot', 'xref_y': 'xref_entrez'}, inplace=True)

# replace NaN with 'None'
ensembl_annot.fillna('None', inplace=True)

# preview data
ensembl_annot.head(n=3)

Unnamed: 0,gene_stable_id,transcript_stable_id,protein_stable_id,xref_uniprot,xref_entrez
0,ENSG00000186092,ENST00000641515,ENSP00000493376,A0A2U3U0J3,79501
1,ENSG00000186092,ENST00000335137,ENSP00000334393,Q8NH21,79501
2,ENSG00000284733,ENST00000426406,ENSP00000409316,Q6IEY1,729759


_Merge Ensembl Annotation and Gene Set Data_

In [8]:
# merge annotation data with enseble gene set
ensembl = pandas.merge(ensembl_geneset,
                       ensembl_annot,
                       left_on = ['gene_stable_id', 'transcript_stable_id'],
                       right_on = ['gene_stable_id', 'transcript_stable_id'],
                       how='outer')

# replace NaN with 'None'
ensembl.fillna('None', inplace=True)

# preview data
ensembl.head(n=3)

Unnamed: 0,ensembl_source,molecule_type,gene_stable_id,transcript_stable_id,exon_number,gene_name,gene_type,transcript_name,transcript_type,exon_stable_id,tag,transcript_support_level,protein_stable_id,xref_uniprot,xref_entrez
0,havana,gene,ENSG00000223972,,,DDX11L1,transcribed_unprocessed_pseudogene,,,,,,,,
1,havana,transcript,ENSG00000223972,ENST00000456328,,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,,basic,1.0,,,
2,havana,exon,ENSG00000223972,ENST00000456328,1.0,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,ENSE00002234944,basic,1.0,,,


***
***
**Process Data:** `Homo_sapiens.gene_info`

In [11]:
url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data


In [9]:
ncbi_gene = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info', header = 0, delimiter = '\t')

# replace "-" with "None"
ncbi_gene.replace('-','None', inplace=True)

# explode nested data
explode_df_ncbi_gene = explode(ncbi_gene.copy(), ['dbXrefs'], '|')

# remove identifier type, which appears before ':'
explode_df_ncbi_gene['dbXrefs'].replace('(^\w*\:)','', inplace=True, regex=True)

# remove unneeded columns
explode_df_ncbi_gene = explode_df_ncbi_gene[['GeneID', 'Symbol', 'Synonyms', 'dbXrefs', 'chromosome', 'map_location',
                                             'description', 'type_of_gene', 'Other_designations']]

# preview data
explode_df_ncbi_gene.head(n=3)

Unnamed: 0,GeneID,Symbol,Synonyms,dbXrefs,chromosome,map_location,description,type_of_gene,Other_designations
0,1,A1BG,A1B|ABG|GAB|HYST2477,138670,19,19q13.43,alpha-1-B glycoprotein,protein-coding,alpha-1B-glycoprotein|HEL-S-163pA|epididymis s...
1,1,A1BG,A1B|ABG|GAB|HYST2477,HGNC:5,19,19q13.43,alpha-1-B glycoprotein,protein-coding,alpha-1B-glycoprotein|HEL-S-163pA|epididymis s...
2,1,A1BG,A1B|ABG|GAB|HYST2477,ENSG00000121410,19,19q13.43,alpha-1-B glycoprotein,protein-coding,alpha-1B-glycoprotein|HEL-S-163pA|epididymis s...


***
**Process Data:** `promapping.txt`

In [None]:
url = 'https://proconsortium.org/download/current/promapping.txt'
data_downloader(url, unprocessed_data_location)

In [10]:
pro_mapping = pandas.read_csv(unprocessed_data_location + 'promapping.txt',
                              header = None,
                              names = ['pro_id', 'Entry', 'pro_mapping'],
                              delimiter = '\t')

# remove rows without 'UniProtKB'
pro_mapping = pro_mapping.loc[pro_mapping['Entry'].apply(lambda x: x.startswith('UniProtKB:'))] 

# remove identifier type, which appears before ':'
pro_mapping['Entry'].replace('(^\w*\:)','', inplace=True, regex=True)

# preview data
pro_mapping.head(n=3)

Unnamed: 0,pro_id,Entry,pro_mapping
6,PR:000000005,P37173,is_a
7,PR:000000005,P38438,is_a
8,PR:000000005,Q62312,is_a


***
**Merge Processed Data:** `hgnc` + `ensembl`

In [11]:
# rename columns before merging
ensembl.rename(columns={'gene_stable_id': 'ensembl_gene_id', 'xref_uniprot': 'uniprot_ids', 'xref_entrez': 'entrez_id', 'gene_name': 'symbol'}, inplace=True)

# merge uniprot and ncbi data
ensembl_hgnc_merged_data = pandas.merge(ensembl,
                                        hgnc,
                                        left_on=['ensembl_gene_id', 'entrez_id', 'uniprot_ids', 'symbol'],
                                        right_on=['ensembl_gene_id', 'entrez_id', 'uniprot_ids', 'symbol'],
                                        how='outer')

# replace NaN with 'None'
ensembl_hgnc_merged_data.fillna('None', inplace=True)

# preview data
ensembl_hgnc_merged_data.head(n=3)

Unnamed: 0,ensembl_source,molecule_type,ensembl_gene_id,transcript_stable_id,exon_number,symbol,gene_type,transcript_name,transcript_type,exon_stable_id,tag,transcript_support_level,protein_stable_id,uniprot_ids,entrez_id,hgnc_id,name,locus_group,location,alias_symbol
0,havana,gene,ENSG00000223972,,,DDX11L1,transcribed_unprocessed_pseudogene,,,,,,,,,,,,,
1,havana,transcript,ENSG00000223972,ENST00000456328,,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,,basic,1.0,,,,,,,,
2,havana,exon,ENSG00000223972,ENST00000456328,1.0,DDX11L1,transcribed_unprocessed_pseudogene,DDX11L1-202,processed_transcript,ENSE00002234944,basic,1.0,,,,,,,,


***
***
**Merge Processed Data:** `ensembl_hgnc_merged_data` + `Homo_sapiens.gene_info`

In [12]:
# rename columns before merging
explode_df_ncbi_gene.rename(columns={'GeneID': 'entrez_id', 'Symbol': 'symbol', 'map_location': 'location', 'description': 'name'}, inplace=True)

# update cell values
ensembl_hgnc_merged_data['gene_type'].replace('protein_coding', 'protein-coding', inplace=True, regex=True)
ensembl_hgnc_merged_data['gene_type'].replace('pseudogene', 'pseudo', inplace=True, regex=True)
ensembl_hgnc_merged_data['locus_group'].replace('protein-coding gene', 'protein-coding', inplace=True, regex=True)
ensembl_hgnc_merged_data['locus_group'].replace('non-coding RNA', 'ncRNA', inplace=True, regex=True)
ensembl_hgnc_merged_data['locus_group'].replace('protein-coding gene', 'protein-coding', inplace=True, regex=True)
ensembl_hgnc_merged_data['locus_group'].replace('pseudogene', 'pseudo', inplace=True, regex=True)

# make sure that merge columns are of same type
explode_df_ncbi_gene['entrez_id'] = explode_df_ncbi_gene['entrez_id'].astype(str)

# merge uniprot and ncbi data
ensembl_hgnc_ncbi_merged_data = pandas.merge(ensembl_hgnc_merged_data,
                                             explode_df_ncbi_gene,
                                             left_on=['entrez_id', 'symbol', 'location', 'name'],
                                             right_on=['entrez_id', 'symbol', 'location', 'name'],
                                             how='outer')

# replace NaN with 'None'
ensembl_hgnc_ncbi_merged_data.fillna('None', inplace=True)

# preview data
ensembl_hgnc_ncbi_merged_data.head(n=3)

Unnamed: 0,ensembl_source,molecule_type,ensembl_gene_id,transcript_stable_id,exon_number,symbol,gene_type,transcript_name,transcript_type,exon_stable_id,...,hgnc_id,name,locus_group,location,alias_symbol,Synonyms,dbXrefs,chromosome,type_of_gene,Other_designations
0,havana,gene,ENSG00000223972,,,DDX11L1,transcribed_unprocessed_pseudo,,,,...,,,,,,,,,,
1,havana,transcript,ENSG00000223972,ENST00000456328,,DDX11L1,transcribed_unprocessed_pseudo,DDX11L1-202,processed_transcript,,...,,,,,,,,,,
2,havana,exon,ENSG00000223972,ENST00000456328,1.0,DDX11L1,transcribed_unprocessed_pseudo,DDX11L1-202,processed_transcript,ENSE00002234944,...,,,,,,,,,,


_Clean Merged Data_

In [13]:
# clean up merged data by combining columns of same type and removing un-needed columns
gene_type = []

# loop over data and fill in missing values
for idx, row in tqdm(ensembl_hgnc_ncbi_merged_data.iterrows(), total=ensembl_hgnc_ncbi_merged_data.shape[0]):
    if row['locus_group'] != 'None' and row['type_of_gene'] != 'None' and row['gene_type'] != 'None':
        if (row['locus_group'] == row['type_of_gene']):
            gene_type.append(row['locus_group']) 
        else:
            gene_type.append('{} (HGNC)|{} (Entrez Gene)|{} (Ensembl)'.format(row['locus_group'], row['type_of_gene'], row['gene_type']))
    elif row['locus_group'] != 'None' and (row['entrez_id'] == 'None' and row['gene_type'] == 'None'):
        gene_type.append(row['locus_group'])
    elif row['type_of_gene'] != 'None' and (row['locus_group'] == 'None' and row['gene_type'] == 'None'):
        gene_type.append(row['type_of_gene'])
    elif row['gene_type'] != 'None' and (row['locus_group'] == 'None' and row['type_of_gene'] == 'None'):
        gene_type.append(row['gene_type'])
    else:
        gene_type.append('None')
            
# reduce columns
ensembl_hgnc_ncbi_merged_data_clean = ensembl_hgnc_ncbi_merged_data.copy()
ensembl_hgnc_ncbi_merged_data_clean = ensembl_hgnc_ncbi_merged_data_clean[['ensembl_gene_id', 'transcript_stable_id', 'protein_stable_id',
                                                                           'uniprot_ids', 'entrez_id', 'hgnc_id', 'chromosome', 'symbol',
                                                                           'location', 'name', 'alias_symbol', 'Synonyms', 'Other_designations',
                                                                           'dbXrefs', 'transcript_name', 'transcript_type', 'gene_type']]

# add cleaned columns
ensembl_hgnc_ncbi_merged_data_clean['gene_type_cleaned'] = gene_type

# remove duplicates
ensembl_hgnc_ncbi_merged_data_clean.drop_duplicates(subset=None, keep='first', inplace=True)
    
# preview data
ensembl_hgnc_ncbi_merged_data_clean.head(n=3)

100%|██████████| 5337480/5337480 [17:28<00:00, 5088.43it/s] 


Unnamed: 0,ensembl_gene_id,transcript_stable_id,protein_stable_id,uniprot_ids,entrez_id,hgnc_id,chromosome,symbol,location,name,alias_symbol,Synonyms,Other_designations,dbXrefs,transcript_name,transcript_type,gene_type,gene_type_cleaned
0,ENSG00000223972,,,,,,,DDX11L1,,,,,,,,,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudo
1,ENSG00000223972,ENST00000456328,,,,,,DDX11L1,,,,,,,DDX11L1-202,processed_transcript,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudo
5,ENSG00000223972,ENST00000450305,,,,,,DDX11L1,,,,,,,DDX11L1-201,transcribed_unprocessed_pseudogene,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudo


***
***
**Merge Processed Data:** `ensembl_ncbi_merged_data_clean` + `promapping.txt`  

In [14]:
# rename columns before merging
pro_mapping.rename(columns={'Entry': 'uniprot_ids'}, inplace=True)

# merge uniprot and ncbi data
merged_data = pandas.merge(ensembl_hgnc_ncbi_merged_data_clean,
                           pro_mapping,
                           left_on='uniprot_ids',
                           right_on='uniprot_ids',
                           how='outer')

# replace NaN with 'None'
merged_data.fillna('None', inplace=True)

# preview data
merged_data.head(n=3)

Unnamed: 0,ensembl_gene_id,transcript_stable_id,protein_stable_id,uniprot_ids,entrez_id,hgnc_id,chromosome,symbol,location,name,alias_symbol,Synonyms,Other_designations,dbXrefs,transcript_name,transcript_type,gene_type,gene_type_cleaned,pro_id,pro_mapping
0,ENSG00000223972,,,,,,,DDX11L1,,,,,,,,,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudo,,
1,ENSG00000223972,ENST00000456328,,,,,,DDX11L1,,,,,,,DDX11L1-202,processed_transcript,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudo,,
2,ENSG00000223972,ENST00000450305,,,,,,DDX11L1,,,,,,,DDX11L1-201,transcribed_unprocessed_pseudogene,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudo,,


**Clean Full Merged Data**

In [21]:
# remove duplicates
merged_data_clean = merged_data.drop_duplicates(subset=None, keep='first')

# replace NaN with 'None'
merged_data_clean.fillna('None', inplace=True)

# clean up gene symbols that have been converted into dates
clean_dates = []

for x in tqdm(list(merged_data_clean['symbol'])):
    if '-' in x and len(x.split('-')[0]) < 3 and len(x.split('-')[1]) == 3:
        clean_dates.append(x.split('-')[1].upper() + x.split('-')[0])
    else:
        clean_dates.append(x)
    
merged_data_clean['symbol'] = clean_dates

# write data
merged_data_clean.to_csv(processed_data_location + 'Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt',
                         header = True,
                         sep = '\t',
                         index = False)
    
# preview data
merged_data_clean.head(n=3)

100%|██████████| 851764/851764 [00:00<00:00, 1667300.82it/s]


Unnamed: 0,ensembl_gene_id,transcript_stable_id,protein_stable_id,uniprot_ids,entrez_id,hgnc_id,chromosome,symbol,location,name,alias_symbol,Synonyms,Other_designations,dbXrefs,transcript_name,transcript_type,gene_type,gene_type_cleaned,pro_id,pro_mapping
0,ENSG00000223972,,,,,,,DDX11L1,,,,,,,,,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudo,,
1,ENSG00000223972,ENST00000456328,,,,,,DDX11L1,,,,,,,DDX11L1-202,processed_transcript,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudo,,
2,ENSG00000223972,ENST00000450305,,,,,,DDX11L1,,,,,,,DDX11L1-201,transcribed_unprocessed_pseudogene,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudo,,


<br>

***
***

### Ensembl Gene-Ensembl Transcript <a class="anchor" id="ensemblgene-ensembltranscript"></a>

**Purpose:** To map Ensembl gene identifiers to Ensembl transcript identifiers when creating the following edges: 
- rna-cell   
- rna-tissue types  

**Output:** [`ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`](https://www.dropbox.com/s/8n1isqytlz2z1g6/ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt?dl=1)

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['ensembl_gene_id', 'transcript_stable_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['ensembl_gene_id'] != 'None' and row['transcript_stable_id'] != 'None': 
            outfile.write(row['ensembl_gene_id'].strip() + '\t' + row['transcript_stable_id'].strip() + '\t' + row['gene_type_cleaned'] + '\t' + row['transcript_type'] + '\n')

outfile.close()

**Preview Processed Data**

In [84]:
eget_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Ensembl_Transcript_IDs', 'Gene_Type', 'Transcript_Type'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-ensembl transcript edges'.format(edge_count=len(eget_data)))

There are 243643 ensembl gene-ensembl transcript edges


In [85]:
eget_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Ensembl_Transcript_IDs,Gene_Type,Transcript_Type
0,ENSG00000223972,ENST00000456328,transcribed_unprocessed_pseudo,processed_transcript
1,ENSG00000223972,ENST00000450305,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudogene
2,ENSG00000227232,ENST00000488147,unprocessed_pseudo,unprocessed_pseudogene
3,ENSG00000278267,ENST00000619216,miRNA,miRNA
4,ENSG00000243485,ENST00000473358,lncRNA,lncRNA


<br>

***

### Ensembl Gene-Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Ensembl gene identifiers to Entrez gene identifiers when creating the following edges:   
- gene-gene

**Output:** [`ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`](https://www.dropbox.com/s/crghjh2we5v7pws/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt?dl=1)

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['ensembl_gene_id', 'entrez_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['ensembl_gene_id'] != 'None' and row['entrez_id'] != 'None': 
            outfile.write(row['ensembl_gene_id'].strip() + '\t' + row['entrez_id'].strip() + '\t' + row['gene_type_cleaned'] + '\n')

outfile.close()

**Preview Processed Data**

In [81]:
egeg_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Entrez_Gene_IDs', 'Gene_Type'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-entrez gene edges'.format(edge_count=len(egeg_data)))


There are 42196 ensembl gene-entrez gene edges


In [82]:
egeg_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Entrez_Gene_IDs,Gene_Type
0,ENSG00000187634,148398,protein-coding
1,ENSG00000188976,26155,protein-coding
2,ENSG00000187961,339451,protein-coding
3,ENSG00000187583,84069,protein-coding
4,ENSG00000187642,84808,protein-coding


<br>

***

### Ensembl Transcript-Protein Ontology <a class="anchor" id="ensembltranscript-proteinontology"></a>

**Purpose:** To map Ensembl transcript identifiers to Protein Ontology identifiers when creating the following edges: 
- rna-protein  

**Output:** [`ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/ckrw11nfyu6a08c/ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt?dl=1)


In [None]:
# de-dup data
df_po = merged_data_clean.drop_duplicates(subset=['transcript_stable_id', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_po.iterrows(), total=df_po.shape[0]):
        if row['transcript_stable_id'] != 'None' and row['pro_id'] != 'None': 
            outfile.write(row['transcript_stable_id'].strip() + '\t' + row['pro_id'].replace('PR:', 'PR_').strip() + '\t' + row['gene_type_cleaned'] + '\t' + row['transcript_type'] + '\n')

outfile.close()

**Preview Processed Data**

In [78]:
etpr_data = pandas.read_csv(processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['Ensembl_Transcript_IDs', 'Protein_Ontology_IDs', 'Gene_Type', 'Transcript_Type'],
                            delimiter = '\t',
                            low_memory=False)

print('There are {edge_count} ensembl transcript-protein ontology edges'.format(edge_count=len(etpr_data)))

There are 92270 ensembl transcript-protein ontology edges


In [79]:
etpr_data.head(n=5)

Unnamed: 0,Ensembl_Transcript_IDs,Protein_Ontology_IDs,Gene_Type,Transcript_Type
0,ENST00000335137,PR_000011836,protein-coding,protein_coding
1,ENST00000335137,PR_Q8NH21,protein-coding,protein_coding
2,ENST00000426406,PR_000011834,protein-coding,protein_coding
3,ENST00000426406,PR_Q6IEY1,protein-coding,protein_coding
4,ENST00000332831,PR_000011834,protein-coding,protein_coding


<br>

***
***

### Gene Symbol-Ensembl Transcript <a class="anchor" id="genesymbol-ensembltranscript"></a>

**Purpose:** To map gene symbols to Ensembl transcript identifiers when creating the following edges: 
- gene-rna 

**Output:** [`GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt`](https://www.dropbox.com/s/5o8yt7eejbf819x/GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt?dl=1)

In [74]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['symbol', 'transcript_stable_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['symbol'] != 'None' and row['transcript_stable_id'] != 'None': 
            outfile.write(row['symbol'].strip() + '\t' + row['transcript_stable_id'].strip() + '\t' + row['gene_type_cleaned'] + '\t' + row['transcript_type'] + '\n')

outfile.close()

**Preview Processed Data**

In [75]:
set_data = pandas.read_csv(processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt',
                            header = None,
                            names=['Gene_Symbols', 'Ensembl_Transcript_IDs', 'Gene_Type', 'Transcript_Type'],
                            delimiter = '\t')

print('There are {edge_count} gene symbol-ensembl transcript edges'.format(edge_count=len(set_data)))

There are 227818 gene symbol-ensembl transcript edges


In [76]:
set_data.head(n=5)

Unnamed: 0,Gene_Symbols,Ensembl_Transcript_IDs,Gene_Type,Transcript_Type
0,DDX11L1,ENST00000456328,transcribed_unprocessed_pseudo,processed_transcript
1,DDX11L1,ENST00000450305,transcribed_unprocessed_pseudo,transcribed_unprocessed_pseudogene
2,WASH7P,ENST00000488147,unprocessed_pseudo,unprocessed_pseudogene
3,MIR6859-1,ENST00000619216,miRNA,miRNA
4,MIR1302-2HG,ENST00000473358,lncRNA,lncRNA


<br>


***

### Entrez Gene-Protein Ontology <a class="anchor" id="entrezgene-proteinontology"></a>

**Purpose:** To map Protein Ontology identifiers to Ensembl transcript identifiers when creating the following edges:   
- chemical-protein  

**Output:** [`ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/ufbp5o6zgagriw7/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt?dl=1)

In [None]:
# de-dup data
df_egpr = merged_data_clean.drop_duplicates(subset=['entrez_id', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_egpr.iterrows(), total=df_egpr.shape[0]):
        if row['entrez_id'] != 'None' and row['pro_id'] != 'None': 
            outfile.write(row['entrez_id'].strip() + '\t' + row['pro_id'].replace(':', '_').strip() + '\t' + row['gene_type_cleaned'] + '\n')

outfile.close()

**Preview Processed Data**

In [72]:
egpr_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['Gene_IDs', 'Protein_Ontology_IDs', 'Gene_Type'],
                            delimiter = '\t')

print('There are {edge_count} entrez gene-protein ontology edges'.format(edge_count=len(egpr_data)))

There are 38996 entrez gene-protein ontology edges


In [73]:
egpr_data.head(n=5)

Unnamed: 0,Gene_IDs,Protein_Ontology_IDs,Gene_Type
0,79501,PR_000011836,protein-coding
1,79501,PR_Q8NH21,protein-coding
2,729759,PR_000011834,protein-coding
3,729759,PR_Q6IEY1,protein-coding
4,81399,PR_000011834,protein-coding


<BR>

***


### STRING-Protein Ontology <a class="anchor" id="string-proteinontology"></a>

**Purpose:** To map STRING identifiers to Protein Ontology identifiers when creating the following edges:   
- protein-protein  

**Output:** [`STRING_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/mekh5lr3bxp7gvu/STRING_PRO_ONTOLOGY_MAP.txt?dl=1)

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['protein_stable_id', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['protein_stable_id'] != 'None' and row['pro_id'] != 'None':
            outfile.write('9606.' + row['protein_stable_id'].strip() + '\t' + row['pro_id'].replace(':', '_').strip() + '\t' + row['gene_type_cleaned'] + '\n')

outfile.close()

**Preview Processed Data**

In [69]:
stpr_data = pandas.read_csv(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['STRING_IDs', 'Protein_Ontology_IDs', 'Gene_Type'],
                            delimiter = '\t')

print('There are {edge_count} string-protein ontology edges'.format(edge_count=len(stpr_data)))

There are 92270 string-protein ontology edges


In [70]:
stpr_data.head(n=5)

Unnamed: 0,STRING_IDs,Protein_Ontology_IDs,Gene_Type
0,9606.ENSP00000334393,PR_000011836,protein-coding
1,9606.ENSP00000334393,PR_Q8NH21,protein-coding
2,9606.ENSP00000409316,PR_000011834,protein-coding
3,9606.ENSP00000409316,PR_Q6IEY1,protein-coding
4,9606.ENSP00000329982,PR_000011834,protein-coding


<br>

***
***

### Uniprot Accession-Protein Ontology <a class="anchor" id="uniprotaccession-proteinontology"></a>

**Purpose:** To map Uniprot accession identifiers to Protein Ontology identifiers when creating the following edges:  
- protein-gobp  
- protein-gomf  
- protein-gocc  
- protein-complex  
- protein-cofactor  
- protein-catalyst 
- protein-reaction  
- protein-pathway

**Output:** [`UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/txp8tqdipzwus9p/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt?dl=1)

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['uniprot_ids', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['uniprot_ids'] != 'None' and row['pro_id'] != 'None': 
            outfile.write(row['uniprot_ids'].strip() + '\t' + row['pro_id'].replace(':', '_').strip() + '\t' + row['gene_type_cleaned'] + '\n')

outfile.close()

**Preview Processed Data**

In [66]:
uapr_data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['Uniprot_Accession_IDs', 'Protein_Ontology_IDs', 'Gene_Types'],
                            delimiter = '\t')

print('There are {edge_count} uniprot accession-protein ontology edges'.format(edge_count=len(uapr_data)))

There are 313776 uniprot accession-protein ontology edges


In [67]:
uapr_data.head(n=5)

Unnamed: 0,Uniprot_Accession_IDs,Protein_Ontology_IDs,Gene_Types
0,Q8NH21,PR_000011836,protein-coding
1,Q8NH21,PR_Q8NH21,protein-coding
2,Q6IEY1,PR_000011834,protein-coding
3,Q6IEY1,PR_Q6IEY1,protein-coding
4,Q96NU1,PR_000014441,protein-coding


<br>

***
***
### Other Identifier Mapping <a class="anchor" id="other-identifier-mapping"></a>
***
* [ChEBI Identifiers](#mesh-chebi)  
* [Human Protein Atlas Tissue and Cell Types](#hpa-uberon) 
* [Human Disease and Phenotype Identifiers](#disease-identifiers) 

***
***

### ChEBI-MeSH Identifiers <a class="anchor" id="mesh-chebi"></a>

**Data Source Wiki Page:** [mapping-mesh-to-chebi](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#mapping-mesh-identifiers-to-chebi-identifiers)  

**Purpose:** Map MeSH identifiers to ChEBI identifiers when creating the following edges:  
- chemical-gene  
- chemical-disease

**Dependencies:** This script assumes that the [`ncbo_rest_api.py`](https://github.com/callahantiff/PheKnowLator/blob/development/scripts/python/ncbo_rest_api.py) script was run and the data generated from this file was written to `./resources/processed_data/temp`. 

**Output:** [`MESH_CHEBI_MAP.txt`](https://www.dropbox.com/s/5nr87v5h6x8oc1b/MESH_CHEBI_MAP.txt?dl=1)


In [100]:
with open(processed_data_location + 'MESH_CHEBI_MAP.txt', 'w') as out:
    for filename in tqdm(glob.glob(processed_data_location + 'temp/*.txt')):
        for row in list(filter(None, open(filename, 'r').read().split('\n'))):
            mesh = '_'.join(row.split('\t')[0].split('/')[-2:])
            chebi = row.split('\t')[1].split('/')[-1]
            out.write(mesh + '\t' + chebi + '\n')

out.close()



100%|██████████| 44/44 [00:00<00:00, 670.44it/s]


**Preview Processed Data**

In [101]:
mc_data = pandas.read_csv(processed_data_location + 'MESH_CHEBI_MAP.txt',
                          delimiter = '\t',
                          header=None,
                          names=['MeSH_IDs', 'ChEBI_IDs'])

print('There are {edge_count} MeSH-ChEBI edges'.format(edge_count=len(mc_data)))

There are 11434 MeSH-ChEBI edges


In [102]:
mc_data.head(n=5)

Unnamed: 0,MeSH_IDs,ChEBI_IDs
0,MESH_C535085,CHEBI_133814
1,MESH_C008574,CHEBI_17221
2,MESH_C492482,CHEBI_34581
3,MESH_C007556,CHEBI_135978
4,MESH_C500395,CHEBI_29138


<br>

***

### Disease and Phenotype Identifiers <a class="anchor" id="disease-identifiers"></a>

**Data Source Wiki Page:** [disgenet](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet)  

**Purpose:** This script downloads the [disease_mappings.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz) to map UMLS identifiers to Human Disease and Human Phenotype identifiers when creating the following edges:  
- chemical-disease  
- disease-phenotype

**Output:**   
- Human Disease Ontology Mappings ➞ [`DISEASE_DOID_MAP.txt`](https://www.dropbox.com/s/q30ferujl7k574j/DISEASE_DOID_MAP.txt?dl=1)  
- Human Phenotype Ontology Mappings ➞ [`PHENOTYPE_HPO_MAP.txt`](https://www.dropbox.com/s/5ayl0c5qm7r4tdm/PHENOTYPE_HPO_MAP.txt?dl=1)

In [None]:
url = 'https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz'
data_downloader(url, unprocessed_data_location)

In [103]:
disease_data = pandas.read_csv(unprocessed_data_location + 'disease_mappings.tsv',
                               header = 0,
                               delimiter = '|')

disease_data.head(n=3)

Unnamed: 0,diseaseId,name,vocabulary,code,vocabularyName
0,C0018923,Hemangiosarcoma,DO,1816,angiosarcoma
1,C0854893,Angiosarcoma non-metastatic,DO,1816,angiosarcoma
2,C0033999,Pterygium,DO,2116,pterygium


In [None]:
# convert to dictionary
disease_dict = {}

for idx, row in tqdm(disease_data.iterrows(), total=disease_data.shape[0]):
    if row['vocabulary'] == 'MSH':
        mesh_finder(disease_data, row['code'], 'MESH:', disease_dict)
    elif row['vocabulary'] == 'OMIM':
        mesh_finder(disease_data, row['code'], 'OMIM:', disease_dict)
    elif row['vocabulary'] == 'ORDO':
        mesh_finder(disease_data, row['code'], 'ORPHA:', disease_dict)
    elif row['diseaseId'] in disease_dict.keys():
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']].append('DOID_' + row['code']) 
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']].append(row['code'].replace('HP:', 'HP_'))
    else:
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']] = ['DOID_' + row['code']] 
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']] = [row['code'].replace('HP:', 'HP_')] 

In [None]:
# reformat data and write it out
with open(processed_data_location + 'DISEASE_DOID_MAP.txt', 'w') as outfile1,open(processed_data_location + 'PHENOTYPE_HPO_MAP.txt', 'w') as outfile2:
    for key, value in tqdm(disease_dict.items()):
        for i in value:
            # get diseases
            if i.startswith('DOID_'): 
                outfile1.write(key.split(':')[-1] + '\t' + i + '\n')

            # get phenotypes
            if i.startswith('HP_'): 
                outfile2.write(key.split(':')[-1] + '\t' + i + '\n')

outfile1.close()
outfile2.close()

**Preview Processed Data**

_Preview Disease (DOID) Mappings_

In [106]:
dis_data = pandas.read_csv(processed_data_location + 'DISEASE_DOID_MAP.txt',
                           header = None,
                           names=['Disease_IDs', 'DOID_IDs'],
                           delimiter = '\t')

print('There are {} disease-DOID edges'.format(len(dis_data)))

There are 46720 disease-DOID edges


In [107]:
dis_data.head(n=5)

Unnamed: 0,Disease_IDs,DOID_IDs
0,C0018923,DOID_0001816
1,C0854893,DOID_0001816
2,C0033999,DOID_0002116
3,C4520843,DOID_0002116
4,C0024814,DOID_0014667


_Preview Phenotype (HP) Mappings_

In [108]:
hp_data = pandas.read_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt',
                          header = None,
                          names=['Disease_IDs', 'HP_IDs'],
                          delimiter = '\t')

print('There are {} phenotype-HPO edges'.format(len(hp_data)))

There are 21676 phenotype-HPO edges


In [109]:
hp_data.head(n=5)

Unnamed: 0,Disease_IDs,HP_IDs
0,C0018923,HP_0200058
1,C0033999,HP_0001059
2,C4520843,HP_0001059
3,C0037199,HP_0000246
4,C0008780,HP_0012265


<br>

***
***

### Human Protein Atlas/GTEx Tissue/Cells - UBERON + Cell Ontology + Cell Line Ontology <a class="anchor" id="hpa-uberon"></a>

**Data Source Wiki Page:**  
- [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#human-protein-atlas) 
- [genotype-tissue-expression-project](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#the-genotype-tissue-expression-gtex-project)  

<br>

**Purpose:** Downloads a query for cell, tissue, and blood types with overexpressed protein-coding genes in the human proteome ([`proteinatlas_search.tsv`](https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv)) and median gene-level TPM by tissue for all genes that are not protein-coding ([`GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct`](https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz)) in order to create mappings between cell and tissue type strings to the Uber-Anatomy, Cell Ontology, and Cell Line Ontology concepts (see [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-protein-atlas) for details on the mapping process). The [`Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt`](https://www.dropbox.com/s/fiek6h5rowi7dh0/Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt?dl=1) file was used to filter the GTEx data for genes that were not protein-coding. The mappings are then used to create the following edge types:  
- rna-cell line  
- rna-tissue type   
- protein-cell line  
- protein-tissue type  

<br>

**Output:**  
- All HPA tissue and cell type strings ➞ [`HPA_tissues.txt`](https://www.dropbox.com/s/m0spn8h1l8kxb61/HPA_tissues.txt?dl=1)  
- Mapping HPA strings to ontology concepts (documentation) ➞ [`zooma_tissue_cell_mapping_04JAN2020.xlsx`](https://www.dropbox.com/s/lxp8vxj39eumvcn/zooma_tissue_cell_mapping_04JAN2020.xlsx?dl=1)  
- Final HPA-ontology mappings ➞ [`HPA_GTEx_TISSUE_CELL_MAP.txt`](https://www.dropbox.com/s/snzdwv1cvs0v9pp/HPA_GTEx_TISSUE_CELL_MAP.txt?dl=1)
- HPA Edges ➞ [`HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt`](https://www.dropbox.com/s/u7elnc056zxypc6/HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt?dl=1)

In [20]:
# read in data used to identify genes that do not code for proteins from GTEx
merged_data_clean = pandas.read_csv(processed_data_location + 'Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt',
                                   header = 0,
                                   low_memory=False,
                                   delimiter = '\t')

**Human Protein Atlas**

In [None]:
url = 'https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv'
data_downloader(url, unprocessed_data_location, 'proteinatlas_search.tsv.gz')

In [50]:
hpa = pandas.read_csv(unprocessed_data_location + 'proteinatlas_search.tsv',
                      header = 0,
                      delimiter = '\t')

# replace NaN with 'None'
hpa.fillna('None', inplace=True)

_Identify HPA Terms Needing Mapping_

In [51]:
# retrieve terms to map
terms_to_map = list(hpa.columns)

# write results
with open(unprocessed_data_location + 'HPA_tissues.txt', 'w') as outfile:
    for x in tqdm(terms_to_map):
        if x.endswith('[NX]'):
            term = x.split('RNA - ')[-1].split(' [NX]')[:-1][0]
            outfile.write(term + '\n')

outfile.close()

100%|██████████| 161/161 [00:00<00:00, 52715.30it/s]


**Genotype-Tissue Expression Project**

In [13]:
url='https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data file


In [52]:
gtex = pandas.read_csv(unprocessed_data_location + 'GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct',
                      header = 0,
                      skiprows=2,
                      delimiter = '\t')

# replace NaN with 'None'
gtex.fillna('None', inplace=True)

**Get Mapping Data**  
Here, we are reading back in the concepts that we externally mapped from HPA and GTEx tissue, cell, and cell lines to [UBERON](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#uber-anatomy-ontology), the [Cell Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#cell-ontology), and the [Cell Line Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#cell-line-ontology).

In [53]:
# read back in mapped tissue/cell data
mapping_data = pandas.read_excel(open(unprocessed_data_location + 'zooma_tissue_cell_mapping_04JAN2020.xlsx', 'rb'),
                                 sheet_name='Concept_Mapping - 04JAN2020',
                                 header=0)

# convert NaN to None
mapping_data.fillna('None', inplace=True)

# preview data
mapping_data.head(n=3)

Unnamed: 0,ORIGINAL TERM,UBERON ID,UBERON LABEL,CL ID,CL LABEL,CLO ID,CLO LABEL,UBERON MAPPING,CL MAPPING,CLO MAPPING
0,A-431,UBERON_0000014,zone of skin,CL_0000066,epithelial cell,CLO_0001591,A431 cell,Manual,Manual,Manual
1,A549,UBERON_0002048,lung,CL_0000141,epithelial cell of lung,CLO_0001601,A549 cell,Manual,Manual,Manual
2,Adipose - Subcutaneous,UBERON_0002190,subcutaneous adipose tissue,,,,,GTEX,,


In [54]:
# reformat data and write it out
with open(processed_data_location + 'HPA_GTEx_TISSUE_CELL_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(mapping_data.iterrows(), total=mapping_data.shape[0]):
        if row['UBERON ID'] != 'None':
            outfile.write(str(row['ORIGINAL TERM']).strip() + '\t' + str(row['UBERON ID']).strip() + '\n')
        if row['CL ID'] != 'None':
            outfile.write(str(row['ORIGINAL TERM']).strip() + '\t' + str(row['CL ID']).strip() + '\n')
        if row['CLO ID'] != 'None':
            outfile.write(str(row['ORIGINAL TERM']).strip() + '\t' + str(row['CLO ID']).strip() + '\n')

outfile.close()

100%|██████████| 207/207 [00:00<00:00, 2953.23it/s]


**Preview Processed Data**

In [55]:
hpa_data = pandas.read_csv(processed_data_location + 'HPA_GTEx_TISSUE_CELL_MAP.txt',
                           header = None,
                           names=['TISSUE_CELL_TERM', 'ONTOLOGY_IDs'],
                           delimiter = '\t')

print('There are {edge_count} edges'.format(edge_count=len(hpa_data)))

There are 337 edges


In [56]:
hpa_data.head(n=3)

Unnamed: 0,TISSUE_CELL_TERM,ONTOLOGY_IDs
0,A-431,UBERON_0000014
1,A-431,CL_0000066
2,A-431,CLO_0001591


**Create Edge Data Set**

_Genotype-Tissue Expression Project_

In [None]:
# remove rows that contain protein coding genes
gtex_genes = merged_data_clean.loc[merged_data_clean['gene_type_cleaned'].apply(lambda x: 'protein-coding' not in x.lower() and 'none' not in x.lower())]

# merge gtex results with gene type data to allow filtering out of protein-coding genes
merged_gtex = pandas.merge(gtex, gtex_genes, left_on='Description', right_on='symbol', how='left')

# loop over data and re-organize - only keep results with tpm >= 1 and if gene symbol is not a protein-coding gene
gtex_results = []

for idx, row in tqdm(gtex.iterrows(), total=gtex.shape[0]):    
    for col in list(gtex.columns)[2:]:
        if row[col] >= 1.0:           
            gtex_results += [[row['Name'], row['Description'], 'None', 'Evidence at transcript level', 'cell line' if 'Cells' in col else 'anatomy', col]]

_Human Protein Atlas_

In [None]:
hpa_results = []

for idx, row in tqdm(hpa.iterrows(), total=hpa.shape[0]):
    if row['RNA tissue specific NX'] != 'None':
        for x in row['RNA tissue specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'anatomy', x.split(':')[0]]]

    if row['RNA cell line specific NX'] != 'None':
        for x in row['RNA cell line specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'cell line', x.split(':')[0]]]

    if row['RNA brain regional specific NX'] != 'None':
        for x in row['RNA brain regional specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'anatomy', x.split(':')[0]]]

    if row['RNA blood cell specific NX'] != 'None':
        for x in row['RNA blood cell specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'anatomy', x.split(':')[0]]]

    if row['RNA blood lineage specific NX'] != 'None':
        for x in row['RNA blood lineage specific NX'].split(';'):
            hpa_results += [[row['Ensembl'], row['Gene'], row['Uniprot'], row['Evidence'], 'anatomy', x.split(':')[0]]]

_Write Results_

In [None]:
# reformat data and write it out
with open(processed_data_location + 'HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt', 'w') as outfile:
    for res in tqdm(hpa_results + gtex_results):
        outfile.write(res[0] + '\t' + res[1] + '\t' + res[2] + '\t' + res[3] + '\t' + res[4] + '\t' + res[5] + '\n')

outfile.close()

**Preview Processed Data**

In [91]:
hpa_edges = pandas.read_csv(processed_data_location + 'HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt',
                           header = None,
                           names=['Ensembl_IDs', 'Gene_Symbols', 'Uniport_IDs', 'Evidence', 'Anatomy_Type', 'Anatomy'],
                           low_memory=False,
                           delimiter = '\t')

print('There are {edge_count} edges'.format(edge_count=len(hpa_edges)))

There are 952962 edges


In [92]:
hpa_edges.head(n=5)

Unnamed: 0,Ensembl_IDs,Gene_Symbols,Uniport_IDs,Evidence,Anatomy_Type,Anatomy
0,ENSG00000121410,A1BG,P04217,Evidence at protein level,anatomy,liver
1,ENSG00000121410,A1BG,P04217,Evidence at protein level,cell line,HEK 293
2,ENSG00000121410,A1BG,P04217,Evidence at protein level,cell line,Hep G2
3,ENSG00000121410,A1BG,P04217,Evidence at protein level,cell line,REH
4,ENSG00000121410,A1BG,P04217,Evidence at protein level,cell line,U-266/70


<br>

***
***
### CREATE EDGE DATASETS  <a class="anchor" id="create-edge-datasets"></a>
***
***

### Ontologies  <a class="anchor" id="ontologies"></a>
***
- [Protein Ontology](#protein-ontology)  
- [Relations Ontology](#relations-ontology)  

***

### Protein Ontology <a class="anchor" id="protein-ontology"></a>

**Data Source Wiki Page:** [protein-ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-phenotype-ontology)  

**Purpose:** This script downloads the [pr.owl](http://purl.obolibrary.org/obo/pr.owl) file from [ProConsortium.org](https://proconsortium.org/) in order to create a version of the ontology that contains only human proteins. This is achieved by performing forward and reverse breadth first search over all proteins which are `owl:subClassOf` [Homo sapiens protein](https://proconsortium.org/app/entry/PR%3A000029067/).

<br>

**Output:**  
- Human Protein Ontology ➞ [`human_pro.owl`](https://www.dropbox.com/s/jw8jksgnqbcz9sm/human_pro.owl?dl=1)
- Classified Human Protein Ontology (Hermit) ➞ [`human_pro_closed.owl`](https://www.dropbox.com/s/6ux85agl95ja3wx/human_pro_closed.owl?dl=1)


In [None]:
url = 'http://purl.obolibrary.org/obo/pr.owl'
data_downloader(url, unprocessed_data_location)

In [None]:
# read in ontology as graph (the ontology is large so this takes ~60 minutes) - 11,757,623 edges on 12/18/2019
graph = Graph()
graph.parse(unprocessed_data_location + 'pr.owl')

print('There are {} edges in the ontology'.format(len(graph)))

**Convert Ontology to Directed MulitGraph:**  
In order to create a version of the ontology which includes all relevant human edges, we need to first convert the KG to a [directed multigraph](https://networkx.github.io/documentation/stable/reference/classes/multidigraph.html).

In [None]:
# convert RDF graph to multidigraph (the ontology is large so this takes ~45 minutes)
networkx_mdg = rdflib_to_networkx_multidigraph(graph)

**Identify Human Proteins:**   
A list of human proteins is obtained by querying the ontology to return all ontology classes `only_in_taxon some Homo sapiens`. To expedite the query time, the following SPARQL query is run from the [ProConsortium](https://proconsortium.org/pro_sparql.shtml) SPARQL endpoint: 

```SPARQL
PREFIX obo: <http://purl.obolibrary.org/obo/>

SELECT ?PRO_term
FROM <http://purl.obolibrary.org/obo/pr>
WHERE {
       ?PRO_term rdf:type owl:Class .
       ?PRO_term rdfs:subClassOf ?restriction .
       ?restriction owl:onProperty obo:RO_0002160 .
       ?restriction owl:someValuesFrom obo:NCBITaxon_9606 .

       # use this to filter-out things like hgnc ids
       FILTER (regex(?PRO_term,"http://purl.obolibrary.org/obo/*")) .
}

```


In [None]:
# download data - pro classes only_in_taxon some Homo sapiens (61,064 classes on 12/18/2019)
url = 'http://sparql.proconsortium.org/virtuoso/sparql?query=PREFIX+obo%3A+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F%3E%0D%0ASELECT+%3FPRO_term%0D%0AFROM+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fpr%3E%0D%0AWHERE%0D%0A%7B%0D%0A+++%3FPRO_term+rdf%3Atype+owl%3AClass+.%0D%0A+++%3FPRO_term+rdfs%3AsubClassOf+%3Frestriction+.%0D%0A+++%3Frestriction+owl%3AonProperty+obo%3ARO_0002160+.%0D%0A+++%3Frestriction+owl%3AsomeValuesFrom+obo%3ANCBITaxon_9606+.%0D%0A%0D%0A+++FILTER+%28regex%28%3FPRO_term%2C%22http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F*%22%29%29+.%0D%0A%0D%0A%7D%0D%0A&format=text%2Fhtml&debug='
html = requests.get(url, allow_redirects=True).content

# extract data from html table
df_list = pandas.read_html(html)
human_pro_classes = list(df_list[-1]['PRO_term'])

print('There are {protein_count} human classes in the PRO ontology'.format(protein_count=len(human_pro_classes)))

**Construct Human PRO:**   
Now that we have all of the paths from the original graph that are relevant to humans, we can construct a human-only version of the PRotein ontology.

In [None]:
# create a new graph using bfs paths
human_pro_graph = Graph()
human_networkx_mdg = networkx.MultiDiGraph()

for node in tqdm(human_pro_classes):
    forward = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='original'))
    reverse = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='reverse'))
    
    # add edges from forward and reverse bfs paths
    for path in forward + reverse:
        human_pro_graph.add((path[0], path[2], path[1]))
        human_networkx_mdg.add_edge(path[0], path[1], path[2])

In [None]:
# verify that the constructed ontology only has 1 component
networkx.number_connected_components(human_networkx_mdg.to_undirected())

In [None]:
# save filtered ontology
human_pro_graph.serialize(destination=unprocessed_data_location + 'human_pro.owl', format='xml')

**Classify Ontology:**  
To ensure that we have correclty built the new ontology, we run the hermit reasoner over it to ensure that there are no incomplete triples or inconsistent classes. In order to do this, we will call the reasoner using [OWLTools](https://github.com/owlcollab/owltools), which this script assumes has already been downloaded to the `../resources/lib` directory. The following arguments are then called to run the reasoner (from the command line):  

```bash
./resources/lib/owltools ./resources/unprocessed_data/human_pro_filtered.owl --reasoner hermit --run-reasoner --assert-implied -o ./resources/processed_data/human_pro_closed.owl
```

_**Note.** This step takes around 30-45 minutes to run. When run from the command line the reasoner determined that the ontology was consistent and 174 new axioms were inferrred (12/18/2019)._

In [None]:
# run reasoner -- RUN FROM COMMAND LINE NOT HERE
# subprocess.run(['../../resources/lib/owltools',
#                 '../../resources/unprocessed_data/human_pro_filtered.owl',
#                 '--reasoner hermit',
#                 '--run-reasoner',
#                 '--assert-implied',
#                 '--list-unsatisfiable',
#                 '-o ./resources/processed_data/human_pro_closed.owl'])

**Examine Cleaned Human PRO:**  
Once we have cleaned the ontology we can get counts of components, nodes, edges, and then write the cleaned graph to the `../../resources/processed_data` repository.

In [None]:
# get count of connected components
pro_human_graph = Graph()
pro_human_graph.parse(processed_data_location + 'human_pro_closed.owl')

# get node and edge count
edge_count = len(human_pro_graph)
node_count = len(set([str(node) for edge in list(human_pro_graph) for node in edge[0::2]]))

print('\n The classified, filtered Human version of PRO contains {node} nodes and {edge} edges\n'.format(node=node_count, edge=edge_count))

<br>

***

### Relations Ontology <a class="anchor" id="relations-ontology"></a>

**Data Source Wiki Page:** [RO](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#relation-ontology)  

**Purpose:** This script downloads the [ro.owl](http://purl.obolibrary.org/obo/ro.owl) file from [obofoundry.org](http://www.obofoundry.org/) in order to obtain all `ObjectProperties` and their inverse relations.  

**Output:** 
- Relations and Inverse Relations ➞ [`INVERSE_RELATIONS.txt`](https://www.dropbox.com/s/sd8qlib8f6gqyz4/INVERSE_RELATIONS.txt?dl=1)
- Relations and Labels ➞ [`RELATIONS_LABELS.txt`](https://www.dropbox.com/s/k2hm9p0r8l9ecj3/RELATIONS_LABELS.txt?dl=1)

In [None]:
url = 'http://purl.obolibrary.org/obo/ro.owl'
data_downloader(url)

In [None]:
ro_graph = Graph()
ro_graph.parse(unprocessed_data_location + 'ro.owl')

print('There are {} edges in the ontology'.format(len(ro_graph))) #5,669 edges on 12/15/2019


___

**Identify Relations and Inverse Relations:**  
Identify all relations and their inverse relations using the `owl:inverseOf` property. To make it easier to look up the inverse relations, each pair is listed twice, for example:  
- [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015) `owl:inverseOf` [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025)  
- [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025) `owl:inverseOf` [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015)

In [None]:
with open('./resources/relations_data/INVERSE_RELATIONS.txt', 'w') as outfile:
    
    # write column names
    outfile.write('Relation' + '\t' + 'Inverse_Relation' + '\n')
    
    # manually add missing relations
    outfile.write('RO_0000056' + '\t' + 'RO_0000057' + '\n') #participates_in/has_participant
    outfile.write('RO_0000057' + '\t' + 'RO_0000056' + '\n') #participates_in/has_participant
    outfile.write('RO_0000085' + '\t' + 'RO_0000079' + '\n') #has_function/function_of
    outfile.write('RO_0000079' + '\t' + 'RO_0000085' + '\n') #has_function/function_of
    outfile.write('RO_0001025' + '\t' + 'RO_0001015' + '\n') #located_in/has_location
    outfile.write('RO_0001015' + '\t' + 'RO_0001025' + '\n') #located_in/has_location

    # find inverse relations
    for s, p, o in tqdm(ro_graph):
        if 'owl#inverseOf' in str(p):
            if 'RO' in str(s) and 'RO' in str(o):
                outfile.write(str(s.split('/')[-1]) + '\t' + str(o.split('/')[-1]) + '\n')
                outfile.write(str(o.split('/')[-1]) + '\t' + str(s.split('/')[-1]) + '\n')

outfile.close()

**Preview Processed Data**

In [None]:
ro_data = pandas.read_csv('./resources/relations_data/INVERSE_RELATIONS.txt',
                          header = 0,
                          delimiter = '\t')

print('There are {edge_count} RO Relations and Inverse Relations'.format(edge_count=len(ro_data)))

In [None]:
ro_data.head(n=5)

***
**Get Relations Labels:**  
Identify all relations and their labels for use when building the knowledge graph.

In [None]:
results = ro_graph.query(
    """SELECT DISTINCT ?p ?p_label
           WHERE {
              ?p rdf:type owl:ObjectProperty .
              ?p rdfs:label ?p_label . }
           """, initNs={"rdf": 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                        "rdfs": 'http://www.w3.org/2000/01/rdf-schema#',
                        "owl": 'http://www.w3.org/2002/07/owl#'})    

In [None]:
# write data to file
with open('./resources/relations_data/RELATIONS_LABELS.txt', 'w') as outfile:
    
    # write column names
    outfile.write('Relation' + '\t' + 'Label' + '\n')

    for p, p_label in list(results):
        outfile.write(str(p).split('/')[-1] + '\t' + str(p_label) + '\n')

**Preview Processed Data**

In [None]:
ro_data_label = pandas.read_csv('./resources/relations_data/RELATIONS_LABELS.txt',
                                header = 0,
                                delimiter = '\t')

print('There are {edge_count} RO Relations and Labels'.format(edge_count=len(ro_data_label)))

In [None]:
ro_data_label.head(n=5)

<br>

***
***
### Linked Data <a class="anchor" id="linked-data"></a>
***
* [Clinvar Variant-Diseases and Phenotypes](#clinvar-variant)
* [NCBI Gene Protein-Coding Genes and Proteins](#ncbi-protein-coding-genes)  
* [Reactome Chemical-Complex Data](#reactome-chemical-complex)  
* [Reactome Complex-Complex Data](#reactome-complex-complex)  
* [Reactome Protein-Complex Data](#reactome-protein-complex)  
* [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  

***

### Clinvar Variant-Diseases and Phenotypes <a class="anchor" id="clinvar-variant"></a>

**Data Source Wiki Page:** [Clinvar](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar)  

**Purpose:** This script downloads the [variant_summary.txt](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz) file from [CLinVar](https://www.ncbi.nlm.nih.gov/clinvar/) in order to create the following edges:  
- gene-variant  
- variant-disease  
- variant-phenotype  

**Output:** [`CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt`](https://www.dropbox.com/s/1doj3lj46ufgdpd/CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt?dl=1)


In [None]:
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
data_downloader(url, unprocessed_data_location)

In [110]:
# read in data and provided labels (needed to unnest data)
clinvar_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt',
                               header = 0,
                               delimiter = '\t',
                               low_memory=False)

# replace NaN with 'None'
clinvar_data.fillna('None', inplace=True)

In [111]:
# explode nested data
explode_df_clinvar = explode(clinvar_data.copy(), ['PhenotypeIDS'], ';')
explode_df_clinvar = explode(explode_df_clinvar.copy(), ['PhenotypeIDS'], ',')

# edit column formatting
explode_df_clinvar['PhenotypeIDS'].replace('Orphanet:ORPHA','ORPHA:', inplace=True, regex=True)
explode_df_clinvar['PhenotypeIDS'].replace('Human Phenotype Ontology:HP:','HP_', inplace=True, regex=True)

# write data
explode_df_clinvar.to_csv(processed_data_location + 'CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt', header = True, sep='\t', encoding='utf-8', index=False)

**Preview Processed Data**

In [112]:
print('There are {edge_count} variant edges'.format(edge_count=len(explode_df_clinvar)))

There are 3587975 variant edges


In [113]:
# preview data
explode_df_clinvar.head(n=5)

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,RS# (dbSNP),...,ReferenceAllele,AlternateAllele,Cytogenetic,ReviewStatus,NumberSubmitters,Guidelines,TestedInGTR,OtherIDs,SubmitterCategories,VariationID
0,15228,deletion,NM_001017995.3(SH3PXD2B):c.969del (p.Arg324fs),285590,SH3PXD2B,HGNC:29242,Pathogenic/Likely pathogenic,1,"Jun 01, 2017",794728006,...,GC,G,5q35.1,no assertion criteria provided,3,,N,OMIM Allelic Variant:613293.0002,3,189
1,15228,deletion,NM_001017995.3(SH3PXD2B):c.969del (p.Arg324fs),285590,SH3PXD2B,HGNC:29242,Pathogenic/Likely pathogenic,1,"Jun 01, 2017",794728006,...,GC,G,5q35.1,no assertion criteria provided,3,,N,OMIM Allelic Variant:613293.0002,3,189
2,15228,deletion,NM_001017995.3(SH3PXD2B):c.969del (p.Arg324fs),285590,SH3PXD2B,HGNC:29242,Pathogenic/Likely pathogenic,1,"Jun 01, 2017",794728006,...,GC,G,5q35.1,no assertion criteria provided,3,,N,OMIM Allelic Variant:613293.0002,3,189
3,15229,single nucleotide variant,NM_001017995.3(SH3PXD2B):c.127C>T (p.Arg43Trp),285590,SH3PXD2B,HGNC:29242,Pathogenic,1,"Feb 12, 2010",267607046,...,G,A,5q35.1,no assertion criteria provided,1,,N,"OMIM Allelic Variant:613293.0003,UniProtKB (pr...",1,190
4,15229,single nucleotide variant,NM_001017995.3(SH3PXD2B):c.127C>T (p.Arg43Trp),285590,SH3PXD2B,HGNC:29242,Pathogenic,1,"Feb 12, 2010",267607046,...,G,A,5q35.1,no assertion criteria provided,1,,N,"OMIM Allelic Variant:613293.0003,UniProtKB (pr...",1,190


<br>

***
***

### NCBI Gene Protein-Coding Gene-Protein <a class="anchor" id="ncbi-protein-coding-genes"></a>

**Data Source Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) 

**Purpose:** This script utilizes the merged data created in the [Human-Transcript, Gene, and Protein Identifier Mapping](#Human-Transcript,-Gene,-and-Protein-Identifier-Mapping) subsection in order to create the following edges:  
- gene-protein

**Output:** [`PROTEIN_CODING_GENES_PROTEINS.txt`](https://www.dropbox.com/s/79ce6oe68jt72ph/PROTEIN_CODING_GENES_PROTEINS.txt?dl=1)  

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['entrez_id', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'PROTEIN_CODING_GENES_PROTEINS.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if (row['entrez_id'] != 'None' and row['pro_id'] != 'None') and row['gene_type'] == 'protein-coding': 
            outfile.write(row['entrez_id'].strip() + '\t' + row['pro_id'].replace('PR:', 'PR_').strip() + '\t' + row['gene_type_cleaned'] + '\n')

outfile.close()

**Preview Processed Data**

In [96]:
hpe_data = pandas.read_csv(processed_data_location + 'PROTEIN_CODING_GENES_PROTEINS.txt',
                           header = None,
                           names=['Entrez_Gene_IDs', 'Protein_Ontology_IDs', "Gene_Type"],
                           delimiter = '\t')

print('There are {edge_count} protein-coding gene edges'.format(edge_count=len(hpe_data)))

There are 37569 protein-coding gene edges


In [97]:
hpe_data.head(n=5)

Unnamed: 0,Entrez_Gene_IDs,Protein_Ontology_IDs,Gene_Type
0,79501,PR_000011836,protein-coding
1,79501,PR_Q8NH21,protein-coding
2,729759,PR_000011834,protein-coding
3,729759,PR_Q6IEY1,protein-coding
4,81399,PR_000011834,protein-coding


<br>

***

### Reactome Chemical-Complex Data <a class="anchor" id="reactome-chemical-complex"></a>

**Data Souurce Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) in order to create the following edges:  
- chemical-complex  

**Output:** [`REACTOME_CHEMICAL_COMPLEX.txt`](https://www.dropbox.com/s/qoetjt0vfy6qb3y/REACTOME_CHEMICAL_COMPLEX.txt?dl=1)

In [None]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        
        if (row[0].strip().startswith('R-HSA') or row[0].strip().startswith('R-ALL')):
            # find all proteins in a complex
            for x in row[2].split('|'):
                if x.startswith('chebi:'):            
                    outfile.write(x.replace('chebi:', 'CHEBI_') + '\t' + row[0].strip() + '\n')

outfile.close()

**Preview Processed Data**

In [115]:
cc1_data = pandas.read_csv(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt',
                           header = None,
                           names=['CHEBI_IDs', 'Reactome_IDs'],
                           delimiter = '\t')

print('There are {edge_count} chemical-complex edges'.format(edge_count=len(cc1_data)))

There are 5589 chemical-complex edges


In [116]:
cc1_data.head(n=5)

Unnamed: 0,CHEBI_IDs,Reactome_IDs
0,CHEBI_24505,R-HSA-1006173
1,CHEBI_28879,R-HSA-1006173
2,CHEBI_59888,R-HSA-1013011
3,CHEBI_59888,R-HSA-1013017
4,CHEBI_29105,R-HSA-109266


<br>

***

### Reactome Complex-Complex Data <a class="anchor" id="reactome-complex-complex"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) in order to create the following edges:  
- complex-complex  

**Output:** [`REACTOME_COMPLEX_COMPLEX.txt`](https://www.dropbox.com/s/sojaq8u3hwfw4jz/REACTOME_COMPLEX_COMPLEX.txt?dl=1)


In [117]:
# create label dictionary
labels = pandas.read_csv(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt',
                         header = 0,
                         delimiter = '\t')

# convert to dictionary
label_dict = {row[0]:row[1] for idx, row in labels.iterrows()}

In [None]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        
        if row[0].strip().startswith('R-HSA'):
            # find all complexes
            for x in row[3].split('|'):
                if (x.startswith('R-HSA-') or x.startswith('R-ALL-')) and x.strip() in label_dict.keys():            
                    outfile.write(row[0].strip() + '\t' + x.strip() + '\n')

outfile.close()

**Preview Processed Data**

In [119]:
cc_data = pandas.read_csv(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt',
                          header = None,
                          names=['Reactome_Complex_u', 'Reactome_Complex_v'],
                          delimiter = '\t')

print('There are {edge_count} complex-complex edges'.format(edge_count=len(cc_data)))

There are 13606 complex-complex edges


In [120]:
cc_data.head(n=5)

Unnamed: 0,Reactome_Complex_u,Reactome_Complex_v
0,R-HSA-1008206,R-HSA-1008229
1,R-HSA-1013011,R-HSA-1013017
2,R-HSA-1013011,R-HSA-1013019
3,R-HSA-1013011,R-HSA-420698
4,R-HSA-1013011,R-HSA-420748


<br>

***
***

### Reactome Complex-Pathway Data <a class="anchor" id="reactome-complex-pathway"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [Complex_2_Pathway_human.txt](https://reactome.org/download/current/Complex_2_Pathway_human.txt) file from [Reactome](https://reactome.orgt) in order to create the following edges:  
- complex-pathway  

**Output:** [`REACTOME_COMPLEX_PATHWAY.txt`](https://www.dropbox.com/s/my03w16fjw7bt20/REACTOME_COMPLEX_PATHWAY.txt?dl=1)


In [None]:
url = 'https://reactome.org/download/current/Complex_2_Pathway_human.txt'
data_downloader(url, unprocessed_data_location)

In [None]:
# process data
data = open(unprocessed_data_location + 'Complex_2_Pathway_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_COMPLEX_PATHWAY.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        if row[0].startswith('R-HSA-'):            
            outfile.write(row[0].strip() + '\t' + row[1].strip() + '\n')

outfile.close()

**Previewed Processed Data**

In [122]:
cp_data = pandas.read_csv(processed_data_location + 'REACTOME_COMPLEX_PATHWAY.txt',
                          header = None,
                          names=['Reactome_Complex', 'Reactome_Pathway'],
                          delimiter = '\t')

print('There are {edge_count} complex-pathway edges'.format(edge_count=len(cp_data)))

There are 20480 complex-pathway edges


In [123]:
cp_data.head(n=5)

Unnamed: 0,Reactome_Complex,Reactome_Pathway
0,R-HSA-1006173,R-HSA-977606
1,R-HSA-1008206,R-HSA-983231
2,R-HSA-1008229,R-HSA-983231
3,R-HSA-1008252,R-HSA-983231
4,R-HSA-1011577,R-HSA-983231


<br>

***
***

### Reactome Protein-Complex Data <a class="anchor" id="reactome-protein-complex"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.org) in order to create the following edges:  
- protein-complex

**Output:** [`REACTOME_PROTEIN_COMPLEX.txt`](https://www.dropbox.com/s/7meu0cdz1mrnsz7/REACTOME_PROTEIN_COMPLEX.txt?dl=1)


In [None]:
url = 'https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt'
data_downloader(url, unprocessed_data_location)

In [None]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data):
        row = line.split('\t')
        
        if row[0].strip().startswith('R-HSA'):
            # find all proteins in a complex
            for x in row[2].split('|'):
                if x.startswith('uniprot:'):            
                    outfile.write(x.split(':')[-1].strip() + '\t' + row[0].strip() + '\n')

outfile.close()

**Preview Processed Data**

In [125]:
pc_data = pandas.read_csv(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt',
                       header = None,
                       names=['Uniprot_Protein', 'Reactome_Complex'],
                       delimiter = '\t')

print('There are {edge_count} protein-complex edges'.format(edge_count=len(pc_data)))

There are 91201 protein-complex edges


In [126]:
pc_data.head(n=5)

Unnamed: 0,Uniprot_Protein,Reactome_Complex
0,P08603,R-HSA-1006173
1,Q16621,R-HSA-1008206
2,Q9ULX9,R-HSA-1008206
3,O15525,R-HSA-1008206
4,O60675,R-HSA-1008206


<br>

***
***

### Uniprot  Protein-Cofactor and Protein-Catalyst <a class="anchor" id="uniprot-protein-cofactorcatalyst"></a>

**Data Source Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)  

**Purpose:** This script downloads the [uniprot-cofactor-catalyst.tab](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) file from the [Uniprot Knowledge Base](https://www.uniprot.org) in order to create the following edges:  
- protein-cofactor  
- protein-catalyst  

**Output:**  
- protein-cofactor ➞ [`UNIPROT_PROTEIN_COFACTOR.txt`](https://www.dropbox.com/s/ij9t89botd8nmmj/UNIPROT_PROTEIN_COFACTOR.txt?dl=1)
- protein-catalyst ➞ [`UNIPROT_PROTEIN_CATALYST.txt`](https://www.dropbox.com/s/pvopvs0iq8x3oq2/UNIPROT_PROTEIN_CATALYST.txt?dl=1)


In [None]:
url = 'https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Centry%20name%2Creviewed%2Cdatabase(PRO)%2Cchebi(Cofactor)%2Cchebi(Catalytic%20activity)&format=tab'
data_downloader(url, unprocessed_data_location, 'uniprot-cofactor-catalyst.tab')

In [None]:
data = open(unprocessed_data_location + 'uniprot-cofactor-catalyst.tab').readlines()

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt', 'w') as outfile1, open(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt', 'w') as outfile2:
    for line in tqdm(data):

        # get cofactors
        if 'CHEBI' in line.split('\t')[4]: 
            for i in line.split('\t')[4].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile1.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')
        
        # get catalysts
        if 'CHEBI' in line.split('\t')[5]:       
            for i in line.split('\t')[5].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile2.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')

outfile1.close()
outfile2.close()

**Preview Processed Data**

_Preview Cofactor Data_

In [128]:
pcp1_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt',
                            header = None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter = '\t')

print('There are {edge_count} protein-cofactor edges'.format(edge_count=len(pcp1_data)))

There are 5577 protein-cofactor edges


In [129]:
pcp1_data.head(n=5)

Unnamed: 0,Protein_Ontology_IDs,CHEBI_IDs
0,PR_Q9BRS2,CHEBI_18420
1,PR_Q05823,CHEBI_18420
2,PR_Q05823,CHEBI_29035
3,PR_Q13472,CHEBI_18420
4,PR_Q9BXA7,CHEBI_18420


_Preview Catalyst Data_

In [130]:
pcp2_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt',
                            header = None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter = '\t')

print('There are {edge_count} protein-catalyst edges'.format(edge_count=len(pcp2_data)))

There are 59863 protein-catalyst edges


In [131]:
pcp2_data.head(n=5)

Unnamed: 0,Protein_Ontology_IDs,CHEBI_IDs
0,PR_Q9NP80,CHEBI_15377
1,PR_Q9NP80,CHEBI_15378
2,PR_Q9NP80,CHEBI_28868
3,PR_Q9NP80,CHEBI_16870
4,PR_Q9NP80,CHEBI_58168


<br>

***
***
### INSTANCE METADATA <a class="anchor" id="create-instance-metadata"></a>
***

**Data Source Wiki Page:** [Dependencies](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies/#node-metadata) 

**Purpose:** The goal of this section is to obtain metadata for each instance data source used in the knowledge graph. To determine which of the edges contains instance data, the [`Master_Edge_List_Dict.json`](https://www.dropbox.com/s/4j0vrwx26dh8hd1/Master_Edge_List_Dict.json?dl=1) file is parsed and saved to a nested dictionary (see example below). 

```python
{
  'complex': {
              'chemical-complex': [[node_1, node_2]...[node_n, node_m]],
              'complex-complex':  [[node_1, node_2]...[node_n, node_m]],
              'complex-pathway':  [[node_1, node_2]...[node_n, node_m]],
              },
     'gene': {
                'chemical-gene':  [[node_1, node_2]...[node_n, node_m]],
                 'gene-disease':  [[node_1, node_2]...[node_n, node_m]],
              }
}
```

<br>

Once this dictionary is created, each major data type (examples shown in the list below) will be processed. For **[`Release V2.0.0`](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**, the following are instance data and require the compiling of metadata:
- [Genes](#gene-metadata)
- [RNA](#rna-metadata)
- [Pathways](#pathway-metadata)
- [Complexes](#complex-metadata)
- [Reactions](#reaction-metadata)
- [Variants](#variant-metadata)


<br>

____

**Metadata:** The <u>metadata</u> we will gather includes:  

| **Metadata Type** | **Definition** | **Example Node**  | **Example Node Metadata** | 
| :---: | :---: | :---: | :---: | 
| Label | The primary label or name for the node | `R-HSA-1006173` | "CFH:Host cell surface" |       
| Description | A definition or other useful details about the node | `rs794727058` | This `germline` `single nucleotide variant (allele alteration: C➞T)` located on chromosome `5 (GRCh38: NC_000005.10, start/stop positions (126555930/126555930))` with `pathogenic` clinical significance and a last review date of `2/23/2015` (review status: `criteria provided, single submitter`). |        
| Synonym | Alternative terms used for a node | `81399` | "OR1-1, OR7-21" |           

<br>

The metadata information will be used to create the following edges in the knowledge graph:  
- **Label** ➞ node `rdfs:label`  
- **Description** ➞ node `obo:IAO_0000115` description 
- **Synonyms** ➞ node `oboInOwl:hasExactSynonym` synonym 

<br>

*<b>NOTE.</b> All node metadata datasets are written to the `node_data` directory. The algorithm will look for data in this directory and if it is not there, then no node metadata will be created.*

_____

### Prepare Metadata Dictionaries
***

**Purpose:** To create the resources needed in order to create metadata dictionaries, which are in turn used to obtain metadata for instance data nodes. This process has the following steps:

**1. [Identify Instance Data Nodes](#identify-instance-data-nodes):** In order to automatically obtain the list of edges that include an instance data source and their corresponding edge lists, the `Master_Edge_List_Dict.json` is read in and processed.  
  - <u>Input Data</u>: [`Master_Edge_List_Dict.json`](./resources/Master_Edge_List_Dict.json)  

<br>

**2. [Generate Metadata Dictionaries](#generate-metadata-dictionaries):** In order to efficiently obtain metadata for the instance data nodes identified in _Step 1_, we first read in the data for each node type (i.e. genes, rna, complexes, pathways, reactions, and variants) and convert it into a dictionary. Then, each metadata dictionary is saved to a `master_metadata_dictioanry`, keyed by node type.
  - <u>Input Datasets</u>:  
    - Genes ➞ [`Homo_sapiens.gene_info`](https://www.dropbox.com/s/vazlmzxydgv6xzz/Homo_sapiens.gene_info?dl=1)    
    - RNA ➞ [`Merged_Human_Ensembl_Entrez_Uniprot_Identifiers.txt`](https://www.dropbox.com/s/l79166x1fx6vc4l/Merged_Human_Ensembl_Entrez_Uniprot_Identifiers.txt?dl=1) 
    - Complexes ➞ [`reactome2py API`](https://github.com/reactome/reactome2py)  
    - Pathways ➞ [`reactome2py API`](https://github.com/reactome/reactome2py)  
    - Reactions ➞ [`reactome2py API`](https://github.com/reactome/reactome2py)  
    - Variants ➞ [`variant_summary.txt.gz`](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)  

<br>

**3. [Write Metadata Files](#write-metadata-files):** The Instance data node dictionary from _Step 1_ and metadata dictionaries from _Step 2_ are used to write `.txt` files for all `edge-type` data included in the instance node dictionary.

<br>

***
***

### Identify Instance Data Nodes  <a class="anchor" id="identify-instance-data-nodes"></a>

In [3]:
# read in data files for each edge type
edge_data = json.load(open('./resources/Master_Edge_List_Dict.json', 'r'))
edge_dict = {key:[edge_data[key]['data_type'], edge_data[key]['edge_list']] for key in edge_data.keys()}

**Sort Data:** For all edges in the `edge_dict()` that include instance data, we create a new dictionary where each edge type is further organized by node from the edge type that references the instance data (e.g. from the `chemical-gene` edge type, the `gene` node references instance data).

In [4]:
# sort data files
metadata_file_info = {}

for edge in tqdm(edge_dict.keys()): 
    if 'instance' in edge_dict[edge][0]:
        
        # get instance type
        inst_type = edge.split('-')[edge_dict[edge][0].split('-').index('instance')]
        
        # read in data
        if inst_type in metadata_file_info.keys(): 
            metadata_file_info[inst_type][edge] = {}
            metadata_file_info[inst_type][edge]['data'] = edge_dict[edge][1]
            metadata_file_info[inst_type][edge]['instance_idx'] = edge_dict[edge][0]
            
        else:
            metadata_file_info[inst_type] = {}
            metadata_file_info[inst_type][edge] =  {}
            metadata_file_info[inst_type][edge]['data'] =  edge_dict[edge][1]
            metadata_file_info[inst_type][edge]['instance_idx'] = edge_dict[edge][0]


100%|██████████| 40/40 [00:00<00:00, 51797.52it/s]


In [5]:
# set directory to write node data to
node_directory = './resources/node_data/'

<br>

***

### Generate Metadata Dictionaries  <a class="anchor" id="generate-metadata-dictionaries"></a>
In this step, the goal is to create a metadata dictionary for each node type that does not rely on API data. In this case, only the **Gene**, **RNA**, and **Variant** nodes require data that is not from an API.


**Genes Metadata Dictionary**

In [None]:
url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz'
data_downloader(url, unprocessed_data_location)

In [6]:
# read in ncbi gene data
ncbi_gene = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info',
                            header = 0,
                            delimiter = '\t')


# replace "-" with "None"
ncbi_gene.replace('-','None', inplace=True)

In [7]:
# create metadata
genes, label, description, synonym = [], [], [], []

for idx, row in tqdm(ncbi_gene.iterrows(), total=ncbi_gene.shape[0]):
    
    # node 
    if row['GeneID'] != 'None':
        genes.append(row['GeneID'])
    
    # label -- only want metadata if there is a label
    if row['Symbol'] != 'None':
        label.append(row['Symbol'])
    
        # description
        description.append('{desc} is a {gene} gene that is located on chromosome {chrom} (map_location: {map_loc}).'.format(desc=row['description'].title(),
                                                                                                                            gene=row['type_of_gene'],
                                                                                                                            chrom=row['chromosome'],
                                                                                                                            map_loc=row['map_location']))
        # synonym
        if row['Synonyms'] != 'None' and row['Other_designations'] != 'None':
            synonym.append(row['Synonyms'] + '|' + row['Other_designations'])

        elif row['Synonyms'] != 'None' and row['Other_designations'] == 'None':
            synonym.append(row['Synonyms'])

        elif row['Synonyms'] == 'None' and row['Other_designations'] != 'None':
            synonym.append(row['Other_designations'])

        else:
            synonym.append('None')
            
    
# combine into new data frame        
gene_metadata_final = pandas.DataFrame(list(zip(genes, label, description, synonym)), columns =['ID', 'Label', 'Description', 'Synonym'])

# make all variables string
gene_metadata_final = gene_metadata_final.astype(str)

# convert df to dictionary
gene_metadata_final.set_index('ID', inplace=True)
gene_metadata_dict = gene_metadata_final.to_dict('index')

100%|██████████| 61645/61645 [00:15<00:00, 3943.38it/s]


**RNA Metadata Dictionary**

In [8]:
# read in data
rna_gene_data = pandas.read_csv(processed_data_location + 'Merged_Human_Ensembl_Entrez_Uniprot_Identifiers.txt',
                                header = 0,
                                delimiter = '\t',
                                low_memory=False)

In [9]:
# remove rows without identifiers
rna_data = rna_gene_data.loc[rna_gene_data['transcript_stable_id'].apply(lambda x: x != 'None')]
rna_data_labels = rna_gene_data.loc[rna_gene_data['GeneID_Cleaned'].apply(lambda x: x != 'None')]

# de-dup data
rna_metadata = rna_data[['transcript_stable_id', 'type_of_gene', 'Symbol', 'Synonyms', 'description', 'chromosome', 'map_location', 'Other_designations']].drop_duplicates(subset=['transcript_stable_id', 'type_of_gene', 'Symbol', 'Synonyms', 'description', 'chromosome', 'map_location', 'Other_designations'], keep='first', inplace=False) 


# aggregate mapping identifiers
agg_cols = []

for x in [ x for x in list(rna_data_labels) if x != 'transcript_stable_id']:
    if x == 'GeneID_Cleaned':
        agg_cols.append(rna_data_labels[['transcript_stable_id', x]].groupby('transcript_stable_id', as_index=False).agg(lambda x: '|'.join([x for x in list(set(x)) if x != 'None'])))
    
    else:
        agg_cols.append(rna_data_labels[['transcript_stable_id', x]].groupby('transcript_stable_id', as_index=False).agg(lambda x: ', '.join([x for x in list(set(x)) if x != 'None'])))

# merged aggreagted columns back together
rna_merged = reduce(lambda  left, right: pandas.merge(left, right, on=['transcript_stable_id'], how='outer'), agg_cols)

# replace NaN with 'None'
rna_merged.replace('','None', inplace=True)

# replace NaN with 'None'
rna_merged.fillna('None', inplace=True)

# remove rows without symbols

In [10]:
# create metadata
rna, label, description, synonym = [], [], [], []

for idx, row in tqdm(rna_merged.iterrows(), total=rna_merged.shape[0]):
    
    # node
    if row['transcript_stable_id'] != 'None':
        rna.append(row['transcript_stable_id'])
    
    # label -- only want metadata if there is a label
    if row['Symbol'] != 'None':
        label.append(row['Symbol'])
    
        # description
        description.append('This transcript was transcribed from {desc}, a {gene} gene that is located on chromosome {chrom} (map_location: {map_loc}).'.format(desc=row['description'].title(),
                                                                                                                                                                 gene=row['type_of_gene'],
                                                                                                                                                                 chrom=row['chromosome'],
                                                                                                                                                                 map_loc=row['map_location']))

        # synonym
        if row['Synonyms'] != 'None' and row['Other_designations'] != 'None':
            synonym.append(row['Synonyms'] + '|' + row['Other_designations'])

        elif row['Synonyms'] != 'None' and row['Other_designations'] == 'None':
            synonym.append(row['Synonyms'])

        elif row['Synonyms'] == 'None' and row['Other_designations'] != 'None':
            synonym.append(row['Other_designations'])

        else:
            synonym.append('None')
    
# combine into new data frame
rna_metadata_final = pandas.DataFrame(list(zip(rna, label, description, synonym)), columns =['ID', 'Label', 'Description', 'Synonym'])

# make all variables string
rna_metadata_final = rna_metadata_final.astype(str)

# convert df to dictionary
rna_metadata_final.set_index('ID', inplace=True)
rna_metadata_dict = rna_metadata_final.to_dict('index')

100%|██████████| 171919/171919 [00:48<00:00, 3535.18it/s]


**Variant Metadata Dictionary**

In [None]:
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
data_downloader(url, unprocessed_data_location)

In [11]:
var_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt',
                           header = 0,
                           delimiter = '\t',
                           low_memory=False)

In [12]:
# remove rows without identifiers
var_data = var_data.loc[var_data['Assembly'].apply(lambda x: x == 'GRCh38')]
var_data = var_data.loc[var_data['RS# (dbSNP)'].apply(lambda x: x != -1)]

# de-dup data
var_metadata = var_data[['#AlleleID', 'Type', 'Name', 'ClinicalSignificance', 'RS# (dbSNP)', 'Origin',
                         'ChromosomeAccession', 'Chromosome', 'Start', 'Stop', 'ReferenceAllele',
                         'Assembly', 'AlternateAllele','Cytogenetic', 'ReviewStatus', 'LastEvaluated']] 

# replace NaN with 'None'
var_metadata.fillna('None', inplace=True)

# remove duplicate dbSNP ids by choosing the most recent reviewed variant
var_metadata.sort_values('LastEvaluated', ascending=False, inplace=True)
var_metadata.drop_duplicates(subset='RS# (dbSNP)', keep='first', inplace=True)

In [13]:
# create metadata
variant, label, description = [], [], []

for idx, row in tqdm(var_metadata.iterrows(), total=var_metadata.shape[0]):
    
    # node
    if row['RS# (dbSNP)'] != 'None':
        variant.append('rs' + str(row['RS# (dbSNP)']))
    
    # label -- only want metadata if there is a label
    if row['Name'] != 'None':
        label.append(row['Name'])
    
        # description
        sent = 'This variant is a {Origin} {Type} that results when a {ReferenceAllele} allele is changed to {AlternateAllele} on chromosome {Chromosome} ({ChromosomeAccession}, start:{Start}/stop:{Stop} positions, cytogenetic location:{Cytogenetic}) and has clinical significance {ClinicalSignificance}. This entry is for the {Assembly} and was last reviewed on {LastEvaluated} with review status "{ReviewStatus}".'
        description.append(sent.format(Origin=row['Origin'], Type=row['Type'], ReferenceAllele=row['ReferenceAllele'],
                                       AlternateAllele=row['AlternateAllele'], Chromosome=row['Chromosome'],
                                       ChromosomeAccession=row['ChromosomeAccession'], Start=row['Start'],
                                       Stop=row['Stop'], Cytogenetic=row['Cytogenetic'], ClinicalSignificance=row['ClinicalSignificance'],
                                       Assembly=row['Assembly'], LastEvaluated=row['LastEvaluated'], ReviewStatus=row['ReviewStatus']))

# combine into new data frame
var_metadata_final = pandas.DataFrame(list(zip(variant, label, description)), columns =['ID', 'Label', 'Description'])

# drop duplicates
var_metadata_final.drop_duplicates(subset=None, keep='first', inplace=True)

# make all variables string
var_metadata_final = var_metadata_final.astype(str)

# convert df to dictionary
var_metadata_final.set_index('ID', inplace=True)
var_metadata_dict = var_metadata_final.to_dict('index')                      

100%|██████████| 429639/429639 [02:15<00:00, 3180.52it/s]


**Create Master Metadata Dictionary**  
To make it easier to navigate the mapping of each instance node in an edge, a master dictionary is created and keyed by node type. This is most useful when both nodes in an edge are instances, but of different data types (e.g. `gene-rna`).


In [14]:
master_metadata_dictionary = {'gene': gene_metadata_dict, 'rna': rna_metadata_dict, 'variant': var_metadata_dict}

<br>

***

### Write Metadata Files  <a class="anchor" id="write-metadata-files"></a>   
using the `Master Metadata Dictionary` created in the prior step, all of the `edge-type` data is processed and the resulting data written out `.txt` file to the `./resource/node_data` repository.

- [Genes](#gene-metadata)
- [RNA](#rna-metadata)
- [Pathways](#pathway-metadata)
- [Complexes](#complex-metadata)
- [Reactions](#reaction-metadata)
- [Variants](#variant-metadata)

### Genes <a class="anchor" id="gene-metadata"></a>

**Data Source Wiki Pages:** [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene) 

**Output:**  
- chemical-gene ➞ [`chemical-gene_GENE_METADATA.txt`](https://www.dropbox.com/s/fvkqnuk5xhs0huh/chemical-gene_GENE_METADATA.txt?dl=1) 
- gene-disease ➞ [`gene-disease_GENE_METADATA.txt`](https://www.dropbox.com/s/o0y21rx3b829q6d/gene-disease_GENE_METADATA.txt?dl=1) 
- gene-gene ➞ [`gene-gene_GENE_METADATA.txt`](https://www.dropbox.com/s/i4gznnct7rzh7pn/gene-gene_GENE_METADATA.txt?dl=1) 
- gene-pathway ➞ [`gene-pathway_GENE_METADATA.txt`](https://www.dropbox.com/s/yncd95vanhkp0ey/gene-pathway_GENE_METADATA.txt?dl=1) 
- gene-phenotype ➞ [`gene-phenotype_GENE_METADATA.txt`](https://www.dropbox.com/s/jghcoc5xzada011/gene-phenotype_GENE_METADATA.txt?dl=1) 
- gene-protein ➞ [`gene-protein_GENE_METADATA.txt`](https://www.dropbox.com/s/6vu961lna08qn08/gene-protein_GENE_METADATA.txt?dl=1) 
- gene-rna ➞ [`gene-rna_GENE_METADATA.txt`](https://www.dropbox.com/s/vs0kirmugdo9zkd/gene-rna_GENE_METADATA.txt?dl=1) 

In [None]:
node_type = 'gene'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['instance_idx']
    inst_idx = edge_data_type.split('-').index('instance')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_directory + edge_type + '_' + node_type.upper() + '_METADATA.txt', header = True, sep = '\t', index = False)


<br>

***

### RNA<a class="anchor" id="rna-metadata"></a>

**Data Source Wiki Pages:**  
- [Ensembl](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar)  
- [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene) 

**Output:**  
- chemical-rna ➞ [`chemical-rna_RNA_METADATA.txt`](https://www.dropbox.com/s/sm0orl0waq5iqhd/chemical-rna_RNA_METADATA.txt?dl=1) 
- rna-anatomy ➞ [`rna-anatomy_RNA_METADATA.txt`](https://www.dropbox.com/s/plkrunhhusx6mez/rna-anatomy_RNA_METADATA.txt?dl=1) 
- rna-cell ➞ [`rna-cell_RNA_METADATA.txt`](https://www.dropbox.com/s/dld0eadxyyzr44y/rna-cell_RNA_METADATA.txt?dl=1) 
- rna-protein ➞ [`rna-protein_RNA_METADATA.txt`](https://www.dropbox.com/s/3g72sb2e685rptn/rna-protein_RNA_METADATA.txt?dl=1) 

In [16]:
node_type = 'rna'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['instance_idx']
    inst_idx = edge_data_type.split('-').index('instance')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_directory + edge_type + '_' + node_type.upper() + '_METADATA.txt', header = True, sep = '\t', index = False)


<br>

***

### Pathways<a class="anchor" id="pathway-metadata"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#reactome-pathway-database)  

**Output:**    
- chemical-pathway ➞ [`chemical-pathway_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/2txg2ui4e6y7rnm/chemical-pathway_PATHWAY_METADATA.txt?dl=1)
- gobp-pathway ➞ [`gobp-pathway_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/bq0g1g4ef40vwxj/gobp-pathway_PATHWAY_METADATA.txt?dl=1)
- pathway-gocc ➞ [`pathway-gocc_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/6fzkzjxj08u6jfi/pathway-gocc_PATHWAY_METADATA.txt?dl=1)
- pathway-gomf ➞ [`pathway-gomf_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/gfqt86vujnoo7j5/pathway-gomf_PATHWAY_METADATA.txt?dl=1)
- protein-pathway ➞ [`protein-pathway_PATHWAY_METADATA.txt`](https://www.dropbox.com/s/xadtz4c0ab4a7p9/protein-pathway_PATHWAY_METADATA.txt?dl=1)

In [17]:
node_type = 'pathway'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['instance_idx']
    inst_idx = edge_data_type.split('-').index('instance')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_directory + edge_type + '_' + node_type.upper() + '_METADATA.txt', header = True, sep = '\t', index = False)


<br>

***
***

### Complexes<a class="anchor" id="complex-metadata"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#reactome-pathway-database)    

**Output:**    
- chemical-complex ➞ [`chemical-complex_COMPLEX_METADATA.txt`](https://www.dropbox.com/s/mu53u8fv5v0epvf/chemical-complex_COMPLEX_METADATA.txt?dl=1) 
- complex-complex ➞ [`complex-complex_COMPLEX_METADATA.txt`](https://www.dropbox.com/s/y4qt0ne47ix1tqb/complex-complex_COMPLEX_METADATA.txt?dl=1) 
- complex-pathway ➞ [`complex-pathway_COMPLEX_METADATA.txt`](https://www.dropbox.com/s/6n9w0vvxabi7efl/complex-pathway_COMPLEX_METADATA.txt?dl=1) 

In [18]:
node_type = 'complex'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['instance_idx']
    inst_idx = edge_data_type.split('-').index('instance')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_directory + edge_type + '_' + node_type.upper() + '_METADATA.txt', header = True, sep = '\t', index = False)


<br>

***

### Reactions<a class="anchor" id="reaction-metadata"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#reactome-pathway-database)   

**Output:**    
- chemical-reaction ➞ [`chemical-reaction_REACTION_METADATA.txt`](https://www.dropbox.com/s/6iztwaxrhrp1f7h/chemical-reaction_REACTION_METADATA.txt?dl=1)
- protein-reaction ➞ [`protein-reaction_REACTION_METADATA.txt`](https://www.dropbox.com/s/92vsuon54w4uq4j/protein-reaction_REACTION_METADATA.txt?dl=1)

In [None]:
node_type = 'reaction'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['instance_idx']
    inst_idx = edge_data_type.split('-').index('instance')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_directory + edge_type + '_' + node_type.upper() + '_METADATA.txt', header = True, sep = '\t', index = False)


<br>

***
***

### Variants<a class="anchor" id="variant-metadata"></a>

**Data Source Wiki Page:** [ClinVar](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar)  

**Output:**  
- variant-disease ➞ [`variant-disease_VARIANT_METADATA.txt`](https://www.dropbox.com/s/vj440u5efwdwibl/variant-disease_VARIANT_METADATA.txt?dl=1)  
- variant-gene ➞ [`variant-gene_VARIANT_METADATA.txt`](https://www.dropbox.com/s/geui7nby9h055bc/variant-gene_VARIANT_METADATA.txt?dl=1)  
- variant-phenotype ➞ [`variant-phenotype_VARIANT_METADATA.txt`](https://www.dropbox.com/s/hnocd802detivdd/variant-phenotype_VARIANT_METADATA.txt?dl=1)  

In [None]:
node_type = 'variant'

for edge_type in tqdm(metadata_file_info[node_type]):
    print('\nPROCESSING EDGE TYPE: {}'.format(edge_type))

    # gather vars for processing data
    data = metadata_file_info[node_type][edge_type]['data']
    edge_data_type = metadata_file_info[node_type][edge_type]['instance_idx']
    inst_idx = edge_data_type.split('-').index('instance')

    # get list of nodes to map and the dictionary to use
    # when nodes are of the same type (i.e. gene-gene)
    if (edge_type.split('-')[0] == edge_type.split('-')[1]):
        nodes = set([x for y in data for x in y])
        
        if edge_type.split('-')[0] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[0]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))
            
    # when nodes are both instances, but different types (i.e. gene-rna)
    elif edge_data_type.split('-')[0] == edge_data_type.split('-')[1]:
        data_res = []

        for node in edge_type.split('-'):
            nodes = set([x[int(edge_type.split('-').index(node))] for x in data])
            
            if node in master_metadata_dictionary:
                metadata_dictionaries = master_metadata_dictionary[node]
                data_res.append(metadata_dictionary_mapper(nodes, metadata_dictionaries))
            else:
                data_res.append(metadata_api_mapper(list(nodes)))
    
        # combine data into single df
        results = pandas.concat(data_res, ignore_index=True)
                
    # when only one node is an instance
    else:
        nodes = set([x[int(inst_idx)] for x in data])
        
        if edge_type.split('-')[int(inst_idx)] in master_metadata_dictionary.keys():
            metadata_dictionaries = master_metadata_dictionary[edge_type.split('-')[int(inst_idx)]]
            results = metadata_dictionary_mapper(nodes, metadata_dictionaries)
        else:
            results = metadata_api_mapper(list(nodes))

    # write data
    results.to_csv(node_directory + edge_type + '_' + node_type.upper() + '_METADATA.txt', header = True, sep = '\t', index = False)



<br>

***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```