***
# PheKnowLator - Data Preparation
***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**

**Purpose:** This notebook serves as a script to download and process data in order to generate mapping and filtering data needed to build edges for the PheKnowLator knowledge graph. For more information on the data sources utilize within this script, please see the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page.

**Assumptions:**   
- Raw data downloads ➞ `./resources/processed_data/unprocessed_data`    
- Processed data write location ➞ `./resources/processed_data`   

**Dependencies:** This notebook utilizes several helper functions, which are stored in the [`DataPreparationHelperFunctions.py`](https://github.com/callahantiff/PheKnowLator/blob/master/scripts/python/DataPreparationHelperFunctions.py) script. Hyperlinks to all downloaded and generated data sources are provided on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page as well as within each source subsection of this notebook. All generated data is freely available for download from DropBox. 

_____
***

## Table of Contents
***

### [Create Identifier Maps ](#create-identifier-maps)  
- [HUMAN TRANSCRIPT, GENE, AND PROTEIN IDENTIFIER MAPPING](#human-transcript,-gene,-and-protein-identifier-mapping)
  - [Ensembl Gene-Ensembl Transcript](#ensemblgene-ensembltranscript)  
  - [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
  - [Ensembl Transcript-Protein Ontology](#ensembltranscript-proteinontology)
  - [Entrez Gene-Ensembl Transcript](#entrezgene-ensembltranscript)
  - [Entrez Gene-Protein Ontology](#entrezgene-proteinontology)  
  - [STRING-Protein Ontology](#string-proteinontology)  
  - [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)


- [OTHER IDENTIFIER MAPPING](#other-identifier-mapping) 
  - [ChEBI Identifiers](#mesh-chebi) 
  - [Human Disease and Phenotype Identifiers](#disease-identifiers)
  - [Human Protein Atlas Tissue and Cell Types](#hpa-uberon)  

<br>

### [Create Edge Datasets](#create-edge-datasets)
- [ONTOLOGIES](#ontologies)  
  - [Protein Ontology](#protein-ontology)  
  - [Relations Ontology](#relations-ontology)  


- [LINKED DATA](#linked-data)  
  - [Clinvar Variant-Diseases and Phenotypes](#clinvar-variant)
  - [NCBI Gene Protein-Coding Genes and Proteins](#ncbi-protein-coding-genes)  
  - [Reactome Chemical-Complex Data](#reactome-chemical-complex)
  - [Reactome Complex-Complex Data](#reactome-complex-complex)
  - [Reactome Complex-Pathway Data](#reactome-complex-pathway)
  - [Reactome Protein-Complex Data](#reactome-protein-complex)
  - [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  

<br>

### [Gather Instance Data Metadata](#create-instance-metadata)  
- [Genes/RNA](#gene-and-rna-metadata)
- [Pathways](#pathway-metadata)
- [Complexes](#complex-metadata)
- [Reactions](#reaction-metadata)
- [Variants](#variant-metadata) 

____

<br>

### Set-Up Environment
_____

In [3]:
# import needed libraries
import glob
import networkx
import numpy
import pandas

from functools import reduce
from owlready2 import subprocess
from rdflib import Graph, Namespace, URIRef, BNode, extras, Literal
from rdflib.extras.external_graph_libs import *
from tqdm import tqdm

# import script containing helper functions
from scripts.python.DataPreparationHelperFunctions import *

**Define Global Variables**

In [4]:
# directory to read unprocessed data files from
unprocessed_data_location = 'resources/processed_data/unprocessed_data/'

# directory to write processed data files to
processed_data_location = 'resources/processed_data/'

***
***
### CREATE MAPPING DATASETS  <a class="anchor" id="create-identifier-maps"></a>
***
***

### Human Transcript, Gene, and Protein Identifier Mapping  <a class="anchor" id="human-transcript,-gene,-and-protein-identifier-mapping"></a>
***

**Data Source Wiki Pages:**   
- [Ensembl](https://uswest.ensembl.org/)
- [Uniprot Knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)   
- [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene) 
- [Protein Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#protein-ontology)

**Purpose:** To map create protein-coding gene-protein relations and mappings between the identifiers listed below. The edges types produced from each of these mappings will be further described within each identifier mapping section:  
- [Ensembl Gene-Ensembl Transcript](#ensemblgene-ensembltranscript)  
- [Entrez Gene-Ensembl Transcript](#entrezgene-ensembltranscript)  
- [Entrez Gene-Protein Ontology](#entrezgene-proteinontology)  
- [Ensembl Gene-Entrez Gene](#ensemblgene-entrezgene)
- [Uniprot Accession-Protein Ontology](#uniprotaccession-proteinontology)
- [STRING-Protein Ontology](#string-proteinontology)

**Output:** This script downloads and saves the following data:  

- Human Ensembl-UniProt Identifiers: [Homo_sapiens.GRCh38.98.uniprot.tsv.gz](ftp://ftp.ensembl.org/pub/release-98/tsv/homo_sapiens/Homo_sapiens.GRCh38.98.uniprot.tsv.gz) ➞ [`Homo_sapiens.GRCh38.98.uniprot.tsv`](https://www.dropbox.com/s/cesjvqz1b8c7ami/Homo_sapiens.GRCh38.98.uniprot.tsv?dl=1) 
- Human Ensembl-Entrez Identifiers: [Homo_sapiens.GRCh38.98.entrez.tsv.gz](ftp://ftp.ensembl.org/pub/release-98/tsv/homo_sapiens/Homo_sapiens.GRCh38.98.entrez.tsv.gz) ➞ [`Homo_sapiens.GRCh38.98.entrez.tsv`](https://www.dropbox.com/s/5kstw70py0azvws/Homo_sapiens.GRCh38.98.entrez.tsv?dl=1) 
- Human Gene Identifiers: [Homo_sapiens.gene_info.gz](ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz) ➞ [`Homo_sapiens.gene_info`](https://www.dropbox.com/s/vazlmzxydgv6xzz/Homo_sapiens.gene_info?dl=1)  
- Human Protein Identifiers: [promapping.txt](https://proconsortium.org/download/current/promapping.txt) ➞ [`promapping.txt`](https://www.dropbox.com/s/x7wdimv6ph6bl8k/promapping.txt?dl=1) 

_All Merged Data Sets:_ [`Merged_Human_Ensembl_Entrez_Uniprot_Identifiers.txt`](https://www.dropbox.com/s/l79166x1fx6vc4l/Merged_Human_Ensembl_Entrez_Uniprot_Identifiers.txt?dl=1)

***

***
**Process Data:** `Homo_sapiens.GRCh38.98.uniprot.tsv`

In [45]:
url1 = 'ftp://ftp.ensembl.org/pub/release-98/tsv/homo_sapiens/Homo_sapiens.GRCh38.98.uniprot.tsv.gz'
data_downloader(url1, unprocessed_data_location)

url2 = 'ftp://ftp.ensembl.org/pub/release-98/tsv/homo_sapiens/Homo_sapiens.GRCh38.98.entrez.tsv.gz'
data_downloader(url2, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data
Downloading gzipped data from ftp server
Decompressing and writing gzipped data


In [56]:
# read in ensembl-uniprot data
ensembl1 = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.98.uniprot.tsv',
                          header = 0,
                          delimiter = '\t',
                          low_memory=False)
# replace "-"
ensembl1.replace('-','None', inplace=True)

# read in entrez-uniprot data
ensembl2 = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.GRCh38.98.entrez.tsv',
                          header = 0,
                          delimiter = '\t',
                          low_memory=False)

# replace "-"
ensembl2.replace('-','None', inplace=True)

# merge datasets
ensembl = pandas.merge(ensembl1[['gene_stable_id', 'transcript_stable_id', 'protein_stable_id', 'xref']],
                       ensembl2[['gene_stable_id', 'transcript_stable_id', 'protein_stable_id', 'xref']],
                       left_on=['gene_stable_id', 'transcript_stable_id', 'protein_stable_id'],
                       right_on=['gene_stable_id', 'transcript_stable_id', 'protein_stable_id'],
                       how='outer')

# rename columns
ensembl.rename(columns={'xref_x': 'xref_uniprot', 'xref_y': 'xref_entrez'}, inplace=True)

# replace NaN with 'None'
ensembl.fillna('None', inplace=True)

# preview data
ensembl.head(n=3)

Unnamed: 0,gene_stable_id,transcript_stable_id,protein_stable_id,xref_uniprot,xref_entrez
0,ENSG00000186092,ENST00000641515,ENSP00000493376,A0A2U3U0J3,79501
1,ENSG00000186092,ENST00000335137,ENSP00000334393,Q8NH21,79501
2,ENSG00000284733,ENST00000426406,ENSP00000409316,Q6IEY1,729759


***

**Process Data:** `Homo_sapiens.gene_info`

In [5]:
url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data


In [109]:
ncbi_gene = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info', header = 0, delimiter = '\t')

# replace "-" with "None"
ncbi_gene.replace('-','None', inplace=True)

# explode nested data
explode_df_ncbi_gene = explode(ncbi_gene.copy(), ['dbXrefs'], '|')

# remove identifier type, which appears before ':'
explode_df_ncbi_gene['dbXrefs'].replace('(^\w*\:)','', inplace=True, regex=True)

# preview data
explode_df_ncbi_gene.head(n=3)

Unnamed: 0,#tax_id,GeneID,Symbol,LocusTag,Synonyms,dbXrefs,chromosome,map_location,description,type_of_gene,Symbol_from_nomenclature_authority,Full_name_from_nomenclature_authority,Nomenclature_status,Other_designations,Modification_date,Feature_type
0,9606,1,A1BG,,A1B|ABG|GAB|HYST2477,138670,19,19q13.43,alpha-1-B glycoprotein,protein-coding,A1BG,alpha-1-B glycoprotein,O,alpha-1B-glycoprotein|HEL-S-163pA|epididymis s...,20191220,
1,9606,1,A1BG,,A1B|ABG|GAB|HYST2477,HGNC:5,19,19q13.43,alpha-1-B glycoprotein,protein-coding,A1BG,alpha-1-B glycoprotein,O,alpha-1B-glycoprotein|HEL-S-163pA|epididymis s...,20191220,
2,9606,1,A1BG,,A1B|ABG|GAB|HYST2477,ENSG00000121410,19,19q13.43,alpha-1-B glycoprotein,protein-coding,A1BG,alpha-1-B glycoprotein,O,alpha-1B-glycoprotein|HEL-S-163pA|epididymis s...,20191220,


In [None]:
# create dictionary for quick look-up of ensembl gene ids by entrez gene ids
# EXPLANATION: an incorrect number of indentifers is returned when attempting to merge columns, this strategy is more effective
ensembl_map = {}

# loop over data and fill in missing values
for idx, row in tqdm(explode_df_ncbi_gene.iterrows(), total=explode_df_ncbi_gene.shape[0]):
    if (row['GeneID'] != 'None' and row['dbXrefs'] != 'None') and row['dbXrefs'].startswith('ENSG'):
        if row['GeneID'] in ensembl_map.keys():
            ensembl_map[row['GeneID']].append(row['dbXrefs'])
        else:
            ensembl_map[row['GeneID']] = [row['dbXrefs']]

***
**Process Data:** `promapping.txt`

In [4]:
url = 'https://proconsortium.org/download/current/promapping.txt'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [58]:
pro_mapping = pandas.read_csv(unprocessed_data_location + 'promapping.txt',
                              header = None,
                              names = ['pro_id', 'Entry', 'pro_mapping'],
                              delimiter = '\t')

# remove rows without 'UniProtKB'
pro_mapping = pro_mapping.loc[pro_mapping['Entry'].apply(lambda x: x.startswith('UniProtKB:'))] 

# remove identifier type, which appears before ':'
pro_mapping['Entry'].replace('(^\w*\:)','', inplace=True, regex=True)

# preview data
pro_mapping.head(n=3)

Unnamed: 0,pro_id,Entry,pro_mapping
6,PR:000000005,P37173,is_a
7,PR:000000005,P38438,is_a
8,PR:000000005,Q62312,is_a


***
***
**Merge Processed Data:** `ensembl` + `Homo_sapiens.gene_info`

In [201]:
# make sure that merge columns are of same type
ncbi_gene['GeneID'] = ncbi_gene['GeneID'].astype(str)

# merge uniprot and ncbi data
ensembl_ncbi_merged_data = pandas.merge(ensembl,
                                        ncbi_gene,
                                        left_on=['xref_entrez'],
                                        right_on=['GeneID'],
                                        how='outer')

# replace NaN with 'None'
ensembl_ncbi_merged_data.fillna('None', inplace=True)

_Clean Merged Data_

In [None]:
# clean up merged data by combining columns of same type and removing un-needed columns
gene_ids, ensemble_genes = [], []

# loop over data and fill in missing values
for idx, row in tqdm(ensembl_ncbi_merged_data.iterrows(), total=ensembl_ncbi_merged_data.shape[0]):
    
    # gene ids
    if row['xref_entrez'] != 'None' or row['GeneID'] != 'None':
        gene_ids.append(re.sub('None', '', ''.join(set([row['xref_entrez'], row['GeneID']]))))
    else:
        gene_ids.append('None')
    
    # fill in missing ensembl gene ids
    if row['gene_stable_id'] == 'None' and row['GeneID'] != 'None':
        if int(row['GeneID']) in ensembl_map.keys():
            ensemble_genes.append(ensembl_map[int(row['GeneID'])][0])
        else:
            ensemble_genes.append('None')
    else:
        ensemble_genes.append(row['gene_stable_id'])
        
# reduce columns
ensembl_ncbi_merged_data_clean = ensembl_ncbi_merged_data.copy()
ensembl_ncbi_merged_data_clean = ensembl_ncbi_merged_data[['transcript_stable_id', 'protein_stable_id', 'xref_uniprot', 'Symbol',
                                                           'Synonyms', 'description', 'type_of_gene', 'chromosome', 'map_location', 'Other_designations']]

# add cleaned columns
ensembl_ncbi_merged_data_clean['GeneID_Cleaned'] = gene_ids
ensembl_ncbi_merged_data_clean['EnsemblGenes_Cleaned'] = ensemble_genes

# remove duplicates
ensembl_ncbi_merged_data_clean.drop_duplicates(subset=None, keep='first', inplace=True)
    
# preview data
ensembl_ncbi_merged_data_clean.head(n=5)

***
**Merge Processed Data:** `ensembl_ncbi_merged_data_clean` + `promapping.txt`  

In [203]:
# merge uniprot and ncbi data
merged_data = pandas.merge(ensembl_ncbi_merged_data_clean,
                           pro_mapping,
                           left_on='xref_uniprot',
                           right_on='Entry',
                           how='outer')

# replace NaN with 'None'
merged_data.fillna('None', inplace=True)

**Clean Full Merged Data**

In [None]:
# clean up merged data by combining columns of same type and removing un-needed columns
uniprot_ids = []

# loop over data and fill in missing values
for idx, row in tqdm(merged_data.iterrows(), total=merged_data.shape[0]):
    
    # uniprot ids
    if row['xref_uniprot'] != 'None' or row['Entry'] != 'None':
        uniprot_ids.append(re.sub('None', '', ''.join(set([row['xref_uniprot'], row['Entry']]))))
    else:
        uniprot_ids.append('None')
        
# reduce columns
merged_data_clean = merged_data.copy()
merged_data_clean = merged_data[['EnsemblGenes_Cleaned', 'transcript_stable_id', 'protein_stable_id',
                                 'GeneID_Cleaned', 'type_of_gene', 'Symbol', 'Synonyms', 'description',
                                 'chromosome', 'map_location', 'Other_designations','pro_id', 'pro_mapping']]

# add cleaned columns
merged_data_clean['UniprotAccessionID_Cleaned'] = uniprot_ids

# remove duplicates
merged_data_clean.drop_duplicates(subset=None, keep='first', inplace=True)

# replace NaN with 'None'
merged_data_clean.fillna('None', inplace=True)

# write data
merged_data_clean.to_csv(processed_data_location + 'Merged_Human_Ensembl_Entrez_Uniprot_Identifiers.txt',
                         header = True,
                         sep = '\t',
                         index = False)
    
# preview data
merged_data_clean.head(n=5)

<br>

***

### Ensembl Gene-Ensembl Transcript <a class="anchor" id="ensemblgene-ensembltranscript"></a>

**Purpose:** To map Ensembl gene identifiers to Ensembl transcript identifiers when creating the following edges: 
- RNA-Cell   
- RNA-Tissue Types  

**Output:** [`ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`](https://www.dropbox.com/s/8n1isqytlz2z1g6/ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt?dl=1)

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['EnsemblGenes_Cleaned', 'transcript_stable_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['EnsemblGenes_Cleaned'] != 'None' and row['transcript_stable_id'] != 'None': 
            outfile.write(row['EnsemblGenes_Cleaned'].strip() + '\t' + row['transcript_stable_id'].strip() + '\n')

outfile.close()

**Preview Processed Data**

In [206]:
eget_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Ensembl_Transcript_IDs'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-ensembl transcript edges'.format(edge_count=len(eget_data)))

There are 173782 ensembl gene-ensembl transcript edges


In [207]:
eget_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Ensembl_Transcript_IDs
0,ENSG00000186092,ENST00000641515
1,ENSG00000186092,ENST00000335137
2,ENSG00000284733,ENST00000426406
3,ENSG00000284662,ENST00000332831
4,ENSG00000230178,ENST00000456475


<br>

***

### Ensembl Gene-Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Ensembl gene identifiers to Entrez gene identifiers when creating the following edges:   
- gene-gene

**Output:** [`ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`](https://www.dropbox.com/s/crghjh2we5v7pws/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt?dl=1)

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['EnsemblGenes_Cleaned', 'GeneID_Cleaned'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['EnsemblGenes_Cleaned'] != 'None' and row['GeneID_Cleaned'] != 'None': 
            outfile.write(row['EnsemblGenes_Cleaned'].strip() + '\t' + row['GeneID_Cleaned'].strip() + '\n')

outfile.close()

**Preview Processed Data**

In [209]:
egeg_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Entrez_Gene_IDs'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-entrez gene edges'.format(edge_count=len(egeg_data)))


There are 29691 ensembl gene-entrez gene edges


In [210]:
egeg_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Entrez_Gene_IDs
0,ENSG00000186092,79501
1,ENSG00000284733,729759
2,ENSG00000284662,81399
3,ENSG00000230178,26683
4,ENSG00000187634,148398


<br>

***
***

### Ensembl Transcript-Protein Ontology <a class="anchor" id="ensembltranscript-proteinontology"></a>

**Purpose:** To map Ensembl transcript identifiers to Protein Ontology identifiers when creating the following edges: 
- RNA-Protein  

**Output:** [`ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/ckrw11nfyu6a08c/ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt?dl=1)


In [None]:
# de-dup data
df_po = merged_data_clean.drop_duplicates(subset=['transcript_stable_id', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_po.iterrows(), total=df_po.shape[0]):
        if row['transcript_stable_id'] != 'None' and row['pro_id'] != 'None': 
            outfile.write(row['transcript_stable_id'].strip() + '\t' + row['pro_id'].replace('PR:', 'PR_').strip() + '\n')

outfile.close()

**Preview Processed Data**

In [212]:
etpr_data = pandas.read_csv(processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['Ensembl_Transcript_IDs', 'Protein_Ontology_IDs'],
                            delimiter = '\t',
                            low_memory=False)

print('There are {edge_count} ensembl transcript-protein ontology edges'.format(edge_count=len(etpr_data)))

There are 92201 ensembl transcript-protein ontology edges


In [213]:
etpr_data.head(n=5)

Unnamed: 0,Ensembl_Transcript_IDs,Protein_Ontology_IDs
0,ENST00000335137,PR_000011836
1,ENST00000335137,PR_Q8NH21
2,ENST00000426406,PR_000011834
3,ENST00000426406,PR_Q6IEY1
4,ENST00000332831,PR_000011834


<br>

***
***

### Entrez Gene-Ensembl Transcript <a class="anchor" id="entrezgene-ensembltranscript"></a>

**Purpose:** To map Entrez gene identifiers to Ensembl transcript identifiers when creating the following edges: 
- gene-RNA 
- chemical-rna

**Output:** [`ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`](https://www.dropbox.com/s/yqnofd8h90luygu/ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt?dl=1)

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['GeneID_Cleaned', 'transcript_stable_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['GeneID_Cleaned'] != 'None' and row['transcript_stable_id'] != 'None': 
            outfile.write(row['GeneID_Cleaned'].strip() + '\t' + row['transcript_stable_id'].strip() + '\n')

outfile.close()

**Preview Processed Data**

In [215]:
eet_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                            header = None,
                            names=['Entrez_Gene_IDs', 'Ensembl_Transcript_IDs'],
                            delimiter = '\t')

print('There are {edge_count} entrez gene-ensembl transcript edges'.format(edge_count=len(eet_data)))

There are 175662 entrez gene-ensembl transcript edges


In [216]:
eet_data.head(n=5)

Unnamed: 0,Entrez_Gene_IDs,Ensembl_Transcript_IDs
0,79501,ENST00000641515
1,79501,ENST00000335137
2,729759,ENST00000426406
3,81399,ENST00000332831
4,26683,ENST00000456475


<br>

***

### Entrez Gene-Protein Ontology <a class="anchor" id="entrezgene-proteinontology"></a>

**Purpose:** To map Protein Ontology identifiers to Ensembl transcript identifiers when creating the following edges:   
- chemical-protein  

**Output:** [`ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/ufbp5o6zgagriw7/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt?dl=1)

In [None]:
# de-dup data
df_egpr = merged_data_clean.drop_duplicates(subset=['GeneID_Cleaned', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_egpr.iterrows(), total=df_egpr.shape[0]):
        if row['GeneID_Cleaned'] != 'None' and row['pro_id'] != 'None': 
            outfile.write(row['GeneID_Cleaned'].strip() + '\t' + row['pro_id'].replace(':', '_').strip() +  '\n')

outfile.close()

**Preview Processed Data**

In [218]:
egpr_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['Gene_IDs', 'Protein_Ontology_IDs'],
                            delimiter = '\t')

print('There are {edge_count} entrez gene-protein ontology edges'.format(edge_count=len(egpr_data)))

There are 37580 entrez gene-protein ontology edges


In [219]:
egpr_data.head(n=5)

Unnamed: 0,Gene_IDs,Protein_Ontology_IDs
0,79501,PR_000011836
1,79501,PR_Q8NH21
2,729759,PR_000011834
3,729759,PR_Q6IEY1
4,81399,PR_000011834


<BR>

***

### STRING-Protein Ontology <a class="anchor" id="string-proteinontology"></a>

**Purpose:** To map STRING identifiers to Protein Ontology identifiers when creating the following edges:   
- protein-protein  

**Output:** [`STRING_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/mekh5lr3bxp7gvu/STRING_PRO_ONTOLOGY_MAP.txt?dl=1)

In [220]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['protein_stable_id', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['protein_stable_id'] != 'None' and row['pro_id'] != 'None':
            outfile.write('9606.' + row['protein_stable_id'].strip() + '\t' + row['pro_id'].replace(':', '_').strip() +  '\n')

outfile.close()

**Preview Processed Data**

In [221]:
stpr_data = pandas.read_csv(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['STRING_IDs', 'Protein_Ontology_IDs'],
                            delimiter = '\t')

print('There are {edge_count} string-protein ontology edges'.format(edge_count=len(stpr_data)))

There are 92201 string-protein ontology edges


In [222]:
stpr_data.head(n=5)

Unnamed: 0,STRING_IDs,Protein_Ontology_IDs
0,9606.ENSP00000334393,PR_000011836
1,9606.ENSP00000334393,PR_Q8NH21
2,9606.ENSP00000409316,PR_000011834
3,9606.ENSP00000409316,PR_Q6IEY1
4,9606.ENSP00000329982,PR_000011834


<br>

***

### Uniprot Accession-Protein Ontology <a class="anchor" id="uniprotaccession-proteinontology"></a>

**Purpose:** To map Uniprot accession identifiers to Protein Ontology identifiers when creating the following edges:  
- protein-gobp  
- protein-gomf  
- protein-gocc  
- protein-complex  
- protein-cofactor  
- protein-catalyst 
- protein-reaction  
- protein-pathway

**Output:** [`UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt`](https://www.dropbox.com/s/txp8tqdipzwus9p/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt?dl=1)

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['UniprotAccessionID_Cleaned', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if row['UniprotAccessionID_Cleaned'] != 'None' and row['pro_id'] != 'None': 
            outfile.write(row['UniprotAccessionID_Cleaned'].strip() + '\t' + row['pro_id'].replace(':', '_').strip() +  '\n')

outfile.close()

**Preview Processed Data**

In [224]:
uapr_data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                            header = None,
                            names=['Uniprot_Accession_IDs', 'Protein_Ontology_IDs'],
                            delimiter = '\t')

print('There are {edge_count} uniprot accession-protein ontology edges'.format(edge_count=len(uapr_data)))

There are 313776 uniprot accession-protein ontology edges


In [225]:
uapr_data.head(n=5)

Unnamed: 0,Uniprot_Accession_IDs,Protein_Ontology_IDs
0,Q8NH21,PR_000011836
1,Q8NH21,PR_Q8NH21
2,Q6IEY1,PR_000011834
3,Q6IEY1,PR_Q6IEY1
4,Q96NU1,PR_000014441


<br>

***
***
### Other Identifier Mapping <a class="anchor" id="other-identifier-mapping"></a>
***
* [ChEBI Identifiers](#mesh-chebi)  
* [Human Protein Atlas Tissue and Cell Types](#hpa-uberon) 
* [Human Disease and Phenotype Identifiers](#disease-identifiers) 

***
***

### ChEBI-MeSH Identifiers <a class="anchor" id="mesh-chebi"></a>

**Data Source Wiki Page:** [mapping-mesh-to-chebi](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#mapping-mesh-identifiers-to-chebi-identifiers)  

**Purpose:** Map MeSH identifiers to ChEBI identifiers when creating the following edges:  
- Chemical-Gene  
- Chemical-Disease

**Dependencies:** This script assumes that the `NCBO_rest_api.py` script was run and the data generated from this file was written to `./resources/processed_data/temp`. 

**Output:** [`MESH_CHEBI_MAP.txt`](https://www.dropbox.com/s/5nr87v5h6x8oc1b/MESH_CHEBI_MAP.txt?dl=1)


In [30]:
with open(processed_data_location + 'MESH_CHEBI_MAP.txt', 'w') as out:
    for filename in tqdm(glob.glob(processed_data_location + 'temp/*.txt')):
        for row in list(filter(None, open(filename, 'r').read().split('\n'))):
            mesh = '_'.join(row.split('\t')[0].split('/')[-2:])
            chebi = row.split('\t')[1].split('/')[-1]
            out.write(mesh + '\t' + chebi + '\n')

out.close()

100%|██████████| 44/44 [00:00<00:00, 643.29it/s]


**Preview Processed Data**

In [31]:
mc_data = pandas.read_csv(processed_data_location + 'MESH_CHEBI_MAP.txt',
                          delimiter = '\t',
                          header=None,
                          names=['MeSH_IDs', 'ChEBI_IDs'])

print('There are {edge_count} MeSH-ChEBI edges'.format(edge_count=len(mc_data)))

There are 11434 MeSH-ChEBI edges


In [32]:
mc_data.head(n=5)

Unnamed: 0,MeSH_IDs,ChEBI_IDs
0,MESH_C535085,CHEBI_133814
1,MESH_C008574,CHEBI_17221
2,MESH_C492482,CHEBI_34581
3,MESH_C007556,CHEBI_135978
4,MESH_C500395,CHEBI_29138


<br>

***

### Disease and Phenotype Identifiers <a class="anchor" id="disease-identifiers"></a>

**Data Source Wiki Page:** [disgenet](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet)  

**Purpose:** This script downloads the [disease_mappings.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz) to map UMLS identifiers to Human Disease and Human Phenotype identifiers when creating the following edges:  
- chemical-disease  
- disease-phenotype

**Output:**   
- Human Disease Ontology Mappings ➞ [`DISEASE_DOID_MAP.txt`](https://www.dropbox.com/s/q30ferujl7k574j/DISEASE_DOID_MAP.txt?dl=1)  
- Human Phenotype Ontology Mappings ➞ [`PHENOTYPE_HPO_MAP.txt`](https://www.dropbox.com/s/5ayl0c5qm7r4tdm/PHENOTYPE_HPO_MAP.txt?dl=1)

In [None]:
url = 'https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz'
data_downloader(url, unprocessed_data_location)

In [3]:
disease_data = pandas.read_csv(unprocessed_data_location + 'disease_mappings.tsv',
                               header = 0,
                               delimiter = '|')

disease_data.head(n=3)

Unnamed: 0,diseaseId,name,vocabulary,code,vocabularyName
0,C0018923,Hemangiosarcoma,DO,1816,angiosarcoma
1,C0854893,Angiosarcoma non-metastatic,DO,1816,angiosarcoma
2,C0033999,Pterygium,DO,2116,pterygium


In [4]:
# convert to dictionary
disease_dict = {}

for idx, row in tqdm(disease_data.iterrows(), total=disease_data.shape[0]):
    
    if row['vocabulary'] == 'MSH':
        mesh_finder(disease_data, row['code'], 'MESH:', disease_dict)
        
    elif row['vocabulary'] == 'OMIM':
        mesh_finder(disease_data, row['code'], 'OMIM:', disease_dict)
        
    elif row['vocabulary'] == 'ORDO':
        mesh_finder(disease_data, row['code'], 'ORPHA:', disease_dict)
    
    elif row['diseaseId'] in disease_dict.keys():
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']].append('DOID_' + row['code']) 
        
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']].append(row['code'].replace('HP:', 'HP_'))
    
    else:
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']] = ['DOID_' + row['code']] 
        
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']] = [row['code'].replace('HP:', 'HP_')] 

100%|██████████| 97502/97502 [13:00<00:00, 124.92it/s] 


In [5]:
# reformat data and write it out
with open(processed_data_location + 'DISEASE_DOID_MAP.txt', 'w') as outfile1,open(processed_data_location + 'PHENOTYPE_HPO_MAP.txt', 'w') as outfile2:
    for key, value in tqdm(disease_dict.items()):
        for i in value:
            # get diseases
            if i.startswith('DOID_'): 
                outfile1.write(key.split(':')[-1] + '\t' + i + '\n')

            # get phenotypes
            if i.startswith('HP_'): 
                outfile2.write(key.split(':')[-1] + '\t' + i + '\n')

outfile1.close()
outfile2.close()

100%|██████████| 38595/38595 [00:00<00:00, 303979.02it/s]


**Preview Processed Data**

_Preview Disease (DOID) Mappings_

In [6]:
dis_data = pandas.read_csv(processed_data_location + 'DISEASE_DOID_MAP.txt',
                           header = None,
                           names=['Disease_IDs', 'DOID_IDs'],
                           delimiter = '\t')

print('There are {} disease-DOID edges'.format(len(dis_data)))

There are 46720 disease-DOID edges


In [7]:
dis_data.head(n=5)

Unnamed: 0,Disease_IDs,DOID_IDs
0,C0018923,DOID_0001816
1,C0854893,DOID_0001816
2,C0033999,DOID_0002116
3,C4520843,DOID_0002116
4,C0024814,DOID_0014667


_Preview Phenotype (HP) Mappings_

In [8]:
hp_data = pandas.read_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt',
                          header = None,
                          names=['Disease_IDs', 'HP_IDs'],
                          delimiter = '\t')

print('There are {} phenotype-HPO edges'.format(len(hp_data)))

There are 21676 phenotype-HPO edges


In [9]:
hp_data.head(n=5)

Unnamed: 0,Disease_IDs,HP_IDs
0,C0018923,HP_0200058
1,C0033999,HP_0001059
2,C4520843,HP_0001059
3,C0037199,HP_0000246
4,C0008780,HP_0012265


<br>

***
***

### Human Protein Atlas Tissue/Cells - UBERON + Cell Ontology + Cell Line Ontology <a class="anchor" id="hpa-uberon"></a>

**Data Source Wiki Page:** [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#human-protein-atlas)  

**Purpose:** Downloads a query for cell, tissue, and blood types with overexpressed protein-coding genes in the human proteome ([`proteinatlas_search.tsv`](https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv)) in order to create mappings between HPA cell and tissue type strings to Uber-Anatomy, Cell Ontology, and Cell Line Ontology concepts (see [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-protein-atlas) for details on the mapping process). This mapping is then used to create the following edge types:  
- RNA-cell Line  
- RNA-tissue type   
- Protein-cell Line  
- Protein-tissue type  


**Output:**  
- All HPA tissue and cell type strings ➞ [`HPA_tissues.txt`](https://www.dropbox.com/s/m0spn8h1l8kxb61/HPA_tissues.txt?dl=1)  
- Mapping HPA strings to ontology concepts (documentation) ➞ [`zooma_tissue_cell_mapping_04JAN2020.xlsx`](https://www.dropbox.com/s/lxp8vxj39eumvcn/zooma_tissue_cell_mapping_04JAN2020.xlsx?dl=1)  
- Final HPA-ontology mappings ➞ [`HPA_TISSUE_CELL_MAP.txt`](https://www.dropbox.com/s/dsh1x88u6251w76/HPA_TISSUE_CELL_MAP.txt?dl=1)
- HPA Edges ➞ [`HPA_RNA_GENE_PROTEIN_EDGES.txt`](https://www.dropbox.com/s/vww5mif0i8h7bhj/HPA_RNA_GENE_PROTEIN_EDGES.txt?dl=1)

In [None]:
url = 'https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv'
data_downloader(url, unprocessed_data_location, 'proteinatlas_search.tsv.gz')

_Read in Data Files_

In [41]:
hpa = pandas.read_csv(unprocessed_data_location + 'proteinatlas_search.tsv',
                      header = 0,
                      delimiter = '\t')

# replace NaN with 'None'
hpa.fillna('None', inplace=True)

**Get Mapping Data**

In [42]:
# retrieve terms to map
terms_to_map = list(hpa.columns)

# write results
with open(unprocessed_data_location + 'HPA_tissues.txt', 'w') as outfile:
    for x in tqdm(terms_to_map):
        if x.endswith('[NX]'):
            term = x.split('RNA - ')[-1].split(' [NX]')[:-1][0]
            outfile.write(term + '\n')

outfile.close()

100%|██████████| 161/161 [00:00<00:00, 227444.58it/s]


In [43]:
# read back in mapped tissue/cell data
hpa_mapping_data = pandas.read_excel(open(unprocessed_data_location + 'zooma_tissue_cell_mapping_04JAN2020.xlsx', 'rb'),
                                     sheet_name='Concept_Mapping - 04JAN2020',
                                     header=0)

hpa_mapping_data.fillna('None', inplace=True)

# preview data
hpa_mapping_data.head(n=3)

Unnamed: 0,ORIGINAL TERM,UBERON ID,UBERON LABEL,CL ID,CL LABEL,CLO ID,CLO LABEL,UBERON MAPPING,CL MAPPING,CLO MAPPING
0,A-431,UBERON_0000014,zone of skin,CL_0000066,epithelial cell,CLO_0001591,A431 cell,Manual,Manual,Manual
1,A549,UBERON_0002048,lung,CL_0000141,epithelial cell of lung,CLO_0001601,A549 cell,Manual,Manual,Manual
2,adipose tissue,UBERON_0001013,adipose tissue,,,,,Zooma,,


In [44]:
# reformat data and write it out
with open(processed_data_location + 'HPA_TISSUE_CELL_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(hpa_mapping_data.iterrows(), total=hpa_mapping_data.shape[0]):
        if row['UBERON ID'] != 'None':
            outfile.write(str(row['ORIGINAL TERM']).strip() + '\t' + str(row['UBERON ID']).strip() + '\n')

        if row['CL ID'] != 'None':
            outfile.write(str(row['ORIGINAL TERM']).strip() + '\t' + str(row['CL ID']).strip() + '\n')
        
        if row['CLO ID'] != 'None':
            outfile.write(str(row['ORIGINAL TERM']).strip() + '\t' + str(row['CLO ID']).strip() + '\n')

outfile.close()

100%|██████████| 153/153 [00:00<00:00, 2605.23it/s]


**Preview Processed Data**

In [45]:
hpa_data = pandas.read_csv(processed_data_location + 'HPA_TISSUE_CELL_MAP.txt',
                           header = None,
                           names=['Tissue/Cell TERM', 'ONTOLOGY_IDs'],
                           delimiter = '\t')

print('There are {edge_count} edges'.format(edge_count=len(hpa_data)))

There are 281 edges


In [43]:
hpa_data.head(n=3)

Unnamed: 0,Tissue/Cell TERM,ONTOLOGY_IDs
0,A-431,UBERON_0000014
1,A-431,CL_0000066
2,A-431,CLO_0001591


**Create Edge Data Set**

In [46]:
# reformat data and write it out
with open(processed_data_location + 'HPA_RNA_GENE_PROTEIN_EDGES.txt', 'w') as outfile:
    for idx, row in tqdm(hpa.iterrows(), total=hpa.shape[0]):
        if row['RNA tissue specific NX'] != 'None':
            for x in row['RNA tissue specific NX'].split(';'):
                outfile.write(row['Ensembl'] + '\t' + row['Gene'] + '\t' + row['Uniprot'] + '\t' + row['Evidence'] + '\t' + 'anatomy' + '\t' + x.split(':')[0] + '\n')

        if row['RNA cell line specific NX'] != 'None':
            for x in row['RNA cell line specific NX'].split(';'):
                outfile.write(row['Ensembl'] + '\t' + row['Gene'] + '\t' + row['Uniprot'] + '\t' + row['Evidence'] + '\t' + 'cell line' + '\t' + x.split(':')[0] + '\n')

        if row['RNA brain regional specific NX'] != 'None':
            for x in row['RNA brain regional specific NX'].split(';'):
                outfile.write(row['Ensembl'] + '\t' + row['Gene'] + '\t' + row['Uniprot'] + '\t' + row['Evidence'] + '\t' + 'anatomy' + '\t' + x.split(':')[0] + '\n')

        if row['RNA blood cell specific NX'] != 'None':
            for x in row['RNA blood cell specific NX'].split(';'):
                outfile.write(row['Ensembl'] + '\t' + row['Gene'] + '\t' + row['Uniprot'] + '\t' + row['Evidence'] + '\t' + 'anatomy' + '\t' + x.split(':')[0] + '\n')

        if row['RNA blood lineage specific NX'] != 'None':
            for x in row['RNA blood lineage specific NX'].split(';'):
                outfile.write(row['Ensembl'] + '\t' + row['Gene'] + '\t' + row['Uniprot'] + '\t' + row['Evidence'] + '\t' + 'anatomy' + '\t' + x.split(':')[0] + '\n')

outfile.close()

100%|██████████| 19670/19670 [00:08<00:00, 2378.03it/s]


**Preview Processed Data**

In [47]:
hpa_edges = pandas.read_csv(processed_data_location + 'HPA_RNA_GENE_PROTEIN_EDGES.txt',
                           header = None,
                           names=['Ensembl_IDs', 'Gene_Symbols', 'Uniport_IDs', 'Evidence', 'Anatomy_Type', 'Anatomy'],
                           delimiter = '\t')

print('There are {edge_count} edges'.format(edge_count=len(hpa_edges)))

There are 68164 edges


In [48]:
hpa_edges.head(n=5)

Unnamed: 0,Ensembl_IDs,Gene_Symbols,Uniport_IDs,Evidence,Anatomy_Type,Anatomy
0,ENSG00000121410,A1BG,P04217,Evidence at protein level,anatomy,liver
1,ENSG00000121410,A1BG,P04217,Evidence at protein level,cell line,HEK 293
2,ENSG00000121410,A1BG,P04217,Evidence at protein level,cell line,Hep G2
3,ENSG00000121410,A1BG,P04217,Evidence at protein level,cell line,REH
4,ENSG00000121410,A1BG,P04217,Evidence at protein level,cell line,U-266/70


<br>

***
***
### CREATE EDGE DATASETS  <a class="anchor" id="create-edge-datasets"></a>

***
### Ontologies  <a class="anchor" id="ontologies"></a>
***
- [Protein Ontology](#protein-ontology)  
- [Relations Ontology](#relations-ontology)  

***

### Protein Ontology <a class="anchor" id="protein-ontology"></a>

**Data Source Wiki Page:** [protein-ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-phenotype-ontology)  

**Purpose:** This script downloads the [pr.owl](http://purl.obolibrary.org/obo/pr.owl) file from [ProConsortium.org](https://proconsortium.org/) in order to create a version of the ontology that contains only human proteins. This is achieved by performing forward and reverse breadth first search over all proteins which are `owl:subClassOf` [Homo sapiens protein](https://proconsortium.org/app/entry/PR%3A000029067/).

<br>

**Output:**  
- Human Protein Ontology ➞ [`human_pro.owl`](https://www.dropbox.com/s/jw8jksgnqbcz9sm/human_pro.owl?dl=1)
- Classified Human Protein Ontology (Hermit) ➞ [`human_pro_closed.owl`](https://www.dropbox.com/s/6ux85agl95ja3wx/human_pro_closed.owl?dl=1)


In [120]:
url = 'http://purl.obolibrary.org/obo/pr.owl'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [12]:
# read in ontology as graph (the ontology is large so this takes ~60 minutes) - 11,757,623 edges on 12/18/2019
graph = Graph()
graph.parse(unprocessed_data_location + 'pr.owl')

print('There are {} edges in the ontology'.format(len(graph)))

There are 11757623 edges in the ontology


**Convert Ontology to Directed MulitGraph:**  
In order to create a version of the ontology which includes all relevant human edges, we need to first convert the KG to a [directed multigraph](https://networkx.github.io/documentation/stable/reference/classes/multidigraph.html).

In [122]:
# convert RDF graph to multidigraph (the ontology is large so this takes ~45 minutes)
networkx_mdg = rdflib_to_networkx_multidigraph(graph)

**Identify Human Proteins:**   
A list of human proteins is obtained by querying the ontology to return all ontology classes `only_in_taxon some Homo sapiens`. To expedite the query time, the following SPARQL query is run from the [ProConsortium](https://proconsortium.org/pro_sparql.shtml) SPARQL endpoint: 

```SPARQL
PREFIX obo: <http://purl.obolibrary.org/obo/>

SELECT ?PRO_term
FROM <http://purl.obolibrary.org/obo/pr>
WHERE {
       ?PRO_term rdf:type owl:Class .
       ?PRO_term rdfs:subClassOf ?restriction .
       ?restriction owl:onProperty obo:RO_0002160 .
       ?restriction owl:someValuesFrom obo:NCBITaxon_9606 .

       # use this to filter-out things like hgnc ids
       FILTER (regex(?PRO_term,"http://purl.obolibrary.org/obo/*")) .
}

```


In [123]:
# download data - pro classes only_in_taxon some Homo sapiens (61,064 classes on 12/18/2019)
url = 'http://sparql.proconsortium.org/virtuoso/sparql?query=PREFIX+obo%3A+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F%3E%0D%0ASELECT+%3FPRO_term%0D%0AFROM+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fpr%3E%0D%0AWHERE%0D%0A%7B%0D%0A+++%3FPRO_term+rdf%3Atype+owl%3AClass+.%0D%0A+++%3FPRO_term+rdfs%3AsubClassOf+%3Frestriction+.%0D%0A+++%3Frestriction+owl%3AonProperty+obo%3ARO_0002160+.%0D%0A+++%3Frestriction+owl%3AsomeValuesFrom+obo%3ANCBITaxon_9606+.%0D%0A%0D%0A+++FILTER+%28regex%28%3FPRO_term%2C%22http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F*%22%29%29+.%0D%0A%0D%0A%7D%0D%0A&format=text%2Fhtml&debug='
html = requests.get(url, allow_redirects=True).content

# extract data from html table
df_list = pandas.read_html(html)
human_pro_classes = list(df_list[-1]['PRO_term'])

print('There are {protein_count} human classes in the PRO ontology'.format(protein_count=len(human_pro_classes)))

There are 61064 human classes in the PRO ontology


**Construct Human PRO:**   
Now that we have all of the paths from the original graph that are relevant to humans, we can construct a human-only version of the PRotein ontology.

In [124]:
# create a new graph using bfs paths
human_pro_graph = Graph()
human_networkx_mdg = networkx.MultiDiGraph()

for node in tqdm(human_pro_classes):
    forward = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='original'))
    reverse = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='reverse'))
    
    # add edges from forward and reverse bfs paths
    for path in forward + reverse:
        human_pro_graph.add((path[0], path[2], path[1]))
        human_networkx_mdg.add_edge(path[0], path[1], path[2])

100%|██████████| 61064/61064 [36:49<00:00, 27.63it/s]    


In [125]:
# verify that the constructed ontology only has 1 component
networkx.number_connected_components(human_networkx_mdg.to_undirected())

1

In [25]:
# save filtered ontology
human_pro_graph.serialize(destination=unprocessed_data_location + 'human_pro.owl', format='xml')

**Classify Ontology:**  
To ensure that we have correclty built the new ontology, we run the hermit reasoner over it to ensure that there are no incomplete triples or inconsistent classes. In order to do this, we will call the reasoner using [OWLTools](https://github.com/owlcollab/owltools), which this script assumes has already been downloaded to the `../resources/lib` directory. The following arguments are then called to run the reasoner (from the command line):  

```bash
./resources/lib/owltools ./resources/unprocessed_data/human_pro_filtered.owl --reasoner hermit --run-reasoner --assert-implied -o ./resources/processed_data/human_pro_closed.owl
```

_**Note.** This step takes around 30-45 minutes to run. When run from the command line the reasoner determined that the ontology was consistent and 174 new axioms were inferrred (12/18/2019)._

In [None]:
# run reasoner -- RUN FROM COMMAND LINE NOT HERE
# subprocess.run(['../../resources/lib/owltools',
#                 '../../resources/unprocessed_data/human_pro_filtered.owl',
#                 '--reasoner hermit',
#                 '--run-reasoner',
#                 '--assert-implied',
#                 '--list-unsatisfiable',
#                 '-o ./resources/processed_data/human_pro_closed.owl'])

**Examine Cleaned Human PRO:**  
Once we have cleaned the ontology we can get counts of components, nodes, edges, and then write the cleaned graph to the `../../resources/processed_data` repository.

In [None]:
# get count of connected components
pro_human_graph = Graph()
pro_human_graph.parse(processed_data_location + 'human_pro_closed.owl')

# get node and edge count
edge_count = len(human_pro_graph)
node_count = len(set([str(node) for edge in list(human_pro_graph) for node in edge[0::2]]))

print('\n The classified, filtered Human version of PRO contains {node} nodes and {edge} edges\n'.format(node=node_count, edge=edge_count))

<br>

***
***

### Relations Ontology <a class="anchor" id="relations-ontology"></a>

**Data Source Wiki Page:** [RO](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#relation-ontology)  

**Purpose:** This script downloads the [ro.owl](http://purl.obolibrary.org/obo/ro.owl) file from [obofoundry.org](http://www.obofoundry.org/) in order to obtain all `ObjectProperties` and their inverse relations.  

**Output:** 
- Relations and Inverse Relations ➞ [`INVERSE_RELATIONS.txt`](https://www.dropbox.com/s/sd8qlib8f6gqyz4/INVERSE_RELATIONS.txt?dl=1)
- Relations and Labels ➞ [`RELATIONS_LABELS.txt`](https://www.dropbox.com/s/k2hm9p0r8l9ecj3/RELATIONS_LABELS.txt?dl=1)

In [129]:
url = 'http://purl.obolibrary.org/obo/ro.owl'
data_downloader(url)

Downloading data file


In [74]:
ro_graph = Graph()
ro_graph.parse(unprocessed_data_location + 'ro.owl')

print('There are {} edges in the ontology'.format(len(ro_graph))) #5,669 edges on 12/15/2019

There are 5669 edges in the ontology



___

**Identify Relations and Inverse Relations:**  
Identify all relations and their inverse relations using the `owl:inverseOf` property. To make it easier to look up the inverse relations, each pair is listed twice, for example:  
- [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015) `owl:inverseOf` [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025)  
- [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025) `owl:inverseOf` [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015)

In [None]:
with open('./resources/relations_data/INVERSE_RELATIONS.txt', 'w') as outfile:
    
    # write column names
    outfile.write('Relation' + '\t' + 'Inverse_Relation' + '\n')
    
    # manually add participates in
    outfile.write('RO_0000056' + '\t' + 'RO_0000057' + '\n')

    # find inverse relations
    for s, p, o in tqdm(ro_graph):
        if 'owl#inverseOf' in str(p):
            if 'RO' in str(s) and 'RO' in str(o):
                outfile.write(str(s.split('/')[-1]) + '\t' + str(o.split('/')[-1]) + '\n')
                outfile.write(str(o.split('/')[-1]) + '\t' + str(s.split('/')[-1]) + '\n')

outfile.close()

**Preview Processed Data**

In [79]:
ro_data = pandas.read_csv('./resources/relations_data/INVERSE_RELATIONS.txt',
                          header = 0,
                          delimiter = '\t')

print('There are {edge_count} RO Relations and Inverse Relations'.format(edge_count=len(ro_data)))

There are 173 RO Relations and Inverse Relations


In [80]:
ro_data.head(n=5)

Unnamed: 0,Relation,Inverse_Relation
0,RO_0000056,RO_0000057
1,RO_0010001,RO_0010002
2,RO_0010002,RO_0010001
3,RO_0002248,RO_0002249
4,RO_0002249,RO_0002248


***
**Get Relations Labels:**  
Identify all relations and their labels for use when building the knowledge graph.

In [7]:
results = ro_graph.query(
    """SELECT DISTINCT ?p ?p_label
           WHERE {
              ?p rdf:type owl:ObjectProperty .
              ?p rdfs:label ?p_label . }
           """, initNs={"rdf": 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                        "rdfs": 'http://www.w3.org/2000/01/rdf-schema#',
                        "owl": 'http://www.w3.org/2002/07/owl#'})    

In [8]:
# write data to file
with open('./resources/relations_data/RELATIONS_LABELS.txt', 'w') as outfile:
    
    # write column names
    outfile.write('Relation' + '\t' + 'Relation_Label' + '\n')

    for p, p_label in list(results):
        outfile.write(str(p).split('/')[-1] + '\t' + str(p_label) + '\n')

**Preview Processed Data**

In [9]:
ro_data_label = pandas.read_csv('./resources/relations_data/RELATIONS_LABELS.txt',
                                header = 0,
                                delimiter = '\t')

print('There are {edge_count} RO Relations and Labels'.format(edge_count=len(ro_data_label)))

There are 454 RO Relations and Labels


In [10]:
ro_data_label.head(n=5)

Unnamed: 0,Relation,Relation_Label
0,RO_0002622,visits flowers of
1,RO_0002467,is mutualism
2,RO_0002014,has negative regulatory component activity
3,RO_0002244,related via exposure to
4,RO_0002606,is substance that treats


<br>

***
***
### Linked Data <a class="anchor" id="linked-data"></a>
***
* [Clinvar Variant-Diseases and Phenotypes](#clinvar-variant)
* [NCBI Gene Protein-Coding Genes and Proteins](#ncbi-protein-coding-genes)  
* [Reactome Chemical-Complex Data](#reactome-chemical-complex)  
* [Reactome Complex-Complex Data](#reactome-complex-complex)  
* [Reactome Protein-Complex Data](#reactome-protein-complex)  
* [Uniprot Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  

***

### Clinvar Variant-Diseases and Phenotypes <a class="anchor" id="clinvar-variant"></a>

**Data Source Wiki Page:** [Clinvar](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar)  

**Purpose:** This script downloads the [variant_summary.txt](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz) file from [CLinVar](https://www.ncbi.nlm.nih.gov/clinvar/) in order to create the following edges:  
- gene-variant  
- variant-disease  
- variant-phenotype  

**Output:** [`CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt`](https://www.dropbox.com/s/1doj3lj46ufgdpd/CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt?dl=1)


In [49]:
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data


In [40]:
# read in data and provided labels (needed to unnest data)
clinvar_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt',
                               header = 0,
                               delimiter = '\t',
                               low_memory=False)

# replace NaN with 'None'
clinvar_data.fillna('None', inplace=True)

In [41]:
# explode nested data
explode_df_clinvar = explode(clinvar_data.copy(), ['PhenotypeIDS'], ';')
explode_df_clinvar = explode(explode_df_clinvar.copy(), ['PhenotypeIDS'], ',')

# edit column formatting
explode_df_clinvar['PhenotypeIDS'].replace('Orphanet:ORPHA','ORPHA:', inplace=True, regex=True)
explode_df_clinvar['PhenotypeIDS'].replace('Human Phenotype Ontology:HP:','HP_', inplace=True, regex=True)

# write data
explode_df_clinvar.to_csv(processed_data_location + 'CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt', header = True, sep='\t', encoding='utf-8', index=False)

**Preview Processed Data**

In [42]:
print('There are {edge_count} variant edges'.format(edge_count=len(explode_df_clinvar)))

There are 3554505 variant edges


In [43]:
# preview data
explode_df_clinvar.head(n=5)

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,RS# (dbSNP),...,ReferenceAllele,AlternateAllele,Cytogenetic,ReviewStatus,NumberSubmitters,Guidelines,TestedInGTR,OtherIDs,SubmitterCategories,VariationID
0,191195,duplication,NM_006920.6(SCN1A):c.2011-13dup,6323,SCN1A,HGNC:10585,Conflicting interpretations of pathogenicity,0,"Feb 15, 2019",549232924,...,T,TA,2q24.3,"criteria provided, conflicting interpretations",3,,N,-,2,194032
1,191195,duplication,NM_006920.6(SCN1A):c.2011-13dup,6323,SCN1A,HGNC:10585,Conflicting interpretations of pathogenicity,0,"Feb 15, 2019",549232924,...,T,TA,2q24.3,"criteria provided, conflicting interpretations",3,,N,-,2,194032
2,191196,single nucleotide variant,NM_001182.5(ALDH7A1):c.1093+1G>A,501,ALDH7A1,HGNC:877,Pathogenic,1,"Feb 23, 2015",794727058,...,C,T,5q23.2,"criteria provided, single submitter",1,,N,-,2,194033
3,191196,single nucleotide variant,NM_001182.5(ALDH7A1):c.1093+1G>A,501,ALDH7A1,HGNC:877,Pathogenic,1,"Feb 23, 2015",794727058,...,C,T,5q23.2,"criteria provided, single submitter",1,,N,-,2,194033
4,191197,single nucleotide variant,NM_001195263.2(PDZD7):c.1752T>C (p.Tyr584=),79955,PDZD7,HGNC:26257,Likely benign,0,"Mar 16, 2015",368563439,...,A,G,10q24.31,"criteria provided, single submitter",1,,N,-,2,194034


<br>

***
***

### NCBI Gene Protein-Coding Gene-Protein <a class="anchor" id="ncbi-protein-coding-genes"></a>

**Data Source Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) 

**Purpose:** This script utilizes the merged data created in the [Human-Transcript, Gene, and Protein Identifier Mapping](#Human-Transcript,-Gene,-and-Protein-Identifier-Mapping) subsection in order to create the following edges:  
- gene-protein

**Output:** [`PROTEIN_CODING_GENES_PROTEINS.txt`](https://www.dropbox.com/s/79ce6oe68jt72ph/PROTEIN_CODING_GENES_PROTEINS.txt?dl=1)  

In [None]:
# de-dup data
df_ens = merged_data_clean.drop_duplicates(subset=['GeneID_Cleaned', 'pro_id'], keep='first', inplace=False) 

# reformat data and write it out
with open(processed_data_location + 'PROTEIN_CODING_GENES_PROTEINS.txt', 'w') as outfile:
    for idx, row in tqdm(df_ens.iterrows(), total=df_ens.shape[0]):
        if (row['GeneID_Cleaned'] != 'None' and row['pro_id'] != 'None') and row['type_of_gene'] == 'protein-coding': 
            outfile.write(row['GeneID_Cleaned'].strip() + '\t' + row['pro_id'].replace('PR:', 'PR_').strip() + '\n')

outfile.close()

**Preview Processed Data**

In [227]:
hpe_data = pandas.read_csv(processed_data_location + 'PROTEIN_CODING_GENES_PROTEINS.txt',
                           header = None,
                           names=['Entrez_Gene_IDs', 'Protein_Ontology_IDs'],
                           delimiter = '\t')

print('There are {edge_count} protein-coding gene edges'.format(edge_count=len(hpe_data)))

There are 37502 protein-coding gene edges


In [228]:
len(set(list(hpe_data['Entrez_Gene_IDs'])))

19096

In [66]:
hpe_data.head(n=5)

Unnamed: 0,Entrez_Gene_IDs,Protein_Ontology_IDs
0,79501,PR_000011836
1,79501,PR_Q8NH21
2,729759,PR_000011834
3,729759,PR_Q6IEY1
4,81399,PR_000011834


<br>

***

### Reactome Chemical-Complex Data <a class="anchor" id="reactome-chemical-complex"></a>

**Data Souurce Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) in order to create the following edges:  
- chemical-complex  

**Output:** [`REACTOME_CHEMICAL_COMPLEX.txt`](https://www.dropbox.com/s/qoetjt0vfy6qb3y/REACTOME_CHEMICAL_COMPLEX.txt?dl=1)

In [5]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        
        if (row[0].strip().startswith('R-HSA') or row[0].strip().startswith('R-ALL')):
            # find all proteins in a complex
            for x in row[2].split('|'):
                if x.startswith('chebi:'):            
                    outfile.write(x.replace('chebi:', 'CHEBI_') + '\t' + row[0].strip() + '\n')

outfile.close()

100%|██████████| 12429/12429 [00:00<00:00, 183156.16it/s]


**Preview Processed Data**

In [6]:
cc1_data = pandas.read_csv(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt',
                           header = None,
                           names=['CHEBI_IDs', 'Reactome_IDs'],
                           delimiter = '\t')

print('There are {edge_count} chemical-complex edges'.format(edge_count=len(cc1_data)))

There are 5589 chemical-complex edges


In [7]:
cc1_data.head(n=5)

Unnamed: 0,CHEBI_IDs,Reactome_IDs
0,CHEBI_24505,R-HSA-1006173
1,CHEBI_28879,R-HSA-1006173
2,CHEBI_59888,R-HSA-1013011
3,CHEBI_59888,R-HSA-1013017
4,CHEBI_29105,R-HSA-109266


<br>

***

### Reactome Complex-Complex Data <a class="anchor" id="reactome-complex-complex"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) in order to create the following edges:  
- complex-complex  

**Output:** [`REACTOME_COMPLEX_COMPLEX.txt`](https://www.dropbox.com/s/sojaq8u3hwfw4jz/REACTOME_COMPLEX_COMPLEX.txt?dl=1)


In [12]:
# create label dictionary
labels = pandas.read_csv(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt',
                         header = 0,
                         delimiter = '\t')

# convert to dictionary
label_dict = {row[0]:row[1] for idx, row in labels.iterrows()}

In [13]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        
        if row[0].strip().startswith('R-HSA'):
            # find all complexes
            for x in row[3].split('|'):
                if (x.startswith('R-HSA-') or x.startswith('R-ALL-')) and x.strip() in label_dict.keys():            
                    outfile.write(row[0].strip() + '\t' + x.strip() + '\n')

outfile.close()

100%|██████████| 12429/12429 [00:00<00:00, 188990.70it/s]


**Preview Processed Data**

In [14]:
cc_data = pandas.read_csv(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt',
                          header = None,
                          names=['Reactome_Complex_u', 'Reactome_Complex_v'],
                          delimiter = '\t')

print('There are {edge_count} complex-complex edges'.format(edge_count=len(cc_data)))

There are 13606 complex-complex edges


In [15]:
cc_data.head(n=5)

Unnamed: 0,Reactome_Complex_u,Reactome_Complex_v
0,R-HSA-1008206,R-HSA-1008229
1,R-HSA-1013011,R-HSA-1013017
2,R-HSA-1013011,R-HSA-1013019
3,R-HSA-1013011,R-HSA-420698
4,R-HSA-1013011,R-HSA-420748


<br>

***
***

### Reactome Complex-Pathway Data <a class="anchor" id="reactome-complex-pathway"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [Complex_2_Pathway_human.txt](https://reactome.org/download/current/Complex_2_Pathway_human.txt) file from [Reactome](https://reactome.orgt) in order to create the following edges:  
- complex-pathway  

**Output:** [`REACTOME_COMPLEX_PATHWAY.txt`](https://www.dropbox.com/s/my03w16fjw7bt20/REACTOME_COMPLEX_PATHWAY.txt?dl=1)


In [None]:
url = 'https://reactome.org/download/current/Complex_2_Pathway_human.txt'
data_downloader(url, unprocessed_data_location)

In [16]:
# process data
data = open(unprocessed_data_location + 'Complex_2_Pathway_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_COMPLEX_PATHWAY.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        if row[0].startswith('R-HSA-'):            
            outfile.write(row[0].strip() + '\t' + row[1].strip() + '\n')

outfile.close()

100%|██████████| 20902/20902 [00:00<00:00, 402062.57it/s]


**Previewed Processed Data**

In [17]:
cp_data = pandas.read_csv(processed_data_location + 'REACTOME_COMPLEX_PATHWAY.txt',
                          header = None,
                          names=['Reactome_Complex', 'Reactome_Pathway'],
                          delimiter = '\t')

print('There are {edge_count} complex-pathway edges'.format(edge_count=len(cp_data)))

There are 20480 complex-pathway edges


In [18]:
cp_data.head(n=5)

Unnamed: 0,Reactome_Complex,Reactome_Pathway
0,R-HSA-1006173,R-HSA-977606
1,R-HSA-1008206,R-HSA-983231
2,R-HSA-1008229,R-HSA-983231
3,R-HSA-1008252,R-HSA-983231
4,R-HSA-1011577,R-HSA-983231


<br>

***
***

### Reactome Protein-Complex Data <a class="anchor" id="reactome-protein-complex"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.org) in order to create the following edges:  
- protein-complex

**Output:** [`REACTOME_PROTEIN_COMPLEX.txt`](https://www.dropbox.com/s/7meu0cdz1mrnsz7/REACTOME_PROTEIN_COMPLEX.txt?dl=1)


In [7]:
url = 'https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [15]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data):
        row = line.split('\t')
        
        if row[0].strip().startswith('R-HSA'):
            # find all proteins in a complex
            for x in row[2].split('|'):
                if x.startswith('uniprot:'):            
                    outfile.write(x.split(':')[-1].strip() + '\t' + row[0].strip() + '\n')

outfile.close()

100%|██████████| 12430/12430 [00:00<00:00, 66838.80it/s]


**Preview Processed Data**

In [16]:
pc_data = pandas.read_csv(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt',
                       header = None,
                       names=['Uniprot_Protein', 'Reactome_Complex'],
                       delimiter = '\t')

print('There are {edge_count} protein-complex edges'.format(edge_count=len(pc_data)))

There are 91201 protein-complex edges


In [17]:
pc_data.head(n=5)

Unnamed: 0,Uniprot_Protein,Reactome_Complex
0,P08603,R-HSA-1006173
1,Q16621,R-HSA-1008206
2,Q9ULX9,R-HSA-1008206
3,O15525,R-HSA-1008206
4,O60675,R-HSA-1008206


<br>

***
***

### Uniprot  Protein-Cofactor and Protein-Catalyst <a class="anchor" id="uniprot-protein-cofactorcatalyst"></a>

**Data Source Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)  

**Purpose:** This script downloads the [uniprot-cofactor-catalyst.tab](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) file from the [Uniprot Knowledge Base](https://www.uniprot.org) in order to create the following edges:  
- protein-cofactor  
- protein-catalyst  

**Output:**  
- protein-cofactor ➞ [`UNIPROT_PROTEIN_COFACTOR.txt`](https://www.dropbox.com/s/ij9t89botd8nmmj/UNIPROT_PROTEIN_COFACTOR.txt?dl=1)
- protein-catalyst ➞ [`UNIPROT_PROTEIN_CATALYST.txt`](https://www.dropbox.com/s/pvopvs0iq8x3oq2/UNIPROT_PROTEIN_CATALYST.txt?dl=1)


In [8]:
url = 'https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Centry%20name%2Creviewed%2Cdatabase(PRO)%2Cchebi(Cofactor)%2Cchebi(Catalytic%20activity)&format=tab'
data_downloader(url, unprocessed_data_location, 'uniprot-cofactor-catalyst.tab')

Downloading data file


In [79]:
data = open(unprocessed_data_location + 'uniprot-cofactor-catalyst.tab').readlines()

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt', 'w') as outfile1, open(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt', 'w') as outfile2:
    for line in tqdm(data):

        # get cofactors
        if 'CHEBI' in line.split('\t')[4]: 
            for i in line.split('\t')[4].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile1.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')
        
        # get catalysts
        if 'CHEBI' in line.split('\t')[5]:       
            for i in line.split('\t')[5].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile2.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')

outfile1.close()
outfile2.close()

100%|██████████| 188350/188350 [00:00<00:00, 389801.93it/s]


**Preview Processed Data**

_Preview Cofactor Data_

In [80]:
pcp1_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt',
                            header = None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter = '\t')

print('There are {edge_count} protein-cofactor edges'.format(edge_count=len(pcp1_data)))

There are 5577 protein-cofactor edges


In [81]:
pcp1_data.head(n=5)

Unnamed: 0,Protein_Ontology_IDs,CHEBI_IDs
0,PR_Q9BRS2,CHEBI_18420
1,PR_Q05823,CHEBI_18420
2,PR_Q05823,CHEBI_29035
3,PR_Q13472,CHEBI_18420
4,PR_Q9BXA7,CHEBI_18420


_Preview Catalyst Data_

In [82]:
pcp2_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt',
                            header = None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter = '\t')

print('There are {edge_count} protein-catalyst edges'.format(edge_count=len(pcp2_data)))

There are 59863 protein-catalyst edges


In [83]:
pcp2_data.head(n=5)

Unnamed: 0,Protein_Ontology_IDs,CHEBI_IDs
0,PR_Q9NP80,CHEBI_15377
1,PR_Q9NP80,CHEBI_15378
2,PR_Q9NP80,CHEBI_28868
3,PR_Q9NP80,CHEBI_16870
4,PR_Q9NP80,CHEBI_58168


<br>

***
***
### INSTANCE METADATA <a class="anchor" id="create-instance-metadata"></a>
***

**Purpose:** The goal of this section is to obtain metadata for each instance data source used in the knowledge graph. For **[`Release V2.0.0`](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**, the following are instance data and require the compiling of metadata:
- [Genes/RNA](#gene-and-rna-metadata)
- [Pathways](#pathway-metadata)
- [Complexes](#complex-metadata)
- [Reactions](#reaction-metadata)
- [Variants](#variant-metadata)

<br>

**Metadata:** The <u>metadata</u> we will gather includes:  

| **Metadata Type** | **Definition** | **Example Node**  | **Example Node Metadata** | 
| :---: | :---: | :---: | :---: | 
| Label | The primary label or name for the node | `R-HSA-1006173` | "CFH:Host cell surface" |       
| Description | A definition or other useful details about the node | `rs794727058` | This `germline` `single nucleotide variant (allele alteration: C➞T)` located on chromosome `5 (GRCh38: NC_000005.10, start/stop positions (126555930/126555930))` with `pathogenic` clinical significance and a last review date of `2/23/2015` (review status: `criteria provided, single submitter`). |        
| Synonym | Alternative terms used for a node | `81399` | "OR1-1, OR7-21" |         
| DbXref | Mapped identifiers used when constructing the knowledge graph | `81399` | `ENST00000456475`<br>`ENST00000426406`<br>`ENST00000332831`<br> |    

<br>

The metadata information will be used to create the following edges in the knowledge graph:  
- **Label** ➞ node `rdfs:label`  
- **Description** ➞ node `obo:IAO_0000115` description 
- **Synonyms** ➞ node `oboInOwl:hasExactSynonym` synonym 
- **DbXref** ➞ node `oboInOwl:hasDbXref` DbXref 

<br>

*<b>NOTE.</b> All node metadata datasets are written to the `node_data` directory. The algorithm will look for data in this directory and if it is not there, then no node metadata will be created.*

_____


In [1]:
# set directory to write node data to
node_directory = './resources/node_data/'

<br>

***

### Genes and RNA<a class="anchor" id="gene-and-rna-metadata"></a>

**Data Source Wiki Pages:**  
- [Ensembl](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar)  
- [NCBI Gene](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#ncbi-gene) 

**Input:** 
- [`Merged_Human_Ensembl_Entrez_Uniprot_Identifiers.txt`](https://www.dropbox.com/s/l79166x1fx6vc4l/Merged_Human_Ensembl_Entrez_Uniprot_Identifiers.txt?dl=1)   
- [`Homo_sapiens.gene_info`](https://www.dropbox.com/s/vazlmzxydgv6xzz/Homo_sapiens.gene_info?dl=1)  


**Output:**  
- Genes ➞ [`GENE_METADATA.txt`](https://www.dropbox.com/s/100qxettd7krxys/GENE_METADATA.txt?dl=1)  
- RNA ➞ [`RNA_METADATA.txt`](https://www.dropbox.com/s/zzweg00odanrfwy/RNA_METADATA.txt?dl=1)  

In [229]:
# read in data
rna_gene_data = pandas.read_csv(processed_data_location + 'Merged_Human_Ensembl_Entrez_Uniprot_Identifiers.txt',
                                header = 0,
                                delimiter = '\t',
                                low_memory=False)

# read in ncbi gene data
ncbi_gene = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info',
                            header = 0,
                            delimiter = '\t')

# replace "-" with "None"
ncbi_gene.replace('-','None', inplace=True)

# explode nested data
explode_df_ncbi_gene = explode(ncbi_gene.copy(), ['dbXrefs'], '|')

# remove all rows that are not ENS
explode_df_ncbi_gene = explode_df_ncbi_gene.loc[explode_df_ncbi_gene['dbXrefs'].apply(lambda x: x.startswith('Ensembl:'))] 

# remove identifier type, which appears before ':'
explode_df_ncbi_gene['dbXrefs'].replace('(^\w*\:)','', inplace=True, regex=True)

**Genes** 

*Process Data*

In [231]:
# remove rows without identifiers
gene_data = rna_gene_data.loc[rna_gene_data['GeneID_Cleaned'].apply(lambda x: x != 'None')]
gene_data_labels = rna_gene_data.loc[rna_gene_data['EnsemblGenes_Cleaned'].apply(lambda x: x != 'None')]

# de-dup data
genes_metadata = gene_data[['GeneID_Cleaned', 'type_of_gene', 'Symbol', 'Synonyms', 'description', 'chromosome', 'map_location', 'Other_designations']].drop_duplicates(subset=['GeneID_Cleaned', 'type_of_gene', 'Symbol', 'Synonyms', 'description', 'chromosome', 'map_location', 'Other_designations'], keep='first', inplace=False) 

# aggregate mapping identifiers
gene_ids = gene_data_labels[['GeneID_Cleaned', 'EnsemblGenes_Cleaned']].groupby('GeneID_Cleaned', as_index=False).agg(lambda x: '|'.join([x for x in list(set(x)) if x != 'None']))

# merge mapping identifiers with data
gene_data_merged = pandas.merge(genes_metadata, gene_ids, left_on='GeneID_Cleaned', right_on='GeneID_Cleaned', how='left')

# replace '' with 'None'
gene_data_merged.replace('','None', inplace=True)

# replace NaN with 'None'
gene_data_merged.fillna('None', inplace=True)

*Write Metadata*

In [None]:
# create metadata
genes, label, description, synonym, dbxref = [], [], [], [], []

for idx, row in tqdm(gene_data_merged.iterrows(), total=gene_data_merged.shape[0]):
    
    # node 
    genes.append(row['GeneID_Cleaned'])
    
    # label
    if row['Symbol'] != 'None':
        label.append(row['Symbol'])
    
    else:
        label.append('None')
    
    # description
    description.append('{desc} is a "{gene}" gene that is located on chromosome {chrom} (map_location: {map_loc}).'.format(desc=row['description'].title(),
                                                                                                                        gene=row['type_of_gene'],
                                                                                                                        chrom=row['chromosome'],
                                                                                                                        map_loc=row['map_location']))
    
    # synonym
    if row['Synonyms'] != 'None' and row['Other_designations'] != 'None':
        synonym.append(row['Synonyms'] + '|' + row['Other_designations'])
    
    elif row['Synonyms'] != 'None' and row['Other_designations'] == 'None':
        synonym.append(row['Synonyms'])
    
    elif row['Synonyms'] == 'None' and row['Other_designations'] != 'None':
        synonym.append(row['Other_designations'])
    
    else:
        synonym.append('None')
    
    # dbxref
    dbxref.append(row['EnsemblGenes_Cleaned'])
    
# combine into new data frame
gene_metadata_final = pandas.DataFrame(list(zip(genes, label, description, synonym, dbxref)), columns =['ID', 'Label', 'Description', 'Synonym', 'DbXref'])

# write data
gene_metadata_final.to_csv(node_directory + 'GENE_METADATA.txt', header = True, sep = '\t', index = False)


In [233]:
# preview data
gene_metadata_final.head(n=5)

Unnamed: 0,ID,Label,Description,Synonym,DbXref
0,79501,OR4F5,Olfactory Receptor Family 4 Subfamily F Member...,olfactory receptor 4F5,ENSG00000186092
1,729759,OR4F29,Olfactory Receptor Family 4 Subfamily F Member...,OR7-21|olfactory receptor 4F3/4F16/4F29|olfact...,ENSG00000284733
2,81399,OR4F16,Olfactory Receptor Family 4 Subfamily F Member...,OR1-1|OR7-21|olfactory receptor 4F3/4F16/4F29|...,ENSG00000284662
3,26683,OR4F3,Olfactory Receptor Family 4 Subfamily F Member...,olfactory receptor 4F3/4F16/4F29|olfactory rec...,ENSG00000230178
4,148398,SAMD11,Sterile Alpha Motif Domain Containing 11 is a ...,MRS|sterile alpha motif domain-containing prot...,ENSG00000187634


**RNA** 

_Process Data_

In [234]:
# remove rows without identifiers
rna_data = rna_gene_data.loc[rna_gene_data['transcript_stable_id'].apply(lambda x: x != 'None')]
rna_data_labels = rna_gene_data.loc[rna_gene_data['GeneID_Cleaned'].apply(lambda x: x != 'None')]

# de-dup data
rna_metadata = rna_data[['transcript_stable_id', 'type_of_gene', 'Symbol', 'Synonyms', 'description', 'chromosome', 'map_location', 'Other_designations']].drop_duplicates(subset=['transcript_stable_id', 'type_of_gene', 'Symbol', 'Synonyms', 'description', 'chromosome', 'map_location', 'Other_designations'], keep='first', inplace=False) 


# aggregate mapping identifiers
agg_cols = []

for x in [ x for x in list(rna_data_labels) if x != 'transcript_stable_id']:
    if x == 'GeneID_Cleaned':
        agg_cols.append(rna_data_labels[['transcript_stable_id', x]].groupby('transcript_stable_id', as_index=False).agg(lambda x: '|'.join([x for x in list(set(x)) if x != 'None'])))
    
    else:
        agg_cols.append(rna_data_labels[['transcript_stable_id', x]].groupby('transcript_stable_id', as_index=False).agg(lambda x: ', '.join([x for x in list(set(x)) if x != 'None'])))

# merged aggreagted columns back together
rna_merged = reduce(lambda  left, right: pandas.merge(left, right, on=['transcript_stable_id'], how='outer'), agg_cols)

# replace NaN with 'None'
rna_merged.replace('','None', inplace=True)

# replace NaN with 'None'
rna_merged.fillna('None', inplace=True)

*Write Metadata*

In [None]:
# create metadata
rna, label, description, synonym, dbxref = [], [], [], [], []

for idx, row in tqdm(rna_merged.iterrows(), total=rna_merged.shape[0]):
    
    # node 
    rna.append(row['transcript_stable_id'])
    
    # label
    if row['Symbol'] != 'None':
        label.append(row['Symbol'])
    
    else:
        label.append('None')
    
    # description
    description.append('This transcript was transcribed from {desc}, a "{gene}" gene that is located on chromosome {chrom} (map_location: {map_loc}).'.format(desc=row['description'].title(),
                                                                                                                                                             gene=row['type_of_gene'],
                                                                                                                                                             chrom=row['chromosome'],
                                                                                                                                                             map_loc=row['map_location']))
    
    # synonym
    if row['Synonyms'] != 'None' and row['Other_designations'] != 'None':
        synonym.append(row['Synonyms'] + '|' + row['Other_designations'])
    
    elif row['Synonyms'] != 'None' and row['Other_designations'] == 'None':
        synonym.append(row['Synonyms'])
    
    elif row['Synonyms'] == 'None' and row['Other_designations'] != 'None':
        synonym.append(row['Other_designations'])
    
    else:
        synonym.append('None')
    
    # dbxref
    dbxref.append(row['GeneID_Cleaned'])
    
# combine into new data frame
rna_metadata_final = pandas.DataFrame(list(zip(rna, label, description, synonym, dbxref)), columns =['ID', 'Label', 'Description', 'Synonym', 'DbXref'])

# write data
rna_metadata_final.to_csv(node_directory + 'RNA_METADATA.txt', header = True, sep = '\t', index = False)

In [236]:
# preview data
rna_metadata_final.head(n=5)

Unnamed: 0,ID,Label,Description,Synonym,DbXref
0,ENST00000000233,ARF5,This transcript was transcribed from Adp Ribos...,ADP-ribosylation factor 5,381
1,ENST00000000412,M6PR,This transcript was transcribed from Mannose-6...,CD-M6PR|CD-MPR|MPR 46|MPR-46|MPR46|SMPR|cation...,4074
2,ENST00000000442,ESRRA,This transcript was transcribed from Estrogen ...,ERR1|ERRa|ERRalpha|ESRL1|NR3B1|steroid hormone...,2101
3,ENST00000001008,FKBP4,This transcript was transcribed from Fkbp Prol...,FKBP51|FKBP52|FKBP59|HBI|Hsp56|PPIase|p52|pept...,2288
4,ENST00000001146,CYP26B1,This transcript was transcribed from Cytochrom...,CYP26A2|P450RAI-2|P450RAI2|RHFCA|cytochrome P4...,56603


<br>

***
***

### Pathways<a class="anchor" id="pathway-metadata"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#reactome-pathway-database)  
**Input:** [`ReactomePathways.txt`](https://reactome.org/download/current/ReactomePathways.txt)  
**Output:** [`PATHWAY_METADATA.txt`](https://www.dropbox.com/s/mdxc9gjsrt20l06/PATHWAY_METADATA.txt?dl=1)  

In [None]:
url = 'https://reactome.org/download/current/ReactomePathways.txt'
data_downloader(url, unprocessed_data_location)

In [12]:
# read in data
pathway_data = pandas.read_csv(unprocessed_data_location + 'ReactomePathways.txt',
                               header = None,
                               delimiter = '\t',
                               low_memory=False)

In [13]:
# filter data to only include human data
pathway_metadata = pathway_data.loc[pathway_data[2].apply(lambda x: x.startswith('Homo sapien'))] 

# reorder data
pathway_metadata = pathway_metadata[[0, 1]].drop_duplicates(subset=None, keep='first', inplace=False)

# rename columns
pathway_metadata.rename(columns={0: 'ID', 1: 'Label'}, inplace=True)

# replace NaN with 'None'
pathway_metadata.fillna('None', inplace=True)

# write data
pathway_metadata.to_csv(node_directory + 'PATHWAY_METADATA.txt', header = True, sep = '\t', index = False)

In [14]:
# preview data
pathway_metadata.head(n=5)

Unnamed: 0,ID,Label
10025,R-HSA-164843,2-LTR circle formation
10026,R-HSA-73843,5-Phosphoribose 1-diphosphate biosynthesis
10027,R-HSA-1971475,A tetrasaccharide linker sequence is required ...
10028,R-HSA-5619084,ABC transporter disorders
10029,R-HSA-1369062,ABC transporters in lipid homeostasis


<br>

***
***

### Complexes<a class="anchor" id="complex-metadata"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#reactome-pathway-database)    
**Input:** [`ComplexParticipantsPubMedIdentifiers_human.txt`](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt)    
**Output:** [`COMPLEX_METADATA.txt`](https://www.dropbox.com/s/hqawjrmjtrsxsl9/COMPLEX_METADATA.txt?dl=1)  

In [None]:
url = 'https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt'
data_downloader(url, unprocessed_data_location)

In [15]:
# read in data
complex_data = pandas.read_csv(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt',
                               header = 0,
                               delimiter = '\t',
                               low_memory=False)

In [16]:
# reorder data
complex_metadata = complex_data[['identifier', 'name']].drop_duplicates(subset=None, keep='first', inplace=False)

# rename columns
complex_metadata.rename(columns={'identifier': 'ID', 'name': 'Label'}, inplace=True)

# replace NaN with 'None'
complex_metadata.fillna('None', inplace=True)

# write data
complex_metadata.to_csv(node_directory + 'COMPLEX_METADATA.txt', header = True, sep = '\t', index = False)

In [17]:
# preview data
complex_metadata.head(n=5)

Unnamed: 0,ID,Label
0,R-HSA-1006173,CFH:Host cell surface [plasma membrane]
1,R-HSA-1008206,NF-E2:Promoter region of beta-globin [nucleopl...
2,R-HSA-1008229,NF-E2 [nucleoplasm]
3,R-HSA-1008252,IRF1:Promoter region of IFN beta [nucleoplasm]
4,R-HSA-1011577,C-terminal EH domain containing proteins:Raben...


<br>

***
***

### Reactions<a class="anchor" id="reaction-metadata"></a>

**Data Source Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#reactome-pathway-database)   
**Input:** [`UniProt2ReactomeReactions.txt`](https://reactome.org/download/current/UniProt2ReactomeReactions.txt)  + [`ChEBI2ReactomeReactions.txt`](https://reactome.org/download/current/ChEBI2ReactomeReactions.txt)  
**Output:** [`REACTION_METADATA.txt`](https://www.dropbox.com/s/lmf2ekadzzqdcf9/REACTION_METADATA.txt?dl=1)  

In [None]:
url = 'https://reactome.org/download/current/UniProt2ReactomeReactions.txt'
data_downloader(url, unprocessed_data_location)

url = 'https://reactome.org/download/current/ChEBI2ReactomeReactions.txt'
data_downloader(url, unprocessed_data_location)

In [18]:
react_data1 = pandas.read_csv(unprocessed_data_location + 'UniProt2ReactomeReactions.txt',
                              header = None,
                              delimiter = '\t',
                              low_memory=False)

react_data2 = pandas.read_csv(unprocessed_data_location + 'ChEBI2ReactomeReactions.txt',
                              header = None,
                              delimiter = '\t',
                              low_memory=False)

In [19]:
# merge datasets
merged_react_data = pandas.merge(react_data1[[1, 3, 5]], react_data2[[1, 3, 5]], left_on=[1, 3, 5], right_on=[1, 3, 5], how='outer')

# filter data to only include pathogenic variants
merged_react_data = merged_react_data.loc[merged_react_data[5].apply(lambda x: x.startswith('Homo sapiens'))] 

# reorder data
merged_react_data = merged_react_data[[1, 3]].drop_duplicates(subset=None, keep='first', inplace=False)

# rename columns
merged_react_data.rename(columns={1: 'ID', 3: 'Label'}, inplace=True)

# replace NaN with 'None'
merged_react_data.fillna('None', inplace=True)

# write data
merged_react_data.to_csv(node_directory + 'REACTION_METADATA.txt', header = True, sep = '\t', index = False)

In [20]:
# preview data
merged_react_data.head(n=5)

Unnamed: 0,ID,Label
16004,R-HSA-1112666,BLNK (SLP-65) Signalosome hydrolyzes phosphati...
16384,R-HSA-166753,Conversion of C4 into C4a and C4b
17320,R-HSA-166792,Conversion of C2 into C2a and C2b
18247,R-HSA-173626,Activation of C1r
18532,R-HSA-173631,Activation of C1s


<br>

***

### Variants<a class="anchor" id="variant-metadata"></a>

**Data Source Wiki Page:** [ClinVar](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#clinvar)  
**Input:** [`variant_summary.txt.gz`](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)   
**Output:** [`VARIANT_METADATA.txt`](https://www.dropbox.com/s/2scki0c33a4w0ez/VARIANT_METADATA.txt?dl=1)  

In [9]:
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data


In [21]:
var_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt',
                           header = 0,
                           delimiter = '\t',
                           low_memory=False)

*Process Data*

In [22]:
# remove rows without identifiers
var_data = var_data.loc[var_data['Assembly'].apply(lambda x: x == 'GRCh38')]
var_data = var_data.loc[var_data['RS# (dbSNP)'].apply(lambda x: x != -1)]

# de-dup data
var_metadata = var_data[['#AlleleID', 'Type', 'Name', 'ClinicalSignificance', 'RS# (dbSNP)', 'Origin',
                         'ChromosomeAccession', 'Chromosome', 'Start', 'Stop', 'ReferenceAllele',
                         'Assembly', 'AlternateAllele','Cytogenetic', 'ReviewStatus', 'LastEvaluated']] 

# replace NaN with 'None'
var_metadata.fillna('None', inplace=True)

# remove duplicate dbSNP ids by choosing the most recent reviewed variant
var_metadata.sort_values('LastEvaluated', ascending=False, inplace=True)
var_metadata.drop_duplicates(subset='RS# (dbSNP)', keep='first', inplace=True)

*Write Metadata*

In [23]:
# create metadata
variant, label, description = [], [], []

for idx, row in tqdm(var_metadata.iterrows(), total=var_metadata.shape[0]):
    
    # node 
    variant.append(row['RS# (dbSNP)'])
    
    # label
    if row['Name'] != 'None':
        label.append(row['Name'])
    
    else:
        label.append('None')
    
    # description
    sent = 'This variant is a {Origin} {Type} that results when a {ReferenceAllele} allele is changed to {AlternateAllele} on chromosome {Chromosome} ({ChromosomeAccession}, start:{Start}/stop:{Stop} positions, cytogenetic location:{Cytogenetic}) and has clinical significance "{ClinicalSignificance}". This entry is for the {Assembly} and was last reviewed on {LastEvaluated} with review status "{ReviewStatus}".'
    description.append(sent.format(Origin=row['Origin'], Type=row['Type'], ReferenceAllele=row['ReferenceAllele'],
                                   AlternateAllele=row['AlternateAllele'], Chromosome=row['Chromosome'],
                                   ChromosomeAccession=row['ChromosomeAccession'], Start=row['Start'],
                                   Stop=row['Stop'], Cytogenetic=row['Cytogenetic'], ClinicalSignificance=row['ClinicalSignificance'],
                                   Assembly=row['Assembly'], LastEvaluated=row['LastEvaluated'], ReviewStatus=row['ReviewStatus']))
    
# combine into new data frame
var_metadata_final = pandas.DataFrame(list(zip(variant, label, description)), columns =['ID', 'Label', 'Description'])

# drop duplicates
var_metadata_final.drop_duplicates(subset=None, keep='first', inplace=True)

# write data
var_metadata_final.to_csv(node_directory + 'VARIANT_METADATA.txt', header = True, sep = '\t', index = False)                      

100%|██████████| 413724/413724 [02:17<00:00, 3009.58it/s]


In [24]:
# preview data
var_metadata_final.head(n=5)

Unnamed: 0,ID,Label,Description
0,1057519493,NM_033004.4(NLRP1):c.197C>T (p.Ala66Val),This variant is a germline single nucleotide v...
1,1057524876,NM_033004.4(NLRP1):c.3641C>G (p.Pro1214Arg),This variant is a germline single nucleotide v...
2,776245016,NM_033004.4(NLRP1):c.2176C>T (p.Arg726Trp),This variant is a germline single nucleotide v...
3,1057519492,NM_033004.4(NLRP1):c.160G>A (p.Ala54Thr),This variant is a germline single nucleotide v...
4,2004640,NM_001098629.3(IRF5):c.-12+198=,This variant is a germline single nucleotide v...
