
# PheKnowLator - Data Preparation


***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**

**Purpose:** This notebook serves as a script to preprocess data and/or generate mapping and filtering data for the PheKnowLator project. The script creates each of the mapping and/or filtering data sources described on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page. 

**Assumptions:** The script assumes that there are files, which need further processing and are located in the `./resources/processed_data/unprocessed_data` directory.


***

## Table of Contents
### Create Mapping Data    
* [MESH-ChEBI](#mesh-chebi)  
* [Ensembl Gene - Ensembl Transcript](#ensemblgene-ensembltranscript)  
* [Ensembl Gene - Entrez Gene](#ensemblgene-entrezgene)  
* [Ensembl Gene - Uniprot Accession](#ensemblgene-uniprot)  
* [Ensembl Protein - Uniprot Accession](#ensemblprotein-uniprot)  
* [Uniprot Accession - Protein Ontology](#uniprot-pro)  
* [Ensembl Protein - Protein Ontology](#ensemblprotein-pro) 
* [HPA Tissue/Cells - UBERON + Cell Ontology](#hpa-uberon) 
* [Disease and Phenotype Identifiers](#disease-identifiers) 

### Process Edge Data
**Ontologies**
* [Protein Ontology](#protein-ontology)  
* [Relation Ontology](#relation-ontology)  

**Linked Data**
* [Reactome: Protein-Complex Data](#reactome-protein-complex)  
* [Reactome: Complex-Complex Data](#reactome-complex-complex)  
* [Reactome: Chemical-Complex Data](#reactome-chemical-complex)  
* [Uniprot: Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  
* [NCBI Gene: Protein-Coding Genes](#ncbi-protein-coding-genes)    

***
***

### Set-Up Environment


In [1]:
# import needed libraries
import glob
import networkx
import pandas

from owlready2 import subprocess
from rdflib import Graph, Namespace, URIRef, BNode, extras, Literal
from rdflib.extras.external_graph_libs import *
from tqdm import tqdm

# import script containing helper functions
from scripts.python.data_preparation_helper_functions import *

**Define Global Variables**

In [2]:
# directory to read unprocessed data files from
unprocessed_data_location = 'resources/processed_data/unprocessed_data/'

# directory to write processed data files to
processed_data_location = 'resources/processed_data/'


***
***
***

### CREATE MAPPING DATA  <a class="anchor" id="mapping-data"></a>


### MESH - ChEBI <a class="anchor" id="mesh-chebi"></a>

**Wiki Page:** [mapping-mesh-to-chebi](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#mapping-mesh-identifiers-to-chebi-identifiers)  

**Purpose:** This script assumes that the `NCBO_rest_api.py` script was run and the data generated from this file was written to `./resources/processed_data/temp`.


In [3]:
with open(processed_data_location + 'MESH_CHEBI_MAP.txt', 'w') as out:
    for filename in tqdm(glob.glob(processed_data_location + 'temp/*.txt')):
        for row in list(filter(None, open(filename, 'r').read().split('\n'))):
            mesh = '_'.join(row.split('\t')[0].split('/')[-2:])
            chebi = row.split('\t')[1].split('/')[-1]
            out.write(mesh + '\t' + chebi + '\n')

out.close()

100%|██████████| 44/44 [00:00<00:00, 748.68it/s]


**Preview Processed Data**

In [4]:
mc_data = pandas.read_csv(processed_data_location + 'MESH_CHEBI_MAP.txt',
                          delimiter = '\t',
                          header=None,
                          names=['MeSH_IDs', 'ChEBI_IDs'])

print('There are {edge_count} MeSH-ChEBI edges'.format(edge_count=len(mc_data)))

There are 11434 MeSH-ChEBI edges


In [5]:
mc_data.head(n=5)

Unnamed: 0,MeSH_IDs,ChEBI_IDs
0,MESH_C535085,CHEBI_133814
1,MESH_C008574,CHEBI_17221
2,MESH_C492482,CHEBI_34581
3,MESH_C007556,CHEBI_135978
4,MESH_C500395,CHEBI_29138



***
***

### Ensembl Gene - Ensembl Transcript <a class="anchor" id="ensemblgene-ensembltranscript"></a>

**Wiki Page:** [uniprot-knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#uniprot-knowledgebase)  

**Purpose:** This script downloads the [HUMAN_9606_idmapping_selected.tab.gz](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) file from the [Uniprot Knolwedge Base](https://www.uniprot.org/) and saves it to the `./resources/processed_data/unprocessed_data` directory.

In [6]:
url = 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data


In [None]:
data_processor(filepath=unprocessed_data_location + 'HUMAN_9606_idmapping_selected.tab',
               row_splitter='\t',
               column_list=[18, 19, 19],
               output_name='ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
               write_location = processed_data_location,
               line_splitter=';')

**Preview Processed Data**

In [None]:
eget_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Ensembl_Transcript_IDs'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-ensembl transcript edges'.format(edge_count=len(eget_data)))

In [None]:
eget_data.head(n=5)



***

### Ensembl Gene - Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>

**Wiki Page:** [uniprot-knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#uniprot-knowledgebase)  

**Purpose:** This script assumes the [HUMAN_9606_idmapping_selected.tab.gz](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) file from the [Uniprot Knolwedge Base](https://www.uniprot.org/) has been downloaded and saved to the `./resources/processed_data/unprocessed_data` directory.


In [28]:
data_processor(filepath=unprocessed_data_location + 'HUMAN_9606_idmapping_selected.tab',
               row_splitter='\t',
               column_list=[18, 2],
               output_name='ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
               write_location = processed_data_location,
               line_splitter=';')

100%|██████████| 188349/188349 [00:01<00:00, 184857.93it/s]


**Preview Processed Data**

In [29]:
egeg_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Entrez_IDs'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-entrez gene edges'.format(edge_count=len(egeg_data)))

There are 25786 ensembl gene-entrez gene edges


In [35]:
egeg_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Entrez_IDs
0,ENSG00000166913,7529
1,ENSG00000108953,7531
2,ENSG00000274474,7531
3,ENSG00000128245,7533
4,ENSG00000170027,7532



***

### Ensembl Data Labels <a class="anchor" id="ensembl-labels"></a>

**Wiki Page:** [gene-labels](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#gene-labels)  

**Purpose:** This script downloads the [Homo_sapiens.gene_info.gz](ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz) file from [NCBI Gene](https://www.ncbi.nlm.nih.gov/gene/) and saves it to the `./resources/processed_data/unprocessed_data` directory.

<br>
This script adds labels to the following files stored in the `./resources/processed_data` directory, and saves the results to the same file name:  
- `ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`  
- `ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`

In [38]:
url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data from ftp server
Decompressing and writing gzipped data


In [40]:
data_labels = pandas.read_csv(unprocessed_data_location + 'Homo_sapiens.gene_info',
                              header = 0,
                              delimiter = '\t')

`ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`

In [41]:
# create dictionary to store label data
label_dic = label_dict(data_labels,
                       'Symbol_from_nomenclature_authority',
                       'Full_name_from_nomenclature_authority',
                       'Synonyms',
                       'dbXrefs',
                       'ENS')

100%|██████████| 61646/61646 [00:09<00:00, 6354.94it/s]


In [42]:
# get attribute information
data_atts = label_attributes(eget_data, label_dic, 'Ensembl_Gene_IDs', 'Ensembl_Transcript_IDs')

# copy data frame
data_cp = eget_data.copy(deep=True)

# add columns to data frames
data_cp['Ensembl_Names'] = data_atts[0]
data_cp['Ensembl_Descriptions'] = data_atts[1]
data_cp['Ensembl_Synonyms'] = data_atts[2]
    
# save data
if len(data_cp) == len(eget_data):
    data_cp.to_csv(processed_data_location + 'ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                   header=True,
                   index=False,
                   sep='\t')
else:
    raise Exception('ERROR: Data sets are not the same size')

100%|██████████| 145246/145246 [00:27<00:00, 5319.50it/s]


In [43]:
data_cp.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Ensembl_Transcript_IDs,Ensembl_Names,Ensembl_Descriptions,Ensembl_Synonyms
0,ENSG00000166913,ENST00000353703,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,GW128|HEL-S-1|HS1|KCIP-1|YWHAA
1,ENSG00000166913,ENST00000372839,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,GW128|HEL-S-1|HS1|KCIP-1|YWHAA
2,ENSG00000108953,ENST00000264335,YWHAE,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,14-3-3E|HEL2|KCIP-1|MDCR|MDS
3,ENSG00000108953,ENST00000571732,YWHAE,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,14-3-3E|HEL2|KCIP-1|MDCR|MDS
4,ENSG00000108953,ENST00000616643,YWHAE,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,14-3-3E|HEL2|KCIP-1|MDCR|MDS


`ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`

In [44]:
# create dictionary to store label data
label_dic = label_dict(data_labels, 'Symbol_from_nomenclature_authority', 'Full_name_from_nomenclature_authority',  'Synonyms', 'GeneID', None)

100%|██████████| 61646/61646 [00:10<00:00, 5972.77it/s]


In [45]:
# get attribute information
data_atts = label_attributes(egeg_data, label_dic, 'Entrez_IDs', None)

# copy data frame
data_cp = egeg_data.copy(deep=True)

# add columns to data frames
data_cp['Entrez_Names'] = data_atts[0]
data_cp['Entrez_Descriptions'] = data_atts[1]
data_cp['Entrez_Synonyms'] = data_atts[2]
    
# save data
if len(data_cp) == len(egeg_data):
    data_cp.to_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                   header=True,
                   index=False,
                   sep='\t')
else:
    raise Exception('ERROR: Data sets are not the same size')

100%|██████████| 25785/25785 [00:04<00:00, 6228.58it/s]


In [46]:
data_cp.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Entrez_IDs,Entrez_Names,Entrez_Descriptions,Entrez_Synonyms
0,ENSG00000166913,7529,YWHAB,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,GW128|HEL-S-1|HS1|KCIP-1|YWHAA
1,ENSG00000108953,7531,YWHAE,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,14-3-3E|HEL2|KCIP-1|MDCR|MDS
2,ENSG00000274474,7531,YWHAE,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,14-3-3E|HEL2|KCIP-1|MDCR|MDS
3,ENSG00000128245,7533,YWHAH,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,YWHA1
4,ENSG00000170027,7532,YWHAG,tyrosine 3-monooxygenase/tryptophan 5-monooxyg...,14-3-3GAMMA|EIEE56|PPP1R170



***

### Ensembl Gene - Uniprot Accession <a class="anchor" id="ensemblgene-uniprot"></a>

**Wiki Page:** [uniprot-knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#uniprot-knowledgebase)  

**Purpose:** This script assumes the [HUMAN_9606_idmapping_selected.tab.gz](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) file from the [Uniprot Knolwedge Base](https://www.uniprot.org/) has been downloaded and saved to the `./resources/processed_data/unprocessed_data` directory.

In [47]:
data_processor(filepath=unprocessed_data_location + 'HUMAN_9606_idmapping_selected.tab',
               row_splitter='\t',
               column_list=[18, 0],
               output_name='ENSEMBL_GENE_UNIPROT_ACCESSION_MAP.txt',
               write_location = processed_data_location,
               line_splitter=';')

100%|██████████| 188349/188349 [00:01<00:00, 157397.02it/s]


**Preview Processed Data**

In [48]:
egua_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_UNIPROT_ACCESSION_MAP.txt',
                            header = None,
                            names=['Ensembl_Gene_IDs', 'Uniprot_Accession_IDs'],
                            delimiter = '\t')

print('There are {edge_count} ensembl gene-uniprot accession edges'.format(edge_count=len(data)))

There are 145246 ensembl gene-uniprot accession edges


In [49]:
egua_data.head(n=5)

Unnamed: 0,Ensembl_Gene_IDs,Uniprot_Accession_IDs
0,ENSG00000166913,P31946
1,ENSG00000108953,P62258
2,ENSG00000274474,P62258
3,ENSG00000128245,Q04917
4,ENSG00000170027,P61981



***

### Ensembl Protein - Uniprot Accession <a class="anchor" id="ensemblprotein-uniprot"></a>

**Wiki Page:** [uniprot-knowledgebase](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#uniprot-knowledgebase)  

**Purpose:** This script assumes the [HUMAN_9606_idmapping_selected.tab.gz](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) file from the [Uniprot Knolwedge Base](https://www.uniprot.org/) has been downloaded and saved to the `./resources/processed_data/unprocessed_data` directory.

In [50]:
data_processor(filepath=unprocessed_data_location + 'HUMAN_9606_idmapping_selected.tab',
               row_splitter='\t',
               column_list=[20, 0],
               output_name='ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP.txt',
               write_location = processed_data_location,
               line_splitter=';')

100%|██████████| 188349/188349 [00:01<00:00, 165541.35it/s]


**Preview Processed Data**

In [28]:
epua_data = pandas.read_csv(processed_data_location + 'ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP.txt',
                            header = None,
                            names=['Ensembl_Protein_IDs', 'Uniprot_Accession_IDs'],
                            delimiter = '\t')

print('There are {edge_count} ensembl protein-uniprot accession edges'.format(edge_count=len(epua_data)))

There are 108627 ensembl protein-uniprot accession edges


In [53]:
epua_data.head(n=5)

Unnamed: 0,Ensembl_Protein_IDs,Uniprot_Accession_IDs
0,ENSP00000300161,P31946
1,ENSP00000361930,P31946
2,ENSP00000264335,P62258
3,ENSP00000461762,P62258
4,ENSP00000481059,P62258



***

### Uniprot Accession - Protein Ontology <a class="anchor" id="uniprot-pro"></a>

**Wiki Page:** [protein-ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#protein-ontology)  

**Purpose:** This script downloads the [promapping.txt](https://proconsortium.org/download/current/promapping.txt) file from the [Pro Consortium](https://proconsortium.org/download/current/) and saves to the `./resources/processed_data/unprocessed_data` directory.


In [54]:
url = 'https://proconsortium.org/download/current/promapping.txt'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [55]:
# reformat data and write it out
data = open(unprocessed_data_location + 'promapping.txt').readlines()

with open(processed_data_location + 'UNIPROT_ACCESSION_PRO_MAP.txt', 'w') as outfile:
    for line in tqdm(data):
        row = line.split('\t')
        if row[1].startswith('UniProtKB'):
            outfile.write(row[0].strip().replace(':', '_') + '\t' + row[1].strip().split(':')[-1] + '\n')

outfile.close()

100%|██████████| 426253/426253 [00:00<00:00, 489333.34it/s]


**Preview Processed Data**

In [27]:
uapr_data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_MAP.txt',
                            header = None,
                            names=['PRO_IDs', 'Uniprot_Accession_IDs'],
                            delimiter = '\t')

print('There are {edge_count} uniprot accession-protein ontology edges'.format(edge_count=len(uapr_data)))

There are 314714 uniprot accession-protein ontology edges


In [58]:
uapr_data.head(n=5)

Unnamed: 0,PRO_IDs,Uniprot_Accession_IDs
0,PR_000000005,P37173
1,PR_000000005,P38438
2,PR_000000005,Q62312
3,PR_000000005,Q90999
4,PR_000000007,F1R709



***

### Ensembl Protein - Protein Ontology <a class="anchor" id="ensemblprotein-pro"></a>

**Wiki Page:** [protein-ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#protein-ontology)  

**Purpose:** This script assumes that the `UNIPROT_ACCESSION_PRO_MAP.txt` and the `ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP.txt` files were created and saved to the `./resources/processed_data/unprocessed_data` directory.


In [59]:
pro_kb = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_MAP.txt',
                         header = None,
                         delimiter = '\t')

# convert to dictionary
pro_dict = {}

for idx, row in tqdm(pro_kb.iterrows(), total=pro_kb.shape[0]):
    if row[1] in pro_dict.keys():
        pro_dict[row[1]].append(row[0]) 
    else:
        pro_dict[row[1]] = [row[0]]

100%|██████████| 314714/314714 [00:52<00:00, 6012.95it/s]


In [60]:
ens_uni = pandas.read_csv(processed_data_location + 'ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP.txt',
                          header = None,
                          delimiter = '\t')

# write out data
with open(processed_data_location + 'ENSEMBL_PROTEIN_PRO_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(ens_uni.iterrows(), total=ens_uni.shape[0]):        
        if row[1] in pro_dict.keys():
            for x in pro_dict[row[1]]: 
                outfile.write(row[0] + '\t' + x + '\n')

outfile.close()

100%|██████████| 108627/108627 [00:17<00:00, 6175.13it/s]


**Preview Processed Data**

In [26]:
eppr_data = pandas.read_csv(processed_data_location + 'ENSEMBL_PROTEIN_PRO_MAP.txt',
                            header = None,
                            names=['Ensembl_Protein_IDs', 'PRO_IDs'],
                            delimiter = '\t')

print('There are {edge_count} ensembl protein-protein ontology edges'.format(edge_count=len(eppr_data)))

There are 92755 ensembl protein-protein ontology edges


In [62]:
eppr_data.head(n=5)

Unnamed: 0,Ensembl_Protein_IDs,PRO_IDs
0,ENSP00000300161,PR_000002175
1,ENSP00000300161,PR_P31946
2,ENSP00000361930,PR_000002175
3,ENSP00000361930,PR_P31946
4,ENSP00000264335,PR_000003104



***

### HPA Tissue/Cells - UBERON + Cell Ontology <a class="anchor" id="hpa-uberon"></a>

**Wiki Page:** [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-protein-atlas)  

**Purpose:** This script downloads the [rna_tissue_consensus.tsv](https://www.proteinatlas.org/download/rna_tissue_consensus.tsv.zip) and [normal_tissue.tsv](https://www.proteinatlas.org/download/normal_tissue.tsv.zip) files from the [Human Protein Atlas](https://www.proteinatlas.org) and saves them to the `./resources/processed_data/unprocessed_data` directory.


In [3]:
url_normal = 'https://www.proteinatlas.org/download/normal_tissue.tsv.zip'
data_downloader(url_normal, unprocessed_data_location)

url_abnormal = 'https://www.proteinatlas.org/download/rna_tissue_consensus.tsv.zip'
data_downloader(url_abnormal, unprocessed_data_location)

Downloading zipped data file
Downloading zipped data file


_Read in Data Files_

In [4]:
abnormal_tissue = []

for line in tqdm(open(unprocessed_data_location + 'rna_tissue_consensus.tsv').readlines()):
    abnormal_tissue.append(line.split('\t')[2].strip())

100%|██████████| 1193056/1193056 [00:01<00:00, 915078.26it/s]


In [5]:
normal_tissue = []

for line in tqdm(open(unprocessed_data_location + 'normal_tissue.tsv').readlines()):
    normal_tissue.append(line.split('\t')[2].strip() + ' - ' + line.split('\t')[3].strip())

100%|██████████| 1056062/1056062 [00:02<00:00, 517483.38it/s]


In [6]:
# combine normal and abnormal tissue and cell types into single list
combo = set(abnormal_tissue + normal_tissue)

# write results
with open(unprocessed_data_location + 'HPA_tissues.txt', 'w') as outfile:
    for x in tqdm(combo):
        outfile.write(x.strip() + '\n')

outfile.close()

100%|██████████| 208/208 [00:00<00:00, 276605.97it/s]


In [9]:
# read back in mapped tissue/cell data
hpa_mapping_data = pandas.read_excel(open(unprocessed_data_location + 'zooma_tissue_cell_mapping_04DEC2019.xlsx', 'rb'),
                                     sheet_name='zooma_tissue_cell_mapping_04DEC',
                                     header=0)

hpa_mapping_data.fillna('None', inplace=True)

# preview data
data.head(n=5)

Unnamed: 0,TISSUE,CELL TYPE,ONTOLOGY,ONTOLOGY ID,ONTOLOGY LABEL,MAPPING
0,adipose tissue,,UBERON,http://purl.obolibrary.org/obo/UBERON_0001013,adipose tissue,ZOOMA
1,adipose tissue,adipocytes,UBERON,http://purl.obolibrary.org/obo/UBERON_0001013,adipose tissue,ZOOMA
2,adipose tissue,adipocytes,CL,http://purl.obolibrary.org/obo/CL_0001070,fat cell,Manual
3,adrenal gland,,UBERON,http://purl.obolibrary.org/obo/UBERON_0002369,adrenal gland,ZOOMA
4,adrenal gland,cells in zona fasciculata,UBERON,http://purl.obolibrary.org/obo/UBERON_0002054,zona fasciculata of adrenal gland,ZOOMA


In [10]:
# reformat data and write it out
with open(processed_data_location + 'HPA_TISSUE_CELL_MAP.txt', 'w') as outfile:
    for idx, row in tqdm(data.iterrows(), total=data.shape[0]):
        if row['TISSUE'] != 'None':
            outfile.write(str(row['TISSUE']).strip() + '\t' + str(row['ONTOLOGY ID']).strip() + '\n')

        if row['CELL TYPE'] != 'None':
            outfile.write(str(row['CELL TYPE']).strip() + '\t' + str(row['ONTOLOGY ID']).strip() + '\n')

outfile.close()

100%|██████████| 340/340 [00:00<00:00, 4113.13it/s]


**Preview Processed Data**

In [25]:
hpa_data = pandas.read_csv(processed_data_location + 'HPA_TISSUE_CELL_MAP.txt',
                           header = None,
                           names=['Tissue/Cell', 'UBERON_CL_IDs'],
                           delimiter = '\t')

print('There are {edge_count} edges'.format(edge_count=len(hpa_data)))

There are 622 edges


In [13]:
hpa_data.head(n=5)

Unnamed: 0,Tissue/Cell,UBERON_CL_IDs
0,adipose tissue,http://purl.obolibrary.org/obo/UBERON_0001013
1,adipose tissue,http://purl.obolibrary.org/obo/UBERON_0001013
2,adipocytes,http://purl.obolibrary.org/obo/UBERON_0001013
3,adipose tissue,http://purl.obolibrary.org/obo/CL_0001070
4,adipocytes,http://purl.obolibrary.org/obo/CL_0001070



***

### Disease and Phenotype Identifiers <a class="anchor" id="disease-identifiers"></a>

**Wiki Page:** [disgenet](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet)  

**Purpose:** This script downloads the [disease_mappings.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz) file from [DisGeNET](https://www.disgenet.org) and saves it to the `./resources/processed_data/unprocessed_data` directory.


In [14]:
url = 'https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz'
data_downloader(url, unprocessed_data_location)

Downloading gzipped data file


In [15]:
data = pandas.read_csv(unprocessed_data_location + 'disease_mappings.tsv',
                       header = 0,
                       delimiter = '|')

In [16]:
data.head(n=5)

Unnamed: 0,diseaseId,name,vocabulary,code,vocabularyName
0,C0018923,Hemangiosarcoma,DO,1816,angiosarcoma
1,C0854893,Angiosarcoma non-metastatic,DO,1816,angiosarcoma
2,C0033999,Pterygium,DO,2116,pterygium
3,C4520843,Pterygium of eye,DO,2116,pterygium
4,C0024814,Marinesco-Sjogren syndrome,DO,14667,disease of metabolism


In [17]:
# convert to dictionary
disease_dict = {}

for idx, row in tqdm(data.iterrows(), total=data.shape[0]):
    
    if row['diseaseId'] in disease_dict.keys():
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']].append('DOID_' + row['code'] ) 
        
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']].append(row['code'].replace('HP:', 'HP_')) 
    
    else:
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']] = ['DOID_' + row['code']] 
        
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']] = [row['code'].replace('HP:', 'HP_')] 

100%|██████████| 97502/97502 [00:16<00:00, 5894.33it/s]


In [68]:
# reformat data and write it out
with open(processed_data_location + 'DISEASE_DOID_MAP.txt', 'w') as outfile1,open('../../resources/processed_data/PHENOTYPE_HPO_MAP.txt', 'w') as outfile2:
    
    for key, value in tqdm(disease_dict.items()):
        for i in value:
            # get diseases
            if i.startswith('DOID_'): 
                outfile1.write(key + '\t' + i + '\n')

            # get phenotypes
            if i.startswith('HP_'): 
                outfile2.write(key + '\t' + i + '\n')

outfile1.close()
outfile2.close()

100%|██████████| 15421/15421 [00:00<00:00, 223784.25it/s]


**Preview Processed Data**

_Preview Disease (DOID) Mappings_

In [19]:
dis_data = pandas.read_csv(processed_data_location + 'DISEASE_DOID_MAP.txt',
                           header = None,
                           names=['Disease_IDs', 'DOID_IDs'],
                           delimiter = '\t')

print('There are {} disease-DOID edges'.format(len(dis_data)))

There are 97502 disease-DOID edges


In [24]:
dis_data.head(n=5)

Unnamed: 0,Disease_IDs,DOID_IDs
0,C0018923,DOID_0001816
1,C0854893,DOID_0001816
2,C0033999,DOID_0002116
3,C4520843,DOID_0002116
4,C0024814,DOID_0014667


_Preview Phenotype (HP) Mappings_

In [21]:
hp_data = pandas.read_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt',
                          header = None,
                          names=['Disease_IDs', 'HP_IDs'],
                          delimiter = '\t')

print('There are {} phenotype-HPO edges'.format(len(hp_data)))

There are 8289 phenotype-HPO edges


In [23]:
hp_data.head(n=5)

Unnamed: 0,Disease_IDs,HP_IDs
0,C0018923,HP_0200058
1,C0033999,HP_0001059
2,C4520843,HP_0001059
3,C0037199,HP_0000246
4,C0008780,HP_0012265


***
***

### PROCESS EDGE DATA: ONTOLOGIES  
- [Protein Ontology](#protein-ontology)  
- [Relation Ontology](#relation-ontology)  

### Protein Ontology <a class="anchor" id="protein-ontology"></a>

**Wiki Page:** [protein-ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-phenotype-ontology)  

**Purpose:** This script downloads the [pr.owl](http://purl.obolibrary.org/obo/pr.owl) file from [ProConsortium.org](https://proconsortium.org/) and saves it to the `./resources/processed_data/unprocessed_data` directory. The file is then read back in and filtered to contain only human proteins by performing forward and reverse breadth first search over all proteins which are `owl:subClassOf` [Homo sapiens protein](https://proconsortium.org/app/entry/PR%3A000029067/).


In [120]:
url = 'http://purl.obolibrary.org/obo/pr.owl'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [121]:
# read in ontology as graph (the ontology is large so this takes ~60 minutes) - 11,757,623 edges on 12/18/2019
graph = Graph()
graph.parse(unprocessed_data_location + 'pr.owl')

print('There are {} edges in the ontology'.format(len(graph)))

There are 11757623 edges in the ontology


**Convert Ontology to Directed MulitGraph:**  
In order to create a verison of the ontology which includes all relevant human edges, we need to first convert the KG to a [directed multigraph](https://networkx.github.io/documentation/stable/reference/classes/multidigraph.html).

In [122]:
# convert RDF graph to multidigraph (the ontology is large so this takes ~45 minutes)
networkx_mdg = rdflib_to_networkx_multidigraph(graph)

**Identify Human Proteins:**   
A list of human proteins is obtained by querying the ontology to return all ontology classes `only_in_taxon some Homo sapiens`. To expedite the query time, the following SPARQL query is run from the [ProConsortium](https://proconsortium.org/pro_sparql.shtml) SPARQL endpoint: 

```SPARQL
PREFIX obo: <http://purl.obolibrary.org/obo/>

SELECT ?PRO_term
FROM <http://purl.obolibrary.org/obo/pr>
WHERE {
       ?PRO_term rdf:type owl:Class .
       ?PRO_term rdfs:subClassOf ?restriction .
       ?restriction owl:onProperty obo:RO_0002160 .
       ?restriction owl:someValuesFrom obo:NCBITaxon_9606 .

       # use this to filter-out things like hgnc ids
       FILTER (regex(?PRO_term,"http://purl.obolibrary.org/obo/*")) .
}

```


In [123]:
# download data - pro classes only_in_taxon some Homo sapiens (61,064 classes on 12/18/2019)
url = 'http://sparql.proconsortium.org/virtuoso/sparql?query=PREFIX+obo%3A+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F%3E%0D%0ASELECT+%3FPRO_term%0D%0AFROM+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fpr%3E%0D%0AWHERE%0D%0A%7B%0D%0A+++%3FPRO_term+rdf%3Atype+owl%3AClass+.%0D%0A+++%3FPRO_term+rdfs%3AsubClassOf+%3Frestriction+.%0D%0A+++%3Frestriction+owl%3AonProperty+obo%3ARO_0002160+.%0D%0A+++%3Frestriction+owl%3AsomeValuesFrom+obo%3ANCBITaxon_9606+.%0D%0A%0D%0A+++FILTER+%28regex%28%3FPRO_term%2C%22http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F*%22%29%29+.%0D%0A%0D%0A%7D%0D%0A&format=text%2Fhtml&debug='
html = requests.get(url, allow_redirects=True).content

# extract data from html table
df_list = pandas.read_html(html)
human_pro_classes = list(df_list[-1]['PRO_term'])

print('There are {protein_count} human classes in the PRO ontology'.format(protein_count=len(human_pro_classes)))

There are 61064 human classes in the PRO ontology


**Construct Human PRO:**   
Now that we have all of the paths from the original graph that are relevant to humans, we can construct a human-only version of the PRotein ontology.

In [124]:
# create a new graph using bfs paths
human_pro_graph = Graph()
human_networkx_mdg = networkx.MultiDiGraph()

for node in tqdm(human_pro_classes):
    forward = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='original'))
    reverse = list(networkx.edge_bfs(networkx_mdg, URIRef(node), orientation='reverse'))
    
    # add edges from forward and reverse bfs paths
    for path in forward + reverse:
        human_pro_graph.add((path[0], path[2], path[1]))
        human_networkx_mdg.add_edge(path[0], path[1], path[2])

100%|██████████| 61064/61064 [36:49<00:00, 27.63it/s]    


In [125]:
# verify that the constructed ontology only has 1 component
networkx.number_connected_components(human_networkx_mdg.to_undirected())

1

In [126]:
# save filtered ontology
human_pro_graph.serialize(destination=unprocessed_data_location + 'human_pro.owl', format='xml')

**Classify Ontology:**  
To ensure that we have correclty built the new ontology, we run the hermit reasoner over it to ensure that there are no incomplete triples or inconsistent classes. In order to do this, we will call the reasoner using [OWLTools](https://github.com/owlcollab/owltools), which this script assumes has already been downloaded to the `../resources/lib` directory. The following arguments are then called to run the reasoner (from the command line):  

```bash
./resources/lib/owltools ./resources/unprocessed_data/human_pro_filtered.owl --reasoner hermit --run-reasoner --assert-implied -o ./resources/processed_data/human_pro_closed.owl
```

_**Note.** This step takes around 30-45 minutes to run. When run from the command line the reasoner determined that the ontology was consistent and 174 new axioms were inferrred (12/18/2019)._

In [None]:
# run reasoner -- RUN FROM COMMAND LINE NOT HERE
# subprocess.run(['../../resources/lib/owltools',
#                 '../../resources/unprocessed_data/human_pro_filtered.owl',
#                 '--reasoner hermit',
#                 '--run-reasoner',
#                 '--assert-implied',
#                 '--list-unsatisfiable',
#                 '-o ./resources/processed_data/human_pro_closed.owl'])

**Examine Cleaned Human PRO:**  
Once we have cleaned the ontology we can get counts of components, nodes, edges, and then write the cleaned graph to the `../../resources/processed_data` repository.

In [128]:
# get count of connected components
pro_human_graph = Graph()
pro_human_graph.parse(processed_data_location + 'human_pro_closed.owl')

# get node and edge count
edge_count = len(human_pro_graph)
node_count = len(set([str(node) for edge in list(human_pro_graph) for node in edge[0::2]]))

print('\n The classified, filtered Human version of PRO contains {node} nodes and {edge} edges\n'.format(node=node_count, edge=edge_count))


 The classified, filtered Human version of PRO contains 1212556 nodes and 2135174 edges




***
***

### Relation Ontology <a class="anchor" id="relation-ontology"></a>

**Wiki Page:** [RO](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#relation-ontology)  

**Purpose:** This script downloads the [ro.owl](http://purl.obolibrary.org/obo/ro.owl) file from [obofoundry.org](http://www.obofoundry.org/) and saves it to the `./resources/processed_data/unprocessed_data` directory. The file is then read back in and queried to obtain all `ObjectProperties` and their inverse relations.

In [129]:
url = 'http://purl.obolibrary.org/obo/ro.owl'
data_downloader(url)

Downloading data file


In [130]:
ro_graph = Graph()
ro_graph.parse(unprocessed_data_location + 'ro.owl')

print('There are {} edges in the ontology'.format(len(ro_graph))) #5,669 edges on 12/15/2019

There are 5669 edges in the ontology


**Identify Relations and Inverse Relations:**  
Identify all relations and their inverse relations using the `owl:inverseOf` property. To make it easier to look up the inverse relations, each pair is listed twice, for example:  
- [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015) `owl:inverseOf` [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025)  
- [located in](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001025) `owl:inverseOf` [location of](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_0001015)

In [131]:
with open(processed_data_location + 'RO_INVERSE_RELATIONS.txt', 'w') as outfile:

    # find inverse relations
    for s, p, o in tqdm(ro_graph):
        if 'owl#inverseOf' in str(p):
            if 'RO' in str(s) and 'RO' in str(o):
                outfile.write(str(s) + '\t' + str(p) + '\t' + str(o) + '\n')
                outfile.write(str(o) + '\t' + str(p) + '\t' + str(s) + '\n')

outfile.close()

**Preview Processed Data**

In [33]:
ro_data = pandas.read_csv(processed_data_location + 'RO_INVERSE_RELATIONS.txt',
                          header = None,
                          names=['RO_Edge', 'RO_ObjectProperty', 'RO_Inverse_Edge'],
                          delimiter = '\t')

print('There are {edge_count} RO Relations and Inverse Relations'.format(edge_count=len(ro_data)))

There are 172 RO Relations and Inverse Relations


In [34]:
ro_data.head(n=5)

Unnamed: 0,RO_Edge,RO_ObjectProperty,RO_Inverse_Edge
0,http://purl.obolibrary.org/obo/RO_0003000,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0003001
1,http://purl.obolibrary.org/obo/RO_0003001,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0003000
2,http://purl.obolibrary.org/obo/RO_0002233,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0002352
3,http://purl.obolibrary.org/obo/RO_0002352,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0002233
4,http://purl.obolibrary.org/obo/RO_0002234,http://www.w3.org/2002/07/owl#inverseOf,http://purl.obolibrary.org/obo/RO_0002353


***
***

### PROCESS EDGE DATA: LINKED DATA


### Reactome: Protein-Complex Data <a class="anchor" id="reactome-protein-complex"></a>

**Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.org) and saves it to the `./resources/processed_data/unprocessed_data` directory.


In [35]:
url = 'https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt'
data_downloader(url, unprocessed_data_location)

Downloading data file


In [36]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data):
        row = line.split('\t')
        
        # find all proteins in a complex
        for x in row[2].split('|'):
            if x.startswith('uniprot:'):            
                outfile.write(row[0].strip() + '\t' + x.split(':')[-1].strip() + '\t' + row[1].strip() + '\n')

outfile.close()

100%|██████████| 12430/12430 [00:00<00:00, 51810.45it/s]


**Preview Processed Data**

In [37]:
pc_data = pandas.read_csv(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt',
                       header = None,
                       names=['Reactome_Complex', 'Uniprot_Protein', 'Reactome_Label'],
                       delimiter = '\t')

print('There are {edge_count} protein-complex edges'.format(edge_count=len(pc_data)))

There are 91985 protein-complex edges


In [38]:
pc_data.head(n=5)

Unnamed: 0,Reactome_Complex,Uniprot_Protein,Reactome_Label
0,R-HSA-1006173,P08603,CFH:Host cell surface [plasma membrane]
1,R-HSA-1008206,Q16621,NF-E2:Promoter region of beta-globin [nucleopl...
2,R-HSA-1008206,Q9ULX9,NF-E2:Promoter region of beta-globin [nucleopl...
3,R-HSA-1008206,O15525,NF-E2:Promoter region of beta-globin [nucleopl...
4,R-HSA-1008206,O60675,NF-E2:Promoter region of beta-globin [nucleopl...



***
***

### Reactome: Complex-Complex Data <a class="anchor" id="reactome-complex-complex"></a>

**Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) and saves it to the `./resources/processed_data/unprocessed_data` directory.


In [69]:
# create label dictionary
labels = pandas.read_csv(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt',
                         header = 0,
                         delimiter = '\t')

# convert to dictionary
label_dict = {row[0]:row[1] for idx, row in labels.iterrows()}

In [70]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        
        # find all proteins in a complex
        for x in row[3].split('|'):
            if x.startswith('R-HSA-') and x.strip() in label_dict.keys():            
                outfile.write(row[0].strip() + '\t' + x.strip() + '\t' + label_dict[x.strip()] + '\n')

outfile.close()

100%|██████████| 12429/12429 [00:00<00:00, 247140.16it/s]


**Preview Processed Data**

In [71]:
cc_data = pandas.read_csv(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt',
                          header = None,
                          names=['Reactome_Complex_u', 'Reactome_Complex_v', 'Reactome Label'],
                          delimiter = '\t')

print('There are {edge_count} complex-complex edges'.format(edge_count=len(cc_data)))

There are 13722 complex-complex edges


In [72]:
cc_data.head(n=5)

Unnamed: 0,Reactome_Complex_u,Reactome_Complex_v,Reactome Label
0,R-HSA-1008206,R-HSA-1008229,NF-E2 [nucleoplasm]
1,R-HSA-1013011,R-HSA-1013017,GABA B receptor G-protein beta-gamma complex [...
2,R-HSA-1013011,R-HSA-1013019,G-protein beta-gamma subunits [plasma membrane]
3,R-HSA-1013011,R-HSA-420698,GABAB receptor:GABA [plasma membrane]
4,R-HSA-1013011,R-HSA-420748,GABAB receptor [plasma membrane]



***
***

### Reactome: Chemical-Complex Data <a class="anchor" id="reactome-chemical-complex"></a>

**Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) and saves it to the `./resources/processed_data/unprocessed_data` directory.


In [73]:
# process data
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt', 'w') as outfile:
    for line in tqdm(data[1:]):
        row = line.split('\t')
        
        # find all proteins in a complex
        for x in row[2].split('|'):
            if x.startswith('chebi:'):            
                outfile.write(row[0].strip() + '\t' + x.replace('chebi:', 'CHEBI_') + '\t' + row[0].strip() + '\n')

outfile.close()

100%|██████████| 12429/12429 [00:00<00:00, 106105.70it/s]


In [74]:
cc1_data = pandas.read_csv(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt',
                           header = None,
                           names=['Reactome_IDs', 'CHEBI_IDs', 'Reactome_Label'],
                           delimiter = '\t')

print('There are {edge_count} chemical-complex edges'.format(edge_count=len(cc1_data)))

There are 5608 chemical-complex edges


In [75]:
cc1_data.head(n=5)

Unnamed: 0,Reactome_IDs,CHEBI_IDs,Reactome_Label
0,R-HSA-1006173,CHEBI_24505,R-HSA-1006173
1,R-HSA-1006173,CHEBI_28879,R-HSA-1006173
2,R-HSA-1013011,CHEBI_59888,R-HSA-1013011
3,R-HSA-1013017,CHEBI_59888,R-HSA-1013017
4,R-HSA-109266,CHEBI_29105,R-HSA-109266



***

### Uniprot:  Protein-Cofactor and Protein-Catalyst <a class="anchor" id="uniprot-protein-cofactorcatalyst"></a>

**Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)  

**Purpose:** This script downloads the [uniprot-cofactor-catalyst.tab](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) file from the [Uniprot Knowledge Base](https://www.uniprot.org) and saves it to the `./resources/processed_data/unprocessed_data` directory.


In [48]:
url = 'https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Centry%20name%2Creviewed%2Cdatabase(PRO)%2Cchebi(Cofactor)%2Cchebi(Catalytic%20activity)&format=tab'
data_downloader(url, unprocessed_data_location, 'uniprot-cofactor-catalyst.tab')

Downloading data file


In [49]:
data = open(unprocessed_data_location + 'uniprot-cofactor-catalyst.tab').readlines()

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt', 'w') as outfile1, open(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt', 'w') as outfile2:
    for line in tqdm(data):

        # get cofactors
        if 'CHEBI' in line.split('\t')[4]: 
            for i in line.split('\t')[4].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile1.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')
        
        # get catalysts
        if 'CHEBI' in line.split('\t')[5]:       
            for i in line.split('\t')[5].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile2.write('PR_' + line.split('\t')[3].strip(';') + '\t' + chebi + '\n')

outfile1.close()
outfile2.close()

100%|██████████| 188350/188350 [00:00<00:00, 405727.43it/s]


**Preview Processed Data**

_Preview Cofactor Data_

In [50]:
pcp1_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt',
                            header = None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter = '\t')

print('There are {edge_count} protein-cofactor edges'.format(edge_count=len(pcp1_data)))

There are 5577 protein-cofactor edges


In [51]:
pcp1_data.head(n=5)

Unnamed: 0,Protein_Ontology_IDs,CHEBI_IDs
0,PR_Q9BRS2,CHEBI_18420
1,PR_Q05823,CHEBI_18420
2,PR_Q05823,CHEBI_29035
3,PR_Q13472,CHEBI_18420
4,PR_Q9BXA7,CHEBI_18420


_Preview Catalyst Data_

In [52]:
pcp2_data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt',
                            header = None,
                            names=['Protein_Ontology_IDs', 'CHEBI_IDs'],
                            delimiter = '\t')

print('There are {edge_count} protein-catalyst edges'.format(edge_count=len(pcp2_data)))

There are 59863 protein-catalyst edges


In [53]:
pcp2_data.head(n=5)

Unnamed: 0,Protein_Ontology_IDs,CHEBI_IDs
0,PR_Q9NP80,CHEBI_15377
1,PR_Q9NP80,CHEBI_15378
2,PR_Q9NP80,CHEBI_28868
3,PR_Q9NP80,CHEBI_16870
4,PR_Q9NP80,CHEBI_58168



***
***

### NCBI Gene:  Protein-Coding Gene-Protein <a class="anchor" id="ncbi-protein-coding-genes"></a>

**Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) 

**Purpose:** This script downloads the [Swiss-Prot Reviewed Uniprot Human Proteome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/_edit#uniprot-knowledgebase) file from the [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)  and saves it to the `./resources/processed_data/unprocessed_data` directory.


In [54]:
url = 'https://www.uniprot.org/uniprot/?query=reviewed%3Ayes%20AND%20proteome%3Aup000005640&columns=id%2Centry%20name%2Creviewed%2Cdatabase(GeneID)%2Cdatabase(PRO)%2Cgenes(PREFERRED)%2Cgenes(ALTERNATIVE)&format=tab'
data_downloader(url, unprocessed_data_location, 'uniprot-human-proteome.tab')

Downloading data file


In [55]:
data = open(unprocessed_data_location + 'uniprot-human-proteome.tab').readlines()

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_HUMAN_PROTEOME.txt', 'w') as outfile:
    for line in data[1:]:
        if line.split('\t')[4].strip() != '' and line.split('\t')[3].strip() != '':
            pro = 'PR_' + line.split('\t')[4].strip().strip(';')
            gene = line.split('\t')[3].strip(';')
            gene_name = line.split('\t')[5].strip()
            gene_syn = line.split('\t')[6].strip().replace(',', '|') if line.split('\t')[6].strip() != '' else ''
            
            outfile.write(pro + '\t' + gene + '\t' + gene_name + '\t' + gene_syn + '\n')

outfile.close()

**Preview Processed Data**

In [56]:
hpe_data = pandas.read_csv(processed_data_location + 'UNIPROT_HUMAN_PROTEOME.txt',
                           header = None,
                           names=['Protein_Ontology_IDs', 'Entrez_Gene_IDs', 'Gene_Name', 'Gene_Synonyms'],
                           delimiter = '\t')

print('There are {edge_count} protein-coding gene edges'.format(edge_count=len(hpe_data)))

There are 18752 protein-coding gene edges


In [57]:
hpe_data.head(n=5)

Unnamed: 0,Protein_Ontology_IDs,Entrez_Gene_IDs,Gene_Name,Gene_Synonyms
0,PR_Q9Y263,9373,PLAA,PLAP
1,PR_Q96RE7,112939,NACC1,BTBD14B NAC1
2,PR_O43312,9788,MTSS1,KIAA0429 MIM
3,PR_Q9NP80,50640,PNPLA8,IPLA22 IPLA2G
4,PR_Q15319,5459,POU4F3,BRN3C


<br>
***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```