
# PheKnowLator - Data Preparation


***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  

**Purpose:** This notebook serves as a script to preprocess data and/or generate mapping and filtering data for the PheKnowLator project. The script creates each of the mapping and/or filtering data sources described on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page. 

**Assumptions:** The script assumes that there are files, which need further processing and are located in the `./resources/processed_data/unprocessed_data/` directory.


***

## Table of Contents
### Create Mapping Data    
* [MESH-ChEBI](#mesh-chebi)  
* [Ensembl Gene - Ensembl Transcript](#ensemblgene-ensembltranscript)  
* [Ensembl Gene - Entrez Gene](#ensemblgene-entrezgene)  
* [Ensembl Gene - Uniprot Accession](#ensemblgene-uniprot)  
* [Ensembl Protein - Uniprot Accession](#ensemblprotein-uniprot)  
* [Uniprot Accession - Protein Ontology](#uniprot-pro)  
* [Ensembl Protein - Protein Ontology](#ensemblprotein-pro) 
* [HPA Tissue/Cells - UBERON + Cell Ontology](#hpa-uberon) 
* [Disease Identifiers](#disease-identifiers) 
* [Phenotype Identifiers](#phenotype-identifiers)  

### Process Edge Data
**Ontologies**
* [Protein Ontology](#protein-ontology)  

**Linked Data**
* [Reactome: Protein-Complex Data](#reactome-protein-complex)  
* [Reactome: Complex-Complex Data](#reactome-complex-complex)  
* [Reactome: Chemical-Complex Data](#reactome-chemical-complex)  
* [Uniprot: Protein-Cofactor and Protein-Catalyst](#uniprot-protein-cofactorcatalyst)  
* [Uniprot: Protein-Coding Genes](#uniprot-protein-coding-genes)    

***
***

#### Set-Up Enviornment


In [38]:
# import needed libraries
import ftplib
import glob
import gzip
import os
import pandas
import re
import requests
import shutil

import networkx as nx
import urllib.request as request

from contextlib import closing
from io import BytesIO
from rdflib import Graph, Namespace, URIRef, extras
from rdflib.extras.external_graph_libs import *
from rdflib.namespace import RDF
from tqdm import tqdm
from zipfile import ZipFile

**Define Global Variables**

In [3]:
# directory to read unprocessed data files from
unprocessed_data_location = '../../resources/processed_data/unprocessed_data/'

# directory to write processed data files to
processed_data_location = '../../resources/processed_data/'

**Create Helper Functions**

_Download Data Files_

In [4]:
def url_download(url: str, filename: str):
    """Downloads a file from a URL.
    
    Args:
        url: A string that points to the location of a temp mapping file that needs to be processed.
        filename: A string containing a filepath for where to write data to.

    Return:
        None.
    """
    print('Downloading data file')
        
    r = requests.get(url, allow_redirects=True)

    # save results
    open(unprocessed_data_location + '{filename}'.format(filename=filename), 'wb').write(r.content)

In [5]:
def ftp_url_download(url: str, filename: str):
    """Downloads a file from an ftp server.
    
    Args:
        url: A string that points to the location of a temp mapping file that needs to be processed.
        filename: A string containing a filepath for where to write data to.

    Return:
        None.
    """
    print('Downloading data from ftp server')

    with closing(request.urlopen(url)) as r:
        with open(unprocessed_data_location + '{filename}'.format(filename=filename), 'wb') as f:
            shutil.copyfileobj(r, f)

In [6]:
def gzipped_ftp_url_download(url: str):
    """Downloads a gzipped file from an ftp server.
    
    Args:
        url: A string that points to the location of a temp mapping file that needs to be processed.

    Return:
        None.
    """
    
    # get ftp server info
    server = url.replace('ftp://', '').split('/')[0]
    directory = '/'.join(url.replace('ftp://', '').split('/')[1:-1])
    file = url.replace('ftp://', '').split('/')[-1]
    write_loc = unprocessed_data_location + '{filename}'.format(filename=file)

    # download ftp gzipped file
    print('Downloading gzipped data from ftp server')
    with closing(ftplib.FTP(server)) as ftp, open(write_loc, 'wb') as fid:
        ftp.login()
        ftp.cwd(directory)
        ftp.retrbinary('RETR {}'.format(file), fid.write)

    # read in gzipped file,uncompress, and write to directory
    print('Uncompressing and writing gzipped data')
    with gzip.open(write_loc, 'rb') as fid_in:
        with open(write_loc.replace('.gz', ''), 'wb') as f:
            f.write(fid_in.read())

    # remove gzipped file
    os.remove(write_loc)

In [7]:
def zipped_url_download(url: str, filename: str):
    """Downloads a zipped file from a URL.
    
    Args:
        url: A string that points to the location of a temp mapping file that needs to be processed.
        filename: A string containing a filepath for where to write data to.

    Return:
        None.
    """
    print('Downloading zipped data file')
    
    with urlopen(zipurl) as zipresp:
        with ZipFile(BytesIO(zipresp.read())) as zfile:
            zfile.extractall(unprocessed_data_location[:-1])

In [8]:
def gzipped_url_download(url: str, filename: str):
    """Downloads a gzipped file from a URL.
    
    Args:
        url: A string that points to the location of a temp mapping file that needs to be processed.
        filename: A string containing a filepath for where to write data to.

    Return:
        None.
    """
    print('Downloading gzipped data file')
    
    with open(unprocessed_data_location + '{filename}'.format(filename=filename), 'wb') as outfile:
        outfile.write(gzip.decompress(request.urlopen(url).read()))

In [9]:
# function to download data from a URL
def data_downloader(url: str, filename: str = ''):
    """Downloads data from a URL and saves the file to the `/resources/processed_data/unprocessed_data' directory.

    Args:
        url: A string that points to the location of a temp mapping file that needs to be processed.
        filename: A string containing a filepath for where to write data to.

    Return:
        None.
    """

    # get filename from url
    file =  filename if filename != '' else re.sub('.gz|.zip', '', url.split('/')[-1])
    
    # zipped data
    if '.zip' in url:
        zipped_url_download(url, file)
        
    elif '.gz' in url:
        if 'ftp' in url:
            gzipped_ftp_url_download(url)
        else:
            gzipped_url_download(url, file)
        
    # not zipped data
    else:
        # download and write data
        if 'ftp' in url:
            ftp_url_download(url, file)
        else:
            # download data from URL
            url_download(url, file)

_Reformat Data Files_

In [10]:
# function to format data files
def data_processor(filepath: str, row_splitter: str, column_list: list, output_name: str, line_splitter: str = ''):
    """Reads in a file using input file path and reduces the file to only include specific columns specified by the
    input var. The reduced file is then saved as a text file and written to the `/resources/processed_data' directory.

    Args:
        filepath: A string that points to the location of a temp mapping file that needs to be processed.
        row_splitter: A string that contains a character used to split rows.
        column_list: A list that contains two numbers, which correspond to indices in the input data file and
                            which appear in the order of write preference.
        output_name: A string naming the processed data file.
        line_splitter: A character used to separate multiple data points from a string. Defaults to an empty
                       string which is used to indicate the string contains a single value.

    Return:
        None.
    """

    # read in data
    data = open(filepath).readlines()

    # process and write out data
    with open(unprocessed_data_location + '{filename}'.format(filename=output_name), 'w') as outfile:

        for line in data:
            subj = line.split(row_splitter)[column_list[0]]
            obj = line.split(row_splitter)[column_list[1]]

            if subj != '' and obj != '':
                for i in [subj.split(line_splitter) if line_splitter != '' else [subj]][0]:
                    for j in [obj.split(line_splitter) if line_splitter != '' else [obj]][0]:
                        outfile.write(i.strip() + '\t' + j.strip() + '\n')

    outfile.close()

***
***

### CREATE MAPPING DATA  <a class="anchor" id="mapping-data"></a>


### MESH - ChEBI <a class="anchor" id="mesh-chebi"></a>

**Wiki Page:** [mapping-mesh-to-chebi](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-mesh-to-chebi)  

**Purpose:** This script assumes that the `NCBO_rest_api.py` script was run and the data generated from this file was written to `./resources/processed_data/temp`.


In [455]:
with open(processed_data_location + 'MESH_CHEBI_MAP.txt', 'w') as out:
    for filename in glob.glob(processed_data_location + 'temp/*.txt'):
        for row in list(filter(None, open(filename, 'r').read().split('\n'))):
            mesh = '_'.join(row.split('\t')[0].split('/')[-2:])
            chebi = row.split('\t')[1].split('/')[-1]
            out.write(mesh + '\t' + chebi + '\n')

out.close()

In [456]:
# preview data
data = pandas.read_csv(processed_data_location + 'MESH_CHEBI_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} MESH-ChEBI edges'.format(edge_count=len(data)))

There are 11434 MESH-ChEBI edges


In [457]:
data.head(n=10)

Unnamed: 0,0,1
0,MESH_C535085,CHEBI_133814
1,MESH_C008574,CHEBI_17221
2,MESH_C492482,CHEBI_34581
3,MESH_C007556,CHEBI_135978
4,MESH_C500395,CHEBI_29138
5,MESH_C028026,CHEBI_15439
6,MESH_C560044,CHEBI_68089
7,MESH_C035745,CHEBI_138829
8,MESH_C511148,CHEBI_6566
9,MESH_C050445,CHEBI_28160


<br>

***
***

### Ensembl Gene - Ensembl Transcript <a class="anchor" id="ensemblgene-ensembltranscript"></a>

**Wiki Page:** [mapping-transcript-protein-and-gene-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-transcript-protein-and-gene-identifiers)  

**Purpose:** This script downloads the [HUMAN_9606_idmapping_selected.tab.gz](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) file from the [Uniprot Knolwedge Base](https://www.uniprot.org/) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [458]:
# download data
url = 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz'
data_downloader(url)

Downloading gzipped data from ftp server
Uncompressing and writing gzipped data


In [459]:
data_processor(filepath=unprocessed_data_location + 'HUMAN_9606_idmapping_selected.tab',
                   row_splitter='\t',
                   column_list=[18, 19],
                   output_name='ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP',
                   line_splitter=';')

In [460]:
# preview data
data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} ensembl gene-ensembl transcript edges'.format(edge_count=len(data)))

There are 144643 ensembl gene-ensembl transcript edges


In [461]:
data.head(n=10)

Unnamed: 0,0,1
0,ENSG00000166913,ENST00000353703
1,ENSG00000166913,ENST00000372839
2,ENSG00000108953,ENST00000264335
3,ENSG00000108953,ENST00000571732
4,ENSG00000108953,ENST00000616643
5,ENSG00000108953,ENST00000627231
6,ENSG00000274474,ENST00000264335
7,ENSG00000274474,ENST00000571732
8,ENSG00000274474,ENST00000616643
9,ENSG00000274474,ENST00000627231


<br>

***
***

### Ensembl Gene - Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>

**Wiki Page:** [mapping-transcript-protein-and-gene-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-transcript-protein-and-gene-identifiers)  

**Purpose:** This script downloads the [HUMAN_9606_idmapping_selected.tab.gz](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) file from the [Uniprot Knolwedge Base](https://www.uniprot.org/) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [462]:
data_processor(filepath=unprocessed_data_location + 'HUMAN_9606_idmapping_selected.tab',
               row_splitter='\t',
               column_list=[18, 2],
               output_name='ENSEMBL_GENE_ENTREZ_GENE_MAP',
               line_splitter=';')

In [463]:
# preview data
data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} ensembl gene-entrez gene edges'.format(edge_count=len(data)))

There are 25605 ensembl gene-entrez gene edges


In [464]:
data.head(n=10)

Unnamed: 0,0,1
0,ENSG00000166913,7529
1,ENSG00000108953,7531
2,ENSG00000274474,7531
3,ENSG00000128245,7533
4,ENSG00000170027,7532
5,ENSG00000175793,2810
6,ENSG00000134308,10971
7,ENSG00000164924,7534
8,ENSG00000110455,84680
9,ENSG00000205126,390110


<br>

***
***

### Ensembl Gene - Uniprot Accession <a class="anchor" id="ensemblgene-uniprot"></a>

**Wiki Page:** [mapping-transcript-protein-and-gene-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-transcript-protein-and-gene-identifiers)  

**Purpose:** This script downloads the [HUMAN_9606_idmapping_selected.tab.gz](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) file from the [Uniprot Knolwedge Base](https://www.uniprot.org/) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [465]:
data_processor(filepath=unprocessed_data_location + 'HUMAN_9606_idmapping_selected.tab',
               row_splitter='\t',
               column_list=[18, 0],
               output_name='ENSEMBL_GENE_UNIPROT_ACCESSION_MAP',
               line_splitter=';')

In [466]:
# preview data
data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_UNIPROT_ACCESSION_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} ensembl gene-uniprot accession edges'.format(edge_count=len(data)))

There are 78437 ensembl gene-uniprot accession edges


In [467]:
data.head(n=10)

Unnamed: 0,0,1
0,ENSG00000166913,P31946
1,ENSG00000108953,P62258
2,ENSG00000274474,P62258
3,ENSG00000128245,Q04917
4,ENSG00000170027,P61981
5,ENSG00000175793,P31947
6,ENSG00000134308,P27348
7,ENSG00000164924,P63104
8,ENSG00000110455,Q96QU6
9,ENSG00000205126,Q4AC99


<br>

***
***

### Ensembl Protein - Uniprot Accession <a class="anchor" id="ensemblprotein-uniprot"></a>

**Wiki Page:** [mapping-transcript-protein-and-gene-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-transcript-protein-and-gene-identifiers)  

**Purpose:** This script downloads the [HUMAN_9606_idmapping_selected.tab.gz](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) file from the [Uniprot Knolwedge Base](https://www.uniprot.org/) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [468]:
data_processor(filepath=unprocessed_data_location + 'HUMAN_9606_idmapping_selected.tab',
               row_splitter='\t',
               column_list=[20, 0],
               output_name='ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP',
               line_splitter=';')

In [469]:
# preview data
data = pandas.read_csv(processed_data_location + 'ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} ensembl protein-uniprot accession edges'.format(edge_count=len(data)))

There are 107976 ensembl protein-uniprot accession edges


In [470]:
data.head(n=10)

Unnamed: 0,0,1
0,ENSP00000300161,P31946
1,ENSP00000361930,P31946
2,ENSP00000264335,P62258
3,ENSP00000461762,P62258
4,ENSP00000481059,P62258
5,ENSP00000487356,P62258
6,ENSP00000248975,Q04917
7,ENSP00000306330,P61981
8,ENSP00000340989,P31947
9,ENSP00000238081,P27348


<br>

***
***

### Uniprot Accession - Protein Ontology <a class="anchor" id="uniprot-pro"></a>

**Wiki Page:** [mapping-protein-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-protein-identifiers)  

**Purpose:** This script downloads the [promapping.txt](https://proconsortium.org/download/current/promapping.txt) file from the [Pro Consortium](https://proconsortium.org/download/current/) and saves to the `./resources/processed_data/unprocessed_data/` directory.


In [471]:
# download data
url = 'https://proconsortium.org/download/current/promapping.txt'
data_downloader(url)

Downloading data file


In [472]:
data = open(unprocessed_data_location + 'promapping.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_ACCESSION_PRO_MAP.txt', 'w') as outfile:
    for line in data:
        row = line.split('\t')

        if row[1].startswith('UniProtKB'):
            outfile.write(row[0].strip().replace(':', '_') + '\t' + row[1].strip().split(':')[-1] + '\n')

outfile.close()

In [473]:
# preview data
data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} uniprot accession-protein ontology edges'.format(edge_count=len(data)))

There are 314714 uniprot accession-protein ontology edges


In [474]:
data.head(n=10)

Unnamed: 0,0,1
0,PR_000000005,P37173
1,PR_000000005,P38438
2,PR_000000005,Q62312
3,PR_000000005,Q90999
4,PR_000000007,F1R709
5,PR_000000009,Q16671
6,PR_000000009,Q62893
7,PR_000000009,Q8K592
8,PR_000000010,O57472
9,PR_000000010,Q24025


<br>

***
***

### Ensembl Protein - Protein Ontology <a class="anchor" id="ensemblprotein-pro"></a>

**Wiki Page:** [mapping-protein-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-protein-identifiers)  

**Purpose:** This script assumes that the `UNIPROT_ACCESSION_PRO_MAP.txt` and the `ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP.txt` files were created and saved to the `./resources/processed_data/unprocessed_data/` directory.


In [476]:
pro_kb = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_MAP.txt',
                         header = None,
                         delimiter = '\t')

# convert to dictionary
pro_dict = {}

for idx, row in pro_kb.iterrows():
    if row[1] in pro_dict.keys():
        pro_dict[row[1]].append(row[0]) 
    else:
        pro_dict[row[1]] = [row[0]]

In [477]:
ens_uni = pandas.read_csv(processed_data_location + 'ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP.txt',
                       header = None,
                       delimiter = '\t')

# write out data
with open(processed_data_location + 'ENSEMBL_PROTEIN_PRO_MAP.txt', 'w') as outfile:
    for idx, row in ens_uni.iterrows():        
        if row[1] in pro_dict.keys():
            for x in pro_dict[row[1]]: 
                outfile.write(row[0] + '\t' + x + '\n')

outfile.close()

In [478]:
# preview data
data = pandas.read_csv(processed_data_location + 'ENSEMBL_PROTEIN_PRO_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} ensembl protein-protein ontology edges'.format(edge_count=len(data)))

There are 92199 ensembl protein-protein ontology edges


In [479]:
data.head(n=10)

Unnamed: 0,0,1
0,ENSP00000300161,PR_000002175
1,ENSP00000300161,PR_P31946
2,ENSP00000361930,PR_000002175
3,ENSP00000361930,PR_P31946
4,ENSP00000264335,PR_000003104
5,ENSP00000264335,PR_P62258
6,ENSP00000461762,PR_000003104
7,ENSP00000461762,PR_P62258
8,ENSP00000481059,PR_000003104
9,ENSP00000481059,PR_P62258


<br>

***
***

### HPA Tissue/Cells - UBERON + Cell Ontology <a class="anchor" id="hpa-uberon"></a>

**Wiki Page:** [mapping-human-protein-atlas-tissues-and-cell-types](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-human-protein-atlas-tissues-and-cell-types)  

**Purpose:** This script downloads the [rna_tissue_consensus.tsv](https://www.proteinatlas.org/download/rna_tissue_consensus.tsv.zip) and [normal_tissue.tsv](https://www.proteinatlas.org/download/normal_tissue.tsv.zip) files from the [Human Protein Atlas](https://www.proteinatlas.org) and saves them to the `./resources/processed_data/unprocessed_data/` directory.


In [480]:
# download data
url1 = 'https://www.proteinatlas.org/download/normal_tissue.tsv.zip'
data_downloader(url1)

url2 = 'https://www.proteinatlas.org/download/rna_tissue_consensus.tsv.zip'
data_downloader(url2)

Downloading zipped data file
Downloading zipped data file


In [481]:
# abnormal tissue
abnormal_tissue = []
for line in open(unprocessed_data_location + 'rna_tissue_consensus.tsv').readlines():
    abnormal_tissue.append(line.split('\t')[2].strip())

# normal tissue
normal_tissue = []

for line in open(unprocessed_data_location + 'normal_tissue.tsv').readlines():
    normal_tissue.append(line.split('\t')[2].strip() + ' - ' + line.split('\t')[3].strip())

# combine normal and abnormal tissue and cells into single list
combo = set(abnormal_tissue + normal_tissue)

# write results
with open(unprocessed_data_location + 'HPA_tissues.txt', 'w') as outfile:
    for x in combo:
        outfile.write(x.strip() + '\n')

outfile.close()

In [482]:
# read back in mapped tissue/cell data
data = pandas.read_excel(open(unprocessed_data_location + 'zooma_tissue_cell_mapping_04DEC2019.xlsx',
                              'rb'),
                         sheet_name='zooma_tissue_cell_mapping_04DEC',
                         header=0)

data.fillna('None', inplace=True)

# prerview data
data.head(n=5)

Unnamed: 0,TISSUE,CELL TYPE,ONTOLOGY,ONTOLOGY ID,ONTOLOGY LABEL,MAPPING
0,adipose tissue,,UBERON,http://purl.obolibrary.org/obo/UBERON_0001013,adipose tissue,ZOOMA
1,adipose tissue,adipocytes,UBERON,http://purl.obolibrary.org/obo/UBERON_0001013,adipose tissue,ZOOMA
2,adipose tissue,adipocytes,CL,http://purl.obolibrary.org/obo/CL_0001070,fat cell,Manual
3,adrenal gland,,UBERON,http://purl.obolibrary.org/obo/UBERON_0002369,adrenal gland,ZOOMA
4,adrenal gland,cells in zona fasciculata,UBERON,http://purl.obolibrary.org/obo/UBERON_0002054,zona fasciculata of adrenal gland,ZOOMA


In [483]:
# reformat data and write it out
with open(processed_data_location + 'HPA_TISSUE_CELL_MAP.txt', 'w') as outfile:
    for idx, row in data.iterrows():

        if row['TISSUE'] != 'None':
            outfile.write(str(row['TISSUE']).strip() + '\t' + str(row['ONTOLOGY ID']).strip() + '\n')

        if row['CELL TYPE'] != 'None':
            outfile.write(str(row['CELL TYPE']).strip() + '\t' + str(row['ONTOLOGY ID']).strip() + '\n')

outfile.close()

In [484]:
# preview data
data = pandas.read_csv(processed_data_location + 'HPA_TISSUE_CELL_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} edges'.format(edge_count=len(data)))

There are 622 edges


In [485]:
data.head(n=10)

Unnamed: 0,0,1
0,adipose tissue,http://purl.obolibrary.org/obo/UBERON_0001013
1,adipose tissue,http://purl.obolibrary.org/obo/UBERON_0001013
2,adipocytes,http://purl.obolibrary.org/obo/UBERON_0001013
3,adipose tissue,http://purl.obolibrary.org/obo/CL_0001070
4,adipocytes,http://purl.obolibrary.org/obo/CL_0001070
5,adrenal gland,http://purl.obolibrary.org/obo/UBERON_0002369
6,adrenal gland,http://purl.obolibrary.org/obo/UBERON_0002054
7,cells in zona fasciculata,http://purl.obolibrary.org/obo/UBERON_0002054
8,adrenal gland,http://purl.obolibrary.org/obo/CL_0002136
9,cells in zona fasciculata,http://purl.obolibrary.org/obo/CL_0002136


<br>

***
***

### Disease and Phenotype Identifiers <a class="anchor" id="disease-identifiers"></a>

**Wiki Page:** [mapping-disease-phenotype-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-disease-identifiers)  

**Purpose:** This script downloads the [disease_mappings.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz) file from [DisGeNET](https://www.disgenet.org) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [486]:
# download data
url = 'https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz'
data_downloader(url)

Downloading gzipped data file


In [487]:
data = pandas.read_csv(unprocessed_data_location + 'disease_mappings.tsv',
                       header = 0,
                       delimiter = '|')

# convert to dictionary
disease_dict = {}

for idx, row in data.iterrows():
    
    if row['diseaseId'] in disease_dict.keys():
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']].append('DOID_' + row['code']) 
        
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']].append(row['code'].replace('HP:', 'HP_')) 
    
    else:
        if row['vocabulary'] == 'DO':
            disease_dict[row['diseaseId']] = ['DOID_' + row['code']] 
        
        if row['vocabulary'] == 'HPO':
            disease_dict[row['diseaseId']] = [row['code'].replace('HP:', 'HP_')] 

In [488]:
# reformat data and write it out
with open(processed_data_location + 'DISEASE_DOID_MAP.txt', 'w') as outfile1, open('../../resources/processed_data/PHENOTYPE_HPO_MAP.txt', 'w') as outfile2:
    
    for key, value in disease_dict.items():
        for i in value:
            # get diseases
            if i.startswith('DOID_'): 
                outfile1.write(key + '\t' + i + '\n')

            # get phenotypes
            if i.startswith('HP_'): 
                outfile2.write(key + '\t' + i + '\n')

outfile1.close()
outfile2.close()

In [489]:
# preview cofactor data
data = pandas.read_csv(processed_data_location + 'DISEASE_DOID_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {} disease-DOID edges'.format(len(data)))

There are 15675 disease-DOID edges


In [490]:
data.head(n=10)

Unnamed: 0,0,1
0,C0018923,DOID_0001816
1,C0854893,DOID_0001816
2,C0033999,DOID_0002116
3,C4520843,DOID_0002116
4,C0024814,DOID_0014667
5,C0024814,DOID_0080195
6,C0024814,DOID_9277
7,C0025517,DOID_0014667
8,C0024291,DOID_0050120
9,C0272199,DOID_0050120


In [491]:
# preview catalyst data
data = pandas.read_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt',
                       header = None,
                       delimiter = '\t')

print('There are {} phenotype-HPO edges'.format(len(data)))

There are 8289 phenotype-HPO edges


In [492]:
data.head(n=10)

Unnamed: 0,0,1
0,C0018923,HP_0200058
1,C0033999,HP_0001059
2,C4520843,HP_0001059
3,C0037199,HP_0000246
4,C0008780,HP_0012265
5,C0032290,HP_0011951
6,C0242770,HP_0011945
7,C0238378,HP_0005942
8,C0030469,HP_0000246
9,C0151516,HP_0005990


***
***

### PROCESS EDGE DATA: ONTOLOGIES


### Protein Ontology <a class="anchor" id="protein-ontology"></a>

**Wiki Page:** [PRO](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-phenotype-ontology)  

**Purpose:** This script downloads the [pr.owl](http://purl.obolibrary.org/obo/pr.owl) file from [ProConsortium.org](https://proconsortium.org/) and saves it to the `./resources/processed_data/unprocessed_data/` directory. The file is then read back in and filtered to contain only human proteins by performing forward and reverse breadth first search over all proteins which are `owl:subClassOf` [Homo sapiens protein](https://proconsortium.org/app/entry/PR%3A000029067/).


In [11]:
# download data
url = 'http://purl.obolibrary.org/obo/pr.owl'
data_downloader(url)

Downloading data file


In [12]:
# read in ontology as graph (the ontology is large so this takes ~60 minutes)
graph = Graph()
graph.parse(unprocessed_data_location + 'pr.owl')

print('There are {} edges in the ontology'.format(len(graph))) #11757623 edges on 12/12/2019


There are 11757623 edges in the ontology


**Filter Ontology:**  
The first step is to remove specific properties we know will cause errors when running breadth first search. All edges containing the following properties are removed:  

**`http://purl.obolibrary.org/obo/pr#lacks_part`**  
>The meaning of C lacks-part D is that all instances of C have no instance of D as part (C subClassOf: not (has-part some D)). [Ontology definition](https://www.ebi.ac.uk/ols/ontologies/pr/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fpr%23lacks_part); [PMID:20807438](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2942855/)

<br>

**`http://www.w3.org/2002/07/owl#disjointWith`**
>The disjointness of a set of classes can be expressed using the owl:disjointWith constructor. It guarantees that an individual that is a member of one class cannot simultaneously be an instance of a specified other class. [W3C Definition](https://www.w3.org/TR/owl-guide/)

<br>

**`http://purl.obolibrary.org/obo/pr#1to1_ortholog_of`**
>If a and b are genes, a 1to1_ortholog_of b iff neither a nor b was duplicated after the speciation event that created them. If a and b are proteins, then a 1to1_ortholog_of b iff a has_gene_template some gene A and b has_gene_template some gene B and A 1to1_ortholog_of B. [Ontology Definition](https://www.ebi.ac.uk/ols/search?q=1to1_ortholog_of)


There is good evidence from the literature to support the removal of `lacks_part` and `owl#disjointWith` edges. For more information, see the following articles: [PMID:20807438](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2942855/), [PMID:17925014](https://www.ncbi.nlm.nih.gov/pubmed/17925014/), [PMID:17108603](https://www.ncbi.nlm.nih.gov/pubmed/17108603/).

In [13]:
# remove edges that contain certain properties (removed 12346 edges on 12/12/2019)

# lacks_part
graph.remove((None, None, URIRef(str(Namespace('http://purl.obolibrary.org/obo/pr#')) + 'lacks_part')))
graph.remove((None, URIRef(str(Namespace('http://purl.obolibrary.org/obo/pr#')) + 'lacks_part'), None))
graph.remove((URIRef(str(Namespace('http://purl.obolibrary.org/obo/pr#')) + 'lacks_part'), None, None))

# lacks_part
graph.remove((None, None, URIRef(str(Namespace('http://www.w3.org/2002/07/owl#')) + 'disjointWith')))
graph.remove((None, URIRef(str(Namespace('http://www.w3.org/2002/07/owl#')) + 'disjointWith'), None))
graph.remove((URIRef(str(Namespace('http://www.w3.org/2002/07/owl#')) + 'disjointWith'), None, None))

# 1to1_ortholog_of
graph.remove((None, None, URIRef(str(Namespace('http://purl.obolibrary.org/obo/pr#')) + '1to1_ortholog_of')))
graph.remove((None, URIRef(str(Namespace('http://purl.obolibrary.org/obo/pr#')) + '1to1_ortholog_of'), None))
graph.remove((URIRef(str(Namespace('http://purl.obolibrary.org/obo/pr#')) + '1to1_ortholog_of'), None, None))

print('There are {edge_count} edges after removing axioms containing "lacks_part", "disjointWith", and "1to1_ortholog_of"'.format(edge_count=len(graph)))


There are 11745277 edges after removing axioms containing "lacks_part", "disjointWith", and "1to1_ortholog_of"


**Query Ontology:**   
A list of human proteins is obtained by querying the ontology to return all proteins that are `OWL:subClassOf` the [Homo sapiens protein](https://proconsortium.org/app/entry/PR%3A000029067/) class. 


In [14]:
# query graph for all human proteins (19,732 human proteins on 12/12/2019)
results = graph.query(
    """SELECT DISTINCT ?c
       WHERE {
          ?c rdf:type owl:Class .
          ?c rdfs:subClassOf obo:PR_000029067.}
       """, initNs={'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                    'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
                    'owl': 'http://www.w3.org/2002/07/owl#',
                    'obo': 'http://purl.obolibrary.org/obo/'})


In [15]:
print('There are {protein_count} human proteins in the ontology'.format(protein_count=len(results)))

There are 19732 human proteins in the ontology


**Convert Ontology to Directed MulitGraph:**  
In order to create a verison of the ontology which includes all relevant human edges, we need to first convert the KG to a [directed multigraph](https://networkx.github.io/documentation/stable/reference/classes/multidigraph.html). Once converted, we can obtain all the relevant edges of the ontogoloy by running breadth first search over each of the human proteins obtained from the prior step.

In [16]:
# convert RDF grpah to multidigraph (the ontology is large so this takes ~6xx minutes; start: 14:07:20)
mdg = rdflib_to_networkx_multidigraph(graph)

In [19]:
# store paths
bfs_paths = []

for node in tqdm(list(results)):
    
    forward = list(nx.edge_bfs(mdg, node, orientation='original'))
    reverse = list(nx.edge_bfs(mdg, node, orientation='reverse'))
    
    bfs_paths.append(forward + reverse)

100%|██████████| 19732/19732 [09:40<00:00, 33.98it/s]  


In [21]:
# unnest paths from lists of edges to a single edge list
bfs_edges = [x for y in bfs_paths for x in y]

print('There are {edge_count} human edges'.format(edge_count=len(bfs_edges)))
      

There are 6263797 human edges


**Construct Human PRO:**   
Now that we have all of the paths from the original graph that are relevant to humans, we can construct a new graph.

In [35]:
# create a new graph using bfs paths
human_pro_graph = Graph()

# add edges from path
for path in tqdm(bfs_edges):
    
    if path[3] == 'forward':
        human_pro_graph.add((path[0], path[2], path[1]))
    
    # adding reverse path graph edges in the correct order     
    else:
        human_pro_graph.add((path[1], path[2], path[0]))

100%|██████████| 6263797/6263797 [02:58<00:00, 35080.84it/s]


In [37]:
# get node and edge count
edge_count = len(human_pro_graph)
node_count = len(set([str(node) for edge in list(human_pro_graph) for node in edge[0::2]]))
print('\n Human PRO contains {node} nodes and {edge} edges\n'.format(node=node_count, edge=edge_count))

# serialize graph
human_pro_graph.serialize(destination=processed_data_location + 'human_pro.owl', format='xml')


 Human PRO contains 851909 nodes and 1394498 edges



In [41]:
# convert back to graph and count connected components
mdg2 = rdflib_to_networkx_multidigraph(human_pro_graph)
nx.number_connected_components(mdg2.to_undirected())

1

***
***

### PROCESS EDGE DATA: LINKED DATA


### Reactome: Protein-Complex Data <a class="anchor" id="reactome-protein-complex"></a>

**Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.org) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [218]:
# download data
url = 'https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt'
data_downloader(url)

Downloading data file


In [24]:
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt', 'w') as outfile:
    for line in data:
        row = line.split('\t')
        
        # find all proteins in a complex
        for x in row[2].split('|'):
            if x.startswith('uniprot:'):            
                outfile.write(row[0].strip() + '\t' + x.split(':')[-1].strip() + '\n')

outfile.close()

In [25]:
# preview data
data = pandas.read_csv(processed_data_location + 'REACTOME_PROTEIN_COMPLEX.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} protein-complex edges'.format(edge_count=len(data)))

There are 91985 protein-complex edges


In [26]:
data.head(n=10)

Unnamed: 0,0,1
0,R-HSA-1006173,P08603
1,R-HSA-1008206,Q16621
2,R-HSA-1008206,Q9ULX9
3,R-HSA-1008206,O15525
4,R-HSA-1008206,O60675
5,R-HSA-1008229,Q16621
6,R-HSA-1008229,Q9ULX9
7,R-HSA-1008229,O15525
8,R-HSA-1008229,O60675
9,R-HSA-1008252,P10914


### Reactome: Complex-Complex Data <a class="anchor" id="reactome-complex-complex"></a>

**Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [222]:
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt', 'w') as outfile:
    for line in data:
        row = line.split('\t')
        
        # find all proteins in a complex
        for x in row[3].split('|'):
            if x.startswith('R-HSA-'):            
                outfile.write(row[0].strip() + '\t' + x.strip() + '\n')

outfile.close()

In [223]:
# preview data
data = pandas.read_csv(processed_data_location + 'REACTOME_COMPLEX_COMPLEX.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} complex-complex edges'.format(edge_count=len(data)))

There are 13746 complex-complex edges


In [224]:
data.head(n=10)

Unnamed: 0,0,1
0,R-HSA-1008206,R-HSA-1008229
1,R-HSA-1013011,R-HSA-1013017
2,R-HSA-1013011,R-HSA-1013019
3,R-HSA-1013011,R-HSA-420698
4,R-HSA-1013011,R-HSA-420748
5,R-HSA-1013017,R-HSA-1013019
6,R-HSA-1013017,R-HSA-420698
7,R-HSA-1013017,R-HSA-420748
8,R-HSA-1015697,R-HSA-913527
9,R-HSA-1015697,R-HSA-909693


### Reactome: Chemical-Complex Data <a class="anchor" id="reactome-chemical-complex"></a>

**Wiki Page:** [Reactome](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#reactome-pathway-database)  

**Purpose:** This script downloads the [ComplexParticipantsPubMedIdentifiers_human.txt](https://reactome.org/download/current/ComplexParticipantsPubMedIdentifiers_human.txt) file from [Reactome](https://reactome.orgt) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [225]:
data = open(unprocessed_data_location + 'ComplexParticipantsPubMedIdentifiers_human.txt').readlines()

# reformat data and write it out
with open(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt', 'w') as outfile:
    for line in data:
        row = line.split('\t')
        
        # find all proteins in a complex
        for x in row[2].split('|'):
            if x.startswith('chebi:'):            
                outfile.write(row[0].strip() + '\t' + x.replace('chebi:', 'CHEBI_') + '\n')

outfile.close()

In [226]:
# preview data
data = pandas.read_csv(processed_data_location + 'REACTOME_CHEMICAL_COMPLEX.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} chemical-complex edges'.format(edge_count=len(data)))

There are 5608 chemical-complex edges


In [227]:
data.head(n=10)

Unnamed: 0,0,1
0,R-HSA-1006173,CHEBI_24505
1,R-HSA-1006173,CHEBI_28879
2,R-HSA-1013011,CHEBI_59888
3,R-HSA-1013017,CHEBI_59888
4,R-HSA-109266,CHEBI_29105
5,R-HSA-109318,CHEBI_18420
6,R-HSA-109363,CHEBI_18420
7,R-HSA-109433,CHEBI_18420
8,R-HSA-109468,CHEBI_18420
9,R-HSA-109497,CHEBI_18420


<br>

***
***

### Uniprot:  Protein-Cofactor and Protein-Catalyst <a class="anchor" id="uniprot-protein-cofactorcatalyst"></a>

**Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase)  

**Purpose:** This script downloads the [uniprot-cofactor-catalyst.tab](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) file from the [Uniprot Knowledge Base](https://www.uniprot.org) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [259]:
# download data
url = 'https://www.uniprot.org/uniprot/?query=&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22&columns=id%2Centry%20name%2Creviewed%2Cgenes%2Cchebi(Cofactor)%2Cchebi(Catalytic%20activity)%2Cprotein%20names&format=tab'
data_downloader(url, 'uniprot-cofactor-catalyst.tab')


Downloading data file


In [264]:
data = open(unprocessed_data_location + 'uniprot-cofactor-catalyst.tab').readlines()

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt', 'w') as outfile1, open(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt', 'w') as outfile2:
    for line in data:

        # get cofactors
        if 'CHEBI' in line.split('\t')[4]: 
            for i in line.split('\t')[4].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile1.write(row[0].strip() + '\t' + chebi + '\n')
        
        # get catalysts
        if 'CHEBI' in line.split('\t')[5]:       
            for i in line.split('\t')[5].split(';'):
                chebi = i.split('[')[-1].replace(']', '').replace(':', '_')
                outfile2.write(row[0].strip() + '\t' + chebi + '\n')

outfile1.close()
outfile2.close()

In [265]:
# preview cofactor data
data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_COFACTOR.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} protein-cofactor edges'.format(edge_count=len(data)))

There are 5541 protein-cofactor edges


In [266]:
data.head(n=10)

Unnamed: 0,0,1
0,R-NUL-997399,CHEBI_29105
1,R-NUL-997399,CHEBI_18420
2,R-NUL-997399,CHEBI_29035
3,R-NUL-997399,CHEBI_29105
4,R-NUL-997399,CHEBI_18420
5,R-NUL-997399,CHEBI_597326
6,R-NUL-997399,CHEBI_29105
7,R-NUL-997399,CHEBI_29103
8,R-NUL-997399,CHEBI_29035
9,R-NUL-997399,CHEBI_49883


In [267]:
# preview catalyst data
data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_CATALYST.txt',
                       header = None,
                       delimiter = '\t')

print('There are {edge_count} protein-catalyst edges'.format(edge_count=len(data)))

There are 59645 protein-catalyst edges


In [268]:
data.head(n=10)

Unnamed: 0,0,1
0,R-NUL-997399,CHEBI_15378
1,R-NUL-997399,CHEBI_43474
2,R-NUL-997399,CHEBI_58228
3,R-NUL-997399,CHEBI_30616
4,R-NUL-997399,CHEBI_17544
5,R-NUL-997399,CHEBI_28938
6,R-NUL-997399,CHEBI_456216
7,R-NUL-997399,CHEBI_15378
8,R-NUL-997399,CHEBI_30616
9,R-NUL-997399,CHEBI_61977


<br>

***
***

### Uniprot:  Protein-Coding Genes <a class="anchor" id="uniprot-protein-coding-genes"></a>

**Wiki Page:** [Uniprot](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/_edit#uniprot-knowledgebase)  

**Purpose:** This script downloads the [uniprot-protein-coding-genes.tab](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#uniprot-knowledgebase) file from the [Uniprot Knowledge Base](https://www.uniprot.org) and saves it to the `./resources/processed_data/unprocessed_data/` directory.


In [260]:
# download data
url = 'https://www.uniprot.org/uniprot/?query=keyword%3A%22Complete%20proteome%20%5BKW-0181%5D%22&fil=organism%3A%22Homo%20sapiens%20(Human)%20%5B9606%5D%22%20AND%20proteome%3Aup000005640&columns=id%2Centry%20name%2Creviewed%2Cprotein%20names%2Cgenes%2Cproteome%2Cdatabase(GeneID)%2Cdatabase(Ensembl)&format=tab'
data_downloader(url, 'uniprot-protein-coding-genes.tab')


Downloading data file


In [261]:
data = open(unprocessed_data_location + 'uniprot-protein-coding-genes.tab').readlines()

# reformat data and write it out
with open(processed_data_location + 'UNIPROT_PROTEIN_CODING_GENES.txt', 'w') as outfile:
    for line in data[1:]:
        for i in line.split('\t')[6].split(';'):
            if i != '':
                outfile.write(line.split('\t')[0].strip() + '\t' + i.strip() + '\n')

outfile.close()

In [262]:
# preview data
data = pandas.read_csv(processed_data_location + 'UNIPROT_PROTEIN_CODING_GENES.txt',
                       header = None,
                       delimiter = '\t')


print('There are {edge_count} protein-coding gene edges'.format(edge_count=len(data)))

There are 23005 protein-coding gene edges


In [263]:
data.head(n=10)

Unnamed: 0,0,1
0,Q96IY4,1361
1,P22362,6346
2,Q8NCR9,119467
3,Q8IUK8,147381
4,Q9BX69,84674
5,Q8N7E2,158506
6,Q8NEA5,147685
7,P31327,1373
8,Q3B7T3,146227
9,Q99933,573
