
# PheKnowLator - Mapping Data


***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  

**Purpose:** This notebook serves as a file to generate mapping and filtering data for the PheKnowLator project. The script creates each of the mapping and/or filtering data sources described on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-and-filtering-data) Wiki page. 

**Assumptions:** The script assumes that there are files, which need further processing and are located in the `./resources/data_maps/unprocessed_data/` directory.


***

### Table of Contents
* [MESH-ChEBI](#mesh-chebi)  
* [Ensembl Gene - Ensembl Transcript](#ensemblgene-ensembltranscript)  
* [Ensembl Gene - ENTREZ Gene](#ensemblgene-entrezgene)  
* [Ensembl Gene - Uniprot Accession](#ensemblgene-uniprot)  
* [Ensembl Protein - Uniprot Accession](#ensemblprotein-uniprot)  
* [HPA Tissue/Cells - UBERON + Cell Ontology](#hpa-uberon)  
* [Uniprot Accession - Protein Ontology](#uniprot-pro)  
* [Ensembl Protein - Protein Ontology](#ensemblprotein-pro)  

***
***

In [3]:
# import needed libraries
import glob
import pandas


#### Create Helper Function


In [13]:
def data_processor(filepath: str, row_splitter: str, column_list: list, output_name: str, line_splitter: str = ''):
    """Reads in a file using input file path and reduces the file to only include specific columns specified by the
    input var. The reduced file is then saved as a text file and written to the `/resources/data_maps' directory.

    Args:
        filepath (str): A string that points to the location of a temp mapping file that needs to be processed.
        row_splitter (str): A string that contains a character used to split rows.
        column_list (list): A list that contains two numbers, which correspond to indices in the input data file and
                            which appear in the order of write preference.
        output_name (str): A string naming the processed data file.
        line_splitter (str): A character used to separate multiple data points from a string. Defaults to an empty
                             string which is used to indicate the string contains a single value.

    Return:
        None.
    """

    # read in data
    data = open(filepath).readlines()

    # process and write out data
    with open('../../resources/data_maps/{filename}.txt'.format(filename=output_name), 'w') as outfile:

        for line in data:
            subj = line.split(row_splitter)[column_list[0]]
            obj = line.split(row_splitter)[column_list[1]]

            if subj != '' and obj != '':
                for i in [subj.split(line_splitter) if line_splitter != '' else [subj]][0]:
                    for j in [obj.split(line_splitter) if line_splitter != '' else [obj]][0]:
                        outfile.write(i.strip() + '\t' + j.strip() + '\n')

    outfile.close()


<br>

***

### MESH - ChEBI <a class="anchor" id="mesh-chebi"></a>

**Wiki Page:** [mapping-mesh-to-chebi](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-mesh-to-chebi)  

**Purpose:** This script assumes that the `NCBO_rest_api.py` script was run and the data generated from this file was written to `./resources/data_maps/temp`.


In [7]:
with open('../../resources/data_maps/MESH_CHEBI_MAP.txt', 'w') as out:
    for filename in glob.glob('../../resources/data_maps/temp/*.txt'):
        for row in list(filter(None, open(filename, 'r').read().split('\n'))):
            mesh = '_'.join(row.split('\t')[0].split('/')[-2:])
            chebi = row.split('\t')[1].split('/')[-1]
            out.write(mesh + '\t' + chebi + '\n')

out.close()
    

In [12]:
# preview data
data = pandas.read_csv('../../resources/data_maps/MESH_CHEBI_MAP.txt',
                       header = None,
                       delimiter = '\t')

data.head(n=10)


Unnamed: 0,0,1
0,MESH_C535085,CHEBI_133814
1,MESH_C008574,CHEBI_17221
2,MESH_C492482,CHEBI_34581
3,MESH_C007556,CHEBI_135978
4,MESH_C500395,CHEBI_29138
5,MESH_C028026,CHEBI_15439
6,MESH_C560044,CHEBI_68089
7,MESH_C035745,CHEBI_138829
8,MESH_C511148,CHEBI_6566
9,MESH_C050445,CHEBI_28160


<br>

***
***

### Ensembl Gene - Ensembl Transcript <a class="anchor" id="ensemblgene-ensembltranscript"></a>

**Wiki Page:** [mapping-transcript-protein-and-gene-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-transcript-protein-and-gene-identifiers)  

**Purpose:** This script assumes that the `HUMAN_9606_idmapping_selected.tab.gz` file from the [Uniprot Knolwedge Base](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) was downloaded and saved to the `./resources/data_maps/unprocessed_data/` directory.


In [14]:
data_processor(filepath='../../resources/data_maps/unprocessed_data/HUMAN_9606_idmapping_selected.tab',
                   row_splitter='\t',
                   column_list=[18, 19],
                   output_name='ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP',
                   line_splitter=';')


In [15]:
# preview data
data = pandas.read_csv('../../resources/data_maps/ENSEMBL_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                       header = None,
                       delimiter = '\t')

data.head(n=10)


Unnamed: 0,0,1
0,ENSG00000166913,ENST00000353703
1,ENSG00000166913,ENST00000372839
2,ENSG00000108953,ENST00000264335
3,ENSG00000108953,ENST00000571732
4,ENSG00000108953,ENST00000616643
5,ENSG00000108953,ENST00000627231
6,ENSG00000274474,ENST00000264335
7,ENSG00000274474,ENST00000571732
8,ENSG00000274474,ENST00000616643
9,ENSG00000274474,ENST00000627231


<br>

***
***

### Ensembl Gene - ENTREZ Gene <a class="anchor" id="ensemblgene-entrezgene"></a>

**Wiki Page:** [mapping-transcript-protein-and-gene-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-transcript-protein-and-gene-identifiers)  

**Purpose:** This script assumes that the `HUMAN_9606_idmapping_selected.tab.gz` file from the [Uniprot Knolwedge Base](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) was downloaded and saved to the `./resources/data_maps/unprocessed_data/` directory.

In [16]:
data_processor(filepath='../../resources/data_maps/unprocessed_data/HUMAN_9606_idmapping_selected.tab',
                   row_splitter='\t',
                   column_list=[18, 2],
                   output_name='ENSEMBL_GENE_ENTREZ_GENE_MAP',
                   line_splitter=';')


In [17]:
# preview data
data = pandas.read_csv('../../resources/data_maps/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                       header = None,
                       delimiter = '\t')

data.head(n=10)


Unnamed: 0,0,1
0,ENSG00000166913,7529
1,ENSG00000108953,7531
2,ENSG00000274474,7531
3,ENSG00000128245,7533
4,ENSG00000170027,7532
5,ENSG00000175793,2810
6,ENSG00000134308,10971
7,ENSG00000164924,7534
8,ENSG00000110455,84680
9,ENSG00000205126,390110


<br>

***
***

### Ensembl Gene - Uniprot Accession <a class="anchor" id="ensemblgene-uniprot"></a>

**Wiki Page:** [mapping-transcript-protein-and-gene-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-transcript-protein-and-gene-identifiers)  

**Purpose:** This script assumes that the `HUMAN_9606_idmapping_selected.tab.gz` file from the [Uniprot Knolwedge Base](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) was downloaded and saved to the `./resources/data_maps/unprocessed_data/` directory.

In [18]:
data_processor(filepath='../../resources/data_maps/unprocessed_data/HUMAN_9606_idmapping_selected.tab',
                   row_splitter='\t',
                   column_list=[18, 0],
                   output_name='ENSEMBL_GENE_UNIPROT_ACCESSION_MAP',
                   line_splitter=';')


In [19]:
# preview data
data = pandas.read_csv('../../resources/data_maps/ENSEMBL_GENE_UNIPROT_ACCESSION_MAP.txt',
                       header = None,
                       delimiter = '\t')

data.head(n=10)


Unnamed: 0,0,1
0,ENSG00000166913,P31946
1,ENSG00000108953,P62258
2,ENSG00000274474,P62258
3,ENSG00000128245,Q04917
4,ENSG00000170027,P61981
5,ENSG00000175793,P31947
6,ENSG00000134308,P27348
7,ENSG00000164924,P63104
8,ENSG00000110455,Q96QU6
9,ENSG00000205126,Q4AC99


<br>

***
***

### Ensembl Protein - Uniprot Accession <a class="anchor" id="ensemblprotein-uniprot"></a>

**Wiki Page:** [mapping-transcript-protein-and-gene-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-transcript-protein-and-gene-identifiers)  

**Purpose:** This script assumes that the `HUMAN_9606_idmapping_selected.tab.gz` file from the [Uniprot Knolwedge Base](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz) was downloaded and saved to the `./resources/data_maps/unprocessed_data/` directory.


In [20]:
data_processor(filepath='../../resources/data_maps/unprocessed_data/HUMAN_9606_idmapping_selected.tab',
                   row_splitter='\t',
                   column_list=[20, 0],
                   output_name='ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP',
                   line_splitter=';')


In [21]:
# preview data
data = pandas.read_csv('../../resources/data_maps/ENSEMBL_PROTEIN_UNIPROT_ACCESSION_MAP.txt',
                       header = None,
                       delimiter = '\t')

data.head(n=10)


Unnamed: 0,0,1
0,ENSP00000300161,P31946
1,ENSP00000361930,P31946
2,ENSP00000264335,P62258
3,ENSP00000461762,P62258
4,ENSP00000481059,P62258
5,ENSP00000487356,P62258
6,ENSP00000248975,Q04917
7,ENSP00000306330,P61981
8,ENSP00000340989,P31947
9,ENSP00000238081,P27348


<br>

***
***

### HPA Tissue/Cells - UBERON + Cell Ontology <a class="anchor" id="hpa-uberon"></a>

**Wiki Page:** [mapping-human-protein-atlas-tissues-and-cell-types](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-human-protein-atlas-tissues-and-cell-types)  

**Purpose:** This script assumes that the `rna_tissue_consensus.tsv` and `normal_tissue.tsv` files from the [Human Protein Atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#human-protein-atlas) were downloaded and saved to the `./resources/data_maps/unprocessed_data/` directory.


In [27]:
# abnormal tissue
abnormal_tissue = []
for line in open('../../resources/data_maps/unprocessed_data/rna_tissue_consensus.tsv').readlines():
    abnormal_tissue.append(line.split('\t')[2].strip())

# normal tissue
normal_tissue = []

for line in open('../../resources/data_maps/unprocessed_data/normal_tissue.tsv').readlines():
    normal_tissue.append(line.split('\t')[2].strip() + ' - ' + line.split('\t')[3].strip())

# combine normal and abnormal tissue and cells into single list
combo = set(abnormal_tissue + normal_tissue)

# write results
with open('../../resources/data_maps/unprocessed_data/HPA_tissues.txt', 'w') as outfile:
    for x in combo:
        outfile.write(x.strip() + '\n')

outfile.close()


In [29]:
# read back in mapped tissue/cell data
data = pandas.read_excel(open('../../resources/data_maps/unprocessed_data/zooma_tissue_cell_mapping_04DEC2019.xlsx',
                              'rb'),
                         sheet_name='zooma_tissue_cell_mapping_04DEC',
                         header=0)

data.fillna('None', inplace=True)

# prerview data
data.head(n=5)


Unnamed: 0,TISSUE,CELL TYPE,ONTOLOGY,ONTOLOGY ID,ONTOLOGY LABEL,MAPPING
0,adipose tissue,,UBERON,http://purl.obolibrary.org/obo/UBERON_0001013,adipose tissue,ZOOMA
1,adipose tissue,adipocytes,UBERON,http://purl.obolibrary.org/obo/UBERON_0001013,adipose tissue,ZOOMA
2,adipose tissue,adipocytes,CL,http://purl.obolibrary.org/obo/CL_0001070,fat cell,Manual
3,adrenal gland,,UBERON,http://purl.obolibrary.org/obo/UBERON_0002369,adrenal gland,ZOOMA
4,adrenal gland,cells in zona fasciculata,UBERON,http://purl.obolibrary.org/obo/UBERON_0002054,zona fasciculata of adrenal gland,ZOOMA


In [24]:
# reformat data and write it out
with open('../../resources/data_maps/HPA_TISSUE_CELL_MAP.txt', 'w') as outfile:
    for idx, row in data.iterrows():

        if row['TISSUE'] != 'None':
            outfile.write(str(row['TISSUE']).strip() + '\t' + str(row['ONTOLOGY ID']).strip() + '\n')

        if row['CELL TYPE'] != 'None':
            outfile.write(str(row['CELL TYPE']).strip() + '\t' + str(row['ONTOLOGY ID']).strip() + '\n')

outfile.close()


In [25]:
# preview data
data = pandas.read_csv('../../resources/data_maps/HPA_TISSUE_CELL_MAP.txt',
                       header = None,
                       delimiter = '\t')

data.head(n=10)


Unnamed: 0,0,1
0,adipose tissue,http://purl.obolibrary.org/obo/UBERON_0001013
1,adipose tissue,http://purl.obolibrary.org/obo/UBERON_0001013
2,adipocytes,http://purl.obolibrary.org/obo/UBERON_0001013
3,adipose tissue,http://purl.obolibrary.org/obo/CL_0001070
4,adipocytes,http://purl.obolibrary.org/obo/CL_0001070
5,adrenal gland,http://purl.obolibrary.org/obo/UBERON_0002369
6,adrenal gland,http://purl.obolibrary.org/obo/UBERON_0002054
7,cells in zona fasciculata,http://purl.obolibrary.org/obo/UBERON_0002054
8,adrenal gland,http://purl.obolibrary.org/obo/CL_0002136
9,cells in zona fasciculata,http://purl.obolibrary.org/obo/CL_0002136


<br>

***
***

### Uniprot Accession - Protein Ontology <a class="anchor" id="uniprot-pro"></a>

**Wiki Page:** [mapping-protein-identifiers](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#mapping-protein-identifiers)  

**Purpose:** This script assumes that the `promapping.txt` file from the [Pro Consortium](https://proconsortium.org/download/current/) was downloaded and saved to the `./resources/data_maps/unprocessed_data/` directory.


In [30]:
data = open('../../resources/data_maps/unprocessed_data/promapping.txt').readlines()

# reformat data and write it out
with open('../../resources/data_maps/UNIPROT_ACCESSION_PRO_MAP.txt', 'w') as outfile:
    for line in data:
        row = line.split('\t')

        if row[1].startswith('UniProtKB'):
            outfile.write(row[0].strip().replace(':', '_') + '\t' + row[1].strip().split(':')[-1] + '\n')

outfile.close()


In [31]:
# preview data
data = pandas.read_csv('../../resources/data_maps/UNIPROT_ACCESSION_PRO_MAP.txt',
                       header = None,
                       delimiter = '\t')

data.head(n=10)


Unnamed: 0,0,1
0,PR_000000005,P37173
1,PR_000000005,P38438
2,PR_000000005,Q62312
3,PR_000000005,Q90999
4,PR_000000007,F1R709
5,PR_000000009,Q16671
6,PR_000000009,Q62893
7,PR_000000009,Q8K592
8,PR_000000010,O57472
9,PR_000000010,Q24025


<br>

***
***

### Ensembl Protein - Protein Ontology <a class="anchor" id="ensemblprotein-pro"></a>


In [None]:
pro_kb = open('./resources/data_maps/UNIPROT_ACCESSION_PRO_MAP.txt').readlines()

# convert to dictionary


In [None]:
# preview data
data = pandas.read_csv('../../resources/data_maps/UNIPROT_ACCESSION_PRO_MAP.txt',
                       header = None,
                       delimiter = '\t')

data.head(n=10)
