# UniProt Gene to Protein Mappings
**[Work in progress]**

This notebook downloads and standardizes Chromosome and Gene information for viral and human proteins for ingestion into the Knowledge Graph.

Data source: [UniProt](https://www.uniprot.org/)

Authors: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import re
import hashlib 
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


### Get list of organisms to be processed

In [4]:
organisms = pd.read_csv("../../reference_data/Organism.csv", dtype=str)

In [5]:
organisms = organisms[organisms['id'].str.startswith('taxonomy')]
# remove CURIE
organisms['taxonomy'] = organisms['id'].apply(lambda x: x.split(':')[1])
taxonomy_ids = organisms['taxonomy'].unique()

### Download data from UniProt using web services
https://www.uniprot.org/help/uniprotkb_column_names

In [6]:
columns = 'id,genes,sequence,proteome,organism-id'

In [7]:
dfs = list()
for taxon in taxonomy_ids:
    url = f'https://www.uniprot.org/uniprot/?query=organism:{taxon}&columns={columns}&format=tab'
    try:
        df = pd.read_csv(url, sep='\t', dtype='str')
        if df.shape[0] > 0:
            print(f'Downloaded {df.shape[0]} genes for taxonomy id {taxon}')
            dfs.append(df)
        else:
            print(f'Downloaded 0 genes for taxonomy id {taxon}')
    except:
        print(f'Downloaded 0 genes for taxonomy id {taxon}')

Downloaded 18413 genes for taxonomy id 2697049
Downloaded 10 genes for taxonomy id 1263720
Downloaded 96 genes for taxonomy id 694009
Downloaded 9 genes for taxonomy id 443239
Downloaded 1578 genes for taxonomy id 31631
Downloaded 455 genes for taxonomy id 11137
Downloaded 985 genes for taxonomy id 277944
Downloaded 14 genes for taxonomy id 12131
Downloaded 16 genes for taxonomy id 12134
Downloaded 10 genes for taxonomy id 766791
Downloaded 146 genes for taxonomy id 693998
Downloaded 15 genes for taxonomy id 1487703
Downloaded 12 genes for taxonomy id 285949
Downloaded 194237 genes for taxonomy id 9606
Downloaded 86618 genes for taxonomy id 10090
Downloaded 144 genes for taxonomy id 59477
Downloaded 22 genes for taxonomy id 608659
Downloaded 0 genes for taxonomy id 49442
Downloaded 80 genes for taxonomy id 9974
Downloaded 187 genes for taxonomy id 143292
Downloaded 0 genes for taxonomy id 71116
Downloaded 0 genes for taxonomy id 9608
Downloaded 41789 genes for taxonomy id 9685
Download

In [8]:
unp = pd.concat(dfs)

In [9]:
unp.reset_index(drop=True,inplace=True)

### Standardize data

In [10]:
unp.fillna('', inplace=True)

Add missing gene name for UniProt entry P0DTC7
I've filed an issue about the missing gene name to the UniProt help desk.

In [11]:
index = unp.query("Entry == 'P0DTC1'").index.values[0]
unp.at[index, 'Gene names'] = '1a'

In [12]:
print(unp.shape)

(393931, 5)


In [13]:
unp['proteome'] = unp['Proteomes'].apply(lambda r: r.split(','))

In [14]:
unp = unp.explode('proteome')

In [15]:
unp[['proteome', 'chromosome']] = unp['proteome'].str.split(':', expand=True)

In [16]:
unp.head()

Unnamed: 0,Entry,Gene names,Sequence,Proteomes,Organism ID,proteome,chromosome
0,P0DTC5,M,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,UP000464024: Genome,2697049,UP000464024,Genome
1,P0DTC1,1a,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,UP000464024: Genome,2697049,UP000464024,Genome
2,P0DTD3,ORF14,MLQSCYNFLKEQHCQKASTQKGAEAAVKPLLVPHHVVATVQEIQLQ...,UP000464024: Genome,2697049,UP000464024,Genome
3,P0DTC4,E 4,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...,UP000464024: Genome,2697049,UP000464024,Genome
4,P0DTC2,S 2,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,UP000464024: Genome,2697049,UP000464024,Genome


In [17]:
unp.fillna('', inplace=True)

Assigned first gene name as the preferred gene name

In [18]:
unp['name'] = unp['Gene names'].str.split(' ', expand=True)[0]

Create a semicolon separate list of gene name synomymes so it can be represented in a CSV file

In [19]:
unp['synonymes'] = unp['Gene names'].str.replace(' ', ';')

In [20]:
unp['chromosome'] = unp['chromosome'].str.strip()

For consistency, name chromomsomes as 'Viral Chromosome'

In [21]:
unp['chromosome'] = unp['chromosome'].str.replace('Genome', 'Viral Chromosome')

The protein sequence md5 hashcode is used to link to the protein level

In [22]:
unp['id'] = unp['Sequence'].apply(lambda seq: 'md5:' + hashlib.md5(seq.encode()).hexdigest())

In [23]:
# disambiguate id by taxonomyId (same sequence for different organisms)
unp['id'] = unp['id'] + '-' + unp['Organism ID']

Assign unique identifiers (CURIEs)

In [24]:
unp['taxonomyId'] = 'taxonomy:' + unp['Organism ID']

### Save genes

In [25]:
gene = unp[['id', 'taxonomyId', 'name', 'synonymes', 'chromosome']].copy()
gene.drop_duplicates(inplace=True)
gene.to_csv(NEO4J_IMPORT / '01a-UniProtGene.csv', index = False)

In [26]:
gene.sample(5)

Unnamed: 0,id,taxonomyId,name,synonymes,chromosome
349143,md5:25dfec4a3ecea2c926a9ef73adf964e5-452646,taxonomy:452646,TTL,TTL,
353274,md5:072bc8f58849408ed7aa5aaa900524df-452646,taxonomy:452646,Q4JCQ8,Q4JCQ8,
26578,md5:066cb7b88e47572767efcf8b52d55d51-9606,taxonomy:9606,TSPAN17,TSPAN17;FBXO23;TM4SF17,Chromosome 5
235449,md5:636d7f540464d89e5e4a318e2d553eb8-10090,taxonomy:10090,Olfr137,Olfr137,Chromosome 17
349591,md5:a3e8410b6dd70f1ce4576d041d86cedf-452646,taxonomy:452646,H0Y3M2,H0Y3M2,


#### Genes with no names

In [27]:
gene.query("name == ''").head()

Unnamed: 0,id,taxonomyId,name,synonymes,chromosome
42,md5:44ac354326e066c8a56b733a2eb409b1-2697049,taxonomy:2697049,,,
132,md5:3921b4859e760676e600fbf687bb10ee-2697049,taxonomy:2697049,,,
612,md5:a0e6d3d9c4e7aac7fca20d7ea58d7398-2697049,taxonomy:2697049,,,
646,md5:0e5acbb538e201e15bdc6ad30f2c37f6-2697049,taxonomy:2697049,,,
815,md5:7b37cc2515fc1acbd1d73dc4e63fd420-2697049,taxonomy:2697049,,,
