# UniProt Gene to Protein Mappings
**[Work in progress]**

This notebook downloads and standardizes Chromosome and Gene information for viral and human proteins for ingestion into the Knowledge Graph.

Data source: [UniProt](https://www.uniprot.org/)

Authors: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import re
import hashlib 
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


### Get list of organisms to be processed

In [4]:
organisms = pd.read_csv("../../reference_data/Organism.csv", dtype=str)

In [5]:
organisms = organisms[organisms['id'].str.startswith('taxonomy')]
# remove CURIE
organisms['taxonomy'] = organisms['id'].apply(lambda x: x.split(':')[1])
taxonomy_ids = organisms['taxonomy'].unique()

### Download data from UniProt using web services
https://www.uniprot.org/help/uniprotkb_column_names

In [6]:
columns = 'id,genes,sequence,proteome,organism-id,reviewed'

In [7]:
dfs = list()
for taxon in taxonomy_ids:
    url = f'https://www.uniprot.org/uniprot/?query=organism:{taxon}&columns={columns}&format=tab'
    try:
        df = pd.read_csv(url, sep='\t', dtype='str')
        if df.shape[0] > 0:
            print(f'Downloaded {df.shape[0]} genes for taxonomy id {taxon}')
            dfs.append(df)
        else:
            print(f'Downloaded 0 genes for taxonomy id {taxon}')
    except:
        print(f'Downloaded 0 genes for taxonomy id {taxon}')

Downloaded 18413 genes for taxonomy id 2697049
Downloaded 10 genes for taxonomy id 1263720
Downloaded 96 genes for taxonomy id 694009
Downloaded 9 genes for taxonomy id 443239
Downloaded 1578 genes for taxonomy id 31631
Downloaded 455 genes for taxonomy id 11137
Downloaded 985 genes for taxonomy id 277944
Downloaded 10 genes for taxonomy id 2709072
Downloaded 42 genes for taxonomy id 2708335
Downloaded 14 genes for taxonomy id 12131
Downloaded 16 genes for taxonomy id 12134
Downloaded 10 genes for taxonomy id 766791
Downloaded 146 genes for taxonomy id 693998
Downloaded 15 genes for taxonomy id 1487703
Downloaded 12 genes for taxonomy id 285949
Downloaded 194237 genes for taxonomy id 9606
Downloaded 86618 genes for taxonomy id 10090
Downloaded 34082 genes for taxonomy id 59479
Downloaded 144 genes for taxonomy id 59477
Downloaded 22 genes for taxonomy id 608659
Downloaded 0 genes for taxonomy id 49442
Downloaded 80 genes for taxonomy id 9974
Downloaded 187 genes for taxonomy id 143292


In [8]:
unp = pd.concat(dfs)

In [9]:
unp.reset_index(drop=True,inplace=True)

### Standardize data

In [10]:
unp.fillna('', inplace=True)

Add missing gene name for UniProt entry P0DTC7
I've filed an issue about the missing gene name to the UniProt help desk.

In [11]:
index = unp.query("Entry == 'P0DTC1'").index.values[0]
unp.at[index, 'Gene names'] = '1a'

In [12]:
print(unp.shape)

(478030, 6)


In [13]:
unp['reviewed'] = unp['Status'].apply(lambda s: 'True' if s == 'reviewed' else 'False')

In [14]:
unp['proteome'] = unp['Proteomes'].apply(lambda r: r.split(','))

In [15]:
unp = unp.explode('proteome')

In [16]:
unp[['proteome', 'chromosome']] = unp['proteome'].str.split(':', expand=True)

In [17]:
unp.head()

Unnamed: 0,Entry,Gene names,Sequence,Proteomes,Organism ID,Status,reviewed,proteome,chromosome
0,P0DTC5,M,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,UP000464024: Genome,2697049,reviewed,True,UP000464024,Genome
1,P0DTC1,1a,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,UP000464024: Genome,2697049,reviewed,True,UP000464024,Genome
2,P0DTD3,ORF14,MLQSCYNFLKEQHCQKASTQKGAEAAVKPLLVPHHVVATVQEIQLQ...,UP000464024: Genome,2697049,reviewed,True,UP000464024,Genome
3,P0DTC4,E 4,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...,UP000464024: Genome,2697049,reviewed,True,UP000464024,Genome
4,P0DTC2,S 2,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,UP000464024: Genome,2697049,reviewed,True,UP000464024,Genome


In [18]:
unp.fillna('', inplace=True)

Assigned first gene name as the preferred gene name

In [19]:
unp['name'] = unp['Gene names'].str.split(' ', expand=True)[0]

Create a semicolon separate list of gene name synomymes so it can be represented in a CSV file

In [20]:
unp['synonymes'] = unp['Gene names'].str.replace(' ', ';')

In [21]:
unp['chromosome'] = unp['chromosome'].str.strip()

For consistency, name chromomsomes as 'Viral Chromosome'

In [22]:
unp['chromosome'] = unp['chromosome'].str.replace('Genome', 'Viral Chromosome')

The protein sequence md5 hashcode is used to link to the protein level

In [23]:
unp['id'] = unp['Sequence'].apply(lambda seq: 'md5:' + hashlib.md5(seq.encode()).hexdigest())

In [24]:
# disambiguate id by taxonomyId (same sequence for different organisms)
unp['id'] = unp['id'] + '-' + unp['Organism ID']

Assign unique identifiers (CURIEs)

In [25]:
unp['taxonomyId'] = 'taxonomy:' + unp['Organism ID']

In [27]:
unp.sort_values('reviewed', ascending=False, inplace=True)

In [28]:
unp.shape[0]

480363

In [29]:
unp.sample(5)

Unnamed: 0,Entry,Gene names,Sequence,Proteomes,Organism ID,Status,reviewed,proteome,chromosome,name,synonymes,id,taxonomyId
307711,A0A671DWL3,ARPC1B,MAYHSFLVEPISCHAWNKDRTQIAICPNNHEVHIYEKSGAKWVKVH...,UP000472240: Chromosome 7,59479,unreviewed,False,UP000472240,Chromosome 7,ARPC1B,ARPC1B,md5:42ce192deb3b841178e73e35779fe78e-59479,taxonomy:59479
445030,U6CPG2,KYNU,MEPSSLELAADTVQRIASELGCLPTDERVALHLDEEDKLRHFKEHF...,,452646,unreviewed,False,,,KYNU,KYNU,md5:fbd57511337be3ad1e1f8392c9c3afb1-452646,taxonomy:452646
445401,Q60440,IL4,MGLRPQLAAILLCLLACTGNWTLGCHHGALKEIIHILNQVTEKGTP...,UP000189706: Genome assembly,10036,reviewed,True,UP000189706,Viral Chromosome assembly,IL4,IL4,md5:0362442d30093cd5ade7f8da040babf5-10036,taxonomy:10036
249412,A0A0G2JFN5,Larp1b,MANWPTPGELVNTG,UP000000589: Chromosome 3,10090,unreviewed,False,UP000000589,Chromosome 3,Larp1b,Larp1b,md5:1931786fa609882db26db295d5ee650d-10090,taxonomy:10090
316459,A0A671EAC0,GCG,MKSIYFVAGLFVMLVQSSWQRSLQDSEEKSSSFPAPQTDPFNDPEQ...,UP000472240: Chromosome 8,59479,unreviewed,False,UP000472240,Chromosome 8,GCG,GCG,md5:60b751eb762b115cead9e500f28469ab-59479,taxonomy:59479


In [30]:
unp.drop_duplicates(subset=['id'], inplace=True)

In [31]:
unp.shape[0]

446821

### Save genes

In [32]:
gene = unp[['id', 'taxonomyId', 'name', 'synonymes', 'chromosome', 'reviewed']].copy()
gene.drop_duplicates(inplace=True)
gene.to_csv(NEO4J_IMPORT / '01a-UniProtGene.csv', index = False)

In [33]:
gene.sample(5)

Unnamed: 0,id,taxonomyId,name,synonymes,chromosome,reviewed
259596,md5:c3a7575cb54d21d76400cd182d25efe2-10090,taxonomy:10090,Psmc4,Psmc4,Chromosome 7,False
156834,md5:3e0df6dee5eed8ca7d44e8fd19af0e9c-9606,taxonomy:9606,CALM2,CALM2,Chromosome 2,False
441442,md5:536ee579e2f4bc7d8af6e5041aac5dfd-452646,taxonomy:452646,ZN202,ZN202,,False
205236,md5:b78ff3e7aea2d86fedce2f63fdae57e1-9606,taxonomy:9606,CAST,CAST,Chromosome 5,False
213477,md5:e56ccc771c6fb579c604519a245d26c9-9606,taxonomy:9606,KCNJ10,KCNJ10,Chromosome 1,False


#### Genes with no names

In [34]:
gene.query("name == ''").head()

Unnamed: 0,id,taxonomyId,name,synonymes,chromosome,reviewed
39315,md5:3ffe9916b5a7a6316ffe86be634cb8e4-9606,taxonomy:9606,,,Unplaced,True
38831,md5:a090464ae3cf84b0fb45ca4506c55e1f-9606,taxonomy:9606,,,,True
40120,md5:6e86bef8a617461c8480d108406d1932-9606,taxonomy:9606,,,Unplaced,True
40399,md5:096c7955063334efe5646f1065bbe2e3-9606,taxonomy:9606,,,Unplaced,True
37310,md5:0aef285f24e9c644c02aab0ee7d71793-9606,taxonomy:9606,,,,True
