# UniProt Gene to Protein Mappings
**[Work in progress]**

This notebook downloads and standardizes Chromosome and Gene information for viral and human proteins for ingestion into the Knowledge Graph.

Data source: [UniProt](https://www.uniprot.org/)

Authors: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import re
import hashlib 
import urllib

import pandas as pd
import numpy as np

from pathlib import Path
from Bio import SeqIO


pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [2]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


### Get list of organisms to be processed

In [3]:
#genomes = pd.read_csv("../../reference_data/Genome.csv", dtype=str)

In [4]:
#genomes['taxonomy'] = genomes['taxonomyId'].apply(lambda x: x.split(':')[1])

In [5]:
organisms = pd.read_csv("../../reference_data/Organism.csv", dtype=str)

In [6]:
organisms = organisms[organisms['id'].str.startswith('taxonomy')]
# remove CURIE
organisms['taxonomy'] = organisms['id'].apply(lambda x: x.split(':')[1])
taxonomy_ids = organisms['taxonomy'].unique()

### Download data from UniProt using web services
https://www.uniprot.org/help/uniprotkb_column_names

In [7]:
columns = 'id,genes,sequence,proteome,organism-id'

In [8]:
urls = [f'https://www.uniprot.org/uniprot/?query=organism:{taxon}&columns={columns}&format=tab'
        for taxon in taxonomy_ids]
#urls = [f'https://www.uniprot.org/uniprot/?query=organism:{taxon}+AND+reviewed:yes&columns={columns}&format=tab'
#        for taxon in genomes['taxonomy'].unique()]

In [9]:
print(urls)

['https://www.uniprot.org/uniprot/?query=organism:2697049&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:1263720&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:694009&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:443239&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:31631&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:11137&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:277944&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:12131&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:12134&columns=id,genes,

In [15]:
dfs = list()
for url in urls:
    try:
        print(url)
        df = pd.read_csv(url, sep='\t', dtype='str')
        print(df.shape)
        if df.shape[0] > 0:
            dfs.append(df)
    except:
        print('No genes found')

https://www.uniprot.org/uniprot/?query=organism:2697049&columns=id,genes,sequence,proteome,organism-id&format=tab
(18413, 5)
https://www.uniprot.org/uniprot/?query=organism:1263720&columns=id,genes,sequence,proteome,organism-id&format=tab
(10, 5)
https://www.uniprot.org/uniprot/?query=organism:694009&columns=id,genes,sequence,proteome,organism-id&format=tab
(96, 5)
https://www.uniprot.org/uniprot/?query=organism:443239&columns=id,genes,sequence,proteome,organism-id&format=tab
(9, 5)
https://www.uniprot.org/uniprot/?query=organism:31631&columns=id,genes,sequence,proteome,organism-id&format=tab
(1578, 5)
https://www.uniprot.org/uniprot/?query=organism:11137&columns=id,genes,sequence,proteome,organism-id&format=tab
(455, 5)
https://www.uniprot.org/uniprot/?query=organism:277944&columns=id,genes,sequence,proteome,organism-id&format=tab
(985, 5)
https://www.uniprot.org/uniprot/?query=organism:12131&columns=id,genes,sequence,proteome,organism-id&format=tab
(14, 5)
https://www.uniprot.org/uni

In [16]:
unp = pd.concat(dfs)
#unp = pd.concat((pd.read_csv(url, sep='\t', dtype='str') for url in urls))

In [17]:
unp.reset_index(drop=True,inplace=True)

### Standardize data

In [18]:
unp.fillna('', inplace=True)

Add missing gene name for UniProt entry P0DTC7
I've filed an issue about the missing gene name to the UniProt help desk.

In [19]:
index = unp.query("Entry == 'P0DTC1'").index.values[0]
unp.at[index, 'Gene names'] = '1a'

In [20]:
print(unp.shape)

(393748, 5)


In [21]:
unp['proteome'] = unp['Proteomes'].apply(lambda r: r.split(','))

In [22]:
unp = unp.explode('proteome')

In [23]:
unp[['proteome', 'chromosome']] = unp['proteome'].str.split(':', expand=True)

In [24]:
unp.head()

Unnamed: 0,Entry,Gene names,Sequence,Proteomes,Organism ID,proteome,chromosome
0,P0DTC5,M,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,UP000464024: Genome,2697049,UP000464024,Genome
1,P0DTC1,1a,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,UP000464024: Genome,2697049,UP000464024,Genome
2,P0DTD3,ORF14,MLQSCYNFLKEQHCQKASTQKGAEAAVKPLLVPHHVVATVQEIQLQ...,UP000464024: Genome,2697049,UP000464024,Genome
3,P0DTC4,E 4,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...,UP000464024: Genome,2697049,UP000464024,Genome
4,P0DTC2,S 2,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,UP000464024: Genome,2697049,UP000464024,Genome


In [25]:
unp.fillna('', inplace=True)

Assigned first gene name as the preferred gene name

In [26]:
unp['name'] = unp['Gene names'].str.split(' ', expand=True)[0]

Create a semicolon separate list of gene name synomymes so it can be represented in a CSV file

In [27]:
unp['synonymes'] = unp['Gene names'].str.replace(' ', ';')

In [28]:
unp['chromosome'] = unp['chromosome'].str.strip()

For consistency, name chromomsomes as 'Viral Chromosome'

In [29]:
unp['chromosome'] = unp['chromosome'].str.replace('Genome', 'Viral Chromosome')

Assign unique identifiers (CURIEs)

In [30]:
unp['taxonomyId'] = 'taxonomy:' + unp['Organism ID']

The protein sequence md5 hashcode is used to link to the protein level

In [31]:
unp['id'] = unp['Sequence'].apply(lambda seq: 'md5:' + hashlib.md5(seq.encode()).hexdigest())

### Save genes

In [32]:
gene = unp[['id', 'taxonomyId', 'name', 'synonymes', 'chromosome']].copy()
gene.drop_duplicates(inplace=True)
gene.to_csv(NEO4J_IMPORT / '01a-UniProtGene.csv', index = False)

In [33]:
gene.sample(5)

Unnamed: 0,id,taxonomyId,name,synonymes,chromosome
38896,md5:e28eaf30c70af2a0974dfa2c150dabc9,taxonomy:9606,HLA-DQB2,HLA-DQB2;HLA-DXB,Chromosome 6
127132,md5:c65f849a442b70a039117af1e45f6200,taxonomy:9606,HLA-C,HLA-C,
388919,md5:3b10f38e2dae53e12417f0a7bab73183,taxonomy:10036,Rnf112,Rnf112,Viral Chromosome assembly
85186,md5:a3435e64a8eb097319ab810b77c16976,taxonomy:9606,LITAF,LITAF,Chromosome 16
42574,md5:dd5896560806b3664e43218be885e71d,taxonomy:9606,HLA-DQA1,HLA-DQA1,


#### Genes with no names

In [34]:
gene.query("name == ''").head()

Unnamed: 0,id,taxonomyId,name,synonymes,chromosome
42,md5:44ac354326e066c8a56b733a2eb409b1,taxonomy:2697049,,,
132,md5:3921b4859e760676e600fbf687bb10ee,taxonomy:2697049,,,
612,md5:a0e6d3d9c4e7aac7fca20d7ea58d7398,taxonomy:2697049,,,
646,md5:0e5acbb538e201e15bdc6ad30f2c37f6,taxonomy:2697049,,,
815,md5:7b37cc2515fc1acbd1d73dc4e63fd420,taxonomy:2697049,,,
