# UniProt Gene to Protein Mappings
**[Work in progress]**

This notebook downloads and standardizes Chromosome and Gene information for viral and human proteins for ingestion into the Knowledge Graph.

Data source: [UniProt](https://www.uniprot.org/)

Authors: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import re
import hashlib 
import urllib

import pandas as pd
import numpy as np

from pathlib import Path
from Bio import SeqIO


pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [2]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


### Get list of organisms to be processed

In [3]:
genomes = pd.read_csv("../../reference_data/Genome.csv", dtype=str)

In [4]:
genomes['taxonomy'] = genomes['taxonomyId'].apply(lambda x: x.split(':')[1])

### Download data from UniProt using web services
https://www.uniprot.org/help/uniprotkb_column_names

In [5]:
columns = 'id,genes,sequence,proteome,organism-id'

In [6]:
urls = [f'https://www.uniprot.org/uniprot/?query=organism:{taxon}+AND+reviewed:yes&columns={columns}&format=tab'
        for taxon in genomes['taxonomy'].unique()]

In [7]:
unp = pd.concat((pd.read_csv(url, sep='\t', dtype='str') for url in urls))

In [8]:
unp.reset_index(drop=True,inplace=True)

### Standardize data

In [9]:
unp.fillna('', inplace=True)

Add missing gene name for UniProt entry P0DTC7
I've filed an issue about the missing gene name to the UniProt help desk.

In [10]:
index = unp.query("Entry == 'P0DTC1'").index.values[0]
unp.at[index, 'Gene names'] = '1a'

In [11]:
print(unp.shape)

(20431, 5)


In [12]:
unp['proteome'] = unp['Proteomes'].apply(lambda r: r.split(','))

In [13]:
unp = unp.explode('proteome')

In [14]:
unp[['proteome', 'chromosome']] = unp['proteome'].str.split(':', expand=True)

In [15]:
unp.head()

Unnamed: 0,Entry,Gene names,Sequence,Proteomes,Organism ID,proteome,chromosome
0,P0DTD1,rep 1a-1b,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,UP000464024: Genome,2697049,UP000464024,Genome
1,P0DTC7,7a,MKIILFLALITLATCELYHYQECVRGTTVLLKEPCSSGTYEGNSPF...,UP000464024: Genome,2697049,UP000464024,Genome
2,P0DTD2,9b,MDPKISEMHPALRLVDPQIQLAVTRMENAVGRDQNNVGPKVYPIIL...,UP000464024: Genome,2697049,UP000464024,Genome
3,P0DTC9,N,MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLP...,UP000464024: Genome,2697049,UP000464024,Genome
4,P0DTC3,3a,MDLFMRIFTIGTVTLKQGEIKDATPSDFVRATATIPIQASLPFGWL...,UP000464024: Genome,2697049,UP000464024,Genome


In [16]:
unp.fillna('', inplace=True)

Assigned first gene name as the preferred gene name

In [17]:
unp['name'] = unp['Gene names'].str.split(' ', expand=True)[0]

Create a semicolon separate list of gene name synomymes so it can be represented in a CSV file

In [18]:
unp['synonymes'] = unp['Gene names'].str.replace(' ', ';')

In [19]:
unp['chromosome'] = unp['chromosome'].str.strip()

Assign unique identifiers (CURIEs)

In [20]:
unp['taxonomyId'] = 'taxonomy:' + unp['Organism ID']

The protein sequence md5 hashcode is used to link to the protein level

In [21]:
unp['id'] = unp['Sequence'].apply(lambda seq: 'md5:' + hashlib.md5(seq.encode()).hexdigest())

### Save proteins

In [22]:
gene = unp[['id', 'taxonomyId', 'name', 'synonymes', 'chromosome']].copy()
gene.drop_duplicates(inplace=True)
gene.to_csv(NEO4J_IMPORT / '01a-UniProtGene.csv', index = False)

In [23]:
gene.sample(5)

Unnamed: 0,id,taxonomyId,name,synonymes,chromosome
11125,md5:375f706e0683e4e205df51fd872d35f5,taxonomy:9606,PAIP1,PAIP1,Chromosome 5
2243,md5:9312466155149a5be81c11dab864bc55,taxonomy:9606,CEACAM1,CEACAM1;BGP;BGP1,Chromosome 19
6670,md5:9630d56c0ff23f5fc3dc85f12dadb4f3,taxonomy:9606,NNMT,NNMT,Chromosome 11
3437,md5:8180914e8918bc888b09341befdf6e60,taxonomy:9606,SLC25A6,SLC25A6;ANT3;CDABP0051,Chromosome X
7513,md5:68f017355ace61c1d60008a052eb5d44,taxonomy:9606,MENT,MENT;C1orf56;UNQ547/PRO1104,Chromosome 1
