# UniProt Gene to Protein Mappings
**[Work in progress]**

This notebook downloads and standardizes Chromosome and Gene information for viral and human proteins for ingestion into the Knowledge Graph.

Data source: [UniProt](https://www.uniprot.org/)

Authors: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import re
import hashlib 
import urllib

import pandas as pd
import numpy as np

from pathlib import Path
from Bio import SeqIO


pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [2]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


### Get list of organisms to be processed

In [3]:
genomes = pd.read_csv("../../reference_data/Genome.csv", dtype=str)

In [4]:
genomes['taxonomy'] = genomes['taxonomyId'].apply(lambda x: x.split(':')[1])

### Download data from UniProt using web services
https://www.uniprot.org/help/uniprotkb_column_names

In [5]:
columns = 'id,genes,sequence,proteome,organism-id'

In [6]:
urls = [f'https://www.uniprot.org/uniprot/?query=organism:{taxon}+AND+reviewed:yes&columns={columns}&format=tab'
        for taxon in genomes['taxonomy'].unique()]

In [7]:
print(urls)

['https://www.uniprot.org/uniprot/?query=organism:2697049+AND+reviewed:yes&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:1263720+AND+reviewed:yes&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:694009+AND+reviewed:yes&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:443239+AND+reviewed:yes&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:31631+AND+reviewed:yes&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:11137+AND+reviewed:yes&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:277944+AND+reviewed:yes&columns=id,genes,sequence,proteome,organism-id&format=tab', 'https://www.uniprot.org/uniprot/?query=organism:9606+AND+reviewed:y

In [8]:
unp = pd.concat((pd.read_csv(url, sep='\t', dtype='str') for url in urls))

In [9]:
unp.reset_index(drop=True,inplace=True)

### Standardize data

In [10]:
unp.fillna('', inplace=True)

Add missing gene name for UniProt entry P0DTC7
I've filed an issue about the missing gene name to the UniProt help desk.

In [11]:
index = unp.query("Entry == 'P0DTC1'").index.values[0]
unp.at[index, 'Gene names'] = '1a'

In [12]:
print(unp.shape)

(20457, 5)


In [13]:
unp['proteome'] = unp['Proteomes'].apply(lambda r: r.split(','))

In [14]:
unp = unp.explode('proteome')

In [15]:
unp[['proteome', 'chromosome']] = unp['proteome'].str.split(':', expand=True)

In [16]:
unp.head()

Unnamed: 0,Entry,Gene names,Sequence,Proteomes,Organism ID,proteome,chromosome
0,P0DTD1,rep 1a-1b,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,UP000464024: Genome,2697049,UP000464024,Genome
1,P0DTC7,7a,MKIILFLALITLATCELYHYQECVRGTTVLLKEPCSSGTYEGNSPF...,UP000464024: Genome,2697049,UP000464024,Genome
2,P0DTD2,9b,MDPKISEMHPALRLVDPQIQLAVTRMENAVGRDQNNVGPKVYPIIL...,UP000464024: Genome,2697049,UP000464024,Genome
3,P0DTC9,N,MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLP...,UP000464024: Genome,2697049,UP000464024,Genome
4,P0DTC3,3a,MDLFMRIFTIGTVTLKQGEIKDATPSDFVRATATIPIQASLPFGWL...,UP000464024: Genome,2697049,UP000464024,Genome


In [17]:
unp.fillna('', inplace=True)

Assigned first gene name as the preferred gene name

In [18]:
unp['name'] = unp['Gene names'].str.split(' ', expand=True)[0]

Create a semicolon separate list of gene name synomymes so it can be represented in a CSV file

In [19]:
unp['synonymes'] = unp['Gene names'].str.replace(' ', ';')

In [20]:
unp['chromosome'] = unp['chromosome'].str.strip()

For consistency, name chromomsomes as 'Viral Chromosome'

In [21]:
unp['chromosome'] = unp['chromosome'].str.replace('Genome', 'Viral Chromosome')

Assign unique identifiers (CURIEs)

In [22]:
unp['taxonomyId'] = 'taxonomy:' + unp['Organism ID']

The protein sequence md5 hashcode is used to link to the protein level

In [23]:
unp['id'] = unp['Sequence'].apply(lambda seq: 'md5:' + hashlib.md5(seq.encode()).hexdigest())

### Save genes

In [24]:
gene = unp[['id', 'taxonomyId', 'name', 'synonymes', 'chromosome']].copy()
gene.drop_duplicates(inplace=True)
gene.to_csv(NEO4J_IMPORT / '01a-UniProtGene.csv', index = False)

In [25]:
gene.sample(5)

Unnamed: 0,id,taxonomyId,name,synonymes,chromosome
10692,md5:6b3bd79a7ec1f17c3499536a7f0b941e,taxonomy:9606,HACD1,HACD1;PTPLA,Chromosome 10
12697,md5:bd6f52c713012fcf14328c0ad20302b1,taxonomy:9606,AP1M2,AP1M2,Chromosome 19
11689,md5:ddaa9c02a66ee77d6cefbae79d1f9af3,taxonomy:9606,SENP5,SENP5;FKSG45,Chromosome 3
1565,md5:8d4bf7acc2a2a6decd2bb4b6052840cd,taxonomy:9606,ZMYND8,ZMYND8;KIAA1125;PRKCBP1;RACK7,Chromosome 20
12977,md5:eefe3d9635a4130e00044c5dcbc19a79,taxonomy:9606,GRIA2,GRIA2;GLUR2,Chromosome 4
