# Protein Metadata

**Created**: 19 February 2022

I will generate metadata for the proteins from Yuxin's MS Proteomics data. This will be a file that maps UniProt IDs to Ensembl IDs and gene information. This table will be useful for performing colocalization down the line.

## Environment

In [1]:
import os
import re

import numpy as np
import pandas as pd

In [2]:
os.chdir('/nfs/users/nfs_n/nm18/eQTL_pQTL_Characterization/')

## Load Data

In [3]:
uniprot = pd.read_csv('02_pQTL_Mapping/data/UniProt/HUMAN_9606_idmapping_selected.tab', sep='\t', header=None, low_memory=False)
uniprot = uniprot.iloc[:,[0, 18]]
uniprot.columns = ['UniProt_ID', 'Gene_IDs']

In [4]:
uniprot.head()

Unnamed: 0,UniProt_ID,Gene_IDs
0,P31946,ENSG00000166913
1,P62258,ENSG00000108953; ENSG00000274474
2,Q04917,ENSG00000128245
3,P61981,ENSG00000170027
4,P31947,ENSG00000175793


In [5]:
prot_info = pd.read_csv('/nfs/users/nfs_n/nm18/gains_team282/proteomics/MS2019_processed_data/protein_info_291_MS2019.csv')

In [6]:
prot_info.head()

Unnamed: 0,Protein,Accession,Entry Name,Gene.Names,Protein Length,Coverage,Protein Existence,Description
0,sp|A0A075B6I9|LV746_HUMAN,A0A075B6I9,LV746_HUMAN,IGLV7-46,117,76.9,3:Protein inferred from homology,Immunoglobulin lambda variable 7-46
1,sp|A0A075B6P5|KV228_HUMAN,A0A075B6P5,KV228_HUMAN,IGKV2-28,120,100.0,3:Protein inferred from homology,Immunoglobulin kappa variable 2-28
2,sp|A0A087WW87|KV240_HUMAN,A0A087WW87,KV240_HUMAN,IGKV2-40,121,81.0,3:Protein inferred from homology,Immunoglobulin kappa variable 2-40
3,sp|A0A0B4J1V0|HV315_HUMAN,A0A0B4J1V0,HV315_HUMAN,IGHV3-15,119,84.0,3:Protein inferred from homology,Immunoglobulin heavy variable 3-15
4,sp|A0A0B4J1V2|HV226_HUMAN,A0A0B4J1V2,HV226_HUMAN,IGHV2-26,119,70.6,3:Protein inferred from homology,Immunoglobulin heavy variable 2-26


In [7]:
gene_info = pd.read_table('/lustre/scratch118/humgen/resources/rna_seq_genomes/Homo_sapiens.GRCh38.99.gtf', sep='\t', skiprows=5, header=None, low_memory=False)
gene_info.columns = ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attribute']
gene_info = gene_info[gene_info.feature == 'gene']
gene_info = gene_info.reset_index(drop=True)

In [8]:
gene_info.head()

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,attribute
0,1,havana,gene,11869,14409,.,+,.,"gene_id ""ENSG00000223972""; gene_version ""5""; g..."
1,1,havana,gene,14404,29570,.,-,.,"gene_id ""ENSG00000227232""; gene_version ""5""; g..."
2,1,mirbase,gene,17369,17436,.,-,.,"gene_id ""ENSG00000278267""; gene_version ""1""; g..."
3,1,havana,gene,29554,31109,.,+,.,"gene_id ""ENSG00000243485""; gene_version ""5""; g..."
4,1,mirbase,gene,30366,30503,.,+,.,"gene_id ""ENSG00000284332""; gene_version ""1""; g..."


In [9]:
gene_info_attributes = {
    'gene_id': list(),
    'gene_name': list(),
    'gene_biotype': list()
}

for attribute_str in gene_info.attribute.values:
    attributes = [x.split() for x in attribute_str.split(';') if x]
    attributes = {x[0]: re.sub('"', '', x[1]) for x in attributes}
    for key in gene_info_attributes:
        gene_info_attributes[key].append(attributes[key])

gene_info.pop('attribute')

gene_info = pd.concat((gene_info, pd.DataFrame(gene_info_attributes)), axis=1)

In [10]:
gene_info['tss'] = gene_info.apply(lambda x: x.start if x.strand == '+' else x.end, axis=1)

In [11]:
gene_info.head()

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,gene_id,gene_name,gene_biotype,tss
0,1,havana,gene,11869,14409,.,+,.,ENSG00000223972,DDX11L1,transcribed_unprocessed_pseudogene,11869
1,1,havana,gene,14404,29570,.,-,.,ENSG00000227232,WASH7P,unprocessed_pseudogene,29570
2,1,mirbase,gene,17369,17436,.,-,.,ENSG00000278267,MIR6859-1,miRNA,17436
3,1,havana,gene,29554,31109,.,+,.,ENSG00000243485,MIR1302-2HG,lncRNA,29554
4,1,mirbase,gene,30366,30503,.,+,.,ENSG00000284332,MIR1302-2,miRNA,30366


## Generate Combined Table

We only care about the genes on the autosomes and the X chromosome. Filter gene information to these only.

In [12]:
gene_info.seqname.unique()

array(['1', '2', '3', '4', '5', '6', '7', 'X', '8', '9', '11', '10', '12',
       '13', '14', '15', '16', '17', '18', '20', '19', 'Y', '22', '21',
       'MT', 'KI270728.1', 'KI270727.1', 'KI270442.1', 'GL000225.1',
       'GL000009.2', 'GL000194.1', 'GL000205.2', 'GL000195.1',
       'KI270733.1', 'GL000219.1', 'GL000216.2', 'KI270744.1',
       'KI270734.1', 'GL000213.1', 'GL000220.1', 'GL000218.1',
       'KI270731.1', 'KI270750.1', 'KI270721.1', 'KI270726.1',
       'KI270711.1', 'KI270713.1'], dtype=object)

In [13]:
chromosomes = [str(x) for x in range(1, 23)] + ['X']
gene_info_auto_X = gene_info[gene_info.seqname.isin(chromosomes)]

I will merge the UniProt data with the protein information from Yuxin's MS Proteomics. This will provide Ensembl Gene IDs for each protein. Some proteins are not associated with an Ensembl Gene ID, which I remove. Some proteins are associated with multiple Ensembl Gene IDs (an example being the hemoglobin peptide HBA1, with associated genes HBA1 and HBA2).

In [14]:
prot_info_all = prot_info.merge(uniprot, left_on='Accession', right_on='UniProt_ID')
prot_info_all.Gene_IDs = prot_info_all.Gene_IDs.fillna('')

In [15]:
prot_info_all_genes_expanded = prot_info_all.Gene_IDs.str.split('; ', expand=True)
prot_info_all = prot_info_all \
    .join(prot_info_all_genes_expanded) \
    .melt(id_vars=prot_info_all.columns, value_vars=prot_info_all_genes_expanded.columns) \
    .dropna(subset=['value'])

del prot_info_all['Gene_IDs']
del prot_info_all['variable']
prot_info_all = prot_info_all.rename(columns={'value': 'Gene_ID'})
prot_info_all = prot_info_all.sort_values(by='Accession').reset_index(drop=True)
prot_info_all.Gene_ID = prot_info_all.Gene_ID.replace({'': np.nan})

In [16]:
prot_info_all[prot_info_all.Gene_ID.isna()]

Unnamed: 0,Protein,Accession,Entry Name,Gene.Names,Protein Length,Coverage,Protein Existence,Description,UniProt_ID,Gene_ID
22,sp|A2VEC9|SSPO_HUMAN,A2VEC9,SSPO_HUMAN,SSPO,5150,61.6,2:Experimental evidence at transcript level,SCO-spondin,A2VEC9,
23,sp|A6NJ88|SGE2P_HUMAN,A6NJ88,SGE2P_HUMAN,SAGE2P,616,47.1,5:Protein uncertain,Putative SAGE1-like protein,A6NJ88,
49,sp|P00736|C1R_HUMAN,P00736,C1R_HUMAN,C1R,705,87.7,1:Experimental evidence at protein level,Complement C1r subcomponent,P00736,
89,sp|P01782|HV309_HUMAN,P01782,HV309_HUMAN,IGHV3-9,118,83.9,1:Experimental evidence at protein level,Immunoglobulin heavy variable 3-9,P01782,
90,sp|P01834|IGKC_HUMAN,P01834,IGKC_HUMAN,IGKC,107,100.0,1:Experimental evidence at protein level,Immunoglobulin kappa constant,P01834,
95,sp|P01860|IGHG3_HUMAN,P01860,IGHG3_HUMAN,IGHG3,377,92.6,1:Experimental evidence at protein level,Immunoglobulin heavy constant gamma 3,P01860,
117,sp|P02746|C1QB_HUMAN,P02746,C1QB_HUMAN,C1QB,253,92.1,1:Experimental evidence at protein level,Complement C1q subcomponent subunit B,P02746,
236,sp|P26927|HGFL_HUMAN,P26927,HGFL_HUMAN,MST1,711,73.4,1:Experimental evidence at protein level,Hepatocyte growth factor-like protein,P26927,
302,sp|Q4V348|Z658B_HUMAN,Q4V348,Z658B_HUMAN,ZNF658B,819,45.2,2:Experimental evidence at transcript level,Zinc finger protein 658B,Q4V348,


Unfortunately, I have to use manual curation to get the Gene IDs for some of the proteins.

In [17]:
prot_info_all.loc[prot_info_all.Accession == 'A2VEC9', 'Gene_ID'] = ['ENSG00000197558']  # SSPO
prot_info_all.loc[prot_info_all.Accession == 'A6NJ88', 'Gene_ID'] = ['ENSG00000198022']  # SAGE2P
prot_info_all.loc[prot_info_all.Accession == 'P00736', 'Gene_ID'] = ['ENSG00000159403']  # C1R
prot_info_all.loc[prot_info_all.Accession == 'P01834', 'Gene_ID'] = ['ENSG00000211592']  # IGKC
prot_info_all.loc[prot_info_all.Accession == 'P01860', 'Gene_ID'] = ['ENSG00000211897']  # IGHG3
prot_info_all.loc[prot_info_all.Accession == 'P02746', 'Gene_ID'] = ['ENSG00000173369']  # C1QB
prot_info_all.loc[prot_info_all.Accession == 'P26927', 'Gene_ID'] = ['ENSG00000173531']  # MST1
prot_info_all.loc[prot_info_all.Accession == 'Q4V348', 'Gene_ID'] = ['ENSG00000198416']  # ZNF658B

In [18]:
prot_info_all.head()

Unnamed: 0,Protein,Accession,Entry Name,Gene.Names,Protein Length,Coverage,Protein Existence,Description,UniProt_ID,Gene_ID
0,sp|A0A075B6I9|LV746_HUMAN,A0A075B6I9,LV746_HUMAN,IGLV7-46,117,76.9,3:Protein inferred from homology,Immunoglobulin lambda variable 7-46,A0A075B6I9,ENSG00000211649
1,sp|A0A075B6P5|KV228_HUMAN,A0A075B6P5,KV228_HUMAN,IGKV2-28,120,100.0,3:Protein inferred from homology,Immunoglobulin kappa variable 2-28,A0A075B6P5,ENSG00000244116
2,sp|A0A075B6P5|KV228_HUMAN,A0A075B6P5,KV228_HUMAN,IGKV2-28,120,100.0,3:Protein inferred from homology,Immunoglobulin kappa variable 2-28,A0A075B6P5,ENSG00000282025
3,sp|A0A087WSY6|KVD15_HUMAN,A0A087WSY6,KVD15_HUMAN,IGKV3D-15,115,96.5,3:Protein inferred from homology,Immunoglobulin kappa variable 3D-15,A0A087WSY6,ENSG00000224041
4,sp|A0A087WW87|KV240_HUMAN,A0A087WW87,KV240_HUMAN,IGKV2-40,121,81.0,3:Protein inferred from homology,Immunoglobulin kappa variable 2-40,A0A087WW87,ENSG00000273962


Merge protein information with gene information from the GTF file used for RNA-Seq mapping. This is from Ensembl Version 99 (GRCh38).

In [19]:
metadata = prot_info_all.merge(gene_info, left_on='Gene_ID', right_on='gene_id', how='left')

In [20]:
metadata.head()

Unnamed: 0,Protein,Accession,Entry Name,Gene.Names,Protein Length,Coverage,Protein Existence,Description,UniProt_ID,Gene_ID,...,feature,start,end,score,strand,frame,gene_id,gene_name,gene_biotype,tss
0,sp|A0A075B6I9|LV746_HUMAN,A0A075B6I9,LV746_HUMAN,IGLV7-46,117,76.9,3:Protein inferred from homology,Immunoglobulin lambda variable 7-46,A0A075B6I9,ENSG00000211649,...,gene,22369614.0,22370087.0,.,+,.,ENSG00000211649,IGLV7-46,IG_V_gene,22369614.0
1,sp|A0A075B6P5|KV228_HUMAN,A0A075B6P5,KV228_HUMAN,IGKV2-28,120,100.0,3:Protein inferred from homology,Immunoglobulin kappa variable 2-28,A0A075B6P5,ENSG00000244116,...,gene,89221698.0,89222461.0,.,-,.,ENSG00000244116,IGKV2-28,IG_V_gene,89222461.0
2,sp|A0A075B6P5|KV228_HUMAN,A0A075B6P5,KV228_HUMAN,IGKV2-28,120,100.0,3:Protein inferred from homology,Immunoglobulin kappa variable 2-28,A0A075B6P5,ENSG00000282025,...,,,,,,,,,,
3,sp|A0A087WSY6|KVD15_HUMAN,A0A087WSY6,KVD15_HUMAN,IGKV3D-15,115,96.5,3:Protein inferred from homology,Immunoglobulin kappa variable 3D-15,A0A087WSY6,ENSG00000224041,...,gene,90114838.0,90115402.0,.,+,.,ENSG00000224041,IGKV3D-15,IG_V_gene,90114838.0
4,sp|A0A087WW87|KV240_HUMAN,A0A087WW87,KV240_HUMAN,IGKV2-40,121,81.0,3:Protein inferred from homology,Immunoglobulin kappa variable 2-40,A0A087WW87,ENSG00000273962,...,gene,89330110.0,89330429.0,.,-,.,ENSG00000273962,IGKV2-40,IG_V_gene,89330429.0


In [21]:
metadata.to_csv('/nfs/users/nfs_n/nm18/gains_team282/nikhil/colocalization/eQTL_pQTL_metadata.tsv', index=False, sep='\t')