# Uniprot dataset update analysis

### Task: Compare dataset from manuscript 1 to current version of Uniprot

- Uniprot releases updates every 8 weeks
    - Current version is 2022_5, manuscript 1 used 2021_04 
- Are there changes to proteins/annotations in new Uniprot release?


### Results:

Dataset:
- Swissprot 2022_05
- Protein evidence at transcript level or protein level
- Transmembrane domain keyword
- Substrate transport keywords
- Clustering at 70% identity threshold
- Only valid sequences

Stats:
- Transmembrane transporters ("Transmembrane" keyword + substrate transport keywords):
    - Removed annotations
        - 67 annotations in 47 proteins were deleted
        - Only one of them might have been one of the outliers in the dataset in M1 (human sugar transporter that is actually ATP exchanger).
    - Added annotations
        - Before clustering: 386 proteins have 478 new substrate annotations
        - After clustering: 222 proteins have 328 new substrate annotations
        - Most were added in human, mouse and rat
        - Most commonly added substrates: Lipid, Ion, Sodium, Amino-acid
- Cellular component keywords (before filtering and clustering, top 4):
    - Added annotations
        - Total: 5258
        - Membrane: 1402
        - Nucleus: 1082
        - Transmembrane: 894
        - Cell membrane: 893
    - Deleted annotations
        - Total: 210
        - Membrane: 90
        - Transmembrane: 51
        - Cell membrane: 24
        - Nucleus 22


### Conclusion

Using the most recent version of Uniprot adds additional annotations that could be useful for training our models, since more samples are available. The deleted annotations show that some errors in the previous version have been corrected, among them a wrongly-classified human sugar transporter that we removed from our dataset as an outlier in Manuscript 1.  

## Imports

In [1]:
from subpred.dataset import (
    create_dataset,
    SUBSTRATE_KEYWORDS,
    get_keywords_df,
    KEYWORDS_LOCATION,
)


Dataset used in manuscript 1:

In [2]:
df_swissprot_manuscript1 = create_dataset(
    # keywords_substrate_filter=["Amino-acid transport", "Sugar transport"],
    # keywords_component_filter=["Transmembrane"],
    # keywords_transport_filter=["Transport"],
    input_file="../data/raw/uniprot/swissprot_data_2021_04_manuscript1.tsv.gz",
    multi_substrate="keep",
    verbose=True,
    # tax_ids_filter=[3702, 9606, 83333, 559292],
    # outliers=outliers,
    # sequence_clustering=70,
    evidence_code=2,
    invalid_amino_acids="remove_amino_acids",
    # force_update=True,
    tcdb_substrates_file="../data/raw/tcdb/tcdb_substrates.tsv",
)
df_swissprot_manuscript1.shape


Found pickle, reading...


(141894, 16)

Swissprot 2022_04:

In [3]:
df_swissprot_new = create_dataset(
    # keywords_substrate_filter=["Amino-acid transport", "Sugar transport"],
    # keywords_component_filter=["Transmembrane"],
    # keywords_transport_filter=["Transport"],
    input_file="../data/raw/uniprot/uniprot_2022_05_evidence1-2_nofragments.tsv",
    multi_substrate="keep",
    verbose=True,
    # tax_ids_filter=[3702, 9606, 83333, 559292],
    # outliers=outliers,
    # sequence_clustering=70,
    evidence_code=2,
    invalid_amino_acids="remove_amino_acids",
    # force_update=True,
    tcdb_substrates_file="../data/raw/tcdb/tcdb_substrates.tsv",
)
df_swissprot_new.shape


Found pickle, reading...


(146597, 16)

## New entries in general:

In [4]:
df_swissprot_new.shape[0] - df_swissprot_manuscript1.shape[0]


4703

## New keyword annotations:

In [5]:
df_keywords_m1 = get_keywords_df(df_swissprot_manuscript1)
df_keywords_m1


Unnamed: 0,Uniprot,keyword
0,Q5SW45,Cell projection
1,Q5SW45,Cilium
2,Q5SW45,Cilium biogenesis/degradation
3,Q5SW45,Cytoplasm
4,Q5SW45,Cytoskeleton
...,...,...
1173582,P50402,Nucleus
1173583,P50402,Phosphoprotein
1173584,P50402,Reference proteome
1173585,P50402,Transmembrane


In [6]:
df_keywords_m2 = get_keywords_df(df_swissprot_new)
df_keywords_m2


Unnamed: 0,Uniprot,keyword
0,A0A0C5B5G6,DNA-binding
1,A0A0C5B5G6,Mitochondrion
2,A0A0C5B5G6,Nucleus
3,A0A0C5B5G6,Osteogenesis
4,A0A0C5B5G6,Reference proteome
...,...,...
1219343,Q9T1X7,Reference proteome
1219344,Q9UAS2,Acetylation
1219345,Q9UAS2,Lipid-binding
1219346,Q9UAS2,Transport


There are 31.557 new keyword annoataions on total:

In [7]:
df_keywords_m2.shape[0] - df_keywords_m1.shape[0]


45761

## New substrate annotations

No filtering for Transmembrane transporters beforehand, these are all substrate annotations.

In [8]:
print(SUBSTRATE_KEYWORDS)


{'Electron transport', 'Peptide transport', 'Hydrogen ion transport', 'Viral movement protein', 'mRNA transport', 'Translocation', 'Sugar transport', 'Protein transport', 'Bacterial flagellum protein export', 'Ammonia transport', 'Potassium transport', 'Copper transport', 'Ion transport', 'Lipid transport', 'Sodium/potassium transport', 'Nickel transport', 'Zinc transport', 'Bacteriocin transport', 'Cobalt transport', 'Sulfate transport', 'Phosphonate transport', 'Neurotransmitter transport', 'Polysaccharide transport', 'Oxygen transport', 'Calcium transport', 'Chloride', 'Amino-acid transport', 'Iron transport', 'Anion exchange', 'Phosphate transport', 'Sodium transport'}


In [9]:
df_substrates_m1 = df_keywords_m1[df_keywords_m1.keyword.isin(SUBSTRATE_KEYWORDS)]


In [10]:
df_substrates_m2 = df_keywords_m2[df_keywords_m2.keyword.isin(SUBSTRATE_KEYWORDS)]


There are 645 more substrate annotations in all of swissprot:

In [11]:
df_substrates_m2.shape[0] - df_substrates_m1.shape[0]


645

#### Merging dataframes to see differences:

In [12]:
df_substrates_merged = df_substrates_m2.merge(
    df_substrates_m1, indicator=True, how="outer"
)
df_substrates_merged


Unnamed: 0,Uniprot,keyword,_merge
0,B7U540,Ion transport,both
1,B7U540,Potassium transport,both
2,O00161,Protein transport,both
3,O00168,Ion transport,both
4,O00168,Potassium transport,both
...,...,...,...
19766,Q10045,Protein transport,right_only
19767,Q5U520,Protein transport,right_only
19768,Q7ZUU1,Protein transport,right_only
19769,Q53HI1,Protein transport,right_only


### Deleted annotations:

47 keyword annotations were deleted in the new version, most of them related to protein transport:

In [13]:
display(
    df_substrates_merged[
        df_substrates_merged._merge == "right_only"
    ].keyword.value_counts()
)
# display(df_substrates_merged[df_substrates_merged._merge == "right_only"])
print(
    df_substrates_merged[df_substrates_merged._merge == "right_only"]
    .Uniprot.unique()
    .size
)


Protein transport     37
Translocation         11
mRNA transport        10
Sugar transport        4
Lipid transport        3
Chloride               1
Electron transport     1
Name: keyword, dtype: int64

47


#### What are the four deleted sugar transporters?

It looks like they all belong to the same protein family, that is related to UDP-Galactose, but now we know that they are ATP/ADP Antiporters.

In [14]:
import pandas as pd

pd.set_option("max_colwidth", 200)
df_substrates_merged_deleted_sugar = df_substrates_merged[
    (df_substrates_merged._merge == "right_only")
    & (df_substrates_merged.keyword == "Sugar transport")
]
print("manuscript 1:")
display(
    df_substrates_merged_deleted_sugar.set_index("Uniprot", drop=True).join(
        df_swissprot_manuscript1[["protein_names", "organism_id"]], how="left"
    )
)
print("manuscript 2:")
df_substrates_merged_deleted_sugar.set_index("Uniprot", drop=True).join(
    df_swissprot_new[["protein_names", "organism_id"]], how="left"
)


manuscript 1:


Unnamed: 0_level_0,keyword,_merge,protein_names,organism_id
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
P78383,Sugar transport,right_only,Solute carrier family 35 member B1 (UDP-galactose transporter-related protein 1) (UGTrel1) (hUGTrel1),9606
P97858,Sugar transport,right_only,Solute carrier family 35 member B1 (UDP-galactose translocator 2) (UDP-galactose transporter-related protein 1) (UGTrel1),10090
Q8MII5,Sugar transport,right_only,Solute carrier family 35 member B1 (Endoplasmic reticulum nucleotide sugar transporter 1),9913
Q6V7K3,Sugar transport,right_only,Solute carrier family 35 member B1 (UDP-galactose transporter-related protein 1) (UGTrel1),10116


manuscript 2:


Unnamed: 0_level_0,keyword,_merge,protein_names,organism_id
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
P78383,Sugar transport,right_only,Solute carrier family 35 member B1 (ATP/ADP exchanger ER) (AXER) (Endoplasmic reticulum ATP/ADP translocase) (UDP-galactose transporter-related protein 1) (UGTrel1),9606
P97858,Sugar transport,right_only,Solute carrier family 35 member B1 (ATP/ADP exchanger ER) (AXER) (Endoplasmic reticulum ATP/ADP translocase) (UDP-galactose transporter-related protein 1) (UGTrel1),10090
Q8MII5,Sugar transport,right_only,Solute carrier family 35 member B1 (ATP/ADP exchanger ER) (AXER) (Endoplasmic reticulum ATP/ADP translocase) (UDP-galactose transporter-related protein 1) (UGTrel1),9913
Q6V7K3,Sugar transport,right_only,Solute carrier family 35 member B1 (ATP/ADP exchanger ER) (AXER) (Endoplasmic reticulum ATP/ADP translocase) (UDP-galactose transporter-related protein 1) (UGTrel1),10116


They are all annotated with "transmembrane" and "transport", so they were in our dataset, at least the human protein. We did actually have troubles with some UDP-galactose transporters in the human dataset and removed some of them as outliers, so this could be why.

In [15]:
tmp = df_keywords_m1[
    df_keywords_m1.Uniprot.isin(df_substrates_merged_deleted_sugar.Uniprot)
]
tmp[tmp.keyword.isin(["Transport", "Transmembrane"])]


Unnamed: 0,Uniprot,keyword
163486,P78383,Transmembrane
163488,P78383,Transport
163520,P97858,Transmembrane
163522,P97858,Transport
178701,Q8MII5,Transmembrane
178703,Q8MII5,Transport
398971,Q6V7K3,Transmembrane
398973,Q6V7K3,Transport


### Added annotations

Substrate annotations that were added in the last year:

- There are many new lipid transporter annotations, as well as ion and protein.

In [16]:
df_substrates_new = df_substrates_merged[df_substrates_merged._merge == "left_only"]
print(len(df_substrates_new))
df_substrates_new.keyword.value_counts()


712


Lipid transport               233
Ion transport                 107
Protein transport              90
Electron transport             51
Amino-acid transport           31
Potassium transport            27
Hydrogen ion transport         27
Translocation                  23
Sodium transport               23
Phosphate transport            17
Chloride                       17
Sugar transport                14
mRNA transport                 12
Calcium transport              11
Iron transport                  8
Zinc transport                  6
Oxygen transport                5
Anion exchange                  3
Viral movement protein          2
Peptide transport               2
Cobalt transport                1
Copper transport                1
Neurotransmitter transport      1
Name: keyword, dtype: int64

#### How many of them are transmembrane transporters?

We did not yet filter by transmembrane transport, and did not perform any clustering. How does this change the additional data?

In [17]:
keyword_matches = (
    df_keywords_m2[df_keywords_m2.keyword.isin(["Transport", "Transmembrane"])]
    .groupby("Uniprot")
    .apply(len)
)
transmembrane_transport = set(
    keyword_matches[keyword_matches == len(["Transport", "Transmembrane"])]
    .index.unique()
    .values
)
df_substrates_new_transmembrane_transport = df_substrates_new[
    df_substrates_new.Uniprot.isin(transmembrane_transport)
]
display(df_substrates_new_transmembrane_transport.keyword.value_counts())
print(len(df_substrates_new_transmembrane_transport))
print(len(df_substrates_new_transmembrane_transport.Uniprot.unique()))


Lipid transport               184
Ion transport                  79
Protein transport              40
Potassium transport            27
Amino-acid transport           26
Sodium transport               18
Phosphate transport            17
Electron transport             17
Chloride                       16
Hydrogen ion transport         14
Calcium transport              10
Sugar transport                10
Translocation                   7
Zinc transport                  6
Anion exchange                  3
Iron transport                  1
Cobalt transport                1
Copper transport                1
Neurotransmitter transport      1
Name: keyword, dtype: int64

478
386


So, there are 478 new annotations for 386 transmembrane transporters before clustering. How do they distribute among the organisms?

- Most were added in human, followed by mouse and rat. This could make these organisms suitable for ion transporter training sets.

In [18]:
df_organisms_new = df_swissprot_new[["organism_id"]].reset_index(drop=False)

In [19]:
# df_substrates_new_transmembrane_transport.merge
# df_swissprot_new

df_substrates_new_transmembrane_transport = (
    df_substrates_new_transmembrane_transport.merge(
        df_organisms_new, how="left", on="Uniprot"
    )
)
print(df_substrates_new_transmembrane_transport.organism_id.value_counts())


9606      64
10090     60
10116     44
9913      24
6239      19
          ..
29397      1
395019     1
83332      1
10036      1
37316      1
Name: organism_id, Length: 86, dtype: int64


#### Filtering out invalid proteins:

How many proteins are in our filtered dataset, where only proteins with experimental evidence and valid sequences are allowed?

- Now we have 282 proteins left. 
- This is also because proteins often have multiple substrate annotations.

In [20]:
df_swissprot_new_sequence_only = df_swissprot_new[["sequence"]].reset_index(drop=False)
tmp = df_swissprot_new_sequence_only[
    df_swissprot_new_sequence_only.Uniprot.isin(
        df_substrates_new_transmembrane_transport.Uniprot
    )
]
tmp


Unnamed: 0,Uniprot,sequence
105,O14975,MLSAIYTVLAGLLFLPLLVNLCCPYFFQDIGYFLKVAAVGRRVRSYGKRRPARTILRAFLEKARQTPHKPFLLFRDETLTYAQVDRRSNQVARALHDHLGLRQGDCVALLMGNEPAYVWLWLGLVKLGCAMACLNYNIRAKSLLHCFQCCGAKVLLVSPELQAAVEEILPSLKKDDVSIYYVSRTSNTDGIDSFLD...
116,O15162,MDKQNSQMNASHPETNLPVGYPPQYPPTAFQGPPGYSGYPGPQVSYPPPPAGHSGPGPAGFPVPNQPVYNQPVYNQPVGAAGVPWMPAPQPPLNCPPGLEYLSQIDQILIHQQIELLEVLTGFETNNKYEIKNSFGQRVYFAAEDTDCCTRNCCGPSRPFTLRIIDNMGQEVITLERPLRCSSCCCPCCLQEIEIQ...
123,O15260,MGQNDLMGTAEDFADQFLRVTKQYLPHVARLCLISTFLEDGIRMWFQWSEQRDYIDTTWNCGYLLASSFVFLNLLGQLTGCVLVLSRNFVQYACFGLFGIIALQTIAYSILWDLKFLMRNLALGGGLLLLLAESRSEGKSMFAGVPTMRESSPKQYMQLGGRVLLVLMFMTLLHFDASFFSIVQNIVGTALMILVA...
317,O95342,MSDSVILRSIKKFGEENDGFESDKSYNNDKKSRLQDEKKGDGVRVGFFQLFRFSSSTDIWLMFVGSLCAFLHGIAQPGVLLIFGTMTDVFIDYDVELQELQIPGKACVNNTIVWTNSSLNQNMTNGTRCGLLNIESEMIKFASYYAGIAVAVLITGYIQICFWVIAAARQIQKMRKFYFRRIMRMEIGWFDCNSVG...
1382,Q5T3U5,MERLLAQLCGSSAAWPLPLWEGDTTGHCFTQLVLSALPHALLAVLSACYLGTPRSPDYILPCSPGWRLRLAASFLLSVFPLLDLLPVALPPGAGPGPIGLEVLAGCVAAVAWISHSLALWVLAHSPHGHSRGPLALALVALLPAPALVLTVLWHCQRGTLLPPLLPGPMARLCLLILQLAALLAYALGWAAPGGPR...
...,...,...
137657,P54251,AIFKSYCEIIVTHFPFDEQNCSMKLGTWTYDSSVVVINPESDQPDLSNFMESGEWVIKEARGWKHNVTYACCLTTHYLDITYHF
137787,Q15B89,KGEAPAKSSTHRHDEELGMASAETLTVFLKLLAAGFYGVSSFLIVVVNKSVLTNYRFPSSLCVGLGQMVATVAVLWVGKALRVVKFPDFDRNVPRKTFPLPLLYFGNQITGLFSTKKLNLPMFTVLRRFSILFTMFAEGVLLKKTFSWGIKMTVFAMIIGAFVAASSDLAFDLEGYVFILINDVLTAANGAYVKQK...
140258,P54248,AIFKSYCEIIVTHFPFDEQNCSMKLGTWTYDGSKVAINAESEHPDLSNFMESGEWVIKEARGWKHWVFYACCPTTPYLDITYHF
141566,P54249,AIFKSYCEIIVTHFPFDEQNCSMKLGTWTYDGSVVAINPENDQPDLSNFMESGEWVIKEARGWKHRVIYACCPSTPYLDITYHF


#### How many are left after clustering?

- After clustering, we have 222 transporters with new annotations

In [21]:
from subpred.cdhit import cd_hit

valid_transporters_cluster_representatives = cd_hit(
    tmp.set_index("Uniprot").sequence, identity_threshold=70
)
print(len(valid_transporters_cluster_representatives))


cd-hit: clustered 386 sequences into 222 clusters at threshold 70
222


These 222 proteins have 328 substrate annotations:

- Lipid is still the largest class in terms of gains, followed by Ion, Sodium and Amino-acid.


In [22]:
df_substrates_m2[
    df_substrates_m2.Uniprot.isin(valid_transporters_cluster_representatives)
]
df_substrates_m2[
    df_substrates_m2.Uniprot.isin(valid_transporters_cluster_representatives)
].keyword.value_counts()


Ion transport                 86
Lipid transport               71
Protein transport             30
Sodium transport              27
Amino-acid transport          25
Potassium transport           20
Electron transport            15
Hydrogen ion transport        12
Calcium transport              9
Sugar transport                9
Zinc transport                 6
Translocation                  4
Phosphate transport            3
Anion exchange                 3
Iron transport                 2
Chloride                       2
Neurotransmitter transport     2
Cobalt transport               1
Copper transport               1
Name: keyword, dtype: int64

We can't give any statistics on the organism distribution of these final proteins, since the cluster representative for a protein family is selected by cd-hit, based on deterministic criteria such as sequence length.

## Changes to cellular components

In [23]:
df_keywords_locations_merged = df_keywords_m2[
    df_keywords_m2.keyword.isin(KEYWORDS_LOCATION)
].merge(
    df_keywords_m1[df_keywords_m1.keyword.isin(KEYWORDS_LOCATION)],
    how="outer",
    indicator=True,
)
df_keywords_locations_merged


Unnamed: 0,Uniprot,keyword,_merge
0,A0A0C5B5G6,Mitochondrion,left_only
1,A0A0C5B5G6,Nucleus,left_only
2,A0A1B0GTW7,Membrane,left_only
3,A0A1B0GTW7,Transmembrane,left_only
4,A0PK11,Cell membrane,left_only
...,...,...,...
147002,O94325,Membrane,right_only
147003,Q1AE95,Membrane,right_only
147004,Q1AE95,Transmembrane,right_only
147005,Q8IM46,Membrane,right_only


#### Stats (before filtering and clustering)

210 membrane-related keywords were deleted, 5258 were added:

In [24]:
df_keywords_locations_deleted = df_keywords_locations_merged[
    df_keywords_locations_merged._merge == "right_only"
]
print(df_keywords_locations_deleted.shape[0])
df_keywords_locations_deleted.keyword.value_counts()

210


Membrane                        90
Transmembrane                   51
Cell membrane                   24
Nucleus                         22
Mitochondrion inner membrane     9
Mitochondrion                    6
Endoplasmic reticulum            5
Cell inner membrane              2
Cell outer membrane              1
Name: keyword, dtype: int64

In [25]:
df_keywords_locations_added = df_keywords_locations_merged[
    df_keywords_locations_merged._merge == "left_only"
]
print(df_keywords_locations_added.shape[0])
df_keywords_locations_added.keyword.value_counts()

5258


Membrane                        1402
Nucleus                         1082
Transmembrane                    894
Cell membrane                    893
Endoplasmic reticulum            381
Mitochondrion                    333
Mitochondrion inner membrane     101
Cell inner membrane               91
Mitochondrion outer membrane      45
Cell outer membrane               21
Postsynaptic cell membrane        14
Plastid inner membrane             1
Name: keyword, dtype: int64