# Uniprot dataset analysis

Task: Compare dataset from manuscript 1 to current version of Uniprot

- Uniprot releases four updates per year
    - Current version is 2022_4, manuscript 1 used 2021_04 
- Are there new proteins in new Uniprot release?
- Comparison of samples annotated with keywords
    - In general
    - Individual organisms


## Imports

In [105]:
from subpred.dataset import create_dataset, SUBSTRATE_KEYWORDS, get_keywords_df


Dataset used in manuscript 1:

In [106]:
df_swissprot_manuscript1 = create_dataset(
    # keywords_substrate_filter=["Amino-acid transport", "Sugar transport"],
    # keywords_component_filter=["Transmembrane"],
    # keywords_transport_filter=["Transport"],
    input_file="../data/raw/uniprot/swissprot_data_2021_04_manuscript1.tsv.gz",
    multi_substrate="keep",
    verbose=True,
    # tax_ids_filter=[3702, 9606, 83333, 559292],
    # outliers=outliers,
    # sequence_clustering=70,
    evidence_code=2,
    invalid_amino_acids="remove_protein",
    # force_update=True,
    tcdb_substrates_file="../data/raw/tcdb/tcdb_substrates.tsv",
)
df_swissprot_manuscript1.shape


Found pickle, reading...


(141892, 16)

Swissprot 2022_04:

In [107]:
df_swissprot_new = create_dataset(
    # keywords_substrate_filter=["Amino-acid transport", "Sugar transport"],
    # keywords_component_filter=["Transmembrane"],
    # keywords_transport_filter=["Transport"],
    input_file="../data/raw/uniprot/swissprot_data_2022_04.tsv.gz",
    multi_substrate="keep",
    verbose=True,
    # tax_ids_filter=[3702, 9606, 83333, 559292],
    # outliers=outliers,
    # sequence_clustering=70,
    evidence_code=2,
    invalid_amino_acids="remove_protein",
    # force_update=True,
    tcdb_substrates_file="../data/raw/tcdb/tcdb_substrates.tsv",
)
df_swissprot_new.shape


Found pickle, reading...


(144929, 16)

## New entries in general:

In [108]:
df_swissprot_new.shape[0] - df_swissprot_manuscript1.shape[0]


3037

## New keyword annotations:

In [109]:
df_keywords_m1 = get_keywords_df(df_swissprot_manuscript1)
df_keywords_m1


Unnamed: 0,Uniprot,keyword
0,Q5SW45,Cell projection
1,Q5SW45,Cilium
2,Q5SW45,Cilium biogenesis/degradation
3,Q5SW45,Cytoplasm
4,Q5SW45,Cytoskeleton
...,...,...
1173568,P50402,Nucleus
1173569,P50402,Phosphoprotein
1173570,P50402,Reference proteome
1173571,P50402,Transmembrane


In [110]:
df_keywords_m2 = get_keywords_df(df_swissprot_new)
df_keywords_m2


Unnamed: 0,Uniprot,keyword
0,A0A024SC78,3D-structure
1,A0A024SC78,Disulfide bond
2,A0A024SC78,Hydrolase
3,A0A024SC78,Secreted
4,A0A024SC78,Serine esterase
...,...,...
1205125,V5XVW4,Capsid protein
1205126,V5XVW4,Direct protein sequencing
1205127,V5XVW4,Virion
1205128,W5X2N3,


There are 31.557 new keyword annoataions on total:

In [111]:
df_keywords_m2.shape[0] - df_keywords_m1.shape[0]


31557

## New substrate annotations

No filtering for Transmembrane transporters beforehand, these are all substrate annotations.

In [112]:
print(SUBSTRATE_KEYWORDS)


{'Protein transport', 'Bacteriocin transport', 'mRNA transport', 'Sodium transport', 'Peptide transport', 'Bacterial flagellum protein export', 'Ammonia transport', 'Phosphate transport', 'Electron transport', 'Nickel transport', 'Anion exchange', 'Copper transport', 'Potassium transport', 'Translocation', 'Sugar transport', 'Ion transport', 'Sodium/potassium transport', 'Sulfate transport', 'Calcium transport', 'Polysaccharide transport', 'Oxygen transport', 'Lipid transport', 'Cobalt transport', 'Viral movement protein', 'Iron transport', 'Neurotransmitter transport', 'Hydrogen ion transport', 'Amino-acid transport', 'Zinc transport', 'Phosphonate transport', 'Chloride'}


In [113]:
df_substrates_m1 = df_keywords_m1[df_keywords_m1.keyword.isin(SUBSTRATE_KEYWORDS)]
df_substrates_m1


Unnamed: 0,Uniprot,keyword
106,Q02455,Protein transport
108,Q02455,Translocation
110,Q02455,mRNA transport
294,Q9BXB5,Lipid transport
320,Q8VHK5,Ion transport
...,...,...
1172631,Q8S8A0,Amino-acid transport
1172637,Q3E965,Amino-acid transport
1172643,Q3EAV6,Amino-acid transport
1172654,Q3E8L0,Amino-acid transport


In [114]:
df_substrates_m2 = df_keywords_m2[df_keywords_m2.keyword.isin(SUBSTRATE_KEYWORDS)]
df_substrates_m2


Unnamed: 0,Uniprot,keyword
63,A0A061ACU2,Ion transport
478,A0A0B4K7J2,mRNA transport
482,A0A0B4K7J2,Protein transport
487,A0A0B4K7J2,Translocation
563,A0A0B7P9G0,Ion transport
...,...,...
1199326,R9RZK8,Oxygen transport
1200308,O59813,Amino-acid transport
1201946,Q01247,Protein transport
1204289,Q8YEE8,Amino-acid transport


There are 445 more substrate annotations in all of swissprot:

In [115]:
df_substrates_m2.shape[0] - df_substrates_m1.shape[0]


445

Merging dataframes to see differences:

In [116]:
df_substrates_merged = df_substrates_m2.merge(
    df_substrates_m1, indicator=True, how="outer"
)
df_substrates_merged


Unnamed: 0,Uniprot,keyword,_merge
0,A0A061ACU2,Ion transport,both
1,A0A0B4K7J2,mRNA transport,both
2,A0A0B4K7J2,Protein transport,both
3,A0A0B4K7J2,Translocation,both
4,A0A0B7P9G0,Ion transport,both
...,...,...,...
19565,Q10045,Protein transport,right_only
19566,Q5U520,Protein transport,right_only
19567,Q7ZUU1,Protein transport,right_only
19568,Q53HI1,Protein transport,right_only


### Deleted annotations:

66 keyword annotations were deleted in the new version, most of them related to protein transport:

In [117]:
display(
    df_substrates_merged[
        df_substrates_merged._merge == "right_only"
    ].keyword.value_counts()
)
display(df_substrates_merged[df_substrates_merged._merge == "right_only"])


Protein transport     37
Translocation         10
mRNA transport        10
Sugar transport        4
Lipid transport        3
Chloride               1
Electron transport     1
Name: keyword, dtype: int64

Unnamed: 0,Uniprot,keyword,_merge
19504,Q9VHN5,Protein transport,right_only
19505,Q96RL7,Protein transport,right_only
19506,P78383,Sugar transport,right_only
19507,P97858,Sugar transport,right_only
19508,Q27966,Protein transport,right_only
...,...,...,...
19565,Q10045,Protein transport,right_only
19566,Q5U520,Protein transport,right_only
19567,Q7ZUU1,Protein transport,right_only
19568,Q53HI1,Protein transport,right_only


#### What are the four deleted sugar transporters?

It looks like they all belong to the same protein family, that is related to UDP-Galactose, but now we know that they are ATP/ADP Antiporters.

In [118]:
import pandas as pd

pd.set_option("max_colwidth", 200)
df_substrates_merged_deleted_sugar = df_substrates_merged[
    (df_substrates_merged._merge == "right_only")
    & (df_substrates_merged.keyword == "Sugar transport")
]
print("manuscript 1:")
display(
    df_substrates_merged_deleted_sugar.set_index("Uniprot", drop=True).join(
        df_swissprot_manuscript1[["protein_names", "organism_id"]], how="left"
    )
)
print("manuscript 2:")
df_substrates_merged_deleted_sugar.set_index("Uniprot", drop=True).join(
    df_swissprot_new[["protein_names", "organism_id"]], how="left"
)


manuscript 1:


Unnamed: 0_level_0,keyword,_merge,protein_names,organism_id
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
P78383,Sugar transport,right_only,Solute carrier family 35 member B1 (UDP-galactose transporter-related protein 1) (UGTrel1) (hUGTrel1),9606
P97858,Sugar transport,right_only,Solute carrier family 35 member B1 (UDP-galactose translocator 2) (UDP-galactose transporter-related protein 1) (UGTrel1),10090
Q8MII5,Sugar transport,right_only,Solute carrier family 35 member B1 (Endoplasmic reticulum nucleotide sugar transporter 1),9913
Q6V7K3,Sugar transport,right_only,Solute carrier family 35 member B1 (UDP-galactose transporter-related protein 1) (UGTrel1),10116


manuscript 2:


Unnamed: 0_level_0,keyword,_merge,protein_names,organism_id
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
P78383,Sugar transport,right_only,Solute carrier family 35 member B1 (ATP/ADP exchanger ER) (AXER) (Endoplasmic reticulum ATP/ADP translocase) (UDP-galactose transporter-related protein 1) (UGTrel1),9606
P97858,Sugar transport,right_only,Solute carrier family 35 member B1 (ATP/ADP exchanger ER) (AXER) (Endoplasmic reticulum ATP/ADP translocase) (UDP-galactose transporter-related protein 1) (UGTrel1),10090
Q8MII5,Sugar transport,right_only,Solute carrier family 35 member B1 (ATP/ADP exchanger ER) (AXER) (Endoplasmic reticulum ATP/ADP translocase) (UDP-galactose transporter-related protein 1) (UGTrel1),9913
Q6V7K3,Sugar transport,right_only,Solute carrier family 35 member B1 (ATP/ADP exchanger ER) (AXER) (Endoplasmic reticulum ATP/ADP translocase) (UDP-galactose transporter-related protein 1) (UGTrel1),10116


They are all annotated with "transmembrane" and "transport", so they were in our dataset, at least the human protein. We did actually have troubles with some UDP-galactose transporters in the human dataset and removed some of them as outliers, so this could be why.

In [119]:
tmp = df_keywords_m1[
    df_keywords_m1.Uniprot.isin(df_substrates_merged_deleted_sugar.Uniprot)
]
tmp[tmp.keyword.isin(["Transport", "Transmembrane"])]


Unnamed: 0,Uniprot,keyword
163478,P78383,Transmembrane
163480,P78383,Transport
163512,P97858,Transmembrane
163514,P97858,Transport
178693,Q8MII5,Transmembrane
178695,Q8MII5,Transport
398963,Q6V7K3,Transmembrane
398965,Q6V7K3,Transport


### Added annotations

Substrate annotations that were added in the last year:

- There are many new lipid transporter annotations, as well as ion and protein.

In [120]:
df_substrates_new = df_substrates_merged[df_substrates_merged._merge == "left_only"]
print(len(df_substrates_new))
df_substrates_new.keyword.value_counts()


511


Lipid transport           213
Ion transport              68
Protein transport          45
Electron transport         33
Hydrogen ion transport     23
Amino-acid transport       19
Sodium transport           18
Chloride                   16
Phosphate transport        16
Potassium transport        16
mRNA transport             11
Sugar transport             9
Iron transport              7
Zinc transport              4
Translocation               3
Calcium transport           3
Viral movement protein      2
Peptide transport           2
Anion exchange              2
Cobalt transport            1
Name: keyword, dtype: int64

#### How many of them are transmembrane transporters?

We did not yet filter by transmembrane transport, and did not perform any clustering. How does this change the additional data?

In [121]:
keyword_matches = (
    df_keywords_m2[df_keywords_m2.keyword.isin(["Transport", "Transmembrane"])]
    .groupby("Uniprot")
    .apply(len)
)
transmembrane_transport = set(
    keyword_matches[keyword_matches == len(["Transport", "Transmembrane"])]
    .index.unique()
    .values
)
df_substrates_new_transmembrane_transport = df_substrates_new[
    df_substrates_new.Uniprot.isin(transmembrane_transport)
]
display(df_substrates_new_transmembrane_transport.keyword.value_counts())
print(len(df_substrates_new_transmembrane_transport))


Lipid transport           172
Ion transport              44
Phosphate transport        16
Potassium transport        16
Chloride                   15
Protein transport          15
Amino-acid transport       14
Sodium transport           13
Hydrogen ion transport     13
Electron transport          7
Sugar transport             5
Zinc transport              4
Anion exchange              2
Calcium transport           2
Cobalt transport            1
Iron transport              1
Translocation               1
Name: keyword, dtype: int64

341


So, there are 341 new annotations for transmembrane transporters before clustering. How do they distribute among the organisms?

- Most were added in human, followed by mouse and rat. This could make these organisms suitable for ion transporter training sets.

In [122]:
df_organisms_new = df_swissprot_new[["organism_id", "organism"]].reset_index(drop=False)
# df_substrates_new_transmembrane_transport.merge
# df_swissprot_new
df_substrates_new_transmembrane_transport = (
    df_substrates_new_transmembrane_transport.merge(
        df_organisms_new, how="left", on="Uniprot"
    )
)
print(df_substrates_new_transmembrane_transport.organism.value_counts())


Homo sapiens (Human)                                                                                           57
Mus musculus (Mouse)                                                                                           56
Rattus norvegicus (Rat)                                                                                        38
Bos taurus (Bovine)                                                                                            19
Caenorhabditis elegans                                                                                         19
                                                                                                               ..
Toxoplasma gondii                                                                                               1
Plasmodium berghei (strain Anka)                                                                                1
Escherichia coli O6:H1 (strain CFT073 / ATCC 700928 / UPEC)                             

Filtering out invalid proteins:

How many proteins are in our filtered dataset, where only proteins with experimental evidence and valid sequences are allowed?

- Now we have 282 proteins left. 
- This is also because proteins often have multiple substrate annotations.

In [123]:
df_swissprot_new_sequence_only = df_swissprot_new[["sequence"]].reset_index(drop=False)
tmp = df_swissprot_new_sequence_only[
    df_swissprot_new_sequence_only.Uniprot.isin(
        df_substrates_new_transmembrane_transport.Uniprot
    )
]
tmp


Unnamed: 0,Uniprot,sequence
82,A0A0G2K1Q8,MVVLRQLRLLLWKNYTLKKRKVLVTVLELFLPLLFSGILIWLRLKIQSENVPNATVYPDQHIQELPLFFSFPPPGGSWELAYVPSHSDAARTITEAVRREFMIKMRVHGFSSEKDFEDYVRYDNHSSNVLAAVVFEHTFNHSKDPLPLAVRYHLRFSYTRRNYMWTQTGNLFLKETEGWHTASLFPLFPSPGPREP...
153,A0A0U1QT59,MQNDEEPAAAAGTSGLSNGESLRSPPAPAPRRPKPGILRLDIGKPRRSSGGSVDFRCVGSSSSNGNTSNVATGANSENNSGVTSPHQLSVTWAPPCDLDRGGWQMQSSADAKREFYKGQRGRRAASQEDHRSYELNDFPLQNQSSDAESCHQEPHFAHQRSPGIGFDEDGGGGDIDDEESYTISVSAIMQRRASVR...
341,A0JNM1,MNYSEKLTGAPPMTEVPLELLEEMLWFFRVEDATPWNCSMFVLAALVAIISFILLGRNIQANRNQKKLPPEKQTPEVLYLAEGGNKDDKNLTSLTETLLSEKPTLAQGEMEAKCSDVPRVHLPDPQEPES
610,A2AVZ9,MASKGLPLYLATLLTGLLECIGFAGVLFGWTSLLFVFKAENYFSEPCEQDCLLQSNVTGPSDLKAQDEKFSLIFTLASFMNNFMTFPTGYIFDRFKTTVARLIAIFFYTCATIIIAFTSANTAMLLFLAMPMLAVGGILFLITNLQIGNLFGKHRSTIITLYNGAFDSSSAVFLVIKLLYEQGISLRSSFIFMSVC...
749,A4FV52,MEFRQEEFRKLAGRALGKLHRLLEKRQEGAETLELSADGRPVTTQTRDPPVVDCTCFGLPRRYIIAIMSGLGFCISFGIRCNLGVAIVSMVNNSTTHRGGHVVMQKAQFNWDPETVGLIHGSFFWGYIVTQIPGGFICQKFAANRVFGFAIVATSTLNMLIPSAARVHYGCVIFVRILQGLVEGVTYPACHGIWSK...
...,...,...
121783,A3M137,MAAEEHALTSTEYIKHHLTNMTYGKMPDGTWKLAETAEEAHSMGFTAIHLDSMGWSIGLGVIFCLLFWIVARAANAGVPTKFQSAIEMIIEFVDSSVRDTFHGKSRLIAPLALTIFVWIFLMNLMDLIPVDWIPQVAAFVGANVFGMDPHHVYFKIVPSTDPNITLGMSLSVFVLILFYSIREKGVGGFVGELALN...
126412,P30144,MLGGAVWFPYVLLGVGLFFTIYLKFPQIRYFKHACQVVSGKFDKKDTEGDTTHFQALATALSGTVGTGNIGGVALAISIGGPAALFWMWMTAFFGMTTKFVEVTLSHKYREKTEDGTMSGGPMYYMDKRLNMKWLAILFAVATVISSFGTGSLPQINNIAQGMEATFGFAPMATGAVLSILLALVILGGIKRIAAI...
131439,Q3SWT5,MVVTQLSLEFRFQGKKLRGFSCELTRSPHGVLPESVLSTTCQVAIPILLSGLGMMTAGLVMNTVQHWPVFRDVKDLLTLVPPLVGLKGNLEMTLASRLSTSANTGQIDDRQERYRIISSNLAVVQVQATVVGLLAAVASLMLGTVSHEEFDWAKVALLCTSSVITAFLAALALGILMICIVIGARKFGVNPDNIAT...
132517,Q56036,MSALNKKSFLTWLKEGGIYVVLLVLLAIIIFQDPTFLSLLNLSNILTQSSVRIIIALGVAGLIVTQGTDLSAGRQVGLAAVVAATLLQSMENANKVFPEMATMPIALVILIVCAIGAVIGLVNGIIIAYLNVTPFITTLGTMIIVYGINSLYYDFVGASPISGFDSGFSTFAQGFVAMGSFRLSYITFYALIAVAF...


How many are left after clustering?

- After clustering, we have 142 transporters with new annotations

In [124]:
from subpred.cdhit import cd_hit

valid_transporters_cluster_representatives = cd_hit(
    tmp.set_index("Uniprot").sequence, identity_threshold=70
)
print(len(valid_transporters_cluster_representatives))


cd-hit: clustered 282 sequences into 142 clusters at threshold 70
142


These 142 proteins have 217 substrate annotations:

- Lipid is still the largest class in terms of gains, followed by Ion, Sodium and Amino-acid.


In [125]:
df_substrates_m2[
    df_substrates_m2.Uniprot.isin(valid_transporters_cluster_representatives)
]
df_substrates_m2[
    df_substrates_m2.Uniprot.isin(valid_transporters_cluster_representatives)
].keyword.value_counts()

Lipid transport               64
Ion transport                 57
Sodium transport              24
Amino-acid transport          14
Hydrogen ion transport        11
Potassium transport           11
Protein transport             10
Electron transport             6
Zinc transport                 4
Sugar transport                4
Iron transport                 2
Anion exchange                 2
Phosphate transport            2
Calcium transport              2
Cobalt transport               1
Chloride                       1
Neurotransmitter transport     1
Translocation                  1
Name: keyword, dtype: int64

We can't give any statistics on the organism distribution of these final proteins, since the cluster representative for a protein family is selected by cd-hit, based on deterministic criteria such as sequence length.

## Changes to transport-related keywords

## Changes to cellular components

## Effects on the actual dataset

## Realistic test

- Clustering is performed to only keep distinct protein sequences
- Filter for organism
- Filter for ...