# Uniprot dataset analysis

Tasks:

1. Compare dataset from manuscript 1 to current version of Uniprot
    - Uniprot releases four updates per year
        - Current version is 2022_4, manuscript 1 used 2021_04 
    - Are there new proteins in new Uniprot release?
    - Comparison of samples annotated with keywords
        - In general
        - Individual organisms
2. Compare the Swissprot dataset (reviewed=True) with the Swissprot+TrEMBL dataset (also including unreviewed proteins).
    - Does the additional data make a difference in terms of sample counts, for our substrates?
    - TODO questions

## Imports

In [3]:
from subpred.dataset import create_dataset, SUBSTRATE_KEYWORDS

Dataset used in manuscript 1:

In [5]:
df_swissprot_manuscript1 = create_dataset(
    # keywords_substrate_filter=["Amino-acid transport", "Sugar transport"],
    # keywords_component_filter=["Transmembrane"],
    # keywords_transport_filter=["Transport"],
    input_file="../data/raw/swissprot/uniprot-reviewed_yes.tab.gz",
    multi_substrate="keep",
    verbose=True,
    # tax_ids_filter=[3702, 9606, 83333, 559292],
    # outliers=outliers,
    # sequence_clustering=70,
    evidence_code=2,
    invalid_amino_acids="remove_protein",
    # force_update=True,
    tcdb_substrates_file="../data/raw/tcdb/tcdb_substrates.tsv"
)
df_swissprot_manuscript1.shape


Found pickle, reading...


(141892, 16)

Dataset used in manuscript 2:

In [6]:
df_swissprot_new = create_dataset(
    # keywords_substrate_filter=["Amino-acid transport", "Sugar transport"],
    # keywords_component_filter=["Transmembrane"],
    # keywords_transport_filter=["Transport"],
    input_file="../data/raw/swissprot/uniprot_data_2022_04.tab.gz",
    multi_substrate="keep",
    verbose=True,
    # tax_ids_filter=[3702, 9606, 83333, 559292],
    # outliers=outliers,
    # sequence_clustering=70,
    evidence_code=2,
    invalid_amino_acids="remove_protein",
    # force_update=True,
    tcdb_substrates_file="../data/raw/tcdb/tcdb_substrates.tsv"
)
df_swissprot_new.shape


Found pickle, reading...


(144929, 16)

## Downloading the raw data

While downloading the raw data for Swissprot+TrEMBL, we encountered the problem that over 200mio entries were available in the dataset. Since the uniprot pagination downloader that we implemented can download about 3000 proteins per minute, this download would take almost two months to complete.

To solve this problem, we decided to only download proteins for which the existence is verified at protein level, and whose sequence is not fragmented. This download only contains 257,658 proteins. This means that we no longer have the option of including proteins that were verified at transcript level, but for unreviewed proteins that might be a bad idea anyways.

In [5]:
df_all = create_dataset(
        # keywords_substrate_filter=["Amino-acid transport", "Sugar transport"],
        # keywords_component_filter=["Transmembrane"],
        # keywords_transport_filter=["Transport"],
        input_file="../data/raw/swissprot/uniprot_data_all_evidence1-2_2022_04.tsv",
        multi_substrate="keep",
        verbose=True,
        # tax_ids_filter=[3702, 9606, 83333, 559292],
        # outliers=outliers,
        # sequence_clustering=70,
        evidence_code=2,
        invalid_amino_acids="remove_protein",
        # force_update=True
    )



Found pickle, reading...


In [4]:
print(SUBSTRATE_KEYWORDS)

{'Bacteriocin transport', 'Oxygen transport', 'Calcium transport', 'Electron transport', 'Ion transport', 'Phosphonate transport', 'Potassium transport', 'Sodium/potassium transport', 'Cobalt transport', 'Sugar transport', 'Polysaccharide transport', 'Anion exchange', 'Bacterial flagellum protein export', 'Peptide transport', 'Phosphate transport', 'Zinc transport', 'Sulfate transport', 'Copper transport', 'Viral movement protein', 'Amino-acid transport', 'Nickel transport', 'Sodium transport', 'Neurotransmitter transport', 'mRNA transport', 'Ammonia transport', 'Chloride', 'Hydrogen ion transport', 'Protein transport', 'Lipid transport', 'Translocation', 'Iron transport'}
