# GO vs. Keywords for substrates

Up until now, we have been using Uniprot keywords for annotations. In order to get all transmembrane transporters of Sugar, we filter the proteins in the dataset by the keywords "Transmembrane" and "Sugar transport". 

This works most of the time, but sometimes there is a small number of proteins that fulfils these requirements, but does not actually transport the substrate. That could be the case for a "sugar transport" protein that exchanges ATP and ADP, which is in turn used to promote Sugar transport. This means that the protein we are looking at promotes sugar transport, but is not a sugar transporter by itself. We need all proteins in the dataset to be actual sugar transporters, to learn that function from the sequence.

In order to prevent these cases, we could turn to the Gene Ontology. The advantage here is that the "molecular function" GO terms are much more specific. We would not have to filter for multiple keywords, and could instead just filter for "sugar transmembrane transporter activity". The downside could might be a lower number of samples, or lower-quality annotations if we use automatically annotated go terms.

Tasks:

- Implement tree-shaped datastructure for ontologies like GO and ChEMBL
- How many substrates do we get with keywords/go in comparison?
- What can we say about proteins that either have a go term or a keyword? Are they outliers, or are they simply not annotated correctly in GO/KW?
- Compare GO term sets from Uniprot dataset to entire GO dataset

## Information

What is the difference between the GO annotation included in the UniProtKB entry view, and the information accessible via the link "Complete GO annotation"?

https://www.uniprot.org/help/complete_go_annotation


What are the differences between UniProtKB keywords and the GO terms?

https://www.uniprot.org/help/keywords_vs_go

Gene Ontology (GO) annotations in Uniprot

https://www.uniprot.org/help/gene_ontology

Transmembrane annotations in Uniprot

https://www.uniprot.org/help/transmem

In [15]:
from subpred.dataset import create_dataset, get_go_df, get_keywords_df, get_tcdb_substrates
df_swissprot = create_dataset(
    # keywords_classes=["Amino-acid transport", "Sugar transport"],
    # keywords_filter=["Transmembrane", "Transport"],
    input_file="../data/raw/uniprot/swissprot_data_2022_04.tsv.gz",
    multi_substrate="keep",
    verbose=True,
    # tax_ids_filter=[3702, 9606, 83333, 559292],
    # outliers=outliers,
    # sequence_clustering=70,
    evidence_code=2,
    invalid_amino_acids="remove_protein",
    # force_update=True,
    tcdb_substrates_file="../data/raw/tcdb/tcdb_substrates.tsv",
)
df_swissprot.shape

Found pickle, reading...


(144929, 16)

In [19]:
df_sp_go = get_go_df(df_swissprot)
df_sp_go.shape

(1399804, 3)

In [20]:
df_sp_kw = get_keywords_df(df_swissprot)
df_sp_kw.shape

(1205130, 2)

In [21]:
df_sp_ch = get_tcdb_substrates(df_swissprot)
df_sp_ch.shape

(7804, 4)