# GO vs. Keywords for substrates

Up until now, we have been using Uniprot keywords for annotations. In order to get all transmembrane transporters of Sugar, we filter the proteins in the dataset by the keywords "Transmembrane" and "Sugar transport". 

This works most of the time, but sometimes there is a small number of proteins that fulfils these requirements, but does not actually transport the substrate. That could be the case for a "sugar transport" protein that exchanges ATP and ADP, which is in turn used to promote Sugar transport. This means that the protein we are looking at promotes sugar transport, but is not a sugar transporter by itself. We need all proteins in the dataset to be actual sugar transporters, to learn that function from the sequence.

In order to prevent these cases, we could turn to the Gene Ontology. The advantage here is that the "molecular function" GO terms are much more specific. We would not have to filter for multiple keywords, and could instead just filter for "sugar transmembrane transporter activity". The downside could might be a lower number of samples, or lower-quality annotations if we use automatically annotated go terms.

Tasks:

- Implement tree-shaped datastructure for ontologies like GO and ChEMBL (done)
    - This allows us to get all descendants/children for a GO term, automatically
    - In contrast to Keywords, a protein is not annotated with all of its GO terms. If it is annotated with a descendant of "transmembrane transporter activity", such as "sugar transmembrane transporter activity", then it does not have to be annotated with "transmembrane transporter activity".
- Create three annotation datasets:
    - Uniprot GO annotations (filtered and preprocessed by uniprot)
    - Uniprot annotation from GO website without electronically inferred evidence (IEA) (doing the filtering ourselves)
    - Uniprot annotation from GO website with electronically inferred evidence (IEA) (doing the filtering ourselves)
    - Uniprot keywords
- Statistics for Swissprot
    - How many transmembrane transporters are in each dataset?
    - How many sugar/amino-acid/ion transporters?
- Comparative analysis
    - How many substrates do we get with keywords/go in comparison? (create table)
    - What is the size of the intersection/difference sets?
    - Look at samples of transporters that have GO term but not Keyword
- Statistics for Swissprot + TrEMBL
    - How many more transporters do we get with each annotation dataset, when we do not filter for Swissprot?

## Information

#### Transmembrane annotations in Uniprot

https://www.uniprot.org/help/transmem

Transmembrane helices in Swissprot are annotated via:

- Prediction tools
- Experimentally determined
- Similarity to protein family that is known to contain TM domains

In TrEMBL, the annotation happens automatically via TMHMM program.

Multi-pass membrane proteinns are annotated as such in "Subcellular location" column, that might also be interesting for filtering. 

Beta-barrels are not predicted by Uniprot, therefore the information might be absent. Beta-barrel property can be found in Domains column.

#### Uniprot GO preprocessing

https://www.uniprot.org/help/complete_go_annotation

- Annotation filtering is applied
    - manual annotation preferred
- Some information is removed
    - with/from column
    - the qualifiers "NOT, contributes_to, colocalizes_with"
    - annotations made to isoform identifiers
- Info on Uniprot is about three months behind QuickGO website

#### What are the differences between UniProtKB keywords and the GO terms?

https://www.uniprot.org/help/keywords_vs_go

- Keywords are manual for Swissprot, and automatic for TrEMBL
- GO terms are manually mapped to keywords and xrefs

#### Gene Ontology (GO) annotations in Uniprot:

https://www.uniprot.org/help/gene_ontology

- Keyword annotations made by Uniprot curators are included in GO with IEA evidence code!
    - Keyword annotations and GO annotations will probably not be the same
    - Swissprot is already annotated by curators, but GO might use more recent research
    - Union set of GO and keywords might increase sample count, especially for TrEMBL

## Swissprot dataset

Basic filtering is applied in the function: Removing fragmented sequences, and proteins that do not have evidence at either transcript level or protein level, among other things.

In [1]:
from subpred.dataset import create_dataset, get_go_df, get_keywords_df, get_tcdb_substrates
from subpred.go_utils import GeneOntology, read_go_uniprot
df_swissprot = create_dataset(
    # keywords_classes=["Amino-acid transport", "Sugar transport"],
    # keywords_filter=["Transmembrane", "Transport"],
    input_file="../data/raw/uniprot/swissprot_data_2022_04.tsv.gz",
    multi_substrate="keep",
    verbose=True,
    # tax_ids_filter=[3702, 9606, 83333, 559292],
    # outliers=outliers,
    # sequence_clustering=70,
    evidence_code=2,
    invalid_amino_acids="remove_protein",
    # force_update=True,
    tcdb_substrates_file="../data/raw/tcdb/tcdb_substrates.tsv",
)
df_swissprot.shape

Found pickle, reading...


(144929, 16)

## Uniprot GO dataset

Get annotated go terms from Uniprot file:

In [2]:
df_sp_go = get_go_df(df_swissprot)
df_sp_go.shape

(1399804, 3)

GO terms of interest (for now):

In [3]:
# Cellular Component
cc_terms = {"membrane": "GO:0016020", "plasma membrane": "GO:0005886"}

mf_terms = {
    "transmembrane transporter activity": "GO:0022857",
    "amino acid transmembrane transporter activity": "GO:0015171",
    "carbohydrate transmembrane transporter activity": "GO:0015144",
    "sugar transmembrane transporter activity": "GO:0051119",
    "ion transmembrane transporter activity": "GO:0015075"
}

Ontology tree data structure that allows us to retrieve children/parents of GO terms, etc.:

In [4]:
go = GeneOntology("../data/raw/ontologies/go.owl")

How many proteins in the dataset are annotated with "transmembrane transporter activity"?

In [5]:
print(
    "proteins annotated with transmembrane transporter activity:",
    df_sp_go[df_sp_go.go_id == "GO:0022857"].Uniprot.unique().shape[0],
)

proteins annotated with transmembrane transporter activity: 1166


1166 is not that much. 

Is every protein that is annotated with a descendant of "transmembrane transporter activity" also annotated with "transmembrane transporter activity"?

In [6]:
tta_descendants = go.get_descendants("GO:0022857")
print(f"There are {len(tta_descendants)} descendant go terms of transmembrane transporter activity")

tta_descendants_proteins = df_sp_go[df_sp_go.go_id.isin(tta_descendants)].Uniprot.unique()
print(f"Proteins annotated with transmembrane transporter activity or descendant go term: {tta_descendants_proteins.shape[0]}")

There are 1024 descendant go terms of transmembrane transporter activity
Proteins annotated with transmembrane transporter activity or descendant go term: 9103


No, that is not the case! We can get 9130 unique proteins by looking at descendants of the term. 

## Official GO dataset (noiea)

Is the Uniprot GO annotation complete? Or can we gain something by reading the official go uniprot annotations from the go website?

To test this, we will read the Uniprot annotation file from the GO website. These GO annotations have not been filtered by Uniprot.

Qualifiers starting with "NOT" and entries for other DBs have already been removed by the function read_go_uniprot. Evidence codes that were electronivally annotated (IEA) are not part of this dataset. 

We should pay attention to the qualifiers column.

In [7]:
df_go = read_go_uniprot("/home/ad/gene_ontology/goa_uniprot_all_noiea.gaf.gz")

What evidence codes are in the dataset? This "noiea" version explicitly does not contain IEA annotations. The downside of the IEA version is the file size: the NOIEA version is 6mb in size, the IEA version is 171GB. It might also be noisier, since most annoatations have not been checked by a curator.

GO evidence code guide: http://geneontology.org/docs/guide-go-evidence-codes/

We have mapped the evidence codes to descriptions, to make it easier to interpret. After reading the explanations, this looks good:

In [8]:
df_go.evidence_description.value_counts()

curator_statement                        129517
computational_analysis                   115756
experimental_evidence                     56587
experimental_evidence_high_throughput      3156
author_statement                           3154
Name: evidence_description, dtype: int64

#### Explanation of qualifiers

Relations in GO: http://geneontology.org/docs/ontology-relations/

The most common relation between go_terms is "is_a", for example "carbohydrate transmembrane transporter activity" "is_a" "transmembrane transporter activity". This term is used within the same sub-ontology, in this case Molecular Function. Other relations are used when switching from one part of the ontology to another, for example the MF "transmembrane transporter activity" is "part_of" the Biological Process "transmembrane transport".

Most common relations between gene products and GO terms (via GO wiki):

- "involved_in": Function of gene product means that it is involved in a biological process
- "enables": Function of gene product explicitly enables the molecular function. This should be the most important category for us.
- "is_active_in": The cellular component where the gene product carries out its function
- "located_in": No documentation found, but presumably relation between gene product and CC
- part_of: Location of gene product, without making statement about activity in that location. Can also be used to relate GO terms to each other, such as CC
- acts_upstream_of_or_within: Experimental Evidence is not sufficient to relate gene products activity to a biological process. Often derived from mutant phenotypes. 
- contributes_to: gene product is a member of a complex that enables a molecular function. We should probably remove those annotations, since we are only interested in the actual channels/carriers.
- colocalizes_with: resolution of assay is not accurate enough to say that gene product is in cellular component. different interpretations of term possible. Should be removed from dataset.
- acts_upstream_of*: Molecular function of gene product has impact on Biological Process that is downstream of gene product. These relations should also be removed from our dataset.


What we should do:

- Keep "enables" for the MF terms, and "is_active_in" for the CC terms.

In [9]:
df_go.qualifier.value_counts()

involved_in                                   121545
enables                                        94792
is_active_in                                   43369
located_in                                     38776
part_of                                         5102
acts_upstream_of_or_within                      3458
contributes_to                                   519
colocalizes_with                                 476
acts_upstream_of                                  67
acts_upstream_of_positive_effect                  28
acts_upstream_of_or_within_positive_effect        22
acts_upstream_of_negative_effect                  15
acts_upstream_of_or_within_negative_effect         1
Name: qualifier, dtype: int64

#### Creating the GO tables:

Filtering the terms by Molecular Function:

In [10]:
df_go_mf = df_go[
    (df_go.aspect == "F") 
]
df_go_mf.shape

(95311, 18)

Only keeping the most accurate qualifier ("enables") 

In [11]:
df_go_mf = df_go_mf[df_go_mf.qualifier == "enables"]
df_go_mf.shape

(94792, 18)

This only removes a small percentage of MF-terms!

#### Comparison to Uniprot GO terms

Now we have a GO annotation dataset directly from GO. From the Uniprot annotations, we had 1166 proteins with the "transmembrane transporter activity", and 9103 proteins when including the descendant go terms. How does that compare to our new GO dataset?

In [12]:
df_go_mf[df_go_mf.go_id == mf_terms["transmembrane transporter activity"]].shape[0]

18

Only 18 proteins are annotated with "transmembrane transporter activity"! What about its descendants?

In [13]:
df_go_mf[
    df_go_mf.go_id.isin(
        go.get_descendants(mf_terms["transmembrane transporter activity"])
    )
].shape[0]

2031

Now it's 2031 proteins. That is a much smaller number than the 9103 proteins annotated with that with those GO terms in Uniprot. From the Information above, we know that keywords are mapped to GO terms, but are labeled as "IEA" in the actual GO dataset. Maybe that could be the problem here.

What are the most common descendant GO terms?

In [14]:
go_descendants_value_counts = df_go_mf[
    df_go_mf.go_id.isin(
        go.get_descendants(mf_terms["transmembrane transporter activity"])
    )
].go_id.value_counts()

go_descendants_value_counts = go_descendants_value_counts.reset_index(drop=False).rename(columns={"index":"go_id", "go_id":"count"})
go_descendants_value_counts["go_term"] = go_descendants_value_counts.go_id.apply(lambda x: go.get_label(x))

go_descendants_value_counts.head(50)

Unnamed: 0,go_id,count,go_term
0,GO:0008137,311,NADH dehydrogenase (ubiquinone) activity
1,GO:0042626,79,ATPase-coupled transmembrane transporter activity
2,GO:0015106,71,bicarbonate transmembrane transporter activity
3,GO:0005244,66,voltage-gated ion channel activity
4,GO:0015250,49,water channel activity
5,GO:0051119,41,sugar transmembrane transporter activity
6,GO:0015108,39,chloride transmembrane transporter activity
7,GO:0005254,38,chloride channel activity
8,GO:0005391,36,P-type sodium:potassium-exchanging transporter...
9,GO:0008320,36,protein transmembrane transporter activity


When printing the top-50, there are some more general terms, but also some usable ones such as chloride and sugar. Of course we would also have to look at the descendants of those terms, some of the sugar transporters might only be annotated with a descendant of "sugar transmembrane transporter activity".

In [20]:
ion_transporter_descendants = go.get_descendants(go.get_identifier("ion transmembrane transporter activity"))
sugar_transporter_descendants = go.get_descendants(go.get_identifier("sugar transmembrane transporter activity"))
carbo_transporter_descendants = go.get_descendants(go.get_identifier("carbohydrate transmembrane transporter activity"))
aa_transporter_descendants = go.get_descendants(go.get_identifier("amino acid transmembrane transporter activity"))
# for go_id_set in [ion_transporter_descendants, sugar_transporter_descendants, carbo_transporter_descendants, aa_transporter_descendants]:
# TODO

## Official GO dataset (including iea) 

Now let's look at the full dataframe to compare:

In [16]:
df_go_mf.head()

Unnamed: 0,db,db_object_id,db_object_symbol,qualifier,go_id,db_reference,evidence_code,with_or_from,aspect,db_object_name,db_object_synonym,db_object_type,taxon,date,assigned_by,annotation_extension,gene_product_form_id,evidence_description
2,UniProtKB,Q87UX2,blc,enables,GO:0003674,GO_REF:0000015,ND,,F,Outer membrane lipoprotein Blc,blc|PSPTO_5170,protein,taxon:223283,20061207,JCVI,,,curator_statement
4,UniProtKB,C8VDI1,AN11006,enables,GO:0003674,GO_REF:0000015,ND,,F,Uncharacterized protein AN11006,AN11006,protein,taxon:227321,20200401,AspGD,,,curator_statement
11,UniProtKB,Q2GIQ1,omp-1X,enables,GO:0003674,GO_REF:0000015,ND,,F,Omp-1X,omp-1X|APH_1219,protein,taxon:212042,20061212,TIGR,,,curator_statement
14,UniProtKB,Q93AM0,fldI,enables,GO:0008047,PMID:11967068,IDA,,F,(R)-phenyllactate dehydratase activator,fldI,protein,taxon:1509,20130610,UniProt,,,experimental_evidence
16,UniProtKB,G4NEF6,MGG_00119,enables,GO:0003674,GO_REF:0000015,ND,,F,Prothymosin alpha,MGG_00119,protein,taxon:242507,20080211,PAMGO_MGG,,,curator_statement


In order to read the whole 171GB file, I first filtered it down using a C++ program that removes the comments, and only keeps "Molecular Function" terms with the "enables" qualifier. All annotations for gene products that are not in Uniprot, i.e. functional RNA, were removed as well. The filtered file is 11GB in size.

In [23]:
import pandas as pd
df_go_all = pd.read_table("~/gene_ontology/goa_sp_iea_mf_enables.tsv", dtype=str, header=None, names=["Uniprot", "go_id", "evidence_code"])

In [24]:
df_go_all.head()

Unnamed: 0,Uniprot,go_id,evidence_code
0,A0A8H7LV99,GO:0022857,IEA
1,A0A852VR81,GO:0003677,IEA
2,A0A852VR81,GO:0003677,IEA
3,A0A852VR81,GO:0003677,IEA
4,A0A851ER61,GO:0003746,IEA


This dataset now also contains IEA evidence codes:

In [34]:
df_go_all.evidence_code.value_counts()

IEA    465678194
ND         43142
ISS        24489
IDA        12067
IPI        11924
IMP         1532
TAS          459
EXP          234
ISA          161
RCA          161
IGC          158
NAS          144
ISM          116
IGI          102
IC            52
ISO           39
IEP           12
Name: evidence_code, dtype: int64

In [36]:
transmembrane_transporter_protein_set = df_go_all[df_go_all.go_id == mf_terms["transmembrane transporter activity"]].Uniprot
print("Unique proteins with transmembrane transporter activity go term:",transmembrane_transporter_protein_set.unique().shape[0])

Unique proteins with transmembrane transporter activity go term: 4401573


There are 4,401,573 proteins with the transmembrane transporter activity keyword! How many of them are in Swissprot?

In [37]:
print(
    "Unique proteins in our filtered Swissprot dataset with transmembrane transporter activity keyword:",
    transmembrane_transporter_protein_set[
        transmembrane_transporter_protein_set.isin(df_swissprot.index)
    ].shape[0],
)

print("Total proteins in filtered Swissprot dataset: ", df_swissprot.shape[0])


Unique proteins in our filtered Swissprot dataset with transmembrane transporter activity keyword: 682
Total proteins in filtered Swissprot dataset:  144929


Only 682 out of 144,929, which is less than Uniprot GO annotations, but more than GO annotations from the official website! How many are there if we include descendants of "transmembrane transporter activity"?

In [38]:
transmembrane_transporter_protein_set = df_go_all[df_go_all.go_id.isin(go.get_descendants(mf_terms["transmembrane transporter activity"]))].Uniprot

print("Unique proteins with  transmembrane transporter activity go term or descendant:",transmembrane_transporter_protein_set.unique().shape[0])

print(
    "Unique proteins in our filtered Swissprot dataset with transmembrane transporter activity goterm or descandant goterm:",
    transmembrane_transporter_protein_set[
        transmembrane_transporter_protein_set.isin(df_swissprot.index)
    ].shape[0],
)

Unique proteins with  transmembrane transporter activity go term or descendant: 13812246
Unique proteins in our filtered Swissprot dataset with transmembrane transporter activity goterm or descandant goterm: 6329


TODO

Three GO datasets, one Keywords dataset

compare number of transmembrane transporters (with or without descendants)

Compare number of sugar, amino-acid, ion transporters



## Keywords dataset

In [None]:
# All keyword annotations in swissprot
df_sp_kw = get_keywords_df(df_swissprot)
df_sp_kw.shape

(1205130, 2)

Substrate keywords:

In [None]:
tm_tp = df_sp_kw[df_sp_kw.keyword.isin(["Transmembrane", "Transport"])].groupby("Uniprot").apply(len) == 2

tm_tp = set(tm_tp[tm_tp].index.tolist())

len(tm_tp)

9624

In [None]:
from subpred.dataset import SUBSTRATE_KEYWORDS

df_sp_substrates = df_sp_kw[df_sp_kw.keyword.isin(SUBSTRATE_KEYWORDS)]

df_sp_substrates.keyword.value_counts()

Ion transport                         4487
Protein transport                     4114
Electron transport                    2189
Oxygen transport                       889
Hydrogen ion transport                 816
Lipid transport                        785
Sodium transport                       732
Potassium transport                    704
Sugar transport                        703
mRNA transport                         669
Translocation                          626
Amino-acid transport                   545
Calcium transport                      463
Chloride                               392
Iron transport                         293
Zinc transport                         191
Neurotransmitter transport             175
Phosphate transport                    143
Peptide transport                      119
Copper transport                        84
Ammonia transport                       79
Sodium/potassium transport              77
Anion exchange                          51
Sulfate tra

There are almost 200.000 more GO term annotations in total:

In [None]:
df_sp_go.shape[0] - df_sp_kw.shape[0]

194674

3384 proteins have no 

In [None]:
print(df_sp_go[df_sp_go.go_term.isnull()].shape[0])
print(df_sp_kw[df_sp_kw.keyword.isnull()].shape[0])

3384
0


In [None]:
df_sp_ch = get_tcdb_substrates(df_swissprot)
df_sp_ch.shape

(7804, 4)

## Results

We created four different datasets.
						transmembrane transporter activity		t.t.a. + descendants
						Swissprot	Swissprot+TrEMBL			Swissprot	Swissprot+TrEMBL
GO-Uniprot(Processed by Uniprot)
GO-NOIEA(Official Website)
GO-IEA(Official Website)

Filters official version: "Enables" qualifier, molecular function terms only. Does not remove many annotations.


Keywords