Steps:

- Filter protein dataset
    - Organisms
- Annotate with GO terms
- Create subgraph of GO
    - transmembrane transporter activity
    - only go terms that occur in protein dataset
- Annotate GO network with ChEBI terms
- Chebi Network: Sub-Network of molecules that occur in organism
    - Node size: Number of proteins
    - Edges: Overlap in terms of proteins

TODO compare to graph notebook

TODO turn into functions in graph.py

## Protein dataset

In [34]:
from subpred.util import load_df

df_uniprot = load_df("uniprot")
df_uniprot

Unnamed: 0_level_0,gene_names,protein_names,reviewed,protein_existence,sequence,organism_id
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A0A0C5B5G6,MT-RNR1,Mitochondrial-derived peptide MOTS-c (Mitochon...,True,1,MRWQEMGYIFYPRKLR,9606
A0A1B0GTW7,CIROP LMLN2,Ciliated left-right organizer metallopeptidase...,True,1,MLLLLLLLLLLPPLVLRVAASRCLHDETQKSVSLLRPPFSQLPSKS...,9606
A0JNW5,BLTP3B KIAA0701 SHIP164 UHRF1BP1L,Bridge-like lipid transfer protein family memb...,True,1,MAGIIKKQILKHLSRFTKNLSPDKINLSTLKGEGELKNLELDEEVL...,9606
A0JP26,POTEB3,POTE ankyrin domain family member B3,True,1,MVAEVCSMPAASAVKKPFDLRSKMGKWCHHRFPCCRGSGKSNMGTS...,9606
A0PK11,CLRN2,Clarin-2,True,1,MPGWFKKAWYGLASLLSFSSFILIIVALVVPHWLSGKILCQTGVDL...,9606
...,...,...,...,...,...,...
X5L4R4,NOD-2,Nucleotide-binding oligomerization domain-cont...,False,2,MSPGCYKGWPFNCHLSHEEDKRRNETLLQEAETSNLQITASFVSGL...,586796
X5MBL2,GT34D,"Putative galacto(Gluco)mannan alpha-1,6-galact...",False,2,KVLYDRAFNSSDDQSALVYLLLKEKDKWADRIFIEHKYYLNGYWLD...,3352
X5MFI4,GT34D,"Putative galacto(Gluco)mannan alpha-1,6-galact...",False,2,MDEDVLCKGPLHGGSARSLKGSLKRLKRIMESLNDGLIFMGGAVSA...,3352
X5MI49,GT34A,"Putative galacto(Gluco)mannan alpha-1,6-galact...",False,2,MVNDSKLETISGNMVQKRKSFDGLPFWTVSIAGGLLLCWSLWRICF...,3352


Creating subset of GO annotations for organism:

In [35]:
ORGANISM_ID = 83333

df_uniprot_organism = df_uniprot[df_uniprot.organism_id == ORGANISM_ID].drop(
    "organism_id", axis=1
)
df_uniprot_organism.shape[0]


3284

Removing entries with missing data: 

In [36]:
df_uniprot_organism = df_uniprot_organism[
    ~df_uniprot_organism.gene_names.isnull()
    & ~df_uniprot_organism.protein_names.isnull()
    & ~df_uniprot_organism.sequence.isnull()
]
df_uniprot_organism.shape[0]


3284

Filtering for available evidence and manual curation: 

In [37]:
REVIEWED_ONLY = True
PROTEIN_EXISTENCE_PROTEIN_LEVEL = True

display(
    df_uniprot_organism[["reviewed", "protein_existence"]]
    .groupby(["reviewed", "protein_existence"])
    .apply(len)
)

if REVIEWED_ONLY:
    df_uniprot_organism = df_uniprot_organism[df_uniprot_organism.reviewed]
if PROTEIN_EXISTENCE_PROTEIN_LEVEL:
    df_uniprot_organism = df_uniprot_organism[
        df_uniprot_organism.protein_existence == 1
    ]


reviewed  protein_existence
False     1                       1
          2                       1
True      1                    3118
          2                     164
dtype: int64

## GO Annotations

In [38]:
df_goa_uniprot = load_df("go")
df_goa_uniprot


Unnamed: 0,Uniprot,qualifier,go_id,evidence_code,aspect
0,A0A009FND8,enables,GO:0000166,IEA,F
1,A0A009FND8,enables,GO:0005524,IEA,F
2,A0A009FND8,enables,GO:0051082,IEA,F
3,A0A009FND8,enables,GO:0140662,IEA,F
4,A0A009FND8,involved_in,GO:0006457,IEA,P
...,...,...,...,...,...
7452013,Z9JND5,enables,GO:0000166,IEA,F
7452014,Z9JND5,enables,GO:0005524,IEA,F
7452015,Z9JND5,enables,GO:0051082,IEA,F
7452016,Z9JND5,enables,GO:0140662,IEA,F


Creating MF subset, only keeping "enables" qualifier as that is the most accurate one.

In [56]:
df_goa_uniprot_enables_mf = df_goa_uniprot[
    (df_goa_uniprot.qualifier == "enables") & (df_goa_uniprot.aspect == "F")
].reset_index(drop=True)
df_goa_uniprot_enables_mf


Unnamed: 0,Uniprot,qualifier,go_id,evidence_code,aspect
0,A0A009FND8,enables,GO:0000166,IEA,F
1,A0A009FND8,enables,GO:0005524,IEA,F
2,A0A009FND8,enables,GO:0051082,IEA,F
3,A0A009FND8,enables,GO:0140662,IEA,F
4,A0A009FS68,enables,GO:0000166,IEA,F
...,...,...,...,...,...
2697668,Z9JMY9,enables,GO:0140662,IEA,F
2697669,Z9JND5,enables,GO:0000166,IEA,F
2697670,Z9JND5,enables,GO:0005524,IEA,F
2697671,Z9JND5,enables,GO:0051082,IEA,F


Filter GO annotations for the organism:

In [57]:
df_goa_uniprot_enables_mf_organism = df_goa_uniprot_enables_mf[
    df_goa_uniprot_enables_mf.Uniprot.isin(df_uniprot_organism.index)
].reset_index(drop=True)
df_goa_uniprot_enables_mf_organism


Unnamed: 0,Uniprot,qualifier,go_id,evidence_code,aspect
0,A5A627,enables,GO:0005253,IDA,F
1,A5A627,enables,GO:0005253,IEA,F
2,C1P5Z7,enables,GO:0004857,IMP,F
3,O32583,enables,GO:0000166,IEA,F
4,O32583,enables,GO:0097163,IDA,F
...,...,...,...,...,...
16863,Q93K97,enables,GO:0019144,IBA,F
16864,Q93K97,enables,GO:0019144,IDA,F
16865,Q93K97,enables,GO:0046872,IEA,F
16866,Q93K97,enables,GO:0047631,IDA,F


In [40]:
# TODO ancestors
# TODO IEA?
# TODO stats on IEA terms, comparison between number of samples. Is it worth it?


## GO NetworkX

In [41]:
import obonet
import networkx as nx

graph_go = obonet.read_obo("../data/raw/ontologies/go.obo", ignore_obsolete=True)


Dicts for converting between labels and identifiers:

In [42]:
id_to_name = {id: data["name"] for id, data in graph_go.nodes(data=True)}
name_to_id = {name: id for id, name in id_to_name.items()}


What edge-annotations are in the data?

In [43]:
{edge[2] for edge in graph_go.edges(data=True, keys=True)}


{'ends_during',
 'happens_during',
 'has_part',
 'is_a',
 'negatively_regulates',
 'occurs_in',
 'part_of',
 'positively_regulates',
 'regulates'}

Creating sub-graph for Molecular Function terms:

In [44]:
graph_go_mf = graph_go.subgraph(
    nodes=[
        node
        for node, data in graph_go.nodes(data=True)
        if data["namespace"] == "molecular_function"
    ]
)


Which relations can we find in the MF-subgraph?

- has_part, part_of: Used for example for protein complexes and their complex members
- regulation: For transcription factors, messengers, etc.
- is_a: Direct, logical relation. This is what we want for finding all sub-types of transmembrane transporter activity.

In [45]:
{edge[2] for edge in graph_go_mf.edges(data=True, keys=True)}


{'has_part',
 'is_a',
 'negatively_regulates',
 'part_of',
 'positively_regulates',
 'regulates'}

Creating subgraph for is_a relationship:

In [46]:
graph_go_mf_isa = graph_go_mf.edge_subgraph(
    {edge for edge in graph_go_mf.edges(keys=True) if edge[2] == "is_a"}
)

The subgraph should be acyclic, is it?

In [47]:
print(nx.is_directed_acyclic_graph(graph_go_mf_isa))


True


## Intersection between Graph and GOA

Are there any GO Uniprot annotations that are not included in the graph? If not, the two files might be based on two different versions of Uniprot, and should be updated to the same version.

In [48]:
print(
    "number of GO terms not in graph:",
    df_goa_uniprot_enables_mf_organism[
        ~df_goa_uniprot_enables_mf_organism.go_id.isin(set(graph_go.nodes()))
    ].shape[0],
)

number of GO terms not in graph: 0


In this case, all GO terms are in the graph, we should still filter them out for compatibility with other datasets

In [49]:
df_goa_uniprot_enables_mf_organism = df_goa_uniprot_enables_mf_organism[
    df_goa_uniprot_enables_mf_organism.go_id.isin(set(graph_go.nodes()))
]

Subgraph for organism:

In [58]:
graph_go_mf_isa_organism = graph_go_mf_isa.subgraph(
    df_goa_uniprot_enables_mf_organism.go_id.unique()
)

## Add abstract GO terms to annotation data

In [51]:
# TODO filter network for e coli proteins


## Annotate GO network with number of organism annotations

## Link from GO to ChEBI

ChEBI NetworkX