# Dataset creation

The input file for this notebook is a Uniprot custom download. Proteins with existence not verified at protein level or transcript level, as well as proteins with fragmented sequences, were removed during the dataset creation process.

The Uniprot version used here is 2022_05.

The result will be datasets of:

- Sequences, along with information about manual review status and gene/protein names
- Uniprot keywords
- GO terms
- TCDB data, including annotated CHEBI molecules
- Interpro domains

The data will be saved as pickled dataframes, which makes the reading much faster. We also tried other binary formats, such as feather, parquet, HDF, and gz/lzma-compressed pickles. The normal pickles led to the best tradeoff between reading speed and file size. 

## Download commands

- Uncomment to re-download
- Some links need to be updated to the most recent version for compatibility across the datasets

In [1]:
# !curl "http://current.geneontology.org/ontology/go.owl" > "../data/raw/ontologies/go.owl"
# !curl "https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz" | gunzip -c > ../data/raw/ontologies/chebi.owl
# !curl "https://tcdb.org/cgi-bin/substrates/getSubstrates.py" > ../data/raw/tcdb/tcdb_substrates.tsv
# !python3 ../subpred/uniprot_downloader.py "https://rest.uniprot.org/uniprotkb/search?compressed=false&fields=accession%2Cgene_names%2Cprotein_name%2Creviewed%2Cprotein_existence%2Csequence%2Corganism_id%2Cgo_id%2Ckeywordid%2Ckeyword%2Cxref_tcdb%2Cxref_interpro&format=tsv&query=%28%28fragment%3Afalse%29%20AND%20%28existence%3A1%29%20OR%20%28existence%3A2%29%29&size=500" "../data/raw/uniprot/uniprot_2022_05_evidence1-2_nofragments.tsv"
# !wget https://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz -O - | gunzip -c | awk 'BEGIN {OFS="\t";FS="\t"} ($1 == "UniProtKB") {print $2,$4,$5,$7,$9,$14}' | sort -u | xz -T0 > ../data/raw/gene_ontology/goa_uniprot_all_ebi_filtered.tsv.xz
# # !curl "https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/entry.list" > "../data/raw/interpro/interpro_entries.tsv"
# !curl "https://ftp.ebi.ac.uk/pub/databases/interpro/releases/90.0/entry.list" > "../data/raw/interpro/interpro_entries.tsv"
# !curl https://release.geneontology.org/2023-01-01/ontology/go.obo > ../data/raw/ontologies/go.obo
# !curl https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.obo.gz | gzip -d  > ../data/raw/ontologies/chebi.obo

## Preprocessing

- Basic cleanup on the raw text files
- Saving as pickles for fast reading speeds

In [2]:
OUTPUT_FOLDER = "../data/datasets"

In [3]:
import requests, sys
import pandas as pd
from subpred.util import save_df, load_df
from subpred.cdhit import cd_hit
import obonet
import json


In [4]:

names = [
    "Uniprot",
    "gene_names",
    "protein_names",
    "reviewed",
    "protein_existence",
    "sequence",
    "organism_id",
    "go_ids",
    "keyword_ids",
    "keywords",
    "tcdb_ids",
    "interpro_ids",
]
dtypes = {
    "Uniprot": "string",
    "gene_names": "string",
    "protein_names": "string",
    "reviewed": "category",
    "protein_existence": "category",
    "sequence": "string",
    "organism_id": "int",
    "go_ids": "string",
    "keyword_ids": "string",
    "keywords": "string",
    "tcdb_ids": "string",
    "interpro_ids": "string",
}

df = pd.read_table(
    "../data/raw/uniprot/uniprot_2022_05_evidence1-2_nofragments.tsv",
    index_col=0,
    header=None,
    names=names,
    dtype=dtypes,
    skiprows=1,
)


## Cleanup

In [5]:
print(df.reviewed.value_counts().to_string())
df.reviewed = df.reviewed.transform(lambda x: x == "reviewed").astype("bool")
print(df.reviewed.value_counts().to_string())

reviewed
unreviewed    1530743
reviewed       162964
reviewed
False    1530743
True      162964


In [6]:
print(df.protein_existence.value_counts().to_string())
df.protein_existence = df.protein_existence.map({"Evidence at transcript level":2,"Evidence at protein level":1}).astype("int")
print(df.protein_existence.value_counts().to_string())

protein_existence
Evidence at transcript level    1429578
Evidence at protein level        264129
protein_existence
2    1429578
1     264129


### Filtering out proteins without gene names

Apparently mostly peptides and pollen, no transporters

In [7]:
# print("before", len(df))
# df = df[~df.gene_names.isnull()]
# print("after", len(df))

### Parsing sequences

In [8]:
import re
print("proteins with non-standard amino acids:")
df[~df.sequence.str.fullmatch(re.compile("[ACDEFGHIKLMNPQRSTVWY]+"))].shape[0]

proteins with non-standard amino acids:


45718

Saving proteins to later check how many end up in our datasets:

In [17]:
df[
    ~df.sequence.str.fullmatch(re.compile("[ACDEFGHIKLMNPQRSTVWY]+"))
].index.to_frame().drop_duplicates().to_csv(
    "../data/datasets/proteins_nonstandard_codes.txt", index=False, header=False
)

In [11]:
df.shape

(1693707, 11)

Removing non-standard amino acids from sequences:

In [8]:
df.sequence = df.sequence.str.replace(re.compile("[^ACDEFGHIKLMNPQRSTVWY]+"), "", regex=True)

In [9]:
print("proteins with non-standard amino acids:")
df[~df.sequence.str.fullmatch(re.compile("[ACDEFGHIKLMNPQRSTVWY]+"))].shape[0]

proteins with non-standard amino acids:


0

### Saving dataset

In [12]:
df_sequences = df.drop(
    ["go_ids", "keyword_ids", "keywords", "tcdb_ids", "interpro_ids"], axis=1
)

In [13]:
save_df(df=df_sequences, dataset_name="uniprot", folder_path=OUTPUT_FOLDER)

### Clustering

In [14]:
cluster_representatives_70 = cd_hit(df_sequences.sequence, identity_threshold=70,n_threads=0, memory=8192)
df_sequences_cdhit70 = df_sequences.loc[cluster_representatives_70]
save_df(df=df_sequences_cdhit70, dataset_name="uniprot70", folder_path=OUTPUT_FOLDER)

cd-hit: clustered .......... sequences into finished clusters at threshold 70


In [15]:
cluster_representatives_100 = cd_hit(df_sequences.sequence, identity_threshold=100,n_threads=0, memory=8192)
df_sequences_cdhit100 = df_sequences.loc[cluster_representatives_100]
save_df(df=df_sequences_cdhit100, dataset_name="uniprot100", folder_path=OUTPUT_FOLDER)

cd-hit: clustered .......... sequences into finished clusters at threshold 100


## Annotation datasets

### Keywords

In [16]:
df_keywords = (
    df.keywords.dropna()
    .str.split(";")
    .explode()
    .str.strip()
    .astype("category")
    .rename("keyword")
    .to_frame()
    .reset_index(drop=False)
    .drop_duplicates()
)
df_keywords

Unnamed: 0,Uniprot,keyword
0,A0A0C5B5G6,DNA-binding
1,A0A0C5B5G6,Mitochondrion
2,A0A0C5B5G6,Nucleus
3,A0A0C5B5G6,Osteogenesis
4,A0A0C5B5G6,Reference proteome
...,...,...
7182749,X5MPI4,Hydrolase
7182750,X5MPI4,Protease
7182751,X5MPI4,Serine protease
7182752,X5MPI4,Signal


In [17]:
save_df(df=df_keywords, dataset_name="keywords", folder_path=OUTPUT_FOLDER)

There are keyword ids as well, but their order does not match that of the keyword strings column:

In [18]:
df.keyword_ids.str.split(";").explode().str.strip()

Uniprot
A0A0C5B5G6    KW-0238
A0A0C5B5G6    KW-0496
A0A0C5B5G6    KW-0539
A0A0C5B5G6    KW-0892
A0A0C5B5G6    KW-1185
               ...   
X5MPI4        KW-0378
X5MPI4        KW-0645
X5MPI4        KW-0720
X5MPI4        KW-0732
X5MPI5        KW-0808
Name: keyword_ids, Length: 7456053, dtype: object

### Interpro Domains

In [19]:
df_interpro = (
    df.interpro_ids.dropna()
    .str.rstrip(";")
    .str.split(";")
    .explode()
    .rename("interpro_id")
    .to_frame()
    .reset_index(drop=False)
    .drop_duplicates()
)
df_interpro

Unnamed: 0,Uniprot,interpro_id
0,A0A1B0GTW7,IPR001577
1,A0JNW5,IPR026728
2,A0JNW5,IPR026854
3,A0JP26,IPR002110
4,A0JP26,IPR036770
...,...,...
6251499,X5MPI4,IPR001254
6251500,X5MPI4,IPR018114
6251501,X5MPI4,IPR033116
6251502,X5MPI5,IPR008630


#### Adding Names for Interpro domains

In [20]:
df_interpro_annotations = pd.read_table(
    "../data/raw/interpro/interpro_entries.tsv",
    names=["accession", "type", "name"],
    skiprows=1,
    index_col=0,
    dtype={"accession": str, "type": "category", "name": str},
)
df_interpro_annotations.head()

Unnamed: 0_level_0,type,name
accession,Unnamed: 1_level_1,Unnamed: 2_level_1
IPR000126,Active_site,"Serine proteases, V8 family, serine active site"
IPR000138,Active_site,"Hydroxymethylglutaryl-CoA lyase, active site"
IPR000169,Active_site,"Cysteine peptidase, cysteine active site"
IPR000180,Active_site,"Membrane dipeptidase, active site"
IPR000189,Active_site,"Prokaryotic transglycosylase, active site"


What are the entry types?

https://www.ebi.ac.uk/training/online/courses/interpro-functional-and-structural-analysis/what-is-an-interpro-entry/interpro-entry-types/#:~:text=InterPro%20entries%20are%20classified%20into,of%20an%20InterPro%20entry%20page.

InterPro entries are classified into one of five categories, depending on the biological entity they represent: homologous superfamily, protein family, domain, repeat or site.

In [21]:
df_interpro_annotations.type.value_counts()

type
Family                    23997
Domain                    11954
Homologous_superfamily     3399
Conserved_site              696
Repeat                      327
Active_site                 132
Binding_site                 75
PTM                          17
Name: count, dtype: int64

It seems that the textual data splits "site" into converved site, active site, binding site, and PTM. We can just keep all of them, since it will be obvious later which category a particular annotation belongs to.

In [22]:
df_interpro_annotations[df_interpro_annotations.type == "PTM"]

Unnamed: 0_level_0,type,name
accession,Unnamed: 1_level_1,Unnamed: 2_level_1
IPR000152,PTM,EGF-type aspartate/asparagine hydroxylation site
IPR001020,PTM,"Phosphotransferase system, HPr histidine phosp..."
IPR002114,PTM,"Phosphotransferase system, HPr serine phosphor..."
IPR002332,PTM,"Nitrogen regulatory protein P-II, urydylation ..."
IPR004091,PTM,"Chemotaxis methyl-accepting receptor, methyl-a..."
IPR006141,PTM,Intein N-terminal splicing region
IPR006162,PTM,Phosphopantetheine attachment site
IPR012902,PTM,Prokaryotic N-terminal methylation site
IPR018051,PTM,"Surfactant-associated polypeptide, palmitoylat..."
IPR018070,PTM,"Neuromedin U, amidation site"


In [23]:
df_interpro = df_interpro.merge(df_interpro_annotations, how="left", left_on="interpro_id", right_index=True)
df_interpro

Unnamed: 0,Uniprot,interpro_id,type,name
0,A0A1B0GTW7,IPR001577,Family,"Peptidase M8, leishmanolysin"
1,A0JNW5,IPR026728,Family,UHRF1-binding protein 1-like
2,A0JNW5,IPR026854,Domain,Vacuolar protein sorting-associated protein 13...
3,A0JP26,IPR002110,Repeat,Ankyrin repeat
4,A0JP26,IPR036770,Homologous_superfamily,Ankyrin repeat-containing domain superfamily
...,...,...,...,...
6251499,X5MPI4,IPR001254,Domain,"Serine proteases, trypsin domain"
6251500,X5MPI4,IPR018114,Active_site,"Serine proteases, trypsin family, histidine ac..."
6251501,X5MPI4,IPR033116,Active_site,"Serine proteases, trypsin family, serine activ..."
6251502,X5MPI5,IPR008630,Family,Glycosyltransferase 34


When using the current interpro file, there were 61901 annotations with no name in the file. When looking them up manually (i.e. https://www.ebi.ac.uk/interpro/entry/InterPro/IPR027181/), it turned out that all of them were non-active, removed entries. We switched to version 90, which is what our Uniprot dataset 2022_05 was linked with. Now, there are no more identifiers with missing names.

In [24]:
df_interpro[df_interpro.name.isnull() | df_interpro.type.isnull()]

Unnamed: 0,Uniprot,interpro_id,type,name


In [25]:
save_df(df=df_interpro, dataset_name="interpro", folder_path=OUTPUT_FOLDER)

### TCDB substrates

In [26]:
df_substrates = pd.read_table(
    "../data/raw/tcdb/tcdb_substrates.tsv",
    header=None,
    names=["tcdb_id", "tcdb_substrates"],
    index_col=0,
)
df_substrates = (
    df_substrates.tcdb_substrates.str.split("|")
    .explode()
    .str.split(";", expand=True)
    .rename(columns={0: "chebi_id", 1: "chebi_term"})
    .reset_index(drop=False)
    .melt(id_vars=["tcdb_id"], value_vars=["chebi_id", "chebi_term"])
)
df_substrates

Unnamed: 0,tcdb_id,variable,value
0,2.A.52.2.2,chebi_id,CHEBI:23337
1,2.A.52.2.2,chebi_id,CHEBI:25517
2,2.A.22.2.5,chebi_id,CHEBI:9175
3,2.A.22.2.5,chebi_id,CHEBI:8345
4,2.A.90.2.4,chebi_id,CHEBI:8816
...,...,...,...
26829,1.H.1.1.17,chebi_term,cation
26830,8.A.139.2.2,chebi_term,peptide
26831,1.B.6.2.13,chebi_term,molecule
26832,2.A.66.1.20,chebi_term,hydron


In [27]:
df_tcdb_uniprot = (
    df.tcdb_ids.dropna()
    .str.rstrip(";")
    .str.split(";")
    .explode()
    .rename("tcdb_id")
    .reset_index(drop=False)
    .drop_duplicates()
    .melt(id_vars=["tcdb_id"], value_vars=["Uniprot"])
)
df_tcdb_uniprot

Unnamed: 0,tcdb_id,variable,value
0,9.A.46.1.2,Uniprot,A0PK11
1,9.B.438.1.1,Uniprot,A2RU14
2,1.N.2.1.1,Uniprot,A6NI61
3,9.A.80.1.1,Uniprot,A6NKB5
4,1.I.1.1.3,Uniprot,O00159
...,...,...,...
8084,8.A.88.1.3,Uniprot,V5IFI8
8085,8.A.69.1.3,Uniprot,V9L7R9
8086,1.C.52.1.29,Uniprot,V9MH43
8087,1.C.115.1.4,Uniprot,W5U3Z6


In [28]:
df_uniprot_tcdb_chebi = pd.concat([df_substrates, df_tcdb_uniprot]).reset_index(drop=True)
df_uniprot_tcdb_chebi = df_uniprot_tcdb_chebi.drop_duplicates()
df_uniprot_tcdb_chebi = df_uniprot_tcdb_chebi.reset_index(drop=True)
df_uniprot_tcdb_chebi

Unnamed: 0,tcdb_id,variable,value
0,2.A.52.2.2,chebi_id,CHEBI:23337
1,2.A.52.2.2,chebi_id,CHEBI:25517
2,2.A.22.2.5,chebi_id,CHEBI:9175
3,2.A.22.2.5,chebi_id,CHEBI:8345
4,2.A.90.2.4,chebi_id,CHEBI:8816
...,...,...,...
34839,8.A.88.1.3,Uniprot,V5IFI8
34840,8.A.69.1.3,Uniprot,V9L7R9
34841,1.C.52.1.29,Uniprot,V9MH43
34842,1.C.115.1.4,Uniprot,W5U3Z6


In [29]:
save_df(df=df_uniprot_tcdb_chebi, dataset_name="tcdb_substrates", folder_path=OUTPUT_FOLDER)

### GO

This file from EBI contains all GO terms from Uniprot and from Interpro. It has been filtered for Uniprot accessions, which removes RNA annotations, for example. Some of the unnecessary columns have also been removed by awk, in order to reduce the file size from 170GB (uncompressed raw data) to 1.5GB (as xz file). Command is in Makefile. We filter the dataset for Uniprot proteins in our dataset only, that reduces the pickle file size from around 8GB to 100MB.

In [30]:
# TODO free up memory

del cluster_representatives_100
del cluster_representatives_70
del df_interpro
del df_interpro_annotations
del df_keywords
del df_sequences
del df_sequences_cdhit100
del df_sequences_cdhit70
del df_substrates
del df_tcdb_uniprot
del df_uniprot_tcdb_chebi

import gc
gc.collect()

0

In [35]:
uniprot_accessions = set(df.index.unique())

df_go_ebi = pd.DataFrame()

with pd.read_table(
    "../data/raw/gene_ontology/goa_uniprot_all_ebi_filtered.tsv.xz",
    header=None,
    names=["Uniprot", "qualifier", "go_id", "evidence_code", "aspect", "date"],
    dtype={
        "Uniprot": "string",
        "qualifier": "category",
        "go_id": "string",
        "evidence_code": "category",
        "aspect": "category",
        "date": "int",
    },
    chunksize=10 ** 6
    # parse_dates=["date"],
) as reader:
    for df_chunk in reader:
        df_chunk_filtered = df_chunk[df_chunk.Uniprot.isin(uniprot_accessions)].drop("date", axis=1).drop_duplicates()
        df_go_ebi = pd.concat([df_go_ebi,df_chunk_filtered], axis=0).reset_index(drop=True)

In [37]:
# This halves the file size!
df_go_ebi = df_go_ebi.reset_index(drop=True)
save_df(df_go_ebi, "go", OUTPUT_FOLDER)

In [28]:
CREATE_FULL_GO_TABLE=False
if CREATE_FULL_GO_TABLE:
    df_go_ebi = pd.read_table(
        "../data/raw/gene_ontology/goa_uniprot_all_ebi_filtered.tsv.xz",
        header=None,
        names=["Uniprot", "qualifier", "go_id", "evidence_code", "aspect", "date"],
        dtype={
            "Uniprot": "string",
            "qualifier": "category",
            "go_id": "string",
            "evidence_code": "category",
            "aspect": "category",
            "date": "int",
        },
        # parse_dates=["date"],
    )
    save_df(df_go_ebi, "go_complete", OUTPUT_FOLDER)
    uniprot_accessions = set(df.index.unique())
    df_go_ebi = df_go_ebi[df_go_ebi.Uniprot.isin(uniprot_accessions)]
    print(df_go_ebi.shape)
    df_go_ebi = df_go_ebi.drop("date", axis=1)
    df_go_ebi = df_go_ebi.drop_duplicates()
    print(df_go_ebi.shape)
    # This halves the file size!
    df_go_ebi = df_go_ebi.reset_index(drop=True)
    save_df(df_go_ebi, "go", OUTPUT_FOLDER)

## GO-slim transmembrane transporter activity

Currently not used, since the entries are too recent for our current Uniprot version, and we implemented a similar pipeline already.

In [1]:
# import requests, sys
# import json
# import pandas as pd

# REDOWNLOAD_TRANSPORTER_GO_TERMS = False
# if REDOWNLOAD_TRANSPORTER_GO_TERMS:
#     requestURL = "https://www.ebi.ac.uk/QuickGO/services/ontology/go/slim?slimsToIds=GO%3A0022857&relations=is_a"

#     r = requests.get(requestURL, headers={ "Accept" : "application/json"})

#     if not r.ok:
#       r.raise_for_status()
#       sys.exit()

#     responseBody = r.text

#     json_transporter_goslim_isa = json.loads(responseBody)["results"]

#     with open("../data/raw/gene_ontology/goslim_transmembrane_transport.json", "w") as json_file:
#         json.dump(json_transporter_goslim_isa, json_file)

# df_goslim_transporter = pd.read_json("../data/raw/gene_ontology/goslim_transmembrane_transport.json")
# assert not (df_goslim_transporter.slimsToIds.apply(len) != 1).any()
# df_goslim_transporter["slimsToIds"] = df_goslim_transporter.slimsToIds.apply(lambda x: x[0])
# df_goslim_transporter

Unnamed: 0,slimsFromId,slimsToIds
0,GO:0015313,GO:0022857
1,GO:0015554,GO:0022857
2,GO:0015312,GO:0022857
3,GO:0042933,GO:0022857
4,GO:0015553,GO:0022857
...,...,...
1027,GO:0033266,GO:0022857
1028,GO:0015615,GO:0022857
1029,GO:0015614,GO:0022857
1030,GO:0015612,GO:0022857


### GO/Chebi connection

Quickgo contains links between GO terms and Chebi molecule identifiers. We could not find them anywhere else, so downloading them through get-requests seems to be the best option.

In [38]:
REDOWNLOAD_GO_XRELS = False
if REDOWNLOAD_GO_XRELS:

    go_terms_uniprot = load_df("go").go_id.unique().tolist()

    responses = list()

    max_entries = 300
    i = 0
    while i < min(i+max_entries, len(go_terms_uniprot)):
        go_terms_sublist = go_terms_uniprot[i:min(i+max_entries, len(go_terms_uniprot))]
        go_terms_str_encoded = "%2C".join([node.replace(":", "%3A") for node in go_terms_sublist])
        requestURL = f"https://www.ebi.ac.uk/QuickGO/services/ontology/go/terms/{go_terms_str_encoded}/xontologyrelations"

        r = requests.get(requestURL, headers={ "Accept" : "application/json"})
        if not r.ok:
            r.raise_for_status()
            sys.exit()

        responses.append(r.text)

        i += max_entries

    json_entries = list()
    for response in responses:
        json_entry = json.loads(response)["results"]
        json_entries.extend(json_entry)
    with open("../data/raw/gene_ontology/go_xrel.json", "w") as xrel_json_file: 
        json.dump(json_entries, xrel_json_file)

In [39]:
df_go_xrel = pd.read_json("../data/raw/gene_ontology/go_xrel.json")

In [40]:
# go term to list of dicts
dict_go_to_xrels = df_go_xrel.set_index("id").xRelations.to_dict()#.apply(pd.Series)
go_to_xrel_records = []
go_to_xrel_records_header = ["go_id", "xrel_id","xrel_term", "namespace", "relation"]
for go_term, xrel_list in dict_go_to_xrels.items():
    if not isinstance(xrel_list, list):
        continue
    for entry in xrel_list:
        go_to_xrel_records.append(
            [
                go_term, entry["id"], entry["term"], entry["namespace"], entry["relation"]
            ]
        )

df_go_xrel = pd.DataFrame.from_records(go_to_xrel_records, columns=go_to_xrel_records_header)
# df_go_to_chebi = df_go_to_chebi[df_go_to_chebi.namespace == "CHEBI"].drop("namespace", axis=1)
df_go_xrel = df_go_xrel.astype({"relation": "category", "namespace": "category"})

df_go_xrel

Unnamed: 0,go_id,xrel_id,xrel_term,namespace,relation
0,GO:0006633,CHEBI:28868,fatty acid anion,CHEBI,has_primary_output
1,GO:0006629,CHEBI:18059,lipid,CHEBI,has_primary_input_or_output
2,GO:0006665,CHEBI:26739,sphingolipid,CHEBI,has_primary_input_or_output
3,GO:0004045,CHEBI:59874,N-acyl-L-alpha-amino acid anion,CHEBI,has_participant
4,GO:0004045,CHEBI:15377,water,CHEBI,has_participant
...,...,...,...,...,...
22853,GO:0009709,CHEBI:65321,terpenoid indole alkaloid,CHEBI,has_primary_output
22854,GO:0140125,CHEBI:26948,vitamin B1,CHEBI,has_primary_input
22855,GO:0140145,CHEBI:23378,copper cation,CHEBI,has_primary_input
22856,GO:0007272,CL:0000125,glial cell,CL,has_participant


The cross relations include:

- Chemical Entities of Biological Interest (ChEBI): Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.

- Cell Type Ontology (CL): The Cell Ontology (CL) is an OBO Foundry ontology for the representation of cell types.

- Plant ontology (PO): Plant Ontology project is a major international bioinformatics effort on standardizing the nomenclature, definitions, synonyms and relations of various terms/keywords/vocabularies that describe anatomical entities as well as the growth and developmental stages of plants.

#### Chebi data:

In [41]:
df_go_to_chebi = (
    df_go_xrel[df_go_xrel.namespace == "CHEBI"]
    .drop("namespace", axis=1)
    .drop_duplicates()
    .reset_index(drop=True)
    .rename(columns={"xrel_id": "chebi_id", "xrel_term": "chebi_term"})
)
df_go_to_chebi


Unnamed: 0,go_id,chebi_id,chebi_term,relation
0,GO:0006633,CHEBI:28868,fatty acid anion,has_primary_output
1,GO:0006629,CHEBI:18059,lipid,has_primary_input_or_output
2,GO:0006665,CHEBI:26739,sphingolipid,has_primary_input_or_output
3,GO:0004045,CHEBI:59874,N-acyl-L-alpha-amino acid anion,has_participant
4,GO:0004045,CHEBI:15377,water,has_participant
...,...,...,...,...
21732,GO:1902434,CHEBI:16189,sulfate,has_primary_input
21733,GO:0010166,CHEBI:73702,wax,has_primary_input_or_output
21734,GO:0009709,CHEBI:65321,terpenoid indole alkaloid,has_primary_output
21735,GO:0140125,CHEBI:26948,vitamin B1,has_primary_input


In [42]:
save_df(df=df_go_to_chebi, dataset_name="go_chebi")

## Ontology files

To make reading the graphs faster, we are going to pickle them as well. 

In [43]:
graph_go = obonet.read_obo("../data/raw/ontologies/go.obo")

save_df(graph_go, "go_obo")

In [44]:
graph_chebi = obonet.read_obo("../data/raw/ontologies/chebi.obo")

save_df(graph_chebi, "chebi_obo")

## Total file sizes:

In [45]:
!du -hs ../data/datasets/*

197M	../data/datasets/chebi_obo.gpickle
211M	../data/datasets/go.pickle
928K	../data/datasets/go_chebi.pickle
16G	../data/datasets/go_complete.pickle.deprecated
29M	../data/datasets/go_obo.gpickle
147M	../data/datasets/interpro.pickle
56M	../data/datasets/keywords.pickle
904K	../data/datasets/tcdb_substrates.pickle
740M	../data/datasets/uniprot.pickle
693M	../data/datasets/uniprot100.pickle
248M	../data/datasets/uniprot70.pickle
