# Dataset creation

The input file for this notebook is a Uniprot custom download. Proteins with existence not verified at protein level or transcript level, as well as proteins with fragmented sequences, were removed during the dataset creation process.

The Uniprot version used here is 2022_05.

The result will be datasets of:

- Sequences, along with information about manual review status and gene/protein names
- Uniprot keywords
- GO terms
- TCDB data, including annotated CHEBI molecules
- Interpro domains

The data will be saved as pickled dataframes, which makes the reading much faster. We also tried other binary formats, such as feather, parquet, HDF, and gz/lzma-compressed pickles. The normal pickles led to the best tradeoff between reading speed and file size. 

In [1]:
OUTPUT_FOLDER = "../data/datasets"

In [2]:
import requests, sys
import pandas as pd
from subpred.util import save_df, load_df
from subpred.cdhit import cd_hit
import obonet

In [3]:

names = [
    "Uniprot",
    "gene_names",
    "protein_names",
    "reviewed",
    "protein_existence",
    "sequence",
    "organism_id",
    "go_ids",
    "keyword_ids",
    "keywords",
    "tcdb_ids",
    "interpro_ids",
]
dtypes = {
    "Uniprot": "string",
    "gene_names": "string",
    "protein_names": "string",
    "reviewed": "category",
    "protein_existence": "category",
    "sequence": "string",
    "organism_id": "int",
    "go_ids": "string",
    "keyword_ids": "string",
    "keywords": "string",
    "tcdb_ids": "string",
    "interpro_ids": "string",
}

df = pd.read_table(
    "../data/raw/uniprot/uniprot_2022_05_evidence1-2_nofragments.tsv",
    index_col=0,
    header=None,
    names=names,
    dtype=dtypes,
    skiprows=1,
)


## Cleanup

In [4]:
print(df.reviewed.value_counts().to_string())
df.reviewed = df.reviewed.transform(lambda x: x == "reviewed").astype("bool")
print(df.reviewed.value_counts().to_string())

reviewed
unreviewed    1530743
reviewed       162964
reviewed
False    1530743
True      162964


In [5]:
print(df.protein_existence.value_counts().to_string())
df.protein_existence = df.protein_existence.map({"Evidence at transcript level":2,"Evidence at protein level":1}).astype("int")
print(df.protein_existence.value_counts().to_string())

protein_existence
Evidence at transcript level    1429578
Evidence at protein level        264129
protein_existence
2    1429578
1     264129


### Filtering out proteins without gene names

Apparently mostly peptides and pollen, no transporters

In [6]:
print("before", len(df))
df = df[~df.gene_names.isnull()]
print("after", len(df))

before 1693707
after 1021957


### Parsing sequences

In [7]:
import re
print("proteins with non-standard amino acids:")
df[~df.sequence.str.fullmatch(re.compile("[ACDEFGHIKLMNPQRSTVWY]+"))].shape[0]

proteins with non-standard amino acids:


21352

Removing non-standard amino acids from sequences:

In [8]:
df.sequence = df.sequence.str.replace(re.compile("[^ACDEFGHIKLMNPQRSTVWY]+"), "", regex=True)

In [9]:
print("proteins with non-standard amino acids:")
df[~df.sequence.str.fullmatch(re.compile("[ACDEFGHIKLMNPQRSTVWY]+"))].shape[0]

proteins with non-standard amino acids:


0

### Saving dataset

In [10]:
df_sequences = df.drop(
    ["go_ids", "keyword_ids", "keywords", "tcdb_ids", "interpro_ids"], axis=1
)

In [11]:
save_df(df=df_sequences, dataset_name="uniprot", folder_path=OUTPUT_FOLDER)

### Clustering

In [12]:
cluster_representatives_70 = cd_hit(df_sequences.sequence, identity_threshold=70,n_threads=0)
df_sequences_cdhit70 = df_sequences.loc[cluster_representatives_70]
save_df(df=df_sequences_cdhit70, dataset_name="uniprot70", folder_path=OUTPUT_FOLDER)

cd-hit: clustered .......... sequences into finished clusters at threshold 70


## Annotation datasets

### Keywords

In [13]:
df_keywords = (
    df.keywords.dropna()
    .str.split(";")
    .explode()
    .str.strip()
    .astype("category")
    .rename("keyword")
    .to_frame()
    .reset_index(drop=False)
    .drop_duplicates()
)
df_keywords

Unnamed: 0,Uniprot,keyword
0,A0A0C5B5G6,DNA-binding
1,A0A0C5B5G6,Mitochondrion
2,A0A0C5B5G6,Nucleus
3,A0A0C5B5G6,Osteogenesis
4,A0A0C5B5G6,Reference proteome
...,...,...
5318151,X5MI49,Membrane
5318152,X5MI49,Transferase
5318153,X5MI49,Transmembrane
5318154,X5MI49,Transmembrane helix


In [14]:
save_df(df=df_keywords, dataset_name="keywords", folder_path=OUTPUT_FOLDER)

There are keyword ids as well, but their order does not match that of the keyword strings column:

In [15]:
df.keyword_ids.str.split(";").explode().str.strip()

Uniprot
A0A0C5B5G6    KW-0238
A0A0C5B5G6    KW-0496
A0A0C5B5G6    KW-0539
A0A0C5B5G6    KW-0892
A0A0C5B5G6    KW-1185
               ...   
X5MI49        KW-0472
X5MI49        KW-0808
X5MI49        KW-0812
X5MI49        KW-1133
X5MPI5        KW-0808
Name: keyword_ids, Length: 5410571, dtype: object

### Interpro Domains

In [16]:
df_interpro = (
    df.interpro_ids.dropna()
    .str.rstrip(";")
    .str.split(";")
    .explode()
    .rename("interpro_id")
    .to_frame()
    .reset_index(drop=False)
    .drop_duplicates()
)
df_interpro

Unnamed: 0,Uniprot,interpro_id
0,A0A1B0GTW7,IPR001577
1,A0JNW5,IPR026728
2,A0JNW5,IPR026854
3,A0JP26,IPR002110
4,A0JP26,IPR036770
...,...,...
4190605,X5MFI4,IPR029044
4190606,X5MI49,IPR008630
4190607,X5MI49,IPR029044
4190608,X5MPI5,IPR008630


#### Adding Names for Interpro domains

In [17]:
df_interpro_annotations = pd.read_table("../data/raw/interpro/interpro_entries.tsv", names=["accession", "type", "name"], skiprows=1, index_col=0, dtype={"accession":str, "type": "category", "name":str})
df_interpro_annotations.head()

Unnamed: 0_level_0,type,name
accession,Unnamed: 1_level_1,Unnamed: 2_level_1
IPR000126,Active_site,"Serine proteases, V8 family, serine active site"
IPR000138,Active_site,"Hydroxymethylglutaryl-CoA lyase, active site"
IPR000169,Active_site,"Cysteine peptidase, cysteine active site"
IPR000180,Active_site,"Membrane dipeptidase, active site"
IPR000189,Active_site,"Prokaryotic transglycosylase, active site"


What are the entry types?

https://www.ebi.ac.uk/training/online/courses/interpro-functional-and-structural-analysis/what-is-an-interpro-entry/interpro-entry-types/#:~:text=InterPro%20entries%20are%20classified%20into,of%20an%20InterPro%20entry%20page.

InterPro entries are classified into one of five categories, depending on the biological entity they represent: homologous superfamily, protein family, domain, repeat or site.

In [18]:
df_interpro_annotations.type.value_counts()

type
Family                    23997
Domain                    11954
Homologous_superfamily     3399
Conserved_site              696
Repeat                      327
Active_site                 132
Binding_site                 75
PTM                          17
Name: count, dtype: int64

It seems that the textual data splits "site" into converved site, active site, binding site, and PTM. We can just keep all of them, since it will be obvious later which category a particular annotation belongs to.

In [19]:
df_interpro_annotations[df_interpro_annotations.type == "PTM"]

Unnamed: 0_level_0,type,name
accession,Unnamed: 1_level_1,Unnamed: 2_level_1
IPR000152,PTM,EGF-type aspartate/asparagine hydroxylation site
IPR001020,PTM,"Phosphotransferase system, HPr histidine phosp..."
IPR002114,PTM,"Phosphotransferase system, HPr serine phosphor..."
IPR002332,PTM,"Nitrogen regulatory protein P-II, urydylation ..."
IPR004091,PTM,"Chemotaxis methyl-accepting receptor, methyl-a..."
IPR006141,PTM,Intein N-terminal splicing region
IPR006162,PTM,Phosphopantetheine attachment site
IPR012902,PTM,Prokaryotic N-terminal methylation site
IPR018051,PTM,"Surfactant-associated polypeptide, palmitoylat..."
IPR018070,PTM,"Neuromedin U, amidation site"


In [20]:
df_interpro = df_interpro.merge(df_interpro_annotations, how="left", left_on="interpro_id", right_index=True)
df_interpro

Unnamed: 0,Uniprot,interpro_id,type,name
0,A0A1B0GTW7,IPR001577,Family,"Peptidase M8, leishmanolysin"
1,A0JNW5,IPR026728,Family,UHRF1-binding protein 1-like
2,A0JNW5,IPR026854,Domain,Vacuolar protein sorting-associated protein 13...
3,A0JP26,IPR002110,Repeat,Ankyrin repeat
4,A0JP26,IPR036770,Homologous_superfamily,Ankyrin repeat-containing domain superfamily
...,...,...,...,...
4190605,X5MFI4,IPR029044,Homologous_superfamily,Nucleotide-diphospho-sugar transferases
4190606,X5MI49,IPR008630,Family,Glycosyltransferase 34
4190607,X5MI49,IPR029044,Homologous_superfamily,Nucleotide-diphospho-sugar transferases
4190608,X5MPI5,IPR008630,Family,Glycosyltransferase 34


When using the current interpro file, there were 61901 annotations with no name in the file. When looking them up manually (i.e. https://www.ebi.ac.uk/interpro/entry/InterPro/IPR027181/), it turned out that all of them were non-active, removed entries. We switched to version 90, which is what our Uniprot dataset 2022_05 was linked with. Now, there are no more identifiers with missing names.

In [21]:
df_interpro[df_interpro.name.isnull() | df_interpro.type.isnull()]

Unnamed: 0,Uniprot,interpro_id,type,name


In [22]:
save_df(df=df_interpro, dataset_name="interpro", folder_path=OUTPUT_FOLDER)

### TCDB substrates

In [23]:
df_substrates = pd.read_table(
    "../data/raw/tcdb/tcdb_substrates.tsv",
    header=None,
    names=["tcdb_id", "tcdb_substrates"],
    index_col=0,
)
df_substrates = (
    df_substrates.tcdb_substrates.str.split("|")
    .explode()
    .str.split(";", expand=True)
    .rename(columns={0: "chebi_id", 1: "chebi_term"})
    .reset_index(drop=False)
    .melt(id_vars=["tcdb_id"], value_vars=["chebi_id", "chebi_term"])
)
df_substrates

Unnamed: 0,tcdb_id,variable,value
0,2.A.52.2.2,chebi_id,CHEBI:23337
1,2.A.52.2.2,chebi_id,CHEBI:25517
2,2.A.22.2.5,chebi_id,CHEBI:9175
3,2.A.22.2.5,chebi_id,CHEBI:8345
4,2.A.90.2.4,chebi_id,CHEBI:8816
...,...,...,...
26829,1.H.1.1.17,chebi_term,cation
26830,8.A.139.2.2,chebi_term,peptide
26831,1.B.6.2.13,chebi_term,molecule
26832,2.A.66.1.20,chebi_term,hydron


In [24]:
df_tcdb_uniprot = (
    df.tcdb_ids.dropna()
    .str.rstrip(";")
    .str.split(";")
    .explode()
    .rename("tcdb_id")
    .reset_index(drop=False)
    .drop_duplicates()
    .melt(id_vars=["tcdb_id"], value_vars=["Uniprot"])
)
df_tcdb_uniprot

Unnamed: 0,tcdb_id,variable,value
0,9.A.46.1.2,Uniprot,A0PK11
1,9.B.438.1.1,Uniprot,A2RU14
2,1.N.2.1.1,Uniprot,A6NI61
3,9.A.80.1.1,Uniprot,A6NKB5
4,1.I.1.1.3,Uniprot,O00159
...,...,...,...
7553,2.A.47.1.5,Uniprot,Q9W7I2
7554,2.A.18.3.2,Uniprot,Q9XE48
7555,2.A.18.3.3,Uniprot,Q9XE49
7556,2.A.29.11.2,Uniprot,Q9ZNY4


In [25]:
df_uniprot_tcdb_chebi = pd.concat([df_substrates, df_tcdb_uniprot]).reset_index(drop=True)
df_uniprot_tcdb_chebi = df_uniprot_tcdb_chebi.drop_duplicates()
df_uniprot_tcdb_chebi = df_uniprot_tcdb_chebi.reset_index(drop=True)
df_uniprot_tcdb_chebi

Unnamed: 0,tcdb_id,variable,value
0,2.A.52.2.2,chebi_id,CHEBI:23337
1,2.A.52.2.2,chebi_id,CHEBI:25517
2,2.A.22.2.5,chebi_id,CHEBI:9175
3,2.A.22.2.5,chebi_id,CHEBI:8345
4,2.A.90.2.4,chebi_id,CHEBI:8816
...,...,...,...
34308,2.A.47.1.5,Uniprot,Q9W7I2
34309,2.A.18.3.2,Uniprot,Q9XE48
34310,2.A.18.3.3,Uniprot,Q9XE49
34311,2.A.29.11.2,Uniprot,Q9ZNY4


In [26]:
save_df(df=df_uniprot_tcdb_chebi, dataset_name="tcdb_substrates", folder_path=OUTPUT_FOLDER)

### GO

This file from EBI contains all GO terms from Uniprot and from Interpro. It has been filtered for Uniprot accessions, which removes RNA annotations, for example. Some of the unnecessary columns have also been removed by awk, in order to reduce the file size from 170GB (uncompressed raw data) to 1.5GB (as xz file). Command is in Makefile. We filter the dataset for Uniprot proteins in our dataset only, that reduces the pickle file size from around 8GB to 100MB.

In [27]:
# # TODO chunk reading file
# uniprot_accessions = set(df.index.unique())

# with pd.read_table(
#     "../data/raw/gene_ontology/goa_uniprot_all_ebi_filtered.tsv.xz",
#     header=None,
#     names=["Uniprot", "qualifier", "go_id", "evidence_code", "aspect", "date"],
#     dtype={
#         "Uniprot": "string",
#         "qualifier": "category",
#         "go_id": "string",
#         "evidence_code": "category",
#         "aspect": "category",
#         "date": "int",
#     },
#     chunksize=10 ** 6
#     # parse_dates=["date"],
# ) as reader:
#     for chunk in reader:
#         pass

In [28]:
df_go_ebi = pd.read_table(
    "../data/raw/gene_ontology/goa_uniprot_all_ebi_filtered.tsv.xz",
    header=None,
    names=["Uniprot", "qualifier", "go_id", "evidence_code", "aspect", "date"],
    dtype={
        "Uniprot": "string",
        "qualifier": "category",
        "go_id": "string",
        "evidence_code": "category",
        "aspect": "category",
        "date": "int",
    },
    # parse_dates=["date"],
)
save_df(df_go_ebi, "go_complete", OUTPUT_FOLDER)

In [29]:
# df_go_ebi = load_df("go_complete")

In [30]:
uniprot_accessions = set(df.index.unique())
df_go_ebi = df_go_ebi[df_go_ebi.Uniprot.isin(uniprot_accessions)]

In [31]:
print(df_go_ebi.shape)
df_go_ebi = df_go_ebi.drop("date", axis=1)
df_go_ebi = df_go_ebi.drop_duplicates()
print(df_go_ebi.shape)

(7580427, 6)
(7452018, 5)


In [32]:
# This halves the file size! 
df_go_ebi = df_go_ebi.reset_index(drop=True)

In [33]:
save_df(df_go_ebi, "go", OUTPUT_FOLDER)

In [34]:
# df_go_ebi_mf_enables = df_go_ebi[
#     (df_go_ebi.qualifier == "enables")&
#     (df_go_ebi.aspect == "F")
# ].drop(["qualifier", "aspect"], axis=1).reset_index(drop=True)
# save_df(df_go_ebi_mf_enables, "go_mf", OUTPUT_FOLDER)

### GO/Chebi connection

Quickgo contains links between GO terms and Chebi molecule identifiers. We could not find them anywhere else, so downloading them through get-requests seems to be the best option.

In [35]:
REDOWNLOAD_GO_XRELS = False
if REDOWNLOAD_GO_XRELS:
    go_terms_uniprot = load_df("go").go_id.unique().tolist()

    responses = list()

    max_entries = 300
    i = 0
    while i < min(i+max_entries, len(go_terms_uniprot)):
        go_terms_sublist = go_terms_uniprot[i:min(i+max_entries, len(go_terms_uniprot))]
        go_terms_str_encoded = "%2C".join([node.replace(":", "%3A") for node in go_terms_sublist])
        requestURL = f"https://www.ebi.ac.uk/QuickGO/services/ontology/go/terms/{go_terms_str_encoded}/xontologyrelations"

        r = requests.get(requestURL, headers={ "Accept" : "application/json"})
        if not r.ok:
            r.raise_for_status()
            sys.exit()

        responses.append(r.text)

        i += max_entries

    df_go_xrel_parts = []
    for response in responses:
        df_go_xrel_part = pd.read_json(response).results.apply(pd.Series)
        df_go_xrel_parts.append(df_go_xrel_part)
    df_go_xrel = pd.concat(df_go_xrel_parts, ignore_index=True)
    df_go_xrel.to_csv("../data/raw/gene_ontology/go_xrel.tsv", sep="\t")
    df_go_xrel

In [37]:
df_go_xrel = pd.read_table("../data/raw/gene_ontology/go_xrel.tsv")

# go term to list of dicts
dict_go_to_xrels = df_go_xrel.set_index("id").xRelations.to_dict()#.apply(pd.Series)
go_to_xrel_records = []
go_to_xrel_records_header = ["go_id", "xrel_id","xrel_term", "namespace", "relation"]
for go_term, xrel_list in dict_go_to_xrels.items():
    if not isinstance(xrel_list, list):
        continue
    for entry in xrel_list:
        go_to_xrel_records.append(
            [
                go_term, entry["id"], entry["term"], entry["namespace"], entry["relation"]
            ]
        )

df_go_xrel = pd.DataFrame.from_records(go_to_xrel_records, columns=go_to_xrel_records_header)
# df_go_to_chebi = df_go_to_chebi[df_go_to_chebi.namespace == "CHEBI"].drop("namespace", axis=1)
df_go_xrel = df_go_xrel.astype({"relation": "category", "namespace": "category"})

df_go_xrel

Unnamed: 0,go_id,xrel_id,xrel_term,namespace,relation
0,GO:0006633,CHEBI:28868,fatty acid anion,CHEBI,has_primary_output
1,GO:0006629,CHEBI:18059,lipid,CHEBI,has_primary_input_or_output
2,GO:0006665,CHEBI:26739,sphingolipid,CHEBI,has_primary_input_or_output
3,GO:0006754,CHEBI:30616,ATP(4-),CHEBI,has_primary_output
4,GO:0051028,CHEBI:33699,messenger RNA,CHEBI,has_primary_input
...,...,...,...,...,...
22470,GO:0140786,CHEBI:28300,glutamine,CHEBI,has_input
22471,GO:1904474,CHEBI:57504,L-dopa zwitterion,CHEBI,has_input
22472,GO:0061431,CHEBI:64558,methionine zwitterion,CHEBI,has_input
22473,GO:0048353,PO:0000038,primary endosperm cell,PO,part_of


The cross relations include:

- Chemical Entities of Biological Interest (ChEBI): Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.

- Cell Type Ontology (CL): The Cell Ontology (CL) is an OBO Foundry ontology for the representation of cell types.

- Plant ontology (PO): Plant Ontology project is a major international bioinformatics effort on standardizing the nomenclature, definitions, synonyms and relations of various terms/keywords/vocabularies that describe anatomical entities as well as the growth and developmental stages of plants.

#### Chebi data:

In [38]:
df_go_to_chebi = (
    df_go_xrel[df_go_xrel.namespace == "CHEBI"]
    .drop("namespace", axis=1)
    .drop_duplicates()
    .reset_index(drop=True)
    .rename(columns={"xrel_id": "chebi_id", "xrel_term": "chebi_term"})
)
df_go_to_chebi


Unnamed: 0,go_id,chebi_id,chebi_term,relation
0,GO:0006633,CHEBI:28868,fatty acid anion,has_primary_output
1,GO:0006629,CHEBI:18059,lipid,has_primary_input_or_output
2,GO:0006665,CHEBI:26739,sphingolipid,has_primary_input_or_output
3,GO:0006754,CHEBI:30616,ATP(4-),has_primary_output
4,GO:0051028,CHEBI:33699,messenger RNA,has_primary_input
...,...,...,...,...
21350,GO:0033521,CHEBI:58404,"(E)-3,7,11,15-tetramethylhexadec-2-en-1-yl dip...",has_primary_output
21351,GO:0140786,CHEBI:28300,glutamine,has_input
21352,GO:1904474,CHEBI:57504,L-dopa zwitterion,has_input
21353,GO:0061431,CHEBI:64558,methionine zwitterion,has_input


In [39]:
save_df(df=df_go_to_chebi, dataset_name="go_chebi")

## Ontology files

To make reading the graphs faster, we are going to pickle them as well. 

In [40]:
graph_go = obonet.read_obo("../data/raw/ontologies/go.obo")

save_df(graph_go, "go_obo")

In [41]:
graph_chebi = obonet.read_obo("../data/raw/ontologies/chebi.obo")

save_df(graph_chebi, "chebi_obo")

## Total file sizes:

In [42]:
!du -hs ../data/datasets/*

197M	../data/datasets/chebi_obo.gpickle
912K	../data/datasets/go_chebi.pickle
16G	../data/datasets/go_complete.pickle
29M	../data/datasets/go_obo.gpickle
107M	../data/datasets/go.pickle
98M	../data/datasets/interpro.pickle
41M	../data/datasets/keywords.pickle
892K	../data/datasets/tcdb_substrates.pickle
158M	../data/datasets/uniprot70.pickle
498M	../data/datasets/uniprot.pickle
