# Dataset creation

The input file for this notebook is a Uniprot custom download. Proteins with existence not verified at protein level or transcript level, as well as proteins with fragmented sequences, were removed during the dataset creation process.

The Uniprot version used here is 2022_05.

The result will be datasets of:

- Sequences, along with information about manual review status and gene/protein names
- Uniprot keywords
- GO terms
- TCDB data, including annotated CHEBI molecules
- Interpro domains

The data will be saved as pickled dataframes, which makes the reading much faster. We also tried other binary formats, such as feather, parquet, HDF, and gz/lzma-compressed pickles. The normal pickles led to the best tradeoff between reading speed and file size. 

In [5]:
OUTPUT_FOLDER = "../data/datasets"

In [6]:
from subpred.util import save_df
from subpred.cdhit import cd_hit
import pandas as pd

In [7]:

names = [
    "Uniprot",
    "gene_names",
    "protein_names",
    "reviewed",
    "protein_existence",
    "sequence",
    "organism_id",
    "go_ids",
    "keyword_ids",
    "keywords",
    "tcdb_ids",
    "interpro_ids",
]
dtypes = {
    "Uniprot": "string",
    "gene_names": "string",
    "protein_names": "string",
    "reviewed": "category",
    "protein_existence": "category",
    "sequence": "string",
    "organism_id": "int",
    "go_ids": "string",
    "keyword_ids": "string",
    "keywords": "string",
    "tcdb_ids": "string",
    "interpro_ids": "string",
}

df = pd.read_table(
    "../data/raw/uniprot/uniprot_2022_05_evidence1-2_nofragments.tsv",
    index_col=0,
    header=None,
    names=names,
    dtype=dtypes,
    skiprows=1,
)


## Cleanup

In [8]:
print(df.reviewed.value_counts().to_string())
df.reviewed = df.reviewed.transform(lambda x: x == "reviewed").astype("bool")
print(df.reviewed.value_counts().to_string())

unreviewed    1530743
reviewed       162964
False    1530743
True      162964


In [9]:
print(df.protein_existence.value_counts().to_string())
df.protein_existence = df.protein_existence.map({"Evidence at transcript level":2,"Evidence at protein level":1}).astype("int")
print(df.protein_existence.value_counts().to_string())

Evidence at transcript level    1429578
Evidence at protein level        264129
2    1429578
1     264129


### Filtering out proteins without gene names

Apparently mostly peptides and pollen, no transporters

In [10]:
print("before", len(df))
df = df[~df.gene_names.isnull()]
print("after", len(df))

before 1693707
after 1021957


### Parsing sequences

In [11]:
import re
print("proteins with non-standard amino acids:")
df[~df.sequence.str.fullmatch(re.compile("[ACDEFGHIKLMNPQRSTVWY]+"))].shape[0]

proteins with non-standard amino acids:


21352

Removing non-standard amino acids from sequences:

In [12]:
df.sequence = df.sequence.str.replace(re.compile("[^ACDEFGHIKLMNPQRSTVWY]+"), "")

In [13]:
print("proteins with non-standard amino acids:")
df[~df.sequence.str.fullmatch(re.compile("[ACDEFGHIKLMNPQRSTVWY]+"))].shape[0]

proteins with non-standard amino acids:


0

### Saving dataset

In [14]:
df_sequences = df.drop(
    ["go_ids", "keyword_ids", "keywords", "tcdb_ids", "interpro_ids"], axis=1
)

In [15]:
save_df(df=df_sequences, dataset_name="uniprot", folder_path=OUTPUT_FOLDER)

### Clustering

In [16]:
cluster_representatives_70 = cd_hit(df_sequences.sequence, identity_threshold=70,n_threads=0)
df_sequences_cdhit70 = df_sequences.loc[cluster_representatives_70]
save_df(df=df_sequences_cdhit70, dataset_name="uniprot70", folder_path=OUTPUT_FOLDER)

cd-hit: clustered .......... sequences into finished clusters at threshold 70


## Annotation datasets

### Keywords

In [17]:
df_keywords = (
    df.keywords.dropna()
    .str.split(";")
    .explode()
    .str.strip()
    .astype("category")
    .rename("keyword")
    .to_frame()
    .reset_index(drop=False)
    .drop_duplicates()
)
df_keywords

Unnamed: 0,Uniprot,keyword
0,A0A0C5B5G6,DNA-binding
1,A0A0C5B5G6,Mitochondrion
2,A0A0C5B5G6,Nucleus
3,A0A0C5B5G6,Osteogenesis
4,A0A0C5B5G6,Reference proteome
...,...,...
5318151,X5MI49,Membrane
5318152,X5MI49,Transferase
5318153,X5MI49,Transmembrane
5318154,X5MI49,Transmembrane helix


In [18]:
save_df(df=df_keywords, dataset_name="keywords", folder_path=OUTPUT_FOLDER)

There are keyword ids as well, but their order does not match that of the keyword strings column:

In [19]:
df.keyword_ids.str.split(";").explode().str.strip()

Uniprot
A0A0C5B5G6    KW-0238
A0A0C5B5G6    KW-0496
A0A0C5B5G6    KW-0539
A0A0C5B5G6    KW-0892
A0A0C5B5G6    KW-1185
               ...   
X5MI49        KW-0472
X5MI49        KW-0808
X5MI49        KW-0812
X5MI49        KW-1133
X5MPI5        KW-0808
Name: keyword_ids, Length: 5410571, dtype: object

### Interpro Domains

In [55]:
df_interpro = (
    df.interpro_ids.dropna()
    .str.rstrip(";")
    .str.split(";")
    .explode()
    .rename("interpro_id")
    .to_frame()
    .reset_index(drop=False)
    .drop_duplicates()
)
df_interpro

Unnamed: 0,Uniprot,interpro_id
0,A0A1B0GTW7,IPR001577
1,A0JNW5,IPR026728
2,A0JNW5,IPR026854
3,A0JP26,IPR002110
4,A0JP26,IPR036770
...,...,...
4190605,X5MFI4,IPR029044
4190606,X5MI49,IPR008630
4190607,X5MI49,IPR029044
4190608,X5MPI5,IPR008630


#### Adding Names for Interpro domains

In [56]:
df_interpro_annotations = pd.read_table("../data/raw/interpro/interpro_entries.tsv", names=["accession", "type", "name"], skiprows=1, index_col=0, dtype={"accession":str, "type": "category", "name":str})
df_interpro_annotations.head()

Unnamed: 0_level_0,type,name
accession,Unnamed: 1_level_1,Unnamed: 2_level_1
IPR000126,Active_site,"Serine proteases, V8 family, serine active site"
IPR000138,Active_site,"Hydroxymethylglutaryl-CoA lyase, active site"
IPR000169,Active_site,"Cysteine peptidase, cysteine active site"
IPR000180,Active_site,"Membrane dipeptidase, active site"
IPR000189,Active_site,"Prokaryotic transglycosylase, active site"


What are the entry types?

https://www.ebi.ac.uk/training/online/courses/interpro-functional-and-structural-analysis/what-is-an-interpro-entry/interpro-entry-types/#:~:text=InterPro%20entries%20are%20classified%20into,of%20an%20InterPro%20entry%20page.

InterPro entries are classified into one of five categories, depending on the biological entity they represent: homologous superfamily, protein family, domain, repeat or site.

In [57]:
df_interpro_annotations.type.value_counts()

Family                    23997
Domain                    11954
Homologous_superfamily     3399
Conserved_site              696
Repeat                      327
Active_site                 132
Binding_site                 75
PTM                          17
Name: type, dtype: int64

It seems that the textual data splits "site" into converved site, active site, binding site, and PTM. We can just keep all of them, since it will be obvious later which category a particular annotation belongs to.

In [58]:
df_interpro_annotations[df_interpro_annotations.type == "PTM"]

Unnamed: 0_level_0,type,name
accession,Unnamed: 1_level_1,Unnamed: 2_level_1
IPR000152,PTM,EGF-type aspartate/asparagine hydroxylation site
IPR001020,PTM,"Phosphotransferase system, HPr histidine phosp..."
IPR002114,PTM,"Phosphotransferase system, HPr serine phosphor..."
IPR002332,PTM,"Nitrogen regulatory protein P-II, urydylation ..."
IPR004091,PTM,"Chemotaxis methyl-accepting receptor, methyl-a..."
IPR006141,PTM,Intein N-terminal splicing region
IPR006162,PTM,Phosphopantetheine attachment site
IPR012902,PTM,Prokaryotic N-terminal methylation site
IPR018051,PTM,"Surfactant-associated polypeptide, palmitoylat..."
IPR018070,PTM,"Neuromedin U, amidation site"


In [59]:
df_interpro = df_interpro.merge(df_interpro_annotations, how="left", left_on="interpro_id", right_index=True)
df_interpro

Unnamed: 0,Uniprot,interpro_id,type,name
0,A0A1B0GTW7,IPR001577,Family,"Peptidase M8, leishmanolysin"
1,A0JNW5,IPR026728,Family,UHRF1-binding protein 1-like
2,A0JNW5,IPR026854,Domain,Vacuolar protein sorting-associated protein 13...
3,A0JP26,IPR002110,Repeat,Ankyrin repeat
4,A0JP26,IPR036770,Homologous_superfamily,Ankyrin repeat-containing domain superfamily
...,...,...,...,...
4190605,X5MFI4,IPR029044,Homologous_superfamily,Nucleotide-diphospho-sugar transferases
4190606,X5MI49,IPR008630,Family,Glycosyltransferase 34
4190607,X5MI49,IPR029044,Homologous_superfamily,Nucleotide-diphospho-sugar transferases
4190608,X5MPI5,IPR008630,Family,Glycosyltransferase 34


When using the current interpro file, there were 61901 annotations with no name in the file. When looking them up manually (i.e. https://www.ebi.ac.uk/interpro/entry/InterPro/IPR027181/), it turned out that all of them were non-active, removed entries. We switched to version 90, which is what our Uniprot dataset 2022_05 was linked with. Now, there are no more identifiers with missing names.

In [61]:
df_interpro[df_interpro.name.isnull() | df_interpro.type.isnull()]

Unnamed: 0,Uniprot,interpro_id,type,name


In [62]:
save_df(df=df_interpro, dataset_name="interpro", folder_path=OUTPUT_FOLDER)

### TCDB substrates

In [25]:
df_substrates = pd.read_table(
    "../data/raw/tcdb/tcdb_substrates.tsv",
    header=None,
    names=["tcdb_id", "tcdb_substrates"],
    index_col=0,
)
df_substrates = (
    df_substrates.tcdb_substrates.str.split("|")
    .explode()
    .str.split(";", expand=True)
    .rename(columns={0: "chebi_id", 1: "chebi_term"})
    .reset_index(drop=False)
    .melt(id_vars=["tcdb_id"], value_vars=["chebi_id", "chebi_term"])
)
df_substrates

Unnamed: 0,tcdb_id,variable,value
0,2.A.52.2.2,chebi_id,CHEBI:23337
1,2.A.52.2.2,chebi_id,CHEBI:25517
2,2.A.22.2.5,chebi_id,CHEBI:9175
3,2.A.22.2.5,chebi_id,CHEBI:8345
4,2.A.90.2.4,chebi_id,CHEBI:8816
...,...,...,...
26829,1.H.1.1.17,chebi_term,cation
26830,8.A.139.2.2,chebi_term,peptide
26831,1.B.6.2.13,chebi_term,molecule
26832,2.A.66.1.20,chebi_term,hydron


In [26]:
df_tcdb_uniprot = (
    df.tcdb_ids.dropna()
    .str.rstrip(";")
    .str.split(";")
    .explode()
    .rename("tcdb_id")
    .reset_index(drop=False)
    .drop_duplicates()
    .melt(id_vars=["tcdb_id"], value_vars=["Uniprot"])
)
df_tcdb_uniprot

Unnamed: 0,tcdb_id,variable,value
0,9.A.46.1.2,Uniprot,A0PK11
1,9.B.438.1.1,Uniprot,A2RU14
2,1.N.2.1.1,Uniprot,A6NI61
3,9.A.80.1.1,Uniprot,A6NKB5
4,1.I.1.1.3,Uniprot,O00159
...,...,...,...
7553,2.A.47.1.5,Uniprot,Q9W7I2
7554,2.A.18.3.2,Uniprot,Q9XE48
7555,2.A.18.3.3,Uniprot,Q9XE49
7556,2.A.29.11.2,Uniprot,Q9ZNY4


In [31]:
df_uniprot_tcdb_chebi = pd.concat([df_substrates, df_tcdb_uniprot]).reset_index(drop=True)
df_uniprot_tcdb_chebi = df_uniprot_tcdb_chebi.drop_duplicates()
df_uniprot_tcdb_chebi

Unnamed: 0,tcdb_id,variable,value
0,2.A.52.2.2,chebi_id,CHEBI:23337
1,2.A.52.2.2,chebi_id,CHEBI:25517
2,2.A.22.2.5,chebi_id,CHEBI:9175
3,2.A.22.2.5,chebi_id,CHEBI:8345
4,2.A.90.2.4,chebi_id,CHEBI:8816
...,...,...,...
34387,2.A.47.1.5,Uniprot,Q9W7I2
34388,2.A.18.3.2,Uniprot,Q9XE48
34389,2.A.18.3.3,Uniprot,Q9XE49
34390,2.A.29.11.2,Uniprot,Q9ZNY4


In [33]:
save_df(df=df_uniprot_tcdb_chebi, dataset_name="tcdb_substrates", folder_path=OUTPUT_FOLDER)

### GO

This file from EBI contains all GO terms from Uniprot and from Interpro. It has been filtered for Uniprot accessions, which removes RNA annotations, for example. Some of the unnecessary columns have also been removed by awk, in order to reduce the file size from 170GB (uncompressed raw data) to 1.5GB (as xz file). Command is in Makefile. We filter the dataset for Uniprot proteins in our dataset only, that reduces the pickle file size from around 8GB to 100MB.

In [6]:
df_go_ebi = pd.read_table(
    "../data/raw/gene_ontology/goa_uniprot_all_ebi_filtered.tsv.xz",
    header=None,
    names=["Uniprot", "qualifier", "go_id", "evidence_code", "aspect", "date"],
    dtype={
        "Uniprot": "string",
        "qualifier": "category",
        "go_id": "string",
        "evidence_code": "category",
        "aspect": "category",
        "date": "int",
    },
    # parse_dates=["date"],
)

In [7]:
save_df(df_go_ebi, "go_complete", OUTPUT_FOLDER)

In [34]:
uniprot_accessions = set(df.index.unique())
df_go_ebi = df_go_ebi[df_go_ebi.Uniprot.isin(uniprot_accessions)]

In [35]:
print(df_go_ebi.shape)
df_go_ebi = df_go_ebi.drop("date", axis=1)
df_go_ebi = df_go_ebi.drop_duplicates()
print(df_go_ebi.shape)

(7580427, 6)
(7452018, 5)


In [36]:
# This halves the file size! 
df_go_ebi = df_go_ebi.reset_index(drop=True)

In [27]:
save_df(df_go_ebi, "go", OUTPUT_FOLDER)

In [28]:
# df_go_ebi_mf_enables = df_go_ebi[
#     (df_go_ebi.qualifier == "enables")&
#     (df_go_ebi.aspect == "F")
# ].drop(["qualifier", "aspect"], axis=1).reset_index(drop=True)
# save_df(df_go_ebi_mf_enables, "go_mf", OUTPUT_FOLDER)

In [1]:
!du -hs ../data/datasets/*

16G	../data/datasets/go_complete.pickle
107M	../data/datasets/go.pickle
130M	../data/datasets/interpro.pickle
81M	../data/datasets/keywords.pickle
1.2M	../data/datasets/tcdb_substrates.pickle
158M	../data/datasets/uniprot70.pickle
498M	../data/datasets/uniprot.pickle
