# Preparing the Gene Ontology Annotations Database for Integration

> A Gene Ontology (GO) annotation is a statement about the function of a particular gene. GO annotations are created by associating a gene or gene product with a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. Hence, GO annotations capture statements about how a gene functions at the molecular level, where in the cell it functions, and what biological processes (pathways, programs) it helps to carry out. (quoted from https://geneontology.org/docs/go-annotations)

This notebook downloads the Gene Ontology annotation database and walks through the steps of 1) validating the usages of prefixes, local unique identifiers, and CURIEs then 2) standardizing them. Many datasets require such standardization to make them readily interoperable with other datasets.

In the first step, we load the most recent GO annotations database from http://geneontology.org/gene-associations/goa_human.gaf.gz. The format of this file is explained at https://geneontology.org/docs/go-annotation-file-gaf-format-2.2/, but we only look at a subset of columns.

In [1]:
import pandas as pd

import bioregistry.pandas as brpd

# Focus on these columns when displaying the data
columns = [0, 1, 4, 5, 12]
names = [
    "subject_prefix",
    "subject_identifier",
    "object_curie",
    "reference_curie",
    "taxon_curie",
]

df = pd.read_csv(
    "http://geneontology.org/gene-associations/goa_human.gaf.gz",
    sep="\t",
    comment="!",
    header=None,
    usecols=columns,
    names=names,
    dtype=str,
).head(100)


df.head()

Unnamed: 0,subject_prefix,subject_identifier,object_curie,reference_curie,taxon_curie
0,UniProtKB,A0A024RBG1,GO:0003723,GO_REF:0000043,taxon:9606
1,UniProtKB,A0A024RBG1,GO:0005515,PMID:33961781,taxon:9606
2,UniProtKB,A0A024RBG1,GO:0008486,GO_REF:0000003,taxon:9606
3,UniProtKB,A0A024RBG1,GO:0016462,GO_REF:0000002,taxon:9606
4,UniProtKB,A0A024RBG1,GO:0016787,GO_REF:0000002,taxon:9606


## Validation

In [2]:
idx = brpd.validate_prefixes(df, column="subject_prefix")

brpd.summarize_prefix_validation(df, idx, column="subject_prefix")

100 of 100 (100%) rows with the following prefixes need to be fixed: ['UniProtKB']
The following prefixes could be normalized using normalize_curies():

| raw       | standardized   |
|-----------|----------------|
| UniProtKB | uniprot        |


In [3]:
idx = brpd.validate_identifiers(
    df, column="subject_identifier", prefix_column="subject_prefix", use_tqdm=True
)

print(f"{(~idx).sum():,} rows have invalid identifiers")

0 rows have invalid identifiers


In [4]:
idx = brpd.validate_curies(df, column="object_curie")
brpd.summarize_curie_validation(df, idx, column="object_curie")

100 of 100 (100%) rows with the following CURIEs need to be fixed: ['GO:0002250', 'GO:0002376', 'GO:0003723', 'GO:0005515', 'GO:0005576', 'GO:0005737', 'GO:0005829', 'GO:0005886', 'GO:0008486', 'GO:0016020', 'GO:0016462', 'GO:0016787', 'GO:0019814', 'GO:0046872']


In [5]:
idx = brpd.validate_curies(df, column="reference_curie")
brpd.summarize_curie_validation(df, idx, column="reference_curie")

100 of 100 (100%) rows with the following CURIEs need to be fixed: ['GO_REF:0000002', 'GO_REF:0000003', 'GO_REF:0000043', 'GO_REF:0000044', 'GO_REF:0000052', 'GO_REF:0000117', 'PMID:33961781']


In [6]:
idx = brpd.validate_curies(df, column="taxon_curie")
brpd.summarize_curie_validation(df, idx, column="taxon_curie")

100 of 100 (100%) rows with the following CURIEs need to be fixed: ['taxon:9606']


## Standardize

In [7]:
brpd.normalize_prefixes(df, column="subject_prefix")

brpd.normalize_curies(df, column="object_curie")
brpd.normalize_curies(df, column="reference_curie")
brpd.normalize_curies(df, column="taxon_curie")

df = df[["subject_prefix", "subject_identifier", "object_curie", "reference_curie", "taxon_curie"]]
df.head()

Unnamed: 0,subject_prefix,subject_identifier,object_curie,reference_curie,taxon_curie
0,uniprot,A0A024RBG1,go:0003723,go.ref:0000043,ncbitaxon:9606
1,uniprot,A0A024RBG1,go:0005515,pubmed:33961781,ncbitaxon:9606
2,uniprot,A0A024RBG1,go:0008486,go.ref:0000003,ncbitaxon:9606
3,uniprot,A0A024RBG1,go:0016462,go.ref:0000002,ncbitaxon:9606
4,uniprot,A0A024RBG1,go:0016787,go.ref:0000002,ncbitaxon:9606


In [8]:
brpd.identifiers_to_curies(df, column="subject_identifier", prefix_column="subject_prefix")

0     uniprot:A0A024RBG1
1     uniprot:A0A024RBG1
2     uniprot:A0A024RBG1
3     uniprot:A0A024RBG1
4     uniprot:A0A024RBG1
             ...        
95    uniprot:A0A075B6I9
96    uniprot:A0A075B6I9
97    uniprot:A0A075B6I9
98    uniprot:A0A075B6I9
99    uniprot:A0A075B6I9
Length: 100, dtype: object

In [9]:
# Collapse split prefix/identifier columns together into curies
brpd.pd_collapse_to_curies(
    df,
    prefix_column="subject_prefix",
    identifier_column="subject_identifier",
    target_column="subject_curie",
)
df

Unnamed: 0,object_curie,reference_curie,taxon_curie,subject_curie
0,go:0003723,go.ref:0000043,ncbitaxon:9606,uniprot:A0A024RBG1
1,go:0005515,pubmed:33961781,ncbitaxon:9606,uniprot:A0A024RBG1
2,go:0008486,go.ref:0000003,ncbitaxon:9606,uniprot:A0A024RBG1
3,go:0016462,go.ref:0000002,ncbitaxon:9606,uniprot:A0A024RBG1
4,go:0016787,go.ref:0000002,ncbitaxon:9606,uniprot:A0A024RBG1
...,...,...,...,...
95,go:0002250,go.ref:0000043,ncbitaxon:9606,uniprot:A0A075B6I9
96,go:0002376,go.ref:0000043,ncbitaxon:9606,uniprot:A0A075B6I9
97,go:0005576,go.ref:0000043,ncbitaxon:9606,uniprot:A0A075B6I9
98,go:0005576,go.ref:0000044,ncbitaxon:9606,uniprot:A0A075B6I9
