# Preparing the Gene Ontology Annotations Database for Integration

> A Gene Ontology (GO) annotation is a statement about the function of a particular gene. GO annotations are created by associating a gene or gene product with a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. Hence, GO annotations capture statements about how a gene functions at the molecular level, where in the cell it functions, and what biological processes (pathways, programs) it helps to carry out. (quoted from https://geneontology.org/docs/go-annotations)

This notebook downloads the Gene Ontology annotation database and walks through the steps of 1) validating the usages of prefixes, local unique identifiers, and CURIEs then 2) standardizing them. Many datasets require such standardization to make them readily interoperable with other datasets.

In the first step, we load the most recent GO annotations database from http://geneontology.org/gene-associations/goa_human.gaf.gz. The format of this file is explained at https://geneontology.org/docs/go-annotation-file-gaf-format-2.2/, but we only look at a subset of columns.

In [4]:
import pandas as pd

import bioregistry.pandas as brpd

# Focus on these columns when displaying the data
columns = [0, 1, 4, 5, 12]
names = [
    "subject_prefix", "subject_identifier", "object_curie", 
    "reference_curie", "taxon_curie",
]

df = pd.read_csv(
    "http://geneontology.org/gene-associations/goa_human.gaf.gz",
    sep="\t",
    comment="!",
    header=None,
    usecols=columns,
    names=names,
    dtype=str,
).head(100)


df.head()

Unnamed: 0,subject_prefix,subject_identifier,object_curie,reference_curie,taxon_curie
0,UniProtKB,A0A024RBG1,GO:0003723,GO_REF:0000043,taxon:9606
1,UniProtKB,A0A024RBG1,GO:0046872,GO_REF:0000043,taxon:9606
2,UniProtKB,A0A024RBG1,GO:0005829,GO_REF:0000052,taxon:9606
3,UniProtKB,A0A075B6H7,GO:0002250,GO_REF:0000043,taxon:9606
4,UniProtKB,A0A075B6H7,GO:0005886,GO_REF:0000044,taxon:9606


## Validation

In [7]:
idx = brpd.validate_prefixes(df, column="subject_prefix")

brpd.summarize_prefix_validation(df, idx, column="subject_prefix")

KeyError: 'subject_prefix'

## Standardize

In [5]:
brpd.normalize_prefixes(df, column="subject_prefix")

# Collapse split prefix/identifier columns together into curies
brpd.pd_collapse_to_curies(
    df, prefix_column="subject_prefix", identifier_column="subject_identifier", target_column="subject_curie",
)

brpd.normalize_curies(df, column="object_curie")
brpd.normalize_curies(df, column="reference_curie")
brpd.normalize_curies(df, column="taxon_curie")

df = df[["subject_curie", "object_curie", "reference_curie", "taxon_curie"]]
df.head()

Unnamed: 0,subject_curie,object_curie,reference_curie,taxon_curie
0,uniprot:A0A024RBG1,go:0003723,go.ref:0000043,ncbitaxon:9606
1,uniprot:A0A024RBG1,go:0046872,go.ref:0000043,ncbitaxon:9606
2,uniprot:A0A024RBG1,go:0005829,go.ref:0000052,ncbitaxon:9606
3,uniprot:A0A075B6H7,go:0002250,go.ref:0000043,ncbitaxon:9606
4,uniprot:A0A075B6H7,go:0005886,go.ref:0000044,ncbitaxon:9606


## Prefixes

In [3]:
idx = brpd.validate_prefixes(df, column="subject_prefix")

brpd.summarize_prefix_validation(df, idx, column="subject_prefix")

KeyError: 'subject_prefix'

## CURIEs

In [None]:
idx = brpd.validate_curies(df, column=4)

brpd.summarize_curie_validation(df, idx)

In [None]:
brpd.normalize_curies(df, column=4)

df[columns].head()

In [None]:
idx = brpd.validate_curies(df, column=4)

brpd.summarize_curie_validation(df, idx)

## Identifiers

In [None]:
idx = brpd.validate_identifiers(df, column=1, prefix_column=0, use_tqdm=True)
print(f"{(~idx).sum():,} rows have invalid identifiers")

In [None]:
(~idx).sum()

In [None]:
brpd.identifiers_to_curies(df, column=1, prefix_column=0)

columns = [c for c in columns if c != 0]  # remove redundant column

df[columns].head()