Skip to content

Defining gene to category annotations

Ben Fulcher edited this page Jun 30, 2020 · 2 revisions

This page describes how to retrieve and process data from an ontology like GO biological processes in a form of gene-to-category annotations. These categories define the units on which enrichment is assessed.

Note that this step can be skipped by downloading pre-processed results from figshare [computed on data as of 2019-04-17].

This involves the following steps:

  1. Retrieve and process the GO hierarchy data.
  2. Retrieve and process the annotations of genes to GO Terms.
  3. Iteratively propagate gene-to-Term annotations from child to parent up the GO hierarchy.

Processing the GO-Term Hierarchy

Downloading the data

There are a number of routes to downloading the GO Term hierarchy. We used the termdb mySQL database dump, and linked to this database from Matlab using a mySQL java connector. Code for achieving this (e.g., in the Matlab_mySQL repository) is a dependency for this package.

Note: the data is also provided in raw form as go-basic.obo (the basic file ensures that annotations can be propagated), and you can also download the data as a database.

Reading and Processing:

  1. Set up downloaded termdb mySQL database, and put connection details in ConnectMeDatabase.
  2. Retrieve Biological Process GO Terms, and save the filtered set of terms to a .mat file:
GOTerms = GetGOTerms('biological_process',true);

Saves out to ProcessedData/GOTerms_BP.mat.

Processing GO Term Annotations

Now that we have the GO Terms in Matlab format, we next need data on which genes are annotated to which GO Terms.

Downloading raw GO annotation data

Annotation files should be downloaded directly from the GO website.

  • For Mus musculus, the annotation file is mgi.gaf.
  • For Homo sapiens, the annotation file is goa_human.gaf.

The appropriate annotation file(s) should be placed in the RawData directory.

Processing data from the annotation file

Each line in the annotation file represents an association between a gene product and a GO term with a certain evidence code, and the reference to support the association. The ReadDirectAnnotationFile function reads in all of this raw data, and processes it into a Matlab table, with a row for each GO Category, including information about the category and the genes that are annotated to it.

Before this can be run, it requires a mapping from MGI gene identifiers to NCBI Entrez gene identifiers. In mouse, this is achieved by taking data from MouseMine.

python3 MGI_NCBI_downloadall.py

This saves the required gene identifier mapping to ALL_MGI_ID_NCBI.csv. In the case of human data, we mapped onto gene symbols from processed gene-expression data from the Allen Human Brain Atlas.

ReadDirectAnnotationFile('mouse')

Saves processed data as GOAnnotationDirect-mouse.mat (or GOAnnotationDirect-human.mat), in the ProcessedData directory.

Note (NOT RECOMMENDED): Annotations processed from GEMMA can alternatively be read using ReadGEMMAAnnotationFile.

Propagate annotations up through the hierarchy

Annotations are made at the lowest level of the GO term hierarchy. Annotations at a lower level of the hierarchy apply to all parent terms. For performing enrichment, we therefore need to iteratively propagate direct annotations up the hierarchy, using is_a (child-parent) relationships from the term2term table from the GO Term database.

For mouse biological processes, this is achieved using:

propagateHierarchy('mouse','biological_process');

The code takes processed data (e.g., GOAnnotationDirect-mouse.mat) and saves propagated output as GOAnnotationDirect-mouse-biological_process-Prop.mat. These propagated annotations can then be used for enrichment analysis.