Annotation Enrichment Analysis http://arxiv.org/abs/1208.412 Information for AnnotationEnrichmentAnalysis.c. Written by Kimberly Glass (kglass@jimmy.harvard.edu), available under GPL. As academic code it is provided without warranty. Please contact the author with any comments/questions/concerns.
This distribution contains code and additional relevant material for the methodology described in "Annotation Enrichment Analysis: An Alternate Method for Evaluating the Functional Properties of Gene Sets" (http://arxiv.org/abs/1208.4127).
AnnotationEnrichmentAnalysis.c performs an Annotation Enrichment Analysis (AEA) for a given input set of gene sets and groups of GO terms (likely defined by branches of the GO hierarchy). The code is written in c++. After compilation run program for usage information. Input files for the human signatures, random gene sets, and Biological Process annotations used in the publication as also included in the InputFiles directory.
g++ AnnotationEnrichmentAnalysis.c -o AEA ./AEA
example: ./AEA -i HumanBP.index -a HumanBP.annotations -p HumanBP_Branches.par -s HumanBPSigs.txt -o AEAresults.txt
Program requires five inputs: -i: an "index" file that gives a numeric identifier to each gene. -a: an "annotation" file that indicates the genes (in terms of the numeric identifiers) that are annotated to each GO term. -p: a "partition" file that indicates the branches to which each term belongs. The first column is the GO term, the second is the list of branches (in terms of numeric identifiers). Note that if the number of branches equals the number of terms, the code assumes that the branch "name" is the GO term found on the corresponding numeric "row". -s: a "signature" file containing a list of the genes in each signature. The first column is the signature name, the second is the list of genes in the signature, separated by commas. -o: the "output" file name to which the AEA results will be written. This output file has eight columns. The first two are the internal numeric identities given to the signature and branch, the next are the signature and branch names. Next are the number of annotations made to the signature and branch, then the number of annotations that are common between the two. Finally, the last column is the AEA-predicted p-value for the signature's enrichment in the branch. Optional inputs: -n: the number of randomizations to use.
Additional Options (use with caution):
-t: the "type" of randomization to use. 0 selects the analytic approximation (AEA-A). 1 only randomizes genes;
2 only randomizes terms; 3 only randomizes genes and calculates the p-value by evaluating the overlap in
genes (AEA-G).
-r: randomization options. 0 creates random gene sets with the same number of genes as the sets in the input
"signatures" file, 1 creates random communities of terms of the same size as those included in the
"partition" file. 2 does both. Note that these randomizations only hold the number of genes and terms
fixed, but do not consider any of the annotation properties of the randomly selected genes or terms.
They are NOT equivalent to the randomizations done in the publication.
Please see the files included with this distribution for more details concerning the formatting. In order to create updated index and annotation files, or these files for other organisms and ontologies you are welcome to use the OntologyAnnotations.c code provided with this distribution. This code was designed to accept OBO and GAF files as supplied by the Gene Ontology consortium (available geneontology.org/downloads). Note, however, that the formatting details of the files provided at geneontology.org appear to have changed since this code was written, and there might be some memory issues if you download the most recent annotations from the Gene Ontology website.
- Fixed code to compile on g++ 4.8.2 (ubuntu 14.04)
- Added Makefile
//wbr, yy