Single Enrichment

Ralonso edited this page Jun 10, 2015 · 96 revisions
Clone this wiki locally

FatiGO. This is the conventional enrichment test where we compare two lists of genes, usually a group of genes which are significant in a given test that are compared to the rest of the genes in the experiment, although any two groups formed in any way can be tested against each other. This approach detects significant over-representation of functional annotations in one gene set respect to the other one. The following sections provide a tutorial on how to use this tool:

Input data

The input for the Single Enrichment Analysis is a list of genes, transcripts or proteins. If you have a ranked list of the whole genome or proteome, we recommend you to use Gene Set Enrichment.

Steps

  1. Online examples. Here you can load small datasets from our server. You can use them to run examples and see how the tool works. Click on the links to load the data. Example1: motor vs apoptosis. The two lists contain genes associated to motor and apoptosis processes. With this analysis we can compare the terms enriched in each list and see that the significant terms are clearly associated to their appropriate list.

  2. Define your comparison. Input data should be a raw counts matrix upload as the data type ID list. See data types here.

    Single Enrichment offers two modes to define your comparison:

    • Id List vs Id List: computing functional enrichment analysis between two list of identifiers of interest.
    • Id List vs Rest of genome: computing functional enrichment analysis of a list of identifiers respect against the rest of the genome.
  3. Select your data. Here you can select the dataset you want to analyse by single enrichment. You should have uploaded it previously using the Upload or directly from Select your data. Your data should be tagged as idlist data type.

    In case the Id List vs Id List has been chosen, the two lists must be selected using Select your data and List 2 boxes. When the second list is the rest of the genome or the complementary list of your annotations, Single Enrichment will calculate the second list and the user must supply only the first one.

  4. Options. Choose options about Fisher test and removing duplicates.

    • Fisher exact test

      The user can choose between the three options for the Fisher exact test:

      • Two-tailed: detects over and under-represented terms comparing both lists
      • Over-represented terms in list 1 (recommended in genome comparison): detects over-represented terms in list 1 of genes. The best choice when list 2 of genes is the rest of the genome.
      • Over-represented terms in list 2: detects over-represented terms in list 2 of genes compared to list 1 of genes.
    • Remove duplicates

      GO Enrichment gives the possibility of checking and removing duplicates existence in the lists of gene or protein identifiers. Four options are given:

      • Never: list 1 and 2 are not checked for duplicate occurrences. Use this option only if you are sure that none of the references are duplicated.
      • Remove on each list separately: each list is preprocessed for duplicates separately. One member of the duplicate pair is kept.
      • Remove on each list and common ids: firstly each list is preprocessed for duplicates separately and one member of the duplicate pair is kept. Then both list are preprocessed for duplicates together and none of the pair is kept.
      • Remove from list 2 those appearing in list 1 (complementary list).
  5. Databases. Select the databases which include the annotations that you are interested in.

    • Organism: select in a wide range of species your organism of interest.

    • Databases included in Babelomics: databases (DB) available for the organism selected can be tested simultaneously and a specific filter option is provided for each one:

      • GO - biological process:

        Options Form:

        Filter terms by number of annotated ids in DB: Introduce a minimum and maximum number of annotated ids in the gene set to filter out smaller and bigger terms.

        Propagation of functional annotation: You can change if you want propagated GOs or not in the options form. By default propagated is selected.

      • GO - molecular function. You can define a filter with the same parameters as described above in GO -biological process section.

      • GO - cellular component. You can define a filter with the same parameters as described above in GO -biological process section.

      • GOSlim GOA. You can define a filter with the same parameters as described above in GO -biological process section.

      • Interpro motifs. You can define a filter with the same parameters as described above in GOSlim GOA section.

      • Genome-Scale Metabolic Network (Recon): you can define a filter with the same parameters as described above in GOSlim GOA section.

    • Your annotations. This is an useful option for users working on species that our database does not support or that want to use their own annotation.

  6. Job information. Fill the information of this job. Give a name to the job and tell the folder where the resulting files should be saved.

    • Select the output folder
    • Choose a job name
    • Specify a description for the job if desired.
  7. Launch job. Press Launch job button and wait until the analysis is finished. See the state of your job by clicking the jobs button on the top right at the panel menu. A box will appear at the right side listing all your jobs. When the analysis is finished, it will be labelled as "Ready". Then, click on it and you will be redirected to the results page. A normal job may last approximately less than two minutes but the time may vary depending on the size of the list (number of genes/transcripts/proteins).

Interpreting the output results

The FatiGO method (Al-Shahrour et al., 2004) was the first proposal for functional enrichment that took into account the multiple testing problem. Single Enrichment works as follows:

  • Single Enrichment compares two lists of genes, usually a group of genes which are significant in a given test that are compared to the rest of the genes in the experiment, although any two groups formed in any way can be tested against each other.
  • These gene lists are transformed to two lists of annotations using the corresponding gene-annotation association table.
  • Then a Fisher's exact test for 2×2 contingency tables is used to check for significant over-representation of functional terms in one of the lists with respect to the other one.
  • Multiple testing correction to account for the multiple hypothesis tested (one for each functional term) is applied. Single Enrichment uses the FDR B&H method.

The results page contains the following sections:

  1. Job information. A short description of your analysis.

  2. Input data. In this section you will find a reminder of the parameters or settings you have used to run the analysis. Also the two lists analysed can be downloaded as a text file, even when the genome or your annotations comparison have been chosen.

    Duplicates management

    • Number of duplicates: number of duplicates and percentage.
    • Number of finally used ids: number of finally used ids after removing duplicate identifiers.
  3. Summary. A three-column table showing the number of genes annotated to each database in each list. This table can be downloaded as a text file.

    Id annotations per DB:

    • DB: the name of the database chosen. Only the databases with at least one gene annotated in one of the lists will be analysed by GO Enrichment.
    • List1: showing three elements. The number of genes in List1 annotated in the database over total number of genes remaining in List1 after the duplicates management. A percentage of genes in List1 annotated in the database. The ratio of annotations per identifier.
    • List2/Genome/Your annotations: the same structure than the List1 explained above but applied to the List2, Genome or Your Annotations after the duplicates management.
  4. Significant results

    • Number of significant terms per DB

      A three-column table showing the number of significant terms encountered after the functional analysis.

      • DB: the name of the database analysed. Only the databases with at least one gene annotated in one of the lists are analysed by Single Enrichment.
      • Total number of significant terms: number of significant terms highlighted in red if the test had positive results.
    • Table of significant terms per DB

      In this section you will find, in case that a database has significant terms (depending on the p-value selected) a downloadable text file and a table both containing information about the significant terms, and a representation of the DAG in GO databases (GO network section). The table originally sorted by the adjusted p-value can be sorted up and down by clicking in the column headings. In the bottom of the table the user can move within the various pages which the table has been broken down.

      • Term: description and id of the term.
      • Term size: number of identifiers in the input data annotated to this term.
      • Term size (in genome): global size of the term in the whole genome.
      • Term annotation % per list: percentage of identifiers in each list annotated to the term respect to the total number of identifiers. The percentage is represented as a darker color bar (red List1 and blue List2) meaning highlighting the list with higher percentage and lighter for the list with lower percentage.
      • Annotated ids: list identifiers annotated to this term. By clicking each of the the list of genes can be accessed through a pop-up window where each gene is connected to its accession in Ensembl.
      • Odds ratio (log e): the odds ratio is a way of comparing whether the probability of a certain event is the same for two groups, the sign of it will give us the idea of which of the lists is enriched in this term. When the log e (odds ratio) is positive the List1 is enriched in the term, when the log e (odds ratio) is negative the enrichment is related to List2.
      • p-value: unadjusted p-value from the Fisher's exact test.
      • Adjusted p-value: adjusted p-value from the Fisher's exact test, after correcting for multiple testing, using the FDR procedure of Benjamini & Hochberg. Keep in mind to use this p-value instead of the unadjusted one as GO Enrichment is performing a great number of test simultaneously and the error rate must be corrected.
    • GO network. A representation of the Directed Acyclic Graph (DAG) in GO databases for the significant GO terms. Colored nodes represent the significant results (red for GOs overrepresented and blue for GOs underrepresented in the list 1), whereas white GOs represent the parents of the significant GOs. Intensity of the color represents smaller adjusted p-value (high statistical significance). You can choose different options to visualize in the tool bar of the embedded application. For further information about how to use Network Viewer Maps, visit the Network viewer documentation.

  5. All results. A downloadable individual text file for each database provides results for all terms, the file has the same structure than the significant results one.

    • Term: identifier of the term.
    • Term size: number of identifiers in the input data annotated to this term.
    • Term size (in genome): global size of the term in the whole genome.
    • List1 positives: number of identifiers of List1 annotated to the functional term.
    • List1 negatives: number of identifiers of List1 not annotated to the functional term.
    • List1 percentage: percentage of identifiers of List1 annotated to the functional term.
    • List2 positives: number of identifiers of List2 annotated to the functional term.
    • List2 negatives: number of identifiers of List2 not annotated to the functional term.
    • List2 percentage: percentage of identifiers of List2 annotated to the functional term.
    • List1 positive ids: list of identifiers in List1 annotated to this term.
    • List2 positive ids: list of identifiers in List2 annotated to this term.
    • Odds ratio (log e): the odds ratio is a way of comparing whether the probability of a certain event is the same for two groups, the sign of it will give us the idea of which of the lists is enriched in this term. When the log e (odds ratio) is positive the List1 is enriched in the term, when the log e (odds ratio) is negative the enrichment is related to List2.
    • p-value: unadjusted p-value from the Fisher's exact test.
    • Adjusted p-value: adjusted p-value from the Fisher's exact test, after correcting for multiple testing, using the FDR procedure of Benjamini & Hochberg. Keep in mind to use this p-value instead of the unadjusted one as GO Enrichment is performing a great number of test simultaneously and the error rate must be corrected.
  6. Annotation files. A list of downloadable files containing the identifier - functional term correspondence. If a gene is annotated to several labels, each of the annotations is listed in a separate line. These lists can be used as an input annotation file in the Single Enrichment Your Annotations mode.

  7. Other actions

  • Change p-value: the p-value used to determine when a functional term is significant or not is highlighted and can be modified to make it more astringent or liberal. The default p-value is 0.05 but the user can change it and Single Enrichment will show the significant results according to the new p-value.

Worked examples and exercises

How the functional profiling should never be done

It is not uncommon to find the following assertion in papers and talks: "then we examined our set of genes selected in this way (whatever) and we discover that 65% of them were related to metabolism, so we can conclude that our experiment activates metabolism genes". This could be true or not depending on the relative abundance of this term. If you look to the rest of genes not activated in the experiment and the proportion of them related to metabolism is, let's say 10%, then you are right. Contrarily, if the proportion is, let's say 61%, then the experiment has probably nothing to do with metabolism. The statistical comparison is compulsory to support such assertions.

Comparing two lists of genes

There are many situations in which the comparison of two lists of genes answers a relevant biological question. Actually a large number of problems can be addressed in this way. For example, one might be interested in knowing whether a group of genes that co-express are functionally related. Typically this implies the comparison of a set of genes that clustered together (by any clustering method) to the rest of genes. Other commonly addressed question is if genes differentially expressed when comparing two experimental conditions are functionally related. And many other similar questions are commonly asked when analysing microarray data or, in general, genomic data. This approach has specifically been designed to answer these kind questions.


1. Exploring differences in GO terms with Enrichment Analysis, basics

The simplest use of the tool is to have a quick look at the functional processes where a set of genes take part of. The list of genes submitted is going to be analysed against the rest of the genome to obtain significance of the GO terms or other sets abundance.

  • Here you can find the corresponding file, for this worked example, containing a list of genes of Saccharomyces Cerevisiae. Save this file to your desktop or local directory and upload it to Babelomics as an idlist type of data.
  • Create a new project (e.g. worked_example1) and start a new Enrichment Analysis job in the Functional analysis >> Single enrichment analysis section of the tools.
  • Choose the Id List vs. Rest of genome option.
  • Choose as List1 the data you already upload.
  • In the Options section choose Over-represented terms in List1 as we want to compare the functions of our list of genes against de rest of the genome. As we already know our list of genes we are sure that contains no duplicates but is better to make it sure or apply a duplicate management option.
  • Choose the organism database Saccharomyces cerevisiae
  • Database section check the GO - biological process
  • Give a name to the new job (e.g. example1_Enrichment_Analysis)
  • Maintain the rest of the parameters as default
  • Submit the job (press the launch button)

The number of significant functional terms is resumed in a table. If you take a look to the significant results you can sort them by the adjusted pvalue.

You will get a resume table with the number of significant GO terms associated to the genes and then a table for each database with information about the test in each of the significant functional terms. The table can be sorted by the different percentage between the genes annotated in this GO term in each list or by the p-value or p-value adjusted.

Submit other jobs playing around with other parameters: Gene Ontology database, other databases, pvalue.


2. Exploring differences in other functional information with Enrichment Analysis, basics

Identically to the previous worked example, Enrichment Analysis can be used to check more functional information as InterPro or Genome-Scale Metabolic Network terms.

  • Here you can find the corresponding file, for the second worked example, containing a list of genes of Homo sapiens. Save this file to your desktop or local directory and upload it as an gene - idlist data type.
  • Create a new project (e.g. worked_Example2) and start a new Enrichment Analysis job in the Functional analysis » Single enrichment analysis section of the tools.
  • Choose the Id List vs. Rest of genome option.
  • Choose as List1 the data you already upload.
  • In the Options section choose Over-represented terms in List1 as we want to compare the functions of our list of genes against de rest of the genome. As we already know our list of genes we are sure that contains no duplicates but is better to make it sure or apply a duplicate management option.
  • Choose the organism database Homo sapiens
  • Database section check the GO - biological process, InterPro and Genome-Scale Metabolic Network check box, then click the options link of GO - biological process.
  • Give a name to the new job (e.g. example2_Enrichment_Analysis)
  • Maintain the rest of the parameters as default
  • Submit the job (press the launch button)

The number of significant functional terms for each database are resumed in a table. If you take a look to the significant results you can sort them by the adjusted pvalue.

You will get a resume table with the number of significant GO terms associated to the genes and then a table for each database with information about the test in each of the significant functional terms. The table can be sorted by the different percentage between the genes annotated in this GO term in each list or by the p-value or p-value adjusted.

Afterwards, launch more jobs choosing other or more databases at a time and change the options parameters.


3. Exploring differences in GO terms with Enrichment Analysis

Let us exemplify the application of Enrichment Analysis with a classical example. We use the data from Chu et al. (1998), The Transcriptional Program of Sporulation in Budding Yeast, Science, 282, 699-705 and cluster the genes according to their expression patterns. We choose a cluster of co-expressing genes and check the hypothesis of "genes of similar function will tend to co-express".

  • The files for the third worked example correspond to a cluster of co-expressing genes and the rest of genes in the experiment of Saccharomyces Cerevisiae. Save both files to your desktop or local directory and upload it as an gene - idlist data type.
  • Create a new project (e.g. sporulation) and start a new Enrichment Analysis job in the Functional analysis » Single enrichment analysis section of the tools.
  • Choose the Id List vs. Id List option.
  • Choose as List1 the data sporulation_clus42 and as List2 sporulation_all_but_clus42 you already upload.
  • In the Options section choose Over-represented terms in List1 as we want to compare the functions of our cluster of co-expressed genes against de rest of the clusters. As we already know our list of genes we are sure that contains no duplicates but is better to make it sure or apply a duplicate management option.
  • Choose the organism database Saccharomyces cerevisiae
  • Database section check the GO - biological process check box, click the options link. Check also the GO - Cellular Component.
  • Give a name to the new job (e.g. sporulationEnrichment Analysis)
  • Maintain the rest of the parameters as default
  • Submit the job (press the launch button)

If we compare it to the rest of genes in the experiment we can see that several terms related with meiosis and chromosome component are significantly overrepresented in the cluster of co-expressing genes. Keep in mind that this test assumes that you do not have any a priori hypothesis on what biological process is operating in this particular cluster of genes.

4. Exploring differences in Gene Ontology

Similarly you can explore functional differences using other functional relevant terms such as cellular components or molecular functions. We can use Enrichment Analysis for this purpose.

  • The files for the fourth worked example correspond to a list of genes related to apoptosis which will be compared to a list of genes extracted from chromosome 19 of Homo sapiens. Save both files to your desktop or local directory and upload it as a gene - idlist data type.
  • Create a new project (e.g. apoptosis) and start a new Enrichment Analysis job in the Functional analysis » Single enrichment analysis section of the tools.
  • Choose the Id List vs. Id List option.
  • Choose as List1 the data fatigo_apoptosis and as List2 fatigo_chr19 that you already uploaded.
  • In the Options section choose Over-represented terms in List1. Choose the option to manage duplicates separately in the two lists.
  • Choose the organism database Homo sapiens
  • Database section check the GO - Biological Process, GO - Cellular Component and GO - Molecular Function.
  • Give a name to the new job (e.g. apoptosis_vs_chr19_Enrichment Analysis)
  • Maintain the rest of the parameters as default
  • Submit the job (press the launch button)

Observing the significant results are enriched only in the apoptosis related list. The terms are associated to the cell programmed death as can be seen in the GO terms description.


Citing

  • Al-Shahrour, F., Minguez, P., Tárraga, J., Medina, I., Alloza, E., Montaner, D., & Dopazo, J. (2007). FatiGO+: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Research 35 (Web Server issue): W91-96
  • Al-Shahrour, F., Díaz-Uriarte, R. & Dopazo, J. (2004). FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20: 578-580

Go back to the Functional page
Go back to the Home page
Go back to the Worked examples for all tools page