Skip to content

Functional GO Enrichment

Francisco García edited this page Jan 20, 2015 · 34 revisions

INPUT

Input data should be a raw counts matrix upload as the data type ID list. See data types [here](Data Types).

Online examples

Here you can load small datasets from our server. You can use them to run examples and see how the tool works. Click on the links to load the data.

  • Example1: motor vs apoptosis The two lists contain genes associated to motor and apoptosis processes. With this analysis we can compare the terms enriched in each list and see that the significant terms are clearly associated to their appropriate list.

STEPS

  1. [Define your comparison] (Functional GO Enrichment. Define your comparison)

  2. Select your data

  3. Choose data set of raw counts among the data sets you have already upload to your personal user folder. Data should NOT be normalized.

  4. If desired, select normalization method. Available normalization methods can be reviewed [here](Differential Expression for RNA-Seq). We recommend to use a normalization method.

  5. Select multiple test-correction method. This is the method used by R to adjust the p-values.

  6. Select the value of the adjusted p-value.

  7. Select the output folder. REMEMBER: Use a different folder for each job.

  8. Choose job name and specify a description for the job if desired.

  9. Press Run button.

Define your comparison

Select your data

Here you can select the dataset you want to analyse by single enrichment. You should have uploaded it previously using the Upload Menu in Babelomics and tagged it with the idlist data type.

In case the Id List vs Id List has been chosen, the two lists must be selected. When the second list is the rest of the genome or the complementary list of your annotations, GO Enrichment will calculate the second list and the user must supply only the first one.

Options

Fisher exact test

The user can choose between the three options for the Fisher exact test:

  • Two-tailed: detects over and under-represented terms comparing both lists
  • Over-represented terms in list 1 (recommended in genome comparison): detects over-represented terms in list 1 of genes. The best choice when list 2 of genes is the rest of the genome.
  • Over-represented terms in list 2: detects over-represented terms in list 2 of genes compared to list 1 of genes.

See Methods section for details on the test.

Remove duplicates?

GO Enrichment gives the possibility of checking and removing duplicates existence in the lists of gene or protein identifiers. Four options are given:

  • Never: list 1 and 2 are not checked for duplicate occurrences. Use this option only if you are sure that none of the references are duplicated.
  • Remove on each list separately: each list is preprocessed for duplicates separately. One member of the duplicate pair is kept.
  • Remove on each list and common ids: firstly each list is preprocessed for duplicates separately and one member of the duplicate pair is kept. Then both list are preprocessed for duplicates together and none of the pair is kept.
  • Remove from list 2 those appearing in list 1 (complementary list).

Databases

  • Organism: select in a wide range of species your organism of interest.

  • Databases: databases available for the organism selected can be tested simultaneously and a specific filter option is provided for each one:

    • GO - biological process:

      Filter terms by number of annotated ids in DB.

      Introduce a minimum and maximum number of annotated ids in the gene set to filter out smaller and bigger terms. The filter can be applied in all the genome or in the user input ids.

    • GO - molecular function: you can define a filter with the same parameters as described above in GO -biological process section.

    • GO - cellular component: you can define a filter with the same parameters as described above in GO -biological process section.

    • GOSlim GOA

      Filter terms by number of annotated ids in DB.

      Introduce a minimum and maximum number of annotated ids in the gene set to filter out smaller and bigger terms. The filter can be applied in all the genome or in the user input ids.

    • Interpro motifs: you can define a filter with the same parameters as described above in GOSlim GOA section.

    • KEGG pathways:

      Filter terms by number of annotated ids in DB.

      Introduce a minimum and maximum number of annotated ids in the gene set to filter out smaller and bigger terms. The filter can be applied in all the genome or in the user input ids.

    • Your annotations: this is an useful option for users working on species that our database does not support or that want to use their own annotation.

Job

  • Job name. Give a short name to your analysis job
  • Job description. You can use this section to document further the characteristics of this analysis

Its aim is to help you identifying the analysis you are running and distinguishing between several analysis. To set the name is mandatory but you can leave the description empty if you do not want to use it.

Run

Once all options are set you can run the job. You may get some error message if some parameters are not properly set. If you do, just check the options you have chosen.

See Output results section for details on the result data format and plots.


OUTPUT

Input data

In this section you will find a reminder of the parameters or settings you have used to run the analysis. Also the two lists analysed can be downloaded as a text file, even when the genome or your annotations comparison have been chosen.

Summary

  • Id annotations per DB

A three-column table showing the number of genes annotated to each database in each list:

  • DB: the name of the database chosen. Only the databases with at least one gene annotated in one of the lists will be analysed by GO Enrichment.
  • List1: showing three elements. The number of genes in List1 annotated in the database over total number of genes remaining in List1 after the duplicates management. A percentage of genes in List1 annotated in the database. The ratio of annotations per identifier.
  • List2/Genome/Your annotations: the same structure than the List1 explained above but applied to the List2, Genome or Your Annotations after the duplicates management.
  • Duplicates management

    • Number of duplicates: number of duplicates and percentage.
    • Number of finally used ids: number of finally used ids after removing duplicate identifiers.

Significant results

  • Number of significant terms per DB

A three-column table showing the number of significant terms encountered after the functional analysis.

  • DB: the name of the database analysed. Only the databases with at least one gene annotated in one of the lists are analysed by GO Enrichment.
  • Total number of significant terms: number of significant terms highlighted in red if the test had positive results.
  • Table of significant terms per DB

In this section you will find, in case that a database has significant terms (depending on the p-value selected) a downloadable text file and a table both containing information about the significant terms, and a representation of the DAG in GO databases. The table originally sorted by the adjusted p-value can be sorted up and down by clicking in the column headings. In the bottom of the table the user can move within the various pages which the table has been broken down. - Term: description and id of the term, you can follow the link to general databases as Ensembl or novo|seek or to more specific data related to the term in its original database, for instance the pathway plot of a KEGG pathway. - Term size: number of identifiers in the input data annotated to this term. - Term size (in genome): global size of the term in the whole genome. - Term annotation % per list: percentage of identifiers in each list annotated to the term respect to the total number of identifiers. The percentage is represented as a darker colour bar (red List1 and blue List2) meaning highlighting the list with higher percentage and lighter for the list with lower percentage. - Annotated ids: list identifiers annotated to this term. By clicking each of the the list of genes can be accessed through a pop-up window where each gene is connected to its accession in Ensembl. - Odds ratio (log e): the odds ratio is a way of comparing whether the probability of a certain event is the same for two groups, the sign of it will give us the idea of which of the lists is enriched in this term. When the log e (odds ratio) is positive the List1 is enriched in the term, when the log e (odds ratio) is negative the enrichment is related to List2. - p-value: unadjusted p-value from the Fisher's exact test. - Adjusted p-value: adjusted p-value from the Fisher's exact test, after correcting for multiple testing, using the FDR procedure of Benjamini & Hochberg. Keep in mind to use this p-value instead of the unadjusted one as GO Enrichment is performing a great number of test simultaneously and the error rate must be corrected. The representation of the DAG in GO databases analysis represents the GO terms coloured by the adjusted p-value of their enrichment in the lists, note that this feature will only create the DAG when up to 100 GO terms are significant.

All results

A downloadable individual text file for each database provides results for all terms, the file has the same structure than the significant results one.

  • Term: identifier of the term.
  • Term size: number of identifiers in the input data annotated to this term.
  • Term size (in genome): global size of the term in the whole genome.
  • List1 positives: number of identifiers of List1 annotated to the functional term.
  • List1 negatives: number of identifiers of List1 not annotated to the functional term.
  • List1 percentage: percentage of identifiers of List1 annotated to the functional term.
  • List2 positives: number of identifiers of List2 annotated to the functional term.
  • List2 negatives: number of identifiers of List2 not annotated to the functional term.
  • List2 percentage: percentage of identifiers of List2 annotated to the functional term.
  • List1 positive ids: list of identifiers in List1 annotated to this term.
  • List2 positive ids: list of identifiers in List2 annotated to this term.
  • Odds ratio (log e): the odds ratio is a way of comparing whether the probability of a certain event is the same for two groups, the sign of it will give us the idea of which of the lists is enriched in this term. When the log e (odds ratio) is positive the List1 is enriched in the term, when the log e (odds ratio) is negative the enrichment is related to List2.
  • p-value: unadjusted p-value from the Fisher's exact test.
  • Adjusted p-value: adjusted p-value from the Fisher's exact test, after correcting for multiple testing, using the FDR procedure of Benjamini & Hochberg. Keep in mind to use this p-value instead of the unadjusted one as GO Enrichment is performing a great number of test simultaneously and the error rate must be corrected.

Annotation files

A list of downloadable files containing the identifier - functional term correspondence. If a gene is annotated to several labels, each of the annotations is listed in a separate line. These lists can be used as an input annotation file in the GO Enrichment Your Annotations mode.

Other actions

  • Change p-value: the p-value used to determine when a functional term is significant or not is highlighted and can be modified to make it more astringent or liberal. The default p-value is 0.05 but the user can change it and GO Enrichment will show the significant results according to the new p-value.
Clone this wiki locally