Gene Set Enrichment

Francisco García edited this page Jul 6, 2015 · 25 revisions
Clone this wiki locally

Logistic model. Gene set methods are much more sensitive than single enrichment methods in detecting gene sets (defined as sets of genes with a common annotation) with a collective behavior in a genomic experiment. This method very efficiently detects gene sets (functional annotations) that are consistently associated to high or low values in a ranked list of genes. For further information about these topics see Gene Set Enrichment.


Input data

The input for the Gene Set Enrichment Analysis is a ranked list of genes, transcripts or proteins. If you have a not ordered list of genes, we recommend you to use Single Enrichment.

Gene Set Enrichment Analysis can be applied to the study of the relationship of biological labels to any type of experiment whose outcome is a sorted list of genes. Genes sorted by differential expression between two experimental conditions can be studied, but also genes correlated to a clinical variable (such as the level of a metabolite) or even to survival. Moreover, other lists of genes ranked by any other experimental or theoretical criteria can be studied (e.g. genes arranged by physico-chemical properties, mutability, structural parameters, etc.) in order to understand whether there is some biological feature (among the labels used) which is related to the experimental parameter studied.

We propose the use of such procedure to scan ordered lists of genes and understand the biological processes operating behind them. This procedure can be useful in situations in which it is not possible to obtain statistically significant differences based on the experimental measurements (low prevalence diseases, etc.).

Steps

  1. Online examples. Here you can load small datasets from our server. You can use them to run examples and see how the tool works. Click on the links to load the data.

  2. Select your data. In this section of the form you can select the data you want to analyze. Only data which are tagged as Ranked ID list can be analyzed in Babelomics Gene Set Analysis Tools.

    Generally you will perform a Gene Set Analysis over some data you will have created using some other Babelomics tool, like for instance the Expression tools. Then some of the files that resulted from such previous analysis will be tagged as Ranked ID.

    Otherwise, you may uploaded any Ranked ID list you want using the Upload Menu in Babelomics or you can follow the link "Or go to Upload Data form" and it will take you to the Upload Menu.

    Gene Set GO Enrichment gives the possibility of checking and removing duplicates existence in the ranked list.

  3. Method. Logistic model method is used by default.

  4. Databases. Here you can choose which databases are to be used in your gene set analysis. After selecting the organism you are studying, you will be able to choose among those databases that are available in Babelomics for it. All databases can be filtered before the analysis following the options link.

    • Organism: select in a wide range of species your organism of interest.

    • Databases included in Babelomics: databases (DB) available for the organism selected can be tested simultaneously and a specific filter option is provided for each one:

      • GO - biological process:

        Filter terms by number of annotated ids in DB.

        Introduce a minimum and maximum number of annotated ids in the gene set to filter out smaller and bigger terms.

      • GO - molecular function. You can define a filter with the same parameters as described above in GO -biological process section.

      • GO - cellular component. You can define a filter with the same parameters as described above in GO -biological process section.

      • GOSlim GOA. You can define a filter with the same parameters as described above in GO -biological process section.

      • Interpro motifs. You can define a filter with the same parameters as described above in GOSlim GOA section.

      • Genome-Scale Metabolic Network (Recon): you can define a filter with the same parameters as described above in GOSlim GOA section.

    • Your annotations. This is an useful option for users working on species that our database does not support or that want to use their own annotation.

  5. Job information. Fill the information of this job. Give a name to the job and tell the folder where the resulting files should be saved.

    • Select the output folder.
    • Choose a job name.
    • Specify a description for the job if desired.
  6. Launch job. Press Launch job button and wait until the analysis is finished. A normal job may last approximately less than two minutes but the time may vary depending on the size of the list. See the state of your job by clicking the jobs button in the top right at the panel menu. A box will appear at the right of the web browser with all your jobs. When the analysis is finished, you will see the label "Ready". Then, click on it and you will be redirected to the results page.

Interpreting the output results

Under a systems biology perspective, the simple functional enrichment analysis to understand the molecular basis of a genome-scale experiment is far away from being efficient. Methods inspired in systems biology can use lists of genes ranked by any biological criteria (e.g. differential expression when comparing cases and healthy controls, genes with different evolutionary rates, etc.) and directly search for the distribution of blocks of functionally related genes across it without imposing any artificial threshold. Any macroscopic observation that causes this ranked list of genes will be the consequence of cooperative action of genes arranged into functional classes, pathways, etc.

Each functional class responsible for the macroscopic observation will, consequently, be found in the extremes of the ranking with highest probability. The imposition of a threshold based on the rank values which does not take into account the cooperation among genes is thus avoided under this perspective. Systems biology inspired methods will directly search for groups of functionally related genes significantly cumulated in the extremes of these ranked lists of genes.

Gene set enrichment implements a segmentation test which checks for asymmetrical distributions of biological labels associated to genes ranked in a list. Gene set enrichment will work as follows:

  • Firstly a list of genes is ordered using experimental information on their differential expression, according to the phenotype studied in the experiment, or to other type of value (e.g. large-scale genotyping, evolutionary analysis, etc.). For example, genes can be ordered on the basis of their differential expression among two experimental conditions (e.g. pre and post drug administration, healthy versus diseased samples, etc.).
  • The second step involves the use of a logistic model to find the association of each functional block with the high or low values of the ranked list.
  • Finally, a table with the significant terms obtained upon the application of the test can be used to detect significant asymmetrical distributions of genes, responsible for diverse biological processes, across the list.

The results page contains the following sections:

  1. Job information. A short description of your analysis.

  2. Input data. In this section you will find a reminder of the parameters or settings you have used to run the analysis. Besides, you can download the ranked list of genes, the sorted Id list and the statistic list as text files.

  3. Summary. A two-column table showing the number of genes annotated to each database in each list. This table can be downloaded as a text file.

  4. Significant results. In this section you will find, in case that a database has significant terms (depending on the p-value selected) a downloadable text file and a table both containing information about the significant terms, and a representation of the DAG in GO databases (GO network section).

    • Number of significant terms per DB: a two-column table showing the number of significant terms encountered after the functional analysis.

      • DB the name of the database analysed. Only the databases with at least one gene annotated in one of the lists (above and below partition point) are analysed.
      • Total number of significant terms. Highlighted in red if the test had positive results.
    • Summary of significant terms: a seven-column table showing the significant terms encountered after the functional analysis. The table can be sorted up and down by clicking in the column headings. In the bottom left part you have the total number of significant terms. In the bottom right part of the table the user can move from the various pages that the table has been broken down.

      • Term term name or identifier.
      • Term size number of genes in each particular list.
      • Term size (in genome) number of genes in the whole genome.
      • Annotated genes genes annotated to the functional term in each list.
      • Converged ids
      • LOR: Logarithm Odds Ratio. When comparing two groups or experimental conditions, LOR > 0 indicates a functional term is over-represented for genes up-expressed in the first condition (genes down-expressed in the second condition) and LOR < 0 indicates a functional term is over-represented for genes down-expressed in the first condition (genes up-expressed in the second condition).

      • Adjusted p-value adjusted p-value from the logistic regression.

    • GO network. A representation of the Directed Acyclic Graph (DAG) in GO databases for the significant GO terms. Colored nodes represent the significant results (red for GOs overrepresented and blue for GOs underrepresented in the list 1), whereas white GOs represent the parents of the significant GOs. Intensity of the color represents smaller adjusted p-value (high statistical significance). You can choose different options to visualize in the tool bar of the embedded application. For further information about how to use Network Viewer Maps, visit the Network viewer documentation.

  5. All results. A downloadable individual text file for each database provides results for all terms, the file has the same structure than the significant results one.

  6. Annotation files. A list of downloadable files containing the gene-functional term correspondence. If a gene is annotated to several labels, each of the annotations is listed in a separate line. If the list of genes submitted (List 1-above the partition point or List 2-below the partition point) contains genes annotated in each particular database an icon to download the file is shown, a text of No references found otherwise. These lists can be used as an input annotation file in the Gene Set GO enrichment Your Annotations mode.

  7. Other actions

    • Change p-value: the p-value used to determine when a functional term is significant or not is highlighted and can be modified to make it more astringent or liberal. The default p-value is 0.05 but the user can change it and Gene Set Enrichment will show the significant results according to the new p-value.

Worked examples and exercises

We downloaded a microarray experiment from GEO GDS715. It describes a set of Acute Myeloid Leukemia (AML) samples treated with a panel of compounds inducing, with different success, their differentiation to mature cells. The gene expression data of each AML sample treated with a compound was compared to the expression data of the negative controls, AML cells and AML cells treated with compounds that do not alter gene expression. For the comparison between both conditions, we applied a Student t-test to every pair of classes: AML+compound and control.

The output of Expression Differencial Analysis is a set of lists of genes, sorted by the t statistic or, in other words, by their importance in the difference between the compound action versus AML status. Then we wanted to give different functional annotation to these lists using Gene Set Enrichment.

Examples of Gene Set Enrichment input list files:

  1. AML + sulmazole
  2. AML + fluorouridine
  3. AML + phenanthroline

For each input list file:

  • Create a new project (e.g. AML_sulmazole) and start a new Gene Set Enrichment Analysis job in the Functional analysis >> Gene Set Enrichment section of the tools.
  • Upload input file
  • Choose the organism
  • Database section check the GO - biological process
  • Give a name to the new job (e.g. example1_GSE_analysis)
  • Maintain the rest of the parameters as default
  • Submit the job (press the launch button)

Repeat this analysis for each input file selecting different biological databases.


Citing

  • Montaner D, Dopazo J (2010). Multidimensional gene set analysis of genomic data. PLoS One. 2010 Apr 27;5(4):e10348. doi: 10.1371/journal.pone.0010348.
  • Al-Shahrour F, Arbiza L, Dopazo H, Huerta J, Minguez P, Montaner D, & Dopazo J (2007). From genes to functional classes in the study of biological systems. BMC Bioinformatics 8: 114
  • Al-Shahrour, F., Díaz-Uriarte, R. & Dopazo, J. (2005). Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information. Bioinformatics 21: 2988-2993

Go back to the Functional page
Go back to the Home page
Go back to the Worked examples for all tools page