Skip to content

Differential Expression for arrays

Francisco García edited this page Jan 30, 2015 · 7 revisions

Introduction

Babelomics' set of tools for differential gene expression analysis can be found in the Differential expression button of the Tools drop down menu. There are three different experimental contexts that you can explore using Babelomics:

  • You can find genes differentially expressed in a class, between two or more classes.

    • Detecting expression patterns more differentiated between several two-colors arrays of the same class (one-class).
    • t is possible with to study differential gene expression between two conditions. For example, when comparing tumor and healthy cells you can find which genes are over-expressed or under-expressed in the tumor compared to the healthy cells.
    • Also, it is possible indeed, to analyze differential expression among more than two conditions. For instance, if you are studding three different types of tumor cells, you can find the genes that have expression patterns more differentiated among them.
  • Another kind of data that you can handle with our tools are those concerned with differential gene expression related to a continuous variable. For example, if you treat some cells with different doses of a drug and you also measure their gene expression levels, you can find genes which expression increases or decreases with the level of treatment.

  • Other analysis that can be done is to explore gene expression related to a survival time. You can study for example which genes are more directly related to the death of your cells by analyzing the relationships between the expression of the genes and the survival time of the cells.

There are many different statistical approaches to the aforementioned scenarios. Babelomics implements some classical methodologies as well as some of the newer algorithms that are known to have enhanced performance in dealing with gene expression data.

Estimates for statistics and p-values are used to order genes in terms of their pattern of differential expression. When it is desirable, p-values adjusted for multiple testing are provided.

All your data are supposed to have been pre-processed and normalized before analyzing them with the differential gene expression module in Babelomics. For some microarray platforms you can do this using Babelomics' normalizing and pre-processing tools, but for some other you will need to use external software and then introduce the preprocessed data in Babelomics.

In general, missing_data are not allowed in your data. See the Babelomics' Preprocessing tool to deal with missing data and other data handling options.

Statistical methods

As said before, Babelomics Differential Gene Expression module distinguishes among three conceptually different testing cases:

1. Class comparison

1.1. One-class

limma This option allows us to detect genes with a significant gene expression (different to 0) between several two-colors arrays in the same experimental condition or class. Limma is a package for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression. This option estimates the variability of data using a diferent method. More information about limma package.

1.2. Two-classes

The purpose of this set of tests is to study differential expression between two groups or classes of arrays. Just two classes are allowed in this set of tests. If you want to analyze more than two classes go to the multi class set of tests.

According to each test, genes are ranked from more expressed in one class (the first value of class variable in the form) to more expressed in the second one (the second and last one in your variable class) passing through no differentially expressed. You might find this arrangement useful in further studies, for example when trying to find sets of functionally related genes that are also differentially expressed using functional tools: Single Enrichment or Gene Set Enrichment for example.

t-test This option performs, for each gene, a t-test for the difference in mean expression between the two groups of arrays. T-statistics and p-values are reported. In the output file as well as in the image, genes are ranked according to the t-statistic. Genes in the top of the results list are those more expressed in your first class. Genes in the bottom part of the list are those more expressed in the second class. More information

limma Interpretation of limma results is like t-test results. Limma is a package for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression. This option estimates the variability of data using a diferent method. [More information about limma package] (http://www.bioconductor.org/packages/release/bioc/html/limma.html).

fold-change Fold-change analysis is used to identify genes with expression ratios or differences between two classes that are outside of a given cutoff or threshold. If you normalized data includes logarithmic transformation, you should calculate fold-change as the difference between means of two classes. In another case, you can calculate fold-change as log2 of ratio between means of two classes.

1.3. More than two classes

The purpose of this set of tests is to study differential expression among more than two groups or classes of arrays. The methods implemented here allow you finding genes differentially expressed between more than two classes. If you want to analyze just two classes go to the two class set of tests.

While the mathematical treatment of this kind of data is similar to that of two classes, in our tools, we separate the case when more than two classes are available because of its conceptual implications. In the case of differential expression between two classes genes can be ranked from most expressed in the first class to most expressed in the second one, passing through non-differentially expressed genes. Such ranking of genes might have a straightforward biological interpretation as well as many advantages in terms of further analyses. However, when more than two classes describe our data, the ranking of genes by their differential expression is not straightforward and we need to be more cautious when interpreting the results

ANOVA For each gene, this option performs a classical analysis of variance to test for mean differences between the array groups defined by the class labels. In the output file as well as in the image, genes are ranked by their estimated p-values, from more differentiated between groups to more similar among them. More information.

limma Interpretation of limma results is like anova results. Limma is a package for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression. This option estimates the variability of data using a diferent method. [More information about limma package] (http://www.bioconductor.org/packages/release/bioc/html/limma.html).

2. Correlation

The purpose of this section is to study gene expression related to some continuous independent variable. We implement here some statistical methods that allow you finding genes whose expression is dependent on a continuous variable like for instance the level of a metabolite.

In the output file as well as in the image, genes are arranged first according to the sign of the correlation coefficient (or the slope in the case of regression analysis) and second by p-values. Genes in the top of the results list are those with stronger positive linear dependence of the explanatory variable. Genes in the bottom part of the list are those with negative linear dependence of the explanatory variable.

Pearson's test. Pearson's correlation coefficient is computed between the intensities measured for each gene and the values of the independent variable. P-values to test for the null hypothesis that the correlation is zero are provided. [More information] (http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient).

Spearman's test. Calculation of the Spearman rank order correlation coefficient is performed. P-values to test for the null hypothesis that the correlation is zero are provided. [More information] (http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).

Regression. A regression analysis is performed for each gene; measured intensities being regressed on the independent variable. Estimates for the intercept and slope are provided together with the statistic and p-value to test for the null hypothesis that the slope is zero. [More information] (http://en.wikipedia.org/wiki/Linear_regression).

3. Survival Analysis

The purpose of this tool is studying gene expression as explanatory variable for survival times associated to each array. A Cox proportional hazards regression model is fitted for each gene. We report estimates of the regression coefficient modeling the hazard function (in the log scale). We also provide p-values to test for the null hypothesis that the coefficient is zero. [More information] (http://en.wikipedia.org/wiki/Survival_analysis).

In the results file an in the image, genes are ranked first according to the sign of the coefficient and second by p-values. Genes on the top of this arrangement are those for which high expression is associated to higher values of the hazard function (shorter survival times). In the bottom of the arrangement are allocated the genes for which high expression is associated to lower values of the hazard function (longer survival times).

References

  • Bolstad, B, Irizarry, R, Astrand, M, & Speed, T. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185-193. doi: 10.1093/bioinformatics/19.2.185.

  • Benjamini, Y, & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289-300. doi: 10.2307/2346101.

  • Benjamini, Y, & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. Volume 29, Number 4, 1165-1188.

  • Smyth GK. (2004). Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Statistical Applications in Genetics and Molecular Biology, 3(1). doi: 10.2202/1544-6115.1027.

  • http://www.bioconductor.org/

Clone this wiki locally
You can’t perform that action at this time.