Tutorial Expression. Class comparison

Francisco García edited this page Mar 27, 2015 · 22 revisions
Clone this wiki locally

INPUT

STEPS

1. Select your data
2. Select the class to analyse
3. Select test
4. Choose multiple-test correction
5. Define a threshold for adjusted p-value
6. Fill information job
7. Press Launch job button

OUTPUT

WORKED EXAMPLES AND EXERCISES





INPUT

Input data

Input data should be a matrix upload as the data type Data matrix expression. See data types here.

Online example

Here you can load a small dataset from our server. You can use them to run this example and see how the tool works. Click on the links to load the data: correlation.txt.


STEPS

Select your data

First step is to select your data to analyze.

Select the class to analyse
  • This variable is relative to experimental design.
  • You can select all or some values of class variable. If you don't use any value of the class variable, you should click none for this value.
Select test

Select test that you want to perform:

  • One-class: limma.
  • Two-classes: t-test, limma, fold-change.
  • Multi-classes: ANOVA, limma.

See Differential Expression section for detailed information about methods.

See Correlation section for detailed information about methods.

Choose multiple-test correction
  • Several methods are implemented to adjust p.values for multiple statistical tests. This is a significance adjustment when many genes are tested in the same.
  • You can select between FDR (False Discovery Rate), Holmm, Hochberg, Bonferroni and BY (Benjamini and Yekutieli).
  • See Differential Expression section for detailed information about methods.
Define a threshold for adjusted p-value

You can choose an adjusted p-value between 0 and 1.

Fill information job
  • Select the output folder
  • Choose a job name
  • Specify a description for the job if desired.
Press Launch job button

Press launch button and wait until the results is finished. A normal job may last approximately few minutes but the time may vary depending on the size of data. See the state of your job by clicking the jobs button in the top right at the panel menu. A box will appear at the right of the web browser with all your jobs. When the analysis is finished, you will see the label "Ready". Then, click on it and you will be redirected to the results page.


OUTPUT

Input parameters

In this section you will find a reminder of the parameters or settings you have used to run the analysis.

Output files

  • In the output file link you will find a text file with results of the analysis for all genes. In general this file will have a first column of gene identifiers followed by some more columns of estimate statistics, their respective p-values, raw and corrected (see multiple testing section) and some other results. Since each particular statistical method reports different parameters, the exact layout of the results file depends on the method that you applied to your data.

  • The way genes are ordered in the results files is thought to be statistically meaningful according to the method used in the analysis. It also tries to be most meaningful for the biological interpretation of the results.

Significant results

  • List of genes and heatmap including only significant results.

  • In any analysis you run, we will provide a grid image representing your data. Each gene is represented in a row and each condition or array is represented in a column. High intensity measurements of gene expression are represented in red colors while blue colors represent lower measurements.

  • Genes are sorted according to their expression patterns in the same order as they are in the output file. Experimental conditions or arrays are ordered depending on their labels.

  • When studying differential expression under two or more conditions arrays are sorted by class. The first class on the left will be the one that appears first in the specified value of class variable. The second class on the left will be the second one to appear in the selected values of class variable.

  • Network viewer. Cell Maps visualization of the protein network of significant results. You can choose the number of significant UP- or DOWN-regulated genes to show in the Select number of nodes in the top (resp. bottom) list of the differential expression result box. Colored nodes represent the significant results, whereas not colored ones represent nodes connected to them directly. You can choose different options to visualize in the tool bar of the embedded application. For further information about how to use Cell Maps, visit the Cell Maps User Manual.

Continue processing

You can redirect the output data to other Babelomics tools to continue with your specific analysis pipeline. Concretely, you can:

  • Redirect files to the Single enrichment analysis. For more information about the Single enrichment tool please visit Single Enrichment Tool. For specific information about how to use the tool, see the Single Enrichment page of the tutorial. You can redirect the file of the most UP-regulated genetic features vs. the whole genome, the file of the most DOWN-regulated genetic features vs. the whole genome and the files of the most UP-regulated and DOWN-regulated genetic features.

  • Redirect the file with the statistics to the Gene set enrichment tool. For further information on the Gene set enrichment tool, see Gene Set Enrichment Tool. For specific information about how to use the tool, see the Gene Set Enrichment page of the tutorial.

  • Redirect files to the Network enrichment analysis. For more information about the Network enrichment tool please visit Network Enrichment. For specific information about how to use the tool, see the Network Enrichment (SNOW) page of the tutorial. You can redirect the file of the most UP-regulated genetic features vs. the whole genome, the file of the most DOWN-regulated genetic features vs. the whole genome and the files of the most UP-regulated and DOWN-regulated genetic features.

  • Redirect the file with the statistics to the Gene set network enrichment tool. For further information on the Gene set network enrichment tool, see Gene Set Network Enrichment. For specific information about how to use the tool, see the Functional Gene Set Network Enrichment page of the tutorial.

  • Redirect the truncated data matrix of the significant genetic features to the Clustering tool. For further information on the Clustering tool, see Clustering. For specific information about how to use the tool, see the Clustering page of the tutorial.


Worked examples and exercises

A. Worked Examples

Example 1. Rheumatoid Arthritis and Osteoarthritis Study
  • Download the data from the Rheumatoid Arthritis. Open the file with a text editor and see how it looks like. This data correspond to 15 samples from two conditions: disease and control.

  • Affymetrix (GeneChip Human Genome U95A Array) platform was used to do the hybridization. The data here presented have been normalized using rma methodology implemented in Babelomics normalization tools.

  • The original data, including .CEL files and information about the samples, can be downloaded from GEO. They correspond to the series GEO GSE1919.

  • We want to analyze differential gene expression between disease and control. To do this kind of analysis you can use the two-class section of the Babelomics Differential Gene Expression module.

Two steps:

  1. Upload data. Data type is "Data matrix expression".

  2. Go to the Class comparison section of the Babelomics /Expression /Differential Expression Microarrays:

  3. select your data: arthritis_rma.txt.
  4. select the class to analyze: arthritis (this variable differenciate two groups: 0 is control, 1 is disease).
  5. select test: ttest.
  6. select multiple test correction: FDR.
  7. select adj. p-value: 0.005.
  8. Name your job and running!

This option performs, for each gene, a t-test for the difference in mean expression between the two groups of arrays. T-statistics and p-values are reported.

In the output file as well as in the image, genes are ranked according to the t-statistic. Genes in the top of the results list are those more expressed in group 0 (control). Genes in the bottom part of the list are those more expressed in the group 1 (disease).


Example 2. Molecular Apocrine Breast Tumor

Download the data from the Molecular Apocrine Breast Tumor. This data correspond to 49 tumors of breast cancer patients. The tumors are classified into 3 classes: apocrine, basal and luminal.

Affymetrix (GeneChip Human Genome U133 Array Set HG-U133A) platform was used to do the hybridization. The data here presented have been normalized using rma methodology implemented in Babelomics normalization tools.

The original data, including .CEL files and information about the samples, can be downloaded from GEO. They correspond to the series GEO GSE1561.

Imagine that we were now interested in finding the genes which expression pattern is more heterogeneous across all three tumors. Such genes could be the ones involved in the processes that differentiate the tumors behavior, having therefore a clinical interest.

To do this kind of analysis you can use Class Comparison section of the Babelomics Expression module.

If you upload the data in Babelomics and you use this section of the Babelomics / Expression/ Microarray / Class comparison choosing the anova method, you will get a graph like this:

As in the two classes analysis, rows of the grid represents genes and columns represent arrays. In the columns of the right of the table you have the estimates of the F-statistic and the adjusted p-values (FDR).

Genes are ranked according to the significance of the differential expression among groups. Genes on the top of the table are those with more differentiated pattern across tumors. Genes on the bottom of the table are those showing less different pattern between tumors. If you see the results file you will see that genes are ranked in the same way they are in the graph. The results of Babelomics Differential Gene Expression module in this case arranges the genes from more differentially expressed among the classes to no differentially expressed among them.


B. Exercises

Exercise 1. Aged Muscle Dataset
  • Download the data from here. Open the file with a text editor and see how it looks like.

  • This data corresponds to 24 samples of human tissue. They were taken from healthy men, aged from 56 to 76, before and after 3 months of physical training.

  • Affymetrix (GeneChip Human Genome U133 Array Set HG-U133A) platform was used to do the hybridization. The data here presented have been normalized using rma as implemented in Babelomics Affymetrix Normalization Tools.

  • The original data, including .CEL files and information about the samples, can be downloaded from the NCBI repository called Gene Expression Omnibus. They correspond to the series GSE1786.

Questions

1. Use the data of this example to compare the samples of sedentary men with the trained ones. Use the t-test method and see the results file.

  • Can you find any gene differentially expressed between the two groups?
  • What is the implication of the sign of the estimate of the t-statistic?
  • How is the relationship among the estimate of the statistic, the p-values and the adjusted p-values?

2. Use limma method to perform the same analysis. (Input parameters. Adj.p-value: 0.05 and Multiple-test correction: FDR)

  • Compare the ranking of the genes given by the different methods. Are they very different?
  • How many significant genes do you have?
  • Change the adj.p-value: 0.1, using "Other actions/Open input form" from the form. Now, how many significant genes do you have? v Redirect some of the results to FatiGO or Logistic models (the only parameters you need to select is the "Homo sapiens" label in the Organism box and some biological database, for example "GO Biological Process"). Don't worry! We will see these interesting functional tools in the next session.

3. Suppose that we are interested in finding genes which expression is higher or lower in elderly men than in young men.

  • What tool do you have to use in Babelomics to evaluate the relationship between the expression of the genes and the age of the men?
  • Use different tests. How is the arrangement of the genes in the heatmap?


Exercise 2. Molecular Apocrine Breast Tumor Dataset

This is dataset that we used in the worked example 2.

Questions

1. Before we used anova test to evaluate expression gene between experimental conditions. Repeat the analysis using limma test. (Input parameters. Adj. p-value: 0.01, multiple test correction:FDR)

  • How is the arrangement of the genes in the heatmap?
  • How many significant genes do you have?
  • Can you indicate the ten genes more differentially expressed? What does it mean in our experimental context?

2. Now, we are interested in comparing theses conditions by pairs. (Input parameters: adj. p-value: 0.01, multiple test correction:FDR)

  • Compare the basal tumors with the luminal ones. How many genes do you have?
  • Can you indicate the ten genes more under-expressed in luminal? And the gen more over-expressed in luminal?
  • Compare the basal tumors with the apocrine ones. How many genes do you have?



Go back to the Expression page
Go back to the Home page
Go back to the Worked examples for all tools page