Class prediction
Pages 113
- Home
- Affymetrix
- affymetrix_expression_normalization_with_apt
- Agilent
- Association Analysis
- Association Analysis doc
- Babelomics version
- Babelomics web structure
- Burden test
- Cancer
- CDF
- Changes in this version
- Class comparison. Worked examples and exercises
- Class prediction
- Class prediction. Worked examples and exercises
- Clustering
- Clustering. Worked examples and exercises
- Cross hybridization
- data matrix expression
- Data types
- Define your comparison
- Detailed example of analysis of expression data in Babelomics: from raw data to expression differential and functional profiling
- Differential Expression for arrays
- Differential Expression for RNA Seq
- Dye bias
- Edit
- Edit your data
- example data
- Expression
- Expression array pipeline
- FAQ
- Functional
- Functional Gene Set Network Enrichment
- Functional GO Enrichment
- GAL
- Gene Set Enrichment
- Gene Set Network Enrichment (Network Miner)
- Gene vs annotation
- Genepix
- Genomics
- Genomics doc
- How to cite babelomics
- Id
- Logging in
- Main areas. Cancer
- Main areas. Expression
- Main areas. Functional
- Main areas. Genomics
- Main areas. Processing
- Main areas: Cancer
- Main areas: Expression
- Main areas: Functional
- Main areas: Genomics
- Main areas: Processing
- Network Enrichment (SNOW)
- Other biological data
- Overview and pipelines
- p values adjusted for multiple testing
- PED
- PED_MAP zipped
- Pipelines
- plink.assoc
- plink.assoc.linear
- plink.assoc.logistic
- plink.fisher
- plink.hh
- plink.log
- plink.tdt
- Preprocessing for data matrix
- Preprocessing for microarrays
- Preprocessing for RNA Seq
- Processing
- Ranked
- Requirements
- RNA Seq Normalization
- RNA Seq pipeline
- SDK (Software Development Kit)
- Single Enrichment
- Single Enrichment. Options
- SNPs array pipeline
- Software and databases used
- Technical Info
- The Babelomics Team
- tut_SNP_association
- Tutorial
- Tutorial Affymetrix Expression Microarray Normalization
- Tutorial Agilent One Color Microarray Normalization
- Tutorial Agilent Two Colors Microarray Normalization
- Tutorial Burden test
- Tutorial Class prediction
- Tutorial Clustering
- Tutorial Data matrix preprocessing
- Tutorial Differential Expression for arrays
- Tutorial Differential Expression for RNA Seq
- Tutorial Expression
- Tutorial Expression. Class comparison
- Tutorial Expression. Correlation
- Tutorial Expression. Survival
- Tutorial Functional
- Tutorial Genepix One Color Microarray Normalization
- Tutorial Genepix Two Colors Microarray Normalization
- Tutorial Genomics
- Tutorial OncodriveClust
- Tutorial OncodriveFM
- Tutorial Processing
- Tutorial SNP Association Analysis
- Tutorial SNP stratification
- Upload your data
- VCF 4.0
- VCF file pipeline
- Visualization tools
- Worked examples
- Workflow
- Show 98 more pages…
General
Tutorial
Analysis tools
Worked examples
-
Expression
-
Functional
Clone this wiki locally
Introduction
Introduction and purpose
In the last years the use of microarrays as predictors of clinical outcomes (van 't Veer et al., 2002), despite not being free of criticisms (Simon, 2005), has fuelled the use of the methodology because of its practical implications in biomedicine. Many other fields, such as agriculture, toxicology, etc are now using this methodology for prognostic and diagnostic purposes.
Predictors are used to assign a new data (expression data in this case) to a specific class (e.g. diseased case or healthy control) based on a rule constructed with a previous dataset containing the classes among which we aim to discriminate. This dataset is usually known as the training set. The rationale under this strategy is the following: if the differences between the classes (our macroscopic observations, e.g. cancer versus healthy cases) is a consequence of certain differences an gene level, and these differences can be measured as differences in the level of gene expression, then it is (in theory) possible finding these gene expression differences and use them to assign the class membership for a new array. This is not always easy, but can be aimed. There are different mathematical methods and operative strategies that can be used for this purpose.
This Babelomics module is a web interface to help in the process of building a "good predictor". We have implemented several widely accepted strategies so as this tool can build up simple, yet powerful predictors, along with a carefully designed cross-validation of the whole process (in order to avoid the widespread problem of "selection bias").
Babelomics allows combining several classification algorithms with different methods for gene selection.
How to build a predictor
A predictor is a mathematical tool that is able to use a data set composed by different classes of objects (here microarrays) and “learn” to distinguish between these classes. There are different methods that can do that (see below). The most important aspect in this learning is the evaluation of the performance of the classifier. This is usually carried out by means of a procedure called cross-validation. The original dataset is randomly divided into several parts (in this case three, which would correspond to the case of three-fold cross validation). Each part must contain a fair representation of the classes to be learned. Then, one of the parts is set aside (the test set) and the rest of the parts (the training set) are used to train the classifier. Then, the efficiency of the classifier is checked by using the corresponding test set which has not been used for the training of the classifier. This process is repeated as many times as the number of partitions performed and finally, an average of the efficiency of classification is obtained.
Methods
We have included in the program several methods that have been shown to perform very well with microarray data (Dudoit et al., 2002; Romualdi et al., 2003; Wessels et al., 2005). These are support vector machines (SVM), k-nearest neighbor (KNN) and Random Forest (RF).
Classification methods
-
Support Vector Machines (SVM): Recently, SVM (Vapnik, 1999) are gaining popularity as classifiers in microarrays (Furey et al., 2000; Lee & Lee, 2003; Ramaswamy et al., 2001). The SVM tries to find a hyperplane that separates the data belonging to two different classes the most. It maximizes the margin, which is the minimal distance between the hyperplane and each data class. When the data are not separable, SVM still tries to maximize the margin but allow some classification errors subject to the constraint that the total error (distance from the hyperplane to the wrongly classifies samples) is below a threshold.
-
Nearest neighbour (KNN): KNN is a non-parametric classification method that predicts the sample of a test case as the majority vote among the k nearest neighbors of the test case (Ripley 1996; Hastie et al., 2001). The number of neighbors used (k) is often chosen by cross-validation. (Ripley 1996; Hastie et al., 2001).
-
Random Forest (RF): It is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees (Leo Breiman,2001)
Feature selection: finding the "important genes"
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models. When applied in biology domain, the technique is also called discriminative gene selection, which detects influential genes based on DNA microarray experiments
- Correlation-based Feature Selection (CFS), This method comes from the thesis of Mark A. Hall., 1999. This algorithm claims that feature selection for supervised machine learning tasks can be accomplished on the basis of correlation between features. In particular, this algorithm investigates the following hypothesis:
"A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other".
- Principal Components Analysis (PCA): involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Dimensionality reduction is accomplished by choosing enough eigenvectors to account for some percentage of the variance in the original data, default 0.95 (95%). Attribute noise can be filtered by transforming to the PC space, eliminating some of the worst eigenvectors, and then transforming back to the original space.
After ranking the genes by PCA, we examine the performance of the class prediction algorithm using different numbers of the best ranked genes and select the best performing predictor. In the current version of Babelomics we build the predictor using the best 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50 genes. You can choose other combinations of numbers anyway. Note, however, that most of the methods require starting by, at least, 5 genes.
Potential sources of errors
Selection bias
If the gene selection process is not taken into account in the cross-validation, the estimations of the errors will be artificially optimistic. This is the problem of selection bias, which has been discussed several times in the microarray literature (Ambroise & McLachlan 2002; Simon et al. 2003). Essentially, the problem is that we use all the arrays to do the filtering, and then we perform the cross-validation of only the classifier, with an already selected set of genes. This cannot account properly for the effect of pre-selecting the genes. As just said, this can lead to severe underestimates of prediction error (and the references given provide several alarming examples). In addition it is very easily to obtain (apparently) very efficient predictors with completely random data, if we do not account for the pre-selection.
Finding the best subset among many trials
The optimal number of genes to be included in a predictor is not know beforehand and it depends on the own predictor and on the particular dataset. A rather intuitive way of guessing about it is by building predictors for different number of genes (see above). Unfortunately, by doing this we are again falling into a situation of selection bias, because we are estimating the error rate of the predictor without taking into account that we are choosing the best among several trials (8 in this case).
Thus, another layer of cross-validation has been added. We need to evaluate the error rate of a predictor that is built by selecting among a set of rules the one with the smallest error.
The cross-validation strategy implemented here return the cross-validated error rate of the complete process. The cross-validation affect to the complete process of building several predictors and then choosing the one with the smallest error rate.
References
- Ambroise, C. and McLachlan, G. J., (2002). Selection bias in gene extraction on the basis of microarray gene expression data. Proc Natl Acad Sci USA, 99:6562-6566.
- Breiman, L., 2001. Random forests. Machine Learning, 45:5-32.
- Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors suing gene expression data. J Am Stat Assoc, 97:77-87.
- Hastie, T., Tibshirani, R., and Friedman, J., (2001). The elements of statistical learning. Springer, New York.
- Lee, JW , Lee JB, Park M, Song SH (2005) An extensive comparison of recent classification tools applied to microarray data Computational Statistics & Data Analysis 48: 869-885
- van 't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T. et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415: 530-536.
- Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge University Press, Cambridge.
- Simon, R. (2005) Roadmap for developing and validating therapeutically relevant genomic classifiers. J Clin Oncol, 23: 7332-7341
- Simon, R., Radmacher, M. D., Dobbin, K., and McShane, L. M., (2003). Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute, 95:14-18.
- Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G., (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA, 99:6567-6572.
- Vapnik, V (1999) Statistical learning theory. John Wiley and Sons. New York.
- Wessels LF, Reinders MJ, Hart AA, Veenman CJ, Dai H, He YD, van't Veer LJ(2005) A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics.21(19):3755-62.
Find the Babelomics suite at http://babelomics.org