Class prediction. Worked examples and exercises
Pages 113
- Home
- Affymetrix
- affymetrix_expression_normalization_with_apt
- Agilent
- Association Analysis
- Association Analysis doc
- Babelomics version
- Babelomics web structure
- Burden test
- Cancer
- CDF
- Changes in this version
- Class comparison. Worked examples and exercises
- Class prediction
- Class prediction. Worked examples and exercises
- Clustering
- Clustering. Worked examples and exercises
- Cross hybridization
- data matrix expression
- Data types
- Define your comparison
- Detailed example of analysis of expression data in Babelomics: from raw data to expression differential and functional profiling
- Differential Expression for arrays
- Differential Expression for RNA Seq
- Dye bias
- Edit
- Edit your data
- example data
- Expression
- Expression array pipeline
- FAQ
- Functional
- Functional Gene Set Network Enrichment
- Functional GO Enrichment
- GAL
- Gene Set Enrichment
- Gene Set Network Enrichment (Network Miner)
- Gene vs annotation
- Genepix
- Genomics
- Genomics doc
- How to cite babelomics
- Id
- Logging in
- Main areas. Cancer
- Main areas. Expression
- Main areas. Functional
- Main areas. Genomics
- Main areas. Processing
- Main areas: Cancer
- Main areas: Expression
- Main areas: Functional
- Main areas: Genomics
- Main areas: Processing
- Network Enrichment (SNOW)
- Other biological data
- Overview and pipelines
- p values adjusted for multiple testing
- PED
- PED_MAP zipped
- Pipelines
- plink.assoc
- plink.assoc.linear
- plink.assoc.logistic
- plink.fisher
- plink.hh
- plink.log
- plink.tdt
- Preprocessing for data matrix
- Preprocessing for microarrays
- Preprocessing for RNA Seq
- Processing
- Ranked
- Requirements
- RNA Seq Normalization
- RNA Seq pipeline
- SDK (Software Development Kit)
- Single Enrichment
- Single Enrichment. Options
- SNPs array pipeline
- Software and databases used
- Technical Info
- The Babelomics Team
- tut_SNP_association
- Tutorial
- Tutorial Affymetrix Expression Microarray Normalization
- Tutorial Agilent One Color Microarray Normalization
- Tutorial Agilent Two Colors Microarray Normalization
- Tutorial Burden test
- Tutorial Class prediction
- Tutorial Clustering
- Tutorial Data matrix preprocessing
- Tutorial Differential Expression for arrays
- Tutorial Differential Expression for RNA Seq
- Tutorial Expression
- Tutorial Expression. Class comparison
- Tutorial Expression. Correlation
- Tutorial Expression. Survival
- Tutorial Functional
- Tutorial Genepix One Color Microarray Normalization
- Tutorial Genepix Two Colors Microarray Normalization
- Tutorial Genomics
- Tutorial OncodriveClust
- Tutorial OncodriveFM
- Tutorial Processing
- Tutorial SNP Association Analysis
- Tutorial SNP stratification
- Upload your data
- VCF 4.0
- VCF file pipeline
- Visualization tools
- Worked examples
- Workflow
- Show 98 more pages…
General
Tutorial
Analysis tools
Worked examples
-
Expression
-
Functional
Clone this wiki locally
INPUT
#### STEPS [1. Select train data](tutorial-expression.-class-prediction#select-train-data)
[2. Select test data](tutorial-expression.-class-prediction#select-test-data)
[3. Choose algorithms](tutorial-expression.-class-prediction#choose-algorithms)
[4. Select method for error estimation](tutorial-expression.-class-prediction#select-method-for-error-estimation)
[5. Define method for gene subset selection](tutorial-expression.-class-prediction#define-method-for-gene-subset-selection)
[6. Press *Launch job* button](tutorial-expression.-class-prediction#press-launch-job-button)
#### OUTPUT - [Train](tutorial-expression.-class-prediction#train) - [Test](tutorial-expression.-class-prediction#test)
**[Worked examples and exercises](tutorial-expression.-class-prediction#worked-examples-and-exercises)**
INPUT
#####Input data Input data matrix should be in a plain text and tab-separated file as following:
# some comments
# more comments
#VARIABLE tumor CATEGORICAL{ALL,AML} VALUES{ALL,ALL,ALL,AML,AML,AML}
#VARIABLE resp_treatment CATEGORICAL{Y,N} VALUES{Y,N,N,N,Y,Y}
#NAMES Cond1 Cond2 Cond3 Cond4 Cond5 Cond6
gen1 -3.06 -2.25 -1.15 -6.64 0.40 1.08
gen2 -1.36 -0.67 -0.17 -0.97 -2.32 -5.06
gen3 -0.17 0.48 1.23 1.52 1.11
gen4 1.61 -0.27 0.71 -0.62 0.14
gen5 2.09 2.12 2.62 1.95 1.04 2.18
gen6 0.20 -3.06 -0.03 0.64 0.84
gen7 -2.00 -0.64 -0.29 0.08 -1.00
gen8 0.93 1.29 -0.23 -0.74 -2.00 -1.25
gen9 0.88 0.31 -0.22 3.25
gen10 0.71 1.03 -0.25 1.03
Things to take into consideration:
- Matrix rows correspond to genes and matrix columns correspond to samples (arrays).
- All the data items must be separated by tabulators.
- missing values are not allowed here. You can use Preprocessing for either get rid of them or imputing the values.
- A line with #NAMES tag is mandatory.
- Lines beginning with "#VARIABLE" will be selectable in the form for predictor training.
- All lines beginning with "#" (different from #NAMES and "#VARIABLE") are treated as commentaries.
#####Online example Here you can load a small dataset from our server. You can use them to run this example and see how the tool works. Click on the links to load the data: leukemia train dataset.
### STEPS ##### Select train data First step is to select your data to analyze.
Input data matrix should be in a plain text and tab-separated file as following:
# some comments
# more comments
#VARIABLE tumor CATEGORICAL{ALL,AML} VALUES{ALL,ALL,ALL,AML,AML,AML}
#VARIABLE resp_treatment CATEGORICAL{Y,N} VALUES{Y,N,N,N,Y,Y}
#NAMES Cond1 Cond2 Cond3 Cond4 Cond5 Cond6
gen1 -3.06 -2.25 -1.15 -6.64 0.40 1.08
gen2 -1.36 -0.67 -0.17 -0.97 -2.32 -5.06
gen3 -0.17 0.48 1.23 1.52 1.11
gen4 1.61 -0.27 0.71 -0.62 0.14
gen5 2.09 2.12 2.62 1.95 1.04 2.18
gen6 0.20 -3.06 -0.03 0.64 0.84
gen7 -2.00 -0.64 -0.29 0.08 -1.00
gen8 0.93 1.29 -0.23 -0.74 -2.00 -1.25
gen9 0.88 0.31 -0.22 3.25
gen10 0.71 1.03 -0.25 1.03
Things to take into consideration:
- Matrix rows correspond to genes and matrix columns correspond to samples (arrays).
- All the data items must be separated by tabulators.
- missing values are not allowed here. You can use Preprocessing for either get rid of them or imputing the values.
- A line with #NAMES tag is mandatory.
- Lines beginning with "#VARIABLE" will be selectable in the form for predictor training.
- All lines beginning with "#" (different from #NAMES and "#VARIABLE") are treated as commentaries.
Select test data
- This data is optional.
- Input data matrix should be in a plain text and tab-separated file as following:
gen1 -3.06 -2.25 -1.15 -6.64 0.40 1.08
gen2 -1.36 -0.67 -0.17 -0.97 -2.32 -5.06
gen3 -0.17 0.48 1.23 1.52 1.11
gen4 1.61 -0.27 0.71 -0.62 0.14
gen5 2.09 2.12 2.62 1.95 1.04 2.18
gen6 0.20 -3.06 -0.03 0.64 0.84
gen7 -2.00 -0.64 -0.29 0.08 -1.00
gen8 0.93 1.29 -0.23 -0.74 -2.00 -1.25
gen9 0.88 0.31 -0.22 3.25
gen10 0.71 1.03 -0.25 1.03
- Things to take into consideration:
- Matrix rows correspond to genes and matrix columns correspond to samples (arrays).
- All the data items must be separated by tabulators.
- missing values are not allowed here. You can use Preprocessing for either get rid of them or imputing the values.
- This matrix doesn't include information about #NAMES. We want to predict this information.
#####Choose algorithms Choice the algorithm to proceed with one of the four prediction methods:
- Support Vector Machines (SVM)
- Nearest neighbour (KNN)
- Random Forest (RF)
See [Class prediction methods](Class prediction) section for details about different algorithms.
#####Select method for error estimation
Choice the cross-validation method to validate training.
- Leave-one-out: Single observation from the original sample as the validation data, and the remaining observations as the training data
- KFold: The original sample is randomly partitioned into K subsamples. A single subsample is retained as the validation data for testing the model.
See [Class prediction methods](Class prediction) section for details about methods for error estimation.
#####Fill information job
- Select the output folder
- Choose a job name
- Specify a description for the job if desired.
#####Press Launch job button
Press launch button and wait until the results is finished. A normal job may last approximately few minutes but the time may vary depending on the size of data. See the state of your job by clicking the jobs button in the top right at the panel menu. A box will appear at the right of the web browser with all your jobs. When the analysis is finished, you will see the label "Ready". Then, click on it and you will be redirected to the results page.

### OUTPUT #### Train #####Summary - **Selection of best 5 classifiers by algorithm**. This table shows a table where comparing different classifiers using several indicators: [accuracy](http://en.wikipedia.org/wiki/Accuracy_and_precision), [MCC](http://en.wikipedia.org/wiki/Matthews_correlation_coefficient), [AUC](http://en.wikipedia.org/wiki/Receiver_operating_characteristic) and [RMSE](http://en.wikipedia.org/wiki/Root-mean-square_deviation). - **Percentages of correct classification per sample and classifier**
#####Results
- All classifiers by algorithm.
- This information is summarized by a comparative plot.
Test
#####Results Class prediction for 5 best classifiers by selected algorithms.
### Worked example
In this example we are going to analyse a dataset from Golub et al. (1999). In that paper they were studying two different types of leukaemia (acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) in order to detect differences between them. This dataset have 3051 genes and 38 arrays, 27 of them labeled as ALL and 11 of them as AML.
Using Class prediction* we are going to build a predictor to try to distinguish between both classes. In the train file we can see 30 arrays, 22 ALL and 9 AML. The rest, 6 ALL and 2 AML, are in the test file for predicting.
You can find the dataset for this exercise in the following files:
- The first one is the file to train the predictor: {{:datatraingolub.txt|}}
- The second one will be used to predict the classes (test dataset): {{:datatestgolub.txt|}}
Some training exercises
- Train with knn: Upload the datafile and select the variable TUMOR. In order to get the exercises fast select 5 repeats of 5-fold cross validation. In this exercise do not select any feature selection method.
- Repeat the exercise but select CFS feature selection method, which one works better? why? how many genes were selected
- Now try with svm with no feature selection method, which one performs better? SVM or KNN
- To finish you can try SVM with CFS feature selection method, how many features were selected? why it matches KNN with CFS?
- Finally, which is the bes combination? why is SVM doing better alon than with CFS?
Some test exercises
Now we select the option Train and test and select datatraingolub and datatestgolub:
We can select KNN without feaure method to speed up the exercise.
In order to check the accuracy of prediction you can see the correct labels for the test file:
ALL ALL ALL ALL ALL ALL AML AML
Are the predictions right? Do you get the same results with SVM?
Find the Babelomics suite at http://babelomics.org