Tutorial Class prediction

Francisco García edited this page Mar 16, 2016 · 29 revisions
Clone this wiki locally

INPUT

STEPS

1. Select train data
2. Select test data
3. Choose algorithms
4. Select method for error estimation
5. Define method for gene subset selection
6. Press Launch job button

OUTPUT

WORKED EXAMPLES AND EXERCISES





INPUT

Input data

Input data matrix should be in a plain text and tab-separated file as following:

# some comments
# more comments
#VARIABLE tumor CATEGORICAL{ALL,AML} VALUES{ALL,ALL,ALL,AML,AML,AML}
#VARIABLE resp_treatment CATEGORICAL{Y,N} VALUES{Y,N,N,N,Y,Y}
#NAMES Cond1 Cond2 Cond3 Cond4 Cond5 Cond6
gen1    -3.06   -2.25   -1.15   -6.64   0.40    1.08
gen2    -1.36   -0.67   -0.17   -0.97   -2.32   -5.06
gen3    -0.17   0.48    1.23    1.52    1.11    
gen4        1.61    -0.27   0.71    -0.62   0.14
gen5    2.09    2.12    2.62    1.95    1.04    2.18
gen6    0.20    -3.06   -0.03   0.64    0.84    
gen7    -2.00   -0.64   -0.29   0.08    -1.00   
gen8    0.93    1.29    -0.23   -0.74   -2.00   -1.25
gen9    0.88    0.31    -0.22   3.25        
gen10   0.71    1.03    -0.25       1.03

Things to take into consideration:

  • Matrix rows correspond to genes and matrix columns correspond to samples (arrays).
  • All the data items must be separated by tabulators.
  • missing values are not allowed here. You can use Preprocessing for either get rid of them or imputing the values.
  • A line with #NAMES tag is mandatory.
  • Lines beginning with "#VARIABLE" will be selectable in the form for predictor training.
  • All lines beginning with "#" (different from #NAMES and "#VARIABLE") are treated as commentaries.
Online example

Here you can load a small dataset from our server. You can use them to run this example and see how the tool works. Click on the links to load the data: leukemia train dataset.


STEPS

Select train data

First step is to select your data to analyze.

Input data matrix should be in a plain text and tab-separated file as following:

# some comments
# more comments
#VARIABLE tumor CATEGORICAL{ALL,AML} VALUES{ALL,ALL,ALL,AML,AML,AML}
#VARIABLE resp_treatment CATEGORICAL{Y,N} VALUES{Y,N,N,N,Y,Y}
#NAMES Cond1 Cond2 Cond3 Cond4 Cond5 Cond6
gen1    -3.06   -2.25   -1.15   -6.64   0.40    1.08
gen2    -1.36   -0.67   -0.17   -0.97   -2.32   -5.06
gen3    -0.17   0.48    1.23    1.52    1.11    
gen4        1.61    -0.27   0.71    -0.62   0.14
gen5    2.09    2.12    2.62    1.95    1.04    2.18
gen6    0.20    -3.06   -0.03   0.64    0.84    
gen7    -2.00   -0.64   -0.29   0.08    -1.00   
gen8    0.93    1.29    -0.23   -0.74   -2.00   -1.25
gen9    0.88    0.31    -0.22   3.25        
gen10   0.71    1.03    -0.25       1.03

Things to take into consideration:

  • Matrix rows correspond to genes and matrix columns correspond to samples (arrays).
  • All the data items must be separated by tabulators.
  • missing values are not allowed here. You can use Preprocessing for either get rid of them or imputing the values.
  • A line with #NAMES tag is mandatory.
  • Lines beginning with "#VARIABLE" will be selectable in the form for predictor training.
  • All lines beginning with "#" (different from #NAMES and "#VARIABLE") are treated as commentaries.
Select test data
  • This data is optional.
  • Input data matrix should be in a plain text and tab-separated file as following:
gen1    -3.06   -2.25   -1.15   -6.64   0.40    1.08
gen2    -1.36   -0.67   -0.17   -0.97   -2.32   -5.06
gen3    -0.17   0.48    1.23    1.52    1.11    
gen4        1.61    -0.27   0.71    -0.62   0.14
gen5    2.09    2.12    2.62    1.95    1.04    2.18
gen6    0.20    -3.06   -0.03   0.64    0.84    
gen7    -2.00   -0.64   -0.29   0.08    -1.00   
gen8    0.93    1.29    -0.23   -0.74   -2.00   -1.25
gen9    0.88    0.31    -0.22   3.25        
gen10   0.71    1.03    -0.25       1.03
  • Things to take into consideration:
    • Matrix rows correspond to genes and matrix columns correspond to samples (arrays).
    • All the data items must be separated by tabulators.
    • missing values are not allowed here. You can use Preprocessing for either get rid of them or imputing the values.
    • This matrix doesn't include information about #NAMES. We want to predict this information.
Choose algorithms

Choice the algorithm to proceed with one of the four prediction methods:

  • Support Vector Machines (SVM)
  • Nearest neighbour (KNN)
  • Random Forest (RF)

See Class prediction methods section for details about different algorithms.

Select method for error estimation

Choice the cross-validation method to validate training.

  • Leave-one-out: Single observation from the original sample as the validation data, and the remaining observations as the training data
  • KFold: The original sample is randomly partitioned into K subsamples. A single subsample is retained as the validation data for testing the model.

See Class prediction methods section for details about methods for error estimation.

Fill information job
  • Select the output folder
  • Choose a job name
  • Specify a description for the job if desired.
Press Launch job button

Press launch button and wait until the results is finished. A normal job may last approximately few minutes but the time may vary depending on the size of data. See the state of your job by clicking the jobs button in the top right at the panel menu. A box will appear at the right of the web browser with all your jobs. When the analysis is finished, you will see the label "Ready". Then, click on it and you will be redirected to the results page.


OUTPUT

Train

Summary
  • Selection of best 5 classifiers by algorithm. This table shows a table where comparing different classifiers using several indicators: accuracy, MCC, AUC and RMSE.
  • Percentages of correct classification per sample and classifier.
Results
  • All classifiers by algorithm.
  • This information is summarized by a comparative plot.

Test

Results

Class prediction for 5 best classifiers by selected algorithms.


Worked examples and exercises

Exercise 1

In this example we are going to analyse a dataset from Golub et al. (1999). In that paper they were studying two different types of leukemia (acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) in order to detect differences between them. This dataset have 3051 genes and 38 arrays, 27 of them labeled as ALL and 11 of them as AML.

Using Class prediction we are going to build a predictor to try to distinguish between both classes. In the train file we can see 30 arrays, 22 ALL and 9 AML. The rest, 6 ALL and 2 AML, are in the test file for predicting.

You can find the dataset for this exercise in the following files:

A. Training
  • Train with KNN algorithm. Upload the datafile and select the variable TUMOR. In order to get the exercises fast select 5 repeats of 5-fold cross validation. In this exercise do not select any feature selection method.
  • Repeat the exercise but select CFS feature selection method, which one works better? why? how many genes were selected
  • Now try with SVM algorithm with no feature selection method, which one performs better? SVM or KNN
  • To finish you can try SVM with CFS feature selection method, how many features were selected? why it matches KNN with CFS?
  • Finally, which is the best combination? why is SVM doing better along than with CFS?

B. Test

  • Now we select the option Train and test and select datatraingolub and datatestgolub.
  • We can select KNN without feaure method to speed up the exercise.
  • In order to check the accuracy of prediction you can see the correct labels for the test file:
 ALL    ALL ALL ALL ALL ALL AML AML
  • Are the predictions right? Do you get the same results with SVM?

Exercise 2

Data description. RNA-Seq data of Lung squamous cell carcinoma (LUSC) samples taken from The Cancer Genome Atlas (TCGA) data portal.

Goals. We want to train several classification models. After this step, we are evaluating the best way of classifying our data from a test dataset.

1) Download tca_gene_lusc_train.txt. Contains 11 Normal and 150 Tumor samples.

2) Download tca_gene_lusc_test.txt. Contains 6 Normal and 75 Tumor samples.

3) Upload your files to Babelomics 5.0. Go to section Expression > Class Prediction

4) Try several classification strategies:

  • Select SVM, KNN and Random Forest
  • Select Leave-one-out for error estimation
  • Select Correlation-based Feature Selection (CFS)

5) Download test_result.txt

  • Which supervised classification method(s) works better?
  • How many genes were used for the prediction?
  • Are the selected genes same for all methods?

Go back to the Class Prediction page
Go back to the Home page
Go back to the Worked examples for all tools page