Tutorial Data matrix preprocessing

Francisco García edited this page Jan 28, 2015 · 3 revisions
Clone this wiki locally

INPUT


STEPS

1. Select your data
2. Select logarithmic transformation
3. Choose exponential function
4. Merge replicates
5. Filter missing values
6. Impute missing values
7. Extract IDs from dataset and save into a file
8. Filter genes by names
9. Fill information job
10. Press Launch job button


OUTPUT





INPUT

Input data

Input data should be a matrix upload as the data type Data matrix expression. See data types here.

Online example

Here you can load small datasets from our server. You can use them to run this example and see how the tool works. Click on the links to load the data: preprocessing a two classes matrix or preprocessing a correlation matrix.


STEPS

Select your data

First step is to select your data to analyze.

Select logarithmic transformation

This function calculates the logarithm of the expression values. You can select the base you prefer for this.

Choose exponential function

This option applies exponential function. You can select the base you prefer for this.

Merge replicates

This function looks for replicated clones (ids, genes...) and merge their patterns. You can choose between averaging the original patterns or getting the median.

Filter missing values
  • This option is intended for removing the patterns with many missing values. You can choose the "Minimum percentage of existing values" you want to impose.

  • For example, if you have a dataset with 10 conditions and you set up the minimum percentage of existing values to 70%, all the patterns with less than 7 existing values will be removed, i.e., all the patterns with more than 3 missing values will be removed.

  • This function looks for replicated clones (ids, genes...) and merge their patterns. You can choose between averaging the original patterns or getting the median.

Impute missing values

This function fills out missing values. Several algorithms are available:

  • Impute with zeros: replace missing values by zeros. This is the simplest option and we do not recommend to use it unless you really know what you are doing

  • Row mean imputation: replace missing values by the. row average. This option is better than the first one but again we do not recommend to use it unless you really know what you are doing

  • Row median imputation: replace missing values by the. row median. This option is better than the first one but again we do not recommend to use it unless you really know what you are doing

  • KNN imputation: replace missing values by the average value of the K nearest patterns. You need at least 5 non-mising values for imputing the rest of the pattern. Good values for K are around 15.

See Troyanskaya et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17 (6), pp. 520-525

Extract IDs from dataset and save into a file

Sometimes is useful to have all ids in a file for other analysis tools.

Filter genes by names

This option will remove all the genes that are present in the extra list you upload.

Fill information job
  • Select the output folder
  • Choose a job name
  • Specify a description for the job if desired.
Press Launch job button

Press launch button and wait until the results is finished. A normal job may last approximately few minutes but the time may vary depending on the size of data. See the state of your job by clicking the jobs button in the top right at the panel menu. A box will appear at the right of the web browser with all your jobs. When the analysis is finished, you will see the label "Ready". Then, click on it and you will be redirected to the results page.


OUTPUT

Input parameters

In this section you will find a reminder of the parameters or settings you have used to run the analysis.

Output files

Preprocessing a data matrix yield an other data matrix with transformed measurements.

The processed matrix is stored in a new text file tab delimited.

In this file:

  • Arrays or samples are arranged in columns.
  • Genes, spots or features are set in rows.
  • Some header lines may be included at the beginning of the file. They will all start by #.
  • One of those header lines, starting by #NAMES will contain the names of the arrays in your dataset.
  • The first column contains feature identifiers. For Agilent one color arrays, Babelomics tries to figure out which is the best feature id among those provided within the raw data files. Some other feature ids will we reported in the Feature Data File.



Go back to the Processing page
Go back to the Home page