Preprocessing for microarrays

Francisco García edited this page Jan 30, 2015 · 15 revisions
Clone this wiki locally

Generalities

When starting a new microarray analysis, we have to normalize raw data aiming to correct undesirable effects. After the transformation we will need to check if the modified data, the normalized data in microarray terminology, are free of the original artifacts that we wanted to remove. This assessment will be done using several. e.

There are two general assumptions about the data up on which Babelomics' normalization methods relay on:

  • That the proportion of genes that change across biological samples of any condition is small.
  • That there is a similar amount of genes with increased and decreased measurements when comparing two biological samples.

This assumptions seem reasonable in general microarray experiments that record whole genome expression data. Nevertheless researchers should revise their particular experimental context before using the methodologies here presented.

What this hypothesis basically imply in statistical terms is that, if there where no technical artifact in the data, there should be no general trend or pattern in the gene differences between any two samples of the data set. Most normalization algorithms exploit this fact by first, fitting the trend of the raw data and second, correcting the data for such trend.

Thus seems clear that, in some point of the normalization process, all the arrays in the dataset must be treated together. The general rule of microarray data processing is to normalize together all the microarrays that are going to be analyzed together later.

Two-color versus single-channel hybridization

Originally, microarrays where used under the two color schema of a competitive hybridization. Here, two biological samples labeled with different fluorescent dyes, are hybridized into the same array slide. The attachment of the genomic material to the glass keeps the proportion of the molecular concentrations in the samples. Thus, the intensities measured in the two channels represent the abundance of molecules in one sample relative to the other one.

Generally in two color studies, the logarithm of the ratio of the two intensity channels (log ratio) is reported as a summary of the differences between two samples. If the above mentioned assumptions hold, the statistical distribution of log ratios of each array will be center in zero and the variability across arrays will be similar.

Many microarray applications still use this two color approach but also, some newer microarray technologies are hybridized using a single channel protocol. They use only one type of dye to label each sample which is hybridized on its own in the array. In this approach, each microarray yields intensity measurements that represent absolute abundance of molecules for a unique biological sample.

Microarray Platforms

There are many different microarray platforms or manufacturers. Each of them uses its own technology and design to build up its microarrays and suggests different hybridization protocols. Hence, microarrays from different platforms have particularities that will need to be taken into account by the normalization algorithms.

But not only the platform defines the data we get from the microarray. The scanner used to read the chip and the image processing software is what determines the final raw data format. Some manufacturers like Affymetrix or Agilent provide their own scanner besides the microarray slides. Other platforms (including home made) produce microarrays to be read in general purpose scanners like those of GenePix.

Hence, first think we will require from a normalization software is that it is able to read the format of our raw data files.

Babelomics can read 5 file formats from 3 different platforms (or, more appropriately, from 3 different scanners):

  • Affymetrix raw data files: .CEL files. (Affymetrix are always one color arrays).
  • Agilent one color raw data files: one channel gilent .TXT files.
  • Agilent two color raw data files: two channels Agilent .TXT files.
  • GenePix one color raw data files: one channel GPR files.
  • GenePix two color raw data files: two channels GPR files.

In Babelomics (as in general microarray contexts) we consider such files to be the raw data of the microarray experiment; the starting point of the data analysis process. However, this does not mean you have to normalize your data using Babelomics. If you have already pre-processed data you can input the normalized values into Babelomics to perform further steps in your analysis.

Normalization Steps

There are four general steps to be followed in microarray data preprocessing. Not all of them may be necessary in all contexts; for instance the Within Array Correction is not necessary, even meaningful in one color arrays.

This steps are described below in a particular order which is useful to understand what has to be done when normalizing microarray data. But indeed, mots methodologies or algorithms have an effect in several of them.

Background Correction

The aim of this step is to correct for what is usually known as the background effect. That is, any source of technical variation reflected in an spatial pattern of the intensity measurements.

Probes or features are usually randomly scattered in the surface of the microarray. Therefore, there is no biological reason to expect such spatial effect or trend in the intensity measurements; it may be caused then, by irregularities in the glass surface, differences in the hybridization efficiency, array washing problems or scanner effects.

In two color microarrays the background effect may affect differently to each of the intensity channels. Hence, the background correction has to deal separately with each of the colors. Similarly, the background effect will differ between arrays in the experiment, therefore, each array needs its own correction. Nevertheless, this does not mean that the algorithms used in the background correction are going to deal with the arrays one at a time. Some background correction methodologies, like RMA use information form all the arrays in the experiment in order to correct each of them. Hence the final corrected values for one array will change if it is normalized within a different dataset.

Some array platforms like Agilent or other spotted arrays scanned using GenePix, provide a local background estimate for each of the features. In the background correction step, such information is used by the normalization algorithms to do a first correction of the foreground measurements. This first correction affects to each feature (within each color) independently of the other ones.

Affymetrix arrays design do not have a background estimate for each probe but instead a mismatch probe (MM) or set of probes to correct for cross-hybridization or non-specific binding. Background correction algorithms may take advantage of such features in their action to do background correction.

Within Array Correction

In two color microarray technologies, two biological samples, each of them labeled with a different dye, are hybridized into the same chip. Ideally, the ratios of the two intensities are representative of the ratio concentration of the genes in both samples.

But differences in the processing of the two samples, in the eficiency of the dying or in the scanner reading the red or the green channel may end up distorting such ratios. A dye bias.

Dye bias correction deals with non biological differences in the two channel intensities of each array. It is the firs aim of the within array correction but there is also a second purpose of summarization in this step. The two color signals of each gene or feature in the array are merged into a unique measurement. This is achieved by computing the log ratio of the two intensity measurements. This log ratio value is generally called M-value.

After log ratio transformation the M-values should have a distribution centered around zero. This is used by some normalization algorithms like the loess normalization to be able to fit the trend of the noise and to correct for it. This transformation relies up on the general assumption that a similar amount of genes will have increased or decreased expression levels in on channel related to the other.

Between Array Scaling

In this step measurements from all microarray are rescaled into a unique final distribution. This is necessary in order to get data from different samples calibrated one to each other. Otherwise any analysis done will be meaningless.

Generally a consensus distribution is defined from the dataset and then, data from each array is transformed into that distribution. This is the basis, for instance of the quantile distribution.

If wee succeed to remove dye bias in two color arrays, then, all array median values should be centered in zero. Then the only thing that the Between Array Scaling will do is to standardize their variabilities.

Summarization

This is the final step (at least conceptually) in the pre-processing of microarray data.

In general array designs there may be several probes or spots designed to hybridize with the same gene, transcript or biological feature. There may also be control spots designed for quality checking, background signal estimation or to measure cross hybridization.

In this final step, array intensities are summarized in a final measurement relating each biological feature of interest in the study. If for instance we are doing an experiment at a gene level, all probes matching a gene will be somehow averaged in a unique number, reflecting the expression of the gene. If we are investigating at exon level, the probes of the array will be summarized for each exon.

Also in this final step control spots in the array will be removed so only biological measurements remain in the normalized data.

A good example to show that the steps above mentioned are not always performed in the described order will be to say that, generally, control spots are remooved at the very beginning of the analysis, even before background correction so they do not influence further transformation of the data.

Normalization Output

Babelomics stores normalized intensity measurements from all arrays into a unique normalized data matrix. In this matrix genes are arranged by rows and arrays (or experiments) are ordered in columns.

This matrix can be downloaded in a tab delimited text file or redirected to some other Babelomics' modules for further analysis.

In order to asses how well the normalization has perform in your data some plots are also provided with the normalized data. This plots are devised to represent array data distribution and can be used to compare datasets normalized using different methodologies. They can also be used to compare normalized data with raw data and see how much the normalization reduced the noise.

General plots provided after normalization will be:

  • Box-plots of the normalized data are displayed to help you assessing performance of the normalization method used. Each Box-plot represents the distribution of intensity measurements of one array. You would expect all of them to show the same shape. If you are normalizing two color arrays you will also expect all of the boxes to be centered in zero.

  • MA plots representing the normalized intensity distribution of each sample against a consensus mean sample. A LOESS line fitting the trend between M and A values is drawn in red. After normalization you expect no trend in the LOESS line, that is, you expect it to be as close as possible tho the horizontal 0 axis.

  • Pseudo Image Plots: represent the normalized intensity of each spot within the array coordinates, creating a pseudo photo of the normalized array. High intensities are represented in red colors, low intensities are represented in blue colors. Ideally you will see an evenly colored image, meaning that, after normalization, there is no spatial effect in the array measurements. M-values ie. log ratios of the two channel intensities are represented in this plots.


Affymetrix Expression Arrays Normalization Methods

Using Babelomics you can normalize and compute present-absent calls for Affymetrix expression arrays.

Either 3' Gene Expression Analysis Arrays (the old style chips) or Whole-Transcript Expression Exon and Gene Level Arrays (the newer microarrays) can be processed in Babelomics. At the moment Whole-Transcript Affymetrix arrays are only processed at gene Level in Babelomics.

Available normalization methods are:

  • RMA: Robust Multi-Array background correction is followed by a quantile scale transformation to get similar intensity distribution across arrays. Only PM probes are used in the computation (MM probes are discarded) and median polish is used to summarize probe-sets into a unique intensity measuremen. RMA returns normalized data in log2 scale, as in the original RMA implementation.

  • PLIER: quantile scale transformation of probe intensities is followed by a PM-MM. IN this correction MisMatch probe intensity (in 3' arrays or some control in WT arrays) is subtracted from the Perfect Match measurement. Then a Probe Logarithmic Intensity Error summarization step is taken. In Babelomics PLIER also returns normalized data in log2 scale; notice that it was not so in the original implementation of the algorithm.

The present-absent call algorithm uses a Wilcoxon’s rank test to detect which probe-sets are expressed well above the MisMatch (in 3' arrays or the background control intensities in the WT arrays). For Whole-Transcript arrays the algorithm works only at gene level not at exon level. See Affymetrix Whitepaper for details.

Babelomics uses Affymetrix Power Tools (APT) for normalizing Affymetrix expression microarrays. Babelomics imports all Affymetrix expression library files so you hsould be able to normalize any Affymetrix standard array. Nevertheless you will not be able to use Babelomics with custom arrays as library files are not available.

See here for details of the implementation of APT into Babelomics.


Agilent One Color Arrays Normalization Methods

Using Babelomics you can normalize one color Agilent arrays.

Agilent array designs usually include several control spots. See Agilent Feature Extraction Software Reference Guide for details. Babelomics tries to remove those control spots from the data before normalization computation.

Background Correction Methods

  • agilent: Uses Agilent ProcessedSignal as returned by Agilent Feature Extraction Software Reference Guide.
  • normexp: a convolution of normal and exponential distributions is fitted to the foreground intensities using the background intensities as a covariate. Similar to rma but uses maximum likelihood estimation to fit the model. Babelomics uses its implementation in the limma package form Bioconductor.
  • rma: Robust Multi-Array Average (RMA) normalization proposed by Irizarry et al. (2003). Babelomics uses its implementation in the affy package form Bioconductor.
  • half: like subtract but any intensity which is less than 0.5 after background subtraction is reset to be equal to 0.5.
  • subtract: subtracts the background intensities from the foreground intensities.
  • none: no background correction is applied. Just foreground intensities used. Background intensities are treated as zero.

Scaling Methods

Flagged spots

Spots within Agilent microarrays may be flagged for spot quality. That is, if a spot is considered to have bad quality because of its shape, size or any other parameter set by the scanner software, it is flagged as a bad spot. You can decide to use or not this flagged spot in your normalization steps:

  • Flags not fitted: if you tick this box, flagged spots will not be used in the fitting algorithms of the normalization process
  • Flags as missing: if you tick this box, flagged spots will be returned as missing data.

Hence you can combine this two options to deal with your flagged spots as you consider necessary.


Genepix One Color Arrays Normalization Methods

Using Babelomics you can normalize one color GPR files generated by GenePix scanners.

Some array designs include control spots. If this non biological features are registered in the GPR files, Babelomics will try to remove from the data before normalization computation.

Background Correction Methods

  • normexp: a convolution of normal and exponential distributions is fitted to the foreground intensities using the background intensities as a covariate. Similar to rma but uses maximum likelihood estimation to fit the model. Babelomics uses its implementation in the limma package form Bioconductor.
  • rma: Robust Multi-Array Average (RMA) normalization proposed by Irizarry et al. (2003). Babelomics uses its implementation in the affy package form Bioconductor.
  • half: like subtract but any intensity which is less than 0.5 after background subtraction is reset to be equal to 0.5.
  • subtract: subtracts the background intensities from the foreground intensities.
  • none: no background correction is applied. Just foreground intensities used. Background intensities are treated as zero.

Scaling Methods

  • quantiles: quantile scale transformation proposed by Bolstad et al. (2003) Babelomics uses its implementation in the affy package form Bioconductor.
  • scale: transforms array measurements to have the same median-abolute-deviation (MAD) as proposed in Yang et al. (2002). Babelomics uses its implementation as in the limma package form Bioconductor.
  • none: no scaling is applied.

Flagged spots

GenePix Microarray Image Analysis Software can account for spot quality and flag low quality features. That is, if a spot is considered to have bad quality because of its shape, size or any other parameter set by the scanner software, it is flagged as a bad spot. You can decide to use or not this flagged spot in your normalization steps:

  • Flags not fitted: if you tick this box, flagged spots will not be used in the fitting algorithms of the normalization process
  • Flags as missing: if you tick this box, flagged spots will be returned as missing data.

Hence you can combine this two options to deal with your flagged spots as you consider necessary.


Agilent Two Colors Arrays Normalization Methods

Using Babelomics you can normalize two colors Agilent arrays.

Agilent array designs usually include several control spots. See Agilent Feature Extraction Software Reference Guide for details. Babelomics tries to remove those control spots from the data before normalization computation.

Background Correction Methods

  • normexp: a convolution of normal and exponential distributions is fitted to the foreground intensities using the background intensities as a covariate. Similar to rma but uses maximum likelihood estimation to fit the model. Babelomics uses its implementation in the limma package from Bioconductor.
  • rma: Robust Multi-Array Average (RMA) normalization proposed by Irizarry et al. (2003). Babelomics uses its implementation in the affy package from Bioconductor.
  • half: like subtract but any intensity which is less than 0.5 after background subtraction is reset to be equal to 0.5.
  • subtract: subtracts the background intensities from the foreground intensities.
  • minimum: like subtract but any intensity which is zero or negative after background subtraction is set equal to half the minimum of the positive corrected intensities for that array.
  • movingmin: like subtract but the background estimates are replaced with the minimums of the backgrounds of the spot and its eight neighbors, i.e., the background is replaced by a moving minimum of 3x3 grids of spots.
  • edwards: a log-linear interpolation method is used to adjust lower intensities. Edwards (2003).
  • none: no background correction is applied. Just foreground intensities used. Background intensities are treated as zero.

Within array normalization Methods

There are two objectives in this step of the normalization process for two color microarrays. The first one is to do a dye-bias correction: rectify artifacts produced by differences in the measurements of the two signal channels. The second one is to summarize the two color signals into a unique measurement for each gene or feature. This second objective is achieved by computing, for each feature in the array, the log ratio of the two intensity measurements. This log ratio value is generally called M-value. M-values in In Babelomics are computed doing the logarithm in base2 of the red signal over the green one.

Dye-bias correction methods available are:

  • loess: loess normalization is applied globally to all spots in the array. Yang et al. (2002) . Babelomics uses its implementation in the limma package form Bioconductor.
  • median: subtracts the weighted median from the log-ratios (M-values) for each array.
  • none: log-ratios (M-values) are computed without any other correction.

Between Array Scaling Methods

In two color arrays the scaling step is done over the M-values.

Available methods are:

Flagged spots

Spots within Agilent microarrays may be flagged for spot quality. That is, if a spot is considered to have bad quality because of its shape, size or any other parameter set by the scanner software, it is flagged as a bad spot. You can decide to use or not this flagged spot in your normalization steps:

  • Flags not fitted: if you tick this box, flagged spots will not be used in the fitting algorithms of the normalization process
  • Flags as missing: if you tick this box, flagged spots will be returned as missing data.

Hence you can combine this two options to deal with your flagged spots as you consider necessary.

Two color normalization methods for Agilent microarrays are those implemented in the limma package form Bioconductor. See Smyth and Speed (2003) for details of the method implementation.

See MA-plot description for a better understanding of M-values and A-values.


Genepix Two Colors Arrays Normalization Methods

Using Babelomics you can normalize two colors GPR files generated by GenePix scanners.

Some array designs include control spots. If this non biological features are registered in the GPR files, Babelomics will try to remove them from the data before normalization computation.

Genepix scanners are generally used together with spotted arrays, and are prepared to report, for each feature, the print tip block in which it is allocated. Such local information is used to improve normalization in the two color microarray context, for instance in the printtiploess correction method.

Background Correction Methods

  • normexp: a convolution of normal and exponential distributions is fitted to the foreground intensities using the background intensities as a covariate. Similar to rma but uses maximum likelihood estimation to fit the model. Babelomics uses its implementation in the limma package form Bioconductor.
  • rma: Robust Multi-Array Average (RMA) normalization proposed by Irizarry et al. (2003). Babelomics uses its implementation in the affy package form Bioconductor.
  • half: like subtract but any intensity which is less than 0.5 after background subtraction is reset to be equal to 0.5.
  • subtract: subtracts the background intensities from the foreground intensities.
  • minimum: like subtract but any intensity which is zero or negative after background subtraction is set equal to half the minimum of the positive corrected intensities for that array.
  • movingmin: like subtract but the background estimates are replaced with the minimums of the backgrounds of the spot and its eight neighbors, i.e., the background is replaced by a moving minimum of 3x3 grids of spots.
  • edwards: a log-linear interpolation method is used to adjust lower intensities. Edwards (2003).
  • none: no background correction is applied. Just foreground intensities used. Background intensities are treated as zero.

Within array normalization Methods

There are two objectives in this step of the normalization process for two color microarrays. The first one is to do a dye-bias correction: rectify artifacts produced by differences in the measurements of the two signal channels. The second one is to summarize the two color signals into a unique measurement for each gene or feature. This second objective is achieved by computing, for each feature in the array, the log ratio of the two intensity measurements. This log ratio value is generally called M-value. M-values in In Babelomics are computed doing the logarithm in base2 of the red signal over the green one.

Dye-bias correction methods available are:

  • printtiploess: loess normalization is applied to each print-tip block. Yang et al. (2002). Takes advantage of the fact that all spots in the same print tip block share spatial and technical characteristics. Care should be taken when print tip blocks are small because the method can over fit the data. Babelomics uses its implementation in the limma package form Bioconductor.
  • loess: loess normalization is applied globally to all spots in the array. Yang et al. (2002) . Babelomics uses its implementation in the limma package form Bioconductor.
  • median: subtracts the weighted median from the log-ratios (M-values) for each array.
  • none: log-ratios (M-values) are computed without any other correction.

Between Array Scaling Methods

In two color arrays the scaling step is done over the M-values.

Available methods are:

Flagged spots

Spots within GenePix microarrays may be flagged for spot quality. That is, if a spot is considered to have bad quality because of its shape, size or any other parameter set by the scanner software, it is flagged as a bad spot. You can decide to use or not this flagged spot in your normalization steps:

  • Flags not fitted: if you tick this box, flagged spots will not be used in the fitting algorithms of the normalization process
  • Flags as missing: if you tick this box, flagged spots will be returned as missing data.

Hence you can combine this two options to deal with your flagged spots as you consider necessary.

Two color normalization methods for GenePix microarrays are those implemented in the limma package form Bioconductor. See Smyth and Speed (2003) for details of the method implementation.

See MA-plot description for a better understanding of M-values and A-values.