affymetrix_expression_normalization_with_apt

Francisco García edited this page Jan 30, 2015 · 3 revisions
Clone this wiki locally

APT commands for Affymetrix Microarray Normalization in Babelomics

Introduction

There are two kind of Affymetrix expression arrays:

  • 3' Gene Expression Analysis Arrays: the old style chips
  • Whole-Transcript Expression Exon and Gene Level Arrays: the new style chips

In what concerns normalization:

  • 3' Gene Expression Arrays can be normalized +only+ at gene level
  • Whole-Transcript Expression Arrays can be normalized +both+ at exon and gene level (the arrays called Gene... can indeed be normalized at exon level; the ones called Exon... can be normalized at gene level too)

In general Babelomics has to discover the type of chip the user is sending and to be able to classify it as 3' or WT. To do this we need to create a list of all array names that Babelomics accepts together with their classification.

See all Affymetrix expression arrays and their classification into 3' or WT.

Several Library Files are needed to normalize each kind of chip using APT. This is the complete list of Library Files provided by Affymetrix. Babelomics should export all of them.

In general Babelomics will will need...

  • for normalizing 3' arrays: a single file with the extension .cdf
  • for normalizing WT arrays: several files with the extensions .clf .pgf .bgp .qcc. In addition, if the normalization is to be done at a gene level, Babelomics will need also a file with the extension .mps

Finding this files in the Affymetrix Page may be a bit confusing. Here is a tip:

  • For Whole-Transcript GENE arrays: all this files (including .mps) can be found in the Affymetrix web, in the compressed files displayed with the tag Array, Analysis (the file name is tagged as analysis-lib-files).
  • For Whole-Transcript EXON arrays also most of the necessary files come in the same Array Analysis zipped file. Only the .mps files are in a separated compressed file tagged as Meta Probeset for instance Human Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full, extended and comprehensive hg18. Notice that There are 3 .mps files here; Babelomics should use the one called corre.mps. There are also 2 .bgp files; Babelomics should use the one called .antigenomic.bgp.

Babelomics Normalization Pipeline for Affymetrix Arrays

The Pipeline will be:

  • Convert CEL files to a text format.
    • From the text formatted files extract the array-type and check that all arrays in a data set are from the same type.
    • Check that the type of chip is among those that Babelomics can deal with.
    • Classify the chip as 3' o WT in order to set the kind of analysis that can be done (in the future Babelomics will offer the possibility of normalizing WT arrays at exon level)
    • Extract the array-dimension from the text formatted files and check that the array information is complete; ie. the number of lines in the array is the expected one.
  • Generate a matrix of raw data to be used afterwards (is given at a probe level)
  • Normalize the data (including present-absent calls if required). The user will be able to choose the options: ** RMA:
     -a rma 
    or, when the dataset is big
     -a rma-sketch 
    ** Plier: if the chip type is 3'
     -a plier-mm 
    or, when the dataset is big
     -a plier-mm-sketch 
    if the chip type is Whole-Transcript
     -a plier-gcbg
    or, when the dataset is big
     -a plier-gcbg-sketch 
    ** Present-Absent Calls: if the chip type is 3'
     -a pm-mm,mas5-detect.calls=1.pairs=1 
    if the chip type is Whole-Transcript
     -a dabg 
  • Properly derive the calls in the Present-Absent Calls option ** for 3' arrays p-values and proper calls (A, M, P values) are returned by apt-probeset-summarize so no more transformation is needed. ** for WT arrays only p-values are returned by apt-probeset-summarize. Hence calls have to be derived form p-values as follows: *** IF p-value < 0.05 THEN call is P *** IF 0.05 <= p-value <= 0.065 THEN call is M *** IF p-value > 0.065 THEN call is A

See APT complete documentation for more details.

The following sections explain how to carry on each of this three steps using APT.

Affymetrix Power Tools (APT) for CEL file conversion to a text file

The same code is used for all kind of arrays, either 3' or WT

Indicating a directory where to find the CEL files

apt-cel-convert -f text \
                -o txt_converted_cel_files_dir \
                cel_file_dir/*.CEL

Indicating a text file with the paths to the CEL files

apt-cel-convert -f text \
                -o txt_converted_cel_files_dir \
                --cel-files cell_paths_file.txt

Note: --cel-files: file specifying cel files to process, one per line with the first line being 'cel_files'.

Affymetrix Power Tools (APT) for raw data extraction

3' arrays

apt-cel-extract -o raw_intensities.txt \
                -d HG-U133A_2.cdf \
                cel_file_dir/*.CEL

Note --cel-files option is also available.

Whole-Transcript arrays

apt-cel-extract -o raw_intensities.txt \
                -c MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
                -p MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
                -b MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.bgp \
                cel_file_dir/*.CEL

Note --cel-files option is also available.

Affymetrix Power Tools (APT) for Gene Expression Level normalization

3' arrays

apt-probeset-summarize -o data_normalized_dir/ \
                       -d HG-U133A_2.cdf \
                       -a pm-mm,mas5-detect.calls=1.pairs=1 \
                       -a rma \
                       -a rma-sketch \
                       -a plier-mm \
                       -a plier-mm-sketch \
                       data_raw/expression/*.CEL

Note --cel-files option is also available. the option -a indicates each analysis or normalization method available. They can all be run at the same time but Babelomics will use just one at a time.

Whole-Transcript arrays


Gene:

apt-probeset-summarize -o data_processed/data_normalized/exon_level \
                       -c MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
                       -p MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
                       -b MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.bgp \
           --qc-probesets MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.qcc \
                       -m MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.mps \
                       -a dabg \
                       -a rma \
                       -a rma-sketch \
                       -a plier-gcbg \
                       -a plier-gcbg-sketch \
                       data_raw/expression/*.CEL


Exon:

apt-probeset-summarize -o data_processed/data_normalized/exon_level \
                       -c MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
                       -p MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
                       -b MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.antigenomic.bgp \
           --qc-probesets MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.qcc \
                       -m MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.core.mps \
                       -a dabg \
                       -a rma \
                       -a rma-sketch \
                       -a plier-gcbg \
                       -a plier-gcbg-sketch \
                       data_raw/expression/*.CEL

Note --cel-files option is also available. the option -a indicates each analysis or normalization method available. They can all be run at the same time but Babelomics will use just one at a time.

-a dabg: only generates the p-value matrix (summary file) but not the call matrix (as is done by -a pm-mm,mas5-detect.calls=1.pairs=1). Hence call matrix is derived by Babelomics as follows:

  • P if p-value < 0.05
  • M if 0.05 <= p-value <= 0.065
  • A if 0.065 > p-value

Affymetrix Power Tools (APT) for Exon Level

Only Whole-Transcript arrays can be processed at exon level.


Gene:

apt-probeset-summarize -o data_processed/data_normalized/exon_level \
                       -c MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
                       -p MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
                       -b MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.bgp \
           --qc-probesets MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.qcc \
                       -a dabg \
                       -a rma \
                       -a rma-sketch \
                       -a plier-gcbg \
                       -a plier-gcbg-sketch \
                       data_raw/expression/*.CEL


Exon:

apt-probeset-summarize -o data_processed/data_normalized/exon_level \
                       -c MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
                       -p MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
                       -b MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.antigenomic.bgp \
           --qc-probesets MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.qcc \
                       -a dabg \
                       -a rma \
                       -a rma-sketch \
                       -a plier-gcbg \
                       -a plier-gcbg-sketch \
                       data_raw/expression/*.CEL

Note --cel-files option is also available. the option -a indicates each analysis or normalization method available. They can all be run at the same time but Babelomics will use just one at a time. (indeed the command is exactly the same that the one of gene level normalization but without the .mps line)