affymetrix_expression_normalization_with_apt
Pages 113
- Home
- Affymetrix
- affymetrix_expression_normalization_with_apt
- Agilent
- Association Analysis
- Association Analysis doc
- Babelomics version
- Babelomics web structure
- Burden test
- Cancer
- CDF
- Changes in this version
- Class comparison. Worked examples and exercises
- Class prediction
- Class prediction. Worked examples and exercises
- Clustering
- Clustering. Worked examples and exercises
- Cross hybridization
- data matrix expression
- Data types
- Define your comparison
- Detailed example of analysis of expression data in Babelomics: from raw data to expression differential and functional profiling
- Differential Expression for arrays
- Differential Expression for RNA Seq
- Dye bias
- Edit
- Edit your data
- example data
- Expression
- Expression array pipeline
- FAQ
- Functional
- Functional Gene Set Network Enrichment
- Functional GO Enrichment
- GAL
- Gene Set Enrichment
- Gene Set Network Enrichment (Network Miner)
- Gene vs annotation
- Genepix
- Genomics
- Genomics doc
- How to cite babelomics
- Id
- Logging in
- Main areas. Cancer
- Main areas. Expression
- Main areas. Functional
- Main areas. Genomics
- Main areas. Processing
- Main areas: Cancer
- Main areas: Expression
- Main areas: Functional
- Main areas: Genomics
- Main areas: Processing
- Network Enrichment (SNOW)
- Other biological data
- Overview and pipelines
- p values adjusted for multiple testing
- PED
- PED_MAP zipped
- Pipelines
- plink.assoc
- plink.assoc.linear
- plink.assoc.logistic
- plink.fisher
- plink.hh
- plink.log
- plink.tdt
- Preprocessing for data matrix
- Preprocessing for microarrays
- Preprocessing for RNA Seq
- Processing
- Ranked
- Requirements
- RNA Seq Normalization
- RNA Seq pipeline
- SDK (Software Development Kit)
- Single Enrichment
- Single Enrichment. Options
- SNPs array pipeline
- Software and databases used
- Technical Info
- The Babelomics Team
- tut_SNP_association
- Tutorial
- Tutorial Affymetrix Expression Microarray Normalization
- Tutorial Agilent One Color Microarray Normalization
- Tutorial Agilent Two Colors Microarray Normalization
- Tutorial Burden test
- Tutorial Class prediction
- Tutorial Clustering
- Tutorial Data matrix preprocessing
- Tutorial Differential Expression for arrays
- Tutorial Differential Expression for RNA Seq
- Tutorial Expression
- Tutorial Expression. Class comparison
- Tutorial Expression. Correlation
- Tutorial Expression. Survival
- Tutorial Functional
- Tutorial Genepix One Color Microarray Normalization
- Tutorial Genepix Two Colors Microarray Normalization
- Tutorial Genomics
- Tutorial OncodriveClust
- Tutorial OncodriveFM
- Tutorial Processing
- Tutorial SNP Association Analysis
- Tutorial SNP stratification
- Upload your data
- VCF 4.0
- VCF file pipeline
- Visualization tools
- Worked examples
- Workflow
- Show 98 more pages…
General
Tutorial
Analysis tools
Worked examples
-
Expression
-
Functional
Clone this wiki locally
APT commands for Affymetrix Microarray Normalization in Babelomics
Introduction
There are two kind of Affymetrix expression arrays:
- 3' Gene Expression Analysis Arrays: the old style chips
- Whole-Transcript Expression Exon and Gene Level Arrays: the new style chips
In what concerns normalization:
- 3' Gene Expression Arrays can be normalized +only+ at gene level
- Whole-Transcript Expression Arrays can be normalized +both+ at exon and gene level (the arrays called Gene... can indeed be normalized at exon level; the ones called Exon... can be normalized at gene level too)
In general Babelomics has to discover the type of chip the user is sending and to be able to classify it as 3' or WT. To do this we need to create a list of all array names that Babelomics accepts together with their classification.
See all Affymetrix expression arrays and their classification into 3' or WT.
Several Library Files are needed to normalize each kind of chip using APT. This is the complete list of Library Files provided by Affymetrix. Babelomics should export all of them.
In general Babelomics will will need...
- for normalizing 3' arrays: a single file with the extension .cdf
- for normalizing WT arrays: several files with the extensions .clf .pgf .bgp .qcc. In addition, if the normalization is to be done at a gene level, Babelomics will need also a file with the extension .mps
Finding this files in the Affymetrix Page may be a bit confusing. Here is a tip:
- For Whole-Transcript GENE arrays: all this files (including .mps) can be found in the Affymetrix web, in the compressed files displayed with the tag Array, Analysis (the file name is tagged as analysis-lib-files).
- For Whole-Transcript EXON arrays also most of the necessary files come in the same Array Analysis zipped file. Only the .mps files are in a separated compressed file tagged as Meta Probeset for instance Human Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full, extended and comprehensive hg18. Notice that There are 3 .mps files here; Babelomics should use the one called corre.mps. There are also 2 .bgp files; Babelomics should use the one called .antigenomic.bgp.
Babelomics Normalization Pipeline for Affymetrix Arrays
The Pipeline will be:
- Convert CEL files to a text format.
- From the text formatted files extract the array-type and check that all arrays in a data set are from the same type.
- Check that the type of chip is among those that Babelomics can deal with.
- Classify the chip as 3' o WT in order to set the kind of analysis that can be done (in the future Babelomics will offer the possibility of normalizing WT arrays at exon level)
- Extract the array-dimension from the text formatted files and check that the array information is complete; ie. the number of lines in the array is the expected one.
- Generate a matrix of raw data to be used afterwards (is given at a probe level)
- Normalize the data (including present-absent calls if required). The user will be able to choose the options:
** RMA:
-a rma
or, when the dataset is big-a rma-sketch
** Plier: if the chip type is 3'-a plier-mm
or, when the dataset is big-a plier-mm-sketch
if the chip type is Whole-Transcript-a plier-gcbg
or, when the dataset is big-a plier-gcbg-sketch
** Present-Absent Calls: if the chip type is 3'-a pm-mm,mas5-detect.calls=1.pairs=1
if the chip type is Whole-Transcript-a dabg
- Properly derive the calls in the Present-Absent Calls option ** for 3' arrays p-values and proper calls (A, M, P values) are returned by apt-probeset-summarize so no more transformation is needed. ** for WT arrays only p-values are returned by apt-probeset-summarize. Hence calls have to be derived form p-values as follows: *** IF p-value < 0.05 THEN call is P *** IF 0.05 <= p-value <= 0.065 THEN call is M *** IF p-value > 0.065 THEN call is A
See APT complete documentation for more details.
The following sections explain how to carry on each of this three steps using APT.
Affymetrix Power Tools (APT) for CEL file conversion to a text file
The same code is used for all kind of arrays, either 3' or WT
Indicating a directory where to find the CEL files
apt-cel-convert -f text \
-o txt_converted_cel_files_dir \
cel_file_dir/*.CEL
Indicating a text file with the paths to the CEL files
apt-cel-convert -f text \
-o txt_converted_cel_files_dir \
--cel-files cell_paths_file.txt
Note: --cel-files: file specifying cel files to process, one per line with the first line being 'cel_files'.
Affymetrix Power Tools (APT) for raw data extraction
3' arrays
apt-cel-extract -o raw_intensities.txt \
-d HG-U133A_2.cdf \
cel_file_dir/*.CEL
Note --cel-files option is also available.
Whole-Transcript arrays
apt-cel-extract -o raw_intensities.txt \
-c MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
-p MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
-b MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.bgp \
cel_file_dir/*.CEL
Note --cel-files option is also available.
Affymetrix Power Tools (APT) for Gene Expression Level normalization
3' arrays
apt-probeset-summarize -o data_normalized_dir/ \
-d HG-U133A_2.cdf \
-a pm-mm,mas5-detect.calls=1.pairs=1 \
-a rma \
-a rma-sketch \
-a plier-mm \
-a plier-mm-sketch \
data_raw/expression/*.CEL
Note --cel-files option is also available. the option -a indicates each analysis or normalization method available. They can all be run at the same time but Babelomics will use just one at a time.
Whole-Transcript arrays
Gene:
apt-probeset-summarize -o data_processed/data_normalized/exon_level \
-c MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
-p MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
-b MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.bgp \
--qc-probesets MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.qcc \
-m MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.mps \
-a dabg \
-a rma \
-a rma-sketch \
-a plier-gcbg \
-a plier-gcbg-sketch \
data_raw/expression/*.CEL
Exon:
apt-probeset-summarize -o data_processed/data_normalized/exon_level \
-c MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
-p MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
-b MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.antigenomic.bgp \
--qc-probesets MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.qcc \
-m MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.core.mps \
-a dabg \
-a rma \
-a rma-sketch \
-a plier-gcbg \
-a plier-gcbg-sketch \
data_raw/expression/*.CEL
Note --cel-files option is also available. the option -a indicates each analysis or normalization method available. They can all be run at the same time but Babelomics will use just one at a time.
-a dabg: only generates the p-value matrix (summary file) but not the call matrix (as is done by -a pm-mm,mas5-detect.calls=1.pairs=1). Hence call matrix is derived by Babelomics as follows:
- P if p-value < 0.05
- M if 0.05 <= p-value <= 0.065
- A if 0.065 > p-value
Affymetrix Power Tools (APT) for Exon Level
Only Whole-Transcript arrays can be processed at exon level.
Gene:
apt-probeset-summarize -o data_processed/data_normalized/exon_level \
-c MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
-p MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
-b MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.bgp \
--qc-probesets MoGene-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.qcc \
-a dabg \
-a rma \
-a rma-sketch \
-a plier-gcbg \
-a plier-gcbg-sketch \
data_raw/expression/*.CEL
Exon:
apt-probeset-summarize -o data_processed/data_normalized/exon_level \
-c MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.clf \
-p MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.pgf \
-b MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.antigenomic.bgp \
--qc-probesets MoEx-1_0-st-v1.r4.analysis-lib-files/MoGene-1_0-st-v1.r4.qcc \
-a dabg \
-a rma \
-a rma-sketch \
-a plier-gcbg \
-a plier-gcbg-sketch \
data_raw/expression/*.CEL
Note --cel-files option is also available. the option -a indicates each analysis or normalization method available. They can all be run at the same time but Babelomics will use just one at a time. (indeed the command is exactly the same that the one of gene level normalization but without the .mps line)
Find the Babelomics suite at http://babelomics.org