Preprocessing for RNA Seq

mhg-cipf edited this page Jan 16, 2015 · 7 revisions
Clone this wiki locally

Biases

Depending on the biases present in our data, a normalization method or other should be applied. Babelomics allows us to correct three different kinds of biases:

  • Library depth bias: The number of counts of the genes is proportional to the library length. Deeper libraries give rise to genes with more counts. For samples with the same library depth there is no such bias.

  • Gene length bias: The number of counts of the genes is proportional to the gene length. Habitually, longer genes accumulate a greater number of transcripts.

  • RNA composition: This bias occurs when some of the genes are hugely expressed in some samples but not so much in others. Since the total number of counts is the same for every sample, the genes equally expressed in every sample will not have a similar number of counts.

Normalization methods

Babelomics' normalization methods are:

  • Reads-Per-Kilobase-per-Million (RPKM) (Mortazavi et al. 2008): Gene counts are divided by the gene length and by the total number of mapped reads in millions. This normalization corrects the library depth bias and the gene length bias. However, it is not recommended for differential expression.

  • Trimmed Means of M values (TMM) (Robinson et al. 2010): A correction factor of the depth library is computed for each gene, in order to correct the RNA composition bias . Although this method does not usually correct the gene length bias, the implementation of package NOISeq with this option is used to correct also this kind of bias.

Babelomics also allows us to run automatically the normalization method which is best fitted for our particular data.

Automatic optimal method

Babelomics allows us to compute automatically the normalization method which best fits our particular data. To determine the optimal method the following procedure is applied.

  • If the gene length information of each of the genes is not available, TMM method is recommended.

  • If the gene length information is available, the diagnostic test for differences in RNA composition from the NOISeq package (Tarazona et al., 2011) is applied. For data passing the test, RPKM method is recommended. For data failing the test, TMM method with gene length correction is recommended.

References


  • Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 Jul;5(7):621-8. doi: 10.1038/nmeth.1226. Epub 2008 May 30. PubMed PMID: 18516045.

  • Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25. doi: 10.1186/gb-2010-11-3-r25. Epub 2010 Mar 2. PubMed PMID: 20196867; PubMed Central PMCID: PMC2864565.

  • Tarazona S, Garcia-Alcalde F, Dopazo J, Ferrer A and Conesa A (2011). “Differential expression in RNA-seq: a matter of depth.” Genome research, 21(12), pp. 4436.