Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bamCoverage and RNA-seq data #401

Closed
friedue opened this issue Aug 11, 2016 · 18 comments
Closed

bamCoverage and RNA-seq data #401

friedue opened this issue Aug 11, 2016 · 18 comments

Comments

@friedue
Copy link
Collaborator

friedue commented Aug 11, 2016

I rarely use deepTools for RNA-seq data, so apologies if that turns out to be a trivial issue.
I'm just wondering whether the documentation in regard to the RNA-seq handling should be improved.

Let's assume I have several RNA-seq data sets, just plain ol' single read. Currently, we basically say "run bamCoverage in default": bamCoverage -b reads.bam -o coverage.bw.

If bamCoverage is used in default mode, I am assuming that the bigWig will simply contain the counts per bin? No normalization for sequencing depth or the like, right?
Should one not use the --skipNonCoveredRegions with RNA-seq since the vast majority of the genome is not expected to be covered? Will this influence the way the bins are made, i.e., will this force a behavior where the bins will focus on transcribed regions? Or should we simply recommend to have bin size = 1 bp for RNA-seq data?

I was asked whether there's a way to get a "normalized" RNA-seq coverage track using deepTools. I guess, --normalizeTo1x should be avoided for RNA-seq since the effective genome size is not really appropriate here (neither is the fragment size, particularly not for paired-end data!)?
--normalizeUsingRPKM is also a bit misleading since RPKMs in the RNA-seq world refer to the transcriptome length (which is a thorny issue in itself). With -bs 1 this option would basically result in a simple division by total counts, which is perhaps the best we can do?

How have you used bamCoverage with RNA-seq data?

@friedue
Copy link
Collaborator Author

friedue commented Aug 11, 2016

Alright, I'll just keep on rambling further until someone stops me :)
So. I thought to myself, well, if someone's not happy with the two normalization options deepTools offers, she could just use the --scaleFactor option, e.g, supplying the inverse of the size factor calculated by DESeq2 (this was actually proposed by someone on the deepTools mailing list if I remember correctly). The --help to bamCoverage currently states that the scaleFactor will be used to multiply the scaling factor determined by either --normalizeTo1x or --normalizeUsingRPKM. This is quite different from what I thought was happening because I thought the scaleFactor is used to multiply the number of (raw) reads.
Also - not that it ever mattered to me before, but it maybe something we should clarify - is there no way to turn off the normalization?

@dpryan79
Copy link
Collaborator

If you're not filtering anything then the counts are just the reads/bin (i.e., the final scale factor is really 1). If you do any filtering then use the --verbose option so the final scaling factor will get printed. You can then rerun things and specify the inverse as the scale factor.

--skipNonCoveredRegions just prevent bins with 0 in them from being written, they won't make any other difference. The bins are the same (both the values and where they are) regardless of whether that's used.

The most DESeq2-like normalization is probably SES, since it should at least theoretically be more robust to outliers. One could use estimateScaleFactor with the same reference sample to get appropriate scaling factors. I'm not sure how well that'd work in practice, since SES tends to bork when it gets a lot of 0 bins. The best bet is probably what you suggested regarding directly using the DESeq2 scale factors.

Regarding --normalizeTo1x, it's effectively equivalent to the older "library size normalization" in RNAseq, which is known to be problematic.

The only real thing to keep in mind with RNAseq is to not extend reads, since that really mucks up introns.

BTW, if you specify a scale factor, then it's used directly unless you use the RPKM normalization or do some filtering. The final bin count is then scale factor * number of reads.

@fidelram
Copy link
Collaborator

I suppose that --normalizeTo1x using the exome length as effective genome
size could produce meaningful results that can be compared between samples.
Just keep in mind the issue with library size normalization mentioned by
Devon.

Regarding the SES normalization, I think the scaling factors would end
being based on the background read coverage which most likely are non
exonic regions with very little coverage. Thus, the results may be
unreliable.

Finally, the confusion about the scaling factors come from the fact that if
you combine --normalizeTo1x and scale factors, deepTools respect your wish
and apply the given scale factor on top of normalizeTo1x. if
--normalizeTo1x is not given, then only the scale factor is applied.

On Thu, Aug 11, 2016 at 9:58 PM, Devon Ryan notifications@github.com
wrote:

If you're not filtering anything then the counts are just the reads/bin
(i.e., the final scale factor is really 1). If you do any filtering then
use the --verbose option so the final scaling factor will get printed.
You can then rerun things and specify the inverse as the scale factor.

--skipNonCoveredRegions just prevent bins with 0 in them from being
written, they won't make any other difference. The bins are the same (both
the values and where they are) regardless of whether that's used.

The most DESeq2-like normalization is probably SES, since it should at
least theoretically be more robust to outliers. One could use
estimateScaleFactor with the same reference sample to get appropriate
scaling factors. I'm not sure how well that'd work in practice, since SES
tends to bork when it gets a lot of 0 bins. The best bet is probably what
you suggested regarding directly using the DESeq2 scale factors.

Regarding --normalizeTo1x, it's effectively equivalent to the older
"library size normalization" in RNAseq, which is known to be problematic.

The only real thing to keep in mind with RNAseq is to not extend reads,
since that really mucks up introns.

BTW, if you specify a scale factor, then it's used directly unless you use
the RPKM normalization or do some filtering. The final bin count is then scale
factor * number of reads.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#401 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEu_1UrOPnN5BhBxtipHRsB3a773uWqBks5qe37qgaJpZM4JiS6P
.

Fidel Ramirez

@friedue
Copy link
Collaborator Author

friedue commented Aug 12, 2016

Hi guys,

thanks for the info.
So, to sum up, the recommendations would be:

  • neither --normalizeTo1x nor --normalizeUsingRPKM, instead --scaleFactor 1/<DESeq2's sizeFactor>
  • no read extension (which is already the default)
  • --skipNonCoveredRegions to decrease the size of the output file

Any opinions on the influence of the bin size?

My remaining confusion stems from the observation that deepTools tells me it's normalizing for depth when I'm not specifying either of the normalization methods:

$ bamCoverage -b Aligned.bam -o Coverage.bw
normalization: depth
minFragmentLength: 0
verbose: False
out_file_for_raw_data: None
numberOfSamples: None
bedFile: None
ignoreDuplicates: False
numberOfProcessors: 16
samFlag_exclude: None
save_data: False
stepSize: 50
smoothLength: None
center_read: False
defaultFragmentLength: read length
chrsToSkip: []
region: None
maxPairedFragmentLength: 1000
samFlag_include: None
binLength: 50
blackListFileName: None
maxFragmentLength: 0
minMappingQuality: None
zerosToNans: False

@dpryan79
Copy link
Collaborator

Oh, someone requested the "normalization: depth" output, though we should change the wording. It should probably be something like "percentage kept after filtering" or "None". I'd vote for "None", but have no strong opinion on it.

For the bin size, I guess it depends on the goals. The default size is probably OK as long as people just want to look at things in IGV.

@friedue
Copy link
Collaborator Author

friedue commented Aug 12, 2016

I would also vote for "normalization: none" (if nothing is done to the read counts except aggregating them over bins)

The bins will be "drawn" regardless of the annotation, right? So, in principle, read counts that originated from exons could "bleed" into intronic regions. I'm not saying that's a huge concern necessarily, but maybe we should make a note of that in the documentation.

@dpryan79
Copy link
Collaborator

Correct, bamCoverage doesn't accept a region of any sort. I should note that if someone wants to do a "metagene" plot at some point, then smaller bins probably make more sense, though 1 base bins are likely overkill (having said that, that's what I used in those cases).

@vivekbhr
Copy link
Member

vivekbhr commented Aug 12, 2016

Wait.. I missed the discussion before, but why --normalizeusingRPKM wouldn't be one of the recommended options? If bin size is smaller than 1kb then the RPKM gets scaled down, but it's still valid..

Plus the size factor calculated by DESeq could be misleading since it's based on the total annotated gene counts, and you don't always want to normalize by this.

@dpryan79
Copy link
Collaborator

RPKM is fine, it's just that people probably naively think it's doing something it's not. I would generally prefer using the scale factor, since it's more robust, but as long as rRNAs are blacklisted or not an issue then that's probably not an issue.

Speaking of that, at least for the atypical kinds of RNAseq (e.g., RiboSeq), you end up needing to do a LOT of blacklisting of regions so they don't throw off the scaling. I expect that'll be the case for other RNAseq variants or even normal RNAseq if it wasn't polyA selected.

@friedue
Copy link
Collaborator Author

friedue commented Aug 12, 2016

Two points:

  • You can have a bin size smaller than 1 bp? When would you choose that?
  • RPKM (per bin) = number of reads per bin / ( number of mapped reads (in millions) * bin length (kb) -- this is basically normalization by library size, too. I mostly wanted to deprecate it because its name is misleading (people will not get the same values that they would get if they generated RPKM values with, say, edgeR), but you're right, it may be a good alternative to having to resort to featureCounts and DESeq2 when all you want is just look the signal in a manner that's at least accounting for differences in sequencing depth.

@dpryan79
Copy link
Collaborator

@friedue 1kb, not 1bp :)

@friedue
Copy link
Collaborator Author

friedue commented Aug 12, 2016

blacklisting rRNAs and tRNAs is a very good point!

@dpryan79
Copy link
Collaborator

Are there any remaining issues surrounding this, or should we close it?

@vivekbhr
Copy link
Member

close.. 👍

@friedue
Copy link
Collaborator Author

friedue commented Aug 15, 2016

I haven't yet changed the documentation to reflect these recommendations. I can assign it to myself and aim to get it done this week.

@dpryan79
Copy link
Collaborator

Sounds good!

@friedue
Copy link
Collaborator Author

friedue commented Aug 15, 2016

I assume that the same points we discussed for single files will also apply to bamCompare? The recommendation being that SES normalization should not be applied, fragmentLength extension should not be turned on and, if wanted, the DESeq size factors can be supplied. Anything else?

@dpryan79
Copy link
Collaborator

Yep, they apply to bamCompare too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants