New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigWig computation rules are slow #89
Comments
A couple of deepTools issues discussing performance. I am just posting this as a reminder that I actually looked at this, with negative results :_)
I don't think any of this is our case, the Anyway I looked at the code and I see that RPKM only uses a formula and for RPGC what they do is: go to the BAM file to calculate fragment length (which makes use of a mapReduce function, but iterates while not enough values are found, so scarce coverage BAM files could be a problem, but doesn't seem to be problematic for us). Then they use: At this point of the pipeline execution, we already calculated all those values, so we could basically feed that composite scaling factor instead of the one we use right now. I don't think this would help a lot in an individual bigWig file. We do have a large amount of bigWig files, so maybe it would shave off some minutes per file. But I'm not sure this is worth to pursue. I did a mini test on a ~500MB BAM file it reduced about 10 seconds execution time from a total of 4 minutes (on a single thread,
|
And for completeness, I want to mention deeptools/deepTools#912, where I reported that bamCoverage is slow on BAM files with very few reads. |
As discussed in #80.
The
scaled_bigwig
andunscaled_bigwig
rules are responsible for a good chunk of the total runtime of the pipeline. It is possible that bamCoverage is quite inefficient for our case and we may want to look into alternatives. For example, it needs 100 CPU minutes to compute a 69 MB BigWig file from a 850 MB BAM file (samtools view needs 16 seconds to iterate over the file). This takes longer than the actual read mapping. I believe that this is slower than it needs to be and that there is some potential.Especially when more bin sizes are added, this will definitely be a problem.
The text was updated successfully, but these errors were encountered: