Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigWig computation rules are slow #89

Open
marcelm opened this issue Nov 17, 2020 · 2 comments
Open

BigWig computation rules are slow #89

marcelm opened this issue Nov 17, 2020 · 2 comments

Comments

@marcelm
Copy link
Collaborator

marcelm commented Nov 17, 2020

As discussed in #80.

The scaled_bigwig and unscaled_bigwig rules are responsible for a good chunk of the total runtime of the pipeline. It is possible that bamCoverage is quite inefficient for our case and we may want to look into alternatives. For example, it needs 100 CPU minutes to compute a 69 MB BigWig file from a 850 MB BAM file (samtools view needs 16 seconds to iterate over the file). This takes longer than the actual read mapping. I believe that this is slower than it needs to be and that there is some potential.

Especially when more bin sizes are added, this will definitely be a problem.

@cnluzon
Copy link
Collaborator

cnluzon commented Dec 4, 2020

A couple of deepTools issues discussing performance. I am just posting this as a reminder that I actually looked at this, with negative results :_)

  1. Here they seem to come to the conclusion that --normalize RPGC is slowing things down, as get_scale_factor method takes a lot of the execution time (they argue).

  2. Here they also complain that deepTools used to perform poorly on small BAM files because it does some subsampling in order to compute scaling factor. Now they provide this --exactScaling parameter that just ignores this and uses all reads. I am not sure whether this could help, it depends on what they consider "big" and how much this estimation hinders the performance.

I don't think any of this is our case, the unscaled rules would take significantly (or at least perceptibly) longer than scaled, and according to the values in #80, this is not happening. Aggregated value accounts to more in the unscaled, but I think just because of more rules (+ Input). Individual values don't seem to differ.

Anyway I looked at the code and I see that RPKM only uses a formula and for RPGC what they do is: go to the BAM file to calculate fragment length (which makes use of a mapReduce function, but iterates while not enough values are found, so scarce coverage BAM files could be a problem, but doesn't seem to be problematic for us). Then they use: 1.0 / (float(bam_mapped * fragment_length) / args.effectiveGenomeSize) as a scaling factor.

At this point of the pipeline execution, we already calculated all those values, so we could basically feed that composite scaling factor instead of the one we use right now. I don't think this would help a lot in an individual bigWig file. We do have a large amount of bigWig files, so maybe it would shave off some minutes per file. But I'm not sure this is worth to pursue. I did a mini test on a ~500MB BAM file it reduced about 10 seconds execution time from a total of 4 minutes (on a single thread, --binSize 50).

--binSize has a large influence on this. So we can also consider how much resolution we need on the other bigWig files generated.

@marcelm
Copy link
Collaborator Author

marcelm commented Dec 8, 2020

And for completeness, I want to mention deeptools/deepTools#912, where I reported that bamCoverage is slow on BAM files with very few reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants