New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New default compression level 1 is a bigger change than advertised #879
Comments
@tfenne , thanks for bringing this up. We're rolling back the change to the default compression now and will do a picard release tomorrow morning. I think we'll just leave it at 5 and change the invocations on our side for the medium term. |
Intel's original evaluation was done using GATK4's
|
@droazen - thanks. At least on the timings that makes a lot more sense to me - i.e. that the 3x speedup is likely a combined read/write cycle, where the performance at level 1 is expected to be much better than level 5. I also see approximately 2.5-3x improvement. Notably, (and tangentially) I also see a decent speed bump at level=5 between the JDK and Intel deflater, maybe 30-35%, which I didn't expect given I'd heard that the intel deflator didn't have much advantage at higher compression levels. I'm not in a great position to try on full WGS bams, but I have some small test BAMs that are constructed by extracting reads overlapping a few hundred kb of genome from WGS samples. I made the table below using such a BAM made from the 1KG PCR-free WGS data from NA19625. Not ideal, but I would think would perform fairly similar to a full WGS bam for compression purposes. What I see is that at compression level 1 the intel deflator produces a significantly larger BAM that the JDK deflator at level=1.
|
Fixed in 2.10.5. |
I expect that a lot of testing was done internally at Broad on the intel deflator/inflator but I'm having a hard time understanding the assertions that:
The following test is 100% anecdotal, but highlights that the above is not universal. I took a single BAM file that represents ~7000X coverage of a 100 gene panel, and used the latest release JAR of picard to run
SamFormatConverter
and emit BAMs at the default (1
) compression level and again at5
, both using the intel deflator. The BAM at compression level 1 came out at 3,152,719,064 bytes; the one at level5
came out at 1,811,428,073 bytes. I.e. going from5
to1
produced a 74% increase in file size.I then tried running
CollectInsertSizeMetrics
, thinking that is a Picard tool that decompresses the entire BAM but then does very little beyond that (just reads the isize field). Running on the level1
BAM was slower than running on the level5
BAM by about 13% (1.11 minutes vs. 0.98 minutes). The test was run on a Mac with a local SSD; I would imagine results to be significantly worse if using any kind of network storage. I'm wondering if the observed "3x faster" reading is just the amount of time spent in the decompression code and doesn't account for the time to read the bytes from disk?The change in the default compression level should also, IMHO, have triggered a major version bump. It's non-backwards compatible in that the results produced are significantly different, meanwhile this change is easy to miss since it won't cause any programs to outright fail (unless they run out of disk space).
How about changing the default back to
5
until this is better understood?The text was updated successfully, but these errors were encountered: