Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New default compression level 1 is a bigger change than advertised #879

Closed
tfenne opened this issue Jul 25, 2017 · 4 comments
Closed

New default compression level 1 is a bigger change than advertised #879

tfenne opened this issue Jul 25, 2017 · 4 comments
Assignees

Comments

@tfenne
Copy link
Collaborator

tfenne commented Jul 25, 2017

I expect that a lot of testing was done internally at Broad on the intel deflator/inflator but I'm having a hard time understanding the assertions that:

  1. The intel deflator produces BAMs at compression=1 only "slightly larger" than the old default of 5
  2. That reading BAMs compressed at level 1 is 3x faster than reading BAMs compressed at level 5.

The following test is 100% anecdotal, but highlights that the above is not universal. I took a single BAM file that represents ~7000X coverage of a 100 gene panel, and used the latest release JAR of picard to run SamFormatConverter and emit BAMs at the default (1) compression level and again at 5, both using the intel deflator. The BAM at compression level 1 came out at 3,152,719,064 bytes; the one at level 5 came out at 1,811,428,073 bytes. I.e. going from 5 to 1 produced a 74% increase in file size.

I then tried running CollectInsertSizeMetrics, thinking that is a Picard tool that decompresses the entire BAM but then does very little beyond that (just reads the isize field). Running on the level 1 BAM was slower than running on the level 5 BAM by about 13% (1.11 minutes vs. 0.98 minutes). The test was run on a Mac with a local SSD; I would imagine results to be significantly worse if using any kind of network storage. I'm wondering if the observed "3x faster" reading is just the amount of time spent in the decompression code and doesn't account for the time to read the bytes from disk?

The change in the default compression level should also, IMHO, have triggered a major version bump. It's non-backwards compatible in that the results produced are significantly different, meanwhile this change is easy to miss since it won't cause any programs to outright fail (unless they run out of disk space).

How about changing the default back to 5 until this is better understood?

@ktibbett
Copy link
Contributor

@tfenne , thanks for bringing this up. We're rolling back the change to the default compression now and will do a picard release tomorrow morning. I think we'll just leave it at 5 and change the invocations on our side for the medium term.

@droazen
Copy link
Contributor

droazen commented Jul 25, 2017

Intel's original evaluation was done using GATK4's PrintReads on whole-genome data:

compression_tradeoffs

PrintReads is just a thin front-end to htsjdk's SAMReader/SAMFileWriter. It's possible that performance/compression characteristics with smaller inputs are different than with larger inputs -- this is certainly something worth investigating. I'll ask Intel to do some tests with bams smaller than a whole genome.

@tfenne
Copy link
Collaborator Author

tfenne commented Jul 25, 2017

@droazen - thanks. At least on the timings that makes a lot more sense to me - i.e. that the 3x speedup is likely a combined read/write cycle, where the performance at level 1 is expected to be much better than level 5. I also see approximately 2.5-3x improvement. Notably, (and tangentially) I also see a decent speed bump at level=5 between the JDK and Intel deflater, maybe 30-35%, which I didn't expect given I'd heard that the intel deflator didn't have much advantage at higher compression levels.

I'm not in a great position to try on full WGS bams, but I have some small test BAMs that are constructed by extracting reads overlapping a few hundred kb of genome from WGS samples. I made the table below using such a BAM made from the 1KG PCR-free WGS data from NA19625. Not ideal, but I would think would perform fairly similar to a full WGS bam for compression purposes. What I see is that at compression level 1 the intel deflator produces a significantly larger BAM that the JDK deflator at level=1.

Compression Level Intel Deflater File Size Intel Deflater % of JDK l=5 JDK Deflater File Size JDK Delfater % of JDK l=5
1 54,840,445 175.23% 38,543,684 123.16%
2 35,782,642 114.33% 36,745,494 117.41%
3 34,989,899 111.80% 35,262,326 112.67%
4 31,815,698 101.66% 32,549,560 104.00%
5 31,240,892 99.82% 31,296,433 100.00%
6 30,675,174 98.01% 30,577,906 97.70%
7 30,379,699 97.07% 30,380,325 97.07%
8 30,124,200 96.25% 30,124,375 96.25%
9 30,064,322 96.06% 30,064,325 96.06%

@ktibbett
Copy link
Contributor

Fixed in 2.10.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants