New default compression level 1 is a bigger change than advertised #879

tfenne · 2017-07-25T20:48:23Z

I expect that a lot of testing was done internally at Broad on the intel deflator/inflator but I'm having a hard time understanding the assertions that:

The intel deflator produces BAMs at compression=1 only "slightly larger" than the old default of 5
That reading BAMs compressed at level 1 is 3x faster than reading BAMs compressed at level 5.

The following test is 100% anecdotal, but highlights that the above is not universal. I took a single BAM file that represents ~7000X coverage of a 100 gene panel, and used the latest release JAR of picard to run SamFormatConverter and emit BAMs at the default (1) compression level and again at 5, both using the intel deflator. The BAM at compression level 1 came out at 3,152,719,064 bytes; the one at level 5 came out at 1,811,428,073 bytes. I.e. going from 5 to 1 produced a 74% increase in file size.

I then tried running CollectInsertSizeMetrics, thinking that is a Picard tool that decompresses the entire BAM but then does very little beyond that (just reads the isize field). Running on the level 1 BAM was slower than running on the level 5 BAM by about 13% (1.11 minutes vs. 0.98 minutes). The test was run on a Mac with a local SSD; I would imagine results to be significantly worse if using any kind of network storage. I'm wondering if the observed "3x faster" reading is just the amount of time spent in the decompression code and doesn't account for the time to read the bytes from disk?

The change in the default compression level should also, IMHO, have triggered a major version bump. It's non-backwards compatible in that the results produced are significantly different, meanwhile this change is easy to miss since it won't cause any programs to outright fail (unless they run out of disk space).

How about changing the default back to 5 until this is better understood?

The text was updated successfully, but these errors were encountered:

ktibbett · 2017-07-25T21:04:07Z

@tfenne , thanks for bringing this up. We're rolling back the change to the default compression now and will do a picard release tomorrow morning. I think we'll just leave it at 5 and change the invocations on our side for the medium term.

droazen · 2017-07-25T21:13:09Z

Intel's original evaluation was done using GATK4's PrintReads on whole-genome data:

PrintReads is just a thin front-end to htsjdk's SAMReader/SAMFileWriter. It's possible that performance/compression characteristics with smaller inputs are different than with larger inputs -- this is certainly something worth investigating. I'll ask Intel to do some tests with bams smaller than a whole genome.

tfenne · 2017-07-25T22:06:00Z

@droazen - thanks. At least on the timings that makes a lot more sense to me - i.e. that the 3x speedup is likely a combined read/write cycle, where the performance at level 1 is expected to be much better than level 5. I also see approximately 2.5-3x improvement. Notably, (and tangentially) I also see a decent speed bump at level=5 between the JDK and Intel deflater, maybe 30-35%, which I didn't expect given I'd heard that the intel deflator didn't have much advantage at higher compression levels.

I'm not in a great position to try on full WGS bams, but I have some small test BAMs that are constructed by extracting reads overlapping a few hundred kb of genome from WGS samples. I made the table below using such a BAM made from the 1KG PCR-free WGS data from NA19625. Not ideal, but I would think would perform fairly similar to a full WGS bam for compression purposes. What I see is that at compression level 1 the intel deflator produces a significantly larger BAM that the JDK deflator at level=1.

Compression Level	Intel Deflater File Size	Intel Deflater % of JDK l=5	JDK Deflater File Size	JDK Delfater % of JDK l=5
1	54,840,445	175.23%	38,543,684	123.16%
2	35,782,642	114.33%	36,745,494	117.41%
3	34,989,899	111.80%	35,262,326	112.67%
4	31,815,698	101.66%	32,549,560	104.00%
5	31,240,892	99.82%	31,296,433	100.00%
6	30,675,174	98.01%	30,577,906	97.70%
7	30,379,699	97.07%	30,380,325	97.07%
8	30,124,200	96.25%	30,124,375	96.25%
9	30,064,322	96.06%	30,064,325	96.06%

ktibbett · 2017-07-26T16:56:46Z

Fixed in 2.10.5.

tfenne assigned ktibbett Jul 25, 2017

ktibbett assigned danxmoran Jul 25, 2017

danxmoran mentioned this issue Jul 25, 2017

Revert default compression level to 5. #880

Merged

5 tasks

danxmoran closed this as completed in #880 Jul 25, 2017

droazen mentioned this issue Aug 4, 2017

Intel deflater doesn't always perform well on smaller data at compression level 1 broadinstitute/gatk#3413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New default compression level 1 is a bigger change than advertised #879

New default compression level 1 is a bigger change than advertised #879

tfenne commented Jul 25, 2017

ktibbett commented Jul 25, 2017

droazen commented Jul 25, 2017

tfenne commented Jul 25, 2017

ktibbett commented Jul 26, 2017

New default compression level 1 is a bigger change than advertised #879

New default compression level 1 is a bigger change than advertised #879

Comments

tfenne commented Jul 25, 2017

ktibbett commented Jul 25, 2017

droazen commented Jul 25, 2017

tfenne commented Jul 25, 2017

ktibbett commented Jul 26, 2017