Optimize compression hot paths: bypass Hadoop codec abstraction for Snappy, LZ4_RAW, ZSTD, and GZIP#3555
Open
iemejia wants to merge 7 commits into
Open
Optimize compression hot paths: bypass Hadoop codec abstraction for Snappy, LZ4_RAW, ZSTD, and GZIP#3555iemejia wants to merge 7 commits into
iemejia wants to merge 7 commits into
Conversation
f9d81a4 to
415375a
Compare
…nappy, LZ4_RAW, and ZSTD - Add zero-copy ByteArrayBytesInput.toByteArray() override: returns backing array directly when offset==0 && length==in.length, avoiding BAOS allocation and System.arraycopy on every decompressor call. - Snappy: use Snappy.compress()/uncompress() directly (single JNI call). - LZ4_RAW: use LZ4 compressor/decompressor API directly. - ZSTD: use ZstdOutputStreamNoFinalizer/ZstdInputStreamNoFinalizer directly. - Add LZ4_RAW direct ByteBuffer compressor/decompressor to DirectCodecFactory. - Add JMH CompressionBenchmark for isolated codec throughput measurement. - Update TestDirectCodecFactory LZ4_RAW assertion.
…ater directly Use reusable Deflater/Inflater instances with manual GZIP header/trailer, bypassing Hadoop's GzipCodec, CodecPool, and GZIPOutputStream/GZIPInputStream. Deflater.reset() reuses native zlib state across pages, avoiding per-call allocation. Manual header/trailer eliminates stream wrapper overhead. Results (3 forks, 15 iterations, AMD EPYC 9V45): Compress: 8KB +3%, 64KB +1%, 256KB +1% Decompress: 8KB +6%, 64KB +3%, 256KB +9%
…erPool, route GZIP in DirectCodecFactory Respect parquet.compression.codec.zstd.bufferPool.enabled in the optimized ZstdBytesCompressor/Decompressor (was hardcoded to RecyclingBufferPool). Route GZIP decompression through the optimized path in DirectCodecFactory instead of falling back to the Hadoop codec pool. Remove dead GZIP/ZSTD branches from cacheKey(). Document ISA-L native library bypass in GZIP Javadocs. Replace obsolete Hadoop codec caching tests with end-to-end compression level verification tests.
Use page sizes that reflect actual Parquet page sizes observed in practice: 64KB, 128KB, 256KB, and 1MB (the default). The 20K row-count limit (PARQUET-1414) means most numeric columns produce pages of 78-234KB, making the previous 8KB test point unrealistic. Also fix JMH annotation processor path for Java 17+ compatibility and reduce warmup/measurement iterations for faster iteration. Performance results (master vs perf-compression-bypass branch): Compression (ops/s, higher is better): Codec | Page | Master | Branch | Speedup SNAPPY | 64KB | 53,979 | 60,799 | +12.6% SNAPPY | 128KB | 27,764 | 30,524 | +9.9% SNAPPY | 256KB | 13,549 | 14,648 | +8.1% SNAPPY | 1MB | 2,445 | 2,675 | +9.4% ZSTD | 64KB | 8,813 | 8,719 | -1.1% ZSTD | 128KB | 4,361 | 4,501 | +3.2% ZSTD | 256KB | 2,112 | 2,008 | -4.9% ZSTD | 1MB | 423 | 422 | -0.3% LZ4_RAW | 64KB | 37,777 | 36,107 | -4.4% LZ4_RAW | 128KB | 16,777 | 16,330 | -2.7% LZ4_RAW | 256KB | 9,060 | 8,956 | -1.1% LZ4_RAW | 1MB | 1,961 | 2,191 | +11.7% GZIP | 64KB | 1,422 | 1,423 | +0.1% GZIP | 128KB | 641 | 646 | +0.8% GZIP | 256KB | 315 | 317 | +0.7% GZIP | 1MB | 75 | 77 | +2.3% Decompression (ops/s, higher is better): Codec | Page | Master | Branch | Speedup SNAPPY | 64KB | 60,928 | 67,224 | +10.3% SNAPPY | 128KB | 29,919 | 33,457 | +11.8% SNAPPY | 256KB | 14,431 | 15,912 | +10.3% SNAPPY | 1MB | 3,140 | 3,540 | +12.7% ZSTD | 64KB | 32,042 | 35,750 | +11.6% ZSTD | 128KB | 19,447 | 21,800 | +12.1% ZSTD | 256KB | 9,495 | 10,759 | +13.3% ZSTD | 1MB | 2,155 | 2,409 | +11.8% LZ4_RAW | 64KB | 80,415 |118,358 | +47.2% LZ4_RAW | 128KB | 40,615 | 59,620 | +46.8% LZ4_RAW | 256KB | 19,888 | 29,914 | +50.4% LZ4_RAW | 1MB | 4,628 | 7,517 | +62.4% GZIP | 64KB | 9,393 | 9,608 | +2.3% GZIP | 128KB | 4,101 | 4,536 | +10.6% GZIP | 256KB | 1,736 | 1,891 | +8.9% GZIP | 1MB | 406 | 442 | +9.1% Key findings: - SNAPPY: consistent 8-13% improvement across all page sizes - LZ4_RAW decompression: strongest gain at 47-62% (eliminates 2x heap<->direct copies) - ZSTD decompression: 11-13% from NoFinalizer + config caching - GZIP decompression: 9-11% faster at 128KB+ page sizes - ZSTD/GZIP compression: within noise (CPU-bound in native codec) - LZ4_RAW compression: within noise at small pages, +12% at 1MB
- Add brotli-codec dependency to parquet-benchmarks (profile-gated, x86_64 only) - Include BROTLI in @Param codec list alongside SNAPPY, ZSTD, LZ4_RAW, GZIP - Add jitpack.io repository for brotli-codec resolution
Bypass the Hadoop BrotliCodec/stream wrapper for BROTLI compression and decompression by using org.meteogroup.jbrotli's native JNI bindings directly with ByteBuffer support via reflection (brotli-codec remains runtime scope). This eliminates intermediate buffer copies and the BrotliStreamCompressor state machine overhead. Changes: - DirectCodecFactory: Add BrotliDirectCompressor (quality=1, matching Hadoop default) and BrotliDirectDecompressor using one-shot jbrotli API via reflection - Load native library eagerly with graceful fallback to Hadoop codec path - CompressionBenchmark: Switch from heap CodecFactory to DirectCodecFactory to benchmark the actual production code path Results at 64KB page size: - Compress: 6,746 -> 9,662 ops/s (1.43x speedup) - Decompress: 2,534 -> 2,786 ops/s (1.10x speedup)
c1632dd to
7ce2c12
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bypass the Hadoop
Compressor/Decompressor/CodecPoolabstraction layer inCodecFactoryandDirectCodecFactory, calling native compression libraries directly. This eliminates per-page stream creation, intermediate buffer copies, and codec pool synchronization for all four supported codecs.What changes
CodecPool+SnappyCompressor(which copies heap→direct→heap) with a singleSnappy.compress(byte[], byte[])/Snappy.uncompress(byte[], byte[])JNI call and a reusable output buffer.NonBlockedCompressor(which allocates direct ByteBuffers and copies heap↔direct twice per call) with heapByteBuffer.wrap()and direct airlift LZ4 compress/decompress — zero intermediate copies.ZstdCompressorStreamwithZstdOutputStreamNoFinalizer(avoids finalizer registration) and cache the ZSTD level / buffer pool configuration reads per compressor instance instead of re-readingConfigurationon each page.GzipCodec(which wraps Java'sDeflater/Inflaterin stream abstractions) with directDeflater/Inflaterusage, reusing instances viareset()and managing GZIP headers/trailers manually.CompressionBenchmarkpage sizes from{8KB, 64KB, 256KB}to{64KB, 128KB, 256KB, 1MB}to reflect real-world Parquet page sizes (most pages are 64-256KB due to the 20K row-count limit from PARQUET-1414; only wide string/binary columns hit the 1MB size limit).Benchmark results (ops/s, higher is better)
Compression
Decompression
JMH config: JDK 25.0.3 Temurin, 1 fork, 2 warmup × 1s, 3 measurement × 2s.
Why LZ4_RAW decompression gains are largest
NonBlockedDecompressorperforms two full data copies per operation — heap byte[] → direct ByteBuffer on input, direct ByteBuffer → heap byte[] on output — plus direct buffer allocation and synchronized access. The bypass eliminates both copies by usingByteBuffer.wrap()on heap arrays, letting airlift's LZ4 decompress directly between heap buffers.Why ZSTD compression gains are minimal
ZstandardCodecalready returnsnullfromcreateCompressor()/createDecompressor()and delegates directly tozstd-jnistreams. The Hadoop abstraction overhead was already bypassed at the codec level. The branch adds finalizer avoidance (NoFinalizervariants) and caches configuration reads, which helps decompression but leaves compression within noise.Alternative considered: modify codecs instead of CodecFactory
We evaluated modifying
SnappyCodecandLz4RawCodecto follow theZstandardCodecpattern (returnnullfromcreateCompressor(), use custom stream wrappers). This approach was 25-50% slower than theCodecFactorybypass for Snappy/LZ4 and even 20-47% slower than master. The per-call stream creation,ByteArrayOutputStreambuffering, and lack of buffer reuse dominate for memory-bandwidth-bound codecs where the actual compression takes only 8-65 microseconds.Files changed
CodecFactory.java: Bypass compressor/decompressor with codec-specific inner classes (SnappyBytesCompressor,Lz4RawBytesCompressor,ZstdBytesCompressor,GzipBytesCompressor+ matching decompressors)DirectCodecFactory.java: Bypass for directByteBufferpath (Snappy, LZ4_RAW, ZSTD)BytesInput.java: AddByteBufferBackedOutputStreamto avoidtoByteArray()copiesCompressionBenchmark.java: Realistic page sizes + JMH annotation processor fix for Java 17+TestDirectCodecFactory.java: Updated tests for bypass path