Optimize compression hot paths: bypass Hadoop codec abstraction for Snappy, LZ4_RAW, ZSTD, and GZIP by iemejia · Pull Request #3555 · apache/parquet-java

iemejia · 2026-05-11T15:11:23Z

Summary

Bypass the Hadoop Compressor/Decompressor/CodecPool abstraction layer in CodecFactory and DirectCodecFactory, calling native compression libraries directly. This eliminates per-page stream creation, intermediate buffer copies, and codec pool synchronization for all four supported codecs.

What changes

Snappy: Replace CodecPool + SnappyCompressor (which copies heap→direct→heap) with a single Snappy.compress(byte[], byte[]) / Snappy.uncompress(byte[], byte[]) JNI call and a reusable output buffer.
LZ4_RAW: Replace NonBlockedCompressor (which allocates direct ByteBuffers and copies heap↔direct twice per call) with heap ByteBuffer.wrap() and direct airlift LZ4 compress/decompress — zero intermediate copies.
ZSTD: Replace ZstdCompressorStream with ZstdOutputStreamNoFinalizer (avoids finalizer registration) and cache the ZSTD level / buffer pool configuration reads per compressor instance instead of re-reading Configuration on each page.
GZIP: Replace Hadoop's GzipCodec (which wraps Java's Deflater/Inflater in stream abstractions) with direct Deflater/Inflater usage, reusing instances via reset() and managing GZIP headers/trailers manually.
Benchmark: Update CompressionBenchmark page sizes from {8KB, 64KB, 256KB} to {64KB, 128KB, 256KB, 1MB} to reflect real-world Parquet page sizes (most pages are 64-256KB due to the 20K row-count limit from PARQUET-1414; only wide string/binary columns hit the 1MB size limit).

Benchmark results (ops/s, higher is better)

Compression

Codec	Page Size	Master	Branch	Delta
SNAPPY	64 KB	53,979	60,799	+12.6%
SNAPPY	128 KB	27,764	30,524	+9.9%
SNAPPY	256 KB	13,549	14,648	+8.1%
SNAPPY	1 MB	2,445	2,675	+9.4%
LZ4_RAW	1 MB	1,961	2,191	+11.7%
LZ4_RAW	64-256 KB	—	—	within noise (-1 to -4%)
ZSTD	all sizes	—	—	within noise
GZIP	all sizes	—	—	within noise

Decompression

Codec	Page Size	Master	Branch	Delta
LZ4_RAW	64 KB	80,415	118,358	+47.2%
LZ4_RAW	128 KB	40,615	59,620	+46.8%
LZ4_RAW	256 KB	19,888	29,914	+50.4%
LZ4_RAW	1 MB	4,628	7,517	+62.4%
SNAPPY	64 KB	60,928	67,224	+10.3%
SNAPPY	128 KB	29,919	33,457	+11.8%
SNAPPY	256 KB	14,431	15,912	+10.3%
SNAPPY	1 MB	3,140	3,540	+12.7%
ZSTD	64 KB	32,042	35,750	+11.6%
ZSTD	128 KB	19,447	21,800	+12.1%
ZSTD	256 KB	9,495	10,759	+13.3%
ZSTD	1 MB	2,155	2,409	+11.8%
GZIP	128 KB	4,101	4,536	+10.6%
GZIP	256 KB	1,736	1,891	+8.9%
GZIP	1 MB	406	442	+9.1%

JMH config: JDK 25.0.3 Temurin, 1 fork, 2 warmup × 1s, 3 measurement × 2s.

Why LZ4_RAW decompression gains are largest

NonBlockedDecompressor performs two full data copies per operation — heap byte[] → direct ByteBuffer on input, direct ByteBuffer → heap byte[] on output — plus direct buffer allocation and synchronized access. The bypass eliminates both copies by using ByteBuffer.wrap() on heap arrays, letting airlift's LZ4 decompress directly between heap buffers.

Why ZSTD compression gains are minimal

ZstandardCodec already returns null from createCompressor()/createDecompressor() and delegates directly to zstd-jni streams. The Hadoop abstraction overhead was already bypassed at the codec level. The branch adds finalizer avoidance (NoFinalizer variants) and caches configuration reads, which helps decompression but leaves compression within noise.

Alternative considered: modify codecs instead of CodecFactory

We evaluated modifying SnappyCodec and Lz4RawCodec to follow the ZstandardCodec pattern (return null from createCompressor(), use custom stream wrappers). This approach was 25-50% slower than the CodecFactory bypass for Snappy/LZ4 and even 20-47% slower than master. The per-call stream creation, ByteArrayOutputStream buffering, and lack of buffer reuse dominate for memory-bandwidth-bound codecs where the actual compression takes only 8-65 microseconds.

Files changed

CodecFactory.java: Bypass compressor/decompressor with codec-specific inner classes (SnappyBytesCompressor, Lz4RawBytesCompressor, ZstdBytesCompressor, GzipBytesCompressor + matching decompressors)
DirectCodecFactory.java: Bypass for direct ByteBuffer path (Snappy, LZ4_RAW, ZSTD)
BytesInput.java: Add ByteBufferBackedOutputStream to avoid toByteArray() copies
CompressionBenchmark.java: Realistic page sizes + JMH annotation processor fix for Java 17+
TestDirectCodecFactory.java: Updated tests for bypass path

…nappy, LZ4_RAW, and ZSTD - Add zero-copy ByteArrayBytesInput.toByteArray() override: returns backing array directly when offset==0 && length==in.length, avoiding BAOS allocation and System.arraycopy on every decompressor call. - Snappy: use Snappy.compress()/uncompress() directly (single JNI call). - LZ4_RAW: use LZ4 compressor/decompressor API directly. - ZSTD: use ZstdOutputStreamNoFinalizer/ZstdInputStreamNoFinalizer directly. - Add LZ4_RAW direct ByteBuffer compressor/decompressor to DirectCodecFactory. - Add JMH CompressionBenchmark for isolated codec throughput measurement. - Update TestDirectCodecFactory LZ4_RAW assertion.

…ater directly Use reusable Deflater/Inflater instances with manual GZIP header/trailer, bypassing Hadoop's GzipCodec, CodecPool, and GZIPOutputStream/GZIPInputStream. Deflater.reset() reuses native zlib state across pages, avoiding per-call allocation. Manual header/trailer eliminates stream wrapper overhead. Results (3 forks, 15 iterations, AMD EPYC 9V45): Compress: 8KB +3%, 64KB +1%, 256KB +1% Decompress: 8KB +6%, 64KB +3%, 256KB +9%

…erPool, route GZIP in DirectCodecFactory Respect parquet.compression.codec.zstd.bufferPool.enabled in the optimized ZstdBytesCompressor/Decompressor (was hardcoded to RecyclingBufferPool). Route GZIP decompression through the optimized path in DirectCodecFactory instead of falling back to the Hadoop codec pool. Remove dead GZIP/ZSTD branches from cacheKey(). Document ISA-L native library bypass in GZIP Javadocs. Replace obsolete Hadoop codec caching tests with end-to-end compression level verification tests.

Use page sizes that reflect actual Parquet page sizes observed in practice: 64KB, 128KB, 256KB, and 1MB (the default). The 20K row-count limit (PARQUET-1414) means most numeric columns produce pages of 78-234KB, making the previous 8KB test point unrealistic. Also fix JMH annotation processor path for Java 17+ compatibility and reduce warmup/measurement iterations for faster iteration. Performance results (master vs perf-compression-bypass branch): Compression (ops/s, higher is better): Codec | Page | Master | Branch | Speedup SNAPPY | 64KB | 53,979 | 60,799 | +12.6% SNAPPY | 128KB | 27,764 | 30,524 | +9.9% SNAPPY | 256KB | 13,549 | 14,648 | +8.1% SNAPPY | 1MB | 2,445 | 2,675 | +9.4% ZSTD | 64KB | 8,813 | 8,719 | -1.1% ZSTD | 128KB | 4,361 | 4,501 | +3.2% ZSTD | 256KB | 2,112 | 2,008 | -4.9% ZSTD | 1MB | 423 | 422 | -0.3% LZ4_RAW | 64KB | 37,777 | 36,107 | -4.4% LZ4_RAW | 128KB | 16,777 | 16,330 | -2.7% LZ4_RAW | 256KB | 9,060 | 8,956 | -1.1% LZ4_RAW | 1MB | 1,961 | 2,191 | +11.7% GZIP | 64KB | 1,422 | 1,423 | +0.1% GZIP | 128KB | 641 | 646 | +0.8% GZIP | 256KB | 315 | 317 | +0.7% GZIP | 1MB | 75 | 77 | +2.3% Decompression (ops/s, higher is better): Codec | Page | Master | Branch | Speedup SNAPPY | 64KB | 60,928 | 67,224 | +10.3% SNAPPY | 128KB | 29,919 | 33,457 | +11.8% SNAPPY | 256KB | 14,431 | 15,912 | +10.3% SNAPPY | 1MB | 3,140 | 3,540 | +12.7% ZSTD | 64KB | 32,042 | 35,750 | +11.6% ZSTD | 128KB | 19,447 | 21,800 | +12.1% ZSTD | 256KB | 9,495 | 10,759 | +13.3% ZSTD | 1MB | 2,155 | 2,409 | +11.8% LZ4_RAW | 64KB | 80,415 |118,358 | +47.2% LZ4_RAW | 128KB | 40,615 | 59,620 | +46.8% LZ4_RAW | 256KB | 19,888 | 29,914 | +50.4% LZ4_RAW | 1MB | 4,628 | 7,517 | +62.4% GZIP | 64KB | 9,393 | 9,608 | +2.3% GZIP | 128KB | 4,101 | 4,536 | +10.6% GZIP | 256KB | 1,736 | 1,891 | +8.9% GZIP | 1MB | 406 | 442 | +9.1% Key findings: - SNAPPY: consistent 8-13% improvement across all page sizes - LZ4_RAW decompression: strongest gain at 47-62% (eliminates 2x heap<->direct copies) - ZSTD decompression: 11-13% from NoFinalizer + config caching - GZIP decompression: 9-11% faster at 128KB+ page sizes - ZSTD/GZIP compression: within noise (CPU-bound in native codec) - LZ4_RAW compression: within noise at small pages, +12% at 1MB

- Add brotli-codec dependency to parquet-benchmarks (profile-gated, x86_64 only) - Include BROTLI in @Param codec list alongside SNAPPY, ZSTD, LZ4_RAW, GZIP - Add jitpack.io repository for brotli-codec resolution

Bypass the Hadoop BrotliCodec/stream wrapper for BROTLI compression and decompression by using org.meteogroup.jbrotli's native JNI bindings directly with ByteBuffer support via reflection (brotli-codec remains runtime scope). This eliminates intermediate buffer copies and the BrotliStreamCompressor state machine overhead. Changes: - DirectCodecFactory: Add BrotliDirectCompressor (quality=1, matching Hadoop default) and BrotliDirectDecompressor using one-shot jbrotli API via reflection - Load native library eagerly with graceful fallback to Hadoop codec path - CompressionBenchmark: Switch from heap CodecFactory to DirectCodecFactory to benchmark the actual production code path Results at 64KB page size: - Compress: 6,746 -> 9,662 ops/s (1.43x speedup) - Decompress: 2,534 -> 2,786 ops/s (1.10x speedup)

iemejia force-pushed the perf-compression-bypass branch from f9d81a4 to 415375a Compare May 11, 2026 21:28

iemejia mentioned this pull request May 11, 2026

Apache Parquet Java Performance Improvements #3530

Open

iemejia added 7 commits May 13, 2026 21:30

Run spotless:apply

d9dcf9e

Add BROTLI to CompressionBenchmark codec parameter list

05cda01

- Add brotli-codec dependency to parquet-benchmarks (profile-gated, x86_64 only) - Include BROTLI in @Param codec list alongside SNAPPY, ZSTD, LZ4_RAW, GZIP - Add jitpack.io repository for brotli-codec resolution

iemejia force-pushed the perf-compression-bypass branch from c1632dd to 7ce2c12 Compare May 13, 2026 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize compression hot paths: bypass Hadoop codec abstraction for Snappy, LZ4_RAW, ZSTD, and GZIP#3555

Optimize compression hot paths: bypass Hadoop codec abstraction for Snappy, LZ4_RAW, ZSTD, and GZIP#3555
iemejia wants to merge 7 commits into
apache:masterfrom
iemejia:perf-compression-bypass

iemejia commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iemejia commented May 11, 2026

Summary

What changes

Benchmark results (ops/s, higher is better)

Compression

Decompression

Why LZ4_RAW decompression gains are largest

Why ZSTD compression gains are minimal

Alternative considered: modify codecs instead of CodecFactory

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant