Skip to content

Optimize compression hot paths: bypass Hadoop codec abstraction for Snappy, LZ4_RAW, ZSTD, and GZIP#3555

Open
iemejia wants to merge 7 commits into
apache:masterfrom
iemejia:perf-compression-bypass
Open

Optimize compression hot paths: bypass Hadoop codec abstraction for Snappy, LZ4_RAW, ZSTD, and GZIP#3555
iemejia wants to merge 7 commits into
apache:masterfrom
iemejia:perf-compression-bypass

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 11, 2026

Summary

Bypass the Hadoop Compressor/Decompressor/CodecPool abstraction layer in CodecFactory and DirectCodecFactory, calling native compression libraries directly. This eliminates per-page stream creation, intermediate buffer copies, and codec pool synchronization for all four supported codecs.

What changes

  • Snappy: Replace CodecPool + SnappyCompressor (which copies heap→direct→heap) with a single Snappy.compress(byte[], byte[]) / Snappy.uncompress(byte[], byte[]) JNI call and a reusable output buffer.
  • LZ4_RAW: Replace NonBlockedCompressor (which allocates direct ByteBuffers and copies heap↔direct twice per call) with heap ByteBuffer.wrap() and direct airlift LZ4 compress/decompress — zero intermediate copies.
  • ZSTD: Replace ZstdCompressorStream with ZstdOutputStreamNoFinalizer (avoids finalizer registration) and cache the ZSTD level / buffer pool configuration reads per compressor instance instead of re-reading Configuration on each page.
  • GZIP: Replace Hadoop's GzipCodec (which wraps Java's Deflater/Inflater in stream abstractions) with direct Deflater/Inflater usage, reusing instances via reset() and managing GZIP headers/trailers manually.
  • Benchmark: Update CompressionBenchmark page sizes from {8KB, 64KB, 256KB} to {64KB, 128KB, 256KB, 1MB} to reflect real-world Parquet page sizes (most pages are 64-256KB due to the 20K row-count limit from PARQUET-1414; only wide string/binary columns hit the 1MB size limit).

Benchmark results (ops/s, higher is better)

Compression

Codec Page Size Master Branch Delta
SNAPPY 64 KB 53,979 60,799 +12.6%
SNAPPY 128 KB 27,764 30,524 +9.9%
SNAPPY 256 KB 13,549 14,648 +8.1%
SNAPPY 1 MB 2,445 2,675 +9.4%
LZ4_RAW 1 MB 1,961 2,191 +11.7%
LZ4_RAW 64-256 KB within noise (-1 to -4%)
ZSTD all sizes within noise
GZIP all sizes within noise

Decompression

Codec Page Size Master Branch Delta
LZ4_RAW 64 KB 80,415 118,358 +47.2%
LZ4_RAW 128 KB 40,615 59,620 +46.8%
LZ4_RAW 256 KB 19,888 29,914 +50.4%
LZ4_RAW 1 MB 4,628 7,517 +62.4%
SNAPPY 64 KB 60,928 67,224 +10.3%
SNAPPY 128 KB 29,919 33,457 +11.8%
SNAPPY 256 KB 14,431 15,912 +10.3%
SNAPPY 1 MB 3,140 3,540 +12.7%
ZSTD 64 KB 32,042 35,750 +11.6%
ZSTD 128 KB 19,447 21,800 +12.1%
ZSTD 256 KB 9,495 10,759 +13.3%
ZSTD 1 MB 2,155 2,409 +11.8%
GZIP 128 KB 4,101 4,536 +10.6%
GZIP 256 KB 1,736 1,891 +8.9%
GZIP 1 MB 406 442 +9.1%

JMH config: JDK 25.0.3 Temurin, 1 fork, 2 warmup × 1s, 3 measurement × 2s.

Why LZ4_RAW decompression gains are largest

NonBlockedDecompressor performs two full data copies per operation — heap byte[] → direct ByteBuffer on input, direct ByteBuffer → heap byte[] on output — plus direct buffer allocation and synchronized access. The bypass eliminates both copies by using ByteBuffer.wrap() on heap arrays, letting airlift's LZ4 decompress directly between heap buffers.

Why ZSTD compression gains are minimal

ZstandardCodec already returns null from createCompressor()/createDecompressor() and delegates directly to zstd-jni streams. The Hadoop abstraction overhead was already bypassed at the codec level. The branch adds finalizer avoidance (NoFinalizer variants) and caches configuration reads, which helps decompression but leaves compression within noise.

Alternative considered: modify codecs instead of CodecFactory

We evaluated modifying SnappyCodec and Lz4RawCodec to follow the ZstandardCodec pattern (return null from createCompressor(), use custom stream wrappers). This approach was 25-50% slower than the CodecFactory bypass for Snappy/LZ4 and even 20-47% slower than master. The per-call stream creation, ByteArrayOutputStream buffering, and lack of buffer reuse dominate for memory-bandwidth-bound codecs where the actual compression takes only 8-65 microseconds.

Files changed

  • CodecFactory.java: Bypass compressor/decompressor with codec-specific inner classes (SnappyBytesCompressor, Lz4RawBytesCompressor, ZstdBytesCompressor, GzipBytesCompressor + matching decompressors)
  • DirectCodecFactory.java: Bypass for direct ByteBuffer path (Snappy, LZ4_RAW, ZSTD)
  • BytesInput.java: Add ByteBufferBackedOutputStream to avoid toByteArray() copies
  • CompressionBenchmark.java: Realistic page sizes + JMH annotation processor fix for Java 17+
  • TestDirectCodecFactory.java: Updated tests for bypass path

iemejia added 7 commits May 13, 2026 21:30
…nappy, LZ4_RAW, and ZSTD

- Add zero-copy ByteArrayBytesInput.toByteArray() override: returns backing
  array directly when offset==0 && length==in.length, avoiding BAOS allocation
  and System.arraycopy on every decompressor call.
- Snappy: use Snappy.compress()/uncompress() directly (single JNI call).
- LZ4_RAW: use LZ4 compressor/decompressor API directly.
- ZSTD: use ZstdOutputStreamNoFinalizer/ZstdInputStreamNoFinalizer directly.
- Add LZ4_RAW direct ByteBuffer compressor/decompressor to DirectCodecFactory.
- Add JMH CompressionBenchmark for isolated codec throughput measurement.
- Update TestDirectCodecFactory LZ4_RAW assertion.
…ater directly

Use reusable Deflater/Inflater instances with manual GZIP header/trailer,
bypassing Hadoop's GzipCodec, CodecPool, and GZIPOutputStream/GZIPInputStream.
Deflater.reset() reuses native zlib state across pages, avoiding per-call
allocation. Manual header/trailer eliminates stream wrapper overhead.

Results (3 forks, 15 iterations, AMD EPYC 9V45):
  Compress:   8KB +3%, 64KB +1%, 256KB +1%
  Decompress: 8KB +6%, 64KB +3%, 256KB +9%
…erPool, route GZIP in DirectCodecFactory

Respect parquet.compression.codec.zstd.bufferPool.enabled in the optimized
ZstdBytesCompressor/Decompressor (was hardcoded to RecyclingBufferPool).
Route GZIP decompression through the optimized path in DirectCodecFactory
instead of falling back to the Hadoop codec pool. Remove dead GZIP/ZSTD
branches from cacheKey(). Document ISA-L native library bypass in GZIP
Javadocs. Replace obsolete Hadoop codec caching tests with end-to-end
compression level verification tests.
Use page sizes that reflect actual Parquet page sizes observed in practice:
64KB, 128KB, 256KB, and 1MB (the default). The 20K row-count limit
(PARQUET-1414) means most numeric columns produce pages of 78-234KB,
making the previous 8KB test point unrealistic.

Also fix JMH annotation processor path for Java 17+ compatibility
and reduce warmup/measurement iterations for faster iteration.

Performance results (master vs perf-compression-bypass branch):

Compression (ops/s, higher is better):
  Codec    | Page   | Master  | Branch  | Speedup
  SNAPPY   | 64KB   | 53,979  | 60,799  | +12.6%
  SNAPPY   | 128KB  | 27,764  | 30,524  |  +9.9%
  SNAPPY   | 256KB  | 13,549  | 14,648  |  +8.1%
  SNAPPY   | 1MB    |  2,445  |  2,675  |  +9.4%
  ZSTD     | 64KB   |  8,813  |  8,719  |  -1.1%
  ZSTD     | 128KB  |  4,361  |  4,501  |  +3.2%
  ZSTD     | 256KB  |  2,112  |  2,008  |  -4.9%
  ZSTD     | 1MB    |    423  |    422  |  -0.3%
  LZ4_RAW  | 64KB   | 37,777  | 36,107  |  -4.4%
  LZ4_RAW  | 128KB  | 16,777  | 16,330  |  -2.7%
  LZ4_RAW  | 256KB  |  9,060  |  8,956  |  -1.1%
  LZ4_RAW  | 1MB    |  1,961  |  2,191  | +11.7%
  GZIP     | 64KB   |  1,422  |  1,423  |  +0.1%
  GZIP     | 128KB  |    641  |    646  |  +0.8%
  GZIP     | 256KB  |    315  |    317  |  +0.7%
  GZIP     | 1MB    |     75  |     77  |  +2.3%

Decompression (ops/s, higher is better):
  Codec    | Page   | Master  | Branch  | Speedup
  SNAPPY   | 64KB   | 60,928  | 67,224  | +10.3%
  SNAPPY   | 128KB  | 29,919  | 33,457  | +11.8%
  SNAPPY   | 256KB  | 14,431  | 15,912  | +10.3%
  SNAPPY   | 1MB    |  3,140  |  3,540  | +12.7%
  ZSTD     | 64KB   | 32,042  | 35,750  | +11.6%
  ZSTD     | 128KB  | 19,447  | 21,800  | +12.1%
  ZSTD     | 256KB  |  9,495  | 10,759  | +13.3%
  ZSTD     | 1MB    |  2,155  |  2,409  | +11.8%
  LZ4_RAW  | 64KB   | 80,415  |118,358  | +47.2%
  LZ4_RAW  | 128KB  | 40,615  | 59,620  | +46.8%
  LZ4_RAW  | 256KB  | 19,888  | 29,914  | +50.4%
  LZ4_RAW  | 1MB    |  4,628  |  7,517  | +62.4%
  GZIP     | 64KB   |  9,393  |  9,608  |  +2.3%
  GZIP     | 128KB  |  4,101  |  4,536  | +10.6%
  GZIP     | 256KB  |  1,736  |  1,891  |  +8.9%
  GZIP     | 1MB    |    406  |    442  |  +9.1%

Key findings:
- SNAPPY: consistent 8-13% improvement across all page sizes
- LZ4_RAW decompression: strongest gain at 47-62% (eliminates 2x heap<->direct copies)
- ZSTD decompression: 11-13% from NoFinalizer + config caching
- GZIP decompression: 9-11% faster at 128KB+ page sizes
- ZSTD/GZIP compression: within noise (CPU-bound in native codec)
- LZ4_RAW compression: within noise at small pages, +12% at 1MB
- Add brotli-codec dependency to parquet-benchmarks (profile-gated, x86_64 only)
- Include BROTLI in @Param codec list alongside SNAPPY, ZSTD, LZ4_RAW, GZIP
- Add jitpack.io repository for brotli-codec resolution
Bypass the Hadoop BrotliCodec/stream wrapper for BROTLI compression and
decompression by using org.meteogroup.jbrotli's native JNI bindings directly
with ByteBuffer support via reflection (brotli-codec remains runtime scope).
This eliminates intermediate buffer copies and the BrotliStreamCompressor
state machine overhead.

Changes:
- DirectCodecFactory: Add BrotliDirectCompressor (quality=1, matching Hadoop
  default) and BrotliDirectDecompressor using one-shot jbrotli API via reflection
- Load native library eagerly with graceful fallback to Hadoop codec path
- CompressionBenchmark: Switch from heap CodecFactory to DirectCodecFactory
  to benchmark the actual production code path

Results at 64KB page size:
- Compress: 6,746 -> 9,662 ops/s (1.43x speedup)
- Decompress: 2,534 -> 2,786 ops/s (1.10x speedup)
@iemejia iemejia force-pushed the perf-compression-bypass branch from c1632dd to 7ce2c12 Compare May 13, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant