GH-3530: Bypass Hadoop codec abstraction to optimize compression performance by iemejia · Pull Request #3570 · apache/parquet-java

iemejia · 2026-05-17T22:39:00Z

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Bypass the Hadoop CompressionCodec abstraction for all six supported codecs, eliminating per-page codec-pool lookups, stream-wrapper allocation, and unnecessary buffer copies in both CodecFactory and DirectCodecFactory.

Codec	Before	After
Snappy	Hadoop `SnappyCodec` stream wrappers	xerial `Snappy.compress`/`uncompress` direct calls
LZ4_RAW	Hadoop codec abstraction	airlift `LZ4Compressor`/`LZ4Decompressor` direct
ZSTD	Streaming `ZstdOutputStreamNoFinalizer`/`ZstdInputStreamNoFinalizer`	Reusable `ZstdCompressCtx`/`ZstdDecompressCtx` single-call APIs
GZIP	Hadoop `GzipCodec` with codec-pool overhead	JDK `GZIPOutputStream`/`GZIPInputStream` direct
LZO	GPL `com.hadoop.compression.lzo.LzoCodec`	aircompressor `LzoHadoopStreams` (Apache 2.0, wire-compatible)
Brotli	Abandoned `brotli-codec` (jbrotli, 2016, x86-only)	`brotli4j` 1.23.0 (10 platforms incl. aarch64, reflection-loaded)

Notable side effects:

LZO: Removes GPL dependency; uses Apache 2.0 aircompressor. Wire-compatible framing.
Brotli: Enables aarch64 support (linux, macOS, Windows). Removes non-aarch64 Maven profile guards and test skips.

JMH benchmarks: CompressionBenchmark, CpuReadBenchmark, CpuWriteBenchmark, FileReadBenchmark, FileWriteBenchmark, ConcurrentReadWriteBenchmark.

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64.

End-to-end file write (100K rows, SingleShotTime, ms/op lower is better):

Codec	V1 dict=true	V2 dict=true	V2 Speedup
SNAPPY	50.6 -> 40.9 (1.24x)	69.7 -> 38.7	1.80x
ZSTD	52.3 -> 43.6 (1.20x)	70.7 -> 40.6	1.74x
LZ4_RAW	49.6 -> 41.3 (1.20x)	70.2 -> 39.0	1.80x
GZIP	149.9 -> 119.3 (1.26x)	123.4 -> 67.6	1.83x
BROTLI	55.4 -> 46.8 (1.18x)	72.8 -> 41.8	1.74x

End-to-end file read (ms/op lower is better):

Codec	V1 Speedup	V2 Speedup
SNAPPY	1.50x	1.61x
ZSTD	1.49x	1.60x
LZ4_RAW	1.23x	1.57x
GZIP	1.47x	1.49x
BROTLI	1.83x	1.91x

Raw codec throughput (DirectCodecFactory): Snappy/ZSTD/LZ4/GZIP unchanged (already had native access). Brotli decompression improved 2.3-2.7x (brotli4j >> jbrotli).

V2 shows consistently larger speedups than V1 because V2 encoding produces more, smaller pages, meaning more codec invocations per file where the per-invocation Hadoop overhead accumulates.

…n performance Some of the Parquet compression codecs rely on Hadoop's CompressionCodec. After evaluating with performance tests that isolate the CPU utilization it is clear that the Hadoop abstraction introduces considerable overhead. This PR improves that for Snappy, LZ4_RAW, ZSTD, GZIP, LZO, and BROTLI. It also migrates Brotli from jbrotli to brotli4j. Bypass Hadoop CompressionCodec for Snappy (xerial JNI), LZ4_RAW (airlift), ZSTD (zstd-jni), GZIP (JDK), LZO (airlift), and BROTLI (brotli4j) in both CodecFactory and DirectCodecFactory, eliminating per-page codec pool lookups, stream wrapper allocation, and unnecessary buffer copies. ZSTD: replace streaming ZstdOutputStreamNoFinalizer/ZstdInputStreamNoFinalizer with reusable ZstdCompressCtx/ZstdDecompressCtx single-call APIs. GZIP: bypass Hadoop's GzipCodec and its codec-pool/stream-wrapper overhead with direct JDK GZIPOutputStream/GZIPInputStream. Compression level is read from the existing "zlib.compress.level" Hadoop configuration key. LZO: bypass the GPL-licensed com.hadoop.compression.lzo.LzoCodec entirely using aircompressor's LzoHadoopStreams (Apache 2.0). The framing format (big-endian length-prefixed blocks) is wire-compatible with Hadoop's LzoCodec, so existing LZO Parquet files remain readable. Removes the GPL dependency for LZO support. Uncomment previously disabled LZO benchmarks and tests. BROTLI: migrate from abandoned brotli-codec (jbrotli, 2016, x86-only) to brotli4j 1.23.0 (com.aayushatharva.brotli4j) which supports 10 platforms including linux/darwin/windows aarch64. brotli4j is a runtime-only optional dependency accessed via reflection (Encoder.compress and Decoder.decompress) to avoid a compile-time dependency. Uses Decoder.decompress(byte[], int, int) instead of DirectDecompress to avoid loading classes that reference Netty. Remove non-aarch64 Maven profile guards and aarch64 test skips. ByteBuffer decompressors use native APIs with slice + manual position advancement pattern (matching DirectCodecFactory.BaseDecompressor): - Snappy: Snappy.uncompress(slice, slice) - ZSTD: Zstd.decompress(slice, slice) - LZ4_RAW: decompressor.decompress(slice, slice) - GZIP: ByteBufferInputStream.wrap(slice) -> GZIPInputStream - LZO: ByteBufferInputStream.wrap(slice) -> LzoHadoopInputStream - BROTLI: byte[] copy through Decoder.decompress (no direct ByteBuffer API) Add BytesInput.toByteArray() zero-copy override in ByteArrayBytesInput. Add benchmarks: CompressionBenchmark, CpuReadBenchmark, CpuWriteBenchmark, FileReadBenchmark, FileWriteBenchmark, InMemoryInputFile, InMemoryOutputFile, ConcurrentReadWriteBenchmark. Remove encoding/row-group benchmarks. Add 15 new tests in TestDirectCodecFactory, 3 new tests in TestBytesInput.

This was referenced May 17, 2026

Optimize compression hot paths: bypass Hadoop codec abstraction for Snappy, LZ4_RAW, ZSTD, and GZIP #3555

Closed

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar #3512

Closed

Apache Parquet Java Performance Improvements #3530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3530: Bypass Hadoop codec abstraction to optimize compression performance#3570

GH-3530: Bypass Hadoop codec abstraction to optimize compression performance#3570
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par6-compression

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iemejia commented May 17, 2026

Summary

Benchmark results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant