Skip to content

Add JMH benchmarks for encoding/decoding paths and fix parquet-benchmarks shaded jar #3511

@iemejia

Description

@iemejia

Background

The 6 perf-optimization PRs currently open (#3494, #3496, #3500, #3504, #3506, #3510) report headline numbers (12x decode speedup, 7193% Binary.hashCode improvement, etc.) but cite JMH benchmarks that do not exist on master. Reviewers cannot reproduce the numbers without manually copying benchmark sources from elsewhere.

This issue tracks contributing the JMH benchmarks themselves so reviewers can reproduce, validate, and continue measuring across future changes.

Problems

1. parquet-benchmarks shaded jar is broken on master

A build of parquet-benchmarks from the current master produces a jar that is non-functional:

$ java -jar parquet-benchmarks/target/parquet-benchmarks.jar
Exception in thread "main" java.lang.RuntimeException: ERROR: Unable to find the resource: /META-INF/BenchmarkList

The parquet-benchmarks/pom.xml is missing two pieces of configuration:

  • The maven-compiler-plugin lacks the annotationProcessorPaths / annotationProcessors config for jmh-generator-annprocess. As a result the JMH annotation processor never runs, and META-INF/BenchmarkList and META-INF/CompilerHints are never generated. (Workaround: pass -Dmaven.compiler.proc=full, but this is undiscoverable.)
  • The maven-shade-plugin lacks AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints. Even if the resources were generated, shading would drop them.

2. No benchmarks for the optimizations under review

The 6 open perf PRs touch encode/decode paths in parquet-column and parquet-common (PlainValuesReader/Writer, Binary.hashCode, ByteStreamSplitValuesReader/Writer, BinaryPlainValuesReader). Master's parquet-benchmarks covers only file-level read/write, not these CPU-bound encoding paths.

Proposal

Land the following in a single PR against parquet-benchmarks:

  1. pom.xml fix: add JMH annotation-processor config + AppendingTransformer entries so the shaded jar is runnable.
  2. 11 new JMH benchmark files covering the encoding/decoding paths under optimization, plus supporting infrastructure:
    • IntEncodingBenchmark — encode/decode with PLAIN, DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT, RLE, and dictionary, parameterized on value count and data distribution
    • BinaryEncodingBenchmark — Binary write/read paths (PLAIN, dictionary), parameterized on length and cardinality
    • ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark — BSS encode/decode for float/double/int/long
    • FixedLenByteArrayEncodingBenchmark — FLBA encode/decode
    • FileReadBenchmark, FileWriteBenchmark — CPU-focused file-level benchmarks (minimal I/O via temp files)
    • RowGroupFlushBenchmark — flush-path benchmark
    • ConcurrentReadWriteBenchmark — multi-threaded read/write throughput
    • BlackHoleOutputFileOutputFile that discards bytes, used to isolate CPU work from I/O
    • TestDataFactory — shared test-data generation utilities

After this lands, each existing perf PR will be amended with a one-line "How to reproduce" snippet pointing at the relevant *Benchmark class.

Out of scope (deferred)

The existing ReadBenchmarks, WriteBenchmarks, and NestedNullWritingBenchmarks could be modernized (Hadoop-free LocalInputFile, parameterized over compression and writer version, JMH-idiomatic state setup). That is a separate concern and will be proposed in a follow-up PR.

Validation

With the proposed pom changes, the shaded jar contains a populated META-INF/BenchmarkList (87 benchmarks registered) and runs cleanly. As a sanity check, IntEncodingBenchmark.decodePlain reproduces the ~91M ops/s baseline cited in #3493/#3494 (master JDK 21, JMH 1.37, 3 warmup + 5 measurement iterations).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions