chore: Add microbenchmark for IcebergScan operator serde roundtrip #3296

andygrove · 2026-01-27T14:05:41Z

Summary

This PR adds a microbenchmark for measuring the serialization/deserialization performance of Iceberg FileScanTask objects to protobuf.

The benchmark:

Creates a real Iceberg table with configurable number of partitions (default: 30,000)
Extracts actual FileScanTask objects through query planning
Benchmarks conversion from FileScanTask to Protobuf via CometIcebergNativeScan.convert()
Benchmarks serialization to bytes and deserialization

Usage

# Run with default 30000 partitions
make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark

# Run with custom partition count
make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark -- 1000

Sample Results (1000 partitions)

IcebergScan serde (1000 partitions, 1000 tasks):    Best Time(ms)   Avg Time(ms)   Relative
-------------------------------------------------------------------------------------------
FileScanTask -> Protobuf (convert)                           1043           1058       1.0X
FileScanTask -> Protobuf -> bytes                            1126           1133       0.9X
bytes -> Protobuf (parseFrom)                                  10             11     107.7X
Full roundtrip (convert + serialize + deserialize)           1150           1159       0.9X

Key insight: The conversion from FileScanTask to protobuf dominates (~99% of time). Protobuf parsing is extremely fast.

Serialized size: 178.7 KB for 1000 tasks (~179 bytes/task)

Test plan

Benchmark compiles and runs successfully
Results are consistent across multiple runs

🤖 Generated with Claude Code

This benchmark measures the serialization/deserialization performance of Iceberg FileScanTask objects to protobuf, starting from actual Iceberg Java objects rather than pre-constructed protobuf messages. The benchmark: - Creates a real Iceberg table with configurable number of partitions - Extracts FileScanTask objects through query planning - Benchmarks conversion from FileScanTask to Protobuf - Benchmarks serialization to bytes and deserialization Usage: make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark -- 30000 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

mbutrovich · 2026-01-27T15:30:44Z

spark/src/test/scala/org/apache/spark/sql/benchmark/CometOperatorSerdeBenchmark.scala

+  }
+
+  /**
+   * Creates an Iceberg table with the specified number of partitions. Each partition contains one


Is this true? Often when I write in batches to Iceberg tables I get different files for iterations of inserts, and then need to do a compaction after.

mbutrovich · 2026-01-27T15:33:21Z

spark/src/test/scala/org/apache/spark/sql/benchmark/CometOperatorSerdeBenchmark.scala

+            // Benchmark the serialization
+            val iterations = 100
+            val benchmark = new Benchmark(
+              s"IcebergScan serde ($numPartitions partitions, ${tasks.size()} tasks)",


The table has numPartitions partitions, right. But does the query run with that many? I figured Spark would use however many shuffle partitions it's configured for, and if num shuffle partitions < num table partitions, table partitions get grouped together. Could you confirm what you mean here? For serde it might be interesting to measure Spark partitions, not table partitions. I just want to make sure we're measuring what we expect here.

mbutrovich

Thanks @andygrove. Let's merge this so we can quantify the serialization efforts we're doing. This is a good foundation.

) This benchmark measures the serialization/deserialization performance of Iceberg FileScanTask objects to protobuf, starting from actual Iceberg Java objects rather than pre-constructed protobuf messages. The benchmark: - Creates a real Iceberg table with configurable number of partitions - Extracts FileScanTask objects through query planning - Benchmarks conversion from FileScanTask to Protobuf - Benchmarks serialization to bytes and deserialization Usage: make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark -- 30000 Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

andygrove changed the title ~~Add microbenchmark for IcebergScan operator serde roundtrip~~ chore: Add microbenchmark for IcebergScan operator serde roundtrip Jan 27, 2026

andygrove requested a review from mbutrovich January 27, 2026 15:19

mbutrovich reviewed Jan 27, 2026

View reviewed changes

mbutrovich approved these changes Jan 28, 2026

View reviewed changes

mbutrovich merged commit 7ea1cd4 into apache:main Jan 28, 2026
4 of 5 checks passed

andygrove deleted the add-iceberg-serde-benchmark branch January 28, 2026 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Add microbenchmark for IcebergScan operator serde roundtrip #3296

chore: Add microbenchmark for IcebergScan operator serde roundtrip #3296

Uh oh!

andygrove commented Jan 27, 2026

Uh oh!

mbutrovich Jan 27, 2026

Uh oh!

mbutrovich Jan 27, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chore: Add microbenchmark for IcebergScan operator serde roundtrip #3296

chore: Add microbenchmark for IcebergScan operator serde roundtrip #3296

Uh oh!

Conversation

andygrove commented Jan 27, 2026

Summary

Usage

Sample Results (1000 partitions)

Test plan

Uh oh!

mbutrovich Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants