Skip to content

Conversation

@andygrove
Copy link
Member

Summary

This PR adds a microbenchmark for measuring the serialization/deserialization performance of Iceberg FileScanTask objects to protobuf.

The benchmark:

  • Creates a real Iceberg table with configurable number of partitions (default: 30,000)
  • Extracts actual FileScanTask objects through query planning
  • Benchmarks conversion from FileScanTask to Protobuf via CometIcebergNativeScan.convert()
  • Benchmarks serialization to bytes and deserialization

Usage

# Run with default 30000 partitions
make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark

# Run with custom partition count
make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark -- 1000

Sample Results (1000 partitions)

IcebergScan serde (1000 partitions, 1000 tasks):    Best Time(ms)   Avg Time(ms)   Relative
-------------------------------------------------------------------------------------------
FileScanTask -> Protobuf (convert)                           1043           1058       1.0X
FileScanTask -> Protobuf -> bytes                            1126           1133       0.9X
bytes -> Protobuf (parseFrom)                                  10             11     107.7X
Full roundtrip (convert + serialize + deserialize)           1150           1159       0.9X

Key insight: The conversion from FileScanTask to protobuf dominates (~99% of time). Protobuf parsing is extremely fast.

Serialized size: 178.7 KB for 1000 tasks (~179 bytes/task)

Test plan

  • Benchmark compiles and runs successfully
  • Results are consistent across multiple runs

🤖 Generated with Claude Code

This benchmark measures the serialization/deserialization performance
of Iceberg FileScanTask objects to protobuf, starting from actual
Iceberg Java objects rather than pre-constructed protobuf messages.

The benchmark:
- Creates a real Iceberg table with configurable number of partitions
- Extracts FileScanTask objects through query planning
- Benchmarks conversion from FileScanTask to Protobuf
- Benchmarks serialization to bytes and deserialization

Usage:
  make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark -- 30000

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove changed the title Add microbenchmark for IcebergScan operator serde roundtrip chore: Add microbenchmark for IcebergScan operator serde roundtrip Jan 27, 2026
@andygrove andygrove requested a review from mbutrovich January 27, 2026 15:19
}

/**
* Creates an Iceberg table with the specified number of partitions. Each partition contains one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? Often when I write in batches to Iceberg tables I get different files for iterations of inserts, and then need to do a compaction after.

// Benchmark the serialization
val iterations = 100
val benchmark = new Benchmark(
s"IcebergScan serde ($numPartitions partitions, ${tasks.size()} tasks)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table has numPartitions partitions, right. But does the query run with that many? I figured Spark would use however many shuffle partitions it's configured for, and if num shuffle partitions < num table partitions, table partitions get grouped together. Could you confirm what you mean here? For serde it might be interesting to measure Spark partitions, not table partitions. I just want to make sure we're measuring what we expect here.

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove. Let's merge this so we can quantify the serialization efforts we're doing. This is a good foundation.

@mbutrovich mbutrovich merged commit 7ea1cd4 into apache:main Jan 28, 2026
4 of 5 checks passed
@andygrove andygrove deleted the add-iceberg-serde-benchmark branch January 28, 2026 14:37
vigneshsiva11 pushed a commit to vigneshsiva11/datafusion-comet that referenced this pull request Jan 29, 2026
)

This benchmark measures the serialization/deserialization performance
of Iceberg FileScanTask objects to protobuf, starting from actual
Iceberg Java objects rather than pre-constructed protobuf messages.

The benchmark:
- Creates a real Iceberg table with configurable number of partitions
- Extracts FileScanTask objects through query planning
- Benchmarks conversion from FileScanTask to Protobuf
- Benchmarks serialization to bytes and deserialization

Usage:
  make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark -- 30000

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants