(WIP) benchmark: [RFC-98] COW table read performance comparison by geserdugarov · Pull Request #18351 · apache/hudi

geserdugarov · 2026-03-19T11:08:25Z

Describe the issue this Pull Request addresses

This draft PR shows benchmark, which is used to support design doc #18276 and implementation for COW #18277.
Only the latest commit contains benchmark. All previous commits are copy from #18277, which makes this branch independent and ready for recheck.

Data: 800 parquet files with column stats, 30 mln rows, 300 column, 100 GB in total.

The results of reading data locally:

============================================================
DSv2 vs DSv1 PERFORMANCE COMPARISON
============================================================

Full scan (COW)                    : DSv1 avg 277.7s, DSv2 avg 279.2s, speedup 0.99x (DSv1 FASTER)
Projected (COW)                    : DSv1 avg 7.4s, DSv2 avg 6.0s, speedup 1.25x (DSv2 FASTER)
Filter (COW)                       : DSv1 avg 7.2s, DSv2 avg 6.1s, speedup 1.19x (DSv2 FASTER)
Limit (COW)                        : DSv1 avg 55.6s, DSv2 avg 59.1s, speedup 0.94x (DSv1 FASTER)
Aggregate COUNT(*)                 : DSv1 avg 3.5s, DSv2 avg 0.2s, speedup 14.98x (DSv2 FASTER)
Aggregate MIN/MAX                  : DSv1 avg 3.8s, DSv2 avg 0.2s, speedup 19.55x (DSv2 FASTER)

PASS: DSv2 is faster than DSv1 in 4 of 6 scenarios

The results of reading data from remote HDFS:

============================================================
DSv2 vs DSv1 PERFORMANCE COMPARISON
============================================================

Full scan (COW)                    : DSv1 avg 273.3s, DSv2 avg 278.0s, speedup 0.98x (DSv1 FASTER)
Projected (COW)                    : DSv1 avg 7.3s, DSv2 avg 5.9s, speedup 1.24x (DSv2 FASTER)
Filter (COW)                       : DSv1 avg 7.2s, DSv2 avg 6.0s, speedup 1.20x (DSv2 FASTER)
Limit (COW)                        : DSv1 avg 56.6s, DSv2 avg 59.5s, speedup 0.95x (DSv1 FASTER)
Aggregate COUNT(*)                 : DSv1 avg 3.6s, DSv2 avg 0.2s, speedup 18.43x (DSv2 FASTER)
Aggregate MIN/MAX                  : DSv1 avg 3.8s, DSv2 avg 0.2s, speedup 20.95x (DSv2 FASTER)

PASS: DSv2 is faster than DSv1 in 4 of 6 scenarios

Summary and Changelog

Scala code for submitting to Spark cluster with description, which will perform DSv2 and DSv1 read comparison.

Impact

None. Supporting materials.

Risk Level

None

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…owed packages

…lyPushed` for different Spark versions

codecov-commenter · 2026-03-20T00:18:16Z

Codecov Report

❌ Patch coverage is 70.61728% with 119 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.44%. Comparing base (19c4cc9) to head (98af4fe).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
...g/apache/spark/sql/hudi/v2/HoodieScanBuilder.scala	61.64%	36 Missing and 48 partials ⚠️
.../apache/spark/sql/hudi/v2/HoodieSparkV2Table.scala	71.05%	3 Missing and 8 partials ⚠️
...ache/spark/sql/hudi/v2/HoodiePartitionReader.scala	75.75%	3 Missing and 5 partials ⚠️
.../apache/spark/sql/hudi/v2/HoodieDataSourceV2.scala	58.82%	3 Missing and 4 partials ⚠️
.../apache/spark/sql/hudi/catalog/HoodieCatalog.scala	84.61%	1 Missing and 1 partial ⚠️
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala	0.00%	0 Missing and 1 partial ⚠️
...spark/sql/hudi/catalog/HoodieInternalV2Table.scala	95.00%	0 Missing and 1 partial ⚠️
...org/apache/spark/sql/hudi/v2/HoodieBatchScan.scala	96.29%	0 Missing and 1 partial ⚠️
...pache/spark/sql/hudi/v2/HoodieInputPartition.scala	75.00%	1 Missing ⚠️
...org/apache/spark/sql/hudi/v2/HoodieLocalScan.scala	80.00%	1 Missing ⚠️
... and 2 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18351      +/-   ##
============================================
+ Coverage     68.41%   68.44%   +0.02%     
- Complexity    27411    27532     +121     
============================================
  Files          2423     2433      +10     
  Lines        132458   132853     +395     
  Branches      15972    16050      +78     
============================================
+ Hits          90623    90928     +305     
- Misses        34786    34804      +18     
- Partials       7049     7121      +72

Flag	Coverage Δ
common-and-other-modules	`44.21% <1.73%> (-0.14%)`	⬇️
hadoop-mr-java-client	`45.10% <ø> (-0.07%)`	⬇️
spark-client-hadoop-common	`48.31% <ø> (+<0.01%)`	⬆️
spark-java-tests	`48.97% <70.61%> (+0.15%)`	⬆️
spark-scala-tests	`44.85% <2.71%> (-0.16%)`	⬇️
utilities	`38.46% <2.48%> (-0.15%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...apache/spark/sql/hudi/v2/PartialLimitPushDown.java	`100.00% <100.00%> (ø)`
...main/scala/org/apache/hudi/DataSourceOptions.scala	`95.44% <100.00%> (+0.04%)`	⬆️
...ark/sql/hudi/v2/HoodiePartitionReaderFactory.scala	`100.00% <100.00%> (ø)`
...rg/apache/spark/sql/hudi/v2/HoodieStatistics.scala	`100.00% <100.00%> (ø)`
...park/sql/hudi/analysis/HoodieSpark33Analysis.scala	`66.66% <ø> (ø)`
...park/sql/hudi/analysis/HoodieSpark34Analysis.scala	`66.66% <ø> (ø)`
...park/sql/hudi/analysis/HoodieSpark35Analysis.scala	`48.05% <ø> (+1.29%)`	⬆️
...park/sql/hudi/analysis/HoodieSpark40Analysis.scala	`48.05% <ø> (+1.29%)`	⬆️
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala	`73.77% <0.00%> (-0.41%)`	⬇️
...spark/sql/hudi/catalog/HoodieInternalV2Table.scala	`65.51% <95.00%> (+32.18%)`	⬆️
... and 10 more

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

geserdugarov added 6 commits March 19, 2026 14:33

feat(dsv2): [RFC-98] POC implementation of DSv2 read

ff36f6c

feat(dsv2): [RFC-98] COW snapshot read

8eb5e44

feat(dsv2): [RFC-98] Filter pushdown

13193cc

feat(dsv2): [RFC-98] Limit and aggregate pushdown, statistics for CBO

60bcc85

feat(dsv2): [RFC-98] Require Spark 3.5+, move new test classes to all…

cded8e0

…owed packages

feat(dsv2): [RFC-98] Resolve issue with `HoodieScanBuilder::isPartial…

64e9e39

…lyPushed` for different Spark versions

github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 19, 2026

geserdugarov force-pushed the dsv2-benchmark branch 2 times, most recently from dc2433f to 5d3093d Compare March 19, 2026 13:16

chore: clean comments

03938ea

geserdugarov force-pushed the dsv2-benchmark branch 2 times, most recently from 713d749 to b137583 Compare March 19, 2026 17:14

benchmark: [RFC-98] COW table read performance comparison

98af4fe

geserdugarov force-pushed the dsv2-benchmark branch from b137583 to 98af4fe Compare March 19, 2026 23:01

geserdugarov mentioned this pull request Mar 19, 2026

docs: [RFC-98] Design doc of DSv2 read support #18276

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) benchmark: [RFC-98] COW table read performance comparison#18351

(WIP) benchmark: [RFC-98] COW table read performance comparison#18351
geserdugarov wants to merge 8 commits intoapache:masterfrom
geserdugarov:dsv2-benchmark

geserdugarov commented Mar 19, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geserdugarov commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

codecov-commenter commented Mar 20, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

geserdugarov commented Mar 19, 2026 •

edited

Loading