Skip to content

(WIP) benchmark: [RFC-98] COW table read performance comparison#18351

Draft
geserdugarov wants to merge 8 commits intoapache:masterfrom
geserdugarov:dsv2-benchmark
Draft

(WIP) benchmark: [RFC-98] COW table read performance comparison#18351
geserdugarov wants to merge 8 commits intoapache:masterfrom
geserdugarov:dsv2-benchmark

Conversation

@geserdugarov
Copy link
Contributor

@geserdugarov geserdugarov commented Mar 19, 2026

Describe the issue this Pull Request addresses

This draft PR shows benchmark, which is used to support design doc #18276 and implementation for COW #18277.
Only the latest commit contains benchmark. All previous commits are copy from #18277, which makes this branch independent and ready for recheck.

Data: 800 parquet files with column stats, 30 mln rows, 300 column, 100 GB in total.

The results of reading data locally:

============================================================
DSv2 vs DSv1 PERFORMANCE COMPARISON
============================================================

Full scan (COW)                    : DSv1 avg 277.7s, DSv2 avg 279.2s, speedup 0.99x (DSv1 FASTER)
Projected (COW)                    : DSv1 avg 7.4s, DSv2 avg 6.0s, speedup 1.25x (DSv2 FASTER)
Filter (COW)                       : DSv1 avg 7.2s, DSv2 avg 6.1s, speedup 1.19x (DSv2 FASTER)
Limit (COW)                        : DSv1 avg 55.6s, DSv2 avg 59.1s, speedup 0.94x (DSv1 FASTER)
Aggregate COUNT(*)                 : DSv1 avg 3.5s, DSv2 avg 0.2s, speedup 14.98x (DSv2 FASTER)
Aggregate MIN/MAX                  : DSv1 avg 3.8s, DSv2 avg 0.2s, speedup 19.55x (DSv2 FASTER)

PASS: DSv2 is faster than DSv1 in 4 of 6 scenarios

The results of reading data from remote HDFS:

============================================================
DSv2 vs DSv1 PERFORMANCE COMPARISON
============================================================

Full scan (COW)                    : DSv1 avg 273.3s, DSv2 avg 278.0s, speedup 0.98x (DSv1 FASTER)
Projected (COW)                    : DSv1 avg 7.3s, DSv2 avg 5.9s, speedup 1.24x (DSv2 FASTER)
Filter (COW)                       : DSv1 avg 7.2s, DSv2 avg 6.0s, speedup 1.20x (DSv2 FASTER)
Limit (COW)                        : DSv1 avg 56.6s, DSv2 avg 59.5s, speedup 0.95x (DSv1 FASTER)
Aggregate COUNT(*)                 : DSv1 avg 3.6s, DSv2 avg 0.2s, speedup 18.43x (DSv2 FASTER)
Aggregate MIN/MAX                  : DSv1 avg 3.8s, DSv2 avg 0.2s, speedup 20.95x (DSv2 FASTER)

PASS: DSv2 is faster than DSv1 in 4 of 6 scenarios

Summary and Changelog

Scala code for submitting to Spark cluster with description, which will perform DSv2 and DSv1 read comparison.

Impact

None. Supporting materials.

Risk Level

None

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 19, 2026
@geserdugarov geserdugarov force-pushed the dsv2-benchmark branch 2 times, most recently from dc2433f to 5d3093d Compare March 19, 2026 13:16
@geserdugarov geserdugarov force-pushed the dsv2-benchmark branch 2 times, most recently from 713d749 to b137583 Compare March 19, 2026 17:14
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 70.61728% with 119 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.44%. Comparing base (19c4cc9) to head (98af4fe).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...g/apache/spark/sql/hudi/v2/HoodieScanBuilder.scala 61.64% 36 Missing and 48 partials ⚠️
.../apache/spark/sql/hudi/v2/HoodieSparkV2Table.scala 71.05% 3 Missing and 8 partials ⚠️
...ache/spark/sql/hudi/v2/HoodiePartitionReader.scala 75.75% 3 Missing and 5 partials ⚠️
.../apache/spark/sql/hudi/v2/HoodieDataSourceV2.scala 58.82% 3 Missing and 4 partials ⚠️
.../apache/spark/sql/hudi/catalog/HoodieCatalog.scala 84.61% 1 Missing and 1 partial ⚠️
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala 0.00% 0 Missing and 1 partial ⚠️
...spark/sql/hudi/catalog/HoodieInternalV2Table.scala 95.00% 0 Missing and 1 partial ⚠️
...org/apache/spark/sql/hudi/v2/HoodieBatchScan.scala 96.29% 0 Missing and 1 partial ⚠️
...pache/spark/sql/hudi/v2/HoodieInputPartition.scala 75.00% 1 Missing ⚠️
...org/apache/spark/sql/hudi/v2/HoodieLocalScan.scala 80.00% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18351      +/-   ##
============================================
+ Coverage     68.41%   68.44%   +0.02%     
- Complexity    27411    27532     +121     
============================================
  Files          2423     2433      +10     
  Lines        132458   132853     +395     
  Branches      15972    16050      +78     
============================================
+ Hits          90623    90928     +305     
- Misses        34786    34804      +18     
- Partials       7049     7121      +72     
Flag Coverage Δ
common-and-other-modules 44.21% <1.73%> (-0.14%) ⬇️
hadoop-mr-java-client 45.10% <ø> (-0.07%) ⬇️
spark-client-hadoop-common 48.31% <ø> (+<0.01%) ⬆️
spark-java-tests 48.97% <70.61%> (+0.15%) ⬆️
spark-scala-tests 44.85% <2.71%> (-0.16%) ⬇️
utilities 38.46% <2.48%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...apache/spark/sql/hudi/v2/PartialLimitPushDown.java 100.00% <100.00%> (ø)
...main/scala/org/apache/hudi/DataSourceOptions.scala 95.44% <100.00%> (+0.04%) ⬆️
...ark/sql/hudi/v2/HoodiePartitionReaderFactory.scala 100.00% <100.00%> (ø)
...rg/apache/spark/sql/hudi/v2/HoodieStatistics.scala 100.00% <100.00%> (ø)
...park/sql/hudi/analysis/HoodieSpark33Analysis.scala 66.66% <ø> (ø)
...park/sql/hudi/analysis/HoodieSpark34Analysis.scala 66.66% <ø> (ø)
...park/sql/hudi/analysis/HoodieSpark35Analysis.scala 48.05% <ø> (+1.29%) ⬆️
...park/sql/hudi/analysis/HoodieSpark40Analysis.scala 48.05% <ø> (+1.29%) ⬆️
...rk/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala 73.77% <0.00%> (-0.41%) ⬇️
...spark/sql/hudi/catalog/HoodieInternalV2Table.scala 65.51% <95.00%> (+32.18%) ⬆️
... and 10 more

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants