[HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table by TheR1sing3un · Pull Request #12112 · apache/hudi

TheR1sing3un · 2024-10-16T10:55:04Z

In some cases, a MOR table's latest (or view at time-travel specified instant) file-slices all have only base-file but empty log-files. When performs Snapshot-Query for these tables, we can regard it as MOR-ReadOptimized-Query and provide a HadoopFsRelation to Spark.

Change Logs

regard mor snapshot query with all base-file-only table as mor read-optimized query
Describe context and summary for this change. Highlight if any code was copied.

Impact

none
Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

low
If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2024-10-17T00:23:52Z

Is this change related: #12080 ?

…read-optimized query 1. regard mor snapshot query with all base-file-only table as mor read-optimized query Signed-off-by: TheR1sing3un <chaoyang@apache.org>

TheR1sing3un · 2024-10-18T04:57:39Z

Is this change related: #12080 ?

#12080 is optimizing filter pushdown for HoodieBaseRelation by reducing unnecessary columns.
My changes focus on regard [MergeOnReadRelation with all base-file-only file-slices] as BaseFileOnlyRelation so that we can fallback it to HadoopFsRelation. Spark has many optimizations for HadoopFsRelation which can improve our query performance.

1. optimize TestDataSkippingWithMORColstats Signed-off-by: TheR1sing3un <chaoyang@apache.org>

1. fix TestNestedSchemaPruningOptimization Signed-off-by: TheR1sing3un <chaoyang@apache.org>

1. fix TestSparkSqlWithCustomKeyGenerator, need unified time conversion for mor and cow in the future to solve the root problem Signed-off-by: TheR1sing3un <chaoyang@apache.org>

danny0405 · 2024-10-21T01:20:38Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala


  val INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT: ConfigProperty[String] = HoodieCommonConfig.INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT

+  val ENABLE_OPTIMIZED_READ_FOR_MOR_WITH_ALL_BASE_FILE_ONLY_SLICE: ConfigProperty[Boolean] = ConfigProperty


If it's a pure optimization, let's eliminate this option cc @jonvex and @yihua to take a look too.

Isn't that just a read optimized query?

TheR1sing3un · 2024-10-21T02:13:07Z

@hudi-bot run azure

hudi-bot · 2024-10-21T03:20:48Z

CI report:

083aff2 UNKNOWN
0802e2b Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

jonvex · 2024-10-21T18:44:40Z

We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways

TheR1sing3un · 2024-10-22T02:34:54Z

We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways

Do you mean to use NewHoodieParquetFileFormat?

It looks like still an experimental feature. In many real scene for our product environment, we still query with relation implementations.
IMO, maybe we could introduce [MOR-SNAPSHOT-QUERY-FALLBACK-TO-HadoopFsRelation] with relation implementations, rather than directly changing to apply NewHoodieParquetFileFormat.
Looking forward to your reply!

github-actions bot added the size:S PR with lines of changes in (10, 100] label Oct 16, 2024

feat: regard mor snapshot query with all base-file-only table as mor …

083aff2

…read-optimized query 1. regard mor snapshot query with all base-file-only table as mor read-optimized query Signed-off-by: TheR1sing3un <chaoyang@apache.org>

TheR1sing3un force-pushed the feat_optimize_mor_read_with_empty_log branch from be6673e to 083aff2 Compare October 18, 2024 04:52

github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Oct 18, 2024

TheR1sing3un added 3 commits October 18, 2024 13:46

feat: optimize TestDataSkippingWithMORColstats

3b56049

1. optimize TestDataSkippingWithMORColstats Signed-off-by: TheR1sing3un <chaoyang@apache.org>

test: fix TestNestedSchemaPruningOptimization

813ad02

1. fix TestNestedSchemaPruningOptimization Signed-off-by: TheR1sing3un <chaoyang@apache.org>

test: fix TestSparkSqlWithCustomKeyGenerator

0802e2b

1. fix TestSparkSqlWithCustomKeyGenerator, need unified time conversion for mor and cow in the future to solve the root problem Signed-off-by: TheR1sing3un <chaoyang@apache.org>

danny0405 reviewed Oct 21, 2024

View reviewed changes

TheR1sing3un closed this May 28, 2025

hudi-bot mentioned this pull request Dec 9, 2025

Improve MOR-Snapshot-Query performance for COW like table #16684

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table#12112

[HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table#12112
TheR1sing3un wants to merge 4 commits intoapache:branch-0.xfrom
TheR1sing3un:feat_optimize_mor_read_with_empty_log

TheR1sing3un commented Oct 16, 2024 •

edited

Loading

Uh oh!

danny0405 commented Oct 17, 2024

Uh oh!

TheR1sing3un commented Oct 18, 2024 •

edited

Loading

Uh oh!

danny0405 Oct 21, 2024

Uh oh!

jonvex Oct 21, 2024

Uh oh!

TheR1sing3un commented Oct 21, 2024

Uh oh!

hudi-bot commented Oct 21, 2024

Uh oh!

jonvex commented Oct 21, 2024

Uh oh!

TheR1sing3un commented Oct 22, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		val INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT: ConfigProperty[String] = HoodieCommonConfig.INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT

		val ENABLE_OPTIMIZED_READ_FOR_MOR_WITH_ALL_BASE_FILE_ONLY_SLICE: ConfigProperty[Boolean] = ConfigProperty

Conversation

TheR1sing3un commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

danny0405 commented Oct 17, 2024

Uh oh!

TheR1sing3un commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0405 Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

jonvex Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

TheR1sing3un commented Oct 21, 2024

Uh oh!

hudi-bot commented Oct 21, 2024

CI report:

Uh oh!

jonvex commented Oct 21, 2024

Uh oh!

TheR1sing3un commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TheR1sing3un commented Oct 16, 2024 •

edited

Loading

TheR1sing3un commented Oct 18, 2024 •

edited

Loading

TheR1sing3un commented Oct 22, 2024 •

edited

Loading