Skip to content

[HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table#12112

Closed
TheR1sing3un wants to merge 4 commits intoapache:branch-0.xfrom
TheR1sing3un:feat_optimize_mor_read_with_empty_log
Closed

[HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table#12112
TheR1sing3un wants to merge 4 commits intoapache:branch-0.xfrom
TheR1sing3un:feat_optimize_mor_read_with_empty_log

Conversation

@TheR1sing3un
Copy link
Member

@TheR1sing3un TheR1sing3un commented Oct 16, 2024

In some cases, a MOR table's latest (or view at time-travel specified instant) file-slices all have only base-file but empty log-files. When performs Snapshot-Query for these tables, we can regard it as MOR-ReadOptimized-Query and provide a HadoopFsRelation to Spark.

Change Logs

  1. regard mor snapshot query with all base-file-only table as mor read-optimized query
    Describe context and summary for this change. Highlight if any code was copied.

Impact

none
Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

low
If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Oct 16, 2024
@danny0405
Copy link
Contributor

Is this change related: #12080 ?

…read-optimized query

1. regard mor snapshot query with all base-file-only table as mor read-optimized query

Signed-off-by: TheR1sing3un <chaoyang@apache.org>
@TheR1sing3un TheR1sing3un force-pushed the feat_optimize_mor_read_with_empty_log branch from be6673e to 083aff2 Compare October 18, 2024 04:52
@TheR1sing3un
Copy link
Member Author

TheR1sing3un commented Oct 18, 2024

Is this change related: #12080 ?

#12080 is optimizing filter pushdown for HoodieBaseRelation by reducing unnecessary columns.
My changes focus on regard [MergeOnReadRelation with all base-file-only file-slices] as BaseFileOnlyRelation so that we can fallback it to HadoopFsRelation. Spark has many optimizations for HadoopFsRelation which can improve our query performance.

@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Oct 18, 2024
1. optimize TestDataSkippingWithMORColstats

Signed-off-by: TheR1sing3un <chaoyang@apache.org>
1. fix TestNestedSchemaPruningOptimization

Signed-off-by: TheR1sing3un <chaoyang@apache.org>
1. fix TestSparkSqlWithCustomKeyGenerator, need unified time conversion for mor and cow in the future to solve the root problem

Signed-off-by: TheR1sing3un <chaoyang@apache.org>

val INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT: ConfigProperty[String] = HoodieCommonConfig.INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT

val ENABLE_OPTIMIZED_READ_FOR_MOR_WITH_ALL_BASE_FILE_ONLY_SLICE: ConfigProperty[Boolean] = ConfigProperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a pure optimization, let's eliminate this option cc @jonvex and @yihua to take a look too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that just a read optimized query?

@TheR1sing3un
Copy link
Member Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@jonvex
Copy link
Contributor

jonvex commented Oct 21, 2024

We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways

@TheR1sing3un
Copy link
Member Author

TheR1sing3un commented Oct 22, 2024

We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways

Do you mean to use NewHoodieParquetFileFormat?
image
It looks like still an experimental feature. In many real scene for our product environment, we still query with relation implementations.
IMO, maybe we could introduce [MOR-SNAPSHOT-QUERY-FALLBACK-TO-HadoopFsRelation] with relation implementations, rather than directly changing to apply NewHoodieParquetFileFormat.
Looking forward to your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants