[HUDI-8483] Remove unnecessary code by linliu-code · Pull Request #12323 · apache/hudi

linliu-code · 2024-11-24T00:06:33Z

Change Logs

When there are log files, the default query type is snapshot query.
When there are no log files, there is no difference between snapshot query and ro queries.

Impact

Simplified code.

Risk level (write none, low medium or high below)

None.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-bot · 2024-11-24T00:10:30Z

CI report:

891d09c Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua · 2024-11-25T20:58:49Z

...n/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java

-    if (hasLogFiles) {
-      params.put("hoodie.datasource.query.type", "snapshot");
-    } else {
-      params.put("hoodie.datasource.query.type", "read_optimized");
-    }


By default, hoodie.datasource.query.type is set to snapshot, and the new HadoopFSRelation based reader logic in Spark makes sure there's no performance degradation for base file-only cases in MOR, so params.put("hoodie.datasource.query.type", "read_optimized") should not be needed either. Could you point out what errors are thrown if these lines are removed? It would be good to record and understand the errors to make sure there is no other related issue.

https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala#L250, which shows there is no default value set.
I did not know if it is safe to rely the default value either.

Meanwhile, do we know how HadoopFSRelation ensure no performance degradation for base file-only cases?

The config itself has the default value defined but as you pointed out it's not honored through the read path with file paths provided (which the clustering execution uses).

val QUERY_TYPE: ConfigProperty[String] = ConfigProperty .key("hoodie.datasource.query.type") .defaultValue(QUERY_TYPE_SNAPSHOT_OPT_VAL) .withAlternatives("hoodie.datasource.view.type") .withValidValues(QUERY_TYPE_SNAPSHOT_OPT_VAL, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_INCREMENTAL_OPT_VAL) .withDocumentation("Whether data needs to be read, in `" + QUERY_TYPE_INCREMENTAL_OPT_VAL + "` mode (new data since an instantTime) " + "(or) `" + QUERY_TYPE_READ_OPTIMIZED_OPT_VAL + "` mode (obtain latest view, based on base files) (or) `" + QUERY_TYPE_SNAPSHOT_OPT_VAL + "` mode " + "(obtain latest view, by merging base and (if any) log files)")

Also, after checking the code again, when the file paths are provided through "hoodie.datasource.read.paths", the relation-based read path is used (i.e., useNewParquetFileFormat is false). We can keep this for now.

Filed HUDI-8576 as a follow-up.

HUDI-8577 to track default change. We can take this after Hudi 1.0 is released.

linliu-code · 2024-11-25T21:57:32Z

Anyways, I can reopen this case, and add the default query type.

linliu-code · 2024-11-25T22:06:11Z

Cool. thanks. then I will close this for now.

linliu-code closed this Nov 24, 2024

yihua reviewed Nov 25, 2024

View reviewed changes

linliu-code reopened this Nov 25, 2024

github-actions bot added the size:XS PR with lines of changes in <= 10 label Nov 25, 2024

linliu-code added 2 commits November 25, 2024 13:59

Remove unnecessary code

b8a885d

Add default query type

a1f4d7a

linliu-code force-pushed the HUDI-8483_remove_time_travel_with_clustering branch from 891d09c to a1f4d7a Compare November 25, 2024 22:02

linliu-code closed this Nov 25, 2024

hudi-bot mentioned this pull request Nov 30, 2025

Use new file format instead of relations for clustering execution #16742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8483] Remove unnecessary code#12323

[HUDI-8483] Remove unnecessary code#12323
linliu-code wants to merge 2 commits intoapache:masterfrom
linliu-code:HUDI-8483_remove_time_travel_with_clustering

linliu-code commented Nov 24, 2024

Uh oh!

hudi-bot commented Nov 24, 2024

Uh oh!

yihua Nov 25, 2024 •

edited

Loading

Uh oh!

linliu-code Nov 25, 2024

Uh oh!

yihua Nov 25, 2024

Uh oh!

yihua Nov 25, 2024

Uh oh!

linliu-code commented Nov 25, 2024

Uh oh!

linliu-code commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

linliu-code commented Nov 24, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Nov 24, 2024

CI report:

Uh oh!

yihua Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linliu-code Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

yihua Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

yihua Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

linliu-code commented Nov 25, 2024

Uh oh!

linliu-code commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yihua Nov 25, 2024 •

edited

Loading