Skip to content

[HUDI-8483] Remove unnecessary code#12323

Closed
linliu-code wants to merge 2 commits intoapache:masterfrom
linliu-code:HUDI-8483_remove_time_travel_with_clustering
Closed

[HUDI-8483] Remove unnecessary code#12323
linliu-code wants to merge 2 commits intoapache:masterfrom
linliu-code:HUDI-8483_remove_time_travel_with_clustering

Conversation

@linliu-code
Copy link
Collaborator

Change Logs

When there are log files, the default query type is snapshot query.
When there are no log files, there is no difference between snapshot query and ro queries.

Impact

Simplified code.

Risk level (write none, low medium or high below)

None.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Comment on lines -446 to -450
if (hasLogFiles) {
params.put("hoodie.datasource.query.type", "snapshot");
} else {
params.put("hoodie.datasource.query.type", "read_optimized");
}
Copy link
Contributor

@yihua yihua Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, hoodie.datasource.query.type is set to snapshot, and the new HadoopFSRelation based reader logic in Spark makes sure there's no performance degradation for base file-only cases in MOR, so params.put("hoodie.datasource.query.type", "read_optimized") should not be needed either. Could you point out what errors are thrown if these lines are removed? It would be good to record and understand the errors to make sure there is no other related issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala#L250, which shows there is no default value set.
I did not know if it is safe to rely the default value either.

Meanwhile, do we know how HadoopFSRelation ensure no performance degradation for base file-only cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config itself has the default value defined but as you pointed out it's not honored through the read path with file paths provided (which the clustering execution uses).

val QUERY_TYPE: ConfigProperty[String] = ConfigProperty
    .key("hoodie.datasource.query.type")
    .defaultValue(QUERY_TYPE_SNAPSHOT_OPT_VAL)
    .withAlternatives("hoodie.datasource.view.type")
    .withValidValues(QUERY_TYPE_SNAPSHOT_OPT_VAL, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_INCREMENTAL_OPT_VAL)
    .withDocumentation("Whether data needs to be read, in `" + QUERY_TYPE_INCREMENTAL_OPT_VAL + "` mode (new data since an instantTime) " +
      "(or) `" + QUERY_TYPE_READ_OPTIMIZED_OPT_VAL + "` mode (obtain latest view, based on base files) (or) `" + QUERY_TYPE_SNAPSHOT_OPT_VAL + "` mode " +
      "(obtain latest view, by merging base and (if any) log files)")

Also, after checking the code again, when the file paths are provided through "hoodie.datasource.read.paths", the relation-based read path is used (i.e., useNewParquetFileFormat is false). We can keep this for now.

Filed HUDI-8576 as a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUDI-8577 to track default change. We can take this after Hudi 1.0 is released.

@linliu-code
Copy link
Collaborator Author

Anyways, I can reopen this case, and add the default query type.

@linliu-code linliu-code reopened this Nov 25, 2024
@github-actions github-actions bot added the size:XS PR with lines of changes in <= 10 label Nov 25, 2024
@linliu-code linliu-code force-pushed the HUDI-8483_remove_time_travel_with_clustering branch from 891d09c to a1f4d7a Compare November 25, 2024 22:02
@linliu-code
Copy link
Collaborator Author

Cool. thanks. then I will close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS PR with lines of changes in <= 10

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants