[HUDI-8483] Remove unnecessary code#12323
Conversation
| if (hasLogFiles) { | ||
| params.put("hoodie.datasource.query.type", "snapshot"); | ||
| } else { | ||
| params.put("hoodie.datasource.query.type", "read_optimized"); | ||
| } |
There was a problem hiding this comment.
By default, hoodie.datasource.query.type is set to snapshot, and the new HadoopFSRelation based reader logic in Spark makes sure there's no performance degradation for base file-only cases in MOR, so params.put("hoodie.datasource.query.type", "read_optimized") should not be needed either. Could you point out what errors are thrown if these lines are removed? It would be good to record and understand the errors to make sure there is no other related issue.
There was a problem hiding this comment.
https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala#L250, which shows there is no default value set.
I did not know if it is safe to rely the default value either.
Meanwhile, do we know how HadoopFSRelation ensure no performance degradation for base file-only cases?
There was a problem hiding this comment.
The config itself has the default value defined but as you pointed out it's not honored through the read path with file paths provided (which the clustering execution uses).
val QUERY_TYPE: ConfigProperty[String] = ConfigProperty
.key("hoodie.datasource.query.type")
.defaultValue(QUERY_TYPE_SNAPSHOT_OPT_VAL)
.withAlternatives("hoodie.datasource.view.type")
.withValidValues(QUERY_TYPE_SNAPSHOT_OPT_VAL, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_INCREMENTAL_OPT_VAL)
.withDocumentation("Whether data needs to be read, in `" + QUERY_TYPE_INCREMENTAL_OPT_VAL + "` mode (new data since an instantTime) " +
"(or) `" + QUERY_TYPE_READ_OPTIMIZED_OPT_VAL + "` mode (obtain latest view, based on base files) (or) `" + QUERY_TYPE_SNAPSHOT_OPT_VAL + "` mode " +
"(obtain latest view, by merging base and (if any) log files)")
Also, after checking the code again, when the file paths are provided through "hoodie.datasource.read.paths", the relation-based read path is used (i.e., useNewParquetFileFormat is false). We can keep this for now.
Filed HUDI-8576 as a follow-up.
There was a problem hiding this comment.
HUDI-8577 to track default change. We can take this after Hudi 1.0 is released.
|
Anyways, I can reopen this case, and add the default query type. |
891d09c to
a1f4d7a
Compare
|
Cool. thanks. then I will close this for now. |
Change Logs
When there are log files, the default query type is snapshot query.
When there are no log files, there is no difference between snapshot query and ro queries.
Impact
Simplified code.
Risk level (write none, low medium or high below)
None.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist