[HUDI-6821] Support multiple base file formats in Hudi table#9761
[HUDI-6821] Support multiple base file formats in Hudi table#9761codope merged 2 commits intoapache:masterfrom
Conversation
42e452b to
be1d48f
Compare
| * Base relation to handle table with multiple base file formats. | ||
| */ | ||
| abstract class BaseHoodieMultiFileFormatRelation(override val sqlContext: SQLContext, | ||
| override val metaClient: HoodieTableMetaClient, |
There was a problem hiding this comment.
What is the reason we need a new relation abstraction here? The base file format can be always inferred from the file extension right?
There was a problem hiding this comment.
Yes, we are using the file extension to figure out the base file format. However, the usual BaseFileOnlyRelation converts to HadoopFsRelation, which cannot be done for multiple file formats because the file format is not known at the time creating the relation. It is only known when we the relation is collecting file splits. Hence, a separate relation to handle multiple file formats. I will add this in scaladoc of the class.
Plus, I think a separate relation keeps the code clean and more maintainable.
There was a problem hiding this comment.
which cannot be done for multiple file formats because the file format is not known at the time creating the relation
You mean the HadoopFsRelation ? We can know the file format because the fileFormat is already there, currently it is either a OrcFileFormat or ParquetFileFormat.
There was a problem hiding this comment.
Coversion to HadoopFsRelation happens while resolving BaseFileOnlyRelation in DefaultSource -
At this point, we don't know the file format. Previoulsy, it used to work because the code was written with the assumption that there will be just single base file format, either Parquet or Orc.
There was a problem hiding this comment.
Discussed offline. We think that implementing a new FileFormat which works with multiple base file formats should be possible. So, i'm going to attempt that.
There was a problem hiding this comment.
Added the new file format implementation in HoodieMultipleBaseFileFormat
be1d48f to
60875bd
Compare
Address comments Remove unused table config and check write config
89e72e0 to
4ec731d
Compare
danny0405
left a comment
There was a problem hiding this comment.
+1, overall looks great, just left some minor comments.
| assertEquals(metadataMetaClient.getTableConfig().getBaseFileFormat(), HoodieFileFormat.HFILE, | ||
| "Metadata Table base file format should be HFile"); | ||
|
|
||
| // Metadata table has a fixed number of partitions |
There was a problem hiding this comment.
Going forward we'll have to remove this check as we can have multiple file formats even in metdata table when we support certain secondary indexes in other than HFile format. This check also did not add much value anyway.
| val baseFile = createPartitionedFile(partitionValues, hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen) | ||
| baseFileFormat match { | ||
| case "parquet" => parquetBaseFileReader(baseFile) | ||
| case "orc" => orcBaseFileReader(baseFile) |
There was a problem hiding this comment.
Maybe we can avoid to hardcode these file format constants.
There was a problem hiding this comment.
Good point! We do have HoodieFileFormat enum which can be used here. Let me do that in a minor followup PR where I will refactor some parts. I have HUDI-6986 to track the refactoring.
| hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_PATH, hoodieCatalogTable.tableLocation) | ||
| hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, hoodieCatalogTable.baseFileFormat) | ||
| hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, props.getString(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT.key, HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT.defaultValue)) | ||
| hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_DATABASE_NAME, hoodieCatalogTable.table.identifier.database.getOrElse("default")) |
There was a problem hiding this comment.
Do we have function regression if user does not provide the option HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT?
There was a problem hiding this comment.
No, it will ultimately fallback to using the table config because the HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT has infer function -
Change Logs
In order to support multiple base file format for a Hudi table, following changes have been done:
FileFormatimplementation has been introduced.HoodieMultipleBaseFileFormatis created only when the above table config is set to true.Users need to set
hoodie.table.multiple.base.file.formats.enable=trueto be able to read/write with multiple base file formats.Impact
This is a format change. Only newer Hudi table versions will support it. Note that the change in compatible i.e. the readers used for the new relation should work even if multiple table formats is not enabled.
The performance impact is minimal. There is on extra iteration over the splits in order to infer the file format from the file extension.
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist