Skip to content

[HUDI-6821] Support multiple base file formats in Hudi table#9761

Merged
codope merged 2 commits intoapache:masterfrom
codope:HUDI-6821-base-file-fmt
Oct 26, 2023
Merged

[HUDI-6821] Support multiple base file formats in Hudi table#9761
codope merged 2 commits intoapache:masterfrom
codope:HUDI-6821-base-file-fmt

Conversation

@codope
Copy link
Copy Markdown
Member

@codope codope commented Sep 21, 2023

Change Logs

In order to support multiple base file format for a Hudi table, following changes have been done:

  1. Added a new table config which is boolean to indicate whether this table supports multiple base file formats. By default, it is false.
  2. File formats are decided based on file extension and then the writer/reader is created.
  3. On the Spark reader side, a new FileFormat implementation has been introduced. HoodieMultipleBaseFileFormat is created only when the above table config is set to true.
  4. Change has been tested for snapshot reads for both MOR and COW table.
  5. Does not yet support bootstrap table and incremental queries.

Users need to set hoodie.table.multiple.base.file.formats.enable=true to be able to read/write with multiple base file formats.

Impact

This is a format change. Only newer Hudi table versions will support it. Note that the change in compatible i.e. the readers used for the new relation should work even if multiple table formats is not enabled.
The performance impact is minimal. There is on extra iteration over the splits in order to infer the file format from the file extension.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@codope codope force-pushed the HUDI-6821-base-file-fmt branch 3 times, most recently from 42e452b to be1d48f Compare September 28, 2023 16:18
@codope codope requested a review from danny0405 October 12, 2023 02:02
Comment thread hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java Outdated
* Base relation to handle table with multiple base file formats.
*/
abstract class BaseHoodieMultiFileFormatRelation(override val sqlContext: SQLContext,
override val metaClient: HoodieTableMetaClient,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we need a new relation abstraction here? The base file format can be always inferred from the file extension right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are using the file extension to figure out the base file format. However, the usual BaseFileOnlyRelation converts to HadoopFsRelation, which cannot be done for multiple file formats because the file format is not known at the time creating the relation. It is only known when we the relation is collecting file splits. Hence, a separate relation to handle multiple file formats. I will add this in scaladoc of the class.
Plus, I think a separate relation keeps the code clean and more maintainable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which cannot be done for multiple file formats because the file format is not known at the time creating the relation

You mean the HadoopFsRelation ? We can know the file format because the fileFormat is already there, currently it is either a OrcFileFormat or ParquetFileFormat.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coversion to HadoopFsRelation happens while resolving BaseFileOnlyRelation in DefaultSource -


At this point, we don't know the file format. Previoulsy, it used to work because the code was written with the assumption that there will be just single base file format, either Parquet or Orc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. We think that implementing a new FileFormat which works with multiple base file formats should be possible. So, i'm going to attempt that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the new file format implementation in HoodieMultipleBaseFileFormat

@codope codope force-pushed the HUDI-6821-base-file-fmt branch from be1d48f to 60875bd Compare October 13, 2023 05:29
@apache apache deleted a comment from hudi-bot Oct 14, 2023
Comment thread hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java Outdated
Address comments

Remove unused table config and check write config
@codope codope force-pushed the HUDI-6821-base-file-fmt branch from 89e72e0 to 4ec731d Compare October 25, 2023 14:51
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, overall looks great, just left some minor comments.

assertEquals(metadataMetaClient.getTableConfig().getBaseFileFormat(), HoodieFileFormat.HFILE,
"Metadata Table base file format should be HFile");

// Metadata table has a fixed number of partitions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this check

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going forward we'll have to remove this check as we can have multiple file formats even in metdata table when we support certain secondary indexes in other than HFile format. This check also did not add much value anyway.

val baseFile = createPartitionedFile(partitionValues, hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
baseFileFormat match {
case "parquet" => parquetBaseFileReader(baseFile)
case "orc" => orcBaseFileReader(baseFile)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can avoid to hardcode these file format constants.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! We do have HoodieFileFormat enum which can be used here. Let me do that in a minor followup PR where I will refactor some parts. I have HUDI-6986 to track the refactoring.

hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_PATH, hoodieCatalogTable.tableLocation)
hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, hoodieCatalogTable.baseFileFormat)
hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, props.getString(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT.key, HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT.defaultValue))
hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_DATABASE_NAME, hoodieCatalogTable.table.identifier.database.getOrElse("default"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have function regression if user does not provide the option HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will ultimately fallback to using the table config because the HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT has infer function -

.withInferFunction(cfg -> Option.ofNullable(cfg.getString(BASE_FILE_FORMAT)))

@codope codope merged commit 8bf44c0 into apache:master Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants