[HUDI-8631] Support of hoodie.populate.meta.fields for Flink append mode#12516
Conversation
| this.fileId = fileId; | ||
| this.newRecordLocation = new HoodieRecordLocation(instantTime, fileId); | ||
| this.preserveHoodieMetadata = preserveHoodieMetadata; | ||
| this.skipMetadataWrite = skipMetadataWrite; |
There was a problem hiding this comment.
Isn't the flag preserveHoodieMetadata already control this behavior, there is another PR raised by @usberkeley for fixing all the scenarios BTW: #12404
There was a problem hiding this comment.
Yes, I saw #12404 and was confused by ticket name, which mention Flink table config hoodie.populate.meta.fields, and didn't find any changes in hudi-flink-client or hudi-flink-datasource that will change current behavior. My point of view I described in #12404 (comment)
And to support my comment, I've created this MR that shows the lack of support of hoodie.populate.meta.fields in Flink.
There was a problem hiding this comment.
preserveHoodieMetadata actually used in the code as an indicator to get metadata from row data or generate it by calling corresponding methods. And it actually a little bit confusing naming then. I believe that values for preserveHoodieMetadata could be described by this schema:

Looks like preserveHoodieMetadata could be true only for clustering operator.
There was a problem hiding this comment.
It looks like the flag preserveHoodieMetadata indicates whether the source row includes the metadata fields already, for table service like clustering, this should be true by default(because clustering is just a rewrite). For regular write, the metadata fields should be generated on the fly.
Let's check in which case the option hoodie.populate.meta.fields could be false.
There was a problem hiding this comment.
User sets value for hoodie.populate.meta.fields option, which is true by default. And in description for this config, "append only/immutable data" is mentioned as use case:
For this reason, in this MR I supported
hoodie.populate.meta.fields in Flink only for append mode.
For quick check I use SQL queries like the following ones, which used for append mode:
CREATE TABLE hudi_debug (
id INT,
part INT,
desc STRING,
PRIMARY KEY (id) NOT ENFORCED
)
WITH (
'connector' = 'hudi',
'path' = '...',
'table.type' = 'COPY_ON_WRITE',
'write.operation' = 'insert',
'hoodie.populate.meta.fields' = 'false'
);INSERT INTO hudi_debug VALUES
(1,100,'aaa'),
(2,200,'bbb');Expected results: there is no exceptions during
SELECT * FROM hudi_debug;and corresponding parquet files in HDFS don't contain columns with metadata.
There was a problem hiding this comment.
I've also found that we could write MOR table in upsert mode without metadata. Call stack in this case will include HoodieAppendHandle. But we couldn't read result MOR table by Flink later due to exception thrown during:
SELECT * FROM hudi_debug;I've created separate bug for MOR without meta columns read: HUDI-8785, will fix it in a separate MR.
There was a problem hiding this comment.
Can we atleast add an integration test in ITTestHoodieDataSource
There was a problem hiding this comment.
@danny0405
I've added ITTestHoodieDataSource::testWriteWithoutMetaColumns. But for proper checking, it would be great to write data by Flink, and then read it by Spark, because in Spark
SELECT * FROM table;will return all columns including those with metadata.
And it would be really useful check of engines interoperability. I've created a corresponding task HUDI-8788.
|
I will restart Azure CI, but we have a problem with timeout: |
|
@hudi-bot run azure |
Change Logs
Currently,
hoodie.populate.meta.fieldsis not supported in Flink. This config should be used for append mode, so this MR adds support for this case.Before, master, 3cb874f:

After:

TPC-H lineitem table is used for profiling.
14% faster write in append mode for Flink when
hoodie.populate.meta.fields = false.The next possible optimization for this scenario is to think about do we really need Bloom filters in this use case, because it costs 16% of CPU after optimization of this MR:

Impact
No
Risk level (write none, low medium or high below)
Low
Documentation Update
No need. There is no mention in "All Configurations" page that
hoodie.populate.meta.fieldsis not supported for Flink.Contributor's checklist