[HUDI-18383] Support selective meta field population#18384
[HUDI-18383] Support selective meta field population#18384prashantwason wants to merge 3 commits intoapache:masterfrom
Conversation
Add hoodie.meta.fields.to.exclude config to selectively skip meta field population. Excluded fields are written as null for optimal Parquet storage savings while retaining incremental query capability via _hoodie_commit_time. Covers all 4 write paths: Avro, Spark InternalRow, Spark SQL row-writer, and Flink. Uses pre-computed boolean[5] for zero-overhead per-row checks. Closes apache#18383 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| : null; | ||
| metaFields[2] = populateField[2] ? recordKey : null; | ||
| metaFields[3] = populateField[3] ? row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD) : null; | ||
| metaFields[4] = populateField[4] ? fileName : null; |
There was a problem hiding this comment.
So the metadata fields are still in the table schema, it's just not populated selectively.
There was a problem hiding this comment.
should we change the table schema to remove the metadata fields too, cc @vinothchandar for visibility.
There was a problem hiding this comment.
Yes, that's correct. The meta field columns remain in the schema for compatibility. When a field is excluded, its value is written as null (which in Parquet takes zero data bytes — stored only as a bit flag in the definition level). This preserves schema consistency while saving storage for fields that can be virtualized (e.g. record_key, partition_path, file_name).
There was a problem hiding this comment.
got it, thanks for the clarification.
suryaprasanna
left a comment
There was a problem hiding this comment.
Can we have unit tests for this change?
| return getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS); | ||
| } | ||
|
|
||
| public Set<String> getMetaFieldsToExclude() { |
There was a problem hiding this comment.
Should this be private?
There was a problem hiding this comment.
Done. Made it private — it's only used internally by getMetaFieldPopulationFlags().
| flags[2] = !excluded.contains(HoodieRecord.RECORD_KEY_METADATA_FIELD); | ||
| flags[3] = !excluded.contains(HoodieRecord.PARTITION_PATH_METADATA_FIELD); | ||
| flags[4] = !excluded.contains(HoodieRecord.FILENAME_METADATA_FIELD); | ||
| return flags; |
There was a problem hiding this comment.
Do we need to include OPERATION_METADATA_FIELD as well?
There was a problem hiding this comment.
No — OPERATION_METADATA_FIELD (_hoodie_operation) is not part of the standard 5 HOODIE_META_COLUMNS. It's in the separate HOODIE_META_COLUMNS_WITH_OPERATION set and is marked as temporary in the codebase. The boolean[5] array maps exactly to the 5 standard meta fields by ordinal (commit_time, commit_seqno, record_key, partition_path, file_name).
| private final String fileId; | ||
| private final boolean preserveHoodieMetadata; | ||
| private final boolean skipMetadataWrite; | ||
| private final boolean[] populateField; |
There was a problem hiding this comment.
Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?
There was a problem hiding this comment.
Good suggestion! Renamed to populateIndividualMetaFields across all 9 files.
| private final UTF8String instantTime; | ||
|
|
||
| private final boolean populateMetaFields; | ||
| private final boolean[] populateField; |
There was a problem hiding this comment.
Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?
There was a problem hiding this comment.
Done — renamed to populateIndividualMetaFields.
| row.update(COMMIT_SEQNO_METADATA_FIELD.ordinal(), UTF8String.fromString(seqIdGenerator.apply(recordCount))); | ||
| row.update(RECORD_KEY_METADATA_FIELD.ordinal(), recordKey); | ||
| row.update(PARTITION_PATH_METADATA_FIELD.ordinal(), UTF8String.fromString(partitionPath)); | ||
| row.update(FILENAME_METADATA_FIELD.ordinal(), fileName); |
There was a problem hiding this comment.
Do we need to include OPERATION_METADATA_FIELD as well?
There was a problem hiding this comment.
Same as above — OPERATION_METADATA_FIELD is not part of the standard 5 meta columns, so it's not applicable here.
| sparkKeyGenerator | ||
| } | ||
|
|
||
| val populateField = config.getMetaFieldPopulationFlags |
There was a problem hiding this comment.
Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?
| .withDocumentation("When enabled, populates all meta fields. When disabled, no meta fields are populated " | ||
| + "and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing"); | ||
|
|
||
| public static final ConfigProperty<String> META_FIELDS_TO_EXCLUDE = ConfigProperty |
There was a problem hiding this comment.
This would require table version upgrade, not sure how we want to track it as part of next version.
There was a problem hiding this comment.
This is a table config property stored in hoodie.properties, not a schema or format change. It only affects write behavior (which meta fields to populate). Existing tables without this property behave identically to before (all meta fields populated). Since it's purely additive with no format changes, I don't think a table version upgrade is required. Happy to discuss further if you see it differently.
| } | ||
|
|
||
| public HoodieAvroOrcWriter(String instantTime, StoragePath file, HoodieOrcConfig config, HoodieSchema schema, | ||
| TaskContextSupplier taskContextSupplier, boolean[] populateField) throws IOException { |
There was a problem hiding this comment.
Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?
…LDS_TO_EXCLUDE Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename boolean[] populateField to populateIndividualMetaFields across all 9 source files for clarity (per review feedback) - Make getMetaFieldsToExclude() private in HoodieWriteConfig since it is only used internally by getMetaFieldPopulationFlags() - Add withMetaFieldsToExclude() builder method in HoodieWriteConfig - Add 8 unit tests in TestHoodieWriteConfig for getMetaFieldPopulationFlags: default, disabled, selective exclusion, exclude all, single exclusion, whitespace handling, invalid field names, empty list - Add integration test in TestHoodieRowCreateHandle verifying selective meta field population writes nulls for excluded fields in parquet output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18384 +/- ##
============================================
- Coverage 68.37% 65.99% -2.39%
+ Complexity 27573 22478 -5095
============================================
Files 2433 1996 -437
Lines 133268 111142 -22126
Branches 16034 14044 -1990
============================================
- Hits 91122 73347 -17775
+ Misses 35093 31531 -3562
+ Partials 7053 6264 -789
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@suryaprasanna Thanks for the review! I've addressed all the feedback: Changes in latest commit:
Unit tests added:
On OPERATION_METADATA_FIELD: It's not part of the standard 5 On table version upgrade: |
|
I don't think this can be a writer property. And another reason why can't this be just a writer property: Lets rethink the solution. I have not reviewing the patch for now. Lets get an alignment on the requirements and approach. |
Describe the issue this Pull Request addresses
Closes #18383
Discussion: #17959
Currently
hoodie.populate.meta.fieldsis all-or-nothing: either all 5 meta columns are populated, or none are (all get empty strings). Users who disable it to save storage lose incremental query capability (which requires_hoodie_commit_time). Fields like_hoodie_record_key,_hoodie_partition_path, and_hoodie_file_namecan be virtualized and don't need physical storage.Summary and Changelog
Adds
hoodie.meta.fields.to.excludeconfig for selective meta field population. Excluded meta fields are written as null (not empty string) for optimal Parquet storage savings (nulls take zero data bytes, stored as bit flags in definition levels).Changes:
hoodie.meta.fields.to.excludeconfig property inHoodieTableConfiggetMetaFieldPopulationFlags()inHoodieWriteConfigreturning a pre-computedboolean[5]array indexed by meta field ordinalHoodieAvroParquetWriter,HoodieAvroOrcWriter,HoodieAvroHFileWriter) via newprepRecordWithMetadata()overloadHoodieSparkParquetWriter) via conditionalupdateRecordMetadata()HoodieRowCreateHandle,HoodieDatasetBulkInsertHelper) via conditional meta field arrayHoodieRowDataCreateHandle) via conditional values inHoodieRowDataCreation.create()_hoodie_record_keyis excluded (bloom filter indexes record keys)AbstractHoodieRowData.getString()to handle null meta columns without NPEExample config:
Impact
New config
hoodie.meta.fields.to.exclude(default: empty). No behavior change for existing users. When configured, excluded meta fields are written as null instead of computed values. Public API addition only (new config property).Risk Level
low - Additive change. Default behavior is unchanged (empty exclude list = all fields populated). The
boolean[5]array is pre-computed once per writer constructor with zero per-row allocation overhead.Documentation Update
New config
hoodie.meta.fields.to.excludeadded with inline documentation. Valid values are the 5 meta field names:_hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name. Only effective whenhoodie.populate.meta.fields=true.Contributor's checklist