Skip to content

[HUDI-18383] Support selective meta field population#18384

Open
prashantwason wants to merge 3 commits intoapache:masterfrom
prashantwason:selective-meta-field-population
Open

[HUDI-18383] Support selective meta field population#18384
prashantwason wants to merge 3 commits intoapache:masterfrom
prashantwason:selective-meta-field-population

Conversation

@prashantwason
Copy link
Copy Markdown
Member

@prashantwason prashantwason commented Mar 25, 2026

Describe the issue this Pull Request addresses

Closes #18383
Discussion: #17959

Currently hoodie.populate.meta.fields is all-or-nothing: either all 5 meta columns are populated, or none are (all get empty strings). Users who disable it to save storage lose incremental query capability (which requires _hoodie_commit_time). Fields like _hoodie_record_key, _hoodie_partition_path, and _hoodie_file_name can be virtualized and don't need physical storage.

Summary and Changelog

Adds hoodie.meta.fields.to.exclude config for selective meta field population. Excluded meta fields are written as null (not empty string) for optimal Parquet storage savings (nulls take zero data bytes, stored as bit flags in definition levels).

Changes:

  • Added hoodie.meta.fields.to.exclude config property in HoodieTableConfig
  • Added getMetaFieldPopulationFlags() in HoodieWriteConfig returning a pre-computed boolean[5] array indexed by meta field ordinal
  • Modified all 4 write engine paths to conditionally populate meta fields:
    • Avro file writers (HoodieAvroParquetWriter, HoodieAvroOrcWriter, HoodieAvroHFileWriter) via new prepRecordWithMetadata() overload
    • Spark InternalRow writer (HoodieSparkParquetWriter) via conditional updateRecordMetadata()
    • Spark SQL row-writer (HoodieRowCreateHandle, HoodieDatasetBulkInsertHelper) via conditional meta field array
    • Flink writer (HoodieRowDataCreateHandle) via conditional values in HoodieRowDataCreation.create()
  • Disabled bloom filter when _hoodie_record_key is excluded (bloom filter indexes record keys)
  • Fixed null safety in Flink AbstractHoodieRowData.getString() to handle null meta columns without NPE

Example config:

hoodie.populate.meta.fields=true
hoodie.meta.fields.to.exclude=_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,_hoodie_commit_seqno

Impact

New config hoodie.meta.fields.to.exclude (default: empty). No behavior change for existing users. When configured, excluded meta fields are written as null instead of computed values. Public API addition only (new config property).

Risk Level

low - Additive change. Default behavior is unchanged (empty exclude list = all fields populated). The boolean[5] array is pre-computed once per writer constructor with zero per-row allocation overhead.

Documentation Update

New config hoodie.meta.fields.to.exclude added with inline documentation. Valid values are the 5 meta field names: _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, _hoodie_file_name. Only effective when hoodie.populate.meta.fields=true.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Add hoodie.meta.fields.to.exclude config to selectively skip meta field
population. Excluded fields are written as null for optimal Parquet storage
savings while retaining incremental query capability via _hoodie_commit_time.

Covers all 4 write paths: Avro, Spark InternalRow, Spark SQL row-writer,
and Flink. Uses pre-computed boolean[5] for zero-overhead per-row checks.

Closes apache#18383

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Mar 25, 2026
: null;
metaFields[2] = populateField[2] ? recordKey : null;
metaFields[3] = populateField[3] ? row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD) : null;
metaFields[4] = populateField[4] ? fileName : null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the metadata fields are still in the table schema, it's just not populated selectively.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change the table schema to remove the metadata fields too, cc @vinothchandar for visibility.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. The meta field columns remain in the schema for compatibility. When a field is excluded, its value is written as null (which in Parquet takes zero data bytes — stored only as a bit flag in the definition level). This preserves schema consistency while saving storage for fields that can be virtualized (e.g. record_key, partition_path, file_name).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks for the clarification.

Copy link
Copy Markdown
Contributor

@suryaprasanna suryaprasanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have unit tests for this change?

return getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
}

public Set<String> getMetaFieldsToExclude() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be private?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Made it private — it's only used internally by getMetaFieldPopulationFlags().

flags[2] = !excluded.contains(HoodieRecord.RECORD_KEY_METADATA_FIELD);
flags[3] = !excluded.contains(HoodieRecord.PARTITION_PATH_METADATA_FIELD);
flags[4] = !excluded.contains(HoodieRecord.FILENAME_METADATA_FIELD);
return flags;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include OPERATION_METADATA_FIELD as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No — OPERATION_METADATA_FIELD (_hoodie_operation) is not part of the standard 5 HOODIE_META_COLUMNS. It's in the separate HOODIE_META_COLUMNS_WITH_OPERATION set and is marked as temporary in the codebase. The boolean[5] array maps exactly to the 5 standard meta fields by ordinal (commit_time, commit_seqno, record_key, partition_path, file_name).

private final String fileId;
private final boolean preserveHoodieMetadata;
private final boolean skipMetadataWrite;
private final boolean[] populateField;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Renamed to populateIndividualMetaFields across all 9 files.

private final UTF8String instantTime;

private final boolean populateMetaFields;
private final boolean[] populateField;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed to populateIndividualMetaFields.

row.update(COMMIT_SEQNO_METADATA_FIELD.ordinal(), UTF8String.fromString(seqIdGenerator.apply(recordCount)));
row.update(RECORD_KEY_METADATA_FIELD.ordinal(), recordKey);
row.update(PARTITION_PATH_METADATA_FIELD.ordinal(), UTF8String.fromString(partitionPath));
row.update(FILENAME_METADATA_FIELD.ordinal(), fileName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include OPERATION_METADATA_FIELD as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — OPERATION_METADATA_FIELD is not part of the standard 5 meta columns, so it's not applicable here.

sparkKeyGenerator
}

val populateField = config.getMetaFieldPopulationFlags
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.withDocumentation("When enabled, populates all meta fields. When disabled, no meta fields are populated "
+ "and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing");

public static final ConfigProperty<String> META_FIELDS_TO_EXCLUDE = ConfigProperty
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require table version upgrade, not sure how we want to track it as part of next version.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a table config property stored in hoodie.properties, not a schema or format change. It only affects write behavior (which meta fields to populate). Existing tables without this property behave identically to before (all meta fields populated). Since it's purely additive with no format changes, I don't think a table version upgrade is required. Happy to discuss further if you see it differently.

}

public HoodieAvroOrcWriter(String instantTime, StoragePath file, HoodieOrcConfig config, HoodieSchema schema,
TaskContextSupplier taskContextSupplier, boolean[] populateField) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

prashantwason and others added 2 commits March 26, 2026 15:57
…LDS_TO_EXCLUDE

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename boolean[] populateField to populateIndividualMetaFields across
  all 9 source files for clarity (per review feedback)
- Make getMetaFieldsToExclude() private in HoodieWriteConfig since it is
  only used internally by getMetaFieldPopulationFlags()
- Add withMetaFieldsToExclude() builder method in HoodieWriteConfig
- Add 8 unit tests in TestHoodieWriteConfig for getMetaFieldPopulationFlags:
  default, disabled, selective exclusion, exclude all, single exclusion,
  whitespace handling, invalid field names, empty list
- Add integration test in TestHoodieRowCreateHandle verifying selective
  meta field population writes nulls for excluded fields in parquet output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Mar 29, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 56.88073% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.99%. Comparing base (69fa35b) to head (143b68b).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
...ache/hudi/io/storage/HoodieSparkParquetWriter.java 40.90% 10 Missing and 3 partials ⚠️
...g/apache/hudi/io/storage/HoodieAvroFileWriter.java 0.00% 13 Missing ⚠️
...java/org/apache/hudi/config/HoodieWriteConfig.java 70.83% 0 Missing and 7 partials ⚠️
...udi/io/storage/hadoop/HoodieAvroParquetWriter.java 54.54% 2 Missing and 3 partials ⚠️
.../hudi/io/storage/hadoop/HoodieAvroHFileWriter.java 66.66% 2 Missing and 1 partial ⚠️
...he/hudi/io/storage/hadoop/HoodieAvroOrcWriter.java 66.66% 2 Missing and 1 partial ⚠️
...rg/apache/hudi/HoodieDatasetBulkInsertHelper.scala 33.33% 0 Missing and 2 partials ⚠️
...che/hudi/io/storage/row/HoodieRowCreateHandle.java 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18384      +/-   ##
============================================
- Coverage     68.37%   65.99%   -2.39%     
+ Complexity    27573    22478    -5095     
============================================
  Files          2433     1996     -437     
  Lines        133268   111142   -22126     
  Branches      16034    14044    -1990     
============================================
- Hits          91122    73347   -17775     
+ Misses        35093    31531    -3562     
+ Partials       7053     6264     -789     
Flag Coverage Δ
common-and-other-modules ?
hadoop-mr-java-client 44.97% <32.39%> (-0.19%) ⬇️
spark-client-hadoop-common 48.28% <22.93%> (-0.29%) ⬇️
spark-java-tests 48.65% <55.04%> (-0.09%) ⬇️
spark-scala-tests 45.23% <38.53%> (-0.16%) ⬇️
utilities 38.37% <36.69%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...torage/row/HoodieInternalRowFileWriterFactory.java 89.28% <100.00%> (+0.39%) ⬆️
...rg/apache/hudi/common/table/HoodieTableConfig.java 94.37% <100.00%> (-0.46%) ⬇️
...che/hudi/io/storage/row/HoodieRowCreateHandle.java 85.58% <90.90%> (-0.40%) ⬇️
...rg/apache/hudi/HoodieDatasetBulkInsertHelper.scala 90.90% <33.33%> (-1.46%) ⬇️
.../hudi/io/storage/hadoop/HoodieAvroHFileWriter.java 87.95% <66.66%> (-2.96%) ⬇️
...he/hudi/io/storage/hadoop/HoodieAvroOrcWriter.java 83.09% <66.66%> (-3.06%) ⬇️
...udi/io/storage/hadoop/HoodieAvroParquetWriter.java 81.48% <54.54%> (-18.52%) ⬇️
...java/org/apache/hudi/config/HoodieWriteConfig.java 86.98% <70.83%> (-2.87%) ⬇️
...ache/hudi/io/storage/HoodieSparkParquetWriter.java 65.90% <40.90%> (-27.20%) ⬇️
...g/apache/hudi/io/storage/HoodieAvroFileWriter.java 43.47% <0.00%> (-56.53%) ⬇️

... and 812 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@prashantwason
Copy link
Copy Markdown
Member Author

@suryaprasanna Thanks for the review! I've addressed all the feedback:

Changes in latest commit:

  • Renamed populateFieldpopulateIndividualMetaFields across all 9 files
  • Made getMetaFieldsToExclude() private (only used internally)
  • Added withMetaFieldsToExclude() builder method for test convenience

Unit tests added:

  • 8 tests in TestHoodieWriteConfig covering getMetaFieldPopulationFlags(): default behavior, populateMetaFields=false, selective exclusion, exclude all, single field exclusion, whitespace handling, invalid field names, empty list
  • 1 integration test in TestHoodieRowCreateHandle (testSelectiveMetaFieldPopulation) that writes with excluded fields and verifies null values in parquet output

On OPERATION_METADATA_FIELD: It's not part of the standard 5 HOODIE_META_COLUMNS — it's in the separate HOODIE_META_COLUMNS_WITH_OPERATION set and marked as temporary. No change needed.

On table version upgrade: hoodie.meta.fields.to.exclude is a table config property in hoodie.properties, not a schema/format change. Existing tables without this property behave identically. Since it's purely additive, a table version upgrade shouldn't be required.

@nsivabalan
Copy link
Copy Markdown
Contributor

I don't think this can be a writer property.
In case of MOR table, for snapshot reads, to merge base file and log files, we need to know if _hoodie_record_key meta field is populated or not. Thats why we added hoodie.populate.meta.fields to table property so that, readers can rely on that.

And another reason why can't this be just a writer property:
we can't let users switch between true and false for this.
for eg, in first 5 commits, meta fields were null. and in next 5 commits, if meta fields were enabled, the merge handle will assume that meta fields will be available and will use that to merge w/ previous base file. but in previous base file the meta fields could be empty.

Lets rethink the solution. I have not reviewing the patch for now. Lets get an alignment on the requirements and approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support selective meta field population

6 participants