Skip to content

feat(flink): collect event time in HoodieRowDataCreateHandle for min/max event time metrics#18250

Merged
prashantwason merged 2 commits into
apache:masterfrom
jianchun:feat/flink-event-time-create-handle-oss
Feb 27, 2026
Merged

feat(flink): collect event time in HoodieRowDataCreateHandle for min/max event time metrics#18250
prashantwason merged 2 commits into
apache:masterfrom
jianchun:feat/flink-event-time-create-handle-oss

Conversation

@jianchun
Copy link
Copy Markdown
Contributor

@jianchun jianchun commented Feb 26, 2026

Describe the issue this Pull Request addresses

Flink bulk insert does not populate min/max event time in WriteStatus. Use cases (e.g. metrics) that rely on event time stats do not work. This PR adds event time collection in HoodieRowDataCreateHandle when hoodie.payload.event.time.field is discovered, so Flink WriteStatus can report min/max event time.

Summary and Changelog

  • Summary: When hoodie.payload.event.time.field is configured, HoodieRowDataCreateHandle reads the event time from each record and passes it to WriteStatus.markSuccess as record metadata. WriteStatus then updates min/max event time for the file. No time unit is interpreted in the handle; the value is converted to long then string. WriteStatus infers unit by string length (e.g. 10 digits vs 13) as today.
  • Changelog:
    • Support DOUBLE and BIGINT for the event time field (unsupported types are skipped with a warning).
    • Use RowData.FieldGetter (Flink API) to read the field by name/type; getter is null when the field is not configured, not found, or unsupported.
    • Extract value as long then string; no unit conversion in the handle.

Impact

  • User-facing: Existing config hoodie.payload.event.time.field (field name only) is used. When set and the field is DOUBLE or BIGINT, Flink bulk insert will populate min/max event time in WriteStatus. No new config; no change when the config is unset or the field is missing/unsupported.
  • Performance: One field read per record when event time is enabled; negligible.

Risk Level

Low. Change is scoped to Flink HoodieRowDataCreateHandle and its tests. Event time is optional and only applied when the configured field exists and is DOUBLE or BIGINT. Defensive try-catch avoids failing the write on read errors.

Documentation Update

None. The config key is existing; behavior is an extension of its use for Flink (min/max event time in WriteStatus). No new configs or defaults.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…max event time metrics

- Add event time field index from hoodie.payload.event.time.field (DOUBLE, epoch seconds)
- Extract event time in write() when !skipMetadataWrite and pass to WriteStatus.markSuccess
- Enables Flink ingestion min/max event time in commit stats for latency metrics
- Add TestHoodieRowDataCreateHandle for event time field index and extraction

Made-with: Cursor
for (int i = 0; i < rowType.getFieldCount(); i++) {
if (rowType.getFieldNames().get(i).equals(eventTimeField)) {
RowType.RowField field = rowType.getFields().get(i);
if (field.getType().getTypeRoot() == LogicalTypeRoot.DOUBLE) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why double instead of long, seems long type milliseconds is most widely used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to support both double and long.

return Option.empty();
}
double eventTimeSeconds = rowData.getDouble(eventTimeFieldIndex);
long eventTimeMillis = (long) (eventTimeSeconds * 1000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we assured for the time unit?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Searched the code and could not find any good way to tell the time unit here. WriteStatus later uses a heauristic from digit length. The easiest here is not do the conversion and continue to let WriteStatus handle that. So if it is double type of seconds, we lose a bit of precision (of milliseconds data), that seems acceptable. Changed the implementation to skip conversion here, and suppors both double/long types.

@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2026
…ateHandle

- Use RowData.FieldGetter for event time field (hoodie.payload.event.time.field)
- Support both DOUBLE and BIGINT; getter is null when field not configured or unsupported type
- Extract value as long then string (no unit interpretation; WriteStatus infers by length)
- Add defensive try-catch in extractEventTimeMetadata for schema mismatch edge cases
- Add testEventTimeExtractionWithBigintMillis for BIGINT (millis) coverage

Made-with: Cursor
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.29%. Comparing base (fb7b1a5) to head (c83c1f0).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18250      +/-   ##
============================================
- Coverage     57.30%   57.29%   -0.01%     
  Complexity    18561    18561              
============================================
  Files          1945     1946       +1     
  Lines        106256   106319      +63     
  Branches      13131    13140       +9     
============================================
+ Hits          60885    60914      +29     
- Misses        39648    39677      +29     
- Partials       5723     5728       +5     
Flag Coverage Δ
hadoop-mr-java-client 45.41% <ø> (+0.01%) ⬆️
spark-java-tests 47.42% <ø> (+<0.01%) ⬆️
spark-scala-tests 45.51% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@prashantwason prashantwason merged commit e1ae9c6 into apache:master Feb 27, 2026
72 of 74 checks passed
@jianchun jianchun deleted the feat/flink-event-time-create-handle-oss branch February 27, 2026 18:29
dwshmilyss pushed a commit to dwshmilyss/hudi that referenced this pull request Mar 11, 2026
…max event time metrics (apache#18250)

* feat(flink): collect event time in HoodieRowDataCreateHandle for min/max event time metrics

- Add event time field index from hoodie.payload.event.time.field (DOUBLE, epoch seconds)
- Extract event time in write() when !skipMetadataWrite and pass to WriteStatus.markSuccess
- Enables Flink ingestion min/max event time in commit stats for latency metrics
- Add TestHoodieRowDataCreateHandle for event time field index and extraction

Made-with: Cursor

* feat(flink): support DOUBLE and BIGINT event time in HoodieRowDataCreateHandle

- Use RowData.FieldGetter for event time field (hoodie.payload.event.time.field)
- Support both DOUBLE and BIGINT; getter is null when field not configured or unsupported type
- Extract value as long then string (no unit interpretation; WriteStatus infers by length)
- Add defensive try-catch in extractEventTimeMetadata for schema mismatch edge cases
- Add testEventTimeExtractionWithBigintMillis for BIGINT (millis) coverage

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants