Skip to content

[HUDI-6801] Implement merging partial updates from log files for MOR tables#9883

Merged
yihua merged 23 commits intoapache:masterfrom
yihua:HUDI-6801-merging-partial-updates
Oct 31, 2023
Merged

[HUDI-6801] Implement merging partial updates from log files for MOR tables#9883
yihua merged 23 commits intoapache:masterfrom
yihua:HUDI-6801-merging-partial-updates

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Oct 19, 2023

Change Logs

This PR adds the logic of merging partial updates in the new file group reader with Spark record merger.

Impact

Supports snapshot queries on file groups with partial updates.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua force-pushed the HUDI-6801-merging-partial-updates branch from 7eb16f5 to 2de35c7 Compare October 19, 2023 05:26
@yihua yihua changed the title [HUDI-6801][WIP] Implement merging partial updates from log files for MOR … [HUDI-6801] Implement merging partial updates from log files for MOR tables Oct 21, 2023
@yihua yihua force-pushed the HUDI-6801-merging-partial-updates branch 5 times, most recently from a54b000 to c140ff4 Compare October 23, 2023 06:27
@yihua yihua marked this pull request as ready for review October 23, 2023 06:27
@yihua yihua force-pushed the HUDI-6801-merging-partial-updates branch 2 times, most recently from 085a858 to 0b36862 Compare October 25, 2023 23:28
Map<String, Object> meta = new HashMap<>();
meta.put(INTERNAL_META_RECORD_KEY, getRecordKey(record, schema));
meta.put(INTERNAL_META_SCHEMA, schema);
meta.put(INTERNAL_META_IS_PARTIAL, isPartial);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we can represent the metadata as a POJO to make the interface more explicit and clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. We can take it up in a separate PR.

String recordKey = readerContext.getRecordKey(baseRecord, baseFileSchema);
Pair<Option<T>, Map<String, Object>> logRecordInfo = records.remove(recordKey);
Map<String, Object> metadata = readerContext.generateMetadataForRecord(
baseRecord, baseFileSchema, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution for the performace regression for per-record metadata construction.

Copy link
Contributor Author

@yihua yihua Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll do benchmarking on this. Likely it's ok given we're not doing much work inside the method and the existing log reader also extract some metadata for each record.

@yihua yihua force-pushed the HUDI-6801-merging-partial-updates branch from 78480b5 to 48be011 Compare October 26, 2023 07:02
boolean isValid = data == null || data instanceof UnsafeRow
|| schema != null && (data instanceof HoodieInternalRow || SparkAdapterSupport$.MODULE$.sparkAdapter().isColumnarBatchRow(data));
|| schema != null && (data instanceof HoodieInternalRow
|| data instanceof GenericInternalRow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original indentation is more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line is too long. I've revised the indentation to make it more clear.

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's supplement with more tests if it is necessary.

@apache apache deleted a comment from hudi-bot Oct 30, 2023
@yihua yihua merged commit 26ecb28 into apache:master Oct 31, 2023
LXin96 pushed a commit to LXin96/hudi that referenced this pull request Nov 2, 2023
…tables (apache#9883)

This commit adds the logic of merging partial updates in the new file group reader with Spark record merger.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants