[HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs #9066

codope · 2023-06-27T16:32:18Z

Change Logs

Add a new HoodieMergeOnReadSnapshotReader that implements Iterator<HoodieRecord>. It merges the base Parquet data with Avro data in log files.
Add a test for the new reader.
This is used in codope/trino@c744601#diff-b55a3ef8348b943f46a40738421f333d68b880bf7817d7d513c6e4d1569154de

Impact

It is a public API and can be used in query engines to execute Hudi MOR snapshot queries. It's an alternative to using RealtimeRecordReaders.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

yihua

LGTM with a few clarification questions.

...hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java

yihua · 2023-07-05T00:02:30Z

...hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java

+    this.logRecordScanner = getMergedLogRecordScanner();
+    LOG.debug("Time taken to scan log records: {}", timer.endTimer());
+    this.baseFileReader = getBaseFileReader(new Path(baseFilePath), jobConf);
+    this.logRecordsByKey = logRecordScanner.getRecords();


Do we need an external spillable map for the log records as well?

external spillable map is used to hold merged records by key. logRecordsByKey is a simple map. We can do with simple map for merged records too but I wasn't sure how many records can there be across base and log files and what resources user env has, so kept external spillable map.

I was thinking if the size of records in logRecordsByKey can exceed the memory and why we use a simple map for logRecordsByKey but not mergedRecordsByKey, which is my concern. If we use the simple map as well in the existing realtime reader it should be OK. Is that the case?

...hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java

Build iterator correctly

Empty-Commit

hudi-bot · 2023-07-05T10:38:49Z

CI report:

8588674 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2023-07-07T18:14:12Z

...hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java

+   * @param start             Start offset
+   * @param length            Length of the split
+   */
+  public HoodieMergeOnReadSnapshotReader(String tableBasePath,


oh man. another record reader impl?
Did we take a look at HoodieMergedReadHandle ?

I understand this is specifically avoiding hadoop APIs. but do we have a common interface. and some generic impl. may be we don't need two separate impls ? or am I missing something.

codope marked this pull request as draft June 27, 2023 16:32

codope added priority:critical production down; pipelines stalled; Need help asap. release-0.14.0 query-engine trino, presto, athena, impala, etc labels Jun 28, 2023

codope force-pushed the dehadoop-trino-bundle branch 2 times, most recently from 8662958 to 60c1b8c Compare June 29, 2023 10:10

codope marked this pull request as ready for review June 29, 2023 10:11

codope assigned yihua Jun 29, 2023

codope added priority:blocker and removed priority:critical production down; pipelines stalled; Need help asap. labels Jun 30, 2023

apache deleted a comment from hudi-bot Jun 30, 2023

yihua approved these changes Jul 5, 2023

View reviewed changes

codope added 5 commits July 5, 2023 10:09

Add MOR snapshot record reader

f06952b

Build iterator correctly

Add test for the new reader

37782a7

Fix some tests and optimize merging logic

181d95f

Fix more tests and some cleanup

3f99d60

Rebase and address comments

8588674

Empty-Commit

codope force-pushed the dehadoop-trino-bundle branch from 2eabd28 to 8588674 Compare July 5, 2023 07:33

codope merged commit 66e11e5 into apache:master Jul 5, 2023
24 checks passed

yihua mentioned this pull request Jul 7, 2023

[HUDI-6504] [DOCS] Updated roadmap #9142

Merged

nsivabalan reviewed Jul 7, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs #9066

[HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs #9066

codope commented Jun 27, 2023

yihua left a comment

yihua Jul 5, 2023

codope Jul 5, 2023

yihua Jul 5, 2023

hudi-bot commented Jul 5, 2023

nsivabalan Jul 7, 2023

nsivabalan Jul 7, 2023

[HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs #9066

[HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs #9066

Conversation

codope commented Jun 27, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

yihua left a comment

Choose a reason for hiding this comment

yihua Jul 5, 2023

Choose a reason for hiding this comment

codope Jul 5, 2023

Choose a reason for hiding this comment

yihua Jul 5, 2023

Choose a reason for hiding this comment

hudi-bot commented Jul 5, 2023

CI report:

nsivabalan Jul 7, 2023

Choose a reason for hiding this comment

nsivabalan Jul 7, 2023

Choose a reason for hiding this comment