Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs #9066

Merged
merged 5 commits into from Jul 5, 2023

Conversation

codope
Copy link
Member

@codope codope commented Jun 27, 2023

Change Logs

Impact

It is a public API and can be used in query engines to execute Hudi MOR snapshot queries. It's an alternative to using RealtimeRecordReaders.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@codope codope marked this pull request as draft June 27, 2023 16:32
@codope codope added priority:critical production down; pipelines stalled; Need help asap. release-0.14.0 query-engine trino, presto, athena, impala, etc labels Jun 28, 2023
@codope codope force-pushed the dehadoop-trino-bundle branch 2 times, most recently from 8662958 to 60c1b8c Compare June 29, 2023 10:10
@codope codope marked this pull request as ready for review June 29, 2023 10:11
@codope codope added priority:blocker and removed priority:critical production down; pipelines stalled; Need help asap. labels Jun 30, 2023
@apache apache deleted a comment from hudi-bot Jun 30, 2023
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few clarification questions.

this.logRecordScanner = getMergedLogRecordScanner();
LOG.debug("Time taken to scan log records: {}", timer.endTimer());
this.baseFileReader = getBaseFileReader(new Path(baseFilePath), jobConf);
this.logRecordsByKey = logRecordScanner.getRecords();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an external spillable map for the log records as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

external spillable map is used to hold merged records by key. logRecordsByKey is a simple map. We can do with simple map for merged records too but I wasn't sure how many records can there be across base and log files and what resources user env has, so kept external spillable map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if the size of records in logRecordsByKey can exceed the memory and why we use a simple map for logRecordsByKey but not mergedRecordsByKey, which is my concern. If we use the simple map as well in the existing realtime reader it should be OK. Is that the case?

@hudi-bot
Copy link

hudi-bot commented Jul 5, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit 66e11e5 into apache:master Jul 5, 2023
24 checks passed
* @param start Start offset
* @param length Length of the split
*/
public HoodieMergeOnReadSnapshotReader(String tableBasePath,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man. another record reader impl?
Did we take a look at HoodieMergedReadHandle ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is specifically avoiding hadoop APIs. but do we have a common interface. and some generic impl. may be we don't need two separate impls ? or am I missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:blocker query-engine trino, presto, athena, impala, etc release-0.14.0
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants