New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs #9066
Conversation
8662958
to
60c1b8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few clarification questions.
...hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java
Outdated
Show resolved
Hide resolved
this.logRecordScanner = getMergedLogRecordScanner(); | ||
LOG.debug("Time taken to scan log records: {}", timer.endTimer()); | ||
this.baseFileReader = getBaseFileReader(new Path(baseFilePath), jobConf); | ||
this.logRecordsByKey = logRecordScanner.getRecords(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an external spillable map for the log records as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
external spillable map is used to hold merged records by key. logRecordsByKey
is a simple map. We can do with simple map for merged records too but I wasn't sure how many records can there be across base and log files and what resources user env has, so kept external spillable map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking if the size of records in logRecordsByKey
can exceed the memory and why we use a simple map for logRecordsByKey
but not mergedRecordsByKey
, which is my concern. If we use the simple map as well in the existing realtime reader it should be OK. Is that the case?
...hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java
Show resolved
Hide resolved
...hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java
Outdated
Show resolved
Hide resolved
2eabd28
to
8588674
Compare
* @param start Start offset | ||
* @param length Length of the split | ||
*/ | ||
public HoodieMergeOnReadSnapshotReader(String tableBasePath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh man. another record reader impl?
Did we take a look at HoodieMergedReadHandle ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is specifically avoiding hadoop APIs. but do we have a common interface. and some generic impl. may be we don't need two separate impls ? or am I missing something.
Change Logs
HoodieMergeOnReadSnapshotReader
that implementsIterator<HoodieRecord>
. It merges the base Parquet data with Avro data in log files.Impact
It is a public API and can be used in query engines to execute Hudi MOR snapshot queries. It's an alternative to using RealtimeRecordReaders.
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist