-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormat
s
#4556
Conversation
31b0669
to
096e26a
Compare
97dd5ec
to
0538475
Compare
368c099
to
28a5a4f
Compare
FIleInputFormat
sFIleInputFormat
s
FIleInputFormat
sFileInputFormat
s
906607b
to
dc1b033
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took one pass and need some clarification. @alexeykudinkin It would be great if you can put a quick summary of the actual logic changes in the query path and file listing in the PR description. That'll make it easier to review the nuances.
...client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestHoodieMergeOnReadTable.java
Show resolved
Hide resolved
List<String> inputPaths = tableView.getLatestBaseFiles() | ||
.map(baseFile -> new Path(baseFile.getPath()).getParent().toString()) | ||
.collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
.../src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java
Show resolved
Hide resolved
.../src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java
Show resolved
Hide resolved
.../src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
Show resolved
Hide resolved
} else if (s instanceof RealtimeBootstrapBaseFileSplit) { | ||
rtSplits.add(s); | ||
RealtimeBootstrapBaseFileSplit bs = unsafeCast(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is type cast redundant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Java, sir
:-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my point is, you're not calling methods only in RealtimeBootstrapBaseFileSplit
here. s
is a instance of RealtimeBootstrapBaseFileSplit
already and RealtimeBootstrapBaseFileSplit
is a subclass of InputSplit
. Directly adding s
with rtSplits.add(s);
is not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. Makes sense
if (includeLogFilesForSnapshotView()) { | ||
if (baseFileOpt.isPresent()) { | ||
return createRealtimeFileStatusUnchecked(baseFileOpt.get(), logFiles, latestCompletedInstantOpt, tableMetaClient); | ||
} else if (latestLogFileOpt.isPresent()) { | ||
return createRealtimeFileStatusUnchecked(latestLogFileOpt.get(), logFiles, latestCompletedInstantOpt, tableMetaClient); | ||
} else { | ||
throw new IllegalStateException("Invalid state: either base-file or log-file has to be present"); | ||
} | ||
} else { | ||
throw new IllegalStateException("Invalid state: either base-file or log-file should be present"); | ||
if (baseFileOpt.isPresent()) { | ||
return getFileStatusUnchecked(baseFileOpt.get()); | ||
} else { | ||
throw new IllegalStateException("Invalid state: base-file has to be present"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the logic is changed here to cover broader cases, especially when both base file and log files exist. Basically, if base file exists, log files are ignored before. However, for Snapshot mode, should log files be ignored by default? Or do you intend to use this method listStatusForSnapshotMode()
broadly for different query mode (renaming it if true)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, for Snapshot query on MOR table, we do need the merging of base file and log files. For Snapshot query on COW table, only base file is needed. I think that is your intention of changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listStatusForSnapshotMode()
is used for both COW/MOR, and whether we're handling COW/MOR is controlled by includeLogFilesForSnapshotView
(will clean this up in a follow up, cleaning up some assertions after majority of the PRs land)
@yihua good call out. Idea is to remove duplicated file-listing operations w/in |
5b8f581
to
13702fc
Compare
FileInputFormat
sFileInputFormat
s
@hudi-bot run azure |
9bdd8a6
to
aebb9df
Compare
@hudi-bot run azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have some clarifications apart from the comments already made. will reach out.
...client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestHoodieMergeOnReadTable.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java
Show resolved
Hide resolved
Stream<HoodieLogFile> logFiles, | ||
Option<HoodieInstant> latestCompletedInstantOpt, | ||
HoodieTableMetaClient tableMetaClient) { | ||
List<HoodieLogFile> sortedLogFiles = logFiles.sorted(HoodieLogFile.getLogFileComparator()).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we move this method from elsewhere or did you add it as part of this patch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new
…es (#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off #4556
FileInputFormat
sFileInputFormat
s
…plit`; Added assertions
Made sure new fields could be properly deser'd
f911d86
to
34b94b9
Compare
@alexeykudinkin is this PR ready for another look or you're still addressing comments |
@yihua yeah, it's rebased on master now and ready for another pass |
@hudi-bot run azure |
latestCommitInstant.getTimestamp(), | ||
false); | ||
} else if (split instanceof BaseFileWithLogsSplit) { | ||
BaseFileWithLogsSplit baseFileWithLogsSplit = unsafeCast(split); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the maxCommitTime in baseFileSplit will be in sync with latestCommitInstant computed at L89. Prior to this patch, use the latestCommitInstant computed here, where as now, we just reuse the same thats comes from BaseFileWithLogsSplit.
Just wanted to confirm as these are new code to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's in sync. However, you brought up a very good point that the instant shouldn't actually be set here. This will be cleaned up in subsequent PRs where HoodieRealtimeFileSplit
will be merged with BaseWithLogFilesSplit
public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils { | ||
|
||
private static final Logger LOG = LogManager.getLogger(HoodieRealtimeInputFormatUtils.class); | ||
|
||
public static InputSplit[] getRealtimeSplits(Configuration conf, Stream<FileSplit> fileSplits) { | ||
public static InputSplit[] getRealtimeSplits(Configuration conf, List<FileSplit> fileSplits) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this refactoring makes total sense assuming each FileSplit will correspond to one FileSlice. and there won't be a case where multiple FileSplits can store info about a single FileSlice.
thanks for doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestHoodieMergeOnReadTable.java
Show resolved
Hide resolved
.../src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java
Show resolved
Hide resolved
protected abstract boolean includeLogFilesForSnapshotView(); | ||
|
||
@Nonnull | ||
private static RealtimeFileStatus createRealtimeFileStatusUnchecked(HoodieBaseFile baseFile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, one-line javadocs are going to be added for both methods for clarity in the following PR:
/**
* Creates real-time FileStatus for a base file with log files.
*/
@Nonnull
private static RealtimeFileStatus createRealtimeFileStatusUnchecked(HoodieBaseFile baseFile,
Stream<HoodieLogFile> logFiles,
Option<HoodieInstant> latestCompletedInstantOpt,
HoodieTableMetaClient tableMetaClient)
/**
* Creates real-time FileStatus for the log files only.
*/
@Nonnull
private static RealtimeFileStatus createRealtimeFileStatusUnchecked(HoodieLogFile latestLogFile,
Stream<HoodieLogFile> logFiles,
Option<HoodieInstant> latestCompletedInstantOpt,
HoodieTableMetaClient tableMetaClient)
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimeFileSplit.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
Show resolved
Hide resolved
…es (apache#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off apache#4556
…es (apache#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off apache#4556
… `FileInputFormat`s (apache#4556)
…es (apache#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off apache#4556
… `FileInputFormat`s (apache#4556)
Tips
What is the purpose of the pull request
Removing duplicated file-listing process w/in Hive's MOR
Hoodie{Parquet|HFile}RealtimeInputFormat
Brief change log
See above
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.