Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats #4556

Merged
merged 29 commits into from
Feb 3, 2022

Conversation

alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Jan 11, 2022

Tips

What is the purpose of the pull request

Removing duplicated file-listing process w/in Hive's MOR Hoodie{Parquet|HFile}RealtimeInputFormat

Brief change log

See above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin force-pushed the ak/rpath-ref-4 branch 2 times, most recently from 31b0669 to 096e26a Compare January 12, 2022 01:52
@vinothchandar vinothchandar added this to Ready for Review in PR Tracker Board Jan 13, 2022
@alexeykudinkin alexeykudinkin force-pushed the ak/rpath-ref-4 branch 2 times, most recently from 97dd5ec to 0538475 Compare January 14, 2022 18:56
@alexeykudinkin alexeykudinkin force-pushed the ak/rpath-ref-4 branch 3 times, most recently from 368c099 to 28a5a4f Compare January 19, 2022 21:35
@alexeykudinkin alexeykudinkin changed the title [HUDI-3191][Stacked on 4531] Removing duplicating file-listing process w/in Hive's MOR FIleInputFormats [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FIleInputFormats Jan 19, 2022
@alexeykudinkin alexeykudinkin changed the title [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FIleInputFormats [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats Jan 21, 2022
@yihua yihua self-assigned this Jan 21, 2022
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took one pass and need some clarification. @alexeykudinkin It would be great if you can put a quick summary of the actual logic changes in the query path and file listing in the PR description. That'll make it easier to review the nuances.

Comment on lines +217 to +219
List<String> inputPaths = tableView.getLatestBaseFiles()
.map(baseFile -> new Path(baseFile.getPath()).getParent().toString())
.collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

} else if (s instanceof RealtimeBootstrapBaseFileSplit) {
rtSplits.add(s);
RealtimeBootstrapBaseFileSplit bs = unsafeCast(s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is type cast redundant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Java, sir

:-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my point is, you're not calling methods only in RealtimeBootstrapBaseFileSplit here. s is a instance of RealtimeBootstrapBaseFileSplit already and RealtimeBootstrapBaseFileSplit is a subclass of InputSplit. Directly adding s with rtSplits.add(s); is not a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. Makes sense

Comment on lines +295 to 308
if (includeLogFilesForSnapshotView()) {
if (baseFileOpt.isPresent()) {
return createRealtimeFileStatusUnchecked(baseFileOpt.get(), logFiles, latestCompletedInstantOpt, tableMetaClient);
} else if (latestLogFileOpt.isPresent()) {
return createRealtimeFileStatusUnchecked(latestLogFileOpt.get(), logFiles, latestCompletedInstantOpt, tableMetaClient);
} else {
throw new IllegalStateException("Invalid state: either base-file or log-file has to be present");
}
} else {
throw new IllegalStateException("Invalid state: either base-file or log-file should be present");
if (baseFileOpt.isPresent()) {
return getFileStatusUnchecked(baseFileOpt.get());
} else {
throw new IllegalStateException("Invalid state: base-file has to be present");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the logic is changed here to cover broader cases, especially when both base file and log files exist. Basically, if base file exists, log files are ignored before. However, for Snapshot mode, should log files be ignored by default? Or do you intend to use this method listStatusForSnapshotMode() broadly for different query mode (renaming it if true)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, for Snapshot query on MOR table, we do need the merging of base file and log files. For Snapshot query on COW table, only base file is needed. I think that is your intention of changes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listStatusForSnapshotMode() is used for both COW/MOR, and whether we're handling COW/MOR is controlled by includeLogFilesForSnapshotView (will clean this up in a follow up, cleaning up some assertions after majority of the PRs land)

@alexeykudinkin
Copy link
Contributor Author

@yihua good call out. Idea is to remove duplicated file-listing operations w/in HoodieInputFormatUtils.getRealtimeSplits methods, leveraging file-structure fetched by listFileStatus w/in InputFormat impls.

@alexeykudinkin alexeykudinkin force-pushed the ak/rpath-ref-4 branch 2 times, most recently from 5b8f581 to 13702fc Compare January 29, 2022 01:37
@alexeykudinkin alexeykudinkin changed the title [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats [HUDI-3191][Stacked on 4716] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats Jan 29, 2022
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

2 similar comments
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have some clarifications apart from the comments already made. will reach out.

Stream<HoodieLogFile> logFiles,
Option<HoodieInstant> latestCompletedInstantOpt,
HoodieTableMetaClient tableMetaClient) {
List<HoodieLogFile> sortedLogFiles = logFiles.sorted(HoodieLogFile.getLogFileComparator()).collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we move this method from elsewhere or did you add it as part of this patch ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new

nsivabalan pushed a commit that referenced this pull request Feb 2, 2022
…es (#4716)

This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records.

There are multiple issues that were leading to that:

- [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those.
- [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed.

This change will unblock Stack of PRs based off #4556
@alexeykudinkin alexeykudinkin changed the title [HUDI-3191][Stacked on 4716] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats Feb 2, 2022
@yihua
Copy link
Contributor

yihua commented Feb 2, 2022

@alexeykudinkin is this PR ready for another look or you're still addressing comments

@alexeykudinkin
Copy link
Contributor Author

@yihua yeah, it's rebased on master now and ready for another pass

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

latestCommitInstant.getTimestamp(),
false);
} else if (split instanceof BaseFileWithLogsSplit) {
BaseFileWithLogsSplit baseFileWithLogsSplit = unsafeCast(split);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the maxCommitTime in baseFileSplit will be in sync with latestCommitInstant computed at L89. Prior to this patch, use the latestCommitInstant computed here, where as now, we just reuse the same thats comes from BaseFileWithLogsSplit.
Just wanted to confirm as these are new code to me.

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's in sync. However, you brought up a very good point that the instant shouldn't actually be set here. This will be cleaned up in subsequent PRs where HoodieRealtimeFileSplit will be merged with BaseWithLogFilesSplit

public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {

private static final Logger LOG = LogManager.getLogger(HoodieRealtimeInputFormatUtils.class);

public static InputSplit[] getRealtimeSplits(Configuration conf, Stream<FileSplit> fileSplits) {
public static InputSplit[] getRealtimeSplits(Configuration conf, List<FileSplit> fileSplits) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this refactoring makes total sense assuming each FileSplit will correspond to one FileSlice. and there won't be a case where multiple FileSplits can store info about a single FileSlice.
thanks for doing this.

@hudi-bot
Copy link

hudi-bot commented Feb 3, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

protected abstract boolean includeLogFilesForSnapshotView();

@Nonnull
private static RealtimeFileStatus createRealtimeFileStatusUnchecked(HoodieBaseFile baseFile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, one-line javadocs are going to be added for both methods for clarity in the following PR:

/**
* Creates real-time FileStatus for a base file with log files.
*/
@Nonnull
private static RealtimeFileStatus createRealtimeFileStatusUnchecked(HoodieBaseFile baseFile,
                                                                      Stream<HoodieLogFile> logFiles,
                                                                      Option<HoodieInstant> latestCompletedInstantOpt,
                                                                      HoodieTableMetaClient tableMetaClient)
/**
* Creates real-time FileStatus for the log files only.
*/
@Nonnull
private static RealtimeFileStatus createRealtimeFileStatusUnchecked(HoodieLogFile latestLogFile,
                                                                      Stream<HoodieLogFile> logFiles,
                                                                      Option<HoodieInstant> latestCompletedInstantOpt,
                                                                      HoodieTableMetaClient tableMetaClient)

PR Tracker Board automation moved this from Ready for Review to Nearing Landing Feb 3, 2022
@yihua yihua merged commit 69dfcda into apache:master Feb 3, 2022
PR Tracker Board automation moved this from Nearing Landing to Done Feb 3, 2022
nsivabalan pushed a commit to onehouseinc/hudi that referenced this pull request Feb 9, 2022
…es (apache#4716)

This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records.

There are multiple issues that were leading to that:

- [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those.
- [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed.

This change will unblock Stack of PRs based off apache#4556
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
…es (apache#4716)

This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records.

There are multiple issues that were leading to that:

- [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those.
- [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed.

This change will unblock Stack of PRs based off apache#4556
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…es (apache#4716)

This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records.

There are multiple issues that were leading to that:

- [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those.
- [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed.

This change will unblock Stack of PRs based off apache#4556
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants