[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s #4556

alexeykudinkin · 2022-01-11T00:26:44Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Removing duplicated file-listing process w/in Hive's MOR Hoodie{Parquet|HFile}RealtimeInputFormat

Brief change log

See above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

yihua

I took one pass and need some clarification. @alexeykudinkin It would be great if you can put a quick summary of the actual logic changes in the query path and file listing in the PR description. That'll make it easier to review the nuances.

...client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestHoodieMergeOnReadTable.java

yihua · 2022-01-24T23:25:44Z

...java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableInsertUpdateDelete.java

+      List<String> inputPaths = tableView.getLatestBaseFiles()
+          .map(baseFile -> new Path(baseFile.getPath()).getParent().toString())
+          .collect(Collectors.toList());


.../src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

yihua · 2022-01-25T01:49:27Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

        } else if (s instanceof RealtimeBootstrapBaseFileSplit) {
-          rtSplits.add(s);
+          RealtimeBootstrapBaseFileSplit bs = unsafeCast(s);


Is type cast redundant here?

This is Java, sir

:-)

Yeah, my point is, you're not calling methods only in RealtimeBootstrapBaseFileSplit here. s is a instance of RealtimeBootstrapBaseFileSplit already and RealtimeBootstrapBaseFileSplit is a subclass of InputSplit. Directly adding s with rtSplits.add(s); is not a problem.

I see now. Makes sense

yihua · 2022-01-25T02:48:26Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

+                if (includeLogFilesForSnapshotView()) {
+                  if (baseFileOpt.isPresent()) {
+                    return createRealtimeFileStatusUnchecked(baseFileOpt.get(), logFiles, latestCompletedInstantOpt, tableMetaClient);
+                  } else if (latestLogFileOpt.isPresent()) {
+                    return createRealtimeFileStatusUnchecked(latestLogFileOpt.get(), logFiles, latestCompletedInstantOpt, tableMetaClient);
+                  } else {
+                    throw new IllegalStateException("Invalid state: either base-file or log-file has to be present");
+                  }
                } else {
-                  throw new IllegalStateException("Invalid state: either base-file or log-file should be present");
+                  if (baseFileOpt.isPresent()) {
+                    return getFileStatusUnchecked(baseFileOpt.get());
+                  } else {
+                    throw new IllegalStateException("Invalid state: base-file has to be present");
+                  }
                }


It looks like the logic is changed here to cover broader cases, especially when both base file and log files exist. Basically, if base file exists, log files are ignored before. However, for Snapshot mode, should log files be ignored by default? Or do you intend to use this method listStatusForSnapshotMode() broadly for different query mode (renaming it if true)?

To clarify, for Snapshot query on MOR table, we do need the merging of base file and log files. For Snapshot query on COW table, only base file is needed. I think that is your intention of changes here.

listStatusForSnapshotMode() is used for both COW/MOR, and whether we're handling COW/MOR is controlled by includeLogFilesForSnapshotView (will clean this up in a follow up, cleaning up some assertions after majority of the PRs land)

alexeykudinkin · 2022-01-25T04:35:08Z

@yihua good call out. Idea is to remove duplicated file-listing operations w/in HoodieInputFormatUtils.getRealtimeSplits methods, leveraging file-structure fetched by listFileStatus w/in InputFormat impls.

alexeykudinkin · 2022-01-29T06:12:18Z

@hudi-bot run azure

alexeykudinkin · 2022-02-01T16:46:38Z

@hudi-bot run azure

alexeykudinkin · 2022-02-01T18:40:30Z

@hudi-bot run azure

alexeykudinkin · 2022-02-01T23:24:07Z

@hudi-bot run azure

nsivabalan

have some clarifications apart from the comments already made. will reach out.

...client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestHoodieMergeOnReadTable.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

nsivabalan · 2022-02-02T04:43:18Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

+                                                                      Stream<HoodieLogFile> logFiles,
+                                                                      Option<HoodieInstant> latestCompletedInstantOpt,
+                                                                      HoodieTableMetaClient tableMetaClient) {
+    List<HoodieLogFile> sortedLogFiles = logFiles.sorted(HoodieLogFile.getLogFileComparator()).collect(Collectors.toList());


did we move this method from elsewhere or did you add it as part of this patch ?

This is a new

…es (#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off #4556

… dereferencing)

…plit`; Added assertions

Made sure new fields could be properly deser'd

yihua · 2022-02-02T23:08:52Z

@alexeykudinkin is this PR ready for another look or you're still addressing comments

alexeykudinkin · 2022-02-02T23:25:41Z

@yihua yeah, it's rebased on master now and ready for another pass

alexeykudinkin · 2022-02-03T01:35:55Z

@hudi-bot run azure

nsivabalan · 2022-02-03T01:52:41Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

+              latestCommitInstant.getTimestamp(),
+              false);
+        } else if (split instanceof BaseFileWithLogsSplit) {
+          BaseFileWithLogsSplit baseFileWithLogsSplit = unsafeCast(split);


does the maxCommitTime in baseFileSplit will be in sync with latestCommitInstant computed at L89. Prior to this patch, use the latestCommitInstant computed here, where as now, we just reuse the same thats comes from BaseFileWithLogsSplit.
Just wanted to confirm as these are new code to me.

Yes, it's in sync. However, you brought up a very good point that the instant shouldn't actually be set here. This will be cleaned up in subsequent PRs where HoodieRealtimeFileSplit will be merged with BaseWithLogFilesSplit

nsivabalan · 2022-02-03T01:55:13Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

 public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {

  private static final Logger LOG = LogManager.getLogger(HoodieRealtimeInputFormatUtils.class);

-  public static InputSplit[] getRealtimeSplits(Configuration conf, Stream<FileSplit> fileSplits) {
+  public static InputSplit[] getRealtimeSplits(Configuration conf, List<FileSplit> fileSplits) throws IOException {


this refactoring makes total sense assuming each FileSplit will correspond to one FileSlice. and there won't be a case where multiple FileSplits can store info about a single FileSlice.
thanks for doing this.

hudi-bot · 2022-02-03T02:34:37Z

CI report:

77d1113 UNKNOWN
3d9c2ae UNKNOWN
31b0669 UNKNOWN
28a5a4f UNKNOWN
c09e228 UNKNOWN
5b8f581 UNKNOWN
5d37935 UNKNOWN
7f79374 Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM

...client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestHoodieMergeOnReadTable.java

.../src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

yihua · 2022-02-03T21:57:14Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

+  protected abstract boolean includeLogFilesForSnapshotView();
+
+  @Nonnull
+  private static RealtimeFileStatus createRealtimeFileStatusUnchecked(HoodieBaseFile baseFile,


As discussed offline, one-line javadocs are going to be added for both methods for clarity in the following PR:

/** * Creates real-time FileStatus for a base file with log files. */ @Nonnull private static RealtimeFileStatus createRealtimeFileStatusUnchecked(HoodieBaseFile baseFile, Stream<HoodieLogFile> logFiles, Option<HoodieInstant> latestCompletedInstantOpt, HoodieTableMetaClient tableMetaClient) /** * Creates real-time FileStatus for the log files only. */ @Nonnull private static RealtimeFileStatus createRealtimeFileStatusUnchecked(HoodieLogFile latestLogFile, Stream<HoodieLogFile> logFiles, Option<HoodieInstant> latestCompletedInstantOpt, HoodieTableMetaClient tableMetaClient)

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimeFileSplit.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

…es (apache#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off apache#4556

… `FileInputFormat`s (apache#4556)

…es (apache#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off apache#4556

… `FileInputFormat`s (apache#4556)

alexeykudinkin force-pushed the ak/rpath-ref-4 branch 2 times, most recently from 31b0669 to 096e26a Compare January 12, 2022 01:52

vinothchandar added this to Ready for Review in PR Tracker Board Jan 13, 2022

alexeykudinkin force-pushed the ak/rpath-ref-4 branch 2 times, most recently from 97dd5ec to 0538475 Compare January 14, 2022 18:56

alexeykudinkin mentioned this pull request Jan 14, 2022

[HUDI-3010] Unbundle parquet-avro and shade hbase in presto-bundle #4551

Merged

5 tasks

alexeykudinkin force-pushed the ak/rpath-ref-4 branch 3 times, most recently from 368c099 to 28a5a4f Compare January 19, 2022 21:35

alexeykudinkin changed the title ~~[HUDI-3191][Stacked on 4531] Removing duplicating file-listing process w/in Hive's MOR FIleInputFormats~~ [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FIleInputFormats Jan 19, 2022

alexeykudinkin changed the title ~~[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FIleInputFormats~~ [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats Jan 21, 2022

yihua self-assigned this Jan 21, 2022

alexeykudinkin force-pushed the ak/rpath-ref-4 branch from 906607b to dc1b033 Compare January 25, 2022 00:06

yihua reviewed Jan 25, 2022

View reviewed changes

alexeykudinkin mentioned this pull request Jan 29, 2022

[HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues #4716

Merged

5 tasks

alexeykudinkin force-pushed the ak/rpath-ref-4 branch 2 times, most recently from 5b8f581 to 13702fc Compare January 29, 2022 01:37

alexeykudinkin changed the title ~~[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats~~ [HUDI-3191][Stacked on 4716] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats Jan 29, 2022

alexeykudinkin force-pushed the ak/rpath-ref-4 branch from 9bdd8a6 to aebb9df Compare January 31, 2022 22:25

nsivabalan reviewed Feb 2, 2022

View reviewed changes

alexeykudinkin changed the title ~~[HUDI-3191][Stacked on 4716] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats~~ [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats Feb 2, 2022

Alexey Kudinkin added 2 commits February 2, 2022 13:14

Tidying up

7cd90b3

Leverage appropriate RealtimeFileStatus objects for MOR tables

6725975

Alexey Kudinkin added 13 commits February 2, 2022 13:14

Fixed incorrect paths being passed to Hive's FileInputFormat impl

edadcb8

Fixing more RDD de-referencing

5943bf8

Typo

e880aa0

Fixed incorrectly data files being written multiple times (due to RDD…

81c97ab

… dereferencing)

Added flag belongsToIncrementalQuery to `RealtimeBootstrapBaseFileS…

8f895b7

…plit`; Added assertions

Fixed missing handler for BootstrapBaseFileSplit

6b21ebb

lint

2a7536e

Tidying up

cc8ad23

Fixing tests

3a6e814

Fixing compilation

a9d42e2

After rebase fixes

43d49b4

Added improperly removed default ctors;

7dfb2f5

Made sure new fields could be properly deser'd

Amended java-docs

34b94b9

alexeykudinkin force-pushed the ak/rpath-ref-4 branch from f911d86 to 34b94b9 Compare February 2, 2022 21:14

Tidying up

7f79374

nsivabalan reviewed Feb 3, 2022

View reviewed changes

yihua approved these changes Feb 3, 2022

View reviewed changes

PR Tracker Board automation moved this from Ready for Review to Nearing Landing Feb 3, 2022

yihua merged commit 69dfcda into apache:master Feb 3, 2022

PR Tracker Board automation moved this from Nearing Landing to Done Feb 3, 2022

yihua mentioned this pull request Feb 5, 2022

[HUDI-3206] Unify Hive's MOR implementations to avoid duplication #4559

Merged

5 tasks

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022

[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR…

43540df

… `FileInputFormat`s (apache#4556)

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR…

bfac6c0

… `FileInputFormat`s (apache#4556)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s #4556

[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s #4556

alexeykudinkin commented Jan 11, 2022 •

edited

Loading

yihua left a comment

yihua Jan 24, 2022

yihua Jan 25, 2022

alexeykudinkin Jan 25, 2022

yihua Jan 28, 2022

alexeykudinkin Feb 2, 2022

yihua Jan 25, 2022

yihua Jan 25, 2022

alexeykudinkin Jan 25, 2022

alexeykudinkin commented Jan 25, 2022

alexeykudinkin commented Jan 29, 2022

alexeykudinkin commented Feb 1, 2022

alexeykudinkin commented Feb 1, 2022

alexeykudinkin commented Feb 1, 2022

nsivabalan left a comment

nsivabalan Feb 2, 2022

alexeykudinkin Feb 2, 2022

yihua commented Feb 2, 2022

alexeykudinkin commented Feb 2, 2022

alexeykudinkin commented Feb 3, 2022

nsivabalan Feb 3, 2022

alexeykudinkin Feb 3, 2022 •

edited

Loading

nsivabalan Feb 3, 2022

hudi-bot commented Feb 3, 2022

yihua left a comment

yihua Feb 3, 2022

[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats #4556

[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats #4556

Conversation

alexeykudinkin commented Jan 11, 2022 • edited Loading

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

yihua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykudinkin commented Jan 25, 2022

alexeykudinkin commented Jan 29, 2022

alexeykudinkin commented Feb 1, 2022

alexeykudinkin commented Feb 1, 2022

alexeykudinkin commented Feb 1, 2022

nsivabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yihua commented Feb 2, 2022

alexeykudinkin commented Feb 2, 2022

alexeykudinkin commented Feb 3, 2022

Choose a reason for hiding this comment

alexeykudinkin Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Feb 3, 2022

CI report:

yihua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s #4556

[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s #4556

alexeykudinkin commented Jan 11, 2022 •

edited

Loading

alexeykudinkin Feb 3, 2022 •

edited

Loading