[HUDI-1089] Refactor hudi-client to support multi-engine #1827

wangxianghu · 2020-07-14T11:56:30Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

Refactor hudi-client to support multi-engine

Brief change log

Refactor hudi-client to support multi-engine

Verify this pull request

This pull request is already covered by existing tests.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

wangxianghu · 2020-07-14T12:30:33Z

Hi, @vinothchandar @yanghua @leesf as the refactor is finished, I have filed a Jira ticket to track this work,
please review the refactor work on this pr :)

leesf · 2020-07-15T01:17:21Z

Hi, @vinothchandar @yanghua @leesf as the refactor is finished, I have filed a Jira ticket to track this work,
please review the refactor work on this pr :)

ack. @Mathieu1124 pls check travis failure.

vinothchandar · 2020-07-15T02:17:56Z

@leesf @Mathieu1124 @lw309637554 so this replaces #1727 right?

vinothchandar · 2020-07-15T02:18:42Z

Good to get @n3nash 's review here as well to make sure we are not breaking anything for the RDD client users..

wangxianghu · 2020-07-15T04:14:19Z

@leesf @Mathieu1124 @lw309637554 so this replaces #1727 right?

yes, #1727 can be closed now

wangxianghu · 2020-07-15T04:16:18Z

Hi, @vinothchandar @yanghua @leesf as the refactor is finished, I have filed a Jira ticket to track this work,
please review the refactor work on this pr :)

ack. @Mathieu1124 pls check travis failure.

copy, have resolved the ci failure and conflicts with master, will push it after work.

wangxianghu · 2020-07-16T01:06:55Z

Hi, @vinothchandar @yanghua @leesf @n3nash, ci is green, this pr is ready for review now :)

yanghua

@Mathieu1124 I reviewed some changes. Since this is a huge PR, can we recover the method align issues in this PR so that we can reduce many change points? Then, the review work should be easier.

yanghua · 2020-07-16T10:58:14Z

hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java

-  private String getRenamesToBePrinted(List<RenameOpResult> res, Integer limit, String sortByField, boolean descending,
-      boolean headerOnly, String operation) {
+  private String getRenamesToBePrinted(List<BaseCompactionAdminClient.RenameOpResult> res, Integer limit, String sortByField, boolean descending,
+                                       boolean headerOnly, String operation) {


The new style or the old style, which one is right?

Hi, @yanghua thanks for your review. I am not sure which one is right either, I will roll back these style issues just to keep as same as before.

yanghua · 2020-07-16T11:04:11Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieParquetWriter.java

@@ -40,7 +39,7 @@
 * HoodieParquetWriter extends the ParquetWriter to help limit the size of underlying file. Provides a way to check if
 * the current file can take more records with the <code>canWrite()</code>
 */
-public class HoodieParquetWriter<T extends HoodieRecordPayload, R extends IndexedRecord>
+public class HoodieParquetWriter<R extends IndexedRecord>


Why do we need to change this class?

Why do we need to change this class?

The Generic "T" is useless in this class, and it causes some generic problems in the abstraction, so I removed it.

vinothchandar · 2020-07-17T02:10:13Z

@Mathieu1124 @leesf can you please share any tests you may have done in your own environment to ensure existing functionality is in tact.. This is a major signal we may not completely get with a PR review

wangxianghu · 2020-07-17T06:44:00Z

@Mathieu1124 @leesf can you please share any tests you may have done in your own environment to ensure existing functionality is in tact.. This is a major signal we may not completely get with a PR review

@vinothchandar, My test is limited, just all the unit tests in source code, and all the demos in the Quick-Start Guide. I am planning to test it in docker env.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java

* Making HoodieSnapshotCopier/HoodieSnapshotExporter all use HoodieContext * More replacements of jsc.parallelize across hudi-spark-client * More replacements of jsc.setJobGroup across hudi-spark-client * Removing usages of HoodieIndex#fetchRecordLocation everywhere

codecov-commenter · 2020-10-01T07:35:46Z

Codecov Report

Merging #1827 into master will decrease coverage by 3.75%.
The diff coverage is 30.00%.

@@             Coverage Diff              @@
##             master    #1827      +/-   ##
============================================
- Coverage     59.89%   56.14%   -3.76%     
+ Complexity     4454     2658    -1796     
============================================
  Files           558      324     -234     
  Lines         23378    14775    -8603     
  Branches       2348     1539     -809     
============================================
- Hits          14003     8295    -5708     
+ Misses         8355     5783    -2572     
+ Partials       1020      697     -323

Flag	Coverage Δ	Complexity Δ
#hudicli	`38.37% <30.00%> (-27.83%)`	`193.00 <0.00> (-1615.00)`
#hudiclient	`100.00% <ø> (+25.46%)`	`0.00 <ø> (-1615.00)`	⬆️
#hudicommon	`54.74% <ø> (ø)`	`1793.00 <ø> (ø)`
#hudihadoopmr	`?`	`?`
#hudispark	`67.18% <ø> (-0.02%)`	`311.00 <ø> (ø)`
#huditimelineservice	`64.43% <ø> (ø)`	`49.00 <ø> (ø)`
#hudiutilities	`69.43% <ø> (+0.05%)`	`312.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...rg/apache/hudi/cli/commands/SavepointsCommand.java	`14.28% <0.00%> (ø)`	`3.00 <0.00> (ø)`
...main/java/org/apache/hudi/cli/utils/SparkUtil.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
...n/java/org/apache/hudi/cli/commands/SparkMain.java	`6.43% <37.50%> (+0.40%)`	`4.00 <0.00> (ø)`
...src/main/java/org/apache/hudi/DataSourceUtils.java	`45.36% <0.00%> (ø)`	`21.00% <0.00%> (ø%)`
...in/scala/org/apache/hudi/HoodieStreamingSink.scala	`24.00% <0.00%> (ø)`	`10.00% <0.00%> (ø%)`
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`56.20% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
...in/java/org/apache/hudi/utilities/UtilHelpers.java	`64.59% <0.00%> (ø)`	`30.00% <0.00%> (ø%)`
...apache/hudi/utilities/deltastreamer/DeltaSync.java	`68.16% <0.00%> (ø)`	`39.00% <0.00%> (ø%)`
.../hudi/async/SparkStreamingAsyncCompactService.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
.../hudi/internal/HoodieDataSourceInternalWriter.java	`87.50% <0.00%> (ø)`	`8.00% <0.00%> (ø%)`
... and 46 more

wangxianghu · 2020-10-01T07:40:09Z

@wangxianghu @yanghua I have rebased this against master. Please take a look at my changes.

High level, we could re-use more code, but it needs an abstraction that can wrap RDD or DataSet or `D

@wangxianghu @yanghua I have rebased this against master. Please take a look at my changes.

High level, we could re-use more code, but it needs an abstraction that can wrap RDD or DataSet or DataStream adequately and support basic operations like .map(), reduceByKey() etc. We can do this in a second pass once we have a working Flink impl. For now this will do.

I am trying to get the tests to pass. if they do, we could go ahead and merge

Thanks, @vinothchandar, this is really great work!
Yes, we can do more abstractions about basic map, reduceByKey methods in HoodieEngineContext, or some Util classes next.

vinothchandar · 2020-10-01T08:01:02Z

I actually figured out that we can remove P altogether. since HoodieIndex#fetchRecordLocation is not used much outside of internal APIs. So will push a final change for that . tests are passing now

* Not used by any other major API * Removing `P` from the templatized list of parameters

vinothchandar · 2020-10-01T08:49:07Z

@wangxianghu Please help test this out if possible. Once the tests pass again, planning to merge this, morning PST

cc @yanghua

leesf · 2020-10-01T13:12:51Z

Run quickstart demo: found the warn log:
20/10/01 21:11:18 WARN embedded.EmbeddedTimelineService: Unable to find driver bind address from spark config, but works fine, the warn log is not found in 0.6.0. @vinothchandar @wangxianghu
Ran my own unit tests, works fine.

vinothchandar · 2020-10-01T14:28:10Z

@leesf do you see the following exception? could not understand how you ll get the other one even.

LOG.info("Starting Timeline service !!");
        Option<String> hostAddr = context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST);
        if (!hostAddr.isPresent()) {
          throw new HoodieException("Unable to find host address to bind timeline server to.");
        }
        timelineServer = Option.of(new EmbeddedTimelineService(context, hostAddr.get(),
            config.getClientSpecifiedViewStorageConfig()));

Either way, good pointer. the behavior has changed around this a bit actually. So will try and tweak and push a fix

wangxianghu · 2020-10-01T15:40:05Z

@vinothchandar @yanghua @leesf The demo runs well in my local, except the warning WARN embedded.EmbeddedTimelineService: Unable to find driver bind address from spark config

vinothchandar · 2020-10-01T16:23:16Z

@wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right?

if this round of tests pass, and you confirm, we can land from my perspective

wangxianghu · 2020-10-01T17:58:57Z

@wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right?

if this round of tests pass, and you confirm, we can land from my perspective

Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is ok, no warn log)
I think we should check embeddedTimelineServiceHostAddr instead of hostAddr.

  private void setHostAddr(String embeddedTimelineServiceHostAddr) {
   // here we should check embeddedTimelineServiceHostAddr instead of hostAddr
    if (hostAddr != null) {
      LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr + ") found in spark-conf. It was " + this.hostAddr);
      this.hostAddr = embeddedTimelineServiceHostAddr;
    } else {
      LOG.warn("Unable to find driver bind address from spark config");
      this.hostAddr = NetworkUtils.getHostname();
    }
  }

wangxianghu · 2020-10-01T18:09:50Z

@wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right?
if this round of tests pass, and you confirm, we can land from my perspective

Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is ok, no warn log)
I think we should check embeddedTimelineServiceHostAddr instead of hostAddr.
  private void setHostAddr(String embeddedTimelineServiceHostAddr) {
   // here we should check embeddedTimelineServiceHostAddr instead of hostAddr
    if (hostAddr != null) {
      LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr + ") found in spark-conf. It was " + this.hostAddr);
      this.hostAddr = embeddedTimelineServiceHostAddr;
    } else {
      LOG.warn("Unable to find driver bind address from spark config");
      this.hostAddr = NetworkUtils.getHostname();
    }
  }

I have tested the latest commit with the check condition changed to

if (embeddedTimelineServiceHostAddr != null) {

It runs well in my local, and the warn log disappeared.

wangxianghu · 2020-10-01T18:28:38Z

@vinothchandar The warn log issue is fixed

vinothchandar · 2020-10-01T19:00:13Z

@wangxianghu duh ofc. I understand now. Thanks for jumping in @wangxianghu !

vinothchandar · 2020-10-01T21:26:41Z

@wangxianghu Just merged! Thanks again for the herculean effort.

May be some followups could pop up. Would you be interested in taking them up? if so, I ll mention you along the way

wangxianghu · 2020-10-01T22:01:21Z

@wangxianghu Just merged! Thanks again for the herculean effort.

May be some followups could pop up. Would you be interested in taking them up? if so, I ll mention you along the way

sure, just ping me when needed

- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules - Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc - Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common` - Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies - To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives Co-authored-by: Vinoth Chandar <vinoth@apache.org>

TengHuo · 2023-11-23T08:18:42Z

...lient-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java

+  protected abstract Iterator<List<WriteStatus>> handleInsert(String idPfx,
+      Iterator<HoodieRecord<T>> recordItr) throws Exception;
+
+  protected abstract Iterator<List<WriteStatus>> handleUpdate(String partitionPath, String fileId,


May I ask if the difference of parameter idPfx and fileId between the method handleInsert and handleUpdate is designed on some purpose?

I can understand for update operation, we need to find out the previous version of base file, so use file id to locate it.

But in the method BaseSparkCommitActionExecutor.handleUpsertPartition, both the method handleInsert and handleUpdate get the parameters from binfo.fileIdPrefix. It's really confused.

protected Iterator<List<WriteStatus>> handleUpsertPartition(String instantTime, Integer partition, Iterator recordItr, Partitioner partitioner) { SparkHoodiePartitioner upsertPartitioner = (SparkHoodiePartitioner) partitioner; BucketInfo binfo = upsertPartitioner.getBucketInfo(partition); BucketType btype = binfo.bucketType; try { if (btype.equals(BucketType.INSERT)) { return handleInsert(binfo.fileIdPrefix, recordItr); } else if (btype.equals(BucketType.UPDATE)) { return handleUpdate(binfo.partitionPath, binfo.fileIdPrefix, recordItr); } else { throw new HoodieUpsertException("Unknown bucketType " + btype + " for partition :" + partition); } } catch (Throwable t) { String msg = "Error upserting bucketType " + btype + " for partition :" + partition; LOG.error(msg, t); throw new HoodieUpsertException(msg, t); } }

wangxianghu force-pushed the HUDI-1089 branch from 95b5468 to 5b5baed Compare July 14, 2020 12:24

wangxianghu mentioned this pull request Jul 14, 2020

[Review] refactor hudi-client #1727

Closed

5 tasks

wangxianghu force-pushed the HUDI-1089 branch from 10527db to f0947f8 Compare July 14, 2020 15:06

leesf assigned vinothchandar, yanghua and leesf Jul 15, 2020

vinothchandar assigned n3nash, vinothchandar, yanghua and leesf and unassigned vinothchandar, yanghua and leesf Jul 15, 2020

wangxianghu force-pushed the HUDI-1089 branch 5 times, most recently from 9d544ec to 13795f5 Compare July 15, 2020 16:27

yanghua requested changes Jul 16, 2020

View reviewed changes

henrywu2019 reviewed Jul 20, 2020

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java Outdated Show resolved Hide resolved

wangxianghu force-pushed the HUDI-1089 branch from 88c6661 to b30f973 Compare July 20, 2020 11:22

vinothchandar force-pushed the HUDI-1089 branch 2 times, most recently from 56690a5 to c7b1cb1 Compare October 1, 2020 06:42

vinothchandar force-pushed the HUDI-1089 branch from c7b1cb1 to 6a79819 Compare October 1, 2020 06:53

vinothchandar approved these changes Oct 1, 2020

View reviewed changes

Dropping HoodieIndex#fetchRecordLocation API

7e33f73

* Not used by any other major API * Removing `P` from the templatized list of parameters

vinothchandar force-pushed the HUDI-1089 branch from be83c08 to d99096d Compare October 1, 2020 14:32

Renaming some Abstract* classes as Hoodie* to improve readability

9d7a51d

vinothchandar force-pushed the HUDI-1089 branch from d99096d to 9d7a51d Compare October 1, 2020 16:14

fix Unable to find driver bind address from spark config

137da2b

vinothchandar merged commit 1f7add9 into apache:master Oct 1, 2020

TengHuo reviewed Nov 23, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1089] Refactor hudi-client to support multi-engine #1827

[HUDI-1089] Refactor hudi-client to support multi-engine #1827

wangxianghu commented Jul 14, 2020 •

edited

wangxianghu commented Jul 14, 2020 •

edited

leesf commented Jul 15, 2020

vinothchandar commented Jul 15, 2020

vinothchandar commented Jul 15, 2020

wangxianghu commented Jul 15, 2020

wangxianghu commented Jul 15, 2020

wangxianghu commented Jul 16, 2020

yanghua left a comment

yanghua Jul 16, 2020

wangxianghu Jul 17, 2020

yanghua Jul 16, 2020

wangxianghu Jul 17, 2020

vinothchandar commented Jul 17, 2020

wangxianghu commented Jul 17, 2020 •

edited

codecov-commenter commented Oct 1, 2020

wangxianghu commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

leesf commented Oct 1, 2020 •

edited

vinothchandar commented Oct 1, 2020

wangxianghu commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

wangxianghu commented Oct 1, 2020 •

edited

wangxianghu commented Oct 1, 2020

wangxianghu commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

wangxianghu commented Oct 1, 2020

TengHuo Nov 23, 2023

[HUDI-1089] Refactor hudi-client to support multi-engine #1827

[HUDI-1089] Refactor hudi-client to support multi-engine #1827

Conversation

wangxianghu commented Jul 14, 2020 • edited

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

wangxianghu commented Jul 14, 2020 • edited

leesf commented Jul 15, 2020

vinothchandar commented Jul 15, 2020

vinothchandar commented Jul 15, 2020

wangxianghu commented Jul 15, 2020

wangxianghu commented Jul 15, 2020

wangxianghu commented Jul 16, 2020

yanghua left a comment

Choose a reason for hiding this comment

yanghua Jul 16, 2020

Choose a reason for hiding this comment

wangxianghu Jul 17, 2020

Choose a reason for hiding this comment

yanghua Jul 16, 2020

Choose a reason for hiding this comment

wangxianghu Jul 17, 2020

Choose a reason for hiding this comment

vinothchandar commented Jul 17, 2020

wangxianghu commented Jul 17, 2020 • edited

codecov-commenter commented Oct 1, 2020

Codecov Report

wangxianghu commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

leesf commented Oct 1, 2020 • edited

vinothchandar commented Oct 1, 2020

wangxianghu commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

wangxianghu commented Oct 1, 2020 • edited

wangxianghu commented Oct 1, 2020

wangxianghu commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

vinothchandar commented Oct 1, 2020

wangxianghu commented Oct 1, 2020

TengHuo Nov 23, 2023

Choose a reason for hiding this comment

wangxianghu commented Jul 14, 2020 •

edited

wangxianghu commented Jul 14, 2020 •

edited

wangxianghu commented Jul 17, 2020 •

edited

leesf commented Oct 1, 2020 •

edited

wangxianghu commented Oct 1, 2020 •

edited