New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1089] Refactor hudi-client to support multi-engine #1827
Conversation
Hi, @vinothchandar @yanghua @leesf as the refactor is finished, I have filed a Jira ticket to track this work, |
ack. @Mathieu1124 pls check travis failure. |
@leesf @Mathieu1124 @lw309637554 so this replaces #1727 right? |
Good to get @n3nash 's review here as well to make sure we are not breaking anything for the RDD client users.. |
yes, #1727 can be closed now |
copy, have resolved the ci failure and conflicts with master, will push it after work. |
9d544ec
to
13795f5
Compare
Hi, @vinothchandar @yanghua @leesf @n3nash, ci is green, this pr is ready for review now :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mathieu1124 I reviewed some changes. Since this is a huge PR, can we recover the method align issues in this PR so that we can reduce many change points? Then, the review work should be easier.
private String getRenamesToBePrinted(List<RenameOpResult> res, Integer limit, String sortByField, boolean descending, | ||
boolean headerOnly, String operation) { | ||
private String getRenamesToBePrinted(List<BaseCompactionAdminClient.RenameOpResult> res, Integer limit, String sortByField, boolean descending, | ||
boolean headerOnly, String operation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new style or the old style, which one is right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @yanghua thanks for your review. I am not sure which one is right either, I will roll back these style issues just to keep as same as before.
@@ -40,7 +39,7 @@ | |||
* HoodieParquetWriter extends the ParquetWriter to help limit the size of underlying file. Provides a way to check if | |||
* the current file can take more records with the <code>canWrite()</code> | |||
*/ | |||
public class HoodieParquetWriter<T extends HoodieRecordPayload, R extends IndexedRecord> | |||
public class HoodieParquetWriter<R extends IndexedRecord> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to change this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to change this class?
The Generic "T" is useless in this class, and it causes some generic problems in the abstraction, so I removed it.
@Mathieu1124 @leesf can you please share any tests you may have done in your own environment to ensure existing functionality is in tact.. This is a major signal we may not completely get with a PR review |
@vinothchandar, My test is limited, just all the unit tests in source code, and all the demos in the Quick-Start Guide. I am planning to test it in docker env. |
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
Outdated
Show resolved
Hide resolved
56690a5
to
c7b1cb1
Compare
* Making HoodieSnapshotCopier/HoodieSnapshotExporter all use HoodieContext * More replacements of jsc.parallelize across hudi-spark-client * More replacements of jsc.setJobGroup across hudi-spark-client * Removing usages of HoodieIndex#fetchRecordLocation everywhere
c7b1cb1
to
6a79819
Compare
Codecov Report
@@ Coverage Diff @@
## master #1827 +/- ##
============================================
- Coverage 59.89% 56.14% -3.76%
+ Complexity 4454 2658 -1796
============================================
Files 558 324 -234
Lines 23378 14775 -8603
Branches 2348 1539 -809
============================================
- Hits 14003 8295 -5708
+ Misses 8355 5783 -2572
+ Partials 1020 697 -323
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Thanks, @vinothchandar, this is really great work! |
I actually figured out that we can remove |
* Not used by any other major API * Removing `P` from the templatized list of parameters
@wangxianghu Please help test this out if possible. Once the tests pass again, planning to merge this, morning PST cc @yanghua |
|
@leesf do you see the following exception? could not understand how you ll get the other one even.
Either way, good pointer. the behavior has changed around this a bit actually. So will try and tweak and push a fix |
be83c08
to
d99096d
Compare
@vinothchandar @yanghua @leesf The demo runs well in my local, except the warning |
d99096d
to
9d7a51d
Compare
@wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right? if this round of tests pass, and you confirm, we can land from my perspective |
Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is ok, no warn log)
|
I have tested the latest commit with the check condition changed to
It runs well in my local, and the warn log disappeared. |
@vinothchandar The warn log issue is fixed |
@wangxianghu duh ofc. I understand now. Thanks for jumping in @wangxianghu ! |
@wangxianghu Just merged! Thanks again for the herculean effort. May be some followups could pop up. Would you be interested in taking them up? if so, I ll mention you along the way |
sure, just ping me when needed |
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules - Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc - Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common` - Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies - To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives Co-authored-by: Vinoth Chandar <vinoth@apache.org>
protected abstract Iterator<List<WriteStatus>> handleInsert(String idPfx, | ||
Iterator<HoodieRecord<T>> recordItr) throws Exception; | ||
|
||
protected abstract Iterator<List<WriteStatus>> handleUpdate(String partitionPath, String fileId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask if the difference of parameter idPfx
and fileId
between the method handleInsert
and handleUpdate
is designed on some purpose?
I can understand for update operation, we need to find out the previous version of base file, so use file id to locate it.
But in the method BaseSparkCommitActionExecutor.handleUpsertPartition
, both the method handleInsert
and handleUpdate
get the parameters from binfo.fileIdPrefix
. It's really confused.
protected Iterator<List<WriteStatus>> handleUpsertPartition(String instantTime, Integer partition, Iterator recordItr,
Partitioner partitioner) {
SparkHoodiePartitioner upsertPartitioner = (SparkHoodiePartitioner) partitioner;
BucketInfo binfo = upsertPartitioner.getBucketInfo(partition);
BucketType btype = binfo.bucketType;
try {
if (btype.equals(BucketType.INSERT)) {
return handleInsert(binfo.fileIdPrefix, recordItr);
} else if (btype.equals(BucketType.UPDATE)) {
return handleUpdate(binfo.partitionPath, binfo.fileIdPrefix, recordItr);
} else {
throw new HoodieUpsertException("Unknown bucketType " + btype + " for partition :" + partition);
}
} catch (Throwable t) {
String msg = "Error upserting bucketType " + btype + " for partition :" + partition;
LOG.error(msg, t);
throw new HoodieUpsertException(msg, t);
}
}
Tips
What is the purpose of the pull request
Refactor hudi-client to support multi-engine
Brief change log
Verify this pull request
This pull request is already covered by existing tests.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.