New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-728]: Implement custom key generator #1433
Conversation
This PR is a currently work in progress. Will be adding more test cases to ensure everything works fine. |
@bvaradar @vinothchandar please have a look and let me know if this looks good. |
@nsivabalan could you review this please? |
@pratyakshsharma thanks for your patience.. I will review this myself and get back in a day. |
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/KeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
@vinothchandar : my bad. I have taken a look. You can review once my feedback is addressed. Will keep you posted. |
@vinothchandar @nsivabalan please take pass. |
LGTM. @vinothchandar : do you want to review or can we go ahead and merge it. |
@vinothchandar Let us close this? :) |
@pratyakshsharma sorry .. fell off my radar since I was not an assignee.. I do have some concerns.. Review coming by your day time :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns on the implementation.. Also I think you may have checked in a few parquet files from testing?
hudi-spark/src/main/java/org/apache/hudi/keygen/KeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/KeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/KeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/ComplexKeyGenerator.java
Outdated
Show resolved
Hide resolved
* The complete partition path is created as <value for field1 basis PartitionKeyType1>/<value for field2 basis PartitionKeyType2> and so on. | ||
* | ||
* Few points to consider: | ||
* 1. If you want to customise some partition path field on a timestamp basis, you can use field1:timestampBased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this needs to be documented for the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in our documentation you mean to say? Happy to do that @vinothchandar
Will raise a follow up PR once this gets merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes..
is the CustomKeyGenerator compatible with the SimpleKeyGenerator
configs? I am wondering if we can replace the default with this, without forcing user to do any additional work.. I think this is worth pursuing.. (We can then rename this DefaultKeyGenerator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the configs are not compatible, since CustomKeyGenerator expects partitionPathFields to be provided in a particular format. :(
But since we are going with a major release next, I guess we can make this as the default? (My thinking here is users can expect a bit of breaking changes in major releases, anyways we will mention all the changes in the release notes). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also because this key generator pretty much covers all the possible cases for key generation.
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
My bad. Will remove them. |
b6d4314
to
454116a
Compare
Codecov Report
@@ Coverage Diff @@
## master #1433 +/- ##
============================================
- Coverage 71.66% 71.37% -0.30%
+ Complexity 294 289 -5
============================================
Files 378 379 +1
Lines 16551 16603 +52
Branches 1670 1674 +4
============================================
- Hits 11861 11850 -11
- Misses 3959 4029 +70
+ Partials 731 724 -7
Continue to review full report at Codecov.
|
@vinothchandar Have tried to address the comments, please have a look again. Also looking at the Codecov report, I feel there is a need to add test cases for all key generators for which we do not have right now like SimpleKeyGenerator, ComplexKeyGenerator etc. Should I include that in this PR or should I go for a separate PR. Please suggest. |
Your call.. Doing it here is fine by me as well, since you are touching all those files anyway.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments around usability..
* The complete partition path is created as <value for field1 basis PartitionKeyType1>/<value for field2 basis PartitionKeyType2> and so on. | ||
* | ||
* Few points to consider: | ||
* 1. If you want to customise some partition path field on a timestamp basis, you can use field1:timestampBased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes..
is the CustomKeyGenerator compatible with the SimpleKeyGenerator
configs? I am wondering if we can replace the default with this, without forcing user to do any additional work.. I think this is worth pursuing.. (We can then rename this DefaultKeyGenerator
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
@vinothchandar Please take a pass. |
@pratyakshsharma rebase again? I can take a final pass |
@pratyakshsharma Rebased and removed the parquet files etc.. |
@nsivabalan can you shepherd this one home from here> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pratyakshsharma : have left few comments. I see that this has been going on for quite some time, partly because of delay in reviews. I will ensure I keep a tab at it. ping me once you have addressed the comments.
hudi-spark/src/main/java/org/apache/hudi/keygen/ComplexKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java
Outdated
Show resolved
Hide resolved
return partitionPath; | ||
} | ||
|
||
public String getRecordKey(GenericRecord record) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you think if we need to make this an abstract method in KeyGenerator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this has been discussed already. Please refer to this - #1433 (comment)
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
@nsivabalan I have tried to include the changes from #1597 as well in this. Please take a pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would have been easier to have it in two different diffs. I need to review that patch now. It is fine. But can you help me with something. Can you point me to exact commits where you addressed my last set of comments and commits where you pulled in the other PR? When I looked at the last 3 commits, it was bit confusing to me.
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #1433 +/- ##
============================================
+ Coverage 16.71% 18.24% +1.52%
- Complexity 795 854 +59
============================================
Files 340 347 +7
Lines 15030 15257 +227
Branches 1499 1525 +26
============================================
+ Hits 2512 2783 +271
+ Misses 12188 12122 -66
- Partials 330 352 +22
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess you have added some parquet files by mistake. Rest looks good except for one minor comment. Once done, do squash all your commits to one. We can merge it.
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/test/java/org/apache/hudi/keygen/TestComplexKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/test/java/org/apache/hudi/keygen/TestCustomKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/test/java/org/apache/hudi/keygen/TestSimpleKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/test/java/org/apache/hudi/keygen/TestTimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
@pratyakshsharma : can you fix the build issue. |
@nsivabalan I squashed the commits and force pushed after unstaging the 2 parquet files, but they are still showing. |
cc @wangxianghu .. @pratyakshsharma confirmed, he will resume this wiork and take it across finish line�� |
9cab65f
to
2810271
Compare
@nsivabalan can we merge this now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. can you squash all commits and let me know.
I have already squashed @nsivabalan :) |
Pinging to see if you have any more concerns here or we can merge this @nsivabalan ? :) |
Tips
What is the purpose of the pull request
We have TimestampBasedKeyGenerator for defining custom partition paths and we have ComplexKeyGenerator for supporting having combination of fields as record key or partition key.
However we do not have support for the case where one wants to have combination of fields as record key along with being able to define custom partition paths.
This PR aims to give a generic implementation where we can define key generator for every field in partition path.
Brief change log
Introduced PartitionKeyType in KeyGenerator, also added 2 new functions
String getPartitionPath(GenericRecord record, String partitionPathField)
String getRecordKey(GenericRecord record)
Introduced a new class CustomKeyGenerator which accepts input for partition path field in form -> field1:PartitionKeyType1,field2:PartitionKeyType2
All the corner cases have been handled. Added a test class TestCustomKeyGenerator with only one test case for now. Will be adding more.
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.