Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-728]: Implement custom key generator #1433

Merged
merged 1 commit into from Jul 9, 2020

Conversation

pratyakshsharma
Copy link
Contributor

@pratyakshsharma pratyakshsharma commented Mar 21, 2020

Tips

What is the purpose of the pull request

We have TimestampBasedKeyGenerator for defining custom partition paths and we have ComplexKeyGenerator for supporting having combination of fields as record key or partition key.

However we do not have support for the case where one wants to have combination of fields as record key along with being able to define custom partition paths.
This PR aims to give a generic implementation where we can define key generator for every field in partition path.

Brief change log

  • Introduced PartitionKeyType in KeyGenerator, also added 2 new functions
    String getPartitionPath(GenericRecord record, String partitionPathField)
    String getRecordKey(GenericRecord record)

  • Introduced a new class CustomKeyGenerator which accepts input for partition path field in form -> field1:PartitionKeyType1,field2:PartitionKeyType2

  • All the corner cases have been handled. Added a test class TestCustomKeyGenerator with only one test case for now. Will be adding more.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@pratyakshsharma
Copy link
Contributor Author

This PR is a currently work in progress. Will be adding more test cases to ensure everything works fine.

@pratyakshsharma
Copy link
Contributor Author

@bvaradar @vinothchandar please have a look and let me know if this looks good.

@vinothchandar
Copy link
Member

@nsivabalan could you review this please?

@vinothchandar
Copy link
Member

@pratyakshsharma thanks for your patience.. I will review this myself and get back in a day.

@nsivabalan
Copy link
Contributor

@vinothchandar : my bad. I have taken a look. You can review once my feedback is addressed. Will keep you posted.

@pratyakshsharma
Copy link
Contributor Author

@vinothchandar @nsivabalan please take pass.

@nsivabalan
Copy link
Contributor

LGTM. @vinothchandar : do you want to review or can we go ahead and merge it.

@pratyakshsharma
Copy link
Contributor Author

@vinothchandar Let us close this? :)

@vinothchandar
Copy link
Member

@pratyakshsharma sorry .. fell off my radar since I was not an assignee.. I do have some concerns.. Review coming by your day time :)

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns on the implementation.. Also I think you may have checked in a few parquet files from testing?

* The complete partition path is created as <value for field1 basis PartitionKeyType1>/<value for field2 basis PartitionKeyType2> and so on.
*
* Few points to consider:
* 1. If you want to customise some partition path field on a timestamp basis, you can use field1:timestampBased
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this needs to be documented for the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our documentation you mean to say? Happy to do that @vinothchandar
Will raise a follow up PR once this gets merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes..

is the CustomKeyGenerator compatible with the SimpleKeyGenerator configs? I am wondering if we can replace the default with this, without forcing user to do any additional work.. I think this is worth pursuing.. (We can then rename this DefaultKeyGenerator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the configs are not compatible, since CustomKeyGenerator expects partitionPathFields to be provided in a particular format. :(
But since we are going with a major release next, I guess we can make this as the default? (My thinking here is users can expect a bit of breaking changes in major releases, anyways we will mention all the changes in the release notes). WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also because this key generator pretty much covers all the possible cases for key generation.

@pratyakshsharma
Copy link
Contributor Author

I have some concerns on the implementation.. Also I think you may have checked in a few parquet files from testing?

My bad. Will remove them.

@codecov-io
Copy link

Codecov Report

Merging #1433 into master will decrease coverage by 0.29%.
The diff coverage is 35.80%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1433      +/-   ##
============================================
- Coverage     71.66%   71.37%   -0.30%     
+ Complexity      294      289       -5     
============================================
  Files           378      379       +1     
  Lines         16551    16603      +52     
  Branches       1670     1674       +4     
============================================
- Hits          11861    11850      -11     
- Misses         3959     4029      +70     
+ Partials        731      724       -7     
Impacted Files Coverage Δ Complexity Δ
...e/hudi/exception/HoodieDeltaStreamerException.java 0.00% <ø> (ø) 0.00 <0.00> (?)
...ava/org/apache/hudi/keygen/CustomKeyGenerator.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...main/java/org/apache/hudi/keygen/KeyGenerator.java 60.00% <0.00%> (-40.00%) 0.00 <0.00> (ø)
...apache/hudi/keygen/TimestampBasedKeyGenerator.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...apache/hudi/utilities/deltastreamer/DeltaSync.java 72.44% <ø> (ø) 37.00 <0.00> (ø)
...va/org/apache/hudi/keygen/ComplexKeyGenerator.java 92.30% <94.44%> (+5.46%) 0.00 <0.00> (ø)
...g/apache/hudi/keygen/GlobalDeleteKeyGenerator.java 91.30% <100.00%> (+0.82%) 0.00 <0.00> (ø)
...ava/org/apache/hudi/keygen/SimpleKeyGenerator.java 90.47% <100.00%> (+2.24%) 0.00 <0.00> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19cc15c...f5b7e4b. Read the comment docs.

@pratyakshsharma
Copy link
Contributor Author

@vinothchandar Have tried to address the comments, please have a look again. Also looking at the Codecov report, I feel there is a need to add test cases for all key generators for which we do not have right now like SimpleKeyGenerator, ComplexKeyGenerator etc. Should I include that in this PR or should I go for a separate PR. Please suggest.

@vinothchandar
Copy link
Member

Should I include that in this PR or should I go for a separate PR. Please suggest.

Your call.. Doing it here is fine by me as well, since you are touching all those files anyway..

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments around usability..

* The complete partition path is created as <value for field1 basis PartitionKeyType1>/<value for field2 basis PartitionKeyType2> and so on.
*
* Few points to consider:
* 1. If you want to customise some partition path field on a timestamp basis, you can use field1:timestampBased
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes..

is the CustomKeyGenerator compatible with the SimpleKeyGenerator configs? I am wondering if we can replace the default with this, without forcing user to do any additional work.. I think this is worth pursuing.. (We can then rename this DefaultKeyGenerator

@pratyakshsharma
Copy link
Contributor Author

@vinothchandar Please take a pass.

@vinothchandar
Copy link
Member

@pratyakshsharma rebase again? I can take a final pass

@vinothchandar
Copy link
Member

@pratyakshsharma Rebased and removed the parquet files etc..

@vinothchandar vinothchandar removed their assignment May 14, 2020
@vinothchandar
Copy link
Member

@nsivabalan can you shepherd this one home from here>

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pratyakshsharma : have left few comments. I see that this has been going on for quite some time, partly because of delay in reviews. I will ensure I keep a tab at it. ping me once you have addressed the comments.

return partitionPath;
}

public String getRecordKey(GenericRecord record) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you think if we need to make this an abstract method in KeyGenerator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this has been discussed already. Please refer to this - #1433 (comment)

@pratyakshsharma
Copy link
Contributor Author

@nsivabalan I have tried to include the changes from #1597 as well in this. Please take a pass.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would have been easier to have it in two different diffs. I need to review that patch now. It is fine. But can you help me with something. Can you point me to exact commits where you addressed my last set of comments and commits where you pulled in the other PR? When I looked at the last 3 commits, it was bit confusing to me.

@pratyakshsharma
Copy link
Contributor Author

Can you point me to exact commits where you addressed my last set of comments and commits where you pulled in the other PR?

Sure. I addressed your last set of comments in - 4ebd505

I pulled the other PR in - cfe94e4.

Hope that helps.

@codecov-commenter
Copy link

codecov-commenter commented May 20, 2020

Codecov Report

Merging #1433 into master will increase coverage by 1.52%.
The diff coverage is 8.45%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1433      +/-   ##
============================================
+ Coverage     16.71%   18.24%   +1.52%     
- Complexity      795      854      +59     
============================================
  Files           340      347       +7     
  Lines         15030    15257     +227     
  Branches       1499     1525      +26     
============================================
+ Hits           2512     2783     +271     
+ Misses        12188    12122      -66     
- Partials        330      352      +22     
Impacted Files Coverage Δ Complexity Δ
...e/hudi/exception/HoodieDeltaStreamerException.java 0.00% <ø> (ø) 0.00 <0.00> (?)
...va/org/apache/hudi/keygen/ComplexKeyGenerator.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...ava/org/apache/hudi/keygen/CustomKeyGenerator.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...g/apache/hudi/keygen/GlobalDeleteKeyGenerator.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...apache/hudi/keygen/NonpartitionedKeyGenerator.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...apache/hudi/keygen/TimestampBasedKeyGenerator.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...ava/org/apache/hudi/keygen/SimpleKeyGenerator.java 73.68% <75.00%> (+14.86%) 0.00 <0.00> (ø)
...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java 48.09% <0.00%> (-5.88%) 22.00% <0.00%> (ø%)
...in/scala/org/apache/hudi/AvroConversionUtils.scala 45.45% <0.00%> (-4.55%) 0.00% <0.00%> (ø%)
...c/main/java/org/apache/hudi/index/HoodieIndex.java 36.84% <0.00%> (-4.34%) 3.00% <0.00%> (ø%)
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25a0080...5f0ad60. Read the comment docs.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess you have added some parquet files by mistake. Rest looks good except for one minor comment. Once done, do squash all your commits to one. We can merge it.

@nsivabalan
Copy link
Contributor

@pratyakshsharma : can you fix the build issue.

@pratyakshsharma
Copy link
Contributor Author

@nsivabalan I squashed the commits and force pushed after unstaging the 2 parquet files, but they are still showing.

@vinothchandar
Copy link
Member

cc @wangxianghu ..

@pratyakshsharma confirmed, he will resume this wiork and take it across finish line��

@pratyakshsharma
Copy link
Contributor Author

@nsivabalan can we merge this now?

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. can you squash all commits and let me know.

@pratyakshsharma
Copy link
Contributor Author

LGTM. can you squash all commits and let me know

I have already squashed @nsivabalan :)

@pratyakshsharma
Copy link
Contributor Author

Pinging to see if you have any more concerns here or we can merge this @nsivabalan ? :)

@nsivabalan nsivabalan merged commit c7f1a78 into apache:master Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants