Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1951] Add bucket hash index, compatible with the hive bucket #3173

Merged
merged 13 commits into from Dec 30, 2021

Conversation

minihippo
Copy link
Contributor

What is the purpose of the pull request

Index pattern 1 in RFC-29: Hash Index
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

Brief change log

  • add a new index method
  • fix the compatibility issue when adding member variables to HoodieKey

Verify this pull request

This change added tests and can be verified as follows:

  • Added unit tests for bucket index and serialization
  • Modify existing integration tests to verify the new index method

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2021

Codecov Report

Merging #3173 (fda740e) into master (d024439) will decrease coverage by 41.33%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #3173       +/-   ##
============================================
- Coverage     44.10%   2.77%   -41.34%     
+ Complexity     5157      85     -5072     
============================================
  Files           936     286      -650     
  Lines         41629   12074    -29555     
  Branches       4189    1010     -3179     
============================================
- Hits          18362     335    -18027     
+ Misses        21638   11713     -9925     
+ Partials       1629      26     -1603     
Flag Coverage Δ
hudicli ?
hudiclient 0.00% <0.00%> (-34.47%) ⬇️
hudicommon ?
hudiflink ?
hudihadoopmr ?
hudisparkdatasource ?
hudisync 4.86% <0.00%> (-50.87%) ⬇️
huditimelineservice ?
hudiutilities 9.11% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...java/org/apache/hudi/config/HoodieIndexConfig.java 0.00% <0.00%> (-69.35%) ⬇️
...pache/hudi/execution/CopyOnWriteInsertHandler.java 0.00% <0.00%> (ø)
...c/main/java/org/apache/hudi/index/HoodieIndex.java 0.00% <0.00%> (-33.34%) ⬇️
.../java/org/apache/hudi/keygen/BaseKeyGenerator.java 0.00% <0.00%> (-40.00%) ⬇️
...rg/apache/hudi/keygen/ComplexAvroKeyGenerator.java 0.00% <0.00%> (-50.00%) ⬇️
...org/apache/hudi/keygen/CustomAvroKeyGenerator.java 0.00% <0.00%> (-10.82%) ⬇️
...ache/hudi/keygen/GlobalAvroDeleteKeyGenerator.java 0.00% <0.00%> (-42.86%) ⬇️
.../main/java/org/apache/hudi/keygen/KeyGenUtils.java 0.00% <0.00%> (-17.15%) ⬇️
...he/hudi/keygen/NonpartitionedAvroKeyGenerator.java 0.00% <0.00%> (-41.67%) ⬇️
...org/apache/hudi/keygen/SimpleAvroKeyGenerator.java 0.00% <0.00%> (-61.54%) ⬇️
... and 719 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d024439...fda740e. Read the comment docs.

@vinothchandar vinothchandar added this to Ready for Review in PR Tracker Board Jun 30, 2021
@minihippo
Copy link
Contributor Author

The ITTestHoodieDemo failure is due to its own instability rather than being affected by this new feature, and HUDI-2113 has already fixed it.

@vinothchandar
Copy link
Member

@minihippo can you please rebase this PR again?

@minihippo
Copy link
Contributor Author

@vinothchandar done

@vinothchandar
Copy link
Member

@minihippo been behind on this.

is there anyone else who wants to take a first pass?

@minihippo
Copy link
Contributor Author

Hi @leesf, I consider the patch is too large. Should I divided it into 2 pr for better review?

@leesf
Copy link
Contributor

leesf commented Jul 12, 2021

Hi @leesf, I consider the patch is too large. Should I divided it into 2 pr for better review?

After removing the csv file, the PR become smaller and i think it is ok to contains the changes in one PR.

@vinothchandar vinothchandar moved this from Ready for Review to Nearing Landing in PR Tracker Board Dec 27, 2021
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minihippo Thanks for the push, I also cleaned up some docs, naming. Please take a look at the last commit. Once CI passes and you are happy with it, I'll land. I have pointed out some follow-ups in this review. If you could add subtasks for them and track them to completion, that would be awesome.


public HoodieBucketLayout() {
super();
partitionClass = "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minihippo I understand you did this to avoid a reverse dep from common to spark-client. Wondering, if we need a LayoutConfig introduced so we can pass the partitioner class name along from client side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix


import java.io.Serializable;

public class HoodieStorageLayout implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably pull a separate class HoodieDefaultLayout and make this abstract. and nicely move all layout specific optimizations there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

@@ -526,6 +526,12 @@ object DataSourceWriteOptions {
.noDefaultValue()
.withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.")

val HIVE_SYNC_BUCKET_SYNC: ConfigProperty[Boolean] = ConfigProperty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also fix the deltastreamer path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/**
*
*/
class TestDataSourceForBucketIndex extends HoodieClientTestBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was referring to taking an existing test and running it across bucket and non-bucket cases. We can revisit this again. Could we add a test JIRA under the umbrella.

@vinothchandar
Copy link
Member

There are some test failures

R] Errors: 
[ERROR]   TestHoodieDeltaStreamer.testBulkInsertsAndUpsertsWithSQLBasedTransformerFor2StepPipeline:1088 » IllegalArgument
[ERROR]   TestHoodieDeltaStreamer.testPayloadClassUpdate:1176 » IllegalArgument Property...
[ERROR]   TestHoodieDeltaStreamer.testPayloadClassUpdateWithCOWTable:1202 » IllegalArgument
[INFO] 

[ERROR] Errors: 
[ERROR]   TestDataSourceUtils.testBuildHiveSyncConfig:224 » IllegalArgument Property hoo...
[ERROR]   TestDataSourceUtils.testBuildHiveSyncConfig:224 » IllegalArgument Property hoo...

Don't think these are related to my pushes. @minihippo Could you please check

@minihippo
Copy link
Contributor Author

@vinothchandar I addressed all comments and the failure ut is not related with this pr. Can we land this?

@minihippo
Copy link
Contributor Author

@hudi-bot run azure

@vinothchandar
Copy link
Member

Looking into the failures.

@vinothchandar
Copy link
Member

@minihippo I was thinking we can name all parameters hoodie.storage.layout.. instead, but the space curve PRs are all named hoodie.layout.optimize anyway. So I think its ok

@vinothchandar
Copy link
Member

[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 125.714 s - in org.apache.hudi.integ.command.ITTestHoodieSyncCommand

[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 29.812 s <<< FAILURE! - in org.apache.hudi.integ.ITTestHoodieDemo
[ERROR] org.apache.hudi.integ.ITTestHoodieDemo.testParquetDemo  Time elapsed: 29.622 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: Command ([hdfs, dfsadmin, -safemode, wait]) expected to succeed. Exit (255) ==> expected: <0> but was: <255>
	at org.apache.hudi.integ.ITTestHoodieDemo.setupDemo(ITTestHoodieDemo.java:167)
	at org.apache.hudi.integ.ITTestHoodieDemo.testParquetDemo(ITTestHoodieDemo.java:107)

This keeps failing. Could you rebase again with latest master? want to try running the tests again

@minihippo
Copy link
Contributor Author

@minihippo I was thinking we can name all parameters hoodie.storage.layout.. instead, but the space curve PRs are all named hoodie.layout.optimize anyway. So I think its ok

I didn't modify the hoodie.layout.optimize directly, considering the history config compatibility.

@hudi-bot
Copy link

CI report:

  • ee973637958fb9c1496cfb45f78346e2f01ffa02 UNKNOWN
  • c435898754ec3ce579eb2b67c473443c9ee70e46 UNKNOWN
  • 2de5524 UNKNOWN
  • 924337e UNKNOWN
  • 527e8f5 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minihippo Couple more follow ups. Landing

assignUpdates(profile);
}

private void assignUpdates(WorkloadProfile profile) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using java streams instead of for loops. its a minor comment

* Storage layout related config.
*/
@Immutable
@ConfigClassProperty(name = "Layout Configs",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably have to make this layout a TableConfig. i.e add it to HoodieTableConfig and persist in hoodie.properties, handling cases where there is going to be tables with and without the property.

@vinothchandar vinothchandar merged commit a4e622a into apache:master Dec 30, 2021
PR Tracker Board automation moved this from Nearing Landing to Done Dec 30, 2021
@vinishjail97 vinishjail97 mentioned this pull request Jan 5, 2022
5 tasks
@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
…pache#3173)

* [HUDI-2154] Add index key field to HoodieKey

* [HUDI-2157] Add the bucket index and its read/write implemention of Spark engine.
* revert HUDI-2154 add index key field to HoodieKey
* fix all comments and introduce a new tricky way to get index key at runtime
support double insert for bucket index
* revert spark read optimizer based on bucket index
* add the storage layout
* index tag, hash function and add ut
* fix ut
* address partial comments
* Code review feedback
* add layout config and docs
* fix ut
* rename hoodie.layout and rebase master

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet