New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1951] Add bucket hash index, compatible with the hive bucket #3173
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3173 +/- ##
============================================
- Coverage 44.10% 2.77% -41.34%
+ Complexity 5157 85 -5072
============================================
Files 936 286 -650
Lines 41629 12074 -29555
Branches 4189 1010 -3179
============================================
- Hits 18362 335 -18027
+ Misses 21638 11713 -9925
+ Partials 1629 26 -1603
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
The |
@minihippo can you please rebase this PR again? |
@vinothchandar done |
@minihippo been behind on this. is there anyone else who wants to take a first pass? |
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/SimpleAvroKeyGenerator.java
Outdated
Show resolved
Hide resolved
...-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketInfo.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/test/resources/hive_bucket_id_check.csv
Outdated
Show resolved
Hide resolved
Hi @leesf, I consider the patch is too large. Should I divided it into 2 pr for better review? |
After removing the csv file, the PR become smaller and i think it is ok to contains the changes in one PR. |
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/BaseKeyGenerator.java
Outdated
Show resolved
Hide resolved
...lient-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveHasher.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveHasher.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/TestHiveBucketUtils.java
Outdated
Show resolved
Hide resolved
...lient/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/SparkHiveBucketIndex.java
Outdated
Show resolved
Hide resolved
...lient/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/SparkHiveBucketIndex.java
Outdated
Show resolved
Hide resolved
...lient/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/SparkHiveBucketIndex.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minihippo Thanks for the push, I also cleaned up some docs, naming. Please take a look at the last commit. Once CI passes and you are happy with it, I'll land. I have pointed out some follow-ups in this review. If you could add subtasks for them and track them to completion, that would be awesome.
|
||
public HoodieBucketLayout() { | ||
super(); | ||
partitionClass = "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minihippo I understand you did this to avoid a reverse dep from common to spark-client. Wondering, if we need a LayoutConfig
introduced so we can pass the partitioner class name along from client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
|
||
import java.io.Serializable; | ||
|
||
public class HoodieStorageLayout implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably pull a separate class HoodieDefaultLayout
and make this abstract. and nicely move all layout specific optimizations there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
@@ -526,6 +526,12 @@ object DataSourceWriteOptions { | |||
.noDefaultValue() | |||
.withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.") | |||
|
|||
val HIVE_SYNC_BUCKET_SYNC: ConfigProperty[Boolean] = ConfigProperty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also fix the deltastreamer path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/** | ||
* | ||
*/ | ||
class TestDataSourceForBucketIndex extends HoodieClientTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was referring to taking an existing test and running it across bucket and non-bucket cases. We can revisit this again. Could we add a test JIRA under the umbrella.
There are some test failures
Don't think these are related to my pushes. @minihippo Could you please check |
@vinothchandar I addressed all comments and the failure ut is not related with this pr. Can we land this? |
@hudi-bot run azure |
Looking into the failures. |
@minihippo I was thinking we can name all parameters |
This keeps failing. Could you rebase again with latest master? want to try running the tests again |
…untime support double insert for bucket index
I didn't modify the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minihippo Couple more follow ups. Landing
assignUpdates(profile); | ||
} | ||
|
||
private void assignUpdates(WorkloadProfile profile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using java streams instead of for loops. its a minor comment
* Storage layout related config. | ||
*/ | ||
@Immutable | ||
@ConfigClassProperty(name = "Layout Configs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably have to make this layout a TableConfig. i.e add it to HoodieTableConfig and persist in hoodie.properties
, handling cases where there is going to be tables with and without the property.
…pache#3173) * [HUDI-2154] Add index key field to HoodieKey * [HUDI-2157] Add the bucket index and its read/write implemention of Spark engine. * revert HUDI-2154 add index key field to HoodieKey * fix all comments and introduce a new tricky way to get index key at runtime support double insert for bucket index * revert spark read optimizer based on bucket index * add the storage layout * index tag, hash function and add ut * fix ut * address partial comments * Code review feedback * add layout config and docs * fix ut * rename hoodie.layout and rebase master Co-authored-by: Vinoth Chandar <vinoth@apache.org>
What is the purpose of the pull request
Index pattern 1 in
RFC-29: Hash Index
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.