-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-2909] Handle logical type in TimestampBasedKeyGenerator #4203
Conversation
@codope : is this good to review. If there are any pending work, I can take it up as I am focusing on all issues and jiras. let me know. |
e12df0f
to
5b68cad
Compare
@nsivabalan It is now ready to review. Both row-writer and non-row writer path will emit same value for logical timestamp type column if the config is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if we can avoid changing public apis in KeyGenerator.
@@ -889,6 +890,13 @@ public String getKeyGeneratorClass() { | |||
return getString(KEYGENERATOR_CLASS_NAME); | |||
} | |||
|
|||
public boolean isConsistentLogicalTimestampEnabled() { | |||
if (getBoolean(KeyGeneratorOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED) == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't getBoolean return the default if not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
.../hudi-client-common/src/main/java/org/apache/hudi/keygen/TimestampBasedAvroKeyGenerator.java
Show resolved
Hide resolved
...t/hudi-spark-client/src/test/java/org/apache/hudi/keygen/TestTimestampBasedKeyGenerator.java
Show resolved
Hide resolved
...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/SqlKeyGenerator.scala
Outdated
Show resolved
Hide resolved
@nsivabalan @codope |
@YannByron You bring up a good point. Adding another config in future is tedious. However, the intention behind adding a new config was to avoid discrepancy in existing pipelines. @nsivabalan has explained this in more detail on the jira HUDI-2909. I do not expect such changes to be frequent. Nevertheless, i'll try to avoid making incompatible changes to public APIs. |
5b68cad
to
2c0565c
Compare
hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java
Show resolved
Hide resolved
@codope : TestGenericRecordAndRowConsistency test failed in CI. Can you please check. |
fde98bb
to
cb2d16c
Compare
@hudi-bot run azure |
Add config to enable consistent logical timestamp values Address review comments Add an example in the config doc Cleanup spark context in test
cb2d16c
to
9c22d6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
* [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
@@ -444,15 +445,15 @@ public static Schema generateProjectionSchema(Schema originalSchema, List<String | |||
/** | |||
* Obtain value of the provided field as string, denoted by dot notation. e.g: a.b.c | |||
*/ | |||
public static String getNestedFieldValAsString(GenericRecord record, String fieldName, boolean returnNullIfNotFound) { | |||
Object obj = getNestedFieldVal(record, fieldName, returnNullIfNotFound); | |||
public static String getNestedFieldValAsString(GenericRecord record, String fieldName, boolean returnNullIfNotFound, boolean consistentLogicalTimestampEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codope @nsivabalan folks, i don't believe this is the right fix to address the problem: bubbling down this config value we're now exposing every user of this API to be aware of it, which, very likely, will have very little to do with it.
…#4203) * [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
…#4203) * [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
…#4203) * [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
What is the purpose of the pull request
#3944 handled logical timestamp type for complex keygen. This PR fixes HUDI-2909 by handing it for timestamp based keygen.
When the new config
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled
is set to true, consistent value will be generated for a logical timestamp type column like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value2016-12-29 09:54:00
will be written as timestamp2016-12-29 09:54:00.0
in row-writer path, while it will be written as long value1483023240000000
in non row-writer path. If enabled, then the timestamp value will be written in both the cases.Brief change log
(for example:)
Verify this pull request
Added UT and manually verified with spark datasource.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.