-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1160] Support update partial fields for CoW table #2666
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2666 +/- ##
============================================
- Coverage 51.70% 9.52% -42.18%
+ Complexity 3584 48 -3536
============================================
Files 475 53 -422
Lines 22495 1963 -20532
Branches 2393 235 -2158
============================================
- Hits 11630 187 -11443
+ Misses 9852 1763 -8089
+ Partials 1013 13 -1000
Flags with carried forward coverage won't be shown. Click here to find out more. |
The content in this pr #1929 comment is resolved here, please review |
@liujinhui1994 : Thanks for the contribution. There are 2 to 3 PRs with similar goal. Did you get happen to check out existing ones before putting this up? |
There does not seem to be the same function as this PR at the moment This PR inherits from here, this one #1929 will be obsolete |
@liujinhui1994 We are also need the same feature in hudi. Is there any working branch which can be referred? |
@Sugamber This branch should be available |
Is there any timeline for this pull request? |
Maybe after 0.8 is released
…------------------ 原始邮件 ------------------
发件人: "apache/hudi" ***@***.***>;
发送时间: 2021年3月24日(星期三) 晚上6:03
***@***.***>;
***@***.******@***.***>;
主题: Re: [apache/hudi] [HUDI-1160] Support update partial fields for CoW table (#2666)
Is there any timeline for this pull request?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@liujinhui1994 : Can you please update the description with an example. |
1、old data 2、new data 3、 final result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for late turn around on reviewing this. We should definitely get this before next release.
I am yet to review tests. but few high level thoughts on reviewing source code.
- Shouldn't we check schema compatibility? what incase new incoming batch is not compatible w/ table schema w/ partial updates set to true? did we cover this scenario.
- I see we have added support only for COW. should we throw exception if the config is set for MOR?
- I don't have a good idea of adding sql DML support to hoodie table. But if feasible, once such support is added, do you think we can leverage this w/o duplicating the work for sql DML. for eg, "update col1 = 'new_york' where col2= '123'" Such partial updates should translate from sql layer to this right.
- In tests, do verify that schema in commit metadata refers to table schema and not incoming partial schema.
@@ -79,6 +80,7 @@ | |||
public static final String TIMELINE_LAYOUT_VERSION = "hoodie.timeline.layout.version"; | |||
public static final String BASE_PATH_PROP = "hoodie.base.path"; | |||
public static final String AVRO_SCHEMA = "hoodie.avro.schema"; | |||
public static final String LAST_AVRO_SCHEMA = "hoodie.last.avro.schema"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we can call this as "latest table schema".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i think it's good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 Can you help answer your questions about SQL?
try { | ||
TableSchemaResolver resolver = new TableSchemaResolver(hoodieTable.getMetaClient()); | ||
Schema lastSchema = resolver.getTableAvroSchema(); | ||
config.setLastSchema(lastSchema.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastSchema could be null if this is first commit or last commit is an operation which does not inject schema in commit metadata. Can we check if not null and then set the last schema. If not, .toString() could throw NullPointerException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infact, would prefer to throw an exception if table schema is not present and if someone is trying to use updatePartialFields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastSchema could be null if this is first commit or last commit is an operation which does not inject schema in commit metadata. Can we check if not null and then set the last schema. If not, .toString() could throw NullPointerException.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infact, would prefer to throw an exception if table schema is not present and if someone is trying to use updatePartialFields.
Is this too serious? I don't think it's necessary to throw this exception here
@@ -106,7 +110,7 @@ | |||
public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable<T, I, K, O> hoodieTable, | |||
Iterator<HoodieRecord<T>> recordItr, String partitionPath, String fileId, | |||
TaskContextSupplier taskContextSupplier) { | |||
super(config, instantTime, partitionPath, fileId, hoodieTable, taskContextSupplier); | |||
super(config, instantTime, partitionPath, fileId, hoodieTable, getWriterSchemaIncludingAndExcludingMetadataPair(config, hoodieTable), taskContextSupplier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand something. Here we have HoodieWriteConfig at two places. As a separate entity (1st arg in constructor) and HoodieTable.getConfig(). I see within getWriterSchemaIncludingAndExcludingMetadataPair(...), we update lastSchema in hoodieWriteConfig, which will update the 1st arg. But table.getConfig() may not be updated right.
If above statement is right, how do we rely on table.getConfig.getLastSchema() in *MergeHelper classes.
May be I am missing something. can you throw some light.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found through debug,when config.setLastSchema is successfully updated, table.getConfig.getLastSchema is also updated, and they both take their values from the same properties
import java.io.IOException; | ||
import java.util.List; | ||
|
||
public class PartialUpdatePayload extends OverwriteWithLatestAvroPayload { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java docs w/ example would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,i will add
} | ||
|
||
@Override | ||
public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord lastValue, Schema schema) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema here refers to incoming partial schema right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: instead of "lastValue", may be "existingValue" or "existingStorageValue"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema here refers to incoming partial schema right?
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing
ok
} | ||
return recordOption; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line break please.
|
||
HoodieCommitMetadata commitMetadata = buildMetadataFromStats(writeStats, partitionToReplaceFileIds, commitActionType, operationType); | ||
|
||
// add in extra metadata | ||
if (extraMetadata.isPresent()) { | ||
extraMetadata.get().forEach(commitMetadata::addMetadata); | ||
} | ||
if (updatePartialFields) { | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if updatePartialFields is set, can't we rely on config.getLastSchema() in all places where this method is called similar to how we pass in config.getSchema(). Is it not guaranteed that config.lastSchema() will be set by the time we reach here if updatePartialFields is set to true?
Trying to see if we can avoid parsing the schema from table once again since we have already done it once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good,i will delete it
Your suggestion is very good, please let me think about it, and then reply to you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for late turn around on reviewing this. We should definitely get this before next release.
I am yet to review tests. but few high level thoughts on reviewing source code.
- Shouldn't we check schema compatibility? what incase new incoming batch is not compatible w/ table schema w/ partial updates set to true? did we cover this scenario.
- I see we have added support only for COW. should we throw exception if the config is set for MOR?
- I don't have a good idea of adding sql DML support to hoodie table. But if feasible, once such support is added, do you think we can leverage this w/o duplicating the work for sql DML. for eg, "update col1 = 'new_york' where col2= '123'" Such partial updates should translate from sql layer to this right.
- In tests, do verify that schema in commit metadata refers to table schema and not incoming partial schema.
You are right, let me think about how to be compatible with this scene.
I have resolved the other comment issues. I will update the PR after this issue is resolved.
|
||
HoodieCommitMetadata commitMetadata = buildMetadataFromStats(writeStats, partitionToReplaceFileIds, commitActionType, operationType); | ||
|
||
// add in extra metadata | ||
if (extraMetadata.isPresent()) { | ||
extraMetadata.get().forEach(commitMetadata::addMetadata); | ||
} | ||
if (updatePartialFields) { | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good,i will delete it
I have the same feeling, we should still use the old schema with full fields there, for new records with partial values, we can patch them up with a builtin placeholder values, and when we pre_combine the old and new, if we encounter the placeholder values, use the value from the existing record. In any case, to be consistent with SQL, please do not modify the schema which mess the thing up. |
Okay, I think of a way to support such a situation |
agree. This is actually a fairly involved feature. and I would like to design things for both COW and MOR at the outset. IMO if we design something only for COW, then over time it leads to a pretty confusing matrix of feature gaps. Can we first write down the high level design in a comment, summarize open issues, and we can tackle them one by one? |
Please allow me to think about it and I will reply once there is a result |
@vinothchandar : my thoughts on adding this payload support for MOR. |
@vinothchandar Please take the time to look at @nsivabalan idea of using some of the fields in COW to update. |
@liujinhui1994 : may I know whats the status on this. Did you put up any other patch instead of this one. Wondering why we closed this one ? |
This PR might have helped this user - #5180 and several others in achieving their end goals. @liujinhui1994 |
Tips
What is the purpose of the pull request
Support update partial fields for Cow table
Brief change log
Support update partial fields for Cow table, currently, when updating hudi table, users have to provide all fields, however in some cases, users would not provide all fields, and hudi needs to support update partial fields.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.