Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1160] Support update partial fields for CoW table #2666

Closed
wants to merge 2 commits into from
Closed

[HUDI-1160] Support update partial fields for CoW table #2666

wants to merge 2 commits into from

Conversation

liujinhui1994
Copy link
Contributor

Tips

What is the purpose of the pull request

Support update partial fields for Cow table

Brief change log

Support update partial fields for Cow table, currently, when updating hudi table, users have to provide all fields, however in some cases, users would not provide all fields, and hudi needs to support update partial fields.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-io
Copy link

codecov-io commented Mar 12, 2021

Codecov Report

Merging #2666 (7eeba90) into master (7424194) will decrease coverage by 42.17%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #2666       +/-   ##
============================================
- Coverage     51.70%   9.52%   -42.18%     
+ Complexity     3584      48     -3536     
============================================
  Files           475      53      -422     
  Lines         22495    1963    -20532     
  Branches       2393     235     -2158     
============================================
- Hits          11630     187    -11443     
+ Misses         9852    1763     -8089     
+ Partials       1013      13     -1000     
Flag Coverage Δ Complexity Δ
hudicli ? ?
hudiclient ? ?
hudicommon ? ?
hudiflink ? ?
hudihadoopmr ? ?
hudisparkdatasource ? ?
hudisync ? ?
huditimelineservice ? ?
hudiutilities 9.52% <ø> (-60.42%) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...va/org/apache/hudi/utilities/IdentitySplitter.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-2.00%)
...va/org/apache/hudi/utilities/schema/SchemaSet.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-3.00%)
...a/org/apache/hudi/utilities/sources/RowSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
.../org/apache/hudi/utilities/sources/AvroSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-1.00%)
.../org/apache/hudi/utilities/sources/JsonSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-1.00%)
...rg/apache/hudi/utilities/sources/CsvDFSSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-10.00%)
...g/apache/hudi/utilities/sources/JsonDFSSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
...apache/hudi/utilities/sources/JsonKafkaSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-6.00%)
...pache/hudi/utilities/sources/ParquetDFSSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-5.00%)
...lities/schema/SchemaProviderWithPostProcessor.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
... and 451 more

@liujinhui1994
Copy link
Contributor Author

The content in this pr #1929 comment is resolved here, please review

@liujinhui1994
Copy link
Contributor Author

@n3nash @satishkotha @yanghua

@nsivabalan nsivabalan added the priority:major degraded perf; unable to move forward; potential bugs label Mar 15, 2021
@nsivabalan
Copy link
Contributor

@liujinhui1994 : Thanks for the contribution. There are 2 to 3 PRs with similar goal. Did you get happen to check out existing ones before putting this up?

@liujinhui1994
Copy link
Contributor Author

There does not seem to be the same function as this PR at the moment
@nsivabalan

This PR inherits from here, this one #1929 will be obsolete

@Sugamber
Copy link

@liujinhui1994 We are also need the same feature in hudi. Is there any working branch which can be referred?

@liujinhui1994
Copy link
Contributor Author

@Sugamber This branch should be available

@Sugamber
Copy link

Is there any timeline for this pull request?

@liujinhui1994
Copy link
Contributor Author

liujinhui1994 commented Mar 24, 2021 via email

@nsivabalan nsivabalan added writer-core Issues relating to core transactions/write actions feature-enquiry issue contains feature enquiries/requests or great improvement ideas labels Mar 30, 2021
@nsivabalan
Copy link
Contributor

@liujinhui1994 : Can you please update the description with an example.

@liujinhui1994
Copy link
Contributor Author

@liujinhui1994 : Can you please update the description with an example.

1、old data
id name age
1 liujinhui 26
2 sam 25
3 madeline 30

2、new data
{"id": 1,"name": "abc"}

3、 final result
1 abc 26
2 sam 25
3 madeline 30

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for late turn around on reviewing this. We should definitely get this before next release.

I am yet to review tests. but few high level thoughts on reviewing source code.

  • Shouldn't we check schema compatibility? what incase new incoming batch is not compatible w/ table schema w/ partial updates set to true? did we cover this scenario.
  • I see we have added support only for COW. should we throw exception if the config is set for MOR?
  • I don't have a good idea of adding sql DML support to hoodie table. But if feasible, once such support is added, do you think we can leverage this w/o duplicating the work for sql DML. for eg, "update col1 = 'new_york' where col2= '123'" Such partial updates should translate from sql layer to this right.
  • In tests, do verify that schema in commit metadata refers to table schema and not incoming partial schema.

@@ -79,6 +80,7 @@
public static final String TIMELINE_LAYOUT_VERSION = "hoodie.timeline.layout.version";
public static final String BASE_PATH_PROP = "hoodie.base.path";
public static final String AVRO_SCHEMA = "hoodie.avro.schema";
public static final String LAST_AVRO_SCHEMA = "hoodie.last.avro.schema";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we can call this as "latest table schema".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i think it's good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 Can you help answer your questions about SQL?

try {
TableSchemaResolver resolver = new TableSchemaResolver(hoodieTable.getMetaClient());
Schema lastSchema = resolver.getTableAvroSchema();
config.setLastSchema(lastSchema.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastSchema could be null if this is first commit or last commit is an operation which does not inject schema in commit metadata. Can we check if not null and then set the last schema. If not, .toString() could throw NullPointerException.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infact, would prefer to throw an exception if table schema is not present and if someone is trying to use updatePartialFields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastSchema could be null if this is first commit or last commit is an operation which does not inject schema in commit metadata. Can we check if not null and then set the last schema. If not, .toString() could throw NullPointerException.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infact, would prefer to throw an exception if table schema is not present and if someone is trying to use updatePartialFields.

Is this too serious? I don't think it's necessary to throw this exception here

@@ -106,7 +110,7 @@
public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable<T, I, K, O> hoodieTable,
Iterator<HoodieRecord<T>> recordItr, String partitionPath, String fileId,
TaskContextSupplier taskContextSupplier) {
super(config, instantTime, partitionPath, fileId, hoodieTable, taskContextSupplier);
super(config, instantTime, partitionPath, fileId, hoodieTable, getWriterSchemaIncludingAndExcludingMetadataPair(config, hoodieTable), taskContextSupplier);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand something. Here we have HoodieWriteConfig at two places. As a separate entity (1st arg in constructor) and HoodieTable.getConfig(). I see within getWriterSchemaIncludingAndExcludingMetadataPair(...), we update lastSchema in hoodieWriteConfig, which will update the 1st arg. But table.getConfig() may not be updated right.
If above statement is right, how do we rely on table.getConfig.getLastSchema() in *MergeHelper classes.
May be I am missing something. can you throw some light.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found through debug,when config.setLastSchema is successfully updated, table.getConfig.getLastSchema is also updated, and they both take their values from the same properties

import java.io.IOException;
import java.util.List;

public class PartialUpdatePayload extends OverwriteWithLatestAvroPayload {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java docs w/ example would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,i will add

}

@Override
public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord lastValue, Schema schema) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema here refers to incoming partial schema right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of "lastValue", may be "existingValue" or "existingStorageValue"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema here refers to incoming partial schema right?
yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing

ok

}
return recordOption;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line break please.


HoodieCommitMetadata commitMetadata = buildMetadataFromStats(writeStats, partitionToReplaceFileIds, commitActionType, operationType);

// add in extra metadata
if (extraMetadata.isPresent()) {
extraMetadata.get().forEach(commitMetadata::addMetadata);
}
if (updatePartialFields) {
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if updatePartialFields is set, can't we rely on config.getLastSchema() in all places where this method is called similar to how we pass in config.getSchema(). Is it not guaranteed that config.lastSchema() will be set by the time we reach here if updatePartialFields is set to true?
Trying to see if we can avoid parsing the schema from table once again since we have already done it once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good,i will delete it

@liujinhui1994
Copy link
Contributor Author

liujinhui1994 commented Apr 2, 2021

Your suggestion is very good, please let me think about it, and then reply to you
@nsivabalan

Copy link
Contributor Author

@liujinhui1994 liujinhui1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for late turn around on reviewing this. We should definitely get this before next release.

I am yet to review tests. but few high level thoughts on reviewing source code.

  • Shouldn't we check schema compatibility? what incase new incoming batch is not compatible w/ table schema w/ partial updates set to true? did we cover this scenario.
  • I see we have added support only for COW. should we throw exception if the config is set for MOR?
  • I don't have a good idea of adding sql DML support to hoodie table. But if feasible, once such support is added, do you think we can leverage this w/o duplicating the work for sql DML. for eg, "update col1 = 'new_york' where col2= '123'" Such partial updates should translate from sql layer to this right.
  • In tests, do verify that schema in commit metadata refers to table schema and not incoming partial schema.

You are right, let me think about how to be compatible with this scene.

I have resolved the other comment issues. I will update the PR after this issue is resolved.


HoodieCommitMetadata commitMetadata = buildMetadataFromStats(writeStats, partitionToReplaceFileIds, commitActionType, operationType);

// add in extra metadata
if (extraMetadata.isPresent()) {
extraMetadata.get().forEach(commitMetadata::addMetadata);
}
if (updatePartialFields) {
try {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good,i will delete it

@danny0405
Copy link
Contributor

sorry for late turn around on reviewing this. We should definitely get this before next release.

I am yet to review tests. but few high level thoughts on reviewing source code.

  • Shouldn't we check schema compatibility? what incase new incoming batch is not compatible w/ table schema w/ partial updates set to true? did we cover this scenario.
  • I see we have added support only for COW. should we throw exception if the config is set for MOR?
  • I don't have a good idea of adding sql DML support to hoodie table. But if feasible, once such support is added, do you think we can leverage this w/o duplicating the work for sql DML. for eg, "update col1 = 'new_york' where col2= '123'" Such partial updates should translate from sql layer to this right.
  • In tests, do verify that schema in commit metadata refers to table schema and not incoming partial schema.

I have the same feeling, we should still use the old schema with full fields there, for new records with partial values, we can patch them up with a builtin placeholder values, and when we pre_combine the old and new, if we encounter the placeholder values, use the value from the existing record.

In any case, to be consistent with SQL, please do not modify the schema which mess the thing up.

@vinothchandar vinothchandar added this to Ready For Review in PR Tracker Board Apr 15, 2021
@liujinhui1994
Copy link
Contributor Author

sorry for late turn around on reviewing this. We should definitely get this before next release.
I am yet to review tests. but few high level thoughts on reviewing source code.

  • Shouldn't we check schema compatibility? what incase new incoming batch is not compatible w/ table schema w/ partial updates set to true? did we cover this scenario.
  • I see we have added support only for COW. should we throw exception if the config is set for MOR?
  • I don't have a good idea of adding sql DML support to hoodie table. But if feasible, once such support is added, do you think we can leverage this w/o duplicating the work for sql DML. for eg, "update col1 = 'new_york' where col2= '123'" Such partial updates should translate from sql layer to this right.
  • In tests, do verify that schema in commit metadata refers to table schema and not incoming partial schema.

I have the same feeling, we should still use the old schema with full fields there, for new records with partial values, we can patch them up with a builtin placeholder values, and when we pre_combine the old and new, if we encounter the placeholder values, use the value from the existing record.

In any case, to be consistent with SQL, please do not modify the schema which mess the thing up.

Okay, I think of a way to support such a situation

@vinothchandar
Copy link
Member

we should still use the old schema with full fields there, for new records with partial values, we can patch them up with a builtin placeholder values

agree. This is actually a fairly involved feature. and I would like to design things for both COW and MOR at the outset. IMO if we design something only for COW, then over time it leads to a pretty confusing matrix of feature gaps.

Can we first write down the high level design in a comment, summarize open issues, and we can tackle them one by one?

@vinothchandar vinothchandar self-assigned this Apr 19, 2021
@liujinhui1994
Copy link
Contributor Author

we should still use the old schema with full fields there, for new records with partial values, we can patch them up with a builtin placeholder values

agree. This is actually a fairly involved feature. and I would like to design things for both COW and MOR at the outset. IMO if we design something only for COW, then over time it leads to a pretty confusing matrix of feature gaps.

Can we first write down the high level design in a comment, summarize open issues, and we can tackle them one by one?

Please allow me to think about it and I will reply once there is a result

@nsivabalan
Copy link
Contributor

@vinothchandar : my thoughts on adding this payload support for MOR.
It may not make sense to add this payload support to MOR. bcoz, in MOR we don't lookup previous record values as we only append records to logs. But this payload involves fetching previous record and merging with incoming one. So, might as well end up using COW for simplicity and since MOR is mainly used to shave off some latency during writes.
Open to hear your thoughts.

@hudi-bot
Copy link

hudi-bot commented Jul 17, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@liujinhui1994
Copy link
Contributor Author

@vinothchandar Please take the time to look at @nsivabalan idea of using some of the fields in COW to update.
Your guidance and approval would be much appreciated

PR Tracker Board automation moved this from Under Discussion PRs to Done Oct 13, 2021
@nsivabalan
Copy link
Contributor

@liujinhui1994 : may I know whats the status on this. Did you put up any other patch instead of this one. Wondering why we closed this one ?

@pratyakshsharma
Copy link
Contributor

This PR might have helped this user - #5180 and several others in achieving their end goals. @liujinhui1994
I have the same question as Shiv here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas priority:major degraded perf; unable to move forward; potential bugs writer-core Issues relating to core transactions/write actions
Projects
Development

Successfully merging this pull request may close these issues.

None yet

9 participants