-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] - Partial Update : update few columns of a table #2637
Comments
@n3nash : Can you please take a look at this request when you get a chance ? |
@Sugamber The
Each of these API's is a callback that provides you with the See an example here -> https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java#L50 This payload simply takes the latest payload, you could just add custom logic in above 2 methods and achieve desired behavior. |
@n3nash How can we pass custom class name? I copied same class as CustomRecordUpdate and tried to set during save. It throws class not found exception. I tried all three ways of scala api but non of them worked. |
@Sugamber : did you confirm that your class exists in class path while you run your spark job. classOf[CustomRecordUpdate].getName should have worked. |
@nsivabalan Yes, I'm able to find class name in jar using linux command. |
I'm able to resolve class not found exception. |
what was the issue or fix. Do you mind updating it here. |
@nsivabalan , I had created shaded jar and it was causing the issues as few dependencies version were conflicting. |
I have created one class after implementing HoodieRecordPayload. We have three methods for which we have to write our logic.
In my use case , I'm only getting few columns out of 20 in incremental data. I have built schema in constructor as preCombine method does not have any schema details. For Example - Hudi table built with 20 columns. Now, requirement is to update only 3 columns and only these columns data is coming from incremental data feeds along with RECORDKEY_FIELD_OPT_KEY,PARTITIONPATH_FIELD_OPT_KEY and PRECOMBINE_FIELD_OPT_KEY column. I have implemented the class as below. Please let me know in which method, I'll be getting full schema of the table. |
public class PartialColumnUpdate implements HoodieRecordPayload {
} |
Can this use case be achieved using Hudi as target schema and incremental schema are not same? |
@n3nash Please confirm if this use case can be achieved . If yes, provide few inputs. |
There is an open pull request for partial update for CoW table. It looks like my use case is similar to this . |
@nsivabalan Do we have any timeline for this pull request ? |
@Sugamber : we are currently busy with an upcoming release. Once completed, I will start reviewing this work item. And yes, the linked PRs are similar to your ask. Guess there are few other folks who were interested in this. We can target it for next release. |
@Sugamber You code looks correct. Here is the flow :
Now, if your target schema (schema of the record_from_disk) is different from the incremental_schema, that is not a problem as long as target_schema and incremental_schema are backwards compatible. At a high level, the incremental_schema should always be a superset (all fields + new fields) of the target schema |
Thank you!!! |
@Sugamber If your issue is addressed, please close this issue. |
We can close out this issue as we have a tracking jira and a PR being actively reviewed. |
We have one table with more than 300 columns. We would like to update only few fields.
I read the configuration and it suggested that we have to use HoodieRecordPayload and provide our own merge logic.
I didn't see any example on hudi documentaion.
Please help me on this
The text was updated successfully, but these errors were encountered: