-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] PreCombineAndUpdate in Payload #1582
Comments
@nandini57 : The flag is for internal hudi logic to preserve old record when hudi is not able to create a valid updated record to write. I am not sure I am following your use-case. From #1569 , if you are using unique keys per batch, you should not be seeing merges anyways. |
My apologies. Let me try to explain.If i don't upsert the data with each batch where applicable,when i query back the table,it will have duplicates as batch "n" need to have data from batch "n-1","n-2" ... I need to do group by upsertKey ..max(commit_time) to get the latest view of data.Doing a group by with each read won't scale . Instead of this, if i can preserve the current_val with deleted identifier in CustomPayload and also return both incoming and current payload in Combine And Get, i can preserve the required data for audit and also read can filter out records with deleted identifier. Does this make sense.Any other ideas? Possibly ,making the copyOldRecord a configurable property with default as false if that doesn't impact anything else |
Thanks for the details. One of the primary contract within Hudi is the uniqueness of record key within partition/dataset. Instead, can you materialize the grouping within the record. To elaborate, can you create a nested array of struct field : "audit_log" (inner struct having same structure as top-level struct without audit_log) in your schema which would contain basically the list of record images at each ingest time and have your custom payload append all previous images as part of combineAndGetUpdateValue and preCombine. This way if you want the latest image, you simply have to skip projecting "audit_log" in your query and don't have to deal with reduce-by. |
i did think about this, but our schemas are heavily nested and contain more than 5000 cols even for a very decent one .Need to think more around it. If i rethink the problem as figuring out my state of data as of business day <=X, it is possible to track if i tag the recordkey while inserting with _X,_X-1,_X-2 etc.Does it sound logical thing to do? |
Sorry, not following your solution. Are you referring to creating unique record keys per batch and treating them as inserts ? |
Hi Balaji,
` |
Probably switching to parquet format instead of hudi and doing a spark.read.parquet(partitionpath).dropduplicates where commit_time= X is an option? The following works if i want to go back to commit X and have a view of data.However,the same with hudi format doesn't provide me the right view as of commit X def audit(spark: SparkSession, partitionPath: String, tablePath: String, commitTime: String): Unit = { Did a lil bit digging and the following code in HoodieROTablePathFilter seems to be taking only latest BaseFile and thus dropping the other files.The impact of this is in my case ,i get incorrect view as of time X as it is reading latest file which has 2 records as of time X and 1 is upserted and got a new commit time.Is the understanding correct? How do i get around this?Can i use a custom path filter?
|
Turns out, incremental query is what can get me the data back in time.Thanks again public static void audit(SparkSession spark,String tablePath,Long commitTime) { |
Can we close if the issue is resolved? |
Can i get a list of open issues with incremental query option to be aware if anything can hit my job? |
Jira (https://jira.apache.org/jira/projects/HUDI/summary) would be a good place to look at. For Copy ON Write table, you should not see any surprises w.r.t to query engine support. Spark DataSource Support for incremental view over merge on read table is an open item. |
@vinothchandar @bvaradar
In continuation to recently raised issue 1569,for custom merge logic, is there a way to preserve the currentValue on Disk.It seems in HoodieMergeHandler copyOldRecord flag is false and currentval is lost
@OverRide
public Option combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema) throws IOException {
The text was updated successfully, but these errors were encountered: