New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Upsert data with an identical record key and pre-combine field #3266
Comments
Thanks for the feedback, i guess this is because our payload does not choose the latest data when the preCombine field are the same. See |
See this PR: #3267 |
Thank you @danny0405 for the help. if i understand well, the but what will happened if i upsert a dataframe that contains two record with same recordKey & preCombine, will this end-up upserting the latest record in the dataframe ? |
It will, but only after this patch for |
Hi @danny0405 Thanks, |
Yes, the PR would be merged soon once the CI tests pass. |
related to PR #3401 |
The PR is merged, please verify again if your issue is resolved. |
Hello @danny0405, I'm facing similar issue when extracting Kafka events. When a delete is followed by an insert, usually inside a transaction, sometimes the timestamp field that indicates when the events were processed is the same for both events. When I run my spark job to extract the events, the delete event is taking precedence over the insert. Thank you |
The delete is still problematic for MOR table, because Hudi always writes the DELETE data block after the DATA block, the DELETE block does not record the event time, and the reader always read the DELETE block after, so even a DELETE message has smaller event time than INSERT message, the INSERT message is still deleted. We have fixed the sequence for COW table though in the 0.9 release. |
Hi @danny0405 |
Hi @danny0405, Thank you for your reply. Thank you |
Hi @Rap70r , the 0.9.0 has been released so i would just close the issue as resolved, if you have other problems, feel free to reopen it again. |
Thank you @danny0405, I have tested and confirmed the issue, I described above, has been resolved for me. |
I am using AWS DMS Change data Capture service to get change data from my database and then using Apache Hudi with AWS glue ETL job to process the change data and create a table in hive. I am using a pre-combine field as a timestamp sent from AWS DMS as when the data was committed (update_ts_dms).
I have few use cases where Insert/Updates and Deletes for the same primary key are having the same timestamp sent from DMS and after change data processing apache hudi is not giving the latest updated data in the table or last added updated row for the same primary key in the table. It is adding any random insert or update in the table. May be due to the same primary key and the same pre-combine field.
is there any suggested solution for such case?
Sample Data:
{ "Op": "I", "update_ts_dms": "2021-07-08 10:47:53", "id": 10125412, "brand_id": 9722520, "type": "EXPLICIT", "created": "2021-07-08 10:47:53", "updated": "2021-07-08 10:47:53" } { "Op": "D", "update_ts_dms": "2021-07-08 10:47:53", "id": 10125412, "brand_id": 9722520, "type": "EXPLICIT", "created": "2021-07-08 10:47:53", "updated": "2021-07-08 10:47:53" }
Environment Description
Hudi version (jar): hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar
Spark version : 2.4
Hadoop version : 2.8
Storage : S3
Running on Docker?: no
The text was updated successfully, but these errors were encountered: